Deep Dive: How Trigger.dev works

Matt AitkenMatt Aitken

Deep Dive: How Trigger.dev works

Trigger.dev is an open source framework for building background jobs in your existing codebase. Right now we have support for Next.js and serverless. We’re adding support for more JavaScript platforms soon.

This post will explain how it works and dive into some technical detail. Hold on to your butts.

Architectural overview

First of all, why would I need something like this?!

Let’s say your web app is hosted somewhere that uses serverless like Vercel, Cloudflare or Netlify. You want to perform a background job – a series of operations that take some time, can fail, and must be retried.

Without a lot of extra work, serverless is not a good fit for this problem. Serverless functions have timeouts which can be as low as 10 seconds. You must send a response with some data during this time otherwise the job fails. You also want to keep users informed of what’s happening during this time.

How is this problem normally solved?

Most teams end up creating a separate service for background jobs that doesn’t run on serverless. This approach works well but introduces significant work and ongoing maintenance.

You must:

  • Create API endpoints on both sides – to send data back and forth. You really want strong type safety as well.
  • Store the state of Runs so you can report the status and recover when servers go down (e.g. during deployments).
  • Be able to retrieve the state of Runs for displaying to users.
  • Add logging so you can debug when things go wrong.
  • The ability to rerun failed jobs, either from the start with the same input data or just continue by retrying the last failed task.
  • And of course you need to write the code for each job and deploy it.

How is Trigger.dev built?

Here’s a helicopter view before we dive into a real example in detail.

Architectural overview

Your servers

You write background jobs alongside your normal code, using the Trigger.dev SDKs. You can access your database, existing code, whatever you normally do. It’s just code.

But the twist is: if you want make something retryable or logged to the Trigger.dev dashboard you wrap it in a Task. We make this easy for APIs by providing Integrations (we’ve already done it for you). More on that in a bit.

You’ll need to use an adaptor (like our Next.js one) that creates an API endpoint so we can send messages back to your servers.

The Trigger.dev API, dashboard and Postgres

The API triggers jobs, manages Tasks and saves the state of all Runs. It also allows you to get the current state of jobs.

The dashboard is a great UI your whole team can use view all your jobs, Runs (logs) and retry/cancel things.

The glorious dashboard

Postgres is used both as a store of state for Runs/Tasks and for the job queue (we use Graphile Worker).

Trigger.dev is fully open source and can be self-hosted. We have a cloud product too.

Let’s check out an example job: GitHub issue reminders

When a new GitHub issue is created, if no one has acted on it after 24 hours, assign it to someone and send a Slack reminder.

Here’s the code:


_56
import { client } from "@/trigger";
_56
import { Github, events } from "@trigger.dev/github";
_56
import { Slack } from "@trigger.dev/slack";
_56
import { Job } from "@trigger.dev/sdk";
_56
_56
const github = new Github({
_56
id: "github-api-key",
_56
//this token doesn't leave your servers
_56
token: process.env.GITHUB_API_KEY!,
_56
});
_56
_56
//this uses OAuth (setup in the dashboard)
_56
const slack = new Slack({ id: "slack" });
_56
_56
client.defineJob({
_56
id: "new-github-issue-reminder",
_56
name: "New GitHub issue reminder",
_56
version: "0.1.0",
_56
//include the integrations so they can be used in run()
_56
integrations: { github, slack },
_56
//this is what causes run() to fire
_56
trigger: github.triggers.repo({
_56
event: events.onIssueOpened,
_56
owner: "triggerdotdev",
_56
repo: "trigger.dev",
_56
}),
_56
//where the magic happens
_56
run: async (payload, io, ctx) => {
_56
//delay for 24 hours (or 60 seconds in development)
_56
const delayDuration =
_56
ctx.environment.type === "DEVELOPMENT" ? 60 : 60 * 60 * 24;
_56
await io.wait("wait 24 hours", delayDuration);
_56
_56
const issue = await io.github.getIssue("get issue", {
_56
owner: payload.repository.owner.login,
_56
repo: payload.repository.name,
_56
issueNumber: payload.issue.number,
_56
});
_56
_56
//if the issue has had no activity
_56
if (issue.updated_at === payload.issue.updated_at) {
_56
await io.slack.postMessage("Slack reminder", {
_56
text: `Issue needs attention: <${issue.html_url}|${issue.title}>`,
_56
channel: "C04GWUTDC3W",
_56
});
_56
_56
//assign it to someone, in this case… Matt
_56
await io.github.addIssueAssignees("add assignee", {
_56
owner: payload.repository.owner.login,
_56
repo: payload.repository.name,
_56
issueNumber: payload.issue.number,
_56
assignees: ["matt-aitken"],
_56
});
_56
}
_56
},
_56
});

There's a YouTube walkthrough of how to create this job from start to finish.

Job registration

When you run our CLI during local development or when you deploy, your jobs will get registered with the Trigger.dev platform. This makes us aware of them so we can trigger them to start.

There are currently three types of Triggers:

  1. Event Triggers – define the name of an event and expected payload, then send a matching event to trigger the job(s).
  2. Scheduled – either use a CRON pattern or an interval that you want a job to run at.
  3. Webhooks – subscribe to changes on another service using their API.

We’re going to dig into webhooks in detail because it’s the most interesting Trigger and is used in our example above.

How job registration works

  1. You start an endpoint refreshing (either by using the CLI dev command or you have deployment setup).
  2. The absolute URL for your Trigger endpoint is updated.
  3. A request to your endpoint is made with the INDEX_ENDPOINT action.
  4. Data about all your jobs, Sources, Dynamic Triggers and Dynamic Sources is sent back.
  5. Jobs are registered: new jobs are created, old jobs are updated. Any Integrations that don’t exist are created (let’s assume for simplicity that the Slack OAuth Integration with id slack has already been setup).
  6. Sources are registered – a source for the GitHub triggerdotdev/trigger.dev repo with the issues event doesn’t exist, so it needs to be created and registered. If it existed and the config had changed it would be updated.
  7. Registering webhooks uses a job, we kick this off by creating records then sending an event to your server.
  8. The internal registration job starts (this job is defined inside the GitHub Integration).
    1. It gets existing webhooks for triggerdotdev/trigger.dev from GitHub.
    2. None existed so it uses their API to create a webhook for the issues event. The URL registered for the webhook is on the Trigger.dev platform.
    3. That webhook data is sent back to the Trigger.dev API.
  9. The source is updated and is ready to receive webhook events.

Triggering a Run

Let’s dig into the details of how this job gets triggered and a Run starts.

How Run triggering works

  1. Someone creates an issue in your GitHub repo.
  2. GitHub sends a webhook to the URL that was registered on Trigger.dev.
  3. An EventRecord is created, which stores the payload and metadata in the database (used for logging and retries).
  4. All the jobs which subscribe to this webhook event are found (it can be more than one).
  5. A job Run record is created in the database, for each job and environment combo (e.g. dev and prod).
  6. Any Integrations are prepared, so they can be used in the run() function.
    1. The Slack Integration uses OAuth, so the latest token is retrieved from the secret store. OAuth token refreshing is a totally separate flow which is handled for you.
  7. A request with the EXECUTE_JOB action is made to your Trigger API endpoint.
  8. The run() function is called on your servers.

In the (very likely) scenario where the Run doesn’t complete in a single go (e.g. a timeout is hit, a wait is used, your server goes down…) then steps 6 onwards are repeated.

Tasks, resuming, retrying and idempotent-ness

Inside the run() function you can just use normal code. The code isn’t sent to our servers so if you need to perform a super secret operation like accessing very private data from your database you can do that.

It's classified

run() is called at least once

It’s critical to understand that the run() function will very likely be called more than once. So, anything that has side effects, like mutating something, should either be idempotent and/or be wrapped in a Task.

Idempotent?

Excuse me?

Idempotent is a fancy word that has a disappointingly low Scrabble score (15 points). It means no matter how many times you call something (with the same inputs) the result will be the same. A function that sets name = "Rick Astley" in your database is idempotent. But doing rickrollCount += 1 in your database is not because each time you call it, the result is different from the previous times.

Tasks

Tasks are very useful for several reasons, and we strongly recommend you use them.

  1. Once they have succeeded (i.e. not thrown an error) the result is stored so they won’t run again. The result is simply retrieved from the Trigger.dev database.
  2. This success storage/retrieval makes them idempotent most of the time. But not fully on their own. If an error is thrown or a network failure happens that Task will get retried (by default). So if you’ve done something that isn’t idempotent you will get unwanted side effects.
  3. You can configure the retrying behaviour and if the Task does fail it will be retried. The default behaviour is retrying with exponential back off.
  4. When you create a Task you get given an idempotency key that you can use. Some APIs support these, like Stripe.
  5. Tasks are logged to the dashboard which gives you great visibility into your jobs.

Now you know where all my great programming jokes come from…

Tasks always have a key as the first parameter. This uniquely identifies that Task inside a run. When the job is re-run the key is used to look up the previous result. Think of a bit like the React key in a map.

Integrations

You can install Integration packages in your project and you can also create your own. They expose useful API calls as Tasks which have been configured to give a great experience.

The most important properties from the request/response are highlighted in the dashboard.

The most important properties from the request/response are highlighted in the dashboard.

The run() in detail

You were warned it was going to go deep. Here goes:

How a Run works

  1. The job is prepared, as per the previous diagram.
  2. The run() function gets called.
  3. The io.wait function is called.
    1. A request is made to the Trigger.dev API to notify that a task has been called.
    2. No Task exists so one is created with the status of WAITING. Note you can add delayUntil to any Task to defer it being run until later, io.wait isn’t special.
    3. A Task with the WAITING status causes the SDK to throw a special ResumeWithTaskError. This stops the run from continuing.
    4. An API response is sent with the RESUME_WITH_TASK status.
    5. The Task is updated.
    6. The continue time is added to the scheduler.
    7. 24 hours passes.
    8. The wait is complete so the scheduler notifies that the Run is ready to continue. The Task status is updated.
  4. The job is prepared again, this time because it’s not the first time the state of any Tasks are sent as well.
  5. run() is called again. Any Tasks that were received are added to the cache.
  6. The io.wait() function is called.
    1. This time there is a Task in the cache with the key "wait 24 hours". That Task is COMPLETED so we can continue.
  7. run() continues to the next code.
  8. io.github.getIssue is called, which is from the @trigger.dev/github Integration.
    1. A request is made to the Trigger.dev API to notify that a task has been called
    2. The Task doesn’t exist, so it’s created with the PENDING state. It’s returned in the response.
    3. Inside the github.getIssue function is this underlying code which just wraps GitHub’s official SDK rest.issues.get call in runTask() (with some nice properties set so it looks great in the dashboard).
    4. Uh-oh, you’ve done too many requests in a short period using that API key. You’ve hit the GitHub API rate limit, so the response had a 429 status code. GitHub provides information in their response about when you can next retry.
    5. When an error is thrown from runTask(), by default it will retry. The retry behaviour can be controlled and has sensible defaults. In this case the GitHub Integration set the retry options using the info the GitHub API provided. A RetryWithTaskError is thrown (which stops the run from continuing).
    6. An API response is sent with the RETRY_WITH_TASK status.
    7. The Task is updated.
    8. The continue time is added to the scheduler.
    9. Some time passes until the GitHub rate limit has passed.
    10. The wait is complete so the scheduler notifies that the Run is ready to continue. The Task status is updated.
  9. The job is prepared again, this time because it’s not the first time the state of any Tasks are sent as well.
  10. run() is called again. Any Tasks that were received are added to the cache…

11–15. By this point you should get the idea of how the run() function works. Messages are sent back and forth, re-running happens and reliability is achieved.

  1. When the run() function gets to the end you can return data.
  2. That data is sent in the response, as well as metadata about the Run status.
  3. Finally, the Run is updated in the database (to COMPLETE in this case).

Too Long; Already Finished Reading

Trigger.dev provides resilient Background jobs for serverless environments.

  • The API and SDK allow you to write jobs in your codebase that run on your servers.
  • We’re Postgres maximalists, like Supabase.
  • The state of a Run and its Tasks/Subtasks are stored.
  • The run() is called multiple times. This can be caused by waits, errors, server timeouts, network failures, server outages…
  • Making sure your code is idempotent is important. This is also the case in a lot of situations outside of background jobs.
  • Tasks are important as they create resume points for rerunning and make Runs easier to debug in the dashboard.

You don’t need to understand how it works to start writing backgrounds jobs. But hopefully this was a fun deep dive. If you are excited by this, we’d love for you to give Trigger.dev a try (cloud or self-hosted) or you can contribute to the project.