Major v3 reliability improvements

We strongly recommend you upgrade to the latest version of the v3 SDK because we've made some major improvements to run execution reliability.

Run this command in your repo to easily upgrade:


_10
npx trigger.dev@beta update

Run attempts

Each v3 run has at least one attempt. An attempt is an execution of your code, if the attempt succeeds then the run will succeed. If the attempt fails then the run will be retried with more attempts until it succeeds or the maximum number of attempts is reached.

You should read the Errors & Retrying guide. If you apply the guidance you will achieve highly reliable runs.

What's changed and why is it better?

When we first shipped v3 this is how attempts worked:

  1. Your run is taken from the queue on the platform.
  2. We create a run attempt on the platform.
  3. We spin up a new worker to execute your run.
  4. The worker runs (hopefully).
  5. The attempt succeeds or fails.
  6. Failed attempts go back into the queue.
  7. Repeat 1–6.

This had a couple of problems:

  • If the worker failed to start then there'd be a hung attempt. We automatically fail attempts that haven't communicated with the platform recently so it would try again.
  • Each attempt needed to go back into the queue which is innefficient and causes load on the platform.
  • If an attempt didn't start it would count againt your run's attempt limit.

We've moved creating run attempts to the worker, so now:

  1. Your run is taken from the queue on the platform.
  2. We spin up a new worker to execute your run.
  3. The worker creates a run attempt via the platform.
  4. The worker runs (hopefully).
  5. The attempt succeeds or fails.
  6. Failed attempts are retried by the worker, no need to be requeued.
  7. Repeat 3–6.

This is far more reliable because attempts are only created when the worker is actually running. It's also more efficient because we don't need to requeue failed attempts. Win win.