Automatic ratelimit retries

How we improved automatic ratelimit retries when rate limits are exceeded.

Eric Allam

Eric Allam

CTO, Trigger.dev

Image for Automatic ratelimit retries

We've shipped an improvement to our SDK retry logic when calling functions like tasks.trigger, runs.retrieve, etc. Previously, if the SDK received a rate limit error from the API, it would use a simple exponential backoff strategy to retry the request.

It went a little something like this:

  • Request: GET /runs/run_1233
  • Response: 429 Rate Limit Error
  • Retry GET /runs/run_1233 after 1 second
  • Response: 429 Rate Limit Error
  • Retry GET /runs/run_1233 after 2 seconds
  • Response: 429 Rate Limit Error
  • Retry GET /runs/run_1233 after 4 seconds
  • Response: 429 Rate Limit Error
  • Throw an error

Luckily, the Trigger.dev API server uses standard ratelimit headers to communicate the rate limit status back to the client, including the x-ratelimit-reset header that tells the client when the rate limit will reset. This update now uses that header to wait until the rate limit resets before retrying the request:

  • Request: GET /runs/run_1233
  • Response: 429 Rate Limit Error
  • Wait until x-ratelimit-reset time
  • Retry GET /runs/run_1233
  • Response: 200 OK

We've also updated our API server to use a different rate limit strategy. Previously, we used the Sliding Window strategy, but that could lead to long periods of rate limiting if a client made a burst of requests. We've now switched to the Token Bucket strategy, which should provide shorter delays between requests.

Task retries

We've also updated the retry logic for tasks that fail with a rate limit error from our SDK. As an example, let's imagine this task that fetches a run:


import { runs, task } from "@trigger.dev/sdk/v3";
export const taskRetries = task({
id: "task-retries",
retry: {
maxAttempts: 5,
minTimeoutInMs: 500,
maxTimeoutInMs: 30_000,
factor: 1.8,
},
run: async (payload: { runId: string }, { ctx }) => {
// We override the default retry logic so this call with throw a RateLimitError
await runs.retrieve(payload.runId, {
retry: {
maxAttempts: 1,
},
});
},
});

When the above task runs, and the runs.retrieve call fails with a rate limit error, the task will now wait until the rate limit resets before attempting to retry the task.

Custom request options

By default, the SDK will retry requests up to 3 times, with an exponential backoff delay between retries.

You can customize the retry behavior by passing a requestOptions option to the configure function:


import { configure } from "@trigger.dev/sdk/v3";
configure({
requestOptions: {
retry: {
maxAttempts: 5,
minTimeoutInMs: 1000,
maxTimeoutInMs: 5000,
factor: 1.8,
randomize: true,
},
},
});

All SDK functions also take a requestOptions parameter as the last argument, which can be used to customize the request options. You can use this to disable retries for a specific request:


import { runs } from "@trigger.dev/sdk/v3";
async function main() {
const run = await runs.retrieve("run_1234", {
retry: {
maxAttempts: 1, // Disable retries
},
});
}

NOTE

When running inside a task, the SDK ignores customized retry options for certain functions (e.g., task.trigger, task.batchTrigger), and uses retry settings optimized for task execution.

SDK OpenTelemetry spans

The SDK now outputs OpenTelemetry spans for all SDK functions (previously we only emitted spans for task triggering). This includes any retry waits.

The following example tells the story. Note that I ran this against my local Trigger.dev instance and configured the API server to randomly respond with a 500 response 25% of the time:


export const sdkSpans = task({
id: "sdk-spans",
run: async () => {
logger.log("Starting spans subtask without a runId");
const handle = await sdkSpansSubtask.trigger({});
logger.log("Starting spans subtask with a runId", { runId: handle.id });
await sdkSpansSubtask.triggerAndWait({ runId: handle.id });
},
});
export const sdkSpansSubtask = task({
id: "sdk-spans-subtask",
run: async (payload: { runId?: string }) => {
await wait.for({ seconds: 5 });
if (payload.runId) {
logger.log("Retrieving run", { runId: payload.runId });
const run = await runs.retrieve(payload.runId);
logger.log("Cancelling run", { runId: run.id });
await runs.cancel(run.id);
logger.log("Replaying run", { runId: run.id });
await runs.replay(run.id);
}
await wait.for({ seconds: 30 });
},
});

As you can see in the screenshot, all calls to the SDK functions are logged and includes spans for the retries:

SDK spans with waits

Ready to start building?

Build and deploy your first task in 3 minutes.

Get started now