Automatic ratelimit retries

Eric AllamEric Allam

Image forAutomatic ratelimit retries

We've shipped an improvement to our SDK retry logic when calling functions like tasks.trigger, runs.retrieve, etc. Previously, if the SDK received a rate limit error from the API, it would use a simple exponential backoff strategy to retry the request.

It went a little something like this:

  • Request: GET /runs/run_1233
  • Response: 429 Rate Limit Error
  • Retry GET /runs/run_1233 after 1 second
  • Response: 429 Rate Limit Error
  • Retry GET /runs/run_1233 after 2 seconds
  • Response: 429 Rate Limit Error
  • Retry GET /runs/run_1233 after 4 seconds
  • Response: 429 Rate Limit Error
  • Throw an error

Luckily, the Trigger.dev API server uses standard ratelimit headers to communicate the rate limit status back to the client, including the x-ratelimit-reset header that tells the client when the rate limit will reset. This update now uses that header to wait until the rate limit resets before retrying the request:

  • Request: GET /runs/run_1233
  • Response: 429 Rate Limit Error
  • Wait until x-ratelimit-reset time
  • Retry GET /runs/run_1233
  • Response: 200 OK

We've also updated our API server to use a different rate limit strategy. Previously, we used the Sliding Window strategy, but that could lead to long periods of rate limiting if a client made a burst of requests. We've now switched to the Token Bucket strategy, which should provide shorter delays between requests.

Task retries

We've also updated the retry logic for tasks that fail with a rate limit error from our SDK. As an example, let's imagine this task that fetches a run:


_19
import { runs, task } from "@trigger.dev/sdk/v3";
_19
_19
export const taskRetries = task({
_19
id: "task-retries",
_19
retry: {
_19
maxAttempts: 5,
_19
minTimeoutInMs: 500,
_19
maxTimeoutInMs: 30_000,
_19
factor: 1.8,
_19
},
_19
run: async (payload: { runId: string }, { ctx }) => {
_19
// We override the default retry logic so this call with throw a RateLimitError
_19
await runs.retrieve(payload.runId, {
_19
retry: {
_19
maxAttempts: 1,
_19
},
_19
});
_19
},
_19
});

When the above task runs, and the runs.retrieve call fails with a rate limit error, the task will now wait until the rate limit resets before attempting to retry the task.

Custom request options

By default, the SDK will retry requests up to 3 times, with an exponential backoff delay between retries.

You can customize the retry behavior by passing a requestOptions option to the configure function:


_13
import { configure } from "@trigger.dev/sdk/v3";
_13
_13
configure({
_13
requestOptions: {
_13
retry: {
_13
maxAttempts: 5,
_13
minTimeoutInMs: 1000,
_13
maxTimeoutInMs: 5000,
_13
factor: 1.8,
_13
randomize: true,
_13
},
_13
},
_13
});

All SDK functions also take a requestOptions parameter as the last argument, which can be used to customize the request options. You can use this to disable retries for a specific request:


_10
import { runs } from "@trigger.dev/sdk/v3";
_10
_10
async function main() {
_10
const run = await runs.retrieve("run_1234", {
_10
retry: {
_10
maxAttempts: 1, // Disable retries
_10
},
_10
});
_10
}

NOTE

When running inside a task, the SDK ignores customized retry options for certain functions (e.g., task.trigger, task.batchTrigger), and uses retry settings optimized for task execution.

SDK OpenTelemetry spans

The SDK now outputs OpenTelemetry spans for all SDK functions (previously we only emitted spans for task triggering). This includes any retry waits.

The following example tells the story. Note that I ran this against my local Trigger.dev instance and configured the API server to randomly respond with a 500 response 25% of the time:


_27
export const sdkSpans = task({
_27
id: "sdk-spans",
_27
run: async () => {
_27
logger.log("Starting spans subtask without a runId");
_27
const handle = await sdkSpansSubtask.trigger({});
_27
logger.log("Starting spans subtask with a runId", { runId: handle.id });
_27
await sdkSpansSubtask.triggerAndWait({ runId: handle.id });
_27
},
_27
});
_27
_27
export const sdkSpansSubtask = task({
_27
id: "sdk-spans-subtask",
_27
run: async (payload: { runId?: string }) => {
_27
await wait.for({ seconds: 5 });
_27
_27
if (payload.runId) {
_27
logger.log("Retrieving run", { runId: payload.runId });
_27
const run = await runs.retrieve(payload.runId);
_27
logger.log("Cancelling run", { runId: run.id });
_27
await runs.cancel(run.id);
_27
logger.log("Replaying run", { runId: run.id });
_27
await runs.replay(run.id);
_27
}
_27
_27
await wait.for({ seconds: 30 });
_27
},
_27
});

As you can see in the screenshot, all calls to the SDK functions are logged and includes spans for the retries:

SDK spans with waits
,