Changelog #70
Automatic ratelimit retries
How we improved automatic ratelimit retries when rate limits are exceeded.
CTO, Trigger.dev
We've shipped an improvement to our SDK retry logic when calling functions like tasks.trigger
, runs.retrieve
, etc. Previously, if the SDK received a rate limit error from the API, it would use a simple exponential backoff strategy to retry the request.
It went a little something like this:
- Request: GET /runs/run_1233
- Response: 429 Rate Limit Error
- Retry GET /runs/run_1233 after 1 second
- Response: 429 Rate Limit Error
- Retry GET /runs/run_1233 after 2 seconds
- Response: 429 Rate Limit Error
- Retry GET /runs/run_1233 after 4 seconds
- Response: 429 Rate Limit Error
- Throw an error
Luckily, the Trigger.dev API server uses standard ratelimit headers to communicate the rate limit status back to the client, including the x-ratelimit-reset
header that tells the client when the rate limit will reset. This update now uses that header to wait until the rate limit resets before retrying the request:
- Request: GET /runs/run_1233
- Response: 429 Rate Limit Error
- Wait until
x-ratelimit-reset
time - Retry GET /runs/run_1233
- Response: 200 OK
We've also updated our API server to use a different rate limit strategy. Previously, we used the Sliding Window strategy, but that could lead to long periods of rate limiting if a client made a burst of requests. We've now switched to the Token Bucket strategy, which should provide shorter delays between requests.
Task retries
We've also updated the retry logic for tasks that fail with a rate limit error from our SDK. As an example, let's imagine this task that fetches a run:
_19import { runs, task } from "@trigger.dev/sdk/v3";_19_19export const taskRetries = task({_19 id: "task-retries",_19 retry: {_19 maxAttempts: 5,_19 minTimeoutInMs: 500,_19 maxTimeoutInMs: 30_000,_19 factor: 1.8,_19 },_19 run: async (payload: { runId: string }, { ctx }) => {_19 // We override the default retry logic so this call with throw a RateLimitError_19 await runs.retrieve(payload.runId, {_19 retry: {_19 maxAttempts: 1,_19 },_19 });_19 },_19});
When the above task runs, and the runs.retrieve
call fails with a rate limit error, the task will now wait until the rate limit resets before attempting to retry the task.
Custom request options
By default, the SDK will retry requests up to 3 times, with an exponential backoff delay between retries.
You can customize the retry behavior by passing a requestOptions
option to the configure
function:
_13import { configure } from "@trigger.dev/sdk/v3";_13_13configure({_13 requestOptions: {_13 retry: {_13 maxAttempts: 5,_13 minTimeoutInMs: 1000,_13 maxTimeoutInMs: 5000,_13 factor: 1.8,_13 randomize: true,_13 },_13 },_13});
All SDK functions also take a requestOptions
parameter as the last argument, which can be used to customize the request options. You can use this to disable retries for a specific request:
_10import { runs } from "@trigger.dev/sdk/v3";_10_10async function main() {_10 const run = await runs.retrieve("run_1234", {_10 retry: {_10 maxAttempts: 1, // Disable retries_10 },_10 });_10}
NOTE
When running inside a task, the SDK ignores customized retry options for certain functions (e.g., task.trigger
, task.batchTrigger
), and uses retry settings optimized for task execution.
SDK OpenTelemetry spans
The SDK now outputs OpenTelemetry spans for all SDK functions (previously we only emitted spans for task triggering). This includes any retry waits.
The following example tells the story. Note that I ran this against my local Trigger.dev instance and configured the API server to randomly respond with a 500 response 25% of the time:
_27export const sdkSpans = task({_27 id: "sdk-spans",_27 run: async () => {_27 logger.log("Starting spans subtask without a runId");_27 const handle = await sdkSpansSubtask.trigger({});_27 logger.log("Starting spans subtask with a runId", { runId: handle.id });_27 await sdkSpansSubtask.triggerAndWait({ runId: handle.id });_27 },_27});_27_27export const sdkSpansSubtask = task({_27 id: "sdk-spans-subtask",_27 run: async (payload: { runId?: string }) => {_27 await wait.for({ seconds: 5 });_27_27 if (payload.runId) {_27 logger.log("Retrieving run", { runId: payload.runId });_27 const run = await runs.retrieve(payload.runId);_27 logger.log("Cancelling run", { runId: run.id });_27 await runs.cancel(run.id);_27 logger.log("Replaying run", { runId: run.id });_27 await runs.replay(run.id);_27 }_27_27 await wait.for({ seconds: 30 });_27 },_27});
As you can see in the screenshot, all calls to the SDK functions are logged and includes spans for the retries: