July 9, 2024

Automatic ratelimit retries

How we improved automatic ratelimit retries when rate limits are exceeded.

CTO, Trigger.dev

We've shipped an improvement to our SDK retry logic when calling functions like tasks.trigger, runs.retrieve, etc. Previously, if the SDK received a rate limit error from the API, it would use a simple exponential backoff strategy to retry the request.

It went a little something like this:

Request: GET /runs/run_1233
Response: 429 Rate Limit Error
Retry GET /runs/run_1233 after 1 second
Response: 429 Rate Limit Error
Retry GET /runs/run_1233 after 2 seconds
Response: 429 Rate Limit Error
Retry GET /runs/run_1233 after 4 seconds
Response: 429 Rate Limit Error
Throw an error

Luckily, the Trigger.dev API server uses standard ratelimit headers to communicate the rate limit status back to the client, including the x-ratelimit-reset header that tells the client when the rate limit will reset. This update now uses that header to wait until the rate limit resets before retrying the request:

Request: GET /runs/run_1233
Response: 429 Rate Limit Error
Wait until x-ratelimit-reset time
Retry GET /runs/run_1233
Response: 200 OK

We've also updated our API server to use a different rate limit strategy. Previously, we used the Sliding Window strategy, but that could lead to long periods of rate limiting if a client made a burst of requests. We've now switched to the Token Bucket strategy, which should provide shorter delays between requests.

Task retries

We've also updated the retry logic for tasks that fail with a rate limit error from our SDK. As an example, let's imagine this task that fetches a run:

import { runs, task } from "@trigger.dev/sdk/v3";
export const taskRetries = task({
  id: "task-retries",
  retry: {
    maxAttempts: 5,
    minTimeoutInMs: 500,
    maxTimeoutInMs: 30_000,
    factor: 1.8,
  },
  run: async (payload: { runId: string }, { ctx }) => {
    // We override the default retry logic so this call with throw a RateLimitError
    await runs.retrieve(payload.runId, {
      retry: {
        maxAttempts: 1,
      },
    });
  },
});

When the above task runs, and the runs.retrieve call fails with a rate limit error, the task will now wait until the rate limit resets before attempting to retry the task.

Custom request options

By default, the SDK will retry requests up to 3 times, with an exponential backoff delay between retries.

You can customize the retry behavior by passing a requestOptions option to the configure function:

import { configure } from "@trigger.dev/sdk/v3";
configure({
  requestOptions: {
    retry: {
      maxAttempts: 5,
      minTimeoutInMs: 1000,
      maxTimeoutInMs: 5000,
      factor: 1.8,
      randomize: true,
    },
  },
});

All SDK functions also take a requestOptions parameter as the last argument, which can be used to customize the request options. You can use this to disable retries for a specific request:

import { runs } from "@trigger.dev/sdk/v3";
async function main() {
  const run = await runs.retrieve("run_1234", {
    retry: {
      maxAttempts: 1, // Disable retries
    },
  });
}

NOTE

When running inside a task, the SDK ignores customized retry options for certain functions (e.g., task.trigger, task.batchTrigger), and uses retry settings optimized for task execution.

SDK OpenTelemetry spans

The SDK now outputs OpenTelemetry spans for all SDK functions (previously we only emitted spans for task triggering). This includes any retry waits.

The following example tells the story. Note that I ran this against my local Trigger.dev instance and configured the API server to randomly respond with a 500 response 25% of the time:

export const sdkSpans = task({
  id: "sdk-spans",
  run: async () => {
    logger.log("Starting spans subtask without a runId");
    const handle = await sdkSpansSubtask.trigger({});
    logger.log("Starting spans subtask with a runId", { runId: handle.id });
    await sdkSpansSubtask.triggerAndWait({ runId: handle.id });
  },
});
export const sdkSpansSubtask = task({
  id: "sdk-spans-subtask",
  run: async (payload: { runId?: string }) => {
    await wait.for({ seconds: 5 });
    if (payload.runId) {
      logger.log("Retrieving run", { runId: payload.runId });
      const run = await runs.retrieve(payload.runId);
      logger.log("Cancelling run", { runId: run.id });
      await runs.cancel(run.id);
      logger.log("Replaying run", { runId: run.id });
      await runs.replay(run.id);
    }
    await wait.for({ seconds: 30 });
  },
});

As you can see in the screenshot, all calls to the SDK functions are logged and includes spans for the retries:

On this page

Share this article

#69 Task cancellation propagation

Declare cron on your tasks #71

Ready to start building?

Build and deploy your first task in 3 minutes.

Get started now

Product

AI Agents

Trigger.dev Realtime

Concurrency & queues

Scheduled tasks

Observability & monitoring

Roadmap

Latest changelogs

Billing alerts

How to reduce your Trigger.dev spend

Mastra agents with memory example

Latest blog posts

How we built a real-time service that handles 20,000 updates per second

How Magic Patterns migrated 200k monthly jobs to Trigger.dev in one day

Self-hosting Trigger.dev v4 using Kubernetes

Documentation

Guides

Automatic ratelimit retries

How we improved automatic ratelimit retries when rate limits are exceeded.

Task retries

Custom request options

SDK OpenTelemetry spans

Ready to start building?