May 13, 2024

There was a major incident which caused runs to process extremely slowly and the dashboard to be unavailable. Here's what happened and what we've changed to prevent it from happening again.

Image for Incident report on May 10, 2024

Before we get into what happened I want to emphasise how important reliability is to us. This has fallen very short of providing you all with a great experience and a reliable service. We're really sorry for the problems this has caused for you all.

All paying customers will get a full refund for the entirety of May. You don't need to do anything, we'll be adding a 100% coupon to your Stripe subscription for this month.

What caused this?

This issue started with a huge number of queued events in v2. Our internal system that handles concurrency limits on v2 was slowing down the processing of them but that was also causing more database load on the underlying v2 queuing engine: Graphile worker. Graphile is powered by Postgres so it caused a vicious cycle.

We've had larger spikes than this before but the base load has increased a lot over time which made us vulnerable to this.

We've scaled many orders of magnitude in the past year and all our normal upgrades and tuning didn't work here. We upgraded servers and most importantly the database (twice). We tuned some parameters. But as the backlog was growing it became harder to recover from because of the concurrency limits built into the v2 system. Ordinarily this limiter distributes v2 runs fairly and prevents very high load by smoothing out very spiky demand.

What was impacted?

  1. Queues got very long for v2 and processed slowly.
  2. Queues got long for v3. The queuing system for v3 is built on Redis so that was fine but the actual run data lives in Postgres which couldn't be read because of the v2 issues. Also, we use Graphile to trigger v3 scheduled tasks.
  3. The dashboard was very slow to load or was showing timeout errors (ironic I know).
  4. When we took the brakes off the v2 concurrency filter it caused a massive number of runs to happen very quickly. Mostly this was fine but in some cases this caused downstream issues in runs.
  5. When we took the brakes off the v2 concurrency filter it also meant, in some cases, v2 concurrency limits weren't respected.
  6. In some cases v3 runs say they're "Executing" but they actually completed. When the run finished the messages sent back to the platform failed because of the database load. We're working on UI and SDK/API tools for you (and us) to bulk cancel and replay runs.

What we've done to fix it (so far)

  • We upgraded some hardware. This means we can process more runs but it didn't help us escape the spiral.
  • We modified the v2 concurrency filter so it reschedules runs that are over the limit with a slight delay. Before it was thrashing the database with the same runs and could cause a huge load in edge cases like this.
  • We've upgraded Graphile Worker, the core of the v2 queuing system, to v0.16.6. This has a lot of performance improvements so we can cope with more load than before.
  • We have far better diagnostic tools than before.

Today we reached a new level of scale that highlighted some things that up until now had been working well. Reliability is never finished, so work continues tomorrow.

Ready to start building?

Build and deploy your first task in 3 minutes.

Get started now