September 28, 2025

Massive infrastructure scale and reliability improvements

Expanded capacity and smarter scaling for more reliable performance across every region.

Nick

Founding Engineer, Trigger.dev

Image for Massive infrastructure scale and reliability improvements

Trigger.dev Cloud can now scale to thousands of nodes per region and handle significantly more concurrent workloads during peak usage. When we need to add new capacity, brand new servers come online faster, meaning better availability for medium and large machines under pressure. When we perform infrastructure updates, we can roll out changes faster and with less intervention - more frequent fixes and fine-tuning without any interruptions to your runs.

Technical details:

Expanded pod CIDR blocks to eliminate the previous worker node scaling limits caused by IP exhaustion in our Kubernetes overlay network
Migrated all service images to AWS ECR with chained pull-through caching (GHCR → US → EU) reducing image pull times to ~500ms from 10s typical, 60s+ worst case (thanks GitHub)
Automated worker node updates via node drain rules that wait for runs to complete before replacing outdated nodes