December 2, 2025

OTel incident post-mortem

CTO, Trigger.dev

Between November 28 and December 1, 2025, we experienced intermittent failures when ingesting OpenTelemetry trace data into our ClickHouse server instance. The root cause was a partition key design issue that created thousands of tiny data parts when long-running tasks completed, overwhelming ClickHouse's background merge capacity.

The issue was fully resolved by creating a new table with a corrected partition key design, implementing a fix to prevent late-arriving events from creating parts in old partitions, and scaling ClickHouse infrastructure to provide merge headroom.

Task execution itself was not affected. All tasks continued to run normally; only observability data (traces and spans in the dashboard) was impacted during the incident windows.

Impact

Trace and log data for task runs was intermittently not being recorded. Some users experienced missing or incomplete trace data in the dashboard for runs that occurred during the incident windows. Some OpenTelemetry events were dropped during peak incident periods when ClickHouse rejected inserts.

Error Windows

The following time periods had the highest concentration of insert failures. If you notice missing trace data for runs during these windows, this incident is the likely cause.

Period	Duration	Severity
Nov 28, 16:30 - 22:26 UTC	~6 hours	High - initial incident, frequent insert rejections
Nov 28, 23:40 - 00:10 UTC	~30 min	Moderate - brief recurrence
Nov 30, ~15:00 - Dec 1, ~10:00 UTC	~19 hours	High - sustained errors before v2 migration

During these windows, approximately 275,000+ insert operations failed across the affected tables. Events that failed to insert were retried with exponential backoff, but some events may have been permanently lost after retry exhaustion.

Background: how ClickHouse stores data

To understand this incident, it helps to know how ClickHouse's MergeTree engine works.

When you insert data into a ClickHouse MergeTree table, the data is written as a "part," an immutable, self-contained unit stored on disk. Each part contains the actual column data (compressed), primary key index, secondary indices, and metadata. Parts are organized into partitions based on a partition key expression (for example, toDate(timestamp) creates daily partitions). Parts within the same partition can be merged together, but parts in different partitions cannot.

ClickHouse continuously runs background merges to consolidate small parts into larger ones. This is critical for query performance (fewer parts means fewer files to scan), storage efficiency (merged parts compress better), and system health (too many parts exhausts file descriptors and memory).

The merge process works like this:

Insert 1 -> Part A (1000 rows)
Insert 2 -> Part B (1000 rows)
Insert 3 -> Part C (1000 rows)
            Background merge
          Part AB (2000 rows)
            Background merge
          Part ABC (3000 rows)

ClickHouse enforces a limit on active parts per partition (default: 3000). When this limit is exceeded, inserts are rejected with:

Too many parts (N) in table 'X'. Merges are processing significantly slower than inserts.

This is a protective mechanism. Without it, the system would eventually crash from resource exhaustion.

What went wrong

Our task_events_v1 table stored OpenTelemetry span data with this schema:

CREATE TABLE task_events_v1 (
    environment_id String,
    run_id String,
    start_time DateTime64(9),
    span_id String,
    -- ... other columns
)
ENGINE = MergeTree
PARTITION BY toDate(start_time)  -- Daily partitions by span start time
ORDER BY (environment_id, toUnixTimestamp(start_time), trace_id)

The critical design flaw was that we partitioned by toDate(start_time), the timestamp when the span started, not when we received the data.

When a task runs, we create span events to track its execution. A start event is created when the span begins, with start_time = now(). An end event is created when the span completes, but it uses the same start_time as the start event.

Note: This dual event creation is what allows us to display "in progress" spans in the dashboard. With most OpenTelemetry dashboards, you only ever see spans when the span is completed, and they only ever create a single record per span.

Consider a task that runs for 3 weeks:

Day 1:  Task starts -> Start event inserted -> Partition: 2025-11-01
Day 22: Task ends   -> End event inserted   -> Partition: 2025-11-01 (same start_time!)

On Day 22, we're inserting data into a partition from 3 weeks ago. Here's where it gets interesting. By Day 22, the 2025-11-01 partition has already gone through its merge cycles:

Day 1-3:   Thousands of parts created and merged
Day 4-7:   Large merges consolidate into bigger parts
Day 8+:    Partition is "stable" with maybe 10-50 large parts (GBs each)

When our late-arriving end event creates a tiny new part (maybe 10 rows, 3KB), ClickHouse faces a dilemma:

Partition 2025-11-01:
  - Part A: 30 GB (500M rows)   <- Already merged, stable
  - Part B: 25 GB (400M rows)   <- Already merged, stable
  - Part C: 3 KB (10 rows)      <- NEW! Tiny orphan part

To merge Part C, ClickHouse must rewrite Part A or B entirely, a 30GB operation to add 10 rows. This merge takes hours (or even days!), during which more tiny parts accumulate.

With thousands of active long-running tasks, we were constantly inserting tiny parts into old partitions:

Insert -> Partition 2025-11-15 (2 rows)
Insert -> Partition 2025-11-20 (5 rows)
Insert -> Partition 2025-11-01 (1 row)
Insert -> Partition 2025-11-18 (8 rows)
...

Each insert created a part that couldn't be efficiently merged. Parts accumulated faster than merges could consolidate them, eventually hitting the 3000-part limit.

Timeline

Friday, November 28

Time (UTC)	Event
16:30	First "too many parts" error: 3008 parts in `task_events_v1`
17:04	Team begins investigating, initially suspects insert batching settings
18:25	Configuration changes deployed to adjust batch sizes
18:53	Situation worsens, ClickHouse reads also failing
19:26	Deploy blocked because migrations can't run due to query overload
19:37	Emergency PR to add `SKIP_CLICKHOUSE_MIGRATIONS` env var
19:50	ClickHouse support recommends adding replicas
22:26	System stabilizes after reduced load
23:40	Another ~30 minute incident

Saturday, November 29

Time (UTC)	Event
09:00	Investigation continues with 701 parts, merges appear stalled
09:14	Root cause identified: partition key design combined with late-arriving spans
09:22	Team realizes the partition key choice was fundamentally flawed

Monday, December 1

Time (UTC)	Event
11:27	PR #2719: new `task_events_v2` table with `inserted_at` partition key
13:35	New table also hits part limits due to Materialized Views blocking merges
14:24	Full understanding of old-partition merge costs achieved
15:18	PR #2721: fix for `task_events_v1` to clamp old `start_time` values
15:49	Additional ClickHouse replicas provisioned

Resolution

For immediate mitigation, we increased ClickHouse replicas from 2 to 3 for additional merge capacity, disabled debug event logging to reduce write volume, and dropped Materialized Views that were blocking merges.

We then created a new task_events_v2 table with a corrected partition key:

CREATE TABLE task_events_v2 (
    inserted_at DateTime DEFAULT now(),  -- When we received the data
    start_time DateTime64(9),            -- Original span start time (preserved)
    -- ... other columns
)
ENGINE = MergeTree
PARTITION BY toDate(inserted_at)  -- Partition by insertion time, not span time
ORDER BY (environment_id, toUnixTimestamp(start_time), trace_id)

This works because all rows from a single insert batch go into the same partition (today's date). No more fragmenting across old partitions.

For the legacy task_events_v1 table, we implemented start time clamping:

// If start_time is more than 12 hours old, clamp it to 12 hours ago
const maxAge = 12 * 60 * 60 * 1000; // 12 hours in ms
const clampedStartTime = Math.max(startTime, Date.now() - maxAge);

This ensures late-arriving events go into recent partitions that still have active merge cycles, rather than creating orphan parts in old, stable partitions. The tradeoff is that start_time for closing span events may not exactly match the original span start. However, this doesn't affect span display because the opening event has the correct start_time, and when reading spans we merge all events and use the earliest start_time.

Lessons learned

Never partition by a field that can have "late arrivals." Event time, span start time, and similar fields can arrive arbitrarily late. Partition by insertion time instead.

Good partition keys:

PARTITION BY toDate(inserted_at)      -- When we received it
PARTITION BY toYYYYMM(created_at)     -- When the record was created

Risky partition keys:

PARTITION BY toDate(event_time)       -- Events can arrive late
PARTITION BY toDate(start_time)       -- Spans can complete weeks later
PARTITION BY toDate(order_date)       -- Orders can be backdated

Small parts in old partitions are extremely expensive. A 10-row part in a 30GB partition requires rewriting 30GB to merge. You should track system.parts for active part count per table/partition, system.merges for running merge operations, and part size distribution to watch for tiny parts in old partitions.

Our testing didn't include long-running tasks that would expose the partition key issue. Include edge cases like tasks running for days or weeks, high-cardinality partition keys, and backfill scenarios.

MergeTree's merge behavior is well-documented but has subtle implications. The merge algorithm, part limits, and partition interactions require deep understanding for schema design.

Technical details for the curious

You might wonder why we didn't just increase the part limit. ClickHouse's default limit of 3000 parts exists for good reasons: each part requires file handles (potentially 10+ per part), part metadata consumes memory, queries must check all parts which degrades performance, and eventually the system becomes unstable. Increasing the limit just delays the inevitable crash.

You can manually trigger merges with OPTIMIZE TABLE, but it's a blocking operation that can timeout. Merging tiny parts into huge partitions takes hours, and during that time more tiny parts accumulate. It doesn't fix the root cause.

Why partition by day instead of month? Daily partitions provide efficient TTL-based data expiration (you can drop whole partitions), better query pruning for time-range queries, and reasonable merge granularity. Monthly partitions would reduce the problem but not eliminate it since a span from day 1 completing on day 30 still creates an orphan part. The real solution is partitioning by insertion time, which guarantees all data from one insert stays together.

Questions?

If you have questions about this incident or how it may have affected your data, please reach out to [email protected] or join our Discord community.

We're committed to transparency about incidents that affect our users, and we appreciate your patience as we resolved this issue.

On this page

Impact

Error Windows

Share this article

How we give every user SQL access to a shared ClickHouse cluster

Article

•

March 5

Skills: teaching AI agents to act consistently

Article

•

February 4

How Tierly builds AI-powered pricing intelligence using Trigger.dev

Customer story

•

January 29

Trigger.dev raises $16M Series A

Article

•

December 17

How GovSignals is solving government procurement using Trigger.dev

Customer story

•

December 11

Accelerating global logistics workflows using AI copilots

Customer story

•

November 13

Ready to start building?

Build and deploy your first task in 3 minutes.

Get started now

Product

AI Agents

Trigger.dev Realtime

Concurrency & queues

Scheduled tasks

Observability & monitoring

Roadmap

Latest changelogs

Vercel integration

Trigger.dev v4.4.3

Query & Dashboards: analytics for your Trigger.dev data

Latest blog posts

How we give every user SQL access to a shared ClickHouse cluster

Skills: teaching AI agents to act consistently

How Tierly builds AI-powered pricing intelligence using Trigger.dev

Documentation

Guides

OTel incident post-mortem

Impact

Error Windows

Background: how ClickHouse stores data

What went wrong

Timeline

Friday, November 28

Saturday, November 29

Monday, December 1

Resolution

Lessons learned

Technical details for the curious

Questions?

How we give every user SQL access to a shared ClickHouse cluster

Skills: teaching AI agents to act consistently

How Tierly builds AI-powered pricing intelligence using Trigger.dev

Trigger.dev raises $16M Series A

How GovSignals is solving government procurement using Trigger.dev

Accelerating global logistics workflows using AI copilots

Ready to start building?