Eric Allam

Eric Allam

CTO, Trigger.dev

Image for OTel incident post-mortem

Between November 28 and December 1, 2025, we experienced intermittent failures when ingesting OpenTelemetry trace data into our ClickHouse server instance. The root cause was a partition key design issue that created thousands of tiny data parts when long-running tasks completed, overwhelming ClickHouse's background merge capacity.

The issue was fully resolved by creating a new table with a corrected partition key design, implementing a fix to prevent late-arriving events from creating parts in old partitions, and scaling ClickHouse infrastructure to provide merge headroom.

Task execution itself was not affected. All tasks continued to run normally; only observability data (traces and spans in the dashboard) was impacted during the incident windows.


Impact

Trace and log data for task runs was intermittently not being recorded. Some users experienced missing or incomplete trace data in the dashboard for runs that occurred during the incident windows. Some OpenTelemetry events were dropped during peak incident periods when ClickHouse rejected inserts.

Error Windows

The following time periods had the highest concentration of insert failures. If you notice missing trace data for runs during these windows, this incident is the likely cause.

PeriodDurationSeverity
Nov 28, 16:30 - 22:26 UTC~6 hoursHigh - initial incident, frequent insert rejections
Nov 28, 23:40 - 00:10 UTC~30 minModerate - brief recurrence
Nov 30, ~15:00 - Dec 1, ~10:00 UTC~19 hoursHigh - sustained errors before v2 migration

During these windows, approximately 275,000+ insert operations failed across the affected tables. Events that failed to insert were retried with exponential backoff, but some events may have been permanently lost after retry exhaustion.


Background: how ClickHouse stores data

To understand this incident, it helps to know how ClickHouse's MergeTree engine works.

When you insert data into a ClickHouse MergeTree table, the data is written as a "part," an immutable, self-contained unit stored on disk. Each part contains the actual column data (compressed), primary key index, secondary indices, and metadata. Parts are organized into partitions based on a partition key expression (for example, toDate(timestamp) creates daily partitions). Parts within the same partition can be merged together, but parts in different partitions cannot.

ClickHouse continuously runs background merges to consolidate small parts into larger ones. This is critical for query performance (fewer parts means fewer files to scan), storage efficiency (merged parts compress better), and system health (too many parts exhausts file descriptors and memory).

The merge process works like this:


Insert 1 -> Part A (1000 rows)
Insert 2 -> Part B (1000 rows)
Insert 3 -> Part C (1000 rows)
Background merge
Part AB (2000 rows)
Background merge
Part ABC (3000 rows)

ClickHouse enforces a limit on active parts per partition (default: 3000). When this limit is exceeded, inserts are rejected with:


Too many parts (N) in table 'X'. Merges are processing significantly slower than inserts.

This is a protective mechanism. Without it, the system would eventually crash from resource exhaustion.


What went wrong

Our task_events_v1 table stored OpenTelemetry span data with this schema:


CREATE TABLE task_events_v1 (
environment_id String,
run_id String,
start_time DateTime64(9),
span_id String,
-- ... other columns
)
ENGINE = MergeTree
PARTITION BY toDate(start_time) -- Daily partitions by span start time
ORDER BY (environment_id, toUnixTimestamp(start_time), trace_id)

The critical design flaw was that we partitioned by toDate(start_time), the timestamp when the span started, not when we received the data.

When a task runs, we create span events to track its execution. A start event is created when the span begins, with start_time = now(). An end event is created when the span completes, but it uses the same start_time as the start event.

Note: This dual event creation is what allows us to display "in progress" spans in the dashboard. With most OpenTelemetry dashboards, you only ever see spans when the span is completed, and they only ever create a single record per span.

Consider a task that runs for 3 weeks:


Day 1: Task starts -> Start event inserted -> Partition: 2025-11-01
Day 22: Task ends -> End event inserted -> Partition: 2025-11-01 (same start_time!)

On Day 22, we're inserting data into a partition from 3 weeks ago. Here's where it gets interesting. By Day 22, the 2025-11-01 partition has already gone through its merge cycles:


Day 1-3: Thousands of parts created and merged
Day 4-7: Large merges consolidate into bigger parts
Day 8+: Partition is "stable" with maybe 10-50 large parts (GBs each)

When our late-arriving end event creates a tiny new part (maybe 10 rows, 3KB), ClickHouse faces a dilemma:


Partition 2025-11-01:
- Part A: 30 GB (500M rows) <- Already merged, stable
- Part B: 25 GB (400M rows) <- Already merged, stable
- Part C: 3 KB (10 rows) <- NEW! Tiny orphan part

To merge Part C, ClickHouse must rewrite Part A or B entirely, a 30GB operation to add 10 rows. This merge takes hours (or even days!), during which more tiny parts accumulate.

With thousands of active long-running tasks, we were constantly inserting tiny parts into old partitions:


Insert -> Partition 2025-11-15 (2 rows)
Insert -> Partition 2025-11-20 (5 rows)
Insert -> Partition 2025-11-01 (1 row)
Insert -> Partition 2025-11-18 (8 rows)
...

Each insert created a part that couldn't be efficiently merged. Parts accumulated faster than merges could consolidate them, eventually hitting the 3000-part limit.


Timeline

Friday, November 28

Time (UTC)Event
16:30First "too many parts" error: 3008 parts in task_events_v1
17:04Team begins investigating, initially suspects insert batching settings
18:25Configuration changes deployed to adjust batch sizes
18:53Situation worsens, ClickHouse reads also failing
19:26Deploy blocked because migrations can't run due to query overload
19:37Emergency PR to add SKIP_CLICKHOUSE_MIGRATIONS env var
19:50ClickHouse support recommends adding replicas
22:26System stabilizes after reduced load
23:40Another ~30 minute incident

Saturday, November 29

Time (UTC)Event
09:00Investigation continues with 701 parts, merges appear stalled
09:14Root cause identified: partition key design combined with late-arriving spans
09:22Team realizes the partition key choice was fundamentally flawed

Monday, December 1

Time (UTC)Event
11:27PR #2719: new task_events_v2 table with inserted_at partition key
13:35New table also hits part limits due to Materialized Views blocking merges
14:24Full understanding of old-partition merge costs achieved
15:18PR #2721: fix for task_events_v1 to clamp old start_time values
15:49Additional ClickHouse replicas provisioned

Resolution

For immediate mitigation, we increased ClickHouse replicas from 2 to 3 for additional merge capacity, disabled debug event logging to reduce write volume, and dropped Materialized Views that were blocking merges.

We then created a new task_events_v2 table with a corrected partition key:


CREATE TABLE task_events_v2 (
inserted_at DateTime DEFAULT now(), -- When we received the data
start_time DateTime64(9), -- Original span start time (preserved)
-- ... other columns
)
ENGINE = MergeTree
PARTITION BY toDate(inserted_at) -- Partition by insertion time, not span time
ORDER BY (environment_id, toUnixTimestamp(start_time), trace_id)

This works because all rows from a single insert batch go into the same partition (today's date). No more fragmenting across old partitions.

For the legacy task_events_v1 table, we implemented start time clamping:


// If start_time is more than 12 hours old, clamp it to 12 hours ago
const maxAge = 12 * 60 * 60 * 1000; // 12 hours in ms
const clampedStartTime = Math.max(startTime, Date.now() - maxAge);

This ensures late-arriving events go into recent partitions that still have active merge cycles, rather than creating orphan parts in old, stable partitions. The tradeoff is that start_time for closing span events may not exactly match the original span start. However, this doesn't affect span display because the opening event has the correct start_time, and when reading spans we merge all events and use the earliest start_time.


Lessons learned

Never partition by a field that can have "late arrivals." Event time, span start time, and similar fields can arrive arbitrarily late. Partition by insertion time instead.

Good partition keys:


PARTITION BY toDate(inserted_at) -- When we received it
PARTITION BY toYYYYMM(created_at) -- When the record was created

Risky partition keys:


PARTITION BY toDate(event_time) -- Events can arrive late
PARTITION BY toDate(start_time) -- Spans can complete weeks later
PARTITION BY toDate(order_date) -- Orders can be backdated

Small parts in old partitions are extremely expensive. A 10-row part in a 30GB partition requires rewriting 30GB to merge. You should track system.parts for active part count per table/partition, system.merges for running merge operations, and part size distribution to watch for tiny parts in old partitions.

Our testing didn't include long-running tasks that would expose the partition key issue. Include edge cases like tasks running for days or weeks, high-cardinality partition keys, and backfill scenarios.

MergeTree's merge behavior is well-documented but has subtle implications. The merge algorithm, part limits, and partition interactions require deep understanding for schema design.


Technical details for the curious

You might wonder why we didn't just increase the part limit. ClickHouse's default limit of 3000 parts exists for good reasons: each part requires file handles (potentially 10+ per part), part metadata consumes memory, queries must check all parts which degrades performance, and eventually the system becomes unstable. Increasing the limit just delays the inevitable crash.

You can manually trigger merges with OPTIMIZE TABLE, but it's a blocking operation that can timeout. Merging tiny parts into huge partitions takes hours, and during that time more tiny parts accumulate. It doesn't fix the root cause.

Why partition by day instead of month? Daily partitions provide efficient TTL-based data expiration (you can drop whole partitions), better query pruning for time-range queries, and reasonable merge granularity. Monthly partitions would reduce the problem but not eliminate it since a span from day 1 completing on day 30 still creates an orphan part. The real solution is partitioning by insertion time, which guarantees all data from one insert stays together.


Questions?

If you have questions about this incident or how it may have affected your data, please reach out to [email protected] or join our Discord community.

We're committed to transparency about incidents that affect our users, and we appreciate your patience as we resolved this issue.

Ready to start building?

Build and deploy your first task in 3 minutes.

Get started now