Between November 28 and December 1, 2025, we experienced intermittent failures when ingesting OpenTelemetry trace data into our ClickHouse server instance. The root cause was a partition key design issue that created thousands of tiny data parts when long-running tasks completed, overwhelming ClickHouse's background merge capacity.
The issue was fully resolved by creating a new table with a corrected partition key design, implementing a fix to prevent late-arriving events from creating parts in old partitions, and scaling ClickHouse infrastructure to provide merge headroom.
Task execution itself was not affected. All tasks continued to run normally; only observability data (traces and spans in the dashboard) was impacted during the incident windows.
Impact
Trace and log data for task runs was intermittently not being recorded. Some users experienced missing or incomplete trace data in the dashboard for runs that occurred during the incident windows. Some OpenTelemetry events were dropped during peak incident periods when ClickHouse rejected inserts.
Error Windows
The following time periods had the highest concentration of insert failures. If you notice missing trace data for runs during these windows, this incident is the likely cause.
| Period | Duration | Severity |
|---|---|---|
| Nov 28, 16:30 - 22:26 UTC | ~6 hours | High - initial incident, frequent insert rejections |
| Nov 28, 23:40 - 00:10 UTC | ~30 min | Moderate - brief recurrence |
| Nov 30, ~15:00 - Dec 1, ~10:00 UTC | ~19 hours | High - sustained errors before v2 migration |
During these windows, approximately 275,000+ insert operations failed across the affected tables. Events that failed to insert were retried with exponential backoff, but some events may have been permanently lost after retry exhaustion.
Background: how ClickHouse stores data
To understand this incident, it helps to know how ClickHouse's MergeTree engine works.
When you insert data into a ClickHouse MergeTree table, the data is written as a "part," an immutable, self-contained unit stored on disk. Each part contains the actual column data (compressed), primary key index, secondary indices, and metadata. Parts are organized into partitions based on a partition key expression (for example, toDate(timestamp) creates daily partitions). Parts within the same partition can be merged together, but parts in different partitions cannot.
ClickHouse continuously runs background merges to consolidate small parts into larger ones. This is critical for query performance (fewer parts means fewer files to scan), storage efficiency (merged parts compress better), and system health (too many parts exhausts file descriptors and memory).
The merge process works like this:
Insert 1 -> Part A (1000 rows)Insert 2 -> Part B (1000 rows)Insert 3 -> Part C (1000 rows) Background merge Part AB (2000 rows) Background merge Part ABC (3000 rows)
ClickHouse enforces a limit on active parts per partition (default: 3000). When this limit is exceeded, inserts are rejected with:
Too many parts (N) in table 'X'. Merges are processing significantly slower than inserts.
This is a protective mechanism. Without it, the system would eventually crash from resource exhaustion.
What went wrong
Our task_events_v1 table stored OpenTelemetry span data with this schema:
CREATE TABLE task_events_v1 ( environment_id String, run_id String, start_time DateTime64(9), span_id String, -- ... other columns)ENGINE = MergeTreePARTITION BY toDate(start_time) -- Daily partitions by span start timeORDER BY (environment_id, toUnixTimestamp(start_time), trace_id)
The critical design flaw was that we partitioned by toDate(start_time), the timestamp when the span started, not when we received the data.
When a task runs, we create span events to track its execution. A start event is created when the span begins, with start_time = now(). An end event is created when the span completes, but it uses the same start_time as the start event.
Note: This dual event creation is what allows us to display "in progress" spans in the dashboard. With most OpenTelemetry dashboards, you only ever see spans when the span is completed, and they only ever create a single record per span.
Consider a task that runs for 3 weeks:
Day 1: Task starts -> Start event inserted -> Partition: 2025-11-01Day 22: Task ends -> End event inserted -> Partition: 2025-11-01 (same start_time!)
On Day 22, we're inserting data into a partition from 3 weeks ago. Here's where it gets interesting. By Day 22, the 2025-11-01 partition has already gone through its merge cycles:
Day 1-3: Thousands of parts created and mergedDay 4-7: Large merges consolidate into bigger partsDay 8+: Partition is "stable" with maybe 10-50 large parts (GBs each)
When our late-arriving end event creates a tiny new part (maybe 10 rows, 3KB), ClickHouse faces a dilemma:
Partition 2025-11-01: - Part A: 30 GB (500M rows) <- Already merged, stable - Part B: 25 GB (400M rows) <- Already merged, stable - Part C: 3 KB (10 rows) <- NEW! Tiny orphan part
To merge Part C, ClickHouse must rewrite Part A or B entirely, a 30GB operation to add 10 rows. This merge takes hours (or even days!), during which more tiny parts accumulate.
With thousands of active long-running tasks, we were constantly inserting tiny parts into old partitions:
Insert -> Partition 2025-11-15 (2 rows)Insert -> Partition 2025-11-20 (5 rows)Insert -> Partition 2025-11-01 (1 row)Insert -> Partition 2025-11-18 (8 rows)...
Each insert created a part that couldn't be efficiently merged. Parts accumulated faster than merges could consolidate them, eventually hitting the 3000-part limit.
Timeline
Friday, November 28
| Time (UTC) | Event |
|---|---|
| 16:30 | First "too many parts" error: 3008 parts in task_events_v1 |
| 17:04 | Team begins investigating, initially suspects insert batching settings |
| 18:25 | Configuration changes deployed to adjust batch sizes |
| 18:53 | Situation worsens, ClickHouse reads also failing |
| 19:26 | Deploy blocked because migrations can't run due to query overload |
| 19:37 | Emergency PR to add SKIP_CLICKHOUSE_MIGRATIONS env var |
| 19:50 | ClickHouse support recommends adding replicas |
| 22:26 | System stabilizes after reduced load |
| 23:40 | Another ~30 minute incident |
Saturday, November 29
| Time (UTC) | Event |
|---|---|
| 09:00 | Investigation continues with 701 parts, merges appear stalled |
| 09:14 | Root cause identified: partition key design combined with late-arriving spans |
| 09:22 | Team realizes the partition key choice was fundamentally flawed |
Monday, December 1
| Time (UTC) | Event |
|---|---|
| 11:27 | PR #2719: new task_events_v2 table with inserted_at partition key |
| 13:35 | New table also hits part limits due to Materialized Views blocking merges |
| 14:24 | Full understanding of old-partition merge costs achieved |
| 15:18 | PR #2721: fix for task_events_v1 to clamp old start_time values |
| 15:49 | Additional ClickHouse replicas provisioned |
Resolution
For immediate mitigation, we increased ClickHouse replicas from 2 to 3 for additional merge capacity, disabled debug event logging to reduce write volume, and dropped Materialized Views that were blocking merges.
We then created a new task_events_v2 table with a corrected partition key:
CREATE TABLE task_events_v2 ( inserted_at DateTime DEFAULT now(), -- When we received the data start_time DateTime64(9), -- Original span start time (preserved) -- ... other columns)ENGINE = MergeTreePARTITION BY toDate(inserted_at) -- Partition by insertion time, not span timeORDER BY (environment_id, toUnixTimestamp(start_time), trace_id)
This works because all rows from a single insert batch go into the same partition (today's date). No more fragmenting across old partitions.
For the legacy task_events_v1 table, we implemented start time clamping:
// If start_time is more than 12 hours old, clamp it to 12 hours agoconst maxAge = 12 * 60 * 60 * 1000; // 12 hours in msconst clampedStartTime = Math.max(startTime, Date.now() - maxAge);
This ensures late-arriving events go into recent partitions that still have active merge cycles, rather than creating orphan parts in old, stable partitions. The tradeoff is that start_time for closing span events may not exactly match the original span start. However, this doesn't affect span display because the opening event has the correct start_time, and when reading spans we merge all events and use the earliest start_time.
Lessons learned
Never partition by a field that can have "late arrivals." Event time, span start time, and similar fields can arrive arbitrarily late. Partition by insertion time instead.
Good partition keys:
PARTITION BY toDate(inserted_at) -- When we received itPARTITION BY toYYYYMM(created_at) -- When the record was created
Risky partition keys:
PARTITION BY toDate(event_time) -- Events can arrive latePARTITION BY toDate(start_time) -- Spans can complete weeks laterPARTITION BY toDate(order_date) -- Orders can be backdated
Small parts in old partitions are extremely expensive. A 10-row part in a 30GB partition requires rewriting 30GB to merge. You should track system.parts for active part count per table/partition, system.merges for running merge operations, and part size distribution to watch for tiny parts in old partitions.
Our testing didn't include long-running tasks that would expose the partition key issue. Include edge cases like tasks running for days or weeks, high-cardinality partition keys, and backfill scenarios.
MergeTree's merge behavior is well-documented but has subtle implications. The merge algorithm, part limits, and partition interactions require deep understanding for schema design.
Technical details for the curious
You might wonder why we didn't just increase the part limit. ClickHouse's default limit of 3000 parts exists for good reasons: each part requires file handles (potentially 10+ per part), part metadata consumes memory, queries must check all parts which degrades performance, and eventually the system becomes unstable. Increasing the limit just delays the inevitable crash.
You can manually trigger merges with OPTIMIZE TABLE, but it's a blocking operation that can timeout. Merging tiny parts into huge partitions takes hours, and during that time more tiny parts accumulate. It doesn't fix the root cause.
Why partition by day instead of month? Daily partitions provide efficient TTL-based data expiration (you can drop whole partitions), better query pruning for time-range queries, and reasonable merge granularity. Monthly partitions would reduce the problem but not eliminate it since a span from day 1 completing on day 30 still creates an orphan part. The real solution is partitioning by insertion time, which guarantees all data from one insert stays together.
Questions?
If you have questions about this incident or how it may have affected your data, please reach out to [email protected] or join our Discord community.
We're committed to transparency about incidents that affect our users, and we appreciate your patience as we resolved this issue.








