Project Name
How Ksolves Double Cassandra Compaction Throughput for a Leading Telecom with UCS in Cassandra 5.x with AI first approach
![]()
The engineering team at a leading North American telecom operator was spending 18 hours a week managing Cassandra compaction. Not building data products. Not improving query performance. Just keeping three separate compaction strategies tuned, monitored, and stable across 340 tables. That number was going up, not down, as data volumes grew.
The client’s 120-node cluster, spread across 3 data centres, managed a total storage footprint of approximately 540 TB (180 TB per replica at replication factor 3) of call records, network telemetry, and IoT device events, processing billions of time-series and transactional events daily. The problem was the workload mix. Write-heavy ingestion tables and read-heavy analytics tables had fundamentally different data organisation requirements. So the team ran STCS for write-heavy tables, LCS for analytics, and TWCS for time-series, each with its own monitoring stack, its own tuning discipline, and its own failure modes.
Ksolves replaced all three with a single Unified Compaction Strategy deployment across all 24 keyspaces. What would typically take 14 days of manual topology analysis, Ksolves’ AI-assisted configuration review completed in 8. The live table-by-table migration ran without a single maintenance window.
The problems were not theoretical. Each one had an engineering cost attached every week.
- Fragmented Compaction Management: Three separate strategies meant three separate monitoring setups, three tuning disciplines, and 18 hours per week of engineering time on data organisation overhead alone. Every new high-traffic table required a strategy evaluation, a separate tuning pass, and a new entry in the monitoring stack. The operational surface area grew with every table added to the cluster, and the team had no path to simplifying it without a fundamental architecture change.
- I/O Bottlenecks on High-Density Nodes: At telecom ingestion rates, major compaction cycles consumed up to 65% of available disk I/O on high-density nodes. Write throughput during those windows dropped from a baseline of 180 MB/s to under 80 MB/s, triggering timeouts on time-sensitive network event tables. The load was not evenly distributed either. Hotspot nodes absorbed disproportionate I/O while adjacent nodes sat underutilised, creating a cluster-wide imbalance that manual rebalancing could not sustainably fix.
- Performance Inconsistency Under Mixed Workloads: With STCS and LCS running concurrently on heavily loaded nodes, p99 read latency fluctuated between 45 ms and 180 ms depending on active data organisation state. Analytics queries that should have completed in 50 ms were occasionally timing out. The unpredictability was not a Cassandra limitation. It was a direct consequence of running incompatible strategies on the same hardware simultaneously.
- Strategy Changes Required Maintenance Windows: Switching a high-traffic table from STCS to LCS, or retuning a TWCS window parameter, triggered a full data reorganisation cycle consuming 4 to 6 hours of elevated I/O. In an always-on telecom environment, that meant planned maintenance windows, reduced ingest rates, and degraded write performance for the duration. The team was scheduling these windows monthly at minimum.
As Apache cassandra consulting partner, Ksolves built the migration plan around one principle: one strategy for everything, tuned per workload, adjusted live. Before touching a single table, Ksolves used AI tools to map the complete Cassandra topology: all 340 tables across 24 keyspaces, data organisation history, I/O profiles per node, and hotspot distribution patterns. This discovery work typically takes 14 days when done manually across a cluster this size. AI-assisted analysis completed it in 8, producing a validated table-by-table migration sequence with T4/L10 assignments and base_shard_count values confirmed before the first production table moved. As Apache Cassandra engineering specialists, the Ksolves team ran the entire migration live, without a single maintenance window or cluster restart.
- One Unified Framework Across All 340 Tables: UCS replaced STCS, LCS, and TWCS across all 24 keyspaces, migrated in the AI-sequenced order. The operational surface area collapsed from three separate monitoring and tuning disciplines into one. Every table in the cluster now runs under the same framework, regardless of its read/write profile.
- Adaptive T4/L10 Tuning Per Workload: Write-heavy ingestion tables were configured with T4, which provides UCS tiered-like behaviour that prioritises write throughput through larger, less frequent data reorganisation cycles. Analytics tables were configured with L10, which provides UCS leveled-like behaviour that limits read amplification by maintaining a more structured SSTable layout. The same underlying strategy, dynamically matched to each table's actual workload pattern.
- Parallel Sharded Operations via base_shard_count: Setting base_shard_count distributed data organisation work across multiple disk shards per node, enabling parallel operations where previously everything ran sequentially. I/O load, previously concentrated on single-disk operations that bottlenecked at 65% utilisation, redistributed across the full disk array. The per-cycle I/O window dropped from 4 to 6 hours down to under 90 minutes at peak load.
- Live Parameter Adjustments Without Downtime: UCS settings including T4/L10 configuration, base_shard_count, and thresholds can be changed on a live cluster without triggering a major reorganisation cycle or requiring a maintenance window. The team verified this in the first week: three parameter changes were made across active production tables during business hours without any service impact.
Technology Stack: Cassandra 5.0.8, UCS, T4/L10, base_shard_count
| Component | Detail |
|---|---|
| Database | Apache Cassandra 5.0.8 |
| Cluster Size | 120 nodes |
| Data Centres | 3 |
| Total Storage Footprint | ~540 TB (180 TB per replica at RF=3) |
| Daily Event Volume | Billions of time-series and transactional events |
| Tables Migrated to UCS | 340 across 24 keyspaces |
- 2x Compaction Throughput: Parallel sharded UCS operations doubled the data organisation throughput that legacy sequential strategies had managed. The backlog that accumulated during peak ingestion windows disappeared, and the write timeout events that had been a recurring on-call trigger were eliminated entirely.
- 70% Reduction in Management Complexity: Three separate frameworks with three separate monitoring stacks collapsed into one. Engineering time spent on data organisation monitoring, tuning, and incident response dropped from 18 hours per week to under 6. The team now manages 340 tables across 24 keyspaces from a single UCS dashboard.
- p99 Latency Stabilised from 45-180 ms to 28 ms: Under mixed workloads, p99 read latency had ranged from 45 ms when the cluster was quiet to 180 ms during active reorganisation cycles. With UCS managing read and write amplification dynamically per table, p99 stabilised to 28 ms plus or minus 4 ms, consistent enough that the analytics query timeout alerts the team had been investigating monthly stopped firing.
- Zero Maintenance Downtime: Strategy changes that previously required 4 to 6 hour maintenance windows with reduced ingest rates were eliminated. Post-deployment, the team performed 11 live parameter adjustments across active production tables during business hours. Zero maintenance windows. Zero service impact events.
- I/O Variance Reduced from 40% to Under 8%: Disk I/O imbalance across nodes dropped from up to 40% variance to under 8% with base_shard_count distributing work evenly. Write throughput during reorganisation windows recovered from the previous low of 80 MB/s back to the baseline 180 MB/s, and peak-window timeouts were eliminated.
- 340 Tables Migrated With Zero Application Code Changes: Every application writing to or reading from the cluster continued operating without modification throughout the live migration. No schema changes, no application redeployment, no client library updates. UCS operates transparently within the existing Cassandra API surface.
- AI-Compressed Planning: 14 Days to 8: Ksolves' AI-assisted topology analysis mapped all 340 tables, their data organisation histories, per-node I/O profiles, and T4/L10 suitability across 24 keyspaces in 8 days. A manual assessment of a cluster this size typically requires 14. The earlier start and validated migration sequence meant the live deployment ran without rework or rollback.
“Managing three strategies across hundreds of tables was a constant drain. Every new high-traffic table meant hours of tuning and another maintenance window to plan. UCS changed that. We stopped thinking about compaction as a management problem and started treating it as a tuning parameter. Throughput doubled, maintenance windows disappeared, and our on-call team stopped getting paged for latency spikes.”
— VP of Engineering, Leading North American Telecom Operator (Anonymised per NDA)
From 18 engineering hours per week managing three fragmented strategies across 340 tables, to a single UCS framework running 2x faster with 70% less overhead and zero maintenance downtime. That is what a Cassandra 5.x modernisation looks like when the migration is planned with precision and executed live. With 12+ years of experience, a team of experienced Apache Cassandra engineers and UCS specialists, and a 90% client retention rate, Ksolves brings the same discipline to every Cassandra engagement. Whether you are managing fragmented STCS/LCS/TWCS deployments, evaluating a Cassandra 5.x UCS migration, or need big data database consulting to benchmark and modernise your cluster, Ksolves can scope the right path for your environment.
Simplify Operations and Boost Throughput by Upgrading to Cassandra 5.x UCS.