How Ksolves Migrated Cassandra to On-Premises VMs and Reduced Costs by 40%

Industry

Telecommunication

Technology

Apache Cassandra 5.x, Kubernetes, On-Premises VMs, NVMe Storage, G1GC JVM Tuning, NUMA-Aware CPU Pinning, nodetool, sstableloader, rsync, NoSQLBench, SAI, UCS

CLIENT OVERVIEW

A global telecom operator running Apache Cassandra 5.x on Kubernetes had a 50TB cluster of call records, device telemetry, and AI model training data delivering inconsistent read latency at 15 to 20ms P95. Kubernetes containerisation was blocking the hardware-level tuning the workload needed, cloud storage costs were rising with data volume, and strict data residency obligations meant the cluster could not remain on cloud infrastructure. Ksolves was engaged to re-platform the entire cluster onto dedicated on-premises virtual machines using dual migration paths designed in advance for variable network conditions, achieving 35% faster reads, 30% faster writes, 40% lower storage costs, and zero service interruption across all 50TB.

KEY CHALLENGES

The client came to Ksolves with five compounding constraints that were actively limiting AI platform performance and business agility:

Slow and Inconsistent Read Performance at AI Workload Scale: Shared storage and containerised I/O on Kubernetes introduced performance variance that real-time AI decisioning systems could not tolerate. P95 read latency was running at 15 to 20ms under normal load and degrading further under peak AI inference requests, making reliable real-time decisions impossible.
Cloud Storage Costs Scaling Faster Than the Business: Persistent volume charges, cross-zone data transfer fees, and cloud storage overhead for a 50TB cluster were growing at a rate that outpaced the business case for cloud hosting. Cost containment required moving to a model where storage costs were fixed and predictable.
Kubernetes Blocking Hardware-Level Tuning: The containerisation layer prevented the team from implementing memory pinning, NUMA-aware CPU allocation, and storage I/O scheduling that Cassandra requires for peak performance on AI workloads. Every tuning action was filtered through Kubernetes abstractions that added latency rather than removing it.
Excessive Operational Overhead for a Stable Workload: Managing storage volumes, Kubernetes operators, and StatefulSet configurations consumed significant engineering hours each week. For a workload with stable and predictable characteristics, this operational burden was disproportionate and was diverting capacity from AI feature development.
Data Residency Obligations Requiring On-Premises Hosting: AI models trained on subscriber and call record data were subject to strict data residency requirements that could not be met by cloud-hosted infrastructure. Moving the cluster to on-premises virtual machines was a regulatory necessity as much as a performance decision.

OUR SOLUTION

Ksolves used AI-assisted workload analysis to optimize VM sizing and accelerate migration planning from weeks to days. Parallel migration paths were prepared to ensure reliability across varying network conditions.

Online Migration Path: Direct Network Connectivity: Where network connectivity between the Kubernetes cluster and target on-premises VMs was stable, new VM-hosted Cassandra nodes were joined directly into the existing ring. Cassandra's native replication streamed all data to the new nodes automatically. Token distribution was monitored to equilibrium, and Kubernetes nodes were decommissioned cleanly once data integrity was confirmed, with no interruption to live AI or telemetry workloads.
Offline Migration Path: Unreliable Network Conditions: For environments where direct network connectivity was unreliable, full cluster snapshots were captured using nodetool snapshot, transferred to on-premises VM targets via rsync, and reloaded using nodetool import and sstableloader. For high-priority AI feature tables requiring faster transfer, cqlsh COPY provided an accelerated path with immediate validation.
VM Resource Configuration and Cassandra Tuning: On-premises VMs were provisioned with NVMe-backed virtual disks for consistent low-latency storage I/O. The JVM was tuned with G1GC garbage collection parameters configured within VM resource reservations. NUMA-aware CPU pinning was applied to eliminate cross-socket memory access latency, a tuning lever unavailable under Kubernetes, and dual-bonded 10G network interfaces were configured for inter-node replication reliability.
Cassandra 5.x Feature Preservation: Storage-Attached Indexing (SAI) and Unified Compaction Strategy (UCS), both critical to the AI query access patterns, were fully preserved and validated on the new VM environment. No regression in index performance or compaction behaviour was recorded across any keyspace.
Post-Migration Validation with NoSQLBench: NoSQLBench was used to replay live AI workload patterns against the new VM cluster, confirming P95 read and write latency targets had been met. Hash comparisons across all keyspaces validated complete data integrity across the full 50TB dataset.

TECHNOLOGY STACK

CATEGORY	TECHNOLOGY	ROLE IN THIS ENGAGEMENT
Database	Apache Cassandra 5.x	Distributed NoSQL store underpinning all AI and telemetry workloads; SAI and UCS features fully preserved and validated on the new VM environment with no performance regression.
Source Environment	Kubernetes (StatefulSets, PVCs)	Original cluster environment hosting the 50TB dataset; decommissioned post-migration to eliminate container I/O overhead, Kubernetes operator complexity, and cloud storage costs.
Target Environment	On-Premises VMs, NVMe Virtual Disks	Dedicated VM cluster with NVMe-backed storage, NUMA-aware CPU pinning, and dual bonded 10G networking, delivering consistent low-latency I/O that Kubernetes containerisation had blocked.
JVM Tuning	G1GC, NUMA-Aware CPU Pinning	G1GC garbage collection and CPU affinity settings tuned within VM resource reservations, eliminating JVM-induced latency spikes and cross-socket memory access overhead.
Online Migration	Cassandra Native Replication	New VM nodes joined the existing Kubernetes ring directly; Cassandra replicated all data automatically with zero service interruption to live AI and telemetry workloads.
Offline Migration	nodetool, sstableloader, rsync	Full cluster snapshots captured via nodetool snapshot, transferred via rsync, and reloaded via sstableloader for environments where direct network connectivity was unreliable.
Validation	NoSQLBench, Hash Comparison	Live AI workload patterns replayed against the new VM cluster via NoSQLBench; hash comparisons across all keyspaces confirmed 100% data integrity across the full 50TB dataset.
AI Tooling	AI-Assisted Workload Analysis	Performance requirements and VM resource configuration are modelled using AI-assisted analysis before any production changes, compressing the planning phase from weeks to days.

IMPACT

Following migration, the platform delivered seven measurable outcomes across performance, cost, operations, and regulatory compliance:

35% Faster Reads: P95 Latency Reduced from 15-20ms to 9-11ms, NVMe-backed VM storage and NUMA-aware tuning eliminated the containerised I/O variance that had made reliable real-time AI decisioning impossible on the Kubernetes cluster.
30% Faster Writes: P95 Latency Reduced from 10-12ms to 6-8ms, Write latency improvements across call records and device telemetry eliminated processing backlogs in downstream AI pipelines, enabling faster ingestion across all systems.
40% Reduction in Annual Storage Costs: Eliminating cloud PVC charges and cross-zone transfer fees replaced a cost model that scaled with data volume with fixed, predictable on-premises storage costs.
25% Improvement in AI Pipeline Stability: NUMA-aware CPU pinning and NVMe storage eliminated the I/O scheduling variance from Kubernetes that had been causing unpredictable delays in real-time AI inference and batch model training.
50% Reduction in Cassandra Operational Overhead: Direct VM management replaced Kubernetes StatefulSet and operator management, freeing significant engineering capacity from infrastructure maintenance to AI platform feature development. (verify exact figure with project team)
Zero Downtime: 50TB Migrated with Full Data Integrity: Both online and offline migration paths completed with zero service interruptions to live AI systems; hash comparisons confirmed 100% data integrity across all keyspaces and AI feature tables.
Data Residency Requirements Fully Met: All 50TB of subscriber data, call records, and AI training datasets now reside on on-premises VMs under strict access controls, fully compliant with data residency obligations that cloud hosting could not satisfy.

SOLUTION ARCHITECTURE

CLIENT TESTIMONIAL

“Planning for both connected and disconnected migration scenarios before the project started proved to be exactly the right approach. Our inter-environment network was less reliable than anticipated, and having both paths ready meant the project never stalled. Every performance metric improved after the move to on-premises VMs.”
Head of Data Platform Engineering, Global Telecom Operator

CONCLUSION

Before this engagement, the operator’s AI platform was constrained by a 50TB Cassandra cluster on Kubernetes, delivering inconsistent reads at 15 to 20ms P95, rising cloud storage costs, and no viable path to the hardware-level tuning the workload required. Today, Ksolves has delivered a fully re-platformed Cassandra 5.x environment on dedicated on-premises VMs with P95 reads at 9 to 11ms, writes at 6 to 8ms, 40% lower annual storage costs, 50% less operational overhead, and all 50TB migrated with zero service interruption using dual migration paths. All data residency obligations are met, SAI and UCS features are fully preserved, and the VM environment gives the engineering team direct access to the hardware-level tuning levers that Kubernetes had blocked. For telecom operators running Cassandra on Kubernetes with performance, cost, or compliance constraints, explore Ksolves Cassandra Consulting Services.

Have A Project Idea?

Name*

Email*

Phone Number*

Message*

What is 10 + 1 ? *

Facing Cassandra Performance Limits, Rising Cloud Costs, or Data Residency Challenges?

Have project in mind?

How Ksolves Migrated Cassandra to On-Premises VMs and Reduced Costs by 40%

TECHNOLOGY STACK

Patroni PostgreSQL Automated Failover Delivered for a Global Financial Platform

Real-Time Fraud Detection with Apache Flink for FinTech and Banking

Delivered a 50K-TPS Decision Platform for 100M+ Users with Sub-50ms Latency

Talk To Our Experts

Request a Callback

Talk To Our Experts

Let's Talk

Talk To Our Experts

Seize Your Complimentary Reservation Now!

Book a Free 30-minute Consultation!

Book a Free 30-minute
Consultation!