Project Name
Kubernetes Migration Cuts Deployment Time by 65% for a Fast-Growing EdTech Platform
![]()
Semester launch day for an EdTech platform is not the day you want your payment API to go down.
For a Southeast Asian platform with 400,000+ learners, that had already happened twice. The engineering team knew exactly why: no autoscaling, four-hour manual deployments, and an infrastructure that needed a human watching it at all times. They weren’t looking for a partial fix. They needed to move to Kubernetes and get off the treadmill entirely.
Ksolves came in, designed a production-grade AWS EKS setup, and migrated all fourteen services in 14 weeks without taking the platform offline once.
The challenges faced by the client are as follows:
-
Manual Deployment Process Across Three Environments
Every release meant SSH-ing into servers in sequence, pulling images, restarting services, and validating health by hand. A standard deployment took three to four hours. Any mid-way failure left the environment partially updated with no clean recovery path. The team was running twelve deployments a month and losing a significant chunk of engineering time just to shipping code. -
Inability to Handle Enrollment Traffic Spikes
Traffic hit 8x baseline on semester start days. The infrastructure had no autoscaling. On two consecutive launches, the course listing and payment APIs went unresponsive under load. VM provisioning was done manually ahead of peak periods and was consistently mis-sized in one direction or the other. -
No Centralised Observability
Logs lived on individual server disks. When something degraded, engineers SSH'd into each machine and grepped. Mean time to detect a production issue was around 22 minutes, mostly because the first signal was support tickets, not alerts. There were no dashboards, no unified latency or error rate visibility. -
Environment Inconsistency Between Staging and Production
Staging had drifted from production in configuration, dependency versions, and resource limits. Things that passed QA in staging regularly behaved differently in production. The team had resorted to long change-freeze windows around every major release. -
Absence of Rollback Capability
A bad release meant manually reverting, rebuilding the image, and redeploying, up to 45 minutes of recovery time. Two incidents in the prior year had caused outages over an hour each, both for the same underlying reason: no versioned, automated rollback existed. -
Operational Toil Limiting Engineering Capacity
About 35% of the team's time went to infrastructure upkeep: patching, disk monitoring, restarting crashed services, rotating credentials. Feature requests from the product team were being declined not because the work was hard, but because there was no headroom to take anything on.
-
AWS EKS Cluster Design and Provisioned Node Groups
Ksolves designed a multi-node EKS cluster with separate node groups for stateless and stateful workloads, provisioned via Terraform. Development, staging, and production were each isolated in their own namespace with resource quotas and network policies. The design was validated against the platform's 18-month growth forecast before build began. -
Helm-Based Service Packaging and GitOps Deployment
All fourteen services were containerised and packaged as Helm charts. ArgoCD was configured to monitor the Git repositories and automatically apply changes to the correct environment on merge. Deployments became declarative and auditable. Drift detection ensured any manual cluster change was flagged and reconciled automatically. -
Horizontal Pod Autoscaler and Event-Driven Autoscaling
HPAs were configured for the six highest-traffic services using two weeks of baseline data to set thresholds. For the video transcoding queue and assessment submission service, KEDA was implemented to scale on queue depth rather than resource consumption, much faster to respond during enrollment bursts. Behaviour was validated with load tests before the next semester launch. -
Prometheus, Grafana, and Loki Observability Stack
A centralised observability stack was deployed covering metrics, dashboards, and log aggregation. Per-service dashboards covered latency, error rates, pod restarts, and resource saturation. Alerts were routed to Slack via Alertmanager. Mean time to detect production issues dropped from 22 minutes to under 3 minutes in the first week. -
Automated CI/CD Pipeline with GitHub Actions
A GitHub Actions pipeline handled testing, image builds, ECR pushes, and Helm chart updates automatically on merge. A branch-based promotion model meant staging deployments triggered on develop merges; production deployments triggered via ArgoCD sync on main. Build-to-production time dropped from four hours to 18 minutes. Secrets were managed through AWS Secrets Manager with automated pod injection. -
Zero-Downtime Migration with Parallel Traffic Routing
Kubernetes services were stood up alongside existing VMs, with traffic shifted service by service using weighted routing in AWS ALB. Each service retained a VM fallback for 72 hours of healthy cluster operation before being decommissioned. All fourteen services were migrated without a single unplanned outage.
-
Deployment Frequency and Speed
Monthly deployments went from 12 manual to 40+ automated. Build-to-production time dropped from four hours to 18 minutes. Time spent on deployment-related work fell from 35% of engineering hours to under 5%, capacity that went straight into a mobile feature backlog stalled for three quarters. -
Uptime and Reliability During Peak Load
The next semester launch processed 3.2x the prior year's traffic without degradation. The platform held 99.9% uptime across a 30-day window covering two peak enrollment events. Autoscaling absorbed a spike from 1,200 to 9,400 concurrent users in 90 minutes without manual intervention. Infrastructure-related support escalations dropped 80%. -
Infrastructure Cost Reduction
Switching from always-on over-provisioned VMs to autoscaling node groups cut the monthly AWS bill by 60%. The savings came from right-sized pods terminating during off-peak hours and fourteen loosely managed instances consolidating into one efficiently scheduled cluster. The engagement paid for itself within five months. -
Engineering Team Confidence and Release Cadence
A 23-item product backlog accumulated during the infrastructure-constrained period was cleared within two months of migration. Engineers cited identical Helm chart configuration across environments as the reason staging-to-production parity finally felt real. Change-freeze windows before major events were dropped entirely. -
Observability and Incident Response
Three weeks after the observability stack went live, the team found and fixed a memory leak in the video processing service that had been causing restarts for over six months, completely invisible in the old setup. Mean time to detect fell from 22 minutes to under 3. Mean time to resolve went from 47 minutes to 14.
The platform went from an infrastructure that needed constant human attention to one that scales, deploys, and recovers on its own. The engineering team’s job changed from keeping systems alive to building products. That shift is what a well-executed Kubernetes migration actually delivers, and it’s replicable for any platform at a similar inflection point.
Thinking about a Kubernetes migration for your platform?