Project Name

Zero-Downtime HA Platform Built for Multi-Cluster Analytics Resilience

Zero-Downtime HA Platform Built for Multi-Cluster Analytics Resilience
Industry
Telecommunication
Technology
PostgreSQL, Patroni, Keycloak, Big Data

Loading

Zero-Downtime HA Platform Built for Multi-Cluster Analytics Resilience
Overview

A large telecommunications group operating across multiple countries in West and Central Africa was running a distributed analytics platform across two server clusters that had never been designed to work together as a redundant system.

 

What began as a primary and secondary cluster had drifted into two independent silos, each hosting critical services with no automated failover, self-signed certificates blocking cross-service authentication, and a single point of failure inside every application tier. PostgreSQL, Keycloak, and the custom web application each ran on a single node. A node failure would take the platform down, and restoring it would require manual intervention. The platform was already serving active B2B operations.

 

Ksolves redesigned the entire infrastructure layer, deploying a fully redundant, cluster-spanning HA architecture with automated failover, trusted SSL, and zero downtime during migration.

Challenge
  • Fragmented Dual-Cluster Deployment with No Redundancy: Services had been distributed across two clusters without a planned HA design. The original primary and secondary intent had been abandoned, leaving critical application, identity, and database services stranded on single nodes with no failover path.
  • Single Points of Failure Across Every Critical Service Tier: PostgreSQL, Keycloak, and the custom web application each ran on a single node. Any hardware fault or planned reboot would immediately take the platform offline, with no automated recovery possible.
  • No Automated Database Failover Mechanism: The PostgreSQL tier had no replication, no leader election, and no routing layer. Recovery from a database failure required manual reconfiguration, exposing the platform to extended unplanned outages during any incident.
  • Untrusted HTTPS Causing Authentication Failures: All applications were served over self-signed SSL certificates, generating persistent browser security warnings, blocking cross-origin requests, and preventing Keycloak-brokered SSO flows from completing reliably across services.
  • No Session Consistency Across Identity Instances: With Keycloak isolated on a single node, there was no mechanism to synchronise authentication sessions if a second identity instance was introduced, preventing any form of clustered identity without a shared backend.
  • Live Migration Risk Across Active B2B Operations: The platform was already serving active external partners. Any migration to a new HA topology had to preserve all live data, maintain service continuity, and leave zero data loss or user disruption behind.
Solution

Ksolves conducted a full infrastructure audit before touching any configuration, mapping every service to its physical host, identifying all network access paths, and documenting port dependencies. The redesign introduced redundancy at every critical tier simultaneously: application, identity, and database, with a shared Virtual IP binding the two clusters into a single logical platform. The governing principle was automated recovery: every failover scenario was validated before go-live, and migration of all live services was completed with zero downtime.

  • Dual-Node WebApp Deployment: The web application was redeployed on one node in each cluster using ReactJS and NodeJS, managed by PM2 and served through NGINX with SSL termination. A common domain is mapped to a Virtual IP load-balanced across both nodes, making the application cluster-independent and resilient to individual node failures.
  • Keycloak Identity Cluster with Shared HA Backend: Keycloak was clustered across both environments with a centralised PostgreSQL HA backend providing session consistency and synchronised authentication data, ensuring that a login on one node remains valid on the other and that identity failover is seamless.
  • PostgreSQL HA with Patroni, Etcd, HAProxy, and Keepalived: A three-node PostgreSQL cluster was deployed across both environments using Patroni for replication and automatic leader election, Etcd as the consensus store, HAProxy for leader-aware connection routing, and Keepalived to manage a VIP across cluster nodes, delivering fully automated failover with validated recovery scenarios completing in under 45 seconds.
  • Trusted SSL via CA-Signed Certificates: Valid CA-signed SSL/TLS certificates were issued and deployed across all application and SSO endpoints, eliminating browser security warnings, resolving CORS failures, and enabling Keycloak-brokered authentication to function reliably across every service in the platform.
  • Zero-Downtime Service Migration: All existing realm configurations, client registrations, and user accounts were exported from the standalone Keycloak instance and re-imported into the clustered deployment. WebApp redeployment and NGINX reconfiguration were executed without service interruption, with full validation of login flows, SSL trust, data persistence, and dashboard access before go-live.

Technology Stack

Category Technology
Frontend and API ReactJS, NodeJS, NGINX, PM2
Identity Keycloak (Clustered)
Database HA PostgreSQL, Patroni, Etcd
Load Balancing HAProxy, Keepalived
Results: Automated Failover Under 45 Seconds, Zero Downtime Migration, 100% HTTPS Trust Restored
  • Automated Database Failover in Under 45 Seconds: Patroni, HAProxy, and Keepalived handle leader election and VIP migration automatically. Validated failover scenarios complete in under 45 seconds without any manual intervention, replacing a process that previously required hours of DBA manual reconfiguration.
  • Zero Downtime Across the Full Migration: All services, including Keycloak realms, user accounts, WebApp instances, and database nodes, were migrated to the HA topology with zero downtime and zero data loss confirmed, with no disruption to active B2B partner operations throughout.
  • Single Points of Failure Eliminated Across All Tiers: Every application tier is now a dual-node or 3-node cluster. Simulated failover tests confirmed services remained fully operational through node reboots across both clusters, removing the manual recovery dependency that previously defined every incident response.
  • HTTPS Trust Restored Across 100% of Endpoints: Valid CA-signed SSL/TLS certificates replaced self-signed certificates across every endpoint. Browser warnings were eliminated, CORS failures were resolved, and SSO authentication flows now function reliably across all services without interruption.
  • Operational Overhead from Manual Failover Eliminated: Automated leader election, VIP migration, and connection rerouting operate without human involvement. The platform's first simulated failure ran to full recovery with no operator action required, establishing a self-healing infrastructure that the team does not need to monitor manually.
Data Flow Diagram
stream-dfd
Conclusion

Ksolves delivers high-availability infrastructure design and Big Data platform engineering for telecommunications groups and enterprise organisations that need to eliminate single points of failure from mission-critical analytics infrastructure without disrupting live operations.

 

Before this engagement, the group’s analytics platform was one node failure away from an extended manual recovery. After Ksolves delivered the HA redesign, automated failover completes in under 45 seconds at every tier; the full migration was delivered with zero downtime, and the platform is ready to serve mission-critical workloads across the group’s expanding country footprint.

 

The HA foundation removes infrastructure fragility as a constraint on the group’s continued platform expansion, enabling new country rollouts and additional analytics modules to be onboarded without redesigning the reliability layer.

Is Your Analytics Infrastructure One Node Failure Away from Unplanned Downtime?

Copyright 2026© Ksolves.com | All Rights Reserved
Ksolves USP