Project Name
50,000+ CI/CD Jobs, Zero Failures: Ksolves Scaled HashiCorp Vault for a North American SaaS Company
![]()
A mid-size North American SaaS company with approximately 500 employees was running 50,000+ CI/CD pipeline jobs daily across GitLab CI and GitHub Actions. Every job required secure access to database credentials, API keys, cloud provider tokens, and signing certificates, all managed through a single HashiCorp Vault node. Every morning at 07:00 UTC, the build surge began. Response times spiked past 500 milliseconds. The rate limiter kicked in. Deployments started failing. Engineers reverted to hardcoded credentials just to ship. Ksolves re-architected the secrets infrastructure into a 5-node Raft HA cluster with performance standbys and agent-side caching, sustaining the full daily pipeline volume at sub-50ms p99 latency with zero pipeline failures from secrets retrieval.
- Single-Node Vault Was the CI/CD Bottleneck: Every pipeline job fetched secrets from a lone Vault instance. During morning build surges and release windows, the node processed thousands of KV requests per minute, saturating CPU and memory. The rate limiter throttled requests, causing pipeline jobs to time out and forcing engineers to skip secrets retrieval or hardcode credentials to meet deadlines.
- Latency Spikes Exceeding 500ms During Peak Demand: Under load, Vault response times routinely spiked past 500 milliseconds for simple KV reads. With tens of thousands of jobs hitting the secrets store daily, cumulative latency adds minutes to pipeline execution, eroding developer productivity and pushing CI/CD queues past operational windows.
- Static Long-Lived Tokens Creating Security and Operational Risk: Pipelines are authenticated using long-lived static tokens stored as CI/CD variables. Token rotation was a manual, infrequent process that required coordination across teams and application configs, and stale tokens periodically caused mysterious pipeline failures when a rotation happened without updating every runner.
- No Caching Layer for Repeated Secret Fetches: Identical secrets were fetched hundreds of times per pipeline run with no caching mechanism. Every KV read traversed the full Vault API path, compounding the load on the single node and wasting network round-trips for static secrets that rarely changed.
- Single Point of Failure in the Delivery Pipeline: Any maintenance window, unplanned restart, or infrastructure incident rendered secrets completely unavailable, blocking every team's ability to build, test, and deploy, a business-continuity risk for a company shipping code continuously.
- No Visibility into Secrets Access Patterns: Without structured telemetry on secrets requests, the platform team could not identify which pipelines drove the most load, which secrets were fetched most frequently, or whether latency degradation originated at the Vault layer or downstream.
Ksolves re-architected the secrets infrastructure on a "cache close, authenticate short, serve fast" principle: reduce load at the Vault cluster through agent-side caching, replace long-lived static tokens with short-lived OIDC-based credentials, and scale horizontally across a Raft-based HA cluster with performance standbys. Every component was instrumented with Prometheus telemetry and surfaced through Grafana dashboards, giving the platform team real-time visibility into secrets access patterns for the first time.
- 5-Node Raft HA Cluster with Performance Standbys: A five-node HashiCorp Vault cluster was deployed using integrated Raft storage for consensus and data replication. Three active nodes handle client requests with automatic leader election, while two performance standby nodes serve read-heavy KV workloads, distributing secrets traffic across five endpoints and eliminating the single-node bottleneck that previously caused throttling and timeouts.
- Vault Agent Caching with Auto-Auth: Vault Agent was introduced as a sidecar on CI/CD runner environments, configured with auto-auth and an in-memory proxy cache. Frequently fetched secrets are cached at the agent layer with configurable TTLs, reducing repeated KV reads by orders of magnitude and slashing per-pipeline latency to near-cache-hit speed.
- OIDC-Based Short-Lived Tokens for GitLab and GitHub Runners: Vault's OIDC auth method was configured to trust GitLab and GitHub as identity providers. Each pipeline runner authenticates using its CI/CD provider's OIDC-issued JWT, receiving a short-lived Vault token scoped to its specific repository, branch, and environment. This eliminated static long-lived tokens stored in CI/CD variables, enforced least-privilege access per pipeline, and removed manual token rotation from the operations workload.
- Tuned Lease TTLs and Optimised KV Store Access: Secrets access patterns were analysed across all pipeline types, and KV v2 lease TTLs, max TTLs, and token TTLs were tuned to match actual pipeline durations. Short-running build and test jobs receive tokens with TTLs measured in minutes. Longer deployment jobs receive appropriately scoped durations. This reduced unnecessary token renewal traffic and ensured no pipeline outlived its access window.
- Prometheus and Grafana Observability: Vault telemetry was enabled to export request latency, auth success and failure rates, token counts, and Raft consensus metrics to Prometheus. Curated Grafana dashboards show cluster health, per-node request rates, p50/p95/p99 latency, and pipeline-specific secrets usage trends, with proactive alerts before latency crosses defined thresholds.
- High Availability and Auto-Unseal: The Raft cluster was configured for automatic unseal using cloud KMS integration, ensuring node restarts or failovers recover without manual operator intervention. With five nodes and automatic failover, the secrets infrastructure now tolerates the loss of up to two nodes without impacting availability.
Technology Stack
| Category | Technology |
|---|---|
| Secrets Management | HashiCorp Vault (Raft HA) |
| Agent Caching | Vault Agent |
| Authentication | OIDC Auth Method (GitLab, GitHub) |
| Storage | Raft Integrated Storage |
| Observability | Prometheus, Grafana |
- 50,000+ Daily CI/CD Jobs Without Pipeline Failures: The 5-node Raft HA cluster with performance standbys and agent caching sustains the full daily pipeline volume with zero deployment failures attributable to secrets infrastructure, replacing a system that was failing every morning at peak load.
- p99 Latency from 500ms to Under 50ms: Horizontal scale-out across five nodes combined with Vault Agent in-memory caching delivers p99 latency under 50ms, with most cache-hit reads served in single-digit milliseconds, down from routine 500ms+ spikes under the single-node architecture.
- Static Token Exposure Eliminated: OIDC-based auth issues short-lived, per-runner tokens scoped to repository and branch, eliminating static secrets from CI/CD configurations and enforcing least-privilege access automatically across every pipeline job.
- Single Point of Failure Removed: The 5-node Raft cluster tolerates loss of up to two nodes. Automatic leader election and cloud KMS auto-unseal ensure secrets remain available through node failures and maintenance windows without operator intervention.
- Real-Time Secrets Observability Established: Prometheus telemetry and Grafana dashboards provide real-time visibility into request rates, latency percentiles, auth success rates, and per-pipeline usage, enabling proactive capacity planning and instant incident root-causing where manual log correlation was previously the only option.
Ksolves delivers HashiCorp Vault architecture and DevOps consulting services for SaaS companies running high-throughput CI/CD pipelines that need secrets infrastructure engineered for delivery scale.
Before this engagement, the company’s single-node Vault deployment was failing daily under CI/CD load, forcing engineers into unsafe workarounds and blocking releases. After the 5-node Raft HA re-architecture, 50,000+ daily pipeline jobs complete without secrets-related failures, p99 latency sits under 50ms, and the platform team has real-time visibility into secrets access patterns for the first time.
The architecture positions the client to double their CI/CD workload without re-architecture, extend OIDC trust to additional CI/CD providers, and layer in dynamic database credentials as the next security maturity milestone.
Your Secret Infrastructure Should Never Be The Reason A Release Does Not Ship.