Project Name
Reduced Alert Noise to 70% and MTTR to 30 Min Using AIOps
![]()
Our client is a mid-to-large fintech enterprise running 80-plus microservices on OCI Kubernetes Engine, serving financial products across retail and institutional customer segments.
Their SRE organisation managed a high-velocity production environment experiencing 15 to 20 incidents per month, but the signal-to-noise ratio in their alerting infrastructure had collapsed, with 500-plus daily alerts generating more engineer fatigue than operational clarity. On-call rotations were burning out senior engineers, MTTR was averaging 3.5 hours, and no automated remediation existed for any class of known incident.
Leadership engaged Ksolves to deploy an AI-native observability solution that would restore operational signal, automate routine remediation, and reduce on-call cognitive load across the platform, shifting the SRE team from reactive fire brigade to proactive reliability engineering.
Five hundred daily alerts, 70% of them noise, and an SRE team spending its entire on-call capacity on incidents that should have resolved themselves.
- 500+ Daily Alerts, 70% Noise: High alert volume with mostly false positives, causing alert fatigue and missed real issues.
- 45-Minute Root Cause Diagnosis: Engineers manually correlated logs across tools, delaying initial incident understanding.
- 3.5-Hour MTTR Average: Recovery took hours per incident, with full reliance on human intervention across 80+ microservices.
- No Automated Fixes for Known Issues: Recurring problems like OOMKills and overloads always required manual remediation.
- 15–20 Monthly Incidents, No Pattern Detection: Each incident was treated as new due to lack of correlation or learning across events.
- No Leadership-Level Visibility: No unified reporting on trends, MTTR, or remediation effectiveness across services.
Ksolves, an AI-first DevOps consulting services company, deployed Dynatrace with Davis AI integrated across the existing Kubernetes and OCI monitoring stack, applying ML-based anomaly detection to correlate alerts, suppress noise, and identify root cause automatically, replacing 45-minute manual triage with AI-driven incident intelligence in under 5 minutes.
- Dynatrace Davis AI/ML-Based Anomaly Detection: Deployed Dynatrace with Davis AI to ingest telemetry from Prometheus, OCI Monitoring, and OCI Logging, applying ML-based correlation to identify root cause across distributed OKE service dependencies and replacing manual cross-tool triage with automated causal problem cards that arrive with diagnosis already complete.
- Alert Noise Suppression and Intelligent Problem Grouping: Davis AI correlates related alerts into single root-cause-grouped problem cards and suppresses redundant and informational noise, reducing 500-plus daily alerts to a manageable set of diagnosed, actionable problem records that the SRE team can act on without filtering through noise.
- Dagger-Triggered Automated Remediation Runbooks: When Dynatrace root cause confidence exceeds the configured threshold, Dagger pipelines automatically execute the appropriate remediation runbook: pod restart, horizontal scale-out, or deployment rollback with every outcome logged to the incident record for audit and pattern analysis.
- PagerDuty Intelligent Alert Routing: PagerDuty receives only AI-unresolved incidents from Dynatrace, routing to the correct on-call engineer based on Backstage service catalog ownership, eliminating the noise-driven pages that had made on-call rotations unsustainable and ensuring every human page represents a genuine escalation.
- Backstage Custom AIOps Dashboard Plugin: A custom Backstage plugin surfaces incident history, MTTR trend lines, auto-remediation success rates per service, and recurring failure patterns, giving engineering leadership a data-driven reliability improvement feedback loop across the full 80-plus service platform for the first time.
Technology Stack
| Category | Technology |
|---|---|
| AI/ML | Dynatrace (Davis AI) |
| CI/CD Portability | Dagger |
| Alerting | PagerDuty |
| Developer Portal | Backstage (AIOps Plugin) |
| Observability | Prometheus + Grafana + OCI Logging |
| Platform | OCI Kubernetes Engine (OKE) + OCI Monitoring |
From 500 daily alerts and a 3.5-hour MTTR to AI-correlated root cause in under 5 minutes and automated remediation before the on-call engineer opens their laptop.
- Alert Volume Reduced by 70%: Davis AI groups related signals into root-cause problems and filters redundant noise, turning 500+ daily alerts into actionable incidents.
- Root Cause in Under 5 Minutes: Dynatrace automatically identifies causal dependencies across OKE services, replacing 45-minute manual triage with instant diagnosis.
- MTTR Under 30 Minutes: Dagger runbooks resolve known issues automatically when AI confidence is high, reducing a 3.5-hour recovery cycle.
- Automated Remediation for Known Failures: Routine issues like restarts, scaling, and rollbacks are handled without human intervention, freeing on-call teams for novel incidents.
- Real-Time Cross-Service Visibility: Backstage AIOps dashboard tracks incidents, MTTR trends, and remediation success, enabling continuous reliability improvement across all services.
An SRE team overwhelmed by repetitive incidents and alert noise is facing a platform limitation, not just an operational challenge. With 80+ microservices and 500+ daily alerts, this fintech environment lacked any automated resolution path, forcing constant manual triage and remediation. Ksolves addressed this with an AI-first model where Davis AI correlates alerts, identifies root cause in under 5 minutes, and triggers Dagger runbooks to resolve issues before human intervention is needed. PagerDuty now escalates only diagnosed, high-confidence incidents to the right engineer, while Backstage provides visibility into trends, patterns, and remediation effectiveness.
Is Your On-Call Team Spending More Time Triaging Noise than Resolving Real Incidents?