Project Name

Reduced Alert Noise to 70% and MTTR to 30 Min Using AIOps

Reduced Alert Noise to 70% and MTTR to 30 Min Using AIOps
Industry
Fintech
Technology
Dynatrace (Davis AI), Dagger, PagerDuty, Backstage (AIOps Plugin), Prometheus, Grafana, OCI Logging, OCI Kubernetes Engine (OKE), OCI Monitoring

Loading

Reduced Alert Noise to 70% and MTTR to 30 Min Using AIOps
Overview

Our client is a mid-to-large fintech enterprise running 80-plus microservices on OCI Kubernetes Engine, serving financial products across retail and institutional customer segments.

 

Their SRE organisation managed a high-velocity production environment experiencing 15 to 20 incidents per month, but the signal-to-noise ratio in their alerting infrastructure had collapsed, with 500-plus daily alerts generating more engineer fatigue than operational clarity. On-call rotations were burning out senior engineers, MTTR was averaging 3.5 hours, and no automated remediation existed for any class of known incident.

 

Leadership engaged Ksolves to deploy an AI-native observability solution that would restore operational signal, automate routine remediation, and reduce on-call cognitive load across the platform, shifting the SRE team from reactive fire brigade to proactive reliability engineering.

Key Challenges

Five hundred daily alerts, 70% of them noise, and an SRE team spending its entire on-call capacity on incidents that should have resolved themselves.

  • 500+ Daily Alerts, 70% Noise: High alert volume with mostly false positives, causing alert fatigue and missed real issues.
  • 45-Minute Root Cause Diagnosis: Engineers manually correlated logs across tools, delaying initial incident understanding.
  • 3.5-Hour MTTR Average: Recovery took hours per incident, with full reliance on human intervention across 80+ microservices.
  • No Automated Fixes for Known Issues: Recurring problems like OOMKills and overloads always required manual remediation.
  • 15–20 Monthly Incidents, No Pattern Detection: Each incident was treated as new due to lack of correlation or learning across events.
  • No Leadership-Level Visibility: No unified reporting on trends, MTTR, or remediation effectiveness across services.
Our Solution

Ksolves, an AI-first DevOps consulting services company, deployed Dynatrace with Davis AI integrated across the existing Kubernetes and OCI monitoring stack, applying ML-based anomaly detection to correlate alerts, suppress noise, and identify root cause automatically, replacing 45-minute manual triage with AI-driven incident intelligence in under 5 minutes.

  • Dynatrace Davis AI/ML-Based Anomaly Detection: Deployed Dynatrace with Davis AI to ingest telemetry from Prometheus, OCI Monitoring, and OCI Logging, applying ML-based correlation to identify root cause across distributed OKE service dependencies and replacing manual cross-tool triage with automated causal problem cards that arrive with diagnosis already complete.
  • Alert Noise Suppression and Intelligent Problem Grouping: Davis AI correlates related alerts into single root-cause-grouped problem cards and suppresses redundant and informational noise, reducing 500-plus daily alerts to a manageable set of diagnosed, actionable problem records that the SRE team can act on without filtering through noise.
  • Dagger-Triggered Automated Remediation Runbooks: When Dynatrace root cause confidence exceeds the configured threshold, Dagger pipelines automatically execute the appropriate remediation runbook: pod restart, horizontal scale-out, or deployment rollback with every outcome logged to the incident record for audit and pattern analysis.
  • PagerDuty Intelligent Alert Routing: PagerDuty receives only AI-unresolved incidents from Dynatrace, routing to the correct on-call engineer based on Backstage service catalog ownership, eliminating the noise-driven pages that had made on-call rotations unsustainable and ensuring every human page represents a genuine escalation.
  • Backstage Custom AIOps Dashboard Plugin: A custom Backstage plugin surfaces incident history, MTTR trend lines, auto-remediation success rates per service, and recurring failure patterns, giving engineering leadership a data-driven reliability improvement feedback loop across the full 80-plus service platform for the first time.

Technology Stack

Category Technology
AI/ML Dynatrace (Davis AI)
CI/CD Portability Dagger
Alerting PagerDuty
Developer Portal Backstage (AIOps Plugin)
Observability Prometheus + Grafana + OCI Logging
Platform OCI Kubernetes Engine (OKE) + OCI Monitoring
Impact

From 500 daily alerts and a 3.5-hour MTTR to AI-correlated root cause in under 5 minutes and automated remediation before the on-call engineer opens their laptop.

  • Alert Volume Reduced by 70%: Davis AI groups related signals into root-cause problems and filters redundant noise, turning 500+ daily alerts into actionable incidents.
  • Root Cause in Under 5 Minutes: Dynatrace automatically identifies causal dependencies across OKE services, replacing 45-minute manual triage with instant diagnosis.
  • MTTR Under 30 Minutes: Dagger runbooks resolve known issues automatically when AI confidence is high, reducing a 3.5-hour recovery cycle.
  • Automated Remediation for Known Failures: Routine issues like restarts, scaling, and rollbacks are handled without human intervention, freeing on-call teams for novel incidents.
  • Real-Time Cross-Service Visibility: Backstage AIOps dashboard tracks incidents, MTTR trends, and remediation success, enabling continuous reliability improvement across all services.
Solution Architecture
stream-dfd
Conclusion

An SRE team overwhelmed by repetitive incidents and alert noise is facing a platform limitation, not just an operational challenge. With 80+ microservices and 500+ daily alerts, this fintech environment lacked any automated resolution path, forcing constant manual triage and remediation. Ksolves addressed this with an AI-first model where Davis AI correlates alerts, identifies root cause in under 5 minutes, and triggers Dagger runbooks to resolve issues before human intervention is needed. PagerDuty now escalates only diagnosed, high-confidence incidents to the right engineer, while Backstage provides visibility into trends, patterns, and remediation effectiveness.

Is Your On-Call Team Spending More Time Triaging Noise than Resolving Real Incidents?