Project Name

Zero-Regression VM to AKS Migration: NiFi 2.8 and Airflow 3.2 Delivered in 6 Weeks

Zero-Regression VM to AKS Migration: NiFi 2.8 and Airflow 3.2 Delivered in 6 Weeks
Industry
Financial Services
Technology
Apache NiFi 2.8.0 + nifikop Operator v1.17.0, Apache Airflow 3.2.0, KubernetesExecutor, Azure Kubernetes Service (AKS), cert-manager - Cluster Issuer, keytool Jobs, Azure AD OIDC - composite identity provider, Prometheus, Grafana - kube-prometheus-stack

Loading

Zero-Regression VM to AKS Migration: NiFi 2.8 and Airflow 3.2 Delivered in 6 Weeks
Overview

Our client is a leading financial services organisation operating across multiple cities, serving institutional clients in capital markets with data-intensive pipeline workloads. The client runs a large-scale data engineering function processing inbound market data and distributing enriched feeds to downstream systems at scale.

 

With regulatory scrutiny increasing and a technology estate that had grown organically on bare virtual machines, leadership mandated a move to a cloud-native container platform.

 

The organisation required high availability, automated security controls, and consistent environments between UAT and PROD – none of which the existing VM architecture could provide – alongside a major-version upgrade of both core pipeline components without disrupting any existing production flow or DAG.

Key Challenges

A data pipeline processing millions of financial transactions nightly, running on bare VMs with no redundancy, no version control, and a certificate renewal process that had already caused security incidents.

  • VM-Bound Infrastructure With No High Availability: NiFi, NiFi Registry, Airflow, and the monitoring stack all ran on bare VMs with no redundancy. Any VM failure took down the entire data pipeline instantly with no automated recovery path, an unacceptable availability posture for a capital markets environment under increasing regulatory scrutiny.
  • Manual TLS Certificate Management: TLS certificates across all components were created and renewed by hand, producing recurring expiry incidents and leaving security gaps in what should have been an always-encrypted pipeline, with no automated issuance, no rotation, and no visibility into impending expirations.
  • Aging Versions With Breaking Changes: NiFi 1.28 and Airflow 2.10.4 lacked modern security features, had deprecated processors and APIs, and required significant compatibility work to migrate existing flows and DAGs across major version boundaries without regression.
  • No Centralised Observability: Prometheus and Grafana on separate VMs delivered only basic host metrics - no per-node NiFi scraping, no automated token refresh, and no unified view of pipeline health across NiFi, Airflow, and infrastructure simultaneously.
  • Environment Configuration Drift and No Audit Trail: UAT and PROD were managed as independent VMs with no shared manifest strategy, leading to configuration divergence and deployment errors. NiFi flows stored on VM file systems with no version control made rollbacks and audits impossible, a direct compliance risk in a regulated capital markets environment.
  • Manual and Slow Scaling: Adding processing capacity meant provisioning new VMs, installing dependencies by hand, and reconfiguring services — a process taking 4 to 6 hours with no repeatability guarantee and no path to on-demand horizontal scaling.
Our Solution

Ksolves, an AI-first DevOps consulting services company, successfully modernized the client's data platform by migrating Apache NiFi, NiFi Registry, Apache Airflow, and monitoring workloads from virtual machines to Azure Kubernetes Service in just six weeks.

  • NiFi Operator v1.17.0 - 3-Node NiFi 2.8.0 Cluster: Each NiFi node deployed with dedicated PersistentVolumeClaims for flow, content, and provenance repositories; per-node JVM tuning via operator CRD; and Istio cookie-based node affinity. Azure AD OIDC authentication configured as a composite provider, resolving the absence of centralised identity that had left the VM setup relying on per-VM local credentials.
  • cert-manager Cluster Issuer and Keytool Jobs: Automated TLS certificate issuance configured for every NiFi node, with cert-manager generating per-node PKCS12 keystores via post-issuance keytool Jobs. This permanently eliminated the recurring manual certificate renewal incidents and closed the security gaps they had been creating.
  • Apache Airflow 3.2.0 with KubernetesExecutor: All 17 production DAGs migrated from VM filesystem to Azure Files RWX PVCs, with breaking changes resolved across the api-server rename, SSL environment variable renames, and Airflow 3.x DAG bundle discovery. Full high availability deployed with two replicas across all four Airflow components and pod anti-affinity rules enforcing resilience.
  • kube-prometheus-stack - Per-Node NiFi Scraping: Three separate Prometheus scrape jobs configured for EdDSA JWT-authenticated per-node NiFi 2.x metrics, with a CronJob refreshing tokens every four hours. Custom Grafana dashboards unified NiFi, Airflow, and AKS metrics in a single pane of glass, replacing the siloed, basic-metric monitoring that had provided no cross-component visibility.
  • Helm and envsubst - Dual-Environment Manifest Strategy: All Kubernetes manifests templated with environment variable placeholders and processed via envsubst, with a deploy.sh wrapper supporting component and environment flags. UAT and PROD share identical manifests with environment-specific overrides only. This eliminated configuration drift and provided NiFi Registry version control for a full pipeline audit history that the VM setup had never had.

Technology Stack

Category Technology
Architecture Apache NiFi 2.8.0 + nifikop Operator v1.17.0
Processing Apache Airflow 3.2.0 - KubernetesExecutor
Infrastructure Azure Kubernetes Service (AKS)
Integration cert-manager - Cluster Issuer + keytool Jobs
Security Azure AD OIDC - composite identity provider
Monitoring Prometheus + Grafana - kube-prometheus-stack
Impact

From standalone VMs with no redundancy and six-hour deployments to a fully automated AKS platform with high availability, automated security, and five-minute deployments in just six weeks.

  • High Availability Across the Entire Platform: A three-node NiFi cluster and redundant Airflow components eliminated single points of failure, significantly improving resilience for nightly financial data processing.
  • TLS Certificate Management Fully Automated: cert-manager now handles certificate issuance and renewal across all services, eliminating manual renewals and reducing security and outage risks.
  • Zero Data Loss During Migration: All 17 production Airflow DAGs and NiFi flows were migrated successfully, with full version history, auditability, and rollback capabilities preserved.
  • Deployment Time Reduced to Under 5 Minutes: Standardized Helm-based deployments deliver identical UAT and PROD environments, cutting deployment time from 4–6 hours to less than 5 minutes while eliminating configuration drift.
  • Unified Monitoring and Observability: Centralized Prometheus and Grafana dashboards provide end-to-end visibility into NiFi, Airflow, Kubernetes, and infrastructure health from a single monitoring platform.
Solution Architecture
stream-dfd
Conclusion

Ksolves successfully transformed the client’s data processing infrastructure from a VM-based environment into a highly available, Kubernetes-powered platform without disrupting existing operations. The migration delivered automated deployments, automated certificate management, centralized observability, and resilient NiFi and Airflow workloads while preserving every existing flow and DAG.

Is Your Data Infrastructure Migration-Ready, and Can it Support a Zero-Regression Cutover to Kubernetes?

Copyright 2026© Ksolves.com | All Rights Reserved
Ksolves USP