Project Name
HIPAA-Compliant Spark and Airflow Cluster Stood Up on Proxmox in 7 Weeks for a US Healthcare Enterprise
![]()
A US-based enterprise operating in the healthcare benefits sector had no dedicated data engineering infrastructure – just ad-hoc processing tools that could not scale, could not be audited, and could not meet HIPAA compliance requirements. Managing sensitive patient and beneficiary data under strict HIPAA regulatory obligations, with approximately 2TB of full-load data across file and database sources, the organisation had outgrown its legacy data processing approach but lacked the in-house infrastructure and expertise to build a modern, scalable data stack. Cloud-based solutions were not viable given internal policy constraints and the need for full on-premises control. The client needed a production-grade Apache Spark and Airflow cluster built on their existing Proxmox on-premises hardware – including separate Development and Production environments, full security hardening, and a structured knowledge transfer that would leave their team fully self-sufficient. Ksolves was engaged to design and deliver the complete environment from scratch.
- No Data Engineering Infrastructure: The organisation had no dedicated Apache Spark or Airflow environment. All data processing ran through ad-hoc scripts and manual pipelines that could not scale to 2TB workloads, could not be audited, and provided no foundation for future analytics or compliance-grade operations.
- HIPAA Compliance Across All Layers: Healthcare data processing mandated HIPAA-aligned technical safeguards, including data encryption in transit, role-based access control, audit logging, and OS hardening. None of these requirements could be met with the existing informal setup, creating direct regulatory exposure.
- No Cluster Blueprint on Proxmox: The client was committed to on-premises Proxmox-based infrastructure, but had no VM configuration strategy, no resource allocation framework, and no topology documentation for a multi-node distributed cluster. The entire architecture had to be designed from first principles.
- Separate Dev and Prod Environments: The organisation needed logically and physically separate Development and Production environments to support safe workflow testing, change management, and compliance auditing. Neither environment existed, and no framework for environment separation had been established.
- No In-House Operational Expertise: The internal team had no experience setting up or operating Spark, Airflow, HDFS/YARN, or Keycloak - creating risk not only around delivery but around post-handoff self-sufficiency. The engagement had to deliver both the platform and the capability to run it.
- 2TB Source Integration at Scale: Spark workloads needed validated connectivity and performance against both file-based sources (1.5-2TB full load) and database sources (2TB full load) with confirmed end-to-end execution as a formal UAT acceptance criterion before handover.
Ksolves delivered a 6-phase structured engagement - Discovery, Development Environment, Production Environment, Security Hardening, UAT, and Knowledge Transfer - building both Apache Spark and Apache Airflow clusters on a 3-node Proxmox architecture per environment. The governing design principle was compliance-first: every configuration decision was mapped to HIPAA technical safeguards before implementation, ensuring security and auditability were built into the platform from day one rather than retrofitted after delivery.
- Proxmox VM Setup: Virtual Machines provisioned across two separate Proxmox clusters - Development and Production - each comprising three nodes, with CPU, RAM, and NVMe storage allocated per cluster topology recommendations on the client's existing on-premises hardware.
- Apache Spark with HDFS/YARN: Full Apache Spark stack installed and configured with HDFS for distributed storage, YARN for resource management, and Spark History Server for job-level visibility - validated against 2TB file and database source workloads as part of formal UAT acceptance criteria.
- Apache Airflow Orchestration: Apache Airflow configured with DAG-based scheduling for all ingestion, transformation, and load workflows - with email-on-failure alerting and end-to-end pipeline monitoring delivering a production-grade orchestration layer for all current and future data pipelines.
- Keycloak IAM: Keycloak integrated with Active Directory and OpenLDAP for centralised authentication, RBAC enforcement, and session management across Spark and Airflow - directly meeting HIPAA access control, audit trail, and identity verification requirements.
- Prometheus and Grafana Monitoring: Prometheus and Grafana deployed for cluster-level performance monitoring and alerting - providing real-time visibility into CPU, memory, storage utilisation, and job execution health across all nodes in both environments.
- Knowledge Transfer and Docs: Comprehensive configuration documentation, operational guides, and structured KT sessions delivered covering cluster management, DAG development, security operations, and troubleshooting - leaving the internal team fully qualified to manage both environments independently.
Technology Stack
| Category | Technology |
|---|---|
| Processing | Apache Spark |
| Orchestration | Apache Airflow |
| Virtualisation | Proxmox VE |
| Infrastructure | HDFS + YARN |
| Security | Keycloak + RBAC |
| Monitoring | Prometheus + Grafana |
- 2 HIPAA-Compliant Environments in 7 Weeks: Zero dedicated infrastructure to fully operational Dev and Prod Spark/Airflow clusters on Proxmox - with all HIPAA technical safeguards implemented and formally accepted within a 7-week fixed-cost engagement.
- 2TB Data Processing Validated: Spark connectivity confirmed against file-based (1.5-2TB) and database (2TB) sources, with Airflow DAGs orchestrating end-to-end workflows validated through formal UAT.
- Zero to Self-Sufficient Team: Internal team entered with no Spark, Airflow, HDFS/YARN, or Keycloak experience. Structured KT with full documentation left them independently qualified to manage both environments post-handoff.
- 6-Phase Delivery on Schedule: All phases completed within the $33K fixed-cost SOW with formal acceptance criteria signed off at each milestone and zero production disruption throughout.
“We went from having nothing to having a fully compliant, production-grade data platform in seven weeks. The team understood our HIPAA requirements from day one and did not cut a single corner on security.”
– Head of Data Engineering / CTO.
A US healthcare enterprise managing 2TB of sensitive beneficiary data had no data engineering infrastructure, no HIPAA-compliant processing environment, and an internal team with zero Spark or Airflow expertise. Ksolves delivered a production-grade, HIPAA-compliant Spark and Airflow platform on Proxmox in 7 weeks – with separate Dev and Prod environments, full security hardening across all layers, and a structured knowledge transfer that left the team fully self-sufficient. The $33K fixed-cost engagement delivered two fully operational cluster environments, validated 2TB data processing across file and database sources, and established a compliance-first architecture with Keycloak IAM, OS hardening, RBAC, and audit logging built in from design. With this foundation operational, the client can now extend the platform to support AI/ML workloads, advanced analytics, and additional data sources without rebuilding the security or infrastructure layer. To explore our complete range of Big Data Services, visit our site.
Need a HIPAA-Compliant Spark and Airflow Environment Deployed in Weeks?