The Challenges
- Real-time data collection with high volume.
- Analysis of time series & historical data for use in Predictive Maintenance via Machine Learning.
- Volume of data started growing exponentially.
- Frequency of collecting & processing data from millions of IoT Devices.
- Offline data processing capability.
- Selecting and processing aggregated Data.
- Storing and retrieval of unstructured data with Traditional RDBMS databases.
- Scalability & Performance.
- Delivery of data is not guaranteed.
- Real time Reporting of historical data.
- Data model not scalable.
The Solution
We analyzed the customer domain and ongoing challenges and found that it's a typical Big Data use and we proposed them to use the following technology stacks. Also we modified the Data Model which fits for Cassandra Cluster Architecture.
Apache Nifi cluster with 5 nodes was created Apache Cassandra cluster with 10 nodes was created.

Apache Cassandra
- With elastic scalability, we are able to scale up Cassandra cluster with zero downtime as we need to scale up cluster with new nodes as data started growing.
- Inbuilt Fault tolerance and High Availability.
- Cassandra handles data replication on its own which minimizes administrative tasks.
- With Apache Cassandra as we are storing data on a query basis or page basis, no join query needed. Only partition key required to get the desired data.
- We are able to store both structure and unstructured data.
Apache Nifi
- With Nifi, we are able to get guaranteed delivery of process data by restarting the process again where it stopped due to errors.
- Due to Low Latency with high throughput, we are able to get real-time response
- With web-based interface user can track the data transformation process by processors and also easily debug the issues.
- Parallel data processing with high throughput possible due to multi-node cluster architecture.
- This allowed us to increase the frequency of data collection and processing from millions of devices happening twice a day to every hour.
- Nifi Data aggregation - Nifi split processor and Merge Content allows us to combine number of flow files to single flow file to get aggregated data.
Apache Kafka
Apache Kafka to receive and process data streams from various sources.
Architecture Diagram

The Result
- Able to collect real-time data from millions of devices with high frequency.
- Now the database is handling high volumes of data with faster read and writes. We have approximately ~40 GB of data being getting generated in one day and data retention period supported for two weeks.
- Able to perform distributed data processing.
- High data availability with zero fault tolerance.
- Due to cluster database and inbuilt replication functionality, no administrative work needed for data replication.
- Horizontal Scaling of hardware is possible without shutdown of application as well as database Handle millions of concurrent data requests without any performance impact.