If you have to name one processing platform which is both fast and scalable, I believe that the first thing that will come to your mind is Apache NiFi, and why not? It is the most common and widely used processing platform. But do you ever wonder how fast Apache NiFi is? How can a single NiFi cluster process billions and trillions of events along with complete lineage? Well, ksolves being an Apache NiFi development company brings to you the answers to all your questions.
In this article we will discuss how Apache NiFi accomplishes this complex task of processing one billion events and petabytes of data per second.
Apache NiFi overview
Apache Nifi is a real time event processing platform that automates the movement of data between two systems. It offers real time control of the data and manages movement between source and destination data. It is based on the technology called ‘Niagara Falls’ developed by the National Security Agency and later donated to the Apache Software Foundation.
Apache NiFi: Processing billions of data per second
The world today is filled with rapidly increasing data volumes. With such increasing data demand, users need tools that can handle the data rates. Every tool in the enterprise will have to be up to date or else accessing the data will become really difficult. When a customer is adapting Apache Nifi, they should know how much hardware they will need to keep up with the data rates.
Apache NiFi performs various tasks and operations that makes it very difficult to say which hardware will be needed. When NiFi moves data from an FTP server to HDFS, the resources required will be less. Similarly, if NiFi is ingesting data from hundreds of sources, with filtering, routing and performing complex transformations, it will require more resources.
However, NiFi can scale to that degree to achieve scalability and performance and Ksolves will demonstrate it by using a use case.
Before diving deeper let’s first understand the use case that is both realistic and simple to explain. The use case is as follows-
- There is a Bucket in Google Compute Storage that contains 1.5 TB of NiFi log data.
- NiFi will monitor this Bucket and when data lands in the Bucket, Apache NiFi will pull the data if the filename is “nifi-app”.
- The data needs to be detected for each incoming log file. If it is compressed, it needs to be decompressed.
- Filter any messages except the messages with the log level of “WARN” or “ERROR”.
- Convert these log messages into JSON.
- Now compress the JSON.
- At last, deliver the WARN and ERROR level log messages.
In this most common use case Apache NiFi will monitor the data, retrieve it and make routing decisions on it, filter it, transform it and then push data to the final destination.
The type of hardware needed is the most important topic of discussion while ingesting data using Apache NiFi. Here in this use case, we are using Google Kubernetes Engine. It will provide 32 crores and 28.8 GB of RAM per node. Here we will limit NiFi’s container to 26 crores to make sure that other NiFi services such as DNS service have sufficient resources to carry on their tasks.
Also, Apache NiFi stores the data on the disk. Apache NiFi developers need to ensure that if a node is lost and moved to a different host, there is no loss in data. That is why the data is stored on SSD volumes. Google Kubernetes Engine provides better throughput for large volumes of data, and so we use 1 TB volume for content repository.
The data in processing using Apache NiFi not only depends on the hardware but also on the data flow. This sole reason has encouraged us to increase the size of the clusters to achieve required data rate.
- Single node cluster- A single node processed 56.41 GB of incoming data. In this use case we can see that a single node on an average processed 283,727,739 per second. But, if a single node is not enough, we need to scale out to more nodes.
- 5-node cluster- Here the data rate is 264.42 GB and about 4.97 million records per second.
- 25-node cluster- Here we see an incoming data of 1.71 TB and 26 million events per second. Better than single and 5-node clusters.
We then increase the cluster size to 100 nodes and then to 150 nodes. The results were 12.2 trillion events per day.
Apache NiFi always offers high-scalability as we have stated in the previous section but the question is can these nodes match the 32-core machines. Here we created clusters with Virtual Machines with different sizes.
- 4-core Virtual Machines- We checked how an Apache NiFi would perform with 4-core VMs and found out that 150 node clusters worked well, but UI showed lag. Scaling to 500 nodes degraded the user experience. 750 node clusters results in node availability. Here we limited the NiFi container to 2.5 cores.
- 6-core Virtual Machines- We increased the container limit to 4.5 cores. The 500 node clusters were a little sluggish but web requests were completed faster. 700 node clusters made less difference in UI responsiveness. Here we tried a 1000-node cluster. Although the cluster was stable it lacked performance.
- 12-core Virtual Machine- With 250 nodes, events processed were 45 million, with 500 nodes there were 90 million events.
In a Nutshell
With Apache NiFi it’s fast to move data between any two points but it is more important to change with time and unlock more opportunities. With this use case, we can very well say that a single NiFi cluster can run 1 billion events per second. Also we need to make sure all the tools are good enough to handle the data. However, if properly sized with well-designed flow NiFi can work extremely well.
Ksolves is a leading Apache NiFi consulting company with 350+ Apache NiFi experts handling several global as well as domestic clients. If you are looking for NiFi as a service, Ksolves is the right place for you. For further information give us a call or write to us in the comment section below.