- Client had a product through which DB performance can be enhanced.
- This product was capable of enhancing performance of couple of RDBMS databases.
- HDFS, HBase, Hive file-system benchmarking was not a standard practice.
- OLAP performance analysis and monitoring for individual node.
- Distributed queries going via the product was a challenge.
- Spark job with connector to analyse time-series data in Cassandra DB.
- Modify default Spark queries to utilize the product capabilities.
- Apache Hadoop cluster of 3 nodes was created.
- Different data sizes were populated and read in/from the HDFS filesystem to base-line the cluster.
- Hive and HBase queries were built and executed with different combinations to base-line the performance of the cluster services.
- Memory and system level tuning was done to get optimal performance output from the cluster.
- Fraud Detection - For Cassandra DB activities, time-series data of 100M transactions was populated on the cluster and a file containing 1M lost cards was provided to a spark job to detect fraud transactions.
- Different executor combinations were tested to see the best performance point.
- Spark SQL queries were tuned, so that they go through product cache, thereby getting optimal performance throughput.
- Client was able to base-line HDFS, Hbase and Hive performance.
- Cassandra DB performance was enhanced to run Fraud Detection scenarios upto 20% faster than default configuration.