Overview

  • Client had a product through which DB performance can be enhanced.
  • This product was capable of enhancing performance of couple of RDBMS databases.
overview-img
what's-the-need-img

Challenges

  • HDFS, HBase, Hive file-system benchmarking was not a standard practice.
  • OLAP performance analysis and monitoring for individual node.
  • Distributed queries going via the product was a challenge.
  • Spark job with connector to analyse time-series data in Cassandra DB.
  • Modify default Spark queries to utilize the product capabilities.

Solution

  • Apache Hadoop cluster of 3 nodes was created.
  • Different data sizes were populated and read in/from the HDFS filesystem to base-line the cluster.
  • Hive and HBase queries were built and executed with different combinations to base-line the performance of the cluster services.
  • Memory and system level tuning was done to get optimal performance output from the cluster.
  • Fraud Detection - For Cassandra DB activities, time-series data of 100M transactions was populated on the cluster and a file containing 1M lost cards was provided to a spark job to detect fraud transactions.
  • Different executor combinations were tested to see the best performance point.
  • Spark SQL queries were tuned, so that they go through product cache, thereby getting optimal performance throughput.
solution-img

Results

  • Client was able to base-line HDFS, Hbase and Hive performance.
  • Cassandra DB performance was enhanced to run Fraud Detection scenarios upto 20% faster than default configuration.