hatS: A Heterogeneity-Aware Tiered Storage for Hadoop
hatS: A Heterogeneity-Aware Tiered Storage for Hadoop.Hadoop has become the de-facto large-scale data processing framework for modern analytics applications. A major obstacle for sustaining high performance and scalability in Hadoop is managing the data growth while meeting the ever higher I/O demand. To this end, a promising trend in storage systems is to utilize hybrid and heterogeneous devices – Solid State Disks (SSD), ram disks and Network Attached Storage (NAS), which can help achieve very high I/O rates at acceptable cost. However, the Hadoop Distributed File System (HDFS) that is unable to exploit such heterogeneous storage. This is because HDFS works on the assumption that the underlying devices are homogeneous storage blocks, disregarding their individual I/O characteristics, which leads to performance degradation.
In this paper, we present hatS, a Heterogeneity-Aware Tiered Storage, which is a novel redesign of HDFS into a multi-tiered storage system that seamlessly integrates heterogeneous storage technologies into the Hadoop ecosystem. hatS also proposes data placement and retrieval policies, which improve the utilization of the storage devices based on their characteristics such as I/O throughput and capacity. We evaluate hatS using an actual implementation on a medium-sized cluster consisting of HDDs and two types of SSDs (i.e., SATA SSD and PCIe SSD). Experiments show that hatS achieves 32.6% higher read bandwidth, on average, than HDFS for the test Hadoop jobs (such as Grep and Test DFSIO) by directing 64% of the I/O accesses to the SSD tiers. We also evaluate our approach with trace-driven simulations using synthetic Facebook workloads, and show that compared to the standard setup, hatS improves the average I/O rate by 36%, which results in 26% improvement in the job completion time.
Similar IEEE Project Titles
- Performance Implications of SSDs in Virtualized Hadoop Clusters
- ALOJA: A systematic study of Hadoop deployment variables to enable automated characterization of cost-effectiveness
- An Adaptive Auto-configuration Tool for Hadoop
- Exploiting Hadoop Topology in Virtualized Environments
- Automating the Hadoop configuration for easy setup in resilient cloud systems