Improving performance of small-file accessing in Hadoop
Improving performance of small-file accessing in Hadoop.The Hadoop Distributed File System (HDFS) is an open source system which is designed to run on commodity hardware and is suitable for applications that have large data sets (terabytes). As HDFS architecture bases on single master (NameNode) to handle metadata management for multiple slaves (Datanode), NameNode often becomes bottleneck, especially when handling large number of small files. To maximize efficiency, NameNode stores the entire metadata of HDFS in its main memory. With too many small files, NameNode can be running out of memory.
In this paper, we propose a mechanism based on Hadoop Archive (HAR), called New Hadoop Archive (NHAR), to improve the memory utilization for metadata and enhance the efficiency of accessing small files in HDFS. In addition, we also extend HAR capabilities to allow additional files to be inserted into the existing archive files. Our experiment results show that our approach can to improve the access efficiencies of small files drastically as it outperforms HAR up to 85.47%.
Similar IEEE Project Titles
- Experiments on Networking of Hadoop
- Analytical review on Hadoop Distributed file system
- Load balancing solution based on AHP for Hadoop
- Medical Image Retrieval System in Grid Using Hadoop Framework
- hatS: A Heterogeneity-Aware Tiered Storage for Hadoop