Hadoop Research Projects

  • -

Hadoop Research Projects

Hadoop Research Projects

Hadoop Research Projects helps in  processing large amount data using cluster commodity hardware.



  • Understanding the behavior of in-memory computing workloads                                                                                                                                                                                                                                                                  The increasing demands of big data applications have led researchers and practitioners to turn to in-memory computing to speed processing. For instance, the Apache Spark framework stores intermediate results in memory to deliver good performance on iterative machine learning and interactive data analysis tasks. To the best of our knowledge, though, little work has been done to understand Spark’s architectural and microarchitectural behaviors. Furthermore, although conventional commodity processors have been well optimized for traditional desktops and HPC, their effectiveness for Spark workloads remains to be studied. To shed some light on the effectiveness of conventional generalpurpose processors on Spark workloads, we study their behavior in comparison to those of Hadoop, CloudSuite, SPEC CPU2006, TPC-C, and DesktopCloud. We evaluate the benchmarks on a 17-node Xeon cluster. Our performance results reveal that Spark workloads have significantly different characteristics from Hadoop and traditional HPC benchmarks. At the system level, Spark workloads have good memory bandwidth utilization (up to 50%), stable memory accesses, and high disk IO request frequency (200 per second). At the microarchitectural level, the cache and TLB are effective for Spark workloads, but the L2 cache miss rate is high. We hope this work yields insights for chip and datacenter system designers.


  • SMARTH: Enabling Multi-pipeline Data Transfer in HDFS                                                                                                                                                                                                                                                                       Hadoop is a popular open-source implementation of the MapReduce programming model to handle large data sets, and HDFS is one of Hadoop’s most commonly used distributed file systems. Surprisingly, we found that HDFS is inefficient when handling upload of data files from client local file system, especially when the storage cluster is configured to use replicas. The root cause is HDFS’s synchronous pipeline design. In this paper, we introduce an improved HDFS design called SMARTH. It utilizes asynchronous multi-pipeline data transfers instead of a single pipeline stop-and-wait mechanism. SMARTH records the actual transfer speed of data blocks and sends this information to the namenode along with periodic heartbeat messages. The namenode sorts datanodes according to their past performance and tracks this information continuously. When a client initiates an upload request, the namenode will send it a list of “high performance” datanodes that it thinks will yield the highest throughput for the client. By choosing higher performance datanodes relative to each client and by taking advantage of the multi-pipeline design, our experiments show that SMARTH significantly improves the performance of data write operations compared to HDFS. Specifically, SMARTH is able to improve the throughput of data transfer by 27-245% in a heterogeneous virtual cluster on Amazon EC2.


  • Columnar NoSQL CUBE: Agregation operator for columnar NoSQL data warehouse                                                                                                                                                                                                                    The emergence of large volumes of data imposed by the major players of the web requires new management models and new data storage architectures and treatment able to find information quickly in a large volume of data. The column-oriented NoSQL (Not Only SQL) database provide for big data the most suitable model to the data warehouse and the structure of multidimensional data in OLAP cube form. However, in the absence of OLAP cube computation operators, we propose in this paper, a new aggregation operator called CN-CUBE (Columnar NoSQL CUBE), which allows data cubes to be computed from data warehouses stored in column-oriented NoSQL database management system. We implemented the CNCUBE operator using the SQL Phoenix interface of HBase DBMS and conducted experiments on a public data warehouse in a distributed environment produced using the Hadoop platform. Thus we have shown that our CN-CUBE operator has OLAP cubes computation times very suitable for NoSQL warehouses.


  • Parallel Randomly Compressed Cubes : A scalable distributed architecture for big tensor decomposition                                                                                                                                                                                                                                                                                                                                                               This article combines a tutorial on state-of-the-art tensor decomposition as it relates to big data analytics, with original research on parallel and distributed computation of low-rank decomposition for big tensors, and a concise primer on Hadoop?MapReduce. A novel architecture for parallel and distributed computation of low-rank tensor decomposition that is especially well suited for big tensors is proposed. The new architecture is based on parallel processing of a set of randomly compressed, reduced-size replicas of the big tensor. Each replica is independently decomposed, and the results are joined via a master linear equation per tensor mode. The approach enables massive parallelism with guaranteed identifiability properties: if the big tensor is of low rank and the system parameters are appropriately chosen, then the rank-one factors of the big tensor will indeed be recovered from the analysis of the reduced-size replicas. Furthermore, the architecture affords memory/storage and complexity gains of order for a big tensor of size of rank F with No sparsity is required in the tensor or the underlying latent factors, although such sparsity can be exploited to improve memory, storage, and computational savings.


  • Maiter: An Asynchronous Graph Processing Framework for Delta-Based Accumulative Iterative Computation                                                                                                                                                                                                                                                                                                                                          Myriad of graph-based algorithms in machine learning and data mining require parsing relational data iteratively. These algorithms are implemented in a large-scale distributed environment to scale to massive data sets. To accelerate these large-scale graph-based iterative computations, we propose delta-based accumulative iterative computation (DAIC). Different from traditional iterative computations, which iteratively update the result based on the result from the previous iteration, DAIC updates the result by accumulating the “changes” between iterations. By DAIC, we can process only the “changes” to avoid the negligible updates. Furthermore, we can perform DAIC asynchronously to bypass the high-cost synchronous barriers in heterogeneous distributed environments. Based on the DAIC model, we design and implement an asynchronous graph processing framework, Maiter. We evaluate Maiter on local cluster as well as on Amazon EC2 Cloud. The results show that Maiter achieves as much as 60 × speedup over Hadoop and outperforms other state-of-the-art frameworks.


 Similar IEEE Hadoop Research Titles

Hadoop Solutions offers Hadoop Research Projects – Topics/Thesis/Projects for an affordable price with guaranteed output. Enquire us for more details.

Work Progress

PHD - 24

M.TECH - 125

B.TECH -95

BIG DATA -110.


ON-GOING Hadoop Projects





Achievements – Hadoop Solutions


Twitter Feed

Customer Review

Hadoop Solutions 5 Star Rating: Recommended 4.9 - 5 based on 1000+ ratings. 1000+ user reviews.