Category Archives: hadoop related projects

  • -

Hadoop related projects

 Hadoop Related Projects

Hadoop related projects helps on processing large amount data using cluster commodity hardware.



  • An FPGA-Based Tightly Coupled Accelerator for Data-Intensive Applications                      

    Computation beside a data source plays an important role in achieving a high performance with low energy consumption in Big Data processing. In contrast to that of a conventional workload, the processing of Big Data frequently requires that a massive amount of data in distributed storage be scanned. A key technique for reducing energy-consuming processor loads is to install a reconfigurableaccelerator that is tightly coupled to a computational resource with interfaces. The accelerator is capable of configuring application-specific hardware modules to allow some logical and arithmetic operations for data stream transmission between interfaces, as well as the offloading of control protocols for communication with other computing nodes or storage. In this paper, an FPGAbasedaccelerator, which is directly attached to DRAM, the network, and storage, is proposed in order to realize an energy efficient computing system. A simple application that counts the words appearing in the data is implemented to evaluate a prototype system. As the accelerator outperforms by 80.66 to 429 times similar applications executed on an SSD-based Hadoop framework, we confirm that theaccelerator‘s utilization for Big Data processing is beneficial.


  • Correlation Aware Technique for SQL to NoSQL Transformation  

                                                                                                                                                                                                                                                                                                      For better efficiency of parallel and distributed computing, Apache Hadoop distributes the imported data randomly on data nodes. This mechanism provides some advantages for general data analysis. With the same concept Apache Sqoop separates each table into four parts and randomly distributes them on data nodes. However, there is still a database performance concern with this data placement mechanism. This paper proposes a Correlation Aware method on Sqoop (CA_Sqoop) to improve the data placement. By gathering related data as closer as it could be to reduce the data transformationcost on the network and improve the performance in terms of database usage. The CA_Sqoop also considers the table correlation and size for better data locality and query efficiency. Simulation results show that data locality of CA_Sqoop is two times better than that of original Apache Sqoop.


  • A Fault-Tolerant Strategy of Redeploying the Lost Replicas in Cloud  

    In cloud storage centers, replica of file may be lost subjected to the failure of nodes, which will affect the efficiency of file access, as well as users’ satisfaction. To cope with this problem, the method ofredeploying the lost replicas on some other servers to maintain system availability is often adopted. Normally, a file is divided into many blocks with the same size and the popularity of the blocks is different in cloud storage system, which could be used as a parameter in deploying replicas. Therefore, in this paper, Scarlett system is utilized to determine the optimal number of block replica based on the block popularity. Then, considering the system load, the total cost and quality of services, we present a selective data recovery method subjected to the failure of nodes. In the meantime, a cost-efficientreplicas deployment strategy, namely CERD, is provided. The strategy has been verified in HDFS. Finally, we simulate the environment with random cloud node failure, and compare our strategy with thestrategies of Hadoop default. The results verify that CERD strategy can balance the load of the whole system, reduce the total cost of service, as well as provide higher service quality, which are consistent with the theoretical analysis.


  • SILVERBACK: Scalable association mining for temporal data in columnar probabilistic databases    

    We address the problem of large scale probabilistic association rule mining and consider the trade-offs between accuracy of the mining results and quest of scalability on modest hardware infrastructure. We demonstrate how extensions and adaptations of research findings can be integrated in an industrial application, and we present the commercially deployed SILVERBACK framework, developed at Voxsup Inc. SILVERBACK tackles the storage efficiency problem by proposing a probabilistic columnarinfrastructure and using Bloom filters and reservoir sampling techniques. In addition, a probabilisticpruning technique has been introduced based on Apriori for mining frequent item-sets. The proposed target-driven technique yields a significant reduction on the size of the frequent item-set candidates. We present extensive experimental evaluations which demonstrate the benefits of a context-aware incorporation of infrastructure limitations into corresponding research techniques. The experiments indicate that, when compared to the traditional Hadoop-based approach for improving scalability by adding more hosts, SILVERBACK – which has been commercially deployed and developed at Voxsup Inc. since May 2011 – has much better run-time performance with negligible accuracy sacrifices.


  • Organizing and Storing Method for Large-Scale Unstructured Data Set with Complex Content

                                                                                                                                                                                                                                                                                                     At the arrival of big data era, traditional geological industries are still using the traditional way to produce and collect data, and geosciences information is represented as unstructured data in various forms. These data is often categorized together according to a relatively simple way, thus forming a number of datasets with complex internal structure. However, this is not a good expression of rich geoscience information carried by unstructured data and it is also inconvenient to express complex relationships among the information, even against to find in-depth knowledge across datasets. Meanwhile, existence forms of such data also impeded the application of advanced technological methods. In an attempt to solve the problem, this paper proposes a multi-granularity content tree model and pay-as-you-go mode to support evolvement data modeling. These features help to split the data model, position data content precisely and to expand the dimensions of the main features that described according to the data subject, and then gradually discover data contained information and relationships among the information. Considering the large size of the data features, this paper designs data persistence mode based on HBase, so as to achieve the purpose of data processing by using technologies within the Hadoop system. This article also presents data content extraction and content tree initial state algorithms under MapReduce framework, and dynamic loading and local caching algorithms of content tree, thus forming a basic extract-store-load process. An application example of the model about the geological industries is given at the end.

Hadoop Solutions offers Hadoop Related Projects – Topics/Thesis/Projects for an affordable price with guaranteed output. Enquire us for more details.

Work Progress

PHD - 24

M.TECH - 125

B.TECH -95

BIG DATA -110.


ON-GOING Hadoop Projects





Achievements – Hadoop Solutions


Twitter Feed

Customer Review

Hadoop Solutions 5 Star Rating: Recommended 4.9 - 5 based on 1000+ ratings. 1000+ user reviews.