Category Archives: projects in hadoop

  • -

Projects in Hadoop

Projects in Hadoop

Projects in hadoop helps on processing large amount data using cluster commodity hardware.



  • Kylix: A Sparse Allreduce for Commodity Clusters                                                                                                                                                                                                                                                                                                 Allreduce is a basic building block for parallel computing. Our target here is “Big Data” processing oncommodity clusters (mostly sparse power-law data). Allreduce can be used to synchronize models, to maintain distributed datasets, and to perform operations on distributed data such as sparse matrix multiply. We first review a key constraint on cluster communication, the minimum efficient packet size, which hampers the use of direct all-to-all protocols on large networks. Our allreduce network is a nested, heterogeneous-degree butterfly. We show that communication volume in lower layers is typically much less than the top layer, and total communication across all layers a small constant larger than the top layer, which is close to optimal. A chart of network communication volume across layers has a characteristic “Kylix” shape, which gives the method its name. For optimum performance, the butterfly degrees also decrease down the layers. Furthermore, to efficiently route sparse updates to the nodes that need them, the network must be nested. While the approach is amenable to various kinds ofsparse data, almost all “Big Data” sets show power-law statistics, and from the properties of these, we derive methods for optimal network design. Finally, we present experiments showing with Kylix on Amazon EC2 and demonstrating significant improvements over existing systems such as PowerGraph and Hadoop.


  • Testing iOS Apps with HadoopUnit:Rapid Distributed GUI Testing                                                                                                                                                                                                                                                         Smartphone users have come to expect high-quality apps. This has increased the importance of software testing in mobile software development. Unfortunately, testing apps—particularly the GUI—can be very time-consuming. Exercising every user interface element and verifying transitions between different views of the app under test quickly becomes problematic. For example, execution of iOS GUI test suites using Apple’s UI Automation framework can take an hour or more if the app’s interface is complicated. The longer it takes to run a test, the less frequently the test can be run, which in turn reduces software quality. This book describes how to accelerate the testing process for iOS apps using HadoopUnit, a distributed test execution environment that leverages the parallelism inherent in the Hadoop platform. HadoopUnit was previously used to run unit and system tests in the cloud. It has been modified to perform GUI testing of iOS apps on a small-scale cluste —a modest computing infrastructure available to almost every developer. Experimental results have shown that distributed test execution with HadoopUnit can significantly outperform the test execution on a single machine, even if the size of the cluster used for the execution is as small as two nodes. This means that the approach described in this book could be adopted without a huge investment in IT resources. HadoopUnit is a cost-effective solution for reducing lengthy test execution times of system-level GUI testing of iOS apps.


  • The Role of Text Pre-processing in Opinion Mining on a Social Media Language Dataset                                                                                                                                                                                                            This work describes an opinion mining application over a dataset extracted from the web and composed of reviews with several Internet slangs, abbreviations and typo errors. Opinion mining is a study field that tries to identify and classify subjectivity, such as opinions, emotions or sentiments in natural language. In this research, 759.176 Portuguese reviews were extracted from the app store Google Play. Due to the large amount of reviews, large-scale processing techniques were needed, involving powerful frameworks such as Hadoop and Mahout. Based on tests conducted it was concluded that preprocessing has an insignificant role in opinion mining task for the specific domain of reviews of mobile apps. The work also contributed to the creation of a corpus consisting of 759 thousand reviews and a dictionary of slangs and abbreviations commonly used in the Internet.


  • Towards the optimization of a parallel streaming engine for telco applications                                                                                                                                                                                                                                       Parallel and distributed computing is becoming essential to process in real time the increasingly massive volume of data collected by telecommunications companies. Existing computational paradigms such as MapReduce (and its popular open-source implementation Hadoop) provide a scalable, fault tolerant mechanism for large scale batch computations. However, many applications in the telco ecosystem require a real time, incremental streaming approach to process data in real time and enable proactive care. Storm is a scalable, fault tolerant framework for the analysis of real timestreaming data. In this paper we provide a motivation for the use of real time streaming analytics in thetelco ecosystem. We perform an experimental investigation into the performance of Storm, focusing in particular on the impact of parameter configuration. This investigation reveals that optimal parameter choice is highly non-trivial and we use this as motivation to create a parameter configuration engine. As first steps towards the creation of this engine we provide a deep analysis of the inner workings of Storm and provide a set of models describing data flow cost, central processing unit (CPU) cost, and system management cost


  • BigCache for big-data systems                                                                                                                                                                                                                                                                                                                                       Bigdata systems are increasingly used in many disciplines for important tasks such as knowledge discovery and decision making by processing large volumes of data. Bigdata systems rely on hard-disk drive (HDD) based storage to provide the necessary capacity. However, as bigdata applications grow rapidly more diverse and demanding, HDD storage becomes insufficient to satisfy their performance requirements. Emerging solid-state drives (SSDs) promise great IO performance that can be exploited by bigdata applications, but they still face serious limitations in capacity, cost, and endurance and therefore must be strategically incorporated into bigdata systems. This paper presents BigCache, an SSD-based distributed caching layer for bigdata systems. It is designed to be seamlessly integrated with existing bigdata systems and transparently accelerate IOs for diverse bigdata applications. The management of the distributed SSD caches in BigCache is coordinated with the job management ofbigdata systems in order to support cache-locality-driven job scheduling. BigCache is prototyped in Hadoop to provide caching upon HDFS for MapReduce applications. It is evaluated using typical MapReduce applications, and the results show that BigCache reduces the runtime of WordCount by 38% and the runtime of TeraSort by 52%. The results also show that BigCache is able to achieve significant speedup by caching only partial input for the benchmarks, owing to its ability to cache partial input and its replacement policy that recognizes application access patterns.


Hadoop Solutions offers projects in Hadoop – Topics/Thesis/Projects for an affordable price with guaranteed output. Enquire us for more details.

Work Progress

PHD - 24

M.TECH - 125

B.TECH -95

BIG DATA -110.


ON-GOING Hadoop Projects





Achievements – Hadoop Solutions


Twitter Feed

Customer Review

Hadoop Solutions 5 Star Rating: Recommended 4.9 - 5 based on 1000+ ratings. 1000+ user reviews.