Category : projects based on hadoop
Projects Based on Hadoop
Hadoop solutions offers final year Projects Based on Hadoop , big data projects,big data project,hadoop projects for students,hadoop project,hadoop project ideas,sample hadoop projects,mapreduce projects,project idea with hadoop mapreduce,hadoop mapreduce projects,mapreduce project ideas,hadoop mapreduce projects,hadoop mapreduce project,projects on hadoop,hadoop project topics,hadoop research projects,big data hadoop projects,hadoop projects ideas,hadoop based projects,hadoop related projects,projects in hadoop,projects using hadoop.
- Distributed evolutionary approach to data clustering and modeling .
In this article we describe a framework (DEGA-Gen) for the application of distributed genetic algorithms for detection of communities in networks. The framework proposes efficient ways of encoding the network in the chromosomes, greatly optimizing the memory use and computations, resulting in a scalable framework. Different objective functions may be used for producing division of network into communities. The framework is implemented using open source implementation of MapReduce paradigm, Hadoop.Projects Based on Hadoop -Distributed evolutionary approach to data clustering and modeling . We validate the framework by developing community detection algorithm, which uses modularity as measure of the division. Result of the algorithm is the network, partitioned into non-overlapping communities, in such a way, that network modularity is maximized. We apply the algorithm to well-known data sets, such as Zachary Karate club, bottlenose Dolphins network, College football dataset, and US political books dataset. Framework shows comparable results in achieved modularity; however, much less space is used for network representation in memory. Further, the framework is scalable and can deal with large graphs as it was tested on a larger youtube.com dataset.
- Map-reduce processing of k-means algorithm with FPGA-accelerated computer cluster.
The design and implementation of the k–means clustering algorithm on anFPGA–accelerated computer cluster is presented. The implementation followed the Map–Reduce programming model, with both the map andreduce functions executing autonomously to the CPU on multiple FPGAs. A hardware/software framework was developed to manage gateware execution on multiple FPGAs across the cluster. Using this k–meansimplementation as an example, system-level tradeoff study between computation and I/O performance in the target multi-FPGA execution environment was performed. When compared to a similar software implementation executing over the Hadoop MapReduce framework, 15.5× to 20.6× performance improvement has been achieved across a range of input data sets. .
- Distributed image file system based on human cognition.
Abstract – Big Data era is characterized by the explosive increase of image files on the Internet, massive image files bring great challenges to storage. It is required not only the storage efficiency of massive image files but also the accuracy and robustness of massive image file management and retrieval. To meet these requirements, distributed image file storagesystem based on cognition is proposed. According to the human brain function, humans can correlate image files with thousands of distinct object and action categories and sorted store these files. Thus we proposed to sorted store image files according to different visual categories based on human cognition. The experimental results demonstrate that the proposed distributed image file system (DIFS) basedon cognition performs better than Hadoop Distributed File System(HDFS).
Understanding Log Lines Using Development Knowledge.
Abstract – Logs are generated by output statements that developers insert into the code. By recording the system behaviour during runtime, logs play an important role in the maintenance of large software systems. The rich nature of logs has introduced a new market of log management applications (e.g., Splunk, XpoLog and log stash) that assist in storing, querying and analyzing logs. Moreover, recent research has demonstrated the importance of logs in operating, understanding and improving software systems. Thus log maintenance is an important task for the developers. However, all too often practitioners (i.e., operators and administrators) are left without any support to help them unravel the meaning and impact of specific log lines. By spending over 100 human hours and manually examining all the email threads in the mailing list for three open source systems (Hadoop, Cassandra and Zookeeper) and performing web search on sampled logging statements, we found 15 email inquiries and 73 inquiries from web search about different log lines. We identified that five types of development knowledge that are often sought from the logs by practitioners: meaning, cause, context, impact and solution. Due to the frequency and nature of log lines about which real customers inquire, documenting all the log lines or identifying which ones to document is not efficient. Hence in this paper we propose an on-demand approach, which associates the development knowledge present in various development repositories (e.g., code commits and issues reports) with the log lines. Our case studies show that the deriveddevelopment knowledge can be used to resolve real-life inquiries aboutlogs
RABID: A Distributed Parallel R for Large Datasets. Large-scale data mining and deep data analysis are increasingly important for both enterprise and scientific applications. Statistical languages provide rich functionality and ease of use for data analysis and modeling and have a large user base. R is one of the most widely used of these languages, but is limited to a single threaded execution model and proble.m sizes that fit in a single node. This paper describes highly parallel R system called RABID (R Analytics for BIg Data) that maintains R compatibility, leverages the MapReducelike distributed Spark and achieves high performance and scaling across clusters. Our experimental evaluation shows that RABID performs up to 5x faster than Hadoop and 20x faster than RHIPE on two data mining applications