Project Idea With Hadoop Mapreduce
Category : Project Idea With Hadoop Mapreduce
Project Idea With Hadoop Mapreduce
Need Project Idea With Hadoop Mapreduce . Contact Hadoop solutions to choose the best topic for your final year projects/ Thesis.
- Finding nearest facility location with open box query using Geohashing and MapReduce This paper gives an algorithm to find the nearest facility location using Geohashing. Open box query is implemented to find nearest location which is based on unbounded data. Searching is done based on the longitude and latitude of locations. Geohashing is a technique which converts the longitude, latitude pair into a single value which is represented in binary format. Data accumulated in Geospatial queries is too big in size. Therefore MapReduce framework is used for parallel implementation. MapReduce splits the input into the independent chunks and execute them in parallel over different mappers. Geohashing and MapReduce when fused together to find the nearest facility location gives very good results
MapReduce-Based Distributed K-Shell Decomposition for Online Social NetworksSocial network analysis comprises a popular set of tools for the analysis of online social networks. Among these techniques, k-shell decomposition of a graph is a popular technique that has been used for centrality analysis, for communities discovery, for the detection of influential spreaders, and so on. The huge volume of input graphs and the environments where the algorithm needs to run i.e., large datacenters, makes none of the existing algorithms appropriate for the decomposition of graphs into shells. In this article, we develop for the first time in the literature, a distributed algorithm based on MapReduce for the k-shell decomposition of a graph. We furthermore, provide an implementation and assessment of the algorithm using real social network datasets. We analyze the tradeoffs and speedup of the proposed algorithm and conclude for its virtues and shortcomings.
Implementing TaaS-based stress testing by MapReduce computing mode
In this paper we propose to employ the MapReduce computing model to implement a Testing as a Service (TaaS) for stress testing. We focus on stress testing for heavy-weight network transactions. The computation power of MapReduce computing system is used to simulate concurrent network transactions issued by a lot of users. The user first describes the testing scenario which he wants to be simulate in a testing script. The TaaS platform analyzes the testing script and then automatically distributes required testing data including details of transactions and files into a MapReduce computing system. We propose three different schemes to distribute testing data and measure their performances. We compare them with the popular stress testing tool JMeter and find out that our scheme can always have the tested system deliver higher error rate during the stress testing.
Distributed MapReduce engine with fault tolerance
Hadoop is the de facto engine that drives current cloud computing practice. Current Hadoop architecture suffers from single point of failure problems: its job management lacks of fault tolerance. If a job management fails, even if its tasks remains still active on cloud nodes, this job loses all state information and has to restart from scratch. In this work, we propose a distributed MapReduce engine for Hadoop with the Distributed Hash Table (DHT) algorithm that drives the scalable peer-to-peer networks today. The distributed Hadoop engine provides the fault-tolerance capability necessary to support efficient job computation required in the cloud computing with numerous jobs running at a moment. We have implemented the proposed distributed solution into Hadoop and evaluated its performance in job failures under various network deployments.
- MapReduce Analysis for Cloud-Archived Data Public storage clouds have become a popular choice for archiving certain classes of enterprise data – for example, application and infrastructure logs. These logs contain sensitive information like IP addresses or user logins due to which regulatory and security requirements often require data to be encrypted before moved to the cloud. In order to leverage such data for any business value, analytics systems (e.g. Hadoop/MapReduce) first download data from these public clouds, decrypt it and then process it at the secure enterprise site. We propose VNCache: an efficient solution for MapReduceanalysis of such cloud-archived log data without requiring an apriori data transfer and loading into the local Hadoop cluster. VNcache dynamically integrates cloud-archived data into a virtual namespace at the enterprise Hadoop cluster. Through a seamless data streaming and prefetching model, Hadoop jobs can begin execution as soon as they are launched without requiring any apriori downloading. With VNcache’s accurate pre-fetching and caching, jobs often run on a local cached copy of the data block significantly improving performance. When no longer needed, data is safely evicted from the enterprise cluster reducing the total storage footprint. Uniquely, VNcache is implemented with NO changes to the Hadoop application stack
MapReduce-Based RESTMD: Enabling Large-Scale Sampling Tasks with Distributed HPC Systems
A novel implementation of Replica Exchange Statistical Temperature Molecular Dynamics (RESTMD), belonging to a generalized ensemble method and also known as parallel tempering, is presented. Our implementation employs the MapReduce (MR)-based iterative framework for launching RESTMD over high performance computing (HPC) clusters including our test bed system, Cyber-infrastructure for Reconfigurable Optical Networks (CRON) simulating a network-connected distributed system. Our main contribution is a new implementation of STMD plugged into the well-known CHARMM molecular dynamics package as well as the RESTMD implementation powered by the Hadoop that scales out in a cluster and across distributed systems effectively. To address challenges for the use of Hadoop MapReduce, we examined contributing factors on the performance of the proposed framework with various runtime analysis experiments with two biological systems that differ in size and over different types of HPC resources. Many advantages with the use of RESTMD suggest its effectiveness for enhanced sampling, one of grand challenges in a variety of areas of studies ranging from chemical systems to statistical inference. Lastly, with its support for scale-across capacity over distributed computing infrastructure (DCI) and the use of Hadoop for coarse-grained task-level parallelism, MapReduce-based RESTMD represents truly a good example of the next-generation of applications whose provision is increasingly becoming demanded by science gateway projects, in particular, backed by IaaS clouds
- Dependency-Aware Data Locality for MapReduce
- Large-scale neural modeling in MapReduce and Giraph
- Extending MapReduce across Clouds with BStream
- Large Imbalance Data Classification Based on MapReduce for Traffic Accident Prediction
- Applying MapReduce principle to high level information fusion
- Spam filtering techniques and MapReduce with SVM: A study
- Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters
- Optimizing cost and performance trade-offs for MapReduce job processing in the cloud
- AJIRA: A Lightweight Distributed Middleware for MapReduce and Stream Processing
- Pigeon: A spatial MapReduce language
- Optimizing mapreduce with low memory requirements for shared-memory systems
- Energy-Aware Scheduling of MapReduce Jobs
- Joint scheduling of MapReduce jobs with servers: Performance bounds and experiments
- An Incremental and Distributed Inference Method for Large-Scale Ontologies Based on MapReduce Paradigm
- TSIR: A Chinese Temporal semantics Information Retrieval system based on MapReduce