Hadoop based projects

  • -

Hadoop based projects

Hadoop Based Projects

Hadoop based Projects  helps on processing large amount data using cluster commodity hardware.



  • A Repair Framework for Scalar MDS Codes      

                                                                                                                                                                                                                                                                                                   Several works have developed vector-linear maximum-distance separable (MDS) storage codes that minimize the total communication cost required to repair a single coded symbol after an erasure, referred to as repair bandwidth (BW). Vector codes allow communicating fewer sub-symbols per node, instead of the entire content. This allows non trivial savings in repair BW. In sharp contrast, classiccodes, like Reed-Solomon (RS), used in current storage systems, are deemed to suffer from naiverepair, i.e. downloading the entire stored message to repair one failed node. This mainly happens because they are scalar-linear. In this work, we present a simple framework that treats scalar codes as vector-linear. In some cases, this allows significant savings in repair BW. We show that vectorizedscalar codes exhibit properties that simplify the design of repair schemes. Our framework can be seen as a finite field analogue of real interference alignment. Using our simplified framework, we design a scheme that we call clique-repair which provably identifies the best linear repair strategy for any scalar2-parity MDS code, under some conditions on the sub-field chosen for vectorization. We specify optimalrepair schemes for specific (5,3)- and (6,4)-Reed-Solomon (RS) codes. Further, we present a repairstrategy for the RS code currently deployed in the Facebook Analytics Hadoop cluster that leads to 20% of repair BW savings over naive repair which is the repair scheme currently used for this code

  • Support for bioinformatics applications through volunteer and scalable computing frameworks                                                                                                                                                                                                                                                                                                                                                                                                                                  Many e-science applications can benefit from the elasticity of resources provided by volunteer and distributed platforms. While the former is based on resources assigned voluntarily by its owners, the second is based on resources specially configured for this purpose. In this paper, we present the integration of BOINC and Hadoop, two of the most popular middlewares available today, at the infrastructure level of mc2, a platform designed to support scientific applications. We discuss some case studies related to DNA sequencing running in Cloud-Burst, model selection running in the MEGA Computational Core and similarity searches running in BLAST. Our results show scientific users can run their experiments in our platform with a high level of abstraction and, in some cases, with very good performance. 


  • Data clustering-based anomaly detection in industrial control systems    

Modern Networked Critical Infrastructures (NCI), involving cyber and physical systems, are exposed to intelligent cyber attacks targeting the stable operation of these systems. In order to ensure anomaly awareness, the observed data can be used in accordance with data mining techniques to develop Intrusion Detection Systems (IDS) or Anomaly Detection Systems (ADS). There is an increase in the volume of sensor data generated by both cyber and physical sensors, so there is a need to apply Big Data technologies for real-time analysis of large data sets. In this paper, we propose a clustering based approach for detecting cyber attacks that cause anomalies in NCI. Various clustering techniques are explored to choose the most suitable for clustering the time-series data features, thus classifying the states and potential cyber attacks to the physical system. The Hadoop implementation of MapReduce paradigm is used to provide a suitable processing environment for large datasets. A case study on a NCI consisting of multiple gas compressor stations is presented.


  • An effective algorithmic approach for cost optimization in Cloud based data center                                                                                                                                                                                                                                                                                          Cloud computing is offering efficacy oriented IT services to users throughout the world. It enables the hosting of assorted applications from user, scientific, commercial as well as business domains. The core inspiration following Cloud Computing is that the entire system can be controlled as well as worked using simply an HTTP client. The cloud user is required to have only a web based client to make exertions with Cloud Systems and all its applications, including office apps, business modules or personal information systems. It is applied to both modern and obsolete systems. An Open Source Platform designed to hold a wide assortment of Web Applications. Cloud Operating Systems was thought of its new definition that everything inside it can be accessed & acquired from everywhere inside a Network All the user is required to do is login into Cloud Operating Application server with a normal web client, and access to the desktop, with your documents, applications, movies, music and all etc. Cloud Operating Systems lets you upload your files and work with them regardless of your location. It contains almost all applications such as Word Processor, PDF reader, Address Book and many more developed by the Cloud developers and Cloud vendors. Cloud Storage provides the web based users with storage space and make user friendly and timely acquire and store data, which is mainly the foundation of each type of cloud applications. However, there is requirement of deep analysis on how to optimize cloud storage aiming at improvement of data accessing and data storing performance. In this paper, we have proposed the mathematical description as well as the algorithmic approach for cloudstorage optimization and as an objective optimization problem which is solved by our proposed optimized algorithm; as a result the data is distributed in appropriate nodes with the best efficiency. The simulation or investigational results demonstrate the performance of the algorithms analogous – o MapReduce Hadoop technology in BigData paradigm.


  • An automated bot detection system through honeypots for large-scale                                                                                                                                                                                                                                                                                                                      One of the purposes of active cyber defense systems is identifying infected machines in enterprise networks that are presumably root cause and main agent of various cyber-attacks. To achieve this, researchers have suggested many detection systems that rely on host-monitoring techniques and require deep packet inspection or which are trained by malware samples by applying machine learning and clustering techniques. To our knowledge, most approaches are either lack of being deployed easily to real enterprise networks, because of practicability of their training system which is supposed to be trained by malware samples or dependent to host-based or deep packet inspection analysis which requires a big amount of storage capacity for an enterprise. Beside this, honeypot systems are mostly used to collect malware samples for analysis purposes and identify coming attacks. Rather than keeping experimental results of bot detection techniques as theory and using honeypots for only analysis purposes, in this paper, we present a novel automated bot-infected machine detection systemBFH (BotFinder through Honeypots), based on BotFinder, that identifies infected hosts in a real enterprise network by learning approach. Our solution, relies on NetFlow data, is capable of detectingbots which are infected by most-recent malwares whose samples are caught via 97 different honeypotsystems. We train BFH by created models, according to malware samples, provided and updated by 97honeypot systems. BFH system automatically sends caught malwares to classification unit to construct family groups. Later, samples are automatically given to training unit for modeling and perform detectionover NetFlow data. Results are double checked by using full packet capture of a month and throughtools that identify rogue domains. Our results show that BFH is able to detect infected hosts with very few false-positive rates and successful on handling most-recent malware families since it is fed by 97 Honey- ot and it supports large networks with scalability of Hadoop infrastructure, as deployed in alarge-scale enterprise network in Turkey.



Hadoop Solutions offers Hadoop Based Projects – Topics/Thesis/Projects for an affordable price with guaranteed output. Enquire us for more details.

Work Progress

PHD - 24

M.TECH - 125

B.TECH -95

BIG DATA -110.


ON-GOING Hadoop Projects





Achievements – Hadoop Solutions


Twitter Feed

Customer Review

Hadoop Solutions 5 Star Rating: Recommended 4.9 - 5 based on 1000+ ratings. 1000+ user reviews.