Category Archives: big data hadoop projects

  • -

Big Data Hadoop Projects

Big Data Hadoop Projects

Big data Hadoop Projects involved in collecting information from the various sources and these data are interlinked to form global information.



  • A Study of Effective Replica Reconstruction Schemes at Node Deletion for HDFS


    Big Data is transforming healthcare, business, and ultimately society itself, as eHealth becomes one of key driving factors during the innovation process. We investigate BDeHS (Big Data eHealth Service) to fulfill the Big Data applications in the eHealth service domain. In this paper we explain why the existingBig Data technologies such as Hadoop, MapReduce, STORM and the like cannot be simply applied toeHealth services directly. We then describe the additional capabilities as required in order to make BigData services for eHealth become practical. Next we report our design of the BDeHS architecture that supplies data operation management capabilities, regulatory compliance, and eHealth meaningful usages.

  • Adaptive Indexing for Distributed Array Processing

    Scientists are facing the data deluge in the scientific explorations. Big data are collected by the scientific instruments and experiments. The data are usually multidimensional arrays and stored in many files.Distributed computing techniques such as MapReduce make exploring the large datasets practical. Theindex is a well-known measure to shorten the query processing duration. Most of existing indexingmethods need a full load of the raw data to build the index. In this paper, we proposed a distributedadaptive indexing method for the distributed array-oriented query processing. Our method does not require a full scan of the array data. For each subarray accessed by a subtask, we divide the array into multiple logical blocks with a proper block size. The normal processing routine is executed when handling a query. Meanwhile, the index for the blocks accessed by the query is built at a low cost. So the whole index grows along with processing queries. This incremental manner exploits the accessed data of historical queries and eliminates the long load procedure. The experiments show that ouradaptive indexing implemented over Hadoop and Hive is effective for accelerating array-oriented queryprocessing without introducing much overhead.

  • Towards realistic benchmarking for cloud file systems: Early experiences

    Over the past few years, cloud file systems such as Google File System (GFS) and Hadoop DistributedFile System (HDFS) have received a lot of research efforts to optimize their designs and implementations. A common issue for these efforts is performance benchmarking. Unfortunately, manysystem researchers and engineers face challenges on making a benchmark that reflects real-life workload cases, due to the complexity of cloud file systems and vagueness of I/O workload characteristics. They could easily make incorrect assumptions about their systems and workloads, leading to the benchmark results differing from the fact. As the preliminary step for designing a realisticbenchmark, we make an effort to explore the characteristics of data and I/O workload in a production environment. We collected a two-week I/O workload trace from a 2,500-node production cluster, which is one of the largest cloud platforms in Asia. This cloud platform provides two public cloud services: data storage service (DSS) and data processing service (DPS). We analyze the commonalities and individualities between both cloud services in multiple perspectives, including the request arrival pattern, request size, data population and so on. Eight key observations are highlighted from the comprehensive study, including the arrival rate of requests follows a Lognormal distribution rather than a Poisson distribution, request arrival presents multiple periodicities, cloud file systems fit partly-open model rather than purely open model or closed model. Based on the comparative analysis results, we derive several interesting implications on guiding system researchers and engineers to build a realistic benchmark on their own systems. Finally, we discuss several open issues and challenges raised on benchmarkingcloud file systems.

  • Towards a Fine-Grained Access Control for Cloud

    The centerpiece of an efficient Cloud security architecture is a well-defined access control policy. In literature we can find several access control models such as the Mandatory Access Control (MAC), Discretionary Access Control (DAC), Role-Based Access Control (RBAC) and the latest one UsageControl Authorization, oBligation and Condition (UCONABC). The UCONABC is very suitable for the context of distributed systems like cloud computing but it doesn’t give any implementation method. In this paper we define the profile centric model using graph formalism and its implementation using matrix. We define the profile as the combination of all possible authorization, obligation, condition, role, etc… and other access parameters like attributes that we can found in Cloud system. We discuss its application using three matrixes (profile definition, profile inheritance and user assignment). Profile centric modeling is an optimum paradigm to define access control policy in complex distributed and elastic system like cloud computing. The proposed solution is validated and implemented over Hadoop distributed file system in the context of Safe Box as a service.

  • Privacy-Preserving WebID Analytics on the Decentralized Policy-Aware Social Web

    We address the research challenges of privacy-preserving Web ID analytics on the decentralizedSocial Web. We first argue why we should use open and decentralized control but not closed and centralized control of personal data management. Then, we present a policyaware architecture, where a data owner hand-picks a trusted data controller to mask his/her personally identifiable information (PII) and other sensitive social relationships of the Web ID so only anonymous RDF(S) linked datasets are available for analytics. Moreover, we advocate using a R and Hadoop integration paradigm, called RHadoop, for effective hybrid Web ID analytics of large-scale social network linked datasets. Finally, we propose various types of semantics-enabled policies to call for the RHadoop hybrid Web ID analytics and further balance data utility and protection on the privacy-aware Social Web

  • Modeling of Distributed File Systems for Practical Performance Analysis

    Cloud computing has received significant attention recently. Delivering quality guaranteed services in clouds is highly desired. Distributed file systems (DFSs) are the key component of any cloud-scale data processing middleware. Evaluating the performance of DFSs is accordingly very important. To avoid cost for late life cycle performance fixes and architectural redesign, providing performance analysisbefore the deployment of DFSs is also particularly important. In this paper, we propose a systematic and practical performance analysis framework, driven by architecture and design models for defining the structure and behavior of typical master/slave DFSs. We put forward a configuration guideline for specifications of configuration alternatives of such DFSs, and a practical approach for both qualitatively and quantitatively performance analysis of DFSs with various configuration settings in a systematic way. What distinguish our approach from others is that 1) most of existing works rely on performancemeasurements under a variety of workloads/strategies, comparing with other DFSs or running application programs, but our approach is based on architecture and design level models and systematically derived performance models; 2) our approach is able to both qualitatively and quantitatively evaluate the performance of DFSs; and 3) our approach not only can evaluate the overallperformance of a DFS but also its components and individual steps. We demonstrate the effectiveness of our approach by evaluating Hadoop distributed file system (HDFS). A series of real-world experiments on EC2 (Amazon Elastic Compute Cloud), Tansuo and Inspur Clusters, were conducted to qualitatively evaluate the effectiveness of our approach. We also performed a set of experiments of HDFS on EC2 to quantitatively analyze the performance and limitation of the metadata server of DFSs. Results show that our approach can achieve sufficient performance analysis. Similarly, the proposed approach cou- d be also applied to evaluate other DFSs such as MooseFS, GFS, and zFS.

  • A Visualized Framework of Automatic Orchestration Engine Supporting Hybrid Cloud Resources
  • A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures
  • Forensic disk image indexing and search in an HPC environment

Hadoop Solutions offers Big Data Hadoop Projects – Topics/Thesis/Projects for an affordable price with guaranteed output. Enquire us for more details.

Work Progress

PHD - 24

M.TECH - 125

B.TECH -95

BIG DATA -110.


ON-GOING Hadoop Projects





Achievements – Hadoop Solutions


Twitter Feed

Customer Review

Hadoop Solutions 5 Star Rating: Recommended 4.9 - 5 based on 1000+ ratings. 1000+ user reviews.