scholarly journals Big Data Framework for storage Extraction and Identification of Data using Hadoop Distributed File system

Big data is all about the developing challenge that associations face in today’s world, As they manage enormous and quickly developing wellsprings of information or data, with the complex range of analysis and the problem includes computing infrastructure, accessing mixed data both structured and unstructured data from various sources such as networking, Recording and stored images. Hadoop is the open source software framework includes no of compartments that are specifically designed for solving large-scale distributed data storage. MapReduce is a parallel programming design for processing

2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Houshyar Honar Pajooh ◽  
Mohammed A. Rashid ◽  
Fakhrul Alam ◽  
Serge Demidenko

AbstractThe diversity and sheer increase in the number of connected Internet of Things (IoT) devices have brought significant concerns associated with storing and protecting a large volume of IoT data. Storage volume requirements and computational costs are continuously rising in the conventional cloud-centric IoT structures. Besides, dependencies of the centralized server solution impose significant trust issues and make it vulnerable to security risks. In this paper, a layer-based distributed data storage design and implementation of a blockchain-enabled large-scale IoT system are proposed. It has been developed to mitigate the above-mentioned challenges by using the Hyperledger Fabric (HLF) platform for distributed ledger solutions. The need for a centralized server and a third-party auditor was eliminated by leveraging HLF peers performing transaction verifications and records audits in a big data system with the help of blockchain technology. The HLF blockchain facilitates storing the lightweight verification tags on the blockchain ledger. In contrast, the actual metadata are stored in the off-chain big data system to reduce the communication overheads and enhance data integrity. Additionally, a prototype has been implemented on embedded hardware showing the feasibility of deploying the proposed solution in IoT edge computing and big data ecosystems. Finally, experiments have been conducted to evaluate the performance of the proposed scheme in terms of its throughput, latency, communication, and computation costs. The obtained results have indicated the feasibility of the proposed solution to retrieve and store the provenance of large-scale IoT data within the Big Data ecosystem using the HLF blockchain. The experimental results show the throughput of about 600 transactions, 500 ms average response time, about 2–3% of the CPU consumption at the peer process and approximately 10–20% at the client node. The minimum latency remained below 1 s however, there is an increase in the maximum latency when the sending rate reached around 200 transactions per second (TPS).


2013 ◽  
pp. 294-321
Author(s):  
Alexandru Costan

To accommodate the needs of large-scale distributed systems, scalable data storage and management strategies are required, allowing applications to efficiently cope with continuously growing, highly distributed data. This chapter addresses the key issues of data handling in grid environments focusing on storing, accessing, managing and processing data. We start by providing the background for the data storage issue in grid environments. We outline the main challenges addressed by distributed storage systems: high availability which translates into high resilience and consistency, corruption handling regarding arbitrary faults, fault tolerance, asynchrony, fairness, access control and transparency. The core part of the chapter presents how existing solutions cope with these high requirements. The most important research results are organized along several themes: grid data storage, distributed file systems, data transfer and retrieval and data management. Important characteristics such as performance, efficient use of resources, fault tolerance, security, and others are strongly determined by the adopted system architectures and the technologies behind them. For each topic, we shortly present previous work, describe the most recent achievements, highlight their advantages and limitations, and indicate future research trends in distributed data storage and management.


2018 ◽  
Vol 2018 ◽  
pp. 1-14 ◽  
Author(s):  
Vasileios Moysiadis ◽  
Panagiotis Sarigiannidis ◽  
Ioannis Moscholios

In the emerging area of the Internet of Things (IoT), the exponential growth of the number of smart devices leads to a growing need for efficient data storage mechanisms. Cloud Computing was an efficient solution so far to store and manipulate such huge amount of data. However, in the next years it is expected that Cloud Computing will be unable to handle the huge amount of the IoT devices efficiently due to bandwidth limitations. An arising technology which promises to overwhelm many drawbacks in large-scale networks in IoT is Fog Computing. Fog Computing provides high-quality Cloud services in the physical proximity of mobile users. Computational power and storage capacity could be offered from the Fog, with low latency and high bandwidth. This survey discusses the main features of Fog Computing, introduces representative simulators and tools, highlights the benefits of Fog Computing in line with the applications of large-scale IoT networks, and identifies various aspects of issues we may encounter when designing and implementing social IoT systems in the context of the Fog Computing paradigm. The rationale behind this work lies in the data storage discussion which is performed by taking into account the importance of storage capabilities in modern Fog Computing systems. In addition, we provide a comprehensive comparison among previously developed distributed data storage systems which consist of a promising solution for data storage allocation in Fog Computing.


Plenty of research work is going on for efficient storage, processing, and analysis of large volume of data generated in real time and having varying nature and quality. The most common open-source framework for efficient computation of such large volume of data is Hadoop which processes big data sets by employing clusters of networked computers. On the other hand, cloud computing refers to storage of data and applications in cloud servers and accessing of the data of applications over the Internet following an on demand scheme. So the organizations who want to reduce costs and complexities associated with big data framework, the most suitable option for them is to take help of cloud infrastructure. But one biggest concern in this regard is the security of data and applications in cloud. Though Hadoop provides in-built encryption scheme and secured HTTP protocol, once data and applications are stored in public cloud, they become vulnerable to various security breaches still remain uncontrolled by the cloud service providers giving rise of a feeling of untrust. In this scenario, encrypting sensitive business data before cloud uploading may help in preventing access of data by evil intruders. In this paper, an extension to Hadoop security with respect to shared cloud has been proposed by designing a software framework where files are encrypted before uploading to cloud. Security performance of this framework for securing data in storage as well as in transit has been implemented such that without using the framework retrieval of data is not at all possible. Extra layer of security aided by symmetric key cryptographic technique has been proposed which will enhance the security of customers’ resources along with the present standard security measures of a cloud system. A software system performs symmetric encryption before transmitting a file of any format to cloud. To access this encrypted file, the same software system has to be used to download and decrypt the file. This paper also investigates the performances of most common symmetric key techniques AES, DES and triple DES cryptography with respect to the successful encryption of the customer data. This software framework can be applied to provide an extra security layer at the client’s end for users availing service of the cloud platform.


Author(s):  
Shailesh Pancham Khapre ◽  
Chandramohan Dhasarathan ◽  
Puviyarasi T. ◽  
Sam Goundar

In the internet era, incalculable data is generated every day. In the process of data sharing, complex issues such as data privacy and ownership are emerging. Blockchain is a decentralized distributed data storage technology. The introduction of blockchain can eliminate the disadvantages of the centralized data market, but at the same time, distributed data markets have created security and privacy issues. It summarizes the industry status and research progress of the domestic and foreign big data trading markets and refines the nature of the blockchain-based big data sharing and circulation platform. Based on these properties, a blockchain-based data market (BCBDM) framework is proposed, and the security and privacy issues as well as corresponding solutions in this framework are analyzed and discussed. Based on this framework, a data market testing system was implemented, and the feasibility and security of the framework were confirmed.


2011 ◽  
pp. 112-139
Author(s):  
Alexandru Costan

To accommodate the needs of large-scale distributed systems, scalable data storage and management strategies are required, allowing applications to efficiently cope with continuously growing, highly distributed data. This chapter addresses the key issues of data handling in grid environments focusing on storing, accessing, managing and processing data. We start by providing the background for the data storage issue in grid environments. We outline the main challenges addressed by distributed storage systems: high availability which translates into high resilience and consistency, corruption handling regarding arbitrary faults, fault tolerance, asynchrony, fairness, access control and transparency. The core part of the chapter presents how existing solutions cope with these high requirements. The most important research results are organized along several themes: grid data storage, distributed file systems, data transfer and retrieval and data management. Important characteristics such as performance, efficient use of resources, fault tolerance, security, and others are strongly determined by the adopted system architectures and the technologies behind them. For each topic, we shortly present previous work, describe the most recent achievements, highlight their advantages and limitations, and indicate future research trends in distributed data storage and management.


2021 ◽  
Vol 21 (1) ◽  
pp. 24-36
Author(s):  
Francisco Neves ◽  
Ricardo Vilaça ◽  
José Pereira

Modern containerized distributed systems, such as big data storage and processing stacks or micro-service based applications, are inherently hard to monitor and optimize, as resource usage does not directly match hardware resources due to multiple virtualization layers. For instance, interapplication traffic is an important factor in as it directly indicates how components interact, it has not been possible to accurately monitor it in an application independent way and without severe overhead, thus putting it out of reach of cloud platforms. In this paper we present an efficient black-box monitoring approach for gathering detailed structural information of collaborating processes in a distributed system that can be queried for various purposes, as it includes both information about processes, containers, and hosts, as well as resource usage and amount of data exchanged. The key to achieving high detail and low overhead without custom application instrumentation is to use a kernel-aided event driven strategy. We validate a prototype implementation by applying it to multi-platform microservice deployments, evaluate its performance with micro-benchmarks, and demonstrate its usefulness for container placement in a distributed data storage and processing stack (i.e., Cassandra and Spark).


Sign in / Sign up

Export Citation Format

Share Document