scholarly journals Data Lake Governance: Towards a Systemic and Natural Ecosystem Analogy

2020 ◽  
Vol 12 (8) ◽  
pp. 126
Author(s):  
Marzieh Derakhshannia ◽  
Carmen Gervet ◽  
Hicham Hajj-Hassan ◽  
Anne Laurent ◽  
Arnaud Martin

The realm of big data has brought new venues for knowledge acquisition, but also major challenges including data interoperability and effective management. The great volume of miscellaneous data renders the generation of new knowledge a complex data analysis process. Presently, big data technologies provide multiple solutions and tools towards the semantic analysis of heterogeneous data, including their accessibility and reusability. However, in addition to learning from data, we are faced with the issue of data storage and management in a cost-effective and reliable manner. This is the core topic of this paper. A data lake, inspired by the natural lake, is a centralized data repository that stores all kinds of data in any format and structure. This allows any type of data to be ingested into the data lake without any restriction or normalization. This could lead to a critical problem known as data swamp, which can contain invalid or incoherent data that adds no values for further knowledge acquisition. To deal with the potential avalanche of data, some legislation is required to turn such heterogeneous datasets into manageable data. In this article, we address this problem and propose some solutions concerning innovative methods, derived from a multidisciplinary science perspective to manage data lake. The proposed methods imitate the supply chain management and natural lake principles with an emphasis on the importance of the data life cycle, to implement responsible data governance for the data lake.

Author(s):  
Abou_el_ela Abdou Hussein

Day by day advanced web technologies have led to tremendous growth amount of daily data generated volumes. This mountain of huge and spread data sets leads to phenomenon that called big data which is a collection of massive, heterogeneous, unstructured, enormous and complex data sets. Big Data life cycle could be represented as, Collecting (capture), storing, distribute, manipulating, interpreting, analyzing, investigate and visualizing big data. Traditional techniques as Relational Database Management System (RDBMS) couldn’t handle big data because it has its own limitations, so Advancement in computing architecture is required to handle both the data storage requisites and the weighty processing needed to analyze huge volumes and variety of data economically. There are many technologies manipulating a big data, one of them is hadoop. Hadoop could be understand as an open source spread data processing that is one of the prominent and well known solutions to overcome handling big data problem. Apache Hadoop was based on Google File System and Map Reduce programming paradigm. Through this paper we dived to search for all big data characteristics starting from first three V's that have been extended during time through researches to be more than fifty six V's and making comparisons between researchers to reach to best representation and the precise clarification of all big data V’s characteristics. We highlight the challenges that face big data processing and how to overcome these challenges using Hadoop and its use in processing big data sets as a solution for resolving various problems in a distributed cloud based environment. This paper mainly focuses on different components of hadoop like Hive, Pig, and Hbase, etc. Also we institutes absolute description of Hadoop Pros and cons and improvements to face hadoop problems by choosing proposed Cost-efficient Scheduler Algorithm for heterogeneous Hadoop system.


PLoS ONE ◽  
2021 ◽  
Vol 16 (8) ◽  
pp. e0255562
Author(s):  
Eman Khashan ◽  
Ali Eldesouky ◽  
Sally Elghamrawy

The growing popularity of big data analysis and cloud computing has created new big data management standards. Sometimes, programmers may interact with a number of heterogeneous data stores depending on the information they are responsible for: SQL and NoSQL data stores. Interacting with heterogeneous data models via numerous APIs and query languages imposes challenging tasks on multi-data processing developers. Indeed, complex queries concerning homogenous data structures cannot currently be performed in a declarative manner when found in single data storage applications and therefore require additional development efforts. Many models were presented in order to address complex queries Via multistore applications. Some of these models implemented a complex unified and fast model, while others’ efficiency is not good enough to solve this type of complex database queries. This paper provides an automated, fast and easy unified architecture to solve simple and complex SQL and NoSQL queries over heterogeneous data stores (CQNS). This proposed framework can be used in cloud environments or for any big data application to automatically help developers to manage basic and complicated database queries. CQNS consists of three layers: matching selector layer, processing layer, and query execution layer. The matching selector layer is the heart of this architecture in which five of the user queries are examined if they are matched with another five queries stored in a single engine stored in the architecture library. This is achieved through a proposed algorithm that directs the query to the right SQL or NoSQL database engine. Furthermore, CQNS deal with many NoSQL Databases like MongoDB, Cassandra, Riak, CouchDB, and NOE4J databases. This paper presents a spark framework that can handle both SQL and NoSQL Databases. Four scenarios’ benchmarks datasets are used to evaluate the proposed CQNS for querying different NoSQL Databases in terms of optimization process performance and query execution time. The results show that, the CQNS achieves best latency and throughput in less time among the compared systems.


2019 ◽  
Vol 4 (1) ◽  
pp. 14-25
Author(s):  
Saiful Rizal

The development of information technology produces very large data sizes, with various variations in data and complex data structures. Traditional data storage techniques are not sufficient for storage and analysis with very large volumes of data. Many researchers conducted their research in analyzing big data with various analytics models in big data. Therefore, the purpose of the survey paper is to provide an understanding of analytics models in big data for various uses using algorithms in data mining. Preprocessing big data is the key to turning big data into big value.


2021 ◽  
Vol 42 (1) ◽  
pp. 113
Author(s):  
Vitor Valerio de Souza Campos ◽  
Jacques Duílio Brancher ◽  
Francyelcyo Pussi Farias ◽  
José Luiz Villela Marcondes Mioni ◽  
Pedro Luiz Garbim Brahim

In integration approaches, heterogeneity is one of the main challenging factors on the task of providing integration among different data sources, whose solution lies in the search for equality among them. This work describes the state of the art and theoretical foundation involved in the structural and semantic analysis of heterogeneous data and information. The work aims to review methods and techniques used in data integration in Big Data, considering data heterogeneity, reviewing techniques that use the concepts of Semantic Web, Cloud Computing, Data Analysis, Big Data, Data Warehouse and other technologies to solve the problem of data heterogeneity. The research was divided into three stages. In the first stage, articles were selected from digital libraries according to their titles and keywords. In the second stage, the papers went through a second filter based on their summary, and, besides that, duplicate articles were also removed. The works’ introduction and conclusion were analyzed in the third stage to select the articles belonging to this systematic review. Throughout the study, articles were analyzed, compared and categorized. At the end of each section, the interrelationships and possible areas for future work were shown.


2019 ◽  
Vol 4 (2) ◽  
pp. 137-150 ◽  
Author(s):  
Twana Saeed Ali ◽  
Tugberk Kaya

Big Data refers to large volumes of information. This information varies from pictures, videos, texts, audios and other heterogeneous data. In recent years, the amount of such big data has exceeded the capacity of online or cloud storage systems. The amount of data collected yearly has doubled in the past years and the concern for the volume of this data has reached its Exabyte yearly range. This paper focuses on the major issues and opportunities as well as big data storage with the aid of academic tools and researches conducted earlier by scholars for big data analysis. Modern learning environment (MLE) has to be understood in order to know how it supports learning in areas of big data such as university education systems. The utilization of online resources and web pages with laptops and mobile phones need to be understood as an attempt to integrate the modern learning environment and improve teaching in international bossiness. Big data can be fine-tuned and used to create new online learning programmers. Data collected by government departments, universities and institutions could be used as a new innovative learning system such as (MLE) which has a passive and active character i.e. it can be accessed anywhere at any time. This would also help in minimizing extended classroom activities because students would have controlled access to online knowledge from their homes


Author(s):  
Arvind Panwar ◽  
Vishal Bhatnagar

Data is the biggest asset after people for businesses, and it is a new driver of the world economy. The volume of data that enterprises gather every day is growing rapidly. This kind of rapid growth of data in terms of volume, variety, and velocity is known as Big Data. Big Data is a challenge for enterprises, and the biggest challenge is how to store Big Data. In the past and some organizations currently, data warehouses are used to store Big Data. Enterprise data warehouses work on the concept of schema-on-write but Big Data analytics want data storage which works on the schema-on-read concept. To fulfill market demand, researchers are working on a new data repository system for Big Data storage known as a data lake. The data lake is defined as a data landing area for raw data from many sources. There is some confusion and questions which must be answered about data lakes. The objective of this article is to reduce the confusion and address some question about data lakes with the help of architecture.


2018 ◽  
Vol 2 (3) ◽  
pp. 169
Author(s):  
Manishankar S ◽  
S. Sathayanarayana

In this Digital world storage area capacity required for an Enterprise is quite huge, and processing that Big Data is one of the major challenging areas in today’s information technology. As the heterogeneous data from the various sources grow rapidly, there should be some proficient way for data storage for each enterprise. Most of the Enterprises have a tendency to migrate their data in to servers with high processing capability to handle variety and voluminous data. Major problem that arises in such big data servers of an Enterprise is the process involved in segregating data according to their types. In this research, an efficient methodology is proposed which handles the segregation of data inside a server with multi valued distribution-based clustering. These clustering-based solutions provide an efficient visualization of varying data in the server and also a separate visualization of employee data too. The paper discusses about the simulation of the clustering technique with respect to an Enterprise data and visualization of file storage structure and categorization of data, also it gives a picture of performance of the Big data server. 


2019 ◽  
Vol 41 (2) ◽  
pp. 75-106
Author(s):  
Sunyoung Kim ◽  
Byungwoong Kwon

Sign in / Sign up

Export Citation Format

Share Document