Data Stores, Warehouses, Big Data, Lakes, and Cloud Data

2021 ◽  
pp. 99-124
Author(s):  
Scott Burk ◽  
David E. Sweenor ◽  
Gary Miner
Keyword(s):  
Big Data ◽  
Author(s):  
J. Boehm ◽  
K. Liu ◽  
C. Alis

In the geospatial domain we have now reached the point where data volumes we handle have clearly grown beyond the capacity of most desktop computers. This is particularly true in the area of point cloud processing. It is therefore naturally lucrative to explore established big data frameworks for big geospatial data. The very first hurdle is the import of geospatial data into big data frameworks, commonly referred to as data ingestion. Geospatial data is typically encoded in specialised binary file formats, which are not naturally supported by the existing big data frameworks. Instead such file formats are supported by software libraries that are restricted to single CPU execution. We present an approach that allows the use of existing point cloud file format libraries on the Apache Spark big data framework. We demonstrate the ingestion of large volumes of point cloud data into a compute cluster. The approach uses a map function to distribute the data ingestion across the nodes of a cluster. We test the capabilities of the proposed method to load billions of points into a commodity hardware compute cluster and we discuss the implications on scalability and performance. The performance is benchmarked against an existing native Apache Spark data import implementation.


Author(s):  
Robert Vrbić

Cloud computing provides a powerful, scalable and flexible infrastructure into which one can integrate, previously known, techniques and methods of Data Mining. The result of such integration should be strong and capacitive platform that will be able to deal with the increasing production of data, or that will create the conditions for the efficient mining of massive amounts of data from various data warehouses with the aim of creating (useful) information or the production of new knowledge. This paper discusses such technology - the technology of big data mining, known as Cloud Data Mining (CDM).


2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Lin Yang

In recent years, people have paid more and more attention to cloud data. However, because users do not have absolute control over the data stored on the cloud server, it is necessary for the cloud storage server to provide evidence that the data are completely saved to maintain their control over the data. Give users all management rights, users can independently install operating systems and applications and can choose self-service platforms and various remote management tools to manage and control the host according to personal habits. This paper mainly introduces the cloud data integrity verification algorithm of sustainable computing accounting informatization and studies the advantages and disadvantages of the existing data integrity proof mechanism and the new requirements under the cloud storage environment. In this paper, an LBT-based big data integrity proof mechanism is proposed, which introduces a multibranch path tree as the data structure used in the data integrity proof mechanism and proposes a multibranch path structure with rank and data integrity detection algorithm. In this paper, the proposed data integrity verification algorithm and two other integrity verification algorithms are used for simulation experiments. The experimental results show that the proposed scheme is about 10% better than scheme 1 and about 5% better than scheme 2 in computing time of 500 data blocks; in the change of operation data block time, the execution time of scheme 1 and scheme 2 increases with the increase of data blocks. The execution time of the proposed scheme remains unchanged, and the computational cost of the proposed scheme is also better than that of scheme 1 and scheme 2. The scheme in this paper not only can verify the integrity of cloud storage data but also has certain verification advantages, which has a certain significance in the application of big data integrity verification.


2020 ◽  
Vol 39 (4) ◽  
pp. 5027-5036
Author(s):  
You Lu ◽  
Qiming Fu ◽  
Xuefeng Xi ◽  
Zhenping Chen

Data outsourcing has gradually become a mainstream solution, but once data is outsourced, data owners will without the control of the data hardware, there is a possibility that the integrity of the data will be destroyed objectively. Many current studies have achieved low network overhead cloud data set verification by designing algorithmic structures (e.g., hashing, Merkel verification trees); however, cloud service providers may not recognize the incompleteness of cloud data to avoid liability or business factors fact. There is a need to build a secure, reliable, non-tamperable, and non-forgeable verification system for accountability. Blockchain is a chain-like data structure constructed by using data signatures, timestamps, hash functions, and proof-of-work mechanisms. Using blockchain technology to build an integrity verification system can achieve fault accountability. Blockchain is a chain-like data structure constructed by using data signatures, timestamps, hash functions, and proof-of-work mechanisms. Using blockchain technology to build an integrity verification system can achieve fault accountability. This paper uses the Hadoop framework to implement data collection and storage of the HBase system based on big data architecture. In summary, based on the research of blockchain cloud data collection and storage technology, based on the existing big data storage middleware, a large flow, high concurrency and high availability data collection and processing system has been realized.


Author(s):  
Kiritkumar J. Modi ◽  
Prachi Devangbhai Shah ◽  
Zalak Prajapati

The rapid growth of digitization in the present era leads to an exponential increase of information which demands the need of a Big Data paradigm. Big Data denotes complex, unstructured, massive, heterogeneous type data. The Big Data is essential to the success in many applications; however, it has a major setback regarding security and privacy issues. These issues arise because the Big Data is scattered over a distributed system by various users. The security of Big Data relates to all the solutions and measures to prevent the data from threats and malicious activities. Privacy prevails when it comes to processing personal data, while security means protecting information assets from unauthorized access. The existence of cloud computing and cloud data storage have been predecessor and conciliator of emergence of Big Data computing. This article highlights open issues related to traditional techniques of Big Data privacy and security. Moreover, it also illustrates a comprehensive overview of possible security techniques and future directions addressing Big Data privacy and security issues.


2012 ◽  
Vol 12 (3) ◽  
pp. 173-181 ◽  
Author(s):  
Harry Enke ◽  
Adrian Partl ◽  
Alexander Reinefeld ◽  
Florian Schintke

Author(s):  
A. Olasz ◽  
B. Nguyen Thai ◽  
D. Kristóf

Within recent years, several new approaches and solutions for Big Data processing have been developed. The Geospatial world is still facing the lack of well-established distributed processing solutions tailored to the amount and heterogeneity of geodata, especially when fast data processing is a must. The goal of such systems is to improve processing time by distributing data transparently across processing (and/or storage) nodes. These types of methodology are based on the concept of divide and conquer. Nevertheless, in the context of geospatial processing, most of the distributed computing frameworks have important limitations regarding both data distribution and data partitioning methods. Moreover, flexibility and expendability for handling various data types (often in binary formats) are also strongly required. <br><br> This paper presents a concept for tiling, stitching and processing of big geospatial data. The system is based on the IQLib concept (<a href="https://github.com/posseidon/IQLib/"target="_blank">https://github.com/posseidon/IQLib/</a>) developed in the frame of the IQmulus EU FP7 research and development project (<a href="http://www.iqmulus.eu"target="_blank">http://www.iqmulus.eu</a>). The data distribution framework has no limitations on programming language environment and can execute scripts (and workflows) written in different development frameworks (e.g. Python, R or C#). It is capable of processing raster, vector and point cloud data. The above-mentioned prototype is presented through a case study dealing with country-wide processing of raster imagery. Further investigations on algorithmic and implementation details are in focus for the near future.


Author(s):  
J. Boehm ◽  
K. Liu ◽  
C. Alis

In the geospatial domain we have now reached the point where data volumes we handle have clearly grown beyond the capacity of most desktop computers. This is particularly true in the area of point cloud processing. It is therefore naturally lucrative to explore established big data frameworks for big geospatial data. The very first hurdle is the import of geospatial data into big data frameworks, commonly referred to as data ingestion. Geospatial data is typically encoded in specialised binary file formats, which are not naturally supported by the existing big data frameworks. Instead such file formats are supported by software libraries that are restricted to single CPU execution. We present an approach that allows the use of existing point cloud file format libraries on the Apache Spark big data framework. We demonstrate the ingestion of large volumes of point cloud data into a compute cluster. The approach uses a map function to distribute the data ingestion across the nodes of a cluster. We test the capabilities of the proposed method to load billions of points into a commodity hardware compute cluster and we discuss the implications on scalability and performance. The performance is benchmarked against an existing native Apache Spark data import implementation.


Sign in / Sign up

Export Citation Format

Share Document