Data Stores, Warehouses, Big Data, Lakes, and Cloud Data

In the geospatial domain we have now reached the point where data volumes we handle have clearly grown beyond the capacity of most desktop computers. This is particularly true in the area of point cloud processing. It is therefore naturally lucrative to explore established big data frameworks for big geospatial data. The very first hurdle is the import of geospatial data into big data frameworks, commonly referred to as data ingestion. Geospatial data is typically encoded in specialised binary file formats, which are not naturally supported by the existing big data frameworks. Instead such file formats are supported by software libraries that are restricted to single CPU execution. We present an approach that allows the use of existing point cloud file format libraries on the Apache Spark big data framework. We demonstrate the ingestion of large volumes of point cloud data into a compute cluster. The approach uses a map function to distribute the data ingestion across the nodes of a cluster. We test the capabilities of the proposed method to load billions of points into a commodity hardware compute cluster and we discuss the implications on scalability and performance. The performance is benchmarked against an existing native Apache Spark data import implementation.

Download Full-text

Data Mining and Cloud Computing

JITA - Journal of Information Technology and Applications (Banja Luka) - APEIRON ◽

10.7251/jit1202075v ◽

2012 ◽

Vol 4 (2) ◽

Cited By ~ 2

Author(s):

Robert Vrbić

Keyword(s):

Data Mining ◽

Cloud Computing ◽

Big Data ◽

Data Warehouses ◽

Cloud Data ◽

Big Data Mining ◽

New Knowledge ◽

Flexible Infrastructure

Cloud computing provides a powerful, scalable and flexible infrastructure into which one can integrate, previously known, techniques and methods of Data Mining. The result of such integration should be strong and capacitive platform that will be able to deal with the increasing production of data, or that will create the conditions for the efficient mining of massive amounts of data from various data warehouses with the aim of creating (useful) information or the production of new knowledge. This paper discusses such technology - the technology of big data mining, known as Cloud Data Mining (CDM).

Download Full-text

Cloud Data Integrity Verification Algorithm for Sustainable Accounting Informatization

Mathematical Problems in Engineering ◽

10.1155/2021/2330502 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Lin Yang

Keyword(s):

Big Data ◽

Execution Time ◽

Cloud Storage ◽

Computing Time ◽

Data Integrity ◽

Cloud Data ◽

Integrity Verification ◽

Verification Algorithm ◽

Advantages And Disadvantages ◽

Better Than

In recent years, people have paid more and more attention to cloud data. However, because users do not have absolute control over the data stored on the cloud server, it is necessary for the cloud storage server to provide evidence that the data are completely saved to maintain their control over the data. Give users all management rights, users can independently install operating systems and applications and can choose self-service platforms and various remote management tools to manage and control the host according to personal habits. This paper mainly introduces the cloud data integrity verification algorithm of sustainable computing accounting informatization and studies the advantages and disadvantages of the existing data integrity proof mechanism and the new requirements under the cloud storage environment. In this paper, an LBT-based big data integrity proof mechanism is proposed, which introduces a multibranch path tree as the data structure used in the data integrity proof mechanism and proposes a multibranch path structure with rank and data integrity detection algorithm. In this paper, the proposed data integrity verification algorithm and two other integrity verification algorithms are used for simulation experiments. The experimental results show that the proposed scheme is about 10% better than scheme 1 and about 5% better than scheme 2 in computing time of 500 data blocks; in the change of operation data block time, the execution time of scheme 1 and scheme 2 increases with the increase of data blocks. The execution time of the proposed scheme remains unchanged, and the computational cost of the proposed scheme is also better than that of scheme 1 and scheme 2. The scheme in this paper not only can verify the integrity of cloud storage data but also has certain verification advantages, which has a certain significance in the application of big data integrity verification.

Download Full-text

Cloud data acquisition and processing model based on blockchain

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-179988 ◽

2020 ◽

Vol 39 (4) ◽

pp. 5027-5036

Author(s):

You Lu ◽

Qiming Fu ◽

Xuefeng Xi ◽

Zhenping Chen

Keyword(s):

Big Data ◽

Data Structure ◽

Data Collection ◽

Cloud Data ◽

Integrity Verification ◽

Blockchain Technology ◽

Verification System ◽

Using Data ◽

A Chain ◽

And Storage

Data outsourcing has gradually become a mainstream solution, but once data is outsourced, data owners will without the control of the data hardware, there is a possibility that the integrity of the data will be destroyed objectively. Many current studies have achieved low network overhead cloud data set verification by designing algorithmic structures (e.g., hashing, Merkel verification trees); however, cloud service providers may not recognize the incompleteness of cloud data to avoid liability or business factors fact. There is a need to build a secure, reliable, non-tamperable, and non-forgeable verification system for accountability. Blockchain is a chain-like data structure constructed by using data signatures, timestamps, hash functions, and proof-of-work mechanisms. Using blockchain technology to build an integrity verification system can achieve fault accountability. Blockchain is a chain-like data structure constructed by using data signatures, timestamps, hash functions, and proof-of-work mechanisms. Using blockchain technology to build an integrity verification system can achieve fault accountability. This paper uses the Hadoop framework to implement data collection and storage of the HBase system based on big data architecture. In summary, based on the research of blockchain cloud data collection and storage technology, based on the existing big data storage middleware, a large flow, high concurrency and high availability data collection and processing system has been realized.

Download Full-text

Security and Privacy in Big Data Computing

Quantum Cryptography and the Future of Cyber Security - Advances in Information Security, Privacy, and Ethics ◽

10.4018/978-1-7998-2253-0.ch011 ◽

2020 ◽

pp. 236-256

Author(s):

Kiritkumar J. Modi ◽

Prachi Devangbhai Shah ◽

Zalak Prajapati

Keyword(s):

Big Data ◽

Data Privacy ◽

Security And Privacy ◽

Comprehensive Overview ◽

Privacy And Security ◽

Cloud Data ◽

Big Data Privacy ◽

Cloud Data Storage ◽

Type Data ◽

Big Data Computing

The rapid growth of digitization in the present era leads to an exponential increase of information which demands the need of a Big Data paradigm. Big Data denotes complex, unstructured, massive, heterogeneous type data. The Big Data is essential to the success in many applications; however, it has a major setback regarding security and privacy issues. These issues arise because the Big Data is scattered over a distributed system by various users. The security of Big Data relates to all the solutions and measures to prevent the data from threats and malicious activities. Privacy prevails when it comes to processing personal data, while security means protecting information assets from unauthorized access. The existence of cloud computing and cloud data storage have been predecessor and conciliator of emergence of Big Data computing. This article highlights open issues related to traditional techniques of Big Data privacy and security. Moreover, it also illustrates a comprehensive overview of possible security techniques and future directions addressing Big Data privacy and security issues.

Download Full-text

Handling Big Data in Astronomy and Astrophysics: Rich Structured Queries on Replicated Cloud Data with XtreemFS

Datenbank-Spektrum ◽

10.1007/s13222-012-0099-1 ◽

2012 ◽

Vol 12 (3) ◽

pp. 173-181 ◽

Cited By ~ 2

Author(s):

Harry Enke ◽

Adrian Partl ◽

Alexander Reinefeld ◽

Florian Schintke

Keyword(s):

Big Data ◽

Cloud Data ◽

Astronomy And Astrophysics

Download Full-text

Resource Prediction for Big Data Processing in a Cloud Data Center : A Machine Learning Approach

IEIE Transactions on Smart Processing and Computing ◽

10.5573/ieiespc.2018.7.6.478 ◽

2018 ◽

Vol 7 (6) ◽

pp. 478-488

Author(s):

Alanazi Rayan ◽

Yunmook Nah

Keyword(s):

Machine Learning ◽

Big Data ◽

Data Processing ◽

Data Center ◽

Learning Approach ◽

Cloud Data Center ◽

Big Data Processing ◽

Cloud Data ◽

Machine Learning Approach ◽

Resource Prediction

Download Full-text

A NEW INITIATIVE FOR TILING, STITCHING AND PROCESSING GEOSPATIAL BIG DATA IN DISTRIBUTED COMPUTING ENVIRONMENTS

ISPRS Annals of Photogrammetry Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-annals-iii-4-111-2016 ◽

2016 ◽

Vol III-4 ◽

pp. 111-118 ◽

Cited By ~ 1

Author(s):

A. Olasz ◽

B. Nguyen Thai ◽

D. Kristóf

Keyword(s):

Big Data ◽

Distributed Computing ◽

Data Processing ◽

Distributed Processing ◽

Data Distribution ◽

Development Project ◽

Data Partitioning ◽

Divide And Conquer ◽

Data Types ◽

Cloud Data

Within recent years, several new approaches and solutions for Big Data processing have been developed. The Geospatial world is still facing the lack of well-established distributed processing solutions tailored to the amount and heterogeneity of geodata, especially when fast data processing is a must. The goal of such systems is to improve processing time by distributing data transparently across processing (and/or storage) nodes. These types of methodology are based on the concept of divide and conquer. Nevertheless, in the context of geospatial processing, most of the distributed computing frameworks have important limitations regarding both data distribution and data partitioning methods. Moreover, flexibility and expendability for handling various data types (often in binary formats) are also strongly required. <br><br> This paper presents a concept for tiling, stitching and processing of big geospatial data. The system is based on the IQLib concept (<a href="https://github.com/posseidon/IQLib/"target="_blank">https://github.com/posseidon/IQLib/</a>) developed in the frame of the IQmulus EU FP7 research and development project (<a href="http://www.iqmulus.eu"target="_blank">http://www.iqmulus.eu</a>). The data distribution framework has no limitations on programming language environment and can execute scripts (and workflows) written in different development frameworks (e.g. Python, R or C#). It is capable of processing raster, vector and point cloud data. The above-mentioned prototype is presented through a case study dealing with country-wide processing of raster imagery. Further investigations on algorithmic and implementation details are in focus for the near future.

Download Full-text

SIDELOADING – INGESTION OF LARGE POINT CLOUDS INTO THE APACHE SPARK BIG DATA ENGINE

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprsarchives-xli-b2-343-2016 ◽

2016 ◽

Vol XLI-B2 ◽

pp. 343-348 ◽

Cited By ~ 2

Author(s):

J. Boehm ◽

K. Liu ◽

C. Alis

Keyword(s):

Big Data ◽

Point Cloud ◽

Point Clouds ◽

Geospatial Data ◽

Apache Spark ◽

Cloud Data ◽

Binary File ◽

Data Framework ◽

File Formats ◽

Data Ingestion

In the geospatial domain we have now reached the point where data volumes we handle have clearly grown beyond the capacity of most desktop computers. This is particularly true in the area of point cloud processing. It is therefore naturally lucrative to explore established big data frameworks for big geospatial data. The very first hurdle is the import of geospatial data into big data frameworks, commonly referred to as data ingestion. Geospatial data is typically encoded in specialised binary file formats, which are not naturally supported by the existing big data frameworks. Instead such file formats are supported by software libraries that are restricted to single CPU execution. We present an approach that allows the use of existing point cloud file format libraries on the Apache Spark big data framework. We demonstrate the ingestion of large volumes of point cloud data into a compute cluster. The approach uses a map function to distribute the data ingestion across the nodes of a cluster. We test the capabilities of the proposed method to load billions of points into a commodity hardware compute cluster and we discuss the implications on scalability and performance. The performance is benchmarked against an existing native Apache Spark data import implementation.

Download Full-text