Big Data Ingestion and Streaming Patterns

2016 ◽

Vol XLI-B2 ◽

pp. 343-348

Author(s):

J. Boehm ◽

K. Liu ◽

C. Alis

Keyword(s):

Big Data ◽

Point Cloud ◽

Point Clouds ◽

Geospatial Data ◽

Apache Spark ◽

Cloud Data ◽

Binary File ◽

Data Framework ◽

File Formats ◽

Data Ingestion

In the geospatial domain we have now reached the point where data volumes we handle have clearly grown beyond the capacity of most desktop computers. This is particularly true in the area of point cloud processing. It is therefore naturally lucrative to explore established big data frameworks for big geospatial data. The very first hurdle is the import of geospatial data into big data frameworks, commonly referred to as data ingestion. Geospatial data is typically encoded in specialised binary file formats, which are not naturally supported by the existing big data frameworks. Instead such file formats are supported by software libraries that are restricted to single CPU execution. We present an approach that allows the use of existing point cloud file format libraries on the Apache Spark big data framework. We demonstrate the ingestion of large volumes of point cloud data into a compute cluster. The approach uses a map function to distribute the data ingestion across the nodes of a cluster. We test the capabilities of the proposed method to load billions of points into a commodity hardware compute cluster and we discuss the implications on scalability and performance. The performance is benchmarked against an existing native Apache Spark data import implementation.

Download Full-text

Big Data solutions - data ingestion and stream processing for demand response management

2019 23rd International Conference on System Theory, Control and Computing (ICSTCC) ◽

10.1109/icstcc.2019.8885519 ◽

2019 ◽

Author(s):

Simona-Vasilica Oprea ◽

Adela Bara ◽

Vlad Diaconita ◽

Dan Preotescu ◽

Osman Bulent Tor

Keyword(s):

Big Data ◽

Demand Response ◽

Stream Processing ◽

Data Ingestion

Download Full-text

Big Data Real-Time Clickstream Data Ingestion Paradigm for E-Commerce Analytics

2018 4th International Conference for Convergence in Technology (I2CT) ◽

10.1109/i2ct42659.2018.9058112 ◽

2018 ◽

Author(s):

Gautam Pal ◽

Gangmin Li ◽

Katie Atkinson

Keyword(s):

Big Data ◽

Real Time ◽

Clickstream Data ◽

Data Ingestion

Download Full-text

Big Data Ingestion, Integration, and Management

Big DataA Tutorial-Based Approach ◽

10.1201/9780429060939-5 ◽

2019 ◽

pp. 49-57

Author(s):

Nasir Raheem

Keyword(s):

Big Data ◽

Data Ingestion

Download Full-text

Efficient IoT Data Management for Geological Disasters Based on Big Data-Turbocharged Data Lake Architecture

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi10110743 ◽

2021 ◽

Vol 10 (11) ◽

pp. 743

Author(s):

Xiaohui Huang ◽

Junqing Fan ◽

Ze Deng ◽

Jining Yan ◽

Jiabao Li ◽

...

Keyword(s):

Big Data ◽

Data Management ◽

Decision Makers ◽

Prototype System ◽

Data Repositories ◽

Metadata Model ◽

Parallel Data ◽

Interactive Environment ◽

Data Ingestion ◽

Distributed Cache

Multi-source Internet of Things (IoT) data, archived in institutions’ repositories, are becoming more and more widely open-sourced to make them publicly accessed by scientists, developers, and decision makers via web services to promote researches on geohazards prevention. In this paper, we design and implement a big data-turbocharged system for effective IoT data management following the data lake architecture. We first propose a multi-threading parallel data ingestion method to ingest IoT data from institutions’ data repositories in parallel. Next, we design storage strategies for both ingested IoT data and processed IoT data to store them in a scalable, reliable storage environment. We also build a distributed cache layer to enable fast access to IoT data. Then, we provide users with a unified, SQL-based interactive environment to enable IoT data exploration by leveraging the processing ability of Apache Spark. In addition, we design a standard-based metadata model to describe ingested IoT data and thus support IoT dataset discovery. Finally, we implement a prototype system and conduct experiments on real IoT data repositories to evaluate the efficiency of the proposed system.

Download Full-text

Analysing and Predicting on Diseases using Data Pipeline in Hadoop

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit1952362 ◽

2019 ◽

pp. 1288-1292

Author(s):

Arpna Joshi ◽

Chirag Singla ◽

Mr. Pankaj

Keyword(s):

Big Data ◽

Data Processing ◽

Real World ◽

Health Data ◽

Apache Spark ◽

Time Data ◽

Data Pipeline ◽

Data Ingestion ◽

Using Data ◽

Batch Data

A data pipeline is a set of conducts that are performed from the time data is available for ingestion till value is obtained from that data. Such kind of actions is Extraction (getting value field from the dataset), Transformation and Loading (putting the data of value in a form that is useful for upstream use). In this big data project, we will simulate a simple batch data pipeline. Our dataset of interest we will get from https://www.githubarchive.org/ that records the health data of US for past 125years. The objective of this spark project will be to create a small but real-world pipeline that downloads this dataset as they become available, initiated the various form of transformation and load them into forms of storage that will need further use. In this project Apache kafka is used for data ingestion, Apache Spark for data processing and Cassandra for storing the processed result.

Download Full-text

SIDELOADING – INGESTION OF LARGE POINT CLOUDS INTO THE APACHE SPARK BIG DATA ENGINE

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprsarchives-xli-b2-343-2016 ◽

2016 ◽

Vol XLI-B2 ◽

pp. 343-348 ◽

Cited By ~ 2

Author(s):

J. Boehm ◽

K. Liu ◽

C. Alis

Keyword(s):

Big Data ◽

Point Cloud ◽

Point Clouds ◽

Geospatial Data ◽

Apache Spark ◽

Cloud Data ◽

Binary File ◽

Data Framework ◽

File Formats ◽

Data Ingestion

In the geospatial domain we have now reached the point where data volumes we handle have clearly grown beyond the capacity of most desktop computers. This is particularly true in the area of point cloud processing. It is therefore naturally lucrative to explore established big data frameworks for big geospatial data. The very first hurdle is the import of geospatial data into big data frameworks, commonly referred to as data ingestion. Geospatial data is typically encoded in specialised binary file formats, which are not naturally supported by the existing big data frameworks. Instead such file formats are supported by software libraries that are restricted to single CPU execution. We present an approach that allows the use of existing point cloud file format libraries on the Apache Spark big data framework. We demonstrate the ingestion of large volumes of point cloud data into a compute cluster. The approach uses a map function to distribute the data ingestion across the nodes of a cluster. We test the capabilities of the proposed method to load billions of points into a commodity hardware compute cluster and we discuss the implications on scalability and performance. The performance is benchmarked against an existing native Apache Spark data import implementation.

Download Full-text

Privacy-Aware Data Forensics of VRUs Using Machine Learning and Big Data Analytics

Security and Communication Networks ◽

10.1155/2021/3320436 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Muhammad Babar ◽

Muhammad Usman Tariq ◽

Ahmed S. Almasoud ◽

Mohammad Dahman Alshehri

Keyword(s):

Machine Learning ◽

Big Data ◽

Traffic Control ◽

Data Analytics ◽

Data Privacy ◽

Big Data Analytics ◽

Processing Unit ◽

Privacy And Security ◽

User Data ◽

Data Ingestion

The present spreading out of big data found the realization of AI and machine learning. With the rise of big data and machine learning, the idea of improving accuracy and enhancing the efficacy of AI applications is also gaining prominence. Machine learning solutions provide improved guard safety in hazardous traffic circumstances in the context of traffic applications. The existing architectures have various challenges, where data privacy is the foremost challenge for vulnerable road users (VRUs). The key reason for failure in traffic control for pedestrians is flawed in the privacy handling of the users. The user data are at risk and are prone to several privacy and security gaps. If an invader succeeds to infiltrate the setup, exposed data can be malevolently influenced, contrived, and misrepresented for illegitimate drives. In this study, an architecture is proposed based on machine learning to analyze and process big data efficiently in a secure environment. The proposed model considers the privacy of users during big data processing. The proposed architecture is a layered framework with a parallel and distributed module using machine learning on big data to achieve secure big data analytics. The proposed architecture designs a distinct unit for privacy management using a machine learning classifier. A stream processing unit is also integrated with the architecture to process the information. The proposed system is apprehended using real-time datasets from various sources and experimentally tested with reliable datasets that disclose the effectiveness of the proposed architecture. The data ingestion results are also highlighted along with training and validation results.

Download Full-text