Applying Apache Spark on Streaming Big Data for Health Status Prediction

This article analyzes the properties of unknown faults in knowledge management and Big Data systems processing Big Data in real-time. These faults introduce risks and threaten the knowledge pyramid and decisions based on knowledge gleaned from volumes of complex data. The authors hypothesize that not yet encountered faults may require fault handling, an analytic model, and an architectural framework to assess and manage the faults and mitigate the risks of correlating or integrating otherwise uncorrelated Big Data, and to ensure the source pedigree, quality, set integrity, freshness, and validity of the data. New architectures, methods, and tools for handling and analyzing Big Data systems functioning in real-time will contribute to organizational knowledge and performance. System designs must mitigate faults resulting from real-time streaming processes while ensuring that variables such as synchronization, redundancy, and latency are addressed. This article concludes that with improved designs, real-time Big Data systems may continuously deliver the value of streaming Big Data.

Download Full-text

Sentiment Classification Using Paragraph Vector and Cognitive Big Data Semantics on Apache Spark

2018 IEEE 17th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC) ◽

10.1109/icci-cc.2018.8482085 ◽

2018 ◽

Cited By ~ 1

Author(s):

Kumar Ravi ◽

Vadlamani Ravi ◽

B. Shivakrishna

Keyword(s):

Big Data ◽

Sentiment Classification ◽

Apache Spark ◽

Data Semantics ◽

Paragraph Vector

Download Full-text

SIDELOADING – INGESTION OF LARGE POINT CLOUDS INTO THE APACHE SPARK BIG DATA ENGINE

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xli-b2-343-2016 ◽

2016 ◽

Vol XLI-B2 ◽

pp. 343-348

Author(s):

J. Boehm ◽

K. Liu ◽

C. Alis

Keyword(s):

Big Data ◽

Point Cloud ◽

Point Clouds ◽

Geospatial Data ◽

Apache Spark ◽

Cloud Data ◽

Binary File ◽

Data Framework ◽

File Formats ◽

Data Ingestion

In the geospatial domain we have now reached the point where data volumes we handle have clearly grown beyond the capacity of most desktop computers. This is particularly true in the area of point cloud processing. It is therefore naturally lucrative to explore established big data frameworks for big geospatial data. The very first hurdle is the import of geospatial data into big data frameworks, commonly referred to as data ingestion. Geospatial data is typically encoded in specialised binary file formats, which are not naturally supported by the existing big data frameworks. Instead such file formats are supported by software libraries that are restricted to single CPU execution. We present an approach that allows the use of existing point cloud file format libraries on the Apache Spark big data framework. We demonstrate the ingestion of large volumes of point cloud data into a compute cluster. The approach uses a map function to distribute the data ingestion across the nodes of a cluster. We test the capabilities of the proposed method to load billions of points into a commodity hardware compute cluster and we discuss the implications on scalability and performance. The performance is benchmarked against an existing native Apache Spark data import implementation.

Download Full-text