Two-alternative optimization of moderate batch data processing

Data processing has become an important field in today's big data-dominated world. The data has been generating at a tremendous pace from different sources. There has been a change in the nature of data from batch-data to streaming-data, and consequently, data processing methodologies have also changed. Traditional SQL is no longer capable of dealing with this big data. This chapter describes the nature of data and various tools, techniques, and technologies to handle this big data. The chapter also describes the need of shifting big data on to cloud and the challenges in big data processing in the cloud, the migration from data processing to data analytics, tools used in data analytics, and the issues and challenges in data processing and analytics. Then the chapter touches an important application area of streaming data, sentiment analysis, and tries to explore it through some test case demonstrations and results.

Download Full-text

Analysing and Predicting on Diseases using Data Pipeline in Hadoop

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit1952362 ◽

2019 ◽

pp. 1288-1292

Author(s):

Arpna Joshi ◽

Chirag Singla ◽

Mr. Pankaj

Keyword(s):

Big Data ◽

Data Processing ◽

Real World ◽

Health Data ◽

Apache Spark ◽

Time Data ◽

Data Pipeline ◽

Data Ingestion ◽

Using Data ◽

Batch Data

A data pipeline is a set of conducts that are performed from the time data is available for ingestion till value is obtained from that data. Such kind of actions is Extraction (getting value field from the dataset), Transformation and Loading (putting the data of value in a form that is useful for upstream use). In this big data project, we will simulate a simple batch data pipeline. Our dataset of interest we will get from https://www.githubarchive.org/ that records the health data of US for past 125years. The objective of this spark project will be to create a small but real-world pipeline that downloads this dataset as they become available, initiated the various form of transformation and load them into forms of storage that will need further use. In this project Apache kafka is used for data ingestion, Apache Spark for data processing and Cassandra for storing the processed result.

Download Full-text

Optimization of two-alternative batch data processing

IOP Conference Series Materials Science and Engineering ◽

10.1088/1757-899x/450/5/052015 ◽

2018 ◽

Vol 450 ◽

pp. 052015

Author(s):

A V Kolnogorov

Keyword(s):

Data Processing ◽

Batch Data

Download Full-text

A Deep Learning-Based Method for Overhead Contact System Component Recognition Using Mobile 2D LiDAR

Sensors ◽

10.3390/s20082224 ◽

2020 ◽

Vol 20 (8) ◽

pp. 2224 ◽

Cited By ~ 1

Author(s):

Lipei Chen ◽

Cheng Xu ◽

Shuai Lin ◽

Siqi Li ◽

Xiaohan Tu

Keyword(s):

Deep Learning ◽

Data Processing ◽

Point Cloud ◽

Semantic Segmentation ◽

Point Clouds ◽

Contact System ◽

System Component ◽

Online Data ◽

Scan Line ◽

Batch Data

The overhead contact system (OCS) is a critical railway infrastructure for train power supply. Periodic inspections, aiming at acquiring the operational condition of the OCS and detecting problems, are necessary to guarantee the safety of railway operations. One of the OCS inspection means is to analyze data of point clouds collected by mobile 2D LiDAR. Recognizing OCS components from the collected point clouds is a critical task of the data analysis. However, the complex composition of OCS makes the task difficult. To solve the problem of recognizing multiple OCS components, we propose a new deep learning-based method to conduct semantic segmentation on the point cloud collected by mobile 2D LiDAR. Both online data processing and batch data processing are supported because our method is designed to classify points into meaningful categories of objects scan line by scan line. Local features are important for the success of point cloud semantic segmentation. Thus, we design an iterative point partitioning algorithm and a module named as Spatial Fusion Network, which are two critical components of our method for multi-scale local feature extraction. We evaluate our method on point clouds where sixteen categories of common OCS components have been manually labeled. Experimental results show that our method is effective in multiple object recognition since mean Intersection-over-Unions (mIoUs) of online data processing and batch data processing are, respectively, 96.12% and 97.17%.

Download Full-text

Dynamic Scheduling for Batch Data Processing in Parallel Systems

Proceedings of the 3rd International Conference on Operations Research and Enterprise Systems ◽

10.5220/0004833002210225 ◽

2014 ◽

Keyword(s):

Data Processing ◽

Dynamic Scheduling ◽

Parallel Systems ◽

Batch Data

Download Full-text

A Study of Big Data Processing for Sentiments Analysis

Large-Scale Data Streaming, Processing, and Blockchain Security - Advances in Information Security, Privacy, and Ethics ◽

10.4018/978-1-7998-3444-1.ch001 ◽

2021 ◽

pp. 1-38

Author(s):

Dinesh Chander ◽

Hari Singh ◽

Abhinav Kirti Gupta

Keyword(s):

Big Data ◽

Data Processing ◽

Data Analytics ◽

Important Application ◽

Streaming Data ◽

Test Case ◽

Big Data Processing ◽

Important Field ◽

Batch Data ◽

Different Sources

Data processing has become an important field in today's big data-dominated world. The data has been generating at a tremendous pace from different sources. There has been a change in the nature of data from batch-data to streaming-data, and consequently, data processing methodologies have also changed. Traditional SQL is no longer capable of dealing with this big data. This chapter describes the nature of data and various tools, techniques, and technologies to handle this big data. The chapter also describes the need of shifting big data on to cloud and the challenges in big data processing in the cloud, the migration from data processing to data analytics, tools used in data analytics, and the issues and challenges in data processing and analytics. Then the chapter touches an important application area of streaming data, sentiment analysis, and tries to explore it through some test case demonstrations and results.

Download Full-text

Thrill: High-performance algorithmic distributed batch data processing with C++

2016 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata.2016.7840603 ◽

2016 ◽

Cited By ~ 13

Author(s):

Timo Bingmann ◽

Michael Axtmann ◽

Emanuel Jobstl ◽

Sebastian Lamm ◽

Huyen Chau Nguyen ◽

...

Keyword(s):

Data Processing ◽

High Performance ◽

Batch Data

Download Full-text

Gaussian Two-Armed Bandit and Optimization of Batch Data Processing

Problems of Information Transmission ◽

10.1134/s0032946018010076 ◽

2018 ◽

Vol 54 (1) ◽

pp. 84-100 ◽

Cited By ~ 5

Author(s):

A. V. Kolnogorov

Keyword(s):

Data Processing ◽

Batch Data

Download Full-text

Universal strategies for the two-alternative big data processing

Journal of Physics Conference Series ◽

10.1088/1742-6596/2052/1/012020 ◽

2021 ◽

Vol 2052 (1) ◽

pp. 012020

Author(s):

A V Kolnogorov

Keyword(s):

Big Data ◽

Difference Equation ◽

Data Processing ◽

Prior Distribution ◽

Control Process ◽

Order Partial Differential Equation ◽

Worst Case ◽

Big Data Processing ◽

One Step ◽

Batch Data

Abstract We consider the two-alternative processing of big data in the framework of the two-armed bandit problem. We assume that there are two processing methods with different, fixed but a priori unknown efficiencies which are due to different reasons including those caused by legislation. Results of data processing are interpreted as random incomes. During control process, one has to determine the most efficient method and to provide its primary usage. The difficulty of the problem is caused by the fact that its solution essentially depends on distributions of one-step incomes corresponding to results of data processing. However, in case of big data we show that there are universal processing strategies for a wide class of distributions of one-step incomes. To this end, we consider Gaussian two-armed bandit which naturally arises when batch data processing is analyzed. Minimax risk and minimax strategy are searched for as Bayesian ones corresponding to the worst-case prior distribution. We present recursive integro-difference equation for computing Bayesian risk and Bayesian strategy with respect to the worst-case prior distribution and a second order partial differential equation into which integro-difference equation turns in the limiting case as the control horizon goes to infinity. We also show that, in case of big data, processing of data one-by-one is not more efficient than optimal batch data processing for some types of distributions of one-step incomes, e.g. for Bernoulli and Poissonian distributions. Numerical experiments are presented and show that proposed universal strategies provide high performance of two-alternative big data processing.

Download Full-text

Data Processing and Reconciliation for Chemical Process Operations

10.1016/s1874-5970(00)x8014-9 ◽

1999 ◽

Keyword(s):

Data Processing ◽

Chemical Process ◽

Process Operations

Download Full-text