scholarly journals Two-alternative optimization of moderate batch data processing

2020 ◽  
Vol 1658 ◽  
pp. 012027
Author(s):  
A V Kolnogorov
Keyword(s):  
2022 ◽  
pp. 1162-1191
Author(s):  
Dinesh Chander ◽  
Hari Singh ◽  
Abhinav Kirti Gupta

Data processing has become an important field in today's big data-dominated world. The data has been generating at a tremendous pace from different sources. There has been a change in the nature of data from batch-data to streaming-data, and consequently, data processing methodologies have also changed. Traditional SQL is no longer capable of dealing with this big data. This chapter describes the nature of data and various tools, techniques, and technologies to handle this big data. The chapter also describes the need of shifting big data on to cloud and the challenges in big data processing in the cloud, the migration from data processing to data analytics, tools used in data analytics, and the issues and challenges in data processing and analytics. Then the chapter touches an important application area of streaming data, sentiment analysis, and tries to explore it through some test case demonstrations and results.


Author(s):  
Arpna Joshi ◽  
Chirag Singla ◽  
Mr. Pankaj

A data pipeline is a set of conducts that are performed from the time data is available for ingestion till value is obtained from that data. Such kind of actions is Extraction (getting value field from the dataset), Transformation and Loading (putting the data of value in a form that is useful for upstream use). In this big data project, we will simulate a simple batch data pipeline. Our dataset of interest we will get from https://www.githubarchive.org/ that records the health data of US for past 125years. The objective of this spark project will be to create a small but real-world pipeline that downloads this dataset as they become available, initiated the various form of transformation and load them into forms of storage that will need further use. In this project Apache kafka is used for data ingestion, Apache Spark for data processing and Cassandra for storing the processed result.


Sensors ◽  
2020 ◽  
Vol 20 (8) ◽  
pp. 2224 ◽  
Author(s):  
Lipei Chen ◽  
Cheng Xu ◽  
Shuai Lin ◽  
Siqi Li ◽  
Xiaohan Tu

The overhead contact system (OCS) is a critical railway infrastructure for train power supply. Periodic inspections, aiming at acquiring the operational condition of the OCS and detecting problems, are necessary to guarantee the safety of railway operations. One of the OCS inspection means is to analyze data of point clouds collected by mobile 2D LiDAR. Recognizing OCS components from the collected point clouds is a critical task of the data analysis. However, the complex composition of OCS makes the task difficult. To solve the problem of recognizing multiple OCS components, we propose a new deep learning-based method to conduct semantic segmentation on the point cloud collected by mobile 2D LiDAR. Both online data processing and batch data processing are supported because our method is designed to classify points into meaningful categories of objects scan line by scan line. Local features are important for the success of point cloud semantic segmentation. Thus, we design an iterative point partitioning algorithm and a module named as Spatial Fusion Network, which are two critical components of our method for multi-scale local feature extraction. We evaluate our method on point clouds where sixteen categories of common OCS components have been manually labeled. Experimental results show that our method is effective in multiple object recognition since mean Intersection-over-Unions (mIoUs) of online data processing and batch data processing are, respectively, 96.12% and 97.17%.


Author(s):  
Dinesh Chander ◽  
Hari Singh ◽  
Abhinav Kirti Gupta

Data processing has become an important field in today's big data-dominated world. The data has been generating at a tremendous pace from different sources. There has been a change in the nature of data from batch-data to streaming-data, and consequently, data processing methodologies have also changed. Traditional SQL is no longer capable of dealing with this big data. This chapter describes the nature of data and various tools, techniques, and technologies to handle this big data. The chapter also describes the need of shifting big data on to cloud and the challenges in big data processing in the cloud, the migration from data processing to data analytics, tools used in data analytics, and the issues and challenges in data processing and analytics. Then the chapter touches an important application area of streaming data, sentiment analysis, and tries to explore it through some test case demonstrations and results.


Author(s):  
Timo Bingmann ◽  
Michael Axtmann ◽  
Emanuel Jobstl ◽  
Sebastian Lamm ◽  
Huyen Chau Nguyen ◽  
...  

2021 ◽  
Vol 2052 (1) ◽  
pp. 012020
Author(s):  
A V Kolnogorov

Abstract We consider the two-alternative processing of big data in the framework of the two-armed bandit problem. We assume that there are two processing methods with different, fixed but a priori unknown efficiencies which are due to different reasons including those caused by legislation. Results of data processing are interpreted as random incomes. During control process, one has to determine the most efficient method and to provide its primary usage. The difficulty of the problem is caused by the fact that its solution essentially depends on distributions of one-step incomes corresponding to results of data processing. However, in case of big data we show that there are universal processing strategies for a wide class of distributions of one-step incomes. To this end, we consider Gaussian two-armed bandit which naturally arises when batch data processing is analyzed. Minimax risk and minimax strategy are searched for as Bayesian ones corresponding to the worst-case prior distribution. We present recursive integro-difference equation for computing Bayesian risk and Bayesian strategy with respect to the worst-case prior distribution and a second order partial differential equation into which integro-difference equation turns in the limiting case as the control horizon goes to infinity. We also show that, in case of big data, processing of data one-by-one is not more efficient than optimal batch data processing for some types of distributions of one-step incomes, e.g. for Bernoulli and Poissonian distributions. Numerical experiments are presented and show that proposed universal strategies provide high performance of two-alternative big data processing.


Sign in / Sign up

Export Citation Format

Share Document