SPOT: Testing Stream Processing Programs with Symbolic Execution and Stream Synthesizing

Adoption of distributed stream processing (DSP) systems such as Apache Flink in real-time big data processing is increasing. However, DSP programs are prone to be buggy, especially when one programmer neglects some DSP features (e.g., source data reordering), which motivates development of approaches for testing and verification. In this paper, we focus on the test data generation problem for DSP programs. Currently, there is a lack of an approach that generates test data for DSP programs with both high path coverage and covering different stream reordering situations. We present a novel solution, SPOT (i.e., Stream Processing Program Test), to achieve these two goals simultaneously. At first, SPOT generates a set of individual test data representing each path of one DSP program through symbolic execution. Then, SPOT composes these independent data into various time series data (a.k.a, stream) in diverse reordering. Finally, we can perform a test by feeding the DSP program with these streams continuously. To automatically support symbolic analysis, we also developed JPF-Flink, a JPF (i.e., Java Pathfinder) extension to coordinate the execution of Flink programs. We present four case studies to illustrate that: (1) SPOT can support symbolic analysis for the commonly used DSP operators; (2) test data generated by SPOT can more efficiently achieve high JDU (i.e., Joint Dataflow and UDF) path coverage than two recent DSP testing approaches; (3) test data generated by SPOT can more easily trigger software failure when comparing with those two DSP testing approaches; and (4) the data randomly generated by those two test techniques are highly skewed in terms of stream reordering, which is measured by the entropy metric. In comparison, it is even for test data from SPOT.

Download Full-text

A Time-Series Data Generation Method to Predict Remaining Useful Life

Processes ◽

10.3390/pr9071115 ◽

2021 ◽

Vol 9 (7) ◽

pp. 1115

Author(s):

Gilseung Ahn ◽

Hyungseok Yun ◽

Sun Hur ◽

Si-Yeong Lim

Keyword(s):

Time Series ◽

Time Series Data ◽

Remaining Useful Life ◽

Series Data ◽

Generation Model ◽

Data Generation ◽

Training Time ◽

Symbolic Aggregate Approximation ◽

Useful Life ◽

Occurrence Patterns

Accurate predictions of remaining useful life (RUL) of equipment using machine learning (ML) or deep learning (DL) models that collect data until the equipment fails are crucial for maintenance scheduling. Because the data are unavailable until the equipment fails, collecting sufficient data to train a model without overfitting can be challenging. Here, we propose a method of generating time-series data for RUL models to resolve the problems posed by insufficient data. The proposed method converts every training time series into a sequence of alphabetical strings by symbolic aggregate approximation and identifies occurrence patterns in the converted sequences. The method then generates a new sequence and inversely transforms it to a new time series. Experiments with various RUL prediction datasets and ML/DL models verified that the proposed data-generation model can help avoid overfitting in RUL prediction model.

Download Full-text

Program Test Data Generation for Branch Coverage with Genetic Algorithm: Comparative Evaluation of A Maximization and Minimization Approach

Computer Science Conference Proceedings ◽

10.5121/csit.2012.2140 ◽

2012 ◽

Cited By ~ 1

Author(s):

Ankur Pachauri Gursaran

Keyword(s):

Genetic Algorithm ◽

Test Data ◽

Comparative Evaluation ◽

Test Data Generation ◽

Data Generation ◽

Branch Coverage ◽

Program Test ◽

Minimization Approach

Download Full-text

Test-data generation directed by program path coverage through imperialist competitive algorithm

Science of Computer Programming ◽

10.1016/j.scico.2019.102304 ◽

2019 ◽

Vol 184 ◽

pp. 102304 ◽

Cited By ~ 1

Author(s):

Mohammad Ali Saadatjoo ◽

Seyed Morteza Babamir

Keyword(s):

Test Data ◽

Imperialist Competitive Algorithm ◽

Test Data Generation ◽

Data Generation ◽

Competitive Algorithm ◽

Path Coverage

Download Full-text

On how characteristics that hinder test data generation using symbolic execution combine: An analysis of the SF100 benchmark

2016 35th International Conference of the Chilean Computer Science Society (SCCC) ◽

10.1109/sccc.2016.7836011 ◽

2016 ◽

Author(s):

Marcelo Medeiros Eler ◽

Andre Takeshi Endo ◽

Vinicius H. S. Durelli ◽

Danilo Medeiros Eler

Keyword(s):

Test Data ◽

Symbolic Execution ◽

Test Data Generation ◽

Data Generation

Download Full-text

Analyzing Exceptions in the Context of Test Data Generation Based on Symbolic Execution

Proceedings of the 27th International Conference on Software Engineering and Knowledge Engineering ◽

10.18293/seke2015-170 ◽

2015 ◽

Cited By ~ 1

Author(s):

Marcelo Medeiros Eler ◽

Vinicius Durelli ◽

André Takeshi Endo

Keyword(s):

Test Data ◽

Symbolic Execution ◽

Test Data Generation ◽

Data Generation

Download Full-text

A forecasting of stock trading price using time series information based on big data

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v11i3.pp2548-2554 ◽

2021 ◽

Vol 11 (3) ◽

pp. 2548

Author(s):

Soo-Tai Nam ◽

Chan-Yong Jin ◽

Seong-Yoon Shin

Keyword(s):

Time Series ◽

Big Data ◽

Stock Price ◽

Euclidean Distance ◽

Time Series Data ◽

Series Data ◽

Analysis Tool ◽

Large Set ◽

Data Generation ◽

Management Tools

Big data is a large set of structured or unstructured data that can collect, store, manage, and analyze data with existing database management tools. And it means the technique of extracting value from these data and interpreting the results. Big data has three characteristics: The size of existing data and other data (volume), the speed of data generation (velocity), and the variety of information forms (variety). The time series data are obtained by collecting and recording the data generated in accordance with the flow of time. If the analysis of these time series data, found the characteristics of the data implies that feature helps to understand and analyze time series data. The concept of distance is the simplest and the most obvious in dealing with the similarities between objects. The commonly used and widely known method for measuring distance is the Euclidean distance. This study is the result of analyzing the similarity of stock price flow using 793,800 closing prices of 1,323 companies in Korea. Visual studio and Excel presented calculate the Euclidean distance using an analysis tool. We selected “000100” as a target domestic company and prepared for big data analysis. As a result of the analysis, the shortest Euclidean distance is the code “143860” company, and the calculated value is “11.147”. Therefore, based on the results of the analysis, the limitations of the study and theoretical implications are suggested.

Download Full-text

Methods to identify time series abnormalities and predicting issues caused by component failures

MATEC Web of Conferences ◽

10.1051/matecconf/201929002002 ◽

2019 ◽

Vol 290 ◽

pp. 02002

Author(s):

Crina Narcisa Deac ◽

Gicu Calin Deac ◽

Florina Chiscop ◽

Cicerone Laurentiu Popa

Keyword(s):

Data Mining ◽

Time Series ◽

Test Data ◽

Industry 4.0 ◽

Time Series Data ◽

Series Data ◽

Time Interval ◽

Telemetry Data ◽

Data Error ◽

Component Failures

Anomaly detection is a crucial analysis topic in the field of Industry 4.0 data mining as well as knowing what is the probability that a specific machine to go down due to a failure of a component in the next time interval. In this article, we used time series data collected from machines, from both classes - time series data which leads up to the failures of machines as well as data from healthy operational periods of the machine. We used telemetry data, error logs from still operational components, maintenance records comprising historical breakdowns and replacement component to build and compare several different models. The validation of the proposed methods was made by comparing the actual failures in the test data with the predicted component failures over the test data.

Download Full-text

A Comparative Simulation Study of ARIMA and Fuzzy Time Series Model for Forecasting Time Series Data

International Journal of Scientific Research in Science Engineering and Technology ◽

10.32628/ijsrset1184112 ◽

2018 ◽

pp. 49-56

Author(s):

Haji A. Haji ◽

Kusman Sadik ◽

Agus Mohamad Soleh

Keyword(s):

Time Series ◽

Simulation Study ◽

Time Series Data ◽

Mean Squared Error ◽

Time Series Model ◽

Fuzzy Time Series ◽

Series Data ◽

Percentage Error ◽

Data Generation ◽

Data Set

Simulation study is used when real world data is hard to ﬁnd or time consuming to gather and it involves generating data set by specific statistical model or using random sampling. A simulation of the process is useful to test theories and understand behavior of the statistical methods. This study aimed to compare ARIMA and Fuzzy Time Series (FTS) model in order to identify the best model for forecasting time series data based on 100 replicates on 100 generated data of the ARIMA (1,0,1) model.There are 16 scenarios used in this study as a combination between 4 data generation variance error values (0.5, 1, 3,5) with 4 ARMA(1,1) parameter values. Furthermore, The performances were evaluated based on three metric mean absolute percentage error (MAPE),Root mean squared error (RMSE) and Bias statistics criterion to determine the more appropriate method and performance of model. The results of the study show a lowest bias for the chen fuzzy time series model and the performance of all measurements is small then other models. The results also proved that chen method is compatible with the advanced forecasting techniques in all of the consided situation in providing better forecasting accuracy.

Download Full-text

Symbolic analysis-based reduced order Markov modeling of time series data

Signal Processing ◽

10.1016/j.sigpro.2018.03.004 ◽

2018 ◽

Vol 149 ◽

pp. 68-81 ◽

Cited By ~ 7

Author(s):

Devesh K. Jha ◽

Nurali Virani ◽

Jan Reimann ◽

Abhishek Srivastav ◽

Asok Ray

Keyword(s):

Time Series ◽

Time Series Data ◽

Symbolic Analysis ◽

Series Data ◽

Markov Modeling ◽

Reduced Order

Download Full-text

Unpacking nonviolent campaigns

Journal of Peace Research ◽

10.1177/0022343312471551 ◽

2013 ◽

Vol 50 (3) ◽

pp. 415-423 ◽

Cited By ~ 98

Author(s):

Erica Chenoweth ◽

Orion A Lewis

Keyword(s):

Time Series Data ◽

Value Added ◽

Series Data ◽

Annual Data ◽

Generation Process ◽

Data Generation ◽

Nonviolent Resistance ◽

Preliminary Study ◽

Evaluating Methods ◽

External Actors

Recent studies indicate that strategic nonviolent campaigns have been more successful over time in achieving their political objectives than violent insurgencies. But additional research has been limited by a lack of time-series data on nonviolent and violent campaigns, as well as a lack of more nuanced and detailed data on the attributes of the campaigns. In this article, we introduce the Nonviolent and Violent Campaigns and Outcomes (NAVCO) 2.0 dataset, which compiles annual data on 250 nonviolent and violent mass movements for regime change, anti-occupation, and secession from 1945 to 2006. NAVCO 2.0 also includes features of each campaign, such as participation size and diversity, the behavior of regime elites, repression and its effects on the campaign, support (or lack thereof) from external actors, and progress toward the campaign outcomes. After describing the data generation process and the dataset itself, we demonstrate why studying nonviolent resistance may yield novel insights for conflict scholars by replicating an influential study of civil war onset. This preliminary study reveals strikingly divergent findings regarding the systematic drivers of nonviolent campaign onset. Nonviolent campaign onset may be driven by separate – and in some cases, opposing – processes relative to violent campaigns. This finding underscores the value-added of the dataset, as well as the importance of evaluating methods of conflict within a unified research design.

Download Full-text