scholarly journals SPOT: Testing Stream Processing Programs with Symbolic Execution and Stream Synthesizing

2021 ◽  
Vol 11 (17) ◽  
pp. 8057
Author(s):  
Qian Ye ◽  
Minyan Lu

Adoption of distributed stream processing (DSP) systems such as Apache Flink in real-time big data processing is increasing. However, DSP programs are prone to be buggy, especially when one programmer neglects some DSP features (e.g., source data reordering), which motivates development of approaches for testing and verification. In this paper, we focus on the test data generation problem for DSP programs. Currently, there is a lack of an approach that generates test data for DSP programs with both high path coverage and covering different stream reordering situations. We present a novel solution, SPOT (i.e., Stream Processing Program Test), to achieve these two goals simultaneously. At first, SPOT generates a set of individual test data representing each path of one DSP program through symbolic execution. Then, SPOT composes these independent data into various time series data (a.k.a, stream) in diverse reordering. Finally, we can perform a test by feeding the DSP program with these streams continuously. To automatically support symbolic analysis, we also developed JPF-Flink, a JPF (i.e., Java Pathfinder) extension to coordinate the execution of Flink programs. We present four case studies to illustrate that: (1) SPOT can support symbolic analysis for the commonly used DSP operators; (2) test data generated by SPOT can more efficiently achieve high JDU (i.e., Joint Dataflow and UDF) path coverage than two recent DSP testing approaches; (3) test data generated by SPOT can more easily trigger software failure when comparing with those two DSP testing approaches; and (4) the data randomly generated by those two test techniques are highly skewed in terms of stream reordering, which is measured by the entropy metric. In comparison, it is even for test data from SPOT.

Processes ◽  
2021 ◽  
Vol 9 (7) ◽  
pp. 1115
Author(s):  
Gilseung Ahn ◽  
Hyungseok Yun ◽  
Sun Hur ◽  
Si-Yeong Lim

Accurate predictions of remaining useful life (RUL) of equipment using machine learning (ML) or deep learning (DL) models that collect data until the equipment fails are crucial for maintenance scheduling. Because the data are unavailable until the equipment fails, collecting sufficient data to train a model without overfitting can be challenging. Here, we propose a method of generating time-series data for RUL models to resolve the problems posed by insufficient data. The proposed method converts every training time series into a sequence of alphabetical strings by symbolic aggregate approximation and identifies occurrence patterns in the converted sequences. The method then generates a new sequence and inversely transforms it to a new time series. Experiments with various RUL prediction datasets and ML/DL models verified that the proposed data-generation model can help avoid overfitting in RUL prediction model.


Author(s):  
Soo-Tai Nam ◽  
Chan-Yong Jin ◽  
Seong-Yoon Shin

Big data is a large set of structured or unstructured data that can collect, store, manage, and analyze data with existing database management tools. And it means the technique of extracting value from these data and interpreting the results. Big data has three characteristics: The size of existing data and other data (volume), the speed of data generation (velocity), and the variety of information forms (variety). The time series data are obtained by collecting and recording the data generated in accordance with the flow of time. If the analysis of these time series data, found the characteristics of the data implies that feature helps to understand and analyze time series data. The concept of distance is the simplest and the most obvious in dealing with the similarities between objects. The commonly used and widely known method for measuring distance is the Euclidean distance. This study is the result of analyzing the similarity of stock price flow using 793,800 closing prices of 1,323 companies in Korea. Visual studio and Excel presented calculate the Euclidean distance using an analysis tool. We selected “000100” as a target domestic company and prepared for big data analysis. As a result of the analysis, the shortest Euclidean distance is the code “143860” company, and the calculated value is “11.147”. Therefore, based on the results of the analysis, the limitations of the study and theoretical implications are suggested.


2019 ◽  
Vol 290 ◽  
pp. 02002
Author(s):  
Crina Narcisa Deac ◽  
Gicu Calin Deac ◽  
Florina Chiscop ◽  
Cicerone Laurentiu Popa

Anomaly detection is a crucial analysis topic in the field of Industry 4.0 data mining as well as knowing what is the probability that a specific machine to go down due to a failure of a component in the next time interval. In this article, we used time series data collected from machines, from both classes - time series data which leads up to the failures of machines as well as data from healthy operational periods of the machine. We used telemetry data, error logs from still operational components, maintenance records comprising historical breakdowns and replacement component to build and compare several different models. The validation of the proposed methods was made by comparing the actual failures in the test data with the predicted component failures over the test data.


Author(s):  
Haji A. Haji ◽  
Kusman Sadik ◽  
Agus Mohamad Soleh

Simulation study is used when real world data is hard to find or time consuming to gather and it involves generating data set by specific statistical model or using random sampling. A simulation of the process is useful to test theories and understand behavior of the statistical methods. This study aimed to compare ARIMA and Fuzzy Time Series (FTS) model in order to identify the best model for forecasting time series data based on 100 replicates on 100 generated data of the ARIMA (1,0,1) model.There are 16 scenarios used in this study as a combination between 4 data generation variance error values (0.5, 1, 3,5) with 4 ARMA(1,1) parameter values. Furthermore, The performances were evaluated based on three metric mean absolute percentage error (MAPE),Root mean squared error (RMSE) and Bias statistics criterion to determine the more appropriate method and performance of model. The results of the study show a lowest bias for the chen fuzzy time series model and the performance of all measurements is small then other models. The results also proved that chen method is compatible with the advanced forecasting techniques in all of the consided situation in providing better forecasting accuracy.


2018 ◽  
Vol 149 ◽  
pp. 68-81 ◽  
Author(s):  
Devesh K. Jha ◽  
Nurali Virani ◽  
Jan Reimann ◽  
Abhishek Srivastav ◽  
Asok Ray

2013 ◽  
Vol 50 (3) ◽  
pp. 415-423 ◽  
Author(s):  
Erica Chenoweth ◽  
Orion A Lewis

Recent studies indicate that strategic nonviolent campaigns have been more successful over time in achieving their political objectives than violent insurgencies. But additional research has been limited by a lack of time-series data on nonviolent and violent campaigns, as well as a lack of more nuanced and detailed data on the attributes of the campaigns. In this article, we introduce the Nonviolent and Violent Campaigns and Outcomes (NAVCO) 2.0 dataset, which compiles annual data on 250 nonviolent and violent mass movements for regime change, anti-occupation, and secession from 1945 to 2006. NAVCO 2.0 also includes features of each campaign, such as participation size and diversity, the behavior of regime elites, repression and its effects on the campaign, support (or lack thereof) from external actors, and progress toward the campaign outcomes. After describing the data generation process and the dataset itself, we demonstrate why studying nonviolent resistance may yield novel insights for conflict scholars by replicating an influential study of civil war onset. This preliminary study reveals strikingly divergent findings regarding the systematic drivers of nonviolent campaign onset. Nonviolent campaign onset may be driven by separate – and in some cases, opposing – processes relative to violent campaigns. This finding underscores the value-added of the dataset, as well as the importance of evaluating methods of conflict within a unified research design.


Sign in / Sign up

Export Citation Format

Share Document