scholarly journals Algorithm for preprocessing and unification of time series based on machine learning for data structuring

Author(s):  
Andrey Sergeevich Kopyrin ◽  
Irina Leonidovna Makarova

The subject of the research is the process of collecting and preliminary preparation of data from heterogeneous sources. Economic information is heterogeneous and semi-structured or unstructured in nature. Due to the heterogeneity of the primary documents, as well as the human factor, the initial statistical data may contain a large amount of noise, as well as records, the automatic processing of which may be very difficult. This makes preprocessing dynamic input data an important precondition for discovering meaningful patterns and domain knowledge, and making the research topic relevant.Data preprocessing is a series of unique tasks that have led to the emergence of various algorithms and heuristic methods for solving preprocessing tasks such as merge and cleanup, identification of variablesIn this work, a preprocessing algorithm is formulated that allows you to bring together into a single database and structure information on time series from different sources. The key modification of the preprocessing method proposed by the authors is the technology of automated data integration.The technology proposed by the authors involves the combined use of methods for constructing a fuzzy time series and machine lexical comparison on the thesaurus network, as well as the use of a universal database built using the MIVAR concept.The preprocessing algorithm forms a single data model with the ability to transform the periodicity and semantics of the data set and integrate data that can come from various sources into a single information bank.

2020 ◽  
Vol 224 ◽  
pp. 01017
Author(s):  
A.S. Kopyrin ◽  
E.V. Vidishcheva ◽  
Yu.I. Dreizis

The subject of the study is the process of collecting, preparing, and searching for anomalies on data from heterogeneous sources. Economic information is naturally heterogeneous and semi-structured or unstructured. This makes pre-processing of input dynamic data an important prerequisite for the detection of significant patterns and knowledge in the subject area, so the topic of research is relevant. Pre-processing of data is several unique problems that have led to the emergence of various algorithms and heuristic methods for solving such pre-processing problems as merging and cleaning and identifying variables. In this work, an algorithm for preprocessing and searching for anomalies using LSTM is formulated, which allows you to consolidate into a single database and structure information by time series from different sources, as well as search for anomalies in an automated mode. A key modification of the preprocessing method proposed by the authors is the technology of automated data integration. The technology proposed by the authors involves the joint use of methods for building a fuzzy time series and machine lexical matching on a thesaurus network, as well as the use of a universal database built using the MIVAR concept. The preprocessing algorithm forms a single data model with the possibility of transforming the periodicity and semantics of the data set and integrating into a single information bank data that can come from various sources.


2018 ◽  
pp. 1773-1791 ◽  
Author(s):  
Prateek Pandey ◽  
Shishir Kumar ◽  
Sandeep Shrivastava

In recent years, there has been a growing interest in Time Series forecasting. A number of time series forecasting methods have been proposed by various researchers. However, a common trend found in these methods is that they all underperform on a data set that exhibit uneven ups and downs (turbulences). In this paper, a new method based on fuzzy time-series (henceforth FTS) to forecast on the fundament of turbulences in the data set is proposed. The results show that the turbulence based fuzzy time series forecasting is effective, especially, when the available data indicate a high degree of instability. A few benchmark FTS methods are identified from the literature, their limitations and gaps are discussed and it is observed that the proposed method successfully overcome their deficiencies to produce better results. In order to validate the proposed model, a performance comparison with various conventional time series models is also presented.


2020 ◽  
Author(s):  
Prashant Verma ◽  
Mukti Khetan ◽  
Shikha Dwivedi ◽  
Shweta Dixit

Abstract Purpose: The whole world is surfaced with an inordinate challenge of mankind due to COVID-19, caused by 2019 novel coronavirus (SARS-CoV-2). After taking hundreds of thousands of lives, millions of people are still in the substantial grasp of this virus. This virus is highly contagious with reproduction number R0, as high as 6.5 worldwide and between 1.5 to 2.6 in India. So, the number of total infections and the number of deaths will get a day-to-day hike until the curve flattens. Under the current circumstances, it becomes inevitable to develop a model, which can anticipate future morbidities, recoveries, and deaths. Methods: We have developed some models based on ARIMA and FUZZY time series methodology for the forecasting of COVID-19 infections, mortalities and recoveries in India and Maharashtra explicitly, which is the most affected state in India, following the COVID-19 statistics till “Lockdown 3.0” (17th May 2020). Results: Both models suggest that there will be an exponential uplift in COVID-19 cases in the near future. We have forecasted the COVID-19 data set for next seven days. The forecasted values are in good agreement with real ones for all six COVID-19 scenarios for Maharashtra and India as a whole as well.Conclusion: The forecasts for the ARIMA and FUZZY time series models will be useful for the policymakers of the health care systems so that the system and the medical personnel can be prepared to combat the pandemic.


Author(s):  
Haji A. Haji ◽  
Kusman Sadik ◽  
Agus Mohamad Soleh

Simulation study is used when real world data is hard to find or time consuming to gather and it involves generating data set by specific statistical model or using random sampling. A simulation of the process is useful to test theories and understand behavior of the statistical methods. This study aimed to compare ARIMA and Fuzzy Time Series (FTS) model in order to identify the best model for forecasting time series data based on 100 replicates on 100 generated data of the ARIMA (1,0,1) model.There are 16 scenarios used in this study as a combination between 4 data generation variance error values (0.5, 1, 3,5) with 4 ARMA(1,1) parameter values. Furthermore, The performances were evaluated based on three metric mean absolute percentage error (MAPE),Root mean squared error (RMSE) and Bias statistics criterion to determine the more appropriate method and performance of model. The results of the study show a lowest bias for the chen fuzzy time series model and the performance of all measurements is small then other models. The results also proved that chen method is compatible with the advanced forecasting techniques in all of the consided situation in providing better forecasting accuracy.


Author(s):  
Wei Wei ◽  
Xiaojun Wan

Accuracy is one of the basic principles of journalism. However, it is increasingly hard to manage due to the diversity of news media. Some editors of online news tend to use catchy headlines which trick readers into clicking. These headlines are either ambiguous or misleading, degrading the reading experience of the audience. Thus, identifying inaccurate news headlines is a task worth studying. Previous work names these headlines ``clickbaits'' and mainly focus on the features extracted from the headlines, which limits the performance since the consistency between headlines and news bodies is underappreciated. In this paper, we clearly redefine the problem and identify ambiguous and misleading headlines separately. We utilize class sequential rules to exploit structure information when detecting ambiguous headlines. For the identification of misleading headlines, we extract features based on the congruence between headlines and bodies. To make use of the large unlabeled data set, we apply a co-training method and gain an increase in performance. The experiment results show the effectiveness of our methods. Then we use our classifiers to detect inaccurate headlines crawled from different sources and conduct a data analysis.


2016 ◽  
Vol 34 (4) ◽  
pp. 437-449 ◽  
Author(s):  
Costel Munteanu ◽  
Catalin Negrea ◽  
Marius Echim ◽  
Kalevi Mursula

Abstract. In this paper we investigate quantitatively the effect of data gaps for four methods of estimating the amplitude spectrum of a time series: fast Fourier transform (FFT), discrete Fourier transform (DFT), Z transform (ZTR) and the Lomb–Scargle algorithm (LST). We devise two tests: the single-large-gap test, which can probe the effect of a single data gap of varying size and the multiple-small-gaps test, used to study the effect of numerous small gaps of variable size distributed within the time series. The tests are applied on two data sets: a synthetic data set composed of a superposition of four sinusoidal modes, and one component of the magnetic field measured by the Venus Express (VEX) spacecraft in orbit around the planet Venus. For single data gaps, FFT and DFT give an amplitude monotonically decreasing with gap size. However, the shape of their amplitude spectrum remains unmodified even for a large data gap. On the other hand, ZTR and LST preserve the absolute level of amplitude but lead to greatly increased spectral noise for increasing gap size. For multiple small data gaps, DFT, ZTR and LST can, unlike FFT, find the correct amplitude of sinusoidal modes even for large data gap percentage. However, for in-situ data collected in a turbulent plasma environment, these three methods overestimate the high frequency part of the amplitude spectrum above a threshold depending on the maximum gap size, while FFT slightly underestimates it.


2017 ◽  
Vol 09 (01) ◽  
pp. 1750001 ◽  
Author(s):  
Riswan Efendi ◽  
Mustafa Mat Deris

Fuzzy time series has been implemented for data prediction in the various sectors, such as education, finance-economic, energy, traffic accident, others. Moreover, many proposed models have been presented to improve the forecasting accuracy. However, the interval-length adjustment and the out-sample forecast procedure are still issues in fuzzy time series forecasting, where both issues are yet clearly investigated in the previous studies. In this paper, a new adjustment of the interval-length and the partition number of the data set is proposed. Additionally, the determining of the out-sample forecast is also discussed. The yearly oil production (OP) and oil consumption (OC) of Malaysia and Indonesia from 1965 to 2012 are examined to evaluate the performance of fuzzy time series and the probabilistic time series models. The result indicates that the fuzzy time series model is better than the probabilistic models, such as regression time series, exponential smoothing in terms of the forecasting accuracy. This paper thus highlights the effect of the proposed interval length in reducing the forecasting error significantly, as well as the main differences between the fuzzy and probabilistic time series models.


2017 ◽  
Vol 6 (4) ◽  
pp. 83-98 ◽  
Author(s):  
Prateek Pandey ◽  
Shishir Kumar ◽  
Sandeep Shrivastava

In recent years, there has been a growing interest in Time Series forecasting. A number of time series forecasting methods have been proposed by various researchers. However, a common trend found in these methods is that they all underperform on a data set that exhibit uneven ups and downs (turbulences). In this paper, a new method based on fuzzy time-series (henceforth FTS) to forecast on the fundament of turbulences in the data set is proposed. The results show that the turbulence based fuzzy time series forecasting is effective, especially, when the available data indicate a high degree of instability. A few benchmark FTS methods are identified from the literature, their limitations and gaps are discussed and it is observed that the proposed method successfully overcome their deficiencies to produce better results. In order to validate the proposed model, a performance comparison with various conventional time series models is also presented.


2015 ◽  
Vol 14 (4) ◽  
pp. 165-181 ◽  
Author(s):  
Sarah Dudenhöffer ◽  
Christian Dormann

Abstract. The purpose of this study was to replicate the dimensions of the customer-related social stressors (CSS) concept across service jobs, to investigate their consequences for service providers’ well-being, and to examine emotional dissonance as mediator. Data of 20 studies comprising of different service jobs (N = 4,199) were integrated into a single data set and meta-analyzed. Confirmatory factor analyses and explorative principal component analysis confirmed four CSS scales: disproportionate expectations, verbal aggression, ambiguous expectations, disliked customers. These CSS scales were associated with burnout and job satisfaction. Most of the effects were partially mediated by emotional dissonance. Further analyses revealed that differences among jobs exist with regard to the factor solution. However, associations between CSS and outcomes are mainly invariant across service jobs.


Sign in / Sign up

Export Citation Format

Share Document