Processing Exact Results for Sliding Window Joins over Time-Sequence, Streaming Data Using a Disk Archive

Author(s):  
Abhirup Chakraborty ◽  
Ajit Singh
2019 ◽  
Vol 15 (12) ◽  
pp. 155014771989454
Author(s):  
Hao Luo ◽  
Kexin Sun ◽  
Junlu Wang ◽  
Chengfeng Liu ◽  
Linlin Ding ◽  
...  

With the development of streaming data processing technology, real-time event monitoring and querying has become a hot issue in this field. In this article, an investigation based on coal mine disaster events is carried out, and a new anti-aliasing model for abnormal events is proposed, as well as a multistage identification method. Coal mine micro-seismic signal is of great importance in the investigation of vibration characteristic, attenuation law, and disaster assessment of coal mine disasters. However, as affected by factors like geological structure and energy losses, the micro-seismic signals of the same kind of disasters may produce data drift in the time domain transmission, such as weak or enhanced signals, which affects the accuracy of the identification of abnormal events (“the coal mine disaster events”). The current mine disaster event monitoring method is a lagged identification, which is based on monitoring a series of sensors with a 10-s-long data waveform as the monitoring unit. The identification method proposed in this article first takes advantages of the dynamic time warping algorithm, which is widely applied in the field of audio recognition, to build an anti-aliasing model and identifies whether the perceived data are disaster signal based on the similarity fitting between them and the template waveform of historical disaster data, and second, since the real-time monitoring data are continuous streaming data, it is necessary to identify the start point of the disaster waveform before the identification of the disaster signal. Therefore, this article proposes a strategy based on a variable sliding window to align two waveforms, locating the start point of perceptual disaster wave and template wave by gradually sliding the perceptual window, which can guarantee the accuracy of the matching. Finally, this article proposes a multistage identification mechanism based on the sliding window matching strategy and the characteristics of the waveforms of coal mine disasters, adjusting the early warning level according to the identification extent of the disaster signal, which increases the early warning level gradually with the successful result of the matching of 1/ N size of the template, and the piecewise aggregate approximation method is used to optimize the calculation process. Experimental results show that the method proposed in this article is more accurate and be used in real time.


Author(s):  
George B. Mertzios ◽  
Hendrik Molter ◽  
Viktor Zamaraev

Graph coloring is one of the most famous computational problems with applications in a wide range of areas such as planning and scheduling, resource allocation, and pattern matching. So far coloring problems are mostly studied on static graphs, which often stand in stark contrast to practice where data is inherently dynamic and subject to discrete changes over time. A temporal graph is a graph whose edges are assigned a set of integer time labels, indicating at which discrete time steps the edge is active. In this paper we present a natural temporal extension of the classical graph coloring problem. Given a temporal graph and a natural number ∆, we ask for a coloring sequence for each vertex such that (i) in every sliding time window of ∆ consecutive time steps, in which an edge is active, this edge is properly colored (i.e. its endpoints are assigned two different colors) at least once during that time window, and (ii) the total number of different colors is minimized. This sliding window temporal coloring problem abstractly captures many realistic graph coloring scenarios in which the underlying network changes over time, such as dynamically assigning communication channels to moving agents. We present a thorough investigation of the computational complexity of this temporal coloring problem. More specifically, we prove strong computational hardness results, complemented by efficient exact and approximation algorithms. Some of our algorithms are linear-time fixed-parameter tractable with respect to appropriate parameters, while others are asymptotically almost optimal under the Exponential Time Hypothesis (ETH).


Algorithms ◽  
2018 ◽  
Vol 11 (10) ◽  
pp. 158 ◽  
Author(s):  
Sathya Madhusudhanan ◽  
Suresh Jaganathan ◽  
Jayashree L S

Unstructured data are irregular information with no predefined data model. Streaming data which constantly arrives over time is unstructured, and classifying these data is a tedious task as they lack class labels and get accumulated over time. As the data keeps growing, it becomes difficult to train and create a model from scratch each time. Incremental learning, a self-adaptive algorithm uses the previously learned model information, then learns and accommodates new information from the newly arrived data providing a new model, which avoids the retraining. The incrementally learned knowledge helps to classify the unstructured data. In this paper, we propose a framework CUIL (Classification of Unstructured data using Incremental Learning) which clusters the metadata, assigns a label for each cluster and then creates a model using Extreme Learning Machine (ELM), a feed-forward neural network, incrementally for each batch of data arrived. The proposed framework trains the batches separately, reducing the memory resources, training time significantly and is tested with metadata created for the standard image datasets like MNIST, STL-10, CIFAR-10, Caltech101, and Caltech256. Based on the tabulated results, our proposed work proves to show greater accuracy and efficiency.


2020 ◽  
Vol 34 (01) ◽  
pp. 370-377
Author(s):  
Lu Cheng ◽  
Jundong Li ◽  
K. Selcuk Candan ◽  
Huan Liu

Social media has become an indispensable tool in the face of natural disasters due to its broad appeal and ability to quickly disseminate information. For instance, Twitter is an important source for disaster responders to search for (1) topics that have been identified as being of particular interest over time, i.e., common topics such as “disaster rescue”; (2) new emerging themes of disaster-related discussions that are fast gathering in social media streams (Saha and Sindhwani 2012), i.e., distinct topics such as “the latest tsunami destruction”. To understand the status quo and allocate limited resources to most urgent areas, emergency managers need to quickly sift through relevant topics generated over time and investigate their commonness and distinctiveness. A major obstacle to the effective usage of social media, however, is its massive amount of noisy and undesired data. Hence, a naive method, such as set intersection/difference to find common/distinct topics, is often not practical. To address this challenge, this paper studies a new topic tracking problem that seeks to effectively identify the common and distinct topics with social streaming data. The problem is important as it presents a promising new way to efficiently search for accurate information during emergency response. This is achieved by an online Nonnegative Matrix Factorization (NMF) scheme that conducts a faster update of latent factors, and a joint NMF technique that seeks the balance between the reconstruction error of topic identification and the losses induced by discovering common and distinct topics. Extensive experimental results on real-world datasets collected during Hurricane Harvey and Florence reveal the effectiveness of our framework.


2012 ◽  
Vol 256-259 ◽  
pp. 2910-2913
Author(s):  
Jun Tan

Online mining of frequent closed itemsets over streaming data is one of the most important issues in mining data streams. In this paper, we proposed a novel sliding window based algorithm. The algorithm exploits lattice properties to limit the search to frequent close itemsets which share at least one item with the new transaction. Experiments results on synthetic datasets show that our proposed algorithm is both time and space efficient.


2020 ◽  
Vol 76 (10) ◽  
pp. 7619-7634 ◽  
Author(s):  
Wen Xiao ◽  
Juan Hu

Abstract Finding frequent itemsets in a continuous streaming data is an important data mining task which is widely used in network monitoring, Internet of Things data analysis and so on. In the era of big data, it is necessary to develop a distributed frequent itemset mining algorithm to meet the needs of massive streaming data processing. Apache Spark is a unified analytic engine for massive data processing which has been successfully used in many data mining fields. In this paper, we propose a distributed algorithm for mining frequent itemsets over massive streaming data named SWEclat. The algorithm uses sliding window to process streaming data and uses vertical data structure to store the dataset in the sliding window. This algorithm is implemented by Apache Spark and uses Spark RDD to store streaming data and dataset in vertical data format, so as to divide these RDDs into partitions for distributed processing. Experimental results show that SWEclat algorithm has good acceleration, parallel scalability and load balancing.


Author(s):  
Agnes Tegen ◽  
Paul Davidsson ◽  
Jan A. Persson

Abstract The advances in Internet of things lead to an increased number of devices generating and streaming data. These devices can be useful data sources for activity recognition by using machine learning. However, the set of available sensors may vary over time, e.g. due to mobility of the sensors and technical failures. Since the machine learning model uses the data streams from the sensors as input, it must be able to handle a varying number of input variables, i.e. that the feature space might change over time. Moreover, the labelled data necessary for the training is often costly to acquire. In active learning, the model is given a budget for requesting labels from an oracle, and aims to maximize accuracy by careful selection of what data instances to label. It is generally assumed that the role of the oracle only is to respond to queries and that it will always do so. In many real-world scenarios however, the oracle is a human user and the assumptions are simplifications that might not give a proper depiction of the setting. In this work we investigate different interactive machine learning strategies, out of which active learning is one, which explore the effects of an oracle that can be more proactive and factors that might influence a user to provide or withhold labels. We implement five interactive machine learning strategies as well as hybrid versions of them and evaluate them on two datasets. The results show that a more proactive user can improve the performance, especially when the user is influenced by the accuracy of earlier predictions. The experiments also highlight challenges related to evaluating performance when the set of classes is changing over time.


Sensors ◽  
2020 ◽  
Vol 20 (20) ◽  
pp. 5829 ◽  
Author(s):  
Jen-Wei Huang ◽  
Meng-Xun Zhong ◽  
Bijay Prasad Jaysawal

Outlier detection in data streams is crucial to successful data mining. However, this task is made increasingly difficult by the enormous growth in the quantity of data generated by the expansion of Internet of Things (IoT). Recent advances in outlier detection based on the density-based local outlier factor (LOF) algorithms do not consider variations in data that change over time. For example, there may appear a new cluster of data points over time in the data stream. Therefore, we present a novel algorithm for streaming data, referred to as time-aware density-based incremental local outlier detection (TADILOF) to overcome this issue. In addition, we have developed a means for estimating the LOF score, termed "approximate LOF," based on historical information following the removal of outdated data. The results of experiments demonstrate that TADILOF outperforms current state-of-the-art methods in terms of AUC while achieving similar performance in terms of execution time. Moreover, we present an application of the proposed scheme to the development of an air-quality monitoring system.


Sign in / Sign up

Export Citation Format

Share Document