Hierarchical Clustering for Real-Time Stream Data with Noise

Author(s):  
Philipp Kranen ◽  
Felix Reidl ◽  
Fernando Sanchez Villaamil ◽  
Thomas Seidl
Algorithms ◽  
2019 ◽  
Vol 12 (2) ◽  
pp. 37 ◽  
Author(s):  
Zhigang Hu ◽  
Hui Kang ◽  
Meiguang Zheng

A distributed data stream processing system handles real-time, changeable and sudden streaming data load. Its elastic resource allocation has become a fundamental and challenging problem with a fixed strategy that will result in waste of resources or a reduction in QoS (quality of service). Spark Streaming as an emerging system has been developed to process real time stream data analytics by using micro-batch approach. In this paper, first, we propose an improved SVR (support vector regression) based stream data load prediction scheme. Then, we design a spark-based maximum sustainable throughput of time window (MSTW) performance model to find the optimized number of virtual machines. Finally, we present a resource scaling algorithm TWRES (time window resource elasticity scaling algorithm) with MSTW constraint and streaming data load prediction. The evaluation results show that TWRES could improve resource utilization and mitigate SLA (service level agreement) violation.


Entropy ◽  
2021 ◽  
Vol 23 (7) ◽  
pp. 859
Author(s):  
Abdulaziz O. AlQabbany ◽  
Aqil M. Azmi

We are living in the age of big data, a majority of which is stream data. The real-time processing of this data requires careful consideration from different perspectives. Concept drift is a change in the data’s underlying distribution, a significant issue, especially when learning from data streams. It requires learners to be adaptive to dynamic changes. Random forest is an ensemble approach that is widely used in classical non-streaming settings of machine learning applications. At the same time, the Adaptive Random Forest (ARF) is a stream learning algorithm that showed promising results in terms of its accuracy and ability to deal with various types of drift. The incoming instances’ continuity allows for their binomial distribution to be approximated to a Poisson(1) distribution. In this study, we propose a mechanism to increase such streaming algorithms’ efficiency by focusing on resampling. Our measure, resampling effectiveness (ρ), fuses the two most essential aspects in online learning; accuracy and execution time. We use six different synthetic data sets, each having a different type of drift, to empirically select the parameter λ of the Poisson distribution that yields the best value for ρ. By comparing the standard ARF with its tuned variations, we show that ARF performance can be enhanced by tackling this important aspect. Finally, we present three case studies from different contexts to test our proposed enhancement method and demonstrate its effectiveness in processing large data sets: (a) Amazon customer reviews (written in English), (b) hotel reviews (in Arabic), and (c) real-time aspect-based sentiment analysis of COVID-19-related tweets in the United States during April 2020. Results indicate that our proposed method of enhancement exhibited considerable improvement in most of the situations.


Sensors ◽  
2018 ◽  
Vol 18 (9) ◽  
pp. 3084 ◽  
Author(s):  
Kyoungsoo Bok ◽  
Daeyun Kim ◽  
Jaesoo Yoo

As a large amount of stream data are generated through sensors over the Internet of Things environment, studies on complex event processing have been conducted to detect information required by users or specific applications in real time. A complex event is made by combining primitive events through a number of operators. However, the existing complex event-processing methods take a long time because they do not consider similarity and redundancy of operators. In this paper, we propose a new complex event-processing method considering similar and redundant operations for stream data from sensors in real time. In the proposed method, a similar operation in common events is converted into a virtual operator, and redundant operations on the same events are converted into a single operator. The event query tree for complex event detection is reconstructed using the converted operators. Through this method, the cost of comparison and inspection of similar and redundant operations is reduced, thereby decreasing the overall processing cost. To prove the superior performance of the proposed method, its performance is evaluated in comparison with existing methods.


Computers ◽  
2021 ◽  
Vol 10 (10) ◽  
pp. 127
Author(s):  
Besmir Sejdiu ◽  
Florije Ismaili ◽  
Lule Ahmedi

Sensors and other Internet of Things (IoT) technologies are increasingly finding application in various fields, such as air quality monitoring, weather alerts monitoring, water quality monitoring, healthcare monitoring, etc. IoT sensors continuously generate large volumes of observed stream data; therefore, processing requires a special approach. Extracting the contextual information essential for situational knowledge from sensor stream data is very difficult, especially when processing and interpretation of these data are required in real time. This paper focuses on processing and interpreting sensor stream data in real time by integrating different semantic annotations. In this context, a system named IoT Semantic Annotations System (IoTSAS) is developed. Furthermore, the performance of the IoTSAS System is presented by testing air quality and weather alerts monitoring IoT domains by extending the Open Geospatial Consortium (OGC) standards and the Sensor Observations Service (SOS) standards, respectively. The developed system provides information in real time to citizens about the health implications from air pollution and weather conditions, e.g., blizzard, flurry, etc.


Author(s):  
M. Asif Naeem ◽  
Gillian Dobbie ◽  
Gerald Weber

In order to make timely and effective decisions, businesses need the latest information from big data warehouse repositories. To keep these repositories up to date, real-time data integration is required. An important phase in real-time data integration is data transformation where a stream of updates, which is huge in volume and infinite, is joined with large disk-based master data. Stream processing is an important concept in Big Data, since large volumes of data are often best processed immediately. A well-known algorithm called Mesh Join (MESHJOIN) was proposed to process stream data with disk-based master data, which uses limited memory. MESHJOIN is a candidate for a resource-aware system setup. The problem that the authors consider in this chapter is that MESHJOIN is not very selective. In particular, the performance of the algorithm is always inversely proportional to the size of the master data table. As a consequence, the resource consumption is in some scenarios suboptimal. They present an algorithm called Cache Join (CACHEJOIN), which performs asymptotically at least as well as MESHJOIN but performs better in realistic scenarios, particularly if parts of the master data are used with different frequencies. In order to quantify the performance differences, the authors compare both algorithms with a synthetic dataset of a known skewed distribution as well as TPC-H and real-life datasets.


2020 ◽  
pp. 1260-1284
Author(s):  
Laura Belli ◽  
Simone Cirani ◽  
Luca Davoli ◽  
Gianluigi Ferrari ◽  
Lorenzo Melegari ◽  
...  

The Internet of Things (IoT) is expected to interconnect billions (around 50 by 2020) of heterogeneous sensor/actuator-equipped devices denoted as “Smart Objects” (SOs), characterized by constrained resources in terms of memory, processing, and communication reliability. Several IoT applications have real-time and low-latency requirements and must rely on architectures specifically designed to manage gigantic streams of information (in terms of number of data sources and transmission data rate). We refer to “Big Stream” as the paradigm which best fits the selected IoT scenario, in contrast to the traditional “Big Data” concept, which does not consider real-time constraints. Moreover, there are many security concerns related to IoT devices and to the Cloud. In this paper, we analyze security aspects in a novel Cloud architecture for Big Stream applications, which efficiently handles Big Stream data through a Graph-based platform and delivers processed data to consumers, with low latency. The authors detail each module defined in the system architecture, describing all refinements required to make the platform able to secure large data streams. An experimentation is also conducted in order to evaluate the performance of the proposed architecture when integrating security mechanisms.


Sign in / Sign up

Export Citation Format

Share Document