scholarly journals DEN-DIS: “May Get Life in Future” - Hybridized Data Stream Clustering Framework in Market Research Arena

Data streams pose several computational challenges due to their large volume of massive data arriving at a very fast rate. Data streams are gaining the attention of today’s research community for their utility in almost all fields. In turn, organizing the data into groups enables the researchers to derive with many useful and valuable information and conclusions based on the categories that were discovered. Clustering makes this organization or grouping easier and plays an important role in exploratory data analysis. This paper focuses on the amalgamation of two very important algorithms namely Density Based clustering used to group the data and the dissimilarity matrix algorithm used to find the outlier among the data. Before feeding the data, the algorithm filters out the sparse data and a continuous monitoring system provides the frequent outlier and inlier checks on the live stream data using buffer timer. This approach provides an optimistic solution in recognizing the outlier data which may later get reverted as inlier based on certain criteria. The concept of DenDis approach will pave a new innovation world of considering every data which “May Get Life in Future”.

2015 ◽  
Vol 77 (18) ◽  
Author(s):  
Maryam Mousavi ◽  
Azuraliza Abu Bakar

In recent years, clustering methods have attracted more attention in analysing and monitoring data streams. Density-based techniques are the remarkable category of clustering techniques that are able to detect the clusters with arbitrary shapes and noises. However, finding the clusters with local density varieties is a difficult task. For handling this problem, in this paper, a new density-based clustering algorithm for data streams is proposed. This algorithm can improve the offline phase of density-based algorithm based on MinPts parameter. The experimental results show that the proposed technique can improve the clustering quality in data streams with different densities.


2020 ◽  
Vol 11 (2) ◽  
pp. 19-36
Author(s):  
Umesh Kokate ◽  
Arviand V. Deshpande ◽  
Parikshit N. Mahalle

Evolution of data in the data stream environment generates patterns at different time instances. The cluster formation changes with respect to time because of the behaviour and members of clusters. Data stream clustering (DSC) allows us to investigate the changes of the group behaviour. These changes in the behaviour of the group members over time lead to formation of new clusters and may make old clusters extinct. Also, these extinct old clusters may recur over time. The problem is to identify and record these change patterns of evolving data streams. The knowledge obtained from these change patterns is then used for trends analysis over evolving data streams. In order to address this flexible clustering requirement, density-based clustering method is proposed to dynamically cluster evolving data streams. The decay factor identifies formation of new clusters and diminishing of older clusters on arrival of data points. This indicates trends in evolving data streams.


2018 ◽  
Vol 2018 ◽  
pp. 1-10
Author(s):  
Li Li ◽  
Fenghua Li ◽  
Guozhen Shi ◽  
Kui Geng

In view of the demand for high-concurrency massive data encryption and decryption application services in the security field, this paper proposes a dual-channel pipeline parallel data processing model (DPP) according to the characteristics of cryptographic operations and realized cryptographic operations of cross-data streams with different service requirements in a multiuser environment. By encapsulating cryptographic operation requirements in job packages, the input data flow is divided by the dual-channel mechanism and job packages parallel scheduling, which ensures the synchronization between the processing of the dependent job packages and parallel packages and hides the processing of the independent job package in the processing of the dependent job package. Prototyping experiments prove that this model can realize the correct and rapid processing of multiservice cross-data streams. Increasing the pipeline depth and improving the processing performance in each stage of the pipeline are the key to improving the system performance.


2015 ◽  
Vol 22 (3) ◽  
pp. 99-104 ◽  
Author(s):  
Henryk Krawczyk ◽  
Michał Nykiel ◽  
Jerzy Proficz

Abstract The recently deployed supercomputer Tryton, located in the Academic Computer Center of Gdansk University of Technology, provides great means for massive parallel processing. Moreover, the status of the Center as one of the main network nodes in the PIONIER network enables the fast and reliable transfer of data produced by miscellaneous devices scattered in the area of the whole country. The typical examples of such data are streams containing radio-telescope and satellite observations. Their analysis, especially with real-time constraints, can be challenging and requires the usage of dedicated software components. We propose a solution for such parallel analysis using the supercomputer, supervised by the KASKADA platform, which with the conjunction with immerse 3D visualization techniques can be used to solve problems such as pulsar detection and chronometric or oil-spill simulation on the sea surface.


2012 ◽  
Vol 184 (1) ◽  
pp. 196-214 ◽  
Author(s):  
Xiaofeng Ding ◽  
Xiang Lian ◽  
Lei Chen ◽  
Hai Jin

Entropy ◽  
2021 ◽  
Vol 23 (7) ◽  
pp. 859
Author(s):  
Abdulaziz O. AlQabbany ◽  
Aqil M. Azmi

We are living in the age of big data, a majority of which is stream data. The real-time processing of this data requires careful consideration from different perspectives. Concept drift is a change in the data’s underlying distribution, a significant issue, especially when learning from data streams. It requires learners to be adaptive to dynamic changes. Random forest is an ensemble approach that is widely used in classical non-streaming settings of machine learning applications. At the same time, the Adaptive Random Forest (ARF) is a stream learning algorithm that showed promising results in terms of its accuracy and ability to deal with various types of drift. The incoming instances’ continuity allows for their binomial distribution to be approximated to a Poisson(1) distribution. In this study, we propose a mechanism to increase such streaming algorithms’ efficiency by focusing on resampling. Our measure, resampling effectiveness (ρ), fuses the two most essential aspects in online learning; accuracy and execution time. We use six different synthetic data sets, each having a different type of drift, to empirically select the parameter λ of the Poisson distribution that yields the best value for ρ. By comparing the standard ARF with its tuned variations, we show that ARF performance can be enhanced by tackling this important aspect. Finally, we present three case studies from different contexts to test our proposed enhancement method and demonstrate its effectiveness in processing large data sets: (a) Amazon customer reviews (written in English), (b) hotel reviews (in Arabic), and (c) real-time aspect-based sentiment analysis of COVID-19-related tweets in the United States during April 2020. Results indicate that our proposed method of enhancement exhibited considerable improvement in most of the situations.


Author(s):  
Diêgo Lima Crispim ◽  
Lindemberg Lima Fernandes ◽  
Roberta Luiza de Oliveira Albuquerque

Indicators are important tools to guide and assist decision-makers. They are also important to get to know the scenario of a given place and monitor its development. This study aimed to analyze the behavior of the municipalities of Marajó-PA through indicators that cover social, economic, housing and sanitation, using a statistical technique of multivariate analysis to group these into a small number of homogeneous groups. In order to choose the indicators, we carried out a checklist of national, regional and local academic papers dealing with sustainability. Then, the indicators were standardized according to the different units and scales of measurement, not influencing the result and presenting similar weights in the calculation of the similarity coefficient. The measure of dissimilarity used was the euclidean distance and for the composition of the groupings the Ward and k-Means methods were applied. The result obtained using Ward’s hierarchical grouping method enabled the reduction of the numbers of municipalities to a number of 4 probable groups with similar attributes within the group and distinct among the others. It also presented a cofenetic correlation coefficient (CCC) of (r = 0.81), indicating a good degree of fit between the dendrogram and the dissimilarity matrix. The results indicated that the formation of the clusters and the municipalities integrated in them presented similarity both in the hierarchical and non-hierarchical methods. In the k-means method it was found that almost all municipalities that make territorial division remained within the same group.


Sign in / Sign up

Export Citation Format

Share Document