An Approach for Data Labelling and Concept Drift Detection Based on Entropy Model in Rough Sets for Clustering Categorical Data

Clustering is an important technique in data mining. Clustering a large data set is difficult and time consuming. An approach called data labelling has been suggested for clustering large databases using sampling technique to improve efficiency of clustering. A sampled data is selected randomly for initial clustering and data points which are not sampled and unclustered are given cluster label or an outlier based on various data labelling techniques. Data labelling is an easy task in numerical domain because it is performed based on distance between a cluster and an unlabelled data point. However, in categorical domain since the distance is not defined properly between data points and data points with cluster, then data labelling is a difficult task for categorical data. This paper proposes a method for data labelling using entropy model in rough sets for categorical data. The concept of entropy, introduced by Shannon with particular reference to information theory is a powerful mechanism for the measurement of uncertainty information. In this method, data labelling is performed by integrating entropy with rough sets. This method is also applied to drift detection to establish if concept drift occurred or not when clustering categorical data. The cluster purity is also discussed using Rough Entropy for data labelling and for outlier detection. The experimental results show that the efficiency and clustering quality of this algorithm are better than the previous algorithms.

Download Full-text

A Support Based Initialization Algorithm for Categorical Data Clustering

Journal of Information Technology Research ◽

10.4018/jitr.2018040104 ◽

2018 ◽

Vol 11 (2) ◽

pp. 53-67

Author(s):

Ajay Kumar ◽

Shishir Kumar

Keyword(s):

Categorical Data ◽

Selection Process ◽

Numerical Data ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Data Object ◽

Data Points ◽

Wu Method ◽

Selection Algorithms

Several initial center selection algorithms are proposed in the literature for numerical data, but the values of the categorical data are unordered so, these methods are not applicable to a categorical data set. This article investigates the initial center selection process for the categorical data and after that present a new support based initial center selection algorithm. The proposed algorithm measures the weight of unique data points of an attribute with the help of support and then integrates these weights along the rows, to get the support of every row. Further, a data object having the largest support is chosen as an initial center followed by finding other centers that are at the greatest distance from the initially selected center. The quality of the proposed algorithm is compared with the random initial center selection method, Cao's method, Wu method and the method introduced by Khan and Ahmad. Experimental analysis on real data sets shows the effectiveness of the proposed algorithm.

Download Full-text

Exploring biological data: Mappings between ontology- and cluster-based representations

Information Visualization ◽

10.1177/1473871612468880 ◽

2013 ◽

Vol 12 (3-4) ◽

pp. 291-307 ◽

Cited By ~ 1

Author(s):

Ilir Jusufi ◽

Andreas Kerren ◽

Falk Schreiber

Keyword(s):

Hierarchical Clustering ◽

Visual Analysis ◽

Large Data ◽

Biological Data ◽

Data Set ◽

Metabolomics Data ◽

Domain Experts ◽

Data Points ◽

Biological Input ◽

Insight Into

Ontologies and hierarchical clustering are both important tools in biology and medicine to study high-throughput data such as transcriptomics and metabolomics data. Enrichment of ontology terms in the data is used to identify statistically overrepresented ontology terms, giving insight into relevant biological processes or functional modules. Hierarchical clustering is a standard method to analyze and visualize data to find relatively homogeneous clusters of experimental data points. Both methods support the analysis of the same data set but are usually considered independently. However, often a combined view is desired: visualizing a large data set in the context of an ontology under consideration of a clustering of the data. This article proposes new visualization methods for this task. They allow for interactive selection and navigation to explore the data under consideration as well as visual analysis of mappings between ontology- and cluster-based space-filling representations. In this context, we discuss our approach together with specific properties of the biological input data and identify features that make our approach easily usable for domain experts.

Download Full-text

A dynamic K-means clustering for data mining

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v13.i2.pp521-526 ◽

2019 ◽

Vol 13 (2) ◽

pp. 521

Author(s):

Md. Zakir Hossain ◽

Md.Nasim Akhtar ◽

R.B. Ahmad ◽

Mostafijur Rahman

Keyword(s):

Data Mining ◽

Clustering Algorithm ◽

Large Data ◽

Threshold Value ◽

Specific Pattern ◽

Large Data Sets ◽

Data Sets ◽

Data Set ◽

Number Of Clusters ◽

Data Points

Data mining is the process of finding structure of data from large data sets. With this process, the decision makers can make a particular decision for further development of the real-world problems. Several data clusteringtechniques are used in data mining for finding a specific pattern of data. The K-means method isone of the familiar clustering techniques for clustering large data sets. The K-means clustering method partitions the data set based on the assumption that the number of clusters are fixed.The main problem of this method is that if the number of clusters is to be chosen small then there is a higher probability of adding dissimilar items into the same group. On the other hand, if the number of clusters is chosen to be high, then there is a higher chance of adding similar items in the different groups. In this paper, we address this issue by proposing a new K-Means clustering algorithm. The proposed method performs data clustering dynamically. The proposed method initially calculates a threshold value as a centroid of K-Means and based on this value the number of clusters are formed. At each iteration of K-Means, if the Euclidian distance between two points is less than or equal to the threshold value, then these two data points will be in the same group. Otherwise, the proposed method will create a new cluster with the dissimilar data point. The results show that the proposed method outperforms the original K-Means method.

Download Full-text

Optimal fast Johnson–Lindenstrauss embeddings for large data sets

Sampling Theory, Signal Processing, and Data Analysis ◽

10.1007/s43670-021-00003-5 ◽

2021 ◽

Vol 19 (1) ◽

Author(s):

Stefan Bamberger ◽

Felix Krahmer

Keyword(s):

Fast Algorithms ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Embedding Dimension ◽

Data Set ◽

Optimal Dimension ◽

Discrete Algorithms ◽

Fast Multiplication ◽

Data Points

AbstractJohnson–Lindenstrauss embeddings are widely used to reduce the dimension and thus the processing time of data. To reduce the total complexity, also fast algorithms for applying these embeddings are necessary. To date, such fast algorithms are only available either for a non-optimal embedding dimension or up to a certain threshold on the number of data points. We address a variant of this problem where one aims to simultaneously embed larger subsets of the data set. Our method follows an approach by Nelson et al. (New constructions of RIP matrices with fast multiplication and fewer rows. In: Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1515-1528, 2014): a subsampled Hadamard transform maps points into a space of lower, but not optimal dimension. Subsequently, a random matrix with independent entries projects to an optimal embedding dimension. For subsets whose size scales at least polynomially in the ambient dimension, the complexity of this method comes close to the number of operations just to read the data under mild assumptions on the size of the data set that are considerably less restrictive than in previous works. We also prove a lower bound showing that subsampled Hadamard matrices alone cannot reach an optimal embedding dimension. Hence, the second embedding cannot be omitted.

Download Full-text

Massively scalable density based clustering (DBSCAN) on the HPCC systems big data platform

IAES International Journal of Artificial Intelligence (IJ-AI) ◽

10.11591/ijai.v10.i1.pp207-214 ◽

2021 ◽

Vol 10 (1) ◽

pp. 207

Author(s):

Yatish H. R. ◽

Shubham Milind Phal ◽

Tanmay Sanjay Hukkeri ◽

Lili Xu ◽

Shobha G ◽

...

Keyword(s):

Clustering Algorithm ◽

Spatial Clustering ◽

Computation Time ◽

Large Data ◽

Single Node ◽

Data Set ◽

Traffic Pattern ◽

Density Based Clustering ◽

Data Points ◽

Hpcc Systems

Dealing with large samples of unlabeled data is a key challenge in today’s world, especially in applications such as traffic pattern analysis and disaster management. DBSCAN, or density based spatial clustering of applications with noise, is a well-known density-based clustering algorithm. Its key strengths lie in its capability to detect outliers and handle arbitrarily shaped clusters. However, the algorithm, being fundamentally sequential in nature, proves expensive and time consuming when operated on extensively large data chunks. This paper thus presents a novel implementation of a parallel and distributed DBSCAN algorithm on the HPCC Systems platform. The algorithm seeks to fully parallelize the algorithm implementation by making use of HPCC Systems optimal distributed architecture and performing a tree-based union to merge local clusters. The proposed approach* was tested both on synthetic as well as standard datasets (MFCCs Data Set) and found to be completely accurate. Additionally, when compared against a single node setup, a significant decrease in computation time was observed with no impact to accuracy. The parallelized algorithm performed eight times better for higher number of data points and takes exponentially lesser time as the number of data points increases.

Download Full-text

Visualization of Predictive Modeling for Big Data Using Various Approaches When There Are Rare Events at Differing Levels

Advances in Data Mining and Database Management - Handbook of Research on Big Data Storage and Visualization Techniques ◽

10.4018/978-1-5225-3142-5.ch021 ◽

2018 ◽

pp. 604-631 ◽

Cited By ~ 1

Author(s):

Alan Olinsky ◽

John Thomas Quinn ◽

Phyllis A. Schumacher

Keyword(s):

Predictive Modeling ◽

Rare Events ◽

Rare Event ◽

Sampling Technique ◽

Large Data ◽

Occurrence Rate ◽

Data Sets ◽

Target Variable ◽

Data Set ◽

Home Mortgage

Many techniques exist for predictive modeling of a bivariate target variable in large data sets. When the target variable represents a rare event with an occurrence in the data set of approximately 10% or less, traditional modeling techniques may fail to identify the rare events. In this chapter, different methods, including oversampling of rare events, undersampling of common events and the Synthetic Minority Over-Sampling Technique are used to improve the prediction outcomes of rare events. The predictive models of decision trees, logistic regression and rule induction are applied with SAS Enterprise Miner (EM) to the revised data. Using a data set of home mortgage applications, misclassification percentages of a target variable with a rare event occurrence rate of 0.8% are obtained by running a multiple comparison node. The percentage is varied from 0.8% up to 50% and the results are compared to see which predictive method worked the best.

Download Full-text

A posteriori noise estimation in variable data sets

Astronomy and Astrophysics ◽

10.1051/0004-6361/201730618 ◽

2018 ◽

Vol 609 ◽

pp. A39 ◽

Cited By ~ 7

Author(s):

S. Czesla ◽

T. Molle ◽

J. H. M. M. Schmitt

Keyword(s):

Standard Deviation ◽

Synthetic Data ◽

Light Curves ◽

Weighted Sums ◽

Data Sets ◽

A Posteriori ◽

Sampled Data ◽

Data Set ◽

Specific Parameter ◽

Data Points

Most physical data sets contain a stochastic contribution produced by measurement noise or other random sources along with the signal. Usually, neither the signal nor the noise are accurately known prior to the measurement so that both have to be estimated a posteriori. We have studied a procedure to estimate the standard deviation of the stochastic contribution assuming normality and independence, requiring a sufficiently well-sampled data set to yield reliable results. This procedure is based on estimating the standard deviation in a sample of weighted sums of arbitrarily sampled data points and is identical to the so-called DER_SNR algorithm for specific parameter settings. To demonstrate the applicability of our procedure, we present applications to synthetic data, high-resolution spectra, and a large sample of space-based light curves and, finally, give guidelines to apply the procedure in situation not explicitly considered here to promote its adoption in data analysis.

Download Full-text

A characteristic turning point between summer and autumn in north China

10.21203/rs.3.rs-96099/v1 ◽

2020 ◽

Author(s):

Yaoqin Xie

Keyword(s):

North China ◽

Turning Point ◽

Sunshine Duration ◽

Large Data ◽

Annual Average ◽

Data Set ◽

Knowledge Based ◽

Gradual Process ◽

Data Points ◽

Average Distribution

Abstract It is generally believed that seasonal alternation is a gradual process marked by temperature. Explored from a large data set containing 1,686,528 data points of temperature, humidity and sunshine duration, we established a seasonal dynamic model of north China. Based on the model, we discovered a turning point on the 220th day in the annual average distribution of humidity and sunshine duration, which can be used as a characteristic node to define the date of summer-autumn alternation in north China. Our results demonstrate that the alternation of summer and autumn is not a gradual process in this region, but a mutation in the annual distributions of humidity and sunshine duration, thus revealing the statistical invariance based on the local knowledge. The study also shows that humidity and sunshine duration can better reflect the climate characteristics of north China than temperature. Because the model is region-specific, the proposed method using big data can be further extended to quantitatively define other seasonal alternations and explore other climate characteristics in different regions, so as to benefit indigenous knowledge-based climate prediction.

Download Full-text

The Mouse Heart Attack Research Tool 1.0 database

AJP Heart and Circulatory Physiology ◽

10.1152/ajpheart.00172.2018 ◽

2018 ◽

Vol 315 (3) ◽

pp. H522-H530 ◽

Cited By ~ 5

Author(s):

Kristine Y. DeLeon-Pennell ◽

Rugmani Padmanabhan Iyer ◽

Yonggang Ma ◽

Andriy Yabluchanskiy ◽

Rogelio Zamilpa ◽

...

Keyword(s):

Myocardial Infarction ◽

Heart Attack ◽

Ventricular Remodeling ◽

Left Ventricular Remodeling ◽

Large Data ◽

Left Ventricular ◽

Research Tool ◽

Mouse Heart ◽

Data Set ◽

Data Points

The generation of big data has enabled systems-level dissections into the mechanisms of cardiovascular pathology. Integration of genetic, proteomic, and pathophysiological variables across platforms and laboratories fosters discoveries through multidisciplinary investigations and minimizes unnecessary redundancy in research efforts. The Mouse Heart Attack Research Tool (mHART) consolidates a large data set of over 10 yr of experiments from a single laboratory for cardiovascular investigators to generate novel hypotheses and identify new predictive markers of progressive left ventricular remodeling after myocardial infarction (MI) in mice. We designed the mHART REDCap database using our own data to integrate cardiovascular community participation. We generated physiological, biochemical, cellular, and proteomic outputs from plasma and left ventricles obtained from post-MI and no-MI (naïve) control groups. We included both male and female mice ranging in age from 3 to 36 mo old. After variable collection, data underwent quality assessment for data curation (e.g., eliminate technical errors, check for completeness, remove duplicates, and define terms). Currently, mHART 1.0 contains >888,000 data points and includes results from >2,100 unique mice. Database performance was tested, and an example is provided to illustrate database utility. This report explains how the first version of the mHART database was established and provides researchers with a standard framework to aid in the integration of their data into our database or in the development of a similar database. NEW & NOTEWORTHY The Mouse Heart Attack Research Tool combines >888,000 cardiovascular data points from >2,100 mice. We provide this large data set as a REDCap database to generate novel hypotheses and identify new predictive markers of adverse left ventricular remodeling following myocardial infarction in mice and provide examples of use. The Mouse Heart Attack Research Tool is the first database of this size that integrates data sets across platforms that include genomic, proteomic, histological, and physiological data.

Download Full-text

Weighted Fuzzy-Possibilistic C-Means Over Large Data Sets

International Journal of Data Warehousing and Mining ◽

10.4018/jdwm.2012100104 ◽

2012 ◽

Vol 8 (4) ◽

pp. 82-107 ◽

Cited By ~ 2

Author(s):

Renxia Wan ◽

Yuelin Gao ◽

Caixia Li

Keyword(s):

Large Data ◽

Real Data ◽

Large Data Sets ◽

Data Sets ◽

Data Set ◽

Data Segment ◽

Possibilistic Clustering ◽

Data Points ◽

Weighted Data ◽

Natural Classes

Up to now, several algorithms for clustering large data sets have been presented. Most clustering approaches for data sets are the crisp ones, which cannot be well suitable to the fuzzy case. In this paper, the authors explore a single pass approach to fuzzy possibilistic clustering over large data set. The basic idea of the proposed approach (weighted fuzzy-possibilistic c-means, WFPCM) is to use a modified possibilistic c-means (PCM) algorithm to cluster the weighted data points and centroids with one data segment as a unit. Experimental results on both synthetic and real data sets show that WFPCM can save significant memory usage when comparing with the fuzzy c-means (FCM) algorithm and the possibilistic c-means (PCM) algorithm. Furthermore, the proposed algorithm is of an excellent immunity to noise and can avoid splitting or merging the exact clusters into some inaccurate clusters, and ensures the integrity and purity of the natural classes.

Download Full-text