Exploring performance of clustering methods on document sentiment analysis

Clustering is a powerful unsupervised tool for sentiment analysis from text. However, the clustering results may be affected by any step of the clustering process, such as data pre-processing strategy, term weighting method in Vector Space Model and clustering algorithm. This paper presents the results of an experimental study of some common clustering techniques with respect to the task of sentiment analysis. Different from previous studies, in particular, we investigate the combination effects of these factors with a series of comprehensive experimental studies. The experimental results indicate that, first, the K-means-type clustering algorithms show clear advantages on balanced review datasets, while performing rather poorly on unbalanced datasets by considering clustering accuracy. Second, the comparatively newly designed weighting models are better than the traditional weighting models for sentiment clustering on both balanced and unbalanced datasets. Furthermore, adjective and adverb words extraction strategy can offer obvious improvements on clustering performance, while strategies of adopting stemming and stopword removal will bring negative influences on sentiment clustering. The experimental results would be valuable for both the study and usage of clustering methods in online review sentiment analysis.

Download Full-text

An improved OPTICS clustering algorithm for discovering clusters with uneven densities

Intelligent Data Analysis ◽

10.3233/ida-205497 ◽

2021 ◽

Vol 25 (6) ◽

pp. 1453-1471

Author(s):

Chunhua Tang ◽

Han Wang ◽

Zhiwen Wang ◽

Xiangkun Zeng ◽

Huaran Yan ◽

...

Keyword(s):

Time Complexity ◽

Clustering Algorithm ◽

Nearest Neighbor ◽

Clustering Algorithms ◽

Substantial Improvement ◽

Experimental Results ◽

High Time ◽

Parameter Setting ◽

K Nearest Neighbor ◽

Density Based Clustering

Most density-based clustering algorithms have the problems of difficult parameter setting, high time complexity, poor noise recognition, and weak clustering for datasets with uneven density. To solve these problems, this paper proposes FOP-OPTICS algorithm (Finding of the Ordering Peaks Based on OPTICS), which is a substantial improvement of OPTICS (Ordering Points To Identify the Clustering Structure). The proposed algorithm finds the demarcation point (DP) from the Augmented Cluster-Ordering generated by OPTICS and uses the reachability-distance of DP as the radius of neighborhood eps of its corresponding cluster. It overcomes the weakness of most algorithms in clustering datasets with uneven densities. By computing the distance of the k-nearest neighbor of each point, it reduces the time complexity of OPTICS; by calculating density-mutation points within the clusters, it can efficiently recognize noise. The experimental results show that FOP-OPTICS has the lowest time complexity, and outperforms other algorithms in parameter setting and noise recognition.

Download Full-text

Improved minimum-minimum roughness algorithm for clustering categorical data

International Journal of ADVANCED AND APPLIED SCIENCES ◽

10.21833/ijaas.2021.10.006 ◽

2021 ◽

Vol 8 (10) ◽

pp. 43-50

Author(s):

Truong et al. ◽

Keyword(s):

Machine Learning ◽

Data Mining ◽

Hierarchical Clustering ◽

Categorical Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Experimental Results ◽

Data Sets ◽

Top Down ◽

Hierarchical Clustering Algorithm

Clustering is a fundamental technique in data mining and machine learning. Recently, many researchers are interested in the problem of clustering categorical data and several new approaches have been proposed. One of the successful and pioneering clustering algorithms is the Minimum-Minimum Roughness algorithm (MMR) which is a top-down hierarchical clustering algorithm and can handle the uncertainty in clustering categorical data. However, MMR tends to choose the category with less value leaf node with more objects, leading to undesirable clustering results. To overcome such shortcomings, this paper proposes an improved version of the MMR algorithm for clustering categorical data, called IMMR (Improved Minimum-Minimum Roughness). Experimental results on actual data sets taken from UCI show that the IMMR algorithm outperforms MMR in clustering categorical data.

Download Full-text

Comparison of dimensionality reduction and clustering methods for SARS-CoV-2 genome

Bulletin of Electrical Engineering and Informatics ◽

10.11591/eei.v10i4.2803 ◽

2021 ◽

Vol 10 (4) ◽

pp. 2170-2180

Author(s):

Untari N. Wisesty ◽

Tati Rajab Mengko

Keyword(s):

Dimensionality Reduction ◽

Dimensional Reduction ◽

Clustering Algorithm ◽

Sequence Data ◽

Clustering Algorithms ◽

Gaussian Mixture Models ◽

Reduction Process ◽

Principal Component ◽

Gaussian Mixture ◽

Clustering Methods

This paper aims to conduct an analysis of the SARS-CoV-2 genome variation was carried out by comparing the results of genome clustering using several clustering algorithms and distribution of sequence in each cluster. The clustering algorithms used are K-means, Gaussian mixture models, agglomerative hierarchical clustering, mean-shift clustering, and DBSCAN. However, the clustering algorithm has a weakness in grouping data that has very high dimensions such as genome data, so that a dimensional reduction process is needed. In this research, dimensionality reduction was carried out using principal component analysis (PCA) and autoencoder method with three models that produce 2, 10, and 50 features. The main contributions achieved were the dimensional reduction and clustering scheme of SARS-CoV-2 sequence data and the performance analysis of each experiment on each scheme and hyper parameters for each method. Based on the results of experiments conducted, PCA and DBSCAN algorithm achieve the highest silhouette score of 0.8770 with three clusters when using two features. However, dimensionality reduction using autoencoder need more iterations to converge. On the testing process with Indonesian sequence data, more than half of them enter one cluster and the rest are distributed in the other two clusters.

Download Full-text

Majority Voting Based Multi-Task Clustering of Air Quality Monitoring Network in Turkey

Applied Sciences ◽

10.3390/app9081610 ◽

2019 ◽

Vol 9 (8) ◽

pp. 1610

Author(s):

Goksu Tuysuzoglu ◽

Derya Birant ◽

Aysegul Pala

Keyword(s):

Air Pollution ◽

Air Quality ◽

Clustering Algorithms ◽

Experimental Studies ◽

Monitoring Network ◽

Environmental Parameters ◽

Global Scale ◽

Majority Voting ◽

Clustering Methods ◽

Task Clustering

Air pollution, which is the result of the urbanization brought by modern life, has a dramatic impact on the global scale as well as local and regional scales. Since air pollution has important effects on human health and other living things, the issue of air quality is of great importance all over the world. Accordingly, many studies based on classification, clustering and association rule mining applications for air pollution have been proposed in the field of data mining and machine learning to extract hidden knowledge from environmental parameters. One approach is to model a region in a way that cities having similar characteristics are determined and placed into the same clusters. Instead of using traditional clustering algorithms, a novel algorithm, named Majority Voting based Multi-Task Clustering (MV-MTC), is proposed and utilized to consider multiple air pollutants jointly. Experimental studies showed that the proposed method is superior to five well-known clustering algorithms: K-Means, Expectation Maximization, Canopy, Farthest First and Hierarchical clustering methods.

Download Full-text

Spectrogram Classification Using Dissimilarity Space

Applied Sciences ◽

10.3390/app10124176 ◽

2020 ◽

Vol 10 (12) ◽

pp. 4176 ◽

Cited By ~ 1

Author(s):

Loris Nanni ◽

Andrea Rigo ◽

Alessandra Lumini ◽

Sheryl Brahnam

Keyword(s):

Ad Hoc ◽

Classification Problem ◽

Space Representation ◽

Support Vector ◽

Clustering Methods ◽

Audio Classification ◽

Classification Problems ◽

Clustering Techniques ◽

Vector Space Representation ◽

Better Than

In this work, we combine a Siamese neural network and different clustering techniques to generate a dissimilarity space that is then used to train an SVM for automated animal audio classification. The animal audio datasets used are (i) birds and (ii) cat sounds, which are freely available. We exploit different clustering methods to reduce the spectrograms in the dataset to a number of centroids that are used to generate the dissimilarity space through the Siamese network. Once computed, we use the dissimilarity space to generate a vector space representation of each pattern, which is then fed into an support vector machine (SVM) to classify a spectrogram by its dissimilarity vector. Our study shows that the proposed approach based on dissimilarity space performs well on both classification problems without ad-hoc optimization of the clustering methods. Moreover, results show that the fusion of CNN-based approaches applied to the animal audio classification problem works better than the stand-alone CNNs.

Download Full-text

LINEAR COHERENT BI-CLUSTERING VIA BEAM SEARCHING AND SAMPLE SET CLUSTERING

Discrete Mathematics Algorithms and Applications ◽

10.1142/s1793830912500231 ◽

2012 ◽

Vol 04 (02) ◽

pp. 1250023 ◽

Cited By ~ 2

Author(s):

YI SHI ◽

MARYAM HASAN ◽

ZHIPENG CAI ◽

GUOHUI LIN ◽

DALE SCHUURMANS

Keyword(s):

Gene Expression ◽

Microarray Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Experimental Results ◽

Gene Expression Microarray ◽

Expression Microarray ◽

Gene Expression Microarray Data ◽

Sample Set

We propose a new bi-clustering algorithm, LinCoh, for finding linear coherent bi-clusters in gene expression microarray data. Our method exploits a robust technique for identifying conditionally correlated genes, combined with an efficient density-based search for clustering sample sets. Experimental results on both synthetic and real datasets demonstrated that LinCoh consistently finds more accurate and higher quality bi-clusters than existing bi-clustering algorithms.

Download Full-text

A Quantitative Discriminant Method of Elbow Point for the Optimal Number of Clusters in Clustering Algorithm

10.21203/rs.3.rs-58011/v3 ◽

2021 ◽

Author(s):

Congming Shi ◽

Bingtao Wei ◽

Shoulin Wei ◽

Wen Wang ◽

Hai Liu ◽

...

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Optimal Number ◽

Machine Learning Method ◽

Cluster Number ◽

Number Of Clusters ◽

Public Dataset ◽

Optimal Cluster ◽

Better Than ◽

Optimal Number Of Clusters

Abstract Clustering, a traditional machine learning method, plays a significant role in data analysis. Most clustering algorithms depend on a predetermined exact number of clusters, whereas, in practice, clusters are usually unpredictable. Although the Elbow method is one of the most commonly used methods to discriminate the optimal cluster number, the discriminant of the number of clusters depends on the manual identification of the elbow points on the visualization curve. Thus, experienced analysts cannot clearly identify the elbow point from the plotted curve when the plotted curve is fairly smooth. To solve this problem, a new elbow point discriminant method is proposed to yield a statistical metric that estimates an optimal cluster number when clustering on a dataset. First, the average degree of distortion obtained by the Elbow method is normalized to the range of 0 to 10. Second, the normalized results are used to calculate the cosine of intersection angles between elbow points. Third, this calculated cosine of intersection angles and the arccosine theorem are used to compute the intersection angles between elbow points. Finally, the index of the above computed minimal intersection angles between elbow points is used as the estimated potential optimal cluster number. The experimental results based on simulated datasets and a well-known public dataset (Iris Dataset) demonstrated that the estimated optimal cluster number obtained by our newly proposed method is better than the widely used Silhouette method.

Download Full-text

VDPC: Variational Density Peak Clustering Algorithm

10.36227/techrxiv.17597669.v1 ◽

2021 ◽

Author(s):

Yizhang Wang ◽

Di Wang ◽

You Zhou ◽

Chai Quek ◽

Xiaofeng Zhang

Keyword(s):

Clustering Algorithm ◽

Cluster Formation ◽

Clustering Algorithms ◽

Data Distribution ◽

Distribution Patterns ◽

Clustering Methods ◽

Density Peak ◽

Global Parameter ◽

Density Peak Clustering ◽

Parameter Values

<div>Clustering is an important unsupervised knowledge acquisition method, which divides the unlabeled data into different groups \cite{atilgan2021efficient,d2021automatic}. Different clustering algorithms make different assumptions on the cluster formation, thus, most clustering algorithms are able to well handle at least one particular type of data distribution but may not well handle the other types of distributions. For example, K-means identifies convex clusters well \cite{bai2017fast}, and DBSCAN is able to find clusters with similar densities \cite{DBSCAN}. </div><div>Therefore, most clustering methods may not work well on data distribution patterns that are different from the assumptions being made and on a mixture of different distribution patterns. Taking DBSCAN as an example, it is sensitive to the loosely connected points between dense natural clusters as illustrated in Figure~\ref{figconnect}. The density of the connected points shown in Figure~\ref{figconnect} is different from the natural clusters on both ends, however, DBSCAN with fixed global parameter values may wrongly assign these connected points and consider all the data points in Figure~\ref{figconnect} as one big cluster.</div>

Download Full-text

Ensemble Hybrid K- Means and DBSCAN Clustering Algorithm – HDKA for Cancer Dataset

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d8257.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 6036-6040

Keyword(s):

Machine Learning ◽

Data Mining ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Experimental Results ◽

Cancer Dataset ◽

Dbscan Clustering ◽

Selection Of

Data Mining is the foremost vital space of analysis and is pragmatically utilized in totally different domains, It becomes a highly demanding field because huge amounts of data have been collected in various applications. The database can be clustered in more number of ways depending on the clustering algorithm used, parameter settings and other factors. Multiple clustering algorithms can be combined to get the final partitioning of data which provides better clustering results. In this paper, Ensemble hybrid KMeans and DBSCAN (HDKA) algorithm has been proposed to overcome the drawbacks of DBSCAN and KMeans clustering algorithms. The performance of the proposed algorithm improves the selection of centroid points through the centroid selection strategy.For experimental results we have used two dataset Colon and Leukemia from UCI machine learning repository.

Download Full-text

Hierarchical kt jet clustering for parallel architectures

Acta Universitatis Sapientiae Informatica ◽

10.1515/ausi-2017-0012 ◽

2017 ◽

Vol 9 (2) ◽

pp. 195-213

Author(s):

Richárd Forster ◽

Ágnes Fülöp

Keyword(s):

Hierarchical Clustering ◽

Particle Physics ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

High Energy ◽

Theoretical Physics ◽

High Energy Particle ◽

Clustering Methods ◽

Hierarchical Clustering Methods ◽

Using Data

AbstractThe reconstruction and analyze of measured data play important role in the research of high energy particle physics. This leads to new results in both experimental and theoretical physics. This requires algorithm improvements and high computer capacity. Clustering algorithm makes it possible to get to know the jet structure more accurately. More granular parallelization of the kt cluster algorithms was explored by combining it with the hierarchical clustering methods used in network evaluations. The kt method allows to know the development of particles due to the collision of high-energy nucleus-nucleus. The hierarchical clustering algorithms works on graphs, so the particle information used by the standard kt algorithm was first transformed into an appropriate graph, representing the network of particles. Testing was done using data samples from the Alice offine library, which contains the required modules to simulate the ALICE detector that is a dedicated Pb-Pb detector. The proposed algorithm was compared to the FastJet toolkit's standard longitudinal invariant kt implementation. Parallelizing the standard non-optimized version of this algorithm utilizing the available CPU architecture proved to be 1:6 times faster, than the standard implementation, while the proposed solution in this paper was able to achieve a 12 times faster computing performance, also being scalable enough to efficiently run on GPUs.

Download Full-text