Comparison of Clustering Algorithms on Air Quality Substances in Peninsular Malaysia

Air quality is one of the most popular environmental problems in this globalization era. Air pollution is the poisonous air that comes from car emissions, smog, open burning, chemicals from factories and other particles and gases. This harmful air can give adverse effects to human health and the environment. In order to provide information which areas are better for the residents in Malaysia, cluster analysis is used to determine the areas that can be clustering together based on their a ir quality through several air quality substances. Monthly data from 37 monitoring stations in Peninsular Malaysia from the year 2013 to 2015 were used in this study. K - Means (KM) clustering algorithm, Expectation Maximization (EM) clustering algorithm and Density Based (DB) clustering algorithm have been chosen as the techniques to analyze the cluster analysis by utilizing the Waikato Environment for Knowledge Analysis (WEKA) tools. Results show that K - means clustering algorithm is the best method among ot her algorithms due to its simplicity and time taken to build the model. The output of K - means clustering algorithm shows that it can cluster the area into two clusters, namely as cluster 0 and cluster 1. Clusters 0 consist of 16 monitoring stations and clu ster 1 consists of 36 monitoring stations in Peninsular Malaysia.

Download Full-text

DRSA: a non-hierarchical clustering algorithm using k-NN graph and its application in vegetation classification

Vegetation of Russia ◽

10.31111/vegrus/2015.27.125 ◽

2015 ◽

pp. 125-138 ◽

Cited By ~ 2

Author(s):

I. V. Goncharenko

Keyword(s):

Cluster Analysis ◽

Clustering Algorithm ◽

Nearest Neighbor ◽

Clustering Algorithms ◽

Protein Structures ◽

Hierarchical Cluster ◽

Vegetation Classification ◽

K Nearest Neighbor ◽

Neighbor Graph ◽

Nearest Neighbor Graph

In this article we proposed a new method of non-hierarchical cluster analysis using k-nearest-neighbor graph and discussed it with respect to vegetation classification. The method of k-nearest neighbor (k-NN) classiﬁcation was originally developed in 1951 (Fix, Hodges, 1951). Later a term “k-NN graph” and a few algorithms of k-NN clustering appeared (Cover, Hart, 1967; Brito et al., 1997). In biology k-NN is used in analysis of protein structures and genome sequences. Most of k-NN clustering algorithms build «excessive» graph firstly, so called hypergraph, and then truncate it to subgraphs, just partitioning and coarsening hypergraph. We developed other strategy, the “upward” clustering in forming (assembling consequentially) one cluster after the other. Until today graph-based cluster analysis has not been considered concerning classification of vegetation datasets.

Download Full-text

A Data Distribution View of Clustering Algorithms

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch059 ◽

2011 ◽

pp. 374-381 ◽

Cited By ~ 1

Author(s):

Junjie Wu ◽

Jian Chen ◽

Hui Xiong

Keyword(s):

Data Mining ◽

Cluster Analysis ◽

Clustering Analysis ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Data Distribution ◽

Point Of View ◽

Group Method ◽

Data Sets ◽

Distribution Point

Cluster analysis (Jain & Dubes, 1988) provides insight into the data by dividing the objects into groups (clusters), such that objects in a cluster are more similar to each other than objects in other clusters. Cluster analysis has long played an important role in a wide variety of fields, such as psychology, bioinformatics, pattern recognition, information retrieval, machine learning, and data mining. Many clustering algorithms, such as K-means and Unweighted Pair Group Method with Arithmetic Mean (UPGMA), have been wellestablished. A recent research focus on clustering analysis is to understand the strength and weakness of various clustering algorithms with respect to data factors. Indeed, people have identified some data characteristics that may strongly affect clustering analysis including high dimensionality and sparseness, the large size, noise, types of attributes and data sets, and scales of attributes (Tan, Steinbach, & Kumar, 2005). However, further investigation is expected to reveal whether and how the data distributions can have the impact on the performance of clustering algorithms. Along this line, we study clustering algorithms by answering three questions: 1. What are the systematic differences between the distributions of the resultant clusters by different clustering algorithms? 2. How can the distribution of the “true” cluster sizes make impact on the performances of clustering algorithms? 3. How to choose an appropriate clustering algorithm in practice? The answers to these questions can guide us for the better understanding and the use of clustering methods. This is noteworthy, since 1) in theory, people seldom realized that there are strong relationships between the clustering algorithms and the cluster size distributions, and 2) in practice, how to choose an appropriate clustering algorithm is still a challenging task, especially after an algorithm boom in data mining area. This chapter thus tries to fill this void initially. To this end, we carefully select two widely used categories of clustering algorithms, i.e., K-means and Agglomerative Hierarchical Clustering (AHC), as the representative algorithms for illustration. In the chapter, we first show that K-means tends to generate the clusters with a relatively uniform distribution on the cluster sizes. Then we demonstrate that UPGMA, one of the robust AHC methods, acts in an opposite way to K-means; that is, UPGMA tends to generate the clusters with high variation on the cluster sizes. Indeed, the experimental results indicate that the variations of the resultant cluster sizes by K-means and UPGMA, measured by the Coefficient of Variation (CV), are in the specific intervals, say [0.3, 1.0] and [1.0, 2.5] respectively. Finally, we put together K-means and UPGMA for a further comparison, and propose some rules for the better choice of the clustering schemes from the data distribution point of view.

Download Full-text

A Cluster Analysis of PM2.5 Using CMAQ Model Results for Representativeness of Air Quality Monitoring Networks in Busan, Korea

10.5194/egusphere-egu2020-3189 ◽

2020 ◽

Author(s):

Woo-Sik Jung ◽

Woo-Gon Do

Keyword(s):

Air Pollution ◽

Cluster Analysis ◽

Air Quality ◽

Monitoring Network ◽

Hierarchical Cluster ◽

Quality Monitoring ◽

Quality Data ◽

Air Quality Monitoring ◽

Monitoring Networks ◽

Monitoring Stations

With increasing interest in air pollution, the installation of air quality monitoring networks for regular measurement is considered a very important task in many countries. However, operation of air quality monitoring networks requires much time and money. Therefore, the representativeness of the locations of air quality monitoring networks is an important issue that has been studied by many groups worldwide. Most such studies are based on statistical analysis or the use of geographic information systems (GIS) in existing air quality monitoring network data. These methods are useful for identifying the representativeness of existing measuring networks, but they cannot verify the need to add new monitoring stations. With the development of computer technology, numerical air quality models such as CMAQ have become increasingly important in analyzing and diagnosing air pollution. In this study, PM2.5 distributions in Busan were reproduced with 1-km grid spacing by the CMAQ model. The model results reflected actual PM2.5 changes relatively well. A cluster analysis, which is a statistical method that groups similar objects together, was then applied to the hourly PM2.5 concentration for all grids in the model domain. Similarities and differences between objects can be measured in several ways. K-means clustering uses a non-hierarchical cluster analysis method featuring an advantageously low calculation time for the fast processing of large amounts of data. K-means clustering was highly prevalent in existing studies that grouped air quality data according to the same characteristics. As a result of the cluster analysis, PM2.5 pollution in Busan was successfully divided into groups with the same concentration change characteristics. Finally, the redundancy of the monitoring stations and the need for additional sites were analyzed by comparing the clusters of PM2.5 with the locations of the air quality monitoring networks currently in operation.This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(2017R1D1A3B03036152).

Download Full-text

A Comparison of K-Means and Mean Shift Algorithms

10.20944/preprints202108.0140.v1 ◽

2021 ◽

Author(s):

Mehak Nigar Shumaila

Keyword(s):

Cluster Analysis ◽

Data Analysis ◽

Time Complexity ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Mean Shift ◽

Prediction Performance ◽

Learning Problem ◽

Cluster A ◽

Formation Of Groups

Clustering, or otherwise known as cluster analysis, is a learning problem that takes place without any human supervision. This technique has often been utilized, much efficiently, in data analysis, and serves for observing and identifying interesting, useful, or desired patterns in the said data. The clustering technique functions by performing a structured division of the data involved, in similar objects based on the characteristics that it identifies. This process results in the formation of groups, and each group that is formed, is called a cluster. A single said cluster consists of objects from the data, that have similarities among other objects found in the same cluster, and resemble differences when compared to objects identified from the data that now exist in other clusters. The process of clustering is very significant in various aspects of data analysis, as it determines and presents the intrinsic grouping of objects present in the data, based on their attributes, in a batch of unlabeled raw data. A textbook or otherwise said, good criteria, does not exist in this method of cluster analysis. That is because this process is so different and so customizable for every user, that needs it in his/her various and different needs. There is no outright best clustering algorithm, as it massively depends on the user’s scenario and needs. This paper is intended to compare and study two different clustering algorithms. The algorithms under investigation are k-mean and mean shift. These algorithms are compared according to the following factors: time complexity, training, prediction performance and accuracy of the clustering algorithms.

Download Full-text

A comparative analysis of selected clustering algorithms for criminal profiling

Nigerian Journal of Technology ◽

10.4314/njt.v39i2.16 ◽

2020 ◽

Vol 39 (2) ◽

pp. 464-471

Author(s):

J.A. Adeyiga ◽

S.O. Olabiyisi ◽

E.O. Omidiora

Keyword(s):

Expectation Maximization ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Expectation Maximization Algorithm ◽

Real Life ◽

Criminal Activity ◽

Law Enforcement Agencies ◽

Criminal Profiling ◽

Hard Clustering ◽

Membership Value

Several criminal profiling systems have been developed to assist the Law Enforcement Agencies in solving crimes but the techniques employed in most of the systems lack the ability to cluster criminal based on their behavioral characteristics. This paper reviewed different clustering techniques used in criminal profiling and then selects one fuzzy clustering algorithm (Expectation Maximization) and two hard clustering algorithm (K-means and Hierarchical). The selected algorithms were then developed and tested on real life data to produce "profiles" of criminal activity and behavior of criminals. The algorithms were implemented using WEKA software package. The performance of the algorithms was evaluated using cluster accuracy and time complexity. The results show that Expectation Maximization algorithm gave a 90.5% clusters accuracy in 8.5s, while K-Means had 62.6% in 0.09s and Hierarchical with 51.9% in 0.11s. In conclusion, soft clustering algorithm performs better than hard clustering algorithm in analyzing criminal data. Keywords: Clustering Algorithm, Profiling, Crime, Membership value

Download Full-text

Multi-Attribute Utility Theory Based K-Means Clustering Applications

International Journal of Data Warehousing and Mining ◽

10.4018/ijdwm.2017040101 ◽

2017 ◽

Vol 13 (2) ◽

pp. 1-12 ◽

Cited By ~ 2

Author(s):

Jungmok Ma

Keyword(s):

Cluster Analysis ◽

Utility Theory ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

User Preferences ◽

Number Of Clusters ◽

Clustering Problem ◽

Multi Attribute Utility Theory ◽

Systematic Framework ◽

Selection Of

One of major obstacles in the application of the k-means clustering algorithm is the selection of the number of clusters k. The multi-attribute utility theory (MAUT)-based k-means clustering algorithm is proposed to tackle the problem by incorporating user preferences. Using MAUT, the decision maker's value structure for the number of clusters and other attributes can be quantitatively modeled, and it can be used as an objective function of the k-means. A target clustering problem for military targeting process is used to demonstrate the MAUT-based k-means and provide a comparative study. The result shows that the existing clustering algorithms do not necessarily reflect user preferences while the MAUT-based k-means provides a systematic framework of preferences modeling in cluster analysis.

Download Full-text

Exploratory Item Classification Via Spectral Graph Clustering

Applied Psychological Measurement ◽

10.1177/0146621617692977 ◽

2017 ◽

Vol 41 (8) ◽

pp. 579-599 ◽

Cited By ~ 6

Author(s):

Yunxiao Chen ◽

Xiaoou Li ◽

Jingchen Liu ◽

Gongjun Xu ◽

Zhiliang Ying

Keyword(s):

Cluster Analysis ◽

Spectral Clustering ◽

Large Scale ◽

Clustering Algorithm ◽

Latent Class ◽

Clustering Algorithms ◽

Eysenck Personality Questionnaire ◽

Computationally Efficient ◽

Computational Overhead ◽

Spectral Clustering Algorithm

Large-scale assessments are supported by a large item pool. An important task in test development is to assign items into scales that measure different characteristics of individuals, and a popular approach is cluster analysis of items. Classical methods in cluster analysis, such as the hierarchical clustering, K-means method, and latent-class analysis, often induce a high computational overhead and have difficulty handling missing data, especially in the presence of high-dimensional responses. In this article, the authors propose a spectral clustering algorithm for exploratory item cluster analysis. The method is computationally efficient, effective for data with missing or incomplete responses, easy to implement, and often outperforms traditional clustering algorithms in the context of high dimensionality. The spectral clustering algorithm is based on graph theory, a branch of mathematics that studies the properties of graphs. The algorithm first constructs a graph of items, characterizing the similarity structure among items. It then extracts item clusters based on the graphical structure, grouping similar items together. The proposed method is evaluated through simulations and an application to the revised Eysenck Personality Questionnaire.

Download Full-text

An Analysis of Similarity between Air Quality Monitoring Stations in Busan using Cluster Analysis

Journal of Environmental Science International ◽

10.5322/jesi.2017.26.8.927 ◽

2017 ◽

Vol 26 (8) ◽

pp. 927-938 ◽

Cited By ~ 1

Author(s):

Woo-gon Do ◽

Woo-sik Jung

Keyword(s):

Cluster Analysis ◽

Air Quality ◽

Quality Monitoring ◽

Air Quality Monitoring ◽

Analysis Of Similarity ◽

Air Quality Monitoring Stations ◽

Monitoring Stations

Download Full-text

Assessment of clustering algorithms in discriminating eutrophic levels in coastal waters

Global NEST Journal ◽

10.30955/gnj.000495 ◽

2013 ◽

Vol 10 (3) ◽

pp. 359-365

Keyword(s):

Cluster Analysis ◽

High Resolution ◽

Coastal Waters ◽

Clustering Algorithm ◽

Trophic Status ◽

Distance Measure ◽

Clustering Algorithms ◽

Water Type ◽

Eutrophic Water ◽

Ward Clustering

Cluster analysis has been used widely as a tool for assessing eutrophic trends in coastal waters. The efficiency of clustering in discriminating between oligotrophic, mesotrophic and eutropic sites, depends on the variables used, the distance measure and the clustering algorithm applied. In the present work seven clustering algorithms were evaluated using sets of data from sampling sites of known water type. The results showed that only the Ward’s algorithm had high resolution in discriminating sampling sites of different trophic status. The remaining clustering algorithms did not show remarkable resolution in classifying different water types. The use of the Ward clustering algorithm is recommended in eutrophication studies where discrete clusters of oligotrophic, mesotrophic and eutrophic water type are under investigation.

Download Full-text

A Tourist Segmentation Based on Motivation, Satisfaction and Prior Knowledge with a Socio-Economic Profiling: A Clustering Approach with Mixed Information

Social Indicators Research ◽

10.1007/s11205-020-02537-y ◽

2020 ◽

Author(s):

Pierpaolo D’Urso ◽

Livia De Giovanni ◽

Marta Disegna ◽

Riccardo Massari ◽

Vincenzina Vitale

Keyword(s):

Cluster Analysis ◽

Prior Knowledge ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Mixed Data ◽

Future Analysis ◽

Clustering Approach

AbstractThe popularity of the cluster analysis in the tourism field has massively grown in the last decades. However, accordingly to our review, researchers are often not aware of the characteristics and limitations of the clustering algorithms adopted. An important gap in the literature emerged from our review regards the adoption of an adequate clustering algorithm for mixed data. The main purpose of this article is to overcome this gap describing, both theoretically and empirically, a suitable clustering algorithm for mixed data. Furthermore, this article contributes to the literature presenting a method to include the “Don’t know” answers in the cluster analysis. Concluding, the main issues related to cluster analysis are highlighted offering some suggestions and recommendations for future analysis.

Download Full-text