A comparative analysis of selected clustering  algorithms for criminal profiling

Several criminal profiling systems have been developed to assist the Law Enforcement Agencies in solving crimes but the techniques employed in most of the systems lack the ability to cluster criminal based on their behavioral characteristics. This paper reviewed different clustering techniques used in criminal profiling and then selects one fuzzy clustering algorithm (Expectation Maximization) and two hard clustering algorithm (K-means and Hierarchical). The selected algorithms were then developed and tested on real life data to produce "profiles" of criminal activity and behavior of criminals. The algorithms were implemented using WEKA software package. The performance of the algorithms was evaluated using cluster accuracy and time complexity. The results show that Expectation Maximization algorithm gave a 90.5% clusters accuracy in 8.5s, while K-Means had 62.6% in 0.09s and Hierarchical with 51.9% in 0.11s. In conclusion, soft clustering algorithm performs better than hard clustering algorithm in analyzing criminal data. Keywords: Clustering Algorithm, Profiling, Crime, Membership value

Download Full-text

An Efficient Clustering Algorithm Based on Expectation Maximization Algorithm in Wireless Sensor Network

Oct. 17-19, 2017 Dubai (UAE) ◽

10.15242/dirpub.dir1017011 ◽

2018 ◽

Keyword(s):

Wireless Sensor Network ◽

Sensor Network ◽

Expectation Maximization ◽

Clustering Algorithm ◽

Expectation Maximization Algorithm ◽

Wireless Sensor

Download Full-text

Canonical PSO Based K-Means Clustering Approach for Real Datasets

International Scholarly Research Notices ◽

10.1155/2014/414013 ◽

2014 ◽

Vol 2014 ◽

pp. 1-11 ◽

Cited By ~ 1

Author(s):

Lopamudra Dey ◽

Sanjay Chakraborty

Keyword(s):

Data Mining ◽

Air Pollution ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Real Life ◽

Cluster Validity ◽

Validity Assessment ◽

Different Types ◽

Clustering Approach ◽

Validity Measure

“Clustering” the significance and application of this technique is spread over various fields. Clustering is an unsupervised process in data mining, that is why the proper evaluation of the results and measuring the compactness and separability of the clusters are important issues. The procedure of evaluating the results of a clustering algorithm is known as cluster validity measure. Different types of indexes are used to solve different types of problems and indices selection depends on the kind of available data. This paper first proposes Canonical PSO based K-means clustering algorithm and also analyses some important clustering indices (intercluster, intracluster) and then evaluates the effects of those indices on real-time air pollution database, wholesale customer, wine, and vehicle datasets using typical K-means, Canonical PSO based K-means, simple PSO based K-means, DBSCAN, and Hierarchical clustering algorithms. This paper also describes the nature of the clusters and finally compares the performances of these clustering algorithms according to the validity assessment. It also defines which algorithm will be more desirable among all these algorithms to make proper compact clusters on this particular real life datasets. It actually deals with the behaviour of these clustering algorithms with respect to validation indexes and represents their results of evaluation in terms of mathematical and graphical forms.

Download Full-text

A Dynamic Genetic Algorithm for Clustering Problems

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.411-414.1884 ◽

2013 ◽

Vol 411-414 ◽

pp. 1884-1893

Author(s):

Yong Chun Cao ◽

Ya Bin Shao ◽

Shuang Liang Tian ◽

Zheng Qi Cai

Keyword(s):

Genetic Algorithm ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Real Life ◽

Search Space ◽

Adaptive Mutation ◽

Data Sets ◽

Data Set ◽

Local Optima ◽

Clustering Problems

Due to many of the clustering algorithms based on GAs suffer from degeneracy and are easy to fall in local optima, a novel dynamic genetic algorithm for clustering problems (DGA) is proposed. The algorithm adopted the variable length coding to represent individuals and processed the parallel crossover operation in the subpopulation with individuals of the same length, which allows the DGA algorithm clustering to explore the search space more effectively and can automatically obtain the proper number of clusters and the proper partition from a given data set; the algorithm used the dynamic crossover probability and adaptive mutation probability, which prevented the dynamic clustering algorithm from getting stuck at a local optimal solution. The clustering results in the experiments on three artificial data sets and two real-life data sets show that the DGA algorithm derives better performance and higher accuracy on clustering problems.

Download Full-text

Adaptive Initialization Method for K-Means Algorithm

Frontiers in Artificial Intelligence ◽

10.3389/frai.2021.740817 ◽

2021 ◽

Vol 4 ◽

Author(s):

Jie Yang ◽

Yu-Kai Wang ◽

Xin Yao ◽

Chin-Teng Lin

Keyword(s):

Time Complexity ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Real Life ◽

Superior Performance ◽

Local Optima ◽

Initial Cluster ◽

Higher Dimensional ◽

Real World Datasets ◽

Random Method

The K-means algorithm is a widely used clustering algorithm that offers simplicity and efficiency. However, the traditional K-means algorithm uses a random method to determine the initial cluster centers, which make clustering results prone to local optima and then result in worse clustering performance. In this research, we propose an adaptive initialization method for the K-means algorithm (AIMK) which can adapt to the various characteristics in different datasets and obtain better clustering performance with stable results. For larger or higher-dimensional datasets, we even leverage random sampling in AIMK (name as AIMK-RS) to reduce the time complexity. 22 real-world datasets were applied for performance comparisons. The experimental results show AIMK and AIMK-RS outperform the current initialization methods and several well-known clustering algorithms. Specifically, AIMK-RS can significantly reduce the time complexity to O (n). Moreover, we exploit AIMK to initialize K-medoids and spectral clustering, and better performance is also explored. The above results demonstrate superior performance and good scalability by AIMK or AIMK-RS. In the future, we would like to apply AIMK to more partition-based clustering algorithms to solve real-life practical problems.

Download Full-text

Comparison of Clustering Algorithms on Air Quality Substances in Peninsular Malaysia

Journal of Computing Research and Innovation ◽

10.24191/jcrinn.v2i1.28 ◽

2018 ◽

Vol 2 (1) ◽

pp. 36-44

Author(s):

Sitti Sufiah Atirah Rosly ◽

Balkiah Moktar ◽

Muhamad Hasbullah Mohd Razali

Keyword(s):

Cluster Analysis ◽

Air Quality ◽

Expectation Maximization ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Peninsular Malaysia ◽

Monthly Data ◽

Knowledge Analysis ◽

Em Clustering ◽

Monitoring Stations

Air quality is one of the most popular environmental problems in this globalization era. Air pollution is the poisonous air that comes from car emissions, smog, open burning, chemicals from factories and other particles and gases. This harmful air can give adverse effects to human health and the environment. In order to provide information which areas are better for the residents in Malaysia, cluster analysis is used to determine the areas that can be clustering together based on their a ir quality through several air quality substances. Monthly data from 37 monitoring stations in Peninsular Malaysia from the year 2013 to 2015 were used in this study. K - Means (KM) clustering algorithm, Expectation Maximization (EM) clustering algorithm and Density Based (DB) clustering algorithm have been chosen as the techniques to analyze the cluster analysis by utilizing the Waikato Environment for Knowledge Analysis (WEKA) tools. Results show that K - means clustering algorithm is the best method among ot her algorithms due to its simplicity and time taken to build the model. The output of K - means clustering algorithm shows that it can cluster the area into two clusters, namely as cluster 0 and cluster 1. Clusters 0 consist of 16 monitoring stations and clu ster 1 consists of 36 monitoring stations in Peninsular Malaysia.

Download Full-text

The Kernel Rough K-Means Algorithm

Recent Advances in Computer Science and Communications ◽

10.2174/2213275912666190716121431 ◽

2020 ◽

Vol 13 (2) ◽

pp. 234-239

Author(s):

Wang Meng ◽

Dui Hongyan ◽

Zhou Shiyuan ◽

Dong Zhankui ◽

Wu Zige

Keyword(s):

Clustering Algorithm ◽

Rough Set Theory ◽

Fuzzy Theory ◽

Clustering Algorithms ◽

Attractive Alternative ◽

Complex Data ◽

Soft Clustering ◽

Alternative Clustering ◽

Clustering And Classification ◽

Hard Clustering

Background: Clustering is one of the most important data mining methods. The k-means (c-means ) and its derivative methods are the hotspot in the field of clustering research in recent years. The clustering method can be divided into two categories according to the uncertainty, which are hard clustering and soft clustering. The Hard C-Means clustering (HCM) belongs to hard clustering while the Fuzzy C-Means clustering (FCM) belongs to soft clustering in the field of k-means clustering research respectively. The linearly separable problem is a big challenge to clustering and classification algorithm and further improvement is required in big data era. Objective: RKM algorithm based on fuzzy roughness is also a hot topic in current research. The rough set theory and the fuzzy theory are powerful tools for depicting uncertainty, which are the same in essence. Therefore, RKM can be kernelized by the mean of KFCM. In this paper, we put forward a Kernel Rough K-Means algorithm (KRKM) for RKM to solve nonlinear problem for RKM. KRKM expanded the ability of processing complex data of RKM and solve the problem of the soft clustering uncertainty. Methods: This paper proposed the process of the Kernel Rough K-Means algorithm (KRKM). Then the clustering accuracy was contrasted by utilizing the data sets from UCI repository. The experiment results shown the KRKM with improved clustering accuracy, comparing with the RKM algorithm. Results: The classification precision of KFCM and KRKM were improved. For the classification precision, KRKM was slightly higher than KFCM, indicating that KRKM was also an attractive alternative clustering algorithm and had good clustering effect when dealing with nonlinear clustering. Conclusion: Through the comparison with the precision of KFCM algorithm, it was found that KRKM had slight advantages in clustering accuracy. KRKM was one of the effective clustering algorithms that can be selected in nonlinear clustering.

Download Full-text

Rough ISODATA Algorithm

International Journal of Fuzzy System Applications ◽

10.4018/ijfsa.2013100101 ◽

2013 ◽

Vol 3 (4) ◽

pp. 1-14 ◽

Cited By ~ 2

Author(s):

S. Sampath ◽

B. Ramya

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Real Life ◽

Vital Role ◽

Data Sets ◽

Clustering Method ◽

Data Set ◽

Number Of Clusters ◽

Real Life Data ◽

Nonparametric Statistical

Cluster analysis is a branch of data mining, which plays a vital role in bringing out hidden information in databases. Clustering algorithms help medical researchers in identifying the presence of natural subgroups in a data set. Different types of clustering algorithms are available in the literature. The most popular among them is k-means clustering. Even though k-means clustering is a popular clustering method widely used, its application requires the knowledge of the number of clusters present in the given data set. Several solutions are available in literature to overcome this limitation. The k-means clustering method creates a disjoint and exhaustive partition of the data set. However, in some situations one can come across objects that belong to more than one cluster. In this paper, a clustering algorithm capable of producing rough clusters automatically without requiring the user to give as input the number of clusters to be produced. The efficiency of the algorithm in detecting the number of clusters present in the data set has been studied with the help of some real life data sets. Further, a nonparametric statistical analysis on the results of the experimental study has been carried out in order to analyze the efficiency of the proposed algorithm in automatic detection of the number of clusters in the data set with the help of rough version of Davies-Bouldin index.

Download Full-text

Efficient Semi-Supervised Learning and Sparse Structural Learning for Feature Selection of Leukemia Dataset

Journal of Medical Imaging and Health Informatics ◽

10.1166/jmihi.2020.3110 ◽

2020 ◽

Vol 10 (8) ◽

pp. 1815-1824

Author(s):

S. Nithya Roopa ◽

N. Nagarajan

Keyword(s):

Feature Selection ◽

Supervised Learning ◽

Health Informatics ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Research Work ◽

Real Life ◽

Support Vector ◽

Structural Learning ◽

Huge Amount

The amount of data produced in health informatics growing large and as a result analysis of this huge amount of data requires a great knowledge which is to be gained. The basic aim of health informatics is to take in real world medical data from all levels of human existence to help improve our understanding of medicine and medical practices. Huge amount of unlabeled data are obtainable in lots of real-life data-mining tasks, e.g., uncategorized messages in an automatic email categorization system, unknown genes functions for doing gene function calculation, and so on. Labelled data is frequently restricted and expensive to produce, while labelling classically needs human proficiency. Consequently, semi-supervised learning has become a topic of significant recent interest. This research work proposed a new semi-supervised grouping, where the performance of unsupervised clustering algorithms is enhanced with restricted numbers of supervision in labels form on constraints or data. The previous system designed a Clustering Guided Hybrid support vector machine based Sparse Structural Learning (CGHSSL) for feature selection. However, it does not produce a satisfactory accuracy results. In this research, proposed clustering-guided with Convolution Neural Network (CNN) based sparse structural learning clustering algorithm. Density-Based Spatial Clustering of Applications with Noise (DBSCAN) clustering algorithm is progressed for learning cluster labels of input samples having more accuracy guiding features election at same time. Concurrently, prediction of cluster labels is as well performed by CNN by means of using hidden structure which is shared by various characteristics. The parameters of CNN are then optimized maximizing Multi-objective Bee Colony (MBO) algorithm that can unravel feature correlations to render outcomes with additional consistency. Row-wise sparse designs are then balanced to yield design depicted to suit for feature selection. This semi supervised algorithm is utilized to choose important characteristics from Leukemia1 dataset additional resourcefully. Therefore dataset size is decreased significantly utilizing semi supervised algorithm prominently. As well proposed Semi Supervised Clustering-Guided Sparse Structural Learning (SSCGSSL) technique is utilized to increase the clustering performance in higher. The experimental results show that the proposed system achieves better performance compared with the existing system in terms of Accuracy, Entropy, Purity, Normalized Mutual Information (NMI) and F-measure.

Download Full-text

A NEW INTEGRATED CLUSTERING ALGORITHM GFC AND SWITCHING REGRESSIONS

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001402001769 ◽

2002 ◽

Vol 16 (04) ◽

pp. 433-446 ◽

Cited By ~ 11

Author(s):

SHITONG WANG ◽

HAIFENG JIANG ◽

HONGJUN LU

Keyword(s):

Pattern Recognition ◽

Fuzzy Clustering ◽

Efficient Algorithm ◽

Local Minimum ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Switching Regression ◽

Theoretic Analysis ◽

Hard Clustering ◽

Regression Problems

The switching regression problems are attracting more and more attention in a variety of disciplines such as pattern recognition, economics and databases. To solve switching regression problems, many approaches have been investigated. In this paper, we present a new integrated clustering algorithm GFC that combines gravity-based clustering algorithm GC with fuzzy clustering. GC, as a new hard clustering algorithm presented here, is based on the well-known Newton's Gravity Law. Our theoretic analysis shows that GFC can converge to a local minimum of the object function. Our experimental results illustrate that GFC for switching regression problems has better performance than standard fuzzy clustering algorithms, especially in terms of convergence speed. Hence GFC is a new more efficient algorithm for switching regression problems.

Download Full-text

Bayesian hierarchical K-means clustering

Intelligent Data Analysis ◽

10.3233/ida-194807 ◽

2020 ◽

Vol 24 (5) ◽

pp. 977-992

Author(s):

Yue Liu ◽

Bufang Li

Keyword(s):

Probability Distribution ◽

Objective Function ◽

Hierarchical Clustering ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Expectation Maximization Algorithm ◽

Synthetic Data ◽

Cluster Tree ◽

Bayesian Hierarchical ◽

Benchmark Datasets

Clustering algorithm is the foundation and important technology in data mining. In fact, in the real world, the data itself often has a hierarchical structure. Hierarchical clustering aims at constructing a cluster tree, which reveals the underlying modal structure of a complex density. Due to its inherent complexity, most existing hierarchical clustering algorithms are usually designed heuristically without an explicit objective function, which limits its utilization and analysis. K-means clustering, the well-known simple yet effective algorithm which can be expressed from the view of probability distribution, has inherent connection to Mixture of Gaussians (MoG). At this point, we consider combining Bayesian theory analysis with K-means algorithm. This motivates us to develop a hierarchical clustering based on K-means under the probability distribution framework, which is different from existing hierarchical K-means algorithms processing data in a single-pass manner along with heuristic strategies. For this goal, we propose an explicit objective function for hierarchical clustering, termed as Bayesian hierarchical K-means (BHK-means). In our method, a cascaded clustering tree is constructed, in which all layers interact with each other in the network-like manner. In this cluster tree, the clustering results of each layer are influenced by the parent and child nodes. Therefore, the clustering result of each layer is dynamically improved in accordance with the global hierarchical clustering objective function. The objective function is solved using the same algorithm as K-means, the Expectation-maximization algorithm. The experimental results on both synthetic data and benchmark datasets demonstrate the effectiveness of our algorithm over the existing related ones.

Download Full-text