Unsupervised Learning and Clustering Algorithms

1996 ◽  
pp. 99-121 ◽  
Author(s):  
Raúl Rojas
Author(s):  
Deepali Virmani ◽  
Nikita Jain ◽  
Ketan Parikh ◽  
Shefali Upadhyaya ◽  
Abhishek Srivastav

This article describes how data is relevant and if it can be organized, linked with other data and grouped into a cluster. Clustering is the process of organizing a given set of objects into a set of disjoint groups called clusters. There are a number of clustering algorithms like k-means, k-medoids, normalized k-means, etc. So, the focus remains on efficiency and accuracy of algorithms. The focus is also on the time it takes for clustering and reducing overlapping between clusters. K-means is one of the simplest unsupervised learning algorithms that solves the well-known clustering problem. The k-means algorithm partitions data into K clusters and the centroids are randomly chosen resulting numeric values prohibits it from being used to cluster real world data containing categorical values. Poor selection of initial centroids can result in poor clustering. This article deals with a proposed algorithm which is a variant of k-means with some modifications resulting in better clustering, reduced overlapping and lesser time required for clustering by selecting initial centres in k-means and normalizing the data.


Author(s):  
SHI ZHONG ◽  
TAGHI M. KHOSHGOFTAAR ◽  
NAEEM SELIYA

Recently data mining methods have gained importance in addressing network security issues, including network intrusion detection — a challenging task in network security. Intrusion detection systems aim to identify attacks with a high detection rate and a low false alarm rate. Classification-based data mining models for intrusion detection are often ineffective in dealing with dynamic changes in intrusion patterns and characteristics. Consequently, unsupervised learning methods have been given a closer look for network intrusion detection. We investigate multiple centroid-based unsupervised clustering algorithms for intrusion detection, and propose a simple yet effective self-labeling heuristic for detecting attack and normal clusters of network traffic audit data. The clustering algorithms investigated include, k-means, Mixture-Of-Spherical Gaussians, Self-Organizing Map, and Neural-Gas. The network traffic datasets provided by the DARPA 1998 offline intrusion detection project are used in our empirical investigation, which demonstrates the feasibility and promise of unsupervised learning methods for network intrusion detection. In addition, a comparative analysis shows the advantage of clustering-based methods over supervised classification techniques in identifying new or unseen attack types.


PLoS ONE ◽  
2021 ◽  
Vol 16 (2) ◽  
pp. e0246529
Author(s):  
Mikhail Kanevski

The paper deals with the analysis of spatial distribution of Swiss population using fractal concepts and unsupervised learning algorithms. The research methodology is based on the development of a high dimensional feature space by calculating local growth curves, widely used in fractal dimension estimation and on the application of clustering algorithms in order to reveal the patterns of spatial population distribution. The notion “unsupervised” also means, that only some general criteria—density, dimensionality, homogeneity, are used to construct an input feature space, without adding any supervised/expert knowledge. The approach is very powerful and provides a comprehensive local information about density and homogeneity/fractality of spatially distributed point patterns.


2021 ◽  
Vol 15 ◽  
Author(s):  
Eshan Bajal ◽  
Vipin Katara ◽  
Madhulika Bhatia ◽  
Madhurima Hooda

Abstract: The two most widely used and easily implementable algorithm for clustering and classification-based analysis of data in the unsupervised learning domain are Density-Based Spatial Clustering of Applications with Noise and K-mean cluster analysis. These two techniques can handle most cases effective when the data has a lot of randomness with no clear set to use as a parameter as in case of linear or logistic regression algorithms. However few papers exist that pit these two against each other in a controlled environment to observe which one reigns supreme and conditions required for the same. In this paper, a renal adenocarcinoma dataset is analyzed and thereafter both DBSCAN and K-mean are applied on the dataset with subsequent examination of the results. The efficacy of both the techniques in this study is compared and based on them the merits and demerits observed are enumerated. Further, the interaction of t-SNE with the generated clusters are explored.


Author(s):  
Kiruthika Ramanathan ◽  
Sheng Uei Guan

In this chapter we present a recursive approach to unsupervised learning. The algorithm proposed, while similar to ensemble clustering, does not need to execute several clustering algorithms and find consensus between them. On the contrary, grouping is done between two subsets of data at one time, thereby saving training time. Also, only two kinds of clustering algorithms are used in creating the recursive clustering ensemble, as opposed to the multitude of clusterers required by ensemble clusterers. In this chapter a recursive clusterer is proposed for both single and multi order neural networks. Empirical results show as much as 50% improvement in clustering accuracy when compared to benchmark clustering algorithms.


2018 ◽  
Vol 17 (03) ◽  
pp. 841-856 ◽  
Author(s):  
Giyasettin Ozcan

In this study, we consider unsupervised learning from multi-dimensional dataset problem. Particularly, we consider [Formula: see text]-means clustering which require long duration time during execution of multi-dimensional datasets. In order to speed up clustering in an accurate form, we introduce a new algorithm, that we term Canopy[Formula: see text]. The algorithm utilizes canopies and statistical techniques. Also, its efficient initiation and normalization methodologies contributes to the improvement. Furthermore, we consider early termination cases of clustering computation, provided that an intermediate result of the computation is accurate enough. We compared our algorithm with four popular clustering algorithms. Results denote that our algorithm speeds up the clustering computation by at least 2X. Also, we analyzed the contribution of early termination. Results present that further 2X improvement can be obtained while incurring 0.1% error rate. We also observe that our Canopy[Formula: see text] algorithm benefits from early termination and introduces extra 1.2X performance improvement.


2020 ◽  
Author(s):  
Mikhail Kanevski ◽  
Federico Amato ◽  
Fabian Guignard

<p>The research deals with an application of advanced exploratory tools to study hourly spatio-temporal air pollution data collected by NABEL monitoring network in Switzerland. Data analyzed consist of several pollutants, mainly NO2, O3, PM2.5, measured during last two years at 16 stations distributed over the country. The data are considered in two different ways: 1) as multivariate time series measured at the same station (different pollutants and environmental variables, like temperature), 2) as a spatially distributed time series of the same pollutant. In the first case, it is interesting to study both univariate and multivariate time series and their complexity. In the second case, similarity between time series distributed in space can signify the similar underlying phenomena and environmental conditions giving rise to the pollution. An important aspect of the data is that they are collected at the places of different land use classes – urban, suburban, rural etc., which helps in understanding and interpretation of the results.</p><p>Nowadays, unsupervised learning algorithms are widely applied in intelligent exploratory data analysis. Well known tasks of unsupervised learning include manifold learning, dimensionality reduction and clustering. In the present research, intrinsic and fractal dimensions, measures characterizing the similarity and redundancy in data and machine learning clustering algorithms were adapted and applied. The results obtained give a new and important information on the air pollution spatio-temporal patterns. The following results, between others, can be mentioned: 1) some measures of similarity (e.g., complexity-independent distance) are efficient in discriminating between time series; 2) intrinsic dimension, characterizing the ensemble of monitoring data, is pollutant dependent; 3) clustering of time series observed can be interpreted using the available information on land use.  </p>


Sign in / Sign up

Export Citation Format

Share Document