Estimating the Number of Clusters in Multivariate Data by Self-Organizing Maps

1999 ◽  
Vol 09 (03) ◽  
pp. 195-202 ◽  
Author(s):  
JOSÉ ALFREDO FERREIRA COSTA ◽  
MÁRCIO LUIZ DE ANDRADE NETTO

Determining the structure of data without prior knowledge of the number of clusters or any information about their composition is a problem of interest in many fields, such as image analysis, astrophysics, biology, etc. Partitioning a set of n patterns in a p-dimensional feature space must be done such that those in a given cluster are more similar to each other than the rest. As there are approximately [Formula: see text] possible ways of partitioning the patterns among K clusters, finding the best solution is very hard when n is large. The search space is increased when we have no a priori number of partitions. Although the self-organizing feature map (SOM) can be used to visualize clusters, the automation of knowledge discovery by SOM is a difficult task. This paper proposes region-based image processing methods to post-processing the U-matrix obtained after the unsupervised learning performed by SOM. Mathematical morphology is applied to identify regions of neurons that are similar. The number of regions and their labels are automatically found and they are related to the number of clusters in a multivariate data set. New data can be classified by labeling it according to the best match neuron. Simulations using data sets drawn from finite mixtures of p-variate normal densities are presented as well as related advantages and drawbacks of the method.

2011 ◽  
Vol 16 (4) ◽  
pp. 488-504 ◽  
Author(s):  
Pavel Stefanovič ◽  
Olga Kurasova

In the article, an additional visualization of self-organizing maps (SOM) has been investigated. The main objective of self-organizing maps is data clustering and their graphical presentation. Opportunities of SOM visualization in four systems (NeNet, SOM-Toolbox, Databionic ESOM and Viscovery SOMine) have been investigated. Each system has its additional tools for visualizing SOM. A comparative analysis has been made for two data sets: Fisher’s iris data set and the economic indices of the European Union countries. A new SOM system is also introduced and researched. The system has a specific visualization tool. It is missing in other SOM systems. It helps to see the proportion of neurons, corresponding to the data items, belonging to the different classes, and fallen in the same SOM cell.


2006 ◽  
Vol 5 (2) ◽  
pp. 125-136 ◽  
Author(s):  
Jimmy Johansson ◽  
Patric Ljung ◽  
Mikael Jern ◽  
Matthew Cooper

Parallel coordinates is a well-known technique used for visualization of multivariate data. When the size of the data sets increases the parallel coordinates display results in an image far too cluttered to perceive any structure. We tackle this problem by constructing high-precision textures to represent the data. By using transfer functions that operate on the high-precision textures, it is possible to highlight different aspects of the entire data set or clusters of the data. Our methods are implemented in both standard 2D parallel coordinates and 3D multi-relational parallel coordinates. Furthermore, when visualizing a larger number of clusters, a technique called ‘feature animation’ may be used as guidance by presenting various cluster statistics. A case study is also performed to illustrate the analysis process when analysing large multivariate data sets using our proposed techniques.


Author(s):  
W. Karel ◽  
M. Doneus ◽  
C. Briese ◽  
G. Verhoeven ◽  
N. Pfeifer

We present a method for the automatic geo-referencing of archaeological photographs captured aboard unmanned aerial vehicles (UAVs), termed UPs. We do so by help of pre-existing ortho-photo maps (OPMs) and digital surface models (DSMs). Typically, these pre-existing data sets are based on data that were captured at a widely different point in time. This renders the detection (and hence the matching) of homologous feature points in the UPs and OPMs infeasible mainly due to temporal variations of vegetation and illumination. Facing this difficulty, we opt for the normalized cross correlation coefficient of perspectively transformed image patches as the measure of image similarity. Applying a threshold to this measure, we detect candidates for homologous image points, resulting in a distinctive, but computationally intensive method. In order to lower computation times, we reduce the dimensionality and extents of the search space by making use of a priori knowledge of the data sets. By assigning terrain heights interpolated in the DSM to the image points found in the OPM, we generate control points. We introduce respective observations into a bundle block, from which gross errors i.e. false matches are eliminated during its robust adjustment. A test of our approach on a UAV image data set demonstrates its potential and raises hope to successfully process large image archives.


2011 ◽  
pp. 24-32 ◽  
Author(s):  
Nicoleta Rogovschi ◽  
Mustapha Lebbah ◽  
Younès Bennani

Most traditional clustering algorithms are limited to handle data sets that contain either continuous or categorical variables. However data sets with mixed types of variables are commonly used in data mining field. In this paper we introduce a weighted self-organizing map for clustering, analysis and visualization mixed data (continuous/binary). The learning of weights and prototypes is done in a simultaneous manner assuring an optimized data clustering. More variables has a high weight, more the clustering algorithm will take into account the informations transmitted by these variables. The learning of these topological maps is combined with a weighting process of different variables by computing weights which influence the quality of clustering. We illustrate the power of this method with data sets taken from a public data set repository: a handwritten digit data set, Zoo data set and other three mixed data sets. The results show a good quality of the topological ordering and homogenous clustering.


2020 ◽  
Vol 34 (04) ◽  
pp. 5620-5627 ◽  
Author(s):  
Murat Sensoy ◽  
Lance Kaplan ◽  
Federico Cerutti ◽  
Maryam Saleki

Deep neural networks are often ignorant about what they do not know and overconfident when they make uninformed predictions. Some recent approaches quantify classification uncertainty directly by training the model to output high uncertainty for the data samples close to class boundaries or from the outside of the training distribution. These approaches use an auxiliary data set during training to represent out-of-distribution samples. However, selection or creation of such an auxiliary data set is non-trivial, especially for high dimensional data such as images. In this work we develop a novel neural network model that is able to express both aleatoric and epistemic uncertainty to distinguish decision boundary and out-of-distribution regions of the feature space. To this end, variational autoencoders and generative adversarial networks are incorporated to automatically generate out-of-distribution exemplars for training. Through extensive analysis, we demonstrate that the proposed approach provides better estimates of uncertainty for in- and out-of-distribution samples, and adversarial examples on well-known data sets against state-of-the-art approaches including recent Bayesian approaches for neural networks and anomaly detection methods.


2021 ◽  
Author(s):  
Kezia Lange ◽  
Andreas C. Meier ◽  
Michel Van Roozendael ◽  
Thomas Wagner ◽  
Thomas Ruhtz ◽  
...  

<p>Airborne imaging DOAS and ground-based stationary and mobile DOAS measurements were conducted during the ESA funded S5P-VAL-DE-Ruhr campaign in September 2020 in the Ruhr area. The Ruhr area is located in Western Germany and is a pollution hotspot in Europe with urban character as well as large industrial emitters. The measurements are used to validate data from the Sentinel-5P TROPOspheric Monitoring Instrument (TROPOMI) with focus on the NO<sub>2</sub> tropospheric vertical column product.</p><p>Seven flights were performed with the airborne imaging DOAS instrument, AirMAP, providing continuous maps of NO<sub>2</sub> in the layers below the aircraft. These flights cover many S5P ground pixels within an area of about 40 km side length and were accompanied by ground-based stationary measurements and three mobile car DOAS instruments. Stationary measurements were conducted by two Pandora, two zenith-sky and two MAX-DOAS instruments distributed over three target areas, partly as long-term measurements over a one-year period.</p><p>Airborne and ground-based measurements were compared to evaluate the representativeness of the measurements in time and space. With a resolution of about 100 x 30 m<sup>2</sup>, the AirMAP data creates a link between the ground-based and the TROPOMI measurements with a resolution of 3.5 x 5.5 km<sup>2</sup> and is therefore well suited to validate TROPOMI's tropospheric NO<sub>2</sub> vertical column.</p><p>The measurements on the seven flight days show strong variability depending on the different target areas, the weekday and meteorological conditions. We found an overall low bias of the TROPOMI operational NO<sub>2</sub> data for all three target areas but with varying magnitude for different days. The campaign data set is compared to custom TROPOMI NO<sub>2</sub> products, using different auxiliary data, such as albedo or a priori vertical profiles to evaluate the influence on the TROPOMI data product. Analyzing and comparing the different data sets provides more insight into the high spatial and temporal heterogeneity in NO<sub>2</sub> and its impact on satellite observations and their validation.</p>


2021 ◽  
Author(s):  
Magnus Dehli Vigeland ◽  
Thore Egeland

Abstract We address computational and statistical aspects of DNA-based identification of victims in the aftermath of disasters. Current methods and software for such identification typically consider each victim individually, leading to suboptimal power of identification and potential inconsistencies in the statistical summary of the evidence. We resolve these problems by performing joint identification of all victims, using the complete genetic data set. Individual identification probabilities, conditional on all available information, are derived from the joint solution in the form of posterior pairing probabilities. A closed formula is obtained for the a priori number of possible joint solutions to a given DVI problem. This number increases quickly with the number of victims and missing persons, posing computational challenges for brute force approaches. We address this complexity with a preparatory sequential step aiming to reduce the search space. The examples show that realistic cases are handled efficiently. User-friendly implementations of all methods are provided in the R package dvir, freely available on all platforms.


Kybernetes ◽  
2019 ◽  
Vol 48 (9) ◽  
pp. 2006-2029
Author(s):  
Hongshan Xiao ◽  
Yu Wang

Purpose Feature space heterogeneity exists widely in various application fields of classification techniques, such as customs inspection decision, credit scoring and medical diagnosis. This paper aims to study the relationship between feature space heterogeneity and classification performance. Design/methodology/approach A measurement is first developed for measuring and identifying any significant heterogeneity that exists in the feature space of a data set. The main idea of this measurement is derived from a meta-analysis. For the data set with significant feature space heterogeneity, a classification algorithm based on factor analysis and clustering is proposed to learn the data patterns, which, in turn, are used for data classification. Findings The proposed approach has two main advantages over the previous methods. The first advantage lies in feature transform using orthogonal factor analysis, which results in new features without redundancy and irrelevance. The second advantage rests on samples partitioning to capture the feature space heterogeneity reflected by differences of factor scores. The validity and effectiveness of the proposed approach is verified on a number of benchmarking data sets. Research limitations/implications Measurement should be used to guide the heterogeneity elimination process, which is an interesting topic in future research. In addition, to develop a classification algorithm that enables scalable and incremental learning for large data sets with significant feature space heterogeneity is also an important issue. Practical implications Measuring and eliminating the feature space heterogeneity possibly existing in the data are important for accurate classification. This study provides a systematical approach to feature space heterogeneity measurement and elimination for better classification performance, which is favorable for applications of classification techniques in real-word problems. Originality/value A measurement based on meta-analysis for measuring and identifying any significant feature space heterogeneity in a classification problem is developed, and an ensemble classification framework is proposed to deal with the feature space heterogeneity and improve the classification accuracy.


Geophysics ◽  
2019 ◽  
Vol 84 (5) ◽  
pp. E293-E299
Author(s):  
Jorlivan L. Correa ◽  
Paulo T. L. Menezes

Synthetic data provided by geoelectric earth models are a powerful tool to evaluate a priori a controlled-source electromagnetic (CSEM) workflow effectiveness. Marlim R3D (MR3D) is an open-source complex and realistic geoelectric model for CSEM simulations of the postsalt turbiditic reservoirs at the Brazilian offshore margin. We have developed a 3D CSEM finite-difference time-domain forward study to generate the full-azimuth CSEM data set for the MR3D earth model. To that end, we fabricated a full-azimuth survey with 45 towlines striking the north–south and east–west directions over a total of 500 receivers evenly spaced at 1 km intervals along the rugged seafloor of the MR3D model. To correctly represent the thin, disconnected, and complex geometries of the studied reservoirs, we have built a finely discretized mesh of [Formula: see text] cells leading to a large mesh with a total of approximately 90 million cells. We computed the six electromagnetic field components (Ex, Ey, Ez, Hx, Hy, and Hz) at six frequencies in the range of 0.125–1.25 Hz. In our efforts to mimic noise in real CSEM data, we summed to the data a multiplicative noise with a 1% standard deviation. Both CSEM data sets (noise free and noise added), with inline and broadside geometries, are distributed for research or commercial use, under the Creative Common License, at the Zenodo platform.


2020 ◽  
Vol 11 (3) ◽  
pp. 42-67
Author(s):  
Soumeya Zerabi ◽  
Souham Meshoul ◽  
Samia Chikhi Boucherkha

Cluster validation aims to both evaluate the results of clustering algorithms and predict the number of clusters. It is usually achieved using several indexes. Traditional internal clustering validation indexes (CVIs) are mainly based in computing pairwise distances which results in a quadratic complexity of the related algorithms. The existing CVIs cannot handle large data sets properly and need to be revisited to take account of the ever-increasing data set volume. Therefore, design of parallel and distributed solutions to implement these indexes is required. To cope with this issue, the authors propose two parallel and distributed models for internal CVIs namely for Silhouette and Dunn indexes using MapReduce framework under Hadoop. The proposed models termed as MR_Silhouette and MR_Dunn have been tested to solve both the issue of evaluating the clustering results and identifying the optimal number of clusters. The results of experimental study are very promising and show that the proposed parallel and distributed models achieve the expected tasks successfully.


Sign in / Sign up

Export Citation Format

Share Document