Estimating and Controlling Overlap in Gaussian Mixtures for Clustering Methods Evaluation

Author(s):  
Radhwane Gherbaoui ◽  
Mohammed Ouali ◽  
Nacéra Benamrane

The ad hoc nature of the clustering methods makes simulated data paramount in assessing the performance of clustering methods. Real datasets could be used in the evaluation of clustering methods with the major drawback of missing the assessment of many test scenarios. In this paper, we propose a formal quantification of component overlap. This quantification is derived from a set of theorems which allow us to derive an automatic method for artificial data generation. We also derive a method to estimate parameters of existing models and to evaluate the results of other approaches. Automatic estimation of the overlap rate can also be used as an unsupervised learning approach in data mining to determine the parameters of mixture models from actual observations.

2021 ◽  
Author(s):  
Helena L Crowell ◽  
Sarah X Morillo Leonardo ◽  
Charlotte Soneson ◽  
Mark D Robinson

With the emergence of hundreds of single-cell RNA-sequencing (scRNA-seq) datasets, the number of computational tools to analyse aspects of the generated data has grown rapidly. As a result, there is a recurring need to demonstrate whether newly developed methods are truly performant - on their own as well as in comparison to existing tools. Benchmark studies aim to consolidate the space of available methods for a given task, and often use simulated data that provide a ground truth for evaluations. Thus, demanding a high quality standard for synthetically generated data is critical to make simulation study results credible and transferable to real data. Here, we evaluated methods for synthetic scRNA-seq data generation in their ability to mimic experimental data. Besides comparing gene- and cell-level quality control summaries in both one- and two-dimensional settings, we further quantified these at the batch- and cluster-level. Secondly, we investigate the effect of simulators on clustering and batch correction method comparisons, and, thirdly, which and to what extent quality control summaries can capture reference-simulation similarity. Our results suggest that most simulators are unable to accommodate complex designs without introducing artificial effects; they yield over-optimistic performance of integration, and potentially unreliable ranking of clustering methods; and, it is generally unknown which summaries are important to ensure effective simulation-based method comparisons.


Sensors ◽  
2021 ◽  
Vol 21 (5) ◽  
pp. 1573
Author(s):  
Loris Nanni ◽  
Giovanni Minchio ◽  
Sheryl Brahnam ◽  
Gianluca Maguolo ◽  
Alessandra Lumini

Traditionally, classifiers are trained to predict patterns within a feature space. The image classification system presented here trains classifiers to predict patterns within a vector space by combining the dissimilarity spaces generated by a large set of Siamese Neural Networks (SNNs). A set of centroids from the patterns in the training data sets is calculated with supervised k-means clustering. The centroids are used to generate the dissimilarity space via the Siamese networks. The vector space descriptors are extracted by projecting patterns onto the similarity spaces, and SVMs classify an image by its dissimilarity vector. The versatility of the proposed approach in image classification is demonstrated by evaluating the system on different types of images across two domains: two medical data sets and two animal audio data sets with vocalizations represented as images (spectrograms). Results show that the proposed system’s performance competes competitively against the best-performing methods in the literature, obtaining state-of-the-art performance on one of the medical data sets, and does so without ad-hoc optimization of the clustering methods on the tested data sets.


2012 ◽  
Vol 20 (3) ◽  
pp. 387-399 ◽  
Author(s):  
Benjamin E. Lauderdale

Political scientists often study dollar-denominated outcomes that are zero for some observations. These zeros can arise because the data-generating process is granular: The observed outcome results from aggregation of a small number of discrete projects or grants, each of varying dollar size. This article describes the use of a compound distribution in which each observed outcome is the sum of a Poisson—distributed number of gamma distributed quantities, a special case of the Tweedie distribution. Regression models based on this distribution estimate loglinear marginal effects without either the ad hoc treatment of zeros necessary to use a log-dependent variable regression or the change in quantity of interest necessary to use a tobit or selection model. The compound Poisson—gamma regression is compared with commonly applied approaches in an application to data on high-speed rail grants from the United States federal government to the states, and against simulated data from several data-generating processes.


Author(s):  
A. Schlichting ◽  
C. Brenner

LiDAR sensors are proven sensors for accurate vehicle localization. Instead of detecting and matching features in the LiDAR data, we want to use the entire information provided by the scanners. As dynamic objects, like cars, pedestrians or even construction sites could lead to wrong localization results, we use a change detection algorithm to detect these objects in the reference data. If an object occurs in a certain number of measurements at the same position, we mark it and every containing point as static. In the next step, we merge the data of the single measurement epochs to one reference dataset, whereby we only use static points. Further, we also use a classification algorithm to detect trees. <br><br> For the online localization of the vehicle, we use simulated data of a vertical aligned automotive LiDAR sensor. As we only want to use static objects in this case as well, we use a random forest classifier to detect dynamic scan points online. Since the automotive data is derived from the LiDAR Mobile Mapping System, we are able to use the labelled objects from the reference data generation step to create the training data and further to detect dynamic objects online. The localization then can be done by a point to image correlation method using only static objects. We achieved a localization standard deviation of about 5 cm (position) and 0.06° (heading), and were able to successfully localize the vehicle in about 93 % of the cases along a trajectory of 13 km in Hannover, Germany.


2020 ◽  
Vol 11 ◽  
Author(s):  
Alejandro Abdala Asbun ◽  
Marc A. Besseling ◽  
Sergio Balzano ◽  
Judith D. L. van Bleijswijk ◽  
Harry J. Witte ◽  
...  

Marker gene sequencing of the rRNA operon (16S, 18S, ITS) or cytochrome c oxidase I (CO1) is a popular means to assess microbial communities of the environment, microbiomes associated with plants and animals, as well as communities of multicellular organisms via environmental DNA sequencing. Since this technique is based on sequencing a single gene, or even only parts of a single gene rather than the entire genome, the number of reads needed per sample to assess the microbial community structure is lower than that required for metagenome sequencing. This makes marker gene sequencing affordable to nearly any laboratory. Despite the relative ease and cost-efficiency of data generation, analyzing the resulting sequence data requires computational skills that may go beyond the standard repertoire of a current molecular biologist/ecologist. We have developed Cascabel, a scalable, flexible, and easy-to-use amplicon sequence data analysis pipeline, which uses Snakemake and a combination of existing and newly developed solutions for its computational steps. Cascabel takes the raw data as input and delivers a table of operational taxonomic units (OTUs) or Amplicon Sequence Variants (ASVs) in BIOM and text format and representative sequences. Cascabel is a highly versatile software that allows users to customize several steps of the pipeline, such as selecting from a set of OTU clustering methods or performing ASV analysis. In addition, we designed Cascabel to run in any linux/unix computing environment from desktop computers to computing servers making use of parallel processing if possible. The analyses and results are fully reproducible and documented in an HTML and optional pdf report. Cascabel is freely available at Github: https://github.com/AlejandroAb/CASCABEL.


2017 ◽  
Vol 63 (3) ◽  
pp. 309-313 ◽  
Author(s):  
C. Suganthi Evangeline ◽  
S. Appu

Abstract A special type of Mobile Ad-hoc Networks (MANETs) which has frequent changes of topology and higher mobility is known as Vehicular Ad-hoc Networks (VANETs). In order to divide the network into groups of mobile vehicles and improve routing, data gathering, clustering is applied in VANETs. A stable clustering scheme based on adaptive multiple metric combining both the features of static and dynamic clustering methods is proposed in this work. Based on a new multiple metric method, a cluster head is selected among the cluster members which is taken from the mobility metrics such as position and time to leave the road segment, relative speed and Quality of Service metrics which includes neighborhood degree, link quality of the RSU and bandwidth. A higher QoS and cluster stability are achieved through the adaptive multiple metric. The results are simulated using NS2 and shows that this technique provides more stable cluster structured with the other methods.


2020 ◽  
Vol 10 (12) ◽  
pp. 4176 ◽  
Author(s):  
Loris Nanni ◽  
Andrea Rigo ◽  
Alessandra Lumini ◽  
Sheryl Brahnam

In this work, we combine a Siamese neural network and different clustering techniques to generate a dissimilarity space that is then used to train an SVM for automated animal audio classification. The animal audio datasets used are (i) birds and (ii) cat sounds, which are freely available. We exploit different clustering methods to reduce the spectrograms in the dataset to a number of centroids that are used to generate the dissimilarity space through the Siamese network. Once computed, we use the dissimilarity space to generate a vector space representation of each pattern, which is then fed into an support vector machine (SVM) to classify a spectrogram by its dissimilarity vector. Our study shows that the proposed approach based on dissimilarity space performs well on both classification problems without ad-hoc optimization of the clustering methods. Moreover, results show that the fusion of CNN-based approaches applied to the animal audio classification problem works better than the stand-alone CNNs.


2009 ◽  
Vol 07 (01) ◽  
pp. 135-156 ◽  
Author(s):  
VINHTHUY PHAN ◽  
E. OLUSEGUN GEORGE ◽  
QUYNH T. TRAN ◽  
SHIRLEAN GOODWIN ◽  
SRIDEVI BODREDDIGARI ◽  
...  

Post hoc assignment of patterns determined by all pairwise comparisons in microarray experiments with multiple treatments has been proven to be useful in assessing treatment effects. We propose the usage of transitive directed acyclic graphs (tDAG) as the representation of these patterns and show that such representation can be useful in clustering treatment effects, annotating existing clustering methods, and analyzing sample sizes. Advantages of this approach include: (1) unique and descriptive meaning of each cluster in terms of how genes respond to all pairs of treatments; (2) insensitivity of the observed patterns to the number of genes analyzed; and (3) a combinatorial perspective to address the sample size problem by observing the rate of contractible tDAG as the number of replicates increases. The advantages and overall utility of the method in elaborating drug structure activity relationships are exemplified in a controlled study with real and simulated data.


Sign in / Sign up

Export Citation Format

Share Document