scholarly journals Umpire 2.0: Simulating realistic, mixed-type, clinical data for machine learning

F1000Research ◽  
2021 ◽  
Vol 9 ◽  
pp. 1186
Author(s):  
Caitlin E. Coombes ◽  
Zachary B. Abrams ◽  
Samantha Nakayiza ◽  
Guy Brock ◽  
Kevin R. Coombes

The Umpire 2.0 R-package offers a streamlined, user-friendly workflow to simulate complex, heterogeneous, mixed-type data with known subgroup identities, dichotomous outcomes, and time-to-event data, while providing ample opportunities for fine-tuning and flexibility. Here, we describe how we have expanded the core Umpire 1.0 R-package, developed to simulate gene expression data, to generate clinically realistic, mixed-type data for use in evaluating unsupervised and supervised machine learning (ML) methods. As the availability of large-scale clinical data for ML has increased, clinical data has posed unique challenges, including widely variable size, individual biological heterogeneity, data collection and measurement noise, and mixed data types. Developing and validating ML methods for clinical data requires data sets with known ground truth, generated from simulation. Umpire 2.0 addresses challenges to simulating realistic clinical data by providing the user a series of modules to generate survival parameters and subgroups, apply meaningful additive noise, and discretize to single or mixed data types. Umpire 2.0 provides broad functionality across sample sizes, feature spaces, and data types, allowing the user to simulate correlated, heterogeneous, binary, continuous, categorical, or mixed type data from the scale of a small clinical trial to data on thousands of patients drawn from electronic health records. The user may generate elaborate simulations by varying parameters in order to compare algorithms or interrogate operating characteristics of an algorithm in both supervised and unsupervised ML.

F1000Research ◽  
2020 ◽  
Vol 9 ◽  
pp. 1186
Author(s):  
Caitlin E. Coombes ◽  
Zachary B. Abrams ◽  
Samantha Nakayiza ◽  
Guy Brock ◽  
Kevin R. Coombes

The Umpire 2.0 R-package offers a streamlined, user-friendly workflow to simulate complex, heterogeneous, mixed-type data with known subgroup identities, dichotomous outcomes, and time-to-event data, while providing ample opportunities for fine-tuning and flexibility. Here, we describe how we have expanded the core Umpire 1.0 R-package, developed to simulate gene expression data, to generate clinically realistic, mixed-type data for use in evaluating unsupervised and supervised machine learning (ML) methods. As the availability of large-scale clinical data for ML has increased, clinical data has posed unique challenges, including widely variable size, individual biological heterogeneity, data collection and measurement noise, and mixed data types. Developing and validating ML methods for clinical data requires data sets with known ground truth, generated from simulation. Umpire 2.0 addresses challenges to simulating realistic clinical data by providing the user a series of modules to generate survival parameters and subgroups, apply meaningful additive noise, and discretize to single or mixed data types. Umpire 2.0 provides broad functionality across sample sizes, feature spaces, and data types, allowing the user to simulate correlated, heterogeneous, binary, continuous, categorical, or mixed type data from the scale of a small clinical trial to data on thousands of patients drawn from electronic health records. The user may generate elaborate simulations by varying parameters in order to compare algorithms or interrogate operating characteristics of an algorithm in both supervised and unsupervised ML.


Author(s):  
Jinchao Ji ◽  
Wei Pang ◽  
Yanlin Zheng ◽  
Zhe Wang ◽  
Zhiqiang Ma

Most of the initialization approaches are dedicated to the partitional clustering algorithms which process categorical or numerical data only. However, in real-world applications, data objects with both numeric and categorical features are ubiquitous. The coexistence of both categorical and numerical attributes make the initialization methods designed for single-type data inapplicable to mixed-type data. Furthermore, to the best of our knowledge, in the existing partitional clustering algorithms designed for mixed-type data, the initial cluster centers are determined randomly. In this paper, we propose a novel initialization method for mixed data clustering. In the proposed method, both the distance and density are exploited together to determine initial cluster centers. The performance of the proposed method is demonstrated by a series of experiments on three real-world datasets in comparison with that of traditional initialization methods.


2021 ◽  
Author(s):  
Shahan Derkarabetian ◽  
James Starrett ◽  
Marshal Hedin

The diversity of biological and ecological characteristics of organisms, and the underlying genetic patterns and processes of speciation, makes the development of universally applicable genetic species delimitation methods challenging. Many approaches, like those incorporating the multispecies coalescent, sometimes delimit populations and overestimate species numbers. This issue is exacerbated in taxa with inherently high population structure due to low dispersal ability, and in cryptic species resulting from nonecological speciation. These taxa present a conundrum when delimiting species: analyses rely heavily, if not entirely, on genetic data which over split species, while other lines of evidence lump. We showcase this conundrum in the harvester Theromaster brunneus, a low dispersal taxon with a wide geographic distribution and high potential for cryptic species. Integrating morphology, mitochondrial, and sub-genomic (double-digest RADSeq and ultraconserved elements) data, we find high discordance across analyses and data types in the number of inferred species, with further evidence that multispecies coalescent approaches over split. We demonstrate the power of a supervised machine learning approach in effectively delimiting cryptic species by creating a "custom" training dataset derived from a well-studied lineage with similar biological characteristics as Theromaster. This novel approach uses known taxa with particular biological characteristics to inform unknown taxa with similar characteristics, and uses modern computational tools ideally suited for species delimitation while also considering the biology and natural history of organisms to make more biologically informed species delimitation decisions. In principle, this approach is universally applicable for species delimitation of any taxon with genetic data, particularly for cryptic species.


2021 ◽  
Vol 6 (65) ◽  
pp. 3634
Author(s):  
Mingze Huang ◽  
Christian Müller ◽  
Irina Gaynanova
Keyword(s):  

Entropy ◽  
2020 ◽  
Vol 22 (12) ◽  
pp. 1391
Author(s):  
Ivan Lopez-Arevalo ◽  
Edwin Aldana-Bobadilla ◽  
Alejandro Molina-Villegas ◽  
Hiram Galeana-Zapién ◽  
Victor Muñiz-Sanchez ◽  
...  

The most common machine-learning methods solve supervised and unsupervised problems based on datasets where the problem’s features belong to a numerical space. However, many problems often include data where numerical and categorical data coexist, which represents a challenge to manage them. To transform categorical data into a numeric form, preprocessing tasks are compulsory. Methods such as one-hot and feature-hashing have been the most widely used encoding approaches at the expense of a significant increase in the dimensionality of the dataset. This effect introduces unexpected challenges to deal with the overabundance of variables and/or noisy data. In this regard, in this paper we propose a novel encoding approach that maps mixed-type data into an information space using Shannon’s Theory to model the amount of information contained in the original data. We evaluated our proposal with ten mixed-type datasets from the UCI repository and two datasets representing real-world problems obtaining promising results. For demonstrating the performance of our proposal, this was applied for preparing these datasets for classification, regression, and clustering tasks. We demonstrate that our encoding proposal is remarkably superior to one-hot and feature-hashing encoding in terms of memory efficiency. Our proposal can preserve the information conveyed by the original data.


2021 ◽  
Author(s):  
Rita T. Sousa ◽  
Sara Silva ◽  
Catia Pesquita

AbstractSemantic similarity between concepts in knowledge graphs is essential for several bioinformatics applications, including the prediction of protein-protein interactions and the discovery of associations between diseases and genes. Although knowledge graphs describe entities in terms of several perspectives (or semantic aspects), state-of-the-art semantic similarity measures are general-purpose. This can represent a challenge since different use cases for the application of semantic similarity may need different similarity perspectives and ultimately depend on expert knowledge for manual fine-tuning.We present a new approach that uses supervised machine learning to tailor aspect-oriented semantic similarity measures to fit a particular view on biological similarity or relatedness. We implement and evaluate it using different combinations of representative semantic similarity measures and machine learning methods with four biological similarity views: protein-protein interaction, protein function similarity, protein sequence similarity and phenotype-based gene similarity. The results demonstrate that our approach outperforms non-supervised methods, producing semantic similarity models that fit different biological perspectives significantly better than the commonly used manual combinations of semantic aspects. Moreover, although black-box machine learning models produce the best results, approaches such as genetic programming and linear regression still produce improved results while generating models that are interpretable.


Author(s):  
QingXiang Wu ◽  
Martin McGinnity ◽  
Girijesh Prasad ◽  
David Bell

Data mining and knowledge discovery aim at finding useful information from typically massive collections of data, and then extracting useful knowledge from the information. To date a large number of approaches have been proposed to find useful information and discover useful knowledge; for example, decision trees, Bayesian belief networks, evidence theory, rough set theory, fuzzy set theory, kNN (k-nearest-neighborhood) classifier, neural networks, and support vector machines. However, these approaches are based on a specific data type. In the real world, an intelligent system often encounters mixed data types, incomplete information (missing values), and imprecise information (fuzzy conditions). In the UCI (University of California – Irvine) Machine Learning Repository, it can be seen that there are many real world data sets with missing values and mixed data types. It is a challenge to enable machine learning or data mining approaches to deal with mixed data types (Ching, 1995; Coppock, 2003) because there are difficulties in finding a measure of similarity between objects with mixed data type attributes. The problem with mixed data types is a long-standing issue faced in data mining. The emerging techniques targeted at this issue can be classified into three classes as follows: (1) Symbolic data mining approaches plus different discretizers (e.g., Dougherty et al., 1995; Wu, 1996; Kurgan et al., 2004; Diday, 2004; Darmont et al., 2006; Wu et al., 2007) for transformation from continuous data to symbolic data; (2) Numerical data mining approaches plus transformation from symbolic data to numerical data (e.g.,, Kasabov, 2003; Darmont et al., 2006; Hadzic et al., 2007); (3) Hybrid of symbolic data mining approaches and numerical data mining approaches (e.g.,, Tung, 2002; Kasabov, 2003; Leng et al., 2005; Wu et al., 2006). Since hybrid approaches have the potential to exploit the advantages from both symbolic data mining and numerical data mining approaches, this chapter, after discassing the merits and shortcomings of current approaches, focuses on applying Self-Organizing Computing Network Model to construct a hybrid system to solve the problems of knowledge discovery from databases with a diversity of data types. Future trends for data mining on mixed type data are then discussed. Finally a conclusion is presented.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Ewan Carr ◽  
Mathieu Carrière ◽  
Bertrand Michel ◽  
Frédéric Chazal ◽  
Raquel Iniesta

Abstract Background This paper exploits recent developments in topological data analysis to present a pipeline for clustering based on Mapper, an algorithm that reduces complex data into a one-dimensional graph. Results We present a pipeline to identify and summarise clusters based on statistically significant topological features from a point cloud using Mapper. Conclusions Key strengths of this pipeline include the integration of prior knowledge to inform the clustering process and the selection of optimal clusters; the use of the bootstrap to restrict the search to robust topological features; the use of machine learning to inspect clusters; and the ability to incorporate mixed data types. Our pipeline can be downloaded under the GNU GPLv3 license at https://github.com/kcl-bhi/mapper-pipeline.


Sign in / Sign up

Export Citation Format

Share Document