Umpire 2.0: Simulating realistic, mixed-type, clinical data for machine learning

The Umpire 2.0 R-package offers a streamlined, user-friendly workflow to simulate complex, heterogeneous, mixed-type data with known subgroup identities, dichotomous outcomes, and time-to-event data, while providing ample opportunities for fine-tuning and flexibility. Here, we describe how we have expanded the core Umpire 1.0 R-package, developed to simulate gene expression data, to generate clinically realistic, mixed-type data for use in evaluating unsupervised and supervised machine learning (ML) methods. As the availability of large-scale clinical data for ML has increased, clinical data has posed unique challenges, including widely variable size, individual biological heterogeneity, data collection and measurement noise, and mixed data types. Developing and validating ML methods for clinical data requires data sets with known ground truth, generated from simulation. Umpire 2.0 addresses challenges to simulating realistic clinical data by providing the user a series of modules to generate survival parameters and subgroups, apply meaningful additive noise, and discretize to single or mixed data types. Umpire 2.0 provides broad functionality across sample sizes, feature spaces, and data types, allowing the user to simulate correlated, heterogeneous, binary, continuous, categorical, or mixed type data from the scale of a small clinical trial to data on thousands of patients drawn from electronic health records. The user may generate elaborate simulations by varying parameters in order to compare algorithms or interrogate operating characteristics of an algorithm in both supervised and unsupervised ML.

Download Full-text

Umpire 2.0: Simulating realistic, mixed-type, clinical data for machine learning

F1000Research ◽

10.12688/f1000research.25877.1 ◽

2020 ◽

Vol 9 ◽

pp. 1186

Author(s):

Caitlin E. Coombes ◽

Zachary B. Abrams ◽

Samantha Nakayiza ◽

Guy Brock ◽

Kevin R. Coombes

Keyword(s):

Machine Learning ◽

Clinical Data ◽

Mixed Type ◽

R Package ◽

Fine Tuning ◽

Supervised Machine Learning ◽

Mixed Data ◽

Operating Characteristics ◽

Data Types ◽

Type Data

Download Full-text

An Initialization Method for Clustering Mixed Numeric and Categorical Data Based on the Density and Distance

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s021800141550024x ◽

2015 ◽

Vol 29 (07) ◽

pp. 1550024 ◽

Cited By ~ 9

Author(s):

Jinchao Ji ◽

Wei Pang ◽

Yanlin Zheng ◽

Zhe Wang ◽

Zhiqiang Ma

Keyword(s):

Real World ◽

Mixed Type ◽

Clustering Algorithms ◽

Numerical Data ◽

Mixed Data ◽

Single Type ◽

Partitional Clustering ◽

Initial Cluster ◽

Real World Datasets ◽

Type Data

Most of the initialization approaches are dedicated to the partitional clustering algorithms which process categorical or numerical data only. However, in real-world applications, data objects with both numeric and categorical features are ubiquitous. The coexistence of both categorical and numerical attributes make the initialization methods designed for single-type data inapplicable to mixed-type data. Furthermore, to the best of our knowledge, in the existing partitional clustering algorithms designed for mixed-type data, the initial cluster centers are determined randomly. In this paper, we propose a novel initialization method for mixed data clustering. In the proposed method, both the distance and density are exploited together to determine initial cluster centers. The performance of the proposed method is demonstrated by a series of experiments on three real-world datasets in comparison with that of traditional initialization methods.

Download Full-text

The challenge of delimiting cryptic species, and a supervised machine learning solution

10.1101/2021.08.05.455277 ◽

2021 ◽

Author(s):

Shahan Derkarabetian ◽

James Starrett ◽

Marshal Hedin

Keyword(s):

Machine Learning ◽

Cryptic Species ◽

Species Delimitation ◽

Genetic Data ◽

Biological Characteristics ◽

Supervised Machine Learning ◽

Training Dataset ◽

Dispersal Ability ◽

Data Types ◽

Multispecies Coalescent

The diversity of biological and ecological characteristics of organisms, and the underlying genetic patterns and processes of speciation, makes the development of universally applicable genetic species delimitation methods challenging. Many approaches, like those incorporating the multispecies coalescent, sometimes delimit populations and overestimate species numbers. This issue is exacerbated in taxa with inherently high population structure due to low dispersal ability, and in cryptic species resulting from nonecological speciation. These taxa present a conundrum when delimiting species: analyses rely heavily, if not entirely, on genetic data which over split species, while other lines of evidence lump. We showcase this conundrum in the harvester Theromaster brunneus, a low dispersal taxon with a wide geographic distribution and high potential for cryptic species. Integrating morphology, mitochondrial, and sub-genomic (double-digest RADSeq and ultraconserved elements) data, we find high discordance across analyses and data types in the number of inferred species, with further evidence that multispecies coalescent approaches over split. We demonstrate the power of a supervised machine learning approach in effectively delimiting cryptic species by creating a "custom" training dataset derived from a well-studied lineage with similar biological characteristics as Theromaster. This novel approach uses known taxa with particular biological characteristics to inform unknown taxa with similar characteristics, and uses modern computational tools ideally suited for species delimitation while also considering the biology and natural history of organisms to make more biologically informed species delimitation decisions. In principle, this approach is universally applicable for species delimitation of any taxon with genetic data, particularly for cryptic species.

Download Full-text

Predicting diabetes diseases using mixed data and supervised machine learning algorithms

Proceedings of the 4th International Conference on Smart City Applications - SCA '19 ◽

10.1145/3368756.3369072 ◽

2019 ◽

Cited By ~ 3

Author(s):

Othmane Daanouni ◽

Bouchaib Cherradi ◽

Amal Tmiri

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Mixed Data

Download Full-text

latentcor: An R Package for estimating latent correlations from mixed data types

The Journal of Open Source Software ◽

10.21105/joss.03634 ◽

2021 ◽

Vol 6 (65) ◽

pp. 3634

Author(s):

Mingze Huang ◽

Christian Müller ◽

Irina Gaynanova

Keyword(s):

R Package ◽

Mixed Data ◽

Data Types

Download Full-text

A Memory-Efficient Encoding Method for Processing Mixed-Type Data on Machine Learning

Entropy ◽

10.3390/e22121391 ◽

2020 ◽

Vol 22 (12) ◽

pp. 1391

Author(s):

Ivan Lopez-Arevalo ◽

Edwin Aldana-Bobadilla ◽

Alejandro Molina-Villegas ◽

Hiram Galeana-Zapién ◽

Victor Muñiz-Sanchez ◽

...

Keyword(s):

Machine Learning ◽

Categorical Data ◽

Mixed Type ◽

Original Data ◽

Memory Efficiency ◽

Type Data ◽

Encoding Method ◽

Feature Hashing ◽

Memory Efficient ◽

Real World Problems

The most common machine-learning methods solve supervised and unsupervised problems based on datasets where the problem’s features belong to a numerical space. However, many problems often include data where numerical and categorical data coexist, which represents a challenge to manage them. To transform categorical data into a numeric form, preprocessing tasks are compulsory. Methods such as one-hot and feature-hashing have been the most widely used encoding approaches at the expense of a significant increase in the dimensionality of the dataset. This effect introduces unexpected challenges to deal with the overabundance of variables and/or noisy data. In this regard, in this paper we propose a novel encoding approach that maps mixed-type data into an information space using Shannon’s Theory to model the amount of information contained in the original data. We evaluated our proposal with ten mixed-type datasets from the UCI repository and two datasets representing real-world problems obtaining promising results. For demonstrating the performance of our proposal, this was applied for preparing these datasets for classification, regression, and clustering tasks. We demonstrate that our encoding proposal is remarkably superior to one-hot and feature-hashing encoding in terms of memory efficiency. Our proposal can preserve the information conveyed by the original data.

Download Full-text

Supervised biomedical semantic similarity

10.1101/2021.02.16.431402 ◽

2021 ◽

Author(s):

Rita T. Sousa ◽

Sara Silva ◽

Catia Pesquita

Keyword(s):

Machine Learning ◽

Semantic Similarity ◽

Expert Knowledge ◽

Sequence Similarity ◽

Similarity Measures ◽

Fine Tuning ◽

Supervised Machine Learning ◽

Biological Similarity ◽

Knowledge Graphs ◽

Gene Similarity

AbstractSemantic similarity between concepts in knowledge graphs is essential for several bioinformatics applications, including the prediction of protein-protein interactions and the discovery of associations between diseases and genes. Although knowledge graphs describe entities in terms of several perspectives (or semantic aspects), state-of-the-art semantic similarity measures are general-purpose. This can represent a challenge since different use cases for the application of semantic similarity may need different similarity perspectives and ultimately depend on expert knowledge for manual fine-tuning.We present a new approach that uses supervised machine learning to tailor aspect-oriented semantic similarity measures to fit a particular view on biological similarity or relatedness. We implement and evaluate it using different combinations of representative semantic similarity measures and machine learning methods with four biological similarity views: protein-protein interaction, protein function similarity, protein sequence similarity and phenotype-based gene similarity. The results demonstrate that our approach outperforms non-supervised methods, producing semantic similarity models that fit different biological perspectives significantly better than the commonly used manual combinations of semantic aspects. Moreover, although black-box machine learning models produce the best results, approaches such as genetic programming and linear regression still produce improved results while generating models that are interpretable.

Download Full-text

Identifying Septic Shock Hospitalizations Using Supervised Machine Learning Classification Algorithms with Electronic Clinical Data

Open Forum Infectious Diseases ◽

10.1093/ofid/ofw172.1063 ◽

2016 ◽

Vol 3 (suppl_1) ◽

Author(s):

John T. Menchaca ◽

Sameer Kadri ◽

Jeffrey Strich ◽

Megan Morales ◽

Samuel Hohmann ◽

...

Keyword(s):

Machine Learning ◽

Septic Shock ◽

Clinical Data ◽

Supervised Machine Learning ◽

Classification Algorithms ◽

Machine Learning Classification

Download Full-text

Knowledge Discovery in Databases with Diversity of Data Types

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch173 ◽

2011 ◽

pp. 1117-1123 ◽

Cited By ~ 2

Author(s):

QingXiang Wu ◽

Martin McGinnity ◽

Girijesh Prasad ◽

David Bell

Keyword(s):

Machine Learning ◽

Data Mining ◽

Knowledge Discovery ◽

Set Theory ◽

Missing Values ◽

Numerical Data ◽

Mixed Data ◽

Data Types ◽

Useful Knowledge ◽

Symbolic Data

Data mining and knowledge discovery aim at finding useful information from typically massive collections of data, and then extracting useful knowledge from the information. To date a large number of approaches have been proposed to find useful information and discover useful knowledge; for example, decision trees, Bayesian belief networks, evidence theory, rough set theory, fuzzy set theory, kNN (k-nearest-neighborhood) classifier, neural networks, and support vector machines. However, these approaches are based on a specific data type. In the real world, an intelligent system often encounters mixed data types, incomplete information (missing values), and imprecise information (fuzzy conditions). In the UCI (University of California – Irvine) Machine Learning Repository, it can be seen that there are many real world data sets with missing values and mixed data types. It is a challenge to enable machine learning or data mining approaches to deal with mixed data types (Ching, 1995; Coppock, 2003) because there are difficulties in finding a measure of similarity between objects with mixed data type attributes. The problem with mixed data types is a long-standing issue faced in data mining. The emerging techniques targeted at this issue can be classified into three classes as follows: (1) Symbolic data mining approaches plus different discretizers (e.g., Dougherty et al., 1995; Wu, 1996; Kurgan et al., 2004; Diday, 2004; Darmont et al., 2006; Wu et al., 2007) for transformation from continuous data to symbolic data; (2) Numerical data mining approaches plus transformation from symbolic data to numerical data (e.g.,, Kasabov, 2003; Darmont et al., 2006; Hadzic et al., 2007); (3) Hybrid of symbolic data mining approaches and numerical data mining approaches (e.g.,, Tung, 2002; Kasabov, 2003; Leng et al., 2005; Wu et al., 2006). Since hybrid approaches have the potential to exploit the advantages from both symbolic data mining and numerical data mining approaches, this chapter, after discassing the merits and shortcomings of current approaches, focuses on applying Self-Organizing Computing Network Model to construct a hybrid system to solve the problems of knowledge discovery from databases with a diversity of data types. Future trends for data mining on mixed type data are then discussed. Finally a conclusion is presented.

Download Full-text

Identifying homogeneous subgroups of patients and important features: a topological machine learning approach

BMC Bioinformatics ◽

10.1186/s12859-021-04360-9 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Ewan Carr ◽

Mathieu Carrière ◽

Bertrand Michel ◽

Frédéric Chazal ◽

Raquel Iniesta

Keyword(s):

Machine Learning ◽

Topological Data Analysis ◽

Mixed Data ◽

Complex Data ◽

Data Types ◽

Topological Features ◽

Machine Learning Approach ◽

Recent Developments ◽

Dimensional Graph ◽

Selection Of

Abstract Background This paper exploits recent developments in topological data analysis to present a pipeline for clustering based on Mapper, an algorithm that reduces complex data into a one-dimensional graph. Results We present a pipeline to identify and summarise clusters based on statistically significant topological features from a point cloud using Mapper. Conclusions Key strengths of this pipeline include the integration of prior knowledge to inform the clustering process and the selection of optimal clusters; the use of the bootstrap to restrict the search to robust topological features; the use of machine learning to inspect clusters; and the ability to incorporate mixed data types. Our pipeline can be downloaded under the GNU GPLv3 license at https://github.com/kcl-bhi/mapper-pipeline.

Download Full-text