scholarly journals Opening the Random Forest Black Box of the Metabolome by the Application of Surrogate Minimal Depth

Metabolites ◽  
2021 ◽  
Vol 12 (1) ◽  
pp. 5
Author(s):  
Soeren Wenck ◽  
Marina Creydt ◽  
Jule Hansen ◽  
Florian Gärber ◽  
Markus Fischer ◽  
...  

For the untargeted analysis of the metabolome of biological samples with liquid chromatography–mass spectrometry (LC-MS), high-dimensional data sets containing many different metabolites are obtained. Since the utilization of these complex data is challenging, different machine learning approaches have been developed. Those methods are usually applied as black box classification tools, and detailed information about class differences that result from the complex interplay of the metabolites are not obtained. Here, we demonstrate that this information is accessible by the application of random forest (RF) approaches and especially by surrogate minimal depth (SMD) that is applied to metabolomics data for the first time. We show this by the selection of important features and the evaluation of their mutual impact on the multi-level classification of white asparagus regarding provenance and biological identity. SMD enables the identification of multiple features from the same metabolites and reveals meaningful biological relations, proving its high potential for the comprehensive utilization of high-dimensional metabolomics data.

Metabolites ◽  
2020 ◽  
Vol 10 (12) ◽  
pp. 479
Author(s):  
Gayatri R. Iyer ◽  
Janis Wigginton ◽  
William Duren ◽  
Jennifer L. LaBarre ◽  
Marci Brandenburg ◽  
...  

Modern analytical methods allow for the simultaneous detection of hundreds of metabolites, generating increasingly large and complex data sets. The analysis of metabolomics data is a multi-step process that involves data processing and normalization, followed by statistical analysis. One of the biggest challenges in metabolomics is linking alterations in metabolite levels to specific biological processes that are disrupted, contributing to the development of disease or reflecting the disease state. A common approach to accomplishing this goal involves pathway mapping and enrichment analysis, which assesses the relative importance of predefined metabolic pathways or other biological categories. However, traditional knowledge-based enrichment analysis has limitations when it comes to the analysis of metabolomics and lipidomics data. We present a Java-based, user-friendly bioinformatics tool named Filigree that provides a primarily data-driven alternative to the existing knowledge-based enrichment analysis methods. Filigree is based on our previously published differential network enrichment analysis (DNEA) methodology. To demonstrate the utility of the tool, we applied it to previously published studies analyzing the metabolome in the context of metabolic disorders (type 1 and 2 diabetes) and the maternal and infant lipidome during pregnancy.


2018 ◽  
Vol 10 (2) ◽  
pp. 186-206 ◽  
Author(s):  
Kevin van Hecke ◽  
Guido de Croon ◽  
Laurens van der Maaten ◽  
Daniel Hennes ◽  
Dario Izzo

Self-supervised learning is a reliable learning mechanism in which a robot uses an original, trusted sensor cue for training to recognize an additional, complementary sensor cue. We study for the first time in self-supervised learning how a robot’s learning behavior should be organized, so that the robot can keep performing its task in the case that the original cue becomes unavailable. We study this persistent form of self-supervised learning in the context of a flying robot that has to avoid obstacles based on distance estimates from the visual cue of stereo vision. Over time it will learn to also estimate distances based on monocular appearance cues. A strategy is introduced that has the robot switch from flight based on stereo to flight based on monocular vision, with stereo vision purely used as “training wheels” to avoid imminent collisions. This strategy is shown to be an effective approach to the “feedback-induced data bias” problem as also experienced in learning from demonstration. Both simulations and real-world experiments with a stereo vision equipped ARDrone2 show the feasibility of this approach, with the robot successfully using monocular vision to avoid obstacles in a 5 × 5 m room. The experiments show the potential of persistent self-supervised learning as a robust learning approach to enhance the capabilities of robots. Moreover, the abundant training data coming from the own sensors allow to gather large data sets necessary for deep learning approaches.


Algorithms ◽  
2020 ◽  
Vol 13 (5) ◽  
pp. 109 ◽  
Author(s):  
Marian B. Gorzałczany ◽  
Filip Rudziński

In this paper, we briefly present several modifications and generalizations of the concept of self-organizing neural networks—usually referred to as self-organizing maps (SOMs)—to illustrate their advantages in applications that range from high-dimensional data visualization to complex data clustering. Starting from conventional SOMs, Growing SOMs (GSOMs), Growing Grid Networks (GGNs), Incremental Grid Growing (IGG) approach, Growing Neural Gas (GNG) method as well as our two original solutions, i.e., Generalized SOMs with 1-Dimensional Neighborhood (GeSOMs with 1DN also referred to as Dynamic SOMs (DSOMs)) and Generalized SOMs with Tree-Like Structures (GeSOMs with T-LSs) are discussed. They are characterized in terms of (i) the modification mechanisms used, (ii) the range of network modifications introduced, (iii) the structure regularity, and (iv) the data-visualization/data-clustering effectiveness. The performance of particular solutions is illustrated and compared by means of selected data sets. We also show that the proposed original solutions, i.e., GeSOMs with 1DN (DSOMs) and GeSOMS with T-LSs outperform alternative approaches in various complex clustering tasks by providing up to 20 % increase in the clustering accuracy. The contribution of this work is threefold. First, algorithm-oriented original computer-implementations of particular SOM’s generalizations are developed. Second, their detailed simulation results are presented and discussed. Third, the advantages of our earlier-mentioned original solutions are demonstrated.


2017 ◽  
Vol 10 (13) ◽  
pp. 355 ◽  
Author(s):  
Reshma Remesh ◽  
Pattabiraman. V

Dimensionality reduction techniques are used to reduce the complexity for analysis of high dimensional data sets. The raw input data set may have large dimensions and it might consume time and lead to wrong predictions if unnecessary data attributes are been considered for analysis. So using dimensionality reduction techniques one can reduce the dimensions of input data towards accurate prediction with less cost. In this paper the different machine learning approaches used for dimensionality reductions such as PCA, SVD, LDA, Kernel Principal Component Analysis and Artificial Neural Network  have been studied.


2019 ◽  
Author(s):  
Shannon Wongvibulsin ◽  
Katherine C Wu ◽  
Scott L Zeger

BACKGROUND Despite the promise of machine learning (ML) to inform individualized medical care, the clinical utility of ML in medicine has been limited by the minimal interpretability and <i>black box</i> nature of these algorithms. OBJECTIVE The study aimed to demonstrate a general and simple framework for generating clinically relevant and interpretable visualizations of <i>black box</i> predictions to aid in the clinical translation of ML. METHODS To obtain improved transparency of ML, simplified models and visual displays can be generated using common methods from clinical practice such as decision trees and effect plots. We illustrated the approach based on postprocessing of ML predictions, in this case random forest predictions, and applied the method to data from the Left Ventricular (LV) Structural Predictors of Sudden Cardiac Death (SCD) Registry for individualized risk prediction of SCD, a leading cause of death. RESULTS With the LV Structural Predictors of SCD Registry data, SCD risk predictions are obtained from a random forest algorithm that identifies the most important predictors, nonlinearities, and interactions among a large number of variables while naturally accounting for missing data. The <i>black box</i> predictions are postprocessed using classification and regression trees into a clinically relevant and interpretable visualization. The method also quantifies the relative importance of an individual or a combination of predictors. Several risk factors (heart failure hospitalization, cardiac magnetic resonance imaging indices, and serum concentration of systemic inflammation) can be clearly visualized as branch points of a decision tree to discriminate between low-, intermediate-, and high-risk patients. CONCLUSIONS Through a clinically important example, we illustrate a general and simple approach to increase the clinical translation of ML through clinician-tailored visual displays of results from black box algorithms. We illustrate this general model-agnostic framework by applying it to SCD risk prediction. Although we illustrate the methods using SCD prediction with random forest, the methods presented are applicable more broadly to improving the clinical translation of ML, regardless of the specific ML algorithm or clinical application. As any trained predictive model can be summarized in this manner to a prespecified level of precision, we encourage the use of simplified visual displays as an adjunct to the complex predictive model. Overall, this framework can allow clinicians to peek inside the black box and develop a deeper understanding of the most important features from a model to gain trust in the predictions and confidence in applying them to clinical care.


Author(s):  
Jiajun Yang ◽  
Thomas Hermann

In this paper, we revisit, explore and extend the Particle Trajectory Sonification (PTS) model, which supports cluster analysis of high-dimensional data by probing a model space with virtual particles which are ‘gravitationally’ attracted to a mode of the dataset’s potential function. The particles’ kinetic energy progression of as function of time adds directly to a signal which constitutes the sonification. The exponential increase in computation power since its conception in 1999 enables now for the first time to investigate real-time interactivity in such complex interweaved dynamic sonification models. We speeded up the computation of the PTS model with (i) data optimization via vector quantization, and (ii) parallel computing via OpenCL. We investigated the performance of sonifying high-dimensional complex data under different approaches. The results show a substantial increase in speed when applying vector quantization and parallelism with CPU. GPU parallelism provided a substantial speedup for very large number of particles comparing to using CPU but did not show enough benefit for a low number of particles due to copying overhead. A hybrid OpenCL implementation is presented to maximize the benefits of both worlds.


PeerJ ◽  
2018 ◽  
Vol 6 ◽  
pp. e5199
Author(s):  
Wanli Zhang ◽  
Yanming Di

The accumulation of RNA sequencing (RNA-Seq) gene expression data in recent years has resulted in large and complex data sets of high dimensions. Exploratory analysis, including data mining and visualization, reveals hidden patterns and potential outliers in such data, but is often challenged by the high dimensional nature of the data. The scatterplot matrix is a commonly used tool for visualizing multivariate data, and allows us to view multiple bivariate relationships simultaneously. However, the scatterplot matrix becomes less effective for high dimensional data because the number of bivariate displays increases quadratically with data dimensionality. In this study, we introduce a selection criterion for each bivariate scatterplot and design/implement an algorithm that automatically scan and rank all possible scatterplots, with the goal of identifying the plots in which separation between two pre-defined groups is maximized. By applying our method to a multi-experimentArabidopsisRNA-Seq data set, we were able to successfully pinpoint the visualization angles where genes from two biological pathways are the most separated, as well as identify potential outliers.


2021 ◽  
Author(s):  
Amy Bednar

A growing area of mathematics topological data analysis (TDA) uses fundamental concepts of topology to analyze complex, high-dimensional data. A topological network represents the data, and the TDA uses the network to analyze the shape of the data and identify features in the network that correspond to patterns in the data. These patterns extract knowledge from the data. TDA provides a framework to advance machine learning’s ability to understand and analyze large, complex data. This paper provides background information about TDA, TDA applications for large data sets, and details related to the investigation and implementation of existing tools and environments.


10.2196/15791 ◽  
2020 ◽  
Vol 8 (6) ◽  
pp. e15791
Author(s):  
Shannon Wongvibulsin ◽  
Katherine C Wu ◽  
Scott L Zeger

Background Despite the promise of machine learning (ML) to inform individualized medical care, the clinical utility of ML in medicine has been limited by the minimal interpretability and black box nature of these algorithms. Objective The study aimed to demonstrate a general and simple framework for generating clinically relevant and interpretable visualizations of black box predictions to aid in the clinical translation of ML. Methods To obtain improved transparency of ML, simplified models and visual displays can be generated using common methods from clinical practice such as decision trees and effect plots. We illustrated the approach based on postprocessing of ML predictions, in this case random forest predictions, and applied the method to data from the Left Ventricular (LV) Structural Predictors of Sudden Cardiac Death (SCD) Registry for individualized risk prediction of SCD, a leading cause of death. Results With the LV Structural Predictors of SCD Registry data, SCD risk predictions are obtained from a random forest algorithm that identifies the most important predictors, nonlinearities, and interactions among a large number of variables while naturally accounting for missing data. The black box predictions are postprocessed using classification and regression trees into a clinically relevant and interpretable visualization. The method also quantifies the relative importance of an individual or a combination of predictors. Several risk factors (heart failure hospitalization, cardiac magnetic resonance imaging indices, and serum concentration of systemic inflammation) can be clearly visualized as branch points of a decision tree to discriminate between low-, intermediate-, and high-risk patients. Conclusions Through a clinically important example, we illustrate a general and simple approach to increase the clinical translation of ML through clinician-tailored visual displays of results from black box algorithms. We illustrate this general model-agnostic framework by applying it to SCD risk prediction. Although we illustrate the methods using SCD prediction with random forest, the methods presented are applicable more broadly to improving the clinical translation of ML, regardless of the specific ML algorithm or clinical application. As any trained predictive model can be summarized in this manner to a prespecified level of precision, we encourage the use of simplified visual displays as an adjunct to the complex predictive model. Overall, this framework can allow clinicians to peek inside the black box and develop a deeper understanding of the most important features from a model to gain trust in the predictions and confidence in applying them to clinical care.


2021 ◽  
Vol 2083 (2) ◽  
pp. 022041
Author(s):  
Caiyu Liu ◽  
Zuofeng Zhou ◽  
Qingquan Wu

Abstract As an important part of road maintenance, the detection of road sprinkles has attracted extensive attention from scholars. However, after years of research, there are still some problems in the detection of road sprinkles. First of all, the detection accuracy of traditional detection algorithm is deficient. Second, deep learning approaches have great limitations for there are various kinds of sprinkles which makes it difficult to build a data set. In view of the above problems, this paper proposes a road sprinkling detection method based on multi-feature fusion. The characteristics of color, gradient, luminance and neighborhood information were considered in our method. Compared with other traditional methods, our method has higher detection accuracy. In addition, compared with deep learning-based methods, our approach doesn’t involve creating a complex data set and reduces costs. The main contributions of this paper are as follows: I. For the first time, the density clustering algorithm is combined with the detection of sprinkles, which provides a new idea for this field. II. The use of multi-feature fusion improves the accuracy and robustness of the traditional method which makes the algorithm usable in many real-world scenarios.


Sign in / Sign up

Export Citation Format

Share Document