SCOUR: a stepwise machine learning framework for predicting metabolite-dependent regulatory interactions

Abstract Background The topology of metabolic networks is both well-studied and remarkably well-conserved across many species. The regulation of these networks, however, is much more poorly characterized, though it is known to be divergent across organisms—two characteristics that make it difficult to model metabolic networks accurately. While many computational methods have been built to unravel transcriptional regulation, there have been few approaches developed for systems-scale analysis and study of metabolic regulation. Here, we present a stepwise machine learning framework that applies established algorithms to identify regulatory interactions in metabolic systems based on metabolic data: stepwise classification of unknown regulation, or SCOUR. Results We evaluated our framework on both noiseless and noisy data, using several models of varying sizes and topologies to show that our approach is generalizable. We found that, when testing on data under the most realistic conditions (low sampling frequency and high noise), SCOUR could identify reaction fluxes controlled only by the concentration of a single metabolite (its primary substrate) with high accuracy. The positive predictive value (PPV) for identifying reactions controlled by the concentration of two metabolites ranged from 32 to 88% for noiseless data, 9.2 to 49% for either low sampling frequency/low noise or high sampling frequency/high noise data, and 6.6–27% for low sampling frequency/high noise data, with results typically sufficiently high for lab validation to be a practical endeavor. While the PPVs for reactions controlled by three metabolites were lower, they were still in most cases significantly better than random classification. Conclusions SCOUR uses a novel approach to synthetically generate the training data needed to identify regulators of reaction fluxes in a given metabolic system, enabling metabolomics and fluxomics data to be leveraged for regulatory structure inference. By identifying and triaging the most likely candidate regulatory interactions, SCOUR can drastically reduce the amount of time needed to identify and experimentally validate metabolic regulatory interactions. As high-throughput experimental methods for testing these interactions are further developed, SCOUR will provide critical impact in the development of predictive metabolic models in new organisms and pathways.

Download Full-text

SCOUR: A stepwise machine learning framework for predicting metabolite-dependent regulatory interactions

10.1101/2021.05.14.444159 ◽

2021 ◽

Author(s):

Justin Y Lee ◽

Britney Nguyen ◽

Carolos Orosco ◽

Mark P Styczynski

Keyword(s):

Machine Learning ◽

Metabolic Networks ◽

Sampling Frequency ◽

Low Noise ◽

Training Data ◽

High Noise ◽

Regulatory Interactions ◽

Learning Framework ◽

Metabolic Systems ◽

Noise Data

Background: The topology of metabolic networks is both well-studied and remarkably well-conserved across many species. The regulation of these networks, however, is much more poorly characterized, though it is known to be divergent across organisms -- two characteristics that make it difficult to model metabolic networks accurately. While many computational methods have been built to unravel transcriptional regulation, there have been few approaches developed for systems-scale analysis and study of metabolic regulation. Here, we present a stepwise machine learning framework that applies established algorithms to identify regulatory interactions in metabolic systems based on metabolic data: Stepwise Classification Of Unknown Regulation, or SCOUR. Results: We evaluated our framework on both noiseless and noisy data, using several models of varying sizes and topologies to show that our approach is generalizable. We found that, when testing on data under the most realistic conditions (low sampling frequency and high noise), SCOUR could identify reaction fluxes controlled only by the concentration of a single metabolite (its primary substrate) with high accuracy. The positive predictive value (PPV) for identifying reactions controlled by the concentration of two metabolites ranged from 32-88% for noiseless data, 9.2-49% for either low sampling frequency/low noise or high sampling frequency/high noise data, and 6.6-27% for low sampling frequency/high noise data, with results typically sufficiently high for lab validation to be a practical endeavor. While the PPVs for reactions controlled by three metabolites were lower, they were still in most cases significantly better than random classification. Conclusions: SCOUR uses a novel approach to synthetically generate the training data needed to identify regulators of reaction fluxes in a given metabolic system, enabling metabolomics and fluxomics data to be leveraged for regulatory structure inference. By identifying and triaging the most likely candidate regulatory interactions, SCOUR can drastically reduce the amount of time needed to identify and experimentally validate metabolic regulatory interactions. As high-throughput experimental methods for testing these interactions are further developed, SCOUR will provide critical impact in the development of predictive metabolic models in new organisms and pathways.

Download Full-text

A machine learning framework to determine geolocations from metagenomic profiling

Biology Direct ◽

10.1186/s13062-020-00278-z ◽

2020 ◽

Vol 15 (1) ◽

Cited By ~ 1

Author(s):

Lihong Huang ◽

Canqiang Xu ◽

Wenxian Yang ◽

Rongshan Yu

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Geographic Origin ◽

Training Data ◽

Metagenomic Data ◽

Training Dataset ◽

Kriging Interpolation ◽

Learning Framework ◽

Testing Data ◽

Microbial Samples

Abstract Background Studies on metagenomic data of environmental microbial samples found that microbial communities seem to be geolocation-specific, and the microbiome abundance profile can be a differentiating feature to identify samples’ geolocations. In this paper, we present a machine learning framework to determine the geolocations from metagenomics profiling of microbial samples. Results Our method was applied to the multi-source microbiome data from MetaSUB (The Metagenomics and Metadesign of Subways and Urban Biomes) International Consortium for the CAMDA 2019 Metagenomic Forensics Challenge (the Challenge). The goal of the Challenge is to predict the geographical origins of mystery samples by constructing microbiome fingerprints.First, we extracted features from metagenomic abundance profiles. We then randomly split the training data into training and validation sets and trained the prediction models on the training set. Prediction performance was evaluated on the validation set. By using logistic regression with L2 normalization, the prediction accuracy of the model reaches 86%, averaged over 100 random splits of training and validation datasets.The testing data consists of samples from cities that do not occur in the training data. To predict the “mystery” cities that are not sampled before for the testing data, we first defined biological coordinates for sampled cities based on the similarity of microbial samples from them. Then we performed affine transform on the map such that the distance between cities measures their biological difference rather than geographical distance. After that, we derived the probabilities of a given testing sample from unsampled cities based on its predicted probabilities on sampled cities using Kriging interpolation. Results show that this method can successfully assign high probabilities to the true cities-of-origin of testing samples. Conclusion Our framework shows good performance in predicting the geographic origin of metagenomic samples for cities where training data are available. Furthermore, we demonstrate the potential of the proposed method to predict metagenomic samples’ geolocations for samples from locations that are not in the training dataset.

Download Full-text

Object Detection using Feature Mining in a Distributed Machine Learning Framework

10.51202/9783186855107 ◽

2017 ◽

Author(s):

Arne Ehlers

Keyword(s):

Machine Learning ◽

Object Detection ◽

Training Data ◽

Visual Object ◽

Ensemble Classifiers ◽

Adaptive Boosting ◽

Learning Framework ◽

Theory Of Evidence ◽

Feature Mining ◽

Distributed Machine Learning

This dissertation addresses the problem of visual object detection based on machine-learned classifiers. A distributed machine learning framework is developed to learn detectors for several object classes creating cascaded ensemble classifiers by the Adaptive Boosting algorithm. Methods are proposed that enhance several components of an object detection framework: At first, the thesis deals with augmenting the training data in order to improve the performance of object detectors learned from sparse training sets. Secondly, feature mining strategies are introduced to create feature sets that are customized to the object class to be detected. Furthermore, a novel class of fractal features is proposed that allows to represent a wide variety of shapes. Thirdly, a method is introduced that models and combines internal confidences and uncertainties of the cascaded detector using Dempster’s theory of evidence in order to increase the quality of the post-processing. ...

Download Full-text

Stochastic simulation and modelling of metabolic networks in a machine learning framework

Simulation Modelling Practice and Theory ◽

10.1016/j.simpat.2011.05.002 ◽

2011 ◽

Vol 19 (9) ◽

pp. 1957-1966 ◽

Cited By ~ 4

Author(s):

Marenglen Biba ◽

Fatos Xhafa ◽

Floriana Esposito ◽

Stefano Ferilli

Keyword(s):

Machine Learning ◽

Stochastic Simulation ◽

Metabolic Networks ◽

Learning Framework ◽

Simulation And Modelling

Download Full-text

On the Intrinsic Differential Privacy of Bagging

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/376 ◽

2021 ◽

Author(s):

Hongbin Liu ◽

Jinyuan Jia ◽

Neil Zhenqiang Gong

Keyword(s):

Machine Learning ◽

Differential Privacy ◽

Majority Vote ◽

Training Data ◽

Training Process ◽

Learning Methods ◽

Learning Framework ◽

Machine Learning Methods ◽

Additional Noise ◽

Base Learner

Differentially private machine learning trains models while protecting privacy of the sensitive training data. The key to obtain differentially private models is to introduce noise/randomness to the training process. In particular, existing differentially private machine learning methods add noise to the training data, the gradients, the loss function, and/or the model itself. Bagging, a popular ensemble learning framework, randomly creates some subsamples of the training data, trains a base model for each subsample using a base learner, and takes majority vote among the base models when making predictions. Bagging has intrinsic randomness in the training process as it randomly creates subsamples. Our major theoretical results show that such intrinsic randomness already makes Bagging differentially private without the needs of additional noise. Moreover, we prove that if no assumptions about the base learner are made, our derived privacy guarantees are tight. We empirically evaluate Bagging on MNIST and CIFAR10. Our experimental results demonstrate that Bagging achieves significantly higher accuracies than state-of-the-art differentially private machine learning methods with the same privacy budgets.

Download Full-text

Scalable Approach to High Coverages on Oxides via Iterative Training of a Machine-Learning Algorithm

10.26434/chemrxiv.10288514.v1 ◽

2019 ◽

Author(s):

Andrew Medford ◽

Shengchun Yang ◽

Fuzhu Liu

Keyword(s):

Machine Learning ◽

Chemical Potential ◽

Learning Algorithm ◽

Absolute Error ◽

Low Energy ◽

Training Data ◽

High Coverage ◽

Metal Compounds ◽

Adsorption Energies ◽

The Stability

Understanding the interaction of multiple types of adsorbate molecules on solid surfaces is crucial to establishing the stability of catalysts under various chemical environments. Computational studies on the high coverage and mixed coverages of reaction intermediates are still challenging, especially for transition-metal compounds. In this work, we present a framework to predict differential adsorption energies and identify low-energy structures under high- and mixed-adsorbate coverages on oxide materials. The approach uses Gaussian process machine-learning models with quantified uncertainty in conjunction with an iterative training algorithm to actively identify the training set. The framework is demonstrated for the mixed adsorption of CHx, NHx and OHx species on the oxygen vacancy and pristine rutile TiO2(110) surface sites. The results indicate that the proposed algorithm is highly efficient at identifying the most valuable training data, and is able to predict differential adsorption energies with a mean absolute error of ~0.3 eV based on <25% of the total DFT data. The algorithm is also used to identify 76% of the low-energy structures based on <30% of the total DFT data, enabling construction of surface phase diagrams that account for high and mixed coverage as a function of the chemical potential of C, H, O, and N. Furthermore, the computational scaling indicates the algorithm scales nearly linearly (N1.12) as the number of adsorbates increases. This framework can be directly extended to metals, metal oxides, and other materials, providing a practical route toward the investigation of the behavior of catalysts under high-coverage conditions.

Download Full-text

Optimization of Diabetes Training DATA using Machine Learning Algorithms

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v6i2.283286 ◽

2018 ◽

Vol 6 (2) ◽

pp. 283-286

Author(s):

M. Samba Siva Rao ◽

◽

M.Yaswanth . ◽

K. Raghavendra Swamy ◽

◽

...

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Training Data

Download Full-text

Learning from metabolic networks: current trends and future directions for precision medicine.

Current Medicinal Chemistry ◽

10.2174/0929867328666201217103148 ◽

2020 ◽

Vol 28 ◽

Author(s):

Ilaria Granata ◽

Mario Manzo ◽

Ari Kusumastuti ◽

Mario R Guarracino

Keyword(s):

Systems Biology ◽

Precision Medicine ◽

Metabolic Networks ◽

Information Modeling ◽

Model Systems ◽

Specific Information ◽

Important Indicator ◽

Predictive Values ◽

Metabolic Systems ◽

Metabolic Models

Purpose: Systems biology and network modeling represent, nowadays, the hallmark approaches for the development of predictive and targeted-treatment based precision medicine. The study of health and disease as properties of the human body system allows the understanding of the genotype-phenotype relationship through the definition of molecular interactions and dependencies. In this scenario, metabolism plays a central role as its interactions are well characterized and it is considered an important indicator of the genotype-phenotype associations. In metabolic systems biology, the genome-scale metabolic models are the primary scaffolds to integrate multi-omics data as well as cell-, tissue-, condition-specific information. Modeling the metabolism has both investigative and predictive values. Several methods have been proposed to model systems, which involve steady-state or kinetic approaches, and to extract knowledge through machine and deep learning. Method: This review collects, analyzes, and compares the suitable data and computational approaches for the exploration of metabolic networks as tools for the development of precision medicine. To this extent, we organized it into three main sections: "Data and Databases", "Methods and Tools", and "Metabolic Networks for medicine". In the first one, we have collected the most used data and relative databases to build and annotate metabolic models. In the second section, we have reported the state-of-the-art methods and relative tools to reconstruct, simulate, and interpret metabolic systems. Finally, we have reported the most recent and innovative studies which exploited metabolic networks for the study of several pathological conditions, not only those directly related to the metabolism. Conclusion: We think that this review can be a guide to researchers of different disciplines, from computer science to biology and medicine, in exploring the power, challenges and future promises of the metabolism as predictor and target of the so-called P4 medicine (predictive, preventive, personalized and participatory).

Download Full-text

Comparative Analysis of Machine Learning Techniques Using Predictive Modeling

Recent Advances in Computer Science and Communications ◽

10.2174/2666255813999200904164539 ◽

2020 ◽

Vol 13 ◽

Author(s):

Ritu Khandelwal ◽

Hemlata Goyal ◽

Rajveer Singh Shekhawat

Keyword(s):

Machine Learning ◽

Comparative Analysis ◽

Data Science ◽

Training Data ◽

Machine Learning Techniques ◽

Future Trends ◽

Data Set ◽

Learning Stage ◽

Learning Techniques ◽

Different Types

Introduction: Machine learning is an intelligent technology that works as a bridge between businesses and data science. With the involvement of data science, the business goal focuses on findings to get valuable insights on available data. The large part of Indian Cinema is Bollywood which is a multi-million dollar industry. This paper attempts to predict whether the upcoming Bollywood Movie would be Blockbuster, Superhit, Hit, Average or Flop. For this Machine Learning techniques (classification and prediction) will be applied. To make classifier or prediction model first step is the learning stage in which we need to give the training data set to train the model by applying some technique or algorithm and after that different rules are generated which helps to make a model and predict future trends in different types of organizations. Methods: All the techniques related to classification and Prediction such as Support Vector Machine(SVM), Random Forest, Decision Tree, Naïve Bayes, Logistic Regression, Adaboost, and KNN will be applied and try to find out efficient and effective results. All these functionalities can be applied with GUI Based workflows available with various categories such as data, Visualize, Model, and Evaluate. Result: To make classifier or prediction model first step is learning stage in which we need to give the training data set to train the model by applying some technique or algorithm and after that different rules are generated which helps to make a model and predict future trends in different types of organizations Conclusion: This paper focuses on Comparative Analysis that would be performed based on different parameters such as Accuracy, Confusion Matrix to identify the best possible model for predicting the movie Success. By using Advertisement Propaganda, they can plan for the best time to release the movie according to the predicted success rate to gain higher benefits. Discussion: Data Mining is the process of discovering different patterns from large data sets and from that various relationships are also discovered to solve various problems that come in business and helps to predict the forthcoming trends. This Prediction can help Production Houses for Advertisement Propaganda and also they can plan their costs and by assuring these factors they can make the movie more profitable.

Download Full-text

Multilayer Soil Moisture Mapping at a Regional Scale from Multisource Data via a Machine Learning Method

Remote Sensing ◽

10.3390/rs11030284 ◽

2019 ◽

Vol 11 (3) ◽

pp. 284 ◽

Cited By ~ 1

Author(s):

Linglin Zeng ◽

Shun Hu ◽

Daxiang Xiang ◽

Xiang Zhang ◽

Deren Li ◽

...

Keyword(s):

Machine Learning ◽

Soil Moisture ◽

Regional Scale ◽

Remotely Sensed ◽

Temporal Variations ◽

Training Data ◽

Estimation Accuracy ◽

Learning Approaches ◽

Remotely Sensed Data ◽

Deep Soil

Soil moisture mapping at a regional scale is commonplace since these data are required in many applications, such as hydrological and agricultural analyses. The use of remotely sensed data for the estimation of deep soil moisture at a regional scale has received far less emphasis. The objective of this study was to map the 500-m, 8-day average and daily soil moisture at different soil depths in Oklahoma from remotely sensed and ground-measured data using the random forest (RF) method, which is one of the machine-learning approaches. In order to investigate the estimation accuracy of the RF method at both a spatial and a temporal scale, two independent soil moisture estimation experiments were conducted using data from 2010 to 2014: a year-to-year experiment (with a root mean square error (RMSE) ranging from 0.038 to 0.050 m3/m3) and a station-to-station experiment (with an RMSE ranging from 0.044 to 0.057 m3/m3). Then, the data requirements, importance factors, and spatial and temporal variations in estimation accuracy were discussed based on the results using the training data selected by iterated random sampling. The highly accurate estimations of both the surface and the deep soil moisture for the study area reveal the potential of RF methods when mapping soil moisture at a regional scale, especially when considering the high heterogeneity of land-cover types and topography in the study area.

Download Full-text