scholarly journals Prediction of Whole-Cell Transcriptional Response with Machine Learning

Author(s):  
Mohammed Eslami ◽  
Amin Espah-Borujeni ◽  
Hamed Eramian ◽  
Mark Weston ◽  
George Zheng ◽  
...  

Abstract Motivation Applications in synthetic and systems biology can benefit from measuring whole-cell response to biochemical perturbations. Execution of experiments to cover all possible combinations of perturbations is infeasible. In this paper, we present the host response model (HRM), a machine learning approach that maps response of single perturbations to transcriptional response of the combination of perturbations. Results The HRM combines high-throughput sequencing with machine learning to infer links between experimental context, prior knowledge of cell regulatory networks, and RNASeq data to predict a gene’s dysregulation. We find that the HRM can predict the directionality of dysregulation to a combination of inducers with an accuracy of > 90% using data from single inducers. We further find that the use of prior, known cell regulatory networks doubles the predictive performance of the HRM (an R2 from 0.3 to 0.65). The model was validated in two organisms, E. coli and B. subtilis, using new experiments conducted post training. Finally, while the HRM is trained on gene expression data, the direct prediction of differential expression makes it possible to also conduct enrichment analyses using its predictions. We show that the HRM can accurately classify >95% of the pathway regulations. The HRM reduces the number of RNASeq experiments needed as responses can be tested in-silico to focus experiments. Availability The HRM software and tutorial are available at https://github.com/sd2e/CDM and the configurable differential expression analysis tools and tutorials are available at https://github.com/SD2E/omics_tools. Supplementary information Supplementary data are available at Bioinformatics online.

2021 ◽  
Author(s):  
Mohammed Eslami ◽  
Amin Espah Borujeni ◽  
Hamed Eramian ◽  
Hamid Doost Hosseini ◽  
Matthew Vaughn ◽  
...  

Applications in synthetic and systems biology can benefit from measuring whole-cell response to biochemical perturbations. Execution of experiments to cover all possible combinations of perturbations is infeasible. In this paper, we present the host response model (HRM), a machine learning approach that takes the cell response to single perturbations as the input and predicts the whole cell transcriptional response to the combination of inducers. We find that the HRM is able to qualitatively predict the directionality of dysregulation to a combination of inducers with an accuracy of >90% using data from single inducers. We further find that the use of known prior, known cell regulatory networks doubles the predictive performance of the HRM (an R2 from 0.3 to 0.65). This tool will significantly reduce the number of high-throughput sequencing experiments that need to be run to characterize the transcriptional impact of the combination of perturbations on the host.


2020 ◽  
Vol 36 (11) ◽  
pp. 3385-3392
Author(s):  
Zi-Lin Liu ◽  
Jing-Hao Hu ◽  
Fan Jiang ◽  
Yun-Dong Wu

Abstract Motivation High-throughput sequencing discovers many naturally occurring disulfide-rich peptides or cystine-rich peptides (CRPs) with diversified bioactivities. However, their structure information, which is very important to peptide drug discovery, is still very limited. Results We have developed a CRP-specific structure prediction method called Cystine-Rich peptide Structure Prediction (CRiSP), based on a customized template database with cystine-specific sequence alignment and three machine-learning predictors. The modeling accuracy is significantly better than several popular general-purpose structure modeling methods, and our CRiSP can provide useful model quality estimations. Availability and implementation The CRiSP server is freely available on the website at http://wulab.com.cn/CRISP. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


Genes ◽  
2021 ◽  
Vol 12 (12) ◽  
pp. 1947
Author(s):  
Samarendra Das ◽  
Anil Rai ◽  
Michael L. Merchant ◽  
Matthew C. Cave ◽  
Shesh N. Rai

Single-cell RNA-sequencing (scRNA-seq) is a recent high-throughput sequencing technique for studying gene expressions at the cell level. Differential Expression (DE) analysis is a major downstream analysis of scRNA-seq data. DE analysis the in presence of noises from different sources remains a key challenge in scRNA-seq. Earlier practices for addressing this involved borrowing methods from bulk RNA-seq, which are based on non-zero differences in average expressions of genes across cell populations. Later, several methods specifically designed for scRNA-seq were developed. To provide guidance on choosing an appropriate tool or developing a new one, it is necessary to comprehensively study the performance of DE analysis methods. Here, we provide a review and classification of different DE approaches adapted from bulk RNA-seq practice as well as those specifically designed for scRNA-seq. We also evaluate the performance of 19 widely used methods in terms of 13 performance metrics on 11 real scRNA-seq datasets. Our findings suggest that some bulk RNA-seq methods are quite competitive with the single-cell methods and their performance depends on the underlying models, DE test statistic(s), and data characteristics. Further, it is difficult to obtain the method which will be best-performing globally through individual performance criterion. However, the multi-criteria and combined-data analysis indicates that DECENT and EBSeq are the best options for DE analysis. The results also reveal the similarities among the tested methods in terms of detecting common DE genes. Our evaluation provides proper guidelines for selecting the proper tool which performs best under particular experimental settings in the context of the scRNA-seq.


2015 ◽  
Author(s):  
Rahul Reddy

As RNA-Seq and other high-throughput sequencing grow in use and remain critical for gene expression studies, technical variability in counts data impedes studies of differential expression studies, data across samples and experiments, or reproducing results. Studies like Dillies et al. (2013) compare several between-lane normalization methods involving scaling factors, while Hansen et al. (2012) and Risso et al. (2014) propose methods that correct for sample-specific bias or use sets of control genes to isolate and remove technical variability. This paper evaluates four normalization methods in terms of reducing intra-group, technical variability and facilitating differential expression analysis or other research where the biological, inter-group variability is of interest. To this end, the four methods were evaluated in differential expression analysis between data from Pickrell et al. (2010) and Montgomery et al. (2010) and between simulated data modeled on these two datasets. Though the between-lane scaling factor methods perform worse on real data sets, they are much stronger for simulated data. We cannot reject the recommendation of Dillies et al. to use TMM and DESeq normalization, but further study of power to detect effects of different size under each normalization method is merited.


2021 ◽  
Author(s):  
Yu Hamaguchi ◽  
Chao Zeng ◽  
Michiaki Hamada

Abstract Background: Differential expression (DE) analysis of RNA-seq data typically depends on gene annotations. Different sets of gene annotations are available for the human genome and are continually updated–a process complicated with the development and application of high-throughput sequencing technologies. However, the impact of the complexity of gene annotations on DE analysis remains unclear.Results: Using “mappability”, a metric of the complexity of gene annotation, we compared three distinct human gene annotations, GENCODE, RefSeq, and NONCODE, and evaluated how mappability affected DE analysis. We found that mappability was significantly different among the human gene annotations. We also found that increasing mappability improved the performance of DE analysis, and the impact of mappability mainly evident in the quantification step and propagated downstream of DE analysis systematically.Conclusions: We assessed how the complexity of gene annotations affects DE analysis using mappability. Our findings indicate that the growth and complexity of gene annotations negatively impact the performance of DE analysis, suggesting that an approach that excludes unnecessary gene models from gene annotations improves the performance of DE analysis.


2019 ◽  
Vol 35 (22) ◽  
pp. 4834-4836
Author(s):  
Tim Jeske ◽  
Peter Huypens ◽  
Laura Stirm ◽  
Selina Höckele ◽  
Christine M Wurmser ◽  
...  

Abstract Summary Despite their fundamental role in various biological processes, the analysis of small RNA sequencing data remains a challenging task. Major obstacles arise when short RNA sequences map to multiple locations in the genome, align to regions that are not annotated or underwent post-transcriptional changes which hamper accurate mapping. In order to tackle these issues, we present a novel profiling strategy that circumvents the need for read mapping to a reference genome by utilizing the actual read sequences to determine expression intensities. After differential expression analysis of individual sequence counts, significant sequences are annotated against user defined feature databases and clustered by sequence similarity. This strategy enables a more comprehensive and concise representation of small RNA populations without any data loss or data distortion. Availability and implementation Code and documentation of our R package at http://ibis.helmholtz-muenchen.de/deus/. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 26 (12) ◽  
pp. 1427-1436 ◽  
Author(s):  
Haley S Hunter-Zinck ◽  
Jordan S Peck ◽  
Tania D Strout ◽  
Stephan A Gaehde

Abstract Objective Emergency departments (EDs) continue to pursue optimal patient flow without sacrificing quality of care. The speed with which a healthcare provider receives pertinent information, such as results from clinical orders, can impact flow. We seek to determine if clinical ordering behavior can be predicted at triage during an ED visit. Materials and Methods Using data available during triage, we trained multilabel machine learning classifiers to predict clinical orders placed during an ED visit. We benchmarked 4 classifiers with 2 multilabel learning frameworks that predict orders independently (binary relevance) or simultaneously (random k-labelsets). We evaluated algorithm performance, calculated variable importance, and conducted a simple simulation study to examine the effects of algorithm implementation on length of stay and cost. Results Aggregate performance across orders was highest when predicting orders independently with a multilayer perceptron (median F1 score = 0.56), but prediction frameworks that simultaneously predict orders for a visit enhanced predictive performance for correlated orders. Visit acuity was the most important predictor for most orders. Simulation results indicated that direct implementation of the model would increase ordering costs (from $21 to $45 per visit) but reduce length of stay (from 158 minutes to 151 minutes) over all visits. Discussion Simulated implementations of the predictive algorithm decreased length of stay but increased ordering costs. Optimal implementation of these predictions to reduce patient length of stay without incurring additional costs requires more exploration. Conclusions It is possible to predict common clinical orders placed during an ED visit with data available at triage.


2019 ◽  
Vol 18 (03) ◽  
pp. 747-791 ◽  
Author(s):  
Robin Gubela ◽  
Artem Bequé ◽  
Stefan Lessmann ◽  
Fabian Gebert

Uplift modeling combines machine learning and experimental strategies to estimate the differential effect of a treatment on individuals’ behavior. The paper considers uplift models in the scope of marketing campaign targeting. Literature on uplift modeling strategies is fragmented across academic disciplines and lacks an overarching empirical comparison. Using data from online retailers, we fill this gap and contribute to literature through consolidating prior work on uplift modeling and systematically comparing the predictive performance and utility of available uplift modeling strategies. Our empirical study includes three experiments in which we examine the interaction between an uplift modeling strategy and the underlying machine learning algorithm to implement the strategy, quantify model performance in terms of business value and demonstrate the advantages of uplift models over response models, which are widely used in marketing. The results facilitate making specific recommendations how to deploy uplift models in e-commerce applications.


2021 ◽  
Author(s):  
Massimiliano Greco ◽  
Giovanni Angelotti ◽  
Pier Francesco Caruso ◽  
Alberto Zanella ◽  
Niccolò Stomeo ◽  
...  

Abstract Introduction: SARS-CoV-2 infection was first identified at the end of 2019 in China, and subsequently spread globally. COVID-19 disease frequently affects the lungs leading to bilateral viral pneumonia, progressing in some cases to severe respiratory failure requiring ICU admission and mechanical ventilation. Risk stratification at ICU admission is fundamental for resource allocation and decision making, considering that baseline comorbidities, age, and patient conditions at admission have been associated to poorer outcomes. Supervised machine learning techniques are increasingly diffuse in clinical medicine and can predict mortality and test associations reaching high predictive performance. We assessed performances of a machine learning approach to predict mortality in COVID-19 patients admitted to ICU using data from the Lombardy ICU Network.Methods: this is a secondary analysis of prospectively collected data from Lombardy ICU network. To predict survival at 7-,14- and 28 days we built two different models; model A included patient demographics, medications before admission and comorbidities, while model B also included the data of the first day since ICU admission. 10-fold cross validation was repeated 2500 times, to ensure optimal hyperparameter choice. The only constrain imposed to model optimization was the choice of logistic regression as final layer to increase clinical interpretability. Different imputation and over-sampling techniques were employed in model training.Results 1503 patients were included, with 766 deaths (51%). Exploratory analysis and Kaplan-Meier curves demonstrated mortality association with age and gender. Model A and B reached the greatest predictive performance at 28 days (AUC 0.77 and 0.79), with lower performance at 14 days (AUC 0.72 and 0.74) and 7 days (AUC 0.68 and 0.71). Male gender, age and number of comorbidities were strongly associated with mortality in both models. Among comorbidities, chronic kidney disease and chronic obstructive pulmonary disease demonstrated association. Mode of ventilatory assistance at ICU admission and Fraction of Inspired oxygen were associated with mortality in model B.Conclusions Supervised machine learning models demonstrated good performance in prediction of 28-day mortality. 7-days and 14-days predictions demonstrated lower performance. Machine learning techniques may be useful in emergency phases to reach higher predictive performance with reduced human supervision using complex data.


Sign in / Sign up

Export Citation Format

Share Document