scholarly journals A New Machine Learning-Based Framework for Mapping Uncertainty Analysis in RNA-Seq Read Alignment and Gene Expression Estimation

2018 ◽  
Vol 9 ◽  
Author(s):  
Adam McDermaid ◽  
Xin Chen ◽  
Yiran Zhang ◽  
Cankun Wang ◽  
Shaopeng Gu ◽  
...  
2019 ◽  
Author(s):  
William A Figgett ◽  
Katherine Monaghan ◽  
Milica Ng ◽  
Monther Alhamdoosh ◽  
Eugene Maraskovsky ◽  
...  

ABSTRACTObjectiveSystemic lupus erythematosus (SLE) is a heterogeneous autoimmune disease that is difficult to treat. There is currently no optimal stratification of patients with SLE, and thus responses to available treatments are unpredictable. Here, we developed a new stratification scheme for patients with SLE, based on the whole-blood transcriptomes of patients with SLE.MethodsWe applied machine learning approaches to RNA-sequencing (RNA-seq) datasets to stratify patients with SLE into four distinct clusters based on their gene expression profiles. A meta-analysis on two recently published whole-blood RNA-seq datasets was carried out and an additional similar dataset of 30 patients with SLE and 29 healthy donors was contributed in this research; 141 patients with SLE and 51 healthy donors were analysed in total.ResultsExamination of SLE clusters, as opposed to unstratified SLE patients, revealed underappreciated differences in the pattern of expression of disease-related genes relative to clinical presentation. Moreover, gene signatures correlated to flare activity were successfully identified.ConclusionGiven that disease heterogeneity has confounded research studies and clinical trials, our approach addresses current unmet medical needs and provides a greater understanding of SLE heterogeneity in humans. Stratification of patients based on gene expression signatures may be a valuable strategy to harness disease heterogeneity and identify patient populations that may be at an increased risk of disease symptoms. Further, this approach can be used to understand the variability in responsiveness to therapeutics, thereby improving the design of clinical trials and advancing personalised therapy.


2015 ◽  
Author(s):  
Jeffrey A Thompson ◽  
Jie Tan ◽  
Casey S Greene

Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization and a simple log2 transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.


PeerJ ◽  
2016 ◽  
Vol 4 ◽  
pp. e1621 ◽  
Author(s):  
Jeffrey A. Thompson ◽  
Jie Tan ◽  
Casey S. Greene

Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization, nonparanormal transformation, and a simplelog2transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.


2022 ◽  
Author(s):  
Sofya Lipnitskaya ◽  
Yang Shen ◽  
Stefan Legewie ◽  
Holger Klein ◽  
Kolja Becker

Abstract Background: Recent studies in the area of transcriptomics performed on single-cell and population levels reveal noticeable variability in gene expression measurements provided by different RNA sequencing technologies. Due to increased noise and complexity of single-cell RNA-Seq (scRNA-Seq) data over the bulk experiment, there is a substantial number of variably-expressed genes and so-called dropouts, challenging the subsequent computational analysis and potentially leading to false positive discoveries. In order to investigate factors affecting technical variability between RNA sequencing experiments of different technologies, we performed a systematic assessment of single-cell and bulk RNA-Seq data, which have undergone the same pre-processing and sample preparation procedures. Results: Our analysis indicates that variability between gene expression measurements as well as dropout events are not exclusively caused by biological variability, low expression levels, or random variation. Furthermore, we propose FAVSeq, a machine learning-assisted pipeline for detection of factors contributing to gene expression variability in matched RNA-Seq data provided by two technologies. Based on the analysis of the matched bulk and single-cell dataset, we found the 3'-UTR and transcript lengths as the most relevant effectors of the observed variation between RNA-Seq experiments, while the same factors together with cellular compartments were shown to be associated with dropouts. Conclusions: Here, we investigated the sources of variation in RNA-Seq profiles of matched single-cell and bulk experiments. In addition, we proposed the FAVSeq pipeline for analyzing multimodal RNA sequencing data, which allowed to identify factors affecting quantitative difference in gene expression measurements as well as the presence of dropouts. Hereby, the derived knowledge can be employed further in order to improve the interpretation of RNA-Seq data and identify genes that can be affected by assay-based deviations. Source code is available under the MIT license at https://github.com/slipnitskaya/FAVSeq.


2015 ◽  
Author(s):  
Jeffrey A Thompson ◽  
Jie Tan ◽  
Casey S Greene

Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization and a simple log2 transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.


2019 ◽  
Author(s):  
Shikha Roy ◽  
Rakesh Kumar ◽  
Vaibhav Mittal ◽  
Dinesh Gupta

AbstractEarly detection of breast cancer and its correct stage determination are important for prognosis and rendering appropriate personalized clinical treatment to breast cancer patients. However, despite considerable efforts and progress, there is a need to identify the specific genomic factors responsible for, or accompanying Invasive Ductal Carcinoma (IDC) progression stages, which can aid the determination of the correct cancer stages. We have developed two-class machine-learning classification models to differentiate the early and late stages of invasive ductal carcinoma. The prediction models are trained with RNA-seq gene expression profiles representing different IDC stages of 610 patients, obtained from The Cancer Genome Atlas (TCGA). Different supervised learning algorithms were trained and evaluated with an enriched model learning, facilitated by different feature selection methods. We also developed a machine-learning classifier trained on the same datasets with training sets reduced data corresponding to IDC driver genes. Based on these two classifiers, we have developed a web-server Duct-BRCA-CSP to predict early stage from late stages of IDC based on input RNA-seq gene expression profiles. The analysis conducted by us also enables deeper insights into the stage-dependent molecular events accompanying breast ductal carcinoma progression. The server is publicly available at http://bioinfo.icgeb.res.in/duct-BRCA-CSP.


2021 ◽  
Vol 11 ◽  
Author(s):  
Jaewoong Lee ◽  
Sungmin Cho ◽  
Seong-Eui Hong ◽  
Dain Kang ◽  
Hayoung Choi ◽  
...  

BCR-ABL1–positive acute leukemia can be classified into three disease categories: B-lymphoblastic leukemia (B-ALL), acute myeloid leukemia (AML), and mixed-phenotype acute leukemia (MPAL). We conducted an integrative analysis of RNA sequencing (RNA-seq) data obtained from 12 BCR-ABL1–positive B-ALL, AML, and MPAL samples to evaluate its diagnostic utility. RNA-seq facilitated the identification of all p190 BCR-ABL1 with accurate splicing sites and a new gene fusion involving MAP2K2. Most of the clinically significant mutations were also identified including single-nucleotide variations, insertions, and deletions. In addition, RNA-seq yielded differential gene expression profile according to the disease category. Therefore, we selected 368 genes differentially expressed between AML and B-ALL and developed two differential diagnosis models based on the gene expression data using 1) scoring algorithm and 2) machine learning. Both models showed an excellent diagnostic accuracy not only for our 12 BCR-ABL1–positive cases but also for 427 public gene expression datasets from acute leukemias regardless of specific genetic aberration. This is the first trial to develop models of differential diagnosis using RNA-seq, especially to evaluate the potential role of machine learning in identifying the disease category of acute leukemia. The integrative analysis of gene expression data by RNA-seq facilitates the accurate differential diagnosis of acute leukemia with successful detection of significant gene fusion and/or mutations, which warrants further investigation.


2017 ◽  
Author(s):  
Nathan T. Johnson ◽  
Andi Dhroso ◽  
Katelyn J. Hughes ◽  
Dmitry Korkin

AbstractThe extent to which the genes are expressed in the cell can be simplistically defined as a function of one or more factors of the environment, lifestyle, and genetics. RNA sequencing (RNA-Seq) is becoming a prevalent approach to quantify gene expression, and is expected to gain better insights to a number of biological and biomedical questions, compared to the DNA microarrays. Most importantly, RNA-Seq allows to quantify expression at the gene and alternative splicing isoform levels. However, leveraging the RNA-Seq data requires development of new data mining and analytics methods. Supervised machine learning methods are commonly used approaches for biological data analysis, and have recently gained attention for their applications to the RNA-Seq data.In this work, we assess the utility of supervised learning methods trained on RNA-Seq data for a diverse range of biological classification tasks. We hypothesize that the isoform-level expression data is more informative for biological classification tasks than the gene-level expression data. Our large-scale assessment is done through utilizing multiple datasets, organisms, lab groups, and RNA-Seq analysis pipelines. Overall, we performed and assessed 61 biological classification problems that leverage three independent RNA-Seq datasets and include over 2,000 samples that come from multiple organisms, lab groups, and RNA-Seq analyses. These 61 problems include predictions of the tissue type, sex, or age of the sample, healthy or cancerous phenotypes and, the pathological tumor stage for the samples from the cancerous tissue. For each classification problem, the performance of three normalization techniques and six machine learning classifiers was explored. We find that for every single classification problem, the isoform-based classifiers outperform or are comparable with gene expression based methods. The top-performing supervised learning techniques reached a near perfect classification accuracy, demonstrating the utility of supervised learning for RNA-Seq based data analysis.


2021 ◽  
Author(s):  
Yanzhou Zhang ◽  
Qing Zhu ◽  
Xiufeng Cao ◽  
Bin Ni

Abstract Background and objective: Esophageal cancer(ESCA) ranks eleventh in incidence and eighth in mortality among malignant tumors in the world. Due to the paucity of effective early diagnostic approach, a lot of patients have missed the first-rank treatment time frame and were already in the advanced phase at their first diagnosis. The continuous reforming of high-throughput sequencing technologies and analytical techniques has provided novel concepts and approaches for the study of cancer biomarkers in esophageal cancer. The development of cancer is a complex biological process with multi-gene concernment, multi-factor mutual effect and multi-phase development. This process includes the mutations in proto-oncogenes, changes in transcript expression profiles, and abnormalities of protein structure, function, or expression levels. The study of the molecular mechanism of ESCA using high-throughput sequencing technology will lay theoretic foundation for the early diagnosis and targeted therapy of ESCA.Materials and methods: In this study, a search was conducted in tow commonly used public databases, UCSC XENA and GEO, one UCSC XENA RNA-seq data and tow GEO datasets were included in this study. Differential expression analysis was implemented by using limma in R software.Weighted gene co-expression network analysis (WGCNA) was used to analyze the gene transcriptome expression profile consisting of 181 ESCA tissues and 181 normal tissues as controls to construct topology network. We constructed gene modules and searched for gene modules that were closely participant to ESCA, and gene ontology (GO) and KEGG pathway enrichment analysis were implemented to probe into the functions of the DEGs and differentially expressed hub genes in key modules. By combining the consequences of differential gene expression analysis with WGCNA consequences(hub genes), we procured a 30 of differentially expressed genes in module that were closely participant to ESCA. Next, we procured the expression data of these genes from normalized transcriptome expression data to construct ESCA predictive model. Then, ten-fold cross validation combining with machine learning algorithms were used to construct prediction models for ESCA. Finally, we also verified the four screened biomarkers which used to build the predictive model with the GEO data sets.Results: Analysis of differentially expressed genes were conducted by using the limma packages and differentially expressed genes were defined as |log2FC|>1 and adj.P.Val < 0.01. After comparison the results from limma, a total of 15814 genes were up-regulated in ESCA, a total of 6176 gene were down-regulated in ESCA.A total of 7 gene modules were identified from WGCNA, 2 modules of them are strongly corelative with ESCA (Brown module: R2=0.87, Lightcyan module: R2=-0.75, both P <0.001). Brown module is closely related to ESCA.The consequences of WGCNA analysis combined with differentially expressed genes revealed that there were 4419 differentially expressed genes in the brown module which were closely related to ESCA. 30 hub gene were screened by kWithin top 30 from brown module, and all of them are differentially expressed.GO analysis of differetially expressed genes from brown module revealed that these genes are from immunoglobulin complex, “chromosome, centromeric region”, condensed chromosome, “immunoglobulin complex, circulating”, condensed chromosome, centromeric region, and other components, and they participated in biological function such as antigen binding, immunoglobulin receptor binding, ATPase activity, cadherin binding, DNA helicase activity, etc., involved in biological processes such as adaptive immune response based on somatic recombination of immune receptors built from immunoglobulin superfamily domains, mitotic nuclear division, lymphocyte mediated immunity, nuclear division, and DNA replication; KEGG pathway analysis shows the brown module differentially expressed genes are mainly enriched in signal pathways such as cell cycle, pathogenic escherichia coli infection, DNA replication, IL-17 signaling pathway and human T-cell leukemia virus 1 infection. This shed new light on molecular mechanisms of the development of ESCA.Twelve ESCA prediction models constructed from 30 gene expression matrices from 362 subjects by using 10-fold cross-validation combined with machine learning algorithms revealed good prediction performance in validation dataset, among which models from gbm, BoostGLM, C5.0 algorithms revealed higher accuracy than from other algorithms. Although the transparent or semi-transparent models constructed by JRip, PART, and Rpart algorithms have acceptable accuracy in validation dataset, their sensitivity are lower. From a comprehensive perspective, two black box algorithm models including gbm and BoostGLM models are selected as the final model. This study has successfully constructed ESCA prediction models with accuracies higher than 0.97.Finally, three of the four screened biomarkers were validated.Conclusions: In current study, differential expression analysis and WGCNA of ESCA participant RNA-seq data available in public database were used to screen DEGs and genes that were closely participant with ESCA. Consequences from GO and KEGG analysis further revealed the underlying mechanisms of ESCA. Normalized gene expression data was feed to several different machine learning techniques and 10-fold cross validation was used to construct high accuracy ESCA predictive models. Eventually, several ESCA predictive models with accuracy higher than 0.96 in validation group were constructed. At the meantime, three biomarkers(G3BP1, CHEK1 and MOB1A) were screened and validated, in particular, G3BP1 may be a potential therapeutic target, as overall survival analysis have shown it to be an adverse prognostic factor. Current study has lay the basis of applying RNA-seq data in the early genetic diagnosis of ESCA, and a prognostic marker that might contribute to treatment of ESCA.


Sign in / Sign up

Export Citation Format

Share Document