scholarly journals voomDDA: discovery of diagnostic biomarkers and classification of RNA-seq data

PeerJ ◽  
2017 ◽  
Vol 5 ◽  
pp. e3890 ◽  
Author(s):  
Gokmen Zararsiz ◽  
Dincer Goksuluk ◽  
Bernd Klaus ◽  
Selcuk Korkmaz ◽  
Vahap Eldem ◽  
...  

RNA-Seq is a recent and efficient technique that uses the capabilities of next-generation sequencing technology for characterizing and quantifying transcriptomes. One important task using gene-expression data is to identify a small subset of genes that can be used to build diagnostic classifiers particularly for cancer diseases. Microarray based classifiers are not directly applicable to RNA-Seq data due to its discrete nature. Overdispersion is another problem that requires careful modeling of mean and variance relationship of the RNA-Seq data. In this study, we present voomDDA classifiers: variance modeling at the observational level (voom) extensions of the nearest shrunken centroids (NSC) and the diagonal discriminant classifiers. VoomNSC is one of these classifiers and brings voom and NSC approaches together for the purpose of gene-expression based classification. For this purpose, we propose weighted statistics and put these weighted statistics into the NSC algorithm. The VoomNSC is a sparse classifier that models the mean-variance relationship using the voom method and incorporates voom’s precision weights into the NSC classifier via weighted statistics. A comprehensive simulation study was designed and four real datasets are used for performance assessment. The overall results indicate that voomNSC performs as the sparsest classifier. It also provides the most accurate results together with power-transformed Poisson linear discriminant analysis, rlog transformed support vector machines and random forests algorithms. In addition to prediction purposes, the voomNSC classifier can be used to identify the potential diagnostic biomarkers for a condition of interest. Through this work, statistical learning methods proposed for microarrays can be reused for RNA-Seq data. An interactive web application is freely available at http://www.biosoft.hacettepe.edu.tr/voomDDA/.

2005 ◽  
Vol 03 (02) ◽  
pp. 185-205 ◽  
Author(s):  
CHRIS DING ◽  
HANCHUAN PENG

How to selecting a small subset out of the thousands of genes in microarray data is important for accurate classification of phenotypes. Widely used methods typically rank genes according to their differential expressions among phenotypes and pick the top-ranked genes. We observe that feature sets so obtained have certain redundancy and study methods to minimize it. We propose a minimum redundancy — maximum relevance (MRMR) feature selection framework. Genes selected via MRMR provide a more balanced coverage of the space and capture broader characteristics of phenotypes. They lead to significantly improved class predictions in extensive experiments on 6 gene expression data sets: NCI, Lymphoma, Lung, Child Leukemia, Leukemia, and Colon. Improvements are observed consistently among 4 classification methods: Naïve Bayes, Linear discriminant analysis, Logistic regression, and Support vector machines. Supplimentary: The top 60 MRMR genes for each of the datasets are listed in . More information related to MRMR methods can be found at .


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Li Tong ◽  
◽  
Po-Yen Wu ◽  
John H. Phan ◽  
Hamid R. Hassazadeh ◽  
...  

Abstract To use next-generation sequencing technology such as RNA-seq for medical and health applications, choosing proper analysis methods for biomarker identification remains a critical challenge for most users. The US Food and Drug Administration (FDA) has led the Sequencing Quality Control (SEQC) project to conduct a comprehensive investigation of 278 representative RNA-seq data analysis pipelines consisting of 13 sequence mapping, three quantification, and seven normalization methods. In this article, we focused on the impact of the joint effects of RNA-seq pipelines on gene expression estimation as well as the downstream prediction of disease outcomes. First, we developed and applied three metrics (i.e., accuracy, precision, and reliability) to quantitatively evaluate each pipeline’s performance on gene expression estimation. We then investigated the correlation between the proposed metrics and the downstream prediction performance using two real-world cancer datasets (i.e., SEQC neuroblastoma dataset and the NIH/NCI TCGA lung adenocarcinoma dataset). We found that RNA-seq pipeline components jointly and significantly impacted the accuracy of gene expression estimation, and its impact was extended to the downstream prediction of these cancer outcomes. Specifically, RNA-seq pipelines that produced more accurate, precise, and reliable gene expression estimation tended to perform better in the prediction of disease outcome. In the end, we provided scenarios as guidelines for users to use these three metrics to select sensible RNA-seq pipelines for the improved accuracy, precision, and reliability of gene expression estimation, which lead to the improved downstream gene expression-based prediction of disease outcome.


2020 ◽  
Author(s):  
Matthew N. Bernstein ◽  
Zijian Ni ◽  
Michael Collins ◽  
Mark E. Burkard ◽  
Christina Kendziorski ◽  
...  

AbstractBackgroundSingle-cell RNA-seq (scRNA-seq) enables the profiling of genome-wide gene expression at the single-cell level and in so doing facilitates insight into and information about cellular heterogeneity within a tissue. Perhaps nowhere is this more important than in cancer, where tumor and tumor microenvironment heterogeneity directly impact development, maintenance, and progression of disease. While publicly available scRNA-seq cancer datasets offer unprecedented opportunity to better understand the mechanisms underlying tumor progression, metastasis, drug resistance, and immune evasion, much of the available information has been underutilized, in part, due to the lack of tools available for aggregating and analysing these data.ResultsWe present CHARacterizing Tumor Subpopulations (CHARTS), a computational pipeline and web application for analyzing, characterizing, and integrating publicly available scRNA-seq cancer datasets. CHARTS enables the exploration of individual gene expression, cell type, malignancy-status, differentially expressed genes, and gene set enrichment results in subpopulations of cells across multiple tumors and datasets.ConclusionCHARTS is an easy to use, comprehensive platform for exploring single-cell subpopulations within tumors across the ever-growing collection of public scRNA-seq cancer datasets. CHARTS is freely available at charts.morgridge.org.


2021 ◽  
Vol 18 (17) ◽  
Author(s):  
Micheal Olaolu AROWOLO ◽  
Marion Olubunmi ADEBIYI ◽  
Chiebuka Timothy NNODIM ◽  
Sulaiman Olaniyi ABDULSALAM ◽  
Ayodele Ariyo ADEBIYI

As mosquito parasites breed across many parts of the sub-Saharan Africa part of the world, infected cells embrace an unpredictable and erratic life period. Millions of individual parasites have gene expressions. Ribonucleic acid sequencing (RNA-seq) is a popular transcriptional technique that has improved the detection of major genetic probes. The RNA-seq analysis generally requires computational improvements of machine learning techniques since it computes interpretations of gene expressions. For this study, an adaptive genetic algorithm (A-GA) with recursive feature elimination (RFE) (A-GA-RFE) feature selection algorithms was utilized to detect important information from a high-dimensional gene expression malaria vector RNA-seq dataset. Support Vector Machine (SVM) kernels were used as the classification algorithms to evaluate its predictive performances. The feasibility of this study was confirmed by using an RNA-seq dataset from the mosquito Anopheles gambiae. The technique results in related performance had 98.3 and 96.7 % accuracy rates, respectively. HIGHLIGHTS Dimensionality reduction method based of feature selection Classification using Support vector machine Classification of malaria vector dataset using an adaptive GA-RFE-SVM GRAPHICAL ABSTRACT


2006 ◽  
Vol 24 (18_suppl) ◽  
pp. 10049-10049
Author(s):  
U. Vogt ◽  
B. Brandt ◽  
U. Bosse ◽  
U. Bonk ◽  
H. Adigüzel ◽  
...  

10049 Background: Currently there are no tests to assist in selecting the optimal PST regimen for breast cancer patients. Primary study goals of this prospective, single-armed multicentric investigation are pathologically confirmed tumor response and the rate of breast conserving therapy (BCT). Secondary goals are to find histopathologic and gene profiling patterns best correlating with tumor remission in a taxane- anthracycline based neoadjuvant setting as well as to evaluate cytostatic toxicity and quality of life. Methods: In this phase II study of totally 40 eligible patients with invasive breast cancer Human Genome Survey Microarray (HGSM) expression profiling is performed on jet-biopsy sample basis. The protocol was elaborated for the treatment of patients suffering from a primary tumor with 6 cycles of TEC (3-weekly) prior to the surgical treatment. The selection of predictor genes was done with BRB-ArrayTools Version 3.3 using a model based on the Compound Covariate Predictor, Diagonal Linear Discriminant Analysis, Nearest Neighbor Classification, and Support Vector Machines with linear kernel.We estimated the prediction error of each model using leave-one-out cross-validation (LOOCV) as described by Simon R. 2000 random permutations were used. Clustering was done using Cluster 3.0 and Java TreeView 1.0.12. Results: Tumor response (pCR, pPR) of more than 70% can be achieved using neoadjuvant TEC-regimen. 22% pCR (ypT0; ypN0) and 90% BCT in this study are comparable with data of other published PST trials. Preliminary expression profiling results reveal a subset of 148 genes that classifies all patients with a complete remission (pCR), in one cluster with a very closely related gene expression pattern (n=5; PPV = 100%). Furthermore 10 patients defined as responders due to selected MIB1-expression based criteria (expressing cells in the residual tumor ≤ 5% and a Δ MIB1-expression ≥ 20%) can be correctly classified in 9 of 10 cases. Comparable separation of the groups could not be achieved by established tumor factors. Conclusions: HGSM expression profiling is promising to have the potential to figure out genes that are related to chemotherapy response, especially in PST. No significant financial relationships to disclose.


2017 ◽  
Author(s):  
Alemu Takele Assefa ◽  
Katrijn De Paepe ◽  
Celine Everaert ◽  
Pieter Mestdagh ◽  
Olivier Thas ◽  
...  

ABSTRACTBackgroundProtein-coding RNAs (mRNA) have been the primary target of most transcriptome studies in the past, but in recent years, attention has expanded to include long non-coding RNAs (lncRNA). lncRNAs are typically expressed at low levels, and are inherently highly variable. This is a fundamental challenge for differential expression (DE) analysis. In this study, the performance of 14 popular tools for testing DE in RNA-seq data along with their normalization methods is comprehensively evaluated, with a particular focus on lncRNAs and low abundant mRNAs.ResultsThirteen performance metrics were used to evaluate DE tools and normalization methods using simulations and analyses of six diverse RNA-seq datasets. Non-parametric procedures are used to simulate gene expression data in such a way that realistic levels of expression and variability are preserved in the simulated data. Throughout the assessment, we kept track of the results for mRNA and lncRNA separately. All statistical models exhibited inferior performance for lncRNAs compared to mRNAs across all simulated scenarios and analysis of benchmark RNA-seq datasets. No single tool uniformly outperformed the others.ConclusionOverall, the linear modeling with empirical Bayes moderation (limma) and the nonparametric approach (SAMSeq) showed best performance: good control of the false discovery rate (FDR) and reasonable sensitivity. However, for achieving a sensitivity of at least 50%, more than 80 samples are required when studying expression levels in a realistic clinical settings such as in cancer research. About half of the methods showed severe excess of false discoveries, making these methods unreliable for differential expression analysis and jeopardizing reproducible science. The detailed results of our study can be consulted through a user-friendly web application, http://statapps.ugent.be/tools/AppDGE/


2018 ◽  
Author(s):  
Mohamed K Gunady ◽  
Stephen M Mount ◽  
Héctor Corrada Bravo

AbstractIntroduction:Analysis of differential alternative splicing from RNA-seq data is complicated by the fact that many RNA-seq reads map to multiple transcripts, besides, the annotated transcripts are often a small subset of the possible transcripts of a gene. Here we describe Yanagi, a tool for segmenting transcriptome to create a library of maximal L-disjoint segments from a complete transcriptome annotation. That segment library preserves all transcriptome substrings of length L and transcripts structural relationships while eliminating unnecessary sequence duplications.Contributions:In this paper, we formalize the concept of transcriptome segmentation and propose an efficient algorithm for generating segment libraries based on a length parameter dependent on specific RNA-Seq library construction. The resulting segment sequences can be used with pseudo-alignment tools to quantify expression at the segment level. We characterize the segment libraries for the reference transcriptomes of Drosophila melanogaster and Homo sapiens and provide gene-level visualization of the segments for better interpretability. Then we demonstrate the use of segments-level quantification into gene expression and alternative splicing analysis. The notion of transcript segmentation as introduced here and implemented in Yanagi opens the door for the application of lightweight, ultra-fast pseudo-alignment algorithms in a wide variety of RNA-seq analyses.Conclusion:Using segment library rather than the standard transcriptome succeeds in significantly reducing ambigious alignments where reads are multimapped to several sequences in the reference. That allowed avoiding the quantification step required by standard kmer-based pipelines for gene expression analysis. Moreover, using segment counts as statistics for alternative splicing analysis enables achieving comparable performance to counting-based approaches (e.g. rMATS) while rather using fast and lighthweight pseudo alignment.


2019 ◽  
Vol 37 (15_suppl) ◽  
pp. 2029-2029 ◽  
Author(s):  
Estela Pineda ◽  
Anna Esteve-Codina ◽  
Maria Martinez-Garcia ◽  
Francesc Alameda ◽  
Cristina Carrato ◽  
...  

2029 Background: Glioblastoma (GBM) gene expression subtypes have been described in last years, data in homogeneously treated patients is lacking. Methods: Clinical, molecular and immunohistochemistry (IHC) analysis from patients with newly diagnosed GBM homogeneously treated with standard radiochemotherapy were studied. Samples were classified based on the expression profiles into three different subtypes (classical, mesenchymal, proneural) using Support Vector Machine (SVM), the K-nearest neighbor (K-NN) and the single sample Gene Set Enrichment Analysis (ssGSEA) classification algorithms provided by GlioVis web application. Results: GLIOCAT Project recruited 432 patients from 6 catalan institutions, all of whom received standard first-line treatment (2004 -2015). Best paraffin tissue samples were selected for RNAseq and reliable data were obtained from 124. 82 cases (66%) were classified into the same subtype by all three classification algorithms. SVM and ssGEA algorithms obtain more similar results (87%). No differences in clinical variables were found between the 3 GBM subtypes. Proneural subtype was enriched with IDH1 mutated and G-CIMP positive tumors. Mesenchymal subtype (SVM) was enriched in unmethylated MGMT tumors (p = 0.008), and classical (SVM) in methylated MGMT tumors (p = 0.008). Long survivors ( > 30 months) were rarely classified as mesenchymal (0-7.5%) and were more frequently classified as Proneural (23.1-26.). Clinical (age, resection, KPS) and molecular ( IDH1, MGMT) known prognostic factors were confirmed in this serie. Overall, no differences in prognosis were observed between 3 subtypes, but a trend to worse survival in mesenchymal was observed in K-NN (9.6 vs 15 ). Mesenchymal subtype presented less expression of Olig2 (p < 0.001) and SOX2 (p = 0.003) by IHC, but more YLK-40 expression (p = 0.023, SVM). On the other hand, classical subtype expressed more Nestin (p = 0.004) compared to the other subtypes (K-NN). Conclusions: In our study we have not found correlation between glioblastoma expression subtype and outcome. This large serie provides reproducible data regarding clinical-molecular-immunohistochemistry features of glioblastoma genetic subtypes.


BMC Genomics ◽  
2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Alberto Luiz P. Reyes ◽  
Tiago C. Silva ◽  
Simon G. Coetzee ◽  
Jasmine T. Plummer ◽  
Brian D. Davis ◽  
...  

Abstract Background The development of next generation sequencing (NGS) methods led to a rapid rise in the generation of large genomic datasets, but the development of user-friendly tools to analyze and visualize these datasets has not developed at the same pace. This presents a two-fold challenge to biologists; the expertise to select an appropriate data analysis pipeline, and the need for bioinformatics or programming skills to apply this pipeline. The development of graphical user interface (GUI) applications hosted on web-based servers such as Shiny can make complex workflows accessible across operating systems and internet browsers to those without programming knowledge. Results We have developed GENAVi (Gene Expression Normalization Analysis and Visualization) to provide a user-friendly interface for normalization and differential expression analysis (DEA) of human or mouse feature count level RNA-Seq data. GENAVi is a GUI based tool that combines Bioconductor packages in a format for scientists without bioinformatics expertise. We provide a panel of 20 cell lines commonly used for the study of breast and ovarian cancer within GENAVi as a foundation for users to bring their own data to the application. Users can visualize expression across samples, cluster samples based on gene expression or correlation, calculate and plot the results of principal components analysis, perform DEA and gene set enrichment and produce plots for each of these analyses. To allow scalability for large datasets we have provided local install via three methods. We improve on available tools by offering a range of normalization methods and a simple to use interface that provides clear and complete session reporting and for reproducible analysis. Conclusion The development of tools using a GUI makes them practical and accessible to scientists without bioinformatics expertise, or access to a data analyst with relevant skills. While several GUI based tools are currently available for RNA-Seq analysis we improve on these existing tools. This user-friendly application provides a convenient platform for the normalization, analysis and visualization of gene expression data for scientists without bioinformatics expertise.


Author(s):  
MOHD SABERI MOHAMAD ◽  
SAFAAI DERIS ◽  
ROSLI MD ILLIAS

Constantly improving gene expression technology offer the ability to measure the expression levels of thousand of genes in parallel. Gene expression data is expected to significantly aid in the development of efficient cancer diagnosis and classification platforms. Key issue that needs to be addressed is the selection of small number of genes that contribute to a disease from the thousands of genes measured on microarrays that are inherently noisy. This work deals with finding a small subset of informative genes from gene expression microarray data which maximise the classification accuracy. This paper introduces a new algorithm of hybrid Genetic Algorithm and Support Vector Machine for genes selection and classification task. We show that the classification accuracy of the proposed algorithm is superior to a number of current state-of-the-art methods of two widely used benchmark datasets. The informative genes from the best subset are validated and verified by comparing them with the biological results produced from biologist and computer scientist researches in order to explore the biological plausibility.


Sign in / Sign up

Export Citation Format

Share Document