scholarly journals A comprehensive simulation study on classification of RNA-Seq data

Author(s):  
Gokmen Zararsiz ◽  
Dinçer Göksülük ◽  
Selçuk Korkmaz ◽  
Vahap Eldem ◽  
Gözde Ertürk Zararsız ◽  
...  

RNA sequencing (RNA-Seq) is a powerful technique for thegene-expression profiling of organisms that uses the capabilities of next-generation sequencing technologies.Developing gene-expression-based classification algorithms is an emerging powerful method for diagnosis, disease classification and monitoring at molecular level, as well as providing potential markers of diseases. Most of the statistical methods proposed for the classification of geneexpression data are either based on a continuous scale (eg. microarray data) or require a normal distribution assumption. Hence, these methods cannot be directly applied to RNA-Seq data since they violate both data structure and distributional assumptions. However, it is possible to apply these algorithms with appropriate modifications to RNA-Seq data. One way is to develop count-based classifiers, such as Poisson linear discriminant analysis and negative binomial linear discriminant analysis. Another way is to bring the data hierarchically closer to microarrays and apply microarray-based classifiers.In this study, we compared several classifiers including PLDA with and without power transformation, NBLDA, single SVM, bagging SVM (bagSVM), classification and regression trees (CART), and random forests (RF). We also examined the effect of several parameters such asoverdispersion, sample size, number of genes, number of classes, differential-expression rate, andthe transformation method on model performances.A comprehensive simulation study is conducted and the results are compared with the results of two miRNA and two mRNA experimental datasets. The results revealed that increasing the sample size, differential-expression rate, and number of genes and decreasing the dispersion parameter and number of groups lead to an increase in classification accuracy. Similar with differential-expression studies, the classification of RNA-Seq data requires careful attention when handling data overdispersion. We conclude that, as a count-based classifier, the power transformed PLDA and, as a microarray-based classifier, vst or rlog transformed RF and SVM clas sifiers may be a good choice for classification. An R/BIOCONDUCTOR package, MLSeq, is freely available at https://www.bioconductor.org/packages/release/bioc/html/MLSeq.html .

2017 ◽  
Author(s):  
Gokmen Zararsiz ◽  
Dinçer Göksülük ◽  
Selçuk Korkmaz ◽  
Vahap Eldem ◽  
Gözde Ertürk Zararsız ◽  
...  

RNA sequencing (RNA-Seq) is a powerful technique for thegene-expression profiling of organisms that uses the capabilities of next-generation sequencing technologies.Developing gene-expression-based classification algorithms is an emerging powerful method for diagnosis, disease classification and monitoring at molecular level, as well as providing potential markers of diseases. Most of the statistical methods proposed for the classification of geneexpression data are either based on a continuous scale (eg. microarray data) or require a normal distribution assumption. Hence, these methods cannot be directly applied to RNA-Seq data since they violate both data structure and distributional assumptions. However, it is possible to apply these algorithms with appropriate modifications to RNA-Seq data. One way is to develop count-based classifiers, such as Poisson linear discriminant analysis and negative binomial linear discriminant analysis. Another way is to bring the data hierarchically closer to microarrays and apply microarray-based classifiers.In this study, we compared several classifiers including PLDA with and without power transformation, NBLDA, single SVM, bagging SVM (bagSVM), classification and regression trees (CART), and random forests (RF). We also examined the effect of several parameters such asoverdispersion, sample size, number of genes, number of classes, differential-expression rate, andthe transformation method on model performances.A comprehensive simulation study is conducted and the results are compared with the results of two miRNA and two mRNA experimental datasets. The results revealed that increasing the sample size, differential-expression rate, and number of genes and decreasing the dispersion parameter and number of groups lead to an increase in classification accuracy. Similar with differential-expression studies, the classification of RNA-Seq data requires careful attention when handling data overdispersion. We conclude that, as a count-based classifier, the power transformed PLDA and, as a microarray-based classifier, vst or rlog transformed RF and SVM clas sifiers may be a good choice for classification. An R/BIOCONDUCTOR package, MLSeq, is freely available at https://www.bioconductor.org/packages/release/bioc/html/MLSeq.html .


2017 ◽  
Author(s):  
Gokmen Zararsiz ◽  
Dinçer Göksülük ◽  
Selçuk Korkmaz ◽  
Vahap Eldem ◽  
Gözde Ertürk Zararsız ◽  
...  

Background RNA sequencing (RNA-Seq) is a powerful technique for transcriptome profiling of the organisms that uses the capabilities of next-generation sequencing (NGS) technologies. Recent advances in NGS let to measure the expression levels of tens to thousands of transcripts simultaneously. Using such information, developing expression-based classification algorithms is an emerging powerful method for diagnosis, disease classification and monitoring at molecular level, as well as providing potential markers of disease. Microarray based classifiers cannot be directly applied due to the discrete nature of RNA-Seq data. One way is to develop count-based classifiers, such as poisson linear discriminant analysis (PLDA) and negative binomial linear discriminant analysis (NBLDA). Other way is to transform the data hierarchically closer to microarrays and apply microarray-based classifiers. In most of the studies, the data overdispersion seems to be an another challenge in modeling RNA-Seq data. In this study, we aimed to examine the effect of dispersion parameter and classification algorithms on RNA-Seq classification. We also considered the effect of other parameters (i) sample size, (ii) number of genes, (iii) number of class, (iv) DE (differential expression) rate, (v) transformation method on classification performance. Methods We designed a comprehensive simulation study, also used two miRNA and two mRNA experimental datasets. Simulated datasets are generated from negative binomial distribution under different scenarios and real datasets are obtained from publicly available resources. We compared the results of several classifiers including PLDA with and without power transformation, NBLDA, single SVM, bagging SVM (bagSVM), classification and regression trees (CART), and random forests (RF). Results Results from the simulated and real datasets revealed that increasing the sample size, differential expression rate, number of genes and decreasing the dispersion parameter and number of groups lead to an increase in the classification accuracy. To make an overall assessment, power transformed PLDA, RF and SVM classifiers performed the highest classification accuracies. Discussion Overdispersion seems to be an important challenge in RNA-Seq classification studies. Similar with differential expression studies, classification of RNA-Seq data requires careful attention on handling data overdispersion. We conclude that, as a count-based classifier, power transformed PLDA; as a microarray based classifier vst or rlog transformed RF and SVM (bagSVM for high sample sized data) classifiers may be a good choice for classification. However, there is still a need to develop novel classifiers or transformation approaches for classification of RNA-Seq data. An R/BIOCONDUCTOR package MLSeq with a vignette is freely available at http://www.bioconductor.org/packages/2.14/bioc/html/MLSeq.html .


Author(s):  
Ramia Z. Al Bakain ◽  
Yahya S. Al-Degs ◽  
James V. Cizdziel ◽  
Mahmoud A. Elsohly

AbstractFifty four domestically produced cannabis samples obtained from different USA states were quantitatively assayed by GC–FID to detect 22 active components: 15 terpenoids and 7 cannabinoids. The profiles of the selected compounds were used as inputs for samples grouping to their geographical origins and for building a geographical prediction model using Linear Discriminant Analysis. The proposed sample extraction and chromatographic separation was satisfactory to select 22 active ingredients with a wide analytical range between 5.0 and 1,000 µg/mL. Analysis of GC-profiles by Principle Component Analysis retained three significant variables for grouping job (Δ9-THC, CBN, and CBC) and the modest discrimination of samples based on their geographical origin was reported. PCA was able to separate many samples of Oregon and Vermont while a mixed classification was observed for the rest of samples. By using LDA as a supervised classification method, excellent separation of cannabis samples was attained leading to a classification of new samples not being included in the model. Using two principal components and LDA with GC–FID profiles correctly predict the geographical of 100% Washington cannabis, 86% of both Oregon and Vermont samples, and finally, 71% of Ohio samples.


Sign in / Sign up

Export Citation Format

Share Document