Bias of error rates in linear discriminant analysis caused by feature selection and sample size

RNA sequencing (RNA-Seq) is a powerful technique for thegene-expression profiling of organisms that uses the capabilities of next-generation sequencing technologies.Developing gene-expression-based classification algorithms is an emerging powerful method for diagnosis, disease classification and monitoring at molecular level, as well as providing potential markers of diseases. Most of the statistical methods proposed for the classification of geneexpression data are either based on a continuous scale (eg. microarray data) or require a normal distribution assumption. Hence, these methods cannot be directly applied to RNA-Seq data since they violate both data structure and distributional assumptions. However, it is possible to apply these algorithms with appropriate modifications to RNA-Seq data. One way is to develop count-based classifiers, such as Poisson linear discriminant analysis and negative binomial linear discriminant analysis. Another way is to bring the data hierarchically closer to microarrays and apply microarray-based classifiers.In this study, we compared several classifiers including PLDA with and without power transformation, NBLDA, single SVM, bagging SVM (bagSVM), classification and regression trees (CART), and random forests (RF). We also examined the effect of several parameters such asoverdispersion, sample size, number of genes, number of classes, differential-expression rate, andthe transformation method on model performances.A comprehensive simulation study is conducted and the results are compared with the results of two miRNA and two mRNA experimental datasets. The results revealed that increasing the sample size, differential-expression rate, and number of genes and decreasing the dispersion parameter and number of groups lead to an increase in classification accuracy. Similar with differential-expression studies, the classification of RNA-Seq data requires careful attention when handling data overdispersion. We conclude that, as a count-based classifier, the power transformed PLDA and, as a microarray-based classifier, vst or rlog transformed RF and SVM clas sifiers may be a good choice for classification. An R/BIOCONDUCTOR package, MLSeq, is freely available at https://www.bioconductor.org/packages/release/bioc/html/MLSeq.html .

Download Full-text

Subspace Regularized Linear Discriminant Analysis for Small Sample Size Problems

Lecture Notes in Computer Science - PRICAI 2012: Trends in Artificial Intelligence ◽

10.1007/978-3-642-32695-0_58 ◽

2012 ◽

pp. 661-672

Author(s):

Zhidong Wang ◽

Wuyi Yang

Keyword(s):

Discriminant Analysis ◽

Sample Size ◽

Linear Discriminant Analysis ◽

Small Sample Size ◽

Small Sample ◽

Linear Discriminant ◽

Regularized Linear Discriminant Analysis

Download Full-text

Regularized Complete Linear Discriminant Analysis for Small Sample Size Problems

Communications in Computer and Information Science - Emerging Intelligent Computing Technology and Applications ◽

10.1007/978-3-642-31837-5_10 ◽

2012 ◽

pp. 67-73

Author(s):

Wuyi Yang

Keyword(s):

Discriminant Analysis ◽

Sample Size ◽

Linear Discriminant Analysis ◽

Small Sample Size ◽

Small Sample ◽

Linear Discriminant

Download Full-text

Robust and Efficient Linear Discriminant Analysis With L 2,1-Norm for Feature Selection

IEEE Access ◽

10.1109/access.2020.2978287 ◽

2020 ◽

Vol 8 ◽

pp. 44100-44110 ◽

Cited By ~ 2

Author(s):

Libo Yang ◽

Xuemei Liu ◽

Feiping Nie ◽

Yang Liu

Keyword(s):

Feature Selection ◽

Discriminant Analysis ◽

Linear Discriminant Analysis ◽

Linear Discriminant

Download Full-text

Evaluation of Classification Algorithms, Linear Discriminant Analysis and a New Hybrid Feature Selection Methodology for the Diagnosis of Coronary Artery Disease

2018 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata.2018.8622609 ◽

2018 ◽

Cited By ~ 5

Author(s):

Burak Kolukisa ◽

Hilal Hacilar ◽

Gokhan Goy ◽

Mustafa Kus ◽

Burcu Bakir-Gungor ◽

...

Keyword(s):

Coronary Artery Disease ◽

Feature Selection ◽

Coronary Artery ◽

Discriminant Analysis ◽

Linear Discriminant Analysis ◽

Classification Algorithms ◽

Linear Discriminant ◽

Artery Disease ◽

Selection Methodology

Download Full-text

Discriminant Analysis for Biometric Recognition

Advances in Information and Communication Technology Education - Advanced Pattern Recognition Technologies with Applications to Biometrics ◽

10.4018/978-1-60566-200-8.ch002 ◽

2011 ◽

pp. 25-29

Author(s):

David Zhang ◽

Fengxi Song ◽

Yong Xu ◽

Zhizhen Liang

Keyword(s):

Pattern Recognition ◽

Feature Extraction ◽

Discriminant Analysis ◽

Sample Size ◽

Linear Discriminant Analysis ◽

Small Sample Size ◽

Small Sample ◽

Biometric Recognition ◽

Linear Discriminant

This chapter is a brief introduction to biometric discriminant analysis technologies — Section I of the book. Section 2.1 describes two kinds of linear discriminant analysis (LDA) approaches: classification-oriented LDA and feature extraction-oriented LDA. Section 2.2 discusses LDA for solving the small sample size (SSS) pattern recognition problems. Section 2.3 shows the organization of Section I.

Download Full-text