scholarly journals Finding Biomarkers from a High-Dimensional Imbalanced Dataset Using the Hybrid Method of Random Undersampling and Lasso

2020 ◽  
Vol 11 (2) ◽  
pp. 75-81
Author(s):  
Masithoh Yessi Rochayani ◽  
Umu Sa'adah ◽  
Ani Budi Astuti

The research conducted undersampling and gene selection as a starting point for cancer classification in gene expression datasets with a high-dimensional and imbalanced class. It investigated whether implementing undersampling before gene selection gave better results than without implementing undersampling. The used undersampling method was Random Undersampling (RUS), and for gene selection, it was Lasso. Then, the selected genes based on theory were validated. To explore the effectiveness of applying RUS before gene selection, the researchers used two gene expression datasets. Both of the datasets consisted of two classes, 1.545 observations and 10.935 genes, but had a different imbalance ratio. The results show that the proposed gene selection methods, namely Lasso and RUS + Lasso, can produce several important biomarkers, and the obtained model has high accuracy. However, the model is complicated since it involves too many genes. It also finds that undersampling is not affected when it is implemented in a less imbalanced class. Meanwhile, when the dataset is highly imbalanced, undersampling can remove a lot of information from the majority class. Nevertheless, the effectiveness of undersampling remains unclear. Simulation studies can be carried out in the next research to investigate when undersampling should be implemented.

2021 ◽  
Vol 11 (1) ◽  
pp. 35-43
Author(s):  
Wen Xin Ng ◽  
Weng Howe Chan

In healthcare, biomarkers serve an important role in disease classification. Many existing works are focusing in identifying potential biomarkers from gene expression. Moreover, the large number of redundant features in a high dimensional dataset such as gene expression would introduce bias in the classifier and reduce the classifier’s performance. Embedded feature selection methods such as ranked guided iterative feature elimination have been widely adopted owing to the good performance in identification of informative features. However, method like ranked guided iterative feature elimination does not consider the redundancy of the features. Thus, this paper proposes an improved ranked guided iterative feature elimination method by introducing an additional filter selection based on minimum redundancy maximum relevance to filter out redundant features and maintain the relevant feature subset to be ranked and used for classification. Experiments are done using two gene expression datasets for prostate cancer and central nervous system. The performance of the classification is measured in terms of accuracy and compared with existing methods. Meanwhile, biological context verification of the identified features is done through available knowledge databases. Our method shows improved classification accuracy, and the selected genes were found to have relationship with the diseases.


2019 ◽  
Vol 2019 ◽  
pp. 1-12 ◽  
Author(s):  
Suyan Tian ◽  
Chi Wang ◽  
Bing Wang

To analyze gene expression data with sophisticated grouping structures and to extract hidden patterns from such data, feature selection is of critical importance. It is well known that genes do not function in isolation but rather work together within various metabolic, regulatory, and signaling pathways. If the biological knowledge contained within these pathways is taken into account, the resulting method is a pathway-based algorithm. Studies have demonstrated that a pathway-based method usually outperforms its gene-based counterpart in which no biological knowledge is considered. In this article, a pathway-based feature selection is firstly divided into three major categories, namely, pathway-level selection, bilevel selection, and pathway-guided gene selection. With bilevel selection methods being regarded as a special case of pathway-guided gene selection process, we discuss pathway-guided gene selection methods in detail and the importance of penalization in such methods. Last, we point out the potential utilizations of pathway-guided gene selection in one active research avenue, namely, to analyze longitudinal gene expression data. We believe this article provides valuable insights for computational biologists and biostatisticians so that they can make biology more computable.


Author(s):  
Samarendra Das ◽  
Shesh N. Rai

Selection of biologically relevant genes from high dimensional expression data is a key research problem in gene expression genomics. Most of the available gene selection methods are either based on relevancy or redundancy measure, which are usually adjudged through post selection classification accuracy. Through these methods the ranking of genes was done on a single high-dimensional expression data, which leads to the selection of spuriously associated and redundant genes. Hence, we developed a statistical approach through combining Support Vector Machine with Maximum Relevance and Minimum Redundancy under a sound statistical setup for the selection of biologically relevant genes. Here, the genes are selected through statistical significance values computed using a non-parametric test statistic under a bootstrap based subject sampling model. Further, a systematic and rigorous evaluation of the proposed approach with nine existing competitive methods was carried on six different real crop gene expression datasets. This performance analysis was carried out under three comparison settings, i.e. subject classification, biological relevant criteria based on quantitative trait loci, and gene ontology. Our analytical results showed that the proposed approach selects genes that are more biologically relevant as compared to the existing methods. Moreover, the proposed approach was also found to be better with respect to the competitive existing methods. The proposed statistical approach provides a framework for combining filter, and wrapper methods of gene selection.


Mathematics ◽  
2019 ◽  
Vol 7 (5) ◽  
pp. 457 ◽  
Author(s):  
Md Sarker ◽  
Michael Pokojovy ◽  
Sangjin Kim

In high-dimensional gene expression data analysis, the accuracy and reliability of cancer classification and selection of important genes play a very crucial role. To identify these important genes and predict future outcomes (tumor vs. non-tumor), various methods have been proposed in the literature. But only few of them take into account correlation patterns and grouping effects among the genes. In this article, we propose a rank-based modification of the popular penalized logistic regression procedure based on a combination of ℓ 1 and ℓ 2 penalties capable of handling possible correlation among genes in different groups. While the ℓ 1 penalty maintains sparsity, the ℓ 2 penalty induces smoothness based on the information from the Laplacian matrix, which represents the correlation pattern among genes. We combined logistic regression with the BH-FDR (Benjamini and Hochberg false discovery rate) screening procedure and a newly developed rank-based selection method to come up with an optimal model retaining the important genes. Through simulation studies and real-world application to high-dimensional colon cancer gene expression data, we demonstrated that the proposed rank-based method outperforms such currently popular methods as lasso, adaptive lasso and elastic net when applied both to gene selection and classification.


2012 ◽  
Vol 23 (02) ◽  
pp. 431-444 ◽  
Author(s):  
ALLANI ABDERRAHIM ◽  
EL-GHAZALI TALBI ◽  
MELLOULI KHALED

In this work, we hybridize the Genetic Quantum Algorithm with the Support Vector Machines classifier for gene selection and classification of high dimensional Microarray Data. We named our algorithm GQA SVM. Its purpose is to identify a small subset of genes that could be used to separate two classes of samples with high accuracy. A comparison of the approach with different methods of literature, in particular GA SVM and PSO SVM [2], was realized on six different datasets issued of microarray experiments dealing with cancer (leukemia, breast, colon, ovarian, prostate, and lung) and available on Web. The experiments clearified the very good performances of the method. The first contribution shows that the algorithm GQA SVM is able to find genes of interest and improve the classification on a meaningful way. The second important contribution consists in the actual discovery of new and challenging results on datasets used.


PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e11762
Author(s):  
Shirin Roohigohar ◽  
Anthony R. Clarke ◽  
Peter J. Prentis

Fruit production is negatively affected by a wide range of frugivorous insects, among them tephritid fruit flies are one of the most important. As a replacement for pesticide-based controls, enhancing natural fruit resistance through biotechnology approaches is a poorly researched but promising alternative. The use of quantitative reverse transcription PCR (RT-qPCR) is an approach to studying gene expression which has been widely used in studying plant resistance to pathogens and non-frugivorous insect herbivores, and offers a starting point for fruit fly studies. In this paper, we develop a gene selection pipe-line for known induced-defense genes in tomato fruit, Solanum lycopersicum, and putative detoxification genes in Queensland fruit fly, Bactrocera tryoni, as a basis for future RT-qPCR research. The pipeline started with a literature review on plant/herbivore and plant/pathogen molecular interactions. With respect to the fly, this was then followed by the identification of gene families known to be associated with insect resistance to toxins, and then individual genes through reference to annotated B. tryoni transcriptomes and gene identity matching with related species. In contrast for tomato, a much better studied species, individual defense genes could be identified directly through literature research. For B. tryoni, gene selection was then further refined through gene expression studies. Ultimately 28 putative detoxification genes from cytochrome P450 (P450), carboxylesterase (CarE), glutathione S-transferases (GST), and ATP binding cassette transporters (ABC) gene families were identified for B. tryoni, and 15 induced defense genes from receptor-like kinase (RLK), D-mannose/L-galactose, mitogen-activated protein kinase (MAPK), lipoxygenase (LOX), gamma-aminobutyric acid (GABA) pathways and polyphenol oxidase (PPO), proteinase inhibitors (PI) and resistance (R) gene families were identified from tomato fruit. The developed gene selection process for B. tryoni can be applied to other herbivorous and frugivorous insect pests so long as the minimum necessary genomic information, an annotated transcriptome, is available.


2016 ◽  
Vol 78 (5-10) ◽  
Author(s):  
Farzana Kabir Ahmad

Deoxyribonucleic acid (DNA) microarray technology is the recent invention that provided colossal opportunities to measure a large scale of gene expressions simultaneously. However, interpreting large scale of gene expression data remain a challenging issue due to their innate nature of “high dimensional low sample size”. Microarray data mainly involved thousands of genes, n in a very small size sample, p which complicates the data analysis process. For such a reason, feature selection methods also known as gene selection methods have become apparently need to select significant genes that present the maximum discriminative power between cancerous and normal tissues. Feature selection methods can be structured into three basic factions; a) filter methods; b) wrapper methods and c) embedded methods. Among these methods, filter gene selection methods provide easy way to calculate the informative genes and can simplify reduce the large scale microarray datasets. Although filter based gene selection techniques have been commonly used in analyzing microarray dataset, these techniques have been tested separately in different studies. Therefore, this study aims to investigate and compare the effectiveness of these four popular filter gene selection methods namely Signal-to-Noise ratio (SNR), Fisher Criterion (FC), Information Gain (IG) and t-Test in selecting informative genes that can distinguish cancer and normal tissues. In this experiment, common classifiers, Support Vector Machine (SVM) is used to train the selected genes. These gene selection methods are tested on three large scales of gene expression datasets, namely breast cancer dataset, colon dataset, and lung dataset. This study has discovered that IG and SNR are more suitable to be used with SVM. Furthermore, this study has shown SVM performance remained moderately unaffected unless a very small size of genes was selected.


Entropy ◽  
2020 ◽  
Vol 22 (11) ◽  
pp. 1205
Author(s):  
Samarendra Das ◽  
Shesh N. Rai

Selection of biologically relevant genes from high-dimensional expression data is a key research problem in gene expression genomics. Most of the available gene selection methods are either based on relevancy or redundancy measure, which are usually adjudged through post selection classification accuracy. Through these methods the ranking of genes was conducted on a single high-dimensional expression data, which led to the selection of spuriously associated and redundant genes. Hence, we developed a statistical approach through combining a support vector machine with Maximum Relevance and Minimum Redundancy under a sound statistical setup for the selection of biologically relevant genes. Here, the genes were selected through statistical significance values and computed using a nonparametric test statistic under a bootstrap-based subject sampling model. Further, a systematic and rigorous evaluation of the proposed approach with nine existing competitive methods was carried on six different real crop gene expression datasets. This performance analysis was carried out under three comparison settings, i.e., subject classification, biological relevant criteria based on quantitative trait loci and gene ontology. Our analytical results showed that the proposed approach selects genes which are more biologically relevant as compared to the existing methods. Moreover, the proposed approach was also found to be better with respect to the competitive existing methods. The proposed statistical approach provides a framework for combining filter and wrapper methods of gene selection.


2007 ◽  
Vol 3 ◽  
pp. 117693510700300 ◽  
Author(s):  
Simin Hu ◽  
J. Sunil Rao

In gene selection for cancer classification using microarray data, we define an eigenvalue-ratio statistic to measure a gene's contribution to the joint discriminability when this gene is included into a set of genes. Based on this eigenvalue-ratio statistic, we define a novel hypothesis testing for gene statistical redundancy and propose two gene selection methods. Simulation studies illustrate the agreement between statistical redundancy testing and gene selection methods. Real data examples show the proposed gene selection methods can select a compact gene subset which can not only be used to build high quality cancer classifiers but also show biological relevance.


Sign in / Sign up

Export Citation Format

Share Document