scholarly journals Stable bagging feature selection on medical data

2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Salem Alelyani

AbstractIn the medical field, distinguishing genes that are relevant to a specific disease, let’s say colon cancer, is crucial to finding a cure and understanding its causes and subsequent complications. Usually, medical datasets are comprised of immensely complex dimensions with considerably small sample size. Thus, for domain experts, such as biologists, the task of identifying these genes have become a very challenging one, to say the least. Feature selection is a technique that aims to select these genes, or features in machine learning field with respect to the disease. However, learning from a medical dataset to identify relevant features suffers from the curse-of-dimensionality. Due to a large number of features with a small sample size, the selection usually returns a different subset each time a new sample is introduced into the dataset. This selection instability is intrinsically related to data variance. We assume that reducing data variance improves selection stability. In this paper, we propose an ensemble approach based on the bagging technique to improve feature selection stability in medical datasets via data variance reduction. We conducted an experiment using four microarray datasets each of which suffers from high dimensionality and relatively small sample size. On each dataset, we applied five well-known feature selection algorithms to select varying number of features. The proposed technique shows a significant improvement in selection stability while at least maintaining the classification accuracy. The stability improvement ranges from 20 to 50 percent in all cases. This implies that the likelihood of selecting the same features increased 20 to 50 percent more. This is accompanied with the increase of classification accuracy in most cases, which signifies the stated results of stability.

2020 ◽  
Author(s):  
Salem Alelyani

Abstract In the medical eld, distinguishing genes that are relevant to a specific disease, let's say colon cancer, is crucial to finding a cure and understanding its causes and subsequent complications. Usually, medical datasets are comprised of immensely complex dimensions with considerably small sample size. Thus, for domain experts, such as biologists, the task of identifying these genes have become a very challenging one, to say the least. Feature selection is a technique that aims to select these genes, or features in machine learning eld with respect to the disease. However, learning from a medical dataset to identify relevant features suers from the curse-of-dimensionality. Due to a large number of features with a small sample size, the selection usually returns a different subset each time a new sample is introduced into the dataset. This selection instability is intrinsically related to data variance. We assume that reducing data variance improves selection stability. In this paper, we propose an ensemble approach based on the bagging technique to improve feature selection stability in medical datasets via data variance reduction. We conducted an experiment using four microarray datasets each of which suers from high dimensionality and relatively small sample size. On each dataset, we applied five well-known feature selection algorithms to select varying number of features.The proposed technique shows a significant improvement in selection stability while at least maintaining the classification accuracy. The stability improvement ranges from 20 to 50 percent in all cases. This implies that the likelihood of selecting the same features increased 20 to 50 percent more. This is accompanied with the increase of classification accuracy in most cases, which signifies the stated results of stability.


2020 ◽  
Author(s):  
Salem Alelyani

Abstract In the medical field, distinguishing genes that are relevant to a specific disease, let's say colon cancer, is crucial to finding a cure and understanding its causes and subsequent complications. Usually, medical datasets are comprised of immensely complex dimensions with considerably small sample size. Thus, for domain experts, such as biologists, the task of identifying these genes have become a very challenging one, to say the least. Feature selection is a technique that aims to select these genes, or features in machine learning field with respect to the disease. However, learning from a medical dataset to identify relevant features suffers from the curse-of-dimensionality. Due to a large number of features with a small sample size, the selection usually returns a different subset each time a new sample is introduced into the dataset. This selection instability is intrinsically related to data variance. We assume that reducing data variance improves selection stability. In this paper, we propose an ensemble approach based on the bagging technique to improve feature selection stability in medical datasets via data variance reduction. We conducted an experiment using four microarray datasets each of which suffers from high dimensionality and relatively small sample size. On each dataset, we applied five well-known feature selection algorithms to select varying number of features. The results of the selection stability and accuracy show the improvement in terms of both the stability and the accuracy with the bagging technique.


2019 ◽  
Vol 21 (9) ◽  
pp. 631-645 ◽  
Author(s):  
Saeed Ahmed ◽  
Muhammad Kabir ◽  
Zakir Ali ◽  
Muhammad Arif ◽  
Farman Ali ◽  
...  

Aim and Objective: Cancer is a dangerous disease worldwide, caused by somatic mutations in the genome. Diagnosis of this deadly disease at an early stage is exceptionally new clinical application of microarray data. In DNA microarray technology, gene expression data have a high dimension with small sample size. Therefore, the development of efficient and robust feature selection methods is indispensable that identify a small set of genes to achieve better classification performance. Materials and Methods: In this study, we developed a hybrid feature selection method that integrates correlation-based feature selection (CFS) and Multi-Objective Evolutionary Algorithm (MOEA) approaches which select the highly informative genes. The hybrid model with Redial base function neural network (RBFNN) classifier has been evaluated on 11 benchmark gene expression datasets by employing a 10-fold cross-validation test. Results: The experimental results are compared with seven conventional-based feature selection and other methods in the literature, which shows that our approach owned the obvious merits in the aspect of classification accuracy ratio and some genes selected by extensive comparing with other methods. Conclusion: Our proposed CFS-MOEA algorithm attained up to 100% classification accuracy for six out of eleven datasets with a minimal sized predictive gene subset.


PEDIATRICS ◽  
1989 ◽  
Vol 83 (3) ◽  
pp. A72-A72
Author(s):  
Student

The believer in the law of small numbers practices science as follows: 1. He gambles his research hypotheses on small samples without realizing that the odds against him are unreasonably high. He overestimates power. 2. He has undue confidence in early trends (e.g., the data of the first few subjects) and in the stability of observed patterns (e.g., the number and identity of significant results). He overestimates significance. 3. In evaluating replications, his or others', he has unreasonably high expectations about the replicability of significant results. He underestimates the breadth of confidence intervals. 4. He rarely attributes a deviation of results from expectations to sampling variability, because he finds a causal "explanation" for any discrepancy. Thus, he has little opportunity to recognize sampling variation in action. His belief in the law of small numbers, therefore, will forever remain intact.


2019 ◽  
Vol 11 (6) ◽  
pp. 734 ◽  
Author(s):  
Xiufang Zhu ◽  
Nan Li ◽  
Yaozhong Pan

Group intelligence algorithms have been widely used in support vector machine (SVM) parameter optimization due to their obvious characteristics of strong parallel processing ability, fast optimization, and global optimization. However, few studies have made optimization performance comparisons of different group intelligence algorithms on SVMs, especially in terms of their application to hyperspectral remote sensing classification. In this paper, we compare the optimization performance of three different group intelligence algorithms that were run on a SVM in terms of five aspects by using three hyperspectral images (one each of the Indian Pines, University of Pavia, and Salinas): the stability to parameter settings, convergence rate, feature selection ability, sample size, and classification accuracy. Particle swarm optimization (PSO), genetic algorithms (GAs), and artificial bee colony (ABC) algorithms are the three group intelligence algorithms. Our results showed the influence of these three optimization algorithms on the C-parameter optimization of the SVM was less than their influence on the σ-parameter. The convergence rate, the number of selected features, and the accuracy of the three group intelligence algorithms were statistically significant different at the p = 0.01 level. The GA algorithm could compress more than 70% of the original data and it was the least affected by sample size. GA-SVM had the highest average overall accuracy (91.77%), followed by ABC-SVM (88.73%), and PSO-SVM (86.65%). Especially, in complex scenes (e.g., the Indian Pines image), GA-SVM showed the highest classification accuracy (87.34%, which was 8.23% higher than ABC-SVM and 16.42% higher than PSO-SVM) and the best stability (the standard deviation of its classification accuracy was 0.82%, which was 5.54% lower than ABC-SVM, and 21.63% lower than PSO-SVM). Therefore, when compared with the ABC and PSO algorithms, the GA had more advantages in terms of feature band selection, small sample size classification, and classification accuracy.


2012 ◽  
Vol 108 (1) ◽  
pp. 138-150 ◽  
Author(s):  
Martin Macaš ◽  
Lenka Lhotská ◽  
Eduard Bakstein ◽  
Daniel Novák ◽  
Jiří Wild ◽  
...  

Evolution ◽  
1989 ◽  
Vol 43 (3) ◽  
pp. 678 ◽  
Author(s):  
James W. Archie ◽  
Chris Simon ◽  
Andrew Martin

Sign in / Sign up

Export Citation Format

Share Document