Stable Bagging Feature Selection on Medical Data

Mapping Intimacies ◽

10.21203/rs.3.rs-50237/v1 ◽

2020 ◽

Author(s):

Salem Alelyani

Keyword(s):

Feature Selection ◽

Sample Size ◽

Variance Reduction ◽

Small Sample Size ◽

Small Sample ◽

Domain Experts ◽

Complex Dimensions ◽

The Stability ◽

Stability And Accuracy ◽

Selection Algorithms

Abstract In the medical field, distinguishing genes that are relevant to a specific disease, let's say colon cancer, is crucial to finding a cure and understanding its causes and subsequent complications. Usually, medical datasets are comprised of immensely complex dimensions with considerably small sample size. Thus, for domain experts, such as biologists, the task of identifying these genes have become a very challenging one, to say the least. Feature selection is a technique that aims to select these genes, or features in machine learning field with respect to the disease. However, learning from a medical dataset to identify relevant features suffers from the curse-of-dimensionality. Due to a large number of features with a small sample size, the selection usually returns a different subset each time a new sample is introduced into the dataset. This selection instability is intrinsically related to data variance. We assume that reducing data variance improves selection stability. In this paper, we propose an ensemble approach based on the bagging technique to improve feature selection stability in medical datasets via data variance reduction. We conducted an experiment using four microarray datasets each of which suffers from high dimensionality and relatively small sample size. On each dataset, we applied five well-known feature selection algorithms to select varying number of features. The results of the selection stability and accuracy show the improvement in terms of both the stability and the accuracy with the bagging technique.

Download Full-text

Stable Bagging Feature Selection on Medical Data

10.21203/rs.3.rs-50237/v2 ◽

2020 ◽

Author(s):

Salem Alelyani

Keyword(s):

Feature Selection ◽

Sample Size ◽

Classification Accuracy ◽

Variance Reduction ◽

Small Sample Size ◽

Small Sample ◽

Domain Experts ◽

Complex Dimensions ◽

The Stability ◽

Selection Algorithms

Abstract In the medical eld, distinguishing genes that are relevant to a specific disease, let's say colon cancer, is crucial to finding a cure and understanding its causes and subsequent complications. Usually, medical datasets are comprised of immensely complex dimensions with considerably small sample size. Thus, for domain experts, such as biologists, the task of identifying these genes have become a very challenging one, to say the least. Feature selection is a technique that aims to select these genes, or features in machine learning eld with respect to the disease. However, learning from a medical dataset to identify relevant features suers from the curse-of-dimensionality. Due to a large number of features with a small sample size, the selection usually returns a different subset each time a new sample is introduced into the dataset. This selection instability is intrinsically related to data variance. We assume that reducing data variance improves selection stability. In this paper, we propose an ensemble approach based on the bagging technique to improve feature selection stability in medical datasets via data variance reduction. We conducted an experiment using four microarray datasets each of which suers from high dimensionality and relatively small sample size. On each dataset, we applied five well-known feature selection algorithms to select varying number of features.The proposed technique shows a significant improvement in selection stability while at least maintaining the classification accuracy. The stability improvement ranges from 20 to 50 percent in all cases. This implies that the likelihood of selecting the same features increased 20 to 50 percent more. This is accompanied with the increase of classification accuracy in most cases, which signifies the stated results of stability.

Download Full-text

Stable bagging feature selection on medical data

Journal Of Big Data ◽

10.1186/s40537-020-00385-8 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Salem Alelyani

Keyword(s):

Feature Selection ◽

Sample Size ◽

Classification Accuracy ◽

Variance Reduction ◽

Small Sample Size ◽

Small Sample ◽

Domain Experts ◽

Complex Dimensions ◽

The Stability ◽

Selection Algorithms

AbstractIn the medical field, distinguishing genes that are relevant to a specific disease, let’s say colon cancer, is crucial to finding a cure and understanding its causes and subsequent complications. Usually, medical datasets are comprised of immensely complex dimensions with considerably small sample size. Thus, for domain experts, such as biologists, the task of identifying these genes have become a very challenging one, to say the least. Feature selection is a technique that aims to select these genes, or features in machine learning field with respect to the disease. However, learning from a medical dataset to identify relevant features suffers from the curse-of-dimensionality. Due to a large number of features with a small sample size, the selection usually returns a different subset each time a new sample is introduced into the dataset. This selection instability is intrinsically related to data variance. We assume that reducing data variance improves selection stability. In this paper, we propose an ensemble approach based on the bagging technique to improve feature selection stability in medical datasets via data variance reduction. We conducted an experiment using four microarray datasets each of which suffers from high dimensionality and relatively small sample size. On each dataset, we applied five well-known feature selection algorithms to select varying number of features. The proposed technique shows a significant improvement in selection stability while at least maintaining the classification accuracy. The stability improvement ranges from 20 to 50 percent in all cases. This implies that the likelihood of selecting the same features increased 20 to 50 percent more. This is accompanied with the increase of classification accuracy in most cases, which signifies the stated results of stability.

Download Full-text

SMALL SAMPLE SIZE SCIENTIST

PEDIATRICS ◽

10.1542/peds.83.3.a72a ◽

1989 ◽

Vol 83 (3) ◽

pp. A72-A72

Author(s):

Student

Keyword(s):

Sample Size ◽

Confidence Intervals ◽

Causal Explanation ◽

Small Sample Size ◽

Small Sample ◽

Small Samples ◽

High Expectations ◽

Sampling Variation ◽

The Law ◽

The Stability

The believer in the law of small numbers practices science as follows: 1. He gambles his research hypotheses on small samples without realizing that the odds against him are unreasonably high. He overestimates power. 2. He has undue confidence in early trends (e.g., the data of the first few subjects) and in the stability of observed patterns (e.g., the number and identity of significant results). He overestimates significance. 3. In evaluating replications, his or others', he has unreasonably high expectations about the replicability of significant results. He underestimates the breadth of confidence intervals. 4. He rarely attributes a deviation of results from expectations to sampling variability, because he finds a causal "explanation" for any discrepancy. Thus, he has little opportunity to recognize sampling variation in action. His belief in the law of small numbers, therefore, will forever remain intact.

Download Full-text