scholarly journals Correcting for selection bias via cross-validation in the classification of microarray data

Author(s):  
G. J. McLachlan ◽  
J. Chevelu ◽  
J. Zhu
2006 ◽  
Vol 2 ◽  
pp. 117693510600200 ◽  
Author(s):  
Michael Lecocke ◽  
Kenneth Hess

Background We consider both univariate- and multivariate-based feature selection for the problem of binary classification with microarray data. The idea is to determine whether the more sophisticated multivariate approach leads to better misclassification error rates because of the potential to consider jointly significant subsets of genes (but without overfitting the data). Methods We present an empirical study in which 10-fold cross-validation is applied externally to both a univariate-based and two multivariate- (genetic algorithm (GA)-) based feature selection processes. These procedures are applied with respect to three supervised learning algorithms and six published two-class microarray datasets. Results Considering all datasets, and learning algorithms, the average 10-fold external cross-validation error rates for the univariate-, single-stage GA-, and two-stage GA-based processes are 14.2%, 14.6%, and 14.2%, respectively. We also find that the optimism bias estimates from the GA analyses were half that of the univariate approach, but the selection bias estimates from the GA analyses were 2.5 times that of the univariate results. Conclusions We find that the 10-fold external cross-validation misclassification error rates were very comparable. Further, we find that a two-stage GA approach did not demonstrate a significant advantage over a 1-stage approach. We also find that the univariate approach had higher optimism bias and lower selection bias compared to both GA approaches.


Processes ◽  
2021 ◽  
Vol 9 (2) ◽  
pp. 196
Author(s):  
Araz Soltani Nazarloo ◽  
Vali Rasooli Sharabiani ◽  
Yousef Abbaspour Gilandeh ◽  
Ebrahim Taghinezhad ◽  
Mariusz Szymanek ◽  
...  

The purpose of this work was to investigate the detection of the pesticide residual (profenofos) in tomatoes by using visible/near-infrared spectroscopy. Therefore, the experiments were performed on 180 tomato samples with different percentages of profenofos pesticide (higher and lower values than the maximum residual limit (MRL)) as compared to the control (no pesticide). VIS/near infrared (NIR) spectral data from pesticide solution and non-pesticide tomato samples (used as control treatment) impregnated with different concentrations of pesticide in the range of 400 to 1050 nm were recorded by a spectrometer. For classification of tomatoes with pesticide content at lower and higher levels of MRL as healthy and unhealthy samples, we used different spectral pre-processing methods with partial least squares discriminant analysis (PLS-DA) models. The Smoothing Moving Average pre-processing method with the standard error of cross validation (SECV) = 4.2767 was selected as the best model for this study. In addition, in the calibration and prediction sets, the percentages of total correctly classified samples were 90 and 91.66%, respectively. Therefore, it can be concluded that reflective spectroscopy (VIS/NIR) can be used as a non-destructive, low-cost, and rapid technique to control the health of tomatoes impregnated with profenofos pesticide.


2021 ◽  
Vol 39 (15_suppl) ◽  
pp. 3044-3044
Author(s):  
David Haan ◽  
Anna Bergamaschi ◽  
Yuhong Ning ◽  
William Gibb ◽  
Michael Kesling ◽  
...  

3044 Background: Epigenomics assays have recently become popular tools for identification of molecular biomarkers, both in tissue and in plasma. In particular 5-hydroxymethyl-cytosine (5hmC) method, has been shown to enable the epigenomic regulation of gene expression and subsequent gene activity, with different patterns, across several tumor and normal tissues types. In this study we show that 5hmC profiles enable discrete classification of tumor and normal tissue for breast, colorectal, lung ovary and pancreas. Such classification was also recapitulated in cfDNA from patient with breast, colorectal, lung, ovarian and pancreatic cancers. Methods: DNA was isolated from 176 fresh frozen tissues from breast, colorectal, lung, ovary and pancreas (44 per tumor per tissue type and up to 11 tumor tissues for each stage (I-IV)) and up to 10 normal tissues per tissue type. cfDNA was isolated from plasma from 783 non-cancer individuals and 569 cancer patients. Plasma-isolated cfDNA and tumor genomic DNA, were enriched for the 5hmC fraction using chemical labelling, sequenced, and aligned to a reference genome to construct features sets of 5hmC patterns. Results: 5hmC multinomial logistic regression analysis was employed across tumor and normal tissues and identified a set of specific and discrete tumor and normal tissue gene-based features. This indicates that we can classify samples regardless of source, with a high degree of accuracy, based on tissue of origin and also distinguish between normal and tumor status.Next, we employed a stacked ensemble machine learning algorithm combining multiple logistic regression models across diverse feature sets to the cfDNA dataset composed of 783 non cancers and 569 cancers comprising 67 breast, 118 colorectal, 210 Lung, 71 ovarian and 100 pancreatic cancers. We identified a genomic signature that enable the classification of non-cancer versus cancers with an outer fold cross validation sensitivity of 49% (CI 45%-53%) at 99% specificity. Further, individual cancer outer fold cross validation sensitivity at 99% specificity, was measured as follows: breast 30% (CI 119% -42%); colorectal 41% (CI 32%-50%); lung 49% (CI 42%-56%); ovarian 72% (CI 60-82%); pancreatic 56% (CI 46%-66%). Conclusions: This study demonstrates that 5hmC profiles can distinguish cancer and normal tissues based on their origin. Further, 5hmC changes in cfDNA enables detection of the several cancer types: breast, colorectal, lung, ovarian and pancreatic cancers. Our technology provides a non-invasive tool for cancer detection with low risk sample collection enabling improved compliance than current screening methods. Among other utilities, we believe our technology could be applied to asymptomatic high-risk individuals thus enabling enrichment for those subjects that most need a diagnostic imaging follow up.


2018 ◽  
Vol 32 (7) ◽  
pp. 2397-2404
Author(s):  
Mausami Mondal ◽  
Rahul Semwal ◽  
Utkarsh Raj ◽  
Imlimaong Aier ◽  
Pritish Kumar Varadwaj
Keyword(s):  

2020 ◽  
Author(s):  
Eleonora De Filippi ◽  
Mara Wolter ◽  
Bruno Melo ◽  
Carlos J. Tierra-Criollo ◽  
Tiago Bortolini ◽  
...  

AbstractDuring the last decades, neurofeedback training for emotional self-regulation has received significant attention from both the scientific and clinical communities. However, most studies have focused on broader emotional states such as “negative vs. positive”, primarily due to our poor understanding of the functional anatomy of more complex emotions at the electrophysiological level. Our proof-of-concept study aims at investigating the feasibility of classifying two complex emotions that have been implicated in mental health, namely tenderness and anguish, using features extracted from the electroencephalogram (EEG) signal in healthy participants. Electrophysiological data were recorded from fourteen participants during a block-designed experiment consisting of emotional self-induction trials combined with a multimodal virtual scenario. For the within-subject classification, the linear Support Vector Machine was trained with two sets of samples: random cross-validation of the sliding windows of all trials; and 2) strategic cross-validation, assigning all the windows of one trial to the same fold. Spectral features, together with the frontal-alpha asymmetry, were extracted using Complex Morlet Wavelet analysis. Classification results with these features showed an accuracy of 79.3% on average when doing random cross-validation, and 73.3% when applying strategic cross-validation. We extracted a second set of features from the amplitude time-series correlation analysis, which significantly enhanced random cross-validation accuracy while showing similar performance to spectral features when doing strategic cross-validation. These results suggest that complex emotions show distinct electrophysiological correlates, which paves the way for future EEG-based, real-time neurofeedback training of complex emotional states.Significance statementThere is still little understanding about the correlates of high-order emotions (i.e., anguish and tenderness) in the physiological signals recorded with the EEG. Most studies have investigated emotions using functional magnetic resonance imaging (fMRI), including the real-time application in neurofeedback training. However, concerning the therapeutic application, EEG is a more suitable tool with regards to costs and practicability. Therefore, our proof-of-concept study aims at establishing a method for classifying complex emotions that can be later used for EEG-based neurofeedback on emotion regulation. We recorded EEG signals during a multimodal, near-immersive emotion-elicitation experiment. Results demonstrate that intraindividual classification of discrete emotions with features extracted from the EEG is feasible and may be implemented in real-time to enable neurofeedback.


2009 ◽  
Vol 2009 ◽  
pp. 1-10 ◽  
Author(s):  
Nicoletta Dessì ◽  
Barbara Pes

The classification of cancers from gene expression profiles is a challenging research area in bioinformatics since the high dimensionality of microarray data results in irrelevant and redundant information that affects the performance of classification. This paper proposes using an evolutionary algorithm to select relevant gene subsets in order to further use them for the classification task. This is achieved by combining valuable results from different feature ranking methods into feature pools whose dimensionality is reduced by a wrapper approach involving a genetic algorithm and SVM classifier. Specifically, the GA explores the space defined by each feature pool looking for solutions that balance the size of the feature subsets and their classification accuracy. Experiments demonstrate that the proposed method provide good results in comparison to different state of art methods for the classification of microarray data.


Sign in / Sign up

Export Citation Format

Share Document