Combining Nearest Neighbor Classifiers Versus Cross-Validation Selection

2004 ◽  
Vol 3 (1) ◽  
pp. 1-19 ◽  
Author(s):  
Minhui Paik ◽  
Yuhong Yang

Various discriminant methods have been applied for classification of tumors based on gene expression profiles, among which the nearest neighbor (NN) method has been reported to perform relatively well. Usually cross-validation (CV) is used to select the neighbor size as well as the number of variables for the NN method. However, CV can perform poorly when there is considerable uncertainty in choosing the best candidate classifier. As an alternative to selecting a single “winner," we propose a weighting method to combine the multiple NN rules. Four gene expression data sets are used to compare its performance with CV methods. The results show that when the CV selection is unstable, the combined classifier performs much better.

BMC Genomics ◽  
2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Kyu-Sang Lim ◽  
Qian Dong ◽  
Pamela Moll ◽  
Jana Vitkovska ◽  
Gregor Wiktorin ◽  
...  

Abstract Background Gene expression profiling in blood is a potential source of biomarkers to evaluate or predict phenotypic differences between pigs but is expensive and inefficient because of the high abundance of globin mRNA in porcine blood. These limitations can be overcome by the use of QuantSeq 3’mRNA sequencing (QuantSeq) combined with a method to deplete or block the processing of globin mRNA prior to or during library construction. Here, we validated the effectiveness of QuantSeq using a novel specific globin blocker (GB) that is included in the library preparation step of QuantSeq. Results In data set 1, four concentrations of the GB were applied to RNA samples from two pigs. The GB significantly reduced the proportion of globin reads compared to non-GB (NGB) samples (P = 0.005) and increased the number of detectable non-globin genes. The highest evaluated concentration (C1) of the GB resulted in the largest reduction of globin reads compared to the NGB (from 56.4 to 10.1%). The second highest concentration C2, which showed very similar globin depletion rates (12%) as C1 but a better correlation of the expression of non-globin genes between NGB and GB (r = 0.98), allowed the expression of an additional 1295 non-globin genes to be detected, although 40 genes that were detected in the NGB sample (at a low level) were not present in the GB library. Concentration C2 was applied in the rest of the study. In data set 2, the distribution of the percentage of globin reads for NGB (n = 184) and GB (n = 189) samples clearly showed the effects of the GB on reducing globin reads, in particular for HBB, similar to results from data set 1. Data set 3 (n = 84) revealed that the proportion of globin reads that remained in GB samples was significantly and positively correlated with the reticulocyte count in the original blood sample (P < 0.001). Conclusions The effect of the GB on reducing the proportion of globin reads in porcine blood QuantSeq was demonstrated in three data sets. In addition to increasing the efficiency of sequencing non-globin mRNA, the GB for QuantSeq has an advantage that it does not require an additional step prior to or during library creation. Therefore, the GB is a useful tool in the quantification of whole gene expression profiles in porcine blood.


2012 ◽  
Vol 11 ◽  
pp. CIN.S10375 ◽  
Author(s):  
Mark Burton ◽  
Mads Thomassen ◽  
Qihua Tan ◽  
Torben A. Kruse

Background The popularity of a large number of microarray applications has in cancer research led to the development of predictive or prognostic gene expression profiles. However, the diversity of microarray platforms has made the full validation of such profiles and their related gene lists across studies difficult and, at the level of classification accuracies, rarely validated in multiple independent datasets. Frequently, while the individual genes between such lists may not match, genes with same function are included across such gene lists. Development of such lists does not take into account the fact that genes can be grouped together as metagenes (MGs) based on common characteristics such as pathways, regulation, or genomic location. Such MGs might be used as features in building a predictive model applicable for classifying independent data. It is, therefore, demanding to systematically compare independent validation of gene lists or classifiers based on metagene or individual gene (SG) features. Methods In this study we compared the performance of either metagene- or single gene-based feature sets and classifiers using random forest and two support vector machines for classifier building. The performance within the same dataset, feature set validation performance, and validation performance of entire classifiers in strictly independent datasets were assessed by 10 times repeated 10-fold cross validation, leave-one-out cross validation, and one-fold validation, respectively. To test the significance of the performance difference between MG- and SG-features/classifiers, we used a repeated down-sampled binomial test approach. Results MG- and SG-feature sets are transferable and perform well for training and testing prediction of metastasis outcome in strictly independent data sets, both between different and within similar microarray platforms, while classifiers had a poorer performance when validated in strictly independent datasets. The study showed that MG- and SG-feature sets perform equally well in classifying independent data. Furthermore, SG-classifiers significantly outperformed MG-classifier when validation is conducted between datasets using similar platforms, while no significant performance difference was found when validation was performed between different platforms. Conclusion Prediction of metastasis outcome in lymph node–negative patients by MG- and SG-classifiers showed that SG-classifiers performed significantly better than MG-classifiers when validated in independent data based on the same microarray platform as used for developing the classifier. However, the MG- and SG-classifiers had similar performance when conducting classifier validation in independent data based on a different microarray platform. The latter was also true when only validating sets of MG- and SG-features in independent datasets, both between and within similar and different platforms.


Blood ◽  
2007 ◽  
Vol 110 (11) ◽  
pp. 2606-2606
Author(s):  
N.A. Johnson ◽  
T. Nayar ◽  
S.S. Dave ◽  
G. Wright ◽  
A. Rosenwald ◽  
...  

Abstract Background: FL is a common NHL that has a broad spectrum of clinical outcomes. Over time some pts will transform to an aggressive histology (Tly) associated with inferior survival. In 2004, the LLMPP constructed a model that was predictive of overall survival (OS) based on the gene expression profiles (GEP) of 191 specimens taken from pts with untreated FL. The genes associated with survival were derived from the non-neoplastic immune response (IR) cells. However the risk of developing Tly was not addressed in this study. Thus we re-analyzed the GEP with updated clinical data. Our goal was to validate our previous model with extended follow-up and to create a model that would predict the risk of developing TLy. Methods: 170 of 191 previously untreated FL pts had updated clinical information but only 142 had transformation outcome. Transformation was defined as biopsy proven DLBCL or clinically based on the presence of at least one of the following: hypercalcemia, a sudden rise in LDH &gt;twice baseline, unusual extranodal growth or rapid discordant nodal growth. Raw CEL files from Affymetrix U133A arrays were pre-processed and normalized using Bioconductor’s GCRMA package. Models were developed using SignS package (http://signs/bioinfo.cnio.es/), with 10 times cross-validation. All gene lists produced in these analyses were then re-tested for association with outcome using Bioconductor’s Globaltest package. Over Representation Analysis of signature components was performed using Dchip. Results: The median OS of these patients was 8 yrs. A new 7-component survival model (85 genes) was developed that was significantly associated with survival (p= 2.9×10−13). In Globaltest, these gene lists were associated with survival at a level of (p=2.6×10−5). The previous model using IR-1 and IR-2 signatures was associated with survival at a level of p=2.6×10−4. Although there is little overlap between the 2 models, the new model confirms the importance of IR genes and extracellular matrix genes as being prognostically important. Interestingly, one component containing 10 genes on chromosome 6q was associated with a superior survival (p&lt;1×107). 27% developed Tly over a median follow-up time of 11.2 yrs (69% biopsy proven). Our transformation model included 53 genes divided into 3 components (p=0.001). The Globaltest analysis for association of these genes with transformation was significant (p=0.018). 54 genes overlapped between the survival genes and transformation genes that were present in &gt;1 cross validation run. These were significantly enriched in genes important in immune response like T cell and macrophage activation. Conclusion: Our survival model is stable and confirms the importance of key genes involved in the immune response and lymph node remodeling. It also introduces new genes that are potentially important for survival. Our transformation model may shed light on the mechanisms involved in the progression of FL to DLBCL but it is less stable and less reliable than our survival model at predicting outcome.


2005 ◽  
Vol 21 (1) ◽  
pp. 43-58 ◽  
Author(s):  
Jiang Li ◽  
Maria L. Spletter ◽  
Jeffrey A. Johnson

This paper compares the gene expression profiles identified by short (Affymetrix U95AV2) or long (Agilent Hu1A) oligonucleotide arrays on a model for upregulation of a cluster of antioxidant responsive element-driven genes by treatment with tert-butylhydroquinone. MAS 5.0, dCHIP, and RMA were applied to normalize the Affymetrix data, and Lowess regression was considered for Agilent data. SAM was used to identify the differential gene expression. A set of biological markers and housekeeping genes were chosen to evaluate the performance of multiple normalization approaches. Both arrays illustrated a definite set of overlapping genes between the data sets regardless of data mining tools used. However, unique gene expression profiles based on the platform used were also revealed and confirmed by quantitative RT-PCR. Further analysis of the data revealed by alternative approaches suggested that alternative splicing, multiple vs. single probe(s) measurement, and use or nonuse of mismatch probes may account for the discrepant data. Therefore, these two microarray technologies offer relatively reliable data. Integration of the gene expression profiles from different array platforms may not only help for cross-validation but also provide a more complete view of the transcriptional scenario.


2019 ◽  
Author(s):  
Necla Koçhan ◽  
Gözde Yazgı Tütüncü ◽  
Göknur Giner

AbstractBackground and ObjectiveRecent developments in the next-generation sequencing (NGS) based on RNA-sequencing (RNA-Seq) allow researchers to measure the expression levels of thousands of genes for multiple samples simultaneously. In order to analyze these kind of data sets, many classification models have been proposed in the literature. Most of the existing classifiers assume that genes are independent; however, this is not a realistic approach for real RNA-Seq classification problems. For this reason, some other classification methods, which incorporates the dependence structure between genes into a model, are proposed. qtQDA proposed by Koçhan et al. [1] is one of those classifiers, which estimates covariance matrix by Maximum Likelihood Estimator.MethodsIn this study, we use a another approach based on local dependence function to estimate the covariance matrix to be used in the qtQDA classification model. We investigate the impact of different covariance estimates on RNA-Seq data classification.ResultsThe performances of qtQDA classifier based on two different covariance matrix estimates are compared over two real RNA-Seq data sets, in terms of classification error rates. The results show that using local dependence function approach yields a better estimate of covariance matrix and increases the performance of qtQDA classifier.ConclusionIncorporating the true/accurate covariance matrix into the classification model is an important and crucial step particularly for cancer prediction. The local covariance matrix estimate allows researchers to classify cancer patients based on gene expression profiles more accurately. R code for local dependence function is available at https://github.com/Necla/LocalDependence.


2021 ◽  
Author(s):  
Katie Mika ◽  
Camilla M. Whittington ◽  
Bronwyn M. McAllan ◽  
Vincent J Lynch

Structural and physiological changes in the female reproductive system underlie the origins of pregnancy in multiple vertebrate lineages. In mammals, for example, the glandular portion of the lower reproductive tract has transformed into a structure specialized for supporting fetal development. These specializations range from relatively simple maternal provisioning in egg-laying monotremes to an elaborate suite of traits that support intimate maternal-fetal interactions in Eutherians. Among these traits are the maternal decidua and fetal component of the placenta, but there is considerable uncertainty about how these structures evolved. We identified the origins of pregnancy utilizing ancestral transcriptome reconstruction to infer functional evolution of the maternal-fetal interface. Remarkably, we found that maternal gene expression profiles are correlated with degree of placental invasion. These results indicate that an epitheliochorial-like placenta evolved early in the mammalian stem-lineage and that the ancestor of Eutherians had a hemochorial placenta, and suggest maternal control of placental invasiveness. Collectively, these data resolve major transitions in the evolution of pregnancy and indicate that ancestral transcriptome reconstruction can be used to study the function of ancestral cell, tissue, and organ systems.


2016 ◽  
Vol 2016 ◽  
pp. 1-10 ◽  
Author(s):  
Liying Yang ◽  
Zhimin Liu ◽  
Xiguo Yuan ◽  
Jianhua Wei ◽  
Junying Zhang

Background. Precisely predicting cancer is crucial for cancer treatment. Gene expression profiles make it possible to analyze patterns between genes and cancers on the genome-wide scale. Gene expression data analysis, however, is confronted with enormous challenges for its characteristics, such as high dimensionality, small sample size, and low Signal-to-Noise Ratio.Results. This paper proposes a method, termed RS_SVM, to predict gene expression profiles via aggregating SVM trained on random subspaces. After choosing gene features through statistical analysis, RS_SVM randomly selects feature subsets to yield random subspaces and training SVM classifiers accordingly and then aggregates SVM classifiers to capture the advantage of ensemble learning. Experiments on eight real gene expression datasets are performed to validate the RS_SVM method. Experimental results show that RS_SVM achieved better classification accuracy and generalization performance in contrast with single SVM,K-nearest neighbor, decision tree, Bagging, AdaBoost, and the state-of-the-art methods. Experiments also explored the effect of subspace size on prediction performance.Conclusions. The proposed RS_SVM method yielded superior performance in analyzing gene expression profiles, which demonstrates that RS_SVM provides a good channel for such biological data.


2014 ◽  
Vol 281 (1797) ◽  
pp. 20141868 ◽  
Author(s):  
Zhengzheng S. Liang ◽  
Heather R. Mattila ◽  
Sandra L. Rodriguez-Zas ◽  
Bruce R. Southey ◽  
Thomas D. Seeley ◽  
...  

Individual differences in behaviour are often consistent across time and contexts, but it is not clear whether such consistency is reflected at the molecular level. We explored this issue by studying scouting in honeybees in two different behavioural and ecological contexts: finding new sources of floral food resources and finding a new nest site. Brain gene expression profiles in food-source and nest-site scouts showed a significant overlap, despite large expression differences associated with the two different contexts. Class prediction and ‘leave-one-out’ cross-validation analyses revealed that a bee's role as a scout in either context could be predicted with 92.5% success using 89 genes at minimum. We also found that genes related to four neurotransmitter systems were part of a shared brain molecular signature in both types of scouts, and the two types of scouts were more similar for genes related to glutamate and GABA than catecholamine or acetylcholine signalling. These results indicate that consistent behavioural tendencies across different ecological contexts involve a mixture of similarities and differences in brain gene expression.


2019 ◽  
Author(s):  
Kyu-Sang Lim ◽  
Qian Dong ◽  
Pamela Renate Moll ◽  
Jana Vitkovska ◽  
Gregor Wiktorin ◽  
...  

Abstract Background Gene expression profiling in blood is a potential source of biomarkers to evaluate or predict phenotypic differences between pigs but is expensive and inefficient because of the high abundance of hemoglobin (HB) mRNA in porcine blood. These limitations can be overcome by the use of QuantSeq 3’mRNA sequencing (QuantSeq) combined with a method to deplete or block the processing of HB mRNA prior to or during library construction. Here, we validated the effectiveness of QuantSeq using a novel specific globin blocker (GB) that is included in the library preparation step of QuantSeq. Results In data set 1, four concentrations of the GB were applied to RNA samples from two pigs. The GB significantly reduced the proportion of HB reads compared to non-GB (NGB) samples (P = 0.005) and increased the number of detectable non-HB genes. The second highest concentration C2, which showed very similar globin depletion rates (from 56.4 to 12%) as C1 but a better correlation of the expression of non-HB genes between NGB and GB (r = 0.98), allowed the expression of an additional 1,295 non-HB genes to be detected, although 40 genes that were detected in the NGB sample (at a low level) were not present in the GB library. Concentration C2 was applied in the rest of the study. In data set 2, the distribution of the percentage of HB reads for NGB (n=184) and GB (n=189) samples clearly showed the effects of the GB on reducing HB reads. Data set 3 (n=84) revealed that the proportion of HB reads that remained in GB samples was significantly and positively correlated with the reticulocyte count in the original blood sample (P < 0.001). Conclusions The effect of the GB on reducing the proportion of HB reads in porcine blood QuantSeq was demonstrated in three data sets. In addition to increasing the efficiency of sequencing non-HB mRNA, the GB for QuantSeq has as advantage that it does not require an additional step prior to or during library creation. Therefore, the GB is a useful tool in the quantification of whole gene expression profiles in porcine blood.


2010 ◽  
Vol 298 (5) ◽  
pp. G582-G589 ◽  
Author(s):  
Robert S. Chapkin ◽  
Chen Zhao ◽  
Ivan Ivanov ◽  
Laurie A. Davidson ◽  
Jennifer S. Goldsby ◽  
...  

We have developed a novel molecular methodology that utilizes stool samples containing intact sloughed epithelial cells to quantify intestinal gene expression profiles in the developing human neonate. Since nutrition exerts a major role in regulating neonatal intestinal development and function, our goal was to identify gene sets (combinations) that are differentially regulated in response to infant feeding. For this purpose, fecal mRNA was isolated from exclusively breast-fed ( n = 12) and formula-fed ( n = 10) infants at 3 mo of age. Linear discriminant analysis was successfully used to identify the single genes and the two- to three-gene combinations that best distinguish the feeding groups. In addition, putative “master” regulatory genes were identified using coefficient of determination analysis. These results support our premise that mRNA isolated from stool has value in terms of characterizing the epigenetic mechanisms underlying the developmentally regulated transcriptional activation/repression of genes known to modulate gastrointestinal function. As larger data sets become available, this methodology can be extended to validation and, ultimately, identification of the main nutritional components that modulate intestinal maturation and function.


Sign in / Sign up

Export Citation Format

Share Document