Cautionary Note on Using Cross-Validation for Molecular Classification

Purpose Reproducibility of scientific experimentation has become a major concern because of the perception that many published biomedical studies cannot be replicated. In this article, we draw attention to the connection between inflated overoptimistic findings and the use of cross-validation for error estimation in molecular classification studies. We show that, in the absence of careful design to prevent artifacts caused by systematic differences in the processing of specimens, established tools such as cross-validation can lead to a spurious estimate of the error rate in the overoptimistic direction, regardless of the use of data normalization as an effort to remove these artifacts. Methods We demonstrated this important yet overlooked complication of cross-validation using a unique pair of data sets on the same set of tumor samples. One data set was collected with uniform handling to prevent handling effects; the other was collected without uniform handling and exhibited handling effects. The paired data sets were used to estimate the biologic effects of the samples and the handling effects of the arrays in the latter data set, which were then used to simulate data using virtual rehybridization following various array-to-sample assignment schemes. Results Our study showed that (1) cross-validation tended to underestimate the error rate when the data possessed confounding handling effects; (2) depending on the relative amount of handling effects, normalization may further worsen the underestimation of the error rate; and (3) balanced assignment of arrays to comparison groups allowed cross-validation to provide an unbiased error estimate. Conclusion Our study demonstrates the benefits of balanced array assignment for reproducible molecular classification and calls for caution on the routine use of data normalization and cross-validation in such analysis.

Download Full-text

Effects of genotype and lactation number on health and reproductive problems in dairy cows

Proceedings of the British Society of Animal Science ◽

10.1017/s1752756200595842 ◽

1997 ◽

Vol 1997 ◽

pp. 143-143

Author(s):

B.L. Nielsen ◽

R.F. Veerkamp ◽

J.E. Pryce ◽

G. Simm ◽

J.D. Oldham

Keyword(s):

Dairy Cows ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Variation Analysis ◽

Genetic Line ◽

Data Set ◽

Health Events ◽

Use Of Data ◽

Low Incidence

High producing dairy cows have been found to be more susceptible to disease (Jones et al., 1994; Göhn et al., 1995) raising concerns about the welfare of the modern dairy cow. Genotype and number of lactations may affect various health problems differently, and their relative importance may vary. The categorical nature and low incidence of health events necessitates large data-sets, but the use of data collected across herds may introduce unwanted variation. Analysis of a comprehensive data-set from a single herd was carried out to investigate the effects of genetic line and lactation number on the incidence of various health and reproductive problems.

Download Full-text

Reduced Data Sets and Entropy-Based Discretization

Entropy ◽

10.3390/e21111051 ◽

2019 ◽

Vol 21 (11) ◽

pp. 1051

Author(s):

Jerzy W. Grzymala-Busse ◽

Zdzislaw S. Hippe ◽

Teresa Mroczek

Keyword(s):

Decision Trees ◽

Error Rate ◽

Numerical Data ◽

Data Sets ◽

Data Set ◽

Tree Generation ◽

C4.5 Decision Tree ◽

Interval Width ◽

Left And Right ◽

Reduced Data

Results of experiments on numerical data sets discretized using two methods—global versions of Equal Frequency per Interval and Equal Interval Width-are presented. Globalization of both methods is based on entropy. For discretized data sets left and right reducts were computed. For each discretized data set and two data sets, based, respectively, on left and right reducts, we applied ten-fold cross validation using the C4.5 decision tree generation system. Our main objective was to compare the quality of all three types of data sets in terms of an error rate. Additionally, we compared complexity of generated decision trees. We show that reduction of data sets may only increase the error rate and that the decision trees generated from reduced decision sets are not simpler than the decision trees generated from non-reduced data sets.

Download Full-text

Analyzing performance of classifiers for medical datasets

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.15.11370 ◽

2018 ◽

Vol 7 (2.15) ◽

pp. 136 ◽

Cited By ~ 1

Author(s):

Rosaida Rosly ◽

Mokhairi Makhtar ◽

Mohd Khalid Awang ◽

Mohd Isa Awang ◽

Mohd Nordin Abdul Rahman

Keyword(s):

Breast Cancer ◽

Cross Validation ◽

Ensemble Methods ◽

Data Sets ◽

Ensemble Classifiers ◽

Classification Models ◽

Data Set ◽

Mining Tool ◽

Fold Cross Validation

This paper analyses the performance of classification models using single classification and combination of ensemble method, which are Breast Cancer Wisconsin and Hepatitis data sets as training datasets. This paper presents a comparison of different classifiers based on a 10-fold cross validation using a data mining tool. In this experiment, various classifiers are implemented including three popular ensemble methods which are boosting, bagging and stacking for the combination. The result shows that for the classification of the Breast Cancer Wisconsin data set, the single classification of Naïve Bayes (NB) and a combination of bagging+NB algorithm displayed the highest accuracy at the same percentage (97.51%) compared to other combinations of ensemble classifiers. For the classification of the Hepatitisdata set, the result showed that the combination of stacking+Multi-Layer Perception (MLP) algorithm achieved a higher accuracy at 86.25%. By using the ensemble classifiers, the result may be improved. In future, a multi-classifier approach will be proposed by introducing a fusion at the classification level between these classifiers to obtain classification with higher accuracies.

Download Full-text

Interlaboratory comparison of δ13C and δD measurements of atmospheric CH4 for combined use of data sets from different laboratories

Atmospheric Measurement Techniques ◽

10.5194/amt-11-1207-2018 ◽

2018 ◽

Vol 11 (2) ◽

pp. 1207-1231 ◽

Cited By ~ 13

Author(s):

Taku Umezawa ◽

Carl A. M. Brenninkmeijer ◽

Thomas Röckmann ◽

Carina van der Veen ◽

Stanley C. Tyler ◽

...

Keyword(s):

Interlaboratory Comparison ◽

Data Sets ◽

Data Set ◽

Use Of Data ◽

Combined Use ◽

Modelling Studies ◽

Hydrogen Isotope Ratios ◽

Term Data ◽

Carbon And Hydrogen Isotope

Abstract. We report results from a worldwide interlaboratory comparison of samples among laboratories that measure (or measured) stable carbon and hydrogen isotope ratios of atmospheric CH4 (δ13C-CH4 and δD-CH4). The offsets among the laboratories are larger than the measurement reproducibility of individual laboratories. To disentangle plausible measurement offsets, we evaluated and critically assessed a large number of intercomparison results, some of which have been documented previously in the literature. The results indicate significant offsets of δ13C-CH4 and δD-CH4 measurements among data sets reported from different laboratories; the differences among laboratories at modern atmospheric CH4 level spread over ranges of 0.5 ‰ for δ13C-CH4 and 13 ‰ for δD-CH4. The intercomparison results summarized in this study may be of help in future attempts to harmonize δ13C-CH4 and δD-CH4 data sets from different laboratories in order to jointly incorporate them into modelling studies. However, establishing a merged data set, which includes δ13C-CH4 and δD-CH4 data from multiple laboratories with desirable compatibility, is still challenging due to differences among laboratories in instrument settings, correction methods, traceability to reference materials and long-term data management. Further efforts are needed to identify causes of the interlaboratory measurement offsets and to decrease those to move towards the best use of available δ13C-CH4 and δD-CH4 data sets.

Download Full-text

Importance of Holidays for Short Term Load Forecasting Using Adaptive Neural Fuzzy Inference System

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.433-440.3959 ◽

2012 ◽

Vol 433-440 ◽

pp. 3959-3963 ◽

Cited By ~ 1

Author(s):

Bayram Akdemir ◽

Nurettin Çetinkaya

Keyword(s):

Energy Level ◽

Cross Validation ◽

Fuzzy Inference ◽

Load Forecasting ◽

Percentage Error ◽

Data Sets ◽

Data Set ◽

Inference System ◽

Peak Energy ◽

Fold Cross Validation

In distributing systems, load forecasting is one of the major management problems to carry on energy flowing; protect the systems, and economic management. In order to manage the system, next step of the load characteristics must be inform from historical data sets. For the forecasting, not only historical parameters are used but also external parameters such as weather conditions, seasons and populations and etc. have much importance to forecast the next behavior of the load characteristic. Holidays and week days have different affects on energy consumption in any country. In this study, target is to forecast the peak energy level the next an hour and to compare affects of week days and holidays on peak energy needs. Energy consumption data sets have nonlinear characteristics and it is not easy to fit any curve due to its nonlinearity and lots of parameters. In order to forecast peak energy level, Adaptive neural fuzzy inference system is used for hourly affects of holidays and week days on peak energy level is argued. The obtained values from output of the artificial intelligence are evaluated two fold cross validation and mean absolute percentage error. The obtained two fold cross validation error as mean absolute percentage error is 3.51 and included holidays data set has more accuracy than the data set without holiday. Total success increased 2.4%.

Download Full-text

Molecular Classification Based on Prognostic and Cell Cycle-Associated Genes in Patients With Colon Cancer

Frontiers in Oncology ◽

10.3389/fonc.2021.636591 ◽

2021 ◽

Vol 11 ◽

Author(s):

Zhiyuan Zhang ◽

Meiling Ji ◽

Jie Li ◽

Qi Wu ◽

Yuanjian Huang ◽

...

Keyword(s):

Cell Cycle ◽

Colon Cancer ◽

Cox Regression ◽

Enrichment Analysis ◽

Molecular Classification ◽

Original Data ◽

Gene Set Enrichment Analysis ◽

Data Sets ◽

Data Set ◽

Cox Regression Analysis

The molecular classification of patients with colon cancer is inconclusive. The gene set enrichment analysis (GSEA) of dysregulated genes among normal and tumor tissues indicated that the cell cycle played a crucial role in colon cancer. We performed univariate Cox regression analysis to find out the prognostic-related genes, and these genes were then intersected with cell cycle-associated genes and were further recognized as prognostic and cell cycle-associated genes. Unsupervised non-negative matrix factorization (NMF) clustering was performed based on cell cycle-associated genes. Two subgroups were identified with different overall survival, clinical features, cell cycle enrichment profile, and mutation profile. Through nearest template prediction (NTP), the molecular classification could be effectively repeated in the original data set and validated in several independent data sets indicating that the classification is highly repeatable. Furthermore, we constructed two prognostic signatures in two subgroups, respectively. Our molecular classification based on cell cycle may provide novel insight into the treatment and the prognosis of colon cancer.

Download Full-text

Using anticlustering to partition data sets into equivalent parts

10.31234/osf.io/3razc ◽

2019 ◽

Author(s):

Martin Papenberg ◽

Gunnar W. Klau

Keyword(s):

Cross Validation ◽

Item Difficulty ◽

Large Data ◽

Real Data ◽

Psychological Research ◽

Data Sets ◽

Clustering Methods ◽

Data Set ◽

R Programming Language ◽

R Programming

Numerous applications in psychological research require that a pool of elements is partitioned into multiple parts. While many applications seek groups that are well-separated, i.e., dissimilar from each other, others require the different groups to be as similar as possible. Examples include the assignment of students to parallel courses, assembling stimulus sets in experimental psychology, splitting achievement tests into parts of equal difficulty, and dividing a data set for cross validation. We present anticlust, an easy-to-use and free software package for solving these problems fast and in an automated manner. The package anticlust is an open source extension to the R programming language and implements the methodology of anticlustering. Anticlustering divides elements into similar parts, ensuring similarity between groups by enforcing heterogeneity within groups. Thus, anticlustering is the direct reversal of cluster analysis that aims to maximize homogeneity within groups and dissimilarity between groups. Our package anticlust implements two anticlustering criteria, reversing the clustering methods k-means and cluster editing, respectively. In a simulation study, we show that anticlustering returns excellent results and outperforms alternative approaches like random assignment and matching. In three example applications, we illustrate how to apply anticlust on real data sets. We demonstrate how to assign experimental stimuli to equivalent sets based on norming data, how to divide a large data set for cross validation, and how to split a test into parts of equal item difficulty and discrimination.

Download Full-text

Generated rules for AIDS and e-learning classifier using rough set approach

International Journal of Advances in Intelligent Informatics ◽

10.26555/ijain.v2i2.74 ◽

2016 ◽

Vol 2 (2) ◽

pp. 103

Author(s):

Sarina Sulaiman ◽

Nor Amalina Abdul Rahim ◽

Andri Pranolo

Keyword(s):

Rough Set ◽

Incomplete Data ◽

Cross Validation ◽

Data Sets ◽

Observation Error ◽

Data Set ◽

Left Hand ◽

Reduction Techniques ◽

E Learning ◽

Fold Cross Validation

The emergence and growth of internet usage has accumulated an extensive amount of data. These data contain a wealth of undiscovered valuable information and problems of incomplete data set may lead to observation error. This research explored a technique to analyze data that transforms meaningless data to meaningful information. The work focused on Rough Set (RS) to deal with incomplete data and rules derivation. Rules with high and low left-hand-side (LHS) support value generated by RS were used as query statements to form a cluster of data. The model was tested on AIDS blog data set consisting of 146 bloggers and E-Learning@UTM (EL) log data set comprising 23105 URLs. 5-fold and 10-fold cross validation were used to split the data. Naïve algorithm and Boolean algorithm as discretization techniques and Johnson’s algorithm (Johnson) and Genetic algorithm (GA) as reduction techniques were employed to compare the results. 5-fold cross validation tended to suit AIDS data well while 10-fold cross validation was the best for EL data set. Johnson and GA yielded the same number of rules for both data sets. These findings are significant as evidence in terms of accuracy that was achieved using the proposed model

Download Full-text

Reliable Metadata and the Creation of Trustworthy, Reproducible, and Re-usable Data Sets

Stepping in the Same River Twice ◽

10.12987/yale/9780300209549.003.0013 ◽

2017 ◽

Cited By ~ 1

Author(s):

Kristin Vanderbilt ◽

David Blankman

Keyword(s):

Data Sets ◽

Data Set ◽

Data Repositories ◽

Error Prevention ◽

Data Intensive ◽

Use Of Data ◽

Multiple Data ◽

Original Analysis ◽

Public Data ◽

Multiple Data Sets

Science has become a data-intensive enterprise. Data sets are commonly being stored in public data repositories and are thus available for others to use in new, often unexpected ways. Such re-use of data sets can take the form of reproducing the original analysis, analyzing the data in new ways, or combining multiple data sets into new data sets that are analyzed still further. A scientist who re-uses a data set collected by another must be able to assess its trustworthiness. This chapter reviews the types of errors that are found in metadata referring to data collected manually, data collected by instruments (sensors), and data recovered from specimens in museum collections. It also summarizes methods used to screen these types of data for errors. It stresses the importance of ensuring that metadata associated with a data set thoroughly document the error prevention, detection, and correction methods applied to the data set prior to publication.

Download Full-text

Improving Data Quality: Actors, Incentives, and Capabilities

Political Analysis ◽

10.1093/pan/mpm007 ◽

2007 ◽

Vol 15 (4) ◽

pp. 365-386 ◽

Cited By ~ 42

Author(s):

Yoshiko M. Herrera ◽

Devesh Kapur

Keyword(s):

Data Quality ◽

International Organizations ◽

Data Sets ◽

Data Set ◽

The Core ◽

Use Of Data ◽

Private Organizations ◽

Assess Data Quality ◽

Existing Data ◽

Shape Data

This paper examines the construction and use of data sets in political science. We focus on three interrelated questions: How might we assess data quality? What factors shape data quality? and How can these factors be addressed to improve data quality? We first outline some problems with existing data set quality, including issues of validity, coverage, and accuracy, and we discuss some ways of identifying problems as well as some consequences of data quality problems. The core of the paper addresses the second question by analyzing the incentives and capabilities facing four key actors in a data supply chain: respondents, data collection agencies (including state bureaucracies and private organizations), international organizations, and finally, academic scholars. We conclude by making some suggestions for improving the use and construction of data sets.It is a capital mistake, Watson, to theorise before you have all the evidence. It biases the judgment.—Sherlock Holmes in “A Study in Scarlet”Statistics make officials, and officials make statistics.”—Chinese proverb

Download Full-text

Cautionary Note on Using Cross-Validation for Molecular Classification

Effects of genotype and lactation number on health and reproductive problems in dairy cows

Reduced Data Sets and Entropy-Based Discretization

Analyzing performance of classifiers for medical datasets

Interlaboratory comparison of <i>δ</i><sup>13</sup>C and <i>δ</i>D measurements of atmospheric CH<sub>4</sub> for combined use of data sets from different laboratories

Importance of Holidays for Short Term Load Forecasting Using Adaptive Neural Fuzzy Inference System

Molecular Classification Based on Prognostic and Cell Cycle-Associated Genes in Patients With Colon Cancer

Using anticlustering to partition data sets into equivalent parts

Generated rules for AIDS and e-learning classifier using rough set approach

Reliable Metadata and the Creation of Trustworthy, Reproducible, and Re-usable Data Sets

Improving Data Quality: Actors, Incentives, and Capabilities

Export Citation Format