Do We Train on Test Data? Purging CIFAR of Near-Duplicates

Björn Barz; Joachim Denzler

doi:10.3390/jimaging6060041

Do We Train on Test Data? Purging CIFAR of Near-Duplicates

Journal of Imaging ◽

10.3390/jimaging6060041 ◽

2020 ◽

Vol 6 (6) ◽

pp. 41 ◽

Cited By ~ 1

Author(s):

Björn Barz ◽

Joachim Denzler

Keyword(s):

Classification Accuracy ◽

State Of The Art ◽

Classification Performance ◽

Abstract Concepts ◽

Original Performance ◽

Generalization Capability ◽

Training Set ◽

Significant Drop ◽

Test Set ◽

Test Sets

The CIFAR-10 and CIFAR-100 datasets are two of the most heavily benchmarked datasets in computer vision and are often used to evaluate novel methods and model architectures in the field of deep learning. However, we find that 3.3% and 10% of the images from the test sets of these datasets have duplicates in the training set. These duplicates are easily recognizable by memorization and may, hence, bias the comparison of image recognition techniques regarding their generalization capability. To eliminate this bias, we provide the “fair CIFAR” (ciFAIR) dataset, where we replaced all duplicates in the test sets with new images sampled from the same domain. The training set remains unchanged, in order not to invalidate pre-trained models. We then re-evaluate the classification performance of various popular state-of-the-art CNN architectures on these new test sets to investigate whether recent research has overfitted to memorizing data instead of learning abstract concepts. We find a significant drop in classification accuracy of between 9% and 14% relative to the original performance on the duplicate-free test set. We make both the ciFAIR dataset and pre-trained models publicly available and furthermore maintain a leaderboard for tracking the state of the art.

Download Full-text

Semi-Supervised PolSAR Image Classification Based on Self-Training and Superpixels

Remote Sensing ◽

10.3390/rs11161933 ◽

2019 ◽

Vol 11 (16) ◽

pp. 1933 ◽

Cited By ~ 5

Author(s):

Yangyang Li ◽

Ruoting Xing ◽

Licheng Jiao ◽

Yanqiao Chen ◽

Yingte Chai ◽

...

Keyword(s):

Image Classification ◽

State Of The Art ◽

Sample Selection ◽

Spatial Relations ◽

Speckle Noise ◽

Classification Performance ◽

Selection Strategy ◽

Training Set ◽

Polarimetric Synthetic Aperture Radar ◽

Unlabeled Sample

Polarimetric synthetic aperture radar (PolSAR) image classification is a recent technology with great practical value in the field of remote sensing. However, due to the time-consuming and labor-intensive data collection, there are few labeled datasets available. Furthermore, most available state-of-the-art classification methods heavily suffer from the speckle noise. To solve these problems, in this paper, a novel semi-supervised algorithm based on self-training and superpixels is proposed. First, the Pauli-RGB image is over-segmented into superpixels to obtain a large number of homogeneous areas. Then, features that can mitigate the effects of the speckle noise are obtained using spatial weighting in the same superpixel. Next, the training set is expanded iteratively utilizing a semi-supervised unlabeled sample selection strategy that elaborately makes use of spatial relations provided by superpixels. In addition, a stacked sparse auto-encoder is self-trained using the expanded training set to obtain classification results. Experiments on two typical PolSAR datasets verified its capability of suppressing the speckle noise and showed excellent classification performance with limited labeled data.

Download Full-text

Feature-Weighted Sampling for Proper Evaluation of Classification Models

Applied Sciences ◽

10.3390/app11052039 ◽

2021 ◽

Vol 11 (5) ◽

pp. 2039

Author(s):

Hyunseok Shin ◽

Sejong Oh

Keyword(s):

Random Sampling ◽

Sampling Method ◽

Classification Model ◽

Training Set ◽

Test Set ◽

Feature Importance ◽

Proper Training ◽

Machine Learning Applications ◽

Test Sets ◽

The Given

In machine learning applications, classification schemes have been widely used for prediction tasks. Typically, to develop a prediction model, the given dataset is divided into training and test sets; the training set is used to build the model and the test set is used to evaluate the model. Furthermore, random sampling is traditionally used to divide datasets. The problem, however, is that the performance of the model is evaluated differently depending on how we divide the training and test sets. Therefore, in this study, we proposed an improved sampling method for the accurate evaluation of a classification model. We first generated numerous candidate cases of train/test sets using the R-value-based sampling method. We evaluated the similarity of distributions of the candidate cases with the whole dataset, and the case with the smallest distribution–difference was selected as the final train/test set. Histograms and feature importance were used to evaluate the similarity of distributions. The proposed method produces more proper training and test sets than previous sampling methods, including random and non-random sampling.

Download Full-text

A Multi-Branch Feature Fusion Strategy Based on an Attention Mechanism for Remote Sensing Image Scene Classification

Remote Sensing ◽

10.3390/rs13101950 ◽

2021 ◽

Vol 13 (10) ◽

pp. 1950

Author(s):

Cuiping Shi ◽

Xin Zhao ◽

Liguo Wang

Keyword(s):

Remote Sensing ◽

Feature Extraction ◽

Classification Accuracy ◽

Feature Fusion ◽

State Of The Art ◽

Rapid Development ◽

Remote Sensing Image ◽

Classification Performance ◽

Attention Mechanism ◽

Scene Classification

In recent years, with the rapid development of computer vision, increasing attention has been paid to remote sensing image scene classification. To improve the classification performance, many studies have increased the depth of convolutional neural networks (CNNs) and expanded the width of the network to extract more deep features, thereby increasing the complexity of the model. To solve this problem, in this paper, we propose a lightweight convolutional neural network based on attention-oriented multi-branch feature fusion (AMB-CNN) for remote sensing image scene classification. Firstly, we propose two convolution combination modules for feature extraction, through which the deep features of images can be fully extracted with multi convolution cooperation. Then, the weights of the feature are calculated, and the extracted deep features are sent to the attention mechanism for further feature extraction. Next, all of the extracted features are fused by multiple branches. Finally, depth separable convolution and asymmetric convolution are implemented to greatly reduce the number of parameters. The experimental results show that, compared with some state-of-the-art methods, the proposed method still has a great advantage in classification accuracy with very few parameters.

Download Full-text

The Importance of the Test Set Size in Quantification Assessment

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/366 ◽

2020 ◽

Cited By ~ 1

Author(s):

André Maletzke ◽

Waqar Hassan ◽

Denis dos Reis ◽

Gustavo Batista

Keyword(s):

Performance Measures ◽

Training Set ◽

Test Set ◽

Test Size ◽

Critical Variable ◽

Set Size ◽

Quantification Method ◽

Class Distribution ◽

Cherry Picking ◽

Test Sets

Quantification is a task similar to classification in the sense that it learns from a labeled training set. However, quantification is not interested in predicting the class of each observation, but rather measure the class distribution in the test set. The community has developed performance measures and experimental setups tailored to quantification tasks. Nonetheless, we argue that a critical variable, the size of the test sets, remains ignored. Such disregard has three main detrimental effects. First, it implicitly assumes that quantifiers will perform equally well for different test set sizes. Second, it increases the risk of cherry-picking by selecting a test set size for which a particular proposal performs best. Finally, it disregards the importance of designing methods that are suitable for different test set sizes. We discuss these issues with the support of one of the broadest experimental evaluations ever performed, with three main outcomes. (i) We empirically demonstrate the importance of the test set size to assess quantifiers. (ii) We show that current quantifiers generally have a mediocre performance on the smallest test sets. (iii) We propose a metalearning scheme to select the best quantifier based on the test size that can outperform the best single quantification method.

Download Full-text

Emotion Assessment Using Feature Fusion and Decision Fusion Classification Based on Physiological Data: Are We There Yet?

Sensors ◽

10.3390/s20174723 ◽

2020 ◽

Vol 20 (17) ◽

pp. 4723

Author(s):

Patrícia Bota ◽

Chen Wang ◽

Ana Fred ◽

Hugo Silva

Keyword(s):

Emotion Recognition ◽

Classification Accuracy ◽

Feature Fusion ◽

State Of The Art ◽

Electrodermal Activity ◽

Decision Fusion ◽

Classification Performance ◽

Physiological Data ◽

Systematic Analysis ◽

Public Datasets

Emotion recognition based on physiological data classification has been a topic of increasingly growing interest for more than a decade. However, there is a lack of systematic analysis in literature regarding the selection of classifiers to use, sensor modalities, features and range of expected accuracy, just to name a few limitations. In this work, we evaluate emotion in terms of low/high arousal and valence classification through Supervised Learning (SL), Decision Fusion (DF) and Feature Fusion (FF) techniques using multimodal physiological data, namely, Electrocardiography (ECG), Electrodermal Activity (EDA), Respiration (RESP), or Blood Volume Pulse (BVP). The main contribution of our work is a systematic study across five public datasets commonly used in the Emotion Recognition (ER) state-of-the-art, namely: (1) Classification performance analysis of ER benchmarking datasets in the arousal/valence space; (2) Summarising the ranges of the classification accuracy reported across the existing literature; (3) Characterising the results for diverse classifiers, sensor modalities and feature set combinations for ER using accuracy and F1-score; (4) Exploration of an extended feature set for each modality; (5) Systematic analysis of multimodal classification in DF and FF approaches. The experimental results showed that FF is the most competitive technique in terms of classification accuracy and computational complexity. We obtain superior or comparable results to those reported in the state-of-the-art for the selected datasets.

Download Full-text

Span Model for Open Information Extraction on Accurate Corpus

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6497 ◽

2020 ◽

Vol 34 (05) ◽

pp. 9523-9530

Author(s):

Junlang Zhan ◽

Hai Zhao

Keyword(s):

Information Extraction ◽

State Of The Art ◽

Training Dataset ◽

Model Design ◽

Test Set ◽

Benchmark Test ◽

Open Information Extraction ◽

Test Sets ◽

Improved Model ◽

Benchmark Evaluation

Open Information Extraction (Open IE) is a challenging task especially due to its brittle data basis. Most of Open IE systems have to be trained on automatically built corpus and evaluated on inaccurate test set. In this work, we first alleviate this difficulty from both sides of training and test sets. For the former, we propose an improved model design to more sufficiently exploit training dataset. For the latter, we present our accurately re-annotated benchmark test set (Re-OIE2016) according to a series of linguistic observation and analysis. Then, we introduce a span model instead of previous adopted sequence labeling formulization for n-ary Open IE. Our newly introduced model achieves new state-of-the-art performance on both benchmark evaluation datasets.

Download Full-text

Combining Proteomics and Gene Expression Profiling Identifies Proteins/Genes Associated with Short Overall Survival in Multiple Myeloma

Blood ◽

10.1182/blood.v120.21.197.197 ◽

2012 ◽

Vol 120 (21) ◽

pp. 197-197

Author(s):

Ricky D Edmondson ◽

Shweta S. Chavan ◽

Christoph Heuck ◽

Bart Barlogie

Keyword(s):

Gene Expression ◽

Overall Survival ◽

Multivariate Analyses ◽

Rank Test ◽

P Value ◽

Newly Diagnosed ◽

Training Set ◽

Test Set ◽

Log Rank Test ◽

Test Sets

Abstract Abstract 197 We and others have used gene expression profiling to classify multiple myeloma into high and low risk groups; here, we report the first combined GEP and proteomics study of a large number of baseline samples (n=85) of highly enriched tumor cells from patients with newly diagnosed myeloma. Peptide expression levels from MS data on CD138-selected plasma cells from a discovery set of 85 patients with newly diagnosed myeloma were used to identify proteins that were linked to short survival (OS < 3 years vs OS ≥ 3 years). The proteomics dataset consisted of intensity values for 11,006 peptides (representing 2,155 proteins), where intensity is the quantitative measure of peptide abundance; Peptide intensities were normalized by Z score transformation and significance analysis of microarray (SAM) was applied resulting in the identification 24 peptides as differentially expressed between the two groups (OS < 3 years vs OS ≥ 3 years), with fold change ≥1.5 and FDR <5%. The 24 peptides mapped to 19 unique proteins, and all were present at higher levels in the group with shorter overall survival than in the group with longer overall survival. An independent SAM analysis with parameters identical to the proteomics analysis (fold change ≥1.5; FDR <5%) was performed with the Affymetrix U133Plus2 microarray chip based expression data. This analysis identified 151 probe sets that were differentially expressed between the two groups; 144 probe sets were present at higher levels and seven at lower levels in the group with shorter overall survival. Comparing the SAM analyses of proteomics and GEP data, we identified nine probe sets, corresponding to seven genes, with increased levels of both protein and mRNA in the short lived group. In order to validate these findings from the discovery experiment we used GEP data from a randomized subset of the TT3 patient population as a training set for determining the optimal cut-points for each of the nine probe sets. Thus, TT3 population was randomized into two sub-populations for the training set (two-thirds of the population; n=294) and test set (one-third of the population; n=147); the Total Therapy 2 (TT2) patient population was used as an additional test set (n=441). A running log rank test was performed on the training set for each of the nine probe sets to determine its optimal gene expression cut-point. The cut-points derived from the training set were then applied to TT3 and TT2 test sets to investigate survival differences for the groups separated by the optimal cutpoint for each probe. The overall survival of the groups was visualized using the method of Kaplan and Meier, and a P-value was calculated (based on log-rank test) to determine whether there was a statistically significant difference in survival between the two groups (P ≤0.05). We performed univariate regression analysis using Cox proportional hazard model with the nine probe sets as variables on the TT3 test set. To identify which of the genes corresponding to these nine probes had an independent prognostic value, we performed a multivariate stepwise Cox regression analysis. wherein CACYBP, FABP5, and IQGAP2 retained significance after competing with the remaining probe sets in the analysis. CACYBP had the highest hazard ratio (HR 2.70, P-value 0.01). We then performed the univariate and multivariate analyses on the TT2 test set where CACYBP, CORO1A, ENO1, and STMN1 were selected by the multivariate analysis, and CACYBP had the highest hazard ratio (HR 1.93, P-value 0.004). CACYBP was the only gene selected by multivariate analyses of both test sets. Disclosures: No relevant conflicts of interest to declare.

Download Full-text

DeepChannel: Salience Estimation by Contrastive Learning for Extractive Document Summarization

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33016999 ◽

2019 ◽

Vol 33 ◽

pp. 6999-7006 ◽

Cited By ~ 1

Author(s):

Jiaxin Shi ◽

Chen Liang ◽

Lei Hou ◽

Juanzi Li ◽

Zhiyuan Liu ◽

...

Keyword(s):

Deep Neural Network ◽

State Of The Art ◽

Neural Model ◽

Training Set ◽

Daily Mail ◽

Test Set ◽

Document Summarization ◽

Training Strategy ◽

Strong Robustness ◽

Data Efficiency

We propose DeepChannel, a robust, data-efficient, and interpretable neural model for extractive document summarization. Given any document-summary pair, we estimate a salience score, which is modeled using an attention-based deep neural network, to represent the salience degree of the summary for yielding the document. We devise a contrastive training strategy to learn the salience estimation network, and then use the learned salience score as a guide and iteratively extract the most salient sentences from the document as our generated summary. In experiments, our model not only achieves state-of-the-art ROUGE scores on CNN/Daily Mail dataset, but also shows strong robustness in the out-of-domain test on DUC2007 test set. Moreover, our model reaches a ROUGE-1 F-1 score of 39.41 on CNN/Daily Mail test set with merely 1/100 training set, demonstrating a tremendous data efficiency.

Download Full-text

Searching for memory-lighter architectures for OCR-augmented image captioning

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-219230 ◽

2021 ◽

pp. 1-12

Author(s):

Rafael Gallardo García ◽

Beatriz Beltrán Martínez ◽

Carlos Hernández Gracidas ◽

Darnes Vilariño Ayala

Keyword(s):

State Of The Art ◽

Image Captioning ◽

Baseline Model ◽

Test Set ◽

Current State ◽

Processing Power ◽

Test Sets ◽

The One ◽

Specialized Hardware ◽

High Processing

Current State-of-the-Art image captioning systems that can read and integrate read text into the generated descriptions need high processing power and memory usage, which limits the sustainability and usability of the models (as they require expensive and very specialized hardware). The present work introduces two alternative versions (L-M4C and L-CNMT) of top architectures (on the TextCaps challenge), which were mainly adapted to achieve near-State-of-The-Art performance while being memory-lighter when compared to the original architectures, this is mainly achieved by using distilled or smaller pre-trained models on the text-and-OCR embedding modules. On the one hand, a distilled version of BERT was used in order to reduce the size of the text-embedding module (the distilled model has 59% fewer parameters), on the other hand, the OCR context processor on both architectures was replaced by Global Vectors (GloVe), instead of using FastText pre-trained vectors, this can reduce the memory used by the OCR-embedding module up to a 94% . Two of the three models presented in this work surpassed the baseline (M4C-Captioner) of the challenge on the evaluation and test sets, also, our best lighter architecture reached a CIDEr score of 88.24 on the test set, which is 7.25 points above the baseline model.

Download Full-text

Neural Network Model for Assessing the Physical and Mechanical Properties of a Metal Material Based on Deep Learning

Journal of Digital Science ◽

10.33847/2686-8296.2.1_2 ◽

2020 ◽

pp. 18-28

Author(s):

Andrei Kliuev ◽

Roman Klestov ◽

Valerii Stolbov

Keyword(s):

Neural Network ◽

Mechanical Properties ◽

Deep Neural Network ◽

Physical And Mechanical Properties ◽

Training Set ◽

Test Set ◽

Algorithmic Stability ◽

Test Sets ◽

Trained Network ◽

Basic Test

The paper investigates the algorithmic stability of learning a deep neural network in problems of recognition of the materials microstructure. It is shown that at 8% of quantitative deviation in the basic test set the algorithm trained network loses stability. This means that with such a quantitative or qualitative deviation in the training or test sets, the results obtained with such trained network can hardly be trusted. Although the results of this study are applicable to the particular case, i.e. problems of recognition of the microstructure using ResNet-152, the authors propose a cheaper method for studying stability based on the analysis of the test, rather than the training set.

Download Full-text