scholarly journals On the Scale Invariance in State of the Art CNNs Trained on ImageNet

2021 ◽  
Vol 3 (2) ◽  
pp. 374-391
Author(s):  
Mara Graziani ◽  
Thomas Lompech ◽  
Henning Müller ◽  
Adrien Depeursinge ◽  
Vincent Andrearczyk

The diffused practice of pre-training Convolutional Neural Networks (CNNs) on large natural image datasets such as ImageNet causes the automatic learning of invariance to object scale variations. This, however, can be detrimental in medical imaging, where pixel spacing has a known physical correspondence and size is crucial to the diagnosis, for example, the size of lesions, tumors or cell nuclei. In this paper, we use deep learning interpretability to identify at what intermediate layers such invariance is learned. We train and evaluate different regression models on the PASCAL-VOC (Pattern Analysis, Statistical modeling and ComputAtional Learning-Visual Object Classes) annotated data to (i) separate the effects of the closely related yet different notions of image size and object scale, (ii) quantify the presence of scale information in the CNN in terms of the layer-wise correlation between input scale and feature maps in InceptionV3 and ResNet50, and (iii) develop a pruning strategy that reduces the invariance to object scale of the learned features. Results indicate that scale information peaks at central CNN layers and drops close to the softmax, where the invariance is reached. Our pruning strategy uses this to obtain features that preserve scale information. We show that the pruning significantly improves the performance on medical tasks where scale is a relevant factor, for example for the regression of breast histology image magnification. These results show that the presence of scale information at intermediate layers legitimates transfer learning in applications that require scale covariance rather than invariance and that the performance on these tasks can be improved by pruning off the layers where the invariance is learned. All experiments are performed on publicly available data and the code is available on GitHub.

2014 ◽  
Vol 1044-1045 ◽  
pp. 1049-1052 ◽  
Author(s):  
Chin Chen Chang ◽  
I Ta Lee ◽  
Tsung Ta Ke ◽  
Wen Kai Tai

Common methods for reducing image size include scaling and cropping. However, these two approaches have some quality problems for reduced images. In this paper, we propose an image reducing algorithm by separating the main objects and the background. First, we extract two feature maps, namely, an enhanced visual saliency map and an improved gradient map from an input image. After that, we integrate these two feature maps to an importance map. Finally, we generate the target image using the importance map. The proposed approach can obtain desired results for a wide range of images.


Electronics ◽  
2020 ◽  
Vol 9 (8) ◽  
pp. 1235
Author(s):  
Yang Yang ◽  
Hongmin Deng

In order to make the classification and regression of single-stage detectors more accurate, an object detection algorithm named Global Context You-Only-Look-Once v3 (GC-YOLOv3) is proposed based on the You-Only-Look-Once (YOLO) in this paper. Firstly, a better cascading model with learnable semantic fusion between a feature extraction network and a feature pyramid network is designed to improve detection accuracy using a global context block. Secondly, the information to be retained is screened by combining three different scaling feature maps together. Finally, a global self-attention mechanism is used to highlight the useful information of feature maps while suppressing irrelevant information. Experiments show that our GC-YOLOv3 reaches a maximum of 55.5 object detection mean Average Precision (mAP)@0.5 on Common Objects in Context (COCO) 2017 test-dev and that the mAP is 5.1% higher than that of the YOLOv3 algorithm on Pascal Visual Object Classes (PASCAL VOC) 2007 test set. Therefore, experiments indicate that the proposed GC-YOLOv3 model exhibits optimal performance on the PASCAL VOC and COCO datasets.


Author(s):  
Yingying Zhang ◽  
Qiaoyong Zhong ◽  
Liang Ma ◽  
Di Xie ◽  
Shiliang Pu

Person re-identification (ReID) aims to match people across multiple non-overlapping video cameras deployed at different locations. To address this challenging problem, many metric learning approaches have been proposed, among which triplet loss is one of the state-of-the-arts. In this work, we explore the margin between positive and negative pairs of triplets and prove that large margin is beneficial. In particular, we propose a novel multi-stage training strategy which learns incremental triplet margin and improves triplet loss effectively. Multiple levels of feature maps are exploited to make the learned features more discriminative. Besides, we introduce global hard identity searching method to sample hard identities when generating a training batch. Extensive experiments on Market-1501, CUHK03, and DukeMTMCreID show that our approach yields a performance boost and outperforms most existing state-of-the-art methods.


2021 ◽  
Vol 14 (1) ◽  
pp. 102
Author(s):  
Xin Li ◽  
Tao Li ◽  
Ziqi Chen ◽  
Kaiwen Zhang ◽  
Runliang Xia

Semantic segmentation has been a fundamental task in interpreting remote sensing imagery (RSI) for various downstream applications. Due to the high intra-class variants and inter-class similarities, inflexibly transferring natural image-specific networks to RSI is inadvisable. To enhance the distinguishability of learnt representations, attention modules were developed and applied to RSI, resulting in satisfactory improvements. However, these designs capture contextual information by equally handling all the pixels regardless of whether they around edges. Therefore, blurry boundaries are generated, rising high uncertainties in classifying vast adjacent pixels. Hereby, we propose an edge distribution attention module (EDA) to highlight the edge distributions of leant feature maps in a self-attentive fashion. In this module, we first formulate and model column-wise and row-wise edge attention maps based on covariance matrix analysis. Furthermore, a hybrid attention module (HAM) that emphasizes the edge distributions and position-wise dependencies is devised combing with non-local block. Consequently, a conceptually end-to-end neural network, termed as EDENet, is proposed to integrate HAM hierarchically for the detailed strengthening of multi-level representations. EDENet implicitly learns representative and discriminative features, providing available and reasonable cues for dense prediction. The experimental results evaluated on ISPRS Vaihingen, Potsdam and DeepGlobe datasets show the efficacy and superiority to the state-of-the-art methods on overall accuracy (OA) and mean intersection over union (mIoU). In addition, the ablation study further validates the effects of EDA.


Filomat ◽  
2020 ◽  
Vol 34 (15) ◽  
pp. 5139-5148
Author(s):  
Yan Zhou ◽  
Hongwei Guo ◽  
Dongli Wang ◽  
Chunjiang Liao

The efficient convolution operator (ECO) have manifested predominant results in visual object tracking. However, in the pursuit of performance improvement, the computational burden of the tracker becomes heavy, and the importance of different feature layers is not considered. In this paper, we propose a self-adaptive mechanism for regulating the training process in the first frame. To overcome the over-fitting in the tracking process, we adopt the fuzzy model update strategy. Moreover, we weight different feature maps to enhance the tracker performance. Comprehensive experiments have conducted on the OTB-2013 dataset. When adopting our ideas to adjust our tracker, the self-adaptive mechanism can avoid unnecessary training iterations, and the fuzzy update strategy reduces one fifth tracking computation compared to the ECO. Within reduced computation, the tracker based our idea incurs less than 1% loss in AUC (area-undercurve).


Author(s):  
Dan Liu ◽  
Dawei Du ◽  
Libo Zhang ◽  
Tiejian Luo ◽  
Yanjun Wu ◽  
...  

Existing hand detection methods usually follow the pipeline of multiple stages with high computation cost, i.e., feature extraction, region proposal, bounding box regression, and additional layers for rotated region detection. In this paper, we propose a new Scale Invariant Fully Convolutional Network (SIFCN) trained in an end-to-end fashion to detect hands efficiently. Specifically, we merge the feature maps from high to low layers in an iterative way, which handles different scales of hands better with less time overhead comparing to concatenating them simply. Moreover, we develop the Complementary Weighted Fusion (CWF) block to make full use of the distinctive features among multiple layers to achieve scale invariance. To deal with rotated hand detection, we present the rotation map to get rid of complex rotation and derotation layers. Besides, we design the multi-scale loss scheme to accelerate the training process significantly by adding supervision to the intermediate layers of the network. Compared with the state-of-the-art methods, our algorithm shows comparable accuracy and runs a 4.23 times faster speed on the VIVA dataset and achieves better average precision on Oxford hand detection dataset at a speed of 62.5 fps.


Sensors ◽  
2021 ◽  
Vol 21 (9) ◽  
pp. 3265
Author(s):  
Shuyu Wang ◽  
Mingxin Zhao ◽  
Runjiang Dou ◽  
Shuangming Yu ◽  
Liyuan Liu ◽  
...  

Image demosaicking has been an essential and challenging problem among the most crucial steps of image processing behind image sensors. Due to the rapid development of intelligent processors based on deep learning, several demosaicking methods based on a convolutional neural network (CNN) have been proposed. However, it is difficult for their networks to run in real-time on edge computing devices with a large number of model parameters. This paper presents a compact demosaicking neural network based on the UNet++ structure. The network inserts densely connected layer blocks and adopts Gaussian smoothing layers instead of down-sampling operations before the backbone network. The densely connected blocks can extract mosaic image features efficiently by utilizing the correlation between feature maps. Furthermore, the block adopts depthwise separable convolutions to reduce the model parameters; the Gaussian smoothing layer can expand the receptive fields without down-sampling image size and discarding image information. The size constraints on the input and output images can also be relaxed, and the quality of demosaicked images is improved. Experiment results show that the proposed network can improve the running speed by 42% compared with the fastest CNN-based method and achieve comparable reconstruction quality as it on four mainstream datasets. Besides, when we carry out the inference processing on the demosaicked images on typical deep CNN networks, Mobilenet v1 and SSD, the accuracy can also achieve 85.83% (top 5) and 75.44% (mAP), which performs comparably to the existing methods. The proposed network has the highest computing efficiency and lowest parameter number through all methods, demonstrating that it is well suitable for applications on modern edge computing devices.


Author(s):  
K. Chen ◽  
M. Weinmann ◽  
X. Sun ◽  
M. Yan ◽  
S. Hinz ◽  
...  

<p><strong>Abstract.</strong> In this paper, we address the semantic segmentation of aerial imagery based on the use of multi-modal data given in the form of true orthophotos and the corresponding Digital Surface Models (DSMs). We present the Deeply-supervised Shuffling Convolutional Neural Network (DSCNN) representing a multi-scale extension of the Shuffling Convolutional Neural Network (SCNN) with deep supervision. Thereby, we take the advantage of the SCNN involving the shuffling operator to effectively upsample feature maps and then fuse multiscale features derived from the intermediate layers of the SCNN, which results in the Multi-scale Shuffling Convolutional Neural Network (MSCNN). Based on the MSCNN, we derive the DSCNN by introducing additional losses into the intermediate layers of the MSCNN. In addition, we investigate the impact of using different sets of hand-crafted radiometric and geometric features derived from the true orthophotos and the DSMs on the semantic segmentation task. For performance evaluation, we use a commonly used benchmark dataset. The achieved results reveal that both multi-scale fusion and deep supervision contribute to an improvement in performance. Furthermore, the use of a diversity of hand-crafted radiometric and geometric features as input for the DSCNN does not provide the best numerical results, but smoother and improved detections for several objects.</p>


2018 ◽  
Vol 10 (10) ◽  
pp. 1636 ◽  
Author(s):  
Diogo Duarte ◽  
Francesco Nex ◽  
Norman Kerle ◽  
George Vosselman

Remote sensing images have long been preferred to perform building damage assessments. The recently proposed methods to extract damaged regions from remote sensing imagery rely on convolutional neural networks (CNN). The common approach is to train a CNN independently considering each of the different resolution levels (satellite, aerial, and terrestrial) in a binary classification approach. In this regard, an ever-growing amount of multi-resolution imagery are being collected, but the current approaches use one single resolution as their input. The use of up/down-sampled images for training has been reported as beneficial for the image classification accuracy both in the computer vision and remote sensing domains. However, it is still unclear if such multi-resolution information can also be captured from images with different spatial resolutions such as imagery of the satellite and airborne (from both manned and unmanned platforms) resolutions. In this paper, three multi-resolution CNN feature fusion approaches are proposed and tested against two baseline (mono-resolution) methods to perform the image classification of building damages. Overall, the results show better accuracy and localization capabilities when fusing multi-resolution feature maps, specifically when these feature maps are merged and consider feature information from the intermediate layers of each of the resolution level networks. Nonetheless, these multi-resolution feature fusion approaches behaved differently considering each level of resolution. In the satellite and aerial (unmanned) cases, the improvements in the accuracy reached 2% while the accuracy improvements for the airborne (manned) case was marginal. The results were further confirmed by testing the approach for geographical transferability, in which the improvements between the baseline and multi-resolution experiments were overall maintained.


Electronics ◽  
2019 ◽  
Vol 8 (10) ◽  
pp. 1084 ◽  
Author(s):  
Dong-Hyun Lee

The visual object tracking problem seeks to track an arbitrary object in a video, and many deep convolutional neural network-based algorithms have achieved significant performance improvements in recent years. However, most of them do not guarantee real-time operation due to the large computation overhead for deep feature extraction. This paper presents a single-crop visual object tracking algorithm based on a fully convolutional Siamese network (SiamFC). The proposed algorithm significantly reduces the computation burden by extracting multiple scale feature maps from a single image crop. Experimental results show that the proposed algorithm demonstrates superior speed performance in comparison with that of SiamFC.


Sign in / Sign up

Export Citation Format

Share Document