scholarly journals Learning Visual Words for Weakly-Supervised Semantic Segmentation

Author(s):  
Lixiang Ru ◽  
Bo Du ◽  
Chen Wu

Current weakly-supervised semantic segmentation (WSSS) methods with image-level labels mainly adopt class activation maps (CAM) to generate the initial pseudo labels. However, CAM usually only identifies the most discriminative object extents, which is attributed to the fact that the network doesn't need to discover the integral object to recognize image-level labels. In this work, to tackle this problem, we proposed to simultaneously learn the image-level labels and local visual word labels. Specifically, in each forward propagation, the feature maps of the input image will be encoded to visual words with a learnable codebook. By enforcing the network to classify the encoded fine-grained visual words, the generated CAM could cover more semantic regions. Besides, we also proposed a hybrid spatial pyramid pooling module that could preserve local maximum and global average values of feature maps, so that more object details and less background were considered. Based on the proposed methods, we conducted experiments on the PASCAL VOC 2012 dataset. Our proposed method achieved 67.2% mIoU on the val set and 67.3% mIoU on the test set, which outperformed recent state-of-the-art methods.

2020 ◽  
Vol 2020 (10) ◽  
pp. 27-1-27-7
Author(s):  
Congcong Wang ◽  
Faouzi Alaya Cheikh ◽  
Azeddine Beghdadi ◽  
Ole Jakob Elle

The object sizes in images are diverse, therefore, capturing multiple scale context information is essential for semantic segmentation. Existing context aggregation methods such as pyramid pooling module (PPM) and atrous spatial pyramid pooling (ASPP) employ different pooling size or atrous rate, such that multiple scale information is captured. However, the pooling sizes and atrous rates are chosen empirically. Rethinking of ASPP leads to our observation that learnable sampling locations of the convolution operation can endow the network learnable fieldof- view, thus the ability of capturing object context information adaptively. Following this observation, in this paper, we propose an adaptive context encoding (ACE) module based on deformable convolution operation where sampling locations of the convolution operation are learnable. Our ACE module can be embedded into other Convolutional Neural Networks (CNNs) easily for context aggregation. The effectiveness of the proposed module is demonstrated on Pascal-Context and ADE20K datasets. Although our proposed ACE only consists of three deformable convolution blocks, it outperforms PPM and ASPP in terms of mean Intersection of Union (mIoU) on both datasets. All the experimental studies confirm that our proposed module is effective compared to the state-of-the-art methods.


Author(s):  
Xiawu Zheng ◽  
Rongrong Ji ◽  
Xiaoshuai Sun ◽  
Yongjian Wu ◽  
Feiyue Huang ◽  
...  

Fine-grained object retrieval has attracted extensive research focus recently. Its state-of-the-art schemesare typically based upon convolutional neural network (CNN) features. Despite the extensive progress, two issues remain open. On one hand, the deep features are coarsely extracted at image level rather than precisely at object level, which are interrupted by background clutters. On the other hand, training CNN features with a standard triplet loss is time consuming and incapable to learn discriminative features. In this paper, we present a novel fine-grained object retrieval scheme that conquers these issues in a unified framework. Firstly, we introduce a novel centralized ranking loss (CRL), which achieves a very efficient (1,000times training speedup comparing to the triplet loss) and discriminative feature learning by a ?centralized? global pooling. Secondly, a weakly supervised attractive feature extraction is proposed, which segments object contours with top-down saliency. Consequently, the contours are integrated into the CNN response map to precisely extract features ?within? the target object. Interestingly, we have discovered that the combination of CRL and weakly supervised learning can reinforce each other. We evaluate the performance ofthe proposed scheme on widely-used benchmarks including CUB200-2011 and CARS196. We havereported significant gains over the state-of-the-art schemes, e.g., 5.4% over SCDA [Wei et al., 2017]on CARS196, and 3.7% on CUB200-2011.  


2019 ◽  
Vol 9 (13) ◽  
pp. 2686 ◽  
Author(s):  
Jianming Zhang ◽  
Chaoquan Lu ◽  
Jin Wang ◽  
Lei Wang ◽  
Xiao-Guang Yue

In civil engineering, the stability of concrete is of great significance to safety of people’s life and property, so it is necessary to detect concrete damage effectively. In this paper, we treat crack detection on concrete surface as a semantic segmentation task that distinguishes background from crack at the pixel level. Inspired by Fully Convolutional Networks (FCN), we propose a full convolution network based on dilated convolution for concrete crack detection, which consists of an encoder and a decoder. Specifically, we first used the residual network to extract the feature maps of the input image, designed the dilated convolutions with different dilation rates to extract the feature maps of different receptive fields, and fused the extracted features from multiple branches. Then, we exploited the stacked deconvolution to do up-sampling operator in the fused feature maps. Finally, we used the SoftMax function to classify the feature maps at the pixel level. In order to verify the validity of the model, we introduced the commonly used evaluation indicators of semantic segmentation: Pixel Accuracy (PA), Mean Pixel Accuracy (MPA), Mean Intersection over Union (MIoU), and Frequency Weighted Intersection over Union (FWIoU). The experimental results show that the proposed model converges faster and has better generalization performance on the test set by introducing dilated convolutions with different dilation rates and a multi-branch fusion strategy. Our model has a PA of 96.84%, MPA of 92.55%, MIoU of 86.05% and FWIoU of 94.22% on the test set, which is superior to other models.


2021 ◽  
Author(s):  
Anthony Bilodeau ◽  
Constantin V.L. Delmas ◽  
Martin Parent ◽  
Paul De Koninck ◽  
Audrey Durand ◽  
...  

High throughput quantitative analysis of microscopy images presents a challenge due to the complexity of the image content and the difficulty to retrieve precisely annotated datasets. In this paper we introduce a weakly-supervised MICRoscopy Analysis neural network (MICRA-Net) that can be trained on a simple main classification task using image-level annotations to solve multiple the more complex auxiliary semantic segmentation task and other associated tasks such as detection or enumeration. MICRA-Net relies on the latent information embedded within a trained model to achieve performances similar to state-of-the-art fully-supervised learning. This learnt information is extracted from the network using gradient class activation maps, which are combined to generate detailed feature maps of the biological structures of interest. We demonstrate how MICRA-Net significantly alleviates the Expert annotation process on various microscopy datasets and can be used for high-throughput quantitative analysis of microscopy images.


Author(s):  
Wei Ji ◽  
Xi Li ◽  
Yueting Zhuang ◽  
Omar El Farouk Bourahla ◽  
Yixin Ji ◽  
...  

Clothing segmentation is a challenging vision problem typically implemented within a fine-grained semantic segmentation framework. Different from conventional segmentation, clothing segmentation has some domain-specific properties such as texture richness, diverse appearance variations, non-rigid geometry deformations, and small sample learning. To deal with these points, we propose a semantic locality-aware segmentation model, which adaptively attaches an original clothing image with a semantically similar (e.g., appearance or pose) auxiliary exemplar by search. Through considering the interactions of the clothing image and its exemplar, more intrinsic knowledge about the locality manifold structures of clothing images is discovered to make the learning process of small sample problem more stable and tractable. Furthermore, we present a CNN model based on the deformable convolutions to extract the non-rigid geometry-aware features for clothing images. Experimental results demonstrate the effectiveness of the proposed model against the state-of-the-art approaches.


Sensors ◽  
2021 ◽  
Vol 22 (1) ◽  
pp. 207
Author(s):  
Qi Chen ◽  
Yuanyi Zhang ◽  
Xinyuan Li ◽  
Pengjie Tao

Deep learning techniques such as convolutional neural networks have largely improved the performance of building segmentation from remote sensing images. However, the images for building segmentation are often in the form of traditional orthophotos, where the relief displacement would cause non-negligible misalignment between the roof outline and the footprint of a building; such misalignment poses considerable challenges for extracting accurate building footprints, especially for high-rise buildings. Aiming at alleviating this problem, a new workflow is proposed for generating rectified building footprints from traditional orthophotos. We first use the facade labels, which are prepared efficiently at low cost, along with the roof labels to train a semantic segmentation network. Then, the well-trained network, which employs the state-of-the-art version of EfficientNet as backbone, extracts the roof segments and the facade segments of buildings from the input image. Finally, after clustering the classified pixels into instance-level building objects and tracing out the roof outlines, an energy function is proposed to drive the roof outline to maximally align with the building footprint; thus, the rectified footprints can be generated. The experiments on the aerial orthophotos covering a high-density residential area in Shanghai demonstrate that the proposed workflow can generate obviously more accurate building footprints than the baseline methods, especially for high-rise buildings.


2021 ◽  
Vol 1 (1) ◽  
pp. 29-31
Author(s):  
Mahmood Haithami ◽  
Amr Ahmed ◽  
Iman Yi Liao ◽  
Hamid Jalab

In this paper, we aim to enhance the segmentation capabilities of DeeplabV3 by employing Gated Recurrent Neural Network (GRU). A 1-by-1 convolution in DeeplabV3 was replaced by GRU after the Atrous Spatial Pyramid Pooling (ASSP) layer to combine the input feature maps. The convolution and GRU have sharable parameters, though, the latter has gates that enable/disable the contribution of each input feature map. The experiments on unseen test sets demonstrate that employing GRU instead of convolution would produce better segmentation results. The used datasets are public datasets provided by MedAI competition.


Sensors ◽  
2019 ◽  
Vol 19 (8) ◽  
pp. 1795 ◽  
Author(s):  
Xiao Lin ◽  
Dalila Sánchez-Escobedo ◽  
Josep R. Casas ◽  
Montse Pardàs

Semantic segmentation and depth estimation are two important tasks in computer vision, and many methods have been developed to tackle them. Commonly these two tasks are addressed independently, but recently the idea of merging these two problems into a sole framework has been studied under the assumption that integrating two highly correlated tasks may benefit each other to improve the estimation accuracy. In this paper, depth estimation and semantic segmentation are jointly addressed using a single RGB input image under a unified convolutional neural network. We analyze two different architectures to evaluate which features are more relevant when shared by the two tasks and which features should be kept separated to achieve a mutual improvement. Likewise, our approaches are evaluated under two different scenarios designed to review our results versus single-task and multi-task methods. Qualitative and quantitative experiments demonstrate that the performance of our methodology outperforms the state of the art on single-task approaches, while obtaining competitive results compared with other multi-task methods.


Sensors ◽  
2019 ◽  
Vol 19 (24) ◽  
pp. 5361 ◽  
Author(s):  
Bruno Artacho ◽  
Andreas Savakis

We propose a new efficient architecture for semantic segmentation, based on a “Waterfall” Atrous Spatial Pooling architecture, that achieves a considerable accuracy increase while decreasing the number of network parameters and memory footprint. The proposed Waterfall architecture leverages the efficiency of progressive filtering in the cascade architecture while maintaining multiscale fields-of-view comparable to spatial pyramid configurations. Additionally, our method does not rely on a postprocessing stage with Conditional Random Fields, which further reduces complexity and required training time. We demonstrate that the Waterfall approach with a ResNet backbone is a robust and efficient architecture for semantic segmentation obtaining state-of-the-art results with significant reduction in the number of parameters for the Pascal VOC dataset and the Cityscapes dataset.


Sign in / Sign up

Export Citation Format

Share Document