A Novel Upsampling and Context Convolution for Image Semantic Segmentation

Semantic segmentation, which refers to pixel-wise classification of an image, is a fundamental topic in computer vision owing to its growing importance in the robot vision and autonomous driving sectors. It provides rich information about objects in the scene such as object boundary, category, and location. Recent methods for semantic segmentation often employ an encoder-decoder structure using deep convolutional neural networks. The encoder part extracts features of the image using several filters and pooling operations, whereas the decoder part gradually recovers the low-resolution feature maps of the encoder into a full input resolution feature map for pixel-wise prediction. However, the encoder-decoder variants for semantic segmentation suffer from severe spatial information loss, caused by pooling operations or stepwise convolutions, and does not consider the context in the scene. In this paper, we propose a novel dense upsampling convolution method based on a guided filter to effectively preserve the spatial information of the image in the network. We further propose a novel local context convolution method that not only covers larger-scale objects in the scene but covers them densely for precise object boundary delineation. Theoretical analyses and experimental results on several benchmark datasets verify the effectiveness of our method. Qualitatively, our approach delineates object boundaries at a level of accuracy that is beyond the current excellent methods. Quantitatively, we report a new record of 82.86% and 81.62% of pixel accuracy on ADE20K and Pascal-Context benchmark datasets, respectively. In comparison with the state-of-the-art methods, the proposed method offers promising improvements.

Download Full-text

Learnable Gated Convolutional Neural Network for Semantic Segmentation in Remote-Sensing Images

Remote Sensing ◽

10.3390/rs11161922 ◽

2019 ◽

Vol 11 (16) ◽

pp. 1922 ◽

Cited By ~ 7

Author(s):

Shichen Guo ◽

Qizhao Jin ◽

Hongzhen Wang ◽

Xuezhi Wang ◽

Yangang Wang ◽

...

Keyword(s):

Remote Sensing ◽

High Resolution ◽

Urban Areas ◽

Information Aggregation ◽

Semantic Segmentation ◽

Local Context ◽

Context Information ◽

Feature Maps ◽

Deep Convolutional Neural Networks ◽

Proposed Model

Semantic segmentation in high-resolution remote-sensing (RS) images is a fundamental task for RS-based urban understanding and planning. However, various types of artificial objects in urban areas make this task quite challenging. Recently, the use of Deep Convolutional Neural Networks (DCNNs) with multiscale information fusion has demonstrated great potential in enhancing performance. Technically, however, existing fusions are usually implemented by summing or concatenating feature maps in a straightforward way. Seldom do works consider the spatial importance for global-to-local context-information aggregation. This paper proposes a Learnable-Gated CNN (L-GCNN) to address this issue. Methodologically, the Taylor expression of the information-entropy function is first parameterized to design the gate function, which is employed to generate pixelwise weights for coarse-to-fine refinement in the L-GCNN. Accordingly, a Parameterized Gate Module (PGM) was designed to achieve this goal. Then, the single PGM and its densely connected extension were embedded into different levels of the encoder in the L-GCNN to help identify the discriminative feature maps at different scales. With the above designs, the L-GCNN is finally organized as a self-cascaded end-to-end architecture that is able to sequentially aggregate context information for fine segmentation. The proposed model was evaluated on two public challenging benchmarks, the ISPRS 2Dsemantic segmentation challenge Potsdam dataset and the Massachusetts building dataset. The experiment results demonstrate that the proposed method exhibited significant improvement compared with several related segmentation networks, including the FCN, SegNet, RefineNet, PSPNet, DeepLab and GSN.For example, on the Potsdam dataset, our method achieved a 93.65% F 1 score and 88.06% I o U score for the segmentation of tiny cars in high-resolution RS images. As a conclusion, the proposed model showed potential for object segmentation from the RS images of buildings, impervious surfaces, low vegetation, trees and cars in urban settings, which largely varies in size and have confusing appearances.

Download Full-text

Learning Fully Dense Neural Networks for Image Semantic Segmentation

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33019283 ◽

2019 ◽

Vol 33 ◽

pp. 9283-9290 ◽

Cited By ~ 3

Author(s):

Mingmin Zhen ◽

Jinglu Wang ◽

Lei Zhou ◽

Tian Fang ◽

Long Quan

Keyword(s):

Neural Network ◽

Neural Networks ◽

Loss Function ◽

Spatial Information ◽

Semantic Segmentation ◽

The Other ◽

Feature Maps ◽

Spatial Reconstruction ◽

Benchmark Datasets ◽

The One

Semantic segmentation is pixel-wise classification which retains critical spatial information. The “feature map reuse” has been commonly adopted in CNN based approaches to take advantage of feature maps in the early layers for the later spatial reconstruction. Along this direction, we go a step further by proposing a fully dense neural network with an encoderdecoder structure that we abbreviate as FDNet. For each stage in the decoder module, feature maps of all the previous blocks are adaptively aggregated to feedforward as input. On the one hand, it reconstructs the spatial boundaries accurately. On the other hand, it learns more efficiently with the more efficient gradient backpropagation. In addition, we propose the boundary-aware loss function to focus more attention on the pixels near the boundary, which boosts the “hard examples” labeling. We have demonstrated the best performance of the FDNet on the two benchmark datasets: PASCAL VOC 2012, NYUDv2 over previous works when not considering training on other datasets.

Download Full-text

Semantic Image Segmentation with Deep Convolutional Neural Networks and Quick Shift

Symmetry ◽

10.3390/sym12030427 ◽

2020 ◽

Vol 12 (3) ◽

pp. 427 ◽

Cited By ~ 1

Author(s):

Sanxing Zhang ◽

Zhenhuan Ma ◽

Gang Zhang ◽

Tao Lei ◽

Rui Zhang ◽

...

Keyword(s):

Neural Networks ◽

Image Segmentation ◽

Convolutional Neural Networks ◽

Semantic Segmentation ◽

Autonomous Driving ◽

Input Image ◽

Feature Representation ◽

Segmentation Algorithm ◽

Deep Convolutional Neural Networks ◽

Semantic Image Segmentation

Semantic image segmentation, as one of the most popular tasks in computer vision, has been widely used in autonomous driving, robotics and other fields. Currently, deep convolutional neural networks (DCNNs) are driving major advances in semantic segmentation due to their powerful feature representation. However, DCNNs extract high-level feature representations by strided convolution, which makes it impossible to segment foreground objects precisely, especially when locating object boundaries. This paper presents a novel semantic segmentation algorithm with DeepLab v3+ and super-pixel segmentation algorithm-quick shift. DeepLab v3+ is employed to generate a class-indexed score map for the input image. Quick shift is applied to segment the input image into superpixels. Outputs of them are then fed into a class voting module to refine the semantic segmentation results. Extensive experiments on proposed semantic image segmentation are performed over PASCAL VOC 2012 dataset, and results that the proposed method can provide a more efficient solution.

Download Full-text

Novel Method of Semantic Segmentation Applicable to Augmented Reality

Sensors ◽

10.3390/s20061737 ◽

2020 ◽

Vol 20 (6) ◽

pp. 1737 ◽

Cited By ~ 1

Author(s):

Tae-young Ko ◽

Seung-ho Lee

Keyword(s):

Augmented Reality ◽

Spatial Information ◽

Ground Truth ◽

Semantic Segmentation ◽

Frame Rate ◽

Feature Maps ◽

Residual Network ◽

Feature Map ◽

Pascal Voc ◽

Novel Method

This paper proposes a novel method of semantic segmentation, consisting of modified dilated residual network, atrous pyramid pooling module, and backpropagation, that is applicable to augmented reality (AR). In the proposed method, the modified dilated residual network extracts a feature map from the original images and maintains spatial information. The atrous pyramid pooling module places convolutions in parallel and layers feature maps in a pyramid shape to extract objects occupying small areas in the image; these are converted into one channel using a 1 × 1 convolution. Backpropagation compares the semantic segmentation obtained through convolution from the final feature map with the ground truth provided by a database. Losses can be reduced by applying backpropagation to the modified dilated residual network to change the weighting. The proposed method was compared with other methods on the Cityscapes and PASCAL VOC 2012 databases. The proposed method achieved accuracies of 82.8 and 89.8 mean intersection over union (mIOU) and frame rates of 61 and 64.3 frames per second (fps) for the Cityscapes and PASCAL VOC 2012 databases, respectively. These results prove the applicability of the proposed method for implementing natural AR applications at actual speeds because the frame rate is greater than 60 fps.

Download Full-text

Semantic Segmentation of Underwater Images Based on Improved Deeplab

Journal of Marine Science and Engineering ◽

10.3390/jmse8030188 ◽

2020 ◽

Vol 8 (3) ◽

pp. 188

Author(s):

Fangfang Liu ◽

Ming Fang

Keyword(s):

Semantic Segmentation ◽

Autonomous Driving ◽

Correction Method ◽

Target Object ◽

Original Method ◽

Indoor Navigation ◽

Fine Tuning ◽

Object Boundary ◽

Current State ◽

Segmentation Accuracy

Image semantic segmentation technology has been increasingly applied in many fields, for example, autonomous driving, indoor navigation, virtual reality and augmented reality. However, underwater scenes, where there is a huge amount of marine biological resources and irreplaceable biological gene banks that need to be researched and exploited, are limited. In this paper, image semantic segmentation technology is exploited to study underwater scenes. We extend the current state-of-the-art semantic segmentation network DeepLabv3 + and employ it as the basic framework. First, the unsupervised color correction method (UCM) module is introduced to the encoder structure of the framework to improve the quality of the image. Moreover, two up-sampling layers are added to the decoder structure to retain more target features and object boundary information. The model is trained by fine-tuning and optimizing relevant parameters. Experimental results indicate that the image obtained by our method demonstrates better performance in improving the appearance of the segmented target object and avoiding its pixels from mingling with other class’s pixels, enhancing the segmentation accuracy of the target boundaries and retaining more feature information. Compared with the original method, our method improves the segmentation accuracy by 3%.

Download Full-text

Class-Wise Fully Convolutional Network for Semantic Segmentation of Remote Sensing Images

Remote Sensing ◽

10.3390/rs13163211 ◽

2021 ◽

Vol 13 (16) ◽

pp. 3211

Author(s):

Tian Tian ◽

Zhengquan Chu ◽

Qian Hu ◽

Li Ma

Keyword(s):

Remote Sensing ◽

Image Interpretation ◽

Semantic Segmentation ◽

Remote Sensing Images ◽

Feature Maps ◽

Convolutional Network ◽

Fully Convolutional Network ◽

Semantic Labeling ◽

Benchmark Datasets ◽

Semantic Label

Semantic segmentation is a fundamental task in remote sensing image interpretation, which aims to assign a semantic label for every pixel in the given image. Accurate semantic segmentation is still challenging due to the complex distributions of various ground objects. With the development of deep learning, a series of segmentation networks represented by fully convolutional network (FCN) has made remarkable progress on this problem, but the segmentation accuracy is still far from expectations. This paper focuses on the importance of class-specific features of different land cover objects, and presents a novel end-to-end class-wise processing framework for segmentation. The proposed class-wise FCN (C-FCN) is shaped in the form of an encoder-decoder structure with skip-connections, in which the encoder is shared to produce general features for all categories and the decoder is class-wise to process class-specific features. To be detailed, class-wise transition (CT), class-wise up-sampling (CU), class-wise supervision (CS), and class-wise classification (CC) modules are designed to achieve the class-wise transfer, recover the resolution of class-wise feature maps, bridge the encoder and modified decoder, and implement class-wise classifications, respectively. Class-wise and group convolutions are adopted in the architecture with regard to the control of parameter numbers. The method is tested on the public ISPRS 2D semantic labeling benchmark datasets. Experimental results show that the proposed C-FCN significantly improves the segmentation performances compared with many state-of-the-art FCN-based networks, revealing its potentials on accurate segmentation of complex remote sensing images.

Download Full-text

Cascaded Residual Attention Enhanced Road Extraction from Remote Sensing Images

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi11010009 ◽

2021 ◽

Vol 11 (1) ◽

pp. 9

Author(s):

Shengfu Li ◽

Cheng Liao ◽

Yulin Ding ◽

Han Hu ◽

Yang Jia ◽

...

Keyword(s):

Remote Sensing ◽

Spatial Information ◽

Semantic Segmentation ◽

Road Extraction ◽

Remote Sensing Images ◽

Long Distance ◽

Features Fusion ◽

Multi Scale ◽

Boundary Recognition ◽

Benchmark Datasets

Efficient and accurate road extraction from remote sensing imagery is important for applications related to navigation and Geographic Information System updating. Existing data-driven methods based on semantic segmentation recognize roads from images pixel by pixel, which generally uses only local spatial information and causes issues of discontinuous extraction and jagged boundary recognition. To address these problems, we propose a cascaded attention-enhanced architecture to extract boundary-refined roads from remote sensing images. Our proposed architecture uses spatial attention residual blocks on multi-scale features to capture long-distance relations and introduce channel attention layers to optimize the multi-scale features fusion. Furthermore, a lightweight encoder-decoder network is connected to adaptively optimize the boundaries of the extracted roads. Our experiments showed that the proposed method outperformed existing methods and achieved state-of-the-art results on the Massachusetts dataset. In addition, our method achieved competitive results on more recent benchmark datasets, e.g., the DeepGlobe and the Huawei Cloud road extraction challenge.

Download Full-text

Semantic Segmentation of Large-Scale Outdoor Point Clouds by Encoder–Decoder Shared MLPs with Multiple Losses

Remote Sensing ◽

10.3390/rs13163121 ◽

2021 ◽

Vol 13 (16) ◽

pp. 3121

Author(s):

Beanbonyka Rim ◽

Ahyoung Lee ◽

Min Hong

Keyword(s):

Large Scale ◽

Semantic Segmentation ◽

Point Clouds ◽

Autonomous Driving ◽

Trade Off ◽

Efficiency And Effectiveness ◽

Benchmark Datasets ◽

Scale Characteristics ◽

3D Lidar ◽

Geometry Mapping

Semantic segmentation of large-scale outdoor 3D LiDAR point clouds becomes essential to understand the scene environment in various applications, such as geometry mapping, autonomous driving, and more. With an advantage of being a 3D metric space, 3D LiDAR point clouds, on the other hand, pose a challenge for a deep learning approach, due to their unstructured, unorder, irregular, and large-scale characteristics. Therefore, this paper presents an encoder–decoder shared multi-layer perceptron (MLP) with multiple losses, to address an issue of this semantic segmentation. The challenge rises a trade-off between efficiency and effectiveness in performance. To balance this trade-off, we proposed common mechanisms, which is simple and yet effective, by defining a random point sampling layer, an attention-based pooling layer, and a summation of multiple losses integrated with the encoder–decoder shared MLPs method for the large-scale outdoor point clouds semantic segmentation. We conducted our experiments on the following two large-scale benchmark datasets: Toronto-3D and DALES dataset. Our experimental results achieved an overall accuracy (OA) and a mean intersection over union (mIoU) of both the Toronto-3D dataset, with 83.60% and 71.03%, and the DALES dataset, with 76.43% and 59.52%, respectively. Additionally, our proposed method performed a few numbers of parameters of the model, and faster than PointNet++ by about three times during inferencing.

Download Full-text

A Weakly Supervised Semantic Segmentation Network by Aggregating Seed Cues: The Multi-Object Proposal Generation Perspective

ACM Transactions on Multimedia Computing Communications and Applications ◽

10.1145/3419842 ◽

2021 ◽

Vol 17 (1s) ◽

pp. 1-19

Author(s):

Junsheng Xiao ◽

Huahu Xu ◽

Honghao Gao ◽

Minjie Bian ◽

Yang Li

Keyword(s):

Image Classification ◽

Real World ◽

Semantic Segmentation ◽

Feature Maps ◽

High Confidence ◽

Deep Convolutional Neural Networks ◽

Object Proposal ◽

Initial Location ◽

Weakly Supervised ◽

High Level

Weakly supervised semantic segmentation under image-level annotations is effectiveness for real-world applications. The small and sparse discriminative regions obtained from an image classification network that are typically used as the important initial location of semantic segmentation also form the bottleneck. Although deep convolutional neural networks (DCNNs) have exhibited promising performances for single-label image classification tasks, images of the real-world usually contain multiple categories, which is still an open problem. So, the problem of obtaining high-confidence discriminative regions from multi-label classification networks remains unsolved. To solve this problem, this article proposes an innovative three-step framework within the perspective of multi-object proposal generation. First, an image is divided into candidate boxes using the object proposal method. The candidate boxes are sent to a single-classification network to obtain the discriminative regions. Second, the discriminative regions are aggregated to obtain a high-confidence seed map. Third, the seed cues grow on the feature maps of high-level semantics produced by a backbone segmentation network. Experiments are carried out on the PASCAL VOC 2012 dataset to verify the effectiveness of our approach, which is shown to outperform other baseline image segmentation methods.

Download Full-text

Residual Augmented Attentional U-Shaped Network for Spectral Reconstruction from RGB Images

Remote Sensing ◽

10.3390/rs13010115 ◽

2020 ◽

Vol 13 (1) ◽

pp. 115

Author(s):

Jiaojiao Li ◽

Chaoxiong Wu ◽

Rui Song ◽

Yunsong Li ◽

Weiying Xie

Keyword(s):

Superior Performance ◽

Feature Maps ◽

Deep Convolutional Neural Networks ◽

Second Order Statistics ◽

Feature Representations ◽

Quantitative Measurements ◽

Spectral Reconstruction ◽

Perceptual Comparison ◽

Benchmark Datasets ◽

Rgb Images

Deep convolutional neural networks (CNNs) have been successfully applied to spectral reconstruction (SR) and acquired superior performance. Nevertheless, the existing CNN-based SR approaches integrate hierarchical features from different layers indiscriminately, lacking an investigation of the relationships of intermediate feature maps, which limits the learning power of CNNs. To tackle this problem, we propose a deep residual augmented attentional u-shape network (RA2UN) with several double improved residual blocks (DIRB) instead of paired plain convolutional units. Specifically, a trainable spatial augmented attention (SAA) module is developed to bridge the encoder and decoder to emphasize the features in the informative regions. Furthermore, we present a novel channel augmented attention (CAA) module embedded in the DIRB to rescale adaptively and enhance residual learning by using first-order and second-order statistics for stronger feature representations. Finally, a boundary-aware constraint is employed to focus on the salient edge information and recover more accurate high-frequency details. Experimental results on four benchmark datasets demonstrate that the proposed RA2UN network outperforms the state-of-the-art SR methods under quantitative measurements and perceptual comparison.

Download Full-text