Indoor Scene Understanding with RGB-D Images: Bottom-up Segmentation, Object Detection and Semantic Segmentation

Vision-based navigation of autonomous vehicles primarily depends on the deep neural network (DNN) based systems in which the controller obtains input from sensors/detectors, such as cameras, and produces a vehicle control output, such as a steering wheel angle to navigate the vehicle safely in a roadway traffic environment. Typically, these DNN-based systems in the autonomous vehicle are trained through supervised learning; however, recent studies show that a trained DNN-based system can be compromised by perturbation or adverse inputs. Similarly, this perturbation can be introduced into the DNN-based systems of autonomous vehicles by unexpected roadway hazards, such as debris or roadblocks. In this study, we first introduce a hazardous roadway environment that can compromise the DNN-based navigational system of an autonomous vehicle, and produce an incorrect steering wheel angle, which could cause crashes resulting in fatality or injury. Then, we develop a DNN-based autonomous vehicle driving system using object detection and semantic segmentation to mitigate the adverse effect of this type of hazard, which helps the autonomous vehicle to navigate safely around such hazards. We find that our developed DNN-based autonomous vehicle driving system, including hazardous object detection and semantic segmentation, improves the navigational ability of an autonomous vehicle to avoid a potential hazard by 21% compared with the traditional DNN-based autonomous vehicle driving system.

Download Full-text

SIGGRAPH Asia 2014 Indoor Scene Understanding Where Graphics Meets Vision on - SIGGRAPH ASIA '14

10.1145/2670291 ◽

2014 ◽

Keyword(s):

Scene Understanding ◽

Indoor Scene

Download Full-text

ECRU: An Encoder-Decoder Based Convolution Neural Network (CNN) for Road-Scene Understanding

Journal of Imaging ◽

10.3390/jimaging4100116 ◽

2018 ◽

Vol 4 (10) ◽

pp. 116 ◽

Cited By ~ 2

Author(s):

Robail Yasrab

Keyword(s):

Neural Network ◽

Visual Recognition ◽

Substantial Reduction ◽

Scene Understanding ◽

Semantic Segmentation ◽

Research Area ◽

Smart Systems ◽

Proposed Model ◽

Driving Assistance ◽

Flexible Architecture

This research presents the idea of a novel fully-Convolutional Neural Network (CNN)-based model for probabilistic pixel-wise segmentation, titled Encoder-decoder-based CNN for Road-Scene Understanding (ECRU). Lately, scene understanding has become an evolving research area, and semantic segmentation is the most recent method for visual recognition. Among vision-based smart systems, the driving assistance system turns out to be a much preferred research topic. The proposed model is an encoder-decoder that performs pixel-wise class predictions. The encoder network is composed of a VGG-19 layer model, while the decoder network uses 16 upsampling and deconvolution units. The encoder of the network has a very flexible architecture that can be altered and trained for any size and resolution of images. The decoder network upsamples and maps the low-resolution encoder’s features. Consequently, there is a substantial reduction in the trainable parameters, as the network recycles the encoder’s pooling indices for pixel-wise classification and segmentation. The proposed model is intended to offer a simplified CNN model with less overhead and higher performance. The network is trained and tested on the famous road scenes dataset CamVid and offers outstanding outcomes in comparison to similar early approaches like FCN and VGG16 in terms of performance vs. trainable parameters.

Download Full-text

MSGC: A New Bottom-Up Model for Salient Object Detection

2018 IEEE International Conference on Multimedia and Expo (ICME) ◽

10.1109/icme.2018.8486442 ◽

2018 ◽

Cited By ~ 2

Author(s):

Zhi-Jie Wang ◽

Lizhuang Ma ◽

Xiao Lin ◽

Xiabao Wu

Keyword(s):

Object Detection ◽

Salient Object Detection ◽

Salient Object ◽

Bottom Up

Download Full-text

Visual Scene Understanding for Autonomous Driving Using Semantic Segmentation

Explainable AI: Interpreting, Explaining and Visualizing Deep Learning - Lecture Notes in Computer Science ◽

10.1007/978-3-030-28954-6_15 ◽

2019 ◽

pp. 285-296 ◽

Cited By ~ 1

Author(s):

Markus Hofmarcher ◽

Thomas Unterthiner ◽

José Arjona-Medina ◽

Günter Klambauer ◽

Sepp Hochreiter ◽

...

Keyword(s):

Scene Understanding ◽

Semantic Segmentation ◽

Autonomous Driving ◽

Visual Scene ◽

Visual Scene Understanding

Download Full-text

Visual attention strategies for target object detection

10.26686/wgtn.17067635 ◽

2021 ◽

Author(s):

◽

Ibrahim Mohammad Hussain Rahman

Keyword(s):

Visual Attention ◽

Object Detection ◽

Target Object ◽

Detection Accuracy ◽

Estimation Model ◽

Top Down ◽

Bottom Up ◽

Feature Map ◽

Low Level ◽

Visual Tasks

<p>The human visual attention system (HVA) encompasses a set of interconnected neurological modules that are responsible for analyzing visual stimuli by attending to those regions that are salient. Two contrasting biological mechanisms exist in the HVA systems; bottom-up, data-driven attention and top-down, task-driven attention. The former is mostly responsible for low-level instinctive behaviors, while the latter is responsible for performing complex visual tasks such as target object detection. Very few computational models have been proposed to model top-down attention, mainly due to three reasons. The first is that the functionality of top-down process involves many influential factors. The second reason is that there is a diversity in top-down responses from task to task. Finally, many biological aspects of the top-down process are not well understood yet. For the above reasons, it is difficult to come up with a generalized top-down model that could be applied to all high level visual tasks. Instead, this thesis addresses some outstanding issues in modelling top-down attention for one particular task, target object detection. Target object detection is an essential step for analyzing images to further perform complex visual tasks. Target object detection has not been investigated thoroughly when modelling top-down saliency and hence, constitutes the may domain application for this thesis. The thesis will investigate methods to model top-down attention through various high-level data acquired from images. Furthermore, the thesis will investigate different strategies to dynamically combine bottom-up and top-down processes to improve the detection accuracy, as well as the computational efficiency of the existing and new visual attention models. The following techniques and approaches are proposed to address the outstanding issues in modelling top-down saliency: 1. A top-down saliency model that weights low-level attentional features through contextual knowledge of a scene. The proposed model assigns weights to features of a novel image by extracting a contextual descriptor of the image. The contextual descriptor plays the role of tuning the weighting of low-level features to maximize detection accuracy. By incorporating context into the feature weighting mechanism we improve the quality of the assigned weights to these features. 2. Two modules of target features combined with contextual weighting to improve detection accuracy of the target object. In this proposed model, two sets of attentional feature weights are learned, one through context and the other through target features. When both sources of knowledge are used to model top-down attention, a drastic increase in detection accuracy is achieved in images with complex backgrounds and a variety of target objects. 3. A top-down and bottom-up attention combination model based on feature interaction. This model provides a dynamic way for combining both processes by formulating the problem as feature selection. The feature selection exploits the interaction between these features, yielding a robust set of features that would maximize both the detection accuracy and the overall efficiency of the system. 4. A feature map quality score estimation model that is able to accurately predict the detection accuracy score of any previously novel feature map without the need of groundtruth data. The model extracts various local, global, geometrical and statistical characteristic features from a feature map. These characteristics guide a regression model to estimate the quality of a novel map. 5. A dynamic feature integration framework for combining bottom-up and top-down saliencies at runtime. If the estimation model is able to predict the quality score of any novel feature map accurately, then it is possible to perform dynamic feature map integration based on the estimated value. We propose two frameworks for feature map integration using the estimation model. The proposed integration framework achieves higher human fixation prediction accuracy with minimum number of feature maps than that achieved by combining all feature maps. The proposed works in this thesis provide new directions in modelling top-down saliency for target object detection. In addition, dynamic approaches for top-down and bottom-up combination show considerable improvements over existing approaches in both efficiency and accuracy.</p>

Download Full-text

Bottom-up visual attention model for still image: a preliminary study

International Journal of Advances in Intelligent Informatics ◽

10.26555/ijain.v6i1.469 ◽

2020 ◽

Vol 6 (1) ◽

pp. 82

Author(s):

Adhi Prahara ◽

Murinto Murinto ◽

Dewi Pramudi Ismi

Keyword(s):

Visual Attention ◽

Object Detection ◽

Video Compression ◽

Saliency Map ◽

Bottom Up ◽

Attention Model ◽

Intrinsic Cues ◽

Preliminary Study ◽

Segmentation Image ◽

Human Visual Attention

The philosophy of human visual attention is scientifically explained in the field of cognitive psychology and neuroscience then computationally modeled in the field of computer science and engineering. Visual attention models have been applied in computer vision systems such as object detection, object recognition, image segmentation, image and video compression, action recognition, visual tracking, and so on. This work studies bottom-up visual attention, namely human fixation prediction and salient object detection models. The preliminary study briefly covers from the biological perspective of visual attention, including visual pathway, the theory of visual attention, to the computational model of bottom-up visual attention that generates saliency map. The study compares some models at each stage and observes whether the stage is inspired by biological architecture, concept, or behavior of human visual attention. From the study, the use of low-level features, center-surround mechanism, sparse representation, and higher-level guidance with intrinsic cues dominate the bottom-up visual attention approaches. The study also highlights the correlation between bottom-up visual attention and curiosity.

Download Full-text