KDA3D: Key-Point Densification and Multi-Attention Guidance for 3D Object Detection

In this paper, we propose a novel 3D object detector KDA3D, which achieves high-precision and robust classification, segmentation, and localization with the help of key-point densification and multi-attention guidance. The proposed end-to-end neural network architecture takes LIDAR point clouds as the main inputs that can be optionally complemented by RGB images. It consists of three parts: part-1 segments 3D foreground points and generates reliable proposals; part-2 (optional) enhances point cloud density and reconstructs the more compact full-point feature map; part-3 refines 3D bounding boxes and adds semantic segmentation as extra supervision. Our designed lightweight point-wise and channel-wise attention modules can adaptively strengthen the “skeleton” and “distinctiveness” point-features to help feature learning networks capture more representative or finer patterns. The proposed key-point densification component can generate pseudo-point clouds containing target information from monocular images through the distance preference strategy and K-means clustering so as to balance the density distribution and enrich sparse features. Extensive experiments on the KITTI and nuScenes 3D object detection benchmarks show that our KDA3D produces state-of-the-art results while running in near real-time with a low memory footprint.

Download Full-text

3D Object Detection Using Scale Invariant and Feature Reweighting Networks

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33019267 ◽

2019 ◽

Vol 33 ◽

pp. 9267-9274 ◽

Cited By ~ 6

Author(s):

Xin Zhao ◽

Zhe Liu ◽

Ruolan Hu ◽

Kaiqi Huang

Keyword(s):

Object Detection ◽

Network Architecture ◽

Point Clouds ◽

Scale Invariant ◽

3D Object ◽

Outdoor Scenes ◽

Indoor Scenes ◽

Bounding Boxes ◽

The One ◽

3D Object Detection

3D object detection plays an important role in a large number of real-world applications. It requires us to estimate the localizations and the orientations of 3D objects in real scenes. In this paper, we present a new network architecture which focuses on utilizing the front view images and frustum point clouds to generate 3D detection results. On the one hand, a PointSIFT module is utilized to improve the performance of 3D segmentation. It can capture the information from different orientations in space and the robustness to different scale shapes. On the other hand, our network obtains the useful features and suppresses the features with less information by a SENet module. This module reweights channel features and estimates the 3D bounding boxes more effectively. Our method is evaluated on both KITTI dataset for outdoor scenes and SUN-RGBD dataset for indoor scenes. The experimental results illustrate that our method achieves better performance than the state-of-the-art methods especially when point clouds are highly sparse.

Download Full-text

3D object detection combining semantic and geometric features from point clouds

Cobot ◽

10.12688/cobot.17433.1 ◽

2022 ◽

Vol 1 ◽

pp. 2

Author(s):

Hao Peng ◽

Guofeng Tong ◽

Zheng Li ◽

Yaqi Wang ◽

Yuyuan Shao

Keyword(s):

Object Detection ◽

Feature Learning ◽

Point Clouds ◽

Difficulty Level ◽

Geometric Feature ◽

Geometric Features ◽

3D Object ◽

Point Module ◽

Institute Of Technology ◽

3D Object Detection

Background: 3D object detection based on point clouds in road scenes has attracted much attention recently. The voxel-based methods voxelize the scene to regular grids, which can be processed with the advanced feature learning frameworks based on convolutional layers for semantic feature learning. The point-based methods can extract the geometric feature of the point due to the coordinate reservations. The combination of the two is effective for 3D object detection. However, the current methods use a voxel-based detection head with anchors for classification and localization. Although the preset anchors cover the entire scene, it is not suitable for detection tasks with larger scenes and multiple categories of objects, due to the limitation of the voxel size. Additionally, the misalignment between the predicted confidence and proposals in the Regions of the Interest (ROI) selection bring obstacles to 3D object detection. Methods: We investigate the combination of voxel-based methods and point-based methods for 3D object detection. Additionally, a voxel-to-point module that captures semantic and geometric features is proposed in the paper. The voxel-to-point module is conducive to the detection of small-size objects and avoids the presets of anchors in the inference stage. Moreover, a confidence adjustment module with the center-boundary-aware confidence attention is proposed to solve the misalignment between the predicted confidence and proposals in the regions of the interest selection. Results: The proposed method has achieved state-of-the-art results for 3D object detection in the Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) object detection dataset. Actually, as of September 19, 2021, our method ranked 1st in the 3D and Bird Eyes View (BEV) detection of cyclists tagged with difficulty level ‘easy’, and ranked 2nd in the 3D detection of cyclists tagged with ‘moderate’. Conclusions: We propose an end-to-end two-stage 3D object detector with voxel-to-point module and confidence adjustment module.

Download Full-text

A Novel Regional Fusion Network for 3D Object Detection based on RGB Images and Point Clouds

10.5121/csit.2021.111812 ◽

2021 ◽

Author(s):

Hung-Hao Chen ◽

Chia-Hung Wang ◽

Hsueh-Wei Chen ◽

Pei-Yung Hsiao ◽

Li-Chen Fu ◽

...

Keyword(s):

Object Detection ◽

Receptive Fields ◽

Point Clouds ◽

Detection Methods ◽

Lidar Data ◽

3D Object ◽

Multi Scale ◽

Interest Level ◽

Rgb Images ◽

3D Object Detection

The current fusion-based methods transform LiDAR data into bird’s eye view (BEV) representations or 3D voxel, leading to information loss and heavy computation cost of 3D convolution. In contrast, we directly consume raw point clouds and perform fusion between two modalities. We employ the concept of region proposal network to generate proposals from two streams, respectively. In order to make two sensors compensate the weakness of each other, we utilize the calibration parameters to project proposals from one stream onto the other. With the proposed multi-scale feature aggregation module, we are able to combine the extracted regionof-interest-level (RoI-level) features of RGB stream from different receptive fields, resulting in fertilizing feature richness. Experiments on KITTI dataset show that our proposed network outperforms other fusion-based methods with meaningful improvements as compared to 3D object detection methods under challenging setting.

Download Full-text

A Two-Phase Cross-Modality Fusion Network for Robust 3D Object Detection

Sensors ◽

10.3390/s20216043 ◽

2020 ◽

Vol 20 (21) ◽

pp. 6043

Author(s):

Yujun Jiao ◽

Zhishuai Yin

Keyword(s):

Object Detection ◽

Point Cloud ◽

Point Clouds ◽

Second Phase ◽

Two Phase ◽

3D Object ◽

Rgb Images ◽

Fusion Scheme ◽

3D Object Detection ◽

Level Fusion

A two-phase cross-modality fusion detector is proposed in this study for robust and high-precision 3D object detection with RGB images and LiDAR point clouds. First, a two-stream fusion network is built into the framework of Faster RCNN to perform accurate and robust 2D detection. The visible stream takes the RGB images as inputs, while the intensity stream is fed with the intensity maps which are generated by projecting the reflection intensity of point clouds to the front view. A multi-layer feature-level fusion scheme is designed to merge multi-modal features across multiple layers in order to enhance the expressiveness and robustness of the produced features upon which region proposals are generated. Second, a decision-level fusion is implemented by projecting 2D proposals to the space of the point cloud to generate 3D frustums, on the basis of which the second-phase 3D detector is built to accomplish instance segmentation and 3D-box regression on the filtered point cloud. The results on the KITTI benchmark show that features extracted from RGB images and intensity maps complement each other, and our proposed detector achieves state-of-the-art performance on 3D object detection with a substantially lower running time as compared to available competitors.

Download Full-text

ZoomNet: Part-Aware Adaptive Zooming Neural Network for 3D Object Detection

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6945 ◽

2020 ◽

Vol 34 (07) ◽

pp. 12557-12564 ◽

Cited By ~ 4

Author(s):

Zhenbo Xu ◽

Wei Zhang ◽

Xiaoqing Ye ◽

Xiao Tan ◽

Wei Yang ◽

...

Keyword(s):

Object Detection ◽

Point Clouds ◽

Autonomous Driving ◽

Disparity Estimation ◽

3D Object ◽

Detection Model ◽

Occluded Objects ◽

Bounding Boxes ◽

Detection Quality ◽

3D Object Detection

3D object detection is an essential task in autonomous driving and robotics. Though great progress has been made, challenges remain in estimating 3D pose for distant and occluded objects. In this paper, we present a novel framework named ZoomNet for stereo imagery-based 3D detection. The pipeline of ZoomNet begins with an ordinary 2D object detection model which is used to obtain pairs of left-right bounding boxes. To further exploit the abundant texture cues in rgb images for more accurate disparity estimation, we introduce a conceptually straight-forward module – adaptive zooming, which simultaneously resizes 2D instance bounding boxes to a unified resolution and adjusts the camera intrinsic parameters accordingly. In this way, we are able to estimate higher-quality disparity maps from the resized box images then construct dense point clouds for both nearby and distant objects. Moreover, we introduce to learn part locations as complementary features to improve the resistance against occlusion and put forward the 3D fitting score to better estimate the 3D detection quality. Extensive experiments on the popular KITTI 3D detection dataset indicate ZoomNet surpasses all previous state-of-the-art methods by large margins (improved by 9.4% on APbv (IoU=0.7) over pseudo-LiDAR). Ablation study also demonstrates that our adaptive zooming strategy brings an improvement of over 10% on AP3d (IoU=0.7). In addition, since the official KITTI benchmark lacks fine-grained annotations like pixel-wise part locations, we also present our KFG dataset by augmenting KITTI with detailed instance-wise annotations including pixel-wise part location, pixel-wise disparity, etc.. Both the KFG dataset and our codes will be publicly available at https://github.com/detectRecog/ZoomNet.

Download Full-text

Multi-View Fusion-Based 3D Object Detection for Robot Indoor Scene Perception

Sensors ◽

10.3390/s19194092 ◽

2019 ◽

Vol 19 (19) ◽

pp. 4092 ◽

Cited By ~ 2

Author(s):

Li Wang ◽

Ruifeng Li ◽

Jingwen Sun ◽

Xingxing Liu ◽

Lijun Zhao ◽

...

Keyword(s):

Object Detection ◽

Scene Perception ◽

Semantic Segmentation ◽

Point Clouds ◽

Service Robot ◽

Multiple Views ◽

Object Point ◽

3D Object ◽

Bounding Box ◽

3D Object Detection

To autonomously move and operate objects in cluttered indoor environments, a service robot requires the ability of 3D scene perception. Though 3D object detection can provide an object-level environmental description to fill this gap, a robot always encounters incomplete object observation, recurring detections of the same object, error in detection, or intersection between objects when conducting detection continuously in a cluttered room. To solve these problems, we propose a two-stage 3D object detection algorithm which is to fuse multiple views of 3D object point clouds in the first stage and to eliminate unreasonable and intersection detections in the second stage. For each view, the robot performs a 2D object semantic segmentation and obtains 3D object point clouds. Then, an unsupervised segmentation method called Locally Convex Connected Patches (LCCP) is utilized to segment the object accurately from the background. Subsequently, the Manhattan Frame estimation is implemented to calculate the main orientation of the object and subsequently, the 3D object bounding box can be obtained. To deal with the detected objects in multiple views, we construct an object database and propose an object fusion criterion to maintain it automatically. Thus, the same object observed in multi-view is fused together and a more accurate bounding box can be calculated. Finally, we propose an object filtering approach based on prior knowledge to remove incorrect and intersecting objects in the object dataset. Experiments are carried out on both SceneNN dataset and a real indoor environment to verify the stability and accuracy of 3D semantic segmentation and bounding box detection of the object with multi-view fusion.

Download Full-text

P2V-RCNN: Point to Voxel Feature Learning for 3D Object Detection from Point Clouds

IEEE Access ◽

10.1109/access.2021.3094562 ◽

2021 ◽

pp. 1-1

Author(s):

Jiale Li ◽

Yu Sun ◽

Shujie Luo ◽

Ziqi Zhu ◽

Hang Dai ◽

...

Keyword(s):

Object Detection ◽

Feature Learning ◽

Point Clouds ◽

3D Object ◽

3D Object Detection

Download Full-text

CrossFusion net: Deep 3D object detection based on RGB images and point clouds in autonomous driving

Image and Vision Computing ◽

10.1016/j.imavis.2020.103955 ◽

2020 ◽

Vol 100 ◽

pp. 103955

Author(s):

Dza-Shiang Hong ◽

Hung-Hao Chen ◽

Pei-Yung Hsiao ◽

Li-Chen Fu ◽

Siang-Min Siao

Keyword(s):

Object Detection ◽

Point Clouds ◽

Autonomous Driving ◽

3D Object ◽

Rgb Images ◽

3D Object Detection

Download Full-text

Cascaded Cross-Modality Fusion Network for 3D Object Detection

Sensors ◽

10.3390/s20247243 ◽

2020 ◽

Vol 20 (24) ◽

pp. 7243

Author(s):

Zhiyu Chen ◽

Qiong Lin ◽

Jing Sun ◽

Yujian Feng ◽

Shangdong Liu ◽

...

Keyword(s):

Object Detection ◽

Back Propagation ◽

Point Clouds ◽

Semantic Features ◽

3D Object ◽

Multi Scale ◽

Scale Point ◽

The Difference ◽

Bounding Boxes ◽

3D Object Detection

We focus on exploring the LIDAR-RGB fusion-based 3D object detection in this paper. This task is still challenging in two aspects: (1) the difference of data formats and sensor positions contributes to the misalignment of reasoning between the semantic features of images and the geometric features of point clouds. (2) The optimization of traditional IoU is not equal to the regression loss of bounding boxes, resulting in biased back-propagation for non-overlapping cases. In this work, we propose a cascaded cross-modality fusion network (CCFNet), which includes a cascaded multi-scale fusion module (CMF) and a novel center 3D IoU loss to resolve these two issues. Our CMF module is developed to reinforce the discriminative representation of objects by reasoning the relation of corresponding LIDAR geometric capability and RGB semantic capability of the object from two modalities. Specifically, CMF is added in a cascaded way between the RGB and LIDAR streams, which selects salient points and transmits multi-scale point cloud features to each stage of RGB streams. Moreover, our center 3D IoU loss incorporates the distance between anchor centers to avoid the oversimple optimization for non-overlapping bounding boxes. Extensive experiments on the KITTI benchmark have demonstrated that our proposed approach performs better than the compared methods.

Download Full-text

Strong-Weak Feature Alignment for 3D Object Detection

Electronics ◽

10.3390/electronics10101205 ◽

2021 ◽

Vol 10 (10) ◽

pp. 1205

Author(s):

Zhiyu Wang ◽

Li Wang ◽

Bin Dai

Keyword(s):

Object Detection ◽

Point Clouds ◽

Autonomous Driving ◽

Feature Representation ◽

Alignment Algorithm ◽

3D Object ◽

3D Point Clouds ◽

Object Feature ◽

3D Object Detection ◽

Feature Alignment

Object detection in 3D point clouds is still a challenging task in autonomous driving. Due to the inherent occlusion and density changes of the point cloud, the data distribution of the same object will change dramatically. Especially, the incomplete data with sparsity or occlusion can not represent the complete characteristics of the object. In this paper, we proposed a novel strong–weak feature alignment algorithm between complete and incomplete objects for 3D object detection, which explores the correlations within the data. It is an end-to-end adaptive network that does not require additional data and can be easily applied to other object detection networks. Through a complete object feature extractor, we achieve a robust feature representation of the object. It serves as a guarding feature to help the incomplete object feature generator to generate effective features. The strong–weak feature alignment algorithm reduces the gap between different states of the same object and enhances the ability to represent the incomplete object. The proposed adaptation framework is validated on the KITTI object benchmark and gets about 6% improvement in detection average precision on 3D moderate difficulty compared to the basic model. The results show that our adaptation method improves the detection performance of incomplete 3D objects.

Download Full-text