SynPo-Net—Accurate and Fast CNN-Based 6DoF Object Pose Estimation Using Synthetic Training

Estimation and tracking of 6DoF poses of objects in images is a challenging problem of great importance for robotic interaction and augmented reality. Recent approaches applying deep neural networks for pose estimation have shown encouraging results. However, most of them rely on training with real images of objects with severe limitations concerning ground truth pose acquisition, full coverage of possible poses, and training dataset scaling and generalization capability. This paper presents a novel approach using a Convolutional Neural Network (CNN) trained exclusively on single-channel Synthetic images of objects to regress 6DoF object Poses directly (SynPo-Net). The proposed SynPo-Net is a network architecture specifically designed for pose regression and a proposed domain adaptation scheme transforming real and synthetic images into an intermediate domain that is better fit for establishing correspondences. The extensive evaluation shows that our approach significantly outperforms the state-of-the-art using synthetic training in terms of both accuracy and speed. Our system can be used to estimate the 6DoF pose from a single frame, or be integrated into a tracking system to provide the initial pose.

Download Full-text

Voxel-Based Scene Representation for Camera Pose Estimation of a Single RGB Image

Applied Sciences ◽

10.3390/app10248866 ◽

2020 ◽

Vol 10 (24) ◽

pp. 8866

Author(s):

Sangyoon Lee ◽

Hyunki Hong ◽

Changkyoung Eem

Keyword(s):

Pose Estimation ◽

Estimation Method ◽

Ground Truth ◽

Training Dataset ◽

Interest Points ◽

Camera Pose Estimation ◽

Polygonal Region ◽

Camera Pose ◽

End To End ◽

Image Pairs

Deep learning has been utilized in end-to-end camera pose estimation. To improve the performance, we introduce a camera pose estimation method based on a 2D-3D matching scheme with two convolutional neural networks (CNNs). The scene is divided into voxels, whose size and number are computed according to the scene volume and the number of 3D points. We extract inlier points from the 3D point set in a voxel using random sample consensus (RANSAC)-based plane fitting to obtain a set of interest points consisting of a major plane. These points are subsequently reprojected onto the image using the ground truth camera pose, following which a polygonal region is identified in each voxel using the convex hull. We designed a training dataset for 2D–3D matching, consisting of inlier 3D points, correspondence across image pairs, and the voxel regions in the image. We trained the hierarchical learning structure with two CNNs on the dataset architecture to detect the voxel regions and obtain the location/description of the interest points. Following successful 2D–3D matching, the camera pose was estimated using n-point pose solver in RANSAC. The experiment results show that our method can estimate the camera pose more precisely than previous end-to-end estimators.

Download Full-text

Semantic Text Segmentation from Synthetic Images of Full-Text Documents

SPIIRAS Proceedings ◽

10.15622/sp.2019.18.6.1381-1406 ◽

2019 ◽

Vol 18 (6) ◽

pp. 1381-1406 ◽

Cited By ~ 2

Author(s):

Lukáš Bureš ◽

Ivan Gruber ◽

Petr Neduchal ◽

Miroslav Hlaváč ◽

Marek Hrúz

Keyword(s):

Full Text ◽

Network Architecture ◽

Character Recognition ◽

Optical Character Recognition ◽

Recognition Rate ◽

Semantic Segmentation ◽

Text Documents ◽

Text Corpora ◽

Novel Approach ◽

Synthetic Images

An algorithm (divided into multiple modules) for generating images of full-text documents is presented. These images can be used to train, test, and evaluate models for Optical Character Recognition (OCR). The algorithm is modular, individual parts can be changed and tweaked to generate desired images. A method for obtaining background images of paper from already digitized documents is described. For this, a novel approach based on Variational AutoEncoder (VAE) to train a generative model was used. These backgrounds enable the generation of similar background images as the training ones on the fly.The module for printing the text uses large text corpora, a font, and suitable positional and brightness character noise to obtain believable results (for natural-looking aged documents). A few types of layouts of the page are supported. The system generates a detailed, structured annotation of the synthesized image. Tesseract OCR to compare the real-world images to generated images is used. The recognition rate is very similar, indicating the proper appearance of the synthetic images. Moreover, the errors which were made by the OCR system in both cases are very similar. From the generated images, fully-convolutional encoder-decoder neural network architecture for semantic segmentation of individual characters was trained. With this architecture, the recognition accuracy of 99.28% on a test set of synthetic documents is reached.

Download Full-text

Relative Camera Pose Estimation using Synthetic Data with Domain Adaptation via Cycle-Consistent Adversarial Networks

Journal of Intelligent & Robotic Systems ◽

10.1007/s10846-021-01439-6 ◽

2021 ◽

Vol 102 (4) ◽

Author(s):

Chenhao Yang ◽

Yuyi Liu ◽

Andreas Zell

Keyword(s):

Pose Estimation ◽

Autonomous Navigation ◽

Domain Adaptation ◽

Synthetic Data ◽

Estimation Methods ◽

Visual Localization ◽

Style Transfer ◽

Camera Pose Estimation ◽

Camera Pose ◽

Synthetic Images

AbstractLearning-based visual localization has become prospective over the past decades. Since ground truth pose labels are difficult to obtain, recent methods try to learn pose estimation networks using pixel-perfect synthetic data. However, this also introduces the problem of domain bias. In this paper, we first build a Tuebingen Buildings dataset of RGB images collected by a drone in urban scenes and create a 3D model for each scene. A large number of synthetic images are generated based on these 3D models. We take advantage of image style transfer and cycle-consistent adversarial training to predict the relative camera poses of image pairs based on training over synthetic environment data. We propose a relative camera pose estimation approach to solve the continuous localization problem for autonomous navigation of unmanned systems. Unlike those existing learning-based camera pose estimation methods that train and test in a single scene, our approach successfully estimates the relative camera poses of multiple city locations with a single trained model. We use the Tuebingen Buildings and the Cambridge Landmarks datasets to evaluate the performance of our approach in a single scene and across-scenes. For each dataset, we compare the performance between real images and synthetic images trained models. We also test our model in the indoor dataset 7Scenes to demonstrate its generalization ability.

Download Full-text

Frequency-Domain Fusing Convolutional Neural Network: A Unified Architecture Improving Effect of Domain Adaptation for Fault Diagnosis

Sensors ◽

10.3390/s21020450 ◽

2021 ◽

Vol 21 (2) ◽

pp. 450

Author(s):

Xudong Li ◽

Jianhua Zheng ◽

Mingtao Li ◽

Wenzhen Ma ◽

Yang Hu

Keyword(s):

Neural Network ◽

Fault Diagnosis ◽

Convolutional Neural Network ◽

Frequency Domain ◽

Transfer Learning ◽

Network Architecture ◽

Domain Adaptation ◽

Training Dataset ◽

Testing Dataset ◽

Feature Extractor

In recent years, transfer learning has been widely applied in fault diagnosis for solving the problem of inconsistent distribution of the original training dataset and the online-collecting testing dataset. In particular, the domain adaptation method can solve the problem of the unlabeled testing dataset in transfer learning. Moreover, Convolutional Neural Network (CNN) is the most widely used network among existing domain adaptation approaches due to its powerful feature extraction capability. However, network designing is too empirical, and there is no network designing principle from the frequency domain. In this paper, we propose a unified convolutional neural network architecture from a frequency domain perspective for a domain adaptation named Frequency-domain Fusing Convolutional Neural Network (FFCNN). The method of FFCNN contains two parts, frequency-domain fusing layer and feature extractor. The frequency-domain fusing layer uses convolution operations to filter signals at different frequency bands and combines them into new input signals. These signals are input to the feature extractor to extract features and make domain adaptation. We apply FFCNN for three domain adaptation methods, and the diagnosis accuracy is improved compared to the typical CNN.

Download Full-text

Algorithm combining virtual chromoendoscopy features for colorectal polyp classification

Endoscopy International Open ◽

10.1055/a-1512-5175 ◽

2021 ◽

Vol 09 (10) ◽

pp. E1497-E1503

Author(s):

Ramon-Michel Schreuder ◽

Qurine E.W. van der Zander ◽

Roger Fonollà ◽

Lennard P.L. Gilissen ◽

Arnold Stronkhorst ◽

...

Keyword(s):

Predictive Value ◽

Gold Standard ◽

Network Architecture ◽

Colorectal Polyp ◽

Colorectal Polyps ◽

Training Dataset ◽

High Definition ◽

Still Images ◽

Novel Approach ◽

Incidence And Mortality

Abstract Background and study aims Colonoscopy is considered the gold standard for decreasing colorectal cancer incidence and mortality. Optical diagnosis of colorectal polyps (CRPs) is an ongoing challenge in clinical colonoscopy and its accuracy among endoscopists varies widely. Computer-aided diagnosis (CAD) for CRP characterization may help to improve this accuracy. In this study, we investigated the diagnostic accuracy of a novel algorithm for polyp malignancy classification by exploiting the complementary information revealed by three specific modalities. Methods We developed a CAD algorithm for CRP characterization based on high-definition, non-magnified white light (HDWL), Blue light imaging (BLI) and linked color imaging (LCI) still images from routine exams. All CRPs were collected prospectively and classified into benign or premalignant using histopathology as gold standard. Images and data were used to train the CAD algorithm using triplet network architecture. Our training dataset was validated using a threefold cross validation. Results In total 609 colonoscopy images of 203 CRPs of 154 consecutive patients were collected. A total of 174 CRPs were found to be premalignant and 29 were benign. Combining the triplet network features with all three image enhancement modalities resulted in an accuracy of 90.6 %, 89.7 % sensitivity, 96.6 % specificity, a positive predictive value of 99.4 %, and a negative predictive value of 60.9 % for CRP malignancy classification. The classification time for our CAD algorithm was approximately 90 ms per image. Conclusions Our novel approach and algorithm for CRP classification differentiates accurately between benign and premalignant polyps in non-magnified endoscopic images. This is the first algorithm combining three optical modalities (HDWL/BLI/LCI) exploiting the triplet network approach.

Download Full-text

Non-Blind Image Deconvolution Based on “Ringing” Removal Using Convolutional Neural Network

Electronic Imaging ◽

10.2352/issn.2470-1173.2020.10.ipas-180 ◽

2020 ◽

Vol 2020 (10) ◽

pp. 181-1-181-7

Author(s):

Takahiro Kudo ◽

Takanori Fujisawa ◽

Takuro Yamaguchi ◽

Masaaki Ikehara

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Network Architecture ◽

Large Scale ◽

Blind Deconvolution ◽

Training Dataset ◽

Image Deconvolution ◽

Classic Problem ◽

Key Points ◽

Blind Image

Image deconvolution has been an important issue recently. It has two kinds of approaches: non-blind and blind. Non-blind deconvolution is a classic problem of image deblurring, which assumes that the PSF is known and does not change universally in space. Recently, Convolutional Neural Network (CNN) has been used for non-blind deconvolution. Though CNNs can deal with complex changes for unknown images, some CNN-based conventional methods can only handle small PSFs and does not consider the use of large PSFs in the real world. In this paper we propose a non-blind deconvolution framework based on a CNN that can remove large scale ringing in a deblurred image. Our method has three key points. The first is that our network architecture is able to preserve both large and small features in the image. The second is that the training dataset is created to preserve the details. The third is that we extend the images to minimize the effects of large ringing on the image borders. In our experiments, we used three kinds of large PSFs and were able to observe high-precision results from our method both quantitatively and qualitatively.

Download Full-text

A New Multi-Person Pose Estimation Method Using the Partitioned CenterPose Network

Applied Sciences ◽

10.3390/app11094241 ◽

2021 ◽

Vol 11 (9) ◽

pp. 4241

Author(s):

Jiahua Wu ◽

Hyo Jong Lee

Keyword(s):

Pose Estimation ◽

Human Body ◽

State Of The Art ◽

Estimation Method ◽

Bottom Up ◽

Center Point ◽

Novel Approach ◽

Body Joints

In bottom-up multi-person pose estimation, grouping joint candidates into the appropriately structured corresponding instance of a person is challenging. In this paper, a new bottom-up method, the Partitioned CenterPose (PCP) Network, is proposed to better cluster the detected joints. To achieve this goal, we propose a novel approach called Partition Pose Representation (PPR) which integrates the instance of a person and its body joints based on joint offset. PPR leverages information about the center of the human body and the offsets between that center point and the positions of the body’s joints to encode human poses accurately. To enhance the relationships between body joints, we divide the human body into five parts, and then, we generate a sub-PPR for each part. Based on this PPR, the PCP Network can detect people and their body joints simultaneously, then group all body joints according to joint offset. Moreover, an improved l1 loss is designed to more accurately measure joint offset. Using the COCO keypoints and CrowdPose datasets for testing, it was found that the performance of the proposed method is on par with that of existing state-of-the-art bottom-up methods in terms of accuracy and speed.

Download Full-text

A Study of Features and Deep Neural Network Architectures and Hyper-Parameters for Domestic Audio Classification

Applied Sciences ◽

10.3390/app11114880 ◽

2021 ◽

Vol 11 (11) ◽

pp. 4880

Author(s):

Abigail Copiaco ◽

Christian Ritz ◽

Nidhal Abdulaziz ◽

Stefano Fasciani

Keyword(s):

Network Architecture ◽

Single Channel ◽

Classification Performance ◽

Network Size ◽

Directed Acyclic Graphs ◽

Spectral Features ◽

Audio Classification ◽

Resource Requirements ◽

Efficient Alternative ◽

Computational Resources

Recent methodologies for audio classification frequently involve cepstral and spectral features, applied to single channel recordings of acoustic scenes and events. Further, the concept of transfer learning has been widely used over the years, and has proven to provide an efficient alternative to training neural networks from scratch. The lower time and resource requirements when using pre-trained models allows for more versatility in developing system classification approaches. However, information on classification performance when using different features for multi-channel recordings is often limited. Furthermore, pre-trained networks are initially trained on bigger databases and are often unnecessarily large. This poses a challenge when developing systems for devices with limited computational resources, such as mobile or embedded devices. This paper presents a detailed study of the most apparent and widely-used cepstral and spectral features for multi-channel audio applications. Accordingly, we propose the use of spectro-temporal features. Additionally, the paper details the development of a compact version of the AlexNet model for computationally-limited platforms through studies of performances against various architectural and parameter modifications of the original network. The aim is to minimize the network size while maintaining the series network architecture and preserving the classification accuracy. Considering that other state-of-the-art compact networks present complex directed acyclic graphs, a series architecture proposes an advantage in customizability. Experimentation was carried out through Matlab, using a database that we have generated for this task, which composes of four-channel synthetic recordings of both sound events and scenes. The top performing methodology resulted in a weighted F1-score of 87.92% for scalogram features classified via the modified AlexNet-33 network, which has a size of 14.33 MB. The AlexNet network returned 86.24% at a size of 222.71 MB.

Download Full-text

RobotP: A Benchmark Dataset for 6D Object Pose Estimation

Sensors ◽

10.3390/s21041299 ◽

2021 ◽

Vol 21 (4) ◽

pp. 1299

Author(s):

Honglin Yuan ◽

Tim Hoogenkamp ◽

Remco C. Veltkamp

Keyword(s):

Pose Estimation ◽

Ground Truth ◽

3D Models ◽

Depth Image ◽

Great Success ◽

Estimation Algorithms ◽

Depth Images ◽

Object Pose Estimation ◽

Image Pairs ◽

Bounding Boxes

Deep learning has achieved great success on robotic vision tasks. However, when compared with other vision-based tasks, it is difficult to collect a representative and sufficiently large training set for six-dimensional (6D) object pose estimation, due to the inherent difficulty of data collection. In this paper, we propose the RobotP dataset consisting of commonly used objects for benchmarking in 6D object pose estimation. To create the dataset, we apply a 3D reconstruction pipeline to produce high-quality depth images, ground truth poses, and 3D models for well-selected objects. Subsequently, based on the generated data, we produce object segmentation masks and two-dimensional (2D) bounding boxes automatically. To further enrich the data, we synthesize a large number of photo-realistic color-and-depth image pairs with ground truth 6D poses. Our dataset is freely distributed to research groups by the Shape Retrieval Challenge benchmark on 6D pose estimation. Based on our benchmark, different learning-based approaches are trained and tested by the unified dataset. The evaluation results indicate that there is considerable room for improvement in 6D object pose estimation, particularly for objects with dark colors, and photo-realistic images are helpful in increasing the performance of pose estimation algorithms.

Download Full-text

Experimental Evaluation of Computer Vision and Machine Learning-Based UAV Detection and Ranging

Drones ◽

10.3390/drones5020037 ◽

2021 ◽

Vol 5 (2) ◽

pp. 37

Author(s):

Bingsheng Wei ◽

Martin Barczyk

Keyword(s):

Machine Learning ◽

Mean Squared Error ◽

Tracking System ◽

Ground Truth ◽

White Background ◽

Cascade Classifier ◽

Detection Algorithms ◽

Squared Error ◽

Test Conditions ◽

Video Feed

We consider the problem of vision-based detection and ranging of a target UAV using the video feed from a monocular camera onboard a pursuer UAV. Our previously published work in this area employed a cascade classifier algorithm to locate the target UAV, which was found to perform poorly in complex background scenes. We thus study the replacement of the cascade classifier algorithm with newer machine learning-based object detection algorithms. Five candidate algorithms are implemented and quantitatively tested in terms of their efficiency (measured as frames per second processing rate), accuracy (measured as the root mean squared error between ground truth and detected location), and consistency (measured as mean average precision) in a variety of flight patterns, backgrounds, and test conditions. Assigning relative weights of 20%, 40% and 40% to these three criteria, we find that when flying over a white background, the top three performers are YOLO v2 (76.73 out of 100), Faster RCNN v2 (63.65 out of 100), and Tiny YOLO (59.50 out of 100), while over a realistic background, the top three performers are Faster RCNN v2 (54.35 out of 100, SSD MobileNet v1 (51.68 out of 100) and SSD Inception v2 (50.72 out of 100), leading us to recommend Faster RCNN v2 as the recommended solution. We then provide a roadmap for further work in integrating the object detector into our vision-based UAV tracking system.

Download Full-text