scholarly journals DASOT: A Unified Framework Integrating Data Association and Single Object Tracking for Online Multi-Object Tracking

2020 ◽  
Vol 34 (07) ◽  
pp. 10672-10679
Author(s):  
Qi Chu ◽  
Wanli Ouyang ◽  
Bin Liu ◽  
Feng Zhu ◽  
Nenghai Yu

In this paper, we propose an online multi-object tracking (MOT) approach that integrates data association and single object tracking (SOT) with a unified convolutional network (ConvNet), named DASOTNet. The intuition behind integrating data association and SOT is that they can complement each other. Following Siamese network architecture, DASOTNet consists of the shared feature ConvNet, the data association branch and the SOT branch. Data association is treated as a special re-identification task and solved by learning discriminative features for different targets in the data association branch. To handle the problem that the computational cost of SOT grows intolerably as the number of tracked objects increases, we propose an efficient two-stage tracking method in the SOT branch, which utilizes the merits of correlation features and can simultaneously track all the existing targets within one forward propagation. With feature sharing and the interaction between them, data association branch and the SOT branch learn to better complement each other. Using a multi-task objective, the whole network can be trained end-to-end. Compared with state-of-the-art online MOT methods, our method is much faster while maintaining a comparable performance.

Author(s):  
Wei Huang ◽  
Xiaoshu Zhou ◽  
Mingchao Dong ◽  
Huaiyu Xu

AbstractRobust and high-performance visual multi-object tracking is a big challenge in computer vision, especially in a drone scenario. In this paper, an online Multi-Object Tracking (MOT) approach in the UAV system is proposed to handle small target detections and class imbalance challenges, which integrates the merits of deep high-resolution representation network and data association method in a unified framework. Specifically, while applying tracking-by-detection architecture to our tracking framework, a Hierarchical Deep High-resolution network (HDHNet) is proposed, which encourages the model to handle different types and scales of targets, and extract more effective and comprehensive features during online learning. After that, the extracted features are fed into different prediction networks for interesting targets recognition. Besides, an adjustable fusion loss function is proposed by combining focal loss and GIoU loss to solve the problems of class imbalance and hard samples. During the tracking process, these detection results are applied to an improved DeepSORT MOT algorithm in each frame, which is available to make full use of the target appearance features to match one by one on a practical basis. The experimental results on the VisDrone2019 MOT benchmark show that the proposed UAV MOT system achieves the highest accuracy and the best robustness compared with state-of-the-art methods.


2018 ◽  
Vol 10 (9) ◽  
pp. 1347 ◽  
Author(s):  
Ting Chen ◽  
Andrea Pennisi ◽  
Zhi Li ◽  
Yanning Zhang ◽  
Hichem Sahli

Multi-Object Tracking (MOT) in airborne videos is a challenging problem due to the uncertain airborne vehicle motion, vibrations of the mounted camera, unreliable detections, changes of size, appearance and motion of the moving objects and occlusions caused by the interaction between moving and static objects in the scene. To deal with these problems, this work proposes a four-stage hierarchical association framework for multiple object tracking in airborne video. The proposed framework combines Data Association-based Tracking (DAT) methods and target tracking using a compressive tracking approach, to robustly track objects in complex airborne surveillance scenes. In each association stage, different sets of tracklets and detections are associated to efficiently handle local tracklet generation, local trajectory construction, global drifting tracklet correction and global fragmented tracklet linking. Experiments with challenging airborne videos show significant tracking improvement compared to existing state-of-the-art methods.


Sensors ◽  
2020 ◽  
Vol 20 (8) ◽  
pp. 2381
Author(s):  
Dan Li ◽  
Kaifeng Zhang ◽  
Zhenbo Li ◽  
Yifei Chen

The statistical data of different kinds of behaviors of pigs can reflect their health status. However, the traditional behavior statistics of pigs were obtained and then recorded from the videos through human eyes. In order to reduce labor and time consumption, this paper proposed a pig behavior recognition network with a spatiotemporal convolutional network based on the SlowFast network architecture for behavior classification of five categories. Firstly, a pig behavior recognition video dataset (PBVD-5) was built by cutting short clips from 3-month non-stop shooting videos, which was composed of five categories of pig’s behavior: feeding, lying, motoring, scratching and mounting. Subsequently, a SlowFast network based spatiotemporal convolutional network for the pig’s multi-behavior recognition (PMB-SCN) was proposed. The results of the networks with variant architectures of the PMB-SCN were implemented and the optimal architecture was compared with the state-of-the-art single stream 3D convolutional network in our dataset. Our 3D pig behavior recognition network showed a top-1 accuracy of 97.63% and a views accuracy of 96.35% on the test set of PBVD and a top-1 accuracy of 91.87% and a views accuracy of 84.47% on a new test set collected from a completely different pigsty. The experimental results showed that this network provided remarkable ability of generalization and possibility for the subsequent pig detection and behavior recognition simultaneously.


2019 ◽  
Author(s):  
Ben Moseley ◽  
Tarje Nissen-Meyer ◽  
Andrew Markham

Abstract. The simulation of seismic waves is a core task in many geophysical applications. Numerical methods such as Finite Difference (FD) modelling and Spectral Element Methods (SEM) are the most popular techniques for simulating seismic waves in complex media, but for many tasks their computational cost is prohibitively expensive. In this work we present two types of deep neural networks as fast alternatives for simulating seismic waves in horizontally layered and faulted 2D acoustic media. In contrast to the classical methods both networks are able to simulate the seismic response at multiple locations within the media in a single inference step, without needing to iteratively model the seismic wavefield through time, resulting in an order of magnitude reduction in simulation time. This speed improvement could pave the way to real-time seismic simulation and benefit seismic inversion algorithms based on forward modelling, such as full waveform inversion. Our first network is able to simulate seismic waves in horizontally layered media. We use a WaveNet network architecture and show this is more accurate than a standard convolutional network design. Furthermore we show that seismic inversion can be carried out by retraining the network with its inputs and outputs reversed, offering a fast alternative to existing inversion techniques. Our second network is significantly more general than the first; it is able to simulate seismic waves in faulted media with arbitrary layers, fault properties and an arbitrary location of the seismic source on the surface of the media. It uses a convolutional autoencoder network design and is conditioned on the input source location. We investigate the sensitivity of different network designs and training hyperparameters on its simulation accuracy. We compare and contrast this network to the first network. To train both networks we introduce a time-dependent gain in the loss function which improves convergence. We discuss the relative merits of our approach with FD modelling and how our approach could be generalised to simulate more complex Earth models.


Author(s):  
Daniel Groos ◽  
Heri Ramampiaro ◽  
Espen AF Ihlen

Abstract Single-person human pose estimation facilitates markerless movement analysis in sports, as well as in clinical applications. Still, state-of-the-art models for human pose estimation generally do not meet the requirements of real-life applications. The proliferation of deep learning techniques has resulted in the development of many advanced approaches. However, with the progresses in the field, more complex and inefficient models have also been introduced, which have caused tremendous increases in computational demands. To cope with these complexity and inefficiency challenges, we propose a novel convolutional neural network architecture, called EfficientPose, which exploits recently proposed EfficientNets in order to deliver efficient and scalable single-person pose estimation. EfficientPose is a family of models harnessing an effective multi-scale feature extractor and computationally efficient detection blocks using mobile inverted bottleneck convolutions, while at the same time ensuring that the precision of the pose configurations is still improved. Due to its low complexity and efficiency, EfficientPose enables real-world applications on edge devices by limiting the memory footprint and computational cost. The results from our experiments, using the challenging MPII single-person benchmark, show that the proposed EfficientPose models substantially outperform the widely-used OpenPose model both in terms of accuracy and computational efficiency. In particular, our top-performing model achieves state-of-the-art accuracy on single-person MPII, with low-complexity ConvNets.


Author(s):  
Zhizhong Han ◽  
Mingyang Shang ◽  
Xiyang Wang ◽  
Yu-Shen Liu ◽  
Matthias Zwicker

Jointly learning representations of 3D shapes and text is crucial to support tasks such as cross-modal retrieval or shape captioning. A recent method employs 3D voxels to represent 3D shapes, but this limits the approach to low resolutions due to the computational cost caused by the cubic complexity of 3D voxels. Hence the method suffers from a lack of detailed geometry. To resolve this issue, we propose Y2Seq2Seq, a view-based model, to learn cross-modal representations by joint reconstruction and prediction of view and word sequences. Specifically, the network architecture of Y2Seq2Seq bridges the semantic meaning embedded in the two modalities by two coupled “Y” like sequence-tosequence (Seq2Seq) structures. In addition, our novel hierarchical constraints further increase the discriminability of the cross-modal representations by employing more detailed discriminative information. Experimental results on cross-modal retrieval and 3D shape captioning show that Y2Seq2Seq outperforms the state-of-the-art methods.


Sensors ◽  
2021 ◽  
Vol 21 (2) ◽  
pp. 452
Author(s):  
Wenjie Yang ◽  
Jianlin Zhang ◽  
Jingju Cai ◽  
Zhiyong Xu

Graph convolutional networks (GCNs) have brought considerable improvement to the skeleton-based action recognition task. Existing GCN-based methods usually use the fixed spatial graph size among all the layers. It severely affects the model’s abilities to exploit the global and semantic discriminative information due to the limits of receptive fields. Furthermore, the fixed graph size would cause many redundancies in the representation of actions, which is inefficient for the model. The redundancies could also hinder the model from focusing on beneficial features. To address those issues, we proposed a plug-and-play channel adaptive merging module (CAMM) specific for the human skeleton graph, which can merge the vertices from the same part of the skeleton graph adaptively and efficiently. The merge weights are different across the channels, so every channel has its flexibility to integrate the joints. Then, we build a novel shallow graph convolutional network (SGCN) based on the module, which achieves state-of-the-art performance with less computational cost. Experimental results on NTU-RGB+D and Kinetics-Skeleton illustrates the superiority of our methods.


2021 ◽  
Author(s):  
Amandeep Kaur ◽  
Vinayak Singh ◽  
Gargi Chakraverty

With the advancement in technology and computation capabilities, identifying retinal damage through state-of-the-art CNNs architectures has led to the speedy and precise diagnosis, thus inhibiting further disease development. In this study, we focus on the classification of retinal damage caused by detecting choroidal neovascularization (CNV), diabetic macular edema (DME), DRUSEN, and NORMAL in optical coherence tomography (OCT) images. The emphasis of our experiment is to investigate the component of depth in the neural network architecture. We introduce a shallow convolution neural network - LightOCT, outperforming the other deep model configurations, with the lowest value of LVCEL and highest accuracy (+98\% in each class). Next, we experimented to find the best fit optimizer for LightOCT. The results proved that the combination of LightOCT and Adam gave the most optimal results. Finally, we compare our approach with transfer learning models, and LightOCT outperforms the state-of-the-art models in terms of computational cost, least training time and gives comparable results in the criteria of accuracy. We would direct our future work to improve the accuracy metrics with shallow models such that the trade-off between training time and accuracy is reduced.


2021 ◽  
Author(s):  
Venkatesh Parvathala ◽  
Sri Rama Murty Kodukula ◽  
Siva Ganesh Andhavarapu

<div>In this paper, we demonstrate the significance of restoring harmonics of the fundamental frequency (pitch) in deep neural network (DNN) based speech enhancement. We propose a sliding-window attention network to regress the spectral magnitude mask (SMM) from the noisy speech signal. Even though the network parameters can be estimated by minimizing the mask loss, it does not restore the pitch harmonics, especially at higher frequencies. In this paper, we propose to restore the pitch harmonics in the spectral domain by minimizing cepstral loss around the pitch peak. The network parameters are estimated using a combination of the mask loss and cepstral loss. The proposed network architecture functions like an adaptive comb filter on voiced segments, and emphasizes the pitch harmonics in the speech spectrum. The proposed approach achieves comparable performance with the state-of-the-art methods with much lesser computational complexity.</div>


Author(s):  
Kuncheng Fang ◽  
Lian Zhou ◽  
Cheng Jin ◽  
Yuejie Zhang ◽  
Kangnian Weng ◽  
...  

Automatically generating natural language description for video is an extremely complicated and challenging task. To tackle the obstacles of traditional LSTM-based model for video captioning, we propose a novel architecture to generate the optimal descriptions for videos, which focuses on constructing a new network structure that can generate sentences superior to the basic model with LSTM, and establishing special attention mechanisms that can provide more useful visual information for caption generation. This scheme discards the traditional LSTM, and exploits the fully convolutional network with coarse-to-fine and inherited attention designed according to the characteristics of fully convolutional structure. Our model cannot only outperform the basic LSTM-based model, but also achieve the comparable performance with those of state-of-the-art methods


Sign in / Sign up

Export Citation Format

Share Document