ViOC-optical Alphanumeric Character Extraction from Video Frames

We have developed a generalized alphanumeric character extraction algorithm that can efficiently and accurately locate and extract characters from complex scene images. A scene image may be complex due to the following reasons: (1) the characters are embedded in an image with other objects, such as structural bars, company logos and smears; (2) the characters may be painted or printed in any color including white, and the background color may differ only slightly from that of the characters; (3) the font, size and format of the characters may be different; and (4) the lighting may be uneven. The main contribution of this research is that it permits the quick and accurate extraction of characters in a complex scene. A coarse search technique is used to locate potential characters, and then a fine grouping technique is used to extract characters accurately. Several additional techniques in the postprocessing phase eliminate spurious as well as overlapping characters. Experimental results of segmenting characters written on cargo container surfaces show that the system is feasible under real-life constraints. The program has been installed as part of a vision system which verifies container codes on vehicles passing through the Port of Singapore.

Download Full-text

A Dataset of Photos and Videos for Digital Forensics Analysis Using Machine Learning Processing

Data ◽

10.3390/data6080087 ◽

2021 ◽

Vol 6 (8) ◽

pp. 87

Author(s):

Sara Ferreira ◽

Mário Antunes ◽

Manuel E. Correia

Keyword(s):

Machine Learning ◽

Digital Forensics ◽

State Of The Art ◽

Forensic Analysis ◽

Third Party ◽

Support Vector ◽

Multimedia Content ◽

Digital Forensic ◽

Video Frames ◽

Forensic Tools

Deepfake and manipulated digital photos and videos are being increasingly used in a myriad of cybercrimes. Ransomware, the dissemination of fake news, and digital kidnapping-related crimes are the most recurrent, in which tampered multimedia content has been the primordial disseminating vehicle. Digital forensic analysis tools are being widely used by criminal investigations to automate the identification of digital evidence in seized electronic equipment. The number of files to be processed and the complexity of the crimes under analysis have highlighted the need to employ efficient digital forensics techniques grounded on state-of-the-art technologies. Machine Learning (ML) researchers have been challenged to apply techniques and methods to improve the automatic detection of manipulated multimedia content. However, the implementation of such methods have not yet been massively incorporated into digital forensic tools, mostly due to the lack of realistic and well-structured datasets of photos and videos. The diversity and richness of the datasets are crucial to benchmark the ML models and to evaluate their appropriateness to be applied in real-world digital forensics applications. An example is the development of third-party modules for the widely used Autopsy digital forensic application. This paper presents a dataset obtained by extracting a set of simple features from genuine and manipulated photos and videos, which are part of state-of-the-art existing datasets. The resulting dataset is balanced, and each entry comprises a label and a vector of numeric values corresponding to the features extracted through a Discrete Fourier Transform (DFT). The dataset is available in a GitHub repository, and the total amount of photos and video frames is 40,588 and 12,400, respectively. The dataset was validated and benchmarked with deep learning Convolutional Neural Networks (CNN) and Support Vector Machines (SVM) methods; however, a plethora of other existing ones can be applied. Generically, the results show a better F1-score for CNN when comparing with SVM, both for photos and videos processing. CNN achieved an F1-score of 0.9968 and 0.8415 for photos and videos, respectively. Regarding SVM, the results obtained with 5-fold cross-validation are 0.9953 and 0.7955, respectively, for photos and videos processing. A set of methods written in Python is available for the researchers, namely to preprocess and extract the features from the original photos and videos files and to build the training and testing sets. Additional methods are also available to convert the original PKL files into CSV and TXT, which gives more flexibility for the ML researchers to use the dataset on existing ML frameworks and tools.

Download Full-text

GPU-Enabled Serverless Workflows for Efficient Multimedia Processing

Applied Sciences ◽

10.3390/app11041438 ◽

2021 ◽

Vol 11 (4) ◽

pp. 1438

Author(s):

Sebastián Risco ◽

Germán Moltó

Keyword(s):

Video Processing ◽

Cost Effective ◽

Multimedia Processing ◽

Computing Services ◽

Video Frames ◽

Open Source Framework ◽

Cloud Infrastructures ◽

Aws Lambda ◽

Runtime Environments

Serverless computing has introduced scalable event-driven processing in Cloud infrastructures. However, it is not trivial for multimedia processing to benefit from the elastic capabilities featured by serverless applications. To this aim, this paper introduces the evolution of a framework to support the execution of customized runtime environments in AWS Lambda in order to accommodate workloads that do not satisfy its strict computational requirements: increased execution times and the ability to use GPU-based resources. This has been achieved through the integration of AWS Batch, a managed service to deploy virtual elastic clusters for the execution of containerized jobs. In addition, a Functions Definition Language (FDL) is introduced for the description of data-driven workflows of functions. These workflows can simultaneously leverage both AWS Lambda for the highly-scalable execution of short jobs and AWS Batch, for the execution of compute-intensive jobs that can profit from GPU-based computing. To assess the developed open-source framework, we executed a case study for efficient serverless video processing. The workflow automatically generates subtitles based on the audio and applies GPU-based object recognition to the video frames, thus simultaneously harnessing different computing services. This allows for the creation of cost-effective highly-parallel scale-to-zero serverless workflows in AWS.

Download Full-text

A Systematic Deep Learning Based Overhead Tracking and Counting System Using RGB-D Remote Cameras

Applied Sciences ◽

10.3390/app11125503 ◽

2021 ◽

Vol 11 (12) ◽

pp. 5503

Author(s):

Munkhjargal Gochoo ◽

Syeda Amna Rizwan ◽

Yazeed Yasin Ghadi ◽

Ahmad Jalal ◽

Kibum Kim

Keyword(s):

Deep Learning ◽

Human Head ◽

Point Clouds ◽

Head Tracking ◽

Complex Environments ◽

People Tracking ◽

Practical Applications ◽

Remote Cameras ◽

Video Frames ◽

Benchmark Datasets

Automatic head tracking and counting using depth imagery has various practical applications in security, logistics, queue management, space utilization and visitor counting. However, no currently available system can clearly distinguish between a human head and other objects in order to track and count people accurately. For this reason, we propose a novel system that can track people by monitoring their heads and shoulders in complex environments and also count the number of people entering and exiting the scene. Our system is split into six phases; at first, preprocessing is done by converting videos of a scene into frames and removing the background from the video frames. Second, heads are detected using Hough Circular Gradient Transform, and shoulders are detected by HOG based symmetry methods. Third, three robust features, namely, fused joint HOG-LBP, Energy based Point clouds and Fused intra-inter trajectories are extracted. Fourth, the Apriori-Association is implemented to select the best features. Fifth, deep learning is used for accurate people tracking. Finally, heads are counted using Cross-line judgment. The system was tested on three benchmark datasets: the PCDS dataset, the MICC people counting dataset and the GOTPD dataset and counting accuracy of 98.40%, 98%, and 99% respectively was achieved. Our system obtained remarkable results.

Download Full-text

Performance Evaluation of Background Subtraction Techniques for Video Frames

2021 International Conference on Artificial Intelligence (ICAI) ◽

10.1109/icai52203.2021.9445253 ◽

2021 ◽

Author(s):

Salman Qasim ◽

Kaleem Nawaz Khan ◽

Miao Yu ◽

Muhammad Salman Khan

Keyword(s):

Performance Evaluation ◽

Background Subtraction ◽

Video Frames

Download Full-text

Two Stage Continuous Gesture Recognition Based on Deep Learning

Electronics ◽

10.3390/electronics10050534 ◽

2021 ◽

Vol 10 (5) ◽

pp. 534

Author(s):

Huogen Wang

Keyword(s):

Gesture Recognition ◽

Large Scale ◽

Short Term Memory ◽

Short Term ◽

Hand Motion ◽

Spatiotemporal Features ◽

Spatiotemporal Information ◽

Video Frames ◽

Depth Sequences

The paper proposes an effective continuous gesture recognition method, which includes two modules: segmentation and recognition. In the segmentation module, the video frames are divided into gesture frames and transitional frames by using the information of hand motion and appearance, and continuous gesture sequences are segmented into isolated sequences. In the recognition module, our method exploits the spatiotemporal information embedded in RGB and depth sequences. For the RGB modality, our method adopts Convolutional Long Short-Term Memory Networks to learn long-term spatiotemporal features from short-term spatiotemporal features obtained from a 3D convolutional neural network. For the depth modality, our method converts a sequence into Dynamic Images and Motion Dynamic Images through weighted rank pooling and feed them into Convolutional Neural Networks, respectively. Our method has been evaluated on both ChaLearn LAP Large-scale Continuous Gesture Dataset and Montalbano Gesture Dataset and achieved state-of-the-art performance.

Download Full-text

Video Frame Interpolation via Deformable Separable Convolution

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6634 ◽

2020 ◽

Vol 34 (07) ◽

pp. 10607-10614 ◽

Cited By ~ 2

Author(s):

Xianhang Cheng ◽

Zhenzhong Chen

Keyword(s):

State Of The Art ◽

Video Frame ◽

Kernel Size ◽

Frame Interpolation ◽

Interpolation Methods ◽

Video Frames ◽

Convolution Process ◽

Strong Performance ◽

Existing Frames ◽

Better Than

Learning to synthesize non-existing frames from the original consecutive video frames is a challenging task. Recent kernel-based interpolation methods predict pixels with a single convolution process to replace the dependency of optical flow. However, when scene motion is larger than the pre-defined kernel size, these methods yield poor results even though they take thousands of neighboring pixels into account. To solve this problem in this paper, we propose to use deformable separable convolution (DSepConv) to adaptively estimate kernels, offsets and masks to allow the network to obtain information with much fewer but more relevant pixels. In addition, we show that the kernel-based methods and conventional flow-based methods are specific instances of the proposed DSepConv. Experimental results demonstrate that our method significantly outperforms the other kernel-based interpolation methods and shows strong performance on par or even better than the state-of-the-art algorithms both qualitatively and quantitatively.

Download Full-text

Detecting Toe-Off Events Utilizing a Vision-Based Method

Entropy ◽

10.3390/e21040329 ◽

2019 ◽

Vol 21 (4) ◽

pp. 329 ◽

Cited By ~ 4

Author(s):

Yunqi Tang ◽

Zhuorong Li ◽

Huawei Tian ◽

Jianwei Ding ◽

Bingxian Lin

Keyword(s):

Wearable Sensors ◽

Gait Pattern ◽

Video Data ◽

Detection Methods ◽

Detection Accuracy ◽

Public Database ◽

Video Frames ◽

Different Types ◽

Events Detection ◽

Good Detection

Detecting gait events from video data accurately would be a challenging problem. However, most detection methods for gait events are currently based on wearable sensors, which need high cooperation from users and power consumption restriction. This study presents a novel algorithm for achieving accurate detection of toe-off events using a single 2D vision camera without the cooperation of participants. First, a set of novel feature, namely consecutive silhouettes difference maps (CSD-maps), is proposed to represent gait pattern. A CSD-map can encode several consecutive pedestrian silhouettes extracted from video frames into a map. And different number of consecutive pedestrian silhouettes will result in different types of CSD-maps, which can provide significant features for toe-off events detection. Convolutional neural network is then employed to reduce feature dimensions and classify toe-off events. Experiments on a public database demonstrate that the proposed method achieves good detection accuracy.

Download Full-text