Natural Language Description of Videos for Smart Surveillance

After the September 11 attacks, security and surveillance measures have changed across the globe. Now, surveillance cameras are installed almost everywhere to monitor video footage. Though quite handy, these cameras produce videos in a massive size and volume. The major challenge faced by security agencies is the effort of analyzing the surveillance video data collected and generated daily. Problems related to these videos are twofold: (1) understanding the contents of video streams, and (2) conversion of the video contents to condensed formats, such as textual interpretations and summaries, to save storage space. In this paper, we have proposed a video description framework on a surveillance dataset. This framework is based on the multitask learning of high-level features (HLFs) using a convolutional neural network (CNN) and natural language generation (NLG) through bidirectional recurrent networks. For each specific task, a parallel pipeline is derived from the base visual geometry group (VGG)-16 model. Tasks include scene recognition, action recognition, object recognition and human face specific feature recognition. Experimental results on the TRECViD, UET Video Surveillance (UETVS) and AGRIINTRUSION datasets depict that the model outperforms state-of-the-art methods by a METEOR (Metric for Evaluation of Translation with Explicit ORdering) score of 33.9%, 34.3%, and 31.2%, respectively. Our results show that our framework has distinct advantages over traditional rule-based models for the recognition and generation of natural language descriptions.

Download Full-text

Systematic Study of Video Mining with Its Applications

10.3233/apc210232 ◽

2021 ◽

Author(s):

Mallappa G. Mendagudli ◽

K.G. Kharade ◽

T. Nadana Ravishankar ◽

K. Vengatesan

Keyword(s):

Data Mining ◽

Content Analysis ◽

Large Datasets ◽

Video Data ◽

Broad Application ◽

Medicine Research ◽

Video Footage ◽

Security Education ◽

Video Mining ◽

High Level

Effective methods for video indexing will be more valuable as digital video data continues to grow. It has been years since we’ve seen this level of new multimedia research. The content analysis aims to create high-level descriptions and annotations by treating language and facts as data. Data mining is a technique that seeks out previously unknown facts and patterns in large datasets. A video can include several different kinds of data, such as images, visuals, audio, text, and additional metadata. Thanks to its broad application in various disciplines, like security, education, medicine, research, sports, and entertainment, it is often used differently. Data mining aims to discover and articulate exciting patterns that are hidden in a lot of video footage. While video mining is still in its infancy, data mining is more mature. A considerable amount of research must be done to turn the mined video into usable content

Download Full-text

Recognition of classroom student state features based on deep learning algorithms and machine learning

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189232 ◽

2020 ◽

pp. 1-12

Author(s):

Hu Jingchao ◽

Haiying Zhang

Keyword(s):

Deep Learning ◽

Feature Recognition ◽

Subjective Evaluation ◽

Recognition Algorithm ◽

Video Data ◽

State Recognition ◽

Detection Model ◽

State Classification ◽

Intelligent Models ◽

Model Recognition

The difficulty in class student state recognition is how to make feature judgments based on student facial expressions and movement state. At present, some intelligent models are not accurate in class student state recognition. In order to improve the model recognition effect, this study builds a two-level state detection framework based on deep learning and HMM feature recognition algorithm, and expands it as a multi-level detection model through a reasonable state classification method. In addition, this study selects continuous HMM or deep learning to reflect the dynamic generation characteristics of fatigue, and designs random human fatigue recognition experiments to complete the collection and preprocessing of EEG data, facial video data, and subjective evaluation data of classroom students. In addition to this, this study discretizes the feature indicators and builds a student state recognition model. Finally, the performance of the algorithm proposed in this paper is analyzed through experiments. The research results show that the algorithm proposed in this paper has certain advantages over the traditional algorithm in the recognition of classroom student state features.

Download Full-text

Deep-Framework: A Distributed, Scalable, and Edge-Oriented Framework for Real-Time Analysis of Video Streams

Sensors ◽

10.3390/s21124045 ◽

2021 ◽

Vol 21 (12) ◽

pp. 4045

Author(s):

Alessandro Sassu ◽

Jose Francisco Saenz-Cogollo ◽

Maurizio Agelli

Keyword(s):

Deep Learning ◽

Real Time ◽

Video Data ◽

Video Analytics ◽

Web Based ◽

Real Time Analysis ◽

Open Source Framework ◽

Cluster Configuration ◽

Time Requirements ◽

High Level

Edge computing is the best approach for meeting the exponential demand and the real-time requirements of many video analytics applications. Since most of the recent advances regarding the extraction of information from images and video rely on computation heavy deep learning algorithms, there is a growing need for solutions that allow the deployment and use of new models on scalable and flexible edge architectures. In this work, we present Deep-Framework, a novel open source framework for developing edge-oriented real-time video analytics applications based on deep learning. Deep-Framework has a scalable multi-stream architecture based on Docker and abstracts away from the user the complexity of cluster configuration, orchestration of services, and GPU resources allocation. It provides Python interfaces for integrating deep learning models developed with the most popular frameworks and also provides high-level APIs based on standard HTTP and WebRTC interfaces for consuming the extracted video data on clients running on browsers or any other web-based platform.

Download Full-text

On The Difference Between Natural Language And High Level Query Languages

Proceedings of the 1978 annual conference on - ACM 78 ◽

10.1145/800127.804064 ◽

1978 ◽

Cited By ~ 5

Author(s):

S. Jerrold Kaplan

Keyword(s):

Natural Language ◽

Query Languages ◽

The Difference ◽

High Level

Download Full-text

Automated detection of grade-crossing-trespassing near misses based on computer vision analysis of surveillance video data

Safety Science ◽

10.1016/j.ssci.2017.11.023 ◽

2018 ◽

Vol 110 ◽

pp. 276-285 ◽

Cited By ~ 4

Author(s):

Zhipeng Zhang ◽

Chintan Trivedi ◽

Xiang Liu

Keyword(s):

Computer Vision ◽

Automated Detection ◽

Video Data ◽

Surveillance Video ◽

Near Misses ◽

Grade Crossing

Download Full-text

Low-Rank Representation with Contextual Regularization for Moving Object Detection in Big Surveillance Video Data

2017 IEEE Third International Conference on Multimedia Big Data (BigMM) ◽

10.1109/bigmm.2017.37 ◽

2017 ◽

Cited By ~ 1

Author(s):

Bo-Hao Chen ◽

Ling-Feng Shi ◽

Xiao Ke

Keyword(s):

Object Detection ◽

Moving Object Detection ◽

Video Data ◽

Moving Object ◽

Low Rank ◽

Surveillance Video ◽

Low Rank Representation

Download Full-text

Knowledge Integration Networks for Action Recognition

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6983 ◽

2020 ◽

Vol 34 (07) ◽

pp. 12862-12869

Author(s):

Shiwen Zhang ◽

Sheng Guo ◽

Limin Wang ◽

Weilin Huang ◽

Matthew Scott

Keyword(s):

Action Recognition ◽

Large Scale ◽

Knowledge Integration ◽

Scene Recognition ◽

Teacher Networks ◽

Medium Level ◽

Meaningful Context ◽

Context Knowledge ◽

High Level ◽

Context Features

In this work, we propose Knowledge Integration Networks (referred as KINet) for video action recognition. KINet is capable of aggregating meaningful context features which are of great importance to identifying an action, such as human information and scene context. We design a three-branch architecture consisting of a main branch for action recognition, and two auxiliary branches for human parsing and scene recognition which allow the model to encode the knowledge of human and scene for action recognition. We explore two pre-trained models as teacher networks to distill the knowledge of human and scene for training the auxiliary tasks of KINet. Furthermore, we propose a two-level knowledge encoding mechanism which contains a Cross Branch Integration (CBI) module for encoding the auxiliary knowledge into medium-level convolutional features, and an Action Knowledge Graph (AKG) for effectively fusing high-level context information. This results in an end-to-end trainable framework where the three tasks can be trained collaboratively, allowing the model to compute strong context knowledge efficiently. The proposed KINet achieves the state-of-the-art performance on a large-scale action recognition benchmark Kinetics-400, with a top-1 accuracy of 77.8%. We further demonstrate that our KINet has strong capability by transferring the Kinetics-trained model to UCF-101, where it obtains 97.8% top-1 accuracy.

Download Full-text

EYE-C: Eye-Contact Robust Detection and Analysis during Unconstrained Child-Therapist Interactions in the Clinical Setting of Autism Spectrum Disorders

Brain Sciences ◽

10.3390/brainsci11121555 ◽

2021 ◽

Vol 11 (12) ◽

pp. 1555

Author(s):

Gianpaolo Alvari ◽

Luca Coviello ◽

Cesare Furlanello

Keyword(s):

Clinical Sample ◽

Eye Contact ◽

Video Camera ◽

Autism Spectrum ◽

Individual Characteristics ◽

Video Data ◽

Contact Dynamics ◽

Treatment Programs ◽

Robust Detection ◽

High Level

The high level of heterogeneity in Autism Spectrum Disorder (ASD) and the lack of systematic measurements complicate predicting outcomes of early intervention and the identification of better-tailored treatment programs. Computational phenotyping may assist therapists in monitoring child behavior through quantitative measures and personalizing the intervention based on individual characteristics; still, real-world behavioral analysis is an ongoing challenge. For this purpose, we designed EYE-C, a system based on OpenPose and Gaze360 for fine-grained analysis of eye-contact episodes in unconstrained therapist-child interactions via a single video camera. The model was validated on video data varying in resolution and setting, achieving promising performance. We further tested EYE-C on a clinical sample of 62 preschoolers with ASD for spectrum stratification based on eye-contact features and age. By unsupervised clustering, three distinct sub-groups were identified, differentiated by eye-contact dynamics and a specific clinical phenotype. Overall, this study highlights the potential of Artificial Intelligence in categorizing atypical behavior and providing translational solutions that might assist clinical practice.

Download Full-text

Solving Probability Problems in Natural Language

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/556 ◽

2017 ◽

Cited By ~ 3

Author(s):

Anton Dries ◽

Angelika Kimmig ◽

Jesse Davis ◽

Vaishak Belle ◽

Luc de Raedt

Keyword(s):

Natural Language ◽

Discrete Mathematics ◽

Second Step ◽

Programming System ◽

Mathematics Textbooks ◽

Intellectual Skill ◽

Correct Model ◽

Level Model ◽

End To End ◽

High Level

The ability to solve probability word problems such as those found in introductory discrete mathematics textbooks, is an important cognitive and intellectual skill. In this paper, we develop a two-step end-to-end fully automated approach for solving such questions that is able to automatically provide answers to exercises about probability formulated in natural language.In the first step, a question formulated in natural language is analysed and transformed into a high-level model specified in a declarative language. In the second step, a solution to the high-level model is computed using a probabilistic programming system. On a dataset of 2160 probability problems, our solver is able to correctly answer 97.5% of the questions given a correct model. On the end-to-end evaluation, we are able to answer 12.5% of the questions (or 31.1% if we exclude examples not supported by design).

Download Full-text

Unmasking the conversation on masks: Natural language processing for topical sentiment analysis of COVID-19 Twitter discourse

10.1101/2020.08.28.20183863 ◽

2020 ◽

Cited By ~ 1

Author(s):

Abraham Sanders ◽

Rachael White ◽

Lauren Severson ◽

Rufeng Ma ◽

Richard McQueen ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Sentiment Analysis ◽

Language Processing ◽

Public Attitudes ◽

Online Activity ◽

Health Crisis ◽

Public Response ◽

Health Community ◽

High Level

In this exploratory study, we scrutinize a database of over 1 million tweets collected across the first five months of 2020 to draw conclusions about public attitudes towards the preventative measure of mask usage during the COVID-19 pandemic. In recent months, a body of literature has emerged to suggest the robustness of trends in online activity as proxies for the epidemiological and sociological impact of COVID-19. We employ natural language processing, clustering and sentiment analysis techniques to organize tweets relating to mask-wearing into high-level themes, then relay narratives for individual clusters through automatic text summarization. We find that topic clustering and visualization based on mask-related Twitter data offers revealing insights into societal perceptions of COVID-19 and techniques for its prevention. We observe that the volume and polarity of mask related tweets has greatly increased. Importantly, the analysis pipeline presented can be leveraged by the health community for the assessment of public response to health interventions in the ongoing global health crisis.

Download Full-text