scholarly journals Deep Insights into Convolutional Networks for Video Recognition

2019 ◽  
Vol 128 (2) ◽  
pp. 420-437 ◽  
Author(s):  
Christoph Feichtenhofer ◽  
Axel Pinz ◽  
Richard P. Wildes ◽  
Andrew Zisserman

Abstract As the success of deep models has led to their deployment in all areas of computer vision, it is increasingly important to understand how these representations work and what they are capturing. In this paper, we shed light on deep spatiotemporal representations by visualizing the internal representation of models that have been trained to recognize actions in video. We visualize multiple two-stream architectures to show that local detectors for appearance and motion objects arise to form distributed representations for recognizing human actions. Key observations include the following. First, cross-stream fusion enables the learning of true spatiotemporal features rather than simply separate appearance and motion features. Second, the networks can learn local representations that are highly class specific, but also generic representations that can serve a range of classes. Third, throughout the hierarchy of the network, features become more abstract and show increasing invariance to aspects of the data that are unimportant to desired distinctions (e.g. motion patterns across various speeds). Fourth, visualizations can be used not only to shed light on learned representations, but also to reveal idiosyncrasies of training data and to explain failure cases of the system.

2018 ◽  
pp. 2083-2101
Author(s):  
Masaki Takahashi ◽  
Masahide Naemura ◽  
Mahito Fujii ◽  
James J. Little

A feature-representation method for recognizing actions in sports videos on the basis of the relationship between human actions and camera motions is proposed. The method involves the following steps: First, keypoint trajectories are extracted as motion features in spatio-temporal sub-regions called “spatio-temporal multiscale bags” (STMBs). Global representations and local representations from one sub-region in the STMBs are then combined to create a “glocal pairwise representation” (GPR). The GPR considers the co-occurrence of camera motions and human actions. Finally, two-stage SVM classifiers are trained with STMB-based GPRs, and specified human actions in video sequences are identified. An experimental evaluation of the recognition accuracy of the proposed method (by using the public OSUPEL basketball video dataset and broadcast videos) demonstrated that the method can robustly detect specific human actions in both public and broadcast basketball video sequences.


2020 ◽  
Vol 34 (05) ◽  
pp. 8074-8081
Author(s):  
Pavan Kapanipathi ◽  
Veronika Thost ◽  
Siva Sankalp Patel ◽  
Spencer Whitehead ◽  
Ibrahim Abdelaziz ◽  
...  

Textual entailment is a fundamental task in natural language processing. Most approaches for solving this problem use only the textual content present in training data. A few approaches have shown that information from external knowledge sources like knowledge graphs (KGs) can add value, in addition to the textual content, by providing background knowledge that may be critical for a task. However, the proposed models do not fully exploit the information in the usually large and noisy KGs, and it is not clear how it can be effectively encoded to be useful for entailment. We present an approach that complements text-based entailment models with information from KGs by (1) using Personalized PageRank to generate contextual subgraphs with reduced noise and (2) encoding these subgraphs using graph convolutional networks to capture the structural and semantic information in KGs. We evaluate our approach on multiple textual entailment datasets and show that the use of external knowledge helps the model to be robust and improves prediction accuracy. This is particularly evident in the challenging BreakingNLI dataset, where we see an absolute improvement of 5-20% over multiple text-based entailment models.


2020 ◽  
Vol 34 (07) ◽  
pp. 10542-10550 ◽  
Author(s):  
Jingjing Chen ◽  
Liangming Pan ◽  
Zhipeng Wei ◽  
Xiang Wang ◽  
Chong-Wah Ngo ◽  
...  

Recognizing ingredients for a given dish image is at the core of automatic dietary assessment, attracting increasing attention from both industry and academia. Nevertheless, the task is challenging due to the difficulty of collecting and labeling sufficient training data. On one hand, there are hundred thousands of food ingredients in the world, ranging from the common to rare. Collecting training samples for all of the ingredient categories is difficult. On the other hand, as the ingredient appearances exhibit huge visual variance during the food preparation, it requires to collect the training samples under different cooking and cutting methods for robust recognition. Since obtaining sufficient fully annotated training data is not easy, a more practical way of scaling up the recognition is to develop models that are capable of recognizing unseen ingredients. Therefore, in this paper, we target the problem of ingredient recognition with zero training samples. More specifically, we introduce multi-relational GCN (graph convolutional network) that integrates ingredient hierarchy, attribute as well as co-occurrence for zero-shot ingredient recognition. Extensive experiments on both Chinese and Japanese food datasets are performed to demonstrate the superior performance of multi-relational GCN and shed light on zero-shot ingredients recognition.


Author(s):  
Kezhen Chen ◽  
Irina Rabkina ◽  
Matthew D. McLure ◽  
Kenneth D. Forbus

Deep learning systems can perform well on some image recognition tasks. However, they have serious limitations, including requiring far more training data than humans do and being fooled by adversarial examples. By contrast, analogical learning over relational representations tends to be far more data-efficient, requiring only human-like amounts of training data. This paper introduces an approach that combines automatically constructed qualitative visual representations with analogical learning to tackle a hard computer vision problem, object recognition from sketches. Results from the MNIST dataset and a novel dataset, the Coloring Book Objects dataset, are provided. Comparison to existing approaches indicates that analogical generalization can be used to identify sketched objects from these datasets with several orders of magnitude fewer examples than deep learning systems require.


Sensors ◽  
2019 ◽  
Vol 19 (18) ◽  
pp. 3873 ◽  
Author(s):  
Jong Taek Lee ◽  
Eunhee Park ◽  
Tae-Du Jung

Videofluoroscopic swallowing study (VFSS) is a standard diagnostic tool for dysphagia. To detect the presence of aspiration during a swallow, a manual search is commonly used to mark the time intervals of the pharyngeal phase on the corresponding VFSS image. In this study, we present a novel approach that uses 3D convolutional networks to detect the pharyngeal phase in raw VFSS videos without manual annotations. For efficient collection of training data, we propose a cascade framework which no longer requires time intervals of the swallowing process nor the manual marking of anatomical positions for detection. For video classification, we applied the inflated 3D convolutional network (I3D), one of the state-of-the-art network for action classification, as a baseline architecture. We also present a modified 3D convolutional network architecture that is derived from the baseline I3D architecture. The classification and detection performance of these two architectures were evaluated for comparison. The experimental results show that the proposed model outperformed the baseline I3D model in the condition where both models are trained with random weights. We conclude that the proposed method greatly reduces the examination time of the VFSS images with a low miss rate.


2020 ◽  
Vol 30 (Supplement_5) ◽  
Author(s):  
M Dedicatoria ◽  
S Klaus ◽  
R Case ◽  
S Na ◽  
E Ludwick ◽  
...  

Abstract Background Rapid identification of pathogens is critical to outbreak detection and sentinel surveillance; however most diagnoses are made in laboratory settings. Advancements in artificial intelligence (AI) and computer vision offer unprecedented opportunities to facilitate detection and reduce response time in field settings. An initial step is the creation of analysis algorithms for offline mobile computing applications. Methods AI models to identify objects using computer vision are typically “trained” on previously labeled images. The scarcity of labeled image-libraries creates a bottleneck, requiring thousands of labor hours to annotate images by hand to create “training data.” We describe the applicability of Generative Adversarial Network (GAN) methods to amass sufficient training data with minimal manual input. Results Our AI models are built with a performance score of 0.84-0.93 for M. Tuberculosis, a measure of the AI model's accuracy using precision and recall. Our results demonstrate that our GAN pipeline boosts model robustness and learnability of sparse open source data. Conclusions The use of labeled training data to identify M. Tuberculosis developed using our GAN pipeline techniques demonstrates the potential for rapid identification of known pathogens in field settings. Our work paves the way for the development of offline mobile computing applications to identify pathogens outside of a laboratory setting. Advancements in artificial intelligence (AI) and computer vision offer unprecedented opportunities to decrease detection time in field settings by combining these technologies. Further development of these capabilities can improve time-to-detection and outbreak response significantly. Key messages Rapidly deploy AI detectors to aid in disease outbreak and surveillance. Our concept aligns with deploying responsive alerting capabilities to address dynamic threats in low resource, offline computing environs.


Author(s):  
M. Sester ◽  
Y. Feng ◽  
F. Thiemann

<p><strong>Abstract.</strong> Cartographic generalization is a problem, which poses interesting challenges to automation. Whereas plenty of algorithms have been developed for the different sub-problems of generalization (e.g. simplification, displacement, aggregation), there are still cases, which are not generalized adequately or in a satisfactory way. The main problem is the interplay between different operators. In those cases the benchmark is the human operator, who is able to design an aesthetic and correct representation of the physical reality.</p><p>Deep Learning methods have shown tremendous success for interpretation problems for which algorithmic methods have deficits. A prominent example is the classification and interpretation of images, where deep learning approaches outperform the traditional computer vision methods. In both domains &amp;ndash; computer vision and cartography &amp;ndash; humans are able to produce a solution; a prerequisite for this is, that there is the possibility to generate many training examples for the different cases. Thus, the idea in this paper is to employ Deep Learning for cartographic generalizations tasks, especially for the task of building generalization. An advantage of this task is the fact that many training data sets are available from given map series. The approach is a first attempt using an existing network.</p><p>In the paper, the details of the implementation will be reported, together with an in depth analysis of the results. An outlook on future work will be given.</p>


Author(s):  
Alessandro Betti ◽  
Marco Gori ◽  
Stefano Melacci

The puzzle of computer vision might find new challenging solutions when we realize that most successful methods are working at image level, which is remarkably more difficult than processing directly visual streams, just as it happens in nature. In this paper, we claim that the processing of a stream of frames naturally leads to formulate the motion invariance principle, which enables the construction of a new theory of visual learning based on convolutional features. The theory addresses a number of intriguing questions that arise in natural vision, and offers a well-posed computational scheme for the discovery of convolutional filters over the retina. They are driven by the Euler- Lagrange differential equations derived from the principle of least cognitive action, that parallels the laws of mechanics. Unlike traditional convolutional networks, which need massive supervision, the proposed theory offers a truly new scenario in which feature learning takes place by unsupervised processing of video signals. An experimental report of the theory is presented where we show that features extracted under motion invariance yield an improvement that can be assessed by measuring information-based indexes.


2022 ◽  
Vol 951 (1) ◽  
pp. 012097
Author(s):  
A Maghfirah ◽  
I S Nasution

Abstract Coffee is the most important commodity in the trading industry. Determination of the quality of coffee is still done manually so that it cannot separate good quality coffee beans with bad quality coffee beans. This research conducted the development of a visual-based intelligent system using computer vision to be able to classify the quality of rice coffee based on the Indonesian National Standard (SNI). The models used in the study are the K-Nearest Neighbour (K-NN) method and the Support Vector Machine (SVM) method with 13 parameters used such as; area, contrast, energy, correlation, homogeneity, circularity, perimeter, and colour index R(red), G (green), B (blue), L*, a* and b*. A total of 1200 Arabica green coffee bean captured using Kinect V2 camera with training data of 1000 samples and testing data of 200 samples.


Sign in / Sign up

Export Citation Format

Share Document