Convolutional Neural Networks for Visual Information Analysis with Limited Computing Resources

In visual speech recognition (VSR), speech is transcribed using only visual information to interpret tongue and teeth movements. Recently, deep learning has shown outstanding performance in VSR, with accuracy exceeding that of lipreaders on benchmark datasets. However, several problems still exist when using VSR systems. A major challenge is the distinction of words with similar pronunciation, called homophones; these lead to word ambiguity. Another technical limitation of traditional VSR systems is that visual information does not provide sufficient data for learning words such as “a”, “an”, “eight”, and “bin” because their lengths are shorter than 0.02 s. This report proposes a novel lipreading architecture that combines three different convolutional neural networks (CNNs; a 3D CNN, a densely connected 3D CNN, and a multi-layer feature fusion 3D CNN), which are followed by a two-layer bi-directional gated recurrent unit. The entire network was trained using connectionist temporal classification. The results of the standard automatic speech recognition evaluation metrics show that the proposed architecture reduced the character and word error rates of the baseline model by 5.681% and 11.282%, respectively, for the unseen-speaker dataset. Our proposed architecture exhibits improved performance even when visual ambiguity arises, thereby increasing VSR reliability for practical applications.

Download Full-text

Learning to See: Convolutional Neural Networks for the Analysis of Social Science Data

Political Analysis ◽

10.1017/pan.2021.9 ◽

2021 ◽

pp. 1-19

Author(s):

Michelle Torres ◽

Francisco Cantú

Keyword(s):

Social Sciences ◽

Neural Networks ◽

Social Science ◽

Convolutional Neural Networks ◽

Visual Information ◽

Science Data ◽

Social Science Data

Abstract We provide an introduction of the functioning, implementation, and challenges of convolutional neural networks (CNNs) to classify visual information in social sciences. This tool can help scholars to make more efficient the tedious task of classifying images and extracting information from them. We illustrate the implementation and impact of this methodology by coding handwritten information from vote tallies. Our paper not only demonstrates the contributions of CNNs to both scholars and policy practitioners, but also presents the practical challenges and limitations of the method, providing advice on how to deal with these issues.

Download Full-text

Transfer of Learning in the Convolutional Neural Networks on Classifying Geometric Shapes Based on Local or Global Invariants

Frontiers in Computational Neuroscience ◽

10.3389/fncom.2021.637144 ◽

2021 ◽

Vol 15 ◽

Author(s):

Yufeng Zheng ◽

Jun Huang ◽

Tianwen Chen ◽

Yang Ou ◽

Wu Zhou

Keyword(s):

Neural Networks ◽

Image Classification ◽

Convolutional Neural Networks ◽

Visual Information ◽

Transfer Of Learning ◽

Local Features ◽

Global Features ◽

Geometric Shapes ◽

Robust Learning ◽

Global Invariants

The convolutional neural networks (CNNs) are a powerful tool of image classification that has been widely adopted in applications of automated scene segmentation and identification. However, the mechanisms underlying CNN image classification remain to be elucidated. In this study, we developed a new approach to address this issue by investigating transfer of learning in representative CNNs (AlexNet, VGG, ResNet-101, and Inception-ResNet-v2) on classifying geometric shapes based on local/global features or invariants. While the local features are based on simple components, such as orientation of line segment or whether two lines are parallel, the global features are based on the whole object such as whether an object has a hole or whether an object is inside of another object. Six experiments were conducted to test two hypotheses on CNN shape classification. The first hypothesis is that transfer of learning based on local features is higher than transfer of learning based on global features. The second hypothesis is that the CNNs with more layers and advanced architectures have higher transfer of learning based global features. The first two experiments examined how the CNNs transferred learning of discriminating local features (square, rectangle, trapezoid, and parallelogram). The other four experiments examined how the CNNs transferred learning of discriminating global features (presence of a hole, connectivity, and inside/outside relationship). While the CNNs exhibited robust learning on classifying shapes, transfer of learning varied from task to task, and model to model. The results rejected both hypotheses. First, some CNNs exhibited lower transfer of learning based on local features than that based on global features. Second the advanced CNNs exhibited lower transfer of learning on global features than that of the earlier models. Among the tested geometric features, we found that learning of discriminating inside/outside relationship was the most difficult to be transferred, indicating an effective benchmark to develop future CNNs. In contrast to the “ImageNet” approach that employs natural images to train and analyze the CNNs, the results show proof of concept for the “ShapeNet” approach that employs well-defined geometric shapes to elucidate the strengths and limitations of the computation in CNN image classification. This “ShapeNet” approach will also provide insights into understanding visual information processing the primate visual systems.

Download Full-text

Emotion Recognition System from Speech and Visual Information based on Convolutional Neural Networks

2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD) ◽

10.1109/sped.2019.8906538 ◽

2019 ◽

Cited By ~ 1

Author(s):

Nicolae-Catalin Ristea ◽

Liviu Cristian Dutu ◽

Anamaria Radoi

Keyword(s):

Neural Networks ◽

Emotion Recognition ◽

Convolutional Neural Networks ◽

Visual Information ◽

Recognition System

Download Full-text

Limits to visual representational correspondence between convolutional neural networks and the human brain

Nature Communications ◽

10.1038/s41467-021-22244-7 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Yaoda Xu ◽

Maryam Vaziri-Pashkam

Keyword(s):

Neural Networks ◽

Human Brain ◽

Convolutional Neural Networks ◽

Real World ◽

Visual Information ◽

Visual Representations ◽

Human Vision ◽

Brain Responses ◽

General Correspondence ◽

The Brain

AbstractConvolutional neural networks (CNNs) are increasingly used to model human vision due to their high object categorization capabilities and general correspondence with human brain responses. Here we evaluate the performance of 14 different CNNs compared with human fMRI responses to natural and artificial images using representational similarity analysis. Despite the presence of some CNN-brain correspondence and CNNs’ impressive ability to fully capture lower level visual representation of real-world objects, we show that CNNs do not fully capture higher level visual representations of real-world objects, nor those of artificial objects, either at lower or higher levels of visual representations. The latter is particularly critical, as the processing of both real-world and artificial visual stimuli engages the same neural circuits. We report similar results regardless of differences in CNN architecture, training, or the presence of recurrent processing. This indicates some fundamental differences exist in how the brain and CNNs represent visual information.

Download Full-text

Classification of Eye Tracking Data in Visual Information Processing Tasks Using Convolutional Neural Networks and Feature Engineering

SN Computer Science ◽

10.1007/s42979-020-00444-0 ◽

2021 ◽

Vol 2 (2) ◽

Author(s):

Yuehan Yin ◽

Yahya Alqahtani ◽

Jinjuan Heidi Feng ◽

Joyram Chakraborty ◽

Michael P. McGuire

Keyword(s):

Neural Networks ◽

Information Processing ◽

Eye Tracking ◽

Convolutional Neural Networks ◽

Visual Information ◽

Visual Information Processing ◽

Feature Engineering ◽

Tracking Data

Download Full-text

Movie Posters’ Classification into Multiple Genres

ITM Web of Conferences ◽

10.1051/itmconf/20214003048 ◽

2021 ◽

Vol 40 ◽

pp. 03048

Author(s):

Vaibhav Narawade ◽

Aneesh Potnis ◽

Vishwaroop Ray ◽

Pratik Rathor

Keyword(s):

Neural Networks ◽

Comparative Study ◽

Convolutional Neural Networks ◽

Visual Information ◽

Data Augmentation ◽

Classification Problem ◽

Multilabel Classification ◽

Multi Class Classification

Our project intends to classify movies into the three most probable genres that they belong to, from a predefined set of 25 genres, based on only one image i.e the movie poster. We have made use of Convolutional Neural Networks (CNN) to realize this project as we believe it would be of help to extract the features and visual information from the image. Instead of a multi-class classification problem in which the input is classified into any one class, this project would be more correctly described as a multilabel classification problem as a movie belongs to more than one genre. In this project we see a comparative study of different architectures and tune them to yield the best result based on the metric of accuracy. We have applied various techniques such as data augmentation and L2 regularization to comparatively deduce the model that performs best from all the tested models.

Download Full-text

Multi-view classification with convolutional neural networks

PLoS ONE ◽

10.1371/journal.pone.0245230 ◽

2021 ◽

Vol 16 (1) ◽

pp. e0245230

Author(s):

Marco Seeland ◽

Patrick Mäder

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Visual Information ◽

Single Image ◽

Classification Problems ◽

Multiple Perspectives ◽

Feature Maps ◽

Score Fusion ◽

Latent Representations

Humans’ decision making process often relies on utilizing visual information from different views or perspectives. However, in machine-learning-based image classification we typically infer an object’s class from just a single image showing an object. Especially for challenging classification problems, the visual information conveyed by a single image may be insufficient for an accurate decision. We propose a classification scheme that relies on fusing visual information captured through images depicting the same object from multiple perspectives. Convolutional neural networks are used to extract and encode visual features from the multiple views and we propose strategies for fusing these information. More specifically, we investigate the following three strategies: (1) fusing convolutional feature maps at differing network depths; (2) fusion of bottleneck latent representations prior to classification; and (3) score fusion. We systematically evaluate these strategies on three datasets from different domains. Our findings emphasize the benefit of integrating information fusion into the network rather than performing it by post-processing of classification scores. Furthermore, we demonstrate through a case study that already trained networks can be easily extended by the best fusion strategy, outperforming other approaches by large margin.

Download Full-text