An Analysis of the Use of Feed-Forward Sub-Modules for Transformer-Based Image Captioning Tasks

In the quest to make deep learning systems more capable, a number of more complex, more computationally expensive and memory intensive algorithms have been proposed. This switchover glosses over the capabilities of many of the simpler systems or modules within them to adequately address current and future problems. This has led to some of the deep learning research being inaccessible to researchers who don’t possess top-of-the-line hardware. The use of simple feed forward networks has not been explicitly explored in the current transformer-based vision-language field. In this paper, we use a series of feed-forward layers to encode image features, and caption embeddings, alleviating some of the effects of the computational complexities that accompany the use of the self-attention mechanism and limit its application in long sequence task scenarios. We demonstrate that a decoder does not require masking for conditional short sequence generation where the task is not only dependent on the previously generated sequence, but another input such as image features. We perform an empirical and qualitative analysis of the use of linear transforms in place of self-attention layers in vision-language models, and obtain competitive results on the MSCOCO dataset. Our best feed-forward model obtains average scores of over 90% of the current state-of-the-art pre-trained Oscar model in the conventional image captioning metrics. We also demonstrate that the proposed models take less time training and use less memory at larger batch sizes and longer sequence lengths.

Download Full-text

An Attentive Fourier-Augmented Image-Captioning Transformer

Applied Sciences ◽

10.3390/app11188354 ◽

2021 ◽

Vol 11 (18) ◽

pp. 8354

Author(s):

Raymond Ian Osolo ◽

Zhan Yang ◽

Jun Long

Keyword(s):

Fourier Transformation ◽

Fourier Transforms ◽

State Of The Art ◽

Language Model ◽

Image Representation ◽

Image Features ◽

Language Models ◽

Object Location ◽

Image Captioning ◽

Image Object

Many vision–language models that output natural language, such as image-captioning models, usually use image features merely for grounding the captions and most of the good performance of the model can be attributed to the language model, which does all the heavy lifting, a phenomenon that has persisted even with the emergence of transformer-based architectures as the preferred base architecture of recent state-of-the-art vision–language models. In this paper, we make the images matter more by using fast Fourier transforms to further breakdown the input features and extract more of their intrinsic salient information, resulting in more detailed yet concise captions. This is achieved by performing a 1D Fourier transformation on the image features first in the hidden dimension and then in the sequence dimension. These extracted features alongside the region proposal image features result in a richer image representation that can then be queried to produce the associated captions, which showcase a deeper understanding of image–object–location relationships than similar models. Extensive experiments performed on the MSCOCO dataset demonstrate a CIDER-D, BLEU-1, and BLEU-4 score of 130, 80.5, and 39, respectively, on the MSCOCO benchmark dataset.

Download Full-text

A Survey of Graphical Page Object Detection with Deep Neural Networks

10.20944/preprints202104.0739.v1 ◽

2021 ◽

Author(s):

Jwalin Bhatt ◽

Khurram Azeem Hashmi ◽

Muhammad Zeshan Afzal ◽

Didier Stricker

Keyword(s):

Deep Learning ◽

Object Detection ◽

Conceptual Understanding ◽

Deep Neural Networks ◽

State Of The Art ◽

Learning Approaches ◽

Document Images ◽

Essential Information ◽

Current State ◽

High Level

In any document, graphical elements like tables, figures, and formulas contain essential information. The processing and interpretation of such information require specialized algorithms. Off-the-shelf OCR components cannot process this information reliably. Therefore, an essential step in document analysis pipelines is to detect these graphical components. It leads to a high-level conceptual understanding of the documents that makes digitization of documents viable. Since the advent of deep learning, the performance of deep learning-based object detection has improved many folds. In this work, we outline and summarize the deep learning approaches for detecting graphical page objects in the document images. Therefore, we discuss the most relevant deep learning-based approaches and state-of-the-art graphical page object detection in document images. This work provides a comprehensive understanding of the current state-of-the-art and related challenges. Furthermore, we discuss leading datasets along with the quantitative evaluation. Moreover, it discusses briefly the promising directions that can be utilized for further improvements.

Download Full-text

A Robust Context-Based Deep Learning Approach for Highly Imbalanced Hyperspectral Classification

Computational Intelligence and Neuroscience ◽

10.1155/2021/9923491 ◽

2021 ◽

Vol 2021 ◽

pp. 1-17

Author(s):

Juan F. Ramirez Rochac ◽

Nian Zhang ◽

Lara A. Thompson ◽

Tolessa Deksissa

Keyword(s):

Deep Learning ◽

State Of The Art ◽

Mineral Exploration ◽

Classification Models ◽

Noise Resistance ◽

Deep Convolutional Neural Networks ◽

Current State ◽

Feature Augmentation ◽

Active Research ◽

Hyperspectral Classification

Hyperspectral imaging is an area of active research with many applications in remote sensing, mineral exploration, and environmental monitoring. Deep learning and, in particular, convolution-based approaches are the current state-of-the-art classification models. However, in the presence of noisy hyperspectral datasets, these deep convolutional neural networks underperform. In this paper, we proposed a feature augmentation approach to increase noise resistance in imbalanced hyperspectral classification. Our method calculates context-based features, and it uses a deep convolutional neuronet (DCN). We tested our proposed approach on the Pavia datasets and compared three models, DCN, PCA + DCN, and our context-based DCN, using the original datasets and the datasets plus noise. Our experimental results show that DCN and PCA + DCN perform well on the original datasets but not on the noisy datasets. Our robust context-based DCN was able to outperform others in the presence of noise and was able to maintain a comparable classification accuracy on clean hyperspectral images.

Download Full-text

MSA Transformer

10.1101/2021.02.12.430858 ◽

2021 ◽

Author(s):

Roshan Rao ◽

Jason Liu ◽

Robert Verkuil ◽

Joshua Meier ◽

John F. Canny ◽

...

Keyword(s):

Structure Learning ◽

State Of The Art ◽

Language Model ◽

Language Modeling ◽

Language Models ◽

Multiple Sequence ◽

Wide Margin ◽

Current State ◽

Individual Sequences ◽

And Function

AbstractUnsupervised protein language models trained across millions of diverse sequences learn structure and function of proteins. Protein language models studied to date have been trained to perform inference from individual sequences. The longstanding approach in computational biology has been to make inferences from a family of evolutionarily related sequences by fitting a model to each family independently. In this work we combine the two paradigms. We introduce a protein language model which takes as input a set of sequences in the form of a multiple sequence alignment. The model interleaves row and column attention across the input sequences and is trained with a variant of the masked language modeling objective across many protein families. The performance of the model surpasses current state-of-the-art unsupervised structure learning methods by a wide margin, with far greater parameter efficiency than prior state-of-the-art protein language models.

Download Full-text

QASC: A Dataset for Question Answering via Sentence Composition

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6319 ◽

2020 ◽

Vol 34 (05) ◽

pp. 8082-8090

Author(s):

Tushar Khot ◽

Peter Clark ◽

Michal Guerquin ◽

Peter Jansen ◽

Ashish Sabharwal

Keyword(s):

Common Sense ◽

Human Performance ◽

Question Answering ◽

State Of The Art ◽

Multiple Choice ◽

Training Data ◽

Language Models ◽

Current State ◽

New Concepts ◽

Large Corpus

Composing knowledge from multiple pieces of texts is a key challenge in multi-hop question answering. We present a multi-hop reasoning dataset, Question Answering via Sentence Composition (QASC), that requires retrieving facts from a large corpus and composing them to answer a multiple-choice question. QASC is the first dataset to offer two desirable properties: (a) the facts to be composed are annotated in a large corpus, and (b) the decomposition into these facts is not evident from the question itself. The latter makes retrieval challenging as the system must introduce new concepts or relations in order to discover potential decompositions. Further, the reasoning model must then learn to identify valid compositions of these retrieved facts using common-sense reasoning. To help address these challenges, we provide annotation for supporting facts as well as their composition. Guided by these annotations, we present a two-step approach to mitigate the retrieval challenges. We use other multiple-choice datasets as additional training data to strengthen the reasoning model. Our proposed approach improves over current state-of-the-art language models by 11% (absolute). The reasoning and retrieval problems, however, remain unsolved as this model still lags by 20% behind human performance.

Download Full-text

Speech analysis for health: Current state-of-the-art and the increasing impact of deep learning

Methods ◽

10.1016/j.ymeth.2018.07.007 ◽

2018 ◽

Vol 151 ◽

pp. 41-54 ◽

Cited By ~ 23

Author(s):

Nicholas Cummins ◽

Alice Baird ◽

Björn W. Schuller

Keyword(s):

Deep Learning ◽

State Of The Art ◽

Speech Analysis ◽

Current State

Download Full-text

Text: An R-package for Analyzing and Visualizing Human Language Using Natural Language Processing and Deep Learning

10.31234/osf.io/293kt ◽

2021 ◽

Author(s):

Oscar Nils Erik Kjell ◽

H. Andrew Schwartz ◽

Salvatore Giorgi

Keyword(s):

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Rating Scale ◽

State Of The Art ◽

R Package ◽

Language Models ◽

Categorical Variables ◽

Human Language

The language that individuals use for expressing themselves contains rich psychological information. Recent significant advances in Natural Language Processing (NLP) and Deep Learning (DL), namely transformers, have resulted in large performance gains in tasks related to understanding natural language such as machine translation. However, these state-of-the-art methods have not yet been made easily accessible for psychology researchers, nor designed to be optimal for human-level analyses. This tutorial introduces text (www.r-text.org), a new R-package for analyzing and visualizing human language using transformers, the latest techniques from NLP and DL. Text is both a modular solution for accessing state-of-the-art language models and an end-to-end solution catered for human-level analyses. Hence, text provides user-friendly functions tailored to test hypotheses in social sciences for both relatively small and large datasets. This tutorial describes useful methods for analyzing text, providing functions with reliable defaults that can be used off-the-shelf as well as providing a framework for the advanced users to build on for novel techniques and analysis pipelines. The reader learns about six methods: 1) textEmbed: to transform text to traditional or modern transformer-based word embeddings (i.e., numeric representations of words); 2) textTrain: to examine the relationships between text and numeric/categorical variables; 3) textSimilarity and 4) textSimilarityTest: to computing semantic similarity scores between texts and significance test the difference in meaning between two sets of texts; and 5) textProjection and 6) textProjectionPlot: to examine and visualize text within the embedding space according to latent or specified construct dimensions (e.g., low to high rating scale scores).

Download Full-text

Training a neural network to learn other dimensionality reduction removes data size restrictions in bioinformatics and provides a new route to exploring data representations

10.1101/2020.09.03.269555 ◽

2020 ◽

Cited By ~ 1

Author(s):

Alex Dexter ◽

Spencer A. Thomas ◽

Rory T. Steven ◽

Kenneth N. Robinson ◽

Adam J. Taylor ◽

...

Keyword(s):

Neural Networks ◽

Deep Learning ◽

Dimensionality Reduction ◽

Computational Analysis ◽

New Technologies ◽

State Of The Art ◽

Current State ◽

Data Representations ◽

Non Linear ◽

Linear Dimensionality Reduction

AbstractHigh dimensionality omics and hyperspectral imaging datasets present difficult challenges for feature extraction and data mining due to huge numbers of features that cannot be simultaneously examined. The sample numbers and variables of these methods are constantly growing as new technologies are developed, and computational analysis needs to evolve to keep up with growing demand. Current state of the art algorithms can handle some routine datasets but struggle when datasets grow above a certain size. We present a training deep learning via neural networks on non-linear dimensionality reduction, in particular t-distributed stochastic neighbour embedding (t-SNE), to overcome prior limitations of these methods.One Sentence SummaryAnalysis of prohibitively large datasets by combining deep learning via neural networks with non-linear dimensionality reduction.

Download Full-text

3M: Multi-style image caption generation using Multi-modality features under Multi-UPDOWN model

The International FLAIRS Conference Proceedings ◽

10.32473/flairs.v34i1.128380 ◽

2021 ◽

Vol 34 (1) ◽

Author(s):

Chengxi Li ◽

Brent Harrison

Keyword(s):

Qualitative Study ◽

State Of The Art ◽

Image Features ◽

Generative Model ◽

Image Captioning ◽

Text Features ◽

Image Caption Generation ◽

Image Caption

In this paper, we build a multi-style generative model for stylish image captioning which uses multi-modality image features, ResNeXt features, and text features generated by DenseCap. We propose the 3M model, a Multi-UPDOWN caption model that encodes multi-modality features and decodes them into captions. We demonstrate the effectiveness of our model on generating human-like captions by examining its performance on two datasets, the PERSONALITY-CAPTIONS dataset, and the FlickrStyle10K dataset. We compare against a variety of state-of-the-art baselines on various automatic NLP metrics such as BLEU, ROUGE-L, CIDEr, SPICE, etc \footnote{code will be available at https://github.com/cici-ai-club/3M}. A qualitative study has also been done to verify our 3M model can be used for generating different stylized captions.

Download Full-text

Crop Rotation Modeling for Deep Learning-Based Parcel Classification from Satellite Time Series

Remote Sensing ◽

10.3390/rs13224599 ◽

2021 ◽

Vol 13 (22) ◽

pp. 4599

Author(s):

Félix Quinton ◽

Loic Landrieu

Keyword(s):

Time Series ◽

Deep Learning ◽

Crop Rotation ◽

Large Scale ◽

State Of The Art ◽

Crop Rotations ◽

Learning Approach ◽

Type Mapping ◽

Current State ◽

Crop Type

While annual crop rotations play a crucial role for agricultural optimization, they have been largely ignored for automated crop type mapping. In this paper, we take advantage of the increasing quantity of annotated satellite data to propose to model simultaneously the inter- and intra-annual agricultural dynamics of yearly parcel classification with a deep learning approach. Along with simple training adjustments, our model provides an improvement of over 6.3% mIoU over the current state-of-the-art of crop classification, and a reduction of over 21% of the error rate. Furthermore, we release the first large-scale multi-year agricultural dataset with over 300,000 annotated parcels.

Download Full-text