scholarly journals What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations

2021 ◽  
Vol 4 ◽  
Author(s):  
Nikolai Ilinykh ◽  
Simon Dobnik

Neural networks have proven to be very successful in automatically capturing the composition of language and different structures across a range of multi-modal tasks. Thus, an important question to investigate is how neural networks learn and organise such structures. Numerous studies have examined the knowledge captured by language models (LSTMs, transformers) and vision architectures (CNNs, vision transformers) for respective uni-modal tasks. However, very few have explored what structures are acquired by multi-modal transformers where linguistic and visual features are combined. It is critical to understand the representations learned by each modality, their respective interplay, and the task’s effect on these representations in large-scale architectures. In this paper, we take a multi-modal transformer trained for image captioning and examine the structure of the self-attention patterns extracted from the visual stream. Our results indicate that the information about different relations between objects in the visual stream is hierarchical and varies from local to a global object-level understanding of the image. In particular, while visual representations in the first layers encode the knowledge of relations between semantically similar object detections, often constituting neighbouring objects, deeper layers expand their attention across more distant objects and learn global relations between them. We also show that globally attended objects in deeper layers can be linked with entities described in image descriptions, indicating a critical finding - the indirect effect of language on visual representations. In addition, we highlight how object-based input representations affect the structure of learned visual knowledge and guide the model towards more accurate image descriptions. A parallel question that we investigate is whether the insights from cognitive science echo the structure of representations that the current neural architecture learns. The proposed analysis of the inner workings of multi-modal transformers can be used to better understand and improve on such problems as pre-training of large-scale multi-modal architectures, multi-modal information fusion and probing of attention weights. In general, we contribute to the explainable multi-modal natural language processing and currently shallow understanding of how the input representations and the structure of the multi-modal transformer affect visual representations.

2019 ◽  
Vol 25 (4) ◽  
pp. 543-557 ◽  
Author(s):  
Afra Alishahi ◽  
Grzegorz Chrupała ◽  
Tal Linzen

AbstractThe Empirical Methods in Natural Language Processing (EMNLP) 2018 workshop BlackboxNLP was dedicated to resources and techniques specifically developed for analyzing and understanding the inner-workings and representations acquired by neural models of language. Approaches included: systematic manipulation of input to neural networks and investigating the impact on their performance, testing whether interpretable knowledge can be decoded from intermediate representations acquired by neural networks, proposing modifications to neural network architectures to make their knowledge state or generated output more explainable, and examining the performance of networks on simplified or formal languages. Here we review a number of representative studies in each category.


Author(s):  
NGOC TAN LE ◽  
Fatiha Sadat

With the emergence of the neural networks-based approaches, research on information extraction has benefited from large-scale raw texts by leveraging them using pre-trained embeddings and other data augmentation techniques to deal with challenges and issues in Natural Language Processing tasks. In this paper, we propose an approach using sequence-to-sequence neural networks-based models to deal with term extraction for low-resource domain. Our empirical experiments, evaluating on the multilingual ACTER dataset provided in the LREC-TermEval 2020 shared task on automatic term extraction, proved the efficiency of deep learning approach, in the case of low-data settings, for the automatic term extraction task.


Author(s):  
Qiqing Wang ◽  
Cunbin Li

The surge of renewable energy systems can lead to increasing incidents that negatively impact economics and society, rendering incident detection paramount to understand the mechanism and range of those impacts. In this paper, a deep learning framework is proposed to detect renewable energy incidents from news articles containing accidents in various renewable energy systems. The pre-trained language models like Bidirectional Encoder Representations from Transformers (BERT) and word2vec are utilized to represent textual inputs, which are trained by the Text Convolutional Neural Networks (TCNNs) and Text Recurrent Neural Networks. Two types of classifiers for incident detection are trained and tested in this paper, one is a binary classifier for detecting the existence of an incident, the other is a multi-label classifier for identifying different incident attributes such as causal-effects and consequences, etc. The proposed incident detection framework is implemented on a hand-annotated dataset with 5 190 records. The results show that the proposed framework performs well on both the incident existence detection task (F1-score 91.4%) and the incident attributes identification task (micro F1-score 81.7%). It is also shown that the BERT-based TCNNs are effective and robust in detecting renewable energy incidents from large-scale textual materials.


Sensors ◽  
2021 ◽  
Vol 21 (10) ◽  
pp. 3464
Author(s):  
Huabin Diao ◽  
Yuexing Hao ◽  
Shaoyun Xu ◽  
Gongyan Li

Convolutional neural networks (CNNs) have achieved significant breakthroughs in various domains, such as natural language processing (NLP), and computer vision. However, performance improvement is often accompanied by large model size and computation costs, which make it not suitable for resource-constrained devices. Consequently, there is an urgent need to compress CNNs, so as to reduce model size and computation costs. This paper proposes a layer-wise differentiable compression (LWDC) algorithm for compressing CNNs structurally. A differentiable selection operator OS is embedded in the model to compress and train the model simultaneously by gradient descent in one go. Instead of pruning parameters from redundant operators by contrast to most of the existing methods, our method replaces the original bulky operators with more lightweight ones directly, which only needs to specify the set of lightweight operators and the regularization factor in advance, rather than the compression rate for each layer. The compressed model produced by our method is generic and does not need any special hardware/software support. Experimental results on CIFAR-10, CIFAR-100 and ImageNet have demonstrated the effectiveness of our method. LWDC obtains more significant compression than state-of-the-art methods in most cases, while having lower performance degradation. The impact of lightweight operators and regularization factor on the compression rate and accuracy also is evaluated.


2017 ◽  
Author(s):  
Matthew Lowder ◽  
Wonil Choi ◽  
Fernanda Ferreira ◽  
John Henderson

What are the effects of word-by-word predictability on sentence processing times during the natural reading of a text? Although information-complexity metrics such as surprisal and entropy reduction have been useful in addressing this question, these metrics tend to be estimated using computational language models, which require some degree of commitment to a particular theory of language processing. Taking a different approach, the current study implemented a large-scale cumulative cloze task to collect word-by-word predictability data for 40 passages and compute surprisal and entropy reduction values in a theory-neutral manner. A separate group of participants read the same texts while their eye movements were recorded. Results showed that increases in surprisal and entropy reduction were both associated with increases in reading times. Further, these effects did not depend on the global difficulty of the text. The findings suggest that surprisal and entropy reduction independently contribute to variation in reading times, as these metrics seem to capture different aspects of lexical predictability.


2020 ◽  
Author(s):  
Charlotte Caucheteux ◽  
Jean-Rémi King

AbstractDeep learning has recently allowed substantial progress in language tasks such as translation and completion. Do such models process language similarly to humans, and is this similarity driven by systematic structural, functional and learning principles? To address these issues, we tested whether the activations of 7,400 artificial neural networks trained on image, word and sentence processing linearly map onto the hierarchy of human brain responses elicited during a reading task, using source-localized magneto-encephalography (MEG) recordings of one hundred and four subjects. Our results confirm that visual, word and language models sequentially correlate with distinct areas of the left-lateralized cortical hierarchy of reading. However, only specific subsets of these models converge towards brain-like representations during their training. Specifically, when the algorithms are trained on language modeling, their middle layers become increasingly similar to the late responses of the language network in the brain. By contrast, input and output word embedding layers often diverge away from brain activity during training. These differences are primarily rooted in the sustained and bilateral responses of the temporal and frontal cortices. Together, these results suggest that the compositional - but not the lexical - representations of modern language models converge to a brain-like solution.


Description of images has an important role in image mining. The description of images provides an insight into the location, its surroundings and other information related to it. Different procedures of describing the images exist in literature. However, a well trained description of images is still a tedious task to achieve. Several researchers have come up with solutions to this problem using various techniques. Herein, the concept of LSTM is used in generating a trained description of images. The said process is achieved through encoders and decoders. Encoders use techniques of maxpooling and convolution, while the decoders use the concept of recurrent neural networks. The combined architecture of encoders and decoders result in trained classifiers, which enable reliable description of images. The working has been implemented by considering a sample image. It has been found that slight variations with regard to accuracy, naturalness, missing concepts, deficiency of sufficient semantics and incomplete description of image still exist. Hence, it can be inferred that, with reasonable amount of enhancement in the technique and using the techniques of natural language processing, more accuracy in image descriptions could be achieved.


2020 ◽  
Author(s):  
Jose Juan Almagro Armenteros ◽  
Alexander Rosenberg Johansen ◽  
Ole Winther ◽  
Henrik Nielsen

AbstractMotivationLanguage modelling (LM) on biological sequences is an emergent topic in the field of bioinformatics. Current research has shown that language modelling of proteins can create context-dependent representations that can be applied to improve performance on different protein prediction tasks. However, little effort has been directed towards analyzing the properties of the datasets used to train language models. Additionally, only the performance of cherry-picked downstream tasks are used to assess the capacity of LMs.ResultsWe analyze the entire UniProt database and investigate the different properties that can bias or hinder the performance of LMs such as homology, domain of origin, quality of the data, and completeness of the sequence. We evaluate n-gram and Recurrent Neural Network (RNN) LMs to assess the impact of these properties on performance. To our knowledge, this is the first protein dataset with an emphasis on language modelling. Our inclusion of properties specific to proteins gives a detailed analysis of how well natural language processing methods work on biological sequences. We find that organism domain and quality of data have an impact on the performance, while the completeness of the proteins has little influence. The RNN based LM can learn to model Bacteria, Eukarya, and Archaea; but struggles with Viruses. By using the LM we can also generate novel proteins that are shown to be similar to real proteins.Availability and implementationhttps://github.com/alrojo/UniLanguage


2020 ◽  
Author(s):  
Shoya Wada ◽  
Toshihiro Takeda ◽  
Shiro Manabe ◽  
Shozo Konishi ◽  
Jun Kamohara ◽  
...  

Abstract Background: Pre-training large-scale neural language models on raw texts has been shown to make a significant contribution to a strategy for transfer learning in natural language processing (NLP). With the introduction of transformer-based language models, such as Bidirectional Encoder Representations from Transformers (BERT), the performance of information extraction from free text by NLP has significantly improved for both the general domain and the medical domain; however, it is difficult for languages in which there are few publicly available medical databases with a high quality and a large size to train medical BERT models that perform well.Method: We introduce a method to train a BERT model using a small medical corpus both in English and in Japanese. Our proposed method consists of two interventions: simultaneous pre-training, which is intended to encourage masked language modeling and next-sentence prediction on the small medical corpus, and amplified vocabulary, which helps with suiting the small corpus when building the customized corpus by byte-pair encoding. Moreover, we used whole PubMed abstracts and developed a high-performance BERT model, Bidirectional Encoder Representations from Transformers for Biomedical Text Mining by Osaka University (ouBioBERT), in English via our method. We then evaluated the performance of our BERT models and publicly available baselines and compared them.Results: We confirmed that our Japanese medical BERT outperforms conventional baselines and the other BERT models in terms of the medical-document classification task and that our English BERT pre-trained using both the general and medical domain corpora performs sufficiently for practical use in terms of the biomedical language understanding evaluation (BLUE) benchmark. Moreover, ouBioBERT shows that the total score of the BLUE benchmark is 1.1 points above that of BioBERT and 0.3 points above that of the ablation model trained without our proposed method.Conclusions: Our proposed method makes it feasible to construct a practical medical BERT model in both Japanese and English, and it has a potential to produce higher performing models for biomedical shared tasks.


Author(s):  
Mario Fernando Jojoa Acosta ◽  
Begonya Garcia-Zapirain ◽  
Marino J. Gonzalez ◽  
Bernardo Perez-Villa ◽  
Elena Urizar ◽  
...  

The review of previous works shows this study is the first attempt to analyse the lockdown effect using Natural Language Processing Techniques, particularly sentiment analysis methods applied at large scale. On the other hand, it is also the first of its kind to analyse the impact of COVID 19 on the university community jointly on staff and students and with a multi-country perspective. The main overall findings of this work show that the most often related words were family, anxiety, house and life. On another front, it has also been shown that staff have a slightly less negative perception of the consequences of COVID in their daily life. We have used artificial intelligence models like swivel embedding and the Multilayer Perceptron, as classification algorithms. The performance reached in terms of accuracy metric are 88.8% and 88.5%, for student and staff respectively. The main conclusion of our study is that higher education institutions and policymakers around the world may benefit from these findings while formulating policy recommendations and strategies to support students during this and any future pandemics.


Sign in / Sign up

Export Citation Format

Share Document