scholarly journals Unsupervised Bilingual Lexicon Induction from Mono-Lingual Multimodal Data

Author(s):  
Shizhe Chen ◽  
Qin Jin ◽  
Alexander Hauptmann

Bilingual lexicon induction, translating words from the source language to the target language, is a long-standing natural language processing task. Recent endeavors prove that it is promising to employ images as pivot to learn the lexicon induction without reliance on parallel corpora. However, these vision-based approaches simply associate words with entire images, which are constrained to translate concrete words and require object-centered images. We humans can understand words better when they are within a sentence with context. Therefore, in this paper, we propose to utilize images and their associated captions to address the limitations of previous approaches. We propose a multi-lingual caption model trained with different mono-lingual multimodal data to map words in different languages into joint spaces. Two types of word representation are induced from the multi-lingual caption model: linguistic features and localized visual features. The linguistic feature is learned from the sentence contexts with visual semantic constraints, which is beneficial to learn translation for words that are less visual-relevant. The localized visual feature is attended to the region in the image that correlates to the word, so that it alleviates the image restriction for salient visual representation. The two types of features are complementary for word translation. Experimental results on multiple language pairs demonstrate the effectiveness of our proposed method, which substantially outperforms previous vision-based approaches without using any parallel sentences or supervision of seed word pairs.

Author(s):  
Vittoria Cuteri ◽  
Giulia Minori ◽  
Gloria Gagliardi ◽  
Fabio Tamburini ◽  
Elisabetta Malaspina ◽  
...  

Abstract Purpose Attention has recently been paid to Clinical Linguistics for the detection and support of clinical conditions. Many works have been published on the “linguistic profile” of various clinical populations, but very few papers have been devoted to linguistic changes in patients with eating disorders. Patients with Anorexia Nervosa (AN) share similar psychological features such as disturbances in self-perceived body image, inflexible and obsessive thinking and anxious or depressive traits. We hypothesize that these characteristics can result in altered linguistic patterns and be detected using the Natural Language Processing tools. Methods We enrolled 51 young participants from December 2019 to February 2020 (age range: 14–18): 17 girls with a clinical diagnosis of AN, and 34 normal-weighted peers, matched by gender, age and educational level. Participants in each group were asked to produce three written texts (around 10–15 lines long). A rich set of linguistic features was extracted from the text samples and the statistical significance in pinpointing the pathological process was measured. Results Comparison between the two groups showed several linguistics indexes as statistically significant, with syntactic reduction as the most relevant trait of AN productions. In particular, the following features emerge as statistically significant in distinguishing AN girls and their normal-weighted peers: the length of the sentences, the complexity of the noun phrase, and the global syntactic complexity. This peculiar pattern of linguistic erosion may be due to the severe metabolic impairment also affecting the central nervous system in AN. Conclusion These preliminary data showed the existence of linguistic parameters as probable linguistic markers of AN. However, the analysis of a bigger cohort, still ongoing, is needed to consolidate this assumption. Level of evidence III Evidence obtained from case–control analytic studies.


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Michael Adjeisah ◽  
Guohua Liu ◽  
Douglas Omwenga Nyabuga ◽  
Richard Nuetey Nortey ◽  
Jinling Song

Scaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and understand what a good-quality corpus looks like in low-resource conditions, mainly where the target corpus is the only sample text of the parallel language. To improve the MT performance in such low-resource language pairs, we propose to expand the training data by injecting synthetic-parallel corpus obtained by translating a monolingual corpus from the target language based on bootstrapping with different parameter settings. Furthermore, we performed unsupervised measurements on each sentence pair engaging squared Mahalanobis distances, a filtering technique that predicts sentence parallelism. Additionally, we extensively use three different sentence-level similarity metrics after round-trip translation. Experimental results on a diverse amount of available parallel corpus demonstrate that injecting pseudoparallel corpus and extensive filtering with sentence-level similarity metrics significantly improves the original out-of-the-box MT systems for low-resource language pairs. Compared with existing improvements on the same original framework under the same structure, our approach exhibits tremendous developments in BLEU and TER scores.


2020 ◽  
Vol 34 (10) ◽  
pp. 13917-13918
Author(s):  
Dean L. Slack ◽  
Mariann Hardey ◽  
Noura Al Moubayed

Contextual word embeddings produced by neural language models, such as BERT or ELMo, have seen widespread application and performance gains across many Natural Language Processing tasks, suggesting rich linguistic features encoded in their representations. This work aims to investigate to what extent any linguistic hierarchical information is encoded into a single contextual embedding. Using labelled constituency trees, we train simple linear classifiers on top of single contextualised word representations for ancestor sentiment analysis tasks at multiple constituency levels of a sentence. To assess the presence of hierarchical information throughout the networks, the linear classifiers are trained using representations produced by each intermediate layer of BERT and ELMo variants. We show that with no fine-tuning, a single contextualised representation encodes enough syntactic and semantic sentence-level information to significantly outperform a non-contextual baseline for classifying 5-class sentiment of its ancestor constituents at multiple levels of the constituency tree. Additionally, we show that both LSTM and transformer architectures trained on similarly sized datasets achieve similar levels of performance on these tasks. Future work looks to expand the analysis to a wider range of NLP tasks and contextualisers.


2018 ◽  
Author(s):  
Vijay Lingam ◽  
Simran Bhuria ◽  
Mayukh Nair ◽  
Divij Gurpreetsingh ◽  
Anjali Goyal ◽  
...  

Background. Automatic contradiction detection or conflicting statements detection in text consists of identifying discrepancy, inconsistency and defiance in text and has several real world applications in questions and answering systems, multi-document summarization, dispute detection and finder in news, and detection of contradictions in opinions and sentiments on social media. Automatic contradiction detection is a technically challenging natural language processing problem. Contradiction detection between sources of text or two sentence pairs can be framed as a classification problem. Methods. We propose an approach for detecting three different types of contradiction: negation, antonyms and numeric mismatch. We derive several linguistic features from text and use it in a classification framework for detecting contradictions. The novelty of our approach in context to existing work is in the application of artificial neural networks and deep learning. Our approach uses techniques such as Long short-term memory (LSTM) and Global Vectors for Word Representation (GloVe). We conduct a series of experiments on three publicly available dataset on contradiction detection: Stanford dataset, SemEval dataset and PHEME dataset. In addition to existing dataset, we also create more dataset and make it publicly available. We measure the performance of our proposed approach using confusion and error matrix and accuracy. Results. There are three feature combinations on our dataset: manual features, LSTM based features and combination of manual and LSTM features. The accuracy of our classifier based on both LSTM and manual features for the SemEval dataset is 91.2%. The classifier was able to correctly classify 3204 out of 3513 instances. The accuracy of our classifier based on both LSTM and manual features for the Stanford dataset is 71.9%. The classifier was able to correctly classify 855 out of 1189 instances. The accuracy for the PHEME dataset is the highest across all datasets. The accuracy for the contradiction class is 96.85%. Discussion. Experimental analysis demonstrate encouraging results proving our hypothesis that deep learning along with LSTM based features can be used for identifying contradictions in text. Our results shows accuracy improvement over manual features after applying LSTM based features. The accuracy results varies across datasets and we observe different accuracy across multiple types of contradictions. Feature analysis shows that the discriminatory power of the five feature varies.


2021 ◽  
Author(s):  
Vittoria Cuteri ◽  
Giulia Minori ◽  
Gloria Gagliardi ◽  
Fabio Tamburini ◽  
Elisabetta Malaspina ◽  
...  

Abstract Purpose: attention has recently been paid to Clinical Linguistics for the detection and support of clinical conditions. Many works have been published on the “linguistic profile” of various clinical populations, but very few papers have been devoted to linguistic changes in patients with eating disorders. Patients with Anorexia Nervosa (AN) share similar psychological features such as disturbances in self-perceived body image, inflexible and obsessive thinking and anxious or depressive traits. We hypothesize that these characteristics can result in altered linguistic patterns and be detected using the Natural Language Processing tools. Methods: we enrolled 51 young participants from December 2019 to February 2020 (age range: 14-18): 17 girls with a clinical diagnosis of AN, and 34 normal weighted peers, matched by gender, age and educational level. Participants in each group were asked to produce three written texts (around 10-15 lines long). A rich set of linguistic features was extracted from the text samples and the statistical significance in pinpointing the pathological process was measured.Results: comparison between the two groups showed several linguistics indexes as statistically significant, with syntactic reduction as the most relevant trait of AN productions. Conclusion: these preliminary data showed the existence of linguistic parameters as probable linguistic markers of AN. However, the analysis of a bigger cohort, still ongoing, is needed to consolidate this assumption.Level of evidence III: evidence obtained from case-control analytic studies.


2013 ◽  
Vol 1 (1) ◽  
pp. 35-43
Author(s):  
Sandaruwan Prabath Kumara Ranatunga

Authorship verification rely on identification of a given document is written by a particular author or not. Internally analyzing the document itself with respect to the variations in writing style of the author and identification of the author’s own idiolect is the main context of the authorship verification. Mainly, the detection performance depends on the used feature set for clustering the document. Linguistic features and stylistic features have been utilized for author identification according to the writing style of a particular author. Disclose the shallow changes of the author’s writing style is the major problem which should be addressed in the domain of authorship verification. It motivates the computer science researchers to do research on authorship verification in the field of computer forensics and this research also focuses this problem. The contributions from the research are two folded: Former is introducing a new feature extracting method with Natural Language Processing (NLP) and later is propose a new more efficient linguistic feature set for verification of author of the given document. Experiments on a corpus composed of freely downloadable genuine 19th century English Books and Self Organizing Maps has been used as the classifier to cluster the documents. Proper word segmentation also introduced in this work and it helps to demonstrate that the proposed strategy can produced promising results. Finally, it is realized that more accurate classification is generated by the proposed strategy with extracted linguistic feature set.


2019 ◽  
Vol 35 (2) ◽  
pp. 471-484
Author(s):  
Chunlin Wang ◽  
Irene Castellón ◽  
Elisabet Comelles

Abstract Semantic Textual Similarity (STS), which measures the equivalence of meanings between two textual segments, is an important and useful task in Natural Language Processing. In this article, we have analyzed the datasets provided by the Semantic Evaluation (SemEval) 2012–2014 campaigns for this task in order to find out appropriate linguistic features for each dataset, taking into account the influence that linguistic features at different levels (e.g. syntactic constituents and lexical semantics) might have on the sentence similarity. Results indicate that a linguistic feature may have a different effect on different corpus due to the great difference in sentence structure and vocabulary between datasets. Thus, we conclude that the selection of linguistic features according to the genre of the text might be a good strategy for obtaining better results in the STS task. This analysis could be a useful reference for measuring system building and linguistic feature tuning.


2018 ◽  
Author(s):  
Vijay Lingam ◽  
Simran Bhuria ◽  
Mayukh Nair ◽  
Divij Gurpreetsingh ◽  
Anjali Goyal ◽  
...  

Background. Automatic contradiction detection or conflicting statements detection in text consists of identifying discrepancy, inconsistency and defiance in text and has several real world applications in questions and answering systems, multi-document summarization, dispute detection and finder in news, and detection of contradictions in opinions and sentiments on social media. Automatic contradiction detection is a technically challenging natural language processing problem. Contradiction detection between sources of text or two sentence pairs can be framed as a classification problem. Methods. We propose an approach for detecting three different types of contradiction: negation, antonyms and numeric mismatch. We derive several linguistic features from text and use it in a classification framework for detecting contradictions. The novelty of our approach in context to existing work is in the application of artificial neural networks and deep learning. Our approach uses techniques such as Long short-term memory (LSTM) and Global Vectors for Word Representation (GloVe). We conduct a series of experiments on three publicly available dataset on contradiction detection: Stanford dataset, SemEval dataset and PHEME dataset. In addition to existing dataset, we also create more dataset and make it publicly available. We measure the performance of our proposed approach using confusion and error matrix and accuracy. Results. There are three feature combinations on our dataset: manual features, LSTM based features and combination of manual and LSTM features. The accuracy of our classifier based on both LSTM and manual features for the SemEval dataset is 91.2%. The classifier was able to correctly classify 3204 out of 3513 instances. The accuracy of our classifier based on both LSTM and manual features for the Stanford dataset is 71.9%. The classifier was able to correctly classify 855 out of 1189 instances. The accuracy for the PHEME dataset is the highest across all datasets. The accuracy for the contradiction class is 96.85%. Discussion. Experimental analysis demonstrate encouraging results proving our hypothesis that deep learning along with LSTM based features can be used for identifying contradictions in text. Our results shows accuracy improvement over manual features after applying LSTM based features. The accuracy results varies across datasets and we observe different accuracy across multiple types of contradictions. Feature analysis shows that the discriminatory power of the five feature varies.


2003 ◽  
Vol 139-140 ◽  
pp. 129-152
Author(s):  
Paul Bogaards ◽  
Elisabeth Van Der Linden ◽  
Lydius Nienhuis

The research to be reported on in this paper was originally motivated by the finding that about 70% of the mistakes made by university students when translating from their mother tongue (Dutch) into their foreign language (French) were lexical in nature (NIENHUIS et al. 1989). This was partially confinned in the investigation described in NIENHUIS et al. (1993). A closer look at the individual errors suggested that many problems were caused by words with more than one meaning which each require different translations in the target language. In the research reported on in this paper, we checked our fmdings in the light of what is known about the structure of the bilingual lexicon and about the ways bilinguals have access to the elements of their two languages. On the basis of the model of the bilingual lexicon presented by KROLL & Sholl (1992) an adapted model is proposed for the processing of lexical ambiguity. This leads to a tentative schema of the mental activities that language learners have to perfonn when they are translating from their mother tongue into a foreign language, The second part of the paper describes two experiments we have carried out in order to find empirical support for such a schema. The last section of the paper contains a discussion of the results obtained as well as the conclusions that can be drawn.


Author(s):  
Una Stojnić

On the received view, the resolution of context-sensitivity is at least partly determined by non-linguistic features of utterance situation. If I say ‘He’s happy’, what ‘he’ picks out is underspecified by its linguistic meaning, and is only fixed through extra-linguistic supplementation: the speaker’s intention, and/or some objective, non-linguistic feature of the utterance situation. This underspecification is exhibited by most context-sensitive expressions, with the exception of pure indexicals, like ‘I.’ While this received view is prima facie appealing, I argue it is deeply mistaken. I defend an account according to which context-sensitivity resolution is governed by linguistic mechanisms determining prominence of candidate resolutions of context-sensitive items. Thus, on this account, the linguistic meaning of a context-sensitive expression fully specifies its resolution in a context, automatically selecting the resolution antecedently set by the prominence-governing linguistic mechanisms.


Sign in / Sign up

Export Citation Format

Share Document