scholarly journals Improving Semantic Similarity with Cross-Lingual Resources: A Study in Bangla—A Low Resourced Language

Informatics ◽  
2019 ◽  
Vol 6 (2) ◽  
pp. 19 ◽  
Author(s):  
Rajat Pandit ◽  
Saptarshi Sengupta ◽  
Sudip Kumar Naskar ◽  
Niladri Sekhar Dash ◽  
Mohini Mohan Sardar

Semantic similarity is a long-standing problem in natural language processing (NLP). It is a topic of great interest as its understanding can provide a look into how human beings comprehend meaning and make associations between words. However, when this problem is looked at from the viewpoint of machine understanding, particularly for under resourced languages, it poses a different problem altogether. In this paper, semantic similarity is explored in Bangla, a less resourced language. For ameliorating the situation in such languages, the most rudimentary method (path-based) and the latest state-of-the-art method (Word2Vec) for semantic similarity calculation were augmented using cross-lingual resources in English and the results obtained are truly astonishing. In the presented paper, two semantic similarity approaches have been explored in Bangla, namely the path-based and distributional model and their cross-lingual counterparts were synthesized in light of the English WordNet and Corpora. The proposed methods were evaluated on a dataset comprising of 162 Bangla word pairs, which were annotated by five expert raters. The correlation scores obtained between the four metrics and human evaluation scores demonstrate a marked enhancement that the cross-lingual approach brings into the process of semantic similarity calculation for Bangla.

PLoS ONE ◽  
2021 ◽  
Vol 16 (2) ◽  
pp. e0246751
Author(s):  
Ponrudee Netisopakul ◽  
Gerhard Wohlgenannt ◽  
Aleksei Pulich ◽  
Zar Zar Hlaing

Research into semantic similarity has a long history in lexical semantics, and it has applications in many natural language processing (NLP) tasks like word sense disambiguation or machine translation. The task of calculating semantic similarity is usually presented in the form of datasets which contain word pairs and a human-assigned similarity score. Algorithms are then evaluated by their ability to approximate the gold standard similarity scores. Many such datasets, with different characteristics, have been created for English language. Recently, four of those were transformed to Thai language versions, namely WordSim-353, SimLex-999, SemEval-2017-500, and R&G-65. Given those four datasets, in this work we aim to improve the previous baseline evaluations for Thai semantic similarity and solve challenges of unsegmented Asian languages (particularly the high fraction of out-of-vocabulary (OOV) dataset terms). To this end we apply and integrate different strategies to compute similarity, including traditional word-level embeddings, subword-unit embeddings, and ontological or hybrid sources like WordNet and ConceptNet. With our best model, which combines self-trained fastText subword embeddings with ConceptNet Numberbatch, we managed to raise the state-of-the-art, measured with the harmonic mean of Pearson on Spearman ρ, by a large margin from 0.356 to 0.688 for TH-WordSim-353, from 0.286 to 0.769 for TH-SemEval-500, from 0.397 to 0.717 for TH-SimLex-999, and from 0.505 to 0.901 for TWS-65.


2020 ◽  
Vol 34 (05) ◽  
pp. 9725-9732
Author(s):  
Xiaorui Zhou ◽  
Senlin Luo ◽  
Yunfang Wu

In reading comprehension, generating sentence-level distractors is a significant task, which requires a deep understanding of the article and question. The traditional entity-centered methods can only generate word-level or phrase-level distractors. Although recently proposed neural-based methods like sequence-to-sequence (Seq2Seq) model show great potential in generating creative text, the previous neural methods for distractor generation ignore two important aspects. First, they didn't model the interactions between the article and question, making the generated distractors tend to be too general or not relevant to question context. Second, they didn't emphasize the relationship between the distractor and article, making the generated distractors not semantically relevant to the article and thus fail to form a set of meaningful options. To solve the first problem, we propose a co-attention enhanced hierarchical architecture to better capture the interactions between the article and question, thus guide the decoder to generate more coherent distractors. To alleviate the second problem, we add an additional semantic similarity loss to push the generated distractors more relevant to the article. Experimental results show that our model outperforms several strong baselines on automatic metrics, achieving state-of-the-art performance. Further human evaluation indicates that our generated distractors are more coherent and more educative compared with those distractors generated by baselines.


2018 ◽  
Vol 18 (1) ◽  
pp. 18-24
Author(s):  
Sri Reski Anita Muhsini

Implementasi pengukuran kesamaan semantik memiliki peran yang sangat penting dalam beberapa bidang Natural Language Processing (NLP), dimana hasilnya seringkali dijadikan dasar dalam melakukan task NLP yang lebih lanjut. Salah satu penerapannya yaitu dengan melakukan pengukuran kesamaan semantik multibahasa antar kata. Pengukuran ini dilatarbelakangi oleh suatu masalah dimana saat ini banyak sistem pencarian informasi yang harus berurusan dengan teks atau dokumen multibahasa. Sepasang kata dinyatakan memiliki kesamaan semantik jika pasangan kata tersebut memiliki kesamaan dari sisi makna atau konsep. Pada penelitian ini, diimplementasikan perhitungan kesamaan semantik antar kata pada bahasa yang berbeda yaitu bahasa Inggris dan bahasa Spanyol. Korpus yang digunakan pada penelitian ini yakni Europarl Parallel Corpus pada bahasa Inggris dan bahasa Spanyol. Konteks kata bersumber dari Swadesh list, serta hasil dari kesamaan semantiknya dibandingkan dengan datasetGold Standard SemEval 2017 Crosslingual Semantic Similarity untuk diukur nilai korelasinya. Hasil pengujian yang didapat terlihat bahwa pengukuran metode PMI mampu menghasilkan korelasi sebesar 0,5781 untuk korelasi Pearson dan 0.5762 untuk korelasi Spearman. Dari hasil penelitian dapat disimpulkan bahwa Implementasi pengukuran Crosslingual Semantic Similarity menggunakan metode Pointwise Mutual Information (PMI) mampu menghasilkan korelasi terbaik. Peneliti merekomendasikan pada penelitian selanjutnya dapat dilakukan dengan menggunakan dataset lain untuk membuktikan seberapa efektif metode pengukuran Poitnwise Mutual Information (PMI) dalam mengukur Crosslingual Semantic Similarity antar kata.


Author(s):  
Toluwase Victor Asubiaro ◽  
Ebelechukwu Gloria Igwe

African languages, including those that are natives to Nigeria, are low-resource languages because they lack basic computing resources such as language-dependent hardware keyboard. Speakers of these low-resource languages are therefore unfairly deprived of information access on the internet. There is no information about the level of progress that has been made on the computation of Nigerian languages. Hence, this chapter presents a state-of-the-art review of Nigerian languages natural language processing. The review reveals that only four Nigerian languages; Hausa, Ibibio, Igbo, and Yoruba have been significantly studied in published NLP papers. Creating alternatives to hardware keyboard is one of the most popular research areas, and means such as automatic diacritics restoration, virtual keyboard, and optical character recognition have been explored. There was also an inclination towards speech and computational morphological analysis. Resource development and knowledge representation modeling of the languages using rapid resource development and cross-lingual methods are recommended.


2020 ◽  
Vol 2020 ◽  
pp. 1-15
Author(s):  
Mingfu Xue ◽  
Chengxiang Yuan ◽  
Jian Wang ◽  
Weiqiang Liu

Recently, the natural language processing- (NLP-) based intelligent question and answer (Q&A) robots have been used ubiquitously. However, the robustness and security of current Q&A robots are still unsatisfactory, e.g., a slight typo in the user’s question may cause the Q&A robot unable to return the correct answer. In this paper, we propose a fast and automatic test dataset generation method for the robustness and security evaluation of current Q&A robots, which can work in black-box scenarios and thus can be applied to a variety of different Q&A robots. Specifically, we propose a dependency parse-based adversarial examples generation (DPAEG) method for Q&A robots. DPAEG first uses the proposed dependency parse-based keywords extraction algorithm to extract keywords from a question. Then, the proposed algorithm generates adversarial words according to the extracted keywords, which include typos and words that are spelled similarly to the keywords. Finally, these adversarial words are used to generate a large number of adversarial questions. The generated adversarial questions which are similar to the original questions do not affect human’s understanding, but the Q&A robots cannot answer these adversarial questions correctly. Moreover, the proposed method works in a black-box scenario, which means it does not need the knowledge of the target Q&A robots. Experiment results show that the generated adversarial examples have a high success rate on two state-of-the-art Q&A robots, DrQA and Google Assistant. In addition, the generated adversarial examples not only affect the correct answer (top-1) returned by DrQA but also affect the top-k candidate answers returned by DrQA. The adversarial examples make the top-k candidate answers contain fewer correct answers and make the correct answers rank lower in the top-k candidate answers. The human evaluation results show that participants with different genders, ages, and mother tongues can understand the meaning of most of the generated adversarial examples, which means that the generated adversarial examples do not affect human’s understanding.


2020 ◽  
Vol 34 (05) ◽  
pp. 7935-7943
Author(s):  
Wataru Hirota ◽  
Yoshihiko Suhara ◽  
Behzad Golshan ◽  
Wang-Chiew Tan

We present Emu, a system that semantically enhances multilingual sentence embeddings. Our framework fine-tunes pre-trained multilingual sentence embeddings using two main components: a semantic classifier and a language discriminator. The semantic classifier improves the semantic similarity of related sentences, whereas the language discriminator enhances the multilinguality of the embeddings via multilingual adversarial training. Our experimental results based on several language pairs show that our specialized embeddings outperform the state-of-the-art multilingual sentence embedding model on the task of cross-lingual intent classification using only monolingual labeled data.


2018 ◽  
Vol 15 (3) ◽  
pp. 487-499 ◽  
Author(s):  
Hai-Tao Zheng ◽  
Jinxin Han ◽  
Jinyuan Chen ◽  
Arun Sangaiah

Automatic question generation from text or paragraph is a great challenging task which attracts broad attention in natural language processing. Because of the verbose texts and fragile ranking methods, the quality of top generated questions is poor. In this paper, we present a novel framework Automatic Chinese Question Generation (ACQG) to generate questions from text or paragraph. In ACQG, we use an adopted TextRank to extract key sentences and a template-based method to construct questions from key sentences. Then a multi-feature neural network model is built for ranking to obtain the top questions. The automatic evaluation result reveals that the proposed framework outperforms the state-of-the-art systems in terms of perplexity. In human evaluation, questions generated by ACQG rate a higher score.


2021 ◽  
pp. 1-23
Author(s):  
Yerai Doval ◽  
Jose Camacho-Collados ◽  
Luis Espinosa-Anke ◽  
Steven Schockaert

Abstract Word embeddings have become a standard resource in the toolset of any Natural Language Processing practitioner. While monolingual word embeddings encode information about words in the context of a particular language, cross-lingual embeddings define a multilingual space where word embeddings from two or more languages are integrated together. Current state-of-the-art approaches learn these embeddings by aligning two disjoint monolingual vector spaces through an orthogonal transformation which preserves the structure of the monolingual counterparts. In this work, we propose to apply an additional transformation after this initial alignment step, which aims to bring the vector representations of a given word and its translations closer to their average. Since this additional transformation is non-orthogonal, it also affects the structure of the monolingual spaces. We show that our approach both improves the integration of the monolingual spaces and the quality of the monolingual spaces themselves. Furthermore, because our transformation can be applied to an arbitrary number of languages, we are able to effectively obtain a truly multilingual space. The resulting (monolingual and multilingual) spaces show consistent gains over the current state-of-the-art in standard intrinsic tasks, namely dictionary induction and word similarity, as well as in extrinsic tasks such as cross-lingual hypernym discovery and cross-lingual natural language inference.


Information ◽  
2021 ◽  
Vol 12 (10) ◽  
pp. 409
Author(s):  
Panagiotis Kasnesis ◽  
Lazaros Toumanidis ◽  
Charalampos Z. Patrikakis

The widespread use of social networks has brought to the foreground a very important issue, the veracity of the information circulating within them. Many natural language processing methods have been proposed in the past to assess a post’s content with respect to its reliability; however, end-to-end approaches are not comparable in ability to human beings. To overcome this, in this paper, we propose the use of a more modular approach that produces indicators about a post’s subjectivity and the stance provided by the replies it has received to date, letting the user decide whether (s)he trusts or does not trust the provided information. To this end, we fine-tuned state-of-the-art transformer-based language models and compared their performance with previous related work on stance detection and subjectivity analysis. Finally, we discuss the obtained results.


2021 ◽  
Vol 7 ◽  
pp. e664
Author(s):  
Md. Mushfiqur Rahman ◽  
Thasin Abedin ◽  
Khondokar S.S. Prottoy ◽  
Ayana Moshruba ◽  
Fazlul Hasan Siddiqui

Video captioning, i.e., the task of generating captions from video sequences creates a bridge between the Natural Language Processing and Computer Vision domains of computer science. The task of generating a semantically accurate description of a video is quite complex. Considering the complexity, of the problem, the results obtained in recent research works are praiseworthy. However, there is plenty of scope for further investigation. This paper addresses this scope and proposes a novel solution. Most video captioning models comprise two sequential/recurrent layers—one as a video-to-context encoder and the other as a context-to-caption decoder. This paper proposes a novel architecture, namely Semantically Sensible Video Captioning (SSVC) which modifies the context generation mechanism by using two novel approaches—“stacked attention” and “spatial hard pull”. As there are no exclusive metrics for evaluating video captioning models, we emphasize both quantitative and qualitative analysis of our model. Hence, we have used the BLEU scoring metric for quantitative analysis and have proposed a human evaluation metric for qualitative analysis, namely the Semantic Sensibility (SS) scoring metric. SS Score overcomes the shortcomings of common automated scoring metrics. This paper reports that the use of the aforementioned novelties improves the performance of state-of-the-art architectures.


Sign in / Sign up

Export Citation Format

Share Document