English–Welsh Cross-Lingual Embeddings

Cross-lingual embeddings are vector space representations where word translations tend to be co-located. These representations enable learning transfer across languages, thus bridging the gap between data-rich languages such as English and others. In this paper, we present and evaluate a suite of cross-lingual embeddings for the English–Welsh language pair. To train the bilingual embeddings, a Welsh corpus of approximately 145 M words was combined with an English Wikipedia corpus. We used a bilingual dictionary to frame the problem of learning bilingual mappings as a supervised machine learning task, where a word vector space is first learned independently on a monolingual corpus, after which a linear alignment strategy is applied to map the monolingual embeddings to a common bilingual vector space. Two approaches were used to learn monolingual embeddings, including word2vec and fastText. Three cross-language alignment strategies were explored, including cosine similarity, inverted softmax and cross-domain similarity local scaling (CSLS). We evaluated different combinations of these approaches using two tasks, bilingual dictionary induction, and cross-lingual sentiment analysis. The best results were achieved using monolingual fastText embeddings and the CSLS metric. We also demonstrated that by including a few automatically translated training documents, the performance of a cross-lingual text classifier for Welsh can increase by approximately 20 percent points.

Download Full-text

Regularization of multilingual topic models

Numerical Methods and Programming (Vychislitel'nye Metody i Programmirovanie) ◽

10.26089/nummet.v16r104 ◽

2015 ◽

pp. 26-38

Author(s):

М.А. Дударенко

Keyword(s):

Topic Model ◽

Information Source ◽

Topic Models ◽

Bilingual Dictionary ◽

Bilingual Dictionaries ◽

Probabilistic Topic Model ◽

Word Translation ◽

Cross Lingual ◽

Cross Language

Предлагается многоязычная вероятностная тематическая модель, одновременно учитывающая двуязычный словарь и связи между документами параллельной или сравнимой коллекции. Для комбинирования этих двух видов информации применяется аддитивная регуляризация тематических моделей (ARTM). Предлагаются два способа использования двуязычного словаря: первый учитывает только сам факт связи между словами--переводами, во втором настраиваются вероятности переводов в каждой теме. Качество многоязычных моделей измеряется на задаче кросс-язычного поиска, когда запросом является документ на одном языке, а поиск производится среди документов другого языка. Показано, что комбинированный учет слов--переводов из двуязычного словаря и связанных документов улучшает качество кросс-язычного поиска по сравнению с моделями, использующими только один тип информации. Сравнение разных методов включения в модель двуязычных словарей показывает, что оценивание вероятностей переводов не только улучшает качество модели, но и позволяет находить тематический контекст для пар слово--перевод. A multilingual probabilistic topic model based on the additive regularization ARTM allowing to combine both a parallel or comparable corpus and a bilingual translation dictionary is proposed. Two approaches to include information from a bilingual dictionary are discussed: the first one takes into account only the fact of connection between word translations, whereas the second one learns the translation probabilities for each topic. To measure the quality of the proposed multilingual topic model, a cross-language search is performed. For each query document in one language, it is found its translation on another language. It is shown that the combined translation of words from a bilingual dictionary and the corresponding connected documents improves the cross-lingual search compared to the models using only one information source. The use of learning word translation probabilities for bilingual dictionaries improves the quality of the model and allows one to determine a context (a set of topics) for each pair of word translations, where these translations are appropriate.

Download Full-text

Generation of Cross-Lingual Word Vectors for Low-Resourced Languages Using Deep Learning and Topological Metrics in a Data-Efficient Way

Electronics ◽

10.3390/electronics10121372 ◽

2021 ◽

Vol 10 (12) ◽

pp. 1372

Author(s):

Sanjanasri JP ◽

Vijay Krishna Menon ◽

Soman KP ◽

Rajendran S ◽

Agnieszka Wolk

Keyword(s):

Deep Learning ◽

Language Processing ◽

Semantic Space ◽

Semantic Interpretation ◽

Learning Approaches ◽

Qualitative Comparison ◽

Bilingual Dictionary ◽

Pos Tagging ◽

Part Of Speech ◽

Cross Lingual

Linguists have been focused on a qualitative comparison of the semantics from different languages. Evaluation of the semantic interpretation among disparate language pairs like English and Tamil is an even more formidable task than for Slavic languages. The concept of word embedding in Natural Language Processing (NLP) has enabled a felicitous opportunity to quantify linguistic semantics. Multi-lingual tasks can be performed by projecting the word embeddings of one language onto the semantic space of the other. This research presents a suite of data-efficient deep learning approaches to deduce the transfer function from the embedding space of English to that of Tamil, deploying three popular embedding algorithms: Word2Vec, GloVe and FastText. A novel evaluation paradigm was devised for the generation of embeddings to assess their effectiveness, using the original embeddings as ground truths. Transferability across other target languages of the proposed model was assessed via pre-trained Word2Vec embeddings from Hindi and Chinese languages. We empirically prove that with a bilingual dictionary of a thousand words and a corresponding small monolingual target (Tamil) corpus, useful embeddings can be generated by transfer learning from a well-trained source (English) embedding. Furthermore, we demonstrate the usability of generated target embeddings in a few NLP use-case tasks, such as text summarization, part-of-speech (POS) tagging, and bilingual dictionary induction (BDI), bearing in mind that those are not the only possible applications.

Download Full-text

Supervised learning for the detection of negation and of its scope in French and Brazilian Portuguese biomedical corpora

Natural Language Engineering ◽

10.1017/s1351324920000352 ◽

2020 ◽

pp. 1-21 ◽

Cited By ~ 2

Author(s):

Clément Dalloux ◽

Vincent Claveau ◽

Natalia Grabar ◽

Lucas Emanuel Silva Oliveira ◽

Claudia Maria Cabral Moro ◽

...

Keyword(s):

Machine Learning ◽

Information Extraction ◽

State Of The Art ◽

Automatic Detection ◽

Brazilian Portuguese ◽

Supervised Machine Learning ◽

Biomedical Domain ◽

Learning Approaches ◽

Cross Domain ◽

Automatic Methods

Abstract Automatic detection of negated content is often a prerequisite in information extraction systems in various domains. In the biomedical domain especially, this task is important because negation plays an important role. In this work, two main contributions are proposed. First, we work with languages which have been poorly addressed up to now: Brazilian Portuguese and French. Thus, we developed new corpora for these two languages which have been manually annotated for marking up the negation cues and their scope. Second, we propose automatic methods based on supervised machine learning approaches for the automatic detection of negation marks and of their scopes. The methods show to be robust in both languages (Brazilian Portuguese and French) and in cross-domain (general and biomedical languages) contexts. The approach is also validated on English data from the state of the art: it yields very good results and outperforms other existing approaches. Besides, the application is accessible and usable online. We assume that, through these issues (new annotated corpora, application accessible online, and cross-domain robustness), the reproducibility of the results and the robustness of the NLP applications will be augmented.

Download Full-text

A study of user profile representation for personalized cross-language information retrieval

Aslib Journal of Information Management ◽

10.1108/ajim-06-2015-0091 ◽

2016 ◽

Vol 68 (4) ◽

pp. 448-477 ◽

Cited By ~ 5

Author(s):

Dong Zhou ◽

Séamus Lawless ◽

Xuan Wu ◽

Wenyu Zhao ◽

Jianxun Liu

Keyword(s):

Information Retrieval ◽

Query Expansion ◽

User Profile ◽

User Profiles ◽

Content Type ◽

Cross Language Information Retrieval ◽

Cross Lingual ◽

Cross Language ◽

Representation Techniques ◽

Comprehensive Study

Purpose – With an increase in the amount of multilingual content on the World Wide Web, users are often striving to access information provided in a language of which they are non-native speakers. The purpose of this paper is to present a comprehensive study of user profile representation techniques and investigate their use in personalized cross-language information retrieval (CLIR) systems through the means of personalized query expansion. Design/methodology/approach – The user profiles consist of weighted terms computed by using frequency-based methods such as tf-idf and BM25, as well as various latent semantic models trained on monolingual documents and cross-lingual comparable documents. This paper also proposes an automatic evaluation method for comparing various user profile generation techniques and query expansion methods. Findings – Experimental results suggest that latent semantic-weighted user profile representation techniques are superior to frequency-based methods, and are particularly suitable for users with a sufficient amount of historical data. The study also confirmed that user profiles represented by latent semantic models trained on a cross-lingual level gained better performance than the models trained on a monolingual level. Originality/value – Previous studies on personalized information retrieval systems have primarily investigated user profiles and personalization strategies on a monolingual level. The effect of utilizing such monolingual profiles for personalized CLIR remains unclear. The current study fills the gap by a comprehensive study of user profile representation for personalized CLIR and a novel personalized CLIR evaluation methodology to ensure repeatable and controlled experiments can be conducted.

Download Full-text

Arabic English Cross-Lingual Plagiarism Detection Based on Keyphrases Extraction, Monolingual and Machine Learning Approach

Asian Journal of Research in Computer Science ◽

10.9734/ajrcos/2018/v2i330075 ◽

2019 ◽

pp. 1-12

Author(s):

Mokhtar Al-Suhaiqi ◽

Muneer A. S. Hazaa ◽

Mohammed Albared

Keyword(s):

Machine Learning ◽

Machine Learning Techniques ◽

Detection Methods ◽

Support Vector ◽

Svm Classifier ◽

Learning Approach ◽

Plagiarism Detection ◽

Machine Learning Approach ◽

Cross Lingual ◽

Cross Language

Due to rapid growth of research articles in various languages, cross-lingual plagiarism detection problem has received increasing interest in recent years. Cross-lingual plagiarism detection is more challenging task than monolingual plagiarism detection. This paper addresses the problem of cross-lingual plagiarism detection (CLPD) by proposing a method that combines keyphrases extraction, monolingual detection methods and machine learning approach. The research methodology used in this study has facilitated to accomplish the objectives in terms of designing, developing, and implementing an efficient Arabic – English cross lingual plagiarism detection. This paper empirically evaluates five different monolingual plagiarism detection methods namely i)N-Grams Similarity, ii)Longest Common Subsequence, iii)Dice Coefficient, iv)Fingerprint based Jaccard Similarity and v) Fingerprint based Containment Similarity. In addition, three machine learning approaches namely i) naïve Bayes, ii) Support Vector Machine, and iii) linear logistic regression classifiers are used for Arabic-English Cross-language plagiarism detection. Several experiments are conducted to evaluate the performance of the key phrases extraction methods. In addition, Several experiments to investigate the performance of machine learning techniques to find the best method for Arabic-English Cross-language plagiarism detection. According to the experiments of Arabic-English Cross-language plagiarism detection, the highest result was obtained using SVM classifier with 92% f-measure. In addition, the highest results were obtained by all classifiers are achieved, when most of the monolingual plagiarism detection methods are used.

Download Full-text

Macrostructure in the narratives of Indonesian-Dutch bilinguals

Linguistic Approaches to Bilingualism ◽

10.1075/lab.20015.tri ◽

2021 ◽

Author(s):

Elena Tribushinina ◽

Mila Irmawati ◽

Pim Mak

Keyword(s):

Internal State ◽

Assessment Instrument ◽

Input Current ◽

Home Language ◽

Story Structure ◽

Narrative Skills ◽

Language Exposure ◽

Cross Lingual ◽

Cross Language ◽

The Relationship

Abstract There is no agreement regarding the relationship between narrative abilities in the two languages of a bilingual child. In this paper, we test the hypothesis that such cross-language relationships depend on age and language exposure by studying the narrative skills of 32 Indonesian-Dutch bilinguals (mean age: 8;5, range: 5;0–11;9). The narratives were elicited by means of the Multilingual Assessment Instrument for Narratives (MAIN) and analysed for story structure, episodic complexity and use of internal state terms (ISTs) in the home language (Indonesian) and majority language (Dutch). The results demonstrate that story structure scores in the home language (but not in the majority language) were positively related to age. Exposure measures (current Dutch/Indonesian input, current richness of Dutch/Indonesian input, and length of exposure to Dutch) did not predict the macrostructure scores. There was a significant positive cross-language relationship in story structure and episodic complexity, and this relationship became stronger as a function of length of exposure to Dutch. There was also a positive cross-lingual relation in IST use, but it became weaker with age. The results support the idea that narrative skills are transferable between languages and suggest that cross-language relationships may interact with age and exposure factors in differential ways.

Download Full-text

Tri-training for Dependency Parsing Domain Adaptation

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3488367 ◽

2022 ◽

Vol 21 (3) ◽

pp. 1-17

Author(s):

Shu Jiang ◽

Zuchao Li ◽

Hai Zhao ◽

Bao-Liang Lu ◽

Rui Wang

Keyword(s):

Transfer Learning ◽

High Performance ◽

Domain Adaptation ◽

Language Model ◽

Training Methods ◽

Dependency Parsing ◽

Cross Domain ◽

Cross Lingual ◽

Domain Transfer ◽

Domain Transfer Learning

In recent years, the research on dependency parsing focuses on improving the accuracy of the domain-specific (in-domain) test datasets and has made remarkable progress. However, there are innumerable scenarios in the real world that are not covered by the dataset, namely, the out-of-domain dataset. As a result, parsers that perform well on the in-domain data usually suffer from significant performance degradation on the out-of-domain data. Therefore, to adapt the existing in-domain parsers with high performance to a new domain scenario, cross-domain transfer learning methods are essential to solve the domain problem in parsing. This paper examines two scenarios for cross-domain transfer learning: semi-supervised and unsupervised cross-domain transfer learning. Specifically, we adopt a pre-trained language model BERT for training on the source domain (in-domain) data at the subword level and introduce self-training methods varied from tri-training for these two scenarios. The evaluation results on the NLPCC-2019 shared task and universal dependency parsing task indicate the effectiveness of the adopted approaches on cross-domain transfer learning and show the potential of self-learning to cross-lingual transfer learning.

Download Full-text

Constructing Uyghur Commonsense Knowledge Base by Knowledge Projection

Applied Sciences ◽

10.3390/app9163318 ◽

2019 ◽

Vol 9 (16) ◽

pp. 3318

Author(s):

Azmat Anwar ◽

Xiao Li ◽

Yating Yang ◽

Yajuan Wang

Keyword(s):

Semantic Similarity ◽

Knowledge Bases ◽

Considerable Effort ◽

Commonsense Knowledge ◽

Scoring Model ◽

Bilingual Dictionary ◽

Short Period ◽

Translation Ambiguity ◽

Back Translation ◽

Cross Lingual

Although considerable effort has been devoted to building commonsense knowledge bases (CKB), it is still not available for many low-resource languages such as Uyghur because of expensive construction cost. Focusing on this issue, we proposed a cross-lingual knowledge-projection method to construct an Uyghur CKB by projecting ConceptNet’s Chinese facts into Uyghur. We used a Chinese–Uyghur bilingual dictionary to get high-quality entity translation in facts and employed a back-translation method to eliminate the entity-translation ambiguity. Moreover, to tackle the inner relation ambiguity in translated facts, we made a hand-crafted rule to convert the structured facts into natural-language phrases and built the Chinese–Uyghur lingual phrases based on the similarity of phrases that corresponded to the bilingual semantic similarity scoring model. Experimental results show that the accuracy of our semantic similarity scoring model reached 94.75% for our task, and they successfully project 55,872 Chinese facts into Uyghur as well as obtain 67,375 Uyghur facts within a very short period.

Download Full-text

Distributional Correspondence Indexing for Cross-Lingual and Cross-Domain Sentiment Classification (Extended Abstract)

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/802 ◽

2018 ◽

Author(s):

Alejandro Moreo Fernández ◽

Andrea Esuli ◽

Fabrizio Sebastiani

Keyword(s):

Domain Adaptation ◽

State Of The Art ◽

Sentiment Classification ◽

Training Data ◽

Target Domain ◽

Source Domain ◽

Machine Learning Methods ◽

Cross Domain ◽

Current State ◽

Cross Lingual

Domain Adaptation (DA) techniques aim at enabling machine learning methods learn effective classifiers for a “target” domain when the only available training data belongs to a different “source” domain. In this extended abstract, we briefly describe our new DA method called Distributional Correspondence Indexing (DCI) for sentiment classification. DCI derives term representations in a vector space common to both domains where each dimension reflects its distributional correspondence to a pivot, i.e., to a highly predictive term that behaves similarly across domains. The experiments we have conducted show that DCI obtains better performance than current state-of-the-art techniques for cross-lingual and cross-domain sentiment classification.

Download Full-text

ALGORITHM OF CROSS LANGUAGE FUZZY SEARCH BASED ON HASH-VECTORS FOR AUTOMATIC COMPARISON OF PERSONAL NAMES

Vestnik komp iuternykh i informatsionnykh tekhnologii ◽

10.14489/vkit.2020.03.pp.029-036 ◽

2020 ◽

pp. 29-36

Author(s):

E. Yu. Vakhromova ◽

I. V. Beketova ◽

A. A. Gerasimenko ◽

V. I. Goremychkin ◽

V. P. Krivoshlyapov

Keyword(s):

Vector Space ◽

Information Search ◽

Experimental Studies ◽

Practical Implementation ◽

Decomposition Algorithms ◽

Personal Names ◽

Fuzzy Search ◽

Match Rate ◽

Measure Of Similarity ◽

Cross Language

The algorithm of cross language fuzzy search based on hash vectors for automatic matching of personal names is proposed. In the response mode for an input request, names in Latin spelling and a given value for the similarity measure, the algorithm determines the set of output Cyrillic names contained in the database of the information search system. The principal feature of the proposed algorithm is the rejection of the direct translation of personal names. Instead, the hashing mechanism of personal names is used, followed by mapping them into the same hidden vector space where the computational procedures of the decision-making system are built. In the process of research, it was solved a number of actual intermediate tasks. Thus, the decomposition algorithms of the explored database, the generation and clustering of the dictionary of basic morphemes are an instrument that is of independent value in solving the problem of automatically translating names from a foreign language, the translation rules of which are unknown – the socalled generalized transcription. After mapping names into a vector space, the matching operation is reduced to assessing the similarity between vectors. As a measure of similarity, several quantities were considered in the study. The most convenient measure of similarity is the cosine similarity, the critical value of which was obtained by plotting the FMR (False Match Rate) and FNMR (False Non-Match Rate) graphs. The developed algorithm is universal with respect to the languages used, that is, it does not depend on a specific alphabet. In the practical implementation of the developed algorithm, a series of experimental studies was carried out using a database containing more than 2.5 million names.

Download Full-text