scholarly journals Regularization of multilingual topic models

Author(s):  
М.А. Дударенко

Предлагается многоязычная вероятностная тематическая модель, одновременно учитывающая двуязычный словарь и связи между документами параллельной или сравнимой коллекции. Для комбинирования этих двух видов информации применяется аддитивная регуляризация тематических моделей (ARTM). Предлагаются два способа использования двуязычного словаря: первый учитывает только сам факт связи между словами--переводами, во втором настраиваются вероятности переводов в каждой теме. Качество многоязычных моделей измеряется на задаче кросс-язычного поиска, когда запросом является документ на одном языке, а поиск производится среди документов другого языка. Показано, что комбинированный учет слов--переводов из двуязычного словаря и связанных документов улучшает качество кросс-язычного поиска по сравнению с моделями, использующими только один тип информации. Сравнение разных методов включения в модель двуязычных словарей показывает, что оценивание вероятностей переводов не только улучшает качество модели, но и позволяет находить тематический контекст для пар слово--перевод. A multilingual probabilistic topic model based on the additive regularization ARTM allowing to combine both a parallel or comparable corpus and a bilingual translation dictionary is proposed. Two approaches to include information from a bilingual dictionary are discussed: the first one takes into account only the fact of connection between word translations, whereas the second one learns the translation probabilities for each topic. To measure the quality of the proposed multilingual topic model, a cross-language search is performed. For each query document in one language, it is found its translation on another language. It is shown that the combined translation of words from a bilingual dictionary and the corresponding connected documents improves the cross-lingual search compared to the models using only one information source. The use of learning word translation probabilities for bilingual dictionaries improves the quality of the model and allows one to determine a context (a set of topics) for each pair of word translations, where these translations are appropriate.

Literator ◽  
2016 ◽  
Vol 37 (1) ◽  
Author(s):  
Ketiwe Ndhlovu

The development of African languages into languages of science and technology is dependent on action being taken to promote the use of these languages in specialised fields such as technology, commerce, administration, media, law, science and education among others. One possible way of developing African languages is the compilation of specialised dictionaries (Chabata 2013). This article explores how parallel corpora can be interrogated using a bilingual concordancer (ParaConc) to extract bilingual terminology that can be used to create specialised bilingual dictionaries. An English–Ndebele Parallel Corpus was used as a resource and through ParaConc, an alphabetic list was compiled from which headwords and possible translations were sought. These translations provided possible terms for entry in a bilingual dictionary. The frequency feature and ‘hot words’ tool in ParaConc were used to determine the suitability of terms for inclusion in the dictionary and for identifying possible synonyms, respectively. Since parallel corpora are aligned and data are presented in context (Key Word in Context), it was possible to draw examples showing how headwords are used. Using this approach produced results quickly and accurately, whilst minimising the process of translating terms manually. It was noted that the quality of the dictionary is dependent on the quality of the corpus, hence the need for creating a representative and clean corpus needs to be emphasised. Although technology has multiple benefits in dictionary making, the research underscores the importance of collaboration between lexicographers, translators, subject experts and target communities so that representative dictionaries are created.


Babel ◽  
1999 ◽  
Vol 45 (2) ◽  
pp. 107-126 ◽  
Author(s):  
Dionysis Goutsos

Abstract Greek bilingual dictionaries have long been marked by lack of naturalness and inadequate semantic and stylistic discrimination between the various equivalents suggested in translation. Although this is a general problem of bilingual dictionaries, which necessarily deal with decontextualized instances of language in the construction of the lemma, translationese is common in English-Greek dictionaries as a result of the idiosyncratic history of Greek applied linguistic practice. The paper discusses issues of translation equivalence that came into view in the editing of the new Collins English-Greek Dictionary (1997). Specific problems relating to the translation from English to Greek are pointed out, with reference to the areas of lexical, grammatical and discourse equivalence. In particular, the occurrence of 'false friends' and register couplets, the categories of definiteness, countability and verb aspect and the varying Theme-Rheme structures constitute points of divergence between the two languages. The word-for-word translation of these linguistic aspects is mainly accountable for the lack of naturalness. Dictionary editing involves a multitude of detailed decisions along these parameters, which shape the lemmas and influence the quality of the final text. The help from both English and Greek corpora has been indispensable at defining the parameters of naturalness for each lemma and at solving problems specific to Greek bilingual lexicography. Résumé Les dictionnaires bilingues grecs ont été longtemps marqués par un manque de naturel, par une discrimination sémantique et stylistique inadéquate entre les différentes équivalences suggérées dans la traduction. Bien qu'il s'agisse d'un problème général propre aux dictionnaires bilingues, qui, nécessairement se fondent sur des exemples hors de leur contexte linguistique lors de la construction du vocable, des traductions trop influencées par la langue de sortie sont communes dans les dictionnaires anglais-grec à la suite de l'histoire idiosyncratique de la pratique de la linguistique appliquée grecque. L'article se penche sur les problèmes de l'équivalence traductionelle lors de la rédaction du nouveau dictionnaire anglais-grec (Collins - 1997). Des problèmes spécifiques relatifs à la traduction de la langue anglaise à la langue grecque sont mis en évidence relativement aux domaines de l'équivalence lexicologique, grammaticale et du discours. Plus spécialement, l'émergence de "faux amis" et de couples dans le registre, les catégories de précision, la comptabilité des substantifs et l'aspect des verbes ainsi que les structures variables thème-rhème constituent des points de divergence entre les deux langues. La traduction mot-à-mot de ces aspects linguistiques est surtout due au manque de naturel. La rédaction de dictionnaires implique une multitude de décisions détaillés suivant ces paramètres, qui régissent les vocables et influencent la qualité du texte final. L'aide des corpus anglais et grecs a été indispensable lors de la définition des paramètres du naturel pour chaque vocable et lors de la solution des problèmes spécifiques à la lexicographie bilingue grecque.


2021 ◽  
Author(s):  
Yue Niu ◽  
Hongjie Zhang

With the growth of the internet, short texts such as tweets from Twitter, news titles from the RSS, or comments from Amazon have become very prevalent. Many tasks need to retrieve information hidden from the content of short texts. So ontology learning methods are proposed for retrieving structured information. Topic hierarchy is a typical ontology that consists of concepts and taxonomy relations between concepts. Current hierarchical topic models are not specially designed for short texts. These methods use word co-occurrence to construct concepts and general-special word relations to construct taxonomy topics. But in short texts, word cooccurrence is sparse and lacking general-special word relations. To overcome this two problems and provide an interpretable result, we designed a hierarchical topic model which aggregates short texts into long documents and constructing topics and relations. Because long documents add additional semantic information, our model can avoid the sparsity of word cooccurrence. In experiments, we measured the quality of concepts by topic coherence metric on four real-world short texts corpus. The result showed that our topic hierarchy is more interpretable than other methods.


2012 ◽  
Vol 43 ◽  
pp. 135-171 ◽  
Author(s):  
T. Flati ◽  
R. Navigli

Bilingual machine-readable dictionaries are knowledge resources useful in many automatic tasks. However, compared to monolingual computational lexicons like WordNet, bilingual dictionaries typically provide a lower amount of structured information, such as lexical and semantic relations, and often do not cover the entire range of possible translations for a word of interest. In this paper we present Cycles and Quasi-Cycles (CQC), a novel algorithm for the automated disambiguation of ambiguous translations in the lexical entries of a bilingual machine-readable dictionary. The dictionary is represented as a graph, and cyclic patterns are sought in the graph to assign an appropriate sense tag to each translation in a lexical entry. Further, we use the algorithm's output to improve the quality of the dictionary itself, by suggesting accurate solutions to structural problems such as misalignments, partial alignments and missing entries. Finally, we successfully apply CQC to the task of synonym extraction.


Author(s):  
Wesam Elshamy ◽  
William H. Hsu

Topic models are probabilistic models for discovering topical themes in collections of documents. These models provide us with the means of organizing what would otherwise be unstructured collections. The first wave of topic models developed was able to discover the prevailing topics in a big collection of documents spanning a period of time. These time-invariant models were not capable of modeling 1) the time varying number of topics they discover and 2) the time changing structure of these topics. Few models were developed to address these two deficiencies. The online-hierarchical Dirichlet process models the documents with a time varying number of topics, and the continuous-time dynamic topic model evolves topic structure in continuous-time. In this chapter, the authors present the continuous-time infinite dynamic topic model that combines the advantages of these two models. It is a probabilistic topic model that changes the number of topics and topic structure over continuous-time.


2021 ◽  
Vol 11 (14) ◽  
pp. 6541
Author(s):  
Luis Espinosa-Anke ◽  
Geraint Palmer ◽  
Padraig Corcoran ◽  
Maxim Filimonov ◽  
Irena Spasić ◽  
...  

Cross-lingual embeddings are vector space representations where word translations tend to be co-located. These representations enable learning transfer across languages, thus bridging the gap between data-rich languages such as English and others. In this paper, we present and evaluate a suite of cross-lingual embeddings for the English–Welsh language pair. To train the bilingual embeddings, a Welsh corpus of approximately 145 M words was combined with an English Wikipedia corpus. We used a bilingual dictionary to frame the problem of learning bilingual mappings as a supervised machine learning task, where a word vector space is first learned independently on a monolingual corpus, after which a linear alignment strategy is applied to map the monolingual embeddings to a common bilingual vector space. Two approaches were used to learn monolingual embeddings, including word2vec and fastText. Three cross-language alignment strategies were explored, including cosine similarity, inverted softmax and cross-domain similarity local scaling (CSLS). We evaluated different combinations of these approaches using two tasks, bilingual dictionary induction, and cross-lingual sentiment analysis. The best results were achieved using monolingual fastText embeddings and the CSLS metric. We also demonstrated that by including a few automatically translated training documents, the performance of a cross-lingual text classifier for Welsh can increase by approximately 20 percent points.


2021 ◽  
Vol 13 (2) ◽  
pp. 763
Author(s):  
Simona Fiandrino ◽  
Alberto Tonelli

The recent Review of the Non-Financial Reporting Directive (NFRD) aims to enhance adequate non-financial information (NFI) disclosure and improve accountability for stakeholders. This study focuses on this regulatory intervention and has a twofold objective: First, it aims to understand the main underlying issues at stake; second, it suggests areas of possible amendment considering the current debates on sustainability accounting and accounting for stakeholders. In keeping with these aims, the research analyzes the documents annexed to the contribution on the Review of the NFRD by conducting a text-mining analysis with latent Dirichlet allocation (LDA) probabilistic topic model (PTM). Our findings highlight four main topics at the core of the current debate: quality of NFI, standardization, materiality, and assurance. The research suggests ways of improving managerial policies to achieve more comparable, relevant, and reliable information by bringing value creation for stakeholders into accounting. It further addresses an integrated logic of accounting for stakeholders that contributes to sustainable development.


2017 ◽  
Author(s):  
Redhouane Abdellaoui ◽  
Pierre Foulquié ◽  
Nathalie Texier ◽  
Carole Faviez ◽  
Anita Burgun ◽  
...  

BACKGROUND Medication nonadherence is a major impediment to the management of many health conditions. A better understanding of the factors underlying noncompliance to treatment may help health professionals to address it. Patients use peer-to-peer virtual communities and social media to share their experiences regarding their treatments and diseases. Using topic models makes it possible to model themes present in a collection of posts, thus to identify cases of noncompliance. OBJECTIVE The aim of this study was to detect messages describing patients’ noncompliant behaviors associated with a drug of interest. Thus, the objective was the clustering of posts featuring a homogeneous vocabulary related to nonadherent attitudes. METHODS We focused on escitalopram and aripiprazole used to treat depression and psychotic conditions, respectively. We implemented a probabilistic topic model to identify the topics that occurred in a corpus of messages mentioning these drugs, posted from 2004 to 2013 on three of the most popular French forums. Data were collected using a Web crawler designed by Kappa Santé as part of the Detec’t project to analyze social media for drug safety. Several topics were related to noncompliance to treatment. RESULTS Starting from a corpus of 3650 posts related to an antidepressant drug (escitalopram) and 2164 posts related to an antipsychotic drug (aripiprazole), the use of latent Dirichlet allocation allowed us to model several themes, including interruptions of treatment and changes in dosage. The topic model approach detected cases of noncompliance behaviors with a recall of 98.5% (272/276) and a precision of 32.6% (272/844). CONCLUSIONS Topic models enabled us to explore patients’ discussions on community websites and to identify posts related with noncompliant behaviors. After a manual review of the messages in the noncompliance topics, we found that noncompliance to treatment was present in 6.17% (276/4469) of the posts.


2021 ◽  
pp. 1-15
Author(s):  
R.M. Noorullah ◽  
Moulana Mohammed

Topic models are widely used in building clusters of documents for more than a decade, yet problems occurring in choosing the optimal number of topics. The main problem is the lack of a stable metric of the quality of topics obtained during the construction of topic models. The authors analyzed from previous works, most of the models used in determining the number of topics are non-parametric and the quality of topics determined by using perplexity and coherence measures and concluded that they are not applicable in solving this problem. In this paper, we used the parametric method, which is an extension of the traditional topic model with visual access tendency for visualization of the number of topics (clusters) to complement clustering and to choose the optimal number of topics based on results of cluster validity indices. Developed hybrid topic models are demonstrated with different Twitter datasets on various topics in obtaining the optimal number of topics and in measuring the quality of clusters. The experimental results showed that the Visual Non-negative Matrix Factorization (VNMF) topic model performs well in determining the optimal number of topics with interactive visualization and in performance measure of the quality of clusters with validity indices.


2015 ◽  
Vol 21 (5) ◽  
pp. 743-772 ◽  
Author(s):  
ANDRES DUQUE ◽  
LOURDES ARAUJO ◽  
JUAN MARTINEZ-ROMO

AbstractIn this paper, we present a new method based on co-occurrence graphs for performing Cross-Lingual Word Sense Disambiguation (CLWSD). The proposed approach comprises the automatic generation of bilingual dictionaries, and a new technique for the construction of a co-occurrence graph used to select the most suitable translations from the dictionary. Different algorithms that combine both the dictionary and the co-occurrence graph are then used for performing this selection of the final translations: techniques based on sub-graphs (communities) containing clusters of words with related meanings, based on distances between nodes representing words, and based on the relative importance of each node in the whole graph. The initial output of the system is enhanced with translation probabilities, provided by a statistical bilingual dictionary. The system is evaluated using datasets from two competitions: task 3 of SemEval 2010, and task 10 of SemEval 2013. Results obtained by the different disambiguation techniques are analysed and compared to those obtained by the systems participating in the competitions. Our system offers the best results in comparison with other unsupervised systems in most of the experiments, and even overcomes supervised systems in some cases.


Sign in / Sign up

Export Citation Format

Share Document