Combining Bilingual Lexicons Extracted from Comparable Corpora: The Complementary Approach Between Word Embedding and Text Mining

Author(s):  
Sourour Belhaj Rhouma ◽  
Chiraz Latiri ◽  
Catherine Berrut
2020 ◽  
Vol 46 (2) ◽  
pp. 603-618
Author(s):  
Radovan Garabík

The Aranea Project offers a set of comparable corpora for two dozens of (mostly European) languages providing a convenient dataset for nLP applications that require training on large amounts of data. The article presents word embedding models trained on the Aranea corpora and an online interface to query the models and visualize the results. The implementation is aimed towards lexicographic use but can be also useful in other fields of linguistic study since the vector space is a plausible model of semantic space of word meanings. Three different models are available – one for a combination of part of speech and lemma, one for raw word forms, and one based on fastText algorithm uses subword vectors and is not limited to whole or known words in finding their semantic relations. The article is describing the interface and major modes of its functionality; it does not try to perform detailed linguistic analysis of presented examples.


2017 ◽  
Author(s):  
Gabriel Rosenfeld ◽  
Dawei Lin

AbstractWhile the impact of biomedical research has traditionally been measured using bibliographic metrics such as citation or journal impact factor, the data itself is an output which can be directly measured to provide additional context about a publication’s impact. Data are a resource that can be repurposed and reused providing dividends on the original investment used to support the primary work. Moreover, it is the cornerstone upon which a tested hypothesis is rejected or accepted and specific scientific conclusions are reached. Understanding how and where it is being produced enhances the transparency and reproducibility of the biomedical research enterprise. Most biomedical data are not directly deposited in data repositories and are instead found in the publication within figures or attachments making it hard to measure. We attempted to address this challenge by using recent advances in word embedding to identify the technical and methodological features of terms used in the free text of articles’ methods sections. We created term usage signatures for five types of biomedical research data, which were used in univariate clustering to correctly identify a large fraction of positive control articles and a set of manually annotated articles where generation of data types could be validated. The approach was then used to estimate the fraction of PLOS articles generating each biomedical data type over time. Out of all PLOS articles analyzed (n = 129,918), ~7%, 19%, 12%, 18%, and 6% generated flow cytometry, immunoassay, genomic microarray, microscopy, and high-throughput sequencing data. The estimate portends a vast amount of biomedical data being produced: in 2016, if other publishers generated a similar amount of data then roughly 40,000 NIH-funded research articles would produce ~56,000 datasets consisting of the five data types we analyzed.One Sentence SummaryApplication of a word-embedding model trained on the methods sections of research articles allows for estimation of the production of diverse biomedical data types using text mining.


2013 ◽  
Author(s):  
Ronald N. Kostoff ◽  
◽  
Henry A. Buchtel ◽  
John Andrews ◽  
Kirstin M. Pfiel

2020 ◽  
Vol 42 (5) ◽  
pp. 279-307
Author(s):  
Yonglim Joe
Keyword(s):  

2019 ◽  
Vol 19 (2) ◽  
pp. 29-38
Author(s):  
Young-Hee Kim ◽  
◽  
Taek-Hyun Lee ◽  
Jong-Myoung Kim ◽  
Won-Hyung Park ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document