Using machine learning to disentangle homonyms in large text corpora

2018 ◽  
Vol 32 (3) ◽  
pp. 716-724 ◽  
Author(s):  
Uri Roll ◽  
Ricardo A. Correia ◽  
Oded Berger‐Tal
Author(s):  
Jan Rörden ◽  
Doris Gruber ◽  
Martin Krickl ◽  
Bernhard Haslhofer

2021 ◽  
Author(s):  
Anika Frericks-Zipper ◽  
Markus Stepath ◽  
Karin Schork ◽  
Katrin Marcus ◽  
Michael Turewicz ◽  
...  

Biomarkers have been the focus of research for more than 30 years [REF1] . Paone et al. were among the first scientists to use the term biomarker in the course of a comparative study dealing with breast carcinoma [REF2]. In recent years, in addition to proteins and genes, miRNA or micro RNAs, which play an essential role in gene expression, have gained increased interest as valuable biomarkers. As a result, more and more information on miRNA biomarkers can be extracted via text mining approaches from the increasing amount of scientific literature. In the late 1990s the recognition of specific terms in biomedical texts has become a focus of bioinformatic research to automatically extract knowledge out of the increasing number of publications. For this, amongst other methods, machine learning algorithms are applied. However, the recognition (classification) capability of terms by machine learning or rule based algorithms depends on their correct and reproducible training and development. In the case of machine learning-based algorithms the quality of the available training and test data is crucial. The algorithms have to be tested and trained with curated and trustable data sets, the so-called gold or silver standards. Gold standards are text corpora, which are annotated by expertes, whereby silver standards are curated automatically by other algorithms. Training and calibration of neural networks is based on such corpora. In the literature there are some silver standards with approx. 500,000 tokens [REF3]. Also there are already published gold standards for species, genes, proteins or diseases. However, there is no corpus that has been generated specifically for miRNA. To close this gap, we have generated GoMi, a novel and manually curated gold standard corpus for miRNA. GoMi can be directly used to train ML-methods to calibrate or test different algorithms based on the rule-based approach or dictionary-based approach. The GoMi gold standard corpus was created using publicly available PubMed abstracts. GoMi can be downloaded here: https://github.com/mpc-bioinformatics/mirnaGS---GoMi.


2006 ◽  
Vol 15 (05) ◽  
pp. 823-838 ◽  
Author(s):  
EFSTATHIOS STAMATATOS

Authorship attribution can assist the criminal investigation procedure as well as cybercrime analysis. This task can be viewed as a single-label multi-class text categorization problem. Given that the style of a text can be represented as mere word frequencies selected in a language-independent method, suitable machine learning techniques able to deal with high dimensional feature spaces and sparse data can be directly applied to solve this problem. This paper focuses on classifier ensembles based on feature set subspacing. It is shown that an effective ensemble can be constructed using, exhaustive disjoint subspacing, a simple method producing many poor but diverse base classifiers. The simple model can be enhanced by a variation of the technique of cross-validated committees applied to the feature set. Experiments on two benchmark text corpora demonstrate the effectiveness of the presented method improving previously reported results and compare it to support vector machines, an alternative suitable machine learning approach to authorship attribution.


2013 ◽  
Vol 6 (3) ◽  
pp. 221-242 ◽  
Author(s):  
Laurent El Ghaoui ◽  
Vu Pham ◽  
Guan-Cheng Li ◽  
Viet-An Duong ◽  
Ashok Srivastava ◽  
...  

2021 ◽  
Author(s):  
Sriram Yarlagadda ◽  
David J. Scroggins ◽  
Fang Cao ◽  
Yeshwanth Devabhaktuni ◽  
Franklin Buitron ◽  
...  

2020 ◽  
Vol 8 (2) ◽  
pp. 326-339 ◽  
Author(s):  
Sebastian Haunss ◽  
Jonas Kuhn ◽  
Sebastian Padó ◽  
Andre Blessing ◽  
Nico Blokker ◽  
...  

This article investigates the integration of machine learning in the political claim annotation workflow with the goal to partially automate the annotation and analysis of large text corpora. It introduces the MARDY annotation environment and presents results from an experiment in which the annotation quality of annotators with and without machine learning based annotation support is compared. The design and setting aim to measure and evaluate: a) annotation speed; b) annotation quality; and c) applicability to the use case of discourse network generation. While the results indicate only slight increases in terms of annotation speed, the authors find a moderate boost in annotation quality. Additionally, with the help of manual annotation of the actors and filtering out of the false positives, the machine learning based annotation suggestions allow the authors to fully recover the core network of the discourse as extracted from the articles annotated during the experiment. This is due to the redundancy which is naturally present in the annotated texts. Thus, assuming a research focus not on the complete network but the network core, an AI-based annotation can provide reliable information about discourse networks with much less human intervention than compared to the traditional manual approach.


SOIL ◽  
2020 ◽  
Vol 6 (1) ◽  
pp. 35-52 ◽  
Author(s):  
José Padarian ◽  
Budiman Minasny ◽  
Alex B. McBratney

Abstract. The application of machine learning (ML) techniques in various fields of science has increased rapidly, especially in the last 10 years. The increasing availability of soil data that can be efficiently acquired remotely and proximally, and freely available open-source algorithms, have led to an accelerated adoption of ML techniques to analyse soil data. Given the large number of publications, it is an impossible task to manually review all papers on the application of ML in soil science without narrowing down a narrative of ML application in a specific research question. This paper aims to provide a comprehensive review of the application of ML techniques in soil science aided by a ML algorithm (latent Dirichlet allocation) to find patterns in a large collection of text corpora. The objective is to gain insight into publications of ML applications in soil science and to discuss the research gaps in this topic. We found that (a) there is an increasing usage of ML methods in soil sciences, mostly concentrated in developed countries, (b) the reviewed publications can be grouped into 12 topics, namely remote sensing, soil organic carbon, water, contamination, methods (ensembles), erosion and parent material, methods (NN, neural networks, SVM, support vector machines), spectroscopy, modelling (classes), crops, physical, and modelling (continuous), and (c) advanced ML methods usually perform better than simpler approaches thanks to their capability to capture non-linear relationships. From these findings, we found research gaps, in particular, about the precautions that should be taken (parsimony) to avoid overfitting, and that the interpretability of the ML models is an important aspect to consider when applying advanced ML methods in order to improve our knowledge and understanding of soil. We foresee that a large number of studies will focus on the latter topic.


2021 ◽  
Vol 21 (9) ◽  
pp. 2738
Author(s):  
Marius Catalin Iordan ◽  
Tyler Giallanza ◽  
Cameron T. Ellis ◽  
Nicole M. Beckage ◽  
Jonathan D. Cohen

2021 ◽  
Author(s):  
Tiago Barbosa de Lima ◽  
André C. A. Nascimento ◽  
Pericles Miranda ◽  
Rafael Ferreira Mello

In Brazil, several minority languages suffer a serious risk of extinction. The appropriate documentation of such languages is a fundamental step to avoid that. However, for some of those languages, only a small amount of text corpora is digitally accessible. Meanwhile there are many issues related to the identification of indigenous languages, which may help to identify key similarities among them, as well as to connect related languages and dialects. Therefore, this paper proposes to study and automatically classify 26 neglected Brazilian native languages, considering a small amount of training data, under a supervised and unsupervised setting. Our findings indicate that the use of machine learning models to the analysis of Brazilian Indigenous corpora is very promising, and we hope this work encourage more research on this topic in the next years.


Sign in / Sign up

Export Citation Format

Share Document