Using machine learning to disentangle homonyms in large text corpora

Uri Roll; Ricardo A. Correia; Oded Berger‐Tal

doi:10.1111/cobi.13044

Identifying Historical Travelogues in Large Text Corpora Using Machine Learning

Sustainable Digital Communities - Lecture Notes in Computer Science ◽

10.1007/978-3-030-43687-2_67 ◽

2020 ◽

pp. 801-815

Author(s):

Jan Rörden ◽

Doris Gruber ◽

Martin Krickl ◽

Bernhard Haslhofer

Keyword(s):

Machine Learning ◽

Text Corpora

Download Full-text

GoMi - A new gold standard corpus for miRNA Named Entity Recognition to test dictionary, rule-based and machine-learning approaches.

10.1101/2021.10.18.464801 ◽

2021 ◽

Author(s):

Anika Frericks-Zipper ◽

Markus Stepath ◽

Karin Schork ◽

Katrin Marcus ◽

Michael Turewicz ◽

...

Keyword(s):

Machine Learning ◽

Gold Standard ◽

Machine Learning Algorithms ◽

Entity Recognition ◽

Learning Approaches ◽

Rule Based ◽

Text Corpora ◽

Micro Rnas ◽

Gold Standard Corpus ◽

Gold Standards

Biomarkers have been the focus of research for more than 30 years [REF1] . Paone et al. were among the first scientists to use the term biomarker in the course of a comparative study dealing with breast carcinoma [REF2]. In recent years, in addition to proteins and genes, miRNA or micro RNAs, which play an essential role in gene expression, have gained increased interest as valuable biomarkers. As a result, more and more information on miRNA biomarkers can be extracted via text mining approaches from the increasing amount of scientific literature. In the late 1990s the recognition of specific terms in biomedical texts has become a focus of bioinformatic research to automatically extract knowledge out of the increasing number of publications. For this, amongst other methods, machine learning algorithms are applied. However, the recognition (classification) capability of terms by machine learning or rule based algorithms depends on their correct and reproducible training and development. In the case of machine learning-based algorithms the quality of the available training and test data is crucial. The algorithms have to be tested and trained with curated and trustable data sets, the so-called gold or silver standards. Gold standards are text corpora, which are annotated by expertes, whereby silver standards are curated automatically by other algorithms. Training and calibration of neural networks is based on such corpora. In the literature there are some silver standards with approx. 500,000 tokens [REF3]. Also there are already published gold standards for species, genes, proteins or diseases. However, there is no corpus that has been generated specifically for miRNA. To close this gap, we have generated GoMi, a novel and manually curated gold standard corpus for miRNA. GoMi can be directly used to train ML-methods to calibrate or test different algorithms based on the rule-based approach or dictionary-based approach. The GoMi gold standard corpus was created using publicly available PubMed abstracts. GoMi can be downloaded here: https://github.com/mpc-bioinformatics/mirnaGS---GoMi.

Download Full-text

Machine Learning Methodologies and Large Data Text Corpora

The International Journal of Communication and Linguistic Studies ◽

10.18848/2327-7882/cgp/v14i01/43661 ◽

2015 ◽

Vol 13 (1) ◽

pp. 1-15

Author(s):

Luke Barnesmoore ◽

Jeffery Huang

Keyword(s):

Machine Learning ◽

Large Data ◽

Text Corpora

Download Full-text

AUTHORSHIP ATTRIBUTION BASED ON FEATURE SET SUBSPACING ENSEMBLES

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213006002965 ◽

2006 ◽

Vol 15 (05) ◽

pp. 823-838 ◽

Cited By ~ 29

Author(s):

EFSTATHIOS STAMATATOS

Keyword(s):

Machine Learning ◽

Text Categorization ◽

Machine Learning Techniques ◽

Authorship Attribution ◽

Support Vector ◽

Classifier Ensembles ◽

Simple Method ◽

Text Corpora ◽

Learning Techniques ◽

Feature Spaces

Authorship attribution can assist the criminal investigation procedure as well as cybercrime analysis. This task can be viewed as a single-label multi-class text categorization problem. Given that the style of a text can be represented as mere word frequencies selected in a language-independent method, suitable machine learning techniques able to deal with high dimensional feature spaces and sparse data can be directly applied to solve this problem. This paper focuses on classifier ensembles based on feature set subspacing. It is shown that an effective ensemble can be constructed using, exhaustive disjoint subspacing, a simple method producing many poor but diverse base classifiers. The simple model can be enhanced by a variation of the technique of cross-validated committees applied to the feature set. Experiments on two benchmark text corpora demonstrate the effectiveness of the presented method improving previously reported results and compare it to support vector machines, an alternative suitable machine learning approach to authorship attribution.

Download Full-text

Understanding large text corpora via sparse machine learning

Statistical Analysis and Data Mining The ASA Data Science Journal ◽

10.1002/sam.11187 ◽

2013 ◽

Vol 6 (3) ◽

pp. 221-242 ◽

Cited By ~ 8

Author(s):

Laurent El Ghaoui ◽

Vu Pham ◽

Guan-Cheng Li ◽

Viet-An Duong ◽

Ashok Srivastava ◽

...

Keyword(s):

Machine Learning ◽

Text Corpora

Download Full-text

DocTable: Table-Oriented Interactive Machine Learning for Text Corpora

10.1109/mlui54255.2021.00006 ◽

2021 ◽

Author(s):

Sriram Yarlagadda ◽

David J. Scroggins ◽

Fang Cao ◽

Yeshwanth Devabhaktuni ◽

Franklin Buitron ◽

...

Keyword(s):

Machine Learning ◽

Interactive Machine Learning ◽

Text Corpora

Download Full-text

Integrating Manual and Automatic Annotation for the Creation of Discourse Network Data Sets

Politics and Governance ◽

10.17645/pag.v8i2.2591 ◽

2020 ◽

Vol 8 (2) ◽

pp. 326-339 ◽

Cited By ~ 2

Author(s):

Sebastian Haunss ◽

Jonas Kuhn ◽

Sebastian Padó ◽

Andre Blessing ◽

Nico Blokker ◽

...

Keyword(s):

Machine Learning ◽

Data Sets ◽

Use Case ◽

Automatic Annotation ◽

Core Network ◽

Research Focus ◽

The Core ◽

Text Corpora ◽

Annotation Quality

This article investigates the integration of machine learning in the political claim annotation workflow with the goal to partially automate the annotation and analysis of large text corpora. It introduces the MARDY annotation environment and presents results from an experiment in which the annotation quality of annotators with and without machine learning based annotation support is compared. The design and setting aim to measure and evaluate: a) annotation speed; b) annotation quality; and c) applicability to the use case of discourse network generation. While the results indicate only slight increases in terms of annotation speed, the authors find a moderate boost in annotation quality. Additionally, with the help of manual annotation of the actors and filtering out of the false positives, the machine learning based annotation suggestions allow the authors to fully recover the core network of the discourse as extracted from the articles annotated during the experiment. This is due to the redundancy which is naturally present in the annotated texts. Thus, assuming a research focus not on the complete network but the network core, an AI-based annotation can provide reliable information about discourse networks with much less human intervention than compared to the traditional manual approach.

Download Full-text

Machine learning and soil sciences: a review aided by machine learning tools

SOIL ◽

10.5194/soil-6-35-2020 ◽

2020 ◽

Vol 6 (1) ◽

pp. 35-52 ◽

Cited By ~ 12

Author(s):

José Padarian ◽

Budiman Minasny ◽

Alex B. McBratney

Keyword(s):

Machine Learning ◽

Latent Dirichlet Allocation ◽

Research Question ◽

Developed Countries ◽

Parent Material ◽

Soil Science ◽

Support Vector ◽

Research Gaps ◽

Text Corpora ◽

Soil Sciences

Abstract. The application of machine learning (ML) techniques in various fields of science has increased rapidly, especially in the last 10 years. The increasing availability of soil data that can be efficiently acquired remotely and proximally, and freely available open-source algorithms, have led to an accelerated adoption of ML techniques to analyse soil data. Given the large number of publications, it is an impossible task to manually review all papers on the application of ML in soil science without narrowing down a narrative of ML application in a specific research question. This paper aims to provide a comprehensive review of the application of ML techniques in soil science aided by a ML algorithm (latent Dirichlet allocation) to find patterns in a large collection of text corpora. The objective is to gain insight into publications of ML applications in soil science and to discuss the research gaps in this topic. We found that (a) there is an increasing usage of ML methods in soil sciences, mostly concentrated in developed countries, (b) the reviewed publications can be grouped into 12 topics, namely remote sensing, soil organic carbon, water, contamination, methods (ensembles), erosion and parent material, methods (NN, neural networks, SVM, support vector machines), spectroscopy, modelling (classes), crops, physical, and modelling (continuous), and (c) advanced ML methods usually perform better than simpler approaches thanks to their capability to capture non-linear relationships. From these findings, we found research gaps, in particular, about the precautions that should be taken (parsimony) to avoid overfitting, and that the interpretability of the ML models is an important aspect to consider when applying advanced ML methods in order to improve our knowledge and understanding of soil. We foresee that a large number of studies will focus on the latter topic.

Download Full-text

Context Matters: Recovering Human Visual and Semantic Knowledge from Machine Learning Analysis of Large-Scale Text Corpora

Journal of Vision ◽

10.1167/jov.21.9.2738 ◽

2021 ◽

Vol 21 (9) ◽

pp. 2738

Author(s):

Marius Catalin Iordan ◽

Tyler Giallanza ◽

Cameron T. Ellis ◽

Nicole M. Beckage ◽

Jonathan D. Cohen

Keyword(s):

Machine Learning ◽

Large Scale ◽

Semantic Knowledge ◽

Text Corpora ◽

Learning Analysis

Download Full-text

Analysis of a Brazilian Indigenous corpus using machine learning methods

10.5753/eniac.2021.18246 ◽

2021 ◽

Author(s):

Tiago Barbosa de Lima ◽

André C. A. Nascimento ◽

Pericles Miranda ◽

Rafael Ferreira Mello

Keyword(s):

Machine Learning ◽

Indigenous Languages ◽

Training Data ◽

Minority Languages ◽

Learning Models ◽

Native Languages ◽

Machine Learning Methods ◽

Text Corpora ◽

Risk Of Extinction ◽

Machine Learning Models

In Brazil, several minority languages suffer a serious risk of extinction. The appropriate documentation of such languages is a fundamental step to avoid that. However, for some of those languages, only a small amount of text corpora is digitally accessible. Meanwhile there are many issues related to the identification of indigenous languages, which may help to identify key similarities among them, as well as to connect related languages and dialects. Therefore, this paper proposes to study and automatically classify 26 neglected Brazilian native languages, considering a small amount of training data, under a supervised and unsupervised setting. Our findings indicate that the use of machine learning models to the analysis of Brazilian Indigenous corpora is very promising, and we hope this work encourage more research on this topic in the next years.

Download Full-text