gold standard corpus Latest Research Papers

Biomarkers have been the focus of research for more than 30 years [REF1] . Paone et al. were among the first scientists to use the term biomarker in the course of a comparative study dealing with breast carcinoma [REF2]. In recent years, in addition to proteins and genes, miRNA or micro RNAs, which play an essential role in gene expression, have gained increased interest as valuable biomarkers. As a result, more and more information on miRNA biomarkers can be extracted via text mining approaches from the increasing amount of scientific literature. In the late 1990s the recognition of specific terms in biomedical texts has become a focus of bioinformatic research to automatically extract knowledge out of the increasing number of publications. For this, amongst other methods, machine learning algorithms are applied. However, the recognition (classification) capability of terms by machine learning or rule based algorithms depends on their correct and reproducible training and development. In the case of machine learning-based algorithms the quality of the available training and test data is crucial. The algorithms have to be tested and trained with curated and trustable data sets, the so-called gold or silver standards. Gold standards are text corpora, which are annotated by expertes, whereby silver standards are curated automatically by other algorithms. Training and calibration of neural networks is based on such corpora. In the literature there are some silver standards with approx. 500,000 tokens [REF3]. Also there are already published gold standards for species, genes, proteins or diseases. However, there is no corpus that has been generated specifically for miRNA. To close this gap, we have generated GoMi, a novel and manually curated gold standard corpus for miRNA. GoMi can be directly used to train ML-methods to calibrate or test different algorithms based on the rule-based approach or dictionary-based approach. The GoMi gold standard corpus was created using publicly available PubMed abstracts. GoMi can be downloaded here: https://github.com/mpc-bioinformatics/mirnaGS---GoMi.

Download Full-text

TaxoNERD: deep neural models for the recognition of taxonomic entities in the ecological and evolutionary literature

10.1101/2021.06.08.444426 ◽

2021 ◽

Author(s):

Nicolas Le Guillarme ◽

Wilfried Thuiller

Keyword(s):

Information Extraction ◽

High Performance ◽

Named Entity Recognition ◽

Open Data ◽

Entity Recognition ◽

Data Repositories ◽

Biodiversity Crisis ◽

Domain Specific ◽

Ecological Information ◽

Gold Standard Corpus

1. Given the biodiversity crisis, we more than ever need to access information on multiple taxa (e.g. distribution, traits, diet) in the scientific literature to understand, map and predict all-inclusive biodiversity. Tools are needed to automatically extract useful information from the ever-growing corpus of ecological texts and feed this information to open data repositories. A prerequisite is the ability to recognise mentions of taxa in text, a special case of named entity recognition (NER). In recent years, deep learning-based NER systems have become ubiqutous, yielding state-of-the-art results in the general and biomedical domains. However, no such tool is available to ecologists wishing to extract information from the biodiversity literature. 2. We propose a new tool called TaxoNERD that provides two deep neural network (DNN) models to recognise taxon mentions in ecological documents. To achieve high performance, DNN-based NER models usually need to be trained on a large corpus of manually annotated text. Creating such a gold standard corpus (GSC) is a laborious and costly process, with the result that GSCs in the ecological domain tend to be too small to learn an accurate DNN model from scratch. To address this issue, we leverage existing DNN models pretrained on large biomedical corpora using transfer learning. The performance of our models is evaluated on four GSCs and compared to the most popular taxonomic NER tools. 3. Our experiments suggest that existing taxonomic NER tools are not suited to the extraction of ecological information from text as they performed poorly on ecologically-oriented corpora, either because they do not take account of the variability of taxon naming practices, or because they do not generalise well to the ecological domain. Conversely, a domain-specific DNN-based tool like TaxoNERD outperformed the other approaches on an ecological information extraction task. 4. Efforts are needed in order to raise ecological information extraction to the same level of performance as its biomedical counterpart. One promising direction is to leverage the huge corpus of unlabelled ecological texts to learn a language representation model that could benefit downstream tasks. These efforts could be highly beneficial to ecologists on the long term.

Download Full-text

Random Sampling in Corpus Design: Cross-Context Generalizability in Automated Multicountry Protest Event Collection

American Behavioral Scientist ◽

10.1177/00027642211021630 ◽

2021 ◽

pp. 000276422110216

Author(s):

Erdem Yörük ◽

Ali Hürriyetoğlu ◽

Fırat Duruşan ◽

Çağrı Yoltar

Keyword(s):

Machine Learning ◽

Gold Standard ◽

Ethical Issues ◽

Testing Machine ◽

Machine Learning Algorithms ◽

Learning System ◽

Learning Tools ◽

Corpus Design ◽

Gold Standard Corpus ◽

Cross Country

What is the most optimal way of creating a gold standard corpus for training a machine learning system that is designed for automatically collecting protest information in a cross-country context? We show that creating a gold standard corpus for training and testing machine learning models on the basis of randomly chosen news articles from news archives yields better performance than selecting news articles on the basis of keyword filtering, which is the most prevalent method currently used in automated event coding. We advance this new bottom-up approach to ensure generalizability and reliability in cross-country comparative protest event collection from international and local news in different countries, languages, sources and time periods, which entails a large variety of event types, actors, and targets. We present the results of comparing our random-sample approach with keyword filtering. We show that the machine learning algorithms, and particularly state-of-the-art deep learning tools, perform much better when they are trained with the gold standard corpus from a randomly selected set of news articles from China, India, and South Africa. Finally, we also present our approach to overcome the major ethical issues that are intrinsic to protest event coding.

Download Full-text

AraCust: a Saudi Telecom Tweets corpus for sentiment analysis

PeerJ Computer Science ◽

10.7717/peerj-cs.510 ◽

2021 ◽

Vol 7 ◽

pp. e510

Author(s):

Latifah Almuqren ◽

Alexandra Cristea

Keyword(s):

Sentiment Analysis ◽

Language Processing ◽

Exploratory Data Analysis ◽

Simple Experiment ◽

Gold Standard Corpus ◽

Golden Standard ◽

Exploratory Data ◽

Dialectal Arabic ◽

Arabic Sentiment Analysis ◽

The Right

Comparing Arabic to other languages, Arabic lacks large corpora for Natural Language Processing (Assiri, Emam & Al-Dossari, 2018; Gamal et al., 2019). A number of scholars depended on translation from one language to another to construct their corpus (Rushdi-Saleh et al., 2011). This paper presents how we have constructed, cleaned, pre-processed, and annotated our 20,0000 Gold Standard Corpus (GSC) AraCust, the first Telecom GSC for Arabic Sentiment Analysis (ASA) for Dialectal Arabic (DA). AraCust contains Saudi dialect tweets, processed from a self-collected Arabic tweets dataset and has been annotated for sentiment analysis, i.e.,manually labelled (k=0.60). In addition, we have illustrated AraCust’s power, by performing an exploratory data analysis, to analyse the features that were sourced from the nature of our corpus, to assist with choosing the right ASA methods for it. To evaluate our Golden Standard corpus AraCust, we have first applied a simple experiment, using a supervised classifier, to offer benchmark outcomes for forthcoming works. In addition, we have applied the same supervised classifier on a publicly available Arabic dataset created from Twitter, ASTD (Nabil, Aly & Atiya, 2015). The result shows that our dataset AraCust outperforms the ASTD result with 91% accuracy and 89% F1avg score. The AraCust corpus will be released, together with code useful for its exploration, via GitHub as a part of this submission.

Download Full-text

Cross-Context News Corpus for Protest Event-Related Knowledge Base Construction

Data Intelligence ◽

10.1162/dint_a_00092 ◽

2021 ◽

pp. 1-28

Author(s):

Ali Hürriyetoğlu ◽

Erdem Yörük ◽

Osman Mutlu ◽

Fırat Duruşan ◽

Çağrı Yoltar ◽

...

Keyword(s):

English Language ◽

Science Studies ◽

Text Processing ◽

Knowledge Bases ◽

Event Extraction ◽

Related Information ◽

News Source ◽

Gold Standard Corpus ◽

News Corpus ◽

Automated Text Processing

Abstract We describe a gold standard corpus of protest events that comprise various local and international English language sources from various countries. The corpus contains document-, sentence-, and token-level annotations. This corpus facilitates creating machine learning models that automatically classify news articles and extract protest event-related information, constructing knowledge bases that enable comparative social and political science studies. For each news source, the annotation starts with random samples of news articles and continues with samples drawn using active learning. Each batch of samples is annotated by two social and political scientists, adjudicated by an annotation supervisor, and improved by identifying annotation errors semi-automatically. We found that the corpus possesses the variety and quality that are necessary to develop and benchmark text classification and event extraction systems in a cross-context setting, contributing to the generalizability and robustness of automated text processing systems. This corpus and the reported results will establish a common foundation in automated protest event collection studies, which is currently lacking in the literature.

Download Full-text

Ontology-based approach to enhance medical web information extraction

International Journal of Web Information Systems ◽

10.1108/ijwis-03-2018-0017 ◽

2019 ◽

Vol 15 (3) ◽

pp. 359-382 ◽

Cited By ~ 1

Author(s):

Nassim Abdeldjallal Otmani ◽

Malik Si-Mohammed ◽

Catherine Comparot ◽

Pierre-Jean Charrel

Keyword(s):

Information Extraction ◽

Medical Information ◽

Real Life ◽

Snomed Ct ◽

Content Type ◽

Meta Model ◽

Gold Standard Corpus ◽

Health Related ◽

Medical Concepts ◽

The Web

Purpose The purpose of this study is to propose a framework for extracting medical information from the Web using domain ontologies. Patient–Doctor conversations have become prevalent on the Web. For instance, solutions like HealthTap or AskTheDoctors allow patients to ask doctors health-related questions. However, most online health-care consumers still struggle to express their questions efficiently due mainly to the expert/layman language and knowledge discrepancy. Extracting information from these layman descriptions, which typically lack expert terminology, is challenging. This hinders the efficiency of the underlying applications such as information retrieval. Herein, an ontology-driven approach is proposed, which aims at extracting information from such sparse descriptions using a meta-model. Design/methodology/approach A meta-model is designed to bridge the gap between the vocabulary of the medical experts and the consumers of the health services. The meta-model is mapped with SNOMED-CT to access the comprehensive medical vocabulary, as well as with WordNet to improve the coverage of layman terms during information extraction. To assess the potential of the approach, an information extraction prototype based on syntactical patterns is implemented. Findings The evaluation of the approach on the gold standard corpus defined in Task1 of ShARe CLEF 2013 showed promising results, an F-score of 0.79 for recognizing medical concepts in real-life medical documents. Originality/value The originality of the proposed approach lies in the way information is extracted. The context defined through a meta-model proved to be efficient for the task of information extraction, especially from layman descriptions.

Download Full-text

A Crowd Science framework to support the construction of a Gold Standard Corpus for Plagiarism Detection

2019 IEEE 23rd International Conference on Computer Supported Cooperative Work in Design (CSCWD) ◽

10.1109/cscwd.2019.8791853 ◽

2019 ◽

Cited By ~ 2

Author(s):

Patricia de C. Wang ◽

Vanessa S. Soares ◽

Jano M. de Souza ◽

Maria Gilda P. Esteves ◽

Natalia C. L. Schots ◽

...

Keyword(s):

Gold Standard ◽

Plagiarism Detection ◽

Gold Standard Corpus ◽

Science Framework

Download Full-text

COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature

Biodiversity Data Journal ◽

10.3897/bdj.7.e29626 ◽

2019 ◽

Vol 7 ◽

Cited By ~ 1

Author(s):

Nhung Nguyen ◽

Roselyn Gabud ◽

Sophia Ananiadou

Keyword(s):

Gold Standard ◽

Relation Extraction ◽

Entity Recognition ◽

Species Occurrence ◽

Named Entity ◽

Wide Range ◽

Gold Standard Corpus ◽

Temporal Expressions ◽

Species Occurrences ◽

Geographical Locations

Background Species occurrence records are very important in the biodiversity domain. While several available corpora contain only annotations of species names or habitats and geographical locations, there is no consolidated corpus that covers all types of entities necessary for extracting species occurrence from biodiversity literature. In order to alleviate this issue, we have constructed the COPIOUS corpus—a gold standard corpus that covers a wide range of biodiversity entities. Results Two annotators manually annotated the corpus with five categories of entities, i.e. taxon names, geographical locations, habitats, temporal expressions and person names. The overall inter-annotator agreement on 200 doubly-annotated documents is approximately 81.86% F-score. Amongst the five categories, the agreement on habitat entities was the lowest, indicating that this type of entity is complex. The COPIOUS corpus consists of 668 documents downloaded from the Biodiversity Heritage Library with over 26K sentences and more than 28K entities. Named entity recognisers trained on the corpus could achieve an F-score of 74.58%. Moreover, in recognising taxon names, our model performed better than two available tools in the biodiversity domain, namely the SPECIES tagger and the Global Name Recognition and Discovery. More than 1,600 binary relations of Taxon-Habitat, Taxon-Person, Taxon-Geographical locations and Taxon-Temporal expressions were identified by applying a pattern-based relation extraction system to the gold standard. Based on the extracted relations, we can produce a knowledge repository of species occurrences. Conclusion The paper describes in detail the construction of a gold standard named entity corpus for the biodiversity domain. An investigation of the performance of named entity recognition (NER) tools trained on the gold standard revealed that the corpus is sufficiently reliable and sizeable for both training and evaluation purposes. The corpus can be further used for relation extraction to locate species occurrences in literature—a useful task for monitoring species distribution and preserving the biodiversity.

Download Full-text

MoNERo: a Biomedical Gold Standard Corpus for the Romanian Language

10.18653/v1/w19-5008 ◽

2019 ◽

Author(s):

Maria Mitrofan ◽

Verginica Barbu Mititelu ◽

Grigorina Mitrofan

Keyword(s):

Gold Standard ◽

Gold Standard Corpus

Download Full-text

Engineering an Aligned Gold-Standard Corpus of Human to Machine Oriented Controlled Natural Language

2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI) ◽

10.1109/wi.2018.00-58 ◽

2018 ◽

Author(s):

Hazem Safwat ◽

Brian Davis ◽

Manel Zarrouk

Keyword(s):

Natural Language ◽

Gold Standard ◽

Controlled Natural Language ◽

Gold Standard Corpus

Download Full-text

gold standard corpus
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

GoMi - A new gold standard corpus for miRNA Named Entity Recognition to test dictionary, rule-based and machine-learning approaches.

TaxoNERD: deep neural models for the recognition of taxonomic entities in the ecological and evolutionary literature

Random Sampling in Corpus Design: Cross-Context Generalizability in Automated Multicountry Protest Event Collection

AraCust: a Saudi Telecom Tweets corpus for sentiment analysis

Cross-Context News Corpus for Protest Event-Related Knowledge Base Construction

Ontology-based approach to enhance medical web information extraction

A Crowd Science framework to support the construction of a Gold Standard Corpus for Plagiarism Detection

COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature

MoNERo: a Biomedical Gold Standard Corpus for the Romanian Language

Engineering an Aligned Gold-Standard Corpus of Human to Machine Oriented Controlled Natural Language

Export Citation Format

gold standard corpusRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

GoMi - A new gold standard corpus for miRNA Named Entity Recognition to test dictionary, rule-based and machine-learning approaches.

TaxoNERD: deep neural models for the recognition of taxonomic entities in the ecological and evolutionary literature

Random Sampling in Corpus Design: Cross-Context Generalizability in Automated Multicountry Protest Event Collection

AraCust: a Saudi Telecom Tweets corpus for sentiment analysis

Cross-Context News Corpus for Protest Event-Related Knowledge Base Construction

Ontology-based approach to enhance medical web information extraction

A Crowd Science framework to support the construction of a Gold Standard Corpus for Plagiarism Detection

COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature

MoNERo: a Biomedical Gold Standard Corpus for the Romanian Language

Engineering an Aligned Gold-Standard Corpus of Human to Machine Oriented Controlled Natural Language

gold standard corpus
Recently Published Documents