scholarly journals NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition

2021 ◽  
pp. 103779
Author(s):  
Rezarta Islamaj ◽  
Chih-Hsuan Wei ◽  
David Cissel ◽  
Nicholas Miliaras ◽  
Olga Printseva ◽  
...  
2015 ◽  
Vol 21 (5) ◽  
pp. 699-724 ◽  
Author(s):  
LILI KOTLERMAN ◽  
IDO DAGAN ◽  
BERNARDO MAGNINI ◽  
LUISA BENTIVOGLI

AbstractIn this work, we present a novel type of graphs for natural language processing (NLP), namely textual entailment graphs (TEGs). We describe the complete methodology we developed for the construction of such graphs and provide some baselines for this task by evaluating relevant state-of-the-art technology. We situate our research in the context of text exploration, since it was motivated by joint work with industrial partners in the text analytics area. Accordingly, we present our motivating scenario and the first gold-standard dataset of TEGs. However, while our own motivation and the dataset focus on the text exploration setting, we suggest that TEGs can have different usages and suggest that automatic creation of such graphs is an interesting task for the community.


Author(s):  
Varuni Sarwal ◽  
Sebastian Niehus ◽  
Ram Ayyala ◽  
Sei Chang ◽  
Angela Lu ◽  
...  

AbstractAdvances in whole genome sequencing promise to enable the accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from whole genome sequencing (WGS) data presents a substantial number of challenges and a plethora of SV-detection methods have been developed. Currently, there is a paucity of evidence which investigators can use to select appropriate SV-detection tools. In this paper, we evaluated the performance of SV-detection tools using a comprehensive PCR-confirmed gold standard set of SVs. In contrast to the previous benchmarking studies, our gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of SV-detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance, as the SV-detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV-detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Additionally, we have determined the SV callers best suited for low and ultra-low pass sequencing data.


Author(s):  
Sanja Stajner ◽  
Simone Paolo Ponzetto ◽  
Heiner Stuckenschmidt

Lexically and syntactically simpler sentences result in shorter reading time and better understanding in many people. However, no reliable systems for automatic assessment of absolute sentence complexity have been proposed so far. Instead, the assessment is usually done manually, requiring expert human annotators. To address this problem, we first define the sentence complexity assessment as a five-level classification task, and build a ‘gold standard’ dataset. Next, we propose robust systems for sentence complexity assessment, using a novel set of features based on leveraging lexical properties of freely available corpora, and investigate the impact of the feature type and corpus size on the classification performance.


Data ◽  
2021 ◽  
Vol 6 (8) ◽  
pp. 84
Author(s):  
Jenny Heddes ◽  
Pim Meerdink ◽  
Miguel Pieters ◽  
Maarten Marx

We study the task of recognizing named datasets in scientific articles as a Named Entity Recognition (NER) problem. Noticing that available annotated datasets were not adequate for our goals, we annotated 6000 sentences extracted from four major AI conferences, with roughly half of them containing one or more named datasets. A distinguishing feature of this set is the many sentences using enumerations, conjunctions and ellipses, resulting in long BI+ tag sequences. On all measures, the SciBERT NER tagger performed best and most robustly. Our baseline rule based tagger performed remarkably well and better than several state-of-the-art methods. The gold standard dataset, with links and offsets from each sentence to the (open access available) articles together with the annotation guidelines and all code used in the experiments, is available on GitHub.


2021 ◽  
Author(s):  
Katherine James ◽  
Aoesha Alsobhe ◽  
Simon Joseph Cockell ◽  
Anil Wipat ◽  
Matthew Pocock

Background: Probabilistic functional integrated networks (PFINs) are designed to aid our understanding of cellular biology and can be used to generate testable hypotheses about protein function. PFINs are generally created by scoring the quality of interaction datasets against a Gold Standard dataset, usually chosen from a separate high-quality data source, prior to their integration. Use of an external Gold Standard has several drawbacks, including data redundancy, data loss and the need for identifier mapping, which can complicate the network build and impact on PFIN performance. Results: We describe the development of an integration technique, ssNet, that scores and integrates both high-throughput and low-throughout data from a single source database in a consistent manner without the need for an external Gold Standard dataset. Using data from Saccharomyces cerevisiae we show that ssNet is easier and faster, overcoming the challenges of data redundancy, Gold Standard bias and ID mapping, while producing comparable performance. In addition ssNet results in less loss of data and produces a more complete network. Conclusions: The ssNet method allows PFINs to be built successfully from a single database, while producing comparable network performance to networks scored using an external Gold Standard source. Keywords: Network integration; Bioinformatics; Gold Standards; Probabilistic functional integrated networks; Protein function prediction; Interactome.


2009 ◽  
Vol 28 (10) ◽  
pp. 1148-1155 ◽  
Author(s):  
György T. Balogh ◽  
Benjámin Gyarmati ◽  
Balázs Nagy ◽  
László Molnár ◽  
György M. Keserű

2021 ◽  
Author(s):  
Fabio D’Isidoro ◽  
Christophe Chênes ◽  
Stephen J. Ferguson ◽  
Jérôme Schmid

2010 ◽  
Author(s):  
Supriyanto Pawiro ◽  
Primoz Markelj ◽  
Christelle Gendrin ◽  
Michael Figl ◽  
Markus Stock ◽  
...  

10.2196/16757 ◽  
2020 ◽  
Vol 22 (6) ◽  
pp. e16757 ◽  
Author(s):  
Long Nguyen ◽  
Mark Stoové ◽  
Douglas Boyle ◽  
Denton Callander ◽  
Hamish McManus ◽  
...  

Background The Australian Collaboration for Coordinated Enhanced Sentinel Surveillance (ACCESS) was established to monitor national testing and test outcomes for blood-borne viruses (BBVs) and sexually transmissible infections (STIs) in key populations. ACCESS extracts deidentified data from sentinel health services that include general practice, sexual health, and infectious disease clinics, as well as public and private laboratories that conduct a large volume of BBV/STI testing. An important attribute of ACCESS is the ability to accurately link individual-level records within and between the participating sites, as this enables the system to produce reliable epidemiological measures. Objective The aim of this study was to evaluate the use of GRHANITE software in ACCESS to extract and link deidentified data from participating clinics and laboratories. GRHANITE generates irreversible hashed linkage keys based on patient-identifying data captured in the patient electronic medical records (EMRs) at the site. The algorithms to produce the data linkage keys use probabilistic linkage principles to account for variability and completeness of the underlying patient identifiers, producing up to four linkage key types per EMR. Errors in the linkage process can arise from imperfect or missing identifiers, impacting the system’s integrity. Therefore, it is important to evaluate the quality of the linkages created and evaluate the outcome of the linkage for ongoing public health surveillance. Methods Although ACCESS data are deidentified, we created two gold-standard datasets where the true match status could be confirmed in order to compare against record linkage results arising from different approaches of the GRHANITE Linkage Tool. We reported sensitivity, specificity, and positive and negative predictive values where possible and estimated specificity by comparing a history of HIV and hepatitis C antibody results for linked EMRs. Results Sensitivity ranged from 96% to 100%, and specificity was 100% when applying the GRHANITE Linkage Tool to a small gold-standard dataset of 3700 clinical medical records. Medical records in this dataset contained a very high level of data completeness by having the name, date of birth, post code, and Medicare number available for use in record linkage. In a larger gold-standard dataset containing 86,538 medical records across clinics and pathology services, with a lower level of data completeness, sensitivity ranged from 94% to 95% and estimated specificity ranged from 91% to 99% in 4 of the 6 different record linkage approaches. Conclusions This study’s findings suggest that the GRHANITE Linkage Tool can be used to link deidentified patient records accurately and can be confidently used for public health surveillance in systems such as ACCESS.


Author(s):  
Hossein Fani ◽  
Mahtab Tamannaee ◽  
Fattane Zarrinkalam ◽  
Jamil Samouh ◽  
Samad Paydar ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document