PhenoTagger: a hybrid method for phenotype concept recognition using human phenotype ontology

Bioinformatics ◽

10.1093/bioinformatics/btab019 ◽

2021 ◽

Author(s):

Ling Luo ◽

Shankai Yan ◽

Po-Ting Lai ◽

Daniel Veltri ◽

Andrew Oler ◽

...

Keyword(s):

Machine Learning ◽

Hybrid Method ◽

Human Phenotype Ontology ◽

Training Data ◽

Supplementary Information ◽

Training Dataset ◽

Biomedical Text ◽

Phenotype Ontology ◽

Concept Recognition ◽

Human Phenotype

Abstract Motivation Automatic phenotype concept recognition from unstructured text remains a challenging task in biomedical text mining research. Previous works that address the task typically use dictionary-based matching methods, which can achieve high precision but suffer from lower recall. Recently, machine learning-based methods have been proposed to identify biomedical concepts, which can recognize more unseen concept synonyms by automatic feature learning. However, most methods require large corpora of manually annotated data for model training, which is difficult to obtain due to the high cost of human annotation. Results In this article, we propose PhenoTagger, a hybrid method that combines both dictionary and machine learning-based methods to recognize Human Phenotype Ontology (HPO) concepts in unstructured biomedical text. We first use all concepts and synonyms in HPO to construct a dictionary, which is then used to automatically build a distantly supervised training dataset for machine learning. Next, a cutting-edge deep learning model is trained to classify each candidate phrase (n-gram from input sentence) into a corresponding concept label. Finally, the dictionary and machine learning-based prediction results are combined for improved performance. Our method is validated with two HPO corpora, and the results show that PhenoTagger compares favorably to previous methods. In addition, to demonstrate the generalizability of our method, we retrained PhenoTagger using the disease ontology MEDIC for disease concept recognition to investigate the effect of training on different ontologies. Experimental results on the NCBI disease corpus show that PhenoTagger without requiring manually annotated training data achieves competitive performance as compared with state-of-the-art supervised methods. Availabilityand implementation The source code, API information and data for PhenoTagger are freely available at https://github.com/ncbi-nlp/PhenoTagger. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

DiNGO: standalone application for Gene Ontology and Human Phenotype Ontology term enrichment analysis

Bioinformatics ◽

10.1093/bioinformatics/btz836 ◽

2019 ◽

Author(s):

Radoslav Davidović ◽

Vladimir Perovic ◽

Branislava Gemovic ◽

Nevena Veljkovic

Keyword(s):

Gene Ontology ◽

Source Code ◽

Enrichment Analysis ◽

Human Phenotype Ontology ◽

Supplementary Information ◽

Phenotype Ontology ◽

Ontology Term ◽

Human Phenotype ◽

Term Enrichment Analysis ◽

Term Enrichment

Abstract Summary Although various tools for Gene Ontology (GO) term enrichment analysis are available, there is still room for improvement. Hence, we present DiNGO, a standalone application based on an open source code from BiNGO, a widely-used application to assess the overrepresentation of GO categories. Besides facilitating GO term enrichment analyses, DiNGO has been developed to allow for convenient Human Phenotype Ontology (HPO) term overrepresentation investigation. This is an important contribution considering the increasing interest in HPO in scientific research and its potential in clinical settings. DiNGO supports gene/protein identifier conversion and an automatic updating of GO and HPO annotation resources. Finally, DiNGO can rapidly process a large amount of data due to its multithread design. Availability and Implementation DiNGO is implemented in the JAVA language, and its source code, example datasets and instructions are available on GitHub: https://github.com/radoslav180/DiNGO. A pre-compiled jar file is available at: https://www.vin.bg.ac.rs/180/tools/DiNGO.php Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Automatic concept recognition using the Human Phenotype Ontology reference and test suite corpora

Database ◽

10.1093/database/bav005 ◽

2015 ◽

Vol 2015 (0) ◽

pp. bav005-bav005 ◽

Cited By ~ 31

Author(s):

T. Groza ◽

S. Kohler ◽

S. Doelken ◽

N. Collier ◽

A. Oellrich ◽

...

Keyword(s):

Human Phenotype Ontology ◽

Test Suite ◽

Phenotype Ontology ◽

Concept Recognition ◽

Human Phenotype

Download Full-text

Predicting genes from phenotypes using Human Phenotype Ontology (HPO) terms

Molecular Genetics and Metabolism ◽

10.1016/s1096-7192(21)00318-8 ◽

2021 ◽

Vol 132 ◽

pp. S149

Author(s):

Anne Slavotinek ◽

Hannah Prasad ◽

Hannah Hoban ◽

Tiffany Yip ◽

Shannon Rego ◽

...

Keyword(s):

Human Phenotype Ontology ◽

Phenotype Ontology ◽

Human Phenotype

Download Full-text

Accurate estimation of isoelectric point of protein and peptide based on amino acid sequences

Bioinformatics ◽

10.1093/bioinformatics/btv674 ◽

2015 ◽

Vol 32 (6) ◽

pp. 821-827 ◽

Cited By ~ 19

Author(s):

Enrique Audain ◽

Yassel Ramos ◽

Henning Hermjakob ◽

Darren R. Flower ◽

Yasset Perez-Riverol

Keyword(s):

Machine Learning ◽

Isoelectric Point ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Basis Set ◽

Superior Performance ◽

Supplementary Information ◽

Training Dataset ◽

Accurate Estimation ◽

Prediction Methods

Abstract Motivation: In any macromolecular polyprotic system—for example protein, DNA or RNA—the isoelectric point—commonly referred to as the pI—can be defined as the point of singularity in a titration curve, corresponding to the solution pH value at which the net overall surface charge—and thus the electrophoretic mobility—of the ampholyte sums to zero. Different modern analytical biochemistry and proteomics methods depend on the isoelectric point as a principal feature for protein and peptide characterization. Protein separation by isoelectric point is a critical part of 2-D gel electrophoresis, a key precursor of proteomics, where discrete spots can be digested in-gel, and proteins subsequently identified by analytical mass spectrometry. Peptide fractionation according to their pI is also widely used in current proteomics sample preparation procedures previous to the LC-MS/MS analysis. Therefore accurate theoretical prediction of pI would expedite such analysis. While such pI calculation is widely used, it remains largely untested, motivating our efforts to benchmark pI prediction methods. Results: Using data from the database PIP-DB and one publically available dataset as our reference gold standard, we have undertaken the benchmarking of pI calculation methods. We find that methods vary in their accuracy and are highly sensitive to the choice of basis set. The machine-learning algorithms, especially the SVM-based algorithm, showed a superior performance when studying peptide mixtures. In general, learning-based pI prediction methods (such as Cofactor, SVM and Branca) require a large training dataset and their resulting performance will strongly depend of the quality of that data. In contrast with Iterative methods, machine-learning algorithms have the advantage of being able to add new features to improve the accuracy of prediction. Contact: [email protected] Availability and Implementation: The software and data are freely available at https://github.com/ypriverol/pIR. Supplementary information: Supplementary data are available at Bioinformatics online.

Download Full-text

Unique insights from ClinicalTrials.gov by mining protein mutations and RSids in addition to applying the Human Phenotype Ontology v1 (protocols.io.bfacjiaw)

protocols.io ◽

10.17504/protocols.io.bfacjiaw ◽

2020 ◽

Author(s):

Shray Alag

Keyword(s):

Human Phenotype Ontology ◽

Phenotype Ontology ◽

Human Phenotype ◽

Protein Mutations

Download Full-text

Biological and Medical Ontologies: Human Phenotype Ontology (HPO)

Encyclopedia of Bioinformatics and Computational Biology ◽

10.1016/b978-0-12-809633-8.20398-1 ◽

2019 ◽

pp. 848-857

Author(s):

Anna Bernasconi ◽

Marco Masseroli

Keyword(s):

Human Phenotype Ontology ◽

Phenotype Ontology ◽

Human Phenotype

Download Full-text

Curator: A No-Code Self-Supervised Learning and Active Labeling Tool to Create Labeled Image Datasets from Petabyte-Scale Imagery

10.5194/egusphere-egu21-6853 ◽

2021 ◽

Author(s):

Rudy Venguswamy ◽

Mike Levy ◽

Anirudh Koul ◽

Satyarth Praveen ◽

Tarun Narayanan ◽

...

Keyword(s):

Machine Learning ◽

Active Learning ◽

Open Source ◽

Forest Fires ◽

Seed Set ◽

Training Data ◽

Training Dataset ◽

Reference Image ◽

Query Image ◽

Real World Datasets

Machine learning modeling for Earth events at NASA is often limited by the availability of labeled examples. For example, training classifiers for forest fires or oil spills from satellite imagery requires curating a massive and diverse dataset of example forest fires, a tedious multi-month effort requiring careful review of over 196.9 million square miles of data per day for 20 years. While such images might exist in abundance within 40 petabytes of unlabeled satellite data, finding these positive examples to include in a training dataset for a machine learning model is extremely time-consuming and requires researchers to "hunt" for positive examples, like finding a needle in a haystack.&#160;We present a no-code open-source tool, Curator, whose goal is to minimize the amount of human manual image labeling needed to achieve a state of the art classifier. The pipeline, purpose-built to take advantage of the massive amount of unlabeled images, consists of (1) self-supervision training to convert unlabeled images into meaningful representations, (2) search-by-example to collect a seed set of images, (3) human-in-the-loop active learning to iteratively ask for labels on uncertain examples and train on them.&#160;In step 1, a model capable of representing unlabeled images meaningfully is trained with a self-supervised algorithm (like SimCLR) on a random subset of the dataset (that conforms to researchers&#8217; specified &#8220;training budget.&#8221;). Since real-world datasets are often imbalanced leading to suboptimal models, the initial model is used to generate embeddings on the entire dataset. Then, images with equidistant embeddings are sampled. This iterative training and resampling strategy improves both balanced training data and models every iteration. In step 2, researchers supply an example image of interest, and the output embeddings generated from this image are used to find other images with embeddings near the reference image&#8217;s embedding in euclidean space (hence similar looking images to the query image). These proposed candidate images contain a higher density of positive examples and are annotated manually as a seed set. In step 3, the seed labels are used to train a classifier to identify more candidate images for human inspection with active learning. Each classification training loop, candidate images for labeling are sampled from the larger unlabeled dataset based on the images that the model is most uncertain about (p &#8776; 0.5).Curator is released as an open-source package built on PyTorch-Lightning. The pipeline uses GPU-based transforms from the NVIDIA-Dali package for augmentation, leading to a 5-10x speed up in self-supervised training and is run from the command line.By iteratively training a self-supervised model and a classifier in tandem with human manual annotation, this pipeline is able to unearth more positive examples from severely imbalanced datasets which were previously untrainable with self-supervision algorithms. In applications such as detecting wildfires, atmospheric dust, or turning outward with telescopic surveys, increasing the number of positive candidates presented to humans for manual inspection increases the efficacy of classifiers and multiplies the efficiency of researchers&#8217; data curation efforts.

Download Full-text

Unique insights from ClinicalTrials.gov by mining protein mutations and RSids in addition to applying the Human Phenotype Ontology

PLoS ONE ◽

10.1371/journal.pone.0233438 ◽

2020 ◽

Vol 15 (5) ◽

pp. e0233438

Author(s):

Shray Alag

Keyword(s):

Human Phenotype Ontology ◽

Phenotype Ontology ◽

Human Phenotype ◽

Protein Mutations

Download Full-text

A machine learning framework to determine geolocations from metagenomic profiling

Biology Direct ◽

10.1186/s13062-020-00278-z ◽

2020 ◽

Vol 15 (1) ◽

Cited By ~ 1

Author(s):

Lihong Huang ◽

Canqiang Xu ◽

Wenxian Yang ◽

Rongshan Yu

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Geographic Origin ◽

Training Data ◽

Metagenomic Data ◽

Training Dataset ◽

Kriging Interpolation ◽

Learning Framework ◽

Testing Data ◽

Microbial Samples

Abstract Background Studies on metagenomic data of environmental microbial samples found that microbial communities seem to be geolocation-specific, and the microbiome abundance profile can be a differentiating feature to identify samples’ geolocations. In this paper, we present a machine learning framework to determine the geolocations from metagenomics profiling of microbial samples. Results Our method was applied to the multi-source microbiome data from MetaSUB (The Metagenomics and Metadesign of Subways and Urban Biomes) International Consortium for the CAMDA 2019 Metagenomic Forensics Challenge (the Challenge). The goal of the Challenge is to predict the geographical origins of mystery samples by constructing microbiome fingerprints.First, we extracted features from metagenomic abundance profiles. We then randomly split the training data into training and validation sets and trained the prediction models on the training set. Prediction performance was evaluated on the validation set. By using logistic regression with L2 normalization, the prediction accuracy of the model reaches 86%, averaged over 100 random splits of training and validation datasets.The testing data consists of samples from cities that do not occur in the training data. To predict the “mystery” cities that are not sampled before for the testing data, we first defined biological coordinates for sampled cities based on the similarity of microbial samples from them. Then we performed affine transform on the map such that the distance between cities measures their biological difference rather than geographical distance. After that, we derived the probabilities of a given testing sample from unsampled cities based on its predicted probabilities on sampled cities using Kriging interpolation. Results show that this method can successfully assign high probabilities to the true cities-of-origin of testing samples. Conclusion Our framework shows good performance in predicting the geographic origin of metagenomic samples for cities where training data are available. Furthermore, we demonstrate the potential of the proposed method to predict metagenomic samples’ geolocations for samples from locations that are not in the training dataset.

Download Full-text

TADA: phylogenetic augmentation of microbiome samples enhances phenotype classification

Bioinformatics ◽

10.1093/bioinformatics/btz394 ◽

2019 ◽

Vol 35 (14) ◽

pp. i31-i40 ◽

Cited By ~ 1

Author(s):

Erfan Sayyari ◽

Ban Kawas ◽

Siavash Mirarab

Keyword(s):

Machine Learning ◽

Sample Size ◽

Data Augmentation ◽

Training Data ◽

Supplementary Information ◽

High Dimensional ◽

Learning Methods ◽

Machine Learning Methods ◽

Phenotype Classification ◽

Microbiome Data

Abstract Motivation Learning associations of traits with the microbial composition of a set of samples is a fundamental goal in microbiome studies. Recently, machine learning methods have been explored for this goal, with some promise. However, in comparison to other fields, microbiome data are high-dimensional and not abundant; leading to a high-dimensional low-sample-size under-determined system. Moreover, microbiome data are often unbalanced and biased. Given such training data, machine learning methods often fail to perform a classification task with sufficient accuracy. Lack of signal is especially problematic when classes are represented in an unbalanced way in the training data; with some classes under-represented. The presence of inter-correlations among subsets of observations further compounds these issues. As a result, machine learning methods have had only limited success in predicting many traits from microbiome. Data augmentation consists of building synthetic samples and adding them to the training data and is a technique that has proved helpful for many machine learning tasks. Results In this paper, we propose a new data augmentation technique for classifying phenotypes based on the microbiome. Our algorithm, called TADA, uses available data and a statistical generative model to create new samples augmenting existing ones, addressing issues of low-sample-size. In generating new samples, TADA takes into account phylogenetic relationships between microbial species. On two real datasets, we show that adding these synthetic samples to the training set improves the accuracy of downstream classification, especially when the training data have an unbalanced representation of classes. Availability and implementation TADA is available at https://github.com/tada-alg/TADA. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text