Development and Validation of Phenotype Classifiers across Multiple Sites in the Observational Health Sciences and Informatics (OHDSI) Network

ABSTRACTObjectiveAccurate electronic phenotyping is essential to support collaborative observational research. Supervised machine learning methods can be used to train phenotype classifiers in a high-throughput manner using imperfectly labeled data. We developed ten phenotype classifiers using this approach and evaluated performance across multiple sites within the Observational Health Sciences and Informatics (OHDSI) network.Materials and MethodsWe constructed classifiers using the Automated PHenotype Routine for Observational Definition, Identification, Training and Evaluation (APHRODITE) R-package, an open-source framework for learning phenotype classifiers using datasets in the OMOP CDM. We labeled training data based on the presence of multiple mentions of disease-specific codes. Performance was evaluated on cohorts derived using rule-based definitions and real-world disease prevalence. Classifiers were developed and evaluated across three medical centers, including one international site.ResultsCompared to the multiple mentions labeling heuristic, classifiers showed a mean recall boost of 0.43 with a mean precision loss of 0.17. Performance decreased slightly when classifiers were shared across medical centers, with mean recall and precision decreasing by 0.08 and 0.01, respectively, at a site within the USA, and by 0.18 and 0.10, respectively, at an international site.Discussion and ConclusionWe demonstrate a high-throughput pipeline for constructing and sharing phenotype classifiers across multiple sites within the OHDSI network using APHRODITE. Classifiers exhibit good portability between sites within the USA, however limited portability internationally, indicating that classifier generalizability may have geographic limitations, and consequently, sharing the classifier-building recipe, rather than the pre-trained classifiers, may be more useful for facilitating collaborative observational research.

Download Full-text

Development and validation of phenotype classifiers across multiple sites in the observational health data sciences and informatics network

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocaa032 ◽

2020 ◽

Vol 27 (6) ◽

pp. 877-883 ◽

Cited By ~ 1

Author(s):

Mehr Kashyap ◽

Martin Seneviratne ◽

Juan M Banda ◽

Thomas Falconer ◽

Borim Ryu ◽

...

Keyword(s):

High Throughput ◽

R Package ◽

Health Data ◽

Training Data ◽

Supervised Machine Learning ◽

Observational Research ◽

Common Data Model ◽

Medical Centers ◽

The Usa ◽

A Site

Abstract Objective Accurate electronic phenotyping is essential to support collaborative observational research. Supervised machine learning methods can be used to train phenotype classifiers in a high-throughput manner using imperfectly labeled data. We developed 10 phenotype classifiers using this approach and evaluated performance across multiple sites within the Observational Health Data Sciences and Informatics (OHDSI) network. Materials and Methods We constructed classifiers using the Automated PHenotype Routine for Observational Definition, Identification, Training and Evaluation (APHRODITE) R-package, an open-source framework for learning phenotype classifiers using datasets in the Observational Medical Outcomes Partnership Common Data Model. We labeled training data based on the presence of multiple mentions of disease-specific codes. Performance was evaluated on cohorts derived using rule-based definitions and real-world disease prevalence. Classifiers were developed and evaluated across 3 medical centers, including 1 international site. Results Compared to the multiple mentions labeling heuristic, classifiers showed a mean recall boost of 0.43 with a mean precision loss of 0.17. Performance decreased slightly when classifiers were shared across medical centers, with mean recall and precision decreasing by 0.08 and 0.01, respectively, at a site within the USA, and by 0.18 and 0.10, respectively, at an international site. Discussion and Conclusion We demonstrate a high-throughput pipeline for constructing and sharing phenotype classifiers across sites within the OHDSI network using APHRODITE. Classifiers exhibit good portability between sites within the USA, however limited portability internationally, indicating that classifier generalizability may have geographic limitations, and, consequently, sharing the classifier-building recipe, rather than the pretrained classifiers, may be more useful for facilitating collaborative observational research.

Download Full-text

Cardiology Fellowship Education in the Era of High-density Training, Data Tracking, and Quality Measures

The American Heart Hospital Journal ◽

10.15420/ahhj.2011.9.2.99 ◽

2011 ◽

Vol 9 (2) ◽

pp. 99

Author(s):

Alex J Auseon ◽

Albert J Kolibash ◽

◽

Keyword(s):

Ohio State University ◽

Training Data ◽

State University ◽

Training Requirements ◽

Healthcare Environment ◽

Medical Centers ◽

Work Hour ◽

The Ohio State University ◽

Data Tracking

Background:Educating trainees during cardiology fellowship is a process in constant evolution, with program directors regularly adapting to increasing demands and regulations as they strive to prepare graduates for practice in todays healthcare environment.Methods and Results:In a 10-year follow-up to a previous manuscript regarding fellowship education, we reviewed the literature regarding the most topical issues facing training programs in 2010, describing our approach at The Ohio State University.Conclusion:In the midst of challenges posed by the increasing complexity of training requirements and documentation, work hour restrictions, and the new definitions of quality and safety, we propose methods of curricula revision and collaboration that may serve as an example to other medical centers.

Download Full-text

Application of Artificial Intelligence in COVID-19 Pandemic: Bibliometric Analysis

Healthcare ◽

10.3390/healthcare9040441 ◽

2021 ◽

Vol 9 (4) ◽

pp. 441

Author(s):

Md. Mohaimenul Islam ◽

Tahmina Nasrin Poly ◽

Belal Alsinglawi ◽

Li-Fong Lin ◽

Shuo-Chen Chien ◽

...

Keyword(s):

Artificial Intelligence ◽

Bibliometric Analysis ◽

English Language ◽

R Package ◽

Research Area ◽

Republic Of China ◽

Comprehensive Picture ◽

The Usa ◽

The Republic ◽

Academy Of Sciences

The application of artificial intelligence (AI) to health has increased, including to COVID-19. This study aimed to provide a clear overview of COVID-19-related AI publication trends using longitudinal bibliometric analysis. A systematic literature search was conducted on the Web of Science for English language peer-reviewed articles related to AI application to COVID-19. A search strategy was developed to collect relevant articles and extracted bibliographic information (e.g., country, research area, sources, and author). VOSviewer (Leiden University) and Bibliometrix (R package) were used to visualize the co-occurrence networks of authors, sources, countries, institutions, global collaborations, citations, co-citations, and keywords. We included 729 research articles on the application of AI to COVID-19 published between 2020 and 2021. PLOS One (33/729, 4.52%), Chaos Solution Fractals (29/729, 3.97%), and Journal of Medical Internet Research (29/729, 3.97%) were the most common journals publishing these articles. The Republic of China (190/729, 26.06%), the USA (173/729, 23.73%), and India (92/729, 12.62%) were the most prolific countries of origin. The Huazhong University of Science and Technology, Wuhan University, and the Chinese Academy of Sciences were the most productive institutions. This is the first study to show a comprehensive picture of the global efforts to address COVID-19 using AI. The findings of this study also provide insights and research directions for academic researchers, policymakers, and healthcare practitioners who wish to collaborate in these domains in the future.

Download Full-text

movAPA: modeling and visualization of dynamics of alternative polyadenylation across biological samples

Bioinformatics ◽

10.1093/bioinformatics/btaa997 ◽

2020 ◽

Author(s):

Wenbin Ye ◽

Tao Liu ◽

Hongjuan Fu ◽

Congting Ye ◽

Guoli Ji ◽

...

Keyword(s):

Biological Samples ◽

Tissue Specificity ◽

Single Cells ◽

Alternative Polyadenylation ◽

R Package ◽

Supplementary Information ◽

Rna Seq ◽

Mouse Sperm ◽

High Scalability ◽

A Site

Abstract Motivation Alternative polyadenylation (APA) has been widely recognized as a widespread mechanism modulated dynamically. Studies based on 3′ end sequencing and/or RNA-seq have profiled poly(A) sites in various species with diverse pipelines, yet no unified and easy-to-use toolkit is available for comprehensive APA analyses. Results We developed an R package called movAPA for modeling and visualization of dynamics of alternative polyadenylation across biological samples. movAPA incorporates rich functions for preprocessing, annotation and statistical analyses of poly(A) sites, identification of poly(A) signals, profiling of APA dynamics and visualization. Particularly, seven metrics are provided for measuring the tissue-specificity or usages of APA sites across samples. Three methods are used for identifying 3′ UTR shortening/lengthening events between conditions. APA site switching involving non-3′ UTR polyadenylation can also be explored. Using poly(A) site data from rice and mouse sperm cells, we demonstrated the high scalability and flexibility of movAPA in profiling APA dynamics across tissues and single cells. Availability and implementation https://github.com/BMILAB/movAPA. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Effects of Training Set Size on Supervised Machine-Learning Land-Cover Classification of Large-Area High-Resolution Remotely Sensed Data

Remote Sensing ◽

10.3390/rs13030368 ◽

2021 ◽

Vol 13 (3) ◽

pp. 368

Author(s):

Christopher A. Ramezan ◽

Timothy A. Warner ◽

Aaron E. Maxwell ◽

Bradley S. Price

Keyword(s):

Machine Learning ◽

Sample Size ◽

Remotely Sensed ◽

Training Data ◽

Supervised Machine Learning ◽

Sample Sizes ◽

Remotely Sensed Data ◽

Large Area ◽

Training Set ◽

Set Size

The size of the training data set is a major determinant of classification accuracy. Nevertheless, the collection of a large training data set for supervised classifiers can be a challenge, especially for studies covering a large area, which may be typical of many real-world applied projects. This work investigates how variations in training set size, ranging from a large sample size (n = 10,000) to a very small sample size (n = 40), affect the performance of six supervised machine-learning algorithms applied to classify large-area high-spatial-resolution (HR) (1–5 m) remotely sensed data within the context of a geographic object-based image analysis (GEOBIA) approach. GEOBIA, in which adjacent similar pixels are grouped into image-objects that form the unit of the classification, offers the potential benefit of allowing multiple additional variables, such as measures of object geometry and texture, thus increasing the dimensionality of the classification input data. The six supervised machine-learning algorithms are support vector machines (SVM), random forests (RF), k-nearest neighbors (k-NN), single-layer perceptron neural networks (NEU), learning vector quantization (LVQ), and gradient-boosted trees (GBM). RF, the algorithm with the highest overall accuracy, was notable for its negligible decrease in overall accuracy, 1.0%, when training sample size decreased from 10,000 to 315 samples. GBM provided similar overall accuracy to RF; however, the algorithm was very expensive in terms of training time and computational resources, especially with large training sets. In contrast to RF and GBM, NEU, and SVM were particularly sensitive to decreasing sample size, with NEU classifications generally producing overall accuracies that were on average slightly higher than SVM classifications for larger sample sizes, but lower than SVM for the smallest sample sizes. NEU however required a longer processing time. The k-NN classifier saw less of a drop in overall accuracy than NEU and SVM as training set size decreased; however, the overall accuracies of k-NN were typically less than RF, NEU, and SVM classifiers. LVQ generally had the lowest overall accuracy of all six methods, but was relatively insensitive to sample size, down to the smallest sample sizes. Overall, due to its relatively high accuracy with small training sample sets, and minimal variations in overall accuracy between very large and small sample sets, as well as relatively short processing time, RF was a good classifier for large-area land-cover classifications of HR remotely sensed data, especially when training data are scarce. However, as performance of different supervised classifiers varies in response to training set size, investigating multiple classification algorithms is recommended to achieve optimal accuracy for a project.

Download Full-text

kataegis: an R package for identification and visualization of the genomic localized hypermutation regions using high-throughput sequencing

BMC Genomics ◽

10.1186/s12864-021-07696-x ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Xue Lin ◽

Yingying Hua ◽

Shuanglin Gu ◽

Li Lv ◽

Xingyu Li ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Somatic Mutations ◽

R Package ◽

Frequency Of Occurrence ◽

Link Type ◽

Genomic Landscape ◽

One Step ◽

Flanking Regions

Abstract Background Genomic localized hypermutation regions were found in cancers, which were reported to be related to the prognosis of cancers. This genomic localized hypermutation is quite different from the usual somatic mutations in the frequency of occurrence and genomic density. It is like a mutations “violent storm”, which is just what the Greek word “kataegis” means. Results There are needs for a light-weighted and simple-to-use toolkit to identify and visualize the localized hypermutation regions in genome. Thus we developed the R package “kataegis” to meet these needs. The package used only three steps to identify the genomic hypermutation regions, i.e., i) read in the variation files in standard formats; ii) calculate the inter-mutational distances; iii) identify the hypermutation regions with appropriate parameters, and finally one step to visualize the nucleotide contents and spectra of both the foci and flanking regions, and the genomic landscape of these regions. Conclusions The kataegis package is available on Bionconductor/Github (https://github.com/flosalbizziae/kataegis), which provides a light-weighted and simple-to-use toolkit for quickly identifying and visualizing the genomic hypermuation regions.

Download Full-text

The development and validation of EpiComet-Chip, a modified high-throughput comet assay for the assessment of DNA methylation status

Environmental and Molecular Mutagenesis ◽

10.1002/em.22101 ◽

2017 ◽

Vol 58 (7) ◽

pp. 508-521 ◽

Cited By ~ 17

Author(s):

Todd A. Townsend ◽

Marcus C. Parrish ◽

Bevin P. Engelward ◽

Mugimane G. Manjanatha

Keyword(s):

Dna Methylation ◽

Comet Assay ◽

High Throughput ◽

Methylation Status ◽

Development And Validation

Download Full-text

Characterizing the Spatial and Temporal Availability of Very High Resolution Satellite Imagery in Google Earth and Microsoft Bing Maps as a Source of Reference Data

Land ◽

10.3390/land7040118 ◽

2018 ◽

Vol 7 (4) ◽

pp. 118 ◽

Cited By ~ 18

Author(s):

Myroslava Lesiv ◽

Linda See ◽

Juan Laso Bayas ◽

Tobias Sturn ◽

Dmitry Schepaschenko ◽

...

Keyword(s):

High Resolution ◽

Satellite Imagery ◽

Urban Areas ◽

Reference Data ◽

Temporal Distribution ◽

Google Earth ◽

Training Data ◽

Visual Interpretation ◽

The Usa ◽

Very High

Very high resolution (VHR) satellite imagery from Google Earth and Microsoft Bing Maps is increasingly being used in a variety of applications from computer sciences to arts and humanities. In the field of remote sensing, one use of this imagery is to create reference data sets through visual interpretation, e.g., to complement existing training data or to aid in the validation of land-cover products. Through new applications such as Collect Earth, this imagery is also being used for monitoring purposes in the form of statistical surveys obtained through visual interpretation. However, little is known about where VHR satellite imagery exists globally or the dates of the imagery. Here we present a global overview of the spatial and temporal distribution of VHR satellite imagery in Google Earth and Microsoft Bing Maps. The results show an uneven availability globally, with biases in certain areas such as the USA, Europe and India, and with clear discontinuities at political borders. We also show that the availability of VHR imagery is currently not adequate for monitoring protected areas and deforestation, but is better suited for monitoring changes in cropland or urban areas using visual interpretation.

Download Full-text

Folate analysis in foods by UPLC-MS/MS: development and validation of a novel, high throughput quantitative assay; folate levels determined in Australian fortified breads

Analytical and Bioanalytical Chemistry ◽

10.1007/s00216-011-5156-3 ◽

2011 ◽

Vol 401 (3) ◽

pp. 1035-1042 ◽

Cited By ~ 24

Author(s):

Maria V. Chandra-Hioe ◽

Martin P. Bucknall ◽

Jayashree Arcot

Keyword(s):

High Throughput ◽

Quantitative Assay ◽

Development And Validation

Download Full-text

Accurate fetal variant calling in the presence of maternal cell contamination

10.1101/552414 ◽

2019 ◽

Cited By ~ 1

Author(s):

Elena Nabieva ◽

Satyarth Mishra Sharma ◽

Yermek Kapushev ◽

Sofya K. Garushyants ◽

Anna V. Fedotova ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Chorionic Villus ◽

Genetic Diagnosis ◽

Variant Calling ◽

Data Availability ◽

Training Data ◽

Sequencing Data ◽

Maternal Cell ◽

Fetal Dna

AbstractHigh-throughput sequencing of fetal DNA is a promising and increasingly common method for the discovery of all (or all coding) genetic variants in the fetus, either as part of prenatal screening or diagnosis, or for genetic diagnosis of spontaneous abortions. In many cases, the fetal DNA (from chorionic villi, amniotic fluid, or abortive tissue) can be contaminated with maternal cells, resulting in the mixture of fetal and maternal DNA. This maternal cell contamination (MCC) undermines the assumption, made by traditional variant callers, that each allele in a heterozygous site is covered, on average, by 50% of the reads, and therefore can lead to erroneous genotype calls. We present a panel of methods for reducing the genotyping error in the presence of MCC. All methods start with the output of GATK HaplotypeCaller on the sequencing data for the (contaminated) fetal sample and both of its parents, and additionally rely on information about the MCC fraction (which itself is readily estimated from the high-throughput sequencing data). The first of these methods uses a Bayesian probabilistic model to correct the fetal genotype calls produced by MCC-unaware HaplotypeCaller. The other two methods “learn” the genotype-correction model from examples. We use simulated contaminated fetal data to train and test the models. Using the test sets, we show that all three methods lead to substantially improved accuracy when compared with the original MCC-unaware HaplotypeCaller calls. We then apply the best-performing method to three chorionic villus samples from spontaneously terminated pregnancies.Code and training data availabilityhttps://github.com/bazykinlab/ML-maternal-cell-contamination

Download Full-text