Development and validation of phenotype classifiers across multiple sites in the observational health data sciences and informatics network

Abstract Objective Accurate electronic phenotyping is essential to support collaborative observational research. Supervised machine learning methods can be used to train phenotype classifiers in a high-throughput manner using imperfectly labeled data. We developed 10 phenotype classifiers using this approach and evaluated performance across multiple sites within the Observational Health Data Sciences and Informatics (OHDSI) network. Materials and Methods We constructed classifiers using the Automated PHenotype Routine for Observational Definition, Identification, Training and Evaluation (APHRODITE) R-package, an open-source framework for learning phenotype classifiers using datasets in the Observational Medical Outcomes Partnership Common Data Model. We labeled training data based on the presence of multiple mentions of disease-specific codes. Performance was evaluated on cohorts derived using rule-based definitions and real-world disease prevalence. Classifiers were developed and evaluated across 3 medical centers, including 1 international site. Results Compared to the multiple mentions labeling heuristic, classifiers showed a mean recall boost of 0.43 with a mean precision loss of 0.17. Performance decreased slightly when classifiers were shared across medical centers, with mean recall and precision decreasing by 0.08 and 0.01, respectively, at a site within the USA, and by 0.18 and 0.10, respectively, at an international site. Discussion and Conclusion We demonstrate a high-throughput pipeline for constructing and sharing phenotype classifiers across sites within the OHDSI network using APHRODITE. Classifiers exhibit good portability between sites within the USA, however limited portability internationally, indicating that classifier generalizability may have geographic limitations, and, consequently, sharing the classifier-building recipe, rather than the pretrained classifiers, may be more useful for facilitating collaborative observational research.

Download Full-text

Development and Validation of Phenotype Classifiers across Multiple Sites in the Observational Health Sciences and Informatics (OHDSI) Network

10.1101/673418 ◽

2019 ◽

Author(s):

Mehr Kashyap ◽

Martin Seneviratne ◽

Juan M Banda ◽

Thomas Falconer ◽

Borim Ryu ◽

...

Keyword(s):

High Throughput ◽

R Package ◽

Health Sciences ◽

Training Data ◽

Supervised Machine Learning ◽

Observational Research ◽

Medical Centers ◽

The Usa ◽

A Site ◽

Development And Validation

ABSTRACTObjectiveAccurate electronic phenotyping is essential to support collaborative observational research. Supervised machine learning methods can be used to train phenotype classifiers in a high-throughput manner using imperfectly labeled data. We developed ten phenotype classifiers using this approach and evaluated performance across multiple sites within the Observational Health Sciences and Informatics (OHDSI) network.Materials and MethodsWe constructed classifiers using the Automated PHenotype Routine for Observational Definition, Identification, Training and Evaluation (APHRODITE) R-package, an open-source framework for learning phenotype classifiers using datasets in the OMOP CDM. We labeled training data based on the presence of multiple mentions of disease-specific codes. Performance was evaluated on cohorts derived using rule-based definitions and real-world disease prevalence. Classifiers were developed and evaluated across three medical centers, including one international site.ResultsCompared to the multiple mentions labeling heuristic, classifiers showed a mean recall boost of 0.43 with a mean precision loss of 0.17. Performance decreased slightly when classifiers were shared across medical centers, with mean recall and precision decreasing by 0.08 and 0.01, respectively, at a site within the USA, and by 0.18 and 0.10, respectively, at an international site.Discussion and ConclusionWe demonstrate a high-throughput pipeline for constructing and sharing phenotype classifiers across multiple sites within the OHDSI network using APHRODITE. Classifiers exhibit good portability between sites within the USA, however limited portability internationally, indicating that classifier generalizability may have geographic limitations, and consequently, sharing the classifier-building recipe, rather than the pre-trained classifiers, may be more useful for facilitating collaborative observational research.

Download Full-text

Trajectories: a framework for detecting temporal clinical event sequences from health data standardized to the OMOP Common Data Model

10.1101/2021.11.18.21266518 ◽

2021 ◽

Author(s):

Kadri Kunnapuu ◽

Solomon Ioannou ◽

Kadri Ligi ◽

Raivo Kolde ◽

Sven Laur ◽

...

Keyword(s):

R Package ◽

Population Based ◽

Health Data ◽

Clinical Event ◽

Common Data Model ◽

Population Based Study ◽

Healthcare Data ◽

Disease Trajectory ◽

First Time

Objective: To develop a framework for identifying prominent clinical event trajectories from OMOP-formatted observational healthcare data. Methods: A four-step framework based on significant temporal event pair detection is described and implemented as an open-source R package. It is used on a population-based Estonian dataset to first replicate a large Danish population-based study and second, to conduct a disease trajectory detection study for Type 2 Diabetes patients in the Estonian and Dutch databases as an example. Results: As a proof of concept, we apply the methods in the Estonian database and provide a detailed breakdown of our findings. All Estonian population-based event pairs are shown. We compare the event pairs identified from Estonia to Danish and Dutch data and discuss the causes of the differences. Conclusions: For the first time, there is a complete software package for detecting disease trajectories in health data.

Download Full-text

Cardiology Fellowship Education in the Era of High-density Training, Data Tracking, and Quality Measures

The American Heart Hospital Journal ◽

10.15420/ahhj.2011.9.2.99 ◽

2011 ◽

Vol 9 (2) ◽

pp. 99

Author(s):

Alex J Auseon ◽

Albert J Kolibash ◽

◽

Keyword(s):

Ohio State University ◽

Training Data ◽

State University ◽

Training Requirements ◽

Healthcare Environment ◽

Medical Centers ◽

Work Hour ◽

The Ohio State University ◽

Data Tracking

Background:Educating trainees during cardiology fellowship is a process in constant evolution, with program directors regularly adapting to increasing demands and regulations as they strive to prepare graduates for practice in todays healthcare environment.Methods and Results:In a 10-year follow-up to a previous manuscript regarding fellowship education, we reviewed the literature regarding the most topical issues facing training programs in 2010, describing our approach at The Ohio State University.Conclusion:In the midst of challenges posed by the increasing complexity of training requirements and documentation, work hour restrictions, and the new definitions of quality and safety, we propose methods of curricula revision and collaboration that may serve as an example to other medical centers.

Download Full-text

Application of Artificial Intelligence in COVID-19 Pandemic: Bibliometric Analysis

Healthcare ◽

10.3390/healthcare9040441 ◽

2021 ◽

Vol 9 (4) ◽

pp. 441

Author(s):

Md. Mohaimenul Islam ◽

Tahmina Nasrin Poly ◽

Belal Alsinglawi ◽

Li-Fong Lin ◽

Shuo-Chen Chien ◽

...

Keyword(s):

Artificial Intelligence ◽

Bibliometric Analysis ◽

English Language ◽

R Package ◽

Research Area ◽

Republic Of China ◽

Comprehensive Picture ◽

The Usa ◽

The Republic ◽

Academy Of Sciences

The application of artificial intelligence (AI) to health has increased, including to COVID-19. This study aimed to provide a clear overview of COVID-19-related AI publication trends using longitudinal bibliometric analysis. A systematic literature search was conducted on the Web of Science for English language peer-reviewed articles related to AI application to COVID-19. A search strategy was developed to collect relevant articles and extracted bibliographic information (e.g., country, research area, sources, and author). VOSviewer (Leiden University) and Bibliometrix (R package) were used to visualize the co-occurrence networks of authors, sources, countries, institutions, global collaborations, citations, co-citations, and keywords. We included 729 research articles on the application of AI to COVID-19 published between 2020 and 2021. PLOS One (33/729, 4.52%), Chaos Solution Fractals (29/729, 3.97%), and Journal of Medical Internet Research (29/729, 3.97%) were the most common journals publishing these articles. The Republic of China (190/729, 26.06%), the USA (173/729, 23.73%), and India (92/729, 12.62%) were the most prolific countries of origin. The Huazhong University of Science and Technology, Wuhan University, and the Chinese Academy of Sciences were the most productive institutions. This is the first study to show a comprehensive picture of the global efforts to address COVID-19 using AI. The findings of this study also provide insights and research directions for academic researchers, policymakers, and healthcare practitioners who wish to collaborate in these domains in the future.

Download Full-text

movAPA: modeling and visualization of dynamics of alternative polyadenylation across biological samples

Bioinformatics ◽

10.1093/bioinformatics/btaa997 ◽

2020 ◽

Author(s):

Wenbin Ye ◽

Tao Liu ◽

Hongjuan Fu ◽

Congting Ye ◽

Guoli Ji ◽

...

Keyword(s):

Biological Samples ◽

Tissue Specificity ◽

Single Cells ◽

Alternative Polyadenylation ◽

R Package ◽

Supplementary Information ◽

Rna Seq ◽

Mouse Sperm ◽

High Scalability ◽

A Site

Abstract Motivation Alternative polyadenylation (APA) has been widely recognized as a widespread mechanism modulated dynamically. Studies based on 3′ end sequencing and/or RNA-seq have profiled poly(A) sites in various species with diverse pipelines, yet no unified and easy-to-use toolkit is available for comprehensive APA analyses. Results We developed an R package called movAPA for modeling and visualization of dynamics of alternative polyadenylation across biological samples. movAPA incorporates rich functions for preprocessing, annotation and statistical analyses of poly(A) sites, identification of poly(A) signals, profiling of APA dynamics and visualization. Particularly, seven metrics are provided for measuring the tissue-specificity or usages of APA sites across samples. Three methods are used for identifying 3′ UTR shortening/lengthening events between conditions. APA site switching involving non-3′ UTR polyadenylation can also be explored. Using poly(A) site data from rice and mouse sperm cells, we demonstrated the high scalability and flexibility of movAPA in profiling APA dynamics across tissues and single cells. Availability and implementation https://github.com/BMILAB/movAPA. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

OP0285 TOWARDS IMPLEMENTING THE OMOP CDM ACROSS FIVE EUROPEAN BIOLOGIC REGISTRIES

Annals of the Rheumatic Diseases ◽

10.1136/annrheumdis-2020-eular.3303 ◽

2020 ◽

Vol 79 (Suppl 1) ◽

pp. 177.2-178

Author(s):

E. Burn ◽

L. Kearsley-Fleet ◽

K. Hyrich ◽

M. Schaefer ◽

D. Huschek ◽

...

Keyword(s):

Rheumatoid Arthritis ◽

Health Data ◽

Registry Data ◽

British Society ◽

Common Data Model ◽

Free Text ◽

Research Support ◽

Baseline Information ◽

Baseline Conditions ◽

Biologics Register

Background:The Observational and Medical Outcomes Partnerships (OMOP) common data model (CDM) provides a framework for standardising health data.Objectives:To map national biologic registry data collected from different European countries to the OMOP CDM.Methods:Five biologic registries are currently being mapped to the OMOP CDM: 1) the Czech biologics register (ATTRA), 2) Registro Español de Acontecimientos Adversos de Terapias Biológicas en Enfermedades Reumáticas (BIOBADASER), 3) British Society for Rheumatology Biologics Register for Rheumatoid Arthritis (BSRBR-RA), 4) German biologics register ‘Rheumatoid arthritis observation of biologic therapy’ (RABBIT), and 5) Swiss register ’Swiss Clinical Quality Management in Rheumatic Diseases’ (SCQM).Data collected at baseline are being mapped first. Details that uniquely identify individuals are mapped to the person table, with the observation_period table defining the time a person may have had clinical events recorded. Baseline comorbidities are mapped to the condition_occurrence CDM table, while baseline medications are mapped to the drug_exposure CDM table. This mapping is summarised in Figure 1.Figure 1.Overview of initial mappingResults:A total of 64,901 individuals are included in the 5 registries being mapped to the OMOP CDM, see table 1. The number of unique baseline conditions being mapped range from 17 in BSRBR-RA to 108 in RABBIT, while the number of baseline medications range from 26 in ATTRA to 802 in BSRBR-RA. Those registries which captured more comorbidities or medications generally allowed for these to be inputted as free text.Table 1.Summary of initial code mappingRegistryNumber of individualsNumber of mapped baseline conditionsNumber of mapped baseline medicationsATTRA5,3262626BIOBADASER6,4963051BSRBR-RA21,69517802RABBIT13,06210878SCQM18,3222633Conclusion:Due to differences in study design and data capture, the baseline information captured on comorbidities and drugs across registries varies greatly. However, these data have been mapped and mapping biologic registry data to the OMOP CDM is feasible. The adoption of the OMOP CDM will facilitate collaboration across registries and allow for multi-database studies which include data from both biologic registries and other sources of health data which have been mapped to the CDM.Disclosure of Interests:Edward Burn: None declared, Lianne Kearsley-Fleet: None declared, Kimme Hyrich Grant/research support from: Pfizer, UCB, BMS, Speakers bureau: Abbvie, Martin Schaefer: None declared, Doreen Huschek: None declared, Anja Strangfeld Speakers bureau: AbbVie, BMS, Pfizer, Roche, Sanofi-Aventis, Jakub Zavada Speakers bureau: Abbvie, UCB, Sanofi, Elli-Lilly, Novartis, Zentiva, Accord, Markéta Lagová: None declared, Delphine Courvoisier: None declared, Christoph Tellenbach: None declared, Kim Lauper: None declared, Carlos Sánchez-Piedra: None declared, Nuria Montero: None declared, Jesús-Tomás Sanchez-Costa: None declared, Daniel Prieto-Alhambra Grant/research support from: Professor Prieto-Alhambra has received research Grants from AMGEN, UCB Biopharma and Les Laboratoires Servier, Consultant of: DPA’s department has received fees for consultancy services from UCB Biopharma, Speakers bureau: DPA’s department has received fees for speaker and advisory board membership services from Amgen

Download Full-text

Effects of Training Set Size on Supervised Machine-Learning Land-Cover Classification of Large-Area High-Resolution Remotely Sensed Data

Remote Sensing ◽

10.3390/rs13030368 ◽

2021 ◽

Vol 13 (3) ◽

pp. 368

Author(s):

Christopher A. Ramezan ◽

Timothy A. Warner ◽

Aaron E. Maxwell ◽

Bradley S. Price

Keyword(s):

Machine Learning ◽

Sample Size ◽

Remotely Sensed ◽

Training Data ◽

Supervised Machine Learning ◽

Sample Sizes ◽

Remotely Sensed Data ◽

Large Area ◽

Training Set ◽

Set Size

The size of the training data set is a major determinant of classification accuracy. Nevertheless, the collection of a large training data set for supervised classifiers can be a challenge, especially for studies covering a large area, which may be typical of many real-world applied projects. This work investigates how variations in training set size, ranging from a large sample size (n = 10,000) to a very small sample size (n = 40), affect the performance of six supervised machine-learning algorithms applied to classify large-area high-spatial-resolution (HR) (1–5 m) remotely sensed data within the context of a geographic object-based image analysis (GEOBIA) approach. GEOBIA, in which adjacent similar pixels are grouped into image-objects that form the unit of the classification, offers the potential benefit of allowing multiple additional variables, such as measures of object geometry and texture, thus increasing the dimensionality of the classification input data. The six supervised machine-learning algorithms are support vector machines (SVM), random forests (RF), k-nearest neighbors (k-NN), single-layer perceptron neural networks (NEU), learning vector quantization (LVQ), and gradient-boosted trees (GBM). RF, the algorithm with the highest overall accuracy, was notable for its negligible decrease in overall accuracy, 1.0%, when training sample size decreased from 10,000 to 315 samples. GBM provided similar overall accuracy to RF; however, the algorithm was very expensive in terms of training time and computational resources, especially with large training sets. In contrast to RF and GBM, NEU, and SVM were particularly sensitive to decreasing sample size, with NEU classifications generally producing overall accuracies that were on average slightly higher than SVM classifications for larger sample sizes, but lower than SVM for the smallest sample sizes. NEU however required a longer processing time. The k-NN classifier saw less of a drop in overall accuracy than NEU and SVM as training set size decreased; however, the overall accuracies of k-NN were typically less than RF, NEU, and SVM classifiers. LVQ generally had the lowest overall accuracy of all six methods, but was relatively insensitive to sample size, down to the smallest sample sizes. Overall, due to its relatively high accuracy with small training sample sets, and minimal variations in overall accuracy between very large and small sample sets, as well as relatively short processing time, RF was a good classifier for large-area land-cover classifications of HR remotely sensed data, especially when training data are scarce. However, as performance of different supervised classifiers varies in response to training set size, investigating multiple classification algorithms is recommended to achieve optimal accuracy for a project.

Download Full-text

kataegis: an R package for identification and visualization of the genomic localized hypermutation regions using high-throughput sequencing

BMC Genomics ◽

10.1186/s12864-021-07696-x ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Xue Lin ◽

Yingying Hua ◽

Shuanglin Gu ◽

Li Lv ◽

Xingyu Li ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Somatic Mutations ◽

R Package ◽

Frequency Of Occurrence ◽

Link Type ◽

Genomic Landscape ◽

One Step ◽

Flanking Regions

Abstract Background Genomic localized hypermutation regions were found in cancers, which were reported to be related to the prognosis of cancers. This genomic localized hypermutation is quite different from the usual somatic mutations in the frequency of occurrence and genomic density. It is like a mutations “violent storm”, which is just what the Greek word “kataegis” means. Results There are needs for a light-weighted and simple-to-use toolkit to identify and visualize the localized hypermutation regions in genome. Thus we developed the R package “kataegis” to meet these needs. The package used only three steps to identify the genomic hypermutation regions, i.e., i) read in the variation files in standard formats; ii) calculate the inter-mutational distances; iii) identify the hypermutation regions with appropriate parameters, and finally one step to visualize the nucleotide contents and spectra of both the foci and flanking regions, and the genomic landscape of these regions. Conclusions The kataegis package is available on Bionconductor/Github (https://github.com/flosalbizziae/kataegis), which provides a light-weighted and simple-to-use toolkit for quickly identifying and visualizing the genomic hypermuation regions.

Download Full-text

Drawing Reproducible Conclusions from Observational Clinical Data with OHDSI

Yearbook of Medical Informatics ◽

10.1055/s-0041-1726481 ◽

2021 ◽

Author(s):

George Hripcsak ◽

Martijn J. Schuemie ◽

David Madigan ◽

Patrick B. Ryan ◽

Marc A. Suchard

Keyword(s):

Large Scale ◽

Open Science ◽

Positive Control ◽

Research Literature ◽

Observational Research ◽

Common Data Model ◽

Administrative Claims ◽

Open Approach ◽

Operating Characteristics ◽

Software Models

Summary Objective: The current observational research literature shows extensive publication bias and contradiction. The Observational Health Data Sciences and Informatics (OHDSI) initiative seeks to improve research reproducibility through open science. Methods: OHDSI has created an international federated data source of electronic health records and administrative claims that covers nearly 10% of the world’s population. Using a common data model with a practical schema and extensive vocabulary mappings, data from around the world follow the identical format. OHDSI’s research methods emphasize reproducibility, with a large-scale approach to addressing confounding using propensity score adjustment with extensive diagnostics; negative and positive control hypotheses to test for residual systematic error; a variety of data sources to assess consistency and generalizability; a completely open approach including protocol, software, models, parameters, and raw results so that studies can be externally verified; and the study of many hypotheses in parallel so that the operating characteristics of the methods can be assessed. Results: OHDSI has already produced findings in areas like hypertension treatment that are being incorporated into practice, and it has produced rigorous studies of COVID-19 that have aided government agencies in their treatment decisions, that have characterized the disease extensively, that have estimated the comparative effects of treatments, and that the predict likelihood of advancing to serious complications. Conclusions: OHDSI practices open science and incorporates a series of methods to address reproducibility. It has produced important results in several areas, including hypertension therapy and COVID-19 research.

Download Full-text

Characterizing the Spatial and Temporal Availability of Very High Resolution Satellite Imagery in Google Earth and Microsoft Bing Maps as a Source of Reference Data

Land ◽

10.3390/land7040118 ◽

2018 ◽

Vol 7 (4) ◽

pp. 118 ◽

Cited By ~ 18

Author(s):

Myroslava Lesiv ◽

Linda See ◽

Juan Laso Bayas ◽

Tobias Sturn ◽

Dmitry Schepaschenko ◽

...

Keyword(s):

High Resolution ◽

Satellite Imagery ◽

Urban Areas ◽

Reference Data ◽

Temporal Distribution ◽

Google Earth ◽

Training Data ◽

Visual Interpretation ◽

The Usa ◽

Very High

Very high resolution (VHR) satellite imagery from Google Earth and Microsoft Bing Maps is increasingly being used in a variety of applications from computer sciences to arts and humanities. In the field of remote sensing, one use of this imagery is to create reference data sets through visual interpretation, e.g., to complement existing training data or to aid in the validation of land-cover products. Through new applications such as Collect Earth, this imagery is also being used for monitoring purposes in the form of statistical surveys obtained through visual interpretation. However, little is known about where VHR satellite imagery exists globally or the dates of the imagery. Here we present a global overview of the spatial and temporal distribution of VHR satellite imagery in Google Earth and Microsoft Bing Maps. The results show an uneven availability globally, with biases in certain areas such as the USA, Europe and India, and with clear discontinuities at political borders. We also show that the availability of VHR imagery is currently not adequate for monitoring protected areas and deforestation, but is better suited for monitoring changes in cropland or urban areas using visual interpretation.

Download Full-text