pmparser and PMDB: resources for large-scale, open studies of the biomedical literature

PeerJ ◽

10.7717/peerj.11071 ◽

2021 ◽

Vol 9 ◽

pp. e11071

Author(s):

Joshua L. Schoenbachler ◽

Jacob J. Hughey

Keyword(s):

Relational Database ◽

Large Scale ◽

R Package ◽

Biomedical Literature ◽

Complex Queries ◽

Biomedical Community

PubMed is an invaluable resource for the biomedical community. Although PubMed is freely available, the existing API is not designed for large-scale analyses and the XML structure of the underlying data is inconvenient for complex queries. We developed an R package called pmparser to convert the data in PubMed to a relational database. Our implementation of the database, called PMDB, currently contains data on over 31 million PubMed Identifiers (PMIDs) and is updated regularly. Together, pmparser and PMDB can enable large-scale, reproducible, and transparent analyses of the biomedical literature. pmparser is licensed under GPL-2 and available at https://pmparser.hugheylab.org. PMDB is available in both PostgreSQL (DOI 10.5281/zenodo.4008109) and Google BigQuery (https://console.cloud.google.com/bigquery?project=pmdb-bq&d=pmdb).

Download Full-text

pmparser and PMDB: resources for large-scale, open studies of the biomedical literature

10.1101/2020.09.07.285924 ◽

2020 ◽

Author(s):

Joshua L. Schoenbachler ◽

Jacob J. Hughey

Keyword(s):

Relational Database ◽

Large Scale ◽

R Package ◽

Biomedical Literature ◽

Complex Queries ◽

Link Type ◽

Biomedical Community

AbstractPubMed is an invaluable resource for the biomedical community. Although PubMed is freely available, the existing API is not designed for large-scale analyses and the XML structure of the underlying data is inconvenient for complex queries. We developed an R package called pmparser to convert the data in PubMed to a relational database. Our implementation of the database, called PMDB, currently contains data on over 31 million PubMed Identifiers (PMIDs) and is updated regularly. Together, pmparser and PMDB can enable large-scale, reproducible, and transparent analyses of the biomedical literature. pmparser is licensed under GPL-2 and available at https://pmparser.hugheylab.org. PMDB is stored in PostgreSQL and compressed dumps are available on Zenodo (https://doi.org/10.5281/zenodo.4008109).

Download Full-text

Immune modulators in disease: integrating knowledge from the biomedical literature and gene expression

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocv166 ◽

2015 ◽

Vol 23 (3) ◽

pp. 617-626 ◽

Cited By ~ 1

Author(s):

Nophar Geifman ◽

Sanchita Bhattacharya ◽

Atul J Butte

Keyword(s):

Gene Expression ◽

Large Scale ◽

Biomedical Literature ◽

Cytokine Gene Expression ◽

Future Research ◽

Cytokine Gene ◽

Medical Subject Headings ◽

Expression Arrays ◽

Gene Expression Arrays ◽

Subject Headings

Abstract Objective Cytokines play a central role in both health and disease, modulating immune responses and acting as diagnostic markers and therapeutic targets. This work takes a systems-level approach for integration and examination of immune patterns, such as cytokine gene expression with information from biomedical literature, and applies it in the context of disease, with the objective of identifying potentially useful relationships and areas for future research. Results We present herein the integration and analysis of immune-related knowledge, namely, information derived from biomedical literature and gene expression arrays. Cytokine-disease associations were captured from over 2.4 million PubMed records, in the form of Medical Subject Headings descriptor co-occurrences, as well as from gene expression arrays. Clustering of cytokine-disease co-occurrences from biomedical literature is shown to reflect current medical knowledge as well as potentially novel relationships between diseases. A correlation analysis of cytokine gene expression in a variety of diseases revealed compelling relationships. Finally, a novel analysis comparing cytokine gene expression in different diseases to parallel associations captured from the biomedical literature was used to examine which associations are interesting for further investigation. Discussion We demonstrate the usefulness of capturing Medical Subject Headings descriptor co-occurrences from biomedical publications in the generation of valid and potentially useful hypotheses. Furthermore, integrating and comparing descriptor co-occurrences with gene expression data was shown to be useful in detecting new, potentially fruitful, and unaddressed areas of research. Conclusion Using integrated large-scale data captured from the scientific literature and experimental data, a better understanding of the immune mechanisms underlying disease can be achieved and applied to research.

Download Full-text

Feasibility and Evaluation of a Large-Scale External Validation Approach for Patient-Level Prediction in an International Data Network: Validation of models predicting stroke in female patients newly diagnosed with atrial fibrillation.

10.21203/rs.2.11750/v2 ◽

2020 ◽

Author(s):

Jenna Marie Reps ◽

Ross Williams ◽

Seng Chan You ◽

Thomas Falconer ◽

Evan Minty ◽

...

Keyword(s):

Atrial Fibrillation ◽

Large Scale ◽

Data Science ◽

Prediction Models ◽

External Validation ◽

Scale Up ◽

R Package ◽

Prognostic Models ◽

Healthcare Data ◽

Patient Level

Abstract Objective: To demonstrate how the Observational Healthcare Data Science and Informatics (OHDSI) collaborative network and standardization can be utilized to scale-up external validation of patient-level prediction models by enabling validation across a large number of heterogeneous observational healthcare datasets.Materials & Methods: Five previously published prognostic models (ATRIA, CHADS2, CHADS2VASC, Q-Stroke and Framingham) that predict future risk of stroke in patients with atrial fibrillation were replicated using the OHDSI frameworks. A network study was run that enabled the five models to be externally validated across nine observational healthcare datasets spanning three countries and five independent sites. Results: The five existing models were able to be integrated into the OHDSI framework for patient-level prediction and they obtained mean c-statistics ranging between 0.57-0.63 across the 6 databases with sufficient data to predict stroke within 1 year of initial atrial fibrillation diagnosis for females with atrial fibrillation. This was comparable with existing validation studies. The validation network study was run across nine datasets within 60 days once the models were replicated. An R package for the study was published at https://github.com/OHDSI/StudyProtocolSandbox/tree/master/ExistingStrokeRiskExternalValidation.Discussion: This study demonstrates the ability to scale up external validation of patient-level prediction models using a collaboration of researchers and a data standardization that enable models to be readily shared across data sites. External validation is necessary to understand the transportability or reproducibility of a prediction model, but without collaborative approaches it can take three or more years for a model to be validated by one independent researcher. Conclusion : In this paper we show it is possible to both scale-up and speed-up external validation by showing how validation can be done across multiple databases in less than 2 months. We recommend that researchers developing new prediction models use the OHDSI network to externally validate their models.

Download Full-text

Optimizing smartphone-based canopy hemispherical photography

10.1101/2021.03.17.435793 ◽

2021 ◽

Author(s):

Gastón Mauro Díaz

Keyword(s):

Large Scale ◽

Forest Canopy ◽

Accuracy Assessment ◽

Low Cost ◽

R Package ◽

Coefficient Of Determination ◽

Native Forest ◽

Hemispherical Photography ◽

Area Index ◽

Plant Area Index

1) Hemispherical photography (HP) is a long-standing tool for forest canopy characterization. Currently, there are low-cost fisheye lenses to convert smartphones into high-portable HP equipment; however, they cannot be used whenever since HP is sensitive to illumination conditions. To obtain sound results outside diffuse light conditions, a deep-learning-based system needs to be developed. A ready-to-use alternative is the multiscale color-based binarization algorithm, but it can provide moderate-quality results only for open forests. To overcome this limitation, I propose coupling it with the model-based local thresholding algorithm. I call this coupling the MBCB approach. 2) Methods presented here are part of the R package CAnopy IMage ANalysis (caiman), which I am developing. The accuracy assessment of the new MBCB approach was done with data from a pine plantation and a broadleaf native forest. 3) The coefficient of determination (R^2) was greater than 0.7, and the root mean square error (RMSE) lower than 20 %, both for plant area index calculation. 4) Results suggest that the new MBCB approach allows the calculation of unbiased canopy metrics from smartphone-based HP acquired in sunlight conditions, even for closed canopies. This facilitates large-scale and opportunistic sampling with hemispherical photography.

Download Full-text

Implementation of the Omega (ω) Index to detect large-scale systematic cheating

10.35542/osf.io/exwkp ◽

2019 ◽

Author(s):

Alvin Vista

Keyword(s):

Standardized Testing ◽

Large Scale ◽

Type I Error ◽

R Package ◽

Statistical Testing ◽

System Level ◽

Control Group ◽

Type I ◽

Data Contamination ◽

Cheating Detection

Cheating detection is an important issue in standardized testing, especially in large-scale settings. Statistical approaches are often computationally intensive and require specialised software to conduct. We present a two-stage approach that quickly filters suspected groups using statistical testing on an IRT-based answer-copying index. We also present an approach to mitigate data contamination and improve the performance of the index. The computation of the index was implemented through a modified version of an open source R package, thus enabling wider access to the method. Using data from PIRLS 2011 (N=64,232) we conduct a simulation to demonstrate our approach. Type I error was well-controlled and no control group was falsely flagged for cheating, while 16 (combined n=12,569) of the 18 (combined n=14,149) simulated groups were detected. Implications for system-level cheating detection and further improvements of the approach were discussed.

Download Full-text

intsvy: An R Package for Analyzing International Large-Scale Assessment Data

Journal of Statistical Software ◽

10.18637/jss.v081.i07 ◽

2017 ◽

Vol 81 (7) ◽

Cited By ~ 3

Author(s):

Daniel H. Caro ◽

Przemysaw Biecek

Keyword(s):

Large Scale ◽

R Package ◽

Assessment Data ◽

Large Scale Assessment ◽

Scale Assessment

Download Full-text

FrustratometeR: an R-package to compute Local frustration in protein structures, point mutants and MD simulations

10.1101/2020.11.26.400432 ◽

2020 ◽

Author(s):

Atilio O. Rausch ◽

Maria I. Freiberger ◽

Cesar O. Leonetti ◽

Diego M. Luna ◽

Leandro G. Radusky ◽

...

Keyword(s):

Protein Interactions ◽

Large Scale ◽

Protein Structures ◽

Md Simulations ◽

R Package ◽

Protein Protein Interactions ◽

Large Scale Analysis ◽

Functional Aspects ◽

Catalytic Sites ◽

Polypeptide Chains

Once folded natural protein molecules have few energetic conflicts within their polypeptide chains. Many protein structures do however contain regions where energetic conflicts remain after folding, i.e. they have highly frustrated regions. These regions, kept in place over evolutionary and physiological timescales, are related to several functional aspects of natural proteins such as protein-protein interactions, small ligand recognition, catalytic sites and allostery. Here we present FrustratometeR, an R package that easily computes local energetic frustration on a personal computer or a cluster. This package facilitates large scale analysis of local frustration, point mutants and MD trajectories, allowing straightforward integration of local frustration analysis in to pipelines for protein structural analysis.Availability and implementation: https://github.com/proteinphysiologylab/frustratometeR

Download Full-text

Feasibility and Evaluation of a Large-Scale External Validation Approach for Patient-Level Prediction in an International Data Network: Validation of models predicting stroke in female patients newly diagnosed with atrial fibrillation.

10.21203/rs.2.11750/v3 ◽

2020 ◽

Cited By ~ 1

Author(s):

Jenna Marie Reps ◽

Ross D Williams ◽

Seng Chan You ◽

Thomas Falconer ◽

Evan Minty ◽

...

Keyword(s):

Atrial Fibrillation ◽

Large Scale ◽

Data Science ◽

Prediction Models ◽

External Validation ◽

Scale Up ◽

R Package ◽

Prognostic Models ◽

Healthcare Data ◽

Patient Level

Abstract Background: To demonstrate how the Observational Healthcare Data Science and Informatics (OHDSI) collaborative network and standardization can be utilized to scale-up external validation of patient-level prediction models by enabling validation across a large number of heterogeneous observational healthcare datasets.Methods: Five previously published prognostic models (ATRIA, CHADS2, CHADS2VASC, Q-Stroke and Framingham) that predict future risk of stroke in patients with atrial fibrillation were replicated using the OHDSI frameworks. A network study was run that enabled the five models to be externally validated across nine observational healthcare datasets spanning three countries and five independent sites. Results: The five existing models were able to be integrated into the OHDSI framework for patient-level prediction and they obtained mean c-statistics ranging between 0.57-0.63 across the 6 databases with sufficient data to predict stroke within 1 year of initial atrial fibrillation diagnosis for females with atrial fibrillation. This was comparable with existing validation studies. The validation network study was run across nine datasets within 60 days once the models were replicated. An R package for the study was published at https://github.com/OHDSI/StudyProtocolSandbox/tree/master/ExistingStrokeRiskExternalValidation.Conclusion : This study demonstrates the ability to scale up external validation of patient-level prediction models using a collaboration of researchers and a data standardization that enable models to be readily shared across data sites. External validation is necessary to understand the transportability or reproducibility of a prediction model, but without collaborative approaches it can take three or more years for a model to be validated by one independent researcher. In this paper we show it is possible to both scale-up and speed-up external validation by showing how validation can be done across multiple databases in less than 2 months. We recommend that researchers developing new prediction models use the OHDSI network to externally validate their models.

Download Full-text

A bibliometric analysis of property valuation research

International Journal of Housing Markets and Analysis ◽

10.1108/ijhma-09-2020-0115 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

B .V Binoy ◽

M. A Naseer ◽

P.P Anil Kumar ◽

Nina Lazar

Keyword(s):

Bibliometric Analysis ◽

Large Scale ◽

R Package ◽

Data Availability ◽

Hedonic Price ◽

Content Type ◽

Property Valuation ◽

Real Estate Valuation ◽

Modeling Techniques ◽

Number Of Publications

Purpose Real estate valuation studies gained popularity with the availability of large-scale property transaction data in the latter part of the twentieth century. Hedonic price modeling (HPM) was the most popular method in the initial years until it was taken over by advanced modeling methods in the twenty-first century. Even though there exist a few literature reviews on this topic, no comprehensive bibliometric analysis is conducted in this area. In view of gaining a better understanding of the dynamics of property valuation studies, this paper aims to conduct a bibliometric analysis. Design/methodology/approach A comprehensive search in the Scopus database, followed by detailed screening resulted in 1,400 articles. The identified research articles spanning over five decades (1964–2019) are analyzed using the open-source R package “bibliometrix.” Findings The study found the USA to be the most productive country in various aspects, such as number of publications, number of authors and publication hotspots. The findings also demonstrate assessments on the publication trends, journals, citations, keywords, co-citation and collaboration networks. It was observed that there exists an upsurge in the number of publications after the year 2000 owing to improved data availability and better modeling techniques. Research limitations/implications This study is significant in understanding the major research areas and modeling techniques used in property valuation. Future studies can incorporate multiple database sources and include more articles. Originality/value The current study is one of the first bibliometric studies on property valuation. Previous studies have not explored the possibilities of geographic information system in bibliometric research. Spatial mapping and analysis of publications provide a geographical perspective of valuation research.

Download Full-text

A general approach for improving deep learning-based medical relation extraction using a pre-trained model and fine-tuning

Database ◽

10.1093/database/baz116 ◽

2019 ◽

Vol 2019 ◽

Cited By ~ 2

Author(s):

Tao Chen ◽

Mingfen Wu ◽

Hexi Li

Keyword(s):

Deep Learning ◽

Large Scale ◽

Relation Extraction ◽

Training Model ◽

Biomedical Literature ◽

Training Data ◽

Fine Tuning ◽

Learning Approaches ◽

Additional Time ◽

Clinical Records

Abstract The automatic extraction of meaningful relations from biomedical literature or clinical records is crucial in various biomedical applications. Most of the current deep learning approaches for medical relation extraction require large-scale training data to prevent overfitting of the training model. We propose using a pre-trained model and a fine-tuning technique to improve these approaches without additional time-consuming human labeling. Firstly, we show the architecture of Bidirectional Encoder Representations from Transformers (BERT), an approach for pre-training a model on large-scale unstructured text. We then combine BERT with a one-dimensional convolutional neural network (1d-CNN) to fine-tune the pre-trained model for relation extraction. Extensive experiments on three datasets, namely the BioCreative V chemical disease relation corpus, traditional Chinese medicine literature corpus and i2b2 2012 temporal relation challenge corpus, show that the proposed approach achieves state-of-the-art results (giving a relative improvement of 22.2, 7.77, and 38.5% in F1 score, respectively, compared with a traditional 1d-CNN classifier). The source code is available at https://github.com/chentao1999/MedicalRelationExtraction.

Download Full-text