scholarly journals pmparser and PMDB: resources for large-scale, open studies of the biomedical literature

PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e11071
Author(s):  
Joshua L. Schoenbachler ◽  
Jacob J. Hughey

PubMed is an invaluable resource for the biomedical community. Although PubMed is freely available, the existing API is not designed for large-scale analyses and the XML structure of the underlying data is inconvenient for complex queries. We developed an R package called pmparser to convert the data in PubMed to a relational database. Our implementation of the database, called PMDB, currently contains data on over 31 million PubMed Identifiers (PMIDs) and is updated regularly. Together, pmparser and PMDB can enable large-scale, reproducible, and transparent analyses of the biomedical literature. pmparser is licensed under GPL-2 and available at https://pmparser.hugheylab.org. PMDB is available in both PostgreSQL (DOI 10.5281/zenodo.4008109) and Google BigQuery (https://console.cloud.google.com/bigquery?project=pmdb-bq&d=pmdb).

2020 ◽  
Author(s):  
Joshua L. Schoenbachler ◽  
Jacob J. Hughey

AbstractPubMed is an invaluable resource for the biomedical community. Although PubMed is freely available, the existing API is not designed for large-scale analyses and the XML structure of the underlying data is inconvenient for complex queries. We developed an R package called pmparser to convert the data in PubMed to a relational database. Our implementation of the database, called PMDB, currently contains data on over 31 million PubMed Identifiers (PMIDs) and is updated regularly. Together, pmparser and PMDB can enable large-scale, reproducible, and transparent analyses of the biomedical literature. pmparser is licensed under GPL-2 and available at https://pmparser.hugheylab.org. PMDB is stored in PostgreSQL and compressed dumps are available on Zenodo (https://doi.org/10.5281/zenodo.4008109).


2015 ◽  
Vol 23 (3) ◽  
pp. 617-626 ◽  
Author(s):  
Nophar Geifman ◽  
Sanchita Bhattacharya ◽  
Atul J Butte

Abstract Objective Cytokines play a central role in both health and disease, modulating immune responses and acting as diagnostic markers and therapeutic targets. This work takes a systems-level approach for integration and examination of immune patterns, such as cytokine gene expression with information from biomedical literature, and applies it in the context of disease, with the objective of identifying potentially useful relationships and areas for future research. Results We present herein the integration and analysis of immune-related knowledge, namely, information derived from biomedical literature and gene expression arrays. Cytokine-disease associations were captured from over 2.4 million PubMed records, in the form of Medical Subject Headings descriptor co-occurrences, as well as from gene expression arrays. Clustering of cytokine-disease co-occurrences from biomedical literature is shown to reflect current medical knowledge as well as potentially novel relationships between diseases. A correlation analysis of cytokine gene expression in a variety of diseases revealed compelling relationships. Finally, a novel analysis comparing cytokine gene expression in different diseases to parallel associations captured from the biomedical literature was used to examine which associations are interesting for further investigation. Discussion We demonstrate the usefulness of capturing Medical Subject Headings descriptor co-occurrences from biomedical publications in the generation of valid and potentially useful hypotheses. Furthermore, integrating and comparing descriptor co-occurrences with gene expression data was shown to be useful in detecting new, potentially fruitful, and unaddressed areas of research. Conclusion Using integrated large-scale data captured from the scientific literature and experimental data, a better understanding of the immune mechanisms underlying disease can be achieved and applied to research.


2020 ◽  
Author(s):  
Jenna Marie Reps ◽  
Ross Williams ◽  
Seng Chan You ◽  
Thomas Falconer ◽  
Evan Minty ◽  
...  

Abstract Objective: To demonstrate how the Observational Healthcare Data Science and Informatics (OHDSI) collaborative network and standardization can be utilized to scale-up external validation of patient-level prediction models by enabling validation across a large number of heterogeneous observational healthcare datasets.Materials & Methods: Five previously published prognostic models (ATRIA, CHADS2, CHADS2VASC, Q-Stroke and Framingham) that predict future risk of stroke in patients with atrial fibrillation were replicated using the OHDSI frameworks. A network study was run that enabled the five models to be externally validated across nine observational healthcare datasets spanning three countries and five independent sites. Results: The five existing models were able to be integrated into the OHDSI framework for patient-level prediction and they obtained mean c-statistics ranging between 0.57-0.63 across the 6 databases with sufficient data to predict stroke within 1 year of initial atrial fibrillation diagnosis for females with atrial fibrillation. This was comparable with existing validation studies. The validation network study was run across nine datasets within 60 days once the models were replicated. An R package for the study was published at https://github.com/OHDSI/StudyProtocolSandbox/tree/master/ExistingStrokeRiskExternalValidation.Discussion: This study demonstrates the ability to scale up external validation of patient-level prediction models using a collaboration of researchers and a data standardization that enable models to be readily shared across data sites. External validation is necessary to understand the transportability or reproducibility of a prediction model, but without collaborative approaches it can take three or more years for a model to be validated by one independent researcher. Conclusion : In this paper we show it is possible to both scale-up and speed-up external validation by showing how validation can be done across multiple databases in less than 2 months. We recommend that researchers developing new prediction models use the OHDSI network to externally validate their models.


2021 ◽  
Author(s):  
Gastón Mauro Díaz

1) Hemispherical photography (HP) is a long-standing tool for forest canopy characterization. Currently, there are low-cost fisheye lenses to convert smartphones into high-portable HP equipment; however, they cannot be used whenever since HP is sensitive to illumination conditions. To obtain sound results outside diffuse light conditions, a deep-learning-based system needs to be developed. A ready-to-use alternative is the multiscale color-based binarization algorithm, but it can provide moderate-quality results only for open forests. To overcome this limitation, I propose coupling it with the model-based local thresholding algorithm. I call this coupling the MBCB approach. 2) Methods presented here are part of the R package CAnopy IMage ANalysis (caiman), which I am developing. The accuracy assessment of the new MBCB approach was done with data from a pine plantation and a broadleaf native forest. 3) The coefficient of determination (R^2) was greater than 0.7, and the root mean square error (RMSE) lower than 20 %, both for plant area index calculation. 4) Results suggest that the new MBCB approach allows the calculation of unbiased canopy metrics from smartphone-based HP acquired in sunlight conditions, even for closed canopies. This facilitates large-scale and opportunistic sampling with hemispherical photography.


2019 ◽  
Author(s):  
Alvin Vista

Cheating detection is an important issue in standardized testing, especially in large-scale settings. Statistical approaches are often computationally intensive and require specialised software to conduct. We present a two-stage approach that quickly filters suspected groups using statistical testing on an IRT-based answer-copying index. We also present an approach to mitigate data contamination and improve the performance of the index. The computation of the index was implemented through a modified version of an open source R package, thus enabling wider access to the method. Using data from PIRLS 2011 (N=64,232) we conduct a simulation to demonstrate our approach. Type I error was well-controlled and no control group was falsely flagged for cheating, while 16 (combined n=12,569) of the 18 (combined n=14,149) simulated groups were detected. Implications for system-level cheating detection and further improvements of the approach were discussed.


2020 ◽  
Author(s):  
Atilio O. Rausch ◽  
Maria I. Freiberger ◽  
Cesar O. Leonetti ◽  
Diego M. Luna ◽  
Leandro G. Radusky ◽  
...  

Once folded natural protein molecules have few energetic conflicts within their polypeptide chains. Many protein structures do however contain regions where energetic conflicts remain after folding, i.e. they have highly frustrated regions. These regions, kept in place over evolutionary and physiological timescales, are related to several functional aspects of natural proteins such as protein-protein interactions, small ligand recognition, catalytic sites and allostery. Here we present FrustratometeR, an R package that easily computes local energetic frustration on a personal computer or a cluster. This package facilitates large scale analysis of local frustration, point mutants and MD trajectories, allowing straightforward integration of local frustration analysis in to pipelines for protein structural analysis.Availability and implementation: https://github.com/proteinphysiologylab/frustratometeR


Author(s):  
Jenna Marie Reps ◽  
Ross D Williams ◽  
Seng Chan You ◽  
Thomas Falconer ◽  
Evan Minty ◽  
...  

Abstract Background: To demonstrate how the Observational Healthcare Data Science and Informatics (OHDSI) collaborative network and standardization can be utilized to scale-up external validation of patient-level prediction models by enabling validation across a large number of heterogeneous observational healthcare datasets.Methods: Five previously published prognostic models (ATRIA, CHADS2, CHADS2VASC, Q-Stroke and Framingham) that predict future risk of stroke in patients with atrial fibrillation were replicated using the OHDSI frameworks. A network study was run that enabled the five models to be externally validated across nine observational healthcare datasets spanning three countries and five independent sites. Results: The five existing models were able to be integrated into the OHDSI framework for patient-level prediction and they obtained mean c-statistics ranging between 0.57-0.63 across the 6 databases with sufficient data to predict stroke within 1 year of initial atrial fibrillation diagnosis for females with atrial fibrillation. This was comparable with existing validation studies. The validation network study was run across nine datasets within 60 days once the models were replicated. An R package for the study was published at https://github.com/OHDSI/StudyProtocolSandbox/tree/master/ExistingStrokeRiskExternalValidation.Conclusion : This study demonstrates the ability to scale up external validation of patient-level prediction models using a collaboration of researchers and a data standardization that enable models to be readily shared across data sites. External validation is necessary to understand the transportability or reproducibility of a prediction model, but without collaborative approaches it can take three or more years for a model to be validated by one independent researcher. In this paper we show it is possible to both scale-up and speed-up external validation by showing how validation can be done across multiple databases in less than 2 months. We recommend that researchers developing new prediction models use the OHDSI network to externally validate their models.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
B .V Binoy ◽  
M. A Naseer ◽  
P.P Anil Kumar ◽  
Nina Lazar

Purpose Real estate valuation studies gained popularity with the availability of large-scale property transaction data in the latter part of the twentieth century. Hedonic price modeling (HPM) was the most popular method in the initial years until it was taken over by advanced modeling methods in the twenty-first century. Even though there exist a few literature reviews on this topic, no comprehensive bibliometric analysis is conducted in this area. In view of gaining a better understanding of the dynamics of property valuation studies, this paper aims to conduct a bibliometric analysis. Design/methodology/approach A comprehensive search in the Scopus database, followed by detailed screening resulted in 1,400 articles. The identified research articles spanning over five decades (1964–2019) are analyzed using the open-source R package “bibliometrix.” Findings The study found the USA to be the most productive country in various aspects, such as number of publications, number of authors and publication hotspots. The findings also demonstrate assessments on the publication trends, journals, citations, keywords, co-citation and collaboration networks. It was observed that there exists an upsurge in the number of publications after the year 2000 owing to improved data availability and better modeling techniques. Research limitations/implications This study is significant in understanding the major research areas and modeling techniques used in property valuation. Future studies can incorporate multiple database sources and include more articles. Originality/value The current study is one of the first bibliometric studies on property valuation. Previous studies have not explored the possibilities of geographic information system in bibliometric research. Spatial mapping and analysis of publications provide a geographical perspective of valuation research.


Database ◽  
2019 ◽  
Vol 2019 ◽  
Author(s):  
Tao Chen ◽  
Mingfen Wu ◽  
Hexi Li

Abstract The automatic extraction of meaningful relations from biomedical literature or clinical records is crucial in various biomedical applications. Most of the current deep learning approaches for medical relation extraction require large-scale training data to prevent overfitting of the training model. We propose using a pre-trained model and a fine-tuning technique to improve these approaches without additional time-consuming human labeling. Firstly, we show the architecture of Bidirectional Encoder Representations from Transformers (BERT), an approach for pre-training a model on large-scale unstructured text. We then combine BERT with a one-dimensional convolutional neural network (1d-CNN) to fine-tune the pre-trained model for relation extraction. Extensive experiments on three datasets, namely the BioCreative V chemical disease relation corpus, traditional Chinese medicine literature corpus and i2b2 2012 temporal relation challenge corpus, show that the proposed approach achieves state-of-the-art results (giving a relative improvement of 22.2, 7.77, and 38.5% in F1 score, respectively, compared with a traditional 1d-CNN classifier). The source code is available at https://github.com/chentao1999/MedicalRelationExtraction.


Sign in / Sign up

Export Citation Format

Share Document