scholarly journals bioPDFX: preparing PDF scientific articles for biomedical text mining

Author(s):  
Shitij Bhargava ◽  
Tsung-Ting Kuo ◽  
Ankit Goyal ◽  
Vincent Kuri ◽  
Gordon Lin ◽  
...  

Background. There is huge amount of full-text biomedical literatures available in public repositories like PubMed Central (PMC). However, a substantial number of the papers are in Portable Document Format (PDF) and do not provide plain text format ready for text mining and natural language processing (NLP). Although there exist many PDF-to-text converters, they still suffer from several challenges while processing biomedical PDFs, such as the correct transcription of titles/abstracts, segmenting references/acknowledgements, special characters, jumbling errors (the wrong order of the text), and word boundaries. Methods. In this paper, we present bioPDFX, a novel tool which complements weaknesses with strengths of multiple state-of-the-art methods and then applies machine learning methods to address all issues above Results. The experiment results on publications of Genome Wide Association Studies (GWAS) demonstrated that bioPDFX significantly improved the quality of XML comparing to state-of-the-art PDF-to-XML converter, leading to a biomedical database more suitable for text mining. Discussion. Overall, the whole pipeline developed in this paper makes the published literature in form of PDF files much better suited for text mining tasks, while slightly improving the overall text quality as well. The service is open to access freely at URL: http://textmining.ucsd.edu:9000 . A list of PubMed Central IDs of the 941 articles (see Supplemental File 1) used in this study is available for download at the same URL. The instructions of how to run the service with a PubMed ID are described in Supplemental File 2.

2017 ◽  
Author(s):  
Shitij Bhargava ◽  
Tsung-Ting Kuo ◽  
Ankit Goyal ◽  
Vincent Kuri ◽  
Gordon Lin ◽  
...  

Background. There is huge amount of full-text biomedical literatures available in public repositories like PubMed Central (PMC). However, a substantial number of the papers are in Portable Document Format (PDF) and do not provide plain text format ready for text mining and natural language processing (NLP). Although there exist many PDF-to-text converters, they still suffer from several challenges while processing biomedical PDFs, such as the correct transcription of titles/abstracts, segmenting references/acknowledgements, special characters, jumbling errors (the wrong order of the text), and word boundaries. Methods. In this paper, we present bioPDFX, a novel tool which complements weaknesses with strengths of multiple state-of-the-art methods and then applies machine learning methods to address all issues above Results. The experiment results on publications of Genome Wide Association Studies (GWAS) demonstrated that bioPDFX significantly improved the quality of XML comparing to state-of-the-art PDF-to-XML converter, leading to a biomedical database more suitable for text mining. Discussion. Overall, the whole pipeline developed in this paper makes the published literature in form of PDF files much better suited for text mining tasks, while slightly improving the overall text quality as well. The service is open to access freely at URL: http://textmining.ucsd.edu:9000 . A list of PubMed Central IDs of the 941 articles (see Supplemental File 1) used in this study is available for download at the same URL. The instructions of how to run the service with a PubMed ID are described in Supplemental File 2.


Author(s):  
Guia Guffanti ◽  
Milissa L. Kaufman ◽  
Lauren A. M. Lebois ◽  
Kerry J. Ressler

Post-traumatic stress disorder (PTSD) is a debilitating psychiatric disorder with an estimated genetic component accounting for 30%–40% of the variance contributing to risk for the disease. This chapter starts with a review of the biological hypotheses and related genetic mechanisms currently proposed to be associated with PTSD and trauma-related disorders. It will follow with a description of the state-of-the-art on the methodologies and their application to map genetic loci and identify biomarkers associated with PTSD. Finally, we will review the latest results from genome-wide association studies of genetic variants as well as those derived from the emerging fields of epigenetics and gene expression.


2020 ◽  
Author(s):  
Matteo Sesia ◽  
Stephen Bates ◽  
Emmanuel Candès ◽  
Jonathan Marchini ◽  
Chiara Sabatti

AbstractThis paper proposes a novel statistical method to address population structure in genome-wide association studies while controlling the false discovery rate, which overcomes some limitations of existing approaches. Our solution accounts for linkage disequilibrium and diverse ancestries by combining conditional testing via knockoffs with hidden Markov models from state-of-the-art phasing methods. Furthermore, we account for familial relatedness by describing the joint distribution of haplotypes sharing long identical-by-descent segments with a generalized hidden Markov model. Extensive simulations affirm the validity of this method, while applications to UK Biobank phenotypes yield many more discoveries compared to BOLT-LMM, most of which are confirmed by the Japan Biobank and FinnGen data.


2014 ◽  
Author(s):  
Sune Pletscher-Frankild ◽  
Albert Pallejà ◽  
Kalliopi Tsafou ◽  
Janos X Binder ◽  
Lars Juhl Jensen

Text mining is a flexible technology that can be applied to numerous different tasks in biology and medicine. We present a system for extracting disease–gene associations from biomedical abstracts. The system consists of a highly efficient dictionary-based tagger for named entity recognition of human genes and diseases, which we combine with a scoring scheme that takes into account co-occurrences both within and between sentences. We show that this approach is able to extract half of all manually curated associations with a false positive rate of only 0.16%. Nonetheless, text mining should not stand alone, but be combined with other types of evidence. For this reason, we have developed the DISEASES resource, which integrates the results from text mining with manually curated disease–gene associations, cancer mutation data, and genome-wide association studies from existing databases. The DISEASES resource is accessible through a user-friendly web interface at http://diseases.jensenlab.org/, where the text-mining software and all associations are also freely available for download.


2021 ◽  
Author(s):  
Yan Hu ◽  
Shujian Sun ◽  
Thomas Rowlands ◽  
Tim Beck ◽  
Joram Matthias Posma

Motivation: The availability of improved natural language processing (NLP) algorithms and models enable researchers to analyse larger corpora using open source tools. Text mining of biomedical literature is one area for which NLP has been used in recent years with large untapped potential. However, in order to generate corpora that can be analyzed using machine learning NLP algorithms, these need to be standardized. Summarizing data from literature to be stored into databases typically requires manual curation, especially for extracting data from result tables. Results: We present here an automated pipeline that cleans HTML files from biomedical literature. The output is a single JSON file that contains the text for each section, table data in machine-readable format and lists of phenotypes and abbreviations found in the article. We analyzed a total of 2,441 Open Access articles from PubMed Central, from both Genome-Wide and Metabolome-Wide Association Studies, and developed a model to standardize the section headers based on the Information Artifact Ontology. Extraction of table data was developed on PubMed articles and fine-tuned using the equivalent publisher versions. Availability: The Auto-CORPus package is freely available with detailed instructions from Github at https://github.com/jmp111/AutoCORPus/.


Circulation ◽  
2008 ◽  
Vol 118 (suppl_18) ◽  
Author(s):  
Joshua C Denny ◽  
Marylyn D Ritchie ◽  
Dana C Crawford ◽  
Andrea Havens ◽  
Justin Weiner ◽  
...  

Background : Genome-wide association studies, largely in research populations, have identified susceptibility single-nucleotide polymorphisms (SNPs) for a broad range of human diseases, including variants at 4q25 associated with atrial fibrillation (AF). However, no studies have evaluated the applicability of these data to practice-based settings. Methods : This study was conducted in the Vanderbilt DNA Databank, a repository that accrues 500 –900 new samples/week from routine outpatient blood draws, and included 37,335 samples as of June 2, 2008. The Databank is linked to a de-identified derivative of the electronic medial record (EMR), which includes data for the last 15 years on 1.4 million subjects. We used natural language processing techniques and billing code queries to extract AF cases and controls without AF from the first 10,000 subjects entering the Databank. Cases had AF recorded in the cardiologist report of an electrocardiogram (ECG). Controls had at least one ECG and no AF, other abnormal atrial rhythms, or atrioventricular nodal ablation in any portion of the EMR, including text documents, billing codes, and ECGs. We excluded subjects with heart transplants and non-Caucasian ethnicity. Subjects were genotyped at rs2200733 and rs10033464, both located at 4q25, previously associated with AF with odds ratios (ORs) of 1.75 and 1.42, respectively. Results : We identified 168 cases with AF and 1695 controls. The electronic algorithms had an accuracy of 98% for identifying cases and 100% for controls over a random sample of 100 subjects each. The minor allele frequencies (MAF) for rs2200733 were 0.1419 for cases and 0.1032 for controls; the MAF for rs10033464 were 0.1019 for cases and 0.908 for controls. rs2200733 was significantly associated with AF (OR [95% confidence interval], 1.44 [1.01–2.03], p=0.04). The effect of rs10033464 on AF was not significant (OR, 1.14 [0.78 –1.67], p=0.52); however, power calculations indicate that 993 cases with AF were needed to replicate this effect. Conclusion : This practice-based study replicated an association identified in research datasets between a 4q25 SNP and AF. These findings support the utility of Electronic Medical Records coupled to DNA collections as resources for genomic research.


2014 ◽  
Vol 33 (3) ◽  
pp. 5 ◽  
Author(s):  
Leslie A. Williams ◽  
Lynne M Fox ◽  
Christophe Roeder ◽  
Lawrence Hunter

<p>This case study examines strategies used to leverage the library’s existing journal licenses to obtain a large collection of full-text journal articles in extensible markup language (XML) format; the right to text mine the collection; and the right to use the collection and the data mined from it for grant-funded research to develop biomedical natural language processing (BNLP) tools. Researchers attempted to obtain content directly from PubMed Central (PMC). This attempt failed due to limits on use of content in PMC. Next researchers and their library liaison attempted to obtain content from contacts in the technical divisions of the publishing industry. This resulted in an incomplete research data set. Then researchers, the library liaison, and the acquisitions librarian collaborated with the sales and technical staff of a major science, technology, engineering, and medical (STEM) publisher to successfully create a method for obtaining XML content as an extension of the library’s typical acquisition process for electronic resources. Our experience led us to realize that text mining rights of full-text articles in XML format should routinely be included in the negotiation of the library’s licenses.</p>


2021 ◽  
Author(s):  
David Froelicher ◽  
Juan R. Troncoso-Pastoriza ◽  
Jean Louis Raisaro ◽  
Michel A. Cuendet ◽  
Joao Sa Sousa ◽  
...  

ABSTRACTIn biomedical research, real-world evidence, which is emerging as an indispensable complement of clinical trials, relies on access to large quantities of patient data that typically reside at separate healthcare institutions. Conventional approaches for centralizing those data are often not feasible due to privacy and security requirements. As a result, more privacy-friendly solutions based on federated analytics are emerging. They enable to simultaneously analyse medical data distributed across a group of connected institutions. However, these techniques do not inherently protect patients’ privacy as they require institutions to share intermediate results that can reveal patient-level information. To address this issue, state-of-the-art solutions use additional privacy-preserving measures based on data obfuscation, which often introduce noise in the computation of the final result that can become too inaccurate for precision medicine use cases. We propose FAMHE, a modular system based on multiparty homomorphic encryption, that enables the privacy-preserving execution of federated analytics workflows yielding exact results and without leaking any intermediate information. To demonstrate the maturity of our approach, we reproduce the results of two published state-of-the-art centralized biomedical studies, and we demonstrate that FAMHE enables the efficient, privacy-preserving and decentralized execution of analyses that range from low computational complexity, such as Kaplan-Meier overall survival curves used in oncology, to high computational complexity, such as genome-wide association studies on millions of variants.


Author(s):  
Antonio Capalbo ◽  
Maurizio Poli ◽  
Antoni Riera-Escamilla ◽  
Vallari Shukla ◽  
Miya Kudo Høffding ◽  
...  

Abstract BACKGROUND Our genetic code is now readable, writable and hackable. The recent escalation of genome-wide sequencing (GS) applications in population diagnostics will not only enable the assessment of risks of transmitting well-defined monogenic disorders at preconceptional stages (i.e. carrier screening), but also facilitate identification of multifactorial genetic predispositions to sub-lethal pathologies, including those affecting reproductive fitness. Through GS, the acquisition and curation of reproductive-related findings will warrant the expansion of genetic assessment to new areas of genomic prediction of reproductive phenotypes, pharmacogenomics and molecular embryology, further boosting our knowledge and therapeutic tools for treating infertility and improving women’s health. OBJECTIVE AND RATIONALE In this article, we review current knowledge and potential development of preconception genome analysis aimed at detecting reproductive and individual health risks (recessive genetic disease and medically actionable secondary findings) as well as anticipating specific reproductive outcomes, particularly in the context of IVF. The extension of reproductive genetic risk assessment to the general population and IVF couples will lead to the identification of couples who carry recessive mutations, as well as sub-lethal conditions prior to conception. This approach will provide increased reproductive autonomy to couples, particularly in those cases where preimplantation genetic testing is an available option to avoid the transmission of undesirable conditions. In addition, GS on prospective infertility patients will enable genome-wide association studies specific for infertility phenotypes such as predisposition to premature ovarian failure, increased risk of aneuploidies, complete oocyte immaturity or blastocyst development failure, thus empowering the development of true reproductive precision medicine. SEARCH METHODS Searches of the literature on PubMed Central included combinations of the following MeSH terms: human, genetics, genomics, variants, male, female, fertility, next generation sequencing, genome exome sequencing, expanded carrier screening, secondary findings, pharmacogenomics, controlled ovarian stimulation, preconception, genetics, genome-wide association studies, GWAS. OUTCOMES Through PubMed Central queries, we identified a total of 1409 articles. The full list of articles was assessed for date of publication, limiting the search to studies published within the last 15 years (2004 onwards due to escalating research output of next-generation sequencing studies from that date). The remaining articles’ titles were assessed for pertinence to the topic, leaving a total of 644 articles. The use of preconception GS has the potential to identify inheritable genetic conditions concealed in the genome of around 4% of couples looking to conceive. Genomic information during reproductive age will also be useful to anticipate late-onset medically actionable conditions with strong genetic background in around 2–4% of all individuals. Genetic variants correlated with differential response to pharmaceutical treatment in IVF, and clear genotype–phenotype associations are found for aberrant sperm types, oocyte maturation, fertilization or pre- and post-implantation embryonic development. All currently known capabilities of GS at the preconception stage are reviewed along with persisting and forthcoming barriers for the implementation of precise reproductive medicine. WIDER IMPLICATIONS The expansion of sequencing analysis to additional monogenic and polygenic traits may enable the development of cost-effective preconception tests capable of identifying underlying genetic causes of infertility, which have been defined as ‘unexplained’ until now, thus leading to the development of a true personalized genomic medicine framework in reproductive health.


Sign in / Sign up

Export Citation Format

Share Document