scholarly journals Auto-CORPus: Automated and Consistent Outputs from Research Publications

2021 ◽  
Author(s):  
Yan Hu ◽  
Shujian Sun ◽  
Thomas Rowlands ◽  
Tim Beck ◽  
Joram Matthias Posma

Motivation: The availability of improved natural language processing (NLP) algorithms and models enable researchers to analyse larger corpora using open source tools. Text mining of biomedical literature is one area for which NLP has been used in recent years with large untapped potential. However, in order to generate corpora that can be analyzed using machine learning NLP algorithms, these need to be standardized. Summarizing data from literature to be stored into databases typically requires manual curation, especially for extracting data from result tables. Results: We present here an automated pipeline that cleans HTML files from biomedical literature. The output is a single JSON file that contains the text for each section, table data in machine-readable format and lists of phenotypes and abbreviations found in the article. We analyzed a total of 2,441 Open Access articles from PubMed Central, from both Genome-Wide and Metabolome-Wide Association Studies, and developed a model to standardize the section headers based on the Information Artifact Ontology. Extraction of table data was developed on PubMed articles and fine-tuned using the equivalent publisher versions. Availability: The Auto-CORPus package is freely available with detailed instructions from Github at https://github.com/jmp111/AutoCORPus/.

2021 ◽  
Author(s):  
Ziheng Zhang ◽  
Feng Han ◽  
Hongjian Zhang ◽  
Tomohiro Aoki ◽  
Katsuhiko Ogasawara

BACKGROUND Biomedical terms extracted using Word2vec, the most popular word embedding model in recent years, serve as the foundation for various natural language processing (NLP) applications, such as biomedical information retrieval, relation extraction, and recommendation systems. OBJECTIVE The objective of this study is to examine how changes in the ratio of biomedical domain to general domain data in the corpus affect the extraction of similar biomedical terms using Word2vec. METHODS We downloaded abstracts of 214892 articles from PubMed Central (PMC) and the 3.9 GB Billion Word (BW) benchmark corpus from the computer science community. The datasets were preprocessed and grouped into 11 corpora based on the ratio of BW to PMC, ranging from 0:10 to 10:0, and then Word2vec models were trained on these corpora. The cosine similarities between the biomedical terms obtained from the Word2vec models were then compared in each model. RESULTS The results indicated that the models trained with both BW and PMC data outperformed the model trained only with medical data. The similarity between the biomedical terms extracted by the Word2vec model increased, when the ratio of biomedical domain to general domain data was 3: 7 to 5: 5. CONCLUSIONS This study allows NLP researchers to apply Word2vec based on more information and increase the similarity of extracted biomedical terms to improve their effectiveness in NLP applications, such as biomedical information extraction.


2017 ◽  
Author(s):  
Shitij Bhargava ◽  
Tsung-Ting Kuo ◽  
Ankit Goyal ◽  
Vincent Kuri ◽  
Gordon Lin ◽  
...  

Background. There is huge amount of full-text biomedical literatures available in public repositories like PubMed Central (PMC). However, a substantial number of the papers are in Portable Document Format (PDF) and do not provide plain text format ready for text mining and natural language processing (NLP). Although there exist many PDF-to-text converters, they still suffer from several challenges while processing biomedical PDFs, such as the correct transcription of titles/abstracts, segmenting references/acknowledgements, special characters, jumbling errors (the wrong order of the text), and word boundaries. Methods. In this paper, we present bioPDFX, a novel tool which complements weaknesses with strengths of multiple state-of-the-art methods and then applies machine learning methods to address all issues above Results. The experiment results on publications of Genome Wide Association Studies (GWAS) demonstrated that bioPDFX significantly improved the quality of XML comparing to state-of-the-art PDF-to-XML converter, leading to a biomedical database more suitable for text mining. Discussion. Overall, the whole pipeline developed in this paper makes the published literature in form of PDF files much better suited for text mining tasks, while slightly improving the overall text quality as well. The service is open to access freely at URL: http://textmining.ucsd.edu:9000 . A list of PubMed Central IDs of the 941 articles (see Supplemental File 1) used in this study is available for download at the same URL. The instructions of how to run the service with a PubMed ID are described in Supplemental File 2.


2019 ◽  
Author(s):  
Charles Tapley Hoyt ◽  
Daniel Domingo-Fernández ◽  
Rana Aldisi ◽  
Lingling Xu ◽  
Kristian Kolpeja ◽  
...  

AbstractThe rapid accumulation of new biomedical literature not only causes curated knowledge graphs to become outdated and incomplete, but also makes manual curation an impractical and unsustainable solution. Automated or semi-automated workflows are necessary to assist in prioritizing and curating the literature to update and enrich knowledge graphs.We have developed two workflows: one for re-curating a given knowledge graph to assure its syntactic and semantic quality and another for rationally enriching it by manually revising automatically extracted relations for nodes with low information density. We applied these workflows to the knowledge graphs encoded in Biological Expression Language from the NeuroMMSig database using content that was pre-extracted from MEDLINE abstracts and PubMed Central full text articles using text mining output integrated by INDRA. We have made this workflow freely available at https://github.com/bel-enrichment/bel-enrichment.Database URLhttps://github.com/bel-enrichment/results


Author(s):  
Antonio Capalbo ◽  
Maurizio Poli ◽  
Antoni Riera-Escamilla ◽  
Vallari Shukla ◽  
Miya Kudo Høffding ◽  
...  

Abstract BACKGROUND Our genetic code is now readable, writable and hackable. The recent escalation of genome-wide sequencing (GS) applications in population diagnostics will not only enable the assessment of risks of transmitting well-defined monogenic disorders at preconceptional stages (i.e. carrier screening), but also facilitate identification of multifactorial genetic predispositions to sub-lethal pathologies, including those affecting reproductive fitness. Through GS, the acquisition and curation of reproductive-related findings will warrant the expansion of genetic assessment to new areas of genomic prediction of reproductive phenotypes, pharmacogenomics and molecular embryology, further boosting our knowledge and therapeutic tools for treating infertility and improving women’s health. OBJECTIVE AND RATIONALE In this article, we review current knowledge and potential development of preconception genome analysis aimed at detecting reproductive and individual health risks (recessive genetic disease and medically actionable secondary findings) as well as anticipating specific reproductive outcomes, particularly in the context of IVF. The extension of reproductive genetic risk assessment to the general population and IVF couples will lead to the identification of couples who carry recessive mutations, as well as sub-lethal conditions prior to conception. This approach will provide increased reproductive autonomy to couples, particularly in those cases where preimplantation genetic testing is an available option to avoid the transmission of undesirable conditions. In addition, GS on prospective infertility patients will enable genome-wide association studies specific for infertility phenotypes such as predisposition to premature ovarian failure, increased risk of aneuploidies, complete oocyte immaturity or blastocyst development failure, thus empowering the development of true reproductive precision medicine. SEARCH METHODS Searches of the literature on PubMed Central included combinations of the following MeSH terms: human, genetics, genomics, variants, male, female, fertility, next generation sequencing, genome exome sequencing, expanded carrier screening, secondary findings, pharmacogenomics, controlled ovarian stimulation, preconception, genetics, genome-wide association studies, GWAS. OUTCOMES Through PubMed Central queries, we identified a total of 1409 articles. The full list of articles was assessed for date of publication, limiting the search to studies published within the last 15 years (2004 onwards due to escalating research output of next-generation sequencing studies from that date). The remaining articles’ titles were assessed for pertinence to the topic, leaving a total of 644 articles. The use of preconception GS has the potential to identify inheritable genetic conditions concealed in the genome of around 4% of couples looking to conceive. Genomic information during reproductive age will also be useful to anticipate late-onset medically actionable conditions with strong genetic background in around 2–4% of all individuals. Genetic variants correlated with differential response to pharmaceutical treatment in IVF, and clear genotype–phenotype associations are found for aberrant sperm types, oocyte maturation, fertilization or pre- and post-implantation embryonic development. All currently known capabilities of GS at the preconception stage are reviewed along with persisting and forthcoming barriers for the implementation of precise reproductive medicine. WIDER IMPLICATIONS The expansion of sequencing analysis to additional monogenic and polygenic traits may enable the development of cost-effective preconception tests capable of identifying underlying genetic causes of infertility, which have been defined as ‘unexplained’ until now, thus leading to the development of a true personalized genomic medicine framework in reproductive health.


2021 ◽  
Author(s):  
Dhouha Grissa ◽  
Alexander Junge ◽  
Tudor I Oprea ◽  
Lars Juhl Jensen

The scientific knowledge about which genes are involved in which diseases grows rapidly, which makes it difficult to keep up with new publications and genetics datasets. The DISEASES database aims to provide a comprehensive overview by systematically integrating and assigning confidence scores to evidence for disease–gene associations from curated databases, genome-wide association studies (GWAS), and automatic text mining of the biomedical literature. Here, we present a major update to this resource, which greatly increases the number of associations from all these sources. This is especially true for the text-mined associations, which have increased by at least 9-fold at all confidence cutoffs. We show that this dramatic increase is primarily due to adding full-text articles to the text corpus, secondarily due to improvements to both the disease and gene dictionaries used for named entity recognition, and only to a very small extent due to the growth in number of PubMed abstracts. DISEASES now also makes use of a new GWAS database, TIGA, which considerably increased the number of GWAS-derived disease–gene associations. DISEASES itself is also integrated into several other databases and resources, including GeneCards/MalaCards, Pharos/TCRD, and the Cytoscape stringApp. All data in DISEASES is updated on a weekly basis and is available via a web interface at https://diseases.jensenlab.org, from where it can also be downloaded under open licenses.


2017 ◽  
Author(s):  
Shitij Bhargava ◽  
Tsung-Ting Kuo ◽  
Ankit Goyal ◽  
Vincent Kuri ◽  
Gordon Lin ◽  
...  

Background. There is huge amount of full-text biomedical literatures available in public repositories like PubMed Central (PMC). However, a substantial number of the papers are in Portable Document Format (PDF) and do not provide plain text format ready for text mining and natural language processing (NLP). Although there exist many PDF-to-text converters, they still suffer from several challenges while processing biomedical PDFs, such as the correct transcription of titles/abstracts, segmenting references/acknowledgements, special characters, jumbling errors (the wrong order of the text), and word boundaries. Methods. In this paper, we present bioPDFX, a novel tool which complements weaknesses with strengths of multiple state-of-the-art methods and then applies machine learning methods to address all issues above Results. The experiment results on publications of Genome Wide Association Studies (GWAS) demonstrated that bioPDFX significantly improved the quality of XML comparing to state-of-the-art PDF-to-XML converter, leading to a biomedical database more suitable for text mining. Discussion. Overall, the whole pipeline developed in this paper makes the published literature in form of PDF files much better suited for text mining tasks, while slightly improving the overall text quality as well. The service is open to access freely at URL: http://textmining.ucsd.edu:9000 . A list of PubMed Central IDs of the 941 articles (see Supplemental File 1) used in this study is available for download at the same URL. The instructions of how to run the service with a PubMed ID are described in Supplemental File 2.


2018 ◽  
Author(s):  
Lucas Beasley ◽  
Prashanti Manda

Manual curation of scientific literature for ontology-based knowledge representation has proven infeasible and unscalable to the large and growing volume of scientific literature. Automated annotation solutions that leverage text mining and Natural Language Processing (NLP) have been developed to ameliorate the problem of literature curation. These NLP approaches use parsing, syntactical, and lexical analysis of text to recognize and annotate pieces of text with ontology concepts. Here, we conduct a comparison of four state of the art NLP tools at the task of recognizing Gene Ontology concepts from biomedical literature using the Colorado Richly Annotated Full-Text (CRAFT) corpus as a gold standard reference. We demonstrate the use of semantic similarity metrics to compare NLP tool annotations to the gold standard.


Author(s):  
Lucas Beasley ◽  
Prashanti Manda

Manual curation of scientific literature for ontology-based knowledge representation has proven infeasible and unscalable to the large and growing volume of scientific literature. Automated annotation solutions that leverage text mining and Natural Language Processing (NLP) have been developed to ameliorate the problem of literature curation. These NLP approaches use parsing, syntactical, and lexical analysis of text to recognize and annotate pieces of text with ontology concepts. Here, we conduct a comparison of four state of the art NLP tools at the task of recognizing Gene Ontology concepts from biomedical literature using the Colorado Richly Annotated Full-Text (CRAFT) corpus as a gold standard reference. We demonstrate the use of semantic similarity metrics to compare NLP tool annotations to the gold standard.


2021 ◽  
Vol 12 (1) ◽  
pp. 154
Author(s):  
Ziheng Zhang ◽  
Feng Han ◽  
Hongjian Zhang ◽  
Tomohiro Aoki ◽  
Katsuhiko Ogasawara

Biomedical terms extracted using Word2vec, the most popular word embedding model in recent years, serve as the foundation for various natural language processing (NLP) applications, such as biomedical information retrieval, relation extraction, and recommendation systems. The objective of this study is to examine how changes in the ratio of the biomedical domain to general domain data in the corpus affect the extraction of similar biomedical terms using Word2vec. We downloaded abstracts of 214,892 articles from PubMed Central (PMC) and the 3.9 GB Billion Word (BW) benchmark corpus from the computer science community. The datasets were preprocessed and grouped into 11 corpora based on the ratio of BW to PMC, ranging from 0:10 to 10:0, and then Word2vec models were trained on these corpora. The cosine similarities between the biomedical terms obtained from the Word2vec models were then compared in each model. The results indicated that the models trained with both BW and PMC data outperformed the model trained only with medical data. The similarity between the biomedical terms extracted by the Word2vec model increased when the ratio of the biomedical domain to general domain data was 3:7 to 5:5. This study allows NLP researchers to apply Word2vec based on more information and increase the similarity of extracted biomedical terms to improve their effectiveness in NLP applications, such as biomedical information extraction.


10.2196/22976 ◽  
2021 ◽  
Vol 9 (2) ◽  
pp. e22976
Author(s):  
Eduardo Rosado ◽  
Miguel Garcia-Remesal ◽  
Sergio Paraiso-Medina ◽  
Alejandro Pazos ◽  
Victor Maojo

Background Currently, existing biomedical literature repositories do not commonly provide users with specific means to locate and remotely access biomedical databases. Objective To address this issue, we developed the Biomedical Database Inventory (BiDI), a repository linking to biomedical databases automatically extracted from the scientific literature. BiDI provides an index of data resources and a path to access them seamlessly. Methods We designed an ensemble of deep learning methods to extract database mentions. To train the system, we annotated a set of 1242 articles that included mentions of database publications. Such a data set was used along with transfer learning techniques to train an ensemble of deep learning natural language processing models targeted at database publication detection. Results The system obtained an F1 score of 0.929 on database detection, showing high precision and recall values. When applying this model to the PubMed and PubMed Central databases, we identified over 10,000 unique databases. The ensemble model also extracted the weblinks to the reported databases and discarded irrelevant links. For the extraction of weblinks, the model achieved a cross-validated F1 score of 0.908. We show two use cases: one related to “omics” and the other related to the COVID-19 pandemic. Conclusions BiDI enables access to biomedical resources over the internet and facilitates data-driven research and other scientific initiatives. The repository is openly available online and will be regularly updated with an automatic text processing pipeline. The approach can be reused to create repositories of different types (ie, biomedical and others).


Sign in / Sign up

Export Citation Format

Share Document