scholarly journals Text-mining clinically relevant cancer biomarkers for curation into the CIViC database

2019 ◽  
Vol 11 (1) ◽  
Author(s):  
Jake Lever ◽  
Martin R. Jones ◽  
Arpad M. Danos ◽  
Kilannin Krysiak ◽  
Melika Bonakdar ◽  
...  

Abstract Background Precision oncology involves analysis of individual cancer samples to understand the genes and pathways involved in the development and progression of a cancer. To improve patient care, knowledge of diagnostic, prognostic, predisposing, and drug response markers is essential. Several knowledgebases have been created by different groups to collate evidence for these associations. These include the open-access Clinical Interpretation of Variants in Cancer (CIViC) knowledgebase. These databases rely on time-consuming manual curation from skilled experts who read and interpret the relevant biomedical literature. Methods To aid in this curation and provide the greatest coverage for these databases, particularly CIViC, we propose the use of text mining approaches to extract these clinically relevant biomarkers from all available published literature. To this end, a group of cancer genomics experts annotated sentences that discussed biomarkers with their clinical associations and achieved good inter-annotator agreement. We then used a supervised learning approach to construct the CIViCmine knowledgebase. Results We extracted 121,589 relevant sentences from PubMed abstracts and PubMed Central Open Access full-text papers. CIViCmine contains over 87,412 biomarkers associated with 8035 genes, 337 drugs, and 572 cancer types, representing 25,818 abstracts and 39,795 full-text publications. Conclusions Through integration with CIVIC, we provide a prioritized list of curatable clinically relevant cancer biomarkers as well as a resource that is valuable to other knowledgebases and precision cancer analysts in general. All data is publically available and distributed with a Creative Commons Zero license. The CIViCmine knowledgebase is available at http://bionlp.bcgsc.ca/civicmine/.

2018 ◽  
Author(s):  
Jake Lever ◽  
Martin R Jones ◽  
Arpad M Danos ◽  
Kilannin Krysiak ◽  
Melika Bonakdar ◽  
...  

Precision oncology involves analysis of individual cancer samples to understand the genes and pathways involved in the development and progression of a cancer. To improve patient care, knowledge of diagnostic, prognostic, predisposing and drug response markers is essential. Several knowledgebases have been created by different groups to collate evidence for these associations. These include the open-access Clinical Interpretation of Variants in Cancer (CIViC) knowledgebase. These databases rely on time-consuming manual curation from skilled experts who read and interpret the relevant biomedical literature. To aid in this curation and provide the greatest coverage for these databases, particularly CIViC, we propose the use of text mining approaches to extract these clinically relevant biomarkers from all available published literature. To this end, a group of cancer genomics experts annotated biomarkers and their clinical associations discussed in 800 sentences and achieved good inter-annotator agreement. We then used a supervised learning approach to construct the CIViCmine knowledgebase (http://bionlp.bcgsc.ca/civicmine/) extracting 128,857 relevant sentences from PubMed abstracts and Pubmed Central Open Access full text papers. CIViCmine contains over 90,992 biomarkers associated with 7,866 genes, 402 drugs and 557 cancer types, representing 29,153 abstracts and 40,551 full-text publications. Through integration with CIVIC, we provide a prioritised list of curatable biomarkers as well as a resource that is valuable to other knowledgebases and precision cancer analysts in general.


2019 ◽  
Author(s):  
Charles Tapley Hoyt ◽  
Daniel Domingo-Fernández ◽  
Rana Aldisi ◽  
Lingling Xu ◽  
Kristian Kolpeja ◽  
...  

AbstractThe rapid accumulation of new biomedical literature not only causes curated knowledge graphs to become outdated and incomplete, but also makes manual curation an impractical and unsustainable solution. Automated or semi-automated workflows are necessary to assist in prioritizing and curating the literature to update and enrich knowledge graphs.We have developed two workflows: one for re-curating a given knowledge graph to assure its syntactic and semantic quality and another for rationally enriching it by manually revising automatically extracted relations for nodes with low information density. We applied these workflows to the knowledge graphs encoded in Biological Expression Language from the NeuroMMSig database using content that was pre-extracted from MEDLINE abstracts and PubMed Central full text articles using text mining output integrated by INDRA. We have made this workflow freely available at https://github.com/bel-enrichment/bel-enrichment.Database URLhttps://github.com/bel-enrichment/results


2019 ◽  
Author(s):  
Morteza Pourreza Shahri ◽  
Indika Kahanda

Identifying protein-phenotype relations is of paramount importance for applications such as uncovering rare and complex diseases. One of the best resources that captures the protein-phenotype relationships is the biomedical literature. In this work, we introduce ProPheno, a comprehensive online dataset composed of human protein/phenotype mentions extracted from the complete corpora of Medline and PubMed Central Open Access. Moreover, it includes co-occurrences of protein-phenotype pairs within different spans of text such as sentences and paragraphs. We use ProPheno for completely characterizing the human protein-phenotype landscape in biomedical literature. ProPheno, the reported findings and the gained insight has implications for (1) biocurators for expediting their curation efforts, (2) researches for quickly finding relevant articles, and (3) text mining tool developers for training their predictive models. The RESTful API of ProPheno is freely available at http://propheno.cs.montana.edu.


2021 ◽  
Vol 8 ◽  
Author(s):  
Paola Turina ◽  
Piero Fariselli ◽  
Emidio Capriotti

During the last years, the increasing number of DNA sequencing and protein mutagenesis studies has generated a large amount of variation data published in the biomedical literature. The collection of such data has been essential for the development and assessment of tools predicting the impact of protein variants at functional and structural levels. Nevertheless, the collection of manually curated data from literature is a highly time consuming and costly process that requires domain experts. In particular, the development of methods for predicting the effect of amino acid variants on protein stability relies on the thermodynamic data extracted from literature. In the past, such data were deposited in the ProTherm database, which however is no longer maintained since 2013. For facilitating the collection of protein thermodynamic data from literature, we developed the semi-automatic tool ThermoScan. ThermoScan is a text mining approach for the identification of relevant thermodynamic data on protein stability from full-text articles. The method relies on a regular expression searching for groups of words, including the most common conceptual words appearing in experimental studies on protein stability, several thermodynamic variables, and their units of measure. ThermoScan analyzes full-text articles from the PubMed Central Open Access subset and calculates an empiric score that allows the identification of manuscripts reporting thermodynamic data on protein stability. The method was optimized on a set of publications included in the ProTherm database, and tested on a new curated set of articles, manually selected for presence of thermodynamic data. The results show that ThermoScan returns accurate predictions and outperforms recently developed text-mining algorithms based on the analysis of publication abstracts.Availability: The ThermoScan server is freely accessible online at https://folding.biofold.org/thermoscan. The ThermoScan python code and the Google Chrome extension for submitting visualized PMC web pages to the ThermoScan server are available at https://github.com/biofold/ThermoScan.


Author(s):  
Morteza Pourreza Shahri ◽  
Indika Kahanda

Identifying protein-phenotype relations is of paramount importance for applications such as uncovering rare and complex diseases. One of the best resources that captures the protein-phenotype relationships is the biomedical literature. In this work, we introduce ProPheno, a comprehensive online dataset composed of human protein/phenotype mentions extracted from the complete corpora of Medline and PubMed Central Open Access. Moreover, it includes co-occurrences of protein-phenotype pairs within different spans of text such as sentences and paragraphs. We use ProPheno for completely characterizing the human protein-phenotype landscape in biomedical literature. ProPheno, the reported findings and the gained insight has implications for (1) biocurators for expediting their curation efforts, (2) researches for quickly finding relevant articles, and (3) text mining tool developers for training their predictive models. The RESTful API of ProPheno is freely available at http://propheno.cs.montana.edu.


2021 ◽  
Vol 31 (1) ◽  
pp. 1-22
Author(s):  
Ronald Snijder

Open access platforms and retail websites are both trying to present the most relevant offerings to their patrons. Retail websites deploy recommender systems that collect data about their customers. These systems are successful but intrude on privacy. As an alternative, this paper presents an algorithm that uses text mining techniques to find the most important themes of an open access book or chapter. By locating other publications that share one or more of these themes, it is possible to recommend closely related books or chapters. The algorithm splits the full text in trigrams. It removes all trigrams containing words that are commonly used in everyday language and in (open access) book publishing. The most occurring remaining trigrams are distinctive to the publication and indicate the themes of the book. The next step is finding publications that share one or more of the trigrams. The strength of the connection can be measured by counting – and ranking – the number of shared trigrams. The algorithm was used to find connections between 10,997 titles: 67% in English, 29% in German and 6% in Dutch or a combination of languages. The algorithm is able to find connected books across languages. It is possible use the algorithm for several use cases, not just recommender systems. Creating benchmarks for publishers or creating a collection of connected titles for libraries are other possibilities. Apart from the OAPEN Library, the algorithm can be applied to other collections of open access books or even open access journal articles. Combining the results across multiple collections will enhance its effectiveness.


2019 ◽  
Author(s):  
Гульдар Фанисовна Ибрагимова ◽  
Ольга Алексеевна Ковалевич ◽  
Раиса Николаевна Афонина ◽  
Елена Алексеевна Лесных ◽  
Яна Игоревна Ряполова ◽  
...  

Conference paper Covered by Leading Indexing DatabasesOpen European Academy of Public Sciences aims to have all of its journals covered by the Science Citation Index Expanded (SCIE) and Scopus and Web of Science indexing systems. Several journals have already been covered by SCIE for several years and have received official Impact Factors. Some life sciencerelated journals are also covered by PubMed/MEDLINE and archived through PubMed Central (PMC). All of our journals are archived with the Spanish and Germany National Library.All Content is Open Access and Free for Readers Journals published by Open European Academy of Public Sciences are fully open access: research articles, reviews or any other content on this platform is available to everyone free of charge. To be able to provide open access journals, we finance publication through article processing charges (APC); these are usually covered by the authors’ institutes or research funding bodies. We offer access to science and the latest research to readers for free. All of our content is published in open access and distributed under a Creative Commons License, which means published articles can be freely shared and the content reused, upon proper attribution.Open European Academy of Public Sciences Publication Ethics StatementOpen European Academy of Public Sciences is a member of the Committee on Publication Ethics (COPE). Open European Academy of Public Sciences takes the responsibility to enforce a rigorous peerreview together with strict ethical policies and standards to ensure to add high quality scientific works to the field of scholarly publication. Unfortunately, cases of plagiarism, data falsification, inappropriate authorship credit, and the like, do arise. Open European Academy of Public Sciences takes such publishing ethics issues very seriously and our editors are trained to proceed in such cases with a zero tolerance policy. To verify the originality of content submitted to our journals, we use iThenticate to check submissions against previous publications.Mission and ValuesAs a pioneer of academic open access publishing, we serve the scientific community since 2009. Our aim is to foster scientific exchange in all forms, across all disciplines. In addition to being at the root of Open European Academy of Public Sciences and a key theme in our journals, we support sustainability by ensuring the longterm preservation of published papers, and the future of science through partnerships, sponsorships and awards.


Database ◽  
2020 ◽  
Vol 2020 ◽  
Author(s):  
Carlos-Francisco Méndez-Cruz ◽  
Antonio Blanchet ◽  
Alan Godínez ◽  
Ignacio Arroyo-Fernández ◽  
Socorro Gama-Castro ◽  
...  

Abstract Transcription factors (TFs) play a main role in transcriptional regulation of bacteria, as they regulate transcription of the genetic information encoded in DNA. Thus, the curation of the properties of these regulatory proteins is essential for a better understanding of transcriptional regulation. However, traditional manual curation of article collections to compile descriptions of TF properties takes significant time and effort due to the overwhelming amount of biomedical literature, which increases every day. The development of automatic approaches for knowledge extraction to assist curation is therefore critical. Here, we show an effective approach for knowledge extraction to assist curation of summaries describing bacterial TF properties based on an automatic text summarization strategy. We were able to recover automatically a median 77% of the knowledge contained in manual summaries describing properties of 177 TFs of Escherichia coli K-12 by processing 5961 scientific articles. For 71% of the TFs, our approach extracted new knowledge that can be used to expand manual descriptions. Furthermore, as we trained our predictive model with manual summaries of E. coli, we also generated summaries for 185 TFs of Salmonella enterica serovar Typhimurium from 3498 articles. According to the manual curation of 10 of these Salmonella typhimurium summaries, 96% of their sentences contained relevant knowledge. Our results demonstrate the feasibility to assist manual curation to expand manual summaries with new knowledge automatically extracted and to create new summaries of bacteria for which these curation efforts do not exist. Database URL: The automatic summaries of the TFs of E. coli and Salmonella and the automatic summarizer are available in GitHub (https://github.com/laigen-unam/tf-properties-summarizer.git).


Author(s):  
Zhi-mei Li ◽  
Li-xia Chen ◽  
Hua Li

The article “Voltage-gated Sodium Channels and Blockers: An Overview and Where Will They Go?”, written by Zhi-mei LI, Li-xia CHEN, Hua LI, was originally published electronically on the publisher’s internet portal on December 2019 without open access. With the author(s)’ decision to opt for Open Choice, the copyright of the article is changed to © The Author(s) 2020 and the article is forthwith distributed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.The original article has been corrected.Corresponding authors: Li-xia CHEN, Hua LI


Sign in / Sign up

Export Citation Format

Share Document