Text Mining of Full Text Articles and Creation of a Knowledge Base for Analysis of Microarray Data

Author(s):  
Eric G. Bremer ◽  
Jeyakumar Natarajan ◽  
Yonghong Zhang ◽  
Catherine DeSesa ◽  
Catherine J. Hack ◽  
...  
Author(s):  
Jeyakumar Natarajan ◽  
Niranjan Mulay ◽  
Catherine DeSesa ◽  
Catherine J. Hack ◽  
Werner Dubitzky ◽  
...  

2021 ◽  
Vol 19 (1) ◽  
Author(s):  
Chris Bauer ◽  
Ralf Herwig ◽  
Matthias Lienhard ◽  
Paul Prasse ◽  
Tobias Scheffer ◽  
...  

Abstract Background There is a huge body of scientific literature describing the relation between tumor types and anti-cancer drugs. The vast amount of scientific literature makes it impossible for researchers and physicians to extract all relevant information manually. Methods In order to cope with the large amount of literature we applied an automated text mining approach to assess the relations between 30 most frequent cancer types and 270 anti-cancer drugs. We applied two different approaches, a classical text mining based on named entity recognition and an AI-based approach employing word embeddings. The consistency of literature mining results was validated with 3 independent methods: first, using data from FDA approvals, second, using experimentally measured IC-50 cell line data and third, using clinical patient survival data. Results We demonstrated that the automated text mining was able to successfully assess the relation between cancer types and anti-cancer drugs. All validation methods showed a good correspondence between the results from literature mining and independent confirmatory approaches. The relation between most frequent cancer types and drugs employed for their treatment were visualized in a large heatmap. All results are accessible in an interactive web-based knowledge base using the following link: https://knowledgebase.microdiscovery.de/heatmap. Conclusions Our approach is able to assess the relations between compounds and cancer types in an automated manner. Both, cancer types and compounds could be grouped into different clusters. Researchers can use the interactive knowledge base to inspect the presented results and follow their own research questions, for example the identification of novel indication areas for known drugs.


2017 ◽  
Author(s):  
Morgan N. Price ◽  
Adam P. Arkin

AbstractLarge-scale genome sequencing has identified millions of protein-coding genes whose function is unknown. Many of these proteins are similar to characterized proteins from other organisms, but much of this information is missing from annotation databases and is hidden in the scientific literature. To make this information accessible, PaperBLAST uses EuropePMC to search the full text of scientific articles for references to genes. PaperBLAST also takes advantage of curated resources that link protein sequences to scientific articles (Swiss-Prot, GeneRIF, and EcoCyc). PaperBLAST’s database includes over 700,000 scientific articles that mention over 400,000 different proteins. Given a protein of interest, PaperBLAST quickly finds similar proteins that are discussed in the literature and presents snippets of text from relevant articles or from the curators. PaperBLAST is available at http://papers.genomics.lbl.gov/.


Author(s):  
Rafal Rzepka ◽  
Kenji Araki

This chapter introduces an approach and methods for creating a system that refers to human experiences and thoughts about these experiences in order to ethically evaluate other parties', and in a long run, its own actions. It is shown how applying text mining techniques can enrich machine's knowledge about the real world and how this knowledge could be helpful in the difficult realm of moral relativity. Possibilities of simulating empathy and applying proposed methods to various approaches are introduced together with discussion on the possibility of applying growing knowledge base to artificial agents for particular purposes, from simple housework robots to moral advisors, which could refer to millions of different experiences had by people in various cultures. The experimental results show efficiency improvements when compared to previous research and also discuss the problems with fair evaluation of moral and immoral acts.


2019 ◽  
Vol 47 (W1) ◽  
pp. W587-W593 ◽  
Author(s):  
Chih-Hsuan Wei ◽  
Alexis Allot ◽  
Robert Leaman ◽  
Zhiyong Lu

AbstractPubTator Central (https://www.ncbi.nlm.nih.gov/research/pubtator/) is a web service for viewing and retrieving bioconcept annotations in full text biomedical articles. PubTator Central (PTC) provides automated annotations from state-of-the-art text mining systems for genes/proteins, genetic variants, diseases, chemicals, species and cell lines, all available for immediate download. PTC annotates PubMed (29 million abstracts) and the PMC Text Mining subset (3 million full text articles). The new PTC web interface allows users to build full text document collections and visualize concept annotations in each document. Annotations are downloadable in multiple formats (XML, JSON and tab delimited) via the online interface, a RESTful web service and bulk FTP. Improved concept identification systems and a new disambiguation module based on deep learning increase annotation accuracy, and the new server-side architecture is significantly faster. PTC is synchronized with PubMed and PubMed Central, with new articles added daily. The original PubTator service has served annotated abstracts for ∼300 million requests, enabling third-party research in use cases such as biocuration support, gene prioritization, genetic disease analysis, and literature-based knowledge discovery. We demonstrate the full text results in PTC significantly increase biomedical concept coverage and anticipate this expansion will both enhance existing downstream applications and enable new use cases.


2021 ◽  
Vol 8 (2) ◽  
pp. 180-185
Author(s):  
Anna Tolwinska

This article aims to explain the key metadata elements listed in Participation Reports, why it’s important to check them regularly, and how Crossref members can improve their scores. Crossref members register a lot of metadata in Crossref. That metadata is machine-readable, standardized, and then shared across discovery services and author tools. This is important because richer metadata makes content more discoverable and useful to the scholarly community. It’s not always easy to know what metadata Crossref members register in Crossref. This is why Crossref created an easy-to-use tool called Participation Reports to show editors, and researchers the key metadata elements Crossref members register to make their content more useful. The key metadata elements include references and whether they are set to open, ORCID iDs, funding information, Crossmark metadata, licenses, full-text URLs for text-mining, and Similarity Check indexing, as well as abstracts. ROR IDs (Research Organization Registry Identifiers), that identify institutions will be added in the future. This data was always available through the Crossref ’s REST API (Representational State Transfer Application Programming Interface) but is now visualized in Participation Reports. To improve scores, editors should encourage authors to submit ORCIDs in their manuscripts and publishers should register as much metadata as possible to help drive research further.


2008 ◽  
Author(s):  
Wendy C Robertson ◽  
Paul A Soderdahl

Link resolvers, including Ex Libris’ SFX, use OpenURL to provide library patrons with context-sensitive links, such as the ability to move quickly from a citation in an abstracting and indexing database to the full text. In SFX, information for determining the appropriate links is maintained in the knowledge base, which contains details about a library’s electronic holdings and other information about electronic information resources. This article describes SFX functionality, what the service looks like from a patron’s point of view, and how it can be of particular assistance to a serials librarian.


2012 ◽  
Vol 13 (1) ◽  
pp. 172 ◽  
Author(s):  
Jan Czarnecki ◽  
Irene Nobeli ◽  
Adrian M Smith ◽  
Adrian J Shepherd

2020 ◽  
Vol 9 (1) ◽  
Author(s):  
E. Popoff ◽  
M. Besada ◽  
J. P. Jansen ◽  
S. Cope ◽  
S. Kanters

Abstract Background Despite existing research on text mining and machine learning for title and abstract screening, the role of machine learning within systematic literature reviews (SLRs) for health technology assessment (HTA) remains unclear given lack of extensive testing and of guidance from HTA agencies. We sought to address two knowledge gaps: to extend ML algorithms to provide a reason for exclusion—to align with current practices—and to determine optimal parameter settings for feature-set generation and ML algorithms. Methods We used abstract and full-text selection data from five large SLRs (n = 3089 to 12,769 abstracts) across a variety of disease areas. Each SLR was split into training and test sets. We developed a multi-step algorithm to categorize each citation into the following categories: included; excluded for each PICOS criterion; or unclassified. We used a bag-of-words approach for feature-set generation and compared machine learning algorithms using support vector machines (SVMs), naïve Bayes (NB), and bagged classification and regression trees (CART) for classification. We also compared alternative training set strategies: using full data versus downsampling (i.e., reducing excludes to balance includes/excludes because machine learning algorithms perform better with balanced data), and using inclusion/exclusion decisions from abstract versus full-text screening. Performance comparisons were in terms of specificity, sensitivity, accuracy, and matching the reason for exclusion. Results The best-fitting model (optimized sensitivity and specificity) was based on the SVM algorithm using training data based on full-text decisions, downsampling, and excluding words occurring fewer than five times. The sensitivity and specificity of this model ranged from 94 to 100%, and 54 to 89%, respectively, across the five SLRs. On average, 75% of excluded citations were excluded with a reason and 83% of these citations matched the reviewers’ original reason for exclusion. Sensitivity significantly improved when both downsampling and abstract decisions were used. Conclusions ML algorithms can improve the efficiency of the SLR process and the proposed algorithms could reduce the workload of a second reviewer by identifying exclusions with a relevant PICOS reason, thus aligning with HTA guidance. Downsampling can be used to improve study selection, and improvements using full-text exclusions have implications for a learn-as-you-go approach.


Sign in / Sign up

Export Citation Format

Share Document