Pangaea: A modular and extensible collection of tools for mining context dependent gene relationships from the biomedical literature
AbstractMotivationPangaea is a scalable and extensible command line interface (CLI) software that integrates gene-relationship detection features to extract context-dependent structured gene-gene and gene-term relationships from the biomedical literature. It provides computational methods to identify biological relationships between a collection of genes and can be used to search and extract different types of contextual relationships amongst genes.ResultsWe implemented a CLI-based software for downloading PubMed articles and extracting gene relationships from abstracts using natural language processing methods. In terms of scalability, the software was designed to support the retrieval and processing of millions of articles whilst minimising memory requirements and optimising for parallel processing on multiple CPU cores. To allow extensibility, the tool permits the use of contextual custom-made models for the text processing parts, and the output is serialised as JSON objects to allow flexible post-processing workflows.AvailabilityThe software is available online at: https://github.com/ss-lab-cancerunit/pangaea