ssbio: A Python Framework for Structural Systems Biology

AbstractSummaryWorking with protein structures at the genome-scale has been challenging in a variety of ways. Here, we present ssbio, a Python package that provides a framework to easily work with structural information in the context of genome-scale network reconstructions, which can contain thousands of individual proteins. The ssbio package provides an automated pipeline to construct high quality genome-scale models with protein structures (GEM-PROs), wrappers to popular third-party programs to compute associated protein properties, and methods to visualize and annotate structures directly in Jupyter notebooks, thus lowering the barrier of linking 3D structural data with established systems workflows.Availability and Implementationssbio is implemented in Python and available to download under the MIT license at http://github.com/SBRG/ssbio. Documentation and Jupyter notebook tutorials are available at http://ssbio.readthedocs.io/en/latest/. Interactive notebooks can be launched using Binder at https://mybinder.org/v2/gh/SBRG/ssbio/[email protected] InformationSupplementary data are available at Bioinformatics online.

Download Full-text

ssbio: a Python framework for structural systems biology

Bioinformatics ◽

10.1093/bioinformatics/bty077 ◽

2018 ◽

Vol 34 (12) ◽

pp. 2155-2157 ◽

Cited By ~ 15

Author(s):

Nathan Mih ◽

Elizabeth Brunk ◽

Ke Chen ◽

Edward Catoiu ◽

Anand Sastry ◽

...

Keyword(s):

Structural Information ◽

Protein Structures ◽

Structural Data ◽

Third Party ◽

Supplementary Information ◽

Scale Models ◽

Protein Properties ◽

Scale Network ◽

Structural Systems Biology ◽

Genome Scale

Abstract Summary Working with protein structures at the genome-scale has been challenging in a variety of ways. Here, we present ssbio, a Python package that provides a framework to easily work with structural information in the context of genome-scale network reconstructions, which can contain thousands of individual proteins. The ssbio package provides an automated pipeline to construct high quality genome-scale models with protein structures (GEM-PROs), wrappers to popular third-party programs to compute associated protein properties, and methods to visualize and annotate structures directly in Jupyter notebooks, thus lowering the barrier of linking 3D structural data with established systems workflows. Availability and implementation ssbio is implemented in Python and available to download under the MIT license at http://github.com/SBRG/ssbio. Documentation and Jupyter notebook tutorials are available at http://ssbio.readthedocs.io/en/latest/. Interactive notebooks can be launched using Binder at https://mybinder.org/v2/gh/SBRG/ssbio/master?filepath=Binder.ipynb. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Streamlined use of protein structures in variant analysis

10.1101/2021.09.10.459756 ◽

2021 ◽

Author(s):

Sandeep Kaur ◽

Neblina Sikta ◽

Andrea Schafferhans ◽

Nicola Bordin ◽

Mark J. Cowley ◽

...

Keyword(s):

Protein Function ◽

Molecular Mechanisms ◽

Structural Information ◽

Protein Structures ◽

Structural Data ◽

Supplementary Information ◽

3D Structures ◽

Link Type ◽

Variant Analysis ◽

Many Sources

AbstractMotivationVariant analysis is a core task in bioinformatics that requires integrating data from many sources. This process can be helped by using 3D structures of proteins, which can provide a spatial context that can provide insight into how variants affect function. Many available tools can help with mapping variants onto structures; but each has specific restrictions, with the result that many researchers fail to benefit from valuable insights that could be gained from structural data.ResultsTo address this, we have created a streamlined system for incorporating 3D structures into variant analysis. Variants can be easily specified via URLs that are easily readable and writable, and use the notation recommended by the Human Genome Variation Society (HGVS). For example, ‘https://aquaria.app/SARS-CoV-2/S/?N501Y’ specifies the N501Y variant of SARS-CoV-2 S protein. In addition to mapping variants onto structures, our system provides summary information from multiple external resources, including COSMIC, CATH-FunVar, and PredictProtein. Furthermore, our system identifies and summarizes structures containing the variant, as well as the variant-position. Our system supports essentially any mutation for any well-studied protein, and uses all available structural data — including models inferred via very remote homology — integrated into a system that is fast and simple to use. By giving researchers easy, streamlined access to a wealth of structural information during variant analysis, our system will help in revealing novel insights into the molecular mechanisms underlying protein function in health and disease.AvailabilityOur resource is freely available at the project home page (https://aquaria.app). After peer review, the code will be openly available via a GPL version 2 license at https://github.com/ODonoghueLab/Aquaria. PSSH2, the database of sequence-to-structure alignments, is also freely available for download at https://zenodo.org/record/[email protected] informationNone.

Download Full-text

Structural Systems Biology Evaluation of Metabolic Thermotolerance in Escherichia coli

Science ◽

10.1126/science.1234012 ◽

2013 ◽

Vol 340 (6137) ◽

pp. 1220-1223 ◽

Cited By ~ 80

Author(s):

Roger L. Chang ◽

Kathleen Andrews ◽

Donghyuk Kim ◽

Zhanwen Li ◽

Adam Godzik ◽

...

Keyword(s):

Escherichia Coli ◽

Systems Biology ◽

Structural Information ◽

Protein Structures ◽

Limiting Factors ◽

Scale Model ◽

Structural Systems ◽

A Genome ◽

Structural Systems Biology ◽

Genome Scale

Genome-scale network reconstruction has enabled predictive modeling of metabolism for many systems. Traditionally, protein structural information has not been represented in such reconstructions. Expansion of a genome-scale model of Escherichia coli metabolism by including experimental and predicted protein structures enabled the analysis of protein thermostability in a network context. This analysis allowed the prediction of protein activities that limit network function at superoptimal temperatures and mechanistic interpretations of mutations found in strains adapted to heat. Predicted growth-limiting factors for thermotolerance were validated through nutrient supplementation experiments and defined metabolic sensitivities to heat stress, providing evidence that metabolic enzyme thermostability is rate-limiting at superoptimal temperatures. Inclusion of structural information expanded the content and predictive capability of genome-scale metabolic networks that enable structural systems biology of metabolism.

Download Full-text

Protein Structure Determination in Living Cells

International Journal of Molecular Sciences ◽

10.3390/ijms20102442 ◽

2019 ◽

Vol 20 (10) ◽

pp. 2442 ◽

Cited By ~ 2

Author(s):

Teppei Ikeya ◽

Peter Güntert ◽

Yutaka Ito

Keyword(s):

Protein Structure ◽

Structure Determination ◽

Structure Prediction ◽

Structural Information ◽

Nuclear Overhauser Effect ◽

Protein Structures ◽

Three Dimensional ◽

Structural Data ◽

Sample Tube ◽

In Cells

To date, in-cell NMR has elucidated various aspects of protein behaviour by associating structures in physiological conditions. Meanwhile, current studies of this method mostly have deduced protein states in cells exclusively based on ‘indirect’ structural information from peak patterns and chemical shift changes but not ‘direct’ data explicitly including interatomic distances and angles. To fully understand the functions and physical properties of proteins inside cells, it is indispensable to obtain explicit structural data or determine three-dimensional (3D) structures of proteins in cells. Whilst the short lifetime of cells in a sample tube, low sample concentrations, and massive background signals make it difficult to observe NMR signals from proteins inside cells, several methodological advances help to overcome the problems. Paramagnetic effects have an outstanding potential for in-cell structural analysis. The combination of a limited amount of experimental in-cell data with software for ab initio protein structure prediction opens an avenue to visualise 3D protein structures inside cells. Conventional nuclear Overhauser effect spectroscopy (NOESY)-based structure determination is advantageous to elucidate the conformations of side-chain atoms of proteins as well as global structures. In this article, we review current progress for the structure analysis of proteins in living systems and discuss the feasibility of its future works.

Download Full-text

VarMap: a web tool for mapping genomic coordinates to protein sequence and structure and retrieving protein structural annotations

Bioinformatics ◽

10.1093/bioinformatics/btz482 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4854-4856 ◽

Cited By ~ 8

Author(s):

James D Stephenson ◽

Roman A Laskowski ◽

Andrew Nightingale ◽

Matthew E Hurles ◽

Janet M Thornton

Keyword(s):

Protein Sequence ◽

Structural Information ◽

Protein Structures ◽

Supplementary Information ◽

Supplementary Data ◽

Web Tool ◽

Genomic Variants ◽

Structural Context ◽

Pathogenic Variants ◽

Transcript Evidence

Abstract Motivation Understanding the protein structural context and patterning on proteins of genomic variants can help to separate benign from pathogenic variants and reveal molecular consequences. However, mapping genomic coordinates to protein structures is non-trivial, complicated by alternative splicing and transcript evidence. Results Here we present VarMap, a web tool for mapping a list of chromosome coordinates to canonical UniProt sequences and associated protein 3D structures, including validation checks, and annotating them with structural information. Availability and implementation https://www.ebi.ac.uk/thornton-srv/databases/VarMap. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SphereCon—a method for precise estimation of residue relative solvent accessible area from limited structural information

Bioinformatics ◽

10.1093/bioinformatics/btaa159 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3372-3378

Author(s):

Alexander Gress ◽

Olga V Kalinina

Keyword(s):

Protein Function ◽

Structural Information ◽

Solvent Accessibility ◽

Three Dimensional ◽

Structural Data ◽

Supplementary Information ◽

Dimensional Structure ◽

Relative Solvent Accessibility ◽

Precise Measure ◽

The Impact

Abstract Motivation In proteins, solvent accessibility of individual residues is a factor contributing to their importance for protein function and stability. Hence one might wish to calculate solvent accessibility in order to predict the impact of mutations, their pathogenicity and for other biomedical applications. A direct computation of solvent accessibility is only possible if all atoms of a protein three-dimensional structure are reliably resolved. Results We present SphereCon, a new precise measure that can estimate residue relative solvent accessibility (RSA) from limited data. The measure is based on calculating the volume of intersection of a sphere with a cone cut out in the direction opposite of the residue with surrounding atoms. We propose a method for estimating the position and volume of residue atoms in cases when they are not known from the structure, or when the structural data are unreliable or missing. We show that in cases of reliable input structures, SphereCon correlates almost perfectly with the directly computed RSA, and outperforms other previously suggested indirect methods. Moreover, SphereCon is the only measure that yields accurate results when the identities of amino acids are unknown. A significant novel feature of SphereCon is that it can estimate RSA from inter-residue distance and contact matrices, without any information about the actual atom coordinates. Availability and implementation https://github.com/kalininalab/spherecon. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Analysis of several key factors influencing deep learning-based inter-residue contact prediction

Bioinformatics ◽

10.1093/bioinformatics/btz679 ◽

2019 ◽

Cited By ~ 1

Author(s):

Tianqi Wu ◽

Jie Hou ◽

Badri Adhikari ◽

Jianlin Cheng

Keyword(s):

Deep Learning ◽

Structural Information ◽

Protein Structures ◽

Supplementary Information ◽

Prediction Methods ◽

Key Factors ◽

Sequence Alignments ◽

Multiple Sequence ◽

Contact Prediction ◽

Ab Initio Approach

Abstract Motivation Deep learning has become the dominant technology for protein contact prediction. However, the factors that affect the performance of deep learning in contact prediction have not been systematically investigated. Results We analyzed the results of our three deep learning-based contact prediction methods (MULTICOM-CLUSTER, MULTICOM-CONSTRUCT and MULTICOM-NOVEL) in the CASP13 experiment and identified several key factors [i.e. deep learning technique, multiple sequence alignment (MSA), distance distribution prediction and domain-based contact integration] that influenced the contact prediction accuracy. We compared our convolutional neural network (CNN)-based contact prediction methods with three coevolution-based methods on 75 CASP13 targets consisting of 108 domains. We demonstrated that the CNN-based multi-distance approach was able to leverage global coevolutionary coupling patterns comprised of multiple correlated contacts for more accurate contact prediction than the local coevolution-based methods, leading to a substantial increase of precision by 19.2 percentage points. We also tested different alignment methods and domain-based contact prediction with the deep learning contact predictors. The comparison of the three methods showed deeper sequence alignments and the integration of domain-based contact prediction with the full-length contact prediction improved the performance of contact prediction. Moreover, we demonstrated that the domain-based contact prediction based on a novel ab initio approach of parsing domains from MSAs alone without using known protein structures was a simple, fast approach to improve contact prediction. Finally, we showed that predicting the distribution of inter-residue distances in multiple distance intervals could capture more structural information and improve binary contact prediction. Availability and implementation https://github.com/multicom-toolbox/DNCON2/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Dynamical important residue network (DIRN): network inference via conformational change

Bioinformatics ◽

10.1093/bioinformatics/btz298 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4664-4670 ◽

Cited By ~ 2

Author(s):

Quan Li ◽

Ray Luo ◽

Hai-Feng Chen

Keyword(s):

Network Inference ◽

Protein Structures ◽

Interaction Network ◽

Structural Data ◽

Supplementary Information ◽

Residue Interaction ◽

Protein Functions ◽

Important Residue ◽

Dynamics Simulations ◽

Dynamical Information

Abstract Motivation Protein residue interaction network has emerged as a useful strategy to understand the complex relationship between protein structures and functions and how functions are regulated. In a residue interaction network, every residue is used to define a network node, adding noises in network post-analysis and increasing computational burden. In addition, dynamical information is often necessary in deciphering biological functions. Results We developed a robust and efficient protein residue interaction network method, termed dynamical important residue network, by combining both structural and dynamical information. A major departure from previous approaches is our attempt to identify important residues most important for functional regulation before a network is constructed, leading to a much simpler network with the important residues as its nodes. The important residues are identified by monitoring structural data from ensemble molecular dynamics simulations of proteins in different functional states. Our tests show that the new method performs well with overall higher sensitivity than existing approaches in identifying important residues and interactions in tested proteins, so it can be used in studies of protein functions to provide useful hypotheses in identifying key residues and interactions. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SAAMBE-SEQ: a sequence-based method for predicting mutation effect on protein–protein binding affinity

Bioinformatics ◽

10.1093/bioinformatics/btaa761 ◽

2020 ◽

Author(s):

Gen Li ◽

Swagata Pahari ◽

Adithya Krishna Murthy ◽

Siqi Liang ◽

Robert Fragoza ◽

...

Keyword(s):

Free Energy ◽

Protein Binding ◽

Binding Affinity ◽

Protein Interactions ◽

Structural Information ◽

Binding Free Energy ◽

Supplementary Information ◽

Sequence Information ◽

Protein Protein Interactions ◽

Genome Scale

Abstract Motivation Vast majority of human genetic disorders are associated with mutations that affect protein–protein interactions by altering wild-type binding affinity. Therefore, it is extremely important to assess the effect of mutations on protein–protein binding free energy to assist the development of therapeutic solutions. Currently, the most popular approaches use structural information to deliver the predictions, which precludes them to be applicable on genome-scale investigations. Indeed, with the progress of genomic sequencing, researchers are frequently dealing with assessing effect of mutations for which there is no structure available. Results Here, we report a Gradient Boosting Decision Tree machine learning algorithm, the SAAMBE-SEQ, which is completely sequence-based and does not require structural information at all. SAAMBE-SEQ utilizes 80 features representing evolutionary information, sequence-based features and change of physical properties upon mutation at the mutation site. The approach is shown to achieve Pearson correlation coefficient (PCC) of 0.83 in 5-fold cross validation in a benchmarking test against experimentally determined binding free energy change (ΔΔG). Further, a blind test (no-STRUC) is compiled collecting experimental ΔΔG upon mutation for protein complexes for which structure is not available and used to benchmark SAAMBE-SEQ resulting in PCC in the range of 0.37–0.46. The accuracy of SAAMBE-SEQ method is found to be either better or comparable to most advanced structure-based methods. SAAMBE-SEQ is very fast, available as webserver and stand-alone code, and indeed utilizes only sequence information, and thus it is applicable for genome-scale investigations to study the effect of mutations on protein–protein interactions. Availability and implementation SAAMBE-SEQ is available at http://compbio.clemson.edu/saambe_webserver/indexSEQ.php#started. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

multiPhATE: bioinformatics pipeline for functional annotation of phage isolates

10.1101/551010 ◽

2019 ◽

Author(s):

Carol L. Ecale Zhou ◽

Stephanie Malfatti ◽

Jeffrey Kimbrel ◽

Casandra Philipson ◽

Katelyn McNair ◽

...

Keyword(s):

De Novo ◽

Third Party ◽

Supplementary Information ◽

Modular Construction ◽

Bioinformatics Pipeline ◽

Annotation Pipeline ◽

Phage Gene ◽

Software Documentation ◽

Link Type ◽

Multiple Processors

ABSTRACTSummaryTo address the need for improved phage annotation tools that scale, we created an automated throughput annotation pipeline: multiple-genome Phage Annotation Toolkit and Evaluator (multiPhATE). multiPhATE is a throughput pipeline driver that invokes an annotation pipeline (PhATE) across a user-specified set of phage genomes. This tool incorporates a de novo phage gene-calling algorithm and assigns putative functions to gene calls using protein-, virus-, and phage-centric databases. multiPhATE’s modular construction allows the user to implement all or any portion of the analyses by acquiring local instances of the desired databases and specifying the desired analyses in a configuration file. We demonstrate multiPhATE by annotating two newly sequenced Yersinia pestis phage genomes. Within multiPhATE, the PhATE processing pipeline can be readily implemented across multiple processors, making it adaptable for throughput sequencing projects. Software documentation assists the user in configuring the system.Availability and implementationmultiPhATE was implemented in Python 3.7, and runs as a command-line code under Linux or Unix. multiPhATE is freely available under an open-source BSD3 license from https://github.com/carolzhou/multiPhATE. Instructions for acquiring the databases and third-party codes used by multiPhATE are included in the distribution README file. Users may report bugs by submitting to the github issues page associated with the multiPhATE [email protected] or [email protected] informationData generated during the current study are included as supplementary files available for download at https://github.com/carolzhou/PhATE_docs.

Download Full-text