CRiSP: accurate structure prediction of disulfide-rich peptides with cystine-specific sequence alignment and machine learning

Zi-Lin Liu; Jing-Hao Hu; Fan Jiang; Yun-Dong Wu

doi:10.1093/bioinformatics/btaa193

CRiSP: accurate structure prediction of disulfide-rich peptides with cystine-specific sequence alignment and machine learning

Bioinformatics ◽

10.1093/bioinformatics/btaa193 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3385-3392

Author(s):

Zi-Lin Liu ◽

Jing-Hao Hu ◽

Fan Jiang ◽

Yun-Dong Wu

Keyword(s):

Machine Learning ◽

Sequence Alignment ◽

Structure Prediction ◽

High Throughput Sequencing ◽

Prediction Method ◽

General Purpose ◽

Supplementary Information ◽

Model Quality ◽

Specific Sequence ◽

Structure Information

Abstract Motivation High-throughput sequencing discovers many naturally occurring disulfide-rich peptides or cystine-rich peptides (CRPs) with diversified bioactivities. However, their structure information, which is very important to peptide drug discovery, is still very limited. Results We have developed a CRP-specific structure prediction method called Cystine-Rich peptide Structure Prediction (CRiSP), based on a customized template database with cystine-specific sequence alignment and three machine-learning predictors. The modeling accuracy is significantly better than several popular general-purpose structure modeling methods, and our CRiSP can provide useful model quality estimations. Availability and implementation The CRiSP server is freely available on the website at http://wulab.com.cn/CRISP. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Sequence alignment using machine learning for accurate template-based protein structure prediction

Bioinformatics ◽

10.1093/bioinformatics/btz483 ◽

2019 ◽

Vol 36 (1) ◽

pp. 104-111

Author(s):

Shuichiro Makigaki ◽

Takashi Ishida

Keyword(s):

Machine Learning ◽

Structure Prediction ◽

Tertiary Structure ◽

Structural Alignment ◽

Protein Structures ◽

Substitution Matrix ◽

Detection Methods ◽

Supplementary Information ◽

Homology Detection ◽

Sequence Alignments

Abstract Motivation Template-based modeling, the process of predicting the tertiary structure of a protein by using homologous protein structures, is useful if good templates can be found. Although modern homology detection methods can find remote homologs with high sensitivity, the accuracy of template-based models generated from homology-detection-based alignments is often lower than that from ideal alignments. Results In this study, we propose a new method that generates pairwise sequence alignments for more accurate template-based modeling. The proposed method trains a machine learning model using the structural alignment of known homologs. It is difficult to directly predict sequence alignments using machine learning. Thus, when calculating sequence alignments, instead of a fixed substitution matrix, this method dynamically predicts a substitution score from the trained model. We evaluate our method by carefully splitting the training and test datasets and comparing the predicted structure’s accuracy with that of state-of-the-art methods. Our method generates more accurate tertiary structure models than those produced from alignments obtained by other methods. Availability and implementation https://github.com/shuichiro-makigaki/exmachina. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SPRING: a next-generation compressor for FASTQ data

Bioinformatics ◽

10.1093/bioinformatics/bty1015 ◽

2018 ◽

Vol 35 (15) ◽

pp. 2674-2676 ◽

Cited By ~ 18

Author(s):

Shubham Chandak ◽

Kedar Tatwawadi ◽

Idoia Ochoa ◽

Mikel Hernaez ◽

Tsachy Weissman

Keyword(s):

High Throughput Sequencing ◽

Random Access ◽

Lossless Compression ◽

General Purpose ◽

Supplementary Information ◽

High Coverage ◽

Sequencing Technologies ◽

Long Read ◽

Previous State ◽

Computational Resources

Abstract Motivation High-Throughput Sequencing technologies produce huge amounts of data in the form of short genomic reads, associated quality values and read identifiers. Because of the significant structure present in these FASTQ datasets, general-purpose compressors are unable to completely exploit much of the inherent redundancy. Although there has been a lot of work on designing FASTQ compressors, most of them lack in support of one or more crucial properties, such as support for variable length reads, scalability to high coverage datasets, pairing-preserving compression and lossless compression. Results In this work, we propose SPRING, a reference-free compressor for FASTQ files. SPRING supports a wide variety of compression modes and features, including lossless compression, pairing-preserving compression, lossy compression of quality values, long read compression and random access. SPRING achieves substantially better compression than existing tools, for example, SPRING compresses 195 GB of 25× whole genome human FASTQ from Illumina’s NovaSeq sequencer to less than 7 GB, around 1.6× smaller than previous state-of-the-art FASTQ compressors. SPRING achieves this improvement while using comparable computational resources. Availability and implementation SPRING can be downloaded from https://github.com/shubhamchandak94/SPRING. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

2P071 Generality of the protein structure prediction method based on molecular dynamics simulation with secondary structure information(The 48th Annual Meeting of the Biophysical Society of Japan)

Seibutsu Butsuri ◽

10.2142/biophys.50.s94_4 ◽

2010 ◽

Vol 50 (supplement2) ◽

pp. S94

Author(s):

Tohru Terada ◽

Asahiko Nishimura ◽

Kentaro Shimizu

Keyword(s):

Molecular Dynamics ◽

Molecular Dynamics Simulation ◽

Secondary Structure ◽

Structure Prediction ◽

Prediction Method ◽

Dynamics Simulation ◽

Secondary Structure Information ◽

Structure Information ◽

Biophysical Society ◽

Structure Prediction Method

Download Full-text

Large-scale structure prediction by improved contact predictions and model quality assessment

10.1101/128231 ◽

2017 ◽

Cited By ~ 2

Author(s):

Mirco Michel ◽

David Menéndez Hurtado ◽

Karolis Uziela ◽

Arne Elofsson

Keyword(s):

Structure Prediction ◽

Large Scale ◽

Supplementary Information ◽

Model Quality ◽

Contact Maps ◽

Folding Algorithm ◽

Unknown Structure ◽

Supplementary Material ◽

Direct Coupling Analysis ◽

Contact Predictions

AbstractMotivationAccurate contact predictions can be used for predicting the structure of proteins. Until recently these methods were limited to very big protein families, decreasing their utility. However, recent progress by combining direct coupling analysis with machine learning methods has made it possible to predict accurate contact maps for smaller families. To what extent these predictions can be used to produce accurate models of the families is not known.ResultsWe present the PconsFold2 pipeline that uses contact predictions from PconsC3, the CONFOLD folding algorithm and model quality estimations to predict the structure of a protein. We show that the model quality estimation significantly increases the number of models that reliably can be identified. Finally, we apply PconsFold2 to 6379 Pfam families of unknown structure and find that PconsFold2 can, with an estimated 90% specificity, predict the structure of up to 558 Pfam families of unknown structure. Out of these 415 have not been reported before.AvailabilityDatasets as well as models of all the 558 Pfam families are available at http://c3.pcons.net/. All programs used here are freely [email protected] informationNo supplementary data

Download Full-text

Prediction of Whole-Cell Transcriptional Response with Machine Learning

Bioinformatics ◽

10.1093/bioinformatics/btab676 ◽

2021 ◽

Author(s):

Mohammed Eslami ◽

Amin Espah-Borujeni ◽

Hamed Eramian ◽

Mark Weston ◽

George Zheng ◽

...

Keyword(s):

Machine Learning ◽

Differential Expression ◽

Regulatory Networks ◽

High Throughput Sequencing ◽

Differential Expression Analysis ◽

Transcriptional Response ◽

Predictive Performance ◽

Supplementary Information ◽

Whole Cell ◽

Using Data

Abstract Motivation Applications in synthetic and systems biology can benefit from measuring whole-cell response to biochemical perturbations. Execution of experiments to cover all possible combinations of perturbations is infeasible. In this paper, we present the host response model (HRM), a machine learning approach that maps response of single perturbations to transcriptional response of the combination of perturbations. Results The HRM combines high-throughput sequencing with machine learning to infer links between experimental context, prior knowledge of cell regulatory networks, and RNASeq data to predict a gene’s dysregulation. We find that the HRM can predict the directionality of dysregulation to a combination of inducers with an accuracy of > 90% using data from single inducers. We further find that the use of prior, known cell regulatory networks doubles the predictive performance of the HRM (an R2 from 0.3 to 0.65). The model was validated in two organisms, E. coli and B. subtilis, using new experiments conducted post training. Finally, while the HRM is trained on gene expression data, the direct prediction of differential expression makes it possible to also conduct enrichment analyses using its predictions. We show that the HRM can accurately classify >95% of the pathway regulations. The HRM reduces the number of RNASeq experiments needed as responses can be tested in-silico to focus experiments. Availability The HRM software and tutorial are available at https://github.com/sd2e/CDM and the configurable differential expression analysis tools and tutorials are available at https://github.com/SD2E/omics_tools. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Prediction of mRNA subcellular localization using deep recurrent neural networks

Bioinformatics ◽

10.1093/bioinformatics/btz337 ◽

2019 ◽

Vol 35 (14) ◽

pp. i333-i342 ◽

Cited By ~ 12

Author(s):

Zichao Yan ◽

Eric Lécuyer ◽

Mathieu Blanchette

Keyword(s):

Subcellular Localization ◽

Messenger Rna ◽

Rna Binding ◽

Rna Binding Proteins ◽

Regulatory Elements ◽

Supplementary Information ◽

Specific Sequence ◽

Structure Information ◽

Subcellular Compartments ◽

Sequence Elements

Abstract Motivation Messenger RNA subcellular localization mechanisms play a crucial role in post-transcriptional gene regulation. This trafficking is mediated by trans-acting RNA-binding proteins interacting with cis-regulatory elements called zipcodes. While new sequencing-based technologies allow the high-throughput identification of RNAs localized to specific subcellular compartments, the precise mechanisms at play, and their dependency on specific sequence elements, remain poorly understood. Results We introduce RNATracker, a novel deep neural network built to predict, from their sequence alone, the distributions of mRNA transcripts over a predefined set of subcellular compartments. RNATracker integrates several state-of-the-art deep learning techniques (e.g. CNN, LSTM and attention layers) and can make use of both sequence and secondary structure information. We report on a variety of evaluations showing RNATracker’s strong predictive power, which is significantly superior to a variety of baseline predictors. Despite its complexity, several aspects of the model can be isolated to yield valuable, testable mechanistic hypotheses, and to locate candidate zipcode sequences within transcripts. Availability and implementation Code and data can be accessed at https://www.github.com/HarveyYan/RNATracker. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

DNN-Dom: predicting protein domain boundary from sequence alone by deep neural network

Bioinformatics ◽

10.1093/bioinformatics/btz464 ◽

2019 ◽

Vol 35 (24) ◽

pp. 5128-5136 ◽

Cited By ~ 3

Author(s):

Qiang Shi ◽

Weiya Chen ◽

Siqi Huang ◽

Fanglin Jin ◽

Yinghao Dong ◽

...

Keyword(s):

Neural Network ◽

Machine Learning ◽

Structure Prediction ◽

Domain Boundary ◽

Protein Domain ◽

Supplementary Information ◽

High Dimensions ◽

Long Range Interactions ◽

Domain Boundary Prediction ◽

And Function

Abstract Motivation Accurate delineation of protein domain boundary plays an important role for protein engineering and structure prediction. Although machine-learning methods are widely used to predict domain boundary, these approaches often ignore long-range interactions among residues, which have been proven to improve the prediction performance. However, how to simultaneously model the local and global interactions to further improve domain boundary prediction is still a challenging problem. Results This article employs a hybrid deep learning method that combines convolutional neural network and gate recurrent units’ models for domain boundary prediction. It not only captures the local and non-local interactions, but also fuses these features for prediction. Additionally, we adopt balanced Random Forest for classification to deal with high imbalance of samples and high dimensions of deep features. Experimental results show that our proposed approach (DNN-Dom) outperforms existing machine-learning-based methods for boundary prediction. We expect that DNN-Dom can be useful for assisting protein structure and function prediction. Availability and implementation The method is available as DNN-Dom Server at http://isyslab.info/DNN-Dom/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Improved estimation of model quality using predicted inter-residue distance

Bioinformatics ◽

10.1093/bioinformatics/btab632 ◽

2021 ◽

Author(s):

Lisha Ye ◽

Peikun Wu ◽

Zhenling Peng ◽

Jianzhao Gao ◽

Jian Liu ◽

...

Keyword(s):

Protein Structure ◽

Protein Structure Prediction ◽

Structure Prediction ◽

Superior Performance ◽

Supplementary Information ◽

Prediction Algorithm ◽

Structure Model ◽

Single Model ◽

Model Quality ◽

Reference Models

Abstract Motivation Protein model quality assessment (QA) is an essential component in protein structure prediction, which aims to estimate the quality of a structure model and/or select the most accurate model out from a pool of structure models, without knowing the native structure. QA remains a challenging task in protein structure prediction. Results Based on the inter-residue distance predicted by the recent deep learning-based structure prediction algorithm trRosetta, we developed QDistance, a new approach to the estimation of both global and local qualities. QDistance works for both single-model and multi-models inputs. We designed several distance-based features to assess the agreement between the predicted and model-derived inter-residue distances. Together with a few widely used features, they are fed into a simple yet powerful linear regression model to infer the global QA scores. The local QA scores for each structure model are predicted based on a comparative analysis with a set of selected reference models. For multi-models input, the reference models are selected from the input based on the predicted global QA scores. For single-model input, the reference models are predicted by trRosetta. With the informative distance-based features, QDistance can predict the global quality with satisfactory accuracy. Benchmark tests on the CASP13 and the CAMEO structure models suggested that QDistance was competitive other methods. Blind tests in the CASP14 experiments showed that QDistance was robust and ranked among the top predictors. Especially, QDistance was the top 3 local QA method and made the most accurate local QA prediction for unreliable local region. Analysis showed that this superior performance can be attributed to the inclusion of the predicted inter-residue distance. Availability and Implementation http://yanglab.nankai.edu.cn/QDistance Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

DERIVING TOPOLOGY AND SEQUENCE ALIGNMENT FOR THE HELIX SKELETON IN LOW-RESOLUTION PROTEIN DENSITY MAPS

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720008003357 ◽

2008 ◽

Vol 06 (01) ◽

pp. 183-201 ◽

Cited By ~ 13

Author(s):

YONGGANG LU ◽

JING HE ◽

CHARLIE E. M. STRAUSS

Keyword(s):

Sequence Alignment ◽

Structure Prediction ◽

Protein Complexes ◽

3D Structure ◽

Three Dimensional ◽

Prediction Method ◽

Density Maps ◽

Ab Initio Structure ◽

Geometrical Information ◽

Protein Density

Cryoelectron microscopy (cryoEM) is an experimental technique to determine the three-dimensional (3D) structure of large protein complexes. Currently, this technique is able to generate protein density maps at 6–9 Å resolution, at which the skeleton of the structure (which is composed of α-helices and β-sheets) can be visualized. As a step towards predicting the entire backbone of the protein from the protein density map, we developed a method to predict the topology and sequence alignment for the skeleton helices. Our method combines the geometrical information of the skeleton helices with the Rosetta ab initio structure prediction method to derive a consensus topology and sequence alignment for the skeleton helices. We tested the method with 60 proteins. For 45 proteins, the majority of the skeleton helices were assigned a correct topology from one of our top ten predictions. The offsets of the alignment for most of the assigned helices were within ±2 amino acids in the sequence. We also analyzed the use of the skeleton helices as a clustering tool for the decoy structures generated by Rosetta. Our comparison suggests that the topology clustering is a better method than a general overlap clustering method to enrich the ranking of decoys, particularly when the decoy pool is small.

Download Full-text

QDeep: distance-based protein model quality estimation by residue-level ensemble error classifications using stacked deep residual neural networks

Bioinformatics ◽

10.1093/bioinformatics/btaa455 ◽

2020 ◽

Vol 36 (Supplement_1) ◽

pp. i285-i291 ◽

Cited By ~ 1

Author(s):

Md Hossain Shuvo ◽

Sutanu Bhattacharya ◽

Debswapna Bhattacharya

Keyword(s):

Neural Networks ◽

Protein Structure ◽

Protein Structure Prediction ◽

Structure Prediction ◽

Residue Level ◽

Supplementary Information ◽

Model Quality ◽

Quality Estimation ◽

Distance Information ◽

Protein Model

Abstract Motivation Protein model quality estimation, in many ways, informs protein structure prediction. Despite their tight coupling, existing model quality estimation methods do not leverage inter-residue distance information or the latest technological breakthrough in deep learning that has recently revolutionized protein structure prediction. Results We present a new distance-based single-model quality estimation method called QDeep by harnessing the power of stacked deep residual neural networks (ResNets). Our method first employs stacked deep ResNets to perform residue-level ensemble error classifications at multiple predefined error thresholds, and then combines the predictions from the individual error classifiers for estimating the quality of a protein structural model. Experimental results show that our method consistently outperforms existing state-of-the-art methods including ProQ2, ProQ3, ProQ3D, ProQ4, 3DCNN, MESHI, and VoroMQA in multiple independent test datasets across a wide-range of accuracy measures; and that predicted distance information significantly contributes to the improved performance of QDeep. Availability and implementation https://github.com/Bhattacharya-Lab/QDeep. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text