Comprehensive Study on Enhancing Low-Quality Position-Specific Scoring Matrix with Deep Learning for Accurate Protein Structure Property Prediction: Using Bagging Multiple Sequence Alignment Learning

2021 ◽  
Vol 28 (4) ◽  
pp. 346-361
Author(s):  
Yuzhi Guo ◽  
Jiaxiang Wu ◽  
Hehuan Ma ◽  
Sheng Wang ◽  
Junzhou Huang
2020 ◽  
Author(s):  
Fusong Ju ◽  
Jianwei Zhu ◽  
Bin Shao ◽  
Lupeng Kong ◽  
Tie-Yan Liu ◽  
...  

Protein functions are largely determined by the final details of their tertiary structures, and the structures could be accurately reconstructed based on inter-residue distances. Residue co-evolution has become the primary principle for estimating inter-residue distances since the residues in close spatial proximity tend to co-evolve. The widely-used approaches infer residue co-evolution using an indirect strategy, i.e., they first extract from the multiple sequence alignment (MSA) of query protein some handcrafted features, say, co-variance matrix, and then infer residue co-evolution using these features rather than the raw information carried by MSA. This indirect strategy always leads to considerable information loss and inaccurate estimation of inter-residue distances. Here, we report a deep neural network framework (called CopulaNet) to learn residue co-evolution directly from MSA without any handcrafted features. The CopulaNet consists of two key elements: i) an encoder to model context-specific mutation for each residue, and ii) an aggregator to model correlations among residues and thereafter infer residue co-evolutions. Using the CASP13 (the 13th Critical Assessment of Protein Structure Prediction) target proteins as representatives, we demonstrated the successful application of CopulaNet for estimating inter-residue distances and further predicting protein tertiary structure with improved accuracy and efficiency. Head-to-head comparison suggested that for 24 out of the 31 free modeling CASP13 domains, ProFOLD outperformed AlphaFold, one of the state-of-the-art prediction approaches.


2012 ◽  
Vol 2012 ◽  
pp. 1-9 ◽  
Author(s):  
Jian-Jun Shu ◽  
Kian Yan Yong ◽  
Weng Kong Chan

The way for performing multiple sequence alignment is based on the criterion of the maximum-scored information content computed from a weight matrix, but it is possible to have two or more alignments to have the same highest score leading to ambiguities in selecting the best alignment. This paper addresses this issue by introducing the concept of joint weight matrix to eliminate the randomness in selecting the best multiple sequence alignment. Alignments with equal scores are iteratively rescored with the joint weight matrix of increasing level (nucleotide pairs, triplets, and so on) until one single best alignment is eventually found. This method for resolving ambiguity in multiple sequence alignment can be easily implemented by use of the improved scoring matrix.


2021 ◽  
Author(s):  
Liang Hong ◽  
Siqi Sun ◽  
Liangzhen Zheng ◽  
Qingxiong Tan ◽  
Yu Li

Evolutionarily related sequences provide information for the protein structure and function. Multiple sequence alignment, which includes homolog searching from large databases and sequence alignment, is efficient to dig out the information and assist protein structure and function prediction, whose efficiency has been proved by AlphaFold. Despite the existing tools for multiple sequence alignment, searching homologs from the entire UniProt is still time-consuming. Considering the success of AlphaFold, foreseeably, large- scale multiple sequence alignments against massive databases will be a trend in the field. It is very desirable to accelerate this step. Here, we propose a novel method, fastMSA, to improve the speed significantly. Our idea is orthogonal to all the previous accelerating methods. Taking advantage of the protein language model based on BERT, we propose a novel dual encoder architecture that can embed the protein sequences into a low-dimension space and filter the unrelated sequences efficiently before running BLAST. Extensive experimental results suggest that we can recall most of the homologs with a 34-fold speed-up. Moreover, our method is compatible with the downstream tasks, such as structure prediction using AlphaFold. Using multiple sequence alignments generated from our method, we have little performance compromise on the protein structure prediction with much less running time. fastMSA will effectively assist protein sequence, structure, and function analysis based on homologs and multiple sequence alignment.


1995 ◽  
Vol 11 (1) ◽  
pp. 13-18 ◽  
Author(s):  
Makoto Hirosawa ◽  
Yasushi Totoki ◽  
Masaki Hoshida ◽  
Masato Ishikawa

2020 ◽  
Author(s):  
Rishav Dasgupta ◽  
Arpit Kumar Pradhan ◽  
Shyamasree Ghosh

AbstractMycobacterium are a genus of Actinobacteria known to be responsible for several deadly diseases in both humans and animals, including tuberculosis. Luciferase is the primary protein in Mycobacteria that plays a role in bioluminescence. It also plays a role in some bacteria of being a source of energy transference, such as in the case of lumazine proteins. Although studies have been conducted in different luciferase in bacteria, there has been hardly any structural studies on luciferase expressed in Mycobacterium sp. EPa45. Therefore, in this paper we have studied luciferase expressed in Mycobacterium sp. EPa45 by insilico analysis of its structure from its protein sequence. We report the observed differences within luciferase reported from other strains of mycobacterium and pathogenic and non-pathogenic forms of bacteria in terms of their (i) physiochemical characteristics, (ii) protein structure, (iii) multiple sequence alignment and (iv) phylogenetic relationships. We report for the first time the relation of this specific strain of Luciferase in mycobacterium and bacterium at large.HighlightsMycobacterium sp. EPa45 shows similar characteristics to pathogenic mycobacteriumAnalysis of Luciferase sequence and protein qualities provides insight to pathogenicityThe deadly nature of infectious mycobacterium, especially with luciferase sequences similar to Mycobacterium sp. EPa45, is analyzed


Sign in / Sign up

Export Citation Format

Share Document