scholarly journals QMaker: Fast and Accurate Method to Estimate Empirical Models of Protein Evolution

2021 ◽  
Author(s):  
Bui Quang Minh ◽  
Cuong Cao Dang ◽  
Le Sy Vinh ◽  
Robert Lanfear

Abstract Amino acid substitution models play a crucial role in phylogenetic analyses. Maximum likelihood (ML) methods have been proposed to estimate amino acid substitution models; however, they are typically complicated and slow. In this article, we propose QMaker, a new ML method to estimate a general time-reversible $Q$ matrix from a large protein data set consisting of multiple sequence alignments. QMaker combines an efficient ML tree search algorithm, a model selection for handling the model heterogeneity among alignments, and the consideration of rate mixture models among sites. We provide QMaker as a user-friendly function in the IQ-TREE software package (http://www.iqtree.org) supporting the use of multiple CPU cores so that biologists can easily estimate amino acid substitution models from their own protein alignments. We used QMaker to estimate new empirical general amino acid substitution models from the current Pfam database as well as five clade-specific models for mammals, birds, insects, yeasts, and plants. Our results show that the new models considerably improve the fit between model and data and in some cases influence the inference of phylogenetic tree topologies.[Amino acid replacement matrices; amino acid substitution models; maximum likelihood estimation; phylogenetic inferences.]

Author(s):  
Bui Quang Minh ◽  
Cuong Cao Dang ◽  
Le Sy Vinh ◽  
Robert Lanfear

AbstractAmino acid substitution models play a crucial role in phylogenetic analyses. Maximum likelihood (ML) methods have been proposed to estimate amino acid substitution models, however, they are typically complicated and slow. In this paper, we propose QMaker, a new ML method to estimate a general time-reversible Q matrix from a large protein dataset consisting of multiple sequence alignments. QMaker combines an efficient ML tree search algorithm, a model selection for handling the model heterogeneity among alignments, and the consideration of rate mixture models among sites. We provide QMaker as a user-friendly function in the IQ-TREE software package (http://www.iqtree.org) supporting the use of multiple CPU cores so that biologists can easily estimate amino acid substitution models from their own protein alignments. We used QMaker to estimate new empirical general amino acid substitution models from the current Pfam database as well as five clade-specific models for mammals, birds, insects, yeasts, and plants. Our results show that the new models considerably improve the fit between model and data and in some cases influence the inference of phylogenetic tree topologies.


2021 ◽  
Author(s):  
Cuong Cao Dang ◽  
Bui Quang Minh ◽  
Hanon McShea ◽  
Joanna Masel ◽  
Jennifer Eleanor James ◽  
...  

Amino acid substitution models are a key component in phylogenetic analyses of protein sequences. All amino acid models available to date are time-reversible, an assumption designed for computational convenience but not for biological reality. Another significant downside to time-reversible models is that they do not allow inference of rooted trees without outgroups. In this paper, we introduce a maximum likelihood approach nQMaker, an extension of the recently published QMaker method, that allows the estimation of time non-reversible amino acid substitution models and rooted phylogenetic trees from a set of protein sequence alignments. We show that the non-reversible models estimated with nQMaker are a much better fit to empirical alignments than pre-existing reversible models, across a wide range of datasets including mammals, birds, plants, fungi, and other taxa, and that the improvements in model fit scale with the size of the dataset. Notably, for the recently published plant and bird trees, these non-reversible models correctly recovered the commonly known root placements with very high statistical support without the need to use an outgroup. We provide nQMaker as an easy-to-use feature in the IQ-TREE software (http://www.iqtree.org), allowing users to estimate non-reversible models and rooted phylogenies from their own protein datasets.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Juan C. Muñoz-Escalante ◽  
Andreu Comas-García ◽  
Sofía Bernal-Silva ◽  
Daniel E. Noyola

AbstractRespiratory syncytial virus (RSV) is a major cause of respiratory infections and is classified in two main groups, RSV-A and RSV-B, with multiple genotypes within each of them. For RSV-B, more than 30 genotypes have been described, without consensus on their definition. The lack of genotype assignation criteria has a direct impact on viral evolution understanding, development of viral detection methods as well as vaccines design. Here we analyzed the totality of complete RSV-B G gene ectodomain sequences published in GenBank until September 2018 (n = 2190) including 478 complete genome sequences using maximum likelihood and Bayesian phylogenetic analyses, as well as intergenotypic and intragenotypic distance matrices, in order to generate a systematic genotype assignation. Individual RSV-B genes were also assessed using maximum likelihood phylogenetic analyses and multiple sequence alignments were used to identify molecular markers associated to specific genotypes. Analyses of the complete G gene ectodomain region, sequences clustering patterns, and the presence of molecular markers of each individual gene indicate that the 37 previously described genotypes can be classified into fifteen distinct genotypes: BA, BA-C, BA-CC, CB1-THB, GB1-GB4, GB6, JAB1-NZB2, SAB1, SAB2, SAB4, URU2 and a novel early circulating genotype characterized in the present study and designated GB0.


Author(s):  
Renganayaki G. ◽  
Achuthsankar S. Nair

Sequence alignment algorithms and  database search methods use BLOSUM and PAM substitution matrices constructed from general proteins. These de facto matrices are not optimal to align sequences accurately, for the proteins with markedly different compositional bias in the amino acid.   In this work, a new amino acid substitution matrix is calculated for the disorder and low complexity rich region of Hub proteins, based on residue characteristics. Insights into the amino acid background frequencies and the substitution scores obtained from the Hubsm unveils the  residue substitution patterns which differs from commonly used scoring matrices .When comparing the Hub protein sequences for detecting homologs,  the use of this Hubsm matrix yields better results than PAM and BLOSUM matrices. Usage of Hubsm matrix can be optimal in database search and for the construction of more accurate sequence alignments of Hub proteins.


2021 ◽  
Author(s):  
Zaynab Shaik ◽  
Nicola Georgina Bergh ◽  
Bengt Oxelman ◽  
Anthony George Verboom

We applied species delimitation methods based on the Multi-Species Coalescent (MSC) model to 500+ loci derived from genotyping-by-sequencing on the South African Seriphium plumosum (Asteraceae) species complex. The loci were represented either as multiple sequence alignments or single nucleotide polymorphisms (SNPs), and analysed by the STACEY and Bayes Factor Delimitation (BFD)/SNAPP methods, respectively. Both methods supported species taxonomies where virtually all of the 32 sampled individuals, each representing its own geographical population, were identified as separate species. Computational efforts required to achieve adequate mixing of MCMC chains were considerable, and the species/minimal cluster trees identified similar strongly supported clades in replicate runs. The resolution was, however, higher in the STACEY trees than in the SNAPP trees, which is consistent with the higher information content of full sequences. The computational efficiency, measured as effective sample sizes of likelihood and posterior estimates per time unit, was consistently higher for STACEY. A random subset of 56 alignments had similar resolution to the 524-locus SNP data set. The STRUCTURE-like sparse Non-negative Matrix Factorisation (sNMF) method was applied to six individuals from each of 48 geographical populations and 28023 SNPs. Significantly fewer (13) clusters were identified as optimal by this analysis compared to the MSC methods. The sNMF clusters correspond closely to clades consistently supported by MSC methods, and showed evidence of admixture, especially in the western Cape Floristic Region. We discuss the significance of these findings, and conclude that it is important to a priori consider the kind of species one wants to identify when using genome-scale data, the assumptions behind the parametric models applied, and the potential consequences of model violations may have.


Zootaxa ◽  
2021 ◽  
Vol 4995 (2) ◽  
pp. 334-344
Author(s):  
QIAN ZHOU ◽  
FAHUI TANG ◽  
YUANJUN ZHAO

During a survey of parasitic ciliates in Chongqing, China, Trichodina matsu Basson & Van As, 1994 was isolated from gills of Tachysurus fulvidraco. Furthermore, the 18S rRNA gene and ITS-5.8S rRNA region of T. matsu were sequenced for the first time and applied for the species identification and comparison with similar species in the present study. Based on the morphological and molecular comparisons, the results indicate that T. matsu is an ectoparasite specific for the Siluriformes catfish. Based on the analyses of genetic distance, multiple sequence alignments, and phylogenetic analyses, no obvious differentiation within populations of T. matsu was found. In addition, the ‘Trichodina hyperparasitis’ (KX904933) in GenBank is a misidentification and appears to be conspecific with T. matsu according to the comparison of morphological and molecular data.  


2004 ◽  
Vol 02 (04) ◽  
pp. 719-745 ◽  
Author(s):  
ARUN SIDDHARTH KONAGURTHU ◽  
JAMES WHISSTOCK ◽  
PETER J. STUCKEY

In this paper we demonstrate a practical approach to construct progressive multiple alignments using sequence triplet optimizations rather than a conventional pairwise approach. Using the sequence triplet alignments progressively provides a scope for the synthesis of a three-residue exchange amino acid substitution matrix. We develop such a 20×20×20 matrix for the first time and demonstrate how its use in optimal sequence triplet alignments increases the sensitivity of building multiple alignments. Various comparisons were made between alignments generated using the progressive triplet methods and the conventional progressive pairwise procedure. The assessment of these data reveal that, in general, the triplet based approaches generate more accurate sequence alignments than the traditional pairwise based procedures, especially between more divergent sets of sequences.


2004 ◽  
Vol 85 (8) ◽  
pp. 2191-2197 ◽  
Author(s):  
Tomoko Ogawa ◽  
Yoshimi Tomita ◽  
Mineyuki Okada ◽  
Kuniko Shinozaki ◽  
Hiroko Kubonoya ◽  
...  

To investigate the prevalence of bovine papillomavirus (BPV) in bovine papilloma and healthy skin, DNA extracted from teat papillomas and healthy teat skin swabs was analysed by PCR using the primer pairs FAP59/FAP64 and MY09/MY11. Papillomavirus (PV) DNA was detected in all 15 papilloma specimens using FAP59/FAP64 and in 8 of the 15 papilloma specimens using MY09/MY11. In swab samples, 21 and 8 of the 122 samples were PV DNA positive using FAP59/FAP64 and MY09/MY11, respectively. Four BPV types (BPV-1, -3, -5 and -6), two previously identified putative BPV types (BAA1 and -5) and 11 putative new PV types (designated BAPV1 to -10 and BAPV11MY) were found in the 39 PV DNA-positive samples. Amino acid sequence alignments of the putative new PV types with reported BPVs and phylogenetic analyses of the putative new PV types with human and animal PV types showed that BAPV1 to -10 and BAPV11MY are putative new BPV types. These results also showed the genomic diversity and extent of subclinical infection of BPV.


2015 ◽  
Author(s):  
Xiaolong Wang ◽  
Chao Yang

Multiple sequence alignment (MSA) is widely used to reveal structural and functional changes leading to genetic differences among species, and to reconstruct evolutionary histories of related genes, proteins and genomes. Traditionally, proteins and their coding sequences (CDSs) are aligned and analyzed separately, but often drastically different conclusions were drawn on a same set of data. Here we present a new alignment strategy, Codon and Amino Acid Unified Sequence Alignment (CAUSA) 2.0, which aligns proteins and their coding sequences simultaneously. CAUSA 2.0 optimizes the alignment of CDSs at both codon and amino acid level efficiently. Theoretical analysis showed that CAUSA 2.0 enhances the entropy information content of MSA. Empirical data analysis demonstrated that CAUSA 2.0 is more accurate and consistent than nucleotide, protein or codon level alignments. CAUSA 2.0 locates in-frame indels more accurately, makes the alignment of coding sequences biologically more significant, and reveals several novel mutation mechanisms that relate to some genetic diseases. CAUSA 2.0 is available in website www.DNAPlusPro.com .


Sign in / Sign up

Export Citation Format

Share Document