scholarly journals MUSCLE v5 enables improved estimates of phylogenetic tree confidence by ensemble bootstrapping

2021 ◽  
Author(s):  
Robert C Edgar

Phylogenetic tree confidence is often estimated from a multiple sequence alignment (MSA) using the Felsenstein bootstrap heuristic. However, this does not account for systematic errors in the MSA, which may cause substantial bias to the inferred phylogeny. Here, I describe the MSA ensemble bootstrap, a new procedure which generates a set of replicate MSAs by varying parameters such as gap penalties and substitution scores. Such an ensemble is called diagnostic if the typical distance between MSAs is comparable to the error rate. Confidence in a prediction derived from an MSA, e.g. a monophyletic clade, is expressed as the fraction of the ensemble where the prediction is reproduced. This approach is implemented in MUSCLE by modifying the Probcons algorithm, which is based on a hidden Markov model (HMM). An ensemble is generated by perturbing HMM parameters and permuting the guide tree. Ensembles generated by this method are shown to be diagnostic on the Balibase benchmark. To enable scaling to large datasets, divide-and-conquer heuristics are introduced. A new benchmark (Balifam) is described with 36 sets of 10000+ proteins. On Balifam, ensembles generated by MUSCLE are shown to align an average of 59% of columns correctly, 13% better than Clustal-omega (52% correct) and 26% better than MAFFT (47% correct). The ensemble bootstrap is applied to a previously published tree of RNA viruses, showing that the high reported Felsenstein bootstrap confidence of Ribovirus phylum branching order is an artifact of systematic MSA errors.

1998 ◽  
Vol 11 (1) ◽  
pp. 551-551
Author(s):  
N. Zacharias ◽  
M.I. Zacharias ◽  
C. de Vegt ◽  
C.A. Murray

The Second Cape Photographic Catalog (CPC2) contains 276,131 stars covering the entire Southern Hemisphere in a 4-fold overlap pattern. Its mean epoch is 1968, which makes it a key catalog for proper motions. A new reduction of the 5687 plates using on average 40 Hipparcos stars per plate has resulted in a vastly improved catalog with a positional accuracy of about 40 mas (median value) per coordinate, which comes very close to the measuring precision. In particular, for the first time systematic errors depending on magnitude and color can be solved unambiguously and have been removed from the catalog. In combination with the Tycho Catalogue (mean epoch 1991.25) and the upcoming U.S. Naval Observatory CCD Astrograph Catalog (UCAC) project proper motions better than 2 mas/yr can be obtained. This will lead to a vastly improved reference star catalog in the Southern Hemisphere for the final Astrographic Catalogue (AC) reductions, which will then provide propermotions for millions of stars when combined with new epoch data. These data then will allow an uncompromised reduction of the southern Schmidt surveys on the International Celestial Reference System (ICRS).


Solid Earth ◽  
2016 ◽  
Vol 7 (4) ◽  
pp. 1157-1169 ◽  
Author(s):  
Paul W. J. Glover

Abstract. When scientists apply Archie's first law they often include an extra parameter a, which was introduced about 10 years after the equation's first publication by Winsauer et al. (1952), and which is sometimes called the “tortuosity” or “lithology” parameter. This parameter is not, however, theoretically justified. Paradoxically, the Winsauer et al. (1952) form of Archie's law often performs better than the original, more theoretically correct version. The difference in the cementation exponent calculated from these two forms of Archie's law is important, and can lead to a misestimation of reserves by at least 20 % for typical reservoir parameter values. We have examined the apparent paradox, and conclude that while the theoretical form of the law is correct, the data that we have been analysing with Archie's law have been in error. There are at least three types of systematic error that are present in most measurements: (i) a porosity error, (ii) a pore fluid salinity error, and (iii) a temperature error. Each of these systematic errors is sufficient to ensure that a non-unity value of the parameter a is required in order to fit the electrical data well. Fortunately, the inclusion of this parameter in the fit has compensated for the presence of the systematic errors in the electrical and porosity data, leading to a value of cementation exponent that is correct. The exceptions are those cementation exponents that have been calculated for individual core plugs. We make a number of recommendations for reducing the systematic errors that contribute to the problem and suggest that the value of the parameter a may now be used as an indication of data quality.


2021 ◽  
Author(s):  
Céline Marquet ◽  
Michael Heinzinger ◽  
Tobias Olenyi ◽  
Christian Dallago ◽  
Michael Bernhofer ◽  
...  

Abstract The emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (LMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or marked amino acids from the context of entire sequence regions. Here, we explored how to benefit from learned protein LM representations (embeddings) to predict SAV effects. Although we have failed so far to predict SAV effects directly from embeddings, this input alone predicted residue conservation almost as accurately from single sequences as using multiple sequence alignments (MSAs) with a two-state per-residue accuracy (conserved/not) of Q2=80% (embeddings) vs. 81% (ConSeq). Considering all SAVs at all residue positions predicted as conserved to affect function reached 68.6% (Q2: effect/neutral; for PMD) without optimization, compared to an expert solution such as SNAP2 (Q2=69.8). Combining predicted conservation with BLOSUM62 to obtain variant-specific binary predictions, DMS experiments of four human proteins were predicted better than by SNAP2, and better than by applying the same simplistic approach to conservation taken from ConSeq. Thus, embedding methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. This allowed prediction of SAV effects for the entire human proteome (~20k proteins) within 17 minutes on a single GPU.


2018 ◽  
Author(s):  
Ludovica Liguori ◽  
Valentina Citro ◽  
Bruno Hay-Mele ◽  
Giuseppina Andreotti ◽  
Maria Vittoria Cubellis

Bioinformatics has pervaded all fields of biology and has become an indispensable tool for almost all research projects. Hence the demand for graduates well-trained in bioinformatics has grown. Teaching bioinformatics has been incorporated in all traditional life science curricula. Better than teaching stand-alone bioinformatics, it would be useful to stress multidisciplinary and problem-solving aspects. Since bioinformatics relies heavily on the use of computers, e-learning is particularly convenient, but few examples have been produced so far. We present a tutorial that starts from a practical problem: finding novel enzymes from marine environments. First, we introduce the idea of metagenomics, a recent approach that extends biotechnology with non-culturable microbes. We then lead the students through databases such as BRENDA, and programs such as BLAST and Clustal Omega. Lastly, we let the students querying these databases about molecules found in marine environments. At the end of the experience, students will have acquired practical knowledge of bioinformatics fundamentals.


Author(s):  
U. G. Adebo ◽  
J. O. Matthew

Multiple sequence analysis is one of the most widely used model in estimating similarity among genotypes. In a bid to access useful information for the utilization of bush mango genetic resources, nucleotide sequences of eight bush mango (Irvingia gabonensis) cultivars were sourced for and retrieved form NCBI data base, and evaluated for diversity and similarity using computational biology approach. The highest alignment score (26.18), depicting the highest similarity, was between two pairs of sequence combinations; BM07:BM58 and BM12:BM69 respectively, while the least score (19.43) was between BM01: BM13. The phylogenetic tree broadly divided the cultivars into four distinct groups; BM07, BM58 (cluster one), BM01 (cluster 2), BM15, BM13 and BM35 (cluster 3), and BM12, BM69 (cluster 4), while the sequences obtained from the analysis revealed only few fully conserved regions, with the single nucleotides A, and T, which were consistent throughout the evolution. Results obtained from this study indicate that the bush mango cultivars are divergent and can be useful genetic resources for bush mango improvement through breeding.


2001 ◽  
Vol 7 (S2) ◽  
pp. 368-369
Author(s):  
B. Jiang ◽  
J. Friis ◽  
J.C.H. Spence

An accuracy of better than 1% is needed to measure the changes in charge density due to bonding. Here we report an accuracy up to 0.025% (random error) obtained in rutile crystal structure factors measurement by QCBED. This error is the standard deviation in the mean value obtained from ten data sets. Systematic errors may be present. Figure 1 gives an example of the (200) refinement results. Table 1 lists several low order structure factor refinement results. The accuracy of the measured electron structure factors was 0.1-0.2% but after conversion to x-ray structure factors, the accuracy for low orders improved due to the Mott formula [1] For (110) and (101) reflections, the accuracy in x-ray structure factors became 0.025% and 0.048% respectively. This accuracy is equivalent to that of the X-ray single crystal Pendellosung method on silicon crystals [2].The experiments were done on a Leo 912 Omega TEM.


2014 ◽  
Vol 2014 ◽  
pp. 1-11 ◽  
Author(s):  
Vinod Kumar ◽  
Gopal Singh ◽  
Punesh Sangwan ◽  
A. K. Verma ◽  
Sanjeev Agrawal

β-Propeller phytases (BPPhy) are widely distributed in nature and play a major role in phytate-phosphorus cycling. In the present study, a BPPhy gene from Bacillus licheniformis strain was expressed in E. coli with a phytase activity of 1.15 U/mL and specific activity of 0.92 U/mg proteins. The expressed enzyme represented a full length ORF “PhyPB13” of 381 amino acid residues and differs by 3 residues from the closest similar existing BPPhy sequences. The PhyPB13 sequence was characterized in silico using various bioinformatic tools to better understand structural, functional, and evolutionary aspects of BPPhy class by multiple sequence alignment and homology search, phylogenetic tree construction, variation in biochemical features, and distribution of motifs and superfamilies. In all sequences, conserved sites were observed toward their N-terminus and C-terminus. Cysteine was not present in the sequence. Overall, three major clusters were observed in phylogenetic tree with variation in biophysical characteristics. A total of 10 motifs were reported with motif “1” observed in all 44 protein sequences and might be used for diversity and expression analysis of BPPhy enzymes. This study revealed important sequence features of BPPhy and pave a way for determining catalytic mechanism and selection of phytase with desirable characteristics.


2011 ◽  
Vol 7 (1) ◽  
pp. 539 ◽  
Author(s):  
Fabian Sievers ◽  
Andreas Wilm ◽  
David Dineen ◽  
Toby J Gibson ◽  
Kevin Karplus ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document