Transcriptome Ortholog Alignment Sequence Tools (TOAST) for Phylogenomic Dataset Assembly

Abstract Background Advances in next-generation sequencing technologies have reduced the cost of whole transcriptome analyses, allowing characterization of non-model species at unprecedented levels. The rapid pace of transcriptomic sequencing has driven the public accumulation of a wealth of data for phylogenomic analyses, however lack of tools aimed towards phylogeneticists to efficiently identify orthologous sequences currently hinders effective harnessing of this resource.Results We introduce TOAST, an open source R software package that can utilize the ortholog searches based on the software Benchmarking Universal Single-Copy Orthologs (BUSCO) to assemble multiple sequence alignments of orthologous loci from transcriptomes for any group of organisms. By streamlining search, query, and alignment, TOAST automates the generation of locus and concatenated alignments, and also presents a series of outputs from which users can not only explore missing data patterns across their alignments, but also reassemble alignments based on user-defined acceptable missing data levels for a given research question.Conclusions TOAST provides a comprehensive set of tools for assembly of sequence alignments of orthologs for comparative transcriptomic and phylogenomic studies. This software empowers easy assembly of public and novel sequences for any target database of candidate orthologs, and fills a critically needed niche for tools that enable quantification and testing of the impact of missing data. As open-source software, TOAST is fully customizable for integration into existing or novel custom informatic pipelines for phylogenomic inference.

Download Full-text

Transcriptome Ortholog Alignment Sequence Tools (TOAST) for Phylogenomic Dataset Assembly

10.21203/rs.2.16269/v4 ◽

2020 ◽

Author(s):

Dustin J. Wcisel ◽

J. Thomas Howard ◽

Jeffrey A. Yoder ◽

Alex Dornburg

Keyword(s):

Missing Data ◽

Open Source ◽

Research Question ◽

Single Copy ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Sequencing Technologies ◽

Phylogenomic Analyses ◽

The Impact

Abstract Background Advances in next-generation sequencing technologies have reduced the cost of whole transcriptome analyses, allowing characterization of non-model species at unprecedented levels. The rapid pace of transcriptomic sequencing has driven the public accumulation of a wealth of data for phylogenomic analyses, however lack of tools aimed towards phylogeneticists to efficiently identify orthologous sequences currently hinders effective harnessing of this resource. Results We introduce TOAST, an open source R software package that can utilize the ortholog searches based on the software Benchmarking Universal Single-Copy Orthologs (BUSCO) to assemble multiple sequence alignments of orthologous loci from transcriptomes for any group of organisms. By streamlining search, query, and alignment, TOAST automates the generation of locus and concatenated alignments, and also presents a series of outputs from which users can not only explore missing data patterns across their alignments, but also reassemble alignments based on user-defined acceptable missing data levels for a given research question. Conclusions TOAST provides a comprehensive set of tools for assembly of sequence alignments of orthologs for comparative transcriptomic and phylogenomic studies. This software empowers easy assembly of public and novel sequences for any target database of candidate orthologs, and fills a critically needed niche for tools that enable quantification and testing of the impact of missing data. As open-source software, TOAST is fully customizable for integration into existing or novel custom informatic pipelines for phylogenomic inference.

Download Full-text

Transcriptome Ortholog Alignment Sequence Tools (TOAST) for Phylogenomic Dataset Assembly

10.21203/rs.2.16269/v3 ◽

2020 ◽

Author(s):

Dustin J. Wcisel ◽

J. Thomas Howard ◽

Jeffrey A. Yoder ◽

alex dornburg

Keyword(s):

Missing Data ◽

Open Source ◽

Research Question ◽

Single Copy ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Sequencing Technologies ◽

Phylogenomic Analyses ◽

The Impact

Abstract Background Advances in next-generation sequencing technologies have reduced the cost of whole transcriptome analyses, allowing characterization of non-model species at unprecedented levels. The rapid pace of transcriptomic sequencing has driven the public accumulation of a wealth of data for phylogenomic analyses, however lack of tools aimed towards phylogeneticists to efficiently identify orthologous sequences currently hinders effective harnessing of this resource. Results We introduce TOAST, an open source R software package that can utilize the ortholog searches based on the software Benchmarking Universal Single-Copy Orthologs (BUSCO) to assemble multiple sequence alignments of orthologous loci from transcriptomes for any group of organisms. By streamlining search, query, and alignment, TOAST automates the generation of locus and concatenated alignments, and also presents a series of outputs from which users can not only explore missing data patterns across their alignments, but also reassemble alignments based on user-defined acceptable missing data levels for a given research question. Conclusions TOAST provides a comprehensive set of tools for assembly of sequence alignments of orthologs for comparative transcriptomic and phylogenomic studies. This software empowers easy assembly of public and novel sequences for any target database of candidate orthologs, and fills a critically needed niche for tools that enable quantification and testing of the impact of missing data. As open-source software, TOAST is fully customizable for integration into existing or novel custom informatic pipelines for phylogenomic inference.

Download Full-text

Transcriptome Ortholog Alignment Sequence Tools (TOAST) for Phylogenomic Dataset Assembly

10.21203/rs.2.16269/v2 ◽

2020 ◽

Author(s):

alex dornburg ◽

Dustin J. Wcisel ◽

J. Thomas Howard ◽

Jeffrey A. Yoder

Keyword(s):

Missing Data ◽

Open Source ◽

Research Question ◽

Single Copy ◽

Sequence Alignments ◽

Multiple Sequence ◽

Sequencing Technologies ◽

Data Files ◽

Phylogenomic Analyses ◽

The Impact

Abstract Background: Advances in next-generation sequencing technologies have reduced the cost of whole transcriptome analyses, allowing characterization of non-model species at unprecedented levels. The rapid pace of transcriptomic sequencing has driven the public accumulation of a wealth of data for phylogenomic analyses, however lack of tools aimed towards phylogeneticists to efficiently identify orthologous sequences currently hinders effective harnessing of this resource.Results: We introduce TOAST, an open source R software package that can utilize the ortholog searches based on the software Benchmarking Universal Single-Copy Orthologs (BUSCO) to assemble multiple sequence alignments of orthologous loci from transcriptomes for any group of organisms. By streamlining search, query, and alignment, TOAST automates the generation of locus and concatenated alignments, and also presents a series of outputs from which users can not only explore missing data patterns across their alignments, but also reassemble alignments based on user-defined acceptable missing data levels for a given research question.Conclusions: TOAST provides a comprehensive set of tools for assembly of sequence alignments of orthologs for comparative transcriptomic and phylogenomic studies. This software empowers easy assembly of public and novel sequences for any target database of candidate orthologs, and fills a critically needed niche for tools that enable quantification and testing of the impact of missing data. As open-source software, TOAST is fully customizable for integration into existing or novel custom informatic pipelines for phylogenomic inference. Software, a detailed manual, and example data files are available through github carolinafishes.github.io

Download Full-text

The impact of single substitutions on multiple sequence alignments

Philosophical Transactions of the Royal Society B Biological Sciences ◽

10.1098/rstb.2008.0140 ◽

2008 ◽

Vol 363 (1512) ◽

pp. 4041-4047 ◽

Cited By ~ 5

Author(s):

Steffen Klaere ◽

Tanja Gesell ◽

Arndt von Haeseler

Keyword(s):

Stochastic Matrix ◽

Branch Length ◽

Sequence Evolution ◽

Single Mutation ◽

Sequence Alignments ◽

Multiple Sequence ◽

Alignment Column ◽

Multiple Sequence Alignments ◽

Posterior Probability Distribution ◽

The Impact

We introduce another view of sequence evolution. Contrary to other approaches, we model the substitution process in two steps. First we assume (arbitrary) scaled branch lengths on a given phylogenetic tree. Second we allocate a Poisson distributed number of substitutions on the branches. The probability to place a mutation on a branch is proportional to its relative branch length. More importantly, the action of a single mutation on an alignment column is described by a doubly stochastic matrix, the so-called one-step mutation matrix. This matrix leads to analytical formulae for the posterior probability distribution of the number of substitutions for an alignment column.

Download Full-text

Size and structure of the sequence space of repeat proteins

10.1101/635581 ◽

2019 ◽

Author(s):

Jacopo Marchi ◽

Ezequiel A. Galpern ◽

Rocio Espada ◽

Diego U. Ferreiro ◽

Aleksandra M. Walczak ◽

...

Keyword(s):

Amino Acid ◽

Protein Design ◽

Amino Acid Sequences ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Repeat Proteins ◽

The Impact ◽

New Strategies ◽

Amino Acid Conservation

AbstractThe coding space of protein sequences is shaped by evolutionary constraints set by requirements of function and stability. We show that the coding space of a given protein family —the total number of sequences in that family— can be estimated using models of maximum entropy trained on multiple sequence alignments of naturally occuring amino acid sequences. We analyzed and calculated the size of three abundant repeat proteins families, whose members are large proteins made of many repetitions of conserved portions of ∼30 amino acids. While amino acid conservation at each position of the alignment explains most of the reduction of diversity relative to completely random sequences, we found that correlations between amino acid usage at different positions significantly impact that diversity. We quantified the impact of different types of correlations, functional and evolutionary, on sequence diversity. Analysis of the detailed structure of the coding space of the families revealed a rugged landscape, with many local energy minima of varying sizes with a hierarchical structure, reminiscent of fustrated energy landscapes of spin glass in physics. This clustered structure indicates a multiplicity of subtypes within each family, and suggests new strategies for protein design.

Download Full-text

Treerecs: an integrated phylogenetic tool, from sequences to reconciliations

Bioinformatics ◽

10.1093/bioinformatics/btaa615 ◽

2020 ◽

Vol 36 (18) ◽

pp. 4822-4824 ◽

Cited By ~ 1

Author(s):

Nicolas Comte ◽

Benoit Morel ◽

Damir Hasić ◽

Laurent Guéguen ◽

Bastien Boussau ◽

...

Keyword(s):

Open Source ◽

Source Code ◽

Phylogenetic Inference ◽

Species Tree ◽

Gene Trees ◽

Sequence Alignments ◽

Multiple Sequence ◽

Tree Reconciliation ◽

Multiple Sequence Alignments ◽

Multiple Alignments

Abstract Motivation Gene and species tree reconciliation methods are used to interpret gene trees, root them and correct uncertainties that are due to scarcity of signal in multiple sequence alignments. So far, reconciliation tools have not been integrated in standard phylogenetic software and they either lack performance on certain functions, or usability for biologists. Results We present Treerecs, a phylogenetic software based on duplication-loss reconciliation. Treerecs is simple to install and to use. It is fast and versatile, has a graphic output, and can be used along with methods for phylogenetic inference on multiple alignments like PLL and Seaview. Availability and implementation Treerecs is open-source. Its source code (C++, AGPLv3) and manuals are available from https://project.inria.fr/treerecs/.

Download Full-text

LMAP_S: Lightweight Multigene Alignment and Phylogeny eStimation

BMC Bioinformatics ◽

10.1186/s12859-019-3292-5 ◽

2019 ◽

Vol 20 (1) ◽

Author(s):

Emanuel Maldonado ◽

Agostinho Antunes

Keyword(s):

Open Source ◽

High Throughput ◽

Phylogenetic Trees ◽

High Throughput Sequencing ◽

Phenotypic Diversity ◽

Sequence Alignments ◽

Multiple Sequence ◽

Sequencing Technologies ◽

Multiple Datasets ◽

Phylogeny Estimation

Abstract Background Recent advances in genome sequencing technologies and the cost drop in high-throughput sequencing continue to give rise to a deluge of data available for downstream analyses. Among others, evolutionary biologists often make use of genomic data to uncover phenotypic diversity and adaptive evolution in protein-coding genes. Therefore, multiple sequence alignments (MSA) and phylogenetic trees (PT) need to be estimated with optimal results. However, the preparation of an initial dataset of multiple sequence file(s) (MSF) and the steps involved can be challenging when considering extensive amount of data. Thus, it becomes necessary the development of a tool that removes the potential source of error and automates the time-consuming steps of a typical workflow with high-throughput and optimal MSA and PT estimations. Results We introduce LMAP_S (Lightweight Multigene Alignment and Phylogeny eStimation), a user-friendly command-line and interactive package, designed to handle an improved alignment and phylogeny estimation workflow: MSF preparation, MSA estimation, outlier detection, refinement, consensus, phylogeny estimation, comparison and editing, among which file and directory organization, execution, manipulation of information are automated, with minimal manual user intervention. LMAP_S was developed for the workstation multi-core environment and provides a unique advantage for processing multiple datasets. Our software, proved to be efficient throughout the workflow, including, the (unlimited) handling of more than 20 datasets. Conclusions We have developed a simple and versatile LMAP_S package enabling researchers to effectively estimate multiple datasets MSAs and PTs in a high-throughput fashion. LMAP_S integrates more than 25 software providing overall more than 65 algorithm choices distributed in five stages. At minimum, one FASTA file is required within a single input directory. To our knowledge, no other software combines MSA and phylogeny estimation with as many alternatives and provides means to find optimal MSAs and phylogenies. Moreover, we used a case study comparing methodologies that highlighted the usefulness of our software. LMAP_S has been developed as an open-source package, allowing its integration into more complex open-source bioinformatics pipelines. LMAP_S package is released under GPLv3 license and is freely available at https://lmap-s.sourceforge.io/.

Download Full-text

Tailor-made multiple sequence alignments using the PRALINE 2 alignment toolkit

Bioinformatics ◽

10.1093/bioinformatics/btz572 ◽

2019 ◽

Vol 35 (24) ◽

pp. 5315-5317 ◽

Cited By ~ 1

Author(s):

Maurits J J Dijkstra ◽

Atze J van der Ploeg ◽

K Anton Feenstra ◽

Wan J Fokkink ◽

Sanne Abeln ◽

...

Keyword(s):

Secondary Structure ◽

Open Source ◽

Sequence Alignment ◽

Open Source Software ◽

Multiple Sequence Alignment ◽

Multiple Alignment ◽

Sequence Alignments ◽

Multiple Sequence ◽

Dna Motifs ◽

Multiple Sequence Alignments

Abstract Summary PRALINE 2 is a toolkit for custom multiple sequence alignment workflows. It can be used to incorporate sequence annotations, such as secondary structure or (DNA) motifs, into the alignment scoring, as well as to customize many other aspects of a progressive multiple alignment workflow. Availability and implementation PRALINE 2 is implemented in Python and available as open source software on GitHub: https://github.com/ibivu/PRALINE/.

Download Full-text

Inferring an Original Sequence from Erroneous Copies: Two Approaches

Asia-Pacific Biotech News ◽

10.1142/s0219030303000284 ◽

2003 ◽

Vol 07 (03) ◽

pp. 107-114 ◽

Cited By ~ 3

Author(s):

Jonathan M. Keith ◽

Peter Adams ◽

Darryn Bryant ◽

Keith R. Mitchelson ◽

Duncan A. E. Cochran ◽

...

Keyword(s):

Sequence Alignment ◽

Dna Sequences ◽

Multiple Sequence Alignment ◽

Sequence Alignments ◽

Multiple Sequence ◽

Original Sequence ◽

Multiple Sequence Alignments ◽

Sequencing Errors ◽

The Cost ◽

New Algorithms

This paper considers the problem of inferring an original sequence from a number of erroneous copies. The problem arises in DNA sequencing, particularly in the context of emerging technologies that provide high throughput or other advantages at the cost of an increased number of errors. We describe and compare two approaches that have recently been developed by the authors. The first approach searches for a sequence known as a Steiner string; the second searches for the most probable original sequence with respect to a simple Bayesian model of sequencing errors. We present the results of extensive tests in which erroneous copies of real DNA sequences were simulated and the algorithms were used to infer the original sequences. The results are used to compare the two approaches to each other and to a third, more conventional, approach based on multiple sequence alignment. We find that the Bayesian approach is superior to the Steiner approach, which in turn is superior to the alignment approach. The two new algorithms can also be used to construct multiple sequence alignments. We show that the two methods produce alignments of approximately equal quality, and conclude that the Steiner approach is better for this purpose because it is faster. Both methods produce better alignments than a well-known multiple sequence alignment package, for the cases tested.

Download Full-text

Treerecs: an integrated phylogenetic tool, from sequences to reconciliations

10.1101/782946 ◽

2019 ◽

Cited By ~ 2

Author(s):

Nicolas Comte ◽

Benoit Morel ◽

Damir Hasic ◽

Laurent Guéguen ◽

Bastien Boussau ◽

...

Keyword(s):

Open Source ◽

Source Code ◽

Phylogenetic Inference ◽

Species Tree ◽

Gene Trees ◽

Sequence Alignments ◽

Multiple Sequence ◽

Tree Reconciliation ◽

Multiple Sequence Alignments ◽

Multiple Alignments

AbstractMotivationGene and species tree reconciliation methods are used to interpret gene trees, root them and correct uncertainties that are due to scarcity of signal in multiple sequence alignments. So far, reconciliation tools have not been integrated in standard phylogenetic software and they either lack performance on certain functions, or usability for biologists.ResultsWe present Treerecs, a phylogenetic software based on duplication-loss reconciliation. Treerecs is simple to install and to use. It is fast and versatile, has a graphic output, and can be used along with methods for phylogenetic inference on multiple alignments like PLL and Seaview.AvailabilityTreerecs is open-source. Its source code (C++, AGPLv3) and manuals are available from https://project.inria.fr/treerecs/[email protected] or [email protected]

Download Full-text