scholarly journals SeqDistK: a Novel Tool for Alignment-free Phylogenetic Analysis

2021 ◽  
Author(s):  
Xuemei Liu ◽  
Wen Li ◽  
Guanda Huang ◽  
Tianlai Huang ◽  
Qingang Xiong ◽  
...  

Algorithms for constructing phylogenetic trees are fundamental to study the evolution of viruses, bacteria, and other microbes. Established multiple alignment-based algorithms are inefficient for large scale metagenomic sequence data because of their high requirement of inter-sequence correlation and high computational complexity. In this paper, we present SeqDistK, a novel tool for alignment-free phylogenetic analysis. SeqDistK computes the dissimilarity matrix for phylogenetic analysis, incorporating seven k-mer based dissimilarity measures, namely d2, d2S, d2star, Euclidean, Manhattan, CVTree, and Chebyshev. Based on these dissimilarities, SeqDistK constructs phylogenetic tree using the Unweighted Pair Group Method with Arithmetic Mean algorithm. Using a golden standard dataset of 16S rRNA and its associated phylogenetic tree, we compared SeqDistK to Muscle - a multi sequence aligner. We found SeqDistK was not only 38 times faster than Muscle in computational efficiency but also more accurate. SeqDistK achieved the smallest symmetric difference between the inferred and ground truth trees with a range between 13 to 18, while that of Muscle was 62. When measures d2, d2star, d2S, Euclidean, and k-mer size k=5 were used, SeqDistK consistently inferred phylogenetic tree almost identical to the ground truth tree. We also performed clustering of 16S rRNA sequences using SeqDistK and found the clustering was highly consistent with known biological taxonomy. Among all the measures, d2S (k=5, M=2) showed the best accuracy as it correctly clustered and classified all sample sequences. In summary, SeqDistK is a novel, fast and accurate alignment-free tool for large-scale phylogenetic analysis. SeqDistK software is freely available at https://github.com/htczero/SeqDistK.

1980 ◽  
Vol 187 (1) ◽  
pp. 65-74 ◽  
Author(s):  
D Penny ◽  
M D Hendy ◽  
L R Foulds

We have recently reported a method to identify the shortest possible phylogenetic tree for a set of protein sequences [Foulds Hendy & Penny (1979) J. Mol. Evol. 13. 127–150; Foulds, Penny & Hendy (1979) J. Mol. Evol. 13, 151–166]. The present paper discusses issues that arise during the construction of minimal phylogenetic trees from protein-sequence data. The conversion of the data from amino acid sequences into nucleotide sequences is shown to be advantageous. A new variation of a method for constructing a minimal tree is presented. Our previous methods have involved first constructing a tree and then either proving that it is minimal or transforming it into a minimal tree. The approach presented in the present paper progressively builds up a tree, taxon by taxon. We illustrate this approach by using it to construct a minimal tree for ten mammalian haemoglobin alpha-chain sequences. Finally we define a measure of the complexity of the data and illustrate a method to derive a directed phylogenetic tree from the minimal tree.


2015 ◽  
Vol 65 (Pt_2) ◽  
pp. 723-731 ◽  
Author(s):  
Ronel Roberts ◽  
Emma T. Steenkamp ◽  
Gerhard Pietersen

Greening disease of citrus in South Africa is associated with ‘Candidatus Liberibacter africanus’ (Laf), a phloem-limited bacterium vectored by the sap-sucking insect Trioza erytreae (Triozidae). Despite the implementation of control strategies, this disease remains problematic, suggesting the existence of reservoir hosts to Laf. The current study aimed to identify such hosts. Samples from 234 trees of Clausena anisata, 289 trees of Vepris lanceolata and 231 trees of Zanthoxylum capense were collected throughout the natural distribution of these trees in South Africa. Total DNA was extracted from samples and tested for the presence of liberibacters by a generic Liberibacter TaqMan real-time PCR assay. Liberibacters present in positive samples were characterized by amplifying and sequencing rplJ, omp and 16S rRNA gene regions. The identity of tree host species from which liberibacter sequences were obtained was verified by sequencing host rbcL genes. Of the trees tested, 33 specimens of Clausena, 17 specimens of Vepris and 10 specimens of Zanthoxylum tested positive for liberibacter. None of the samples contained typical citrus-infecting Laf sequences. Phylogenetic analysis of 16S rRNA gene sequences indicated that the liberibacters obtained from Vepris and Clausena had 16S rRNA gene sequences identical to that of ‘Candidatus Liberibacter africanus subsp. capensis’ (LafC), whereas those from Zanthoxylum species grouped separately. Phylogenetic analysis of the rplJ and omp gene regions revealed unique clusters for liberibacters associated with each tree species. We propose the following names for these novel liberibacters: ‘Candidatus Liberibacter africanus subsp. clausenae’ (LafCl), ‘Candidatus Liberibacter africanus subsp. vepridis’ (LafV) and ‘Candidatus Liberibacter africanus subsp. zanthoxyli’ (LafZ). This study did not find any natural hosts of Laf associated with greening of citrus. While native citrus relatives were shown to be infected with Laf-related liberibacters, nucleotide sequence data suggest that these are not alternative sources of Laf to citrus orchards, per se.


2019 ◽  
Vol 76 (2) ◽  
pp. 197-220
Author(s):  
K. Tremetsberger ◽  
S. Hameister ◽  
D. A. Simpson ◽  
K.-G. Bernhardt

To date, there are very few sequence data for Cyperaceae from mainland Southeast Asia. The aim of the present study was to contribute nuclear ribosomal internal transcribed spacer (ITS) sequences of selected species of Cambodian Cyperaceae to the overall phylogeny of the family. We generated ITS sequences of 38 accessions representing 26 species from Cambodia and used these sequences for phylogenetic analysis together with similar sequences from the National Center for Biotechnology Information GenBank. Our results corroborate recent phylogenetic work in the family and largely confirm established tribal relationships. The backbone of the phylogenetic tree of species-rich genera that have undergone rapid radiations is often weakly resolved (e.g. in Fimbristylis and in the C4 clade of Cyperus). Cryptic variation was revealed in the taxonomically difficult group of Fimbristylis dichotoma, with samples of this taxon appearing in two distinct clades within Fimbristylis. Further addition of geographically spread accessions of taxa will improve our understanding of the complex biogeographical history of the genera in the family. Eleocharis koyamae Tremetsb. & D.A.Simpson is proposed as a new name for E. macrorrhiza T. Koyama.


2002 ◽  
Vol 184 (1) ◽  
pp. 278-289 ◽  
Author(s):  
Michael W. Friedrich

ABSTRACT Lateral gene transfer affects the evolutionary path of key genes involved in ancient metabolic traits, such as sulfate respiration, even more than previously expected. In this study, the phylogeny of the adenosine-5′-phosphosulfate (APS) reductase was analyzed. APS reductase is a key enzyme in sulfate respiration present in all sulfate-respiring prokaryotes. A newly developed PCR assay was used to amplify and sequence a fragment (∼900 bp) of the APS reductase gene, apsA, from a taxonomically wide range of sulfate-reducing prokaryotes (n = 60). Comparative phylogenetic analysis of all obtained and available ApsA sequences indicated a high degree of sequence conservation in the region analyzed. However, a comparison of ApsA- and 16S rRNA-based phylogenetic trees revealed topological incongruences affecting seven members of the Syntrophobacteraceae and three members of the Nitrospinaceae, which were clearly monophyletic with gram-positive sulfate-reducing bacteria (SRB). In addition, Thermodesulfovibrio islandicus and Thermodesulfobacterium thermophilum, Thermodesulfobacterium commune, and Thermodesulfobacterium hveragerdense clearly branched off between the radiation of the δ-proteobacterial gram-negative SRB and the gram-positive SRB and not close to the root of the tree as expected from 16S rRNA phylogeny. The most parsimonious explanation for these discrepancies in tree topologies is lateral transfer of apsA genes across bacterial divisions. Similar patterns of insertions and deletions in ApsA sequences of donor and recipient lineages provide additional evidence for lateral gene transfer. From a subset of reference strains (n = 25), a fragment of the dissimilatory sulfite reductase genes (dsrAB), which have recently been proposed to have undergone multiple lateral gene transfers (M. Klein et al., J. Bacteriol. 183:6028–6035, 2001), was also amplified and sequenced. Phylogenetic comparison of DsrAB- and ApsA-based trees suggests a frequent involvement of gram-positive and thermophilic SRB in lateral gene transfer events among SRB.


Paleobiology ◽  
1997 ◽  
Vol 23 (1) ◽  
pp. 1-19 ◽  
Author(s):  
William C. Clyde ◽  
Daniel C. Fisher

Stratigraphic data are compared to morphologic data in terms of their fit to phylogenetic hypotheses for 29 data sets taken from the literature. Stratigraphic fit is measured using MacClade's stratigraphic character, which tracks the number of independent discrepancies between observed order and the order of occurrence that would be expected on the basis of a given phylogenetic hypothesis. Acceptance of a phylogenetic hypothesis despite such discrepancies requires ad hoc hypotheses concerning differential probabilities of preservation and recovery. These stratigraphic ad hoc hypotheses are treated as logically equivalent to morphologic ad hoc hypotheses of homoplasy. The retention index is used to compare the number of stratigraphic and morphologic ad hoc hypotheses required by given phylogenetic hypotheses. Each data set is subjected to five analyses, varying in the constraints imposed on the structure of the phylogenetic tree against which fit is measured. Analyses 1–4 compare the stratigraphic and morphologic retention indices using phylogenetic trees consistent with the morphologically most-parsimonious cladogram reported in the original study. Analysis 5 compares retention indices using the overall (stratigraphically and morphologically) most-parsimonious phylogenetic tree, which may be, but is not necessarily, consistent with the reported cladogram. Proceeding from Analysis 1 to Analysis 5, stratigraphic data are allowed greater influence in determining the structure of phylogenetic trees, with the trees in Analysis 1 derived without reference to the stratigraphic character and the trees in Analysis 5 derived from full interaction of stratigraphic and morphologic characters. Morphologic and stratigraphic retention indices for these 29 studies cannot be statistically distinguished in comparisons 3–5, suggesting very similar degrees of fit. The values of these retention indices are high, indicating a generally high level of congruence under these phylogenetic hypotheses. Significant gains (49%) in stratigraphic fit can be realized without significant loss (4%) in morphologic fit as the stratigraphic and morphologic evidence are both allowed to participate in constraining the structure of phylogenetic hypotheses. These results suggest that arguments based on alleged “noisiness” of stratigraphic data offer inadequate grounds for ignoring stratigraphic order in phylogenetic analysis. In terms of congruence, stratigraphic and morphologic data perform about equally well.


2020 ◽  
Author(s):  
Yang Young Lu ◽  
Jiaxing Bai ◽  
Yiwen Wang ◽  
Ying Wang ◽  
Fengzhu Sun

AbstractMotivationRapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption.ResultsWe report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing (HTS) data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102 – 104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures.AvailabilityCRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/[email protected]; [email protected] informationSupplementary data are available at Bioinformatics online.


2015 ◽  
Author(s):  
Jennifer Fouquier ◽  
Jai R Rideout ◽  
Evan Bolyen ◽  
John H Chase ◽  
Arron Shiffer ◽  
...  

Ghost-tree is a bioinformatics tool that integrates sequence data from two genetic markers into a single phylogenetic tree that can be used for diversity analyses. Our approach uses one genetic marker whose sequences can be aligned across organisms spanning divergent taxonomic groups (e.g., fungal families) as a “foundation” phylogeny. A second, more rapidly evolving genetic marker is then used to build “extension” phylogenies for more closely related organisms (e.g., fungal species or strains) that are then grafted on to the foundation tree by mapping taxonomic names. We apply ghost-tree to graft fungal extension phylogenies derived from ITS sequences onto a foundation phylogeny derived from fungal 18S sequences. The result is a phylogenetic tree, compatible with the commonly used UNITE fungal database, that supports phylogenetic diversity analysis (e.g., UniFrac) of fungal communities profiled using ITS markers. Availability: ghost-tree is pip-installable. All source code, documentation, and test code are available under the BSD license at https://github.com/JTFouquier/ghost-tree.


2016 ◽  
Author(s):  
Shea N Gardner ◽  
Sasha K Ames ◽  
Maya B Gokhale ◽  
Tom R Slezak ◽  
Jonathan Allen

Software for rapid, accurate, and comprehensive microbial profiling of metagenomic sequence data on a desktop will play an important role in large scale clinical use of metagenomic data. Here we describe LMAT-ML (Livermore Metagenomics Analysis Toolkit-Marker Library) which can be run with 24 GB of DRAM memory, an amount available on many clusters, or with 16 GB DRAM plus a 24 GB low cost commodity flash drive (NVRAM), a cost effective alternative for desktop or laptop users. We compared results from LMAT with five other rapid, low-memory tools for metagenome analysis for 131 Human Microbiome Project samples, and assessed discordant calls with BLAST. All the tools except LMAT-ML reported overly specific or incorrect species and strain resolution of reads that were in fact much more widely conserved across species, genera, and even families. Several of the tools misclassified reads from synthetic or vector sequence as microbial or human reads as viral. We attribute the high numbers of false positive and false negative calls to a limited reference database with inadequate representation of known diversity. Our comparisons with real world samples show that LMAT-ML is the only tool tested that classifies the majority of reads, and does so with high accuracy.


2022 ◽  
Vol 335 ◽  
pp. 00014
Author(s):  
R. Misrianti ◽  
S.H. Wijaya ◽  
C. Sumantri ◽  
J. Jakaria

Mitochondria DNA (mtDNA) as a source of genetic information based on the maternal genome, can provide important information for phylogenetic analysis and evolutionary biology. The objective of this study was to analyze the phylogenetic tree of Bali cattle with seven gene bank references (Bos indicus, Bos taurus, Bos frontalis, and Bos grunniens) based on partial sequence 16S rRNA mitochondria DNA. The Bayesian phylogenetic tree was constructed using BEAST 2.4. and visualization in Figtree 1.4.4 (tree.bio.ed.ac.uk/software/figtree/). The best model of evolution was carried out using jModelTest 2.1.7. The most optimal was the evolutionary models GTR + I + G with p-inv (I) 0,1990 and gamma shape 0.1960. The main result indicated that the Bali cattle were grouped into Bos javanicus. Phylogenetic analysis also successfully classifying Bos javanicus, Bos indicus, Bos taurus, Bos frontalis and Bos grunniens. These results will complete information about Bali cattle and useful for the preservation and conservation strategies of Indonesian animal genetic resources.


2007 ◽  
Vol 20 (4) ◽  
pp. 287 ◽  
Author(s):  
Michael J. Sanderson

Broad availability of molecular sequence data allows construction of phylogenetic trees with 1000s or even 10 000s of taxa. This paper reviews methodological, technological and empirical issues raised in phylogenetic inference at this scale. Numerous algorithmic and computational challenges have been identified surrounding the core problem of reconstructing large trees accurately from sequence data, but many other obstacles, both upstream and downstream of this step, are less well understood. Before phylogenetic analysis, data must be generated de novo or extracted from existing databases, compiled into blocks of homologous data with controlled properties, aligned, examined for the presence of gene duplications or other kinds of complicating factors, and finally, combined with other evidence via supermatrix or supertree approaches. After phylogenetic analysis, confidence assessments are usually reported, along with other kinds of annotations, such as clade names, or annotations requiring additional inference procedures, such as trait evolution or divergence time estimates. Prospects for partial automation of large-tree construction are also discussed, as well as risks associated with ‘outsourcing’ phylogenetic inference beyond the systematics community.


Sign in / Sign up

Export Citation Format

Share Document