scholarly journals A new (old) approach to genotype-based phylogenomic inference within species, with an example from the saguaro cactus (Carnegiea gigantea)

2020 ◽  
Author(s):  
Michael J. Sanderson ◽  
Alberto Búrquez ◽  
Dario Copetti ◽  
Michelle M. McMahon ◽  
Yichao Zeng ◽  
...  

AbstractGenome sequence data are routinely being used to infer phylogenetic history within and between closely related diploid species, but few tree inference methods are specifically tailored to diploid genotype data. Here we re-examine the method of “polymorphism parsimony” (Inger 1967; Farris 1978; Felsenstein 1979), originally introduced to study morphological characters and chromosome inversion polymorphisms, to evaluate its utility for unphased diploid genotype data in large scale phylogenomic data sets. We show that it is equivalent to inferring species trees by minimizing deep coalescences—assuming an infinite sites model. Two potential advantages of this approach are scalability and estimation of a rooted tree. As with some other single nucleotide polymorphism (SNP) based methods, it requires thinning of data sets to statistically independent sites, and we describe a genotype-based test for phylogenetic independence. To evaluate this approach in genome scale data, we construct intraspecific phylogenies for 10 populations of the saguaro cactus using 200 Gbp of resequencing data, and then use these methods to test whether the population with highest genetic diversity corresponds to the root of the genotype trees. Results were highly congruent with the (unrooted) trees obtained using SVDquartets, a scalable alternative method of phylogenomic inference.

2018 ◽  
Author(s):  
Lucas Czech ◽  
Alexandros Stamatakis

AbstractMotivationIn most metagenomic sequencing studies, the initial analysis step consists in assessing the evolutionary provenance of the sequences. Phylogenetic (or Evolutionary) Placement methods can be employed to determine the evolutionary position of sequences with respect to a given reference phylogeny. These placement methods do however face certain limitations: The manual selection of reference sequences is labor-intensive; the computational effort to infer reference phylogenies is substantially larger than for methods that rely on sequence similarity; the number of taxa in the reference phylogeny should be small enough to allow for visually inspecting the results.ResultsWe present algorithms to overcome the above limitations. First, we introduce a method to automatically construct representative sequences from databases to infer reference phylogenies. Second, we present an approach for conducting large-scale phylogenetic placements on nested phylogenies. Third, we describe a preprocessing pipeline that allows for handling huge sequence data sets. Our experiments on empirical data show that our methods substantially accelerate the workflow and yield highly accurate placement results.ImplementationFreely available under GPLv3 at http://github.com/lczech/[email protected] InformationSupplementary data are available at Bioinformatics online.


Zootaxa ◽  
2004 ◽  
Vol 629 (1) ◽  
pp. 1 ◽  
Author(s):  
MARIAM LEKVEISHVILI ◽  
HANS KLOMPEN

Phylogenetic relationships among the families in the infraorder Sejina and the position of Sejina relative to other infraorders of Mesostigmata are re-examined based on molecular and morphological data. Data sets included DNA sequence data for complete 18S, EF-1 , partial CO1genes, and 69 morphological characters. The two families of Heterozerconina consistently group within Sejina, and we propose to synonymize Heterozerconina with Sejina (Sejina s.l). Microgyniina is not the closest relative of Sejina. Rather, Sejina s.l. most often groups with Gamasina. Uropodellidae and Ichthyostomatogasteridae are sister groups and this lineage forms the sister group to Discozerconidae plus Heterozerconidae. Overall, we recognize 5 families within Sejina: Uropodellidae, Ichthyostomatogasteridae, Sejidae, Discozerconidae, and Heterozerconidae.


2021 ◽  
Author(s):  
Robert G Reynolds ◽  
Aryeh H Miller ◽  
Liam J Revell ◽  
ALEJANDRO RIOS-FRANCESCHI ◽  
Clair A. Huffine ◽  
...  

The genus Sphaerodactylus is a very species-rich assemblage of sphaerodactylid lizards that has undergone a level of speciation in parallel to that of the well-known Anolis lizards. Nevertheless, molecular phylogenetic research on this group consists of a handful of smaller studies of regional focus (e.g., western Puerto Rico, the Lesser Antilles) or large-scale analyses based on relatively limited sequence data. Few medium-scale multi-locus studies exist; for example, studies that encompass an entire radiation on an island group. Building upon previous work done in Puerto Rican Sphaerodactylus, we performed multi-locus sampling of Sphaerodactylus geckos from across the Puerto Rico Bank. We then used these data for phylogeny estimation with near-complete taxon sampling. We focused on sampling the widespread nominal species S. macrolepis and in so doing, we uncovered a highly divergent and morphologically distinct lineage of Sphaerodactylus macrolepis from Puerto Rico, Culebra, and Vieques islands, which we re-describe herein as S. grandisquamis (Stejneger, 1904) comb. nov. on the basis of molecular and morphological characters. S. grandisquamis comb. nov. co-occurs with S. macrolepis only on Culebra Island but is highly genetically differentiated and morphologically distinct. Sphaerodactylus macrolepis is now restricted to the eastern Puerto Rico Bank, from Culebra east through the Virgin Islands and including the topotypic population on St. Croix. We include additional discussion of the evolutionary history and historical biogeography of the Sphaerodactylus of the Puerto Rican Bank in the context of these new discoveries.


2018 ◽  
Author(s):  
Huw A. Ogilvie ◽  
Timothy G. Vaughan ◽  
Nicholas J. Matzke ◽  
Graham J. Slater ◽  
Tanja Stadler ◽  
...  

AbstractBayesian methods can be used to accurately estimate species tree topologies, times and other parameters, but only when the models of evolution which are available and utilized sufficiently account for the underlying evolutionary processes. Multispecies coalescent (MSC) models have been shown to accurately account for the evolution of genes within species in the absence of strong gene flow between lineages, and fossilized birth-death (FBD) models have been shown to estimate divergence times from fossil data in good agreement with expert opinion. Until now dating analyses using the MSC have been based on a fixed clock or informally derived node priors instead of the FBD. On the other hand, dating analyses using an FBD process have concatenated all gene sequences and ignored coalescence processes. To address these mirror-image deficiencies in evolutionary models, we have developed an integrative model of evolution which combines both the FBD and MSC models. By applying concatenation and the MSC (without employing the FBD process) to an exemplar data set consisting of molecular sequence data and morphological characters from the dog and fox subfamily Caninae, we show that concatenation causes predictable biases in estimated branch lengths. We then applied concatenation using the FBD process and the combined FBD-MSC model to show that the same biases are still observed when the FBD process is employed. These biases can be avoided by using the FBD-MSC model, which coherently models fossilization and gene evolution, and does not require an a priori substitution rate estimate to calibrate the molecular clock. We have implemented the FBD-MSC in a new version of StarBEAST2, a package developed for the BEAST2 phylogenetic software.


2013 ◽  
Vol 2013 ◽  
pp. 1-14 ◽  
Author(s):  
Jurate Daugelaite ◽  
Aisling O' Driscoll ◽  
Roy D. Sleator

Multiple sequence alignment (MSA) of DNA, RNA, and protein sequences is one of the most essential techniques in the fields of molecular biology, computational biology, and bioinformatics. Next-generation sequencing technologies are changing the biology landscape, flooding the databases with massive amounts of raw sequence data. MSA of ever-increasing sequence data sets is becoming a significant bottleneck. In order to realise the promise of MSA for large-scale sequence data sets, it is necessary for existing MSA algorithms to be run in a parallelised fashion with the sequence data distributed over a computing cluster or server farm. Combining MSA algorithms with cloud computing technologies is therefore likely to improve the speed, quality, and capability for MSA to handle large numbers of sequences. In this review, multiple sequence alignments are discussed, with a specific focus on the ClustalW and Clustal Omega algorithms. Cloud computing technologies and concepts are outlined, and the next generation of cloud base MSA algorithms is introduced.


2017 ◽  
Vol 15 (03) ◽  
pp. 1740002 ◽  
Author(s):  
Jucheol Moon ◽  
Oliver Eulenstein

Supertree problems are a standard tool for synthesizing large-scale species trees from a given collection of gene trees under some problem-specific objective. Unfortunately, these problems are typically NP-hard, and often remain so when their instances are restricted to rooted gene trees sampled from the same species. While a class of restricted supertree problems has been effectively addressed by the parameterized strict consensus approach, in practice, most gene trees are unrooted and sampled from different species. Here, we overcome this stringent limitation by describing efficient algorithms that are adopting the strict consensus approach to also handle unrestricted supertree problems. Finally, we demonstrate the performance of our algorithms in a comparative study with classic supertree heuristics using simulated and empirical data sets.


2015 ◽  
Vol 112 (7) ◽  
pp. 2058-2063 ◽  
Author(s):  
Marc Hellmuth ◽  
Nicolas Wieseke ◽  
Marcus Lechner ◽  
Hans-Peter Lenhof ◽  
Martin Middendorf ◽  
...  

Phylogenomics heavily relies on well-curated sequence data sets that comprise, for each gene, exclusively 1:1 orthologos. Paralogs are treated as a dangerous nuisance that has to be detected and removed. We show here that this severe restriction of the data sets is not necessary. Building upon recent advances in mathematical phylogenetics, we demonstrate that gene duplications convey meaningful phylogenetic information and allow the inference of plausible phylogenetic trees, provided orthologs and paralogs can be distinguished with a degree of certainty. Starting from tree-free estimates of orthology, cograph editing can sufficiently reduce the noise to find correct event-annotated gene trees. The information of gene trees can then directly be translated into constraints on the species trees. Although the resolution is very poor for individual gene families, we show that genome-wide data sets are sufficient to generate fully resolved phylogenetic trees, even in the presence of horizontal gene transfer.


2020 ◽  
Author(s):  
Mezzalina Vankan ◽  
Simon Y.W. Ho ◽  
Carolina Pardo-Diaz ◽  
David A. Duchêne

AbstractThe phylogenetic information contained in sequence data is partly determined by the overall rate of nucleotide substitution in the genomic region in question. However, phylogenetic signal is affected by various other factors, such as heterogeneity in substitution rates across lineages. These factors might be able to predict the phylogenetic accuracy of any given gene in a data set. We examined the association between the accuracy of phylogenetic inference across genes and several characteristics of branch lengths in phylogenomic data. In a large number of published data sets, we found that the accuracy of phylogenetic inference from genes was consistently associated with their mean statistical branch support and variation in their gene tree root-to-tip distances, but not with tree length and stemminess. Therefore, a signal of constant evolutionary rates across lineages appears to be beneficial for phylogenetic inference. Identifying the causes of variation in root-to-tip lengths in gene trees also offers a potential way forward to increase congruence in the signal across genes and improve estimates of species trees from phylogenomic data sets.


Author(s):  
Ying Zhou ◽  
Sharon R. Browning ◽  
Brian L. Browning

AbstractSegments of identity by descent (IBD) are used in many genetic analyses. We present a method for detecting identical-by-descent haplotype segments that is optimized for large-scale genotype data. Our method, called hap-IBD, combines a compressed representation of genotype data, the positional Burrows-Wheeler transform, and multi-threaded execution to produce very fast analysis times. An attractive feature of hap-IBD is its simplicity: the input parameters clearly and precisely define the IBD segments that are reported, so that program correctness can be confirmed by users.We evaluate hap-IBD and four state-of-the-art IBD segment detection methods (GERMLINE, iLASH, RaPID, and TRUFFLE) using UK Biobank chromosome 20 data and simulated sequence data. We show that hap-IBD detects IBD segments faster and more accurately than competing methods, and that hap-IBD is the only method that can rapidly and accurately detect short 2-4 cM IBD segments in the full UK Biobank data. Analysis of 485,346 UK Biobank samples using hap-IBD with 12 computational threads detects 231.5 billion autosomal IBD segments with length ≥2 cM in 24.4 hours.


Sign in / Sign up

Export Citation Format

Share Document