scholarly journals Phylo SI: a new genome-wide approach for prokaryotic phylogeny

2013 ◽  
Vol 42 (4) ◽  
pp. 2391-2404 ◽  
Author(s):  
Anton Shifman ◽  
Noga Ninyo ◽  
Uri Gophna ◽  
Sagi Snir

Abstract The evolutionary history of all life forms is usually represented as a vertical tree-like process. In prokaryotes, however, the vertical signal is partly obscured by the massive influence of horizontal gene transfer (HGT). The HGT creates widespread discordance between evolutionary histories of different genes as genomes become mosaics of gene histories. Thus, the Tree of Life (TOL) has been questioned as an appropriate representation of the evolution of prokaryotes. Nevertheless a common hypothesis is that prokaryotic evolution is primarily tree-like, and a routine effort is made to place new isolates in their appropriate location in the TOL. Moreover, it appears desirable to exploit non–tree-like evolutionary processes for the task of microbial classification. In this work, we present a novel technique that builds on the straightforward observation that gene order conservation (‘synteny’) decreases in time as a result of gene mobility. This is particularly true in prokaryotes, mainly due to HGT. Using a ‘synteny index’ (SI) that measures the average synteny between a pair of genomes, we developed the phylogenetic reconstruction tool ‘Phylo SI’. Phylo SI offers several attractive properties such as easy bootstrapping, high sensitivity in cases where phylogenetic signal is weak and computational efficiency. Phylo SI was tested both on simulated data and on two bacterial data sets and compared with two well-established phylogenetic methods. Phylo SI is particularly efficient on short evolutionary distances where synteny footprints remain detectable, whereas the nucleotide substitution signal is too weak for reliable sequence-based phylogenetic reconstruction. The method is publicly available at http://research.haifa.ac.il/ssagi/software/PhyloSI.zip.

2015 ◽  
Author(s):  
Galina Glazko ◽  
Michael Gensheimer ◽  
Arcady Mushegian

Abstract Background: Complete genome sequences provide many new characters suitable for studying phylogenetic relationships. The limitations of the single sequence-based phylogenetic reconstruction prompted the efforts to build trees based on genome-wide properties, such as the fraction of shared orthologous genes or conservation of adjoining gene pairs. Gene content-based phylogenies, however, have their own biases: most notably, differential losses and horizontal transfers of genes interfere with phylogenetic signal, each in their own way, and special measures need to be taken to eliminate these types of noise. Results: We expand the repertoire of genome-wide traits available for phylogeny building, by developing a practical approach for measuring local gene conservation in two genomes. We counted the number of orthologous genes shared by chromosomal neighborhoods (“bins”), and built the phylogeny of 63 prokaryotic genomes on this basis. The tree correctly resolved all well-established clades, and also suggested the monophyly of firmicutes, which tend to be split in other genome-based trees. Conclusions: Our measure of local gene order conservation extracts strong phylogenetic signal. This new measure appears to be substantially resistant to the observed instances of gene loss and horizontal transfer, two evolutionary forces which can cause systematic biases in the genome-based phylogenies.


2013 ◽  
Vol 2013 ◽  
pp. 1-14 ◽  
Author(s):  
Francesc López-Giráldez ◽  
Andrew H. Moeller ◽  
Jeffrey P. Townsend

Phylogenetic research is often stymied by selection of a marker that leads to poor phylogenetic resolution despite considerable cost and effort. Profiles of phylogenetic informativeness provide a quantitative measure for prioritizing gene sampling to resolve branching order in a particular epoch. To evaluate the utility of these profiles, we analyzed phylogenomic data sets from metazoans, fungi, and mammals, thus encompassing diverse time scales and taxonomic groups. We also evaluated the utility of profiles created based on simulated data sets. We found that genes selected via their informativeness dramatically outperformed haphazard sampling of markers. Furthermore, our analyses demonstrate that the original phylogenetic informativeness method can be extended to trees with more than four taxa. Thus, although the method currently predicts phylogenetic signal without specifically accounting for the misleading effects of stochastic noise, it is robust to the effects of homoplasy. The phylogenetic informativeness rankings obtained will allow other researchers to select advantageous genes for future studies within these clades, maximizing return on effort and investment. Genes identified might also yield efficient experimental designs for phylogenetic inference for many sister clades and outgroup taxa that are closely related to the diverse groups of organisms analyzed.


2020 ◽  
Vol 37 (9) ◽  
pp. 2747-2762 ◽  
Author(s):  
Guénola Drillon ◽  
Raphaël Champeimont ◽  
Francesco Oteri ◽  
Gilles Fischer ◽  
Alessandra Carbone

Abstract Gene order can be used as an informative character to reconstruct phylogenetic relationships between species independently from the local information present in gene/protein sequences. PhyChro is a reconstruction method based on chromosomal rearrangements, applicable to a wide range of eukaryotic genomes with different gene contents and levels of synteny conservation. For each synteny breakpoint issued from pairwise genome comparisons, the algorithm defines two disjoint sets of genomes, named partial splits, respectively, supporting the two block adjacencies defining the breakpoint. Considering all partial splits issued from all pairwise comparisons, a distance between two genomes is computed from the number of partial splits separating them. Tree reconstruction is achieved through a bottom-up approach by iteratively grouping sister genomes minimizing genome distances. PhyChro estimates branch lengths based on the number of synteny breakpoints and provides confidence scores for the branches. PhyChro performance is evaluated on two data sets of 13 vertebrates and 21 yeast genomes by using up to 130,000 and 179,000 breakpoints, respectively, a scale of genomic markers that has been out of reach until now. PhyChro reconstructs very accurate tree topologies even at known problematic branching positions. Its robustness has been benchmarked for different synteny block reconstruction methods. On simulated data PhyChro reconstructs phylogenies perfectly in almost all cases, and shows the highest accuracy compared with other existing tools. PhyChro is very fast, reconstructing the vertebrate and yeast phylogenies in <15 min.


2020 ◽  
Vol 37 (11) ◽  
pp. 3380-3388
Author(s):  
Stephen A Smith ◽  
Nathanael Walker-Hale ◽  
Joseph F Walker

Abstract Most phylogenetic analyses assume that a single evolutionary history underlies one gene. However, both biological processes and errors can cause intragenic conflict. The extent to which this conflict is present in empirical data sets is not well documented, but if common, could have far-reaching implications for phylogenetic analyses. We examined several large phylogenomic data sets from diverse taxa using a fast and simple method to identify well-supported intragenic conflict. We found conflict to be highly variable between data sets, from 1% to >92% of genes investigated. We analyzed four exemplar genes in detail and analyzed simulated data under several scenarios. Our results suggest that alignment error may be one major source of conflict, but other conflicts remain unexplained and may represent biological signal or other errors. Whether as part of data analysis pipelines or to explore biologically processes, analyses of within-gene phylogenetic signal should become common.


2016 ◽  
Author(s):  
John P Didion ◽  
Francis S Collins

A key step in the transformation of raw sequencing reads into biological insights is the trimming of adapter sequences and low-quality bases. Read trimming has been shown to increase the quality and reliability while decreasing the computational requirements of downstream analyses. Many read trimming software tools are available; however, no tool simultaneously provides the accuracy, computational efficiency, and feature set required to handle the types and volumes of data generated in modern sequencing-based experiments. Here we introduce Atropos and show that it trims reads with high sensitivity and specificity while maintaining leading-edge speed. Compared to other state-of-the-art read trimming tools, Atropos achieves a four-fold increase in trimming accuracy and a decrease in execution time of ~50% (using 16 parallel execution threads). Furthermore, Atropos maintains high accuracy even when trimming simulated data with a high rate of error. The accuracy, high performance, and broad feature set offered by Atropos makes it an appropriate choice for the pre-processing of most current-generation sequencing data sets. Atropos is open source and free software written in Python and available at https://github.com/jdidion/atropos.


2017 ◽  
Author(s):  
B. Schaeffer ◽  
V. Nicolas ◽  
F. Austerlitz ◽  
C. Larédo

AbstractSeveral classes of methods have been proposed for inferring the history of populations from genetic polymorphism data. As connectivity is a key factor to explain the structure of populations, several graph-based methods have been developed to this aim, using population genetics data. Here we propose an original method based on graphical models that uses DNA sequences to provide relationships between populations. We tested our method on various simulated data sets, describing typical demographic scenarios, for different parameters values. We found that our method behaved noticeably well for realistic demographic evolutionary processes and recovered suitably the migration processes. Our method provides thus a complementary tool for investigating population history based on genetic material.


Author(s):  
Haige Han ◽  
Kenneth Bryan ◽  
Wunierfu Shiraigol ◽  
Dongyi Bai ◽  
Yiping Zhao ◽  
...  

Abstract The Mongolian horse is one of the oldest extant horse populations and although domesticated, most animals are free-ranging and experience minimal human intervention. As an ancient population originating in one of the key domestication centers, the Mongolian horse may play a key role in understanding the origins and recent evolutionary history of horses. Here we describe an analysis of high-density genome-wide single-nucleotide polymorphism (SNP) data in 40 globally dispersed horse populations (n = 895). In particular, we have focused on new results from Chinese Mongolian horses (n = 100) that represent 5 distinct populations. These animals were genotyped for 670K SNPs and the data were analyzed in conjunction with 35K SNP data for 35 distinct breeds. Analyses of these integrated SNP data sets demonstrated that the Chinese Mongolian populations were genetically distinct from other modern horse populations. In addition, compared to other domestic horse breeds, the Chinese Mongolian horse populations exhibited relatively high genomic diversity. These results suggest that, in genetic terms, extant Chinese Mongolian horses may be the most similar modern populations to the animals originally domesticated in this region of Asia. Chinese Mongolian horse populations may therefore retain ancestral genetic variants from the earliest domesticates. Further genomic characterization of these populations in conjunction with archaeogenetic sequence data should be prioritized for understanding recent horse evolution and the domestication process that has led to the wealth of diversity observed in modern global horse breeds.


Author(s):  
Siddharth Kulkarni ◽  
Robert J Kallal ◽  
Hannah Wood ◽  
Dimitar Dimitrov ◽  
Gonzalo Giribet ◽  
...  

Abstract Genome-scale data sets are converging on robust, stable phylogenetic hypotheses for many lineages; however, some nodes have shown disagreement across classes of data. We use spiders (Araneae) as a system to identify the causes of incongruence in phylogenetic signal between three classes of data: exons (as in phylotranscriptomics), noncoding regions (included in ultraconserved elements [UCE] analyses), and a combination of both (as in UCE analyses). Gene orthologs, coded as amino acids and nucleotides (with and without third codon positions), were generated by querying published transcriptomes for UCEs, recovering 1,931 UCE loci (codingUCEs). We expected that congeners represented in the codingUCE and UCEs data would form clades in the presence of phylogenetic signal. Noncoding regions derived from UCE sequences were recovered to test the stability of relationships. Phylogenetic relationships resulting from all analyses were largely congruent. All nucleotide data sets from transcriptomes, UCEs, or a combination of both recovered similar topologies in contrast with results from transcriptomes analyzed as amino acids. Most relationships inferred from low-occupancy data sets, containing several hundreds of loci, were congruent across Araneae, as opposed to high occupancy data matrices with fewer loci, which showed more variation. Furthermore, we found that low-occupancy data sets analyzed as nucleotides (as is typical of UCE data sets) can result in more congruent relationships than high occupancy data sets analyzed as amino acids (as in phylotranscriptomics). Thus, omitting data, through amino acid translation or via retention of only high occupancy loci, may have a deleterious effect in phylogenetic reconstruction.


2005 ◽  
Vol 62 (1) ◽  
pp. 215-223 ◽  
Author(s):  
Nathan G Taylor ◽  
Carl J Walters ◽  
Steven J.D. Martell

Gear selectivity and the cumulative effects of size-selective fishing produce bias in the length-at-age samples used to estimate the von Bertalanffy growth parameters. In fished populations, fast-growing young fish and slow-growing old fish are overrepresented in size–age samples. To account for such effects, we treated size-at-age observations as multinomial samples, with expected catches in each size–age category dependent on growth parameters, growth variation, size selectivity, abundance at age, and the history of exploitation. Using simulated data sets, estimated growth parameters using the multinomial likelihood were unbiased when fishing mortality was not too high and the shape of the vulnerability function was correct. In contrast, estimated growth parameters using a least squares approach overestimated the metabolic growth coefficient (K) and underestimated mean asymptotic length (L∞). Models that do not explicitly account for the effects of fishing and size selectivity underestimated L∞ and overestimated K. We estimate growth parameters for northern pikeminnow (Ptychocheilus oregonensis) as an example of the method and document a stunted "pigmy" population with an L∞ of 175-mm fork length, attributing its small size to effects of high density and (or) a short growing season.


2020 ◽  
Author(s):  
Li Liu ◽  
Richard J Caselli

AbstractExcess of heterozygosity (H) is a widely used measure of genetic diversity of a population. As high-throughput sequencing and genotyping data become readily available, it has been applied to investigating the associations of genome-wide genetic diversity with human diseases and traits. However, these studies often report contradictory results. In this paper, we present a meta-analysis of five whole-exome studies to examine the association of H scores with Alzheimer’s disease. We show that the mean H score of a group is not associated with the disease status, but is associated with the sample size. Across all five studies, the group with more samples has a significantly lower H score than the group with fewer samples. To remove potential confounders in empirical data sets, we perform computer simulations to create artificial genomes controlled for the number of polymorphic loci, the sample size and the allele frequency. Analyses of these simulated data confirm the negative correlation between the sample size and the H score. Furthermore, we find that genomes with a large number of rare variants also have inflated H scores. These biases altogether can lead to spurious associations between genetic diversity and the phenotype of interest. Based on these findings, we advocate that studies shall balance the sample sizes when using genome-wide H scores to assess genetic diversities of different populations, which helps improve the reproducibility of future research.


Sign in / Sign up

Export Citation Format

Share Document