scholarly journals Pervasive selection biases inferences of the species tree

2020 ◽  
Author(s):  
Rui Borges ◽  
Bastien Boussau ◽  
Gergely Szollosi ◽  
Carolin Kosiol

Despite the importance of natural selection in species' evolutionary history, phylogenetic methods that take into account population-level processes ignore selection. Assuming neutrality is often based on the idea that selection occurs at a minority of loci in the genome and is unlikely to significantly compromise phylogenetic inferences. However, selection might behave more pervasively, as it the case of nearly neutral evolving mutations. Genome-wide processes like GC-bias and some of the variation segregating at the coding regions are known to evolve in the nearly neutral range. As we are now using genome-wide data to estimate species tree, it is just natural to ask whether weak, but pervasive, selection is likely to blur species tree inferences. Here, we employed a polymorphism-aware phylogenetic model, specially tailored for measuring signatures of nucleotide usage biases, to test the impact of nearly neutrally in the substitution process. Analyses with simulated data indicate that while the inferred relationships among species are not significantly compromised, the genetic distances are systematically underestimated, with the deeper nodes suffering more than the younger ones. Such biases have implications for molecular dating. We found signatures of GC-bias considerably affecting the estimated divergence times (up to 21%) of worldwide fruit fly populations. Our findings call for the need to account for nearly neutral forces (or any other form of pervasive selection) when quantifying divergence or dating species evolution.

2019 ◽  
Author(s):  
Rui Borges ◽  
Carolin Kosiol

AbstractPolymorphism-aware phylogenetic models (PoMo) constitute an alternative approach for species tree estimation from genome-wide data. PoMo builds on the standard substitution models of DNA evolution but expands the classic alphabet of the four nucleotide bases to include polymorphic states. By doing so, PoMo accounts for ancestral and current intra-population variation, while also accommodating population-level processes ruling the substitution process (e.g. genetic drift, mutations, allelic selection). PoMo has shown to be a valuable tool in several phylogenetic applications but a proof of statistical consistency (and identifiability, a necessary condition for consistency) is lacking. Here, we prove that PoMo is identifiable and, using this result, we further show that the maximum a posteriori (MAP) tree estimator of PoMo is a consistent estimator of the species tree. We complement our theoretical results with a simulated data set mimicking the diversity observed in natural populations exhibiting incomplete lineage sorting. We implemented PoMo in a Bayesian framework and show that the MAP tree easily recovers the true tree for typical numbers of sites that are sampled in genome-wide analyses.


PLoS ONE ◽  
2018 ◽  
Vol 13 (9) ◽  
pp. e0202890 ◽  
Author(s):  
Zsolt Bánfai ◽  
Valerián Ádám ◽  
Etelka Pöstyéni ◽  
Gergely Büki ◽  
Márta Czakó ◽  
...  

2012 ◽  
Vol 367 (1590) ◽  
pp. 793-799 ◽  
Author(s):  
Mark A. Jobling

The historical record tells us stories of migrations, population expansions and colonization events in the last few thousand years, but what was their demographic impact? Genetics can throw light on this issue, and has mostly done so through the maternally inherited mitochondrial DNA (mtDNA) and the male-specific Y chromosome. However, there are a number of problems, including marker ascertainment bias, possible influences of natural selection, and the obscuring layers of the palimpsest of historical and prehistorical events. Y-chromosomal lineages are particularly affected by genetic drift, which can be accentuated by recent social selection. A diversity of approaches to expansions in Europe is yielding insights into the histories of Phoenicians, Roma, Anglo-Saxons and Vikings, and new methods for producing and analysing genome-wide data hold much promise. The field would benefit from more consensus on appropriate methods, and better communication between geneticists and experts in other disciplines, such as history, archaeology and linguistics.


Author(s):  
Stefano Amente ◽  
Giovanni Scala ◽  
Barbara Majello ◽  
Somaiyeh Azmoun ◽  
Helen G. Tempest ◽  
...  

AbstractExposures from the external and internal environments lead to the modification of genomic DNA, which is implicated in the cause of numerous diseases, including cancer, cardiovascular, pulmonary and neurodegenerative diseases, together with ageing. However, the precise mechanism(s) linking the presence of damage, to impact upon cellular function and pathogenesis, is far from clear. Genomic location of specific forms of damage is likely to be highly informative in understanding this process, as the impact of downstream events (e.g. mutation, microsatellite instability, altered methylation and gene expression) on cellular function will be positional—events at key locations will have the greatest impact. However, until recently, methods for assessing DNA damage determined the totality of damage in the genomic location, with no positional information. The technique of “mapping DNA adductomics” describes the molecular approaches that map a variety of forms of DNA damage, to specific locations across the nuclear and mitochondrial genomes. We propose that integrated comparison of this information with other genome-wide data, such as mutational hotspots for specific genotoxins, tumour-specific mutation patterns and chromatin organisation and transcriptional activity in non-cancerous lesions (such as nevi), pre-cancerous conditions (such as polyps) and tumours, will improve our understanding of how environmental toxins lead to cancer. Adopting an analogous approach for non-cancer diseases, including the development of genome-wide assays for other cellular outcomes of DNA damage, will improve our understanding of the role of DNA damage in pathogenesis more generally.


2020 ◽  
Author(s):  
Patrícia Santos ◽  
Gloria Gonzalez-Fortes ◽  
Emiliano Trucchi ◽  
Andrea Ceolin ◽  
Guido Cordoni ◽  
...  

AbstractTo reconstruct aspects of human demographic history, linguistics and genetics complement each other, reciprocally suggesting testable hypotheses on population relationships and interactions. Relying on a linguistic comparative method exclusively based on syntactic data, here we focus on the complex relation of genes and languages among Finno-Ugric (FU) speakers, in comparison to their Indo-European (IE) and Altaic (AL) neighbors. Syntactic analysis supports three distinct clusters corresponding to these three Eurasian families; yet, the outliers of the FU group show linguistic convergence with their geographical neighbors. By analyzing genome-wide data in both ancient and contemporary populations, we uncovered remarkably matching patterns, with north-western FU speakers linguistically and genetically closer in parallel degrees to their IE-speaking neighbors, and eastern FU speakers to AL-speakers. Therefore, our study indicates plausible secondary convergence in the syntax of languages of different families, providing evidence that such interference effects were accompanied, and possibly caused, by recognizable processes at the population level. In particular, based on the comparison of modern and ancient genomes, our analysis identified the Pontic-Caspian steppes as the possible origin of the demographic processes that led to the expansion of the FU into Europe.


2020 ◽  
Author(s):  
Nimrod Rappoport ◽  
Roy Safra ◽  
Ron Shamir

AbstractRecent advances in experimental biology allow creation of datasets where several genome-wide data types (called omics) are measured per sample. Integrative analysis of multi-omic datasets in general, and clustering of samples in such datasets specifically, can improve our understanding of biological processes and discover different disease subtypes. In this work we present Monet (Multi Omic clustering by Non-Exhaustive Types), which presents a unique approach to multi-omic clustering. Monet discovers modules of similar samples, such that each module is allowed to have a clustering structure for only a subset of the omics. This approach differs from most extant multi-omic clustering algorithms, which assume a common structure across all omics, and from several recent algorithms that model distinct cluster structures using Bayesian statistics. We tested Monet extensively on simulated data, on an image dataset, and on ten multi-omic cancer datasets from TCGA. Our analysis shows that Monet compares favorably with other multi-omic clustering methods. We demonstrate Monet’s biological and clinical relevance by analyzing its results for Ovarian Serous Cystadenocarcinoma. We also show that Monet is robust to missing data, can cluster genes in multi-omic dataset, and reveal modules of cell types in single-cell multi-omic data. Our work shows that Monet is a valuable tool that can provide complementary results to those provided by extant algorithms for multi-omic analysis.


2019 ◽  
Author(s):  
Johan Pensar ◽  
Santeri Puranen ◽  
Neil MacAlasdair ◽  
Juri Kuronen ◽  
Gerry Tonkin-Hill ◽  
...  

ABSTRACTDiscovery of polymorphisms under co-selective pressure or epistasis has received considerable recent attention in population genomics. Both statistical modeling of the population level co-variation of alleles across the chromosome and model-free testing of dependencies between pairs of polymorphisms have been shown to successfully uncover patterns of selection in bacterial populations. Here we introduce a model-free method, SpydrPick, whose computational efficiency enables analysis at the scale of pan-genomes of many bacteria. SpydrPick incorporates an efficient correction for population structure, which is demonstrated to maintain a very low rate of false positive findings among those SNP pairs highlighted to deviate significantly from the null hypothesis of neutral co-evolution in simulated data. We also introduce a new type of visualization of the results similar to the Manhattan plots used in genome-wide association studies, which enables rapid exploration of the identified signals of co-evolution. Application of the method to large population genomic data sets of two major human pathogens, Streptococcus pneumoniae and Neisseria meningitidis, revealed both previously identified and novel putative targets of co-selection related to virulence and antibiotic resistance, highlighting the potential of this approach to drive molecular discoveries, even in the absence of phenotypic data.


2018 ◽  
Author(s):  
Rui Borges ◽  
Gergely Szöllősi ◽  
Carolin Kosiol

AbstractAs multi-individual population-scale data is becoming available, more-complex modeling strategies are needed to quantify the genome-wide patterns of nucleotide usage and associated mechanisms of evolution. Recently, the multivariate neutral Moran model was proposed. However, it was shown insufficient to explain the distribution of alleles in great apes. Here, we propose a new model that includes allelic selection. Our theoretical results constitute the basis of a new Bayesian framework to estimate mutation rates and selection coefficients from population data. We employ the new framework to a great ape dataset at we found patterns of allelic selection that match those of genome-wide GC-biased gene conversion (gBCG). In particular, we show that great apes have patterns of allelic selection that vary in intensity, a feature that we correlated with the great apes’ distinct demographies. We also demonstrate that the AT/GC toggling effect decreases the probability of a substitution, promoting more polymorphisms in the base composition of great ape genomes. We further assess the impact of CG-bias in molecular analysis and we find that mutation rates and genetic distances are estimated under bias when gBGC is not properly accounted. Our results contribute to the discussion on the tempo and mode of gBGC evolution, while stressing the need for gBGC-aware models in population genetics and phylogenetics.


2020 ◽  
Vol 37 (11) ◽  
pp. 3292-3307
Author(s):  
Chao Zhang ◽  
Celine Scornavacca ◽  
Erin K Molloy ◽  
Siavash Mirarab

Abstract Phylogenetic inference from genome-wide data (phylogenomics) has revolutionized the study of evolution because it enables accounting for discordance among evolutionary histories across the genome. To this end, summary methods have been developed to allow accurate and scalable inference of species trees from gene trees. However, most of these methods, including the widely used ASTRAL, can only handle single-copy gene trees and do not attempt to model gene duplication and gene loss. As a result, most phylogenomic studies have focused on single-copy genes and have discarded large parts of the data. Here, we first propose a measure of quartet similarity between single-copy and multicopy trees that accounts for orthology and paralogy. We then introduce a method called ASTRAL-Pro (ASTRAL for PaRalogs and Orthologs) to find the species tree that optimizes our quartet similarity measure using dynamic programing. By studying its performance on an extensive collection of simulated data sets and on real data sets, we show that ASTRAL-Pro is more accurate than alternative methods.


Sign in / Sign up

Export Citation Format

Share Document