A Coalescent-Based Method for Detecting and Estimating Recombination From Gene Sequences

Genetics ◽  
2002 ◽  
Vol 160 (3) ◽  
pp. 1231-1241 ◽  
Author(s):  
Gil McVean ◽  
Philip Awadalla ◽  
Paul Fearnhead

Abstract Determining the amount of recombination in the genealogical history of a sample of genes is important to both evolutionary biology and medical population genetics. However, recurrent mutation can produce patterns of genetic diversity similar to those generated by recombination and can bias estimates of the population recombination rate. Hudson (2001) has suggested an approximate-likelihood method based on coalescent theory to estimate the population recombination rate, 4Ner, under an infinite-sites model of sequence evolution. Here we extend the method to the estimation of the recombination rate in genomes, such as those of many viruses and bacteria, where the rate of recurrent mutation is high. In addition, we develop a powerful permutation-based method for detecting recombination that is both more powerful than other permutation-based methods and robust to misspecification of the model of sequence evolution. We apply the method to sequence data from viruses, bacteria, and human mitochondrial DNA. The extremely high level of recombination detected in both HIV1 and HIV2 sequences demonstrates that recombination cannot be ignored in the analysis of viral population genetic data.

2019 ◽  
Vol 37 (5) ◽  
pp. 1495-1507 ◽  
Author(s):  
Zhengting Zou ◽  
Hongjiu Zhang ◽  
Yuanfang Guan ◽  
Jianzhi Zhang

Abstract Phylogenetic inference is of fundamental importance to evolutionary as well as other fields of biology, and molecular sequences have emerged as the primary data for this task. Although many phylogenetic methods have been developed to explicitly take into account substitution models of sequence evolution, such methods could fail due to model misspecification or insufficiency, especially in the face of heterogeneities in substitution processes across sites and among lineages. In this study, we propose to infer topologies of four-taxon trees using deep residual neural networks, a machine learning approach needing no explicit modeling of the subject system and having a record of success in solving complex nonlinear inference problems. We train residual networks on simulated protein sequence data with extensive amino acid substitution heterogeneities. We show that the well-trained residual network predictors can outperform existing state-of-the-art inference methods such as the maximum likelihood method on diverse simulated test data, especially under extensive substitution heterogeneities. Reassuringly, residual network predictors generally agree with existing methods in the trees inferred from real phylogenetic data with known or widely believed topologies. Furthermore, when combined with the quartet puzzling algorithm, residual network predictors can be used to reconstruct trees with more than four taxa. We conclude that deep learning represents a powerful new approach to phylogenetic reconstruction, especially when sequences evolve via heterogeneous substitution processes. We present our best trained predictor in a freely available program named Phylogenetics by Deep Learning (PhyDL, https://gitlab.com/ztzou/phydl; last accessed January 3, 2020).


2019 ◽  
Author(s):  
Zhengting Zou ◽  
Hongjiu Zhang ◽  
Yuanfang Guan ◽  
Jianzhi Zhang

ABSTRACTPhylogenetic inference is of fundamental importance to evolutionary as well as other fields of biology, and molecular sequences have emerged as the primary data for this task. Although many phylogenetic methods have been developed to explicitly take into account substitution models of sequence evolution, such methods could fail due to model misspecification and insufficiency, especially in the face of heterogeneities in substitution processes across sites and among lineages. In this study, we propose to infer topologies of four-taxon trees using deep residual neural networks, a machine learning approach needing no explicit modeling of the subject system and having a record of success in solving complex non-linear inference problems. We train residual networks on simulated protein sequence data with extensive amino acid substitution heterogeneities. We show that the well-trained residual network predictors can outperform existing state-of-the-art inference methods such as the maximum likelihood method on diverse simulated test data, especially under extensive substitution heterogeneities. Reassuringly, residual network predictors generally agree with existing methods in the trees inferred from real phylogenetic data with known or widely believed topologies. We conclude that deep learning represents a powerful new approach to phylogenetic reconstruction, especially when sequences evolve via heterogeneous substitution processes. We present our best trained predictor in a freely available program named Phylogenetics by Deep Learning (PhyDL, https://gitlab.com/ztzou/phydl).


Genetics ◽  
2000 ◽  
Vol 155 (3) ◽  
pp. 1429-1437
Author(s):  
Oliver G Pybus ◽  
Andrew Rambaut ◽  
Paul H Harvey

Abstract We describe a unified set of methods for the inference of demographic history using genealogies reconstructed from gene sequence data. We introduce the skyline plot, a graphical, nonparametric estimate of demographic history. We discuss both maximum-likelihood parameter estimation and demographic hypothesis testing. Simulations are carried out to investigate the statistical properties of maximum-likelihood estimates of demographic parameters. The simulations reveal that (i) the performance of exponential growth model estimates is determined by a simple function of the true parameter values and (ii) under some conditions, estimates from reconstructed trees perform as well as estimates from perfect trees. We apply our methods to HIV-1 sequence data and find strong evidence that subtypes A and B have different demographic histories. We also provide the first (albeit tentative) genetic evidence for a recent decrease in the growth rate of subtype B.


2021 ◽  
Author(s):  
Nicola De Maio ◽  
Lukas Weilguny ◽  
Conor R. Walker ◽  
Yatish Turakhia ◽  
Russell Corbett-Detig ◽  
...  

AbstractSequence simulators are fundamental tools in bioinformatics, as they allow us to test data processing and inference tools, as well as being part of some inference methods. The ongoing surge in available sequence data is however testing the limits of our bioinformatics software. One example is the large number of SARS-CoV-2 genomes available, which are beyond the processing power of many methods, and simulating such large datasets is also proving difficult. Here we present a new algorithm and software for efficiently simulating sequence evolution along extremely large trees (e.g. > 100, 000 tips) when the branches of the tree are short, as is typical in genomic epidemiology. Our algorithm is based on the Gillespie approach, and implements an efficient multi-layered search tree structure that provides high computational efficiency by taking advantage of the fact that only a small proportion of the genome is likely to mutate at each branch of the considered phylogeny. Our open source software is available fromhttps://github.com/NicolaDM/phastSimand allows easy integration with other Python packages as well as a variety of evolutionary models, including new ones that we developed to more realistically model SARS-CoV-2 genome evolution.


2020 ◽  
Author(s):  
Aaron J. Stern ◽  
Leo Speidel ◽  
Noah A. Zaitlen ◽  
Rasmus Nielsen

AbstractWe present a full-likelihood method to estimate and quantify polygenic adaptation from contemporary DNA sequence data. The method combines population genetic DNA sequence data and GWAS summary statistics from up to thousands of nucleotide sites in a joint likelihood function to estimate the strength of transient directional selection acting on a polygenic trait. Through population genetic simulations of polygenic trait architectures and GWAS, we show that the method substantially improves power over current methods. We examine the robustness of the method under uncorrected GWAS stratification, uncertainty and ascertainment bias in the GWAS estimates of SNP effects, uncertainty in the identification of causal SNPs, allelic heterogeneity, negative selection, and low GWAS sample size. The method can quantify selection acting on correlated traits, fully controlling for pleiotropy even among traits with strong genetic correlation (|rg| = 80%; c.f. schizophrenia and bipolar disorder) while retaining high power to attribute selection to the causal trait. We apply the method to study 56 human polygenic traits for signs of recent adaptation. We find signals of directional selection on pigmentation (tanning, sunburn, hair, P=5.5e-15, 1.1e-11, 2.2e-6, respectively), life history traits (age at first birth, EduYears, P=2.5e-4, 2.6e-4, respectively), glycated hemoglobin (HbA1c, P=1.2e-3), bone mineral density (P=1.1e-3), and neuroticism (P=5.5e-3). We also conduct joint testing of 137 pairs of genetically correlated traits. We find evidence of widespread correlated response acting on these traits (2.6-fold enrichment over the null expectation, P=1.5e-7). We find that for several traits previously reported as adaptive, such as educational attainment and hair color, a significant proportion of the signal of selection on these traits can be attributed to correlated response, vs direct selection (P=2.9e-6, 1.7e-4, respectively). Lastly, our joint test uncovers antagonistic selection that has acted to increase type 2 diabetes (T2D) risk and decrease HbA1c (P=1.5e-5).


2021 ◽  
Author(s):  
Jiayi Ji ◽  
Donavan J. Jackson ◽  
Adam D. Leaché ◽  
Ziheng Yang

In the past two decades genomic data have been widely used to detect historical gene flow between species in a variety of plants and animals. The Tamias quadrivittatus group of North America chipmunks, which originated through a series of rapid speciation events, are known to undergo massive amounts of mitochondrial introgression. Yet in a recent analysis of targeted nuclear loci from the group, no evidence for cross-species introgression was detected, indicating widespread cytonuclear discordance. The study used heuristic methods that analyze summaries of the multilocus sequence data to detect gene flow, which may suffer from low power. Here we use the full likelihood method implemented in the Bayesian program BPP to reanalyze these data. We take a stepwise approach to constructing an introgression model by adding introgression events onto a well-supported binary species tree. The analysis detected robust evidence for multiple ancient introgression events affecting the nuclear genome, with introgression probabilities reaching 65%. We estimate population parameters and highlight the fact that species divergence times may be seriously underestimated if ancient cross-species gene flow is ignored in the analysis. Our analyses highlight the importance of using adequate statistical methods to reach reliable biological conclusions concerning cross-species gene flow.


PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e12129
Author(s):  
Paul E. Oluniyi ◽  
Fehintola Ajogbasile ◽  
Judith Oguzie ◽  
Jessica Uwanibe ◽  
Adeyemi Kayode ◽  
...  

Next generation sequencing (NGS)-based studies have vastly increased our understanding of viral diversity. Viral sequence data obtained from NGS experiments are a rich source of information, these data can be used to study their epidemiology, evolution, transmission patterns, and can also inform drug and vaccine design. Viral genomes, however, represent a great challenge to bioinformatics due to their high mutation rate and forming quasispecies in the same infected host, bringing about the need to implement advanced bioinformatics tools to assemble consensus genomes well-representative of the viral population circulating in individual patients. Many tools have been developed to preprocess sequencing reads, carry-out de novo or reference-assisted assembly of viral genomes and assess the quality of the genomes obtained. Most of these tools however exist as standalone workflows and usually require huge computational resources. Here we present (Viral Genomes Easily Analyzed), a Snakemake workflow for analyzing RNA viral genomes. VGEA enables users to map sequencing reads to the human genome to remove human contaminants, split bam files into forward and reverse reads, carry out de novo assembly of forward and reverse reads to generate contigs, pre-process reads for quality and contamination, map reads to a reference tailored to the sample using corrected contigs supplemented by the user’s choice of reference sequences and evaluate/compare genome assemblies. We designed a project with the aim of creating a flexible, easy-to-use and all-in-one pipeline from existing/stand-alone bioinformatics tools for viral genome analysis that can be deployed on a personal computer. VGEA was built on the Snakemake workflow management system and utilizes existing tools for each step: fastp (Chen et al., 2018) for read trimming and read-level quality control, BWA (Li & Durbin, 2009) for mapping sequencing reads to the human reference genome, SAMtools (Li et al., 2009) for extracting unmapped reads and also for splitting bam files into fastq files, IVA (Hunt et al., 2015) for de novo assembly to generate contigs, shiver (Wymant et al., 2018) to pre-process reads for quality and contamination, then map to a reference tailored to the sample using corrected contigs supplemented with the user’s choice of existing reference sequences, SeqKit (Shen et al., 2016) for cleaning shiver assembly for QUAST, QUAST (Gurevich et al., 2013) to evaluate/assess the quality of genome assemblies and MultiQC (Ewels et al., 2016) for aggregation of the results from fastp, BWA and QUAST. Our pipeline was successfully tested and validated with SARS-CoV-2 (n = 20), HIV-1 (n = 20) and Lassa Virus (n = 20) datasets all of which have been made publicly available. VGEA is freely available on GitHub at: https://github.com/pauloluniyi/VGEA under the GNU General Public License.


Phytotaxa ◽  
2014 ◽  
Vol 189 (1) ◽  
pp. 186 ◽  
Author(s):  
JOEL A. MERCADO-DÍAZ ◽  
ROBERT LÜCKING ◽  
SITTIPORN PARNMEN

Two new genera and twelve new species of Graphidaceae are described from Puerto Rico. The two new genera, Borinquenotrema and Paratopeliopsis, are based on a combination of molecular sequence data and phenotype characters. Borinquenotrema, with the single new species B. soredicarpum, features rounded ascomata developing beneath and persistently covered with soralia and with an internal anatomy reminescent of Carbacanthographis; it is close to the  tribe Ocellularieae. Paratopeliopsis, including the single new species P. caraibica, resembles a miniature Topeliopsis but differs in the distinctly farinose thallus and the small, brown ascospores; it is not closely related to the latter genus but belongs in tribe Thelotremateae. The other ten new species belong in the genera Acanthotrema, Clandestinotrema, Compositrema, Fissurina, Ocellularia, and Thalloloma. Acanthotrema alboisidiatum is closely related to A. brasilianum but differs in the short, white isidia resembling insect eggs. Clandestinotrema portoricense has a unique ascospore type with a longitudinal septum only in the proximal cell. Compositrema borinquense resembles a species of Stegobolus but belongs in Compositrema based on sequence data, and is characterized by ascomata with a unique columella composed of thick, irregularly radiating strands. The second new species in this genus, C. isidiofarinosum, differs by its ecorticate, farinose thallus with scattered, corticate isidia and by its small ascomata with inconspicuous columella. The three new species of Fissurina all have 3-septate ascospores and are otherwise characterized by an isidiate thallus and stellate, orange-yellow lirellae (F. aurantiacostellata), a verrucose thallus strongly encrusted with calcium oxalate crystals and white, irregularly branched lirellae (F. crystallifera), and myriotremoid ascomata arranged in short lines (F. monilifera). Ocellularia portoricensis belongs in the core group of Ocellularia and differs from O. cavata in the white medulla and the larger ascospores becoming brown, whereas O. vulcanisorediata produces prominent soralia and immersed ascomata with apically carbonized excipulum and columella and small, transversely septate, hyaline ascospores; it is closely related to O. conformalis. Finally, Thalloloma rubromarginatum resembles T. haemographum in the brownish lirellae with bright red margin but differs from that and other species in the corticate thallus and the norstictic acid chemistry. The new combination Ampliotrema rimosum (Hale) Mercado-Díaz, Lücking & Parnmen is also proposed. Considering the current biodiversity knowledge on this family, the high level of endemism observed in other groups of organisms in the island, and the relatively high number of Graphidaceae described, it is highly likely that at least some of these new taxa are endemic to the island. This view is further supported by the unique features of several of the new species, representing novel characters in the corresponding genera.


1988 ◽  
Vol 8 (10) ◽  
pp. 4243-4249
Author(s):  
J Filmus ◽  
J G Church ◽  
R N Buick

We report the isolation of a cDNA clone corresponding to a transcript that is accumulated differentially in rat intestine during development. Clone OCI-5 was selected from the rat intestinal cell line IEC-18, which represents primitive intestinal epithelial crypt cells. Expression was high in rat fetal intestine between 15 and 19 days of development and thereafter was progressively down regulated, becoming undetectable after weaning. Clone OCI-5 detected homologous sequences in human and murine cells. In particular, a high level of expression was detected in CaCo-2, a human colon carcinoma cell line, which is known to express molecules characteristic of fetal small intestinal cells. Expression of a homologous gene was also detected in F9 murine teratocarcinoma cells when they were induced to differentiate into parietal or visceral endodermlike cells. When IEC-18 cells were transformed by activated H-ras or v-src genes, expression of clone OCI-5 was suppressed; the degree of down-regulation correlated with the extent of morphological change induced in the transformed IEC-18 cells. The sequence of clone OCI-5 showed an open reading frame that was capable of encoding a protein of 597 amino acids, but no strong homology was found with any of the proteins registered in the protein sequence data base.


Author(s):  
Christopher Wills

No field of science has cast more light on both the past and the future of our species than evolutionary biology. Recently, the pace of new discoveries about how we have evolved has increased (Culotta and Pennisi, 2005). It is now clear that we are less unique than we used to think. Genetic and palaeontological evidence is now accumulating that hominids with a high level of intelligence, tool-making ability, and probably communication skills have evolved independently more than once. They evolved in Africa (our own ancestors), in Europe (the ancestors of the Neanderthals) and in Southeast Asia (the remarkable ‘hobbits’, who may be miniaturized and highly acculturated Homo erectus). It is also becoming clear that the genes that contribute to the characteristics of our species can be found and that the histories of these genes can be understood. Comparisons of entire genomes have shown that genes involved in brain function have evolved more quickly in hominids than in more distantly related primates. The genetic differences among human groups can now be investigated. Characters that we tend to think of as extremely important markers enabling us to distinguish among different human groups now turn out to be understandable at the genetic level, and their genetic history can be traced. Recently a single allelic difference between Europeans and Africans has been found (Lamason et al., 2005). This functional allelic difference accounts for about a third of the differences in skin pigmentation in these groups. Skin colour differences, in spite of the great importance they have assumed in human societies, are the result of natural selection acting on a small number of genes that are likely to have no effects beyond their influence on skin colour itself. How do these and other recent findings from fields ranging from palaeontology to molecular biology fit into present-day evolution theory, and what light do they cast on how our species is likely to evolve in the future? I will introduce this question by examining briefly how evolutionary change takes place.


Sign in / Sign up

Export Citation Format

Share Document