Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model

While aligning sequences has been the dominant approach for determining homology prior to phylogenetic inference, alignment-free methods have much appeal in terms of simplifying the process of inference, especially when analyzing genome-wide data. Furthermore, alignment-free methods present the only option for some emerging forms of data such as genome skims, which cannot be assembled. Despite the appeal, alignment-free methods have not been competitive with alignment-based methods in terms of accuracy. One limitation of alignment-free methods is that they typically rely on simplified models of sequence evolution such as Jukes-Cantor. It is possible to compute pairwise distances under more complex models by computing frequencies of base substitutions provided that these quantities can be estimated in the alignment-free setting. A particular limitation is that for many forms of genome-wide data, which arguably present the best use case for alignment-free methods, the strand of DNA sequences is unknown. Under such conditions, the so-called no-strand bias models are the most complex models that can be used. Here, we show how to calculate distances under a no-strain bias restriction of the General Time Reversible (GTR) model called TK4 without relying on alignments. The method relies on replacing letters in the input sequences, and subsequent computation of Jaccard indices between k-mer sets. For the method to work on large genomes, we also need to compute the number of k-mer mismatches after replacement due to random chance. We show in simulation that these alignment-free distances can be highly accurate when genomes evolve under the assumed models, and we examine the effectiveness of the method on real genomic data.

Download Full-text

Are Nonsynonymous Transversions Generally More Deleterious than Nonsynonymous Transitions?

Molecular Biology and Evolution ◽

10.1093/molbev/msaa200 ◽

2020 ◽

Vol 38 (1) ◽

pp. 181-191

Author(s):

Zhengting Zou ◽

Jianzhi Zhang

Keyword(s):

Amino Acid ◽

Dna Sequences ◽

Sequence Evolution ◽

Codon Model ◽

Protein Coding ◽

Fitness Effects ◽

Genome Wide ◽

Species Pairs ◽

Species Specific ◽

Evolutionary Lineages

Abstract It has been suggested that, due to the structure of the genetic code, nonsynonymous transitions are less likely than transversions to cause radical changes in amino acid physicochemical properties so are on average less deleterious. This view was supported by some but not all mutagenesis experiments. Because laboratory measures of fitness effects have limited sensitivities and relative frequencies of different mutations in mutagenesis studies may not match those in nature, we here revisit this issue using comparative genomics. We extend the standard codon model of sequence evolution by adding the parameter η that quantifies the ratio of the fixation probability of transitional nonsynonymous mutations to that of transversional nonsynonymous mutations. We then estimate η from the concatenated alignment of all protein-coding DNA sequences of two closely related genomes. Surprisingly, η ranges from 0.13 to 2.0 across 90 species pairs sampled from the tree of life, with 51 incidences of η < 1 and 30 incidences of η >1 that are statistically significant. Hence, whether nonsynonymous transversions are overall more deleterious than nonsynonymous transitions is species-dependent. Because the corresponding groups of amino acid replacements differ between nonsynonymous transitions and transversions, η is influenced by the relative exchangeabilities of amino acid pairs. Indeed, an extensive search reveals that the large variation in η is primarily explainable by the recently reported among-species disparity in amino acid exchangeabilities. These findings demonstrate that genome-wide nucleotide substitution patterns in coding sequences have species-specific features and are more variable among evolutionary lineages than are currently thought.

Download Full-text

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

Bioinformatics ◽

10.1093/bioinformatics/btab083 ◽

2021 ◽

Author(s):

Yanrong Ji ◽

Zhihan Zhou ◽

Han Liu ◽

Ramana V Davuluri

Keyword(s):

Dna Sequences ◽

Regulatory Elements ◽

Ease Of Use ◽

Fine Tuning ◽

Supplementary Information ◽

Sequence Motifs ◽

Semantic Relationship ◽

Accurate Identification ◽

Conserved Sequence ◽

Genome Wide

Abstract Motivation Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. Results To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks. Availability and implementation The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Genetic analysis of amyotrophic lateral sclerosis identifies contributing pathways and cell types

Science Advances ◽

10.1126/sciadv.abd9036 ◽

2021 ◽

Vol 7 (3) ◽

pp. eabd9036

Author(s):

Sara Saez-Atienzar ◽

Sara Bandres-Ciga ◽

Rebekah G. Langston ◽

Jonggeol J. Kim ◽

Shing Wan Choi ◽

...

Keyword(s):

Amyotrophic Lateral Sclerosis ◽

Membrane Trafficking ◽

Molecular Mechanisms ◽

Cell Types ◽

Polygenic Risk Score ◽

Genome Wide ◽

Genome Wide Data ◽

Data Driven Approach ◽

Single Nucleus ◽

Lateral Sclerosis

Despite the considerable progress in unraveling the genetic causes of amyotrophic lateral sclerosis (ALS), we do not fully understand the molecular mechanisms underlying the disease. We analyzed genome-wide data involving 78,500 individuals using a polygenic risk score approach to identify the biological pathways and cell types involved in ALS. This data-driven approach identified multiple aspects of the biology underlying the disease that resolved into broader themes, namely, neuron projection morphogenesis, membrane trafficking, and signal transduction mediated by ribonucleotides. We also found that genomic risk in ALS maps consistently to GABAergic interneurons and oligodendrocytes, as confirmed in human single-nucleus RNA-seq data. Using two-sample Mendelian randomization, we nominated six differentially expressed genes (ATG16L2, ACSL5, MAP1LC3A, MAPKAPK3, PLXNB2, and SCFD1) within the significant pathways as relevant to ALS. We conclude that the disparate genetic etiologies of this fatal neurological disease converge on a smaller number of final common pathways and cell types.

Download Full-text

Distinct regulation of hippocampal neuroplasticity and ciliary genes by corticosteroid receptors

Nature Communications ◽

10.1038/s41467-021-24967-z ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Karen R. Mifsud ◽

Clare L. M. Kennedy ◽

Silvia Salatino ◽

Eshita Sharma ◽

Emily M. Price ◽

...

Keyword(s):

Dna Sequences ◽

Glucocorticoid Receptors ◽

Acute Stress ◽

Circadian Variation ◽

Rna Seq ◽

Physiological Regulation ◽

Behavioural Adaptation ◽

Neuronal Progenitor ◽

Genome Wide ◽

Transcriptional Changes

AbstractGlucocorticoid hormones (GCs) — acting through hippocampal mineralocorticoid receptors (MRs) and glucocorticoid receptors (GRs) — are critical to physiological regulation and behavioural adaptation. We conducted genome-wide MR and GR ChIP-seq and Ribo-Zero RNA-seq studies on rat hippocampus to elucidate MR- and GR-regulated genes under circadian variation or acute stress. In a subset of genes, these physiological conditions resulted in enhanced MR and/or GR binding to DNA sequences and associated transcriptional changes. Binding of MR at a substantial number of sites however remained unchanged. MR and GR binding occur at overlapping as well as distinct loci. Moreover, although the GC response element (GRE) was the predominant motif, the transcription factor recognition site composition within MR and GR binding peaks show marked differences. Pathway analysis uncovered that MR and GR regulate a substantial number of genes involved in synaptic/neuro-plasticity, cell morphology and development, behavior, and neuropsychiatric disorders. We find that MR, not GR, is the predominant receptor binding to >50 ciliary genes; and that MR function is linked to neuronal differentiation and ciliogenesis in human fetal neuronal progenitor cells. These results show that hippocampal MRs and GRs constitutively and dynamically regulate genomic activities underpinning neuronal plasticity and behavioral adaptation to changing environments.

Download Full-text

Ancient genomic time transect from the Central Asian Steppe unravels the history of the Scythians

Science Advances ◽

10.1126/sciadv.abe4414 ◽

2021 ◽

Vol 7 (13) ◽

pp. eabe4414

Author(s):

Guido Alberto Gnecchi-Ruscone ◽

Elmira Khussainova ◽

Nurzhibek Kahbatkyzy ◽

Lyazzat Musralina ◽

Maria A. Spyrou ◽

...

Keyword(s):

Bronze Age ◽

Iron Age ◽

Gene Pools ◽

Social Rules ◽

Eurasian Steppe ◽

Central Asian ◽

Genome Wide ◽

Genome Wide Data ◽

History Of ◽

First Millennium

The Scythians were a multitude of horse-warrior nomad cultures dwelling in the Eurasian steppe during the first millennium BCE. Because of the lack of first-hand written records, little is known about the origins and relations among the different cultures. To address these questions, we produced genome-wide data for 111 ancient individuals retrieved from 39 archaeological sites from the first millennia BCE and CE across the Central Asian Steppe. We uncovered major admixture events in the Late Bronze Age forming the genetic substratum for two main Iron Age gene-pools emerging around the Altai and the Urals respectively. Their demise was mirrored by new genetic turnovers, linked to the spread of the eastern nomad empires in the first centuries CE. Compared to the high genetic heterogeneity of the past, the homogenization of the present-day Kazakhs gene pool is notable, likely a result of 400 years of strict exogamous social rules.

Download Full-text

Genome diversity in Ukraine

GigaScience ◽

10.1093/gigascience/giaa159 ◽

2021 ◽

Vol 10 (1) ◽

Author(s):

Taras K Oleksyk ◽

Walter W Wolfsberger ◽

Alexandra M Weber ◽

Khrystyna Shchubelka ◽

Olga T Oleksyk ◽

...

Keyword(s):

Sequence Data ◽

Copy Number Variations ◽

Genomic Variation ◽

High Coverage ◽

Genome Data ◽

New Information ◽

Genome Wide ◽

Public Data ◽

Genome Wide Data ◽

Multiple Samples

Abstract Background The main goal of this collaborative effort is to provide genome-wide data for the previously underrepresented population in Eastern Europe, and to provide cross-validation of the data from genome sequences and genotypes of the same individuals acquired by different technologies. We collected 97 genome-grade DNA samples from consented individuals representing major regions of Ukraine that were consented for public data release. BGISEQ-500 sequence data and genotypes by an Illumina GWAS chip were cross-validated on multiple samples and additionally referenced to 1 sample that has been resequenced by Illumina NovaSeq6000 S4 at high coverage. Results The genome data have been searched for genomic variation represented in this population, and a number of variants have been reported: large structural variants, indels, copy number variations, single-nucletide polymorphisms, and microsatellites. To our knowledge, this study provides the largest to-date survey of genetic variation in Ukraine, creating a public reference resource aiming to provide data for medical research in a large understudied population. Conclusions Our results indicate that the genetic diversity of the Ukrainian population is uniquely shaped by evolutionary and demographic forces and cannot be ignored in future genetic and biomedical studies. These data will contribute a wealth of new information bringing forth a wealth of novel, endemic and medically related alleles.

Download Full-text

Initial Upper Palaeolithic humans in Europe had recent Neanderthal ancestry

Nature ◽

10.1038/s41586-021-03335-3 ◽

2021 ◽

Vol 592 (7853) ◽

pp. 253-257 ◽

Cited By ~ 3

Author(s):

Mateja Hajdinjak ◽

Fabrizio Mafessoni ◽

Laurits Skov ◽

Benjamin Vernot ◽

Alexander Hübner ◽

...

Keyword(s):

Family History ◽

East Asia ◽

Late Pleistocene ◽

Modern Human ◽

Human Migration ◽

Upper Palaeolithic ◽

Modern Humans ◽

Genome Wide ◽

Genome Wide Data

AbstractModern humans appeared in Europe by at least 45,000 years ago1–5, but the extent of their interactions with Neanderthals, who disappeared by about 40,000 years ago6, and their relationship to the broader expansion of modern humans outside Africa are poorly understood. Here we present genome-wide data from three individuals dated to between 45,930 and 42,580 years ago from Bacho Kiro Cave, Bulgaria1,2. They are the earliest Late Pleistocene modern humans known to have been recovered in Europe so far, and were found in association with an Initial Upper Palaeolithic artefact assemblage. Unlike two previously studied individuals of similar ages from Romania7 and Siberia8 who did not contribute detectably to later populations, these individuals are more closely related to present-day and ancient populations in East Asia and the Americas than to later west Eurasian populations. This indicates that they belonged to a modern human migration into Europe that was not previously known from the genetic record, and provides evidence that there was at least some continuity between the earliest modern humans in Europe and later people in Eurasia. Moreover, we find that all three individuals had Neanderthal ancestors a few generations back in their family history, confirming that the first European modern humans mixed with Neanderthals and suggesting that such mixing could have been common.

Download Full-text

A curated dataset of modern and ancient high-coverage shotgun human genomes

Scientific Data ◽

10.1038/s41597-021-00980-1 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Pierpaolo Maisano Delser ◽

Eppie R. Jones ◽

Anahit Hovhannisyan ◽

Lara Cassidy ◽

Ron Pinhasi ◽

...

Keyword(s):

Sequence Data ◽

Whole Genome ◽

Reference Dataset ◽

High Coverage ◽

Sample Distribution ◽

Human Samples ◽

Human Genomes ◽

Genome Wide ◽

Genome Wide Data ◽

Computationally Intensive

AbstractOver the last few years, genome-wide data for a large number of ancient human samples have been collected. Whilst datasets of captured SNPs have been collated, high coverage shotgun genomes (which are relatively few but allow certain types of analyses not possible with ascertained captured SNPs) have to be reprocessed by individual groups from raw reads. This task is computationally intensive. Here, we release a dataset including 35 whole-genome sequenced samples, previously published and distributed worldwide, together with the genetic pipeline used to process them. The dataset contains 72,041,355 sites called across 19 ancient and 16 modern individuals and includes sequence data from four previously published ancient samples which we sequenced to higher coverage (10–18x). Such a resource will allow researchers to analyse their new samples with the same genetic pipeline and directly compare them to the reference dataset without re-processing published samples. Moreover, this dataset can be easily expanded to increase the sample distribution both across time and space.

Download Full-text

Revealing the impact of the Caucasus region on the genetic legacy of Romani people from genome-wide data

PLoS ONE ◽

10.1371/journal.pone.0202890 ◽

2018 ◽

Vol 13 (9) ◽

pp. e0202890 ◽

Cited By ~ 1

Author(s):

Zsolt Bánfai ◽

Valerián Ádám ◽

Etelka Pöstyéni ◽

Gergely Büki ◽

Márta Czakó ◽

...

Keyword(s):

The Caucasus ◽

Caucasus Region ◽

Genome Wide ◽

Genome Wide Data ◽

The Impact

Download Full-text

Genetic legacy of state centralization in the Kuba Kingdom of the Democratic Republic of the Congo

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1811211115 ◽

2018 ◽

Vol 116 (2) ◽

pp. 593-598 ◽

Cited By ~ 4

Author(s):

Lucy van Dorp ◽

Sara Lowes ◽

Jonathan L. Weigel ◽

Naser Ansari-Pour ◽

Saioa López ◽

...

Keyword(s):

Large Scale ◽

Democratic Republic ◽

Central Province ◽

Genome Wide ◽

Genetic Patterns ◽

Genetic Impact ◽

Genome Wide Data ◽

Republic Of The Congo ◽

History Of ◽

State Centralization

Few phenomena have had as profound or long-lasting consequences in human history as the emergence of large-scale centralized states in the place of smaller scale and more local societies. This study examines a fundamental, and yet unexplored, consequence of state formation: its genetic legacy. We studied the genetic impact of state centralization during the formation of the eminent precolonial Kuba Kingdom of the Democratic Republic of the Congo (DRC) in the 17th century. We analyzed genome-wide data from over 690 individuals sampled from 27 different ethnic groups from the Kasai Central Province of the DRC. By comparing genetic patterns in the present-day Kuba, whose ancestors were part of the Kuba Kingdom, with those in neighboring non-Kuba groups, we show that the Kuba today are more genetically diverse and more similar to other groups in the region than expected, consistent with the historical unification of distinct subgroups during state centralization. We also found evidence of genetic mixing dating to the time of the Kingdom at its most prominent. Using this unique dataset, we characterize the genetic history of the Kasai Central Province and describe the historic late wave of migrations into the region that contributed to a Bantu-like ancestry component found across large parts of Africa today. Taken together, we show the power of genetics to evidence events of sociopolitical importance and highlight how DNA can be used to better understand the behaviors of both people and institutions in the past.

Download Full-text