Genome diversity in Ukraine

Taras K Oleksyk; Walter W Wolfsberger; Alexandra M Weber; Khrystyna Shchubelka; Olga T Oleksyk; Olga Levchuk; Alla Patrus; Nelya Lazar; Stephanie O Castro-Marquez; Yaroslava Hasynets; Patricia Boldyzhar; Mikhailo Neymet; Alina Urbanovych; Viktoriya Stakhovska; Kateryna Malyar; Svitlana Chervyakova; Olena Podoroha; Natalia Kovalchuk; Juan L Rodriguez-Flores; Weichen Zhou; Sarah Medley; Fabia Battistuzzi; Ryan Liu; Yong Hou; Siru Chen; Huanming Yang; Meredith Yeager; Michael Dean; Ryan E Mills; Volodymyr Smolanka

doi:10.1093/gigascience/giaa159

Genome diversity in Ukraine

GigaScience ◽

10.1093/gigascience/giaa159 ◽

2021 ◽

Vol 10 (1) ◽

Author(s):

Taras K Oleksyk ◽

Walter W Wolfsberger ◽

Alexandra M Weber ◽

Khrystyna Shchubelka ◽

Olga T Oleksyk ◽

...

Keyword(s):

Sequence Data ◽

Copy Number Variations ◽

Genomic Variation ◽

High Coverage ◽

Genome Data ◽

New Information ◽

Genome Wide ◽

Public Data ◽

Genome Wide Data ◽

Multiple Samples

Abstract Background The main goal of this collaborative effort is to provide genome-wide data for the previously underrepresented population in Eastern Europe, and to provide cross-validation of the data from genome sequences and genotypes of the same individuals acquired by different technologies. We collected 97 genome-grade DNA samples from consented individuals representing major regions of Ukraine that were consented for public data release. BGISEQ-500 sequence data and genotypes by an Illumina GWAS chip were cross-validated on multiple samples and additionally referenced to 1 sample that has been resequenced by Illumina NovaSeq6000 S4 at high coverage. Results The genome data have been searched for genomic variation represented in this population, and a number of variants have been reported: large structural variants, indels, copy number variations, single-nucletide polymorphisms, and microsatellites. To our knowledge, this study provides the largest to-date survey of genetic variation in Ukraine, creating a public reference resource aiming to provide data for medical research in a large understudied population. Conclusions Our results indicate that the genetic diversity of the Ukrainian population is uniquely shaped by evolutionary and demographic forces and cannot be ignored in future genetic and biomedical studies. These data will contribute a wealth of new information bringing forth a wealth of novel, endemic and medically related alleles.

Download Full-text

A curated dataset of modern and ancient high-coverage shotgun human genomes

Scientific Data ◽

10.1038/s41597-021-00980-1 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Pierpaolo Maisano Delser ◽

Eppie R. Jones ◽

Anahit Hovhannisyan ◽

Lara Cassidy ◽

Ron Pinhasi ◽

...

Keyword(s):

Sequence Data ◽

Whole Genome ◽

Reference Dataset ◽

High Coverage ◽

Sample Distribution ◽

Human Samples ◽

Human Genomes ◽

Genome Wide ◽

Genome Wide Data ◽

Computationally Intensive

AbstractOver the last few years, genome-wide data for a large number of ancient human samples have been collected. Whilst datasets of captured SNPs have been collated, high coverage shotgun genomes (which are relatively few but allow certain types of analyses not possible with ascertained captured SNPs) have to be reprocessed by individual groups from raw reads. This task is computationally intensive. Here, we release a dataset including 35 whole-genome sequenced samples, previously published and distributed worldwide, together with the genetic pipeline used to process them. The dataset contains 72,041,355 sites called across 19 ancient and 16 modern individuals and includes sequence data from four previously published ancient samples which we sequenced to higher coverage (10–18x). Such a resource will allow researchers to analyse their new samples with the same genetic pipeline and directly compare them to the reference dataset without re-processing published samples. Moreover, this dataset can be easily expanded to increase the sample distribution both across time and space.

Download Full-text

A curated dataset of modern and ancient high-coverage shotgun human genomes

10.1101/2020.10.27.351692 ◽

2020 ◽

Author(s):

Pierpaolo Maisano Delser ◽

Eppie R. Jones ◽

Anahit Hovhannisyan ◽

Lara Cassidy ◽

Ron Pinhasi ◽

...

Keyword(s):

Sequence Data ◽

Whole Genome ◽

Reference Dataset ◽

High Coverage ◽

Sample Distribution ◽

Human Samples ◽

Human Genomes ◽

Genome Wide ◽

Genome Wide Data ◽

Computationally Intensive

AbstractOver the last few years, genome-wide data for a large number of ancient human samples have been collected. Whilst datasets of capture SNPs have been collated, high coverage shotgun genomes (which are relatively few but allow certain type of analyses not possible with ascertained captured SNPs) have to be reprocessed by individual groups from raw reads. This task is computationally intensive. Here, we release a dataset including 34 whole-genome sequenced samples, previously published and distributed worldwide, together with the genetic pipeline used to process them. The dataset contains 73,435,604 sites called across 18 ancient and 16 modern individuals and includes sequence data from four previously published ancient samples which we sequenced to higher coverage (10-18x). Such a resource will allow researchers to analyse their new samples with the same genetic pipeline and directly compare them to the reference dataset without re-processing published samples. Moreover, this dataset can be easily expanded to increase the sample distribution both across time and space.

Download Full-text

Genome Diversity in Ukraine

10.1101/2020.08.07.238329 ◽

2020 ◽

Author(s):

Taras K. Oleksyk ◽

Walter W. Wolfsberger ◽

Alexandra Weber ◽

Khrystyna Shchubelka ◽

Olga T. Oleksyk ◽

...

Keyword(s):

Eastern Europe ◽

Association Studies ◽

Population Diversity ◽

Genomic Variation ◽

Population Variation ◽

Genome Wide Association Studies ◽

Structural Variants ◽

High Coverage ◽

Genome Wide ◽

Public Data

AbstractThe main goal of this collaborative effort is to provide genome wide data for the previously underrepresented population in Eastern Europe, and to provide cross-validation of the data from genome sequences and genotypes of the same individuals acquired by different technologies. We collected 97 genome-grade DNA samples from consented individuals representing major regions of Ukraine that were consented for the public data release. DNBSEQ-G50 sequences, and genotypes by an Illumina GWAS chip were cross-validated on multiple samples, and additionally referenced to one sample that has been resequenced by Illumina NovaSeq6000 S4 at high coverage. The genome data has been searched for genomic variation represented in this population, and a number of variants have been reported: large structural variants, indels, CNVs, SNPs and microsatellites. This study provides the largest to-date survey of genetic variation in Ukraine, creating a public reference resource aiming to provide data for historic and medical research in a large understudied population. While most of the common variation is shared with other European populations, this survey of population variation contributes a number of novel SNPs and structural variants that have not been reported in the gnomAD/1KG databases representing global distribution of genomic variation. These endemic variants will become a valuable resource for designing future population and clinical studies, help address questions about ancestry and admixture, and will fill a missing place in the puzzle characterizing human population diversity in Eastern Europe. Our results indicate that genetic diversity of the Ukrainian population is uniquely shaped by the evolutionary and demographic forces, and cannot be ignored in the future genetic and biomedical studies. This data will contribute a wealth of new information bringing forth different risk and/or protective alleles. The newly discovered low frequency and local variants can be added to the current genotyping arrays for genome wide association studies, clinical trials, and in genome assessment of proliferating cancer cells.

Download Full-text

Genome-wide copy number variations in a large cohort of bantu African children

BMC Medical Genomics ◽

10.1186/s12920-021-00978-z ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Feyza Yilmaz ◽

Megan Null ◽

David Astling ◽

Hung-Chun Yu ◽

Joanne Cole ◽

...

Keyword(s):

Copy Number ◽

Developmental Disorders ◽

African Ancestry ◽

22Q11.2 Deletion Syndrome ◽

Copy Number Variations ◽

Genomic Variation ◽

Deletion Syndrome ◽

Genome Wide ◽

A Genome ◽

Genomic Regions

Abstract Background Copy number variations (CNVs) account for a substantial proportion of inter-individual genomic variation. However, a majority of genomic variation studies have focused on single-nucleotide variations (SNVs), with limited genome-wide analysis of CNVs in large cohorts, especially in populations that are under-represented in genetic studies including people of African descent. Methods We carried out a genome-wide copy number analysis in > 3400 healthy Bantu Africans from Tanzania. Signal intensity data from high density (> 2.5 million probes) genotyping arrays were used for CNV calling with three algorithms including PennCNV, DNAcopy and VanillaICE. Stringent quality metrics and filtering criteria were applied to obtain high confidence CNVs. Results We identified over 400,000 CNVs larger than 1 kilobase (kb), for an average of 120 CNVs (SE = 2.57) per individual. We detected 866 large CNVs (≥ 300 kb), some of which overlapped genomic regions previously associated with multiple congenital anomaly syndromes, including Prader-Willi/Angelman syndrome (Type1) and 22q11.2 deletion syndrome. Furthermore, several of the common CNVs seen in our cohort (≥ 5%) overlap genes previously associated with developmental disorders. Conclusions These findings may help refine the phenotypic outcomes and penetrance of variations affecting genes and genomic regions previously implicated in diseases. Our study provides one of the largest datasets of CNVs from individuals of African ancestry, enabling improved clinical evaluation and disease association of CNVs observed in research and clinical studies in African populations.

Download Full-text

Range and niche expansion through multiple interspecific hybridization - a Genotyping by Sequencing analysis of Cherleria (Caryophyllaceae)

10.21203/rs.3.rs-22788/v1 ◽

2020 ◽

Author(s):

Abigail Jane Moore ◽

Jennifer A. Messick ◽

Joachim W. Kadereit

Keyword(s):

Evolutionary History ◽

Sequence Data ◽

Genotyping By Sequencing ◽

Balkan Peninsula ◽

Sequencing Analysis ◽

High Mountains ◽

Genome Wide ◽

Full Picture ◽

Genome Wide Data ◽

Phylogenetic Resolution

Abstract Background Cherleria (Caryophyllaceae) is a circumboreal genus that also occurs in the high mountains of the northern hemisphere. In this study, we focus on a clade that diversified in the European High Mountains, which was identified using nuclear ribosomal (nrDNA) sequence data in a previous study. With the nrDNA data, all but one species was monophyletic, with little sequence variation within most species. Here, we use genotyping by sequencing (GBS) data to determine whether the nrDNA data showed the full picture of the evolution in the genomes of these species. Results The overall relationships found with the GBS data were congruent with those from the nrDNA study. Most of the species were still monophyletic and many of the same subclades were recovered, including a clade of three narrow endemic species from Greece and a clade of largely calcifuge species. The GBS data provided additional resolution within the two species with the best sampling, C. langii and C. laricifolia, with structure that was congruent with geography. In addition, the GBS data showed significant hybridization between several species, including species whose ranges did not currently overlap. Conclusions The hybridization led us to hypothesize that lineages came in contact on the Balkan Peninsula after they diverged, even when those lineages are no longer present on the Balkan Peninsula. Hybridization may also have helped lineages expand their niches to colonize new substrates and different areas. Not only do genome-wide data provide increased phylogenetic resolution of difficult nodes, they also give evidence for a more complex evolutionary history than what can be depicted by a simple, branching phylogeny.

Download Full-text

The International Genome Sample Resource (IGSR) collection of open human genomic variation resources

Nucleic Acids Research ◽

10.1093/nar/gkz836 ◽

2019 ◽

Vol 48 (D1) ◽

pp. D941-D947 ◽

Cited By ~ 20

Author(s):

Susan Fairley ◽

Ernesto Lowy-Gallego ◽

Emily Perry ◽

Paul Flicek

Keyword(s):

Sequence Data ◽

Genomic Variation ◽

1000 Genomes Project ◽

High Coverage ◽

Web Based ◽

1000 Genomes ◽

Open Consent ◽

Unified View ◽

Human Genomic ◽

Project Data

Abstract To sustain and develop the largest fully open human genomic resources the International Genome Sample Resource (IGSR) (https://www.internationalgenome.org) was established. It is built on the foundation of the 1000 Genomes Project, which created the largest openly accessible catalogue of human genomic variation developed from samples spanning five continents. IGSR (i) maintains access to 1000 Genomes Project resources, (ii) updates 1000 Genomes Project resources to the GRCh38 human reference assembly, (iii) adds new data generated on 1000 Genomes Project cell lines, (iv) shares data from samples with a similarly open consent to increase the number of samples and populations represented in the resources and (v) provides support to users of these resources. Among recent updates are the release of variation calls from 1000 Genomes Project data calculated directly on GRCh38 and the addition of high coverage sequence data for the 2504 samples in the 1000 Genomes Project phase three panel. The data portal, which facilitates web-based exploration of the IGSR resources, has been updated to include samples which were not part of the 1000 Genomes Project and now presents a unified view of data and samples across almost 5000 samples from multiple studies. All data is fully open and publicly accessible.

Download Full-text

The counteracting effects of demography on functional genomic variation: the Roma paradigm

Molecular Biology and Evolution ◽

10.1093/molbev/msab070 ◽

2021 ◽

Author(s):

Neus Font-Porterias ◽

Rocio Caro-Consuegra ◽

Marcel Lucas-Sánchez ◽

Marie Lopez ◽

Aaron Giménez ◽

...

Keyword(s):

Gene Flow ◽

Rare Variants ◽

Demographic History ◽

Genomic Variation ◽

Human Populations ◽

History Plays ◽

High Coverage ◽

Roma Population ◽

Genome Wide ◽

Whole Exome

Abstract Demographic history plays a major role in shaping the distribution of genomic variation. Yet the interaction between different demographic forces and their effects in the genomes is not fully resolved in human populations. Here we focus on the Roma population, the largest transnational ethnic minority in Europe. They have a South Asian origin and their demographic history is characterized by recent dispersals, multiple founder events and extensive gene flow from non-Roma groups. Through the analyses of new high-coverage whole exome sequences and genome-wide array data for 89 Iberian Roma individuals together with forward simulations, we show that founder effects have reduced their genetic diversity and proportion of rare variants, gene flow has counteracted the increase in mutational load, runs of homozygosity show ancestry-specific patterns of accumulation of deleterious homozygotes, and selection signals primarily derive from pre-admixture adaptation in the Roma population sources. The present study shows how two demographic forces, bottlenecks and admixture, act in opposite directions and have long-term balancing effects on the Roma genomes. Understanding how demography and gene flow shape the genome of an admixed population provides an opportunity to elucidate how genomic variation is modelled in human populations.

Download Full-text

The SEQC2 Epigenomics Quality Control (EpiQC) Study: Comprehensive Characterization of Epigenetic Methods, Reproducibility, and Quantification

10.1101/2020.12.14.421529 ◽

2020 ◽

Author(s):

Jonathan Foox ◽

Jessica Nordlund ◽

Claudia Lalancette ◽

Ting Gong ◽

Michelle Lacey ◽

...

Keyword(s):

Quality Control ◽

Bisulfite Sequencing ◽

Sequence Data ◽

Basic Research ◽

High Coverage ◽

Genome Wide ◽

Wide Range ◽

Genome Bisulfite Sequencing ◽

Mammalian Genomes ◽

Primary Focus

AbstractDetection of DNA cytosine modifications such as 5-methylcytosine (5mC) and 5-hydroxy-methylcytosine (5hmC) is essential for understanding the epigenetic changes that guide development, cellular lineage specification, and disease. The wide variety of approaches available to interrogate these modifications has created a need for harmonized materials, methods, and rigorous benchmarking to improve genome-wide methylome sequencing applications in clinical and basic research.We present a multi-platform assessment and a global resource for epigenetics research from the FDA’s Epigenomics Quality Control (EpiQC) Group. The study design leverages seven human cell lines that are publicly available from the National Institute of Standards and Technology (NIST) and Genome in a Bottle (GIAB) consortium. These genomes were subject to a variety of genome-wide methylation interrogation approaches across six independent laboratories. Our primary focus was on cytosine modifications found in mammalian genomes (5mC, 5hmC). Each sample was processed in two or more technical replicates by three whole-genome bisulfite sequencing (WGBS) protocols (TruSeq DNA methylation, Accel-NGS, SPLAT), oxidative bisulfite sequencing (oxBS), Enzymatic Methyl-seq (EM-seq), Illumina EPIC targeted-methylation sequencing, and ATAC-seq. Each library was sequenced to high coverage on an Illumina NovaSeq 6000. The data were subject to rigorous quality assessment and subsequently compared to Illumina EPIC methylation microarrays. We provide a wide range of sequence data for commonly used genomics reference materials, as well as best practices for epigenomics research. These findings can serve as a guide for researchers to enable epigenomic analysis of cellular identity in development, health, and disease.

Download Full-text

A performance assessment of relatedness inference methods using genome-wide data from thousands of relatives

10.1101/106013 ◽

2017 ◽

Author(s):

Monica D. Ramstetter ◽

Thomas D. Dyer ◽

Donna M. Lehman ◽

Joanne E. Curran ◽

Ravindranath Duggirala ◽

...

Keyword(s):

State Of The Art ◽

Association Studies ◽

Genetic Association Studies ◽

Real Data ◽

New Methods ◽

Genome Wide ◽

Genome Wide Data ◽

Inference Methods ◽

Multiple Samples ◽

Combining Information

AbstractInferring relatedness from genomic data is an essential component of genetic association studies, population genetics, forensics, and genealogy. While numerous methods exist for inferring relatedness, thorough evaluation of these approaches in real data has been lacking. Here, we report an assessment of 12 state-of-the-art pairwise relatedness inference methods using a dataset with 2,485 individuals contained in several large pedigrees that span up to six generations. We find that all methods have high accuracy (~92% – 99%) when detecting first and second degree relationships, but their accuracy dwindles to less than 43% for seventh degree relationships. However, most IBD segment-based methods inferred seventh degree relatives correct to within one relatedness degree for more than 76% of relative pairs. Overall, the most accurate methods are ERSA and approaches that compute total IBD sharing using the output from GERMLINE and Refined IBD to infer relatedness. Combining information from the most accurate methods provides little accuracy improvement, indicating that novel approaches—such as new methods that leverage relatedness signals from multiple samples—are needed to achieve a sizeable jump in performance.

Download Full-text

Unsupervised cluster analysis of SARS-CoV-2 genomes reflects its geographic progression and identifies distinct genetic subgroups of SARS-CoV-2 virus

10.1101/2020.05.05.079061 ◽

2020 ◽

Cited By ~ 3

Author(s):

Georg Hahn ◽

Sanghun Lee ◽

Scott T. Weiss ◽

Christoph Lange

Keyword(s):

Vaccine Development ◽

Sequence Data ◽

Principal Component ◽

Nucleotide Sequencing ◽

Published Data ◽

Ongoing Research ◽

Genome Data ◽

Model Free ◽

Genome Wide ◽

A Genome

AbstractOver 10,000 viral genome sequences of the SARS-CoV-2 virus have been made readily available during the ongoing coronavirus pandemic since the initial genome sequence of the virus was released on the open access Virological website (http://virological.org/) early on January 11. We utilize the published data on the single stranded RNAs of 11, 132 SARS-CoV-2 patients in the GISAID (Elbe and Buckland-Merrett, 2017; Shu and McCauley, 2017) database, which contains fully or partially sequenced SARS-CoV-2 samples from laboratories around the world. Among many important research questions which are currently being investigated, one aspect pertains to the genetic characterization/classification of the virus. We analyze data on the nucleotide sequencing of the virus and geographic information of a subset of 7, 640 SARS-CoV-2 patients without missing entries that are available in the GISAID database. Instead of modelling the mutation rate, applying phylogenetic tree approaches, etc., we here utilize a model-free clustering approach that compares the viruses at a genome-wide level. We apply principal component analysis to a similarity matrix that compares all pairs of these SARS-CoV-2 nucleotide sequences at all loci simultaneously, using the Jaccard index (Jaccard, 1901; Tan et al., 2005; Prokopenko et al., 2016; Schlauch et al., 2017). Our analysis results of the SARS-CoV-2 genome data illustrates the geographic and chronological progression of the virus, starting from the first cases that were observed in China to the current wave of cases in Europe and North America. This is in line with a phylogenetic analysis which we use to contrast our results. We also observe that, based on their sequence data, the SARS-CoV-2 viruses cluster in distinct genetic subgroups. It is the subject of ongoing research to examine whether the genetic subgroup could be related to diseases outcome and its potential implications for vaccine development.

Download Full-text