The quiescent X, the replicative Y and the Autosomes

Mapping Intimacies ◽

10.1101/351288 ◽

2018 ◽

Cited By ~ 2

Author(s):

Guillaume Achaz ◽

Serge Gangloff ◽

Benoit Arcangioli

Keyword(s):

De Novo ◽

Yeast Cells ◽

Simple Pattern ◽

Maternal Lineage ◽

1000 Genomes Project ◽

Phase 3 ◽

X Chromosomes ◽

1000 Genomes ◽

Y Chromosomes ◽

Human Genomes

AbstractFrom the analysis of the mutation spectrum in the 2,504 sequenced human genomes from the 1000 genomes project (phase 3), we show that sexual chromosomes (X and Y) exhibit a different proportion of indel mutations than autosomes (A), ranking them X>A>Y. We further show that X chromosomes exhibit a higher ratio of deletion/insertion when compared to autosomes. This simple pattern shows that the recent report that non-dividing quiescent yeast cells accumulate relatively more indels (and particularly deletions) than replicating ones also applies to metazoan cells, including humans. Indeed, the X chromosomes display more indels than the autosomes, having spent more time in quiescent oocytes, whereas the Y chromosomes are solely present in the replicating spermatocytes. From the proportion of indels, we have inferred that de novo mutations arising in the maternal lineage are twice more likely to be indels than mutations from the paternal lineage. Our observation, consistent with a recent trio analysis of the spectrum of mutations inherited from the maternal lineage, is likely a major component in our understanding of the origin of anisogamy.

High coverage whole genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios

10.1101/2021.02.06.430068 ◽

2021 ◽

Cited By ~ 4

Author(s):

Marta Byrska-Bishop ◽

Uday S. Evani ◽

Xuefang Zhao ◽

Anna O. Basile ◽

Haley J. Abel ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Sequence Data ◽

Whole Genome ◽

1000 Genomes Project ◽

Phase 3 ◽

High Coverage ◽

Entire Cohort ◽

1000 Genomes ◽

Low Coverage

ABSTRACTThe 1000 Genomes Project (1kGP), launched in 2008, is the largest fully open resource of whole genome sequencing (WGS) data consented for public distribution of raw sequence data without access or use restrictions. The final (phase 3) 2015 release of 1kGP included 2,504 unrelated samples from 26 populations, representing five continental regions of the world and was based on a combination of technologies including low coverage WGS (mean depth 7.4X), high coverage whole exome sequencing (mean depth 65.7X), and microarray genotyping. Here, we present a new, high coverage WGS resource encompassing the original 2,504 1kGP samples, as well as an additional 698 related samples that result in 602 complete trios in the 1kGP cohort. We sequenced this expanded 1kGP cohort of 3,202 samples to a targeted depth of 30X using Illumina NovaSeq 6000 instruments. We performed SNV/INDEL calling against the GRCh38 reference using GATK’s HaplotypeCaller, and generated a comprehensive set of SVs by integrating multiple analytic methods through a sophisticated machine learning model, upgrading the 1kGP dataset to current state-of-the-art standards. Using this strategy, we defined over 111 million SNVs, 14 million INDELs, and ∼170 thousand SVs across the entire cohort of 3,202 samples with estimated false discovery rate (FDR) of 0.3%, 1.0%, and 1.8%, respectively. By comparison to the low-coverage phase 3 callset, we observed substantial improvements in variant discovery and estimated FDR that were facilitated by high coverage re-sequencing and expansion of the cohort. Specifically, we called 7% more SNVs, 59% more INDELs, and 170% more SVs per genome than the phase 3 callset. Moreover, we leveraged the presence of families in the cohort to achieve superior haplotype phasing accuracy and we demonstrate improvements that the high coverage panel brings especially for INDEL imputation. We make all the data generated as part of this project publicly available and we envision this updated version of the 1kGP callset to become the new de facto public resource for the worldwide scientific community working on genomics and genetics.

Open-Access STRS Database Of Populations From The 1000 Genomes Project Using High Coverage Phase 3 Data

10.1101/2021.09.06.459168 ◽

2021 ◽

Author(s):

Tamara Soledad Frontanilla ◽

Guilherme Valle Silva ◽

Jesus Ayala ◽

Celso Teixeira Mendes

Keyword(s):

Open Access ◽

Forensic Genetics ◽

Allele Frequencies ◽

Equilibrium Analysis ◽

1000 Genomes Project ◽

Phase 3 ◽

Principal Coordinates Analysis ◽

High Coverage ◽

1000 Genomes ◽

Principal Coordinates

Accurate STR genotyping from next-generation sequencing (NGS) data has been challenging. Haplotype inference and phasing for STRs (HipSTR) was specifically developed to deal with genotyping errors and obtain reliable STR genotypes from whole-genome sequencing datasets. The objective of this investigation was to perform a comprehensive genotyping analysis of a set of STRs of broad forensic interest from the 1000 Genomes populations and release a reliable open-access STR database to the forensic genetics community. A set of 22 STR markers were analyzed using the CRAM files of the 1000 Genomes Project Phase 3 high-coverage (30x) dataset generated by the New York Genome Center (NYGC). HipSTR was used to call genotypes from 2,504 samples from 26 populations organized into five groups: African, East Asian, European, South Asian, and admixed American. The D21S11 marker could not be detected in the present study. Moreover, the Hardy-Weinberg equilibrium analysis, coupled with a comprehensive analysis of allele frequencies, revealed that HipSTR could not identify longer Penta E (and Penta D at a lesser extent) alleles. This issue is probably due to the limited length of sequencing reads available for genotype calling, resulting in heterozygote deficiency. Notwithstanding that, AMOVA, a clustering analysis using STRUCTURE, and a Principal Coordinates Analysis revealed a clear-cut separation between the four major ancestries sampled by the 1000 Genomes Consortium (AFR, EUR, EAS, SAS). Meanwhile, the AMOVA results corroborated previous reports that most of the variance is (97.12%) observed within populations. This set of analyses revealed that except for larger Penta D and Penta E alleles, allele frequencies and genotypes defined by HipSTR from the 1000 Genomes Project phase 3 data and offered as an open-access database are consistent and highly reliable.

Amplified Fragments of an Autosome-Borne Gene Constitute a Significant Component of the W Sex Chromosome of Eremias velox (Reptilia, Lacertidae)

Genes ◽

10.3390/genes12050779 ◽

2021 ◽

Vol 12 (5) ◽

pp. 779

Author(s):

Artem Lisachov ◽

Daria Andreyushkova ◽

Guzel Davletshina ◽

Dmitry Prokopov ◽

Svetlana Romanenko ◽

...

Keyword(s):

Sex Chromosome ◽

De Novo ◽

Unknown Origin ◽

Protein Product ◽

W Chromosome ◽

Protein Coding ◽

X Chromosomes ◽

Y Chromosomes ◽

Total Female ◽

Evolutionary Trajectories

Heteromorphic W and Y sex chromosomes often experience gene loss and heterochromatinization, which is frequently viewed as their “degeneration”. However, the evolutionary trajectories of the heterochromosomes are in fact more complex since they may not only lose but also acquire new sequences. Previously, we found that the heterochromatic W chromosome of a lizard Eremias velox (Lacertidae) is decondensed and thus transcriptionally active during the lampbrush stage. To determine possible sources of this transcription, we sequenced DNA from a microdissected W chromosome sample and a total female DNA sample and analyzed the results of reference-based and de novo assembly. We found a new repetitive sequence, consisting of fragments of an autosomal protein-coding gene ATF7IP2, several SINE elements, and sequences of unknown origin. This repetitive element is distributed across the whole length of the W chromosome, except the centromeric region. Since it retained only 3 out of 10 original ATF7IP2 exons, it remains unclear whether it is able to produce a protein product. Subsequent studies are required to test the presence of this element in other species of Lacertidae and possible functionality. Our results provide further evidence for the view of W and Y chromosomes as not just “degraded” copies of Z and X chromosomes but independent genomic segments in which novel genetic elements may arise.

Variant calling on the GRCh38 assembly with the data from phase three of the 1000 Genomes Project

Wellcome Open Research ◽

10.12688/wellcomeopenres.15126.2 ◽

2019 ◽

Vol 4 ◽

pp. 50 ◽

Cited By ~ 7

Author(s):

Ernesto Lowy-Gallego ◽

Susan Fairley ◽

Xiangqun Zheng-Bradley ◽

Magali Ruffier ◽

Laura Clarke ◽

...

Keyword(s):

De Novo ◽

Variant Calling ◽

Final Phase ◽

1000 Genomes Project ◽

Data Set ◽

1000 Genomes ◽

Project Data

We present a set of biallelic SNVs and INDELs, from 2,548 samples spanning 26 populations from the 1000 Genomes Project, called de novo on GRCh38. We believe this will be a useful reference resource for those using GRCh38. It represents an improvement over the “lift-overs” of the 1000 Genomes Project data that have been available to date by encompassing all of the GRCh38 primary assembly autosomes and pseudo-autosomal regions, including novel, medically relevant loci. Here, we describe how the data set was created and benchmark our call set against that produced by the final phase of the 1000 Genomes Project on GRCh37 and the lift-over of that data to GRCh38.

Inference of candidate germline mutator loci in humans from genome-wide haplotype data

10.1101/089623 ◽

2016 ◽

Author(s):

Cathal Seoighe ◽

Aylwyn Scally

Keyword(s):

Dna Damage ◽

Dna Repair ◽

Mutation Rate ◽

Germline Mutation ◽

De Novo ◽

1000 Genomes Project ◽

1000 Genomes ◽

A Genome ◽

Haplotype Data ◽

Candidate Loci

AbstractThe rate of germline mutation varies widely between species but little is known about the extent of variation in the germline mutation rate between individuals of the same species. Here we demonstrate that an allele that increases the rate of germline mutation can result in a distinctive signature in the genomic region linked to the affected locus, characterized by a number of haplotypes with a locally high proportion of derived alleles, against a background of haplotypes carrying a typical proportion of derived alleles. We searched for this signature in human haplotype data from phase 3 of the 1000 Genomes Project and report a number of candidate mutator loci, several of which are located close to or within genes involved in DNA repair or the DNA damage response. To investigate whether mutator alleles remained active at any of these loci, we used de novo mutation counts from human parent-offspring trios in the 1000 Genomes and Genome of the Netherlands cohorts, looking for an elevated number of de novo mutations in the offspring of parents carrying a candidate mutator haplotype at each of these loci. We found some support for two of the candidate loci, including one locus just upstream of the BRSK2 gene, which is expressed in the testis and has been reported to be involved in the response to DNA damage.Author SummaryEach time a genome is replicated there is the possibility of error resulting in the incorporation of an incorrect base or bases in the genome sequence. When these errors occur in cells that lead to the production of gametes they can be incorporated into the germline. Such germline mutations are the basis of evolutionary change; however, to date there has been little attempt to quantify the extent of genetic variation in human populations in the rate at which they occur. This is particularly important because new spontaneous mutations are thought to make an important contribution to many human diseases. Here we present a new way to identify genetic loci that may be associated with an elevated rate of germline mutation and report the application of this method to data from a large number of human genomes, generated by the 1000 Genomes Project. Several of the candidate loci we report are in or near genes involved in DNA repair and some were supported by direct measurement of the mutation rate obtained from parent-offspring trios.

Detecting shared independent selection

10.1101/2020.04.21.053959 ◽

2020 ◽

Author(s):

Nathan S. Harris ◽

Alan R. Rogers

Keyword(s):

1000 Genomes Project ◽

Phase 3 ◽

1000 Genomes ◽

Different Populations ◽

Genomic Regions ◽

Two Populations

AbstractSignals of selection are not often shared between populations. When a mutual signal is detected, it is often not known if selection occurred before or after populations split. Here we develop a method to detect genomic regions at which selection has favored different haplotypes in two populations. This method is verified through simulations and tested on small regions of the genome. This method was then expanded to scan the phase 3 genomes of the 1000 Genomes Project populations for regions in which the evidence for independent selection is strongest. We identify several genes which likely underwent selection independently in different populations.

Fine Mapping Analysis of the MHC Region to Identify Variants Associated With Chinese Vitiligo and SLE and Association Across These Diseases

Frontiers in Immunology ◽

10.3389/fimmu.2021.758652 ◽

2022 ◽

Vol 12 ◽

Author(s):

Lu Cao ◽

Ruixue Zhang ◽

Yirui Wang ◽

Xia Hu ◽

Liang Yong ◽

...

Keyword(s):

Autoimmune Diseases ◽

Fine Mapping ◽

Epidemiological Studies ◽

1000 Genomes Project ◽

Phase 3 ◽

1000 Genomes ◽

Mapping Analysis ◽

Healthy Control ◽

New Perspective

The important role of MHC in the pathogenesis of vitiligo and SLE has been confirmed in various populations. To map the most significant MHC variants associated with the risk of vitiligo and SLE, we conducted fine mapping analysis using 1117 vitiligo cases, 1046 SLE cases and 1693 healthy control subjects in the Han-MHC reference panel and 1000 Genomes Project phase 3. rs113465897 (P=1.03×10-13, OR=1.64, 95%CI =1.44–1.87) and rs3129898 (P=4.21×10-17, OR=1.93, 95%CI=1.66–2.25) were identified as being most strongly associated with vitiligo and SLE, respectively. Stepwise conditional analysis revealed additional independent signals at rs3130969(p=1.48×10-7, OR=0.69, 95%CI=0.60–0.79), HLA-DPB1*03:01 (p=1.07×10-6, OR=1.94, 95%CI=1.49–2.53) being linked to vitiligo and HLA-DQB1*0301 (P=4.53×10-7, OR=0.62, 95%CI=0.52-0.75) to SLE. Considering that epidemiological studies have confirmed comorbidities of vitiligo and SLE, we used the GCTA tool to analyse the genetic correlation between these two diseases in the HLA region, the correlation coefficient was 0.79 (P=5.99×10-10, SE=0.07), confirming their similar genetic backgrounds. Our findings highlight the value of the MHC region in vitiligo and SLE and provide a new perspective for comorbidities among autoimmune diseases.

Identifying Y-chromosome haplogroups in arbitrarily large samples of sequenced or genotyped men

10.1101/088716 ◽

2016 ◽

Cited By ~ 45

Author(s):

G. David Poznik

Keyword(s):

Missing Data ◽

Y Chromosome ◽

Software Package ◽

1000 Genomes Project ◽

Phase 3 ◽

Array Data ◽

Large Samples ◽

1000 Genomes ◽

Reasonable Number

AbstractWe have developed an algorithm to rapidly and accurately identify the Y-chromosome haplogroup of each male in a sample of one to millions. The algorithm, implemented in the yHaplo* software package (yHaplo), does not rely on any particular genotyping modality or platform. Full sequences yield the most granular haplogroup classifications, but genotyping arrays can yield reliable calls, provided a reasonable number of phylogenetically informative variants has been assayed. The algorithm is robust to missing data, genotype errors, mutation recurrence, and other complications. We have tested the software on full sequences from phase 3 of the 1000 Genomes Project and on subsets thereof constructed by downsampling to SNPs present on each of four genotyping arrays. We have also run the software on array data from more than 600,000 males.

Major sex differences in allele frequencies for X chromosome variants in the 1000 Genomes Project data

10.1101/2021.10.27.466015 ◽

2021 ◽

Author(s):

Zhong Wang ◽

Lei Sun ◽

Andrew D Paterson

Keyword(s):

Sex Differences ◽

X Chromosome ◽

Association Studies ◽

Allele Frequencies ◽

Whole Genome Sequence ◽

P Value ◽

1000 Genomes Project ◽

Phase 3 ◽

High Coverage ◽

1000 Genomes

An unexpectedly high proportion of SNPs on the X chromosome in the 1000 Genomes Project phase 3 data were identified with significant sex differences in minor allele frequencies (sdMAF). sdMAF persisted for many of these SNPs in the recently released high coverage whole genome sequence, and it was consistent between the five super-populations. Among the 245,825 common biallelic SNPs in phase 3 data presumed to be high quality, 2,039 have genome-wide significant sdMAF (p-value <5e-8). sdMAF varied by location: (NPR)=0.83%, pseudo-autosomal region (PAR1)=0.29%, PAR2=13.1%, and PAR3=0.85% of SNPs had sdMAF, and they were clustered at the NPR-PAR boundaries, among others. sdMAF at the NPR-PAR boundaries are biologically expected due to sex-linkage, but have generally been ignored in association studies. For comparison, similar analyses found only 6, 1 and 0 SNPs with significant sdMAF on chromosomes 1, 7 and 22, respectively. Future X chromosome analyses need to take sdMAF into account.

Using reference-free compressed data structures to analyse sequencing reads from thousands of human genomes

10.1101/060186 ◽

2016 ◽

Cited By ~ 1

Author(s):

Dirk D. Dolle ◽

Zhicheng Liu ◽

Matthew Cotten ◽

Jared T. Simpson ◽

Zamin Iqbal ◽

...

Keyword(s):

Data Structures ◽

De Novo ◽

Sequencing Data ◽

T Lymphotropic Virus ◽

Viral Genomes ◽

1000 Genomes ◽

Base Position ◽

Human Genomes ◽

Compressed Data Structures ◽

Burrows Wheeler Transform

AbstractWe are rapidly approaching the point where we have sequenced millions of human genomes. There is a pressing need for new data structures to store raw sequencing data and efficient algorithms for population scale analysis. Current reference based data formats do not fully exploit the redundancy in population sequencing nor take advantage of shared genetic variation. In recent years, the Burrows-Wheeler transform (BWT) and FM-index have been widely employed as a full text searchable index for read alignment and de novo assembly. We introduce the concept of a population BWT and use it to store and index the sequencing reads of 2,705 samples from the 1000 Genomes Project. A key feature is that as more genomes are added, identical read sequences are increasingly observed and compression becomes more efficient. We assess the support in the 1000 Genomes read data for every base position of two human reference assembly versions, identifying that 3.2 Mbp with population support was lost in the transition from GRCh37 with 13.7 Mbp added to GRCh38. We show that the vast majority of variant alleles can be uniquely described by overlapping 31-mers and show how rapid and accurate SNP and indel genotyping can be carried out across the genomes in the population BWT. We use the population BWT to carry out non-reference queries to search for the presence of all known viral genomes, and discover human T-lymphotropic virus 1 integrations in six samples in a recognised epidemiological distribution.