Genotyping common, large structural variations in 5,202 genomes using pangenomes, the Giraffe mapper, and the vg toolkit

ABSTRACTWe introduce Giraffe, a pangenome short read mapper that can efficiently map to a collection of haplotypes threaded through a sequence graph. Giraffe, part of the variation graph toolkit (vg)1, maps reads to thousands of human genomes at around the same speed BWA-MEM2 maps reads to a single reference genome, while maintaining comparable accuracy to VG-MAP, vg’s original mapper. We have developed efficient genotyping pipelines using Giraffe. We demonstrate improvements in genotyping for single nucleotide variations (SNVs), insertions and deletions (indels) and structural variations (SVs) genome-wide. We use Giraffe to genotype and phase 167 thousands structural variations ascertained from long read studies in 5,202 human genomes sequenced with short reads, including the complete 1000 Genomes Project dataset, at an average cost of $1.50 per sample. We determine the frequency of these variations in diverse human populations, characterize their complex allelic variations and identify thousands of expression quantitative trait loci (eQTLs) driven by these variations.

Download Full-text

A study of transposable element-associated structural variations (TASVs) using a de novo-assembled Korean genome

Experimental & Molecular Medicine ◽

10.1038/s12276-021-00586-y ◽

2021 ◽

Author(s):

Seyoung Mun ◽

Songmi Kim ◽

Wooseok Lee ◽

Keunsoo Kang ◽

Thomas J. Meyer ◽

...

Keyword(s):

Genome Sequencing ◽

Genome Assembly ◽

De Novo ◽

Personal Genome ◽

Human Populations ◽

Whole Genome ◽

Structural Variations ◽

Insert Size ◽

Human Genomes ◽

Next Generation Sequencing Ngs

AbstractAdvances in next-generation sequencing (NGS) technology have made personal genome sequencing possible, and indeed, many individual human genomes have now been sequenced. Comparisons of these individual genomes have revealed substantial genomic differences between human populations as well as between individuals from closely related ethnic groups. Transposable elements (TEs) are known to be one of the major sources of these variations and act through various mechanisms, including de novo insertion, insertion-mediated deletion, and TE–TE recombination-mediated deletion. In this study, we carried out de novo whole-genome sequencing of one Korean individual (KPGP9) via multiple insert-size libraries. The de novo whole-genome assembly resulted in 31,305 scaffolds with a scaffold N50 size of 13.23 Mb. Furthermore, through computational data analysis and experimental verification, we revealed that 182 TE-associated structural variation (TASV) insertions and 89 TASV deletions contributed 64,232 bp in sequence gain and 82,772 bp in sequence loss, respectively, in the KPGP9 genome relative to the hg19 reference genome. We also verified structural differences associated with TASVs by comparative analysis with TASVs in recent genomes (AK1 and TCGA genomes) and reported their details. Here, we constructed a new Korean de novo whole-genome assembly and provide the first study, to our knowledge, focused on the identification of TASVs in an individual Korean genome. Our findings again highlight the role of TEs as a major driver of structural variations in human individual genomes.

Download Full-text

Interaction between M. tuberculosis Lineage and Human Genetic Variants Reveals Novel Pathway Associations with Severity of TB

Pathogens ◽

10.3390/pathogens10111487 ◽

2021 ◽

Vol 10 (11) ◽

pp. 1487

Author(s):

Michael L. McHenry ◽

Eddie M. Wampande ◽

Moses L. Joloba ◽

LaShaunda L. Malone ◽

Harriet Mayanja-Kizza ◽

...

Keyword(s):

Genetic Variation ◽

Clinical Presentation ◽

Sub Saharan Africa ◽

Nucleotide Polymorphisms ◽

Single Nucleotide ◽

Human Genomes ◽

Genome Wide ◽

Sub Saharan ◽

Cellular Replication

Tuberculosis (TB) remains a major public health threat globally, especially in sub-Saharan Africa. Both human and Mycobacterium tuberculosis (MTBC) genetic variation affect TB outcomes, but few studies have examined if and how the two genomes interact to affect disease. We hypothesize that long-term coexistence between human genomes and MTBC lineages modulates disease to affect its severity. We examined this hypothesis in our TB household contact study in Kampala, Uganda, in which we identified three MTBC lineages, of which one, L4.6-Uganda, is clearly derived and hence recent. We quantified TB severity using the Bandim TBscore and examined the interaction between MTBC lineage and human single-nucleotide polymorphisms (SNPs) genome-wide, in two independent cohorts of TB cases (n = 149 and n = 127). We found a significant interaction between an SNP in PPIAP2 and the Uganda lineage (combined p = 4 × 10−8). PPIAP2 is a pseudogene that is highly expressed in immune cells. Pathway and eQTL analyses indicated potential roles between coevolving SNPs and cellular replication and metabolism as well as platelet aggregation and coagulation. This finding provides further evidence that host–pathogen interactions affect clinical presentation differently than host and pathogen genetic variation independently, and that human–MTBC coevolution is likely to explain patterns of disease severity.

Download Full-text

LongPhase: an ultra-fast chromosome-scale phasing algorithm for small and large variants

10.1101/2021.09.09.459623 ◽

2021 ◽

Author(s):

Jyun-Hong Lin ◽

Liang-Chi Chen ◽

Shu-Qi Yu ◽

Yao-Ting Huang

Keyword(s):

Variant Calling ◽

Cost Effective ◽

Nucleotide Polymorphisms ◽

Structural Variations ◽

Single Nucleotide ◽

Chromosome Conformation ◽

Long Reads ◽

Cost Effective Approach ◽

Long Read ◽

Microbial Strains

AbstractLong-read phasing has been used for reconstructing diploid genomes, improving variant calling, and resolving microbial strains in metagenomics. However, the phasing blocks of existing methods are broken by large Structural Variations (SVs), and the efficiency is unsatisfactory for population-scale phasing. This paper presents an ultra-fast algorithm, LongPhase, which can simultaneously phase single nucleotide polymorphisms (SNPs) and SVs of a human genome in ∼10-20 minutes, 10x faster than the state-of-the-art WhatsHap and Margin. In particular, LongPhase produces much larger phased blocks at almost chromosome level with only long reads (N50=26Mbp). We demonstrate that LongPhase combined with Nanopore is a cost-effective approach for providing chromosome-scale phasing without the need for additional trios, chromosome-conformation, and single-cell strand-seq data.

Download Full-text

Evaluating nanopore sequencing data processing pipelines for structural variation identification

Genome Biology ◽

10.1186/s13059-019-1858-1 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 9

Author(s):

Anbo Zhou ◽

Timothy Lin ◽

Jinchuan Xing

Keyword(s):

Detection Accuracy ◽

Nanopore Sequencing ◽

Sequencing Data ◽

Sequencing Technology ◽

Structural Variations ◽

Human Genomes ◽

Data Assessment ◽

Machine Learning Approach ◽

Long Read ◽

The Impact

Abstract Background Structural variations (SVs) account for about 1% of the differences among human genomes and play a significant role in phenotypic variation and disease susceptibility. The emerging nanopore sequencing technology can generate long sequence reads and can potentially provide accurate SV identification. However, the tools for aligning long-read data and detecting SVs have not been thoroughly evaluated. Results Using four nanopore datasets, including both empirical and simulated reads, we evaluate four alignment tools and three SV detection tools. We also evaluate the impact of sequencing depth on SV detection. Finally, we develop a machine learning approach to integrate call sets from multiple pipelines. Overall SV callers’ performance varies depending on the SV types. For an initial data assessment, we recommend using aligner minimap2 in combination with SV caller Sniffles because of their speed and relatively balanced performance. For detailed analysis, we recommend incorporating information from multiple call sets to improve the SV call performance. Conclusions We present a workflow for evaluating aligners and SV callers for nanopore sequencing data and approaches for integrating multiple call sets. Our results indicate that additional optimizations are needed to improve SV detection accuracy and sensitivity, and an integrated call set can provide enhanced performance. The nanopore technology is improving, and the sequencing community is likely to grow accordingly. In turn, better benchmark call sets will be available to more accurately assess the performance of available tools and facilitate further tool development.

Download Full-text

Genome-wide identifying of G-quadruplex structures directly by whole-genome resequencing

10.21203/rs.3.rs-149290/v1 ◽

2021 ◽

Author(s):

Jing Tu ◽

Mengqin Duan ◽

Wenli Liu ◽

Na Lu ◽

Xiao Sun ◽

...

Keyword(s):

Individual Differences ◽

Whole Genome ◽

Genome Resequencing ◽

Single Nucleotide ◽

Genome Wide ◽

G Quadruplex ◽

Sequencing Quality ◽

Single Nucleotide Variations ◽

Whole Genome Resequencing ◽

The Given

Abstract We present a convenient genome-wide DNA G-quadruplex (G4) profiling method that identifies G4 structures from ordinary whole-genome resequencing data by seizing the slight fluctuation of sequencing quality. We identified 736,689 G4 structures within human genome, in which 44.9% of all predicted canonical G4-froming sequences were contained. We observed that some of the single nucleotide variations (SNVs) influenced the formation of G4 structures, including homozygous SNVs and heterozygous SNVs. Due to SNVs contain individual differences, the given approach is available to identify and characterize G4s genome-wide for specific individuals.

Download Full-text

Mining SNPs From EST Databases

Genome Research ◽

10.1101/gr.9.2.167 ◽

1999 ◽

Vol 9 (2) ◽

pp. 167-174 ◽

Cited By ~ 12

Author(s):

Leslie Picoult-Newberg ◽

Trey E. Ideker ◽

Mark G. Pohl ◽

Scott L. Taylor ◽

Miriam A. Donaldson ◽

...

Keyword(s):

Expressed Sequence Tag ◽

De Novo ◽

Cdna Libraries ◽

Human Populations ◽

Data Sets ◽

Nucleotide Polymorphisms ◽

Single Nucleotide ◽

Genome Wide ◽

Using Data

There is considerable interest in the discovery and characterization of single nucleotide polymorphisms (SNPs) to enable the analysis of the potential relationships between human genotype and phenotype. Here we present a strategy that permits the rapid discovery of SNPs from publicly available expressed sequence tag (EST) databases. From a set of ESTs derived from 19 different cDNA libraries, we assembled 300,000 distinct sequences and identified 850 mismatches from contiguous EST data sets (candidate SNP sites), without de novo sequencing. Through a polymerase-mediated, single-base, primer extension technique, Genetic Bit Analysis (GBA), we confirmed the presence of a subset of these candidate SNP sites and have estimated the allele frequencies in three human populations with different ethnic origins. Altogether, our approach provides a basis for rapid and efficient regional and genome-wide SNP discovery using data assembled from sequences from different libraries of cDNAs.[The SNPs identified in this study can be found in the National Center of Biotechnology (NCBI) SNP database under submitter handles ORCHID (SNPS-981210-A) and debnick (SNPS-981209-A and SNPS-981209-B).]

Download Full-text

De novo diploid genome assembly for genome-wide structural variant detection

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqz018 ◽

2019 ◽

Vol 2 (1) ◽

Author(s):

Lu Zhang ◽

Xin Zhou ◽

Ziming Weng ◽

Arend Sidow

Keyword(s):

De Novo Assembly ◽

De Novo ◽

Pairwise Alignment ◽

Cost Effective ◽

Difficult Problem ◽

Ancestral State ◽

Fundamental Limitations ◽

Human Genomes ◽

Genome Wide ◽

Long Read

Abstract Detection of structural variants (SVs) on the basis of read alignment to a reference genome remains a difficult problem. De novo assembly, traditionally used to generate reference genomes, offers an alternative for SV detection. However, it has not been applied broadly to human genomes because of fundamental limitations of short-fragment approaches and high cost of long-read technologies. We here show that 10× linked-read sequencing supports accurate SV detection. We examined variants in six de novo 10× assemblies with diverse experimental parameters from two commonly used human cell lines: NA12878 and NA24385. The assemblies are effective for detecting mid-size SVs, which were discovered by simple pairwise alignment of the assemblies’ contigs to the reference (hg38). Our study also shows that the base-pair level SV breakpoint accuracy is high, with a majority of SVs having precisely correct sizes and breakpoints. Setting the ancestral state of SV loci by comparing to ape orthologs allows inference of the actual molecular mechanism (insertion or deletion) causing the mutation. In about half of cases, the mechanism is the opposite of the reference-based call. We uncover 214 SVs that may have been maintained as polymorphisms in the human lineage since before our divergence from chimp. Overall, we show that de novo assembly of 10× linked-read data can achieve cost-effective SV detection for personal genomes.

Download Full-text

Population-Specific Genetic and Expression Differentiation in Europeans

Genome Biology and Evolution ◽

10.1093/gbe/evaa021 ◽

2020 ◽

Vol 12 (4) ◽

pp. 358-369

Author(s):

Xueyuan Jiang ◽

Raquel Assis

Keyword(s):

Vitamin D ◽

Copy Number ◽

Copy Number Variations ◽

Future Research ◽

Human Populations ◽

Rna Seq ◽

Nucleotide Polymorphism ◽

Single Nucleotide ◽

Genome Wide ◽

The World

Abstract Much of the enormous phenotypic variation observed across human populations is thought to have arisen from events experienced as our ancestors peopled different regions of the world. However, little is known about the genes involved in these population-specific adaptations. Here, we explore this problem by simultaneously examining population-specific genetic and expression differentiation in four human populations. In particular, we derive a branch-based estimator of population-specific differentiation in four populations, and apply this statistic to single-nucleotide polymorphism and RNA-seq data from Italian, British, Finish, and Yoruban populations. As expected, genome-wide estimates of genetic and expression differentiation each independently recapitulate the known relationships among these four human populations, highlighting the utility of our statistic for identifying putative targets of population-specific adaptations. Moreover, genes with large copy number variations display elevated levels of population-specific genetic and expression differentiation, consistent with the hypothesis that gene duplication and deletion events are key reservoirs of adaptive variation. Further, many top-scoring genes are well-known targets of adaptation in Europeans, including those involved in lactase persistence and vitamin D absorption, and a handful of novel candidates represent promising avenues for future research. Together, these analyses reveal that our statistic can aid in uncovering genes involved in population-specific genetic and expression differentiation, and that such genes often play important roles in a diversity of adaptive and disease-related phenotypes in humans.

Download Full-text

G-quadruplex structural variations in human genome associated with single-nucleotide variations and their impact on gene activity

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.2013230118 ◽

2021 ◽

Vol 118 (21) ◽

pp. e2013230118

Author(s):

Jia-yuan Gong ◽

Cui-jiao Wen ◽

Ming-liang Tang ◽

Rui-fang Duan ◽

Juan-nan Chen ◽

...

Keyword(s):

Genetic Variation ◽

Human Genome ◽

Regulatory Elements ◽

Gene Activity ◽

Gene Expressions ◽

Single Nucleotide ◽

Average Load ◽

Genome Wide ◽

Single Nucleotide Variations ◽

The Impact

G-quadruplexes (G4s) formed by guanine-rich nucleic acids play a role in essential biological processes such as transcription and replication. Besides the >1.5 million putative G-4–forming sequences (PQSs), the human genome features >640 million single-nucleotide variations (SNVs), the most common type of genetic variation among people or populations. An SNV may alter a G4 structure when it falls within a PQS motif. To date, genome-wide PQS–SNV interactions and their impact have not been investigated. Herein, we present a study on the PQS–SNV interactions and the impact they can bring to G4 structures and, subsequently, gene expressions. Based on build 154 of the Single Nucleotide Polymorphism Database (dbSNP), we identified 5 million gains/losses or structural conversions of G4s that can be caused by the SNVs. Of these G4 variations (G4Vs), 3.4 million are within genes, resulting in an average load of >120 G4Vs per gene, preferentially enriched near the transcription start site. Moreover, >80% of the G4Vs overlap with transcription factor–binding sites and >14% with enhancers, giving an average load of 3 and 7.5 for the two regulatory elements, respectively. Our experiments show that such G4Vs can significantly influence the expression of their host genes. These results reveal genome-wide G4Vs and their impact on gene activity, emphasizing an understanding of genetic variation, from a structural perspective, of their physiological function and pathological implications. The G4Vs may also provide a unique category of drug targets for individualized therapeutics, health risk assessment, and drug development.

Download Full-text

Segmental duplications and their variation in a complete human genome

10.1101/2021.05.26.445678 ◽

2021 ◽

Author(s):

Mitchell R. Vollger ◽

Xavi Guitart ◽

Philip C. Dishuck ◽

Ludovica Mercuri ◽

William T. Harvey ◽

...

Keyword(s):

Human Genome ◽

Haplotype Diversity ◽

Duplicate Gene ◽

Segmental Duplications ◽

Protein Coding ◽

Human Genomes ◽

Genome Wide ◽

Structural Heterozygosity ◽

Human Frontal Cortex ◽

Long Read

Despite their importance in disease and evolution, highly identical segmental duplications (SDs) have been among the last regions of the human reference genome (GRCh38) to be finished. Based on a complete telomere-to-telomere human genome (T2T CHM13), we present the first comprehensive view of human SD organization. SDs account for nearly one-third of the additional sequence increasing the genome-wide estimate from 5.4% to 7.0% (218 Mbp). An analysis of 266 human genomes shows that 91% of the new T2T CHM13 SD sequence (68.3 Mbp) better represents human copy number. We find that SDs show increased single-nucleotide variation diversity when compared to unique regions; we characterize methylation signatures that correlate with duplicate gene transcription and predict 182 novel protein-coding gene candidates. We find that 63% (35.11/55.7 Mbp) of acrocentric chromosomes consist of SDs distinct from rDNA and satellite sequences. Acrocentric SDs are 1.75-fold longer (p=0.00034) than other SDs, are frequently shared with autosomal pericentromeric regions, and are heteromorphic among human chromosomes. Comparing long-read assemblies from other human (n=12) and nonhuman primate (n=5) genomes, we use the T2T CHM13 genome to systematically reconstruct the evolution and structural haplotype diversity of biomedically relevant (LPA, SMN) and duplicated genes (TBC1D3, SRGAP2C, ARHGAP11B) important in the expansion of the human frontal cortex. The analysis reveals unprecedented patterns of structural heterozygosity and massive evolutionary differences in SD organization between humans and their closest living relatives.

Download Full-text