scholarly journals Structural variants in Chinese population and their impact on phenotypes, diseases and population adaptation

2021 ◽  
Author(s):  
Zhikun Wu ◽  
Zehang Jiang ◽  
Tong Li ◽  
Chuanbo Xie ◽  
Liansheng Zhao ◽  
...  

SummaryA complete characterization of genetic variation is a fundamental goal of human genome research. Long-read sequencing (LRS) improves the sensitivity for structural variant (SV) discovery and facilitates a better understanding of the SV spectrum in human genomes. Here, we conduct the first LRS-based SV analysis in Chinese population. We perform whole-genome LRS for 405 unrelated Chinese, with 68 phenotypic and clinical measurements. We discover a complex landscape of 132,312 non-redundant SVs, of which 53.3% are novel. The identified SVs are of high-quality validated by the PacBio high-fidelity sequencing and PCR experiments. The total length of SVs represents approximately 13.2% of the human reference genome. We annotate 1,929 loss-of-function SVs affecting the coding sequences of 1,681 genes. We discover new associations of SVs with phenotypes and diseases, such as rare deletions in HBA1/HBA2/HBB associated with anemia and common deletions in GHR associated with body height. Furthermore, we identify SV candidates related to human immunity that differentiate sub-populations of Chinese. Our study reveals the complex landscape of human SVs in unprecedented detail and provides new insights into their roles contributing to phenotypes, diseases and evolution. The genotypic and phenotypic resource is freely available to the scientific community.

2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Zhikun Wu ◽  
Zehang Jiang ◽  
Tong Li ◽  
Chuanbo Xie ◽  
Liansheng Zhao ◽  
...  

AbstractA complete characterization of genetic variation is a fundamental goal of human genome research. Long-read sequencing has improved the sensitivity of structural variant discovery. Here, we conduct the long-read sequencing-based structural variant analysis for 405 unrelated Chinese individuals, with 68 phenotypic and clinical measurements. We discover a landscape of 132,312 nonredundant structural variants, of which 45.2% are novel. The identified structural variants are of high-quality, with an estimated false discovery rate of 3.2%. The concatenated length of all the structural variants is approximately 13.2% of the human reference genome. We annotate 1,929 loss-of-function structural variants affecting the coding sequence of 1,681 genes. We discover rare deletions in HBA1/HBA2/HBB associated with anemia. Furthermore, we identify structural variants related to immunity which differentiate the northern and southern Chinese populations. Our study describes the landscape of structural variants in the Chinese population and their contribution to phenotypes and disease.


2021 ◽  
Author(s):  
Songbo Wang ◽  
Jiadong Lin ◽  
Xiaofei Yang ◽  
Zihang Li ◽  
Tun Xu ◽  
...  

Integration of Hepatitis B (HBV) virus into human genome disrupts genetic structures and cellular functions. Here, we conducted multiplatform long read sequencing on two cell lines and five clinical samples of HBV-induced hepatocellular carcinomas (HCC). We resolved two types of complex viral integration induced genome rearrangements and established a Time-phased Integration and Rearrangement Model (TIRM) to depict their formation progress by differentiating inserted HBV copies with HiFi long reads. We showed that the two complex types were initialized from focal replacements and the fragile virus-human junctions triggered subsequent rearrangements. We further revealed that these rearrangements promoted a prevalent loss-of-heterozygosity at chr4q, accounting for 19.5% of HCC samples in ICGC cohort and contributing to immune and metabolic dysfunction. Overall, our long read based analysis reveals a novel sequential rearrangement progress driven by HBV integration, hinting the structural and functional implications on human genomes.


2021 ◽  
Author(s):  
Chiann-Ling C Yeh ◽  
Andreas Tsouris ◽  
Joseph Schacherer ◽  
Maitreya J. Dunham

How natural variation affects phenotype is difficult to determine given our incomplete ability to deduce the functional impact of the polymorphisms detected in a population. Although current computational and experimental tools can predict and measure allele function, there has previously been no assay that does so in a high-throughput manner while also representing haplotypes derived from wild populations. Here, we present such an assay that measures the fitness of hundreds of natural alleles of a given gene without site-directed mutagenesis or DNA synthesis. With a large collection of diverse Saccharomyces cerevisiae natural isolates, we piloted this technique using the gene SUL1, which encodes a high-affinity sulfate permease that, at increased copy number, can improve the fitness of cells grown in sulfate-limited media. We cloned and barcoded all alleles from a collection of over 1000 natural isolates en masse and matched barcodes with their respective variants using PacBio long-read sequencing and a novel error-correction algorithm. We then transformed the reference S288C strain with this library and used barcode sequencing to track growth ability in sulfate limitation of lineages carrying each allele. We show that this approach allows us to measure the fitness conferred by each allele and stratify functional and nonfunctional alleles. Additionally, we pinpoint which polymorphisms in both coding and noncoding regions are detrimental to fitness or are of small effect and result in intermediate phenotypes. Integrating these results with a phylogenetic tree, we observe how often loss-of-function occurs and whether or not there is an evolutionary pattern to our observable phenotypic results. This approach is easily applicable to other genes. Our results complement classic genotype-phenotype mapping strategies and demonstrate a high-throughput approach for understanding the effects of polymorphisms across an entire species which can greatly propel future investigations into quantitative traits.


Author(s):  
Jouni Sirén ◽  
Jean Monlong ◽  
Xian Chang ◽  
Adam M. Novak ◽  
Jordan M. Eizenga ◽  
...  

ABSTRACTWe introduce Giraffe, a pangenome short read mapper that can efficiently map to a collection of haplotypes threaded through a sequence graph. Giraffe, part of the variation graph toolkit (vg)1, maps reads to thousands of human genomes at around the same speed BWA-MEM2 maps reads to a single reference genome, while maintaining comparable accuracy to VG-MAP, vg’s original mapper. We have developed efficient genotyping pipelines using Giraffe. We demonstrate improvements in genotyping for single nucleotide variations (SNVs), insertions and deletions (indels) and structural variations (SVs) genome-wide. We use Giraffe to genotype and phase 167 thousands structural variations ascertained from long read studies in 5,202 human genomes sequenced with short reads, including the complete 1000 Genomes Project dataset, at an average cost of $1.50 per sample. We determine the frequency of these variations in diverse human populations, characterize their complex allelic variations and identify thousands of expression quantitative trait loci (eQTLs) driven by these variations.


2020 ◽  
Author(s):  
Peng Zhang ◽  
Huaxia Luo ◽  
Yanyan Li ◽  
You Wang ◽  
Jiajia Wang ◽  
...  

AbstractThe lack of Chinese population specific haplotype reference panel and whole genome sequencing resources has greatly hindered the genetics studies in the world’s largest population. Here we presented the NyuWa genome resource of 71.1M SNPs and 8.2M indels based on deep (26.2X) sequencing of 2,999 Chinese individuals, and constructed NyuWa reference panel of 5,804 haplotypes and 19.3M variants, which is the first publicly available Chinese population specific reference panel with thousands of samples. There were 25.0M novel variants in NyuWa genome resource, and 3.2M specific variants in NyuWa reference panel. Compared with other panels, NyuWa reference panel reduces the Han Chinese imputation error rate by the range of 30% to 51%. Population structure and imputation simulation tests supported the applicability of one integrated reference panel for both northern and southern Chinese. In addition, a total of 22,504 loss-of-function variants in coding and noncoding genes were identified, including 11,493 novel variants. These results highlight the value of NyuWa genome resource to facilitate genetics research in Chinese and Asian populations.


2016 ◽  
Author(s):  
Li Fang ◽  
Jiang Hu ◽  
Depeng Wang ◽  
Kai Wang

AbstractBackgroundStructural variants (SVs) in human genomes are implicated in a variety of human diseases. Long-read sequencing delivers much longer read lengths than short-read sequencing and may greatly improve SV detection. However, due to the relatively high cost of long-read sequencing, it is unclear what coverage is needed and how to optimally use the aligners and SV callers.ResultsIn this study, we developed NextSV, a meta-caller to perform SV calling from low coverage long-read sequencing data. NextSV integrates three aligners and three SV callers and generates two integrated call sets (sensitive/stringent) for different analysis purposes. We evaluated SV calling performance of NextSV under different PacBio coverages on two personal genomes, NA12878 and HX1. Our results showed that, compared with running any single SV caller, NextSV stringent call set had higher precision and balanced accuracy (F1 score) while NextSV sensitive call set had a higher recall. At 10X coverage, the recall of NextSV sensitive call set was 93.5% to 94.1% for deletions and 87.9% to 93.2% for insertions, indicating that ~10X coverage might be an optimal coverage to use in practice, considering the balance between the sequencing costs and the recall rates. We further evaluated the Mendelian errors on an Ashkenazi Jewish trio dataset.ConclusionsOur results provide useful guidelines for SV detection from low coverage whole-genome PacBio data and we expect that NextSV will facilitate the analysis of SVs on long-read sequencing data.


2019 ◽  
Author(s):  
Kishwar Shafin ◽  
Trevor Pesout ◽  
Ryan Lorig-Roach ◽  
Marina Haukness ◽  
Hugh E. Olsen ◽  
...  

AbstractPresent workflows for producing human genome assemblies from long-read technologies have cost and production time bottlenecks that prohibit efficient scaling to large cohorts. We demonstrate an optimized PromethION nanopore sequencing method for eleven human genomes. The sequencing, performed on one machine in nine days, achieved an average 63x coverage, 42 Kb read N50, 90% median read identity and 6.5x coverage in 100 Kb+ reads using just three flow cells per sample. To assemble these data we introduce new computational tools: Shasta - a de novo long read assembler, and MarginPolish & HELEN - a suite of nanopore assembly polishing algorithms. On a single commercial compute node Shasta can produce a complete human genome assembly in under six hours, and MarginPolish & HELEN can polish the result in just over a day, achieving 99.9% identity (QV30) for haploid samples from nanopore reads alone. We evaluate assembly performance for diploid, haploid and trio-binned human samples in terms of accuracy, cost, and time and demonstrate improvements relative to current state-of-the-art methods in all areas. We further show that addition of proximity ligation (Hi-C) sequencing yields near chromosome-level scaffolds for all eleven genomes.


Author(s):  
Samuel J. Modlin ◽  
Tyler Marbach ◽  
Jim Werngren ◽  
Mikael Mansjö ◽  
Sven E. Hoffner ◽  
...  

Pyrazinamide (PZA) is a widely used antitubercular chemotherapeutic. Typically, PZA resistance (PZA-R) emerges in M. tuberculosis strains with existing resistance to isoniazid and rifampicin (MDR) and is conferred by loss-of-function pncA mutations that inhibit conversion to its active form, Pyrazinoic acid (POA). PZA-R departing from this canonical scenario is poorly understood. Here, we genotype pncA and purported alternative PZA-R genes (panD, rpsA, and clpC1) with long-read sequencing of nineteen phenotypically PZA mono-resistant isolates collected in Sweden and compare their phylogenetic and genomic characteristics to a large set of MDR PZA-R (MDRPZA-R) isolates. We report the first association of ClpC1 mutations with PZA-R in clinical isolates, in the ClpC1 promoter (clpC1p-138) and N-terminal (ClpC1Val63Ala). Mutations have emerged in both these regions under POA selection in vitro and ClpC1N-terminal has been implicated further, through its POA-dependent efficacy in PanD proteolysis. ClpC1Val63Ala mutants spanned 4 Indo-oceanic sublineages. Indo-oceanic isolates invariably harbored ClpC1Val63Ala and were starkly overrepresented (OR=22.2, p <0.00001) among PZA mono-resistant isolates (11/19) compared to MDRPZA-R isolates (5/80). The genetic basis of Indo-oceanic isolates’ overrepresentation in PZA mono-resistant TB remains undetermined, but substantial circumstantial evidence suggests ClpC1Val63Ala confers low-level PZA resistance. Our findings highlight ClpC1 as potentially clinically relevant for PZA-R and reinforce the importance of genetic background in the trajectory of resistance development.


2019 ◽  
Author(s):  
Mitchell R. Vollger ◽  
Glennis A. Logsdon ◽  
Peter A. Audano ◽  
Arvis Sulovari ◽  
David Porubsky ◽  
...  

AbstractThe sequence and assembly of human genomes using long-read sequencing technologies has revolutionized our understanding of structural variation and genome organization. We compared the accuracy, continuity, and gene annotation of genome assemblies generated from either high-fidelity (HiFi) or continuous long-read (CLR) datasets from the same complete hydatidiform mole human genome. We find that the HiFi sequence data assemble an additional 10% of duplicated regions and more accurately represent the structure of tandem repeats, as validated with orthogonal analyses. As a result, an additional 5 Mbp of pericentromeric sequences are recovered in the HiFi assembly, resulting in a 2.5-fold increase in the NG50 within 1 Mbp of the centromere (HiFi 480.6 kbp, CLR 191.5 kbp). Additionally, the HiFi genome assembly was generated in significantly less time with fewer computational resources than the CLR assembly. Although the HiFi assembly has significantly improved continuity and accuracy in many complex regions of the genome, it still falls short of the assembly of centromeric DNA and the largest regions of segmental duplication using existing assemblers. Despite these shortcomings, our results suggest that HiFi may be the most effective stand-alone technology for de novo assembly of human genomes.


2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Anbo Zhou ◽  
Timothy Lin ◽  
Jinchuan Xing

Abstract Background Structural variations (SVs) account for about 1% of the differences among human genomes and play a significant role in phenotypic variation and disease susceptibility. The emerging nanopore sequencing technology can generate long sequence reads and can potentially provide accurate SV identification. However, the tools for aligning long-read data and detecting SVs have not been thoroughly evaluated. Results Using four nanopore datasets, including both empirical and simulated reads, we evaluate four alignment tools and three SV detection tools. We also evaluate the impact of sequencing depth on SV detection. Finally, we develop a machine learning approach to integrate call sets from multiple pipelines. Overall SV callers’ performance varies depending on the SV types. For an initial data assessment, we recommend using aligner minimap2 in combination with SV caller Sniffles because of their speed and relatively balanced performance. For detailed analysis, we recommend incorporating information from multiple call sets to improve the SV call performance. Conclusions We present a workflow for evaluating aligners and SV callers for nanopore sequencing data and approaches for integrating multiple call sets. Our results indicate that additional optimizations are needed to improve SV detection accuracy and sensitivity, and an integrated call set can provide enhanced performance. The nanopore technology is improving, and the sequencing community is likely to grow accordingly. In turn, better benchmark call sets will be available to more accurately assess the performance of available tools and facilitate further tool development.


Sign in / Sign up

Export Citation Format

Share Document