scholarly journals Identifying structural variants using linked-read sequencing data

2017 ◽  
Author(s):  
Rebecca Elyanow ◽  
Hsin-Ta Wu ◽  
Benjamin J. Raphael

AbstractStructural variation, including large deletions, duplications, inversions, translocations, and other rearrangements, is common in human and cancer genomes. A number of methods have been developed to identify structural variants from Illumina short-read sequencing data. However, reliable identification of structural variants remains challenging because many variants have breakpoints in repetitive regions of the genome and thus are difficult to identify with short reads. The recently developed linked-read sequencing technology from 10X Genomics combines a novel barcoding strategy with Illumina sequencing. This technology labels all reads that originate from a small number (~5-10) DNA molecules ~50Kbp in length with the same molecular barcode. These barcoded reads contain long-range sequence information that is advantageous for identification of structural variants. We present Novel Adjacency Identification with Barcoded Reads (NAIBR), an algorithm to identify structural variants in linked-read sequencing data. NAIBR predicts novel adjacencies in a individual genome resulting from structural variants using a probabilistic model that combines multiple signals in barcoded reads. We show that NAIBR outperforms several existing methods for structural variant identification – including two recent methods that also analyze linked-reads – on simulated sequencing data and 10X whole-genome sequencing data from the NA12878 human genome and the HCC1954 breast cancer cell line. Several of the novel somatic structural variants identified in HCC1954 overlap known cancer genes.

2021 ◽  
Vol 15 (1) ◽  
Author(s):  
Shaza Malik ◽  
Roan Zaied ◽  
Najeeb Syed ◽  
Puthen Jithesh ◽  
Mashael Al-Shafai

Abstract Background Glucose-6-phosphate dehydrogenase deficiency (G6PDD) is the most common red cell enzymopathy in the world. In Qatar, the incidence of G6PDD is estimated at around 5%; however, no study has investigated the genetic basis of G6PDD in the Qatari population yet. Methods In this study, we analyzed whole-genome sequencing data generated by the Qatar Genome Programme for 6045 Qatar Biobank participants, to identify G6PDD variants in the Qatari population. In addition, we assessed the impact of the novel variants identified on protein function both in silico and by measuring G6PD enzymatic activity in the subjects carrying them. Results We identified 375 variants in/near G6PD gene, of which 20 were high-impact and 16 were moderate-impact variants. Of these, 14 were known G6PDD-causing variants. The most frequent G6PD-causing variants found in the Qatari population were p.Ser188Phe (G6PD Mediterranean), p.Asn126Asp (G6PD A +), p.Val68Met (G6PD Asahi), p.Ala335Thr (G6PD Chatham), and p.Ile48Thr (G6PD Aures) with allele frequencies of 0.0563, 0.0194, 0.00785, 0.0050, and 0.00380, respectively. Furthermore, we have identified seven novel G6PD variants, all of which were confirmed as G6PD-causing variants and classified as class III variants based on the World Health Organization’s classification scheme. Conclusions This is the first study investigating the molecular basis of G6PDD in Qatar, and it provides novel insights about G6PDD pathogenesis and highlights the importance of studying such understudied population.


2017 ◽  
Author(s):  
Jeremiah Wala ◽  
Pratiti Bandopadhayay ◽  
Noah Greenwald ◽  
Ryan O’Rourke ◽  
Ted Sharpe ◽  
...  

AbstractStructural variants (SVs), including small insertion and deletion variants (indels), are challenging to detect through standard alignment-based variant calling methods. Sequence assembly offers a powerful approach to identifying SVs, but is difficult to apply at-scale genome-wide for SV detection due to its computational complexity and the difficulty of extracting SVs from assembly contigs. We describe SvABA, an efficient and accurate method for detecting SVs from short-read sequencing data using genome-wide local assembly with low memory and computing requirements. We evaluated SvABA’s performance on the NA12878 human genome and in simulated and real cancer genomes. SvABA demonstrates superior sensitivity and specificity across a large spectrum of SVs, and substantially improved detection performance for variants in the 20-300 bp range, compared with existing methods. SvABA also identifies complex somatic rearrangements with chains of short (< 1,000 bp) templated-sequence insertions copied from distant genomic regions. We applied SvABA to 344 cancer genomes from 11 cancer types, and found that templated-sequence insertions occur in ~4% of all somatic rearrangements. Finally, we demonstrate that SvABA can identify sites of viral integration and cancer driver alterations containing medium-sized SVs.


2019 ◽  
Author(s):  
Sergey Aganezov ◽  
Benjamin J. Raphael

AbstractMany cancer genomes are extensively rearranged with highly aberrant chromosomal karyotypes. These genome rearrangements, or structural variants, can be detected in tumor DNA sequencing data by abnormal mapping of se-quence reads to the reference genome. However, nearly all cancer sequencing to date is of bulk tumor samples which consist of a heterogeneous mixture of normal cells and subpopulations of cancers cells, or clones, that harbor distinct somatic structural variants. We introduce a novel algorithm, Reconstructing Cancer Karyotypes (RCK), to reconstruct haplotype-specific karyotypes of one or more rearranged cancer genomes, or clones, that best explain the read alignments from a bulk tumor sample. RCK leverages specific evolutionary constraints on the somatic mutation process in cancer to reduce ambiguity in the deconvolution of admixed DNA sequence data into multiple haplotype-specific cancer karyotypes. In particular, RCK relies on generalizations of the infinite sites assumption that a genome re-arrangement is highly unlikely to occur at the same nucleotide position more than once during somatic evolution. RCK’s comprehensive model allows us to incorporate information both from short and long-read sequencing technologies and is applicable to bulk tumor samples containing a mixture of an arbitrary number of derived genomes. We compared RCK to the state-of-the-art method ReMixT on a dataset of 17 primary and metastatic prostate cancer samples. We demonstrate that ReMixT’s limited support for heterogeneity and lack of evolutionary constrains leads to reconstruction of implausible karyotypes. In contrast, RCK’s infers cancer karyotypes that better explain read alignments from bulk tumor samples and are consistent with a reasonable evolutionary model. RCK’s reconstructions of clone- and haplotype-specific karyotypes will aid further studies of the role of intra-tumor heterogeneity in cancer development and response to treatment. RCK is available at https://github.com/raphael-group/RCK.


2020 ◽  
Vol 295 (42) ◽  
pp. 14510-14521 ◽  
Author(s):  
Mark F. Fisher ◽  
Colton D. Payne ◽  
Thaveshini Chetty ◽  
Darren Crayn ◽  
Oliver Berkowitz ◽  
...  

Cyclic peptides are reported to have antibacterial, antifungal, and other bioactivities. Orbitides are a class of cyclic peptides that are small, head-to-tail cyclized, composed of proteinogenic amino acids and lack disulfide bonds; they are also known in several genera of the plant family Rutaceae. Melicope xanthoxyloides is the Australian rain forest tree of the Rutaceae family in which evolidine, the first plant cyclic peptide, was discovered. Evolidine (cyclo-SFLPVNL) has subsequently been all but forgotten in the academic literature, so to redress this we used tandem MS and de novo transcriptomics to rediscover evolidine and decipher its biosynthetic origin from a short precursor just 48 residues in length. We also identified another six M. xanthoxyloides orbitides using the same techniques. These peptides have atypically diverse C termini consisting of residues not recognized by either of the known proteases plants use to macrocyclize peptides, suggesting new cyclizing enzymes await discovery. We examined the structure of two of the novel orbitides by NMR, finding one had a definable structure, whereas the other did not. Mining RNA-seq and whole genome sequencing data from other species of the Rutaceae family revealed that a large and diverse family of peptides is encoded by similar sequences across the family and demonstrates how powerful de novo transcriptomics can be at accelerating the discovery of new peptide families.


2020 ◽  
Author(s):  
Xiao Chen ◽  
Fei Shen ◽  
Nina Gonzaludo ◽  
Alka Malhotra ◽  
Cande Rogert ◽  
...  

AbstractResponsible for the metabolism of 25% of clinically used drugs, CYP2D6 is a critical component of personalized medicine initiatives. Genotyping CYP2D6 is challenging due to sequence similarity with its pseudogene paralog CYP2D7 and a high number and variety of common structural variants (SVs). Here we describe a novel bioinformatics method, Cyrius, that accurately genotypes CYP2D6 using whole-genome sequencing (WGS) data. We show that Cyrius has superior performance (96.5% concordance with truth genotypes) compared to existing methods (84-86.8%). After implementing the improvements identified from the comparison against the truth data, Cyrius’s accuracy has since been improved to 99.3%. Using Cyrius, we built a haplotype frequency database from 2504 ethnically diverse samples and estimate that SV-containing star alleles are more frequent than previously reported. Cyrius will be an important tool to incorporate pharmacogenomics in WGS-based precision medicine initiatives.


2019 ◽  
Author(s):  
Clement Goubert ◽  
Jainy Thomas ◽  
Lindsay M. Payer ◽  
Jeffrey M. Kidd ◽  
Julie Feusier ◽  
...  

ABSTRACTAlu retrotransposons account for more than 10% of the human genome, and insertions of these elements create structural variants segregating in human populations. Such polymorphic Alu are powerful markers to understand population structure, and they represent variants that can greatly impact genome function, including gene expression. Accurate genotyping of Alu and other mobile elements has been challenging. Indeed, we found that Alu genotypes previously called for the 1000 Genomes Project are sometimes erroneous, which poses significant problems for phasing these insertions with other variants that comprise the haplotype. To ameliorate this issue, we introduce a new pipeline -- TypeTE -- which genotypes Alu insertions from whole-genome sequencing data. Starting from a list of polymorphic Alus, TypeTE identifies the hallmarks (poly-A tail and target site duplication) and orientation of Alu insertions using local re-assembly to reconstruct presence and absence alleles. Genotype likelihoods are then computed after re-mapping sequencing reads to the reconstructed alleles. Using a ‘gold standard’ set of PCR-based genotyping of >200 loci, we show that TypeTE improves genotype accuracy from 83% to 92% in the 1000 Genomes dataset. TypeTE can be readily adapted to other retrotransposon families and brings a valuable toolbox addition for population genomics.


2020 ◽  
Author(s):  
Mark F. Fisher ◽  
Colton Payne ◽  
Thaveshini Chetty ◽  
Darren Crayn ◽  
Oliver Berkowitz ◽  
...  

AbstractCyclic peptides are reported to have antibacterial, antifungal and other bioactivities. Several genera of the Rutaceae family are known to produce orbitides, which are small head-to-tail cyclic peptides composed of proteinogenic amino acids and lacking disulfide bonds. Melicope xanthoxyloides is an Australian rain forest tree of the Rutaceae family in which evolidine - the first plant cyclic peptide - was discovered. Evolidine (cyclo-SFLPVNL) has subsequently been all but forgotten in the academic literature, but here we use tandem mass spectrometry to rediscover evolidine and using de novo transcriptomics we show its biosynthetic origin to be from a short precursor just 48 residues in length. In all, seven M. xanthoxyloides orbitides were found and they had atypically diverse C-termini consisting of residues not recognized by either of the known proteases plants use to macrocyclize peptides. Two of the novel orbitides were studied by nuclear magnetic resonance spectroscopy and although one had definable structure, the other did not. By mining RNA-seq and whole genome sequencing data from other species, it was apparent that a large and diverse family of peptides is encoded by sequences like these across the Rutaceae.


2019 ◽  
Author(s):  
Daniel L. Cameron ◽  
Jonathan Baber ◽  
Charles Shale ◽  
Anthony T. Papenfuss ◽  
Jose Espejo Valle-Inclan ◽  
...  

AbstractWe have developed a novel, integrated and comprehensive purity, ploidy, structural variant and copy number somatic analysis toolkit for whole genome sequencing data of paired tumor/normal samples. We show that the combination of using GRIDSS for somatic structural variant calling and PURPLE for somatic copy number alteration calling allows highly sensitive, precise and consistent copy number and structural variant determination, as well as providing novel insights for short structural variants and regions of complex local topology. LINX, an interpretation tool, leverages the integrated structural variant and copy number calling to cluster individual structural variants into higher order events and chains them together to predict local derivative chromosome structure. LINX classifies and extensively annotates genomic rearrangements including simple and reciprocal breaks, LINE, viral and pseudogene insertions, and complex events such as chromothripsis. LINX also comprehensively calls genic fusions including chained fusions. Finally, our toolkit provides novel visualisation methods providing insight into complex genomic rearrangements.


2018 ◽  
Vol 56 (6) ◽  
Author(s):  
Stefan Bletz ◽  
Sandra Janezic ◽  
Dag Harmsen ◽  
Maja Rupnik ◽  
Alexander Mellmann

ABSTRACT Clostridium difficile , recently renamed Clostridioides difficile , is the most common cause of antibiotic-associated nosocomial gastrointestinal infections worldwide. To differentiate endogenous infections and transmission events, highly discriminatory subtyping is necessary. Today, methods based on whole-genome sequencing data are increasingly used to subtype bacterial pathogens; however, frequently a standardized methodology and typing nomenclature are missing. Here we report a core genome multilocus sequence typing (cgMLST) approach developed for C. difficile . Initially, we determined the breadth of the C. difficile population based on all available MLST sequence types with Bayesian inference (BAPS). The resulting BAPS partitions were used in combination with C. difficile clade information to select representative isolates that were subsequently used to define cgMLST target genes. Finally, we evaluated the novel cgMLST scheme with genomes from 3,025 isolates. BAPS grouping ( n = 6 groups) together with the clade information led to a total of 11 representative isolates that were included for cgMLST definition and resulted in 2,270 cgMLST genes that were present in all isolates. Overall, 2,184 to 2,268 cgMLST targets were detected in the genome sequences of 70 outbreak-associated and reference strains, and on average 99.3% cgMLST targets (1,116 to 2,270 targets) were present in 2,954 genomes downloaded from the NCBI database, underlining the representativeness of the cgMLST scheme. Moreover, reanalyzing different cluster scenarios with cgMLST were concordant to published single nucleotide variant analyses. In conclusion, the novel cgMLST is representative for the whole C. difficile population, is highly discriminatory in outbreak situations, and provides a unique nomenclature facilitating interlaboratory exchange.


2020 ◽  
Vol 48 (6) ◽  
pp. e36-e36 ◽  
Author(s):  
Clément Goubert ◽  
Jainy Thomas ◽  
Lindsay M Payer ◽  
Jeffrey M Kidd ◽  
Julie Feusier ◽  
...  

Abstract Alu retrotransposons account for more than 10% of the human genome, and insertions of these elements create structural variants segregating in human populations. Such polymorphic Alus are powerful markers to understand population structure, and they represent variants that can greatly impact genome function, including gene expression. Accurate genotyping of Alus and other mobile elements has been challenging. Indeed, we found that Alu genotypes previously called for the 1000 Genomes Project are sometimes erroneous, which poses significant problems for phasing these insertions with other variants that comprise the haplotype. To ameliorate this issue, we introduce a new pipeline – TypeTE – which genotypes Alu insertions from whole-genome sequencing data. Starting from a list of polymorphic Alus, TypeTE identifies the hallmarks (poly-A tail and target site duplication) and orientation of Alu insertions using local re-assembly to reconstruct presence and absence alleles. Genotype likelihoods are then computed after re-mapping sequencing reads to the reconstructed alleles. Using a high-quality set of PCR-based genotyping of &gt;200 loci, we show that TypeTE improves genotype accuracy from 83% to 92% in the 1000 Genomes dataset. TypeTE can be readily adapted to other retrotransposon families and brings a valuable toolbox addition for population genomics.


Sign in / Sign up

Export Citation Format

Share Document