scholarly journals GRIDSS2: comprehensive characterisation of somatic structural variation using single breakend variants and structural variant phasing

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Daniel L. Cameron ◽  
Jonathan Baber ◽  
Charles Shale ◽  
Jose Espejo Valle-Inclan ◽  
Nicolle Besselink ◽  
...  

AbstractGRIDSS2 is the first structural variant caller to explicitly report single breakends—breakpoints in which only one side can be unambiguously determined. By treating single breakends as a fundamental genomic rearrangement signal on par with breakpoints, GRIDSS2 can explain 47% of somatic centromere copy number changes using single breakends to non-centromere sequence. On a cohort of 3782 deeply sequenced metastatic cancers, GRIDSS2 achieves an unprecedented 3.1% false negative rate and 3.3% false discovery rate and identifies a novel 32–100 bp duplication signature. GRIDSS2 simplifies complex rearrangement interpretation through phasing of structural variants with 16% of somatic calls phasable using paired-end sequencing.

Author(s):  
Daniel L. Cameron ◽  
Jonathan Baber ◽  
Charles Shale ◽  
Jose Espejo Valle-Inclan ◽  
Nicolle Besselink ◽  
...  

AbstractHere we present GRIDSS2, a general purpose structural variant caller optimised for tumour/normal somatic calling. Using cell line, patient sample validation and cohort-level comparisons, we show GRIDSS2 outperforms recent state-of-the-art tools. We demonstrate GRIDSS2 retains high sensitivity and precision even for small events by identifying a small (32-100bp) duplication signature strongly associated with colorectal cancer using 3,782 metastatic cancers that have been deeply sequenced by the Hartwig Medical Foundation. Essential to the high precision achieved by GRIDSS2 is the novel reporting of single breakend variants: structural variants in which only one side can be unambiguously determined. We show that the inclusion of single breakends reduces the false negative rate from 10.4% to 3.4%. Demonstrating the power single breakend calling has in genomic regions traditionally considered inaccessible to short read callers, we find that 47% of somatic centromeric breaks are repaired to non-centromeric sequence, with chromosome 1 exhibiting a unique centromeric rearrangement signature. Finally, we show that somatic structural variants are highly clustered with GRIDSS2 able to phase 16% of somatic structural variants in the Hartwig cohort from short read sequencing alone.


2010 ◽  
Vol 15 (9) ◽  
pp. 1116-1122 ◽  
Author(s):  
Xiaohua Douglas Zhang

In most genome-scale RNA interference (RNAi) screens, the ultimate goal is to select siRNAs with a large inhibition or activation effect. The selection of hits typically requires statistical control of 2 errors: false positives and false negatives. Traditional methods of controlling false positives and false negatives do not take into account the important feature in RNAi screens: many small-interfering RNAs (siRNAs) may have very small but real nonzero average effects on the measured response and thus cannot allow us to effectively control false positives and false negatives. To address for deficiencies in the application of traditional approaches in RNAi screening, the author proposes a new method for controlling false positives and false negatives in RNAi high-throughput screens. The false negatives are statistically controlled through a false-negative rate (FNR) or false nondiscovery rate (FNDR). FNR is the proportion of false negatives among all siRNAs examined, whereas FNDR is the proportion of false negatives among declared nonhits. The author also proposes new concepts, q*-value and p*-value, to control FNR and FNDR, respectively. The proposed method should have broad utility for hit selection in which one needs to control both false discovery and false nondiscovery rates in genome-scale RNAi screens in a robust manner.


2020 ◽  
Author(s):  
Christos Saragiotis ◽  
Ivan Kitov

<p>Two principal performance measures of the International Monitoring System (IMS) stations detection capability are the rate of automatic detections associated with events in the Reviewed Event Bulletin (REB) and the rate of detections manually added to the REB. These two metrics roughly correspond to the precision (which is the complement of the false-discovery rate) and miss rate or false-negative rate statistical measures of a binary classification test, respectively. The false-discovery and miss rates are clearly significantly influenced by the number of phases detected by the detection algorithm, which in turn depends on prespecified slowness-, frequency- and azimuth- dependent threshold values used in the short-term average over long-term average ratio detection scheme of the IMS stations. In particular, the lower the threshold, the more the detections and therefore the lower the miss rate but the higher the false discovery rate; the higher the threshold, the less the detections and therefore the higher the miss rate but also the lower the false discovery rate. In that sense decreasing both the false-discovery rate and the miss rate are conflicting goals that need to be balanced. On one hand, it is essential that the miss rate is as low as possible since no nuclear explosion should go unnoticed by the IMS. On the other hand, a high false-discovery rate compromises the quality of the automatically generated event lists and adds heavy and unnecessary workload to the seismic analysts during the interactive processing stage.</p><p>A previous study concluded that a way to decrease both the miss and false-discovery rates as well as the analyst workload is to increase the retiming interval, i.e., the maximum allowable time that an analyst is allowed to move an arrival pick without having to declare a new arrival. Indeed, when a detection needs to be moved by an interval larger than the retiming interval, not only is this a much more time-consuming task for the analyst than just retiming it, but it also affects negatively both the associated rate (the automatic detection is deleted and therefore not associated to an event) and the added rate (a new arrival has to be added to arrival list). The International Data Centre has increased the retiming interval from 4 s to 10 s since October 2018. We show how this change affected the associated-detections and added-detections rates and how the values of these metrics can be further improved by tuning the detection threshold levels.</p>


2005 ◽  
Vol 45 (8) ◽  
pp. 859 ◽  
Author(s):  
G. J. McLachlan ◽  
R. W. Bean ◽  
L. Ben-Tovim Jones ◽  
J. X. Zhu

An important and common problem in microarray experiments is the detection of genes that are differentially expressed in a given number of classes. As this problem concerns the selection of significant genes from a large pool of candidate genes, it needs to be carried out within the framework of multiple hypothesis testing. In this paper, we focus on the use of mixture models to handle the multiplicity issue. With this approach, a measure of the local false discovery rate is provided for each gene, and it can be implemented so that the implied global false discovery rate is bounded as with the Benjamini-Hochberg methodology based on tail areas. The latter procedure is too conservative, unless it is modified according to the prior probability that a gene is not differentially expressed. An attractive feature of the mixture model approach is that it provides a framework for the estimation of this probability and its subsequent use in forming a decision rule. The rule can also be formed to take the false negative rate into account.


2006 ◽  
Vol 16 (05) ◽  
pp. 353-362 ◽  
Author(s):  
LIAT BEN-TOVIM JONES ◽  
RICHARD BEAN ◽  
GEOFFREY J. MCLACHLAN ◽  
JUSTIN XI ZHU

An important and common problem in microarray experiments is the detection of genes that are differentially expressed in a given number of classes. As this problem concerns the selection of significant genes from a large pool of candidate genes, it needs to be carried out within the framework of multiple hypothesis testing. In this paper, we focus on the use of mixture models to handle the multiplicity issue. With this approach, a measure of the local FDR (false discovery rate) is provided for each gene. An attractive feature of the mixture model approach is that it provides a framework for the estimation of the prior probability that a gene is not differentially expressed, and this probability can subsequently be used in forming a decision rule. The rule can also be formed to take the false negative rate into account. We apply this approach to a well-known publicly available data set on breast cancer, and discuss our findings with reference to other approaches.


2019 ◽  
Author(s):  
Daniel L. Cameron ◽  
Jonathan Baber ◽  
Charles Shale ◽  
Anthony T. Papenfuss ◽  
Jose Espejo Valle-Inclan ◽  
...  

AbstractWe have developed a novel, integrated and comprehensive purity, ploidy, structural variant and copy number somatic analysis toolkit for whole genome sequencing data of paired tumor/normal samples. We show that the combination of using GRIDSS for somatic structural variant calling and PURPLE for somatic copy number alteration calling allows highly sensitive, precise and consistent copy number and structural variant determination, as well as providing novel insights for short structural variants and regions of complex local topology. LINX, an interpretation tool, leverages the integrated structural variant and copy number calling to cluster individual structural variants into higher order events and chains them together to predict local derivative chromosome structure. LINX classifies and extensively annotates genomic rearrangements including simple and reciprocal breaks, LINE, viral and pseudogene insertions, and complex events such as chromothripsis. LINX also comprehensively calls genic fusions including chained fusions. Finally, our toolkit provides novel visualisation methods providing insight into complex genomic rearrangements.


GigaScience ◽  
2019 ◽  
Vol 8 (9) ◽  
Author(s):  
Varuna Chander ◽  
Richard A Gibbs ◽  
Fritz J Sedlazeck

Abstract Background Structural variation (SV) plays a pivotal role in genetic disease. The discovery of SVs based on short DNA sequence reads from next-generation DNA sequence methods is error-prone, with low sensitivity and high false discovery rates. These shortcomings can be partially overcome with extensive orthogonal validation methods or use of long reads, but the current cost precludes their application for routine clinical diagnostics. In contrast, SV genotyping of known sites of SV occurrence is relatively robust and therefore offers a cost-effective clinical diagnostic tool with potentially few false-positive and false-negative results, even when applied to short-read DNA sequence data. Results We assess 5 state-of-the-art SV genotyping software methods, applied to short-read sequence data. The methods are characterized on the basis of their ability to genotype different SV types, spanning different size ranges. Furthermore, we analyze their ability to parse different VCF file subformats and assess their reliance on specific metadata. We compare the SV genotyping methods across a range of simulated and real data including SVs that were not found with Illumina data alone. We assess sensitivity and the ability to filter initial false discovery calls. We determined the impact of SV type and size on the performance for each SV genotyper. Overall, STIX performed the best on both simulated and GiaB based SV calls, demonstrating a good balance between sensitivity and specificty. Conclusion Our results indicate that, although SV genotyping software methods have superior performance to SV callers, there are limitations that suggest the need for further innovation.


2020 ◽  
Author(s):  
Charles Shale ◽  
Jonathan Baber ◽  
Daniel L. Cameron ◽  
Marie Wong ◽  
Mark J. Cowley ◽  
...  

AbstractComplex somatic genomic rearrangement and copy number alterations (CNA) are hallmarks of nearly all cancers. Whilst whole genome sequencing (WGS) in principle allows comprehensive profiling of these events, biological and clinical interpretation remains challenging. We have developed LINX, a novel algorithm which allows interpretation of short-read paired-end WGS derived structural variant and CNA data by clustering raw structural variant calls into distinct events, predicting their impact on the local structure of the derivative chromosome, and annotating their functional impact on affected genes. Novel visualisations facilitate further investigation of complex genomic rearrangements. We show that LINX provides insights into a diverse range of structural variation events including single and double break-junction events, mobile element insertions, complex shattering and high amplification events. We demonstrate that LINX can reliably detect a wide range of pathogenic rearrangements including gene fusions, immunoglobulin enhancer rearrangements, intragenic deletions and duplications. Uniquely, LINX also predicts chained fusions which we demonstrate account for 13% of clinically relevant oncogenic fusions. LINX also reports a class of inactivation events we term homozygous disruptions which may be a driver mutation in up to 8.8% of tumors including frequently affecting PTEN, TP53 and RB1, and are likely missed by many standard WGS analysis pipelines.


2017 ◽  
Author(s):  
Yilong Li ◽  
Nicola D Roberts ◽  
Joachim Weischenfeldt ◽  
Jeremiah A Wala ◽  
Ofer Shapira ◽  
...  

ABSTRACTA key mutational process in cancer is structural variation, in which rearrangements delete, amplify or reorder genomic segments ranging in size from kilobases to whole chromosomes. We developed methods to group, classify and describe structural variants, applied to >2,500 cancer genomes. Nine signatures of structural variation emerged. Deletions have trimodal size distribution; assort unevenly across tumour types and patients; enrich in late-replicating regions; and correlate with inversions. Tandem duplications also have trimodal size distribution, but enrich in early-replicating regions, as do unbalanced translocations. Replication-based mechanisms of rearrangement generate varied chromosomal structures with low-level copy number gains and frequent inverted rearrangements. One prominent structure consists of 1-7 templates copied from distinct regions of the genome strung together within one locus. Such ‘cycles of templated insertions’ correlate with tandem duplications, frequently activating the telomerase gene, TERT, in liver cancer. Cancers access many rearrangement processes, flexibly sculpting the genome to maximise oncogenic potential.


2020 ◽  
Author(s):  
Lauris Kaplinski ◽  
Märt Möls ◽  
Tarmo Puurand ◽  
Fanny-Dhelia Pajuste ◽  
Maido Remm

AbstractMotivationKATK is a fast and accurate software tool for calling variants directly from raw NGS reads. It uses predefined k-mers to retrieve only the reads of interest from the FASTQ file and calls genotypes by aligning retrieved reads locally. KATK does not use data about known polymorphisms and has NC (No Call) as default genotype. The reference or variant allele is called only if there is sufficient evidence for their presence in data. Thus it is not biased against rare variants or de novo mutations.ResultsWith simulated datasets, we achieved a false negative rate of 0.23% (sensitivity 99.77%) and a false discovery rate of 0.19%. Calling all human exonic regions with KATK requires 1-2 h, depending on sequencing coverage.AvailabilityKATK is distributed under the terms of GNU GPL v3. The k-mer databases are distributed under the Creative Commons CC BY-NC-SA license. The source code is available at GitHub as part of Genometester4 package (https://github.com/bioinfo-ut/GenomeTester4/). The binaries of KATK package and k-mer databases described in the current paper are available on http://bioinfo.ut.ee/KATK/.


Sign in / Sign up

Export Citation Format

Share Document