Benchmarking germline CNV calling tools from exome sequencing data

AbstractWhole-exome sequencing is an attractive alternative to microarray analysis because of the low cost and potential ability to detect copy number variations (CNV) of various sizes (from 1–2 exons to several Mb). Previous comparison of the most popular CNV calling tools showed a high portion of false-positive calls. Moreover, due to a lack of a gold standard CNV set, the results are limited and incomparable. Here, we aimed to perform a comprehensive analysis of tools capable of germline CNV calling available at the moment using a single CNV standard and reference sample set. Compiling variants from previous studies with Bayesian estimation approach, we constructed an internal standard for NA12878 sample (pilot National Institute of Standards and Technology Reference Material) including 110,050 CNV or non-CNV exons. The standard was used to evaluate the performance of 16 germline CNV calling tools on the NA12878 sample and 10 correlated exomes as a reference set with respect to length distribution, concordance, and efficiency. Each algorithm had a certain range of detected lengths and showed low concordance with other tools. Most tools are focused on detection of a limited number of CNVs one to seven exons long with a false-positive rate below 50%. EXCAVATOR2, exomeCopy, and FishingCNV focused on detection of a wide range of variations but showed low precision. Upon unified comparison, the tools were not equivalent. The analysis performed allows choosing algorithms or ensembles of algorithms most suitable for a specific goal, e.g. population studies or medical genetics.

Download Full-text

Variable Phenotypes of Epilepsy, Intellectual Disability, and Schizophrenia Caused by 12p13.33–p13.32 Terminal Microdeletion in a Korean Family: A Case Report and Literature Review

Genes ◽

10.3390/genes12071001 ◽

2021 ◽

Vol 12 (7) ◽

pp. 1001

Author(s):

Jiyoon Han ◽

Joonhong Park

Keyword(s):

Intellectual Disability ◽

Exome Sequencing ◽

Environmental Influence ◽

Copy Number Variations ◽

Genetic Modifiers ◽

Sequencing Data ◽

Exome Sequencing Data ◽

Korean Family ◽

Coverage Analysis ◽

Patient Will

A simultaneous analysis of nucleotide changes and copy number variations (CNVs) based on exome sequencing data was demonstrated as a potential new first-tier diagnosis strategy for rare neuropsychiatric disorders. In this report, using depth-of-coverage analysis from exome sequencing data, we described variable phenotypes of epilepsy, intellectual disability (ID), and schizophrenia caused by 12p13.33–p13.32 terminal microdeletion in a Korean family. We hypothesized that CACNA1C and KDM5A genes of the six candidate genes located in this region were the best candidates for explaining epilepsy, ID, and schizophrenia and may be responsible for clinical features reported in cases with monosomy of the 12p13.33 subtelomeric region. On the background of microdeletion syndrome, which was described in clinical cases with mild, moderate, and severe neurodevelopmental manifestations as well as impairments, the clinician may determine whether the patient will end up with a more severe or milder end‐phenotype, which in turn determines disease prognosis. In our case, the 12p13.33–p13.32 terminal microdeletion may explain the variable expressivity in the same family. However, further comprehensive studies with larger cohorts focusing on careful phenotyping across the lifespan are required to clearly elucidate the possible contribution of genetic modifiers and the environmental influence on the expressivity of 12p13.33 microdeletion and associated characteristics.

Download Full-text

CNV-P: a machine-learning framework for predicting high confident copy number variations

PeerJ ◽

10.7717/peerj.12564 ◽

2021 ◽

Vol 9 ◽

pp. e12564

Author(s):

Taifu Wang ◽

Jinghua Sun ◽

Xiuqing Zhang ◽

Wen-Jing Wang ◽

Qing Zhou

Keyword(s):

Machine Learning ◽

False Positive ◽

Copy Number ◽

Genetic Disorders ◽

Genetic Diseases ◽

Basic Research ◽

Read Depth ◽

Copy Number Variations ◽

Sequencing Data ◽

Learning Framework

Background Copy-number variants (CNVs) have been recognized as one of the major causes of genetic disorders. Reliable detection of CNVs from genome sequencing data has been a strong demand for disease research. However, current software for detecting CNVs has high false-positive rates, which needs further improvement. Methods Here, we proposed a novel and post-processing approach for CNVs prediction (CNV-P), a machine-learning framework that could efficiently remove false-positive fragments from results of CNVs detecting tools. A series of CNVs signals such as read depth (RD), split reads (SR) and read pair (RP) around the putative CNV fragments were defined as features to train a classifier. Results The prediction results on several real biological datasets showed that our models could accurately classify the CNVs at over 90% precision rate and 85% recall rate, which greatly improves the performance of state-of-the-art algorithms. Furthermore, our results indicate that CNV-P is robust to different sizes of CNVs and the platforms of sequencing. Conclusions Our framework for classifying high-confident CNVs could improve both basic research and clinical diagnosis of genetic diseases.

Download Full-text

Detecting copy-number variations in whole-exome sequencing data using the eXome Hidden Markov Model: an ‘exome-first’ approach

Journal of Human Genetics ◽

10.1038/jhg.2014.124 ◽

2015 ◽

Vol 60 (4) ◽

pp. 175-182 ◽

Cited By ~ 36

Author(s):

Satoko Miyatake ◽

Eriko Koshimizu ◽

Atsushi Fujita ◽

Ryoko Fukai ◽

Eri Imagawa ◽

...

Keyword(s):

Markov Model ◽

Hidden Markov Model ◽

Exome Sequencing ◽

Copy Number ◽

Hidden Markov ◽

Copy Number Variations ◽

Sequencing Data ◽

Exome Sequencing Data ◽

Whole Exome ◽

Whole Exome Sequencing Data

Download Full-text

Comparative Study of Exome Copy Number Variation Estimation Tools Using Array Comparative Genomic Hybridization as Control

BioMed Research International ◽

10.1155/2013/915636 ◽

2013 ◽

Vol 2013 ◽

pp. 1-7 ◽

Cited By ~ 23

Author(s):

Yan Guo ◽

Quanghu Sheng ◽

David C. Samuels ◽

Brian Lehmann ◽

Joshua A. Bauer ◽

...

Keyword(s):

Copy Number Variation ◽

Exome Sequencing ◽

Copy Number ◽

Array Cgh ◽

False Positive Rate ◽

Comparative Genomic ◽

Comparative Genome Hybridization ◽

Sequencing Data ◽

Number Variation ◽

Low Sensitivity

Exome sequencing using next-generation sequencing technologies is a cost-efficient approach to selectively sequencing coding regions of the human genome for detection of disease variants. One of the lesser known yet important applications of exome sequencing data is to identify copy number variation (CNV). There have been many exome CNV tools developed over the last few years, but the performance and accuracy of these programs have not been thoroughly evaluated. In this study, we systematically compared four popular exome CNV tools (CoNIFER, cn.MOPS, exomeCopy, and ExomeDepth) and evaluated their effectiveness against array comparative genome hybridization (array CGH) platforms. We found that exome CNV tools are capable of identifying CNVs, but they can have problems such as high false positives, low sensitivity, and duplication bias when compared to array CGH platforms. While exome CNV tools do serve their purpose for data mining, careful evaluation and additional validation is highly recommended. Based on all these results, we recommend CoNIFER and cn.MOPs for nonpaired exome CNV detection over the other two tools due to a low false-positive rate, although none of the four exome CNV tools performed at an outstanding level when compared to array CGH.

Download Full-text

DeAnnCNV: a tool for online detection and annotation of copy number variations from whole-exome sequencing data

Nucleic Acids Research ◽

10.1093/nar/gkv556 ◽

2015 ◽

Vol 43 (W1) ◽

pp. W289-W294 ◽

Cited By ~ 13

Author(s):

Yuanwei Zhang ◽

Zhenhua Yu ◽

Rongjun Ban ◽

Huan Zhang ◽

Furhan Iqbal ◽

...

Keyword(s):

Exome Sequencing ◽

Whole Exome Sequencing ◽

Copy Number ◽

Copy Number Variations ◽

Online Detection ◽

Sequencing Data ◽

Exome Sequencing Data ◽

Whole Exome ◽

Whole Exome Sequencing Data

Download Full-text

Improved analyses of GWAS summary statistics by reducing data heterogeneity and errors

Nature Communications ◽

10.1038/s41467-021-27438-7 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Wenhan Chen ◽

Yang Wu ◽

Zhili Zheng ◽

Ting Qi ◽

Peter M. Visscher ◽

...

Keyword(s):

False Positive ◽

Rare Variants ◽

Control Method ◽

Association Studies ◽

False Positive Rate ◽

Reference Sample ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Positive Rate ◽

Summary Data

AbstractSummary statistics from genome-wide association studies (GWAS) have facilitated the development of various summary data-based methods, which typically require a reference sample for linkage disequilibrium (LD) estimation. Analyses using these methods may be biased by errors in GWAS summary data or LD reference or heterogeneity between GWAS and LD reference. Here we propose a quality control method, DENTIST, that leverages LD among genetic variants to detect and eliminate errors in GWAS or LD reference and heterogeneity between the two. Through simulations, we demonstrate that DENTIST substantially reduces false-positive rate in detecting secondary signals in the summary-data-based conditional and joint association analysis, especially for imputed rare variants (false-positive rate reduced from >28% to <2% in the presence of heterogeneity between GWAS and LD reference). We further show that DENTIST can improve other summary-data-based analyses such as fine-mapping analysis.

Download Full-text

Is N-Hacking Ever OK? A simulation-based study

10.1101/2019.12.12.868489 ◽

2019 ◽

Cited By ~ 1

Author(s):

Pamela Reinagel

Keyword(s):

Positive Predictive Value ◽

Sample Size ◽

Predictive Value ◽

False Positive ◽

Hypothesis Test ◽

False Positive Rate ◽

False Positives ◽

Wide Range ◽

Positive Rate ◽

Incremental Sampling

AbstractAfter an experiment has been completed and analyzed, a trend may be observed that is “not quite significant”. Sometimes in this situation, researchers incrementally grow their sample size N in an effort to achieve statistical significance. This is especially tempting in situations when samples are very costly or time-consuming to collect, such that collecting an entirely new sample larger than N (the statistically sanctioned alternative) would be prohibitive. Such post-hoc sampling or “N-hacking” is condemned, however, because it leads to an excess of false positive results. Here Monte-Carlo simulations are used to show why and how incremental sampling causes false positives, but also to challenge the claim that it necessarily produces alarmingly high false positive rates. In a parameter regime that would be representative of practice in many research fields, simulations show that the inflation of the false positive rate is modest and easily bounded. But the effect on false positive rate is only half the story. What many researchers really want to know is the effect N-hacking would have on the likelihood that a positive result is a real effect that will be replicable: the positive predictive value (PPV). This question has not been considered in the reproducibility literature. The answer depends on the effect size and the prior probability of an effect. Although in practice these values are not known, simulations show that for a wide range of values, the PPV of results obtained by N-hacking is in fact higher than that of non-incremented experiments of the same sample size and statistical power. This is because the increase in false positives is more than offset by the increase in true positives. Therefore in many situations, adding a few samples to shore up a nearly-significant result is in fact statistically beneficial. In conclusion, if samples are added after an initial hypothesis test this should be disclosed, and if a p value is reported it should be corrected. But, contrary to widespread belief, collecting additional samples to resolve a borderline p value is not invalid, and can confer previously unappreciated advantages for efficiency and positive predictive value.

Download Full-text

Polishing Copy Number Variant Calls on Exome Sequencing Data via Deep Learning

10.1101/2020.05.09.086082 ◽

2020 ◽

Author(s):

Furkan Özden ◽

Can Alkan ◽

A. Ercüment Çiçek

Keyword(s):

Deep Learning ◽

Exome Sequencing ◽

Copy Number ◽

Gc Content ◽

Copy Number Variant ◽

Copy Number Variations ◽

Exome Capture ◽

Sequencing Data ◽

Independent Manner ◽

Efficient Detection

AbstractAccurate and efficient detection of copy number variants (CNVs) is of critical importance due to their significant association with complex genetic diseases. Although algorithms working on whole genome sequencing (WGS) data provide stable results with mostly-valid statistical assumptions, copy number detection on whole exome sequencing (WES) data has mostly been a losing game with extremely high false discovery rates. This is unfortunate as WES data is cost efficient, compact and is relatively ubiquitous. The bottleneck is primarily due to non-contiguous nature of the targeted capture: biases in targeted genomic hybridization, GC content, targeting probes, and sample batching during sequencing. Here, we present a novel deep learning model, DECoNT, which uses the matched WES and WGS data and learns to correct the copy number variations reported by any over-the-shelf WES-based germline CNV caller. We train DECoNT on the 1000 Genomes Project data, and we show that (i) we can efficiently triple the duplication call precision and double the deletion call precisions of the state-of-the-art algorithms. We also show that model consistently improves the performance in a (i) sequencing technology, (ii) exome capture kit and (iii) CNV caller independent manner. Using DECoNT as a universal exome CNV call polisher has the potential to improve the reliability of germline CNV detection on WES data sets and surge its application. The code and the models are available at https://github.com/ciceklab/DECoNT.

Download Full-text

Website Fingerprinting with Website Oracles

Proceedings on Privacy Enhancing Technologies ◽

10.2478/popets-2020-0013 ◽

2020 ◽

Vol 2020 (1) ◽

pp. 235-255 ◽

Cited By ~ 2

Author(s):

Tobias Pulls ◽

Rasmus Dahlberg

Keyword(s):

False Positive ◽

State Of The Art ◽

False Positive Rate ◽

Open World ◽

Wide Range ◽

Online Advertisement ◽

Positive Rate ◽

Access Logs ◽

Attacks And Defenses ◽

Security Notion

AbstractWebsite Fingerprinting (WF) attacks are a subset of traffic analysis attacks where a local passive attacker attempts to infer which websites a target victim is visiting over an encrypted tunnel, such as the anonymity network Tor. We introduce the security notion of a Website Oracle (WO) that gives a WF attacker the capability to determine whether a particular monitored website was among the websites visited by Tor clients at the time of a victim’s trace. Our simulations show that combining a WO with a WF attack—which we refer to as a WF+WO attack—significantly reduces false positives for about half of all website visits and for the vast majority of websites visited over Tor. The measured false positive rate is on the order one false positive per million classified website trace for websites around Alexa rank 10,000. Less popular monitored websites show orders of magnitude lower false positive rates.We argue that WOs are inherent to the setting of anonymity networks and should be an assumed capability of attackers when assessing WF attacks and defenses. Sources of WOs are abundant and available to a wide range of realistic attackers, e.g., due to the use of DNS, OCSP, and real-time bidding for online advertisement on the Internet, as well as the abundance of middleboxes and access logs. Access to a WO indicates that the evaluation of WF defenses in the open world should focus on the highest possible recall an attacker can achieve. Our simulations show that augmenting the Deep Fingerprinting WF attack by Sirinam et al. [60] with access to a WO significantly improves the attack against five state-of-the-art WF defenses, rendering some of them largely ineffective in this new WF+WO setting.

Download Full-text

Validation of an ultra-fast CNV calling tool for Next Generation Sequencing data using MLPA-verified copy number alterations

10.1101/340505 ◽

2018 ◽

Author(s):

Bas Tolhuis ◽

Hans Karten

Keyword(s):

Next Generation Sequencing ◽

Copy Number ◽

False Positive Rate ◽

Copy Number Variations ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Single Node ◽

Generation Sequencing ◽

Cnv Detection

AbstractDNA Copy Number Variations (CNVs) are an important source for genetic diversity and pathogenic variants. Next Generation Sequencing (NGS) methods have become increasingly more popular for CNV detection, but its data analysis is a growing bottleneck. Genalice CNV is a novel tool for detection of CNVs. It takes care of turnaround time, scalability and cost issues associated with NGS computational analysis. Here, we validate Genalice CNV with MLPA-verified exon CNVs and genes with normal copy numbers. Genalice CNV detects 61 out of 62 exon CNVs and its false positive rate is less than 1%. It analyzes 96 samples from a targeted NGS assay in less than 45 minutes, including read alignment and CNV detection, using a single node. Furthermore, we describe data quality measures to minimize false discoveries. In conclusion, Genalice CNV is highly sensitive and specific, as well as extremely fast, which will be beneficial for clinical detection of CNVs.

Download Full-text