scholarly journals CNV-PG: a machine-learning framework for accurate copy number variation predicting and genotyping

2020 ◽  
Author(s):  
Taifu Wang ◽  
Jinghua Sun ◽  
Xiuqing Zhang ◽  
Wen-Jing Wang ◽  
Qing Zhou

AbstractMotivationCopy-number variants (CNVs) are one of the major causes of genetic disorders. However, current methods for CNV calling have high false-positive rates and low concordance, and a few of them can accurately genotype CNVs.ResultsHere we propose CNV-PG (CNV Predicting and Genotyping), a machine-learning framework for accurately predicting and genotyping CNVs from paired-end sequencing data. CNV-PG can efficiently remove false positive CNVs from existing CNV discovery algorithms, and integrate CNVs from multiple CNV callers into a unified call set with high genotyping accuracy.AvailabilityCNV-PG is available at https://github.com/wonderful1/CNV-PG

PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e12564
Author(s):  
Taifu Wang ◽  
Jinghua Sun ◽  
Xiuqing Zhang ◽  
Wen-Jing Wang ◽  
Qing Zhou

Background Copy-number variants (CNVs) have been recognized as one of the major causes of genetic disorders. Reliable detection of CNVs from genome sequencing data has been a strong demand for disease research. However, current software for detecting CNVs has high false-positive rates, which needs further improvement. Methods Here, we proposed a novel and post-processing approach for CNVs prediction (CNV-P), a machine-learning framework that could efficiently remove false-positive fragments from results of CNVs detecting tools. A series of CNVs signals such as read depth (RD), split reads (SR) and read pair (RP) around the putative CNV fragments were defined as features to train a classifier. Results The prediction results on several real biological datasets showed that our models could accurately classify the CNVs at over 90% precision rate and 85% recall rate, which greatly improves the performance of state-of-the-art algorithms. Furthermore, our results indicate that CNV-P is robust to different sizes of CNVs and the platforms of sequencing. Conclusions Our framework for classifying high-confident CNVs could improve both basic research and clinical diagnosis of genetic diseases.


2018 ◽  
Author(s):  
Vijay Kumar Pounraja ◽  
Gopal Jayakar ◽  
Matthew Jensen ◽  
Neil Kelkar ◽  
Santhosh Girirajan

ABSTRACTCopy-number variants (CNVs) are a major cause of several genetic disorders, making their detection an essential component of genetic analysis pipelines. Current methods for detecting CNVs from exome sequencing data are limited by high false positive rates and low concordance due to the inherent biases of individual algorithms. To overcome these issues, calls generated by two or more algorithms are often intersected using Venn-diagram approaches to identify “high-confidence” CNVs. However, this approach is inadequate, as it misses potentially true calls that do not have consensus from multiple callers. Here, we present CN-Learn, a machine-learning framework (https://github.com/girirajanlab/CN_Learn) that integrates calls from multiple CNV detection algorithms and learns to accurately identify true CNVs using caller-specific and genomic features from a small subset of validated CNVs. Using CNVs predicted by four exome-based CNV callers (CANOES, CODEX, XHMM and CLAMMS) from 503 samples, we demonstrate that CN-Learn identifies true CNVs at higher precision (~90%) and recall (~85%) rates while maintaining robust performance even when trained with minimal data (~30 samples). CN-Learn recovers twice as many CNVs compared to individual callers or Venn diagram-based approaches, with features such as exome capture probe count, caller concordance and GC content providing the most discriminatory power. In fact, about 58% of all true CNVs recovered by CN-Learn were either singletons or calls that lacked support from at least one caller. Our study underscores the limitations of current approaches for CNV identification and provides an effective method that yields high-quality CNVs for application in clinical diagnostics.


2020 ◽  
Author(s):  
Andre E Minoche ◽  
Ben Lundie ◽  
Greg B Peters ◽  
Thomas Ohnesorg ◽  
Mark Pinese ◽  
...  

AbstractWhole genome sequencing (WGS) has the potential to outperform clinical microarrays for the detection of structural variants (SV) including copy number variants (CNVs), but has been challenged by high false positive rates. Here we present ClinSV, a WGS based SV integration, annotation, prioritisation and visualisation method, which identified 99.8% of pathogenic ClinVar CNVs >10kb and 11/11 pathogenic variants from matched microarrays. The false positive rate was low (1.5–4.5%) and reproducibility high (95–99%). In clinical practice, ClinSV identified reportable variants in 22 of 485 patients (4.7%) of which 35–63% were not detectable by current clinical microarray designs.


2021 ◽  
Vol 13 (1) ◽  
Author(s):  
Andre E. Minoche ◽  
Ben Lundie ◽  
Greg B. Peters ◽  
Thomas Ohnesorg ◽  
Mark Pinese ◽  
...  

AbstractWhole genome sequencing (WGS) has the potential to outperform clinical microarrays for the detection of structural variants (SV) including copy number variants (CNVs), but has been challenged by high false positive rates. Here we present ClinSV, a WGS based SV integration, annotation, prioritization, and visualization framework, which identified 99.8% of simulated pathogenic ClinVar CNVs > 10 kb and 11/11 pathogenic variants from matched microarrays. The false positive rate was low (1.5–4.5%) and reproducibility high (95–99%). In clinical practice, ClinSV identified reportable variants in 22 of 485 patients (4.7%) of which 35–63% were not detectable by current clinical microarray designs. ClinSV is available at https://github.com/KCCG/ClinSV.


2019 ◽  
Vol 105 (4) ◽  
pp. 384-389 ◽  
Author(s):  
Adam Jackson ◽  
Heather Ward ◽  
Rebecca Louise Bromley ◽  
Charulata Deshpande ◽  
Pradeep Vasudevan ◽  
...  

IntroductionFetal anticonvulsant syndrome (FACS) describes the pattern of physical and developmental problems seen in those children exposed to certain antiepileptic drugs (AEDs) in utero. The diagnosis of FACS is a clinical one and so excluding alternative diagnoses such as genetic disorders is essential.MethodsWe reviewed the pathogenicity of reported variants identified on exome sequencing in the Deciphering Developmental Disorders (DDD) Study in 42 children exposed to AEDs in utero, but where a diagnosis other than FACS was suspected. In addition, we analysed chromosome microarray data from 10 patients with FACS seen in a Regional Genetics Service.ResultsSeven children (17%) from the DDD Study had a copy number variant or pathogenic variant in a developmental disorder gene which was considered to explain or partially explain their phenotype. Across the AED exposure types, variants were found in 2/15 (13%) valproate exposed cases and 3/14 (21%) carbamazepine exposed cases. No pathogenic copy number variants were identified in our local sample (n=10).ConclusionsThis study is the first of its kind to analyse the exomes of children with developmental disorders who were exposed to AEDs in utero. Though we acknowledge that the results are subject to bias, a significant number of children were identified with alternate diagnoses which had an impact on counselling and management. We suggest that consideration is given to performing whole exome sequencing as part of the diagnostic work-up for children exposed to AEDs in utero.


2016 ◽  
Vol 36 (6) ◽  
pp. 584-586 ◽  
Author(s):  
Cheryl A. Mather ◽  
Zhongxia Qi ◽  
Arun P. Wiita

2021 ◽  
Vol 12 ◽  
Author(s):  
Jinghang Zhou ◽  
Liyuan Liu ◽  
Thomas J. Lopdell ◽  
Dorian J. Garrick ◽  
Yuangang Shi

Detection of CNVs (copy number variants) and ROH (runs of homozygosity) from SNP (single nucleotide polymorphism) genotyping data is often required in genomic studies. The post-analysis of CNV and ROH generally involves many steps, potentially across multiple computing platforms, which requires the researchers to be familiar with many different tools. In order to get around this problem and improve research efficiency, we present an R package that integrates the summarization, annotation, map conversion, comparison and visualization functions involved in studies of CNV and ROH. This one-stop post-analysis system is standardized, comprehensive, reproducible, timesaving, and user-friendly for researchers in humans and most diploid livestock species.


2020 ◽  
Author(s):  
Christopher W. Whelan ◽  
Robert E. Handsaker ◽  
Giulio Genovese ◽  
Seva Kashin ◽  
Monkol Lek ◽  
...  

AbstractTwo intriguing forms of genome structural variation (SV) – dispersed duplications, and de novo rearrangements of complex, multi-allelic loci – have long escaped genomic analysis. We describe a new way to find and characterize such variation by utilizing identity-by-descent (IBD) relationships between siblings together with high-precision measurements of segmental copy number. Analyzing whole-genome sequence data from 706 families, we find hundreds of “IBD-discordant” (IBDD) CNVs: loci at which siblings’ CNV measurements and IBD states are mathematically inconsistent. We found that commonly-IBDD CNVs identify dispersed duplications; we mapped 95 of these common dispersed duplications to their true genomic locations through family-based linkage and population linkage disequilibrium (LD), and found several to be in strong LD with genome-wide association (GWAS) signals for common diseases or gene expression variation at their revealed genomic locations. Other CNVs that were IBDD in a single family appear to involve de novo mutations in complex and multi-allelic loci; we identified 26 de novo structural mutations that had not been previously detected in earlier analyses of the same families by diverse SV analysis methods. These included a de novo mutation of the amylase gene locus and multiple de novo mutations at chromosome 15q14. Combining these complex mutations with more-conventional CNVs, we estimate that segmental mutations larger than 1kb arise in about one per 22 human meioses. These methods are complementary to previous techniques in that they interrogate genomic regions that are home to segmental duplication, high CNV allele frequencies, and multi-allelic CNVs.Author SummaryCopy number variation is an important form of genetic variation in which individuals differ in the number of copies of segments of their genomes. Certain aspects of copy number variation have traditionally been difficult to study using short-read sequencing data. For example, standard analyses often cannot tell whether the duplicated copies of a segment are located near the original copy or are dispersed to other regions of the genome. Another aspect of copy number variation that has been difficult to study is the detection of mutations in the copy number of DNA segments passed down from parents to their children, particularly when the mutations affect genome segments which already display common copy number variation in the population. We develop an analytical approach to solving these problems when sequencing data is available for all members of families with at least two children. This method is based on determining the number of parental haplotypes the two siblings share at each location in their genome, and using that information to determine the possible inheritance patterns that might explain the copy numbers we observe in each family member. We show that dispersed duplications and mutations can be identified by looking for copy number variants that do not follow these expected inheritance patterns. We use this approach to determine the location of 95 common duplications which are dispersed to distant regions of the genome, and demonstrate that these duplications are linked to genetic variants that affect disease risk or gene expression levels. We also identify a set of copy number mutations not detected by previous analyses of sequencing data from a large cohort of families, and show that repetitive and complex regions of the genome undergo frequent mutations in copy number.


Sign in / Sign up

Export Citation Format

Share Document