scholarly journals Comprehensive characterization of copy number variation (CNV) called from array, long- and short-read data

BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Ksenia Lavrichenko ◽  
Stefan Johansson ◽  
Inge Jonassen

Abstract Background SNP arrays, short- and long-read genome sequencing are genome-wide high-throughput technologies that may be used to assay copy number variants (CNVs) in a personal genome. Each of these technologies comes with its own limitations and biases, many of which are well-known, but not all of them are thoroughly quantified. Results We assembled an ensemble of public datasets of published CNV calls and raw data for the well-studied Genome in a Bottle individual NA12878. This assembly represents a variety of methods and pipelines used for CNV calling from array, short- and long-read technologies. We then performed cross-technology comparisons regarding their ability to call CNVs. Different from other studies, we refrained from using the golden standard. Instead, we attempted to validate the CNV calls by the raw data of each technology. Conclusions Our study confirms that long-read platforms enable recalling CNVs in genomic regions inaccessible to arrays or short reads. We also found that the reproducibility of a CNV by different pipelines within each technology is strongly linked to other CNV evidence measures. Importantly, the three technologies show distinct public database frequency profiles, which differ depending on what technology the database was built on.

BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Joseph T. Glessner ◽  
Xiao Chang ◽  
Yichuan Liu ◽  
Jin Li ◽  
Munir Khan ◽  
...  

Abstract Background Not all cells in a given individual are identical in their genomic makeup. Mosaicism describes such a phenomenon where a mixture of genotypic states in certain genomic segments exists within the same individual. Mosaicism is a prevalent and impactful class of non-integer state copy number variation (CNV). Mosaicism implies that certain cell types or subset of cells contain a CNV in a segment of the genome while other cells in the same individual do not. Several studies have investigated the impact of mosaicism in single patients or small cohorts but no comprehensive scan of mosaic CNVs has been undertaken to accurately detect such variants and interpret their impact on human health and disease. Results We developed a tool called Montage to improve the accuracy of detection of mosaic copy number variants in a high throughput fashion. Montage directly interfaces with ParseCNV2 algorithm to establish disease phenotype genome-wide association and determine which genomic ranges had more or less than expected frequency of mosaic events. We screened for mosaic events in over 350,000 samples using 1% allele frequency as the detection limit. Additionally, we uncovered disease associations of multiple phenotypes with mosaic CNVs at several genomic loci. We additionally investigated the allele imbalance observations genome-wide to define non-diploid and non-integer copy number states. Conclusions Our novel algorithm presents an efficient tool with fast computational runtime and high levels of accuracy of mosaic CNV detection. A curated mosaic CNV callset of 3716 events in 2269 samples is presented with comparability to previous reports and disease phenotype associations. The new algorithm can be freely accessed via: https://github.com/CAG-CNV/MONTAGE.


2013 ◽  
Vol 45 (1) ◽  
pp. 1-16 ◽  
Author(s):  
Wenli Li ◽  
Michael Olivier

Copy number variation (CNV), generated through duplication or deletion events that affect one or more loci, is widespread in the human genomes and is often associated with functional consequences that may include changes in gene expression levels or fusion of genes. Genome-wide association studies indicate that some disease phenotypes and physiological pathways might be impacted by CNV in a small number of characterized genomic regions. However, the pervasiveness and full impact of such variation remains unclear. Suitable analytic methods are needed to thoroughly mine human genomes for genomic structural variation, and to explore the interplay between observed CNV and disease phenotypes, but many medical researchers are unfamiliar with the features and nuances of recently developed technologies for detecting CNV. In this article, we evaluate a suite of commonly used and recently developed approaches to uncovering genome-wide CNVs and discuss the relative merits of each.


Author(s):  
Justin Wagner ◽  
Nathan D Olson ◽  
Lindsay Harris ◽  
Ziad Khan ◽  
Jesse Farek ◽  
...  

AbstractGenome in a Bottle (GIAB) benchmarks have been widely used to help validate clinical sequencing pipelines and develop new variant calling and sequencing methods. Here we use accurate long and linked reads to expand the prior benchmark to include difficult-to-map regions and segmental duplications that are not readily accessible to short reads. Our new benchmark adds more than 300,000 SNVs, 50,000 indels, and 16 % new exonic variants, many in challenging, clinically relevant genes not previously covered (e.g., PMS2). We increase coverage of the autosomal GRCh38 assembly from 85 % to 92 %, while excluding problematic regions for benchmarking small variants (e.g., copy number variants and assembly errors) that should not have been in the previous version. Our new benchmark reliably identifies both false positives and false negatives across multiple short-, linked-, and long-read based variant calling methods. As an example of its utility, this benchmark identifies eight times more false negatives in a short read variant call set relative to our previous benchmark, mostly in difficult-to-map regions. To enable robust small variant benchmarking, we still exclude 3.6% of GRCh37 and 5.0% of GRCh38 in (1) highly repetitive regions such as large, highly similar segmental duplications and the centromere not accessible to our data and (2) regions where our sample is highly divergent from the reference due to large indels, structural variation, copy number variation, and/or errors in the reference (e.g., some KIR genes that have duplications in HG002). We have demonstrated the utility of this benchmark to assess performance in more challenging regions, which enables benchmarking in more difficult genes and continued technology and bioinformatics development. The v4.2.1 benchmarks are available under ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/.


2021 ◽  
Vol 15 (1) ◽  
Author(s):  
Nadia Dehghani ◽  
Gamze Guven ◽  
Celia Kun-Rodrigues ◽  
Catarina Gouveia ◽  
Kalina Foster ◽  
...  

Abstract Background Copy number variants (CNVs) include deletions or multiplications spanning genomic regions. These regions vary in size and may span genes known to play a role in human diseases. As examples, duplications and triplications of SNCA have been shown to cause forms of Parkinson’s disease, while duplications of APP cause early onset Alzheimer’s disease (AD). Results Here, we performed a systematic analysis of CNVs in a Turkish dementia cohort in order to further characterize the genetic causes of dementia in this population. One hundred twenty-four Turkish individuals, either at risk of dementia due to family history, diagnosed with mild cognitive impairment, AD, or frontotemporal dementia, were whole-genome genotyped and CNVs were detected. We integrated family analysis with a comprehensive assessment of potentially disease-associated CNVs in this Turkish dementia cohort. We also utilized both dementia and non-dementia individuals from the UK Biobank in order to further elucidate the potential role of the identified CNVs in neurodegenerative diseases. We report CNVs overlapping the previously implicated genes ZNF804A, SNORA70B, USP34, XPO1, and a locus on chromosome 9 which includes a cluster of olfactory receptors and ABCA1. Additionally, we also describe novel CNVs potentially associated with dementia, overlapping the genes AFG1L, SNX3, VWDE, and BC039545. Conclusions Genotyping data from understudied populations can be utilized to identify copy number variation which may contribute to dementia.


2017 ◽  
Author(s):  
Ruth B. McCole ◽  
Wren Saylor ◽  
Claire Redin ◽  
Chamith Y. Fonseka ◽  
Harrison Brand ◽  
...  

AbstractThe development of the human brain and nervous system can be affected by genetic or environmental factors. Here we focus on characterizing the genetic perturbations that accompany and may contribute to neurodevelopmental phenotypes. Specifically, we examine two types of structural variants, namely, copy number variation and balanced chromosome rearrangements, discovered in subjects with neurodevelopmental disorders and related phenotypes. We find that a feature uniting these types of genetic aberrations is a proximity to ultraconserved elements (UCEs), which are sequences that are perfectly conserved between the reference genomes of distantly related species. In particular, while UCEs are generally depleted from copy number variant regions in healthy individuals, they are, on the whole, enriched in genomic regions disrupted by copy number variants or breakpoints of balanced rearrangements in affected individuals. Additionally, while genes associated with neurodevelopmental disorders are enriched in UCEs, this does not account for the excess of UCEs either in copy number variants or close to the breakpoints of balanced rearrangements in affected individuals. Indeed, our data are consistent with some manifestations of neurodevelopmental disorders resulting from a disruption of genome integrity in the vicinity of UCEs.


2016 ◽  
Vol 22 (3) ◽  
pp. 505-515 ◽  
Author(s):  
Isabelle Cleynen ◽  
Peter Konings ◽  
Caroline Robberecht ◽  
Debby Laukens ◽  
Leila Amininejad ◽  
...  

2021 ◽  
Vol 12 (1) ◽  
Author(s):  
PingHsun Hsieh ◽  
Vy Dang ◽  
Mitchell R. Vollger ◽  
Yafei Mao ◽  
Tzu-Hsueh Huang ◽  
...  

AbstractTRP channel-associated factor 1/2 (TCAF1/TCAF2) proteins antagonistically regulate the cold-sensor protein TRPM8 in multiple human tissues. Understanding their significance has been complicated given the locus spans a gap-ridden region with complex segmental duplications in GRCh38. Using long-read sequencing, we sequence-resolve the locus, annotate full-length TCAF models in primate genomes, and show substantial human-specific TCAF copy number variation. We identify two human super haplogroups, H4 and H5, and establish that TCAF duplications originated ~1.7 million years ago but diversified only in Homo sapiens by recurrent structural mutations. Conversely, in all archaic-hominin samples the fixation for a specific H4 haplotype without duplication is likely due to positive selection. Here, our results of TCAF copy number expansion, selection signals in hominins, and differential TCAF2 expression between haplogroups and high TCAF2 and TRPM8 expression in liver and prostate in modern-day humans imply TCAF diversification among hominins potentially in response to cold or dietary adaptations.


2010 ◽  
Vol 20 (12) ◽  
pp. 1719-1729 ◽  
Author(s):  
M. D. Robinson ◽  
C. Stirzaker ◽  
A. L. Statham ◽  
M. W. Coolen ◽  
J. Z. Song ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document