Evaluation of tools for identifying large copy number variations from ultra-low-coverage whole-genome sequencing data

Abstract Background Detection of copy number variations (CNVs) from high-throughput next-generation whole-genome sequencing (WGS) data has become a widely used research method during the recent years. However, only a little is known about the applicability of the developed algorithms to ultra-low-coverage (0.0005–0.8×) data that is used in various research and clinical applications, such as digital karyotyping and single-cell CNV detection. Result Here, the performance of six popular read-depth based CNV detection algorithms (BIC-seq2, Canvas, CNVnator, FREEC, HMMcopy, and QDNAseq) was studied using ultra-low-coverage WGS data. Real-world array- and karyotyping kit-based validation were used as a benchmark in the evaluation. Additionally, ultra-low-coverage WGS data was simulated to investigate the ability of the algorithms to identify CNVs in the sex chromosomes and the theoretical minimum coverage at which these tools can accurately function. Our results suggest that while all the methods were able to detect large CNVs, many methods were susceptible to producing false positives when smaller CNVs (< 2 Mbp) were detected. There was also significant variability in their ability to identify CNVs in the sex chromosomes. Overall, BIC-seq2 was found to be the best method in terms of statistical performance. However, its significant drawback was by far the slowest runtime among the methods (> 3 h) compared with FREEC (~ 3 min), which we considered the second-best method. Conclusions Our comparative analysis demonstrates that CNV detection from ultra-low-coverage WGS data can be a highly accurate method for the detection of large copy number variations when their length is in millions of base pairs. These findings facilitate applications that utilize ultra-low-coverage CNV detection.

Download Full-text

CNVpytor: a tool for CNV/CNA detection and analysis from read depth and allele imbalance in whole genome sequencing

10.1101/2021.01.27.428472 ◽

2021 ◽

Author(s):

Milovan Suvakov ◽

Arijit Panda ◽

Colin Diesh ◽

Ian Holmes ◽

Alexej Abyzov

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Copy Number ◽

Read Depth ◽

Copy Number Variations ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Modular Architecture ◽

Small Indels

AbstractDetecting copy number variations (CNVs) and copy number alterations (CNAs) based on whole genome sequencing data is important for personalized genomics and treatment. CNVnator is one of the most popular tools for CNV/CNA discovery and analysis based on read depth (RD). Herein, we present an extension of CNVnator developed in Python -- CNVpytor. CNVpytor inherits the reimplemented core engine of its predecessor and extends visualization, modularization, performance, and functionality. Additionally, CNVpytor uses B-allele frequency (BAF) likelihood information from single nucleotide polymorphism and small indels data as additional evidence for CNVs/CNAs and as primary information for copy number neutral losses of heterozygosity. CNVpytor is significantly faster than CNVnator—particularly for parsing alignment files (2 to 20 times faster)—and has (20-50 times) smaller intermediate files. CNV calls can be filtered using several criteria and annotated. Modular architecture allows it to be used in shared and cloud environments such as Google Colab and Jupyter notebook. Data can be exported into JBrowse, while a lightweight plugin version of CNVpytor for JBrowse enables nearly instant and GUI-assisted analysis of CNVs by any user. CNVpytor release and the source code are available on GitHub at https://github.com/abyzovlab/CNVpytor under the MIT license.

Download Full-text

CNVpytor: a tool for copy number variation detection and analysis from read depth and allele imbalance in whole-genome sequencing

GigaScience ◽

10.1093/gigascience/giab074 ◽

2021 ◽

Vol 10 (11) ◽

Cited By ~ 1

Author(s):

Milovan Suvakov ◽

Arijit Panda ◽

Colin Diesh ◽

Ian Holmes ◽

Alexej Abyzov

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Copy Number ◽

Read Depth ◽

Copy Number Variations ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

Modular Architecture

Abstract Background Detecting copy number variations (CNVs) and copy number alterations (CNAs) based on whole-genome sequencing data is important for personalized genomics and treatment. CNVnator is one of the most popular tools for CNV/CNA discovery and analysis based on read depth. Findings Herein, we present an extension of CNVnator developed in Python—CNVpytor. CNVpytor inherits the reimplemented core engine of its predecessor and extends visualization, modularization, performance, and functionality. Additionally, CNVpytor uses B-allele frequency likelihood information from single-nucleotide polymorphisms and small indels data as additional evidence for CNVs/CNAs and as primary information for copy number–neutral losses of heterozygosity. Conclusions CNVpytor is significantly faster than CNVnator—particularly for parsing alignment files (2–20 times faster)—and has (20–50 times) smaller intermediate files. CNV calls can be filtered using several criteria, annotated, and merged over multiple samples. Modular architecture allows it to be used in shared and cloud environments such as Google Colab and Jupyter notebook. Data can be exported into JBrowse, while a lightweight plugin version of CNVpytor for JBrowse enables nearly instant and GUI-assisted analysis of CNVs by any user. CNVpytor release and the source code are available on GitHub at https://github.com/abyzovlab/CNVpytor under the MIT license.

Download Full-text

Copy Number Variant Detection with Low-Coverage Whole-Genome Sequencing Represents a Viable Alternative to the Conventional Array-CGH

Diagnostics ◽

10.3390/diagnostics11040708 ◽

2021 ◽

Vol 11 (4) ◽

pp. 708

Author(s):

Marcel Kucharík ◽

Jaroslav Budiš ◽

Michaela Hýblová ◽

Gabriel Minárik ◽

Tomáš Szemes

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

In Silico ◽

Copy Number ◽

Normal Population ◽

Copy Number Variations ◽

Whole Genome ◽

Real Patient ◽

Low Coverage ◽

Cnv Detection

Copy number variations (CNVs) represent a type of structural variant involving alterations in the number of copies of specific regions of DNA that can either be deleted or duplicated. CNVs contribute substantially to normal population variability, however, abnormal CNVs cause numerous genetic disorders. At present, several methods for CNV detection are applied, ranging from the conventional cytogenetic analysis, through microarray-based methods (aCGH), to next-generation sequencing (NGS). In this paper, we present GenomeScreen, an NGS-based CNV detection method for low-coverage, whole-genome sequencing. We determined the theoretical limits of its accuracy and obtained confirmation in an extensive in silico study and in real patient samples with known genotypes. In theory, at least 6 M uniquely mapped reads are required to detect a CNV with the length of 100 kilobases (kb) or more with high confidence (Z-score > 7). In practice, the in silico analysis required at least 8 M to obtain >99% accuracy (for 100 kb deviations). We compared GenomeScreen with one of the currently used aCGH methods in diagnostic laboratories, which has mean resolution of 200 kb. GenomeScreen and aCGH both detected 59 deviations, while GenomeScreen furthermore detected 134 other (usually) smaller variations. When compared to aCGH, overall performance of the proposed GenemoScreen tool is comparable or superior in terms of accuracy, turn-around time, and cost-effectiveness, thus providing reasonable benefits, particularly in a prenatal diagnosis setting.

Download Full-text

Batch effects in population genomic studies with low‐coverage whole genome sequencing data: causes, detection, and mitigation

Molecular Ecology Resources ◽

10.1111/1755-0998.13559 ◽

2021 ◽

Author(s):

Runyang Nicolas Lou ◽

Nina Overgaard Therkildsen

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Batch Effects ◽

Sequencing Data ◽

Population Genomic ◽

Genomic Studies ◽

Low Coverage

Download Full-text

SEG - A Software Program for Finding Somatic Copy Number Alterations in Whole Genome Sequencing Data of Cancer

Computational and Structural Biotechnology Journal ◽

10.1016/j.csbj.2018.09.001 ◽

2018 ◽

Vol 16 ◽

pp. 335-341 ◽

Cited By ~ 2

Author(s):

Mucheng Zhang ◽

Deli Liu ◽

Jie Tang ◽

Yuan Feng ◽

Tianfang Wang ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Copy Number ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Copy Number Alterations ◽

Sequencing Data ◽

Software Program ◽

Somatic Copy Number Alterations

Download Full-text

dpGMM: A Dirichlet Process Gaussian Mixture Model for Copy Number Variation Detection in Low-Coverage Whole-Genome Sequencing Data

IEEE Access ◽

10.1109/access.2020.2971863 ◽

2020 ◽

Vol 8 ◽

pp. 27973-27985

Author(s):

Yaoyao Li ◽

Junying Zhang ◽

Xiguo Yuan ◽

Junping Li

Keyword(s):

Genome Sequencing ◽

Dirichlet Process ◽

Copy Number ◽

Gaussian Mixture ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Number Variation ◽

Low Coverage ◽

Copy Number Variation Detection

Download Full-text

Application of risk score analysis to low-coverage whole genome sequencing data for the noninvasive detection of trisomy 21, trisomy 18, and trisomy 13

Prenatal Diagnosis ◽

10.1002/pd.4712 ◽

2015 ◽

Vol 36 (1) ◽

pp. 56-62 ◽

Cited By ~ 8

Author(s):

J. A. Tynan ◽

S. K. Kim ◽

A. R. Mazloom ◽

C. Zhao ◽

G. McLennan ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Trisomy 21 ◽

Trisomy 13 ◽

Noninvasive Detection ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Score Analysis ◽

Low Coverage

Download Full-text

Identification of Medium-Sized Copy Number Alterations in Whole-Genome Sequencing

Cancer Informatics ◽

10.4137/cin.s14023 ◽

2014 ◽

Vol 13s3 ◽

pp. CIN.S14023

Author(s):

Hatice Gulcin Ozer ◽

Aisulu Usubalieva ◽

Adrienne Dorrance ◽

Ayse Selen Yilmaz ◽

Michael Caligiuri ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Copy Number ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Copy Number Alterations ◽

Sequencing Data ◽

Entire Genome ◽

Multiple Challenges ◽

Cost Efficient

The genome-wide discoveries such as detection of copy number alterations (CNA) from high-throughput whole-genome sequencing data enabled new developments in personalized medicine. The CNAs have been reported to be associated with various diseases and cancers including acute myeloid leukemia. However, there are multiple challenges to the use of current CNA detection tools that lead to high false-positive rates and thus impede widespread use of such tools in cancer research. In this paper, we discuss these issues and propose possible solutions. First, since the entire genome cannot be mapped due to some regions lacking sequence uniqueness, current methods cannot be appropriately adjusted to handle these regions in the analyses. Thus, detection of medium-sized CNAs is also being directly affected by these mappability problems. The requirement for matching control samples is also an important limitation because acquiring matching controls might not be possible or might not be cost efficient. Here we present an approach that addresses these issues and detects medium-sized CNAs in cancer genomes by (1) masking unmappable regions during the initial CNA detection phase, (2) using pool of a few normal samples as control, and (3) employing median filtering to adjust CNA ratios to its surrounding coverage and eliminate false positives.

Download Full-text

A Bioinformatics Pipeline for Estimating Mitochondria DNA Copy Number and Heteroplasmy Levels from Whole Genome Sequencing Data

10.1101/2021.12.28.21268452 ◽

2021 ◽

Author(s):

Stephanie L Battle ◽

Daniela Puiu ◽

Eric Boerwinkle ◽

Kent Taylor ◽

Jerome Rotter ◽

...

Keyword(s):

Mitochondrial Genome ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Copy Number ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Dna Molecules ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Accurate Identification

Mitochondrial diseases are a heterogeneous group of disorders that can be caused by mutations in the nuclear or mitochondrial genome. Mitochondrial DNA variants may exist in a state of heteroplasmy, where a percentage of DNA molecules harbor a variant, or homoplasmy, where all DNA molecules have a variant. The relative quantity of mtDNA in a cell, or copy number (mtDNA-CN), is associated with mitochondrial function, human disease, and mortality. To facilitate accurate identification of heteroplasmy and quantify mtDNA-CN, we built a bioinformatics pipeline that takes whole genome sequencing data and outputs mitochondrial variants, and mtDNA-CN. We incorporate variant annotations to facilitate determination of variant significance. Our pipeline yields uniform coverage by remapping to a circularized chrM and recovering reads falsely mapped to nuclear-encoded mitochondrial sequences. Notably, we construct a consensus chrM sequence for each sample and recall heteroplasmy against the sample's unique mitochondrial genome. We observe an approximately 3-fold increased association with age for heteroplasmic variants in non-homopolymer regions and, are better able to capture genetic variation in the D-loop of chrM compared to existing software. Our bioinformatics pipeline more accurately captures features of mitochondrial genetics than existing pipelines that are important in understanding how mitochondrial dysfunction contributes to disease.

Download Full-text

Comprehensive analysis of chromothripsis in 2,658 human cancers using whole-genome sequencing

10.1101/333617 ◽

2018 ◽

Cited By ~ 11

Author(s):

Isidro Cortés-Ciriano ◽

June-Koo Lee ◽

Ruibin Xi ◽

Dhawal Jain ◽

Youngsook L. Jung ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Copy Number ◽

Human Cancer ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

End Joining ◽

Cancer Types ◽

Non Homologous End Joining

SummaryChromothripsis is a newly discovered mutational phenomenon involving massive, clustered genomic rearrangements that occurs in cancer and other diseases. Recent studies in cancer suggest that chromothripsis may be far more common than initially inferred from low resolution DNA copy number data. Here, we analyze the patterns of chromothripsis across 2,658 tumors spanning 39 cancer types using whole-genome sequencing data. We find that chromothripsis events are pervasive across cancers, with a frequency of >50% in several cancer types. Whereas canonical chromothripsis profiles display oscillations between two copy number states, a considerable fraction of the events involves multiple chromosomes as well as additional structural alterations. In addition to non-homologous end-joining, we detect signatures of replicative processes and templated insertions. Chromothripsis contributes to oncogene amplification as well as to inactivation of genes such as mismatch-repair related genes. These findings show that chromothripsis is a major process driving genome evolution in human cancer.

Download Full-text