Identifying tumor clones in sparse single-cell mutation data

Matthew A Myers; Simone Zaccaria; Benjamin J Raphael

doi:10.1093/bioinformatics/btaa449

Identifying tumor clones in sparse single-cell mutation data

Bioinformatics ◽

10.1093/bioinformatics/btaa449 ◽

2020 ◽

Vol 36 (Supplement_1) ◽

pp. i186-i193

Author(s):

Matthew A Myers ◽

Simone Zaccaria ◽

Benjamin J Raphael

Keyword(s):

Single Cell ◽

Genome Sequencing ◽

Whole Genome ◽

Sequencing Data ◽

Single Nucleotide ◽

Sequencing Coverage ◽

Sequencing Technologies ◽

Low Coverage ◽

Clonal Composition ◽

Cancer Studies

Abstract Motivation Recent single-cell DNA sequencing technologies enable whole-genome sequencing of hundreds to thousands of individual cells. However, these technologies have ultra-low sequencing coverage (<0.5× per cell) which has limited their use to the analysis of large copy-number aberrations (CNAs) in individual cells. While CNAs are useful markers in cancer studies, single-nucleotide mutations are equally important, both in cancer studies and in other applications. However, ultra-low coverage sequencing yields single-nucleotide mutation data that are too sparse for current single-cell analysis methods. Results We introduce SBMClone, a method to infer clusters of cells, or clones, that share groups of somatic single-nucleotide mutations. SBMClone uses a stochastic block model to overcome sparsity in ultra-low coverage single-cell sequencing data, and we show that SBMClone accurately infers the true clonal composition on simulated datasets with coverage at low as 0.2×. We applied SBMClone to single-cell whole-genome sequencing data from two breast cancer patients obtained using two different sequencing technologies. On the first patient, sequenced using the 10X Genomics CNV solution with sequencing coverage ≈0.03×, SBMClone recovers the major clonal composition when incorporating a small amount of additional information. On the second patient, where pre- and post-treatment tumor samples were sequenced using DOP-PCR with sequencing coverage ≈0.5×, SBMClone shows that tumor cells are present in the post-treatment sample, contrary to published analysis of this dataset. Availability and implementation SBMClone is available on the GitHub repository https://github.com/raphael-group/SBMClone. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SECEDO: SNV-based subclone detection using ultra-low coverage single-cell DNA sequencing

10.1101/2021.11.08.467510 ◽

2021 ◽

Author(s):

Hana Rozhoñová ◽

Daniel Danciu ◽

Stefan Stark ◽

Gunnar Rätsch ◽

Andr&eacute Kahles ◽

...

Keyword(s):

Dna Sequencing ◽

Single Cell ◽

Variant Calling ◽

Bayesian Filtering ◽

Sequencing Data ◽

Single Nucleotide ◽

Sequencing Technologies ◽

The Cost ◽

Low Coverage ◽

Clonal Composition

Recently developed single-cell DNA sequencing technologies enable whole-genome, amplifi-cation-free sequencing of thousands of cells at the cost of ultra-low coverage of the sequenced data(<0.05x per cell), which mostly limits their usage to the identification of copy number alterations(CNAs) in multi-megabase segments. Aside from CNA-based subclone detection, single-nucleotide vari-ant (SNV)-based subclone detection may contribute to a more comprehensive view on intra-tumorheterogeneity. Due to the low coverage of the data, the identification of SNVs is only possible whensuperimposing the sequenced genomes of hundreds of genetically similar cells. Here we present SingleCell Data Tumor Clusterer (SECEDO, lat. 'to separate'), a new method to cluster tumor cells basedsolely on SNVs, inferred on ultra-low coverage single-cell DNA sequencing data. The core aspects ofthe method are an efficient Bayesian filtering of relevant loci and the exploitation of read overlapsand phasing information. We applied SECEDO to a synthetic dataset simulating 7,250 cells and eighttumor subclones from a single patient, and were able to accurately reconstruct the clonal composition,detecting 92.11% of the somatic SNVs, with the smallest clusters representing only 6.9% of the totalpopulation. When applied to four real single-cell sequencing datasets from a breast cancer patient,SECEDO was able to recover the major clonal composition in each dataset at the original sequencingdepth of 0.03x per cell, an 8-fold improvement relative to the state of the art. Variant calling on theresulting clusters recovered more than twice as many SNVs with double the allelic ratio compared tocalling on all cells together, demonstrating the utility of SECEDO. SECEDO is implemented in C++ and is publicly available at https://github.com/ratschlab/secedo.

Download Full-text

Risk prediction and marker selection in nonsynonymous single nucleotide polymorphisms using whole genome sequencing data

Animal Cells and Systems ◽

10.1080/19768354.2020.1860125 ◽

2020 ◽

Vol 24 (6) ◽

pp. 321-328

Author(s):

Young-Sup Lee ◽

KyeongHye Won ◽

Donghyun Shin ◽

Jae-Don Oh

Keyword(s):

Single Nucleotide Polymorphisms ◽

Whole Genome Sequencing ◽

Risk Prediction ◽

Genome Sequencing ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

Single Nucleotide ◽

Marker Selection

Download Full-text

Batch effects in population genomic studies with low‐coverage whole genome sequencing data: causes, detection, and mitigation

Molecular Ecology Resources ◽

10.1111/1755-0998.13559 ◽

2021 ◽

Author(s):

Runyang Nicolas Lou ◽

Nina Overgaard Therkildsen

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Batch Effects ◽

Sequencing Data ◽

Population Genomic ◽

Genomic Studies ◽

Low Coverage

Download Full-text

dpGMM: A Dirichlet Process Gaussian Mixture Model for Copy Number Variation Detection in Low-Coverage Whole-Genome Sequencing Data

IEEE Access ◽

10.1109/access.2020.2971863 ◽

2020 ◽

Vol 8 ◽

pp. 27973-27985

Author(s):

Yaoyao Li ◽

Junying Zhang ◽

Xiguo Yuan ◽

Junping Li

Keyword(s):

Genome Sequencing ◽

Dirichlet Process ◽

Copy Number ◽

Gaussian Mixture ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Number Variation ◽

Low Coverage ◽

Copy Number Variation Detection

Download Full-text

Application of risk score analysis to low-coverage whole genome sequencing data for the noninvasive detection of trisomy 21, trisomy 18, and trisomy 13

Prenatal Diagnosis ◽

10.1002/pd.4712 ◽

2015 ◽

Vol 36 (1) ◽

pp. 56-62 ◽

Cited By ~ 8

Author(s):

J. A. Tynan ◽

S. K. Kim ◽

A. R. Mazloom ◽

C. Zhao ◽

G. McLennan ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Trisomy 21 ◽

Trisomy 13 ◽

Noninvasive Detection ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Score Analysis ◽

Low Coverage

Download Full-text

PaSD-qc: Quality control for single cell whole-genome sequencing data using power spectral density estimation

10.1101/166637 ◽

2017 ◽

Cited By ~ 2

Author(s):

Maxwell A. Sherman ◽

Alison R. Barton ◽

Michael Lodato ◽

Carl Vitzthum ◽

Michael E. Coulter ◽

...

Keyword(s):

Spectral Density ◽

Whole Genome Sequencing ◽

Single Cell ◽

Power Spectral Density ◽

Genome Sequencing ◽

Dna Amplification ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Power Spectral

AbstractSingle cell whole-genome sequencing (scWGS) is providing novel insights into the nature of genetic heterogeneity in normal and diseased cells. However, scWGS introduces DNA amplification-related biases that can confound downstream analysis. Here we present a statistical method, with an accompanying package PaSD-qc (Power Spectral Density-qc), that evaluates the quality of single cell libraries. It uses a modified power spectral density to assess amplification uniformity, amplicon size distribution, autocovariance, and inter-sample consistency as well as identifies aberrantly amplified chromosomes. We demonstrate the usefulness of this tool in evaluating scWGS protocols and in selecting high-quality libraries from low-coverage data for deep sequencing.

Download Full-text

Discovering single nucleotide variants and indels from bulk and single-cell ATAC-seq

10.1101/2021.02.26.433126 ◽

2021 ◽

Author(s):

Arya R. Massarat ◽

Arko Sen ◽

Jeff Jaureguy ◽

Sélène T. Tyndale ◽

Yi Fu ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Single Cell ◽

Genome Sequencing ◽

Superior Performance ◽

Whole Genome Sequencing Data ◽

Regulatory Sequences ◽

Whole Genome ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Regulatory Regions

ABSTRACTGenetic variants and de novo mutations in regulatory regions of the genome are typically discovered by whole-genome sequencing (WGS), however WGS is expensive and most WGS reads come from non-regulatory regions. The Assay for Transposase-Accessible Chromatin (ATAC-seq) generates reads from regulatory sequences and could potentially be used as a low-cost ‘capture’ method for regulatory variant discovery, but its use for this purpose has not been systematically evaluated. Here we apply seven variant callers to bulk and single-cell ATAC-seq data and evaluate their ability to identify single nucleotide variants (SNVs) and insertions/deletions (indels). In addition, we develop an ensemble classifier, VarCA, which combines features from individual variant callers to predict variants. The Genome Analysis Toolkit (GATK) is the best-performing individual caller with precision/recall on a bulk ATAC test dataset of 0.92/0.97 for SNVs and 0.87/0.82 for indels. On bulk ATAC-seq reads, VarCA achieves superior performance with precision/recall of 0.99/0.95 for SNVs and 0.93/0.80 for indels. On single-cell ATAC-seq reads, VarCA attains precision/recall of 0.98/0.94 for SNVs and 0.82/0.82 for indels. In summary, ATAC-seq reads can be used to accurately discover non-coding regulatory variants in the absence of whole-genome sequencing data and our ensemble method, VarCA, has the best overall performance.

Download Full-text

Batch effects in population genomic studies with low-coverage whole genome sequencing data: causes, detection, and mitigation

10.22541/au.162791857.78788821/v1 ◽

2021 ◽

Author(s):

Runyang Nicolas Lou ◽

Nina Overgaard Therkildsen

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Read Length ◽

Whole Genome ◽

Batch Effects ◽

Sequencing Data ◽

Population Genomic ◽

Spatial Coverage ◽

Genomic Studies ◽

Low Coverage

Over the past few decades, the rapid democratization of high-throughput sequencing and the growing emphasis on open science practices have resulted in an explosion in the amount of publicly available sequencing data. This opens new opportunities for combining datasets to achieve unprecedented sample sizes, spatial coverage, or temporal replication in population genomic studies. However, a common concern is that non-biological differences between datasets may generate batch effects that can confound real biological patterns. Despite general awareness about the risk of batch effects, few studies have examined empirically how they manifest in real datasets, and it remains unclear what factors cause batch effects and how to best detect and mitigate their impact bioinformatically. In this paper, we compare two batches of low-coverage whole genome sequencing (lcWGS) data generated from the same populations of Atlantic cod (Gadus morhua). First, we show that with a “batch-effect-naive” bioinformatic pipeline, batch effects severely biased our genetic diversity estimates, population structure inference, and selection scan. We then demonstrate that these batch effects resulted from multiple technical differences between our datasets, including the sequencing instrument model/chemistry, read type, read length, DNA degradation level, and sequencing depth, but their impact can be detected and substantially mitigated with simple bioinformatic approaches. We conclude that combining datasets remains a powerful approach as long as batch effects are explicitly accounted for. We focus on lcWGS data in this paper, which may be particularly vulnerable to certain causes of batch effects, but many of our conclusions also apply to other sequencing strategies.

Download Full-text

Lep-MAP3: robust linkage mapping even for low-coverage whole genome sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btx494 ◽

2017 ◽

Vol 33 (23) ◽

pp. 3726-3732 ◽

Cited By ~ 85

Author(s):

Pasi Rastas

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Linkage Mapping ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Low Coverage

Download Full-text

Improvement of genomic prediction by integrating additional single nucleotide polymorphisms selected from imputed whole genome sequencing data

Heredity ◽

10.1038/s41437-019-0246-7 ◽

2019 ◽

Vol 124 (1) ◽

pp. 37-49 ◽

Cited By ~ 9

Author(s):

Aoxing Liu ◽

Mogens Sandø Lund ◽

Didier Boichard ◽

Emre Karaman ◽

Sebastien Fritz ◽

...

Keyword(s):

Single Nucleotide Polymorphisms ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Genomic Prediction ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

Single Nucleotide

Download Full-text