scholarly journals Identifying tumor clones in sparse single-cell mutation data

2020 ◽  
Vol 36 (Supplement_1) ◽  
pp. i186-i193
Author(s):  
Matthew A Myers ◽  
Simone Zaccaria ◽  
Benjamin J Raphael

Abstract Motivation Recent single-cell DNA sequencing technologies enable whole-genome sequencing of hundreds to thousands of individual cells. However, these technologies have ultra-low sequencing coverage (<0.5× per cell) which has limited their use to the analysis of large copy-number aberrations (CNAs) in individual cells. While CNAs are useful markers in cancer studies, single-nucleotide mutations are equally important, both in cancer studies and in other applications. However, ultra-low coverage sequencing yields single-nucleotide mutation data that are too sparse for current single-cell analysis methods. Results We introduce SBMClone, a method to infer clusters of cells, or clones, that share groups of somatic single-nucleotide mutations. SBMClone uses a stochastic block model to overcome sparsity in ultra-low coverage single-cell sequencing data, and we show that SBMClone accurately infers the true clonal composition on simulated datasets with coverage at low as 0.2×. We applied SBMClone to single-cell whole-genome sequencing data from two breast cancer patients obtained using two different sequencing technologies. On the first patient, sequenced using the 10X Genomics CNV solution with sequencing coverage ≈0.03×, SBMClone recovers the major clonal composition when incorporating a small amount of additional information. On the second patient, where pre- and post-treatment tumor samples were sequenced using DOP-PCR with sequencing coverage ≈0.5×, SBMClone shows that tumor cells are present in the post-treatment sample, contrary to published analysis of this dataset. Availability and implementation SBMClone is available on the GitHub repository https://github.com/raphael-group/SBMClone. Supplementary information Supplementary data are available at Bioinformatics online.

2021 ◽  
Author(s):  
Hana Rozhoñová ◽  
Daniel Danciu ◽  
Stefan Stark ◽  
Gunnar Rätsch ◽  
Andr&eacute Kahles ◽  
...  

Recently developed single-cell DNA sequencing technologies enable whole-genome, amplifi-cation-free sequencing of thousands of cells at the cost of ultra-low coverage of the sequenced data(<0.05x per cell), which mostly limits their usage to the identification of copy number alterations(CNAs) in multi-megabase segments. Aside from CNA-based subclone detection, single-nucleotide vari-ant (SNV)-based subclone detection may contribute to a more comprehensive view on intra-tumorheterogeneity. Due to the low coverage of the data, the identification of SNVs is only possible whensuperimposing the sequenced genomes of hundreds of genetically similar cells. Here we present SingleCell Data Tumor Clusterer (SECEDO, lat. 'to separate'), a new method to cluster tumor cells basedsolely on SNVs, inferred on ultra-low coverage single-cell DNA sequencing data. The core aspects ofthe method are an efficient Bayesian filtering of relevant loci and the exploitation of read overlapsand phasing information. We applied SECEDO to a synthetic dataset simulating 7,250 cells and eighttumor subclones from a single patient, and were able to accurately reconstruct the clonal composition,detecting 92.11% of the somatic SNVs, with the smallest clusters representing only 6.9% of the totalpopulation. When applied to four real single-cell sequencing datasets from a breast cancer patient,SECEDO was able to recover the major clonal composition in each dataset at the original sequencingdepth of 0.03x per cell, an 8-fold improvement relative to the state of the art. Variant calling on theresulting clusters recovered more than twice as many SNVs with double the allelic ratio compared tocalling on all cells together, demonstrating the utility of SECEDO. SECEDO is implemented in C++ and is publicly available at https://github.com/ratschlab/secedo.


2017 ◽  
Author(s):  
Maxwell A. Sherman ◽  
Alison R. Barton ◽  
Michael Lodato ◽  
Carl Vitzthum ◽  
Michael E. Coulter ◽  
...  

AbstractSingle cell whole-genome sequencing (scWGS) is providing novel insights into the nature of genetic heterogeneity in normal and diseased cells. However, scWGS introduces DNA amplification-related biases that can confound downstream analysis. Here we present a statistical method, with an accompanying package PaSD-qc (Power Spectral Density-qc), that evaluates the quality of single cell libraries. It uses a modified power spectral density to assess amplification uniformity, amplicon size distribution, autocovariance, and inter-sample consistency as well as identifies aberrantly amplified chromosomes. We demonstrate the usefulness of this tool in evaluating scWGS protocols and in selecting high-quality libraries from low-coverage data for deep sequencing.


2021 ◽  
Author(s):  
Arya R. Massarat ◽  
Arko Sen ◽  
Jeff Jaureguy ◽  
Sélène T. Tyndale ◽  
Yi Fu ◽  
...  

ABSTRACTGenetic variants and de novo mutations in regulatory regions of the genome are typically discovered by whole-genome sequencing (WGS), however WGS is expensive and most WGS reads come from non-regulatory regions. The Assay for Transposase-Accessible Chromatin (ATAC-seq) generates reads from regulatory sequences and could potentially be used as a low-cost ‘capture’ method for regulatory variant discovery, but its use for this purpose has not been systematically evaluated. Here we apply seven variant callers to bulk and single-cell ATAC-seq data and evaluate their ability to identify single nucleotide variants (SNVs) and insertions/deletions (indels). In addition, we develop an ensemble classifier, VarCA, which combines features from individual variant callers to predict variants. The Genome Analysis Toolkit (GATK) is the best-performing individual caller with precision/recall on a bulk ATAC test dataset of 0.92/0.97 for SNVs and 0.87/0.82 for indels. On bulk ATAC-seq reads, VarCA achieves superior performance with precision/recall of 0.99/0.95 for SNVs and 0.93/0.80 for indels. On single-cell ATAC-seq reads, VarCA attains precision/recall of 0.98/0.94 for SNVs and 0.82/0.82 for indels. In summary, ATAC-seq reads can be used to accurately discover non-coding regulatory variants in the absence of whole-genome sequencing data and our ensemble method, VarCA, has the best overall performance.


Author(s):  
Runyang Nicolas Lou ◽  
Nina Overgaard Therkildsen

Over the past few decades, the rapid democratization of high-throughput sequencing and the growing emphasis on open science practices have resulted in an explosion in the amount of publicly available sequencing data. This opens new opportunities for combining datasets to achieve unprecedented sample sizes, spatial coverage, or temporal replication in population genomic studies. However, a common concern is that non-biological differences between datasets may generate batch effects that can confound real biological patterns. Despite general awareness about the risk of batch effects, few studies have examined empirically how they manifest in real datasets, and it remains unclear what factors cause batch effects and how to best detect and mitigate their impact bioinformatically. In this paper, we compare two batches of low-coverage whole genome sequencing (lcWGS) data generated from the same populations of Atlantic cod (Gadus morhua). First, we show that with a “batch-effect-naive” bioinformatic pipeline, batch effects severely biased our genetic diversity estimates, population structure inference, and selection scan. We then demonstrate that these batch effects resulted from multiple technical differences between our datasets, including the sequencing instrument model/chemistry, read type, read length, DNA degradation level, and sequencing depth, but their impact can be detected and substantially mitigated with simple bioinformatic approaches. We conclude that combining datasets remains a powerful approach as long as batch effects are explicitly accounted for. We focus on lcWGS data in this paper, which may be particularly vulnerable to certain causes of batch effects, but many of our conclusions also apply to other sequencing strategies.


Sign in / Sign up

Export Citation Format

Share Document