scholarly journals Discovering single nucleotide variants and indels from bulk and single-cell ATAC-seq

2021 ◽  
Author(s):  
Arya R. Massarat ◽  
Arko Sen ◽  
Jeff Jaureguy ◽  
Sélène T. Tyndale ◽  
Yi Fu ◽  
...  

ABSTRACTGenetic variants and de novo mutations in regulatory regions of the genome are typically discovered by whole-genome sequencing (WGS), however WGS is expensive and most WGS reads come from non-regulatory regions. The Assay for Transposase-Accessible Chromatin (ATAC-seq) generates reads from regulatory sequences and could potentially be used as a low-cost ‘capture’ method for regulatory variant discovery, but its use for this purpose has not been systematically evaluated. Here we apply seven variant callers to bulk and single-cell ATAC-seq data and evaluate their ability to identify single nucleotide variants (SNVs) and insertions/deletions (indels). In addition, we develop an ensemble classifier, VarCA, which combines features from individual variant callers to predict variants. The Genome Analysis Toolkit (GATK) is the best-performing individual caller with precision/recall on a bulk ATAC test dataset of 0.92/0.97 for SNVs and 0.87/0.82 for indels. On bulk ATAC-seq reads, VarCA achieves superior performance with precision/recall of 0.99/0.95 for SNVs and 0.93/0.80 for indels. On single-cell ATAC-seq reads, VarCA attains precision/recall of 0.98/0.94 for SNVs and 0.82/0.82 for indels. In summary, ATAC-seq reads can be used to accurately discover non-coding regulatory variants in the absence of whole-genome sequencing data and our ensemble method, VarCA, has the best overall performance.

2018 ◽  
Author(s):  
Maxime Garcia ◽  
Szilveszter Juhos ◽  
Malin Larsson ◽  
Pall I. Olason ◽  
Marcel Martin ◽  
...  

AbstractSummaryWhole-genome sequencing (WGS) is a cornerstone of precision medicine, but portable and reproducible open-source workflows for WGS analyses of germline and somatic variants are lacking. We present Sarek, a modular, comprehensive, and easy-to-install workflow, combining a range of software for the identification and annotation of single-nucleotide variants (SNVs), insertion and deletion variants (indels), structural variants, tumor sample heterogeneity, and karyotyping from germline or paired tumor/normal samples. Sarek is implemented in a bioinformatics workflow language (Nextflow) with Docker and Singularity compatible containers, ensuring easy deployment and full reproducibility at any Linux based compute cluster or cloud computing environment. Sarek supports the human reference genomes GRCh37 and GRCh38, and can readily be used both as a core production workflow at sequencing facilities and as a powerful stand-alone tool for individual research groups.AvailabilitySource code and instructions for local installation are available at GitHub (https://github.com/SciLifeLab/Sarek) under the MIT open-source license, and we invite the research community to contribute additional functionality as a collaborative open-source development project.


2017 ◽  
Author(s):  
Maxwell A. Sherman ◽  
Alison R. Barton ◽  
Michael Lodato ◽  
Carl Vitzthum ◽  
Michael E. Coulter ◽  
...  

AbstractSingle cell whole-genome sequencing (scWGS) is providing novel insights into the nature of genetic heterogeneity in normal and diseased cells. However, scWGS introduces DNA amplification-related biases that can confound downstream analysis. Here we present a statistical method, with an accompanying package PaSD-qc (Power Spectral Density-qc), that evaluates the quality of single cell libraries. It uses a modified power spectral density to assess amplification uniformity, amplicon size distribution, autocovariance, and inter-sample consistency as well as identifies aberrantly amplified chromosomes. We demonstrate the usefulness of this tool in evaluating scWGS protocols and in selecting high-quality libraries from low-coverage data for deep sequencing.


2021 ◽  
Author(s):  
Einar Gabbasov ◽  
Miguel Moreno-Molina ◽  
Iñaki Comas ◽  
Maxwell Libbrecht ◽  
Leonid Chindelevitch

AbstractThe occurrence of multiple strains of a bacterial pathogen such as M. tuberculosis or C. difficile within a single human host, referred to as a mixed infection, has important implications for both healthcare and public health. However, methods for detecting it, and especially determining the proportion and identities of the underlying strains, from WGS (whole-genome sequencing) data, have been limited.In this paper we introduce SplitStrains, a novel method for addressing these challenges. Grounded in a rigorous statistical model, SplitStrains not only demonstrates superior performance in proportion estimation to other existing methods on both simulated as well as real M. tuberculosis data, but also successfully determines the identity of the underlying strains.We conclude that SplitStrains is a powerful addition to the existing toolkit of analytical methods for data coming from bacterial pathogens, and holds the promise of enabling previously inaccessible conclusions to be drawn in the realm of public health microbiology.Author summaryWhen multiple strains of a pathogenic organism are present in a patient, it may be necessary to not only detect this, but also to identify the individual strains. However, this problem has not yet been solved for bacterial pathogens processed via whole-genome sequencing. In this paper, we propose the SplitStrains algorithm for detecting multiple strains in a sample, identifying their proportions, and inferring their sequences, in the case of Mycobacterium tuberculosis. We test it on both simulated and real data, with encouraging results. We believe that our work opens new horizons in public health microbiology by allowing a more precise detection, identification and quantification of multiple infecting strains within a sample.


BMC Genetics ◽  
2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Lucy Crooks ◽  
Johnathan Cooper-Knock ◽  
Paul R. Heath ◽  
Ahmed Bouhouche ◽  
Mostafa Elfahime ◽  
...  

Abstract Background Large-scale human sequencing projects have described around a hundred-million single nucleotide variants (SNVs). These studies have predominately involved individuals with European ancestry despite the fact that genetic diversity is expected to be highest in Africa where Homo sapiens evolved and has maintained a large population for the longest time. The African Genome Variation Project examined several African populations but these were all located south of the Sahara. Morocco is on the northwest coast of Africa and mostly lies north of the Sahara, which makes it very attractive for studying genetic diversity. The ancestry of present-day Moroccans is unknown and may be substantially different from Africans found South of the Sahara desert, Recent genomic data of Taforalt individuals in Eastern Morocco revealed 15,000-year-old modern humans and suggested that North African individuals may be genetically distinct from previously studied African populations. Results We present SNVs discovered by whole genome sequencing (WGS) of three Moroccans. From a total of 5.9 million SNVs detected, over 200,000 were not identified by 1000G and were not in the extensive gnomAD database. We summarise the SNVs by genomic position, type of sequence gene context and effect on proteins encoded by the sequence. Analysis of the overall genomic information of the Moroccan individuals to individuals from 1000G supports the Moroccan population being distinct from both sub-Saharan African and European populations. Conclusions We conclude that Moroccan samples are genetically distinct and lie in the middle of the previously observed cline between populations of European and African ancestry. WGS of Moroccan individuals can identify a large number of novel SNVs and aid in functional characterisation of the genome.


2020 ◽  
Author(s):  
Xiao Chen ◽  
Fei Shen ◽  
Nina Gonzaludo ◽  
Alka Malhotra ◽  
Cande Rogert ◽  
...  

AbstractResponsible for the metabolism of 25% of clinically used drugs, CYP2D6 is a critical component of personalized medicine initiatives. Genotyping CYP2D6 is challenging due to sequence similarity with its pseudogene paralog CYP2D7 and a high number and variety of common structural variants (SVs). Here we describe a novel bioinformatics method, Cyrius, that accurately genotypes CYP2D6 using whole-genome sequencing (WGS) data. We show that Cyrius has superior performance (96.5% concordance with truth genotypes) compared to existing methods (84-86.8%). After implementing the improvements identified from the comparison against the truth data, Cyrius’s accuracy has since been improved to 99.3%. Using Cyrius, we built a haplotype frequency database from 2504 ethnically diverse samples and estimate that SV-containing star alleles are more frequent than previously reported. Cyrius will be an important tool to incorporate pharmacogenomics in WGS-based precision medicine initiatives.


2017 ◽  
Author(s):  
Zhiting Wei ◽  
Funan He ◽  
Guohui Chuai ◽  
Hanhui Ma ◽  
Zhixi Su ◽  
...  

To the EditorSchaefer et al.1 (referred to as Study_1) recently presented the provocative conclusion that CRISPR-Cas9 nuclease can induce many unexpected off-target mutations across the genome that arise from the sites with poor homology to the gRNA. As Wilson et al.2 pointed out, however, the selection of a co-housed mouse as the control is insufficient to attribute the observed mutation differences between the CRISPR-treated mice and control mice. Therefore, the causes of these mutations need to be further investigated. In 2015, Iyer et al.3 (referred to as Study_2) used Cas9 and a pair of sgRNAs to mutate the Ar gene in vivo and off-target mutations were investigated by comparison the control mice and the offspring of the modified mice. After analyzing the whole genome sequencing (WGS) of the offspring and the control mice, they claimed that off-target mutations are rare from CRISPR-Cas9 engineering. Notably, their study only focused on indel off-target mutations. We re-analyzed the WGS data of these two studies and detected both single nucleotide variants (SNVs) and indel mutations.


2020 ◽  
Author(s):  
Lucy Crooks ◽  
Johnathan Cooper-Knock ◽  
Paul R. Heath ◽  
Ahmed Bouhouche ◽  
Elmostafa El Fahime ◽  
...  

Abstract Background Large-scale human sequencing projects have described around a hundred-million single nucleotide variants (SNVs), which have predominately focused on individuals with European ancestry despite the fact that genetic diversity is expected to be highest in Africa where Homo sapiens evolved and has maintained a large population for the longest time. The more recent African Genome Variation Project examined several African populations but these were all located south of the Sahara. Morocco is on the northwest coast of Africa and mostly lies north of the Sahara, which makes it very attractive for studying genetic diversity. Recent genomic data of Taforalt individuals in Eastern Morocco revealed 15,000-year-old modern humans, showed that North Africa individuals are expected to show genetic differences from previously studied African populations. Results We present single nucleotide variant (SNV) results from whole genome sequencing (WGS) of three Moroccans. From a total of 5.9 million SNVs detected, over 200,000 were not identified by 1000G. We provide a summary of the SNVs by genomic position, gene context and effect on protein coding. Comparison of genome-wide information of the Moroccan individuals to individuals from 1000G by principal component analysis revealed a substantial genomic distinction between the Moroccan population and sub-Saharan African populations. Conclusions We conclude that Moroccan samples lie in the middle of the previously observed cline between populations of European and African ancestry. WGS of Moroccan individuals can identify a large number of new SNVs and aid in functional characterisation of the genome.


2020 ◽  
Author(s):  
Christian Rödelsperger

AbstractNematodes are attractive model systems to understand the genetic basis of various biological processes ranging from development to complex behaviors. In particular, mutagenesis experiments combined with whole-genome sequencing has been proven as one of the most effective methods to identify core players of multiple biological pathways. To enable experimentalists to apply such integrative genetic and bioinformatic analysis in the case of the satellite model organism Pristionchus pacificus, I present a simplified workflow for the analysis of whole-genome data from mutant lines and corresponding mapping panels. Individual components are based on well-maintained and widely used software packages and are extended by 50 lines of code for the analysis and visualization of allele frequencies. The effectiveness of this workflow is demonstrated by an application to recently generated data of a P. pacificus mutant line, where it reduced the number of candidate mutations from an initial set of 3,500 single nucleotide variants to ten.


Sign in / Sign up

Export Citation Format

Share Document