Customized de novo mutation detection for any variant calling pipeline: SynthDNM

Bioinformatics ◽

10.1093/bioinformatics/btab225 ◽

2021 ◽

Author(s):

Aojie Lian ◽

James Guevara ◽

Kun Xia ◽

Jonathan Sebat

Keyword(s):

Mutation Detection ◽

De Novo ◽

Variant Calling ◽

Real Data ◽

Supplementary Information ◽

De Novo Mutation ◽

Flexible Approach ◽

Sequencing Technologies ◽

Training Examples ◽

Simulated Training

Abstract Motivation As sequencing technologies and analysis pipelines evolve, de novo mutation (DNM) calling tools must be adapted. Therefore, a flexible approach is needed that can accurately identify DNMs from genome or exome sequences from a variety of datasets and variant calling pipelines. Results Here, we describe SynthDNM, a random-forest based classifier that can be readily adapted to new sequencing or variant-calling pipelines by applying a flexible approach to constructing simulated training examples from real data. The optimized SynthDNM classifiers predict de novo SNPs and indels with robust accuracy across multiple methods of variant calling. Availabilityand implementation SynthDNM is freely available on Github (https://github.com/james-guevara/synthdnm). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Customized de novo mutation detection for any variant calling pipeline: SynthDNM

10.1101/2021.02.10.427198 ◽

2021 ◽

Cited By ~ 1

Author(s):

Aojie Lian ◽

James Guevara ◽

Kun Xia ◽

Jonathan Sebat

Keyword(s):

De Novo ◽

Variant Calling ◽

Real Data ◽

Supplementary Information ◽

De Novo Mutation ◽

De Novo Mutations ◽

Flexible Approach ◽

Sequencing Technologies ◽

Training Examples ◽

Simulated Training

AbstractMotivationAs sequencing technologies and analysis pipelines evolve, DNM calling tools must be adapted. Therefore, a flexible approach is needed that can accurately identify de novo mutations from genome or exome sequences from a variety of datasets and variant calling pipelines.ResultsHere, we describe SynthDNM, a random-forest based classifier that can be readily adapted to new sequencing or variant-calling pipelines by applying a flexible approach to constructing simulated training examples from real data. The optimized SynthDNM classifiers predict de novo SNPs and indels with robust accuracy across multiple methods of variant calling.AvailabilitySynthDNM is freely available on Github (https://github.com/james-guevara/synthdnm)[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

A Likelihood-Based Framework for Variant Calling and De Novo Mutation Detection in Families

PLoS Genetics ◽

10.1371/journal.pgen.1002944 ◽

2012 ◽

Vol 8 (10) ◽

pp. e1002944 ◽

Cited By ~ 54

Author(s):

Bingshan Li ◽

Wei Chen ◽

Xiaowei Zhan ◽

Fabio Busonero ◽

Serena Sanna ◽

...

Keyword(s):

Mutation Detection ◽

De Novo ◽

Variant Calling ◽

De Novo Mutation

Download Full-text

MUM&Co: accurate detection of all SV types through whole-genome alignment

Bioinformatics ◽

10.1093/bioinformatics/btaa115 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3242-3243 ◽

Cited By ~ 2

Author(s):

Samuel O’Donnell ◽

Gilles Fischer

Keyword(s):

De Novo ◽

Supplementary Information ◽

Genome Alignment ◽

Whole Genome ◽

Structural Variations ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Human Genomes ◽

Whole Genome Alignment ◽

Primary Output

Abstract Summary MUM&Co is a single bash script to detect structural variations (SVs) utilizing whole-genome alignment (WGA). Using MUMmer’s nucmer alignment, MUM&Co can detect insertions, deletions, tandem duplications, inversions and translocations greater than 50 bp. Its versatility depends upon the WGA and therefore benefits from contiguous de-novo assemblies generated by third generation sequencing technologies. Benchmarked against five WGA SV-calling tools, MUM&Co outperforms all tools on simulated SVs in yeast, plant and human genomes and performs similarly in two real human datasets. Additionally, MUM&Co is particularly unique in its ability to find inversions in both simulated and real datasets. Lastly, MUM&Co’s primary output is an intuitive tabulated file containing a list of SVs with only necessary genomic details. Availability and implementation https://github.com/SAMtoBAM/MUMandCo. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

RECORD: Reference-Assisted Genome Assembly for Closely Related Genomes

International Journal of Genomics ◽

10.1155/2015/563482 ◽

2015 ◽

Vol 2015 ◽

pp. 1-10 ◽

Cited By ~ 1

Author(s):

Krisztian Buza ◽

Bartek Wilczynski ◽

Norbert Dojer

Keyword(s):

Reference Genome ◽

De Novo ◽

Real Data ◽

Reference Sequence ◽

Individual Genome ◽

Single Experiment ◽

Sequencing Technologies ◽

Sequencing Cost ◽

The Individual ◽

Assembly Software

Background. Next-generation sequencing technologies are now producing multiple times the genome size in total reads from a single experiment. This is enough information to reconstruct at least some of the differences between the individual genome studied in the experiment and the reference genome of the species. However, in most typical protocols, this information is disregarded and the reference genome is used.Results. We provide a new approach that allows researchers to reconstruct genomes very closely related to the reference genome (e.g., mutants of the same species) directly from the reads used in the experiment. Our approach applies de novo assembly software to experimental reads and so-called pseudoreads and uses the resulting contigs to generate a modified reference sequence. In this way, it can very quickly, and at no additional sequencing cost, generate new, modified reference sequence that is closer to the actual sequenced genome and has a full coverage. In this paper, we describe our approach and test its implementation called RECORD. We evaluate RECORD on both simulated and real data. We made our software publicly available on sourceforge.Conclusion. Our tests show that on closely related sequences RECORD outperforms more general assisted-assembly software.

Download Full-text

CAMISIM: Simulating metagenomes and microbial communities

10.1101/300970 ◽

2018 ◽

Cited By ~ 4

Author(s):

Adrian Fritz ◽

Peter Hofmann ◽

Stephan Majda ◽

Eik Dahms ◽

Johannes Dröge ◽

...

Keyword(s):

Microbial Communities ◽

De Novo ◽

Real Data ◽

Small Data ◽

Data Sets ◽

Sequencing Data ◽

Taxonomic Profiling ◽

Benchmark Data ◽

Sequencing Technologies ◽

Wide Range

Shotgun metagenome data sets of microbial communities are highly diverse, not only due to the natural variation of the underlying biological systems, but also due to differences in laboratory protocols, replicate numbers, and sequencing technologies. Accordingly, to effectively assess the performance of metagenomic analysis software, a wide range of benchmark data sets are required. Here, we describe the CAMISIM microbial community and metagenome simulator. The software can model different microbial abundance profiles, multi-sample time series and differential abundance studies, includes real and simulated strain-level diversity, and generates second and third generation sequencing data from taxonomic profiles or de novo. Gold standards are created for sequence assembly, genome binning, taxonomic binning, and taxonomic profiling. CAMSIM generated the benchmark data sets of the first CAMI challenge. For two simulated multi-sample data sets of the human and mouse gut microbiomes we observed high functional congruence to the real data. As further applications, we investigated the effect of varying evolutionary genome divergence, sequencing depth, and read error profiles on two popular metagenome assemblers, MEGAHIT and metaSPAdes, on several thousand small data sets generated with CAMISIM. CAMISIM can simulate a wide variety of microbial communities and metagenome data sets together with truth standards for method evaluation. All data sets and the software are freely available at: https://github.com/CAMI-challenge/CAMISIM

Download Full-text

De Novo Mutation Detection from Ctdna Correlates with Variants Detected on Metastasis of Patients with Any Kind of Refractory Cancer from the Shiva Trial

Annals of Oncology ◽

10.1093/annonc/mdu358.2 ◽

2014 ◽

Vol 25 ◽

pp. iv546

Author(s):

J. Pierga ◽

C. Decraene ◽

V. Bernard ◽

M. Kamal ◽

A. Blin ◽

...

Keyword(s):

Mutation Detection ◽

De Novo ◽

De Novo Mutation

Download Full-text

Minirmd: accurate and fast duplicate removal tool for short reads via multiple minimizers

Bioinformatics ◽

10.1093/bioinformatics/btaa915 ◽

2020 ◽

Author(s):

Yuansheng Liu ◽

Xiaocai Zhang ◽

Quan Zou ◽

Xiangxiang Zeng

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

De Novo ◽

Supplementary Information ◽

Supplementary Data ◽

Complementary Strand ◽

Short Reads ◽

Sequencing Technologies ◽

Computational Resources

Abstract Summary Removing duplicate and near-duplicate reads, generated by high-throughput sequencing technologies, is able to reduce computational resources in downstream applications. Here we develop minirmd, a de novo tool to remove duplicate reads via multiple rounds of clustering using different length of minimizer. Experiments demonstrate that minirmd removes more near-duplicate reads than existing clustering approaches and is faster than existing multi-core tools. To the best of our knowledge, minirmd is the first tool to remove near-duplicates on reverse-complementary strand. Availability and implementation https://github.com/yuansliu/minirmd. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Whisper: Read sorting allows robust mapping of sequencing data

10.1101/240358 ◽

2017 ◽

Author(s):

Sebastian Deorowicz ◽

Agnieszka Debudaj-Grabysz ◽

Adam Gudyś ◽

Szymon Grabowski

Keyword(s):

Reference Genome ◽

Variant Calling ◽

Real Data ◽

Supplementary Information ◽

Sequencing Data ◽

Suffix Arrays ◽

Link Type ◽

Mapping Tool ◽

Reverse Complement ◽

Comparable Accuracy

AbstractMotivationMapping reads to a reference genome is often the first step in a sequencing data analysis pipeline. Mistakes made at this computationally challenging stage cannot be recovered easily.ResultsWe present Whisper, an accurate and high-performant mapping tool, based on the idea of sorting reads and then mapping them against suffix arrays for the reference genome and its reverse complement. Employing task and data parallelism as well as storing temporary data on disk result in superior time efficiency at reasonable memory requirements. Whisper excels at large NGS read collections, in particular Illumina reads with typical WGS coverage. The experiments with real data indicate that our solution works in about 15% of the time needed by the well-known Bowtie2 and BWA-MEM tools at a comparable accuracy (validated in variant calling pipeline).AvailabilityWhisper is available for free from https://github.com/refresh-bio/Whisper or http://sun.aei.polsl.pl/REFRESH/Whisper/[email protected] informationSupplementary data are available at publisher Web site.

Download Full-text

LRez: C ++ API and toolkit for analyzing and managing Linked-Reads data

Bioinformatics Advances ◽

10.1093/bioadv/vbab022 ◽

2021 ◽

Author(s):

Pierre Morisse ◽

Claire Lemaitre ◽

Fabrice Legeai

Keyword(s):

Genome Assembly ◽

Low Cost ◽

Variant Calling ◽

Supplementary Information ◽

Supplementary Data ◽

High Quality ◽

Dna Molecule ◽

Sequencing Technologies ◽

Wide Range ◽

Genomic Regions

Abstract Motivation Linked-Reads technologies combine both the high-quality and low cost of short-reads sequencing and long-range information, through the use of barcodes tagging reads which originate from a common long DNA molecule. This technology has been employed in a broad range of applications including genome assembly, phasing and scaffolding, as well as structural variant calling. However, to date, no tool or API dedicated to the manipulation of Linked-Reads data exist. Results We introduce LRez, a C ++ API and toolkit which allows easy management of Linked-Reads data. LRez includes various functionalities, for computing numbers of common barcodes between genomic regions, extracting barcodes from BAM files, as well as indexing and querying BAM, FASTQ and gzipped FASTQ files to quickly fetch all reads or alignments containing a given barcode. LRez is compatible with a wide range of Linked-Reads sequencing technologies, and can thus be used in any tool or pipeline requiring barcode processing or indexing, in order to improve their performances. Availability and implementation LRez is implemented in C ++, supported on Unix-based platforms, and available under AGPL-3.0 License at https://github.com/morispi/LRez, and as a bioconda module. Supplementary information Supplementary data are available at Bioinformatics Advances

Download Full-text

Extensive sequencing of seven human genomes to characterize benchmark reference materials

10.1101/026468 ◽

2015 ◽

Cited By ~ 9

Author(s):

Justin M Zook ◽

David Catoe ◽

Jennifer McDaniel ◽

Lindsay Vang ◽

Noah Spies ◽

...

Keyword(s):

Human Genome ◽

Reference Materials ◽

De Novo ◽

Variant Calling ◽

Genome Project ◽

Genome Comparison ◽

Personal Genome ◽

Sequencing Data ◽

Sequencing Technologies ◽

Human Genomes

The Genome in a Bottle Consortium, hosted by the National Institute of Standards and Technology (NIST) is creating reference materials and data for human genome sequencing, as well as methods for genome comparison and benchmarking. Here, we describe a large, diverse set of sequencing data for seven human genomes; five are current or candidate NIST Reference Materials. The pilot genome, NA12878, has been released as NIST RM 8398. We also describe data from two Personal Genome Project trios, one of Ashkenazim Jewish ancestry and one of Chinese ancestry. The data come from 12 technologies: BioNano Genomics, Complete Genomics paired-end and LFR, Ion Proton exome, Oxford Nanopore, Pacific Biosciences, SOLiD, 10X Genomics GemCodeTM WGS, and Illumina exome and WGS paired-end, mate-pair, and synthetic long reads. Cell lines, DNA, and data from these individuals are publicly available. Therefore, we expect these data to be useful for revealing novel information about the human genome and improving sequencing technologies, SNP, indel, and structural variant calling, and de novo assembly.

Download Full-text