Whisper: Read sorting allows robust mapping of sequencing data

AbstractMotivationMapping reads to a reference genome is often the first step in a sequencing data analysis pipeline. Mistakes made at this computationally challenging stage cannot be recovered easily.ResultsWe present Whisper, an accurate and high-performant mapping tool, based on the idea of sorting reads and then mapping them against suffix arrays for the reference genome and its reverse complement. Employing task and data parallelism as well as storing temporary data on disk result in superior time efficiency at reasonable memory requirements. Whisper excels at large NGS read collections, in particular Illumina reads with typical WGS coverage. The experiments with real data indicate that our solution works in about 15% of the time needed by the well-known Bowtie2 and BWA-MEM tools at a comparable accuracy (validated in variant calling pipeline).AvailabilityWhisper is available for free from https://github.com/refresh-bio/Whisper or http://sun.aei.polsl.pl/REFRESH/Whisper/[email protected] informationSupplementary data are available at publisher Web site.

Download Full-text

NGSEP3: accurate variant calling across species and sequencing protocols

Bioinformatics ◽

10.1093/bioinformatics/btz275 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4716-4723 ◽

Cited By ~ 7

Author(s):

Daniel Tello ◽

Juanita Gil ◽

Cristian D Loaiza ◽

John J Riascos ◽

Nicolás Cardozo ◽

...

Keyword(s):

Short Tandem Repeats ◽

Tandem Repeats ◽

High Throughput Sequencing ◽

Variant Calling ◽

Real Data ◽

Supplementary Information ◽

Sequencing Data ◽

Comparative Accuracy ◽

Downstream Analysis ◽

Short Tandem

Abstract Motivation Accurate detection, genotyping and downstream analysis of genomic variants from high-throughput sequencing data are fundamental features in modern production pipelines for genetic-based diagnosis in medicine or genomic selection in plant and animal breeding. Our research group maintains the Next-Generation Sequencing Experience Platform (NGSEP) as a precise, efficient and easy-to-use software solution for these features. Results Understanding that incorrect alignments around short tandem repeats are an important source of genotyping errors, we implemented in NGSEP new algorithms for realignment and haplotype clustering of reads spanning indels and short tandem repeats. We performed extensive benchmark experiments comparing NGSEP to state-of-the-art software using real data from three sequencing protocols and four species with different distributions of repetitive elements. NGSEP consistently shows comparative accuracy and better efficiency compared to the existing solutions. We expect that this work will contribute to the continuous improvement of quality in variant calling needed for modern applications in medicine and agriculture. Availability and implementation NGSEP is available as open source software at http://ngsep.sf.net. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

AStrap: identification of alternative splicing from transcript sequences without a reference genome

Bioinformatics ◽

10.1093/bioinformatics/bty1008 ◽

2018 ◽

Vol 35 (15) ◽

pp. 2654-2656 ◽

Cited By ~ 5

Author(s):

Guoli Ji ◽

Wenbin Ye ◽

Yaru Su ◽

Moliang Chen ◽

Guangzao Huang ◽

...

Keyword(s):

Machine Learning ◽

Alternative Splicing ◽

Single Molecule ◽

Reference Genome ◽

De Novo ◽

Supplementary Information ◽

Model Organisms ◽

Sequencing Data ◽

Extensive Evaluation ◽

Reference Genomes

Abstract Summary Alternative splicing (AS) is a well-established mechanism for increasing transcriptome and proteome diversity, however, detecting AS events and distinguishing among AS types in organisms without available reference genomes remains challenging. We developed a de novo approach called AStrap for AS analysis without using a reference genome. AStrap identifies AS events by extensive pair-wise alignments of transcript sequences and predicts AS types by a machine-learning model integrating more than 500 assembled features. We evaluated AStrap using collected AS events from reference genomes of rice and human as well as single-molecule real-time sequencing data from Amborella trichopoda. Results show that AStrap can identify much more AS events with comparable or higher accuracy than the competing method. AStrap also possesses a unique feature of predicting AS types, which achieves an overall accuracy of ∼0.87 for different species. Extensive evaluation of AStrap using different parameters, sample sizes and machine-learning models on different species also demonstrates the robustness and flexibility of AStrap. AStrap could be a valuable addition to the community for the study of AS in non-model organisms with limited genetic resources. Availability and implementation AStrap is available for download at https://github.com/BMILAB/AStrap. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

hypeR: An R Package for Geneset Enrichment Workflows

10.1101/656637 ◽

2019 ◽

Cited By ~ 1

Author(s):

Anthony Federico ◽

Stefano Monti

Keyword(s):

High Throughput Sequencing ◽

R Package ◽

Supplementary Information ◽

Sequencing Data ◽

Wide Audience ◽

Popular Method ◽

Link Type ◽

High Throughput Sequencing Data ◽

One Stop ◽

Recent Version

ABSTRACTSummaryGeneset enrichment is a popular method for annotating high-throughput sequencing data. Existing tools fall short in providing the flexibility to tackle the varied challenges researchers face in such analyses, particularly when analyzing many signatures across multiple experiments. We present a comprehensive R package for geneset enrichment workflows that offers multiple enrichment, visualization, and sharing methods in addition to novel features such as hierarchical geneset analysis and built-in markdown reporting. hypeR is a one-stop solution to performing geneset enrichment for a wide audience and range of use cases.Availability and implementationThe most recent version of the package is available at https://github.com/montilab/hypeR.Supplementary informationComprehensive documentation and tutorials, are available at https://montilab.github.io/hypeR-docs.

Download Full-text

Numt identification and removal with RtN!

Bioinformatics ◽

10.1093/bioinformatics/btaa642 ◽

2020 ◽

Vol 36 (20) ◽

pp. 5115-5116 ◽

Cited By ~ 2

Author(s):

August E Woerner ◽

Jennifer Churchill Cihlar ◽

Utpal Smart ◽

Bruce Budowle

Keyword(s):

Mitochondrial Genome ◽

Massively Parallel Sequencing ◽

Sequence Similarity ◽

Variant Calling ◽

Supplementary Information ◽

Mitochondrial Genomes ◽

Sequencing Data ◽

Read Mapping ◽

Genome Data ◽

Mitochondrial Sequences

Abstract Motivation Assays in mitochondrial genomics rely on accurate read mapping and variant calling. However, there are known and unknown nuclear paralogs that have fundamentally different genetic properties than that of the mitochondrial genome. Such paralogs complicate the interpretation of mitochondrial genome data and confound variant calling. Results Remove the Numts! (RtN!) was developed to categorize reads from massively parallel sequencing data not based on the expected properties and sequence identities of paralogous nuclear encoded mitochondrial sequences, but instead using sequence similarity to a large database of publicly available mitochondrial genomes. RtN! removes low-level sequencing noise and mitochondrial paralogs while not impacting variant calling, while competing methods were shown to remove true variants from mitochondrial mixtures. Availability and implementation https://github.com/Ahhgust/RtN Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Pisces: An Accurate and Versatile Variant Caller for Somatic and Germline Next-Generation Sequencing Data

10.1101/291641 ◽

2018 ◽

Cited By ~ 1

Author(s):

Tamsen Dunn ◽

Gwenn Berry ◽

Dorothea Emig-Agius ◽

Yu Jiang ◽

Serena Lei ◽

...

Keyword(s):

Next Generation Sequencing ◽

Gene Mutations ◽

Variant Calling ◽

Amplicon Sequencing ◽

Supplementary Information ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Ras Gene ◽

Generation Sequencing

AbstractMotivationNext-Generation Sequencing (NGS) technology is transitioning quickly from research labs to clinical settings. The diagnosis and treatment selection for many acquired and autosomal conditions necessitate a method for accurately detecting somatic and germline variants, suitable for the clinic.ResultsWe have developed Pisces, a rapid, versatile and accurate small variant calling suite designed for somatic and germline amplicon sequencing applications. Pisces accuracy is achieved by four distinct modules, the Pisces Read Stitcher, Pisces Variant Caller, the Pisces Variant Quality Recalibrator, and the Pisces Variant Phaser. Each module incorporates a number of novel algorithmic strategies aimed at reducing noise or increasing the likelihood of detecting a true variant.AvailabilityPisces is distributed under an open source license and can be downloaded from https://github.com/Illumina/Pisces. Pisces is available on the BaseSpace™ SequenceHub as part of the TruSeq Amplicon workflow and the Illumina Ampliseq Workflow. Pisces is distributed on Illumina sequencing platforms such as the MiSeq™, and is included in the Praxis™ Extended RAS Panel test which was recently approved by the FDA for the detection of multiple RAS gene [email protected] informationSupplementary data are available online.

Download Full-text

Customized de novo mutation detection for any variant calling pipeline: SynthDNM

Bioinformatics ◽

10.1093/bioinformatics/btab225 ◽

2021 ◽

Author(s):

Aojie Lian ◽

James Guevara ◽

Kun Xia ◽

Jonathan Sebat

Keyword(s):

Mutation Detection ◽

De Novo ◽

Variant Calling ◽

Real Data ◽

Supplementary Information ◽

De Novo Mutation ◽

Flexible Approach ◽

Sequencing Technologies ◽

Training Examples ◽

Simulated Training

Abstract Motivation As sequencing technologies and analysis pipelines evolve, de novo mutation (DNM) calling tools must be adapted. Therefore, a flexible approach is needed that can accurately identify DNMs from genome or exome sequences from a variety of datasets and variant calling pipelines. Results Here, we describe SynthDNM, a random-forest based classifier that can be readily adapted to new sequencing or variant-calling pipelines by applying a flexible approach to constructing simulated training examples from real data. The optimized SynthDNM classifiers predict de novo SNPs and indels with robust accuracy across multiple methods of variant calling. Availabilityand implementation SynthDNM is freely available on Github (https://github.com/james-guevara/synthdnm). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

sangeranalyseR: simple and interactive analysis of Sanger sequencing data in R

10.1101/2020.05.18.102459 ◽

2020 ◽

Author(s):

Kuan-Hao Chao ◽

Kirston Barton ◽

Sarah Palmer ◽

Robert Lanfear

Keyword(s):

Sanger Sequencing ◽

Reference Sequence ◽

Supplementary Information ◽

File Format ◽

Bioconductor Package ◽

Sequencing Data ◽

Interactive Analysis ◽

Link Type ◽

Online Documentation ◽

Wide Range

AbstractSummarysangeranalyseR is an interactive R/Bioconductor package and two associated Shiny applications designed for analysing Sanger sequencing from data from the ABIF file format in R. It allows users to go from loading reads to saving aligned contigs in a few lines of R code. sangeranalyseR provides a wide range of options for a number of commonly-performed actions including read trimming, detecting secondary peaks, viewing chromatograms, and detecting indels using a reference sequence. All parameters can be adjusted interactively either in R or in the associated Shiny applications. sangeranalyseR comes with extensive online documentation, and outputs detailed interactive HTML reports.Availability and implementationsangeranalyseR is implemented in R and released under an MIT license. It is available for all platforms on Bioconductor (https://bioconductor.org/packages/sangeranalyseR) and on Github (https://github.com/roblanf/sangeranalyseR)[email protected] informationDocumentation at https://sangeranalyser.readthedocs.io/.

Download Full-text

ipDMR: identification of differentially methylated regions with interval P-values

Bioinformatics ◽

10.1093/bioinformatics/btaa732 ◽

2020 ◽

Author(s):

Zongli Xu ◽

Changchun Xie ◽

Jack A Taylor ◽

Liang Niu

Keyword(s):

Software Tool ◽

Real Data ◽

Supplementary Information ◽

Sequencing Data ◽

Differentially Methylated Regions ◽

R Software ◽

False Discovery Rates ◽

P Values ◽

False Discovery ◽

Bisulfite Sequencing Data

Abstract Summary ipDMR is an R software tool for identification of differentially methylated regions (DMRs) using auto-correlated P-values for individual CpGs from epigenome-wide association analysis using array or bisulfite sequencing data. It summarizes P-values for adjacent CpGs, identifies association peaks and then extends peaks to find boundaries of DMRs. ipDMR uses BED format files as input and is easy to use. Simulations guided by real data found that ipDMR outperformed current available methods and provided slightly higher true positive rates and much lower false discovery rates. Availability and implementation ipDMR is available at https://bioconductor.org/packages/release/bioc/html/ENmix.html. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

scBatch: batch-effect correction of RNA-seq data through sample distance matrix adjustment

Bioinformatics ◽

10.1093/bioinformatics/btaa097 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3115-3123 ◽

Cited By ~ 3

Author(s):

Teng Fei ◽

Tianwei Yu

Keyword(s):

Single Cell ◽

Differential Expression Analysis ◽

Distance Matrix ◽

Real Data ◽

R Package ◽

Batch Effect ◽

Supplementary Information ◽

Rna Seq ◽

Sequencing Data ◽

Gene Differential Expression

Abstract Motivation Batch effect is a frequent challenge in deep sequencing data analysis that can lead to misleading conclusions. Existing methods do not correct batch effects satisfactorily, especially with single-cell RNA sequencing (RNA-seq) data. Results We present scBatch, a numerical algorithm for batch-effect correction on bulk and single-cell RNA-seq data with emphasis on improving both clustering and gene differential expression analysis. scBatch is not restricted by assumptions on the mechanism of batch-effect generation. As shown in simulations and real data analyses, scBatch outperforms benchmark batch-effect correction methods. Availability and implementation The R package is available at github.com/tengfei-emory/scBatch. The code to generate results and figures in this article is available at github.com/tengfei-emory/scBatch-paper-scripts. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

xAtlas: Scalable small variant calling across heterogeneous next-generation sequencing experiments

10.1101/295071 ◽

2018 ◽

Cited By ~ 7

Author(s):

Jesse Farek ◽

Daniel Hughes ◽

Adam Mansfield ◽

Olga Krasheninina ◽

Waleed Nasser ◽

...

Keyword(s):

Next Generation Sequencing ◽

Rapid Development ◽

Variant Calling ◽

Supplementary Information ◽

Data Generation ◽

Next Generation ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Homogeneous Sample ◽

Generation Sequencing

AbstractMotivationThe rapid development of next-generation sequencing (NGS) technologies has lowered the barriers to genomic data generation, resulting in millions of samples sequenced across diverse experimental designs. The growing volume and heterogeneity of these sequencing data complicate the further optimization of methods for identifying DNA variation, especially considering that curated highconfidence variant call sets commonly used to evaluate these methods are generally developed by reference to results from the analysis of comparatively small and homogeneous sample sets.ResultsWe have developed xAtlas, an application for the identification of single nucleotide variants (SNV) and small insertions and deletions (indels) in NGS data. xAtlas is easily scalable and enables execution and retraining with rapid development cycles. Generation of variant calls in VCF or gVCF format from BAM or CRAM alignments is accomplished in less than one CPU-hour per 30× short-read human whole-genome. The retraining capabilities of xAtlas allow its core variant evaluation models to be optimized on new sample data and user-defined truth sets. Obtaining SNV and indels calls from xAtlas can be achieved more than 40 times faster than established methods while retaining the same accuracy.AvailabilityFreely available under a BSD 3-clause license at https://github.com/jfarek/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text