Hardware accelerator architecture for simultaneous short-read DNA sequences alignment with enhanced traceback phase

2012 ◽  
Vol 36 (2) ◽  
pp. 96-109 ◽  
Author(s):  
Nuno Sebastião ◽  
Nuno Roma ◽  
Paulo Flores
mSystems ◽  
2018 ◽  
Vol 3 (3) ◽  
Author(s):  
Gabriel A. Al-Ghalith ◽  
Benjamin Hillmann ◽  
Kaiwei Ang ◽  
Robin Shields-Cutler ◽  
Dan Knights

ABSTRACT Next-generation sequencing technology is of great importance for many biological disciplines; however, due to technical and biological limitations, the short DNA sequences produced by modern sequencers require numerous quality control (QC) measures to reduce errors, remove technical contaminants, or merge paired-end reads together into longer or higher-quality contigs. Many tools for each step exist, but choosing the appropriate methods and usage parameters can be challenging because the parameterization of each step depends on the particularities of the sequencing technology used, the type of samples being analyzed, and the stochasticity of the instrumentation and sample preparation. Furthermore, end users may not know all of the relevant information about how their data were generated, such as the expected overlap for paired-end sequences or type of adaptors used to make informed choices. This increasing complexity and nuance demand a pipeline that combines existing steps together in a user-friendly way and, when possible, learns reasonable quality parameters from the data automatically. We propose a user-friendly quality control pipeline called SHI7 (canonically pronounced “shizen”), which aims to simplify quality control of short-read data for the end user by predicting presence and/or type of common sequencing adaptors, what quality scores to trim, whether the data set is shotgun or amplicon sequencing, whether reads are paired end or single end, and whether pairs are stitchable, including the expected amount of pair overlap. We hope that SHI7 will make it easier for all researchers, expert and novice alike, to follow reasonable practices for short-read data quality control. IMPORTANCE Quality control of high-throughput DNA sequencing data is an important but sometimes laborious task requiring background knowledge of the sequencing protocol used (such as adaptor type, sequencing technology, insert size/stitchability, paired-endedness, etc.). Quality control protocols typically require applying this background knowledge to selecting and executing numerous quality control steps with the appropriate parameters, which is especially difficult when working with public data or data from collaborators who use different protocols. We have created a streamlined quality control pipeline intended to substantially simplify the process of DNA quality control from raw machine output files to actionable sequence data. In contrast to other methods, our proposed pipeline is easy to install and use and attempts to learn the necessary parameters from the data automatically with a single command.


2021 ◽  
Author(s):  
Thomas K. F. Wong ◽  
Teng Li ◽  
Louis Ranjard ◽  
Steven Wu ◽  
Jeet Sukumaran ◽  
...  

AbstractA current strategy for obtaining haplotype information from several individuals involves short-read sequencing of pooled amplicons, where fragments from each individual is identified by a unique DNA barcode. In this paper, we report a new method to recover the phylogeny of haplotypes from short-read sequences obtained using pooled amplicons from a mixture of individuals, without barcoding. The method, AFPhyloMix, accepts an alignment of the mixture of reads against a reference sequence, obtains the single-nucleotide-polymorphisms (SNP) patterns along the alignment, and constructs the phylogenetic tree according to the SNP patterns. AFPhyloMix adopts a Bayesian model of inference to estimates the phylogeny of the haplotypes and their relative frequencies, given that the number of haplotypes is known. In our simulations, AFPhyloMix achieved at least 80% accuracy at recovering the phylogenies and frequencies of the constituent haplotypes, for mixtures with up to 15 haplotypes. AFPhyloMix also worked well on a real data set of kangaroo mitochondrial DNA sequences.


2016 ◽  
Author(s):  
Diego D. Cambuy ◽  
Felipe H. Coutinho ◽  
Bas E. Dutilh

AbstractIn modern-day metagenomics, there is an increasing need for robust taxonomic annotation of long DNA sequences from unknown micro-organisms. Long metagenomic sequences may be derived from assembly of short-read metagenomes, or from long-read single molecule sequencing. Here we introduce CAT, a pipeline for robust taxonomic classification of long DNA sequences. We show that CAT correctly classifies contigs at different taxonomic levels, even in simulated metagenomic datasets that are very distantly related from the sequences in the database. CAT is implemented in Python and the required scripts can be freely downloaded from Github.


2017 ◽  
Author(s):  
Richard Wilton ◽  
Xin Li ◽  
Andrew P. Feinberg ◽  
Alexander S. Szalay

AbstractThe alignment of bisulfite-treated DNA sequences (BS-seq reads) to a large genome involves a significant computational burden beyond that required to align non-bisulfite-treated reads. In the analysis of BS-seq data, this can present an important performance bottleneck that can potentially be addressed by appropriate software-engineering and algorithmic improvements. One strategy is to integrate this additional programming logic into the read-alignment implementation in a way that the software becomes amenable to optimizations that lead to both higher speed and greater sensitivity than can be achieved without this integration.We have evaluated this approach using Arioc, a short-read aligner that uses GPU (general-purpose graphics processing unit) hardware to accelerate computationally-expensive programming logic. We integrated the BS-seq computational logic into both GPU and CPU code throughout the Arioc implementation. We then carried out a read-by-read comparison of Arioc's reported alignments with the alignments reported by the most widely used BS-seq read aligners. With simulated reads, Arioc's accuracy is equal to or better than the other read aligners we evaluated. With human sequencing reads, Arioc's throughput is at least 10 times faster than existing BS-seq aligners across a wide range of sensitivity settings.The Arioc software is available at https://github.com/RWilton/Arioc. It is released under a BSD open-source license.


2018 ◽  
Vol 19 (S14) ◽  
Author(s):  
Jesús Pérez-Serrano ◽  
Edans Sandes ◽  
Alba Cristina Magalhaes Alves de Melo ◽  
Manuel Ujaldón

2017 ◽  
Vol 18 (2) ◽  
pp. 296-305 ◽  
Author(s):  
Diana S. Baetscher ◽  
Anthony J. Clemento ◽  
Thomas C. Ng ◽  
Eric C. Anderson ◽  
John C. Garza
Keyword(s):  

2020 ◽  
Author(s):  
Gabriel Al-Ghalith ◽  
Dan Knights

AbstractOne of the fundamental tasks in analyzing next-generation sequencing data is genome database search, in which DNA sequences are compared to known reference genomes for identification or annotation. Although algorithms exist for optimal database search with perfect sensitivity and specificity, these have largely been abandoned for next-generation sequencing (NGS) data in favor of faster heuristic algorithms that sacrifice alignment quality. Virtually all DNA alignment tools that are commonly used in genomic and metagenomic database search use approximate methods that sometimes report the wrong match, and sometimes fail to find a valid match when present. Here we introduce BURST, a high-throughput DNA short-read aligner that uses several new synergistic optimizations to enable provably optimal alignment in NGS datasets. BURST finds all equally good matches in the database above a specified identity threshold and can either report all of them, pick the most likely among tied matches, or provide lowest-common-ancestor taxonomic annotation among tied matches. BURST can align, disambiguate, and assign taxonomy at a rate of 1,000,000 query sequences per minute against the RefSeq v82 representative prokaryotic genome database (5,500 microbial genomes, 19GB) at 98% identity on a 32-core computer, representing a speedup of up to 20,000-fold over current optimal gapped alignment techniques. This may have broader implications for clinical applications, strain tracking, and other situations where fast, exact, extremely sensitive alignment is desired.


Sign in / Sign up

Export Citation Format

Share Document