short read alignment Latest Research Papers

Abstract Much genomic data comes in the form of paired-end reads: two reads that represent genetic material with a small gap between. We present a new algorithm for aligning both reads in a pair simultaneously by fuzzily intersecting the sets of candidate alignment locations for each read. This algorithm is often much faster and produces alignments that result in variant calls having roughly the same concordance as the best competing aligners.

Download Full-text

Fuzzy set intersection based paired-end short-read alignment

10.1101/2021.11.23.469039 ◽

2021 ◽

Author(s):

William J Bolosky ◽

Arun Subramaniyan ◽

Matei Zaharia ◽

Ravi Pandya ◽

Taylor Sittler ◽

...

Keyword(s):

Fuzzy Set ◽

Genetic Material ◽

Genomic Data ◽

Short Read ◽

Read Alignment ◽

Short Read Alignment ◽

Set Intersection

Much genomic data comes in the form of paired-end reads: two reads that represent genetic material with a small gap between. We present a new algorithm for aligning both reads in a pair simultaneously by fuzzily intersecting the sets of candidate alignment locations for each read. This algorithm is often much faster and produces alignments that result in variant calls having roughly the same concordance as the best competing aligners.

Download Full-text

Polypolish: short-read polishing of long-read bacterial genome assemblies

10.1101/2021.10.14.464465 ◽

2021 ◽

Author(s):

Ryan R Wick ◽

Kathryn E Holt

Keyword(s):

Bacterial Genome ◽

Short Read ◽

Read Alignment ◽

Short Reads ◽

Repeat Sequences ◽

Short Read Alignment ◽

Long Read ◽

Genome Assemblies ◽

Residual Errors

Long-read-only bacterial genome assemblies usually contain residual errors, most commonly homopolymer-length errors. Short-read polishing tools can use short reads to fix these errors, but most rely on short-read alignment which is unreliable in repeat regions. Errors in such regions are therefore challenging to fix and often remain after short-read polishing. Here we introduce Polypolish, a new short-read polisher which uses all-per-read alignments to repair errors in repeat sequences that other polishers cannot. In benchmarking tests using both simulated and real reads, we find that Polypolish performs well, and the best results are achieved by using Polypolish in combination with other short-read polishers.

Download Full-text

BWA-MEME: BWA-MEM emulated with a machine learning approach

10.1101/2021.09.01.457579 ◽

2021 ◽

Author(s):

Youngmok Jung ◽

Dongsu Han

Keyword(s):

Search Algorithm ◽

Search Problem ◽

Learning Approach ◽

Exact Match ◽

Short Read ◽

Read Alignment ◽

Short Read Alignment ◽

Machine Learning Approach ◽

Memory Accesses ◽

Generation Sequencing

The growing use of next-generation sequencing and enlarged sequencing throughput require efficient short-read alignment, where seeding is one of the major performance bottlenecks. The key challenge in the seeding phase is searching for exact matches of substrings of short reads in the reference DNA sequence. Existing algorithms, however, present limitations in performance due to their frequent memory accesses. This paper presents BWA-MEME, the first full-fledged short read alignment software that leverages learned indices for solving the exact match search problem for efficient seeding. BWA-MEME is a practical and efficient seeding algorithm based on a suffix array search algorithm that solves the challenges in utilizing learned indices for SMEM search which is extensively used in the seeding phase. Our evaluation shows that BWA-MEME achieves up to 3.45x speedup in seeding throughput over BWA-MEM2 by reducing the number of instructions by 4.60x, memory accesses by 8.77x, and LLC misses by 2.21x, while ensuring the identical SAM output to BWA-MEM2.

Download Full-text

Fast and SNP-aware short read alignment with SALT

BMC Bioinformatics ◽

10.1186/s12859-021-04088-6 ◽

2021 ◽

Vol 22 (S9) ◽

Author(s):

Wei Quan ◽

Bo Liu ◽

Yadong Wang

Keyword(s):

Sequence Alignment ◽

Genetic Variants ◽

High Throughput Sequencing ◽

Reference Genome ◽

Graph Model ◽

Sequence Alignments ◽

Short Read ◽

Read Alignment ◽

Short Read Alignment ◽

Alignment Tool

Abstract Background DNA sequence alignment is a common first step in most applications of high-throughput sequencing technologies. The accuracy of sequence alignments directly affects the accuracy of downstream analyses, such as variant calling and quantitative analysis of transcriptome; therefore, rapidly and accurately mapping reads to a reference genome is a significant topic in bioinformatics. Conventional DNA read aligners map reads to a linear reference genome (such as the GRCh38 primary assembly). However, such a linear reference genome represents the genome of only one or a few individuals and thus lacks information on variations in the population. This limitation can introduce bias and impact the sensitivity and accuracy of mapping. Recently, a number of aligners have begun to map reads to populations of genomes, which can be represented by a reference genome and a large number of genetic variants. However, compared to linear reference aligners, an aligner that can store and index all genetic variants has a high cost in memory (RAM) space and leads to extremely long run time. Aligning reads to a graph-model-based index that includes all types of variants is ultimately an NP-hard problem in theory. By contrast, considering only single nucleotide polymorphism (SNP) information will reduce the complexity of the index and improve the speed of sequence alignment. Results The SNP-aware alignment tool (SALT) is a fast, memory-efficient, and SNP-aware short read alignment tool. SALT uses 5.8 GB of RAM to index a human reference genome (GRCh38) and incorporates 12.8M UCSC common SNPs. Compared with a state-of-the-art aligner, SALT has a similar speed but higher accuracy. Conclusions Herein, we present an SNP-aware alignment tool (SALT) that aligns reads to a reference genome that incorporates an SNP database. We benchmarked SALT using simulated and real datasets. The results demonstrate that SALT can efficiently map reads to the reference genome with significantly improved accuracy. Incorporating SNP information can improve the accuracy of read alignment and can reveal novel variants. The source code is freely available at https://github.com/weiquan/SALT.

Download Full-text

SRAMM: Short Read Alignment Mapping Metrics

International Journal on Bioinformatics & Biosciences ◽

10.5121/ijbb.2021.11201 ◽

2021 ◽

Vol 11 (02) ◽

pp. 01-07

Author(s):

Alvin Chon ◽

Xiaoqiu Huang

Keyword(s):

Third Party ◽

Read Mapping ◽

Short Read ◽

Read Alignment ◽

Short Read Mapping ◽

Short Read Alignment ◽

Command Line Tool ◽

Short Read Aligner ◽

Quality Programs ◽

Computationally Intensive

Short Read Alignment Mapping Metrics (SRAMM): is an efficient and versatile command line tool providing additional short read mapping metrics, filtering, and graphs. Short read aligners report MAPing Quality (MAPQ), but these methods generally are neither standardized nor well described in literature or software manuals. Additionally, third party mapping quality programs are typically computationally intensive or designed for specific applications. SRAMM efficiently generates multiple different concept-based mapping scores to provide for an informative post alignment examination and filtering process of aligned short reads for various downstream applications. SRAMM is compatible with Python 2.6+ and Python 3.6+ on all operating systems. It works with any short read aligner that generates SAM/BAM/CRAM file outputs and reports 'AS' tags. It is freely available under the MIT license at http://github.com/achon/sramm.

Download Full-text

FPGA Acceleration of Short Read Alignment

Proceedings of the 11th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies ◽

10.1145/3468044.3468057 ◽

2021 ◽

Author(s):

Konstantina Koliogeorgi ◽

Sotirios Xydis ◽

Dimitrios Soudris

Keyword(s):

Short Read ◽

Read Alignment ◽

Short Read Alignment ◽

Fpga Acceleration

Download Full-text

Genomic sequence characteristics and the empiric accuracy of short-read sequencing

10.1101/2021.04.08.438862 ◽

2021 ◽

Author(s):

Maximillian G Marin ◽

Roger Vargas ◽

Michael Harris ◽

Brendan Jeffrey ◽

L. Elaine Epperson ◽

...

Keyword(s):

Genomic Sequence ◽

Repetitive Sequences ◽

Gc Content ◽

Variant Calling ◽

Basic Research ◽

Confidence Regions ◽

Surveillance Systems ◽

Short Read ◽

Short Read Alignment ◽

Long Reads

Background: Short-read whole genome sequencing (WGS) is a vital tool for clinical applications and basic research. Genetic divergence from the reference genome, repetitive sequences, and sequencing bias, reduce the performance of variant calling using short-read alignment, but the loss in recall and specificity has not been adequately characterized. For the clonal pathogen Mycobacterium tuberculosis (Mtb), researchers frequently exclude 10.7% of the genome believed to be repetitive and prone to erroneous variant calls. To benchmark short-read variant calling, we used 36 diverse clinical Mtb isolates dually sequenced with Illumina short-reads and PacBio long-reads. We systematically study the short-read variant calling accuracy and the influence of sequence uniqueness, reference bias, and GC content. Results: Reference based Illumina variant calling had a recall ≥89.0% and precision ≥98.5% across parameters evaluated. The best balance between precision and recall was achieved by tuning the mapping quality (MQ) threshold, i.e. confidence of the read mapping (recall 85.8%, precision 99.1% at MQ ≥ 40). Masking repetitive sequence content is an alternative conservative approach to variant calling that maintains high precision (recall 70.2%, precision 99.6% at MQ≥40). Of the genomic positions typically excluded for Mtb, 68% are accurately called using Illumina WGS including 52 of the 168 PE/PPE genes (34.5%). We present a refined list of low confidence regions and examine the largest sources of variant calling error. Conclusions: Our improved approach to variant calling has broad implications for the use of WGS in the study of Mtb biology, inference of transmission in public health surveillance systems, and more generally for WGS applications in other organisms.

Download Full-text

Acceleration of Short Read Alignment with Runtime Reconfiguration

2020 International Conference on Field-Programmable Technology (ICFPT) ◽

10.1109/icfpt51103.2020.00044 ◽

2020 ◽

Author(s):

Ho-Cheung Ng ◽

Shuanglong Liu ◽

Izaak Coleman ◽

Ringo S.W. Chu ◽

Man-Chung Yue ◽

...

Keyword(s):

Short Read ◽

Read Alignment ◽

Short Read Alignment ◽

Runtime Reconfiguration

Download Full-text

Arioc: High-concurrency short-read alignment on multiple GPUs

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008383 ◽

2020 ◽

Vol 16 (11) ◽

pp. e1008383

Author(s):

Richard Wilton ◽

Alexander S. Szalay

Keyword(s):

Dna Sequence ◽

Data Storage ◽

High Speed ◽

Graphics Processing Unit ◽

General Purpose ◽

Processing Unit ◽

Short Read ◽

Multiple Gpus ◽

Short Read Alignment ◽

Memory Accesses

In large DNA sequence repositories, archival data storage is often coupled with computers that provide 40 or more CPU threads and multiple GPU (general-purpose graphics processing unit) devices. This presents an opportunity for DNA sequence alignment software to exploit high-concurrency hardware to generate short-read alignments at high speed. Arioc, a GPU-accelerated short-read aligner, can compute WGS (whole-genome sequencing) alignments ten times faster than comparable CPU-only alignment software. When two or more GPUs are available, Arioc's speed increases proportionately because the software executes concurrently on each available GPU device. We have adapted Arioc to recent multi-GPU hardware architectures that support high-bandwidth peer-to-peer memory accesses among multiple GPUs. By modifying Arioc's implementation to exploit this GPU memory architecture we obtained a further 1.8x-2.9x increase in overall alignment speeds. With this additional acceleration, Arioc computes two million short-read alignments per second in a four-GPU system; it can align the reads from a human WGS sequencer run–over 500 million 150nt paired-end reads–in less than 15 minutes. As WGS data accumulates exponentially and high-concurrency computational resources become widespread, Arioc addresses a growing need for timely computation in the short-read data analysis toolchain.

Download Full-text

short read alignment
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Fuzzy set intersection based paired-end short-read alignment

Fuzzy set intersection based paired-end short-read alignment

Polypolish: short-read polishing of long-read bacterial genome assemblies

BWA-MEME: BWA-MEM emulated with a machine learning approach

Fast and SNP-aware short read alignment with SALT

SRAMM: Short Read Alignment Mapping Metrics

FPGA Acceleration of Short Read Alignment

Genomic sequence characteristics and the empiric accuracy of short-read sequencing

Acceleration of Short Read Alignment with Runtime Reconfiguration

Arioc: High-concurrency short-read alignment on multiple GPUs

Export Citation Format

short read alignmentRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Fuzzy set intersection based paired-end short-read alignment

Fuzzy set intersection based paired-end short-read alignment

Polypolish: short-read polishing of long-read bacterial genome assemblies

BWA-MEME: BWA-MEM emulated with a machine learning approach

Fast and SNP-aware short read alignment with SALT

SRAMM: Short Read Alignment Mapping Metrics

FPGA Acceleration of Short Read Alignment

Genomic sequence characteristics and the empiric accuracy of short-read sequencing

Acceleration of Short Read Alignment with Runtime Reconfiguration

Arioc: High-concurrency short-read alignment on multiple GPUs

short read alignment
Recently Published Documents