BWTaligner: a genome short-read aligner

Short-read genome alignment is a fundamental computational step used in many bioinformatic analyses. It is therefore desirable to align such data as fast as possible. Most alignment algorithms consider a seed-and-extend approach. Several popular programs perform the seeding step based on the Burrows-Wheeler Transform with a low memory footprint, but they are relatively slow compared to more recent approaches that use a minimizer-based seeding-and-chaining strategy. Recently, syncmers and strobemers were proposed for sequence comparison. Both protocols were designed for improved conservation of matches between sequences under mutations. Syncmers is a thinning protocol proposed as an alternative to minimizers, while strobemers is a linking protocol for gapped sequences and was proposed as an alternative to k-mers. The main contribution in this work is a new seeding approach that combines syncmers and strobemers. We use a strobemer protocol (randstrobes) to link together syncmers (i.e., in syncmer-space) instead of over the original sequence. Our protocol allows us to create longer seeds while preserving mapping accuracy. A longer seed length reduces the number of candidate regions which allows faster mapping and alignment. We also contribute the insight that speed-wise, this protocol is particularly effective when syncmers are canonical. Canonical syncmers can be created for specific parameter combinations and reduce the computational burden of computing the non-canonical randstrobes in reverse complement. We implement our idea in a proof-of-concept short-read aligner strobealign that aligns short reads 3-4x faster than minimap2 and 15-23x faster than BWA and Bowtie2. Many implementation versions of, e.g., BWA, achieve high speed on specific hardware. Our contribution is algorithmic and requires no hardware architecture or system-specific instructions. Strobealign is available at https://github.com/ksahlin/StrobeAlign.

Download Full-text

GPU-accelerated alignment of bisulfite-treated short-read sequences

10.1101/175729 ◽

2017 ◽

Author(s):

Richard Wilton ◽

Xin Li ◽

Andrew P. Feinberg ◽

Alexander S. Szalay

Keyword(s):

Dna Sequences ◽

Graphics Processing Unit ◽

General Purpose ◽

Processing Unit ◽

Short Read ◽

Wide Range ◽

Programming Logic ◽

Short Read Aligner ◽

Graphics Processing ◽

Better Than

AbstractThe alignment of bisulfite-treated DNA sequences (BS-seq reads) to a large genome involves a significant computational burden beyond that required to align non-bisulfite-treated reads. In the analysis of BS-seq data, this can present an important performance bottleneck that can potentially be addressed by appropriate software-engineering and algorithmic improvements. One strategy is to integrate this additional programming logic into the read-alignment implementation in a way that the software becomes amenable to optimizations that lead to both higher speed and greater sensitivity than can be achieved without this integration.We have evaluated this approach using Arioc, a short-read aligner that uses GPU (general-purpose graphics processing unit) hardware to accelerate computationally-expensive programming logic. We integrated the BS-seq computational logic into both GPU and CPU code throughout the Arioc implementation. We then carried out a read-by-read comparison of Arioc's reported alignments with the alignments reported by the most widely used BS-seq read aligners. With simulated reads, Arioc's accuracy is equal to or better than the other read aligners we evaluated. With human sequencing reads, Arioc's throughput is at least 10 times faster than existing BS-seq aligners across a wide range of sensitivity settings.The Arioc software is available at https://github.com/RWilton/Arioc. It is released under a BSD open-source license.

Download Full-text

High conservation combined with high plasticity: genomics and evolution of Borrelia bavariensis

10.21203/rs.3.rs-29892/v1 ◽

2020 ◽

Author(s):

Noémie S Becker ◽

Robert Ethan Rollins ◽

Kateryna Nosenko ◽

Alexander Paulus ◽

Samantha Martin ◽

...

Keyword(s):

Lyme Borreliosis ◽

Next Generation Sequencing Data ◽

Borrelia Burgdorferi Sensu Lato ◽

High Plasticity ◽

Sequencing Data ◽

Short Read ◽

Genome Reconstruction ◽

A Genome ◽

Long Read ◽

Genomic Studies

Abstract BackgroundBorrelia bavariensis is one of the agents of Lyme Borreliosis (or Lyme disease) in Eurasia. The genome of the Borrelia burgdorferi sensu lato species complex, that includes B. bavariensis , is known to be very complex and fragmented making the assembly of whole genomes with next-generation sequencing data a challenge. ResultsWe present a genome reconstruction for 33 B. bavariensis isolates from Eurasia based on long-read (Pacific Bioscience, for three isolates) and short-read (Illumina) data. We show that the combination of both sequencing techniques allows proper genome reconstruction of all plasmids in most cases but us e of a very close reference is necessary when only short-read sequencing data is available. B. bavariensis genomes combine a high degree of genetic conservation with high plasticity: all isolates share the main chromosome and five plasmids, but the repertoire of other plasmids is highly variable. In addition to plasmid losses and gains through horizontal transfer, we also observe several fusions between plasmids. Although European isolates of B. bavariensis have little diversity in genome content, there is some geographic structure to this variation. In contrast, each Asian isolate has a unique plasmid repertoire and we observe no geographically based differences between Japanese and Russian isolates. Comparing the genomes of Asian and European populations of B. bavariensis suggest s that some genes which are markedly different between the two populations may be good candidates for adaptation to the tick vector, ( Ixodes ricinus in Europe and I. persulcatus in Asia) . ConclusionsWe present the characterization of genomes of a large sample of B. bavariensis isolates and show that their plasmid content is highly variable. This study opens the way for genomic studies seeking to understand host and vector adaptation as well as human pathogenicity in Eurasian Lyme borreliosis agents.

Download Full-text

Umap and Bismap: quantifying genome and methylome mappability

10.1101/095463 ◽

2016 ◽

Cited By ~ 5

Author(s):

Mehran Karimzadeh ◽

Carl Ernst ◽

Anshul Kundaje ◽

Michael M. Hoffman

Keyword(s):

Genetic Variation ◽

Bisulfite Sequencing ◽

Cpg Islands ◽

Read Length ◽

Methylation Array ◽

Short Read ◽

Short Read Sequencing ◽

Sequencing Errors ◽

Link Type ◽

A Genome

AbstractMotivationShort-read sequencing enables assessment of genetic and biochemical traits of individual genomic regions, such as the location of genetic variation, protein binding, and chemical modifications. Every region in a genome assembly has a property called mappability which measures the extent to which it can be uniquely mapped by sequence reads. In regions of lower mappability, estimates of genomic and epigenomic characteristics from sequencing assays are less reliable. At best, sequencing assays will produce misleadingly low numbers of reads in these regions. At worst, these regions have increased susceptibility to spurious mapping from reads from other regions of the genome with sequencing errors or unexpected genetic variation. Bisulfite sequencing approaches used to identify DNA methylation exacerbate these problems by introducing large numbers of reads that map to multiple regions. While many tools consider mappability during the read mapping process, subsequent analysis often loses this information. Both to correct assumptions of uniformity in downstream analysis, and to identify regions where the analysis is less reliable, it is necessary to know the mappability of both ordinary and bisulfite-converted genomes.ResultsWe introduce the Umap software for identifying uniquely mappable regions of any genome. Its Bismap extension identifies mappability of the bisulfite-converted genome. With a read length of 24 bp, 18.7% of the unmodified genome and 33.5% of the bisulfite-converted genome is not uniquely mappable. This complicates interpretation of functional genomics experiments using short-read sequencing, especially in regulatory regions. For example, 81% of human CpG islands overlap with regions that are not uniquely mappable. Similarly, in some ENCODE ChIP-seq datasets, up to 50% of peaks overlap with regions that are not uniquely mappable. We also explored differentially methylated regions from a case-control study and identified regions that were not uniquely mappable. In the widely used 450K methylation array, 4,230 probes are not uniquely mappable. Genome mappability is higher with longer sequencing reads, but most publicly available ChIP-seq and reduced representation bisulfite sequencing datasets have shorter reads. Therefore, uneven and low mappability remains a concern in a majority of existing data.AvailabilityA Umap and Bismap track hub for human genome assemblies GRCh37/hg19 and GRCh38/hg38, and mouse assemblies GRCm37/mm9 and GRCm38/mm10 is available at http://bismap.hoffmanlab.org for use with the UCSC and Ensembl genome browsers. We have deposited in Zenodo the current version of our software (https://doi.org/10.5281/zenodo.800648) and the mappability data used in this project (https://doi.org/10.5281/zenodo.800645). In addition, the software (https://bitbucket.org/hoffmanlab/umap) is freely available under the GNU General Public License, version 3 (GPLv3)[email protected]

Download Full-text

Ultra-low input single tube linked-read library method enables short-read NGS systems to generate highly accurate and economical long-range sequencing information for de novo genome assembly and haplotype phasing

10.1101/852947 ◽

2019 ◽

Cited By ~ 3

Author(s):

Zhoutao Chen ◽

Long Pham ◽

Tsai-Chin Wu ◽

Guoya Mo ◽

Yu Xia ◽

...

Keyword(s):

Long Range ◽

De Novo ◽

Low Cost ◽

Cost Effective ◽

De Novo Genome Assembly ◽

Short Read ◽

Single Tube ◽

Haplotype Phasing ◽

A Genome ◽

Long Read

AbstractLong-range sequencing information is required for haplotype phasing, de novo assembly and structural variation detection. Current long-read sequencing technologies can provide valuable long-range information but at a high cost with low accuracy and high DNA input requirement. We have developed a single-tube Transposase Enzyme Linked Long-read Sequencing (TELL-Seq™) technology, which enables a low-cost, high-accuracy and high-throughput short-read next generation sequencer to routinely generate over 100 Kb long-range sequencing information with as little as 0.1 ng input material. In a PCR tube, millions of clonally barcoded beads are used to uniquely barcode long DNA molecules in an open bulk reaction without dilution and compartmentation. The barcode linked reads are used to successfully assemble genomes ranging from microbes to human. These linked-reads also generate mega-base-long phased blocks and provide a cost-effective tool for detecting structural variants in a genome, which are important to identify compound heterozygosity in recessive Mendelian diseases and discover genetic drivers and diagnostic biomarkers in cancers.

Download Full-text

Resolving the Full Spectrum of Human Genome Variation using Linked-Reads

10.1101/230946 ◽

2017 ◽

Cited By ~ 8

Author(s):

Patrick Marks ◽

Sarah Garcia ◽

Alvaro Martinez Barrio ◽

Kamila Belhocine ◽

Jorge Bernate ◽

...

Keyword(s):

Human Genome ◽

Large Scale ◽

De Novo ◽

Simultaneous Detection ◽

Whole Genome ◽

Structural Variations ◽

Full Spectrum ◽

Short Read ◽

Short Reads ◽

A Genome

AbstractLarge-scale population based analyses coupled with advances in technology have demonstrated that the human genome is more diverse than originally thought. To date, this diversity has largely been uncovered using short read whole genome sequencing. However, standard short-read approaches, used primarily due to accuracy, throughput and costs, fail to give a complete picture of a genome. They struggle to identify large, balanced structural events, cannot access repetitive regions of the genome and fail to resolve the human genome into its two haplotypes. Here we describe an approach that retains long range information while harnessing the advantages of short reads. Starting from only ∼1ng of DNA, we produce barcoded short read libraries. The use of novel informatic approaches allows for the barcoded short reads to be associated with the long molecules of origin producing a novel datatype known as ‘Linked-Reads’. This approach allows for simultaneous detection of small and large variants from a single Linked-Read library. We have previously demonstrated the utility of whole genome Linked-Reads (lrWGS) for performing diploid, de novo assembly of individual genomes (Weisenfeld et al. 2017). In this manuscript, we show the advantages of Linked-Reads over standard short read approaches for reference based analysis. We demonstrate the ability of Linked-Reads to reconstruct megabase scale haplotypes and to recover parts of the genome that are typically inaccessible to short reads, including phenotypically important genes such as STRC, SMN1 and SMN2. We demonstrate the ability of both lrWGS and Linked-Read Whole Exome Sequencing (lrWES) to identify complex structural variations, including balanced events, single exon deletions, and single exon duplications. The data presented here show that Linked-Reads provide a scalable approach for comprehensive genome analysis that is not possible using short reads alone.

Download Full-text

Rapid, Paralog-Sensitive CNV Analysis of 2457 Human Genomes Using QuicK-mer2

Genes ◽

10.3390/genes11020141 ◽

2020 ◽

Vol 11 (2) ◽

pp. 141 ◽

Cited By ~ 5

Author(s):

Feichen Shen ◽

Jeffrey M. Kidd

Keyword(s):

Copy Number Variation ◽

Copy Number ◽

Sequence Data ◽

Data Sets ◽

Short Read ◽

Major Mechanism ◽

Rapid Construction ◽

A Genome ◽

Number Variation ◽

Short Read Sequence

Gene duplication is a major mechanism for the evolution of gene novelty, and copy-number variation makes a major contribution to inter-individual genetic diversity. However, most approaches for studying copy-number variation rely upon uniquely mapping reads to a genome reference and are unable to distinguish among duplicated sequences. Specialized approaches to interrogate specific paralogs are comparatively slow and have a high degree of computational complexity, limiting their effective application to emerging population-scale data sets. We present QuicK-mer2, a self-contained, mapping-free approach that enables the rapid construction of paralog-specific copy-number maps from short-read sequence data. This approach is based on the tabulation of unique k-mer sequences from short-read data sets, and is able to analyze a 20X coverage human genome in approximately 20 min. We applied our approach to newly released sequence data from the 1000 Genomes Project, constructed paralog-specific copy-number maps from 2457 unrelated individuals, and uncovered copy-number variation of paralogous genes. We identify nine genes where none of the analyzed samples have a copy number of two, 92 genes where the majority of samples have a copy number other than two, and describe rare copy number variation effecting multiple genes at the APOBEC3 locus.

Download Full-text

CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows–Wheeler transform

Bioinformatics ◽

10.1093/bioinformatics/bts276 ◽

2012 ◽

Vol 28 (14) ◽

pp. 1830-1837 ◽

Cited By ~ 100

Author(s):

Yongchao Liu ◽

Bertil Schmidt ◽

Douglas L. Maskell

Keyword(s):

Short Read ◽

Short Read Aligner ◽

Burrows Wheeler Transform ◽

Large Genomes

Download Full-text

Highly accurate and sensitive short read aligner

TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES ◽

10.3906/elk-1703-251 ◽

2018 ◽

Vol 26 (2) ◽

pp. 721-731

Author(s):

MEHMET YAĞMUR GÖK ◽

SEZER GÖREN UĞURDAĞ ◽

CEM ÜNSALAN ◽

MAHMUT ŞAMİL SAĞIROĞLU

Keyword(s):

Short Read ◽

Short Read Aligner

Download Full-text

High conservation combined with high plasticity: genomics and evolution of Borrelia bavariensis

BMC Genomics ◽

10.1186/s12864-020-07054-3 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Noémie S. Becker ◽

Robert E. Rollins ◽

Kateryna Nosenko ◽

Alexander Paulus ◽

Samantha Martin ◽

...

Keyword(s):

Lyme Borreliosis ◽

Next Generation Sequencing Data ◽

High Plasticity ◽

Sequencing Data ◽

Short Read ◽

Genome Reconstruction ◽

A Genome ◽

Long Read ◽

Genomic Studies ◽

High Conservation

Abstract Background Borrelia bavariensis is one of the agents of Lyme Borreliosis (or Lyme disease) in Eurasia. The genome of the Borrelia burgdorferi sensu lato species complex, that includes B. bavariensis, is known to be very complex and fragmented making the assembly of whole genomes with next-generation sequencing data a challenge. Results We present a genome reconstruction for 33 B. bavariensis isolates from Eurasia based on long-read (Pacific Bioscience, for three isolates) and short-read (Illumina) data. We show that the combination of both sequencing techniques allows proper genome reconstruction of all plasmids in most cases but use of a very close reference is necessary when only short-read sequencing data is available. B. bavariensis genomes combine a high degree of genetic conservation with high plasticity: all isolates share the main chromosome and five plasmids, but the repertoire of other plasmids is highly variable. In addition to plasmid losses and gains through horizontal transfer, we also observe several fusions between plasmids. Although European isolates of B. bavariensis have little diversity in genome content, there is some geographic structure to this variation. In contrast, each Asian isolate has a unique plasmid repertoire and we observe no geographically based differences between Japanese and Russian isolates. Comparing the genomes of Asian and European populations of B. bavariensis suggests that some genes which are markedly different between the two populations may be good candidates for adaptation to the tick vector, (Ixodes ricinus in Europe and I. persulcatus in Asia). Conclusions We present the characterization of genomes of a large sample of B. bavariensis isolates and show that their plasmid content is highly variable. This study opens the way for genomic studies seeking to understand host and vector adaptation as well as human pathogenicity in Eurasian Lyme Borreliosis agents.

Download Full-text