read mapping Latest Research Papers

Low guanine content and biased nucleotide distribution in vertebrate mtDNA can cause overestimation of non-CpG methylation

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab119 ◽

2022 ◽

Vol 4 (1) ◽

Author(s):

Takashi Okada ◽

Xin Sun ◽

Stephen McIlfatrick ◽

Justin C St. John

Keyword(s):

Sus Scrofa ◽

Bisulfite Sequencing ◽

Amplicon Sequencing ◽

Cpg Methylation ◽

Data Sets ◽

Mtdna Sequences ◽

Read Mapping ◽

Specific Sequence ◽

D Loop ◽

Over 40 Years

ABSTRACT Mitochondrial DNA (mtDNA) methylation in vertebrates has been hotly debated for over 40 years. Most contrasting results have been reported following bisulfite sequencing (BS-seq) analyses. We addressed whether BS-seq experimental and analysis conditions influenced the estimation of the levels of methylation in specific mtDNA sequences. We found false positive non-CpG methylation in the CHH context (fpCHH) using unmethylated Sus scrofa mtDNA BS-seq data. fpCHH methylation was detected on the top/plus strand of mtDNA within low guanine content regions. These top/plus strand sequences of fpCHH regions would become extremely AT-rich sequences after BS-conversion, whilst bottom/minus strand sequences remained almost unchanged. These unique sequences caused BS-seq aligners to falsely assign the origin of each strand in fpCHH regions, resulting in false methylation calls. fpCHH methylation detection was enhanced by short sequence reads, short library inserts, skewed top/bottom read ratios and non-directional read mapping modes. We confirmed no detectable CHH methylation in fpCHH regions by BS-amplicon sequencing. The fpCHH peaks were located in the D-loop, ATP6, ND2, ND4L, ND5 and ND6 regions and identified in our S. scrofa ovary and oocyte data and human BS-seq data sets. We conclude that non-CpG methylation could potentially be overestimated in specific sequence regions by BS-seq analysis.

Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers

10.1101/2022.01.11.475838 ◽

2022 ◽

Author(s):

Luiz Carlos Irber ◽

Phillip T Brooks ◽

Taylor E Reiter ◽

N Tessa Pierce-Ward ◽

Mahmudur Rahman Hera ◽

...

Keyword(s):

Large Scale ◽

Compositional Analysis ◽

Set Cover ◽

Sequencing Data ◽

Taxonomic Assignment ◽

Read Mapping ◽

Set Cover Problem ◽

Metagenome Sequencing ◽

Reference Genomes ◽

Selection Of

The identification of reference genomes and taxonomic labels from metagenome data underlies many microbiome studies. Here we describe two algorithms for compositional analysis of metagenome sequencing data. We first investigate the FracMinHash sketching technique, a derivative of modulo hash that supports Jaccard containment estimation between sets of different sizes. We implement FracMinHash in the sourmash software, evaluate its accuracy, and demonstrate large-scale containment searches of metagenomes using 700,000 microbial reference genomes. We next frame shotgun metagenome compositional analysis as the problem of finding a minimum collection of reference genomes that "cover" the known k-mers in a metagenome, a minimum set cover problem. We implement a greedy approximate solution using FracMinHash sketches, and evaluate its accuracy for taxonomic assignment using a CAMI community benchmark. Finally, we show that the minimum metagenome cover can be used to guide the selection of reference genomes for read mapping. sourmash is available as open source software under the BSD 3-Clause license at github.com/dib-lab/sourmash/.

Using syncmers improves long-read mapping

10.1101/2022.01.10.475696 ◽

2022 ◽

Author(s):

David Pellow ◽

Abhinav Dutta ◽

Ron Shamir

Keyword(s):

Read Mapping ◽

Mapping Algorithms ◽

Sequencing Errors ◽

Sequence Identity ◽

Memory Efficiency ◽

Long Reads ◽

Long Read ◽

Cancer Tumor ◽

Generation Sequencing ◽

Unmapped Reads

As sequencing datasets keep growing larger, time and memory efficiency of read mapping are becoming more critical. Many clever algorithms and data structures were used to develop mapping tools for next generation sequencing, and in the last few years also for third generation long reads. A key idea in mapping algorithms is to sketch sequences with their minimizers. Recently, syncmers were introduced as an alternative sketching method that is more robust to mutations and sequencing errors. Here we introduce parameterized syncmer schemes, and provide a theoretical analysis for multi-parameter schemes. By combining these schemes with downsampling or minimizers we can achieve any desired compression and window guarantee. We introduced syncmer schemes into the popular minimap2 and Winnowmap2 mappers. In tests on simulated and real long read data from a variety of genomes, the syncmer-based algorithms reduced unmapped reads by 20-60% at high compression while using less memory. The advantage of syncmer-based mapping was even more pronounced at lower sequence identity. At sequence identity of 65-75% and medium compression, syncmer mappers had 50-60% fewer unmapped reads, and ∼ 10% fewer of the reads that did map were incorrectly mapped. We conclude that syncmer schemes improve mapping under higher error and mutation rates. This situation happens, for example, when the high error rate of long reads is compounded by a high mutation rate in a cancer tumor, or due to differences between strains of viruses or bacteria.

Cluster-specific gene markers enhance Shigella and enteroinvasive Escherichia coli in silico serotyping

Microbial Genomics ◽

10.1099/mgen.0.000704 ◽

2021 ◽

Vol 7 (12) ◽

Author(s):

Xiaomei Zhang ◽

Michael Payne ◽

Thanh Nguyen ◽

Sandeep Kaur ◽

Ruiting Lan

Keyword(s):

Escherichia Coli ◽

In Silico ◽

Type Species ◽

Specific Gene ◽

Read Mapping ◽

Gene Markers ◽

Genetic Characteristics ◽

Content Type ◽

Link Type ◽

Invasion Mechanisms

Shigella and enteroinvasive Escherichia coli (EIEC) cause human bacillary dysentery with similar invasion mechanisms and share similar physiological, biochemical and genetic characteristics. Differentiation of Shigella from EIEC is important for clinical diagnostic and epidemiological investigations. However, phylogenetically, Shigella and EIEC strains are composed of multiple clusters and are different forms of E. coli , making it difficult to find genetic markers to discriminate between Shigella and EIEC. In this study, we identified 10 Shigella clusters, seven EIEC clusters and 53 sporadic types of EIEC by examining over 17000 publicly available Shigella and EIEC genomes. We compared Shigella and EIEC accessory genomes to identify cluster-specific gene markers for the 17 clusters and 53 sporadic types. The cluster-specific gene markers showed 99.64% accuracy and more than 97.02% specificity. In addition, we developed a freely available in silico serotyping pipeline named Shigella EIEC Cluster Enhanced Serotype Finder (ShigEiFinder) by incorporating the cluster-specific gene markers and established Shigella and EIEC serotype-specific O antigen genes and modification genes into typing. ShigEiFinder can process either paired-end Illumina sequencing reads or assembled genomes and almost perfectly differentiated Shigella from EIEC with 99.70 and 99.74% cluster assignment accuracy for the assembled genomes and read mapping respectively. ShigEiFinder was able to serotype over 59 Shigella serotypes and 22 EIEC serotypes and provided a high specificity of 99.40% for assembled genomes and 99.38% for read mapping for serotyping. The cluster-specific gene markers and our new serotyping tool, ShigEiFinder (installable package: https://github.com/LanLab/ShigEiFinder, online tool: https://mgtdb.unsw.edu.au/ShigEiFinder/), will be useful for epidemiological and diagnostic investigations.

Fast alignment of reads to a variation graph with application to SNP detection

Journal of Integrative Bioinformatics ◽

10.1515/jib-2021-0032 ◽

2021 ◽

Vol 0 (0) ◽

Author(s):

Maurilio Monsu ◽

Matteo Comin

Keyword(s):

Low Cost ◽

Variant Calling ◽

Read Mapping ◽

Base Level ◽

Sequencing Technologies ◽

Alignment Tool ◽

Sequencing Studies ◽

Major Bottleneck ◽

High Base ◽

Similar Accuracy

Abstract Sequencing technologies has provided the basis of most modern genome sequencing studies due to its high base-level accuracy and relatively low cost. One of the most demanding step is mapping reads to the human reference genome. The reliance on a single reference human genome could introduce substantial biases in downstream analyses. Pangenomic graph reference representations offer an attractive approach for storing genetic variations. Moreover, it is possible to include known variants in the reference in order to make read mapping, variant calling, and genotyping variant-aware. Only recently a framework for variation graphs, vg [Garrison E, Adam MN, Siren J, et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol 2018;36:875–9], have improved variation-aware alignment and variant calling in general. The major bottleneck of vg is its high cost of reads mapping to a variation graph. In this paper we study the problem of SNP calling on a variation graph and we present a fast reads alignment tool, named VG SNP-Aware. VG SNP-Aware is able align reads exactly to a variation graph and detect SNPs based on these aligned reads. The results show that VG SNP-Aware can efficiently map reads to a variation graph with a speedup of 40× with respect to vg and similar accuracy on SNPs detection.

Fast and memory-efficient mapping of short bisulfite sequencing reads using a two-letter alphabet

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab115 ◽

2021 ◽

Vol 3 (4) ◽

Author(s):

Guilherme de Sena Brandine ◽

Andrew D Smith

Keyword(s):

Cytosine Methylation ◽

Bisulfite Sequencing ◽

Software Tool ◽

Read Mapping ◽

Mapping Algorithm ◽

Letter Alphabet ◽

Mapping Software ◽

Wide Range ◽

Range Of Functions ◽

Similar Accuracy

Abstract DNA cytosine methylation is an important epigenomic mark with a wide range of functions in many organisms. Whole genome bisulfite sequencing is the gold standard to interrogate cytosine methylation genome-wide. Algorithms used to map bisulfite-converted reads often encode the four-base DNA alphabet with three letters by reducing two bases to a common letter. This encoding substantially reduces the entropy of nucleotide frequencies in the resulting reference genome. Within the paradigm of read mapping by first filtering possible candidate alignments, reduced entropy in the sequence space can increase the required computing effort. We introduce another bisulfite mapping algorithm (abismal), based on the idea of encoding a four-letter DNA sequence as only two letters, one for purines and one for pyrimidines. We show that this encoding can lead to greater specificity compared to existing encodings used to map bisulfite sequencing reads. Through the two-letter encoding, the abismal software tool maps reads in less time and using less memory than most bisulfite sequencing read mapping software tools, while attaining similar accuracy. This allows in silico methylation analysis to be performed in a wider range of computing machines with limited hardware settings.

Struo2: efficient metagenome profiling database construction for ever-expanding microbial genome datasets

PeerJ ◽

10.7717/peerj.12198 ◽

2021 ◽

Vol 9 ◽

pp. e12198

Author(s):

Nicholas D. Youngblut ◽

Ruth E. Ley

Keyword(s):

Functional Diversity ◽

Genomic Data ◽

Metagenomic Data ◽

Gene Sequences ◽

Read Mapping ◽

Microbial Genomes ◽

Individual Gene ◽

Database Construction ◽

Microbiome Diversity ◽

Reference Databases

Mapping metagenome reads to reference databases is the standard approach for assessing microbial taxonomic and functional diversity from metagenomic data. However, public reference databases often lack recently generated genomic data such as metagenome-assembled genomes (MAGs), which can limit the sensitivity of read-mapping approaches. We previously developed the Struo pipeline in order to provide a straight-forward method for constructing custom databases; however, the pipeline does not scale well enough to cope with the ever-increasing number of publicly available microbial genomes. Moreover, the pipeline does not allow for efficient database updating as new data are generated. To address these issues, we developed Struo2, which is >3.5 fold faster than Struo at database generation and can also efficiently update existing databases. We also provide custom Kraken2, Bracken, and HUMAnN3 databases that can be easily updated with new genomes and/or individual gene sequences. Efficient database updating, coupled with our pre-generated databases, enables “assembly-enhanced” profiling, which increases database comprehensiveness via inclusion of native genomic content. Inclusion of newly generated genomic content can greatly increase database comprehensiveness, especially for understudied biomes, which will enable more accurate assessments of microbiome diversity.

Vulcan: Improved long-read mapping and structural variant calling via dual-mode alignment

GigaScience ◽

10.1093/gigascience/giab063 ◽

2021 ◽

Vol 10 (9) ◽

Author(s):

Yilei Fu ◽

Medhat Mahmoud ◽

Viginesh Vaibhav Muraliraman ◽

Fritz J Sedlazeck ◽

Todd J Treangen

Keyword(s):

Human Genome ◽

Variant Calling ◽

Dual Mode ◽

Read Mapping ◽

Structural Variant ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

High Level ◽

Improved Accuracy

Abstract Background Long-read sequencing has enabled unprecedented surveys of structural variation across the entire human genome. To maximize the potential of long-read sequencing in this context, novel mapping methods have emerged that have primarily focused on either speed or accuracy. Various heuristics and scoring schemas have been implemented in widely used read mappers (minimap2 and NGMLR) to optimize for speed or accuracy, which have variable performance across different genomic regions and for specific structural variants. Our hypothesis is that constraining read mapping to the use of a single gap penalty across distinct mutational hot spots reduces read alignment accuracy and impedes structural variant detection. Findings We tested our hypothesis by implementing a read-mapping pipeline called Vulcan that uses two distinct gap penalty modes, which we refer to as dual-mode alignment. The high-level idea is that Vulcan leverages the computed normalized edit distance of the mapped reads via minimap2 to identify poorly aligned reads and realigns them using the more accurate yet computationally more expensive long-read mapper (NGMLR). In support of our hypothesis, we show that Vulcan improves the alignments for Oxford Nanopore Technology long reads for both simulated and real datasets. These improvements, in turn, lead to improved accuracy for structural variant calling performance on human genome datasets compared to either of the read-mapping methods alone. Conclusions Vulcan is the first long-read mapping framework that combines two distinct gap penalty modes for improved structural variant recall and precision. Vulcan is open-source and available under the MIT License at https://gitlab.com/treangenlab/vulcan.

hafeZ: Active prophage identification through read mapping

10.1101/2021.07.21.453177 ◽

2021 ◽

Author(s):

Christopher J. R. Turkington ◽

Neda Nezam Abadi ◽

Robert A. Edwards ◽

Juris A. Grasis

Keyword(s):

Open Source ◽

Bacterial Communities ◽

Infected Host ◽

Host Bacterium ◽

Sequencing Data ◽

Bacterial Genomes ◽

Read Mapping ◽

Surrounding Environment ◽

Bacterial Chromosomes

Bacteriophages that have integrated their genomes into bacterial chromosomes, termed prophages, are widespread across bacteria. Prophages are key components of bacterial genomes, with their integration often contributing novel, beneficial, characteristics to the infected host. Likewise, their induction—through the production and release of progeny virions into the surrounding environment—can have considerable ramifications on bacterial communities. Yet, not all prophages can excise following integration, due to genetic degradation by their host bacterium. Here, we present hafeZ, a tool able to identify 'active' prophages (i.e. those undergoing induction) within bacterial genomes through genomic read mapping. We demonstrate its use by applying hafeZ to publicly available sequencing data from bacterial genomes known to contain active prophages and show that hafeZ can accurately identify their presence and location in the host chromosomes. Availability and Implementation: hafeZ is implemented in Python 3.7 and freely available under an open-source GPL-3.0 license from https://github.com/Chrisjrt/hafeZ. Bugs and issues may be reported by submitting them via the hafeZ github issues page.

Merfin: improved variant filtering and polishing via k-mer validation

10.1101/2021.07.16.452324 ◽

2021 ◽

Author(s):

Giulio Formenti ◽

Arang Rhie ◽

Brian P Walenz ◽

Francoise Thibaud-Nissen ◽

Kishwar Shafin ◽

...

Keyword(s):

Human Genome ◽

Variant Calling ◽

Read Mapping ◽

Mapping Algorithm ◽

Copy Numbers ◽

Long Reads ◽

Variant Filtering ◽

Long Read ◽

Finishing Tool

Read mapping and variant calling approaches have been widely used for accurate genotyping and improving consensus quality assembled from noisy long reads. Variant calling accuracy relies heavily on the read quality, the precision of the read mapping algorithm and variant caller, and the criteria adopted to filter the calls. However, it is impossible to define a single set of optimal parameters, as they vary depending on the quality of the read set, the variant caller of choice, and the quality of the unpolished assembly. To overcome this issue, we have devised a new tool called Merfin (k-mer based finishing tool), a k-mer based variant filtering algorithm for improved genotyping and polishing. Merfin evaluates the accuracy of a call based on expected k-mer multiplicity in the reads, independently of the quality of the read alignment and variant caller internal score. Moreover, we introduce novel assembly quality and completeness metrics that account for the expected genomic copy numbers. Merfin significantly increased the precision of a variant call and reduced frameshift errors when applied to PacBio HiFi, PacBio CLR, or Nanopore long read based assemblies. We demonstrate the utility while polishing the first complete human genome, a fully phased human genome, and non-human high-quality genomes.

read mapping
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Low guanine content and biased nucleotide distribution in vertebrate mtDNA can cause overestimation of non-CpG methylation

Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers

Using syncmers improves long-read mapping

Cluster-specific gene markers enhance Shigella and enteroinvasive Escherichia coli in silico serotyping

Fast alignment of reads to a variation graph with application to SNP detection

Fast and memory-efficient mapping of short bisulfite sequencing reads using a two-letter alphabet

Struo2: efficient metagenome profiling database construction for ever-expanding microbial genome datasets

Vulcan: Improved long-read mapping and structural variant calling via dual-mode alignment

hafeZ: Active prophage identification through read mapping

Merfin: improved variant filtering and polishing via k-mer validation

Export Citation Format

read mappingRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Low guanine content and biased nucleotide distribution in vertebrate mtDNA can cause overestimation of non-CpG methylation

Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers

Using syncmers improves long-read mapping

Cluster-specific gene markers enhance Shigella and enteroinvasive Escherichia coli in silico serotyping

Fast alignment of reads to a variation graph with application to SNP detection

Fast and memory-efficient mapping of short bisulfite sequencing reads using a two-letter alphabet

Struo2: efficient metagenome profiling database construction for ever-expanding microbial genome datasets

Vulcan: Improved long-read mapping and structural variant calling via dual-mode alignment

hafeZ: Active prophage identification through read mapping

Merfin: improved variant filtering and polishing via k-mer validation

read mapping
Recently Published Documents