URMAP, an ultra-fast read mapper

Mapping of reads to reference sequences is an essential step in a wide range of biological studies. The large size of datasets generated with next-generation sequencing technologies motivates the development of fast mapping software. Here, I describe URMAP, a new read mapping algorithm. URMAP is an order of magnitude faster than BWA with comparable accuracy on several validation tests. On a Genome in a Bottle (GIAB) variant calling test with 30× coverage 2×150 reads, URMAP achieves high accuracy (precision 0.998, sensitivity 0.982 and F-measure 0.990) with the strelka2 caller. However, GIAB reference variants are shown to be biased against repetitive regions which are difficult to map and may therefore pose an unrealistically easy challenge to read mappers and variant callers.

Download Full-text

URMAP, an ultra-fast read mapper

10.1101/2020.01.12.903351 ◽

2020 ◽

Cited By ~ 1

Author(s):

Robert C. Edgar

Keyword(s):

Read Mapping ◽

Mapping Algorithm ◽

Sequencing Technologies ◽

Large Size ◽

Mapping Software ◽

Biological Studies ◽

Wide Range ◽

Order Of Magnitude ◽

Comparable Accuracy ◽

Generation Sequencing

AbstractMapping of reads to reference sequences is an essential step in a wide range of biological studies. The large size of datasets generated with next-generation sequencing technologies motivates the development of fast mapping software. Here, I describe URMAP, a new read mapping algorithm. URMAP is an order of magnitude faster than BWA and Bowtie2 with comparable accuracy on a benchmark test using simulated paired 150nt reads of a well-studied human genome. Software is freely available at https://drive5.com/urmap.

Download Full-text

Nebula: ultra-efficient mapping-free structural variant genotyper

Nucleic Acids Research ◽

10.1093/nar/gkab025 ◽

2021 ◽

Author(s):

Parsoa Khorsand ◽

Fereydoun Hormozdiari

Keyword(s):

Large Scale ◽

Structural Variants ◽

Sequencing Technologies ◽

Generic Framework ◽

Common Genetic Variants ◽

Order Of Magnitude ◽

Complex Events ◽

Comparable Accuracy ◽

Using Data ◽

Computational Resources

Abstract Large scale catalogs of common genetic variants (including indels and structural variants) are being created using data from second and third generation whole-genome sequencing technologies. However, the genotyping of these variants in newly sequenced samples is a nontrivial task that requires extensive computational resources. Furthermore, current approaches are mostly limited to only specific types of variants and are generally prone to various errors and ambiguities when genotyping complex events. We are proposing an ultra-efficient approach for genotyping any type of structural variation that is not limited by the shortcomings and complexities of current mapping-based approaches. Our method Nebula utilizes the changes in the count of k-mers to predict the genotype of structural variants. We have shown that not only Nebula is an order of magnitude faster than mapping based approaches for genotyping structural variants, but also has comparable accuracy to state-of-the-art approaches. Furthermore, Nebula is a generic framework not limited to any specific type of event. Nebula is publicly available at https://github.com/Parsoa/Nebula.

Download Full-text

A Nearly Complete Genome of Ciona intestinalis Type A (C. robusta) Reveals the Contribution of Inversion to Chromosomal Evolution in the Genus Ciona

Genome Biology and Evolution ◽

10.1093/gbe/evz228 ◽

2019 ◽

Vol 11 (11) ◽

pp. 3144-3157 ◽

Cited By ~ 12

Author(s):

Yutaka Satou ◽

Ryohei Nakamura ◽

Deli Yu ◽

Reiko Yoshida ◽

Mayuko Hamada ◽

...

Keyword(s):

Genome Size ◽

Inbred Line ◽

Genome Assembly ◽

Ciona Intestinalis ◽

Type A ◽

Chromosomal Evolution ◽

Sequence Information ◽

A Genome ◽

Biological Studies ◽

Wide Range

Abstract Since its initial publication in 2002, the genome of Ciona intestinalis type A (Ciona robusta), the first genome sequence of an invertebrate chordate, has provided a valuable resource for a wide range of biological studies, including developmental biology, evolutionary biology, and neuroscience. The genome assembly was updated in 2008, and it included 68% of the sequence information in 14 pairs of chromosomes. However, a more contiguous genome is required for analyses of higher order genomic structure and of chromosomal evolution. Here, we provide a new genome assembly for an inbred line of this animal, constructed with short and long sequencing reads and Hi-C data. In this latest assembly, over 95% of the 123 Mb of sequence data was included in the chromosomes. Short sequencing reads predicted a genome size of 114–120 Mb; therefore, it is likely that the current assembly contains almost the entire genome, although this estimate of genome size was smaller than previous estimates. Remapping of the Hi-C data onto the new assembly revealed a large inversion in the genome of the inbred line. Moreover, a comparison of this genome assembly with that of Ciona savignyi, a different species in the same genus, revealed many chromosomal inversions between these two Ciona species, suggesting that such inversions have occurred frequently and have contributed to chromosomal evolution of Ciona species. Thus, the present assembly greatly improves an essential resource for genome-wide studies of ascidians.

Download Full-text

Fast and memory-efficient mapping of short bisulfite sequencing reads using a two-letter alphabet

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab115 ◽

2021 ◽

Vol 3 (4) ◽

Author(s):

Guilherme de Sena Brandine ◽

Andrew D Smith

Keyword(s):

Cytosine Methylation ◽

Bisulfite Sequencing ◽

Software Tool ◽

Read Mapping ◽

Mapping Algorithm ◽

Letter Alphabet ◽

Mapping Software ◽

Wide Range ◽

Range Of Functions ◽

Similar Accuracy

Abstract DNA cytosine methylation is an important epigenomic mark with a wide range of functions in many organisms. Whole genome bisulfite sequencing is the gold standard to interrogate cytosine methylation genome-wide. Algorithms used to map bisulfite-converted reads often encode the four-base DNA alphabet with three letters by reducing two bases to a common letter. This encoding substantially reduces the entropy of nucleotide frequencies in the resulting reference genome. Within the paradigm of read mapping by first filtering possible candidate alignments, reduced entropy in the sequence space can increase the required computing effort. We introduce another bisulfite mapping algorithm (abismal), based on the idea of encoding a four-letter DNA sequence as only two letters, one for purines and one for pyrimidines. We show that this encoding can lead to greater specificity compared to existing encodings used to map bisulfite sequencing reads. Through the two-letter encoding, the abismal software tool maps reads in less time and using less memory than most bisulfite sequencing read mapping software tools, while attaining similar accuracy. This allows in silico methylation analysis to be performed in a wider range of computing machines with limited hardware settings.

Download Full-text

LRez: C ++ API and toolkit for analyzing and managing Linked-Reads data

Bioinformatics Advances ◽

10.1093/bioadv/vbab022 ◽

2021 ◽

Author(s):

Pierre Morisse ◽

Claire Lemaitre ◽

Fabrice Legeai

Keyword(s):

Genome Assembly ◽

Low Cost ◽

Variant Calling ◽

Supplementary Information ◽

Supplementary Data ◽

High Quality ◽

Dna Molecule ◽

Sequencing Technologies ◽

Wide Range ◽

Genomic Regions

Abstract Motivation Linked-Reads technologies combine both the high-quality and low cost of short-reads sequencing and long-range information, through the use of barcodes tagging reads which originate from a common long DNA molecule. This technology has been employed in a broad range of applications including genome assembly, phasing and scaffolding, as well as structural variant calling. However, to date, no tool or API dedicated to the manipulation of Linked-Reads data exist. Results We introduce LRez, a C ++ API and toolkit which allows easy management of Linked-Reads data. LRez includes various functionalities, for computing numbers of common barcodes between genomic regions, extracting barcodes from BAM files, as well as indexing and querying BAM, FASTQ and gzipped FASTQ files to quickly fetch all reads or alignments containing a given barcode. LRez is compatible with a wide range of Linked-Reads sequencing technologies, and can thus be used in any tool or pipeline requiring barcode processing or indexing, in order to improve their performances. Availability and implementation LRez is implemented in C ++, supported on Unix-based platforms, and available under AGPL-3.0 License at https://github.com/morispi/LRez, and as a bioconda module. Supplementary information Supplementary data are available at Bioinformatics Advances

Download Full-text

MOSGA: Modular Open-Source Genome Annotator

Bioinformatics ◽

10.1093/bioinformatics/btaa1003 ◽

2020 ◽

Author(s):

Roman Martin ◽

Thomas Hackl ◽

Georges Hattab ◽

Matthias G Fischer ◽

Dominik Heider

Keyword(s):

Open Source ◽

Source Code ◽

Supplementary Information ◽

Web Interface ◽

Fully Integrated ◽

Sequencing Technologies ◽

A Genome ◽

Wide Range ◽

User Friendly ◽

Eukaryotic Genomes

Abstract Motivation The generation of high-quality assemblies, even for large eukaryotic genomes, has become a routine task for many biologists thanks to recent advances in sequencing technologies. However, the annotation of these assemblies—a crucial step toward unlocking the biology of the organism of interest—has remained a complex challenge that often requires advanced bioinformatics expertise. Results Here, we present MOSGA (Modular Open-Source Genome Annotator), a genome annotation framework for eukaryotic genomes with a user-friendly web-interface that generates and integrates annotations from various tools. The aggregated results can be analyzed with a fully integrated genome browser and are provided in a format ready for submission to NCBI. MOSGA is built on a portable, customizable and easily extendible Snakemake backend, and thus, can be tailored to a wide range of users and projects. Availability and implementation We provide MOSGA as a web service at https://mosga.mathematik.uni-marburg.de and as a docker container at registry.gitlab.com/mosga/mosga: latest. Source code can be found at https://gitlab.com/mosga/mosga Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Developing informative microsatellite markers for non-model species using reference mapping against a model species’ genome

Scientific Reports ◽

10.1038/srep23087 ◽

2016 ◽

Vol 6 (1) ◽

Cited By ~ 5

Author(s):

Chih-Ming Hung ◽

Ai-Yun Yu ◽

Yu-Ting Lai ◽

Pei-Jen L. Shaner

Keyword(s):

Microsatellite Markers ◽

Rodent Species ◽

Background Information ◽

Model Species ◽

Breeding Programs ◽

Sequencing Technologies ◽

A Genome ◽

Wide Range ◽

Genomic Locations ◽

Traditional Approaches

Abstract Microsatellites have a wide range of applications from behavioral biology, evolution, to agriculture-based breeding programs. The recent progress in the next-generation sequencing technologies and the rapidly increasing number of published genomes may greatly enhance the current applications of microsatellites by turning them from anonymous to informative markers. Here we developed an approach to anchor microsatellite markers of any target species in a genome of a related model species, through which the genomic locations of the markers, along with any functional genes potentially linked to them, can be revealed. We mapped the shotgun sequence reads of a non-model rodent species Apodemus semotus against the genome of a model species, Mus musculus, and presented 24 polymorphic microsatellite markers with detailed background information for A. semotus in this study. The developed markers can be used in other rodent species, especially those that are closely related to A. semotus or M. musculus. Compared to the traditional approaches based on DNA cloning, our approach is likely to yield more loci for the same cost. This study is a timely demonstration of how a research team can efficiently generate informative (neutral or function-associated) microsatellite markers for their study species and unique biological questions.

Download Full-text

Nebula: Ultra-efficient mapping-free structural variant genotyper

10.1101/566620 ◽

2019 ◽

Cited By ~ 2

Author(s):

Parsoa Khorsand ◽

Fereydoun Hormozdiari

Keyword(s):

Large Scale ◽

Mobile Element ◽

Structural Variations ◽

Sequencing Technologies ◽

Generic Framework ◽

Common Genetic Variants ◽

Order Of Magnitude ◽

Comparable Accuracy ◽

Using Data ◽

Computational Resources

AbstractMotivationLarge scale catalogs of common genetic variants (including indels and structural variants) are being created using data from second and third generation whole-genome sequencing technologies. However, the genotyping of these variants in newly sequenced samples is a nontrivial task that requires extensive computational resources. Furthermore, current approaches are mostly limited to only specific types of variants and are generally prone to various errors and ambiguities when genotyping events in repeat regions. Thus we are proposing an ultra-efficient approach for genotyping any type of structural variation that is not limited by the shortcomings and complexities of current mapping-based approaches.ResultsOur method Nebula utilizes the changes in the count of k-mers to predict the genotype of common structural variations. We have shown that not only Nebula is an order of magnitude faster than mapping based approaches for genotyping deletions and mobile-element insertions, but also has comparable accuracy to state-of-the-art approaches. Furthermore, Nebula is a generic framework not limited to any specific type of event.AvailabilityNebula is publicly available at https://github.com/Parsoa/NebulousSerendipity

Download Full-text

MALVA: genotyping by Mapping-free ALlele detection of known VAriants

10.1101/575126 ◽

2019 ◽

Cited By ~ 1

Author(s):

Giulia Bernardini ◽

Paola Bonizzoni ◽

Luca Denti ◽

Marco Previtali ◽

Alexander Schönhuth

Keyword(s):

Variant Calling ◽

Genome Project ◽

Human Populations ◽

Genome Data ◽

Variant Discovery ◽

Sequencing Technologies ◽

Order Of Magnitude ◽

Speed Up ◽

Similar Accuracy ◽

Genomic Regions

AbstractThe amount of genetic variation discovered and characterized in human populations is huge, and is growing rapidly with the widespread availability of modern sequencing technologies. Such a great deal of variation data, that accounts for human diversity, leads to various challenging computational tasks, including variant calling and genotyping of newly sequenced individuals. The standard pipelines for addressing these problems include read mapping, which is a computationally expensive procedure. A few mapping-free tools were proposed in recent years to speed up the genotyping process. While such tools have highly efficient run-times, they focus on isolated, bi-allelic SNPs, providing limited support for multi-allelic SNPs, indels, and genomic regions with high variant density.To address these issues, we introduceMALVA, a fast and lightweight mapping-free method to genotype an individual directly from a sample of reads.MALVAis the first mapping-free tool that is able to genotype multi-allelic SNPs and indels, even in high density genomic regions, and to effectively handle a huge number of variants such as those provided by the 1000 Genome Project. An experimental evaluation on whole-genome data shows thatMALVArequires one order of magnitude less time to genotype a donor than alignment-based pipelines, providing similar accuracy. Remarkably, on indels,MALVAprovides even better results than the most widely adopted variant discovery tools.

Download Full-text

New insights on Laminaria digitata ultrastructure through combined conventional chemical fixation and cryofixation

Botanica Marina ◽

10.1515/bot-2021-0005 ◽

2021 ◽

Vol 0 (0) ◽

Author(s):

Christos Katsaros ◽

Sophie Le Panse ◽

Gillian Milne ◽

Carl J. Carrano ◽

Frithjof Christian Küpper

Keyword(s):

Cell Wall ◽

General Structure ◽

Raw Material ◽

Rocky Shores ◽

Chemical Fixation ◽

Laminaria Digitata ◽

Vegetative Cells ◽

Biological Studies ◽

New Findings ◽

Wide Range

Abstract The objective of the present study is to examine the fine structure of vegetative cells of Laminaria digitata using both chemical fixation and cryofixation. Laminaria digitata was chosen due to its importance as a model organism in a wide range of biological studies, as a keystone species on rocky shores of the North Atlantic, its use of iodide as a unique inorganic antioxidant, and its significance as a raw material for the production of alginate. Details of the fine structural features of vegetative cells are described, with particular emphasis on the differences between the two methods used, i.e. conventional chemical fixation and freeze-fixation. The general structure of the cells was similar to that already described, with minor differences between the different cell types. An intense activity of the Golgi system was found associated with the thick external cell wall, with large dictyosomes from which numerous vesicles and cisternae are released. An interesting type of cisternae was found in the cryofixed material, which was not visible with the chemical fixation. These are elongated structures, in sections appearing tubule-like, close to the external cell wall or to young internal walls. An increased number of these structures was observed near the plasmodesmata of the pit fields. They are similar to the “flat cisternae” found associated with the forming cytokinetic diaphragm of brown algae. Their possible role is discussed. The new findings of this work underline the importance of such combined studies which reveal new data not known until now using the old conventional methods. The main conclusion of the present study is that cryofixation is the method of choice for studying Laminaria cytology by transmission electron microscopy.

Download Full-text