scholarly journals URMAP, an ultra-fast read mapper

Author(s):  
Robert C. Edgar

AbstractMapping of reads to reference sequences is an essential step in a wide range of biological studies. The large size of datasets generated with next-generation sequencing technologies motivates the development of fast mapping software. Here, I describe URMAP, a new read mapping algorithm. URMAP is an order of magnitude faster than BWA and Bowtie2 with comparable accuracy on a benchmark test using simulated paired 150nt reads of a well-studied human genome. Software is freely available at https://drive5.com/urmap.

PeerJ ◽  
2020 ◽  
Vol 8 ◽  
pp. e9338
Author(s):  
Robert Edgar

Mapping of reads to reference sequences is an essential step in a wide range of biological studies. The large size of datasets generated with next-generation sequencing technologies motivates the development of fast mapping software. Here, I describe URMAP, a new read mapping algorithm. URMAP is an order of magnitude faster than BWA with comparable accuracy on several validation tests. On a Genome in a Bottle (GIAB) variant calling test with 30× coverage 2×150 reads, URMAP achieves high accuracy (precision 0.998, sensitivity 0.982 and F-measure 0.990) with the strelka2 caller. However, GIAB reference variants are shown to be biased against repetitive regions which are difficult to map and may therefore pose an unrealistically easy challenge to read mappers and variant callers.


2021 ◽  
Vol 3 (4) ◽  
Author(s):  
Guilherme de Sena Brandine ◽  
Andrew D Smith

Abstract DNA cytosine methylation is an important epigenomic mark with a wide range of functions in many organisms. Whole genome bisulfite sequencing is the gold standard to interrogate cytosine methylation genome-wide. Algorithms used to map bisulfite-converted reads often encode the four-base DNA alphabet with three letters by reducing two bases to a common letter. This encoding substantially reduces the entropy of nucleotide frequencies in the resulting reference genome. Within the paradigm of read mapping by first filtering possible candidate alignments, reduced entropy in the sequence space can increase the required computing effort. We introduce another bisulfite mapping algorithm (abismal), based on the idea of encoding a four-letter DNA sequence as only two letters, one for purines and one for pyrimidines. We show that this encoding can lead to greater specificity compared to existing encodings used to map bisulfite sequencing reads. Through the two-letter encoding, the abismal software tool maps reads in less time and using less memory than most bisulfite sequencing read mapping software tools, while attaining similar accuracy. This allows in silico methylation analysis to be performed in a wider range of computing machines with limited hardware settings.


2021 ◽  
Author(s):  
Parsoa Khorsand ◽  
Fereydoun Hormozdiari

Abstract Large scale catalogs of common genetic variants (including indels and structural variants) are being created using data from second and third generation whole-genome sequencing technologies. However, the genotyping of these variants in newly sequenced samples is a nontrivial task that requires extensive computational resources. Furthermore, current approaches are mostly limited to only specific types of variants and are generally prone to various errors and ambiguities when genotyping complex events. We are proposing an ultra-efficient approach for genotyping any type of structural variation that is not limited by the shortcomings and complexities of current mapping-based approaches. Our method Nebula utilizes the changes in the count of k-mers to predict the genotype of structural variants. We have shown that not only Nebula is an order of magnitude faster than mapping based approaches for genotyping structural variants, but also has comparable accuracy to state-of-the-art approaches. Furthermore, Nebula is a generic framework not limited to any specific type of event. Nebula is publicly available at https://github.com/Parsoa/Nebula.


2017 ◽  
Author(s):  
Krešimir Križanović ◽  
Ivan Sović ◽  
Ivan Krpelnik ◽  
Mile Šikić

AbstractNext generation sequencing technologies have made RNA sequencing widely accessible and applicable in many areas of research. In recent years, 3rd generation sequencing technologies have matured and are slowly replacing NGS for DNA sequencing. This paper presents a novel tool for RNA mapping guided by gene annotations. The tool is an adapted version of a previously developed DNA mapper – GraphMap, tailored for third generation sequencing data, such as those produced by Pacific Biosciences or Oxford Nanopore Technologies devices. It uses gene annotations to generate a transcriptome, uses a DNA mapping algorithm to map reads to the transcriptome, and finally transforms the mappings back to genome coordinates. Modified version of GraphMap is compared on several synthetic datasets to the state-of-the-art RNAseq mappers enabled to work with third generation sequencing data. The results show that our tool outperforms other tools in general mapping quality.


2016 ◽  
Author(s):  
Po-E Li ◽  
Chien-Chi Lo ◽  
Joseph J. Anderson ◽  
Karen W. Davenport ◽  
Kimberly A. Bishop-Lilly ◽  
...  

Continued advancements in sequencing technologies have fueled the development of new sequencing applications and promise to flood current databases with raw data. A number of factors prevent the seamless and easy use of these data, including the breadth of project goals, the wide array of tools that individually perform fractions of any given analysis, the large number of associated software/hardware dependencies, and the detailed expertise required to perform these analyses. To address these issues, we have developed an intuitive web-based environment with a wide assortment of integrated and cutting-edge bioinformatics tools. These preconfigured workflows provide even novice next-generation sequencing users with the ability to perform many complex analyses with only a few mouse clicks, and, within the context of the same environment, to visualize and further interrogate their results. This bioinformatics platform is an initial attempt at Empowering the Development of Genomics Expertise (EDGE) in a wide range of applications.


2015 ◽  
Vol 2015 ◽  
pp. 1-8 ◽  
Author(s):  
Kuo-Wang Tsai ◽  
Bill Chang ◽  
Cheng-Tsung Pan ◽  
Wei-Chen Lin ◽  
Ting-Wen Chen ◽  
...  

Next-generation sequencing (NGS) has become a powerful sequencing tool, applied in a wide range of biological studies. However, the traditional sample preparation protocol for NGS is non-strand-specific (NSS), leading to biased estimates of expression for transcripts overlapped at the antisense strand. Strand-specific (SS) protocols have recently been developed. In this study, we prepared the same RNA sample by using the SS and NSS protocols, followed by sequencing with Illumina HiSeq platform. Using real-time quantitative PCR as a standard, we first proved that the SS protocol more precisely estimates gene expressions compared with the NSS protocol, particularly for those overlapped at the antisense strand. In addition, we also showed that the sequence reads from the SS protocol are comparable with those from conventional NSS protocols in many aspects. Finally, we also mapped a fraction of sequence reads back to the antisense strand of the known genes, originally without annotated genes located. Using sequence assembly and PCR validation, we succeeded in identifying and characterizing the novel antisense genes. Our results show that the SS protocol performs more accurately than the traditional NSS protocol and can be applied in future studies.


2020 ◽  
Vol 30 (4) ◽  
pp. 3-20
Author(s):  
A. G. Borodinov ◽  
◽  
V. V. Manoilov ◽  
I. V. Zarutsky ◽  
A. I. Petrov ◽  
...  

Several decades have passed since the development of the revolutionary DNA sequencing method by Frederick Sanger and his colleagues. After the Human Genome Project, the time interval between sequencing technologies began to shrink, while the volume of scientific knowledge continued to grow exponentially. Following Sanger sequencing, considered as the first generation, new generations of DNA sequencing were consistently introduced into practice. Advances in next generation sequencing (NGS) technologies have contributed significantly to this trend by reducing costs and generating massive sequencing data. To date, there are three generations of sequencing technologies. Second generation se-quencing, which is currently the most commonly used NGS technology, consists of library preparation, amplification and sequencing steps, while in third generation sequencing, individual nucleic acids are sequenced directly to avoid bias and have higher throughput. The development of new generations of sequencing has made it possible to overcome the limitations of traditional DNA sequencing methods and has found application in a wide range of projects in molecular biology. On the other hand, with the development of next generation technologies, many technical problems arise that need to be deeply analyzed and solved. Each generation and sequencing platform, due to its methodological approach, has specific advantages and disadvantages that determine suitability for certain applications. Thus, the assessment of these characteristics, limitations and potential applications helps to shape the directions for further research on sequencing technologies.


2010 ◽  
Vol 2010 ◽  
pp. 1-12 ◽  
Author(s):  
Laurent Noé ◽  
Marta Gîrdea ◽  
Gregory Kucherov

The advent of high-throughput sequencing technologies constituted a major advance in genomic studies, offering new prospects in a wide range of applications.We propose a rigorous and flexible algorithmic solution to mapping SOLiD color-space reads to a reference genome. The solution relies on an advanced method of seed design that uses a faithful probabilistic model of read matches and, on the other hand, a novel seeding principle especially adapted to read mapping. Our method can handle both lossy and lossless frameworks and is able to distinguish, at the level of seed design, between SNPs and reading errors. We illustrate our approach by several seed designs and demonstrate their efficiency.


2019 ◽  
Author(s):  
Parsoa Khorsand ◽  
Fereydoun Hormozdiari

AbstractMotivationLarge scale catalogs of common genetic variants (including indels and structural variants) are being created using data from second and third generation whole-genome sequencing technologies. However, the genotyping of these variants in newly sequenced samples is a nontrivial task that requires extensive computational resources. Furthermore, current approaches are mostly limited to only specific types of variants and are generally prone to various errors and ambiguities when genotyping events in repeat regions. Thus we are proposing an ultra-efficient approach for genotyping any type of structural variation that is not limited by the shortcomings and complexities of current mapping-based approaches.ResultsOur method Nebula utilizes the changes in the count of k-mers to predict the genotype of common structural variations. We have shown that not only Nebula is an order of magnitude faster than mapping based approaches for genotyping deletions and mobile-element insertions, but also has comparable accuracy to state-of-the-art approaches. Furthermore, Nebula is a generic framework not limited to any specific type of event.AvailabilityNebula is publicly available at https://github.com/Parsoa/NebulousSerendipity


2019 ◽  
Vol 14 (2) ◽  
pp. 157-163
Author(s):  
Majid Hajibaba ◽  
Mohsen Sharifi ◽  
Saeid Gorgin

Background: One of the pivotal challenges in nowadays genomic research domain is the fast processing of voluminous data such as the ones engendered by high-throughput Next-Generation Sequencing technologies. On the other hand, BLAST (Basic Local Alignment Search Tool), a longestablished and renowned tool in Bioinformatics, has shown to be incredibly slow in this regard. Objective: To improve the performance of BLAST in the processing of voluminous data, we have applied a novel memory-aware technique to BLAST for faster parallel processing of voluminous data. Method: We have used a master-worker model for the processing of voluminous data alongside a memory-aware technique in which the master partitions the whole data in equal chunks, one chunk for each worker, and consequently each worker further splits and formats its allocated data chunk according to the size of its memory. Each worker searches every split data one-by-one through a list of queries. Results: We have chosen a list of queries with different lengths to run insensitive searches in a huge database called UniProtKB/TrEMBL. Our experiments show 20 percent improvement in performance when workers used our proposed memory-aware technique compared to when they were not memory aware. Comparatively, experiments show even higher performance improvement, approximately 50 percent, when we applied our memory-aware technique to mpiBLAST. Conclusion: We have shown that memory-awareness in formatting bulky database, when running BLAST, can improve performance significantly, while preventing unexpected crashes in low-memory environments. Even though distributed computing attempts to mitigate search time by partitioning and distributing database portions, our memory-aware technique alleviates negative effects of page-faults on performance.


Sign in / Sign up

Export Citation Format

Share Document