scholarly journals Algorithms for efficiently collapsing reads with Unique Molecular Identifiers

2019 ◽  
Author(s):  
Daniel Liu

AbstractBackgroundUnique Molecular Identifiers (UMI) are used in many experiments to find and remove PCR duplicates. Although there are many tools for solving the problem of deduplicating reads based on their finding reads with the same alignment coordinates and UMIs, many tools either cannot handle substitution errors, or require expensive pairwise UMI comparisons that do not efficiently scale to larger datasets.ResultsWe formulate the problem of deduplicating UMIs in a manner that enables optimizations to be made, and more efficient data structures to be used. We implement our data structures and optimizations in a tool called UMICollapse, which is able to deduplicate over one million unique UMIs of length 9 at a single alignment position in around 26 seconds.ConclusionsWe present a new formulation of the UMI deduplication problem, and show that it can be solved faster, with more sophisticated data structures.

PeerJ ◽  
2019 ◽  
Vol 7 ◽  
pp. e8275
Author(s):  
Daniel Liu

Background Unique Molecular Identifiers (UMI) are used in many experiments to find and remove PCR duplicates. There are many tools for solving the problem of deduplicating reads based on their finding reads with the same alignment coordinates and UMIs. However, many tools either cannot handle substitution errors, or require expensive pairwise UMI comparisons that do not efficiently scale to larger datasets. Results We reformulate the problem of deduplicating UMIs in a manner that enables optimizations to be made, and more efficient data structures to be used. We implement our data structures and optimizations in a tool called UMICollapse, which is able to deduplicate over one million unique UMIs of length 9 at a single alignment position in around 26 s, using only a single thread and much less than 10 GB of memory. Conclusions We present a new formulation of the UMI deduplication problem, and show that it can be solved faster, with more sophisticated data structures.


2013 ◽  
Vol 756-759 ◽  
pp. 1387-1391
Author(s):  
Xiao Dong Wang ◽  
Jun Tian

Building an efficient data structure for range selection problems is considered. While there are several theoretical solutions to the problem, only a few have been tried out, and there is little idea on how the others would perform. The computation model used in this paper is the RAM model with word-size . Our data structure is a practical linear space data structure that supports range selection queries in time with preprocessing time.


2008 ◽  
Vol 179 (5) ◽  
pp. 330-338
Author(s):  
Artur Signell ◽  
Francisco Ogando ◽  
Mats Aspnäs ◽  
Jan Westerholm

Author(s):  
Peter Gjøl Jensen ◽  
Kim Guldstrand Larsen ◽  
Jiří Srba ◽  
Mathias Grund Sørensen ◽  
Jakob Haar Taankvist

Sign in / Sign up

Export Citation Format

Share Document