Algorithms for efficiently collapsing reads with Unique Molecular Identifiers

Mapping Intimacies ◽

10.1101/648683 ◽

2019 ◽

Author(s):

Daniel Liu

Keyword(s):

Data Structures ◽

Alignment Position ◽

Pcr Duplicates ◽

Efficient Data ◽

New Formulation ◽

Efficient Data Structures

AbstractBackgroundUnique Molecular Identifiers (UMI) are used in many experiments to find and remove PCR duplicates. Although there are many tools for solving the problem of deduplicating reads based on their finding reads with the same alignment coordinates and UMIs, many tools either cannot handle substitution errors, or require expensive pairwise UMI comparisons that do not efficiently scale to larger datasets.ResultsWe formulate the problem of deduplicating UMIs in a manner that enables optimizations to be made, and more efficient data structures to be used. We implement our data structures and optimizations in a tool called UMICollapse, which is able to deduplicate over one million unique UMIs of length 9 at a single alignment position in around 26 seconds.ConclusionsWe present a new formulation of the UMI deduplication problem, and show that it can be solved faster, with more sophisticated data structures.

Algorithms for efficiently collapsing reads with Unique Molecular Identifiers

PeerJ ◽

10.7717/peerj.8275 ◽

2019 ◽

Vol 7 ◽

pp. e8275

Author(s):

Daniel Liu

Keyword(s):

Data Structures ◽

Alignment Position ◽

Single Thread ◽

Pcr Duplicates ◽

Efficient Data ◽

New Formulation ◽

Efficient Data Structures

Background Unique Molecular Identifiers (UMI) are used in many experiments to find and remove PCR duplicates. There are many tools for solving the problem of deduplicating reads based on their finding reads with the same alignment coordinates and UMIs. However, many tools either cannot handle substitution errors, or require expensive pairwise UMI comparisons that do not efficiently scale to larger datasets. Results We reformulate the problem of deduplicating UMIs in a manner that enables optimizations to be made, and more efficient data structures to be used. We implement our data structures and optimizations in a tool called UMICollapse, which is able to deduplicate over one million unique UMIs of length 9 at a single alignment position in around 26 s, using only a single thread and much less than 10 GB of memory. Conclusions We present a new formulation of the UMI deduplication problem, and show that it can be solved faster, with more sophisticated data structures.

An algorithm and efficient data structures for the binary knapsack problem

European Journal of Operational Research ◽

10.1016/0377-2217(78)90137-6 ◽

1978 ◽

Vol 2 (6) ◽

pp. 420-428 ◽

Cited By ~ 7

Author(s):

Uwe Suhl

Keyword(s):

Data Structures ◽

Knapsack Problem ◽

Efficient Data ◽

Efficient Data Structures

Sisnet 3 — A simulator of switching networks with efficient data structures

Annual Review in Automatic Programming ◽

10.1016/0066-4138(85)90337-4 ◽

1985 ◽

Vol 12 ◽

pp. 101-104

Author(s):

Janusz Kleban ◽

Jerzy Kubasik

Keyword(s):

Data Structures ◽

Switching Networks ◽

Efficient Data ◽

Efficient Data Structures

Efficient Data Structures for Bursty Access Patterns

Packet Forwarding Technologies ◽

10.1201/9780849380587-6 ◽

2007 ◽

pp. 205-236

Author(s):

Weidong Wu

Keyword(s):

Data Structures ◽

Efficient Data ◽

Access Patterns ◽

Efficient Data Structures

EFFICIENT DATA STRUCTURES FOR LOCAL INCONSISTENCY DETECTION IN FIREWALL ACL UPDATES

Proceedings of the 11th International Conference on Enterprise Information ◽

10.5220/0001996001760181 ◽

2009 ◽

Author(s):

S. Pozo ◽

R. M. Gasca ◽

F. de la Rosa T.

Keyword(s):

Data Structures ◽

Inconsistency Detection ◽

Efficient Data ◽

Efficient Data Structures

Efficient Data Structures for the Factor Periodicity Problem

String Processing and Information Retrieval - Lecture Notes in Computer Science ◽

10.1007/978-3-642-34109-0_30 ◽

2012 ◽

pp. 284-294 ◽

Cited By ~ 10

Author(s):

Tomasz Kociumaka ◽

Jakub Radoszewski ◽

Wojciech Rytter ◽

Tomasz Waleń

Keyword(s):

Data Structures ◽

Efficient Data ◽

Efficient Data Structures

Efficient data structures for model-based 3-D object recognition and localization from range images

IEEE Transactions on Pattern Analysis and Machine Intelligence ◽

10.1109/34.159905 ◽

1992 ◽

Vol 14 (10) ◽

pp. 1035-1045 ◽

Cited By ~ 7

Author(s):

W. Wang ◽

S.S. Iyengar

Keyword(s):

Object Recognition ◽

Data Structures ◽

Range Images ◽

Model Based ◽

Efficient Data ◽

Efficient Data Structures

Efficient Data Structures for Range Selections Problem

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.756-759.1387 ◽

2013 ◽

Vol 756-759 ◽

pp. 1387-1391

Author(s):

Xiao Dong Wang ◽

Jun Tian

Keyword(s):

Data Structure ◽

Linear Space ◽

Data Structures ◽

Computation Model ◽

Word Size ◽

Selection Problems ◽

Efficient Data ◽

Space Data ◽

Efficient Data Structures ◽

Range Selection

Building an efficient data structure for range selection problems is considered. While there are several theoretical solutions to the problem, only a few have been tried out, and there is little idea on how the others would perform. The computation model used in this paper is the RAM model with word-size . Our data structure is a practical linear space data structure that supports range selection queries in time with preprocessing time.

Scalable plasma simulation with ELMFIRE using efficient data structures for process communication

Computer Physics Communications ◽

10.1016/j.cpc.2008.03.009 ◽

2008 ◽

Vol 179 (5) ◽

pp. 330-338

Author(s):

Artur Signell ◽

Francisco Ogando ◽

Mats Aspnäs ◽

Jan Westerholm

Keyword(s):

Data Structures ◽

Plasma Simulation ◽

Process Communication ◽

Efficient Data ◽

Efficient Data Structures

Memory Efficient Data Structures for Explicit Verification of Timed Systems

Lecture Notes in Computer Science - NASA Formal Methods ◽

10.1007/978-3-319-06200-6_26 ◽

2014 ◽

pp. 307-312 ◽

Cited By ~ 6

Author(s):

Peter Gjøl Jensen ◽

Kim Guldstrand Larsen ◽

Jiří Srba ◽

Mathias Grund Sørensen ◽

Jakob Haar Taankvist

Keyword(s):

Data Structures ◽

Timed Systems ◽

Efficient Data ◽

Efficient Data Structures ◽

Memory Efficient