scholarly journals Constructing smaller genome graphs via string compression

2021 ◽  
Author(s):  
Yutong Qiu ◽  
Carl Kingsford

AbstractThe size of a genome graph — the space required to store the nodes, their labels and edges — affects the efficiency of operations performed on it. For example, the time complexity to align a sequence to a graph without a graph index depends on the total number of characters in the node labels and the number of edges in the graph. The size of the graph also affects the size of the graph index that is used to speed up the alignment. This raises the need for approaches to construct space-efficient genome graphs.We point out similarities in the string encoding approaches of genome graphs and the external pointer macro (EPM) compression model. Supported by these similarities, we present a pair of linear-time algorithms that transform between genome graphs and EPM-compressed forms. We show that the algorithms result in an upper bound on the size of the genome graph constructed based on an optimal EPM compression. In addition to the transformation, we show that equivalent choices made by EPM compression algorithms may result in different sizes of genome graphs. To further optimize the size of the genome graph, we purpose the source assignment problem that optimizes over the equivalent choices during compression and introduce an ILP formulation that solves that problem optimally. As a proof-of-concept, we introduce RLZ-Graph, a genome graph constructed based on the relative Lempel-Ziv EPM compression algorithm. We show that using RLZ-Graph, across all human chromosomes, we are able to reduce the disk space to store a genome graph on average by 40.7% compared to colored de Bruijn graphs constructed by Bifrost under the default settings.The RLZ-Graph software is available at https://github.com/Kingsford-Group/rlzgraph

2021 ◽  
Vol 17 (5) ◽  
pp. e1008928
Author(s):  
Paul Medvedev ◽  
Mihai Pop

Many students are taught about genome assembly using the dichotomy between the complexity of finding Eulerian and Hamiltonian cycles (easy versus hard, respectively). This dichotomy is sometimes used to motivate the use of de Bruijn graphs in practice. In this paper, we explain that while de Bruijn graphs have indeed been very useful, the reason has nothing to do with the complexity of the Hamiltonian and Eulerian cycle problems. We give 2 arguments. The first is that a genome reconstruction is never unique and hence an algorithm for finding Eulerian or Hamiltonian cycles is not part of any assembly algorithm used in practice. The second is that even if an arbitrary genome reconstruction was desired, one could do so in linear time in both the Eulerian and Hamiltonian paradigms.


2020 ◽  
Author(s):  
Daniel Danciu ◽  
Mikhail Karasikov ◽  
Harun Mustafa ◽  
André Kahles ◽  
Gunnar Rätsch

AbstractSince the amount of published biological sequencing data is growing exponentially, efficient methods for storing and indexing this data are more needed than ever to truly benefit from this invaluable resource for biomedical research. Labeled de Bruijn graphs are a frequently-used approach for representing large sets of sequencing data. While significant progress has been made to succinctly represent the graph itself, efficient methods for storing labels on such graphs are still rapidly evolving. In this paper, we present RowDiff, a new technique for compacting graph labels by leveraging expected similarities in annotations of nodes adjacent in the graph. RowDiff can be constructed in linear time relative to the number of nodes and labels in the graph, and the construction can be efficiently parallelized and distributed, significantly reducing construction time. RowDiff can be viewed as an intermediary sparsification step of the initial annotation matrix and can thus naturally be combined with existing generic schemes for compressed binary matrix representation. Our experiments on the Fungi subset of the RefSeq collection show that applying RowDiff sparsification reduces the size of individual annotation columns stored as compressed bit vectors by an average factor of 42. When combining RowDiff with a Multi-BRWT representation, the resulting annotation is 26 times smaller than Mantis-MST, the previously known smallest annotation representation. In addition, experiments on 10,000 RNA-seq datasets show that RowDiff combined with Multi-BRWT results in a 30% reduction in annotation footprint over Mantis-MST.


Author(s):  
A. G. Jackson ◽  
M. Rowe

Diffraction intensities from intermetallic compounds are, in the kinematic approximation, proportional to the scattering amplitude from the element doing the scattering. More detailed calculations have shown that site symmetry and occupation by various atom species also affects the intensity in a diffracted beam. [1] Hence, by measuring the intensities of beams, or their ratios, the occupancy can be estimated. Measurement of the intensity values also allows structure calculations to be made to determine the spatial distribution of the potentials doing the scattering. Thermal effects are also present as a background contribution. Inelastic effects such as loss or absorption/excitation complicate the intensity behavior, and dynamical theory is required to estimate the intensity value.The dynamic range of currents in diffracted beams can be 104or 105:1. Hence, detection of such information requires a means for collecting the intensity over a signal-to-noise range beyond that obtainable with a single film plate, which has a S/N of about 103:1. Although such a collection system is not available currently, a simple system consisting of instrumentation on an existing STEM can be used as a proof of concept which has a S/N of about 255:1, limited by the 8 bit pixel attributes used in the electronics. Use of 24 bit pixel attributes would easily allowthe desired noise range to be attained in the processing instrumentation. The S/N of the scintillator used by the photoelectron sensor is about 106 to 1, well beyond the S/N goal. The trade-off that must be made is the time for acquiring the signal, since the pattern can be obtained in seconds using film plates, compared to 10 to 20 minutes for a pattern to be acquired using the digital scan. Parallel acquisition would, of course, speed up this process immensely.


Author(s):  
Holger Gruen ◽  
Carsten Benthin ◽  
Sven Woop

We propose an easy and simple-to-integrate approach to accelerate ray tracing of alpha-tested transparent geometry with a focus on Microsoft® DirectX® or Vulkan® ray tracing extensions. Pre-computed bit masks are used to quickly determine fully transparent and fully opaque regions of triangles thereby skipping the more expensive alpha-test operation. These bit masks allow us to skip up to 86% of all transparency tests, yielding up to 40% speed up in a proof-of-concept DirectX® software only implementation.


Author(s):  
Yuya Higashikawa ◽  
Naoki Katoh ◽  
Junichi Teruyama ◽  
Koji Watase

Author(s):  
Andrea Belleri ◽  
Simone Labò

AbstractThe seismic performance of precast portal frames typical of the industrial and commercial sector could be generally improved by providing additional mechanical devices at the beam-to-column joint. Such devices could provide an additional degree of fixity and energy dissipation in a joint generally characterized by a dry hinged connection, adopted to speed-up the construction phase. Another advantage of placing additional devices at the beam-to-column joint is the possibility to act as a fuse, concentrating the seismic damage on few sacrificial and replaceable elements. A procedure to design precast portal frames adopting additional devices is provided herein. The procedure moves from the Displacement-Based Design methodology proposed by M.J.N. Priestley, and it is applicable for both the design of new structures and the retrofit of existing ones. After the derivation of the required analytical formulations, the procedure is applied to select the additional devices for a new and an existing structural system. The validation through non-linear time history analyses allows to highlight the advantages and drawbacks of the considered devices and to prove the effectiveness of the proposed design procedure.


Algorithmica ◽  
2013 ◽  
Vol 71 (2) ◽  
pp. 471-495 ◽  
Author(s):  
Maw-Shang Chang ◽  
Ming-Tat Ko ◽  
Hsueh-I Lu

1996 ◽  
Vol 06 (01) ◽  
pp. 127-136 ◽  
Author(s):  
QIAN-PING GU ◽  
SHIETUNG PENG

In this paper, we give two linear time algorithms for node-to-node fault tolerant routing problem in n-dimensional hypercubes Hn and star graphs Gn. The first algorithm, given at most n−1 arbitrary fault nodes and two non-fault nodes s and t in Hn, finds a fault-free path s→t of length at most [Formula: see text] in O(n) time, where d(s, t) is the distance between s and t. Our second algorithm, given at most n−2 fault nodes and two non-fault nodes s and t in Gn, finds a fault-free path s→t of length at most d(Gn)+3 in O(n) time, where [Formula: see text] is the diameter of Gn. When the time efficiency of finding the routing path is more important than the length of the path, the algorithms in this paper are better than the previous ones.


Sign in / Sign up

Export Citation Format

Share Document