shortest common superstring
Recently Published Documents


TOTAL DOCUMENTS

40
(FIVE YEARS 6)

H-INDEX

7
(FIVE YEARS 1)

2020 ◽  
Author(s):  
Tomasz Kowalski ◽  
Szymon Grabowski

AbstractMotivationFASTQ remains among the widely used formats for high-throughput sequencing data. Despite advances in specialized FASTQ compressors, they are still imperfect in terms of practical performance tradeoffs.ResultsWe present a multi-threaded version of Pseudogenome-based Read Compressor (PgRC), an in-memory algorithm for compressing the DNA stream, based on the idea of building an approximation of the shortest common superstring over high-quality reads. The current version, v1.2, practically preserves the compression ratio and decompression speed of the previous one, reducing the compression time by a factor of about 4–5 on a 6-core/12-thread machine.AvailabilityPgRC 1.2 can be downloaded from https://github.com/kowallus/[email protected]


2020 ◽  
Vol 7 (2) ◽  
pp. 407
Author(s):  
Wisnu Ananta Kusuma ◽  
Albert Adrianus

<p><em>De novo DNA </em>(<em>Deoxyribonucleic Acid</em>)<em> sequence assembly</em> atau perakitan sekuens DNA secara <em>De novo </em>adalah tahapan yang sangat penting dalam analisis sekuens DNA. Tahapan ini diperlukan untuk merakit atau menyambungkan kembali fragmen-fragmen DNA (<em>reads</em>) yang dihasilkan oleh <em>Next Generation Sequencing</em> menjadi genom yang utuh. Masalah perakitan DNA ini dapat direpresentasikan sebagai masalah Shortest Common Superstring (SCS). Perakitan ini memerlukan bantuan perangkat lunak untuk mendeteksi daerah yang sama pada <em>reads</em> DNA (<em>overlap</em>), mengkonstruksi <em>overlap graph,</em> dan kemudian mencari <em>shortest path </em>dari graf yang terbentuk. Metode ini dinamakan <em>Overlap Layout Consensus </em>(OLC). Hal yang penting dalam metode OLC adalah pendeteksian <em>overlap</em> dari masing-masing <em>reads. </em>Pada penelitian ini dikembangkan suatu teknik untuk membuat <em>bidirected overlap graph</em>. <em>Suffix array </em>digunakan untuk menentukan bagian <em>overlap</em> dari setiap <em>reads </em>dengan melakukan pengindeksan setiap <em>suffix </em>dari <em>reads</em>. Proses perakitan sekuens DNA merupakan suatu proses komputasi yang intensif. Untuk mengefisiensikan  proses dilakukan perubahan masing-masing <em>suffix </em>dan<em> prefix</em> menjadi suatu nilai tertentu yang bersifat tunggal dan mencari <em>overlap</em> dengan membandingkan angka yang merupakan representasi dari setiap <em>reads</em>. Cara ini lebih efisien dibandingkan melakukan pendeteksian <em>overlap </em>dengan metode pencocokan  <em>string</em>. Hasil perbandingan menunjukkan bahwa waktu yang diperlukan untuk mengeksekusi metode yang diusulkan (perbandingan angka) jauh lebih singkat dibandingkan dengan menggunakan metode pencocokan<em> string</em>. Untuk jumlah <em>reads</em> 2000 dan 5000 <em>reads</em> teknik yang diusulkan ini dapat menghasilkan <em>overlap</em> graph yang 100% akurat di mana semua <em>reads</em> dapat direpresentasikan ke dalam node yang dikonrtruksi dan semua <em>overlap</em> dapat direpresentasikan ke dalam <em>edge</em>.</p><p class="Abstrak" align="center"> </p><p align="center"><strong><em>Abstract</em></strong></p><p><em>De novo DNA sequence assembly is the important step in DNA sequence analysis. This step is required for assembling fragments or reads produced by Next Generation Sequencing to yield a whole genome. The problem of DNA assembly could be represented as the Shortest Common Superstring (SCS) problem. The assembly requires a software for detecting the overlap region among reads, constructing an overlap graph, and finding the shortest path from the overlap graph.. This method is popular as The Overlap Layout Consensus (OLC). The most important step in OLC is detecting overlaps among reads. This study develop a new approach to construct bidirected overlap graph. Suffis array is used for detecting overlap region from each reads by indexing suffix of each reads. DNA assembly process is computational intensive. To reduce the execution time suffix and prefix was converted into the single value so that the detection of overlap could be done by comparing the values. This method is much more efficient compared to that of using string matching. Using 2000 and 5000 reads, the proposed method (value comparison) could yield the perfect overlap graph, in which all reads and overlap could be represented as nodes and edges, respectively.  </em></p><p> </p>


2019 ◽  
Vol 36 (7) ◽  
pp. 2082-2089 ◽  
Author(s):  
Tomasz M Kowalski ◽  
Szymon Grabowski

Abstract Motivation The amount of sequencing data from high-throughput sequencing technologies grows at a pace exceeding the one predicted by Moore’s law. One of the basic requirements is to efficiently store and transmit such huge collections of data. Despite significant interest in designing FASTQ compressors, they are still imperfect in terms of compression ratio or decompression resources. Results We present Pseudogenome-based Read Compressor (PgRC), an in-memory algorithm for compressing the DNA stream, based on the idea of building an approximation of the shortest common superstring over high-quality reads. Experiments show that PgRC wins in compression ratio over its main competitors, SPRING and Minicom, by up to 15 and 20% on average, respectively, while being comparably fast in decompression. Availability and implementation PgRC can be downloaded from https://github.com/kowallus/PgRC. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Author(s):  
Tomasz Kowalski ◽  
Szymon Grabowski

AbstractMotivationThe amount of sequencing data from High-Throughput Sequencing technologies grows at a pace exceeding the one predicted by Moore’s law. One of the basic requirements is to efficiently store and transmit such huge collections of data. Despite significant interest in designing FASTQ compressors, they are still imperfect in terms of compression ratio or decompression resources.ResultsWe present Pseudogenome-based Read Compressor (PgRC), an in-memory algorithm for compressing the DNA stream, based on the idea of building an approximation of the shortest common superstring over high-quality reads. Experiments show that PgRC wins in compression ratio over its main competitors, SPRING and Minicom, by up to 18 and 21 percent on average, respectively, while being at least comparably fast in decompression.AvailabilityPgRC can be downloaded from https://github.com/kowallus/[email protected]


Author(s):  
Tristan Braquelaire ◽  
Marie Gasparoux ◽  
Mathieu Raffinot ◽  
Raluca Uricaru

2016 ◽  
Vol 116 (3) ◽  
pp. 245-251 ◽  
Author(s):  
Gabriele Fici ◽  
Tomasz Kociumaka ◽  
Jakub Radoszewski ◽  
Wojciech Rytter ◽  
Tomasz Waleń

Sign in / Sign up

Export Citation Format

Share Document