Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences

Kirill Kryukov; Mahoko Takahashi Ueda; So Nakagawa; Tadashi Imanishi

doi:10.1093/bioinformatics/btz144

Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences

Bioinformatics ◽

10.1093/bioinformatics/btz144 ◽

2019 ◽

Vol 35 (19) ◽

pp. 3826-3828 ◽

Cited By ~ 9

Author(s):

Kirill Kryukov ◽

Mahoko Takahashi Ueda ◽

So Nakagawa ◽

Tadashi Imanishi

Keyword(s):

Open Source ◽

Dna Sequence ◽

Compression Ratio ◽

Dna Sequences ◽

General Purpose ◽

Supplementary Information ◽

File Format ◽

Storage Space ◽

Supplementary Data ◽

Network Transmission

Abstract Summary DNA sequence databases use compression such as gzip to reduce the required storage space and network transmission time. We describe Nucleotide Archival Format (NAF)—a new file format for lossless reference-free compression of FASTA and FASTQ-formatted nucleotide sequences. Nucleotide Archival Format compression ratio is comparable to the best DNA compressors, while providing dramatically faster decompression. We compared our format with DNA compressors: DELIMINATE and MFCompress, and with general purpose compressors: gzip, bzip2, xz, brotli and zstd. Availability and implementation NAF compressor and decompressor, as well as format specification are available at https://github.com/KirillKryukov/naf. Format specification is in public domain. Compressor and decompressor are open source under the zlib/libpng license, free for nearly any use. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences

10.1101/501130 ◽

2018 ◽

Author(s):

Kirill Kryukov ◽

Mahoko Takahashi Ueda ◽

So Nakagawa ◽

Tadashi Imanishi

Keyword(s):

Open Source ◽

Dna Sequence ◽

Compression Ratio ◽

Dna Sequences ◽

Public Domain ◽

General Purpose ◽

Nucleotide Sequences ◽

File Format ◽

Storage Space ◽

Network Transmission

AbstractSummaryDNA sequence databases use compression such as gzip to reduce the required storage space and network transmission time. We describe Nucleotide Archival Format (NAF) – a new file format for lossless reference-free compression of FASTA and FASTQ-formatted nucleotide sequences. NAF compression ratio is comparable to the best DNA compressors, while providing dramatically faster decompression. We compared our format with DNA compressors: DELIMINATE and MFCompress, and with general purpose compressors: gzip, bzip2, xz, brotli, and zstd.AvailabilityNAF compressor and decompressor, as well as format specification are available at https://github.com/KirillKryukov/naf. Format specification is in public domain. Compressor and decompressor are open source under the zlib/libpng license, free for nearly any [email protected]

Download Full-text

DNA Chisel, a versatile sequence optimizer

Bioinformatics ◽

10.1093/bioinformatics/btaa558 ◽

2020 ◽

Vol 36 (16) ◽

pp. 4508-4509 ◽

Cited By ~ 3

Author(s):

Valentin Zulkower ◽

Susan Rosser

Keyword(s):

Open Source ◽

Dna Sequence ◽

Web Application ◽

Optimization Problems ◽

Supplementary Information ◽

Supplementary Data ◽

Sequence Optimization ◽

Sequence Design ◽

Optimization Framework ◽

Dna Sequence Design

Abstract Motivation Accounting for biological and practical requirements in DNA sequence design often results in challenging optimization problems. Current software solutions are problem-specific and hard to combine. Results DNA Chisel is an easy-to-use, easy-to-extend sequence optimization framework allowing to freely define and combine optimization specifications via Python scripts or Genbank annotations. Availability and implementation The framework is available as a web application (https://cuba.genomefoundry.org/sculpt_a_sequence) or open-source Python library (see at https://github.com/Edinburgh-Genome-Foundry/DNAChisel for code and documentation). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

ccNetViz: a WebGL-based JavaScript library for visualization of large networks

Bioinformatics ◽

10.1093/bioinformatics/btaa559 ◽

2020 ◽

Vol 36 (16) ◽

pp. 4527-4529

Author(s):

Ales Saska ◽

David Tichy ◽

Robert Moore ◽

Achilles Rasquinha ◽

Caner Akdas ◽

...

Keyword(s):

Systems Biology ◽

Complex Networks ◽

Open Source ◽

High Speed ◽

A Priori ◽

Supplementary Information ◽

Network Visualization ◽

Supplementary Data ◽

Web Based ◽

Flow Of Information

Abstract Summary Visualizing a network provides a concise and practical understanding of the information it represents. Open-source web-based libraries help accelerate the creation of biologically based networks and their use. ccNetViz is an open-source, high speed and lightweight JavaScript library for visualization of large and complex networks. It implements customization and analytical features for easy network interpretation. These features include edge and node animations, which illustrate the flow of information through a network as well as node statistics. Properties can be defined a priori or dynamically imported from models and simulations. ccNetViz is thus a network visualization library particularly suited for systems biology. Availability and implementation The ccNetViz library, demos and documentation are freely available at http://helikarlab.github.io/ccNetViz/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

The DNA walk and its demonstration of deterministic chaos—relevance to genomic alterations in lung cancer

Bioinformatics ◽

10.1093/bioinformatics/bty1021 ◽

2019 ◽

Vol 35 (16) ◽

pp. 2738-2748 ◽

Cited By ~ 1

Author(s):

Blake Hewelt ◽

Haiqing Li ◽

Mohit Kumar Jolly ◽

Prakash Kulkarni ◽

Isa Mambetsariev ◽

...

Keyword(s):

Lung Cancer ◽

Open Source ◽

Fractal Analysis ◽

Dna Sequences ◽

Chaotic Behavior ◽

Supplementary Information ◽

Wild Type ◽

Genomic Alterations ◽

Turtle Graphics ◽

Dna Walk

AbstractMotivationAdvancements in cancer genetics have facilitated the development of therapies with actionable mutations. Although mutated genes have been studied extensively, their chaotic behavior has not been appreciated. Thus, in contrast to naïve DNA, mutated DNA sequences can display characteristics of unpredictability and sensitivity to the initial conditions that may be dictated by the environment, expression patterns and presence of other genomic alterations. Employing a DNA walk as a form of 2D analysis of the nucleotide sequence, we demonstrate that chaotic behavior in the sequence of a mutated gene can be predicted.ResultsUsing fractal analysis for these DNA walks, we have determined the complexity and nucleotide variance of commonly observed mutated genes in non-small cell lung cancer, and their wild-type counterparts. DNA walks for wild-type genes demonstrate varying levels of chaos, with BRAF, NTRK1 and MET exhibiting greater levels of chaos than KRAS, paxillin and EGFR. Analyzing changes in chaotic properties, such as changes in periodicity and linearity, reveal that while deletion mutations indicate a notable disruption in fractal ‘self-similarity’, fusion mutations demonstrate bifurcations between the two genes. Our results suggest that the fractals generated by DNA walks can yield important insights into potential consequences of these mutated genes.Availability and implementationIntroduction to Turtle graphics in Python is an open source article on learning to develop a script for Turtle graphics in Python, freely available on the web at https://docs.python.org/2/library/turtle.html. cDNA sequences were obtained through NCBI RefSeq database, an open source database that contains information on a large array of genes, such as their nucleotide and amino acid sequences, freely available at https://www.ncbi.nlm.nih.gov/refseq/. FracLac plugin for Fractal analysis in ImageJ is an open source plugin for the ImageJ program to perform fractal analysis, free to download at https://imagej.nih.gov/ij/plugins/fraclac/FLHelp/Introduction.html.Supplementary informationSupplementary data are available at Bioinformatics online.

Download Full-text

DrawGlycan-SNFG and gpAnnotate: rendering glycans and annotating glycopeptide mass spectra

Bioinformatics ◽

10.1093/bioinformatics/btz819 ◽

2019 ◽

Cited By ~ 4

Author(s):

Kai Cheng ◽

Gabrielle Pawlowski ◽

Xinheng Yu ◽

Yusen Zhou ◽

Sriram Neelamegham

Keyword(s):

Mass Spectrometry ◽

Open Source ◽

Mass Spectra ◽

Supplementary Information ◽

Supplementary Data ◽

International Union ◽

Open Source Program ◽

Source Program ◽

Wide Range ◽

Peptide Modifications

Abstract Summary This manuscript describes an open-source program, DrawGlycan-SNFG (version 2), that accepts IUPAC (International Union of Pure and Applied Chemist)-condensed inputs to render Symbol Nomenclature For Glycans (SNFG) drawings. A wide range of local and global options enable display of various glycan/peptide modifications including bond breakages, adducts, repeat structures, ambiguous identifications etc. These facilities make DrawGlycan-SNFG ideal for integration into various glycoinformatics software, including glycomics and glycoproteomics mass spectrometry (MS) applications. As a demonstration of such usage, we incorporated DrawGlycan-SNFG into gpAnnotate, a standalone application to score and annotate individual MS/MS glycopeptide spectrum in different fragmentation modes. Availability and implementation DrawGlycan-SNFG and gpAnnotate are platform independent. While originally coded using MATLAB, compiled packages are also provided to enable DrawGlycan-SNFG implementation in Python and Java. All programs are available from https://virtualglycome.org/drawglycan; https://virtualglycome.org/gpAnnotate. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

GlycanFormatConverter: a conversion tool for translating the complexities of glycans

Bioinformatics ◽

10.1093/bioinformatics/bty990 ◽

2018 ◽

Vol 35 (14) ◽

pp. 2434-2440 ◽

Cited By ~ 7

Author(s):

Shinichiro Tsuchiya ◽

Issaku Yamada ◽

Kiyoko F Aoki-Kinoshita

Keyword(s):

Open Source ◽

Source Code ◽

Supplementary Information ◽

Biological Processes ◽

Supplementary Data ◽

Unique Representation ◽

Open Source Tool ◽

Living Organisms ◽

Conversion Tool ◽

Complex Glycan

Abstract Motivation Glycans are biomolecules that take an important role in the biological processes of living organisms. They form diverse, complicated structures such as branched and cyclic forms. Web3 Unique Representation of Carbohydrate Structures (WURCS) was proposed as a new linear notation for uniquely representing glycans during the GlyTouCan project. WURCS defines rules for complex glycan structures that other text formats did not support, and so it is possible to represent a wide variety glycans. However, WURCS uses a complicated nomenclature, so it is not human-readable. Therefore, we aimed to support the interpretation of WURCS by converting WURCS to the most basic and widely used format IUPAC. Results In this study, we developed GlycanFormatConverter and succeeded in converting WURCS to the three kinds of IUPAC formats (IUPAC-Extended, IUPAC-Condensed and IUPAC-Short). Furthermore, we have implemented functionality to import IUPAC-Extended, KEGG Chemical Function (KCF) and LinearCode formats and to export WURCS. We have thoroughly tested our GlycanFormatConverter and were able to show that it was possible to convert all the glycans registered in the GlyTouCan repository, with exceptions owing only to the limitations of the original format. The source code for this conversion tool has been released as an open source tool. Availability and implementation https://github.com/glycoinfo/GlycanFormatConverter.git Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

GPress: a framework for querying general feature format (GFF) files and expression files in a compressed form

Bioinformatics ◽

10.1093/bioinformatics/btaa604 ◽

2020 ◽

Vol 36 (18) ◽

pp. 4810-4812

Author(s):

Qingxi Meng ◽

Idoia Ochoa ◽

Mikel Hernaez

Keyword(s):

Single Cell ◽

Data Streams ◽

General Feature ◽

Supplementary Information ◽

Storage Space ◽

Supplementary Data ◽

Rna Seq ◽

Sequencing Data ◽

General Feature Format ◽

Original File

Abstract Motivation Sequencing data are often summarized at different annotation levels for further analysis, generally using the general feature format (GFF) or its descendants, gene transfer format (GTF) and GFF3. Existing utilities for accessing these files, like gffutils and gffread, do not focus on reducing the storage space, significantly increasing it in some cases. We propose GPress, a framework for querying GFF files in a compressed form. GPress can also incorporate and compress expression files from both bulk and single-cell RNA-Seq experiments, supporting simultaneous queries on both the GFF and expression files. In brief, GPress applies transformations to the data which are then compressed with the general lossless compressor BSC. To support queries, GPress compresses the data in blocks and creates several index tables for fast retrieval. Results We tested GPress on several GFF files of different organisms, and showed that it achieves on average a 61% reduction in size with respect to gzip (the current de facto compressor for GFF files) while being able to retrieve all annotations for a given identifier or a range of coordinates in a few seconds (when run in a common laptop). In contrast, gffutils provides faster retrieval but doubles the size of the GFF files. When additionally linking an expression file, we show that GPress can reduce its size by more than 68% when compared to gzip (for both bulk and single-cell RNA-Seq experiments), while still retrieving the information within seconds. Finally, applying BSC to the data streams generated by GPress instead of to the original file shows a size reduction of more than 44% on average. Availability and implementation GPress is freely available at https://github.com/qm2/gpress. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

FQSqueezer: k-mer-based compression of sequencing data

10.1101/559807 ◽

2019 ◽

Cited By ~ 1

Author(s):

Sebastian Deorowicz

Keyword(s):

Data Compression ◽

State Of The Art ◽

Genomic Data ◽

General Purpose ◽

Supplementary Information ◽

Supplementary Data ◽

Sequencing Data ◽

Partial Matching ◽

Supplementary Material ◽

Better Than

AbstractMotivationThe amount of genomic data that needs to be stored is huge. Therefore it is not surprising that a lot of work has been done in the field of specialized data compression of FASTQ files. The existing algorithms are, however, still imperfect and the best tools produce quite large archives.ResultsWe present FQSqueezer, a novel compression algorithm for sequencing data able to process single- and paired-end reads of variable lengths. It is based on the ideas from the famous prediction by partial matching and dynamic Markov coder algorithms known from the general-purpose-compressors world. The compression ratios are often tens of percent better than offered by the state-of-the-art tools.Availability and Implementationhttps://github.com/refresh-bio/[email protected] informationSupplementary data are available at publisher’s Web site.

Download Full-text

DNA Chisel, a versatile sequence optimizer

10.1101/2019.12.16.877480 ◽

2019 ◽

Author(s):

Valentin Zulkower ◽

Susan Rosser

Keyword(s):

Open Source ◽

Dna Sequence ◽

Web Application ◽

Optimization Problems ◽

Supplementary Information ◽

Sequence Optimization ◽

Sequence Design ◽

Optimization Framework ◽

Link Type ◽

Dna Sequence Design

AbstractMotivationAccounting for biological and practical requirements in DNA sequence design often results in challenging optimization problems. Current software solutions are problem-specific and hard to combine.ResultsDNA Chisel is an easy-to-use, easy-to-extend sequence optimization framework allowing to freely define and combine optimization specifications via Python scripts or Genbank annotations.Availabilityas a web application (https://cuba.genomefoundry.org/sculpt_a_sequence) or open-source Python library (code and documentation at https://github.com/Edinburgh-Genome-Foundry/DNAChisel)[email protected] informationattached.

Download Full-text

CNN-PepPred: An open-source tool to create convolutional NN models for the discovery of patterns in peptide sets. Application to peptide-MHC class II binding prediction

Bioinformatics ◽

10.1093/bioinformatics/btab687 ◽

2021 ◽

Author(s):

Valentin Junet ◽

Xavier Daura

Keyword(s):

Neural Networks ◽

Open Source ◽

Mhc Class Ii ◽

Convolutional Neural Networks ◽

Operating Systems ◽

Class Ii ◽

Supplementary Information ◽

Supplementary Data ◽

Binding Prediction ◽

Open Source Tool

Abstract Summary The ability to unveil binding patterns in peptide sets has important applications in several biomedical areas, including the development of vaccines. We present an open-source tool, CNN-PepPred, that uses convolutional neural networks to discover such patterns, along with its application to peptide-HLA class II binding prediction. The tool can be used locally on different operating systems, with CPUs or GPUs, to train, evaluate, apply and visualize models. Availability and Implementation CNN-PepPred is freely available as a Python tool with a detailed User’s Guide at: https://github.com/ComputBiol-IBB/CNN-PepPred Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text