A New Lossless DNA Compression Algorithm Based on A Single-Block Encoding Scheme

With the emergent evolution in DNA sequencing technology, a massive amount of genomic data is produced every day, mainly DNA sequences, craving for more storage and bandwidth. Unfortunately, managing, analyzing and specifically storing these large amounts of data become a major scientific challenge for bioinformatics. Therefore, to overcome these challenges, compression has become necessary. In this paper, we describe a new reference-free DNA compressor abbreviated as DNAC-SBE. DNAC-SBE is a lossless hybrid compressor that consists of three phases. First, starting from the largest base (Bi), the positions of each Bi are replaced with ones and the positions of other bases that have smaller frequencies than Bi are replaced with zeros. Second, to encode the generated streams, we propose a new single-block encoding scheme (SEB) based on the exploitation of the position of neighboring bits within the block using two different techniques. Finally, the proposed algorithm dynamically assigns the shorter length code to each block. Results show that DNAC-SBE outperforms state-of-the-art compressors and proves its efficiency in terms of special conditions imposed on compressed data, storage space and data transfer rate regardless of the file format or the size of the data.

Download Full-text

Deoxyribonucleic Acid as a Tool for Digital Information Storage: An Overview

THE INDIAN JOURNAL OF VETERINARY SCIENCES AND BIOTECHNOLOGY ◽

10.21887/ijvsbt.15.1.1 ◽

2019 ◽

Vol 15 (01) ◽

pp. 1-8

Author(s):

Ashish C Patel ◽

C G Joshi

Keyword(s):

Data Storage ◽

Dna Sequences ◽

Consensus Sequence ◽

Random Access ◽

Information Storage ◽

Digital Data ◽

Digital Information ◽

Multiple Sequence ◽

Digital World ◽

Digital File

Current data storage technologies cannot keep pace longer with exponentially growing amounts of data through the extensive use of social networking photos and media, etc. The "digital world” with 4.4 zettabytes in 2013 has predicted it to reach 44 zettabytes by 2020. From the past 30 years, scientists and researchers have been trying to develop a robust way of storing data on a medium which is dense and ever-lasting and found DNA as the most promising storage medium. Unlike existing storage devices, DNA requires no maintenance, except the need to store at a cool and dark place. DNA has a small size with high density; just 1 gram of dry DNA can store about 455 exabytes of data. DNA stores the informations using four bases, viz., A, T, G, and C, while CDs, hard disks and other devices stores the information using 0’s and 1’s on the spiral tracks. In the DNA based storage, after binarization of digital file into the binary codes, encoding and decoding are important steps in DNA based storage system. Once the digital file is encoded, the next step is to synthesize arbitrary single-strand DNA sequences and that can be stored in the deep freeze until use.When there is a need for information to be recovered, it can be done using DNA sequencing. New generation sequencing (NGS) capable of producing sequences with very high throughput at a much lower cost about less than 0.1 USD for one MB of data than the first sequencing technologies. Post-sequencing processing includes alignment of all reads using multiple sequence alignment (MSA) algorithms to obtain different consensus sequences. The consensus sequence is decoded as the reversal of the encoding process. Most prior DNA data storage efforts sequenced and decoded the entire amount of stored digital information with no random access, but nowadays it has become possible to extract selective files (e.g., retrieving only required image from a collection) from a DNA pool using PCR-based random access. Various scientists successfully stored up to 110 zettabytes data in one gram of DNA. In the future, with an efficient encoding, error corrections, cheaper DNA synthesis,and sequencing, DNA based storage will become a practical solution for storage of exponentially growing digital data.

Download Full-text

Improving Semi-Supervised Learning for Audio Classification with FixMatch

Electronics ◽

10.3390/electronics10151807 ◽

2021 ◽

Vol 10 (15) ◽

pp. 1807

Author(s):

Sascha Grollmisch ◽

Estefanía Cano

Keyword(s):

Neural Networks ◽

Supervised Learning ◽

Transfer Learning ◽

Data Transfer ◽

State Of The Art ◽

Training Data ◽

Audio Classification ◽

Image Domain ◽

Full Dataset ◽

Audio Data

Including unlabeled data in the training process of neural networks using Semi-Supervised Learning (SSL) has shown impressive results in the image domain, where state-of-the-art results were obtained with only a fraction of the labeled data. The commonality between recent SSL methods is that they strongly rely on the augmentation of unannotated data. This is vastly unexplored for audio data. In this work, SSL using the state-of-the-art FixMatch approach is evaluated on three audio classification tasks, including music, industrial sounds, and acoustic scenes. The performance of FixMatch is compared to Convolutional Neural Networks (CNN) trained from scratch, Transfer Learning, and SSL using the Mean Teacher approach. Additionally, a simple yet effective approach for selecting suitable augmentation methods for FixMatch is introduced. FixMatch with the proposed modifications always outperformed Mean Teacher and the CNNs trained from scratch. For the industrial sounds and music datasets, the CNN baseline performance using the full dataset was reached with less than 5% of the initial training data, demonstrating the potential of recent SSL methods for audio data. Transfer Learning outperformed FixMatch only for the most challenging dataset from acoustic scene classification, showing that there is still room for improvement.

Download Full-text

Neuro-Inspired Signal Processing in Ferromagnetic Nanofibers

Biomimetics ◽

10.3390/biomimetics6020032 ◽

2021 ◽

Vol 6 (2) ◽

pp. 32

Author(s):

Tomasz Blachowicz ◽

Jacek Grzybowski ◽

Pawel Steblinski ◽

Andrea Ehrmann

Keyword(s):

Data Processing ◽

Data Storage ◽

Domain Walls ◽

Data Transfer ◽

Synaptic Activity ◽

Neuromorphic Computing ◽

Rotating Magnetic Fields ◽

Micromagnetic Simulations ◽

Energy Consuming ◽

Stochastic Data

Computers nowadays have different components for data storage and data processing, making data transfer between these units a bottleneck for computing speed. Therefore, so-called cognitive (or neuromorphic) computing approaches try combining both these tasks, as is done in the human brain, to make computing faster and less energy-consuming. One possible method to prepare new hardware solutions for neuromorphic computing is given by nanofiber networks as they can be prepared by diverse methods, from lithography to electrospinning. Here, we show results of micromagnetic simulations of three coupled semicircle fibers in which domain walls are excited by rotating magnetic fields (inputs), leading to different output signals that can be used for stochastic data processing, mimicking biological synaptic activity and thus being suitable as artificial synapses in artificial neural networks.

Download Full-text

Miniature Two-Axis Actuator for High-Data-Transfer-Rate Optical Data Storage System

Japanese Journal of Applied Physics ◽

10.1143/jjap.41.1804 ◽

2002 ◽

Vol 41 (Part 1, No. 3B) ◽

pp. 1804-1807 ◽

Cited By ~ 4

Author(s):

Gakuji Hashimoto ◽

Hiroki Shima ◽

Kenji Yamamoto ◽

Tsutomu Maruyama ◽

Takashi Nakao ◽

...

Keyword(s):

Data Storage ◽

Transfer Rate ◽

Data Transfer ◽

Storage System ◽

Optical Data Storage ◽

Optical Data ◽

Data Transfer Rate ◽

High Data ◽

Data Storage System

Download Full-text

SHI7 Is a Self-Learning Pipeline for Multipurpose Short-Read DNA Quality Control

mSystems ◽

10.1128/msystems.00202-17 ◽

2018 ◽

Vol 3 (3) ◽

Cited By ~ 15

Author(s):

Gabriel A. Al-Ghalith ◽

Benjamin Hillmann ◽

Kaiwei Ang ◽

Robin Shields-Cutler ◽

Dan Knights

Keyword(s):

Quality Control ◽

Dna Sequences ◽

Sequence Data ◽

Background Knowledge ◽

Sequencing Technology ◽

Data Set ◽

Short Read ◽

Dna Quality ◽

Public Data ◽

User Friendly

ABSTRACT Next-generation sequencing technology is of great importance for many biological disciplines; however, due to technical and biological limitations, the short DNA sequences produced by modern sequencers require numerous quality control (QC) measures to reduce errors, remove technical contaminants, or merge paired-end reads together into longer or higher-quality contigs. Many tools for each step exist, but choosing the appropriate methods and usage parameters can be challenging because the parameterization of each step depends on the particularities of the sequencing technology used, the type of samples being analyzed, and the stochasticity of the instrumentation and sample preparation. Furthermore, end users may not know all of the relevant information about how their data were generated, such as the expected overlap for paired-end sequences or type of adaptors used to make informed choices. This increasing complexity and nuance demand a pipeline that combines existing steps together in a user-friendly way and, when possible, learns reasonable quality parameters from the data automatically. We propose a user-friendly quality control pipeline called SHI7 (canonically pronounced “shizen”), which aims to simplify quality control of short-read data for the end user by predicting presence and/or type of common sequencing adaptors, what quality scores to trim, whether the data set is shotgun or amplicon sequencing, whether reads are paired end or single end, and whether pairs are stitchable, including the expected amount of pair overlap. We hope that SHI7 will make it easier for all researchers, expert and novice alike, to follow reasonable practices for short-read data quality control. IMPORTANCE Quality control of high-throughput DNA sequencing data is an important but sometimes laborious task requiring background knowledge of the sequencing protocol used (such as adaptor type, sequencing technology, insert size/stitchability, paired-endedness, etc.). Quality control protocols typically require applying this background knowledge to selecting and executing numerous quality control steps with the appropriate parameters, which is especially difficult when working with public data or data from collaborators who use different protocols. We have created a streamlined quality control pipeline intended to substantially simplify the process of DNA quality control from raw machine output files to actionable sequence data. In contrast to other methods, our proposed pipeline is easy to install and use and attempts to learn the necessary parameters from the data automatically with a single command.

Download Full-text

GIF IMAGE HARDWARE COMPRESSORS

Information, Computing and Intelligent systems ◽

10.20535/2708-4930.2.2021.244189 ◽

2021 ◽

Author(s):

Ivan Mozghovyi ◽

Anatoliy Sergiyenko ◽

Roman Yershov

Keyword(s):

Data Compression ◽

Data Storage ◽

High Speed ◽

Data Transfer ◽

Future Research ◽

Volume Data ◽

High Speed Data ◽

Lzw Algorithm ◽

And Storage ◽

Speed Data Transmission

Increasing requirements for data transfer and storage is one of the crucial questions now. There are several ways of high-speed data transmission, but they meet limited requirements applied to their narrowly focused specific target. The data compression approach gives the solution to the problems of high-speed transfer and low-volume data storage. This paper is devoted to the compression of GIF images, using a modified LZW algorithm with a tree-based dictionary. It has led to a decrease in lookup time and an increase in the speed of data compression, and in turn, allows developing the method of constructing a hardware compression accelerator during the future research.

Download Full-text

Hallucinating Optical Flow Features for Video Classification

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/130 ◽

2019 ◽

Cited By ~ 2

Author(s):

Yongyi Tang ◽

Lin Ma ◽

Lianqiang Zhou

Keyword(s):

Optical Flow ◽

Data Storage ◽

Large Scale ◽

State Of The Art ◽

Video Content ◽

Video Classification ◽

Temporal Relationships ◽

Optical Flow Computation ◽

Flow Features ◽

Optical Flow Features

Appearance and motion are two key components to depict and characterize the video content. Currently, the two-stream models have achieved state-of-the-art performances on video classification. However, extracting motion information, specifically in the form of optical flow features, is extremely computationally expensive, especially for large-scale video classification. In this paper, we propose a motion hallucination network, namely MoNet, to imagine the optical flow features from the appearance features, with no reliance on the optical flow computation. Specifically, MoNet models the temporal relationships of the appearance features and exploits the contextual relationships of the optical flow features with concurrent connections. Extensive experimental results demonstrate that the proposed MoNet can effectively and efficiently hallucinate the optical flow features, which together with the appearance features consistently improve the video classification performances. Moreover, MoNet can help cutting down almost a half of computational and data-storage burdens for the two-stream video classification. Our code is available at: https://github.com/YongyiTang92/MoNet-Features

Download Full-text

Lossless text compression using GPT-2 language model and Huffman coding

SHS Web of Conferences ◽

10.1051/shsconf/202110204013 ◽

2021 ◽

Vol 102 ◽

pp. 04013

Author(s):

Md. Atiqur Rahman ◽

Mohamed Hamada

Keyword(s):

Data Compression ◽

State Of The Art ◽

Language Model ◽

Huffman Coding ◽

Original Text ◽

Text Compression ◽

Compression Technique ◽

Daily Life Activities ◽

Burrows Wheeler Transform ◽

Compressed Data

Modern daily life activities produced lots of information for the advancement of telecommunication. It is a challenging issue to store them on a digital device or transmit it over the Internet, leading to the necessity for data compression. Thus, research on data compression to solve the issue has become a topic of great interest to researchers. Moreover, the size of compressed data is generally smaller than its original. As a result, data compression saves storage and increases transmission speed. In this article, we propose a text compression technique using GPT-2 language model and Huffman coding. In this proposed method, Burrows-Wheeler transform and a list of keys are used to reduce the original text file’s length. Finally, we apply GPT-2 language mode and then Huffman coding for encoding. This proposed method is compared with the state-of-the-art techniques used for text compression. Finally, we show that the proposed method demonstrates a gain in compression ratio compared to the other state-of-the-art methods.

Download Full-text

Encoding scheme for data storage and retrieval on DNA computers

IET Nanobiotechnology ◽

10.1049/iet-nbt.2020.0157 ◽

2020 ◽

Vol 14 (7) ◽

pp. 635-641

Author(s):

Dolly Sharma ◽

Ranjit Kumar ◽

Mayuri Gupta ◽

Tanisha Saxena

Keyword(s):

Data Storage ◽

Encoding Scheme ◽

Storage And Retrieval ◽

Dna Computers

Download Full-text

Concepts of construction and technological solutions for methods of operative transfer of data of field researches from agricultural sites to the remote database of storage of data with a possibility of feedback

Artificial Intelligence ◽

10.15407/jai2020.01.057 ◽

2020 ◽

Vol 25 (1) ◽

pp. 57-64

Author(s):

Pisarenko V. ◽

◽

Pisarenko U. ◽

Koval A. ◽

Varava I.A. ◽

...

Keyword(s):

Data Storage ◽

High Probability ◽

Data Transfer ◽

Field Research ◽

Urgent Problem ◽

Research Institutions ◽

Considerable Distance ◽

Definition Of ◽

Agricultural Areas ◽

Operational Data

A feature of the agro-industrial sphere is the high probability of distribution (remoteness) of production or research sites in areas far from each other for a considerable distance. Moreover, the center for collecting information and processing it, as a rule, is concentrated in one compact place. For research institutions, this feature often acquires a state of rather urgent problem, which requires the search for new innovative approaches. The paper proposes elements of the concept of construction and technological solutions for methods of operational data transfer of field research from agricultural areas to a remote database for data storage with the possibility of feedback. As an example, the procedure of qualification examination of plant varieties with the definition of the criteria of "difference, homogeneity and stability" and "suitability for propagation of the variety in Ukraine" was chosen.

Download Full-text