Super-robust data storage in DNA by de Bruijn graph-based decoding

AbstractHigh density and long-term features make DNA data storage a potential media. However, DNA data channel is a unique channel with unavoidable ‘data reputations’ in the forms of multiple error-rich strand copies. This multi-copy feature cannot be well harnessed by available codec systems optimized for single-copy media. Furthermore, lacking an effective mechanism to handle base shift issues, these systems perform poorly with indels. Here, we report the efficient reconstruction of DNA strands from multiple error-rich sequences directly, utilizing a De Bruijn Graph-based Greedy Path Search (DBG-GPS) algorithm. DBG-GPS can take advantage of the multi-copy feature for efficient correction of indels as well as substitutions. As high as 10% of errors can be accurately corrected with a high coding rate of 96.8%. Accurate data recovery with low quality, deep error-prone PCR products proved the high robustness of DBG-GPS (314Kb, 12K oligos). Furthermore, DBG-GPS shows 50 times faster than the clustering and multiple alignment-based methods reported. The revealed linear decoding complexity makes DBG-GPS a suitable solution for large-scale data storage. DBG-GPS’s capacity with large data was verified by large-scale simulations (300 MB). A Python implementation of DBG-GPS is available at https://switch-codes.coding.net/public/switch-codes/DNA-Fountain-De-Bruijn-Decoding/git/files.

Download Full-text

Robust data storage in DNA by de Bruijn graph-based decoding

10.21203/rs.3.rs-382900/v1 ◽

2021 ◽

Author(s):

Lifu Song ◽

Feng Geng ◽

Ziyi Song ◽

Bing-Zhi Li ◽

Ying-Jin Yuan

Keyword(s):

Data Storage ◽

Large Scale ◽

Search Algorithm ◽

De Bruijn Graph ◽

Large Scale Data ◽

Dna Strands ◽

Pcr Products ◽

Path Search ◽

De Bruijn ◽

Linear Decoding

Abstract Data storage in DNA, which store information in polymers, is a potential technology with high density and long-term features. However, the indels, strand rearrangements, and strand breaks that emerged during synthesis, amplification, sequencing, and storage of DNA molecules need to be handled. Here, we report a de Bruijn graph-based, greedy path search algorithm (DBG-GPS), which can efficiently handle all these issues by efficient reconstruction of the DNA strands. DBG-GPS achieves accurate data recovery with low-quality, deep error-prone PCR products, and accelerated aged DNA samples (solution, 70℃ for two weeks). The robustness of DBG-GPS was verified with 100 times of multiple retrievals using PCR products with massive unspecific amplifications. Moreover, DBG-GPS shows linear decoding complexity and more than 100 times faster than the multiple alignment-based methods, indicating a suitable solution for large-scale data storage.

Download Full-text

SOME EXPERIMENTS ON THE CONSTRUCTION AND ANALYSIS OF THE DE BRUIJN GRAPH

System Informatics ◽

10.31144/si.2307-6410.2020.n16.p47-56 ◽

2020 ◽

Author(s):

Alexander G. Marchuk ◽

◽

Sergey Nikolaevich Troshkov ◽

Keyword(s):

Data Storage ◽

Parallel Computations ◽

De Bruijn Graph ◽

Distributed Data ◽

Distributed Data Storage ◽

De Bruijn

This paper describes the experience of solving the problem of finding chains in the De Bruijn graph using parallel computations and distributed data storage.

Download Full-text

Data Storage, Retrieval and Management

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Large-Scale Distributed Computing and Applications ◽

10.4018/978-1-61520-703-9.ch006 ◽

2010 ◽

pp. 111-140

Author(s):

Valentin Cristea ◽

Ciprian Dobre ◽

Corina Stratan ◽

Florin Pop

Keyword(s):

Data Storage ◽

Resource Sharing ◽

High Performance ◽

Large Scale ◽

Workflow Management ◽

Large Data ◽

Data Retrieval ◽

Distributed Data Storage ◽

Processing Power ◽

Data Transfers

The latest advances in network and distributedsystem technologies now allow integration of a vast variety of services with almost unlimited processing power, using large amounts of data. Sharing of resources is often viewed as the key goal for distributed systems, and in this context the sharing of stored data appears as the most important aspect of distributed resource sharing. Scientific applications are the first to take advantage of such environments as the requirements of current and future high performance computing experiments are pressing, in terms of even higher volumes of issued data to be stored and managed. While these new environments reveal huge opportunities for large-scale distributed data storage and management, they also raise important technical challenges, which need to be addressed. The ability to support persistent storage of data on behalf of users, the consistent distribution of up-to-date data, the reliable replication of fast changing datasets or the efficient management of large data transfers are just some of these new challenges. In this chapter we discuss how the existing distributed computing infrastructure is adequate for supporting the required data storage and management functionalities. We highlight the issues raised from storing data over large distributed environments and discuss the recent research efforts dealing with challenges of data retrieval, replication and fast data transfers. Interaction of data management with other data sensitive, emerging technologies as the workflow management is also addressed.

Download Full-text

A Review: Map Reduce Framework for Cloud Computing

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i4.6.20224 ◽

2018 ◽

Vol 7 (4.6) ◽

pp. 13

Author(s):

Mekala Sandhya ◽

Ashish Ladda ◽

Dr. Uma N Dulhare ◽

. . ◽

. .

Keyword(s):

Data Mining ◽

Cloud Computing ◽

Distributed Computing ◽

Data Storage ◽

High Performance ◽

Large Scale ◽

Distributed Storage ◽

Large Data ◽

Mass Data ◽

Internet Information

In this generation of Internet, information and data are growing continuously. Even though various Internet services and applications. The amount of information is increasing rapidly. Hundred billions even trillions of web indexes exist. Such large data brings people a mass of information and more difficulty discovering useful knowledge in these huge amounts of data at the same time. Cloud computing can provide infrastructure for large data. Cloud computing has two significant characteristics of distributed computing i.e. scalability, high availability. The scalability can seamlessly extend to large-scale clusters. Availability says that cloud computing can bear node errors. Node failures will not affect the program to run correctly. Cloud computing with data mining does significant data processing through high-performance machine. Mass data storage and distributed computing provide a new method for mass data mining and become an effective solution to the distributed storage and efficient computing in data mining.

Download Full-text

Resolving Clinicians’ Queries Across a Grid’s Infrastructure

Methods of Information in Medicine ◽

10.1055/s-0038-1633936 ◽

2005 ◽

Vol 44 (02) ◽

pp. 149-153 ◽

Cited By ~ 2

Author(s):

F. Estrella ◽

C. del Frate ◽

T. Hauer ◽

M. Odeh ◽

D. Rogulin ◽

...

Keyword(s):

Image Analysis ◽

Data Storage ◽

Medical Image ◽

Large Scale ◽

Medical Image Analysis ◽

Large Data ◽

Computing Power ◽

European Database ◽

Order Of Magnitude ◽

Network Speed

Summary Objectives: The past decade has witnessed order of magnitude increases in computing power, data storage capacity and network speed, giving birth to applications which may handle large data volumes of increased complexity, distributed over the internet. Methods: Medical image analysis is one of the areas for which this unique opportunity likely brings revolutionary advances both for the scientist’s research study and the clinician’s everyday work. Grids [1] computing promises to resolve many of the difficulties in facilitating medical image analysis to allow radiologists to collaborate without having to co-locate. Results: The EU-funded MammoGrid project [2] aims to investigate the feasibility of developing a Grid-enabled European database of mammograms and provide an information infrastructure which federates multiple mammogram databases. This will enable clinicians to develop new common, collaborative and co-operative approaches to the analysis of mammographic data. Conclusion: This paper focuses on one of the key requirements for large-scale distributed mammogram analysis: resolving queries across a grid-connected federation of images.

Download Full-text

Experimental Demonstration of Hopfield Neural Network using DNA molecules

MRS Proceedings ◽

10.1557/opl.2011.868 ◽

2011 ◽

Vol 1346 ◽

Author(s):

Hayri E. Akin ◽

Dundar Karabay ◽

Allen P. Mills ◽

Cengiz S. Ozkan ◽

Mihrimah Ozkan

Keyword(s):

Neural Network ◽

Data Storage ◽

Large Scale ◽

Fault Tolerant ◽

Hamiltonian Path ◽

Hopfield Neural Network ◽

Dna Molecules ◽

Processing Power ◽

Dna Strands ◽

Molecular Reactions

ABSTRACTDNA Computing is a rapidly-developing interdisciplinary area which could benefit from more experimental results to solve problems with the current biological tools. In this study, we have integrated microelectronics and molecular biology techniques for showing the feasibility of Hopfield Neural Network using DNA molecules. Adleman’s seminal paper in 1994 showed that DNA strands using specific molecular reactions can be used to solve the Hamiltonian Path Problem. This accomplishment opened the way for possibilities of massively parallel processing power, remarkable energy efficiency and compact data storage ability with DNA. However, in various studies, small departures from the ideal selectivity of DNA hybridization lead to significant undesired pairings of strands and that leads to difficulties in schemes for implementing large Boolean functions using DNA. Therefore, these error prone reactions in the Boolean architecture of the first DNA computers will benefit from fault tolerance or error correction methods and these methods would be essential for large scale applications. In this study, we demonstrate the operation of six dimensional Hopfield associative memory storing various memories as an archetype fault tolerant neural network implemented using DNA molecular reactions. The response of the network suggests that the protocols could be scaled to a network of significantly larger dimensions. In addition the results are read on a Silicon CMOS platform exploiting the semiconductor processing knowledge for fast and accurate hybridization rates.

Download Full-text

A Survey on Big IoT Data Indexing: Potential Solutions, Recent Advancements, and Open Issues

Future Internet ◽

10.3390/fi14010019 ◽

2021 ◽

Vol 14 (1) ◽

pp. 19

Author(s):

Zineddine Kouahla ◽

Ala-Eddine Benrazek ◽

Mohamed Amine Ferrag ◽

Brahim Farou ◽

Hamid Seridi ◽

...

Keyword(s):

Data Storage ◽

Large Scale ◽

Search Time ◽

Large Data ◽

Open Problems ◽

Large Scale Data ◽

Indexing Techniques ◽

Efficient Retrieval ◽

Data Collections ◽

Scale Data

The past decade has been characterized by the growing volumes of data due to the widespread use of the Internet of Things (IoT) applications, which introduced many challenges for efficient data storage and management. Thus, the efficient indexing and searching of large data collections is a very topical and urgent issue. Such solutions can provide users with valuable information about IoT data. However, efficient retrieval and management of such information in terms of index size and search time require optimization of indexing schemes which is rather difficult to implement. The purpose of this paper is to examine and review existing indexing techniques for large-scale data. A taxonomy of indexing techniques is proposed to enable researchers to understand and select the techniques that will serve as a basis for designing a new indexing scheme. The real-world applications of the existing indexing techniques in different areas, such as health, business, scientific experiments, and social networks, are presented. Open problems and research challenges, e.g., privacy and large-scale data mining, are also discussed.

Download Full-text

Cuttlefish: Fast, parallel, and low-memory compaction of de Bruijn graphs from large-scale genome collections

10.1101/2020.10.21.349605 ◽

2020 ◽

Author(s):

Jamshed Khan ◽

Rob Patro

Keyword(s):

Large Scale ◽

De Bruijn Graph ◽

Comparative Genomic ◽

De Bruijn Graphs ◽

Long Reads ◽

Genomic Analyses ◽

Finite State ◽

De Bruijn ◽

Colored De Bruijn Graph ◽

Memory Compaction

AbstractMotivationThe construction of the compacted de Bruijn graph from a large collection of reference genomes is a task of increasing interest in genomic analyses. For example, compacted colored reference de Bruijn graphs are increasingly used as sequence indices for the purposes of alignment of short and long reads. Also, as we sequence and assemble a greater diversity of individual genomes, the compacted colored de Bruijn graph can be used as the basis for methods aiming to perform comparative genomic analyses on these genomes. While algorithms have been developed to construct the compacted colored de Bruijn graph from reference sequences, there is still room for improvement, especially in the memory and the runtime performance as the number and the scale of the genomes over which the de Bruijn graph is built grow.ResultsWe introduce a new algorithm, implemented in the tool Cuttlefish, to construct the colored compacted de Bruijn graph from a collection of one or more genome references. Cuttlefish introduces a novel modeling scheme of the de Bruijn graph vertices as finite-state automata, and constrains the state-space for the automata to enable tracking of their transitioning states with very low memory usage. Cuttlefish is also fast and highly parallelizable. Experimental results demonstrate that the algorithm scales much better than existing approaches, especially as the number and scale of the input references grow. For example, on a typical shared-memory machine, Cuttlefish constructed the compacted graph for 100 human genomes in less than 7 hours, using ~29 GB of memory; no other tested tool successfully completed this task on the testing hardware. We also applied Cuttlefish on 11 diverse conifer plant genomes, and the compacted graph was constructed in under 11 hours, using ~84 GB of memory, while the only other tested tool able to complete this compaction on our hardware took more than 16 hours and ~289 GB of memory.AvailabilityCuttlefish is written in C++14, and is available under an open source license at https://github.com/COMBINE-lab/[email protected]

Download Full-text