scholarly journals Scaling up DNA data storage and random access retrieval

2017 ◽  
Author(s):  
Lee Organick ◽  
Siena Dumas Ang ◽  
Yuan-Jyue Chen ◽  
Randolph Lopez ◽  
Sergey Yekhanin ◽  
...  

Current storage technologies can no longer keep pace with exponentially growing amounts of data. 1 Synthetic DNA offers an attractive alternative due to its potential information density of ~ 1018 B/mm3, 107 times denser than magnetic tape, and potential durability of thousands of years.2 Recent advances in DNA data storage have highlighted technical challenges, in particular, coding and random access, but have stored only modest amounts of data in synthetic DNA. 3,4,5 This paper demonstrates an end-to-end approach toward the viability of DNA data storage with large-scale random access. We encoded and stored 35 distinct files, totaling 200MB of data, in more than 13 million DNA oligonucleotides (about 2 billion nucleotides in total) and fully recovered the data with no bit errors, representing an advance of almost an order of magnitude compared to prior work. 6 Our data curation focused on technologically advanced data types and historical relevance, including the Universal Declaration of Human Rights in over 100 languages,7 a high-definition music video of the band OK Go,8 and a CropTrust database of the seeds stored in the Svalbard Global Seed Vault.9 We developed a random access methodology based on selective amplification, for which we designed and validated a large library of primers, and successfully retrieved arbitrarily chosen items from a subset of our pool containing 10.3 million DNA sequences. Moreover, we developed a novel coding scheme that dramatically reduces the physical redundancy (sequencing read coverage) required for error-free decoding to a median of 5x, while maintaining levels of logical redundancy comparable to the best prior codes. We further stress-tested our coding approach by successfully decoding a file using the more error-prone nanopore-based sequencing. We provide a detailed analysis of errors in the process of writing, storing, and reading data from synthetic DNA at a large scale, which helps characterize DNA as a storage medium and justify our coding approach. Thus, we have demonstrated a significant improvement in data volume, random access, and encoding/decoding schemes that contribute to a whole-system vision for DNA data storage.


2019 ◽  
Author(s):  
Lee Organick ◽  
Yuan-Jyue Chen ◽  
Siena Dumas Ang ◽  
Randolph Lopez ◽  
Karin Strauss ◽  
...  

ABSTRACTSynthetic DNA has been gaining momentum as a potential storage medium for archival data storage1–9. Digital information is translated into sequences of nucleotides and the resulting synthetic DNA strands are then stored for later individual file retrieval via PCR7–9(Fig. 1a). Using a previously presented encoding scheme9and new experiments, we demonstrate reliable file recovery when as few as 10 copies per sequence are stored, on average. This results in density of about 17 exabytes/g, nearly two orders of magnitude greater than prior work has shown6. Further, no prior work has experimentally demonstrated access to specific files in a pool more complex than approximately 106unique DNA sequences9, leaving the issue of accurate file retrieval at high data density and complexity unexamined. Here, we demonstrate successful PCR random access using three files of varying sizes in a complex pool of over 1010unique sequences, with no evidence that we have begun to approach complexity limits. We further investigate the role of file size on successful data recovery, the effect of increasing sequencing coverage to aid file recovery, and whether DNA strands drop out of solution in a systematic manner. These findings substantiate the robustness of PCR as a random access mechanism in complex settings, and that the number of copies needed for data retrieval does not compromise density significantly.



2013 ◽  
Vol 76 (2) ◽  
pp. 283-294 ◽  
Author(s):  
PAJAU VANGAY ◽  
ERIC B. FUGETT ◽  
QI SUN ◽  
MARTIN WIEDMANN

Large amounts of molecular subtyping information are generated by the private sector, academia, and government agencies. However, use of subtype data is limited by a lack of effective data storage and sharing mechanisms that allow comparison of subtype data from multiple sources. Currently available subtype databases are generally limited in scope to a few data types (e.g., MLST.net) or are not publicly available (e.g., PulseNet). We describe the development and initial implementation of Food Microbe Tracker, a public Web-based database that allows archiving and exchange of a variety of molecular subtype data that can be cross-referenced with isolate source data, genetic data, and phenotypic characteristics. Data can be queried with a variety of search criteria, including DNA sequences and banding pattern data (e.g., ribotype or pulsed-field gel electrophoresis type). Food Microbe Tracker allows the deposition of data on any bacterial genus and species, bacteriophages, and other viruses. The bacterial genera and species that currently have the most entries in this database are Listeria monocytogenes, Salmonella, Streptococcus spp., Pseudomonas spp., Bacillus spp., and Paenibacillus spp., with over 40,000 isolates. The combination of pathogen and spoilage microorganism data in the database will facilitate source tracking and outbreak detection, improve discovery of emerging subtypes, and increase our understanding of transmission and ecology of these microbes. Continued addition of subtyping, genetic or phenotypic data for a variety of microbial species will broaden the database and facilitate large-scale studies on the diversity of food-associated microbes.



2020 ◽  
Author(s):  
Callista Bee ◽  
Yuan-Jyue Chen ◽  
David Ward ◽  
Xiaomeng Liu ◽  
Georg Seelig ◽  
...  

AbstractSynthetic DNA has the potential to store the world’s continuously growing amount of data in an extremely dense and durable medium. Current proposals for DNA-based digital storage systems include the ability to retrieve individual files by their unique identifier, but not by their content. Here, we demonstrate content-based retrieval from a DNA database by learning a mapping from images to DNA sequences such that an encoded query image will retrieve visually similar images from the database via DNA hybridization. We encoded and synthesized a database of 1.6 million images and queried it with a variety of images, showing that each query retrieves a sample of the database containing visually similar images are retrieved at a rate much greater than chance. We compare our results with several algorithms for similarity search in electronic systems, and demonstrate that our molecular approach is competitive with state-of-the-art electronics.One Sentence SummaryLearned encodings enable content-based image similarity search from a database of 1.6 million images encoded in synthetic DNA.



2019 ◽  
Author(s):  
S Kasra Tabatabaei ◽  
Boya Wang ◽  
Nagendra Bala Murali Athreya ◽  
Behnam Enghiad ◽  
Alvaro Gonzalo Hernandez ◽  
...  

AbstractSynthetic DNA-based data storage systems have received significant attention due to the promise of ultrahigh storage density and long-term stability. However, all platforms proposed so far suffer from high cost, read-write latency and error-rates that render them noncompetitive with modern optical and magnetic storage devices. One means to avoid synthesizing DNA and to reduce the system error-rates is to use readily available native DNA. As the symbol/nucleotide content of native DNA is fixed, one may adopt an alternative recording strategy that modifies the DNA topology to encode desired information. Here, we report the first macromolecular storage paradigm in which data is written in the form of “nicks (punches)” at predetermined positions on the sugar-phosphate backbone of native dsDNA. The platform accommodates parallel nicking on multiple “orthogonal” genomic DNA fragments and paired nicking and disassociation for creating “toehold” regions that enable single-bit random access and strand displacement in-memory computations. As a proof of concept, we used the programmable restriction enzyme Pyrococcus furiosus Argonaute to punch two files into the PCR products of Escherichia coli genomic DNA. The encoded data is accurately reconstructed through high-throughput sequencing and read alignment.



2019 ◽  
Vol 15 (01) ◽  
pp. 1-8
Author(s):  
Ashish C Patel ◽  
C G Joshi

Current data storage technologies cannot keep pace longer with exponentially growing amounts of data through the extensive use of social networking photos and media, etc. The "digital world” with 4.4 zettabytes in 2013 has predicted it to reach 44 zettabytes by 2020. From the past 30 years, scientists and researchers have been trying to develop a robust way of storing data on a medium which is dense and ever-lasting and found DNA as the most promising storage medium. Unlike existing storage devices, DNA requires no maintenance, except the need to store at a cool and dark place. DNA has a small size with high density; just 1 gram of dry DNA can store about 455 exabytes of data. DNA stores the informations using four bases, viz., A, T, G, and C, while CDs, hard disks and other devices stores the information using 0’s and 1’s on the spiral tracks. In the DNA based storage, after binarization of digital file into the binary codes, encoding and decoding are important steps in DNA based storage system. Once the digital file is encoded, the next step is to synthesize arbitrary single-strand DNA sequences and that can be stored in the deep freeze until use.When there is a need for information to be recovered, it can be done using DNA sequencing. New generation sequencing (NGS) capable of producing sequences with very high throughput at a much lower cost about less than 0.1 USD for one MB of data than the first sequencing technologies. Post-sequencing processing includes alignment of all reads using multiple sequence alignment (MSA) algorithms to obtain different consensus sequences. The consensus sequence is decoded as the reversal of the encoding process. Most prior DNA data storage efforts sequenced and decoded the entire amount of stored digital information with no random access, but nowadays it has become possible to extract selective files (e.g., retrieving only required image from a collection) from a DNA pool using PCR-based random access. Various scientists successfully stored up to 110 zettabytes data in one gram of DNA. In the future, with an efficient encoding, error corrections, cheaper DNA synthesis,and sequencing, DNA based storage will become a practical solution for storage of exponentially growing digital data.



2020 ◽  
Author(s):  
Filip Bošković ◽  
Alexander Ohmann ◽  
Ulrich F. Keyser ◽  
Kaikai Chen

AbstractThree-dimensional (3D) DNA nanostructures built via DNA self-assembly have established recent applications in multiplexed biosensing and storing digital information. However, a key challenge is that 3D DNA structures are not easily copied which is of vital importance for their large-scale production and for access to desired molecules by target-specific amplification. Here, we build 3D DNA structural barcodes and demonstrate the copying and random access of the barcodes from a library of molecules using a modified polymerase chain reaction (PCR). The 3D barcodes were assembled by annealing a single-stranded DNA scaffold with complementary short oligonucleotides containing 3D protrusions at defined locations. DNA nicks in these structures are ligated to facilitate barcode copying using PCR. To randomly access a target from a library of barcodes, we employ a non-complementary end in the DNA construct that serves as a barcode-specific primer template. Readout of the 3D DNA structural barcodes was performed with nanopore measurements. Our study provides a roadmap for convenient production of large quantities of self-assembled 3D DNA nanostructures. In addition, this strategy offers access to specific targets, a crucial capability for multiplexed single-molecule sensing and for DNA data storage.



2018 ◽  
Vol 36 (7) ◽  
pp. 660-660 ◽  
Author(s):  
Lee Organick ◽  
Siena Dumas Ang ◽  
Yuan-Jyue Chen ◽  
Randolph Lopez ◽  
Sergey Yekhanin ◽  
...  


2005 ◽  
Vol 44 (02) ◽  
pp. 149-153 ◽  
Author(s):  
F. Estrella ◽  
C. del Frate ◽  
T. Hauer ◽  
M. Odeh ◽  
D. Rogulin ◽  
...  

Summary Objectives: The past decade has witnessed order of magnitude increases in computing power, data storage capacity and network speed, giving birth to applications which may handle large data volumes of increased complexity, distributed over the internet. Methods: Medical image analysis is one of the areas for which this unique opportunity likely brings revolutionary advances both for the scientist’s research study and the clinician’s everyday work. Grids [1] computing promises to resolve many of the difficulties in facilitating medical image analysis to allow radiologists to collaborate without having to co-locate. Results: The EU-funded MammoGrid project [2] aims to investigate the feasibility of developing a Grid-enabled European database of mammograms and provide an information infrastructure which federates multiple mammogram databases. This will enable clinicians to develop new common, collaborative and co-operative approaches to the analysis of mammographic data. Conclusion: This paper focuses on one of the key requirements for large-scale distributed mammogram analysis: resolving queries across a grid-connected federation of images.



2021 ◽  
Author(s):  
Zhi Ping ◽  
Shihong Chen ◽  
Guangyu Zhou ◽  
Xiaoluo Huang ◽  
Sha Joe Zhu ◽  
...  

Abstract DNA is a promising data storage medium due to its remarkable durability and space-efficient storage. Early bit-to-base transcoding schemes have primarily pursued information density, at the expense however of introducing biocompatibility challenges or at the risk of decoding failure. Here, we propose a robust transcoding algorithm named the “Yin-Yang Codec” (YYC), using two rules to encode two binary bits into one nucleotide, to generate DNA sequences highly compatible with synthesis and sequencing technologies. We encoded two representative file formats and stored them in vitro as 200-nt oligo pools and in vivo as an ~54-kb DNA fragment in yeast cells. Sequencing results show that YYC exhibits high robustness and reliability for a wide variety of data types, with an average recovery rate of 99.94% at 104 molecule copies and an achieved recovery rate of 87.53% at 100 copies. In addition, the in vivo storage demonstration achieved for the first time an experimentally measured physical information density of 198.8 EB per gram of DNA (44% of the theoretical maximum for DNA).



Author(s):  
Yanmin Gao ◽  
Xin Chen ◽  
Jianye Hao ◽  
Chengwei Zhang ◽  
Hongyan Qiao ◽  
...  

AbstractIn DNA data storage, the massive sequence complexity creates challenges in repeatable and efficient information readout. Here, our study clearly demonstrated that canonical polymerase chain reaction (PCR) created significant DNA amplification biases, which greatly hinder fast and stable data retrieving from hundred-thousand synthetic DNA sequences encoding over 2.85 megabyte (MB) digital data. To mitigate the amplification bias, we adapted an isothermal DNA amplification for low-bias amplification of DNA pool with massive sequence complexity, and named the new method isothermal DNA reading (iDR). By using iDR, we were able to robustly and repeatedly retrieve the data stored in DNA strands attached on magnetic beads (MB) with significantly decreased sequencing reads, compared with the PCR method. Therefore, we believe that the low-bias iDR method provides an ideal platform for robust DNA data storage, and fast and reliable data readout.



Sign in / Sign up

Export Citation Format

Share Document