scholarly journals Content-Based Similarity Search in Large-Scale DNA Data Storage Systems

2020 ◽  
Author(s):  
Callista Bee ◽  
Yuan-Jyue Chen ◽  
David Ward ◽  
Xiaomeng Liu ◽  
Georg Seelig ◽  
...  

AbstractSynthetic DNA has the potential to store the world’s continuously growing amount of data in an extremely dense and durable medium. Current proposals for DNA-based digital storage systems include the ability to retrieve individual files by their unique identifier, but not by their content. Here, we demonstrate content-based retrieval from a DNA database by learning a mapping from images to DNA sequences such that an encoded query image will retrieve visually similar images from the database via DNA hybridization. We encoded and synthesized a database of 1.6 million images and queried it with a variety of images, showing that each query retrieves a sample of the database containing visually similar images are retrieved at a rate much greater than chance. We compare our results with several algorithms for similarity search in electronic systems, and demonstrate that our molecular approach is competitive with state-of-the-art electronics.One Sentence SummaryLearned encodings enable content-based image similarity search from a database of 1.6 million images encoded in synthetic DNA.

2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Callista Bee ◽  
Yuan-Jyue Chen ◽  
Melissa Queen ◽  
David Ward ◽  
Xiaomeng Liu ◽  
...  

AbstractAs global demand for digital storage capacity grows, storage technologies based on synthetic DNA have emerged as a dense and durable alternative to traditional media. Existing approaches leverage robust error correcting codes and precise molecular mechanisms to reliably retrieve specific files from large databases. Typically, files are retrieved using a pre-specified key, analogous to a filename. However, these approaches lack the ability to perform more complex computations over the stored data, such as similarity search: e.g., finding images that look similar to an image of interest without prior knowledge of their file names. Here we demonstrate a technique for executing similarity search over a DNA-based database of 1.6 million images. Queries are implemented as hybridization probes, and a key step in our approach was to learn an image-to-sequence encoding ensuring that queries preferentially bind to targets representing visually similar images. Experimental results show that our molecular implementation performs comparably to state-of-the-art in silico algorithms for similarity search.


2017 ◽  
Author(s):  
Lee Organick ◽  
Siena Dumas Ang ◽  
Yuan-Jyue Chen ◽  
Randolph Lopez ◽  
Sergey Yekhanin ◽  
...  

Current storage technologies can no longer keep pace with exponentially growing amounts of data. 1 Synthetic DNA offers an attractive alternative due to its potential information density of ~ 1018 B/mm3, 107 times denser than magnetic tape, and potential durability of thousands of years.2 Recent advances in DNA data storage have highlighted technical challenges, in particular, coding and random access, but have stored only modest amounts of data in synthetic DNA. 3,4,5 This paper demonstrates an end-to-end approach toward the viability of DNA data storage with large-scale random access. We encoded and stored 35 distinct files, totaling 200MB of data, in more than 13 million DNA oligonucleotides (about 2 billion nucleotides in total) and fully recovered the data with no bit errors, representing an advance of almost an order of magnitude compared to prior work. 6 Our data curation focused on technologically advanced data types and historical relevance, including the Universal Declaration of Human Rights in over 100 languages,7 a high-definition music video of the band OK Go,8 and a CropTrust database of the seeds stored in the Svalbard Global Seed Vault.9 We developed a random access methodology based on selective amplification, for which we designed and validated a large library of primers, and successfully retrieved arbitrarily chosen items from a subset of our pool containing 10.3 million DNA sequences. Moreover, we developed a novel coding scheme that dramatically reduces the physical redundancy (sequencing read coverage) required for error-free decoding to a median of 5x, while maintaining levels of logical redundancy comparable to the best prior codes. We further stress-tested our coding approach by successfully decoding a file using the more error-prone nanopore-based sequencing. We provide a detailed analysis of errors in the process of writing, storing, and reading data from synthetic DNA at a large scale, which helps characterize DNA as a storage medium and justify our coding approach. Thus, we have demonstrated a significant improvement in data volume, random access, and encoding/decoding schemes that contribute to a whole-system vision for DNA data storage.


2019 ◽  
Author(s):  
Lee Organick ◽  
Yuan-Jyue Chen ◽  
Siena Dumas Ang ◽  
Randolph Lopez ◽  
Karin Strauss ◽  
...  

ABSTRACTSynthetic DNA has been gaining momentum as a potential storage medium for archival data storage1–9. Digital information is translated into sequences of nucleotides and the resulting synthetic DNA strands are then stored for later individual file retrieval via PCR7–9(Fig. 1a). Using a previously presented encoding scheme9and new experiments, we demonstrate reliable file recovery when as few as 10 copies per sequence are stored, on average. This results in density of about 17 exabytes/g, nearly two orders of magnitude greater than prior work has shown6. Further, no prior work has experimentally demonstrated access to specific files in a pool more complex than approximately 106unique DNA sequences9, leaving the issue of accurate file retrieval at high data density and complexity unexamined. Here, we demonstrate successful PCR random access using three files of varying sizes in a complex pool of over 1010unique sequences, with no evidence that we have begun to approach complexity limits. We further investigate the role of file size on successful data recovery, the effect of increasing sequencing coverage to aid file recovery, and whether DNA strands drop out of solution in a systematic manner. These findings substantiate the robustness of PCR as a random access mechanism in complex settings, and that the number of copies needed for data retrieval does not compromise density significantly.


2021 ◽  
Vol 11 (2) ◽  
pp. 728
Author(s):  
Thien An Nguyen ◽  
Jaejin Lee

The ever-increasing demand for data in recent times has led to the emergence of big data and cloud data. The growth in these fields has necessitated that data be centrally stored in data centers. To meet the need for large-scale storage systems at data centers, innovative technology such as bit-pattern media recording (BPMR) has been developed. With BPMR technology, we are able to achieve significant improvements in high areal density (AD) of magnetic data storage systems. However, two-dimensional (2D) interference is a common issue faced with high AD. Intersymbol interference and intertrack interference occur when the distance between the islands is decreased in the down-track and cross-track, respectively. 2D interference adversely affects the performance of BPMR. In this paper, we propose an improved modified Viterbi algorithm (MVA) exploiting a feedback and a new 2D three-way form of a generalized partial response (GPR) target. The proposed MVA with feedback is superior to the previous MVA by eliminating intertrack interference (ITI) more effectively. With the three-way GPR target, the proposed algorithm achieves more stable performance compared to the previous detection algorithms for the track misregistration effect.


2017 ◽  
Author(s):  
Sanjoy Dasgupta ◽  
Charles F. Stevens ◽  
Saket Navlakha

Similarity search, such as identifying similar images in a database or similar documents on the Web, is a fundamental computing problem faced by many large-scale information retrieval systems. We discovered that the fly’s olfac-tory circuit solves this problem using a novel variant of a traditional computer science algorithm (called locality-sensitive hashing). The fly’s circuit assigns similar neural activity patterns to similar input stimuli (odors), so that behav-iors learned from one odor can be applied when a similar odor is experienced. The fly’s algorithm, however, uses three new computational ingredients that depart from traditional approaches. We show that these ingredients can be translated to improve the performance of similarity search compared to tra-ditional algorithms when evaluated on several benchmark datasets. Overall, this perspective helps illuminate the logic supporting an important sensory function (olfaction), and it provides a conceptually new algorithm for solving a fundamental computational problem.


2013 ◽  
Vol 76 (2) ◽  
pp. 283-294 ◽  
Author(s):  
PAJAU VANGAY ◽  
ERIC B. FUGETT ◽  
QI SUN ◽  
MARTIN WIEDMANN

Large amounts of molecular subtyping information are generated by the private sector, academia, and government agencies. However, use of subtype data is limited by a lack of effective data storage and sharing mechanisms that allow comparison of subtype data from multiple sources. Currently available subtype databases are generally limited in scope to a few data types (e.g., MLST.net) or are not publicly available (e.g., PulseNet). We describe the development and initial implementation of Food Microbe Tracker, a public Web-based database that allows archiving and exchange of a variety of molecular subtype data that can be cross-referenced with isolate source data, genetic data, and phenotypic characteristics. Data can be queried with a variety of search criteria, including DNA sequences and banding pattern data (e.g., ribotype or pulsed-field gel electrophoresis type). Food Microbe Tracker allows the deposition of data on any bacterial genus and species, bacteriophages, and other viruses. The bacterial genera and species that currently have the most entries in this database are Listeria monocytogenes, Salmonella, Streptococcus spp., Pseudomonas spp., Bacillus spp., and Paenibacillus spp., with over 40,000 isolates. The combination of pathogen and spoilage microorganism data in the database will facilitate source tracking and outbreak detection, improve discovery of emerging subtypes, and increase our understanding of transmission and ecology of these microbes. Continued addition of subtyping, genetic or phenotypic data for a variety of microbial species will broaden the database and facilitate large-scale studies on the diversity of food-associated microbes.


2021 ◽  
Author(s):  
Claris Winston ◽  
Lee Organick ◽  
Luis Ceze ◽  
Karin Strauss ◽  
Yuan-Jyue Chen

ABSTRACTWith the rapidly decreasing cost of array-based oligo synthesis, large-scale oligo pools offer significant benefits for advanced applications, including gene synthesis, CRISPR-based gene editing, and DNA data storage. Selectively retrieving specific oligos from these complex pools traditionally uses Polymerase Chain Reaction (PCR), in which any selected oligos are exponentially amplified to quickly outnumber non-selected ones. In this case, the number of orthogonal PCR primers is limited due to interactions between them. This lack of specificity presents a serious challenge, particularly for DNA data storage, where the size of an oligo pool (i.e., a DNA database) is orders of magnitude larger than it is for other applications. Although a nested file address system was recently developed to increase the number of accessible files for DNA storage, it requires a more complicated lab protocol and more expensive reagents to achieve high specificity. Instead, we developed a new combinatorial PCR method that outperforms prior work without compromising the fidelity of retrieved material or complicating wet lab processes. Our method quadratically increases the number of accessible oligos while maintaining high specificity. In experiments, we accessed three arbitrarily chosen files from a DNA prototype database that contained 81 different files. Initially comprising only 1% of the original database, the selected files were enriched to over 99.9% using our combinatorial primer method. Our method thus provides a viable path for scaling up DNA data storage systems and has broader utility whenever scientists need access to a specific target oligo and can design their own primer regions.


2020 ◽  
Vol 245 ◽  
pp. 04038 ◽  
Author(s):  
Luca Mascetti ◽  
Maria Arsuaga Rios ◽  
Enrico Bocchi ◽  
Joao Calado Vicente ◽  
Belinda Chan Kwok Cheong ◽  
...  

The CERN IT Storage group operates multiple distributed storage systems to support all CERN data storage requirements: the physics data generated by LHC and non-LHC experiments; object and file storage for infrastructure services; block storage for the CERN cloud system; filesystems for general use and specialized HPC clusters; content distribution filesystem for software distribution and condition databases; and sync&share cloud storage for end-user files. The total integrated capacity of these systems exceeds 0.6 Exabyte. Large-scale experiment data taking has been supported by EOS and CASTOR for the last 10+ years. Particular highlights for 2018 include the special HeavyIon run which was the last part of the LHC Run2 Programme: the IT storage systems sustained over 10GB/s to flawlessly collect and archive more than 13 PB of data in a single month. While the tape archival continues to be handled by CASTOR, the effort to migrate the current experiment workflows to the new CERN Tape Archive system (CTA) is underway. Ceph infrastructure has operated for more than 5 years to provide block storage to CERN IT private OpenStack cloud, a shared filesystem (CephFS) to HPC clusters and NFS storage to replace commercial Filers. S3 service was introduced in 2018, following increased user requirements for S3-compatible object storage from physics experiments and IT use-cases. Since its introduction in 2014N, CERNBox has become a ubiquitous cloud storage interface for all CERN user groups: physicists, engineers and administration. CERNBox provides easy access to multi-petabyte data stores from a multitude of mobile and desktop devices and all mainstream, modern operating systems (Linux, Windows, macOS, Android, iOS). CERNBox provides synchronized storage for end-user’s devices as well as easy sharing for individual users and e-groups. CERNBox has also become a storage platform to host online applications to process the data such as SWAN (Service for Web-based Analysis) as well as file editors such as Collabora Online, Only Office, Draw.IO and more. An increasing number of online applications in the Windows infrastructure uses CIFS/SMB access to CERNBox files. CVMFS provides software repositories for all experiments across the WLCG infrastructure and has recently been optimized to efficiently handle nightlybuilds. While AFS continues to provide general-purpose filesystem for internal CERN users, especially as $HOME login area on central computing infrastructure, the migration of project and web spaces has significantly advanced. In this paper, we report on the experiences from the last year of LHC RUN2 data taking and evolution of our services in the past year.. We will highlight upcoming changes and future improvements and challenges.


Author(s):  
Yanmin Gao ◽  
Xin Chen ◽  
Jianye Hao ◽  
Chengwei Zhang ◽  
Hongyan Qiao ◽  
...  

AbstractIn DNA data storage, the massive sequence complexity creates challenges in repeatable and efficient information readout. Here, our study clearly demonstrated that canonical polymerase chain reaction (PCR) created significant DNA amplification biases, which greatly hinder fast and stable data retrieving from hundred-thousand synthetic DNA sequences encoding over 2.85 megabyte (MB) digital data. To mitigate the amplification bias, we adapted an isothermal DNA amplification for low-bias amplification of DNA pool with massive sequence complexity, and named the new method isothermal DNA reading (iDR). By using iDR, we were able to robustly and repeatedly retrieve the data stored in DNA strands attached on magnetic beads (MB) with significantly decreased sequencing reads, compared with the PCR method. Therefore, we believe that the low-bias iDR method provides an ideal platform for robust DNA data storage, and fast and reliable data readout.


Sign in / Sign up

Export Citation Format

Share Document