New DNA Data Storage

Researchers from North Carolina State University have turned a longstanding challenge in DNA data storage into a tool, using it to offer users previews of stored data files – such as thumbnail versions of image files.

DNA data storage is an attractive technology because it has the potential to store a tremendous amount of data in a small package, it can store that data for a long time, and it does so in an energy-efficient way. However, until now, it wasn’t possible to preview the data in a file stored as DNA – if you wanted to know what a file was, you had to “open” the entire file.

“The advantage to our technique is that it is more efficient in terms of time and money,” says Kyle Tomek, lead author of a paper on the work and a Ph.D. student at NC State. “If you are not sure which file has the data you want, you don’t have to sequence all of the DNA in all of the potential files. Instead, you can sequence much smaller portions of the DNA files to serve as previews.”

Here’s a quick overview of how this works.

Users “name” their data files by attaching sequences of DNA called primer-binding sequences to the ends of DNA strands that are storing information. To identify and extract a given file, most systems use polymerase chain reaction (PCR). Specifically, they use a small DNA primer that matches the corresponding primer-binding sequence to identify the DNA strands containing the file you want. The system then uses PCR to make lots of copies of the relevant DNA strands, then sequences the entire sample. Because the process makes numerous copies of the targeted DNA strands, the signal of the targeted strands is stronger than the rest of the sample, making it possible to identify the targeted DNA sequence and read the file.

However, one challenge that DNA data storage researchers have grappled with is that if two or more files have similar file names, the PCR will inadvertently copy pieces of multiple data files. As a result, users have to give files very distinct names to avoid getting messy data.

“At some point it occurred to us that we might be able to use these non-specific interactions as a tool, rather than viewing it as a problem,” says Albert Keung, co-corresponding author of a paper on the work and an assistant professor of chemical and biomolecular engineering at NC State.

Specifically, the researchers developed a technique that makes use of similar file names to let them open either an entire file or a specific subset of that file. This works by using a specific naming convention when naming a file and a given subset of the file. They can choose whether to open the entire file, or just the “preview” version, by manipulating several parameters of the PCR process: the temperature, the concentration of DNA in the sample, and the types and concentrations of reagents in the sample.

“Our technique makes the system more complex,” says James Tuck, co-corresponding author of the paper and a professor of computer engineering at NC State. “This means that we have to be even more careful in managing both the file-naming conventions and the conditions of PCR. However, this makes the system both more data-efficient and substantially more user friendly.”

The researchers demonstrated their technique by saving four large JPEG image files in DNA data storage and retrieving thumbnails of each file, as well as the full, high-resolution files in their entirety.

Nature Communications – Promiscuous molecules for smarter file operations in DNA-based data storage

DNA holds significant promise as a data storage medium due to its density, longevity, and resource and energy conservation. These advantages arise from the inherent biomolecular structure of DNA which differentiates it from conventional storage media. The unique molecular architecture of DNA storage also prompts important discussions on how data should be organized, accessed, and manipulated and what practical functionalities may be possible. Here we leverage thermodynamic tuning of biomolecular interactions to implement useful data access and organizational features. Specific sets of environmental conditions including distinct DNA concentrations and temperatures were screened for their ability to switchably access either all DNA strands encoding full image files from a GB-sized background database or subsets of those strands encoding low resolution, File Preview, versions. We demonstrate File Preview with four JPEG images and provide an argument for the substantial and practical economic benefit of this generalizable strategy to organize data.

Information is being generated at an accelerating pace while our means to store it are facing fundamental material, energy, environment, and space limits. DNA has clear potential as a data storage medium due to its extreme density, durability, and efficient resource conservation. Accordingly, DNA-based data storage systems up to 1 GB have been developed by harnessing the advances in DNA synthesis and sequencing, and support the plausibility of commercially viable systems in the not too distant future. However, in addition to continuing to drive down the costs of DNA synthesis and sequencing, there are many important questions that must be addressed. Foremost among them are how data should be organized, accessed, and searched.

Organizing, accessing, and finding information constitutes a complex class of challenges. This complexity arises from how information is commonly stored in DNA-based systems: as many distinct and disordered DNA molecules free-floating in dense mutual proximity. This has two major implications. First, an addressing system is needed that can function in a complex and information-dense molecular mixture. While the use of a physical scaffold to array the DNA would ostensibly solve this challenge, analogous to how data are addressed on conventional tape drives, this would abrogate the density advantage of DNA as the scaffold itself would occupy a disproportionate amount of space. Second, while the inclusion of metadata in the strands of DNA could facilitate search, ultimately there will be many situations in which multiple candidate files contain very similar information. For example, one might wish to retrieve a specific image of the Wright brothers and their first flight, but it would be difficult to include enough metadata to distinguish the multiple images of the Wright brothers as they all fit very similar search criteria. In addition, data stored using DNA could be maintained for generations6 with future users only having access to a limited amount of metadata and cultural memory or knowledge. Given the costs associated with DNA retrieval and sequencing, a method to preview low-resolution versions of multiple files without needing to fully access or download all of them would be advantageous.

The File Preview function is practical in that it reduces the number of strands that need to be sequenced when searching for the desired file. This will reduce the latency and cost of DNA sequencing and decoding. Consequently, one will be able to search a database of files much more rapidly and cost-effectively using Preview than if each file needed to be fully sequenced. Beyond the Preview function, this inducible promiscuity technology could be used for many other data or computing applications. It may have broad application to how data is managed or organized in a file system. For example, files could be differentially encoded to make it cheaper and easier to access frequently versus infrequently used data. Another interesting use case is support for deduplication of data, a ubiquitous need in large and small data sets in which replicated blocks of data are detected and optimized. Rather than storing many copies of duplicated data, a single copy could be shared amongst files by taking advantage of the promiscuous binding.

While previous DNA-based storage systems draw inspiration from conventional storage media and have had success, shifting the design paradigms to naturally leverage the intrinsic structural and biophysical properties of DNA holds the significant promise that could transform the functionality, practicality, and economics of DNA storage. This work provides an archetype for a biochemically driven and enhanced data storage system.

SOURCES -North Carolina State, Nature Communications.
Written by Brian Wang,

3 thoughts on “New DNA Data Storage”

  1. Hey Brian, will you do an article on the recent discovery of unknown mass in human chromosomes?

  2. But then what do you do with DNA encoded data?

    One obvious choice is to encode it into a corona virus and release it to infect the world. Then, when the world is infected you point out that the DNA encodes an entire series of blasphemies against [whichever religion would react in the most amusing way].

    "Yes, every time you cough you are distributing a million copies of The Life of Brian. April Fool!"

Comments are closed.