2 megabytes encoded in DNA at density of 215 petabytes per gram but would cost $900 trillion to scale up to 215 petabytes

Researchers report that they’ve come up with a new way to encode digital data in DNA to create the highest-density large-scale data storage scheme ever invented. Capable of storing 215 petabytes (215 million gigabytes) in a single gram of DNA, the system could, in principle, store every bit of datum ever recorded by humans in a container about the size and weight of a couple of pickup trucks. But whether the technology takes off may depend on its cost.

DNA has many advantages for storing digital data. It’s ultracompact, and it can last hundreds of thousands of years if kept in a cool, dry place. And as long as human societies are reading and writing DNA, they will be able to decode it. “DNA won’t degrade over time like cassette tapes and CDs, and it won’t become obsolete,” says Yaniv Erlich, a computer scientist at Columbia University. And unlike other high-density approaches, such as manipulating individual atoms on a surface, new technologies can write and read large amounts of DNA at a time, allowing it to be scaled up.

Scientists have been storing digital data in DNA since 2012. That was when Harvard University geneticists George Church, Sri Kosuri, and colleagues encoded a 52,000-word book in thousands of snippets of DNA, using strands of DNA’s four-letter alphabet of A, G, T, and C to encode the 0s and 1s of the digitized file. Their particular encoding scheme was relatively inefficient, however, and could store only 1.28 petabytes per gram of DNA. Other approaches have done better. But none has been able to store more than half of what researchers think DNA can actually handle, about 1.8 bits of data per nucleotide of DNA. (The number isn’t 2 bits because of rare, but inevitable, DNA writing and reading errors.)

Erlich thought he could get closer to that limit. So he and Dina Zielinski, an associate scientist at the New York Genome Center, looked at the algorithms that were being used to encode and decode the data. They started with six files, including a full computer operating system, a computer virus, an 1895 French film called Arrival of a Train at La Ciotat, and a 1948 study by information theorist Claude Shannon. They first converted the files into binary strings of 1s and 0s, compressed them into one master file, and then split the data into short strings of binary code. They devised an algorithm called a DNA fountain, which randomly packaged the strings into so-called droplets, to which they added extra tags to help reassemble them in the proper order later. In all, the researchers generated a digital list of 72,000 DNA strands, each 200 bases long.

They sent these as text files to Twist Bioscience, a San Francisco, California–based startup, which then synthesized the DNA strands. Two weeks later, Erlich and Zielinski received in the mail a vial with a speck of DNA encoding their files. To decode them, the pair used modern DNA sequencing technology. The sequences were fed into a computer, which translated the genetic code back into binary and used the tags to reassemble the six original files. The approach worked so well that the new files contained no errors, they report today in Science. They were also able to make a virtually unlimited number of error-free copies of their files through polymerase chain reaction, a standard DNA copying technique. What’s more, Erlich says, they were able to encode 1.6 bits of data per nucleotide, 60% better than any group had done before and 85% the theoretical limit.

The new approach isn’t ready for large-scale use yet. It cost $7000 to synthesize the 2 megabytes of data in the files, and another $2000 to read it. The cost is likely to come down over time, but it still has a long ways to go, Erlich says.

SOURCE – Science magazine