DNA data

Data_exe _

_ | | | | | | | |

Integrated information storage technology for writing large amounts of digital information in DNA using an enzyme-driven, sustainable, low-cost approach

The genetic material DNA has garnered considerable interest as a medium for digital information storage because its density and durability are superior to those of existing silicon-based storage media. For example, DNA is at least 1000-fold more dense than the most compact solid-state hard drive and at least 300-fold more durable than the most stable magnetic tapes. In addition, DNA's four-letter nucleotide code offers a suitable coding environment that can be leveraged like the binary digital code used by computers and other electronic devices to represent any letter, digit, or other character.

Despite these advantages, DNA has not yet become a widespread information storage medium because the cost of chemically synthesizing DNA is still prohibitively high at $3,500 per 1 megabyte of information. To help overcome this limitation, research at the Wyss Institute spearheaded by Henry Hung-Yi Lee, Ph.D., in a collaborative project led by Core Faculty member George Church, Ph.D., and Founding Director Donald Ingber, M.D., Ph.D., has developed new, enzyme-based approaches that can write DNA simpler and faster than traditional chemical techniques. These approaches could also produce much longer strands of DNA while being less toxic for the environment. Importantly, this approach is projected to reduce the cost of DNA synthesis in the future by many orders of magnitude.

Biochemical and Biophysical Research Communications

Volume 68, Issue 2, 26 January 1976, Pages 329-335

On the native structure of the histone H3–H4 complex

Author links open overlay panelPeter N.Lewis

Outline

Abstract

Electrophoretic and sedimentation velocity studies on the histone H3–H4 complex show that provided the H3 cysteine residues remain reduced the complex reforms quantitatively when removed from a variety of denaturing conditions. If histone H3 is allowed to become intramolecularly oxidized while denatured only monomer and large aggregates are formed on return to native conditions. At pH 7 ionic strength 0.1 the complex remains with reduced sulfhydryl groups indefinitely suggesting a vital role for the sequence 96–110 in histone H3 in the tertiary structure of the complex.

This article discusses how DNA might be used to store data. It is argued that, at present, DNA would be best employed as a long-term repository (thousands or millions of years). How data-containing DNA might be packaged and how the data might be encrypted, with particular attention to the encryption of written information, is also discussed. Various encryption issues are touched on, such as how data-containing DNA might be differentiated from genetic material, error detection, data compression and reading frame location. Finally, this article broaches the difficulty of constructing very large pieces of DNA in the laboratory and highlights some complications that might arise when attempting to transmit DNA-encrypted data to recipients who are a long period of time in the future.

NA barcoding is a method of species identification using a short section of DNA from a specific gene or genes. The premise of DNA barcoding is that, by comparison with a reference library of such DNA sections (also called "sequences"), an individual sequence can be used to uniquely identify an organism to species, in the same way that a supermarket scanner uses the familiar black stripes of the UPC barcode to identify an item in its stock against its reference database.[1] These "barcodes" are sometimes used in an effort to identify unknown species, parts of an organism, or simply to catalog as many taxa as possible, or to compare with traditional taxonomy in an effort to determine species

plural form of datum: pieces of information

A collection of object-units that are distinct from one another

information

With fresh material, taxonomic conclusions are leavened by recognition that the material examined reflects the site it occupied; a herbarium packet gives one only a small fraction of the data desirable for sound conclusions. Herbarium material does not, indeed, allow one to extrapolate safely: what you see is what you get.

"A formalized representation of facts or concepts suitable for communication, interpretation, or processing by people or automated means " The term "data" is often used to refer to the information stored in the computer

factual information, especially results of an experiment or clinical trial

Refers to information in numerical form that can be digitally transmitted or processed

a term that describes the BITS, BYTES, etc, that a computer stores and manipulates Data processed in a useful way becomes information

Information that is collected, stored or processed systematically

{i} information, facts

Digital information or just information, depending on the context

Social science data are the raw material out of which social and economic statistics are produced Social science data originate from social research methodologies or administrative records, while statistics are produced from data Data are the information collected and stored at the level at which the unit of analysis was observed Summaries of these data are usually statistics Data must be processed to be of practical use This compilation is accomplished with statistical software, which reads the raw data from a computer file

Factual information used as a basis for reasoning, discussion, or calculation; a collection of numerical facts

the information and evidence gathered during the assessment process for use in determining the level of teaching performance See Evidence, DNA Data, Information

Programs, files, and other information stored in, communicated, or processed by a computer

Digital information that is input to, output from, or processed and stored in a computer

Computational Statistics and Data Analysis (CSDA), an Official Publication of the network Computational and Methodological Statistics (CMStatistics) and of the International Association for Statistical Computing (IASC), is an international journal dedicated to the dissemination of methodological research...

In 1862, Gregor Mendel bred pea plants to study inheritance. Fast forward 100 years to 1962, James Watson, Frances Crick, and Maurice Wilkins were awarded a Nobel Prize for discovering the structure of DNA. Today, advances in this field are spilling over into the most unlikely places.

As we enter the century of biotechnology, our ability to read, write, and edit DNA is disrupting everything from human health to manufacturing. The next disruption to take place could be in the world of data storage.

Tech giants including Facebook and Amazon and their millions of users generate petabytes of data on the Internet every second. Microsoft has been quietly working in the background to store this information in As, Ts, Cs, and Gs, instead of 0s and 1s.

Thermo Scientific Phire Tissue Direct PCR Master Mix has been developed for amplification of DNA directly from a wide variety of tissues obtained from mice, human, fish, birds and insects. The Master Mix containing Phire Hot Start II DNA Polymerase is specially formulated to perform PCR in the presence of different animal tissue-derived inhibitors such as collagen, melanin and eumelanin (hair, skin) or myoglobin (muscle). The kit also includes Thermo Scientific DNARelease Additive, which can be used to improve the release of DNA from difficult tissues.

Talanta

Volume 223, Part 2, 1 February 2021, 121766

Plasma treated graphene FET sensor for the DNA hybridization detection

Plasma treated graphene is firstly used on DNA-FET sensor.

Electrical properties and LOD are all higher than untreated GFET.

A detection limit of 10 aM is obtained for the treated graphene-FET, which is an order of magnitude higher than that of untreated graphene-FET.

Abstract

Room-temperature plasma treated graphene based FET was firstly proposed for the DNA hybridization detection. Affinity and electrical properties of the graphene based DNA-FET sensor were studied and improved benefits from the surface modification. The facile room-temperature Ar plasma easily removes residues from the graphene surface and changes the hydrophilic properties of graphene, which is important for our solution gated DNA-FET sensor. Limit of the detection of below 10 aM is obtained in our experiment. Especially, DNA concentration (CDNA)/the amount of net drain current (ΔI) and the negative shift in the VCNP value of the GFET sensor with the plasma treated 30 s are all improved compared with that without treatment. It shows that the easily plasma treatment of the graphene surface can be used for the solution gated FET sensor.

DNA, or deoxyribonucleic acid, is the hereditary material in humans and almost all other organisms. Nearly every cell in a person's body has the same DNA. Most DNA is located in the cell nucleus (where it is called nuclear DNA), but a small amount of DNA can also be found in the mitochondria (where it is called mitochondrial DNA or mtDNA). Mitochondria are structures within cells that convert the energy from food into a form that cells can use.

The information in DNA is stored as a code made up of four chemical bases: adenine (A), guanine (G), cytosine (C), and thymine (T). Human DNA consists of about 3 billion bases, and more than 99 percent of those bases are the same in all people. The order, or sequence, of these bases determines the information available for building and maintaining an organism, similar to the way in which letters of the alphabet appear in a certain order to form words and sentences.

DNA bases pair up with each other, A with T and C with G, to form units called base pairs. Each base is also attached to a sugar molecule and a phosphate molecule. Together, a base, sugar, and phosphate are called a nucleotide. Nucleotides are arranged in two long strands that form a spiral called a double helix. The structure of the double helix is somewhat like a ladder, with the base pairs forming the ladder's rungs and the sugar and phosphate molecules forming the vertical sidepieces of the ladder.

An important property of DNA is that it can replicate, or make copies of itself. Each strand of DNA in the double helix can serve as a pattern for duplicating the sequence of bases. This is critical when cells divide because each new cell needs to have an exact copy of the DNA present in the old cell.

Graphical abstract

Plasma treatment of the CVD graphene was firstly used on the solution-gated DNA-FET sensor. Electrical properties of the graphene-FET and affinity between the DNA and graphene are analyzed. A detection limit of 10 aM is obtained for the treated graphene-FET, which is an order of magnitude higher than that of untreated graphene-FET.

Direct-to-consumer DNA testing has provided genetic information to more than 12 million individuals, traditionally for exploring ancestry. While such testing does not violate ethical guidelines, other uses of consumer DNA testing may cross the line. Over the past few years, many of these DNA testing companies have branched out into the realm of precision health, treading into ethically dangerous territories.

For example, 23andMe, with US Food and Drug Administration (FDA) permission, now reveals to consumers whether they possess a whole suite of genetic mutations, including those associated with Lynch syndrome and breast cancer, under the assumption that awareness will likely improve the health of its consumers. Other companies advertise that their DNA testing will better educate customers on what type of diet or lifestyle they should incorporate to lose weight.

The major problems with these tests are two-fold. First, many of the tests lack scientific validity to support the genetic outcomes revealed to their customers. Not all of the 25 major companies engaged in direct-to-consumer DNA testing have been Clinical Laboratory Improvement Amendments certified. Second, there is no professional counseling required before and after the consumer receives her results.

DNA-FET sensör

Higher 223 Volume DNA active

Quickly

DNA storage offers substantial information density1,2,3,4,5,6,7 and exceptional half-life3. We devised a 'DNA-of-things' (DoT) storage architecture to produce materials with immutable memory. In a DoT framework, DNA molecules record the data, and these molecules are then encapsulated in nanometer silica beads8, which are fused into various materials that are used to print or cast objects in any shape. First, we applied DoT to three-dimensionally print a Stanford Bunny9 that contained a 45 kB digital DNA blueprint for its synthesis. We synthesized five generations of the bunny, each from the memory of the previous generation without additional DNA synthesis or degradation of information. To test the scalability of DoT, we stored a 1.4 MB video in DNA in plexiglass spectacle lenses and retrieved it by excising a tiny piece of the plexiglass and sequencing the embedded DNA. DoT could be applied to store electronic health records in medical implants, to hide data in everyday objects (steganography) and to manufacture objects containing their own blueprint. It may also facilitate the development of self-replicating machines.

In the late 1980s, scientists at Osaka University in Japan noticed unusual repeated DNA sequences next to a gene they were studying in a common bacterium. They mentioned them in the final paragraph of a paper: "The biological significance of these sequences is not known."

Now their significance is known, and it has set off a scientific frenzy.

The sequences, it turns out, are part of a sophisticated immune system that bacteria use to fight viruses. And that system, whose very existence was unknown until about seven years ago, may provide scientists with unprecedented power to rewrite the code of life.

In the past year or so, researchers have discovered that the bacterial system can be harnessed to make precise changes to the DNA of humans, as well as other animals and plants.

Biological sensör

GENOM"Mystreious Password"

Genome sequencing involves revealing the order of bases present in the entire genome of an organism. Genome sequencing is backed by automated DNA sequencing methods and computer software to assemble the enormous sequence data. It can be divided into four stages: (1) preparation of clones comprising the entire genome of an organism; (2) collection of DNA sequences of clones; (3) generation of contig assembly; and (4) database development. In this chapter, two popular genome sequencing methods – whole genome shotgun sequencing and the clone-by-clone method – are discussed in detail.

Elizabeth A. Normand, Ignatia B. Van den Veyver, in Human Reproductive and Prenatal Genetics, 2019

Genome Sequencing

Genome sequencing is the most unbiased method to sequence the genome as it does not include a capture of specific targeted regions to prepare the library for sequencing (Fig. 29.2C). The resulting sequence data will include coding and noncoding regions, such as introns, promoters, and regulatory sequences. Thus, the amount of information that can be obtained is vastly larger than from exome sequencing, but challenges of classifying and interpreting variants are also amplified. It can often be difficult to prove by bioinformatics interpretation and complementary functional assays whether detected variants in the noncoding fraction of the genome are pathogenic. The sequencing depth in genome sequencing is lower but more uniform, which facilitates the detection of CNVs and has been shown to improve the detection of coding variants by 3% over exome sequencing. The cost of genome sequencing is currently still significantly higher than for exome sequencing, and it is not yet routinely included in clinical NGS-based diagnostic testing. However, as cost continues to fall, technology continues to advance, and our understanding of the noncoding genome improves, this is beginning to change.

Every minute in 2018, Google conducted 3.88 million searches, and people watched 4.33 million videos on YouTube, sent 159,362,760 e-mails, tweeted 473,000 times and posted 49,000 photos on Instagram, according to software company Domo. By 2020 an estimated 1.7 megabytes of data will be created per second per person globally, which translates to about 418 zettabytes in a single year (418 billion one-terabyte hard drive's worth of information), assuming a world population of 7.8 billion. The magnetic or optical data-storage systems that currently hold this volume of 0s and 1s typically cannot last for more than a century, if that. Further, running data centers takes huge amounts of energy. In short, we are about to have a serious data-storage problem that will only become more severe over time.

An alternative to hard drives is progressing: DNA-based data storage. DNA—which consists of long chains of the nucleotides A, T, C and G—is life's information-storage material. Data can be stored in the sequence of these letters, turning DNA into a new form of information technology. It is already routinely sequenced (read), synthesized (written to) and accurately copied with ease. DNA is also incredibly stable, as has been demonstrated by the complete genome sequencing of a fossil horse that lived more than 500,000 years ago. And storing it does not require much energy.

But it is the storage capacity that shines. DNA can accurately stow massive amounts of data at a density far exceeding that of electronic devices. The simple bacterium Escherichia coli, for instance, has a storage density of about 1019 bits per cubic centimeter, according to calculations published in 2016 in Nature Materials by George Church of Harvard University and his colleagues. At that density, all the world's current storage needs for a year could be well met by a cube of DNA measuring about one meter on a side.

The prospect of DNA data storage is not merely theoretical. In 2017, for instance, Church's group at Harvard adopted CRISPR DNA-editing technology to record images of a human hand into the genome of E. coli, which were read out with higher than 90 percent accuracy. And researchers at the University of Washington and Microsoft Research have developed a fully automated system for writing, storing and reading data encoded in DNA. A number of companies, including Microsoft and Twist Bioscience, are working to advance DNA-storage technology.

Read more from this special report:

The Top 10 Emerging Technologies of 2019

Meanwhile DNA is already being used to manage data in a different way, by researchers who grapple with making sense of tremendous volumes of data. Recent advancements in next-generation sequencing techniques allow for billions of DNA sequences to be read easily and simultaneously. With this ability, investigators can employ bar coding—use of DNA sequences as molecular identification "tags"—to keep track of experimental results. DNA bar coding is now being used to dramatically accelerate the pace of research in fields such as chemical engineering, materials science and nanotechnology. At the Georgia Institute of Technology, for example, James E. Dahlman's laboratory is rapidly identifying safer gene therapies; others are figuring out how to combat drug resistance and prevent cancer metastasis.

Among the challenges to making DNA data storage commonplace are the costs and speed of reading and writing DNA, which need to drop even further if the approach is to compete with electronic storage. Even if DNA does not become a ubiquitous storage material, it will almost certainly be used for generating information at entirely new scales and preserving certain types of data over the long term.

Rights & Permissions

ABOUT THE AUTHOR(S)

Sang Yup Lee

Sang Yup Lee, a co-chair of the World Economic Forum's Global Future Council on Biotechnology, is Distinguished Professor of chemical and biomolecular engineering at the Korea Advanced Institute of Science and Technology. He holds more than 700 patents.

Leave a message

Health Science Journal

Submit a Manuscript

DNA and the Digital Data Storage

Lichun Sun1,2*, Jun He3, Jing Luo4 and David H Coy1

1Department of Medicine, School of Medicine, Tulane University Health Sciences Center, New Orleans, LA70112, USA

2Shenzhen Academy of Peptide Targeting Technology at Pingshan and Shenzhen Tyercan Bio-pharm Co., Ltd., Shenzhen, Guangdong, China

3Sino-US Innovative Bio-Medical Center and Hunan Beautide Pharmaceuticals, Xiangtan, Hunan, China

4Department of Health Information and Technology, EXCELth Primary Health Care, New Orleans, LA70112, USA

*Corresponding Author:Lichun Sun

Department of Medicine, School of Medicine

Tulane University Health Sciences Center

New Orleans, LA70112, USA

Tel: 504-988-1179

E-mail: lsun@tulane.edu; peptide612@gmail.com

Received date: 10 May 2019; Accepted date: 19 June 2019; Published date: 26 June 2019

Citation: Sun L, He J, Luo J, Coy DH (2019) DNA and the Digital Data Storage. Health Sci J Vol.13.No.3:659. DOI: 10.36648/1791-809X.1000659

Copyright: © 2019 Sun L, et al. This is an open-access article distributed under the terms of the creative commons attribution license, which permits unrestricted use, distribution and reproduction in any medium, provided the original author and source are credited.

Visit for more related articles at Health Science Journal

Abstract

Nowadays, most of current digital data are mainly stored on magnetic and optical media. At the explosive era of digital data, the digital data are generated every day and increased at an exponential rate. These traditional media cannot meet the urgent requirement of big digital data storage. With such advantages as high density, high replication efficiency, long-term durability and long-term stability, deoxyribonucleic acid (DNA) is expected as a novel and potential data storage medium. For the new DNA data storage, the files or any data readable will be converted to binary and then encoded to DNA sequences consisting of Adenine (A), Cytosine (C), Guanine (G), and Thymine (T). The data-carrying DNA sequences will be synthesized and stored until data retrieval one day. Once data retrieval, the unique data-carrying DNA fragments will be amplified, sequenced and analyzed. The DNA-based data information will then be decoded into binary and eventually converted to the information readable. Currently, the application of DNA data storage is limited due to such disadvantages as high cost, time-consuming, lack of random access ability. We still need to face serial tough challenges. However, the seen advances in DNA sequencing technology positively shine the future of DNA digital data storage.

Keywords

Digital data storage; Deoxyribonucleic acid (DNA); Binary; Encoding; Decoding; Sequencing

Introduction

Nowadays, it is an explosive era of the big data. These big data exist and cover almost everywhere from grocery stores to banks, from offline to online, from academy to industry, from hospital to community, from organization to government. The big data storage and management is becoming a serious concern. Currently, most of the data worldwide is mainly stored on magnetic and optical media such as HDD (hard disk drive), DISKs, CDs, tapes, DVDs, portable hard drives and USB streak drives [1-4]. However, the growing speed of these archival data explosively increases at an exponential rate. These traditional media and their limited data-storing capacity cannot meet the requirement of the rapid increase of digital data. Meanwhile, the data-storing durability of these media is one major challenge. Their durability is very limited. These media last only for a very limited time [5]. For instances, the disks can last for several years and tapes last for several decades. Other electronic storage can be kept under good condition for several decades. Data-storing capacity is another problem for storing big digital data. A CD may store several hundred megabytes (MBs) of data. A large hard drive may store couple terabytes (TBs) of data. However, their capacity is far away from the requirement of the explosive information data [4].

As said by Patrizio, there are totally 33 Zettabytes (ZBs) of data worldwide in 2018 (https://www.networkworld.com), equal to 22 trillion gigabytes (GBs). Therefore, a novel storage technology and innovative system are needed to meet the requirement of this modern era. The deoxyribonucleic acid (DNA), due to its unique advantages, is expected to be an ideal medium for the digital data storage [6-8]. To store the digital data in DNA is not a new story. Actually, it was described by the Soviet physicist Mikhail Neiman in 1960 ' s (https:// www.geneticsdigest.com). However, it was the first time to demonstrate that DNA can store digital data in 1988 [8]. Here, we firstly introduce the applications of DNA as a new medium in digital data storage and will next discuss more details in this field of DNA serving as data-storing medium.

Review of Previous Studies

The binary numeric system

Computers and other digital electronic devices store data and operate with the binary numeric system that uses only two digital numbers or 0 and 1 [9]. The texts are converted to binary version in computer system. In turn, computers operate, and calculate in binary, eventually convert information to texts readable. One byte contains eight bits consisting of either 0's or 1's and having 28 (256) possible values (from 0 to 255), and stores one single letter (Figure 1 and Table 1) [9,10]. As shown in the conversion ASCII table (Table 1). The twenty-six letters with the upper cases and lower cases are converted among Letter, Binary and Hexadecimal. To store a large file or document need much more memory data. A regular song may need dozens of megabytes, with couple gigabytes to store a movie and several terabytes for the books stored in a large library. As shown in Table 2 are the sizes of measurement and memory for the use of binary system from the smallest unit "byte" to the large units, including byte (B), kilobyte (KB), megabyte (MB), gigabyte (GB), terabyte (TB), pegabyte (PB), Exabyte (EB), zettabyte (ZB), yottabyte (YB), brontobyte (BB), Geopbyte (GPB) and so on (https://www.geeksforgeeks.org & https://whatsabyte.com). The units like brontobyte (BB), Geopbyte (GPB) are unimaginable huge values that may never be used in our real world (Table 2).

Figure 1: The text string "DNA digital data storage" was converted as binary bits.

Table 1: The conversion ASCII table of the twenty-six letters with the upper and lower cases among letter, binary and hexadecimal.

LetterBinaryHexadecimalLetterBinaryHexadecimalA100000141a110000161B100001042b110001062C100001143c110001163D100010044d110010064E100010145e110010165F100011046f110011066G100011147g110011167H100100048h110100068I100100149i110100169J10010104Aj11010106AK10010114Bk11010116BL10011004Cl11011006CM10011014Dm11011016DN10011104En11011106EO10011114Fo11011116FP101000050p111000070Q101000151q111000171R101001052r111001072S101001153s111001173T101010054t111010074U101010155u111010175V101011056v111011076W101011157w111011177X101100058x111100078Y101100159y111100179Z10110105Az11110107A

*Note: ACSII (American Standard Code for Information Interchange): serial digital codes to represent number, letters, numerals, and other symbols and to be used as a standard format in the computer system.

Table 2: The sizes of measurement and memory.

SizesByte MagnitudeUnitsStorage*1 B100ByteA character "A", "1", "$"10 B101100 B1021 KB103Kilo byteThe size for graphics of small websites ranges between 5 and 100 KB10 KB104100 KB1051 MB106Mega byte

( 1 MB: 1 million)The size for a high resolution JPEG image is about 1-5 MB10 MB107The size for a 3-minute song is about 30 MB100 MB1081 GB109Giga byteThe size for a standard DVD drive is about 5 GB10 GB1010(1 GB: 1 billion) 100 GB10111 TB1012Tera byte

(1 TB: 1 trillion)The size for a typical internal HDD is about 2 TB10 TB1013100 TB10141 PB1015Peta byte

(1 PB: 1 quadrillion)Google store over 100 PB of all data in their drivers.10 PB1016100 PB10171 EB1018Exa byte

(1 EB: 1 quintillion)Several hundred EBs of data are transferred over global internet per year

Facebook built an entire data center to store 1 EB of data in 201310 EB1019100 EB10201 ZB1021Zetta byte

(1 ZB: 1 sextillion)33 ZBs of global data in 2018.

160-180 ZBs of data is predicted in 2025.10 ZB1022100 ZB10231 YB1024Yotta byte

(1 YB: 1 septillion)1YB = 1 million EBs

1 YB = Size of the entire World Wide Web10 YB1025100 YB10261 BB1027Bronto byte

(1 BB: 1 octillion)1BB equals to 1 million ZBs

The only thing there is to say about a Brontobyte is that it is a 1 followed by 27 zeros!10 BB1028100 BB10291 GPB1030Geop byte (1 GPB: 1 nonillion)1, No one knows why this term was created. It is highly doubtful that anyone alive today will EVER see a Geopbyte hard drive.

*Note: cited from the website GeeksforGeeks ( https://www.geeksforgeeks.org) and the website WhatsaByte (https://whatsabyte.com)

The digital data storage

Digital Data Storage (DDS) was introduced and developed in 1980s. It is a computer-based data storage technology that is based on the Digital Audio Tape (DAT) format. These digital data were stored on the silicon-based chips. Silicon is the primary material of most semiconductor and microelectronic chips. The pure memory-grade silicon is rarely found in nature. All the microchip-grade silicon worldwide is expected to run out in the near future. Also, Moore's Law (The number of transistors accommodated on the integrated circuits is almost doubled every two years, or more transistors chips run faster with more transistors) is coming to the end [5]. Thus, the chips cannot accommodate additional transistors and will reach the limit of their capacity.

Meanwhile, most of current digital data are stored in the traditional magnetic, optical media and others such as HDD (hard disk drive) and CDs. Besides their limited data storage capacity, these media can also be kept for a very limited time [3]. They are sensitive to the environment or data-saving condition. Any environmental and conditional change such as magnetic exposure, high moisture, high temperature, mechanical damage, can possibly result in damage of these media or their data loss. And the frequent use also can lead to their damage or data loss. And also, to store the large amount of data and to meet the requirement of explosive increase of data, we need a large amount of such media as disks, CD, DVP, hard drives [3]. These will lead to high cost and will be timeconsuming.

Simultaneously, the increase of digital data and the requirement of data storage are growing at an exponential rate. IBM built a large center with the data-storing capacity of 120 PBs in 2011. Facebook built another larger center with the capacity of storing 1 EB (1000 PBs) of data in 2013. All digital users worldwide produced over 44 exabytes (EBs) (44000 PBs) of data per day. There were totally 1 zettabyte (ZB, 1000EBs or 1 million PBs) of data produced globally in 2010 and 33 zettabytes (ZB) of data in 2018, with 150-200 zettabytes (ZBs) being predicted in 2025 (cited from Datanami website: https:// www.datanami.com and Network world website: https:// www.networkworld.com) [5,11]. To store these data would need hundreds of thousands of huge space centers. In 2018, Facebook had a total of 15 data center locations in 2018, with more new centers being announced. They will build four additional data centers in Nebraska, consisting of six large buildings with datastoring space over 2.6 million square feet. Whatever, the datastoring space can never catch the exponential increase of data. The current storage media also cannot satisfy the storage requirement. There is an urgent need to develop new generation technology of data storage instead of current siliconbased data storage. With its unique characteristics and potential advantages, deoxyribonucleic acid (DNA) as a possible digital data storage media is coming to the central stage of data storage.

The basic information of deoxyribonucleic acid (DNA)

In 1953, Dr. Crick and Watson disclosed that a DNA molecule having double strands that coil around each other and form a double helical structure [12]. Generally, genetic materials in most of natural organisms are double strands of helical DNAs, with some being single strand of DNA and some others being single or double strands of RNAs. DNA components or Nucleotides consist of nitrogenous bases, phosphate groups and deoxyribose groups. The two letters are structured as the backbone of each DNA molecule, with each pair of bases from each strand to connect by a hydrogen bond. DNA nucleotides consists of four types of bases including adenine (A), cytosine (C), guanine (G), and thymine (T) (Figure 2) [12,13], with ribonucleic acid (RNA) having four types of bases including adenine (A), cytosine (C), guanine (G), and uracil (U) instead of thymine (T). Adenine (A) and guanine (G) are purine, with cytosine (C), thymine (T), and uracil (U) being pyrimidine [13]. In DNA molecules, the base-pairing rule is that A pairs with T, and G pairs with C (Figures 3 and 4) [2,12,13].

Figure 2: The four types of nucleotides are the key components being consisted of the natural deoxyribonucleic acid (DNA), including adenine (A), cytosine (C), guanine (G), and thymine (T). (Adapted from the NIH PubChem website: https://pubchem.ncbi.nlm.nih.gov).

Figure 3: The schematic structure of deoxyribonucleic acid (DNA). There are four types of nucleotides being composed of a deoxyribose, a phosphate group and one of the four nucleobases (A, T, G, C). The double strand DNA forms a double helix via pairing A with T and C with G, and connecting with hydrogen bonds in between. Cited from Genetics Generation (website:https://knowgenetics.org).

Figure 4: The schematic process of DNA replication. A DNA molecule has two complementary strands. During the semiconservative replication, the double strands of DNA molecules are separated. Each parental strand serves as a template to produce its complementary strand (daughter strand). Each complementary base is added to the new strand as the opposite of the base on the parental strand (A with T, and C with G). The new double strand DNA molecule has one parental strand and one daughter strand. These DNA molecules are highly conservative. Cited from Slide Share (website:https://www.slideshare.net/quaninaquan/dnareplication- slide-11981512).

The process is that a double-strand DNA molecule unwinds with each of two parental strands being separated and acts as a parental template for the synthesis of new daughter DNA molecules. The complementary nucleotides are added to the daughter strand, with phosphates and deoxyriboses to form the backbone of the new nucleotides and new bases to pair with the opposite of the bases on the parental strand via the base-pairing rule (A pairing with T, and G pairing with C) and to hold in place with hydrogen bonds [14]. Eventually, each of the new double strand DNA molecules has one parental strand and one daughter strand. The DNA molecules replicate in this semi-conservative model, keep genetic DNAs conservative and constant, and pass from one generation to another generation (Figure 5) [14].

Figure 5: The schematic process of the digital data being stored in DNA, including encoding (encoding a binary string to an oligoDNA) and decoding (decoding DNA sequences to binary data). For encoding, A text string with 24 bytes ("DNA digital data storage") was converted to binary bits that are subsequently encoded to an oligoDNA using one bit per base, with purine (A, G) being assigned as 1s and pyrimidine (C, T) as 0s. The oligoDNA was synthesized and the text contents were saved as oligoDNA fragments. For decoding, the DNA fragments were amplified by PCR, sequenced and decoded to binary bits, and eventually converted to information readable.

The process of DNA digital data storage

The process for DNA digital data storage is to encode and decode binary data to and from synthesized DNA strands. The texts, numbers, images and others readable or visible firstly are converted to binary languages with 0 and 1 instead, and then encoded to DNA nucleotide sequences, with the four bases (A, C, G, T) instead of 0 and 1 [4,15]. For instance, an upper case letter "D" is "01000100" in binary, a lower case letter "d" is "01100100" with a blank " " is "00100000" (Figure 1). In Figures 3 and 5, sentence "DNA digital data storage" was converted to binary version to obtain binary codes with 24 bytes. Then, binary codes (binary bits) are encoded into DNA codes. Each of the four bases (A, C, G and T) should be assigned as either 1 or 0. For example, purine (A, G) is assigned as 1s, with pyrimidine (C, T) being as 0s. Or, the two bases G and T are assigned as 1s, with the other two A and C being 0s. As shown in Figure 4, for dataencoding, A text string with 24 bytes ("DNA digital data storage") was converted to binary bits that are subsequently encoded to an oligo DNA using a 1 bit per base, with purine (A, G) being assigned as 1s and pyrimidine (C, T) as 0s. The oligo DNA was chemically synthesized and the text contents were then saved as oligo DNA fragments for the long-term storage. For datadecoding, the DNA fragments were amplified by PCR, sequenced and decoded to binary bits once one day, we need to retrieve data in order to output binary data to be readable. Eventually, reading the data from DNA sequence library is to sequence the unique DNA molecules, convert the sequencing information into the original digital data as necessary or requirement (Figure 5) [2,3,16].

The advantages of DNA data storage medium

As mentioned above, the global data are sharply increased at the exponential rate. The traditional media cannot sufficiently deal with the requirement of the large data storage [4]. DNA may serve as a possible medium of digital data storage, with its potential advantages such as high density, high replication efficiency, long-term durability and long-term stability (https:// www.scientificworldinfo.com) [2,8,15-17]. DNA at its theoretical maximum capacity can encode about two bits per nucleotide [2]. An entire data center built by IBM in 2011 has about 100 petabytes (PBs) of data-storing capacity. However, due to having a high density, DNA acting as a data-storing medium can store a large amount of data at a small size. A single gram of DNA at its theoretical maximum can store about 200 PBs of data, almost double times more than that of the entire IBM data center [2,7,11,15]. In other words, all information recorded all over the world can be stored in several kilograms of DNAs, or equal to only one shoebox compared with the requirement of millions of large data storage centers for traditional media [4,16].

Data-encoded DNA medium is capable of long-term storage due to having high durability [4,15]. DNA can last for thousands of years in the cold, dry and dark places. Even under worse environment, DNA's half-life is up to hundred years [3,17]. DNA can retain stable at low temperature or high temperature, with the wide range from -800°C to 800°C [2,5,15]. DNA media can also secure data more than traditional digital data media [8,16]. Although new data are increasing at an exponential rate, most of them are saved in archives for long-term storage [4,15]. These cold data will not be retrieved immediately or used frequently. Thus, to store them in DNA media is simple, convenient and costless. Another advantage is that DNA is highly conserved. The natural DNAs can accurately replicate themselves at a high efficiency and always with the base-pairing rule (A with T, C with G) (Figure 3) [3,16,17]. Thus, DNA medium can highly keep data fidelity for a long time.

The challenges for DNA data storage medium

Based on its unique characteristics and compared with the traditional media, DNA could be the potential and promising medium for digital data storage [5,15]. However, it is still a long way to go before DNA could be commercially applied. The challenges we have to deal with exist in various aspects, including high cost, low throughput, the limited access to data storage, short synthetic oligo DNA fragments, error rate in synthesis and sequencing [7,16,18,19].

The use of DNA in data storage is much more expensive than the other traditional media like tape, disk, and HDD (hard disk drive) (https://www.scientificworldinfo.com) [3,5]. Currently, to encode and decode data cost almost $15,000 per megabyte (MB). Meanwhile, current technology in DNA synthesis is limited, with only short oligo DNA sequences to be synthesized. The maximum length of each oligo DNA fragment is limited to several hundred nucleotides [11,20]. Thus, to store a single archived file, particularly 1 large file may need hundreds of thousands of oligo DNAs. And also, it is time-consuming for data to write into and retrieve from oligo DNAs, with the involvement of multiple steps including converting data to binary, encoding binary to oligo DNA, synthesizing and storing DNA sequences, and retrieving unique sequences from DNA storage library, sequencing and decoding, and eventually converting binary to data readable. The traditional media such as disk and tape have their logical addressing information, however, oligo DNAs have not. Thus, it is very difficult to address the unique encoded DNA sequence that we expect to have [16,18]. Meanwhile, random access to DNA-based data storage is important, however, oligo DNAs do not have random access ability [7,16,19]. Via current approaches, only bulk access is available for DNA data storage. The entire DNA-based data storage must be sorted, sequenced and decoded from DNA data storage even though we just need to read a single byte [16]. Therefore, the right primer used to selectively retrieve the right DNA sequence is required. This will also provide a random access during DNA sequencing and data retrieving. The sequencing with the unique primer can selectively read only the required oligoDNA, rather than the entire DNA library [16,17]. And currently, DNA synthesis and sequencing are not completely perfect. During DNA synthesis and sequencing, the occurrence of insertion, deletion, substitution and other errors can be occurred, with an error rate being about 1% per nucleotide [5,16]. The technology and the cost of DNA synthesis and sequencing are not suitable for current data storage [17-21].

Respective for DNA data storage medium

Due to the exponential increase of global data, lack of sufficient storage spaces and the requirement of innovative storage approaches, DNA as a potential brand new medium is becoming a hot topic in the field of big data storage. With the high density, high replication efficiency, long-term durability and stability, DNA displays its own advantages over the traditional data storage media [17]. Meanwhile, the applications of DNA digital data storage have been limited because of the high cost, lack of random access ability, time-consuming in data encoding and decoding. Fortunately, the progress in the field of DNA technology is quickly moving forward. For instance, to complete the sequencing of the first human genome, the global scientists collaborated and worked together for about 10-20 years, with the total cost of $3 billion in 2013 (Human Genome Project (HGP) website: https://www.genome.gov/human-genomeproject).

Conclusion

Nowadays, scientists just need several thousand dollars and couple weeks to finish the sequence of one entire human genome. And it is expected that the sequencing of one human genome just cost hundred dollars or under for several hours in the near future. Thus, the cost can be expected to be affordable. For the random access and addressing information, scientists have solved this challenge via designing the unique primers to selectively address and retrieve the information required. In order to avoid the error occurrence, the error correction metadata are encoded in oligo DNA fragments. Meanwhile, the single molecule DNA sequencers have been invented and currently are available. They are handy and portable. They can further reduce the cost of DNA sequencing and simplify retrieval of DNA information. Thus, following the advances in the technologies of DNA data storage, DNA serving as a data storage medium will be a golden opportunity in this era of big data.

References

Mayer C, McInroy GR, Murat P, Delft PV, Balasubramanian S (2016) An epigenetics-inspired DNA-based data storage system. Angew Chem Int Ed Engl 55: 11144-11148.

Swati A, Mathuria, F, Bhavani, S, Malathy E, Mahadevan R (2017) A review on various encoding schemes used in digital DNA data storage. Int J Civil Eng Technol 8: 7-10.

Appuswamy RLK, Barbry P, Antonini M, Madderson O, Freemont P (2019) Archive: Using DNA in the DBMS storage hierarchy. CIDR 2019, Biennal Conference on Innovative Data Systems Research, California, USA.

De Silva PY, GU Ganegoda (2016) New trends of digital data storage in DNA. Biomed Res Int pp: 8072463-8072472.

Panda DM, Baig KA, Swain MJ, Behera A, Dash D (2018) DNA as a digital information storage device: hope or hype? Biotech 8: 9-15.

Chen K, Kong J, Zhu J, Ermann N, Predki P, et al. (2019) Digital data storage using DNA nanostructures and solid-state Nanopores. Nano Lett 19: 1210-1215.

Yazdi S, Gabrys R, Milenkovic O (2017) Portable and error-free DNA-based data storage. Sci Rep 7: 5011-5013.

Church GM, Gao Y, Kosuri S (2012) Next-generation digital information storage in DNA. Science 337: 1628-1630.

Kuang SY, Zhu G, Wang ZL (2018) Triboelectrification-Enabled Self-Powered Data Storage. Adv Sci (Weinh) 5: 1700658.

Block FE (1987) Analog and digital computer theory. Int J Clin Monit Comput 4: 47-51.

O' Driscoll A, Sleator RD (2013) Synthetic DNA: the next generation of big data storage. Bioengineered 4: 123-1235.

Portin P (2014) The birth and development of the DNA theory of inheritance: sixty years since the discovery of the structure of DNA. J Genet 93: 293-302.

Leu K, Obermayer B, Rajamani S, Gerland U, Chen IA (2011) The prebiotic evolutionary advantage of transferring genetic information from RNA to DNA. Nucleic Acids Res 39: 8135-8147.

Burgers PMJ, Kunkel TA (2017) Eukaryotic DNA replication Fork. Annu Rev Biochem 86: 417-438.

Akram F, Haq I, Ali H, Laghari AT (2018) Trends to store digital data in DNA: an overview. Mol Biol Rep 45: 1479-1490.

Organick L, Ang SD, Chen YJ, Lopez R, Yekhanin S, et al. (2018) Random access in large-scale DNA data storage. Nat Biotechnol 36: 242-248.

Bornholt J, Lopez R, Carmean DM, Ceze L, Seelig G, et al. (2016) A DNA-based archival storage system. ASPLOS 201 (21st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Atlanta, GA).

Last Chapter Next Chapter