In the current era of data, the exponential growth of digital information requires a reliable storage medium capable of accommodating vast amounts of data. However, the capacity of traditional storage media is unable to keep up with the rapid growth rate of digital information. Furthermore, conventional storage devices such as magnetic and optical devices suffer from the drawback of limited storage longevity. As a result, alternative methods of molecular and atomic data storage have been actively investigated in recent years. Since the initial utilization of DNA for information storage in 1988, it has emerged as a promising medium due to its high data density and long-term stability. DNA data storage is primarily divided into four components: coding, writing, storage, and fetching. DNA coding involves converting information to be stored into a DNA base sequence, represented as a binary stream using a particular algorithm. This conversion enables the translation between data information and DNA sequence. The synthesis of specific DNA sequences is achieved through various techniques known as DNA sequence synthesis. DNA storage can be further categorized into in vivo storage and in vitro storage, which involves utilizing specific mediums or methods for preserving DNA, allowing for large-scale and long-term preservation. The data fetch process begins with sequencing a particular DNA sample, followed by error correction and information deduplication. Finally, decoding rules are applied to reverse the coding process and restore the processed DNA sequence back to its original form as real information. 「DNA synthesis and data storage」 Some techniques and methods involved in DNA storage are described below. Data encoding mode Based on the structure and composition of DNA, there are three basic DNA storage coding models: binary, ternary, and quaternary. In the binary model, the 0-1 binary code is combined with the bases A, T, C, and G, where two bases are defined as 0 and the other two bases are defined as 1, resulting in six possible combination forms. The ternary encoding model encodes 0, 1, and 2 digits and converts them into corresponding nucleotides based on specific rules. The quadrilateral coding model uses bases A, T, C, and G to represent 1, 2, 3, and 4 respectively. Therefore, DNA sequences can be viewed as a natural quadrilateral coding method. Currently, the quaternary mode is the most widely used model in DNA storage conversion technology due to its storage capacity and its aim to enhance storage density. Error correction coding technology Error correction coding technology primarily relies on coding techniques. When utilizing DNA for data storage, encoding, and decoding, the accuracy of synthesis and reading is limited, resulting in an error rate of around 0.1%. Longer the DNA sequences are more susceptible to errors. To mitigate this issue, redundant information is incorporated during DNA synthesis. Through a specific algorithm, the connection between the data is established based on the error pattern derived from the redundant information. Any erroneous data can then be recovered through the redundant data, ensuring the fault tolerance of error correction technology. Commonly used error correction codes encompass Hamming code, CRC code, BCH code, RS code, and others.DNA合成方法 DNA synthesis method Currently, DNA synthesis methods can be classified into two main categories: chemical and biological. Chemical methods encompass column-based synthesis and in situ chemical syntheses using microarray. Among these, the solid phase chemistry method based on the DNA synthesis columns has a lower synthesis throughput. This is due to the maximum capacity of the synthesis column instrument, which can only synthesize 1536 types of DNA sequences. In situ, synthesis based on DNA microarray can be further divided into different base allocation methods, including in situ lithography, photosensitive erosion layer synthesis, photoacid-induced synthesis jet printing synthesis, soft lithography synthesis, electro acid-induced synthesis embodiments, and other synthesis methods derived. from these technologies according to different base allocation methods. However, the current chemical methods have limitations in terms of the length of DNA sequences that can be synthesized (less than 200 nucleotides), and they also incur relatively high costs compared to biological methods. As a result, de novo DNA synthesis using biological methods based on enzymatic synthesis has emerged as a prominent research focus.不同的DNA存储方法 Different DNA Storage Methods DNA storage can be categorized into in vivo storage and in vitro storage. In in vivo storage, coded information is assembled and synthesized within living cells, utilizing DNA assembly capabilities or genetic engineering techniques. Experimental materials like E. coli are commonly employed for DNA storage in vivo. This approach offers several advantages, including lower data storage costs, high fidelity of DNA data replication within cells, and longer storage durations. In vitro storage involves preserving the DNA medium outside of living organisms through methods such as dehydrating, freeze-drying, the use of additives, or protective materials. This approach offers the advantage of a large storage capacity. Currently, there are several preservation methods available, including liquid preservation, dry powder preservation, and embedding preservation. 「Summary and outlook」 Since the 20th century, DNA has been recognized as a potential data storage medium and has garnered significant attention due to its excellent chemical stability and precision of data transmission. DNA possesses inherent advantages for data storage and is expected to supplant traditional silicon-based media materials. The continued advancements in DNA synthesis, sequencing, and retrieval technology have set up the foundation for DNA as a data storage medium. However, further research is needed to refine the coding methods for data, enhance information error correction technology, and enable large-scale DNA storage.
SummaryColquhoun H, Lutz J F. Information-containing macromolecules [J]. Nat Chem, 2014, 6(6): 455-456. 李彦敏, 钟云鹏, 祁姗姗, 盛付旭, 田净净, 朱沛煌, 杨平, 蔡晓辉. 基于合成 DNA的数字信息储存和读取系统 [J]. 中国科学:生命科学, 2018, 48(01): 102-104. Shendure J, Balasubramanian S, Church G M, Gilbert W, Rogers J, Schloss J A, Waterston R H. DNA sequencing at 40: past, present and future [J]. Nature, 2017, 550(7676): 345-353. Bornholt J, Lopez R, Carmean D M, Ceze L, Seelig G, Strauss K. A DNA-Based Archival Storage System [M]. Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems. 2016: 637-649 毕昆, 顾万君, 陆祖宏. DNA 存储中的编码技术 [J]. 生物信息学, 2020, 18(02): 76-85. Wang C, Ma G, Wei D, Zhang X, Wang P, Li C, Xing J, Wei Z, Duan B, Yang D, Wang P, Bu D, Chen F. Mainstream encoding– decoding methods of DNA data storage [J]. CCF Transactions on High Performance Computing, 2022, 4(1): 23-33. 许鹏, 方刚, 石晓龙, 刘文斌. DNA 存储及其研究进展 [J]. 电子与信息学报, 2020, 42(06): 1326-1331. 毛秀海, 李凡, 左小磊. DNA 数据存储 [J]. 电子与信息学报, 2020, 42(06): 1303-1312. 吴琦琨, 赖浪文, 徐怀胜, 寇铮. 新一代数据存储介质——DNA [J]. 广州大学学报(自然科学版), 2020, 19(06): 35-40. Antkowiak P L, Lietard J, Darestani M Z, Somoza M M, Stark W J, Heckel R, Grass R N. Low cost DNA data storage using photolithographic synthesis and advanced information reconstruction and error correction [J]. Nat Commun, 2020, 11(1): 5345. 闫汉, 肖鹏峰, 刘全俊, 陆祖宏. DNA微阵列原位化学合成 [J]. 合成生物学, 2021, 2(03): 354-370. Kosuri S, Church G M. Large-scale de novo DNA synthesis: technologies and applications [J]. Nat Methods, 2014, 11(5): 499-507 Lee H H, Kalhor R, Goela N, Bolot J, Church G M. Terminator-free template-independent enzymatic DNA synthesis for digital information storage [J]. Nat Commun, 2019, 10(1): 2383.