Abstract: The imbalance of a binary word refers to the absolute difference between the number of ones and the number of zeros in it. A word is balanced if its imbalance is at most one and a code is balanced if all its codewords are balanced. In 1986, motivated by applications in optical disks, Knuth [1] proposed a simple and efficient scheme that transforms binary messages into balanced codes. In recent years, advances in synthesis and sequencing technologies have made DNA macromolecules an attractive medium for digital information storage. Besides being biochemically robust, DNA strands offer ultrahigh storage densities of 10^15 - 10^20 bytes per gram of DNA, as demonstrated in recent experiments (see [2, Table 1]). However, this non-traditional media presents new challenges to the coding community (see [3] for a survey). One particular challenge has rekindled interest in balanced codes. Specifically, a DNA string comprises four bases or letters: A, C, T and G, and a string is GC-rich (or GC-poor) if a high (or low) proportion of the bases corresponds to either G or C. Since GC-rich or GC-poor DNA strings are prone to both synthesis and sequencing errors, we aim to reduce the difference with the number of G and C and the number of A and T on every DNA codeword. This requirement is equivalent to reducing the imbalance of a related binary word. In this talk, we revisit Knuth's balancing method and look at techniques that adapt the method to address coding problems in DNA-based data storage.In the first part, we look at constructions of address sequences that are critical in DNA-based data storage with random-access capabilities. Specifically, Yazdi et al. [2] developed an architecture that allows selective access to encoded DNA strands through the process of hybridization. The technique involves prepending information-carrying DNA strands with specially chosen address sequences called primers. By combining Knuth's balancing method with cyclic codes, we provide efficient and explicit constructions of such primer sets [4].In the second part, in addition to the GC-balanced constraints, we look at codes that correct a single insertion or deletion or substitution and whose codewords obey the homopolymer runlength constraints. Besides the code constructions, we also propose linear-time encoders for our codebooks [5].In the third part, we apply Knuth's balancing method to a beautiful and important class of codes, the polar codes. Invented by Arıkan [6], polar codes achieve capacity for many channels with low encoding and decoding complexities. We study a generalization of Knuth's balancing method, specifically, a technique of Mazumdar, Roth, and Vontobel [7], and provide means of transforming messages into balanced polar codewords while retaining the low complexities of the polar encoding and decoding algorithms.
References:
[1] D. Knuth, "Efficient balanced codes," IEEE Transactions on Information Theory, vol. 32, no. 1, pp. 51–53, 1986.
[2] S. H. T. Yazdi, R. Gabrys, and O. Milenkovic, "Portable and error-free DNA-based data storage," Scientific reports, vol. 7, no. 1, p. 5011, 2017.
[3] S. Yazdi, H. M. Kiah, E. R. Garcia, J. Ma, H. Zhao, and O. Milenkovic, "DNA-based storage: Trends and methods," IEEE Trans. Molecular, Biological, Multi-Scale Communications, vol. 1, no. 3, pp. 230–248, 2015.
[4] Y. M. Chee, H. M. Kiah, and H. Wei, "Efficient and explicit balanced primer codes," IEEE Transactions on Information Theory, vol. 66, no. 9, pp. 5344--5357, doi:10.1109/TIT.2020.2977915, 2020.
[5] K. Cai, Y. M. Chee, R. Gabrys, H. M. Kiah, T. T. Nguyen, "Optimal Codes Correcting a Single Indel / Edit for DNA-Based Data Storage", preprint arXiv:1910.06501 (see also doi:10.1109/ISIT.2019.8849643, doi:10.1109/ICASSP40776.2020.9053256), 2019.
[6] E. Arıkan. "Channel polarization: A method for constructing capacity achieving codes for symmetric binary-input memoryless channels", IEEE Transactions on Information Theory, vol. 55, no. 7, pp. 3051–3073, 2009.
[7] A. Mazumdar, R. M. Roth, and P. O. Vontobel, "On linear balancing sets," Advances in Mathematics of Communications, vol. 4, no. 3,pp. 345–361, 2010.
[8] U. Gupta, H. M. Kiah, A. Vardy, H. Yao, “Polar Codes with Balanced Codewords,” in 2020 IEEE International Symposium on Information Theory (ISIT), doi:10.1109/ISIT44484.2020.9174042, 2020.
Bio: Han Mao Kiah received his Ph.D. degree in mathematics from Nanyang Technological University (NTU), Singapore, in 2014. From 2014 to 2015, he was a Postdoctoral Research Associate at the Coordinated Science Laboratory, University of Illinois at Urbana-Champaign. From 2015 to 2018, he was a Lecturer at the School of Physical and Mathematical Sciences (SPMS), NTU, Singapore. Currently he is an Assistant Professor at SPMS, NTU, Singapore. His research interests include DNA-based data storage, coding theory, enumerative combinatorics and combinatorial design theory. |