Skip to main content
Top

Open Access 10-05-2025

On the coding capacity of reverse-complement and palindromic duplication-correcting codes

Authors: Lev Yohananov, Moshe Schwartz

Published in: Designs, Codes and Cryptography

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The article explores the intricate world of DNA storage, focusing on the coding capacity of reverse-complement and palindromic duplication-correcting codes. It begins by highlighting the potential of DNA as a storage medium, offering unprecedented density compared to traditional electronic storage. The text then delves into the unique challenges posed by duplication errors in DNA sequences, which can occur during faulty replication. It introduces various types of duplication errors, including tandem, palindromic, and reverse-complement duplications, and discusses their implications for data integrity. The article presents a thorough analysis of the coding capacity for these duplication types, providing a complete solution for all duplication lengths and alphabet sizes. It constructs optimal codes for specific cases and demonstrates the vanishing coding capacity for others, contradicting previous flawed code constructions. The study also touches on the open questions and future directions in the field, inviting further research into the trade-offs between coding capacity and the number of correctable errors. The detailed examples and rigorous mathematical proofs make this article a valuable resource for anyone seeking to understand the complexities of DNA storage and error correction.
Notes
Communicated by M. Buratti.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

Storing information in DNA molecules promises a medium that is six orders of magnitude denser than the densest electronic storage medium of today. The feasibility of this technology was first demonstrated in [3, 8] (for in-vitro DNA storage), and then for storage in the DNA of living organisms, in [16] (for in-vivo DNA storage). Apart from plain storage, the latter also allows watermarking of genetically-modified organisms (GMOs), labeling organisms for biological studies, and running synthetic-biology algorithms.
As with any storage or communication system, the transmitted information is corrupted by the channel and received with errors. In the case of DNA storage, some of the observed error mechanisms have already been studied in the literature in other contexts, mainly electronic communications. Among these we can see substitution errors, insertions and deletions. However, in-vivo DNA storage exhibits a new kind of errors—duplication errors—which may be caused during faulty DNA replication.
A DNA strand is a long string over the alphabet \(\{\textsf{A},\textsf{C},\textsf{G},\textsf{T}\}\), standing for the four chemical bases: adenine, cytosine, guanine, and thymine. The bases are partitioned into complement pairs, where \(\textsf{A}\) and \(\textsf{T}\) are complement to each other, as are \(\textsf{C}\) and \(\textsf{G}\). In a nutshell, a duplication error takes a substring of the DNA string, and inserts a copy of it (perhaps altered) back into the string. Possible alterations vary, depending on the biological process creating them. Common examples we shall be referring to throughout the paper, include tandem duplication, where the copy is unaltered and placed next to its original location, e.g.,
$$\begin{aligned} \textsf{A}\textsf{A}\textsf{C}\textsf{T}\textsf{G}\textsf{G}\textsf{A}\textsf{T}\textsf{C}\textsf{C}\textsf{C}\textsf{T}\Longrightarrow \textsf{A}\textsf{A}\textsf{C}\textsf{T}\textsf{G}\textsf{G}\textsf{A}\textsf{T}\underline{\textsf{G}\textsf{G}\textsf{A}\textsf{T}}\textsf{C}\textsf{C}\textsf{C}\textsf{T}, \end{aligned}$$
where a duplication of length 4 occurred, inserting a duplicate copy of \(\textsf{G}\textsf{G}\textsf{A}\textsf{T}\) (underlined on the right) immediately following its original location. Another type of duplication is palindromic duplication (sometimes called reverse duplication), which alters the inserted copy by reversing it, e.g.,
$$\begin{aligned} \textsf{A}\textsf{A}\textsf{C}\textsf{T}\textsf{G}\textsf{G}\textsf{A}\textsf{T}\textsf{C}\textsf{C}\textsf{C}\textsf{T}\Longrightarrow \textsf{A}\textsf{A}\textsf{C}\textsf{T}\textsf{G}\textsf{G}\textsf{A}\textsf{T}\underline{\textsf{T}\textsf{A}\textsf{G}\textsf{G}}\textsf{C}\textsf{C}\textsf{C}\textsf{T}. \end{aligned}$$
Finally, we mention reverse-complement duplication, in which the altered copy is both reversed and complemented, e.g.,
$$\begin{aligned} \textsf{A}\textsf{A}\textsf{C}\textsf{T}\textsf{G}\textsf{G}\textsf{A}\textsf{T}\textsf{C}\textsf{C}\textsf{C}\textsf{T}\Longrightarrow \textsf{A}\textsf{A}\textsf{C}\textsf{T}\textsf{G}\textsf{G}\textsf{A}\textsf{T}\underline{\textsf{A}\textsf{T}\textsf{C}\textsf{C}}\textsf{C}\textsf{C}\textsf{C}\textsf{T}. \end{aligned}$$
The string-duplication channel was introduced in [6], and the first duplication-correcting codes were developed in [11]. These generalize the setting to allow for arbitrary alphabets, parametric duplication length, and various duplication types. Codes correcting any number of tandem duplications were studied in [11, 2123] and any number of palindromic duplications in [23]. Codes that correct only a fixed number of tandem duplications (sometimes just one) were studied in [9, 12, 13], palindromic duplications in [14], and reverse-complement duplications in [1]. Additionally, codes that correct mixtures of duplications, substitutions, as well as insertions and deletions, have been the focus of [1719]. We also mention related work on duplication errors, though not providing error-correcting codes, but rather studying the possible outcomes from multiple errors [1, 47, 10].
In this paper we focus on the asymptotic rate of optimal codes (the coding capacity) that are capable of correcting any number of duplication errors. The coding capacity for tandem-duplication errors is already known [11]. Thus, we study reverse-complement duplications, and palindromic duplications. We provide a complete solution, for all duplication lengths k, and all possible alphabet sizes q. We show that, except for the case of duplication length \(k=1\), the coding capacity is 0. When \(k=1\), the coding capacity depends on the alphabet size q, and in this case we construct optimal codes. A summary of the coding-capacity results is provided in Table 1. In addition, the vanishing coding capacity for the case of palindromic duplication of length \(k\geqslant 2\), contradict the code constructions presented in [23] that have positive asymptotic rate. We provide counter examples to these constructions that show the constructions are unfortunately flawed.
The paper is organized as follows. In Sect. 2 we give all the necessary notation and definitions. In Sect. 3 we study reverse-complement duplication with duplication length \(k=1\), whereas in Sect. 4 we focus on longer duplication lengths. Section 5 provides the results for palindromic duplication. We conclude in Sect. 6 with a short summary of the results and some open questions.
Table 1
A summary of the coding-capacity results, where k denotes the duplication length, q denotes the alphabet size, and \(R_q(*)_k\) denotes the coding capacity
Type
k
q
\(R_q(*)_k\)
Location
Reverse complement
\(=1\)
\(=2\)
0
Corollary 3
  
\(\geqslant 4\)
\(\log _q(q-2)\)
Corollary 7
 
\(\geqslant 2\)
\(\geqslant 2\)
0
Corollary 13
Palindromic
\(=1\)
\(\geqslant 2\)
\(\log _q(q-1)\)
[11], equivalent to tandem duplication
 
\(\geqslant 2\)
\(\geqslant 2\)
0
Corollary 16
Note that for reverse-complement duplication, q must be even

2 Preliminaries

Throughout this paper, we shall be considering strings over a finite alphabet. Without loss of generality, we shall assume that alphabet is \(\mathbb {Z}_q\), the ring of integers modulo q, where \(q\in \mathbb {N}\), \(q\geqslant 2\). A string is a sequence of letters from the alphabet \(\mathbb {Z}_q\). The set of strings of length \(n\in \mathbb {N}\) is denoted by \(\mathbb {Z}_q^n\). Assume \(x\in \mathbb {Z}_q^n\) is such a string, then the letters of x are denoted by adding a subscript i, where indices start from 0. Thus, \(x=x_0 x_1 \dots x_{n-1}\), with \(x_i\in \mathbb {Z}_q\) for all i. We denote the length of x by \(|x|=n\). If \(y\in \mathbb {Z}_q^m\) is also a string, we denote by xy the concatenation of x and y. If \(\ell \geqslant 0\) is an integer, we denote by \(x^\ell \) the concatenation of \(\ell \) copies of x. Since subscripts are used to denote individual letters in a string, and superscripts denote concatenation of copies of a string, when we require several strings with the same name, we shall use superscripts in parentheses to distinguish between them, e.g., \(x^{(0)}, x^{(1)}, x^{(2)},\dots \).
The set of all finite-length strings over \(\mathbb {Z}_q\) is denoted by \(\mathbb {Z}_q^*\). We use \(\varepsilon \) to write the unique empty string of length 0. Obviously \(\varepsilon \in \mathbb {Z}_q^*\). If we are interested only in strings with positive length we write \(\mathbb {Z}_q^+\triangleq \mathbb {Z}_q^*\setminus \{\varepsilon \}\).
Assume \(x=uvw\), with \(u,v,w\in \mathbb {Z}_q^*\), \(|v|=k\) for some \(k\in \mathbb {N}\). Then we say that v is a k-factor of x. If \(u=\varepsilon \), then v is said to be a k-prefix of x, and if \(w=\varepsilon \) then v is said to be a k-suffix of x. If the length of x is not of importance, we just say it is a factor, or prefix, or suffix of x.
The notion of complement will be important. We assume the letters of the alphabet \(\mathbb {Z}_q\) may be partitioned into pairs of complements. If \(a\in \mathbb {Z}_q\) is a letter, we denote its complement by \(\overline{a}\), where necessarily \(\overline{a}\ne a\). Additionally, \(\overline{\overline{a}}=a\). It follows that if a complement is defined, then q is even. We extend the complement notation in the natural way to strings, namely, if \(x=x_0 x_1 \dots x_{n-1}\in \mathbb {Z}_q^n\), then \(\overline{x}=\overline{x_0} \,\overline{x_1} \dots \overline{x_{n-1}}\).
Another important notation we introduce is the reverse of a string. If \(x=x_0 x_1 \dots x_{n-1}\in \mathbb {Z}_q^n\), then \(x^R \triangleq x_{n-1} \dots x_1 x_0\). Obviously, the complement and reverse operations commute, and so \(\overline{x^R} = \overline{x}^R\).
The two major concepts we shall study in this paper are string-duplication channels, and codes that correct duplication errors introduced by the channel. We shall start by describing the former. We shall generally follow the notation set by [11].
In a string-duplication channel, strings from \(\mathbb {Z}_q^*\) are transmitted. However, the channel introduces errors by applying a (finite) number of duplication rules. Generally speaking, a duplication rule is just a function \(T:\mathbb {Z}_q^*\rightarrow \mathbb {Z}_q^*\). The set of all duplication rules the channel may apply is denoted by \(\mathcal {T}\subseteq \mathbb {Z}_q^*{\,}^{\mathbb {Z}_q^*}\). The choice of rules to contain in this set is motivated by the biological processes present in our in-vivo DNA storage system. We shall focus on two such sets which we define later (for more examples, see [11]).
Assume \(u,v\in \mathbb {Z}_q^*\) are two strings. We say v is a descendant of u if there exist \(t\geqslant 0\) and \(T_1,T_2,\dots ,T_t\in \mathcal {T}\) such that
$$\begin{aligned} v= T_t(T_{t-1}( \dots T_1(u)\dots )). \end{aligned}$$
If the exact sequence of rules is immaterial, we just write
$$\begin{aligned} u\Longrightarrow ^t v, \end{aligned}$$
to signify this relation between u and v. If \(t=1\) we may omit it and write \(u\Longrightarrow v\). If t is unknown or irrelevant, we simply write \(u\Longrightarrow ^*v\). For convenience, we always have \(u\Longrightarrow ^0 u\). The set of all descendants of u is called the descendant cone of u, and is denoted by
$$\begin{aligned} D^*(u) \triangleq \big \{ v\in \mathbb {Z}_q^* ~:~ u\Longrightarrow ^*v \big \}. \end{aligned}$$
Thus, \(D^*(u)\) is the reflexive transitive closure of \(\mathcal {T}\) acting on u.
Motivated by biological processes, in this paper we focus on two sets of duplication rules: the reverse-complement and the palindromic. For the first, we define reverse-complement duplication rules of the following form:
$$\begin{aligned} T^{\textrm{rc}}_{i,k}(x)\triangleq uv \overline{v}^R w \qquad \text {if x=uvw, |u|=i, |v|=k.} \end{aligned}$$
Thus, \(T^{\textrm{rc}}_{i,k}\) takes a k-factor v which follows an i-prefix u, and inserts a reverse-complement copy of v following its original position. Note that the rule is not defined on strings of length shorter than \(i+k\). We then define the set of reverse-complement duplication rules as
$$\begin{aligned} \mathcal {T}^{\textrm{rc}}_k \triangleq \big \{T^{\textrm{rc}}_{i,k} ~:~ i\geqslant 0\big \}. \end{aligned}$$
We denote derivation by \(\Longrightarrow _k\), and the descendant cone by \(D_k\), where the purpose of the subscript is solely to emphasize the fixed length of the duplication k. We may omit this subscript if it is clear from context.
Example 1
Assume \(q=4\), \(k=2\), and in \(\mathbb {Z}_q\) we have \(\overline{0}=3\) and \(\overline{1}=2\). In the reverse-complement duplication setting,
$$\begin{aligned} 01103203 \Longrightarrow _2 011032\underline{10}03 \Longrightarrow _2 011\underline{22}0321003\Longrightarrow _2 0112\underline{12}20321003, \end{aligned}$$
where the underlined factor is the inserted reverse-complement duplication. Thus, we may write
$$\begin{aligned} 01103203 \Longrightarrow ^3_2 01121220321003, \end{aligned}$$
as the right-hand side may be derived from the left-hand side by a sequence of 3 reverse-complement duplications of length 2 each.
Along the same lines, we define the palindromic-duplication rules of the following form:
$$\begin{aligned} T^{\textrm{pal}}_{i,k}(x)\triangleq uv v^R w \qquad \text {if x=uvw, |u|=i, |v|=k,} \end{aligned}$$
and the set of all palindromic duplication rules as
$$\begin{aligned} \mathcal {T}^{\textrm{pal}}_k \triangleq \big \{T^{\textrm{pal}}_{i,k} ~:~ i\geqslant 0\big \}. \end{aligned}$$
The only difference between the palindromic and the reverse-complement duplication rules is that the latter also complement the duplicated symbols. We once again use \(\Longrightarrow _k\) for derivation and \(D_k\) for the descendant cone. To avoid cumbersome notation we infer from the context whether we use palindromic or reverse-complement duplication rules. In particular, Sect. 3 and Sect. 4 use reverse-complement duplication rules, whereas only Sect. 5 uses palindromic duplication rules.
Example 2
Assume \(q=4\) and \(k=2\). In the palindromic duplication setting,
$$\begin{aligned} 01103203 \Longrightarrow _2 011032\underline{23}03, \end{aligned}$$
where the underlined factor is the inserted palindromic duplication.
We now reach the second major concept—error-correcting codes.
Definition 1
An (nMt) duplication-correcting code, with respect to a set of duplication rules \(\mathcal {T}\), is a set \(C\subseteq \mathbb {Z}_q^n\), of size \(|C|=M\), such that for any two distinct codewords, \(c,c'\in C\),
$$\begin{aligned} D^t(c)\cap D^t(c') = \emptyset . \end{aligned}$$
Here, t denotes the number of duplication errors the code can correct, where \(t\in \mathbb {N}\cup \{*\}\), and where \(t=*\) denotes any number of errors.
An important figure of merit associated with a code’s efficiency, is its rate. If C is an (nMt) duplication-correcting code over \(\mathbb {Z}_q\), then its rate is defined as
$$\begin{aligned} R(C) \triangleq \frac{1}{n}\log _q M. \end{aligned}$$
The largest size of such a code will be denoted by \(A_q(n;t)\), and codes attaining this upper bound will be called optimal. Finally, the coding capacity is defined as
$$\begin{aligned} R_q(t) \triangleq \limsup _{n\rightarrow \infty } \frac{1}{n}\log _q A_q(n;t). \end{aligned}$$
In the remainder of the paper, we shall be focusing on the reverse-complement duplication channel and the palindromic duplication channel, both with duplication length k, over \(\mathbb {Z}_q\), and with an unbounded number of errors, i.e., \(t=*\). We shall therefore try to determine or bound \(A_q(n;*)^{\textrm{rc}}_k\) and \(R_q(*)^{\textrm{rc}}_k\) for the reverse-complement duplication channel, and \(A_q(n;*)^{\textrm{pal}}_k\) and \(R_q(*)^{\textrm{pal}}_k\) for the palindromic duplication channel.

3 Reverse-complement duplication of length \(k=1\)

In this section we shall completely determine \(A_q(n;*)_1^{\textrm{rc}}\), the maximal size of a q-ary code of length n that is capable of correcting any number of reverse-complement duplications of length 1. Thus, we shall also find the coding capacity \(R_q(*)_1^{\textrm{rc}}\). Our method is constructive, and we will present codes of optimal size. The results depend on the alphabet size, and in particular, we shall divide our discussion to the case of binary alphabet, \(q=2\), and the case of larger alphabets of even size, \(q\geqslant 4\).

3.1 The binary case, \(q=2\)

We determine an exact condition for two binary strings to have a common descendant. As we shall show, this depends only on the first bit of the string. We start with a simple technical lemma.
Lemma 1
Let \(a\in \mathbb {Z}_2\) be a bit. Then
$$\begin{aligned} a \Longrightarrow ^*_1 a\overline{a}w \end{aligned}$$
for any \(w\in \mathbb {Z}_2^*\).
Proof
We can write
$$\begin{aligned} a\overline{a}w = a \overline{a}^{m_1} a^{m_2} \overline{a}^{m_3} a^{m_4} \dots b^{m_\ell }, \end{aligned}$$
where b is either a or \(\overline{a}\), and where \(m_1\geqslant 1\) for all i. We then start with a, and duplicate that a \(m_1\) times,
$$\begin{aligned} a \Longrightarrow ^{m_1}_1 a \overline{a}^{m_1}. \end{aligned}$$
We then duplicate the last copy of \(\overline{a}\) a total of \(m_2\) times, so
$$\begin{aligned} a \Longrightarrow ^{m_1}_1 a \overline{a}^{m_1} \Longrightarrow ^{m_2}_1 a \overline{a}^{m_1} a^{m_2}. \end{aligned}$$
Repeating this process of duplicating the last bit to create a new run of a or \(\overline{a}\) results in the desired derivation
$$\begin{aligned} a \Longrightarrow ^*_1 a\overline{a}w. \end{aligned}$$
\(\square \)
We can now give our main characterization of strings with a common descendant.
Theorem 2
Let \(x,y\in \mathbb {Z}_2^+\). Then
$$\begin{aligned} D_1^*(x)\cap D_1^*(y)\ne \emptyset \end{aligned}$$
if and only if \(x_0=y_0\).
Proof
For the first direction, assume \(x_0\ne y_0\). Since no duplication can change the first bit of the string, it follows that all the strings in \(D_1^*(x)\) start with \(x_0\), whereas all those in \(D_1^*(y)\) start with \(y_0\). Hence, \(D_1^*(x)\cap D_1^*(y)=\emptyset \).
For the second direction, assume \(x_0=y_0=a\). Let us therefore write
$$\begin{aligned} x&= a x',&y&= a y'. \end{aligned}$$
We distinguish between two cases. For the first case, assume \(y'=\varepsilon \), namely, \(y=a\). In that case we have,
$$\begin{aligned} x=a x' \Longrightarrow _1 a \overline{a}x', \end{aligned}$$
and by Lemma 1 we also have
$$\begin{aligned} y = a \Longrightarrow _1^* a \overline{a}x'. \end{aligned}$$
Hence, \(D_1^*(x)\cap D_1^*(y)\ne \emptyset \).
For the second case, assume \(y'\ne \varepsilon \), and write \(y'=y''b\), with \(b\in \mathbb {Z}_2\). We now observe the derivations
$$\begin{aligned} x&=ax' \Longrightarrow _1^* a \overline{a}y'' b x' \Longrightarrow _1 a \overline{a}y'' b \overline{b}x', \\ y&=ay''b \Longrightarrow _1 a \overline{a}y'' b \Longrightarrow _1^* a \overline{a}y'' b \overline{b}x', \end{aligned}$$
where both of the \(\Longrightarrow _1^*\) derivations follow from Lemma 1. \(\square \)
We thus immediately deduce the following:
[Style2 Style1]Corollary 3
For all \(n\in \mathbb {N}\)
$$\begin{aligned} A_2(n;*)_1^{\textrm{rc}} = 2, \end{aligned}$$
and therefore,
$$\begin{aligned} R_2(*)_1^{\textrm{rc}} = 0. \end{aligned}$$
Proof
Since in a binary \((n,M;*)_1^{\textrm{rc}}\) code we need to have pair-wise disjoint descendant cones for distinct codewords, by Theorem 2, we can have at most two codewords: one starting with a 0, and one starting with a 1. This is also easily achievable, e.g., the repetition code \(C=\{0^n,1^n\}\). Hence, the maximal size of a binary code in this case is \(A_2(n;*)_1^{\textrm{rc}}=2\). The coding capacity follows by definition. \(\square \)

3.2 The non-binary case, \(q\geqslant 4\)

In this section we assume a general non-binary alphabet, namely, \(\mathbb {Z}_q\), with \(q\geqslant 4\) even. We shall, once again, characterize the exact condition for two strings to have a joint descendant, thereby allowing us to find optimal codes as well as the coding capacity.
We introduce some new notation. Let \(a\in \mathbb {Z}_q\) be a letter. We define
$$\begin{aligned} a^\oplus \triangleq a\big \{a,\overline{a}\big \}^*, \end{aligned}$$
namely, \(a^\oplus \) denotes the set of all strings starting with the letter a, and followed by any finite length string composed of a and \(\overline{a}\) only.
If \(x\in \mathbb {Z}_q^+\) is a non-empty string, we can always decompose it into
$$\begin{aligned} x \in a_0^\oplus a_1^\oplus \dots a_{\ell -1}^\oplus , \end{aligned}$$
(1)
where \(a_i\in \mathbb {Z}_q\) for all i. We say this is a minimal decomposition if \(\ell \) is the smallest possible, thus necessarily, \(a_{i+1}\not \in \{a_i,\overline{a_i}\}\) for all i. If (1) is a minimal decomposition of x, we define the signature of x to be
$$\begin{aligned} \sigma (x) \triangleq a_0 a_1 \dots a_{\ell -1}. \end{aligned}$$
Example 3
Assume \(q=4\), \(k=2\), and in \(\mathbb {Z}_q\) we have \(\overline{0}=3\) and \(\overline{1}=2\). Then
$$\begin{aligned} \sigma (0110300203) = 01020. \end{aligned}$$
This is because the minimal decomposition of 0110300203 is given by 0|11|0300|2|03, where the vertical lines show the relevant factors in the decomposition. Notice how
$$\begin{aligned} 0\in 0^\oplus , \quad 11\in 1^\oplus , \qquad 0300\in 0^\oplus , \qquad 2\in 2^\oplus , \qquad 03\in 0^\oplus . \end{aligned}$$
We first observe that all strings in a descendant cone share the same signature.
Lemma 4
Let \(x\in \mathbb {Z}_q^+\) be a string. Then for all \(x'\in D^*_1(x)\),
$$\begin{aligned} \sigma (x)=\sigma (x'). \end{aligned}$$
Proof
Let \(x\in \mathbb {Z}_q^+\) be any string, and let its signature be \(\sigma (x)=a_0 a_1 \dots a_{\ell -1}\), with \(a_i\in \mathbb {Z}_q\) for all i. Furthermore, assume its minimal decomposition is
$$\begin{aligned} x = x^{(0)} x^{(1)} \dots x^{(\ell -1)}, \end{aligned}$$
where \(x^{(i)}\in a_i^\oplus \) for all i.
Now let \(x''\) be obtained from x using a single reverse-complement duplication of length 1, namely, \(x\Longrightarrow _1 x''\). If the letter being duplicated resides in \(x^{(i)}\), then the effect of the duplication is merely extending \(x^{(i)}\) by one letter, without changing its first one. Thus,
$$\begin{aligned} \sigma (x'')=\sigma (x). \end{aligned}$$
By a simple induction this may be extended to any number of derivations, and so for any \(x'\in D_1^*(x)\) we have \(\sigma (x)=\sigma (x')\). \(\square \)
The signature can also determine whether two strings have a common descendant, as the following theorem shows.
Theorem 5
Let \(x,y\in \mathbb {Z}_q^+\). Then
$$\begin{aligned} D_1^*(x)\cap D_1^*(y)\ne \emptyset \end{aligned}$$
if and only if \(\sigma (x)=\sigma (y)\).
Proof
For the first direction, assume \(\sigma (x)=\sigma (y)=a_0 a_1 \dots a_{\ell -1}\). That means we can write
$$\begin{aligned} x&= x^{(0)} x^{(1)} \dots x^{(\ell -1)},\\ y&= y^{(0)} y^{(1)} \dots y^{(\ell -1)}, \end{aligned}$$
where \(x^{(i)},y^{(i)}\in a_i^\oplus \), for all i. We also note that \(x^{(i)}\) and \(y^{(i)}\) begin with the same letter, \(a_i\), and contain only letters from \(\{a_i,\overline{a_i}\}\). Hence, by Theorem 2, for each i, there exists \(z^{(i)}\in \mathbb {Z}_q^+\) such that
$$\begin{aligned} z^{(i)}\in D_1^* (x^{(i)})\cap D_1^* (y^{(i)}). \end{aligned}$$
Namely,
$$\begin{aligned} x^{(i)} \Longrightarrow _1^* z^{(i)}, \qquad \text {and}\qquad y^{(i)} \Longrightarrow _1^* z^{(i)}. \end{aligned}$$
It then follows that
$$\begin{aligned} x&= x^{(0)} x^{(1)} \dots x^{(\ell -1)} \Longrightarrow _1^* z^{(0)} z^{(1)} \dots z^{(\ell -1)},\\ y&= y^{(0)} y^{(1)} \dots y^{(\ell -1)} \Longrightarrow _1^* z^{(0)} z^{(1)} \dots z^{(\ell -1)},\\ \end{aligned}$$
and so
$$\begin{aligned} D_1^*(x)\cap D_1^*(y)\ne \emptyset . \end{aligned}$$
For the other direction, assume \(D_1^*(x)\cap D_1^*(y)\ne \emptyset \), and let \(z\in D_1^*(x)\cap D_1^*(y)\) be a common descendant of x and y. By Lemma 4,
$$\begin{aligned} \sigma (x) = \sigma (z) = \sigma (y), \end{aligned}$$
which completes the proof. \(\square \)
We note that Theorem 5 is a generalization of Theorem 2, since in the binary case, the signature of any non-empty string is just its first letter.
By knowing when descendant cones intersect, we can now construct q-ary \((n,M;*)^{\textrm{rc}}_1\) codes.
Construction 1
Let \(n\in \mathbb {N}\). We construct the following code:
$$\begin{aligned} C = \big \{a_0 a_1\dots a_{\ell -1}a_{\ell -1}^{n-\ell } ~:~ 1\leqslant \ell \leqslant n, \forall i, a_i\in \mathbb {Z}_q, a_{i+1}\notin \big \{a_i,\overline{a_i}\big \}\big \}. \end{aligned}$$
Theorem 6
The code C from Construction 1 is an optimal \((n, M; *)^{\textrm{rc}}_{1}\) code, with
$$\begin{aligned} M=q\cdot \frac{(q-2)^n-1}{q-3}. \end{aligned}$$
Proof
All the codewords of C are of length n. Consider a codeword \(c\in C\), with
$$\begin{aligned} c=a_0 a_1\dots a_{\ell -1}a_{\ell -1}^{n-\ell }, \end{aligned}$$
where for all i we have \(a_i\in \mathbb {Z}_q\) and \(a_{i+1}\not \in \{a_i,\overline{a_i}\}\). By definition, the signature of c is
$$\begin{aligned} \sigma (c) = a_0 a_1 \dots a_{\ell -1}. \end{aligned}$$
It follows that distinct codewords of C have distinct signatures, and so by Theorem 5, they have disjoint descendant cones. Additionally, each possible signature appears in C, and so it is optimal. It remains to compute its size, M. Assuming we want a signature of length \(\ell \), we have q options for the first letters, and each subsequent letter has \(q-2\) options. Thus,
$$\begin{aligned} M=\sum _{\ell =1}^{n} q(q-2)^{\ell -1}=q\cdot \frac{(q-2)^n-1}{q-3}. \end{aligned}$$
\(\square \)
We comment that decoding the code from Construction 1 is straightforward. Assume \(z\in \mathbb {Z}_q\) was received. We then compute its signature, which we denote by \(\sigma (z)=a_0 a_1 \dots a_{\ell -1}\), with \(a_i\in \mathbb {Z}_q\) for all i. Then the decoding function returns \(a_0 a_1 \dots a_{\ell -1} a_{\ell -1}^{n-\ell }\in C\). The entire process runs in \(O(|z|)\) time, i.e., linear in the length of the received string.
Finally, we obtain the following:
[Style2 Style1]Corollary 7
For all \(n\in \mathbb {N}\) and even \(q\in \mathbb {N}\), \(q\geqslant 4\),
$$\begin{aligned} A_q(n;*)^{\textrm{rc}}_1 = q\cdot \frac{(q-2)^n-1}{q-3}, \end{aligned}$$
and therefore
$$\begin{aligned} R_q(*)^{\textrm{rc}}_1 = \log _q (q-2). \end{aligned}$$
Proof
The claims follow directly from the size of the optimal code proved in Theorem 6. \(\square \)

4 Reverse-complement duplication of length \(k\geqslant 2\)

We shall now study codes that correct any number of reverse-complement duplication errors of length \(k\geqslant 2\). We will show that the number of codewords in such a code is upper bounded by a constant, therefore implying a vanishing coding capacity.
Our proof strategy consists of identifying a property of strings that ensures the existence of a common descendant. We call this the k-summary of the string. Since only a constant number of summaries are possible, the main claim will follow.
Let \(w\in \mathbb {Z}_q^*\) be a string. We say that \(a\in \mathbb {Z}_q\) is the i-th letter from the end in w if we can write
$$\begin{aligned} w = w' a w'' \end{aligned}$$
with \(w',w''\in \mathbb {Z}_q^*\) and \(|w''|=i\). In particular, the 0-th letter from the end of w is the last letter of w. We also require the following technical tool from [1]: (whose proof we bring for completeness only)
Lemma 8
( [1, Lemma 2]) Let \(k\geqslant 2\), \(x \in \mathbb {Z}_q^*, |x|\geqslant k + 1\), and let \(i \geqslant 2\) be an integer. If the i-th letter from the end of x is a, then there exists \(x'\in D^2_k(x)\), such that a is the \((i-2)\)-nd letter from the end of \(x'\).
Proof
We can write \(x=uvbcw\), where \(u,w\in \mathbb {Z}_q^*\), \(v\in \mathbb {Z}_q^{k-1}\), \(b,c\in \mathbb {Z}_q\), \(|w|\leqslant i-2\), and v contains the letter a that is ith from the end of s. We consider the following sequence of derivations:
$$\begin{aligned} x=uvbcw \Longrightarrow _k uvb\underline{\overline{b}\overline{v}^R} cw \Longrightarrow _k uvb\overline{b}\overline{v}^R c\underline{\overline{c}v}w=x',\end{aligned}$$
where the underline shows the duplicated parts. Observe that the letter a in v that was ith from the end in x, is \((i-2)\)nd from the end of \(x'\). \(\square \)
An important concept of subsequences is now defined.
Definition 2
Let \(k\geqslant 2\) be an integer, \(x\in \mathbb {Z}_q^k\) be a string of length k, and let \(y\in \mathbb {Z}_q^*\) be any string. We say x is properly spaced in y if both of the following hold:
1.
x is a subsequence of y, namely, there exist \(y^{(0)}, y^{(1)}, y^{(2)},\dots ,y^{(k)}\in \mathbb {Z}_q^*\) such that
$$\begin{aligned} y = y^{(0)} x_0 y^{(1)} x_1 y^{(2)} \dots y^{(k-1)} x_{k-1} y^{(k)}. \end{aligned}$$
 
2.
If k is even, then
$$\begin{aligned} \left|y^{(1)}\right| \equiv \left|y^{(2)}\right| \equiv \dots \equiv \left|y^{(k)}\right| \equiv 0 \pmod {2}. \end{aligned}$$
 
Example 4
Consider the string \(y=0110300203\). The string 323 is properly spaced in y, as shown by the underlined locations \(0110\underline{3}00\underline{2}0\underline{3}\) (note that \(k=|323|=3\) is odd in this case). The string 32 is also properly spaced in y, as we see in \(0110\underline{3}00\underline{2}03\) (this time \(k=|32|=2\) is even). However, 23 is not properly spaced in y, since the length of the factor between 2 and 3 in y is odd, and \(k=|23|=2\) is even.
We observe that the property of being properly-spaced persists in all descendants.
Lemma 9
Let \(k\geqslant 2\), \(x\in \mathbb {Z}_q^k\), and \(y\in \mathbb {Z}_q^*\), such that x is properly spaced in y. If \(y'\in D^*_k(y)\), then x is also properly spaced in \(y'\).
Proof
Since \(y'\) is obtained from y by insertions only, the first requirement of Definition 2 certainly holds for x in \(y'\). Additionally, if k is even, since duplications insert k-factors, it follows that the parities of all \(y^{(i)}\) from Definition 2 remain unchanged. \(\square \)
When a string is properly spaced, we can extract it as a suffix in one of the descendants, as the following lemma shows.
Lemma 10
Let \(k\geqslant 2\), \(x\in \mathbb {Z}_q^k\), and \(y\in \mathbb {Z}_q^*\), such that x is properly spaced in y. Then there exists \(v\in \mathbb {Z}_q^*\) such that
$$\begin{aligned} y \Longrightarrow ^*_k vx. \end{aligned}$$
Proof
By definition, we necessarily have \(|y|\geqslant k\). We consider two cases, depending on the parity of k.
If k is even, the last letter of x, namely, \(x_{k-1}\), is at an even distance from the end. If it is at the end, we duplicate the k-suffix twice (this does not change the last letter, but ensures at least one duplication was employed). Otherwise, we can use Lemma 8 repeatedly to get \(x_{k-1}\) to the end, with each application the distance of \(x_{k-1}\) to the end diminishes by 2. Thus, we have
$$\begin{aligned} y \Longrightarrow ^*_k v^{(k-1)} x_{k-1}, \end{aligned}$$
for some \(v^{(k-1)}\in \mathbb {Z}_q^*\). Since at least one duplication was used, we have
$$\begin{aligned} \left|v^{(k-1)}x_{k-1}\right| \geqslant |y|+k \geqslant 2k\geqslant k+2. \end{aligned}$$
By Lemma 9, x is also properly spaced in \(v^{(k-1)}x_{k-1}\). That means \(x_{k-2}\) is at an even distance from the end of \(v^{(k-1)}\). We ignore the last letter, \(x_{k-1}\), and repeat the process above to push the letter \(x_{k-2}\):
$$\begin{aligned} v^{(k-1)}x_{k-1} \Longrightarrow ^*_k v^{(k-2)}x_{k-2}x_{k-1}. \end{aligned}$$
Continuing in this manner for all the letters of x we finally obtain
$$\begin{aligned} y \Longrightarrow ^*_k vx \end{aligned}$$
for some \(v\in \mathbb {Z}_q^*\).
If k is odd, we repeat the same proof as in the case of k even. The only difference is that we are not guaranteed that the letter we are trying to push, \(x_i\), is at an even distance from the end. If that is not the case, and the distance is odd, we simply perform a duplication on the suffix, thus making the distance even. \(\square \)
Another technical helpful tool allows us to synchronize derivations in certain cases, and produce a common descendant.
Lemma 11
Let \(k\geqslant 2\), \(x\in \mathbb {Z}_q^k\), and \(y\in \mathbb {Z}_q^*\), such that x is properly spaced in y. Then there exists \(v\in \mathbb {Z}_q^*\) such that
$$\begin{aligned} y&\Longrightarrow ^*_k vx,&yx \Longrightarrow ^*_k vx. \end{aligned}$$
Proof
By Lemma 10, there exists \(u\in \mathbb {Z}_q^*\) such that
$$\begin{aligned} y \Longrightarrow ^*_k ux. \end{aligned}$$
(2)
Continuing this derivation we get
$$\begin{aligned} y \Longrightarrow ^*_k ux \Longrightarrow ^2_k ux\overline{x}^R x = vx, \end{aligned}$$
where we define \(v\triangleq ux\overline{x}^R\).
Now, starting with the same derivation as in (2), we get
$$\begin{aligned} yx \Longrightarrow ^*_k uxx \Longrightarrow _k ux\overline{x}^R x = vx, \end{aligned}$$
which completes the proof. \(\square \)
The following definition provides the property used by our main theorem in this section.
Definition 3
Let \(k\in \mathbb {N}\), \(k\geqslant 2\), let \(\ell \geqslant 0\) be an integer, and let \(y\in \mathbb {Z}_q^{\ell k}\) be a string. Partition y into k-factors,
$$\begin{aligned} y = y^{(0)} y^{(1)} \dots y^{(\ell -1)}, \end{aligned}$$
with \(y^{(i)}\in \mathbb {Z}_q^k\) for all i. Define
$$\begin{aligned} \phi (y^{(i)}) \triangleq {\left\{ \begin{array}{ll} \varepsilon & \text {if there exists } i'<i \text { such that } y^{(i')}=y^{(i)},\\ y^{(i)} & \text {otherwise.} \end{array}\right. } \end{aligned}$$
Then the summary of y, denoted \(\Phi (y)\) is defined as
$$\begin{aligned} \Phi (y) \triangleq \phi (y^{(0)}) \phi (y^{(1)}) \dots \phi (y^{(\ell -1)}). \end{aligned}$$
Intuitively, to compute the summary of string, we scan its k-factors from left to right. Each k-factor that appears for the first time is appended to the summary, whereas k-factors that have already appeared before are removed. It follows that the summary of a string contains a partition into distinct k-factors, and whose total length is upper bounded by \(kq^k\).
Example 5
Assume \(q=4\) and \(k=2\). Let \(y=011011013030023003\). Its partition into 2-factors is 01|10|11|01|30|30|02|30|03. If we underline repeated 2-factors in the partition we get \(01|10|11|\underline{01}|30|\underline{30}|02|\underline{30}|03\). By removing the repeated 2-factors from the partition, we obtain the summary of y,
$$\begin{aligned} \Phi (y)=011011300203. \end{aligned}$$
We observe that while 01 appears twice as a 2-factor in the summary, \(\underline{01}1\underline{01}1300203\), the second appearance is not removed since it does not appear in the partition.
In the main theorem for this section, we show that strings which have the same summary (up to a short prefix perhaps) also have a common descendant.
Theorem 12
Let \(k\geqslant 2\), and \(x,y\in \mathbb {Z}_q^*\). Write
$$\begin{aligned} x&= x' x''&y&= y' y'', \end{aligned}$$
where
$$\begin{aligned} \left|x'\right|&= \left|x\right| \bmod k,&\left|y'\right|=\left|y\right| \bmod k, \end{aligned}$$
i.e., \(0\leqslant |x'|,|y'|\leqslant k-1\). If \(x'=y'\) and \(\Phi (x'')=\Phi (y'')\), then
$$\begin{aligned} D^*_k(x)\cap D^*_k(y)\ne \emptyset . \end{aligned}$$
Proof
We shall actually prove that \(x''\) and \(y''\) have a common descendant, namely,
$$\begin{aligned} D^*_k(x'')\cap D^*_k(y'') \ne \emptyset . \end{aligned}$$
Then, by prepending the prefix \(x'=y'\), we shall trivially deduce the desired claim,
$$\begin{aligned} D^*_k(x'x'')\cap D^*_k(y'y'') \ne \emptyset . \end{aligned}$$
Thus, we shall forget about the prefix \(x'=y'\) from now on. Other simple cases we easily dismiss are that of \(x=y\), as well as the case of \(\Phi (x'')=\Phi (y'')=\varepsilon \), since then, again, \(x=y\).
Let us now partition \(x''\) and \(y''\) into k-factors,
$$\begin{aligned} x''&= x^{(0)} x^{(1)} \dots x^{(\ell _x-1)},&y''&= y^{(0)} y^{(1)} \dots y^{(\ell _y-1)}, \end{aligned}$$
where \(x^{(i)}, y^{(j)}\in \mathbb {Z}_q^k\) for all i and j. The remainder of the proof shall proceed in iterations. We initialize two indices, \(i_x=i_y=0\). We shall, at each iteration, make sure that
$$\begin{aligned} \Phi (x^{(0)} \dots x^{(i_x-1)}) = \Phi (y^{(0)} \dots y^{(i_y-1)}), \end{aligned}$$
(3)
and that there exists some \(u\in \mathbb {Z}_q^*\) such that
$$\begin{aligned} x^{(0)} \dots x^{(i_x-1)}&\Longrightarrow ^*_k u,&y^{(0)} \dots y^{(i_y-1)}&\Longrightarrow ^*_k u. \end{aligned}$$
(4)
Obviously, before we begin, when \(i_x=i_y=0\), both (3) and (4) hold, since for (3) we have \(\Phi (\varepsilon )=\varepsilon \), and for (4) choosing \(u=\varepsilon \) gives the trivial \(\varepsilon \Longrightarrow ^*_k \varepsilon \). We distinguish between the following cases:
Case 1 \(i_x<\ell _x\), \(i_y<\ell _y\) and \(x^{(i_x)}=y^{(i_y)}\). Since (3) holds, we also have,
$$\begin{aligned} \Phi (x^{(0)} \dots x^{(i_x)}) = \Phi (y^{(0)} \dots y^{(i_y)}). \end{aligned}$$
Additionally, since (4) holds, we obviously have
$$\begin{aligned} x^{(0)} \dots x^{(i_x-1)} x^{(i_x)}&\Longrightarrow ^*_k u x^{(i_x)},&y^{(0)} \dots y^{(i_y-1)} y^{(i_y)}&\Longrightarrow ^*_k u y^{(i_y)}. \end{aligned}$$
Thus, we can increase both \(i_x\) and \(i_y\) by 1, while maintaining (3) and (4).
Case 2 \(i_x<\ell _x\), \(i_y<\ell _y\) and \(x^{(i_x)}\ne y^{(i_y)}\). Since (3) holds, and also \(\Phi (x'')=\Phi (y'')\), there exists \(i'_x<i_x\) such that \(x^{(i'_x)}=x^{(i_x)}\), or there exists \(i'_y<i_y\) such that \(y^{(i'_y)}=y^{(i_y)}\). Let us assume the former (if the latter holds, the proof is symmetric). We then have
$$\begin{aligned} \Phi (x^{(0)} \dots x^{(i_x)}) = \Phi (x^{(0)} \dots x^{(i_x-1)}) = \Phi (y^{(0)} \dots y^{(i_y-1)}). \end{aligned}$$
By definition, we observe that \(x^{(i_x)}=x^{(i'_x)}\) is properly spaced in \(x^{(0)}\dots x^{(i_x-1)}\). By Lemma 9, this is true also in any descendant of \(x^{(0)}\dots x^{(i_x-1)}\), in particular, u. By Lemma 11, there exists \(u'\in \mathbb {Z}_q^*\) such that
$$\begin{aligned} u&\Longrightarrow ^*_k u',&u x^{(i_x)}&\Longrightarrow ^*_k u'. \end{aligned}$$
We therefore have
$$\begin{aligned} x^{(0)} \dots x^{(i_x-1)} x^{(i_x)}&\Longrightarrow ^*_k u x^{(i_x)}\Longrightarrow ^*_k u',&y^{(0)} \dots y^{(i_y-1)}&\Longrightarrow ^*_k u \Longrightarrow ^*_k u'. \end{aligned}$$
It follows that we can increase \(i_x\) by 1, while maintaining (3) and (4).
Case 3 \(i_x < \ell _x\) and \(i_y = \ell _y\). In this case we have \(y^{(0)}\dots y^{(i_y-1)}=y''\). Since \(\Phi (y'')=\Phi (x'')\), there exists \(i'_x<i_x\) such that \(x^{(i'_x)}=x^{(i_x)}\). The remainder of this case is the same as Case 2.
Case 4 \(i_x = \ell _x\) and \(i_y < \ell _y\). This case is symmetric to Case 3.
Case 5 \(i_x=\ell _x\) and \(i_y=\ell _y\). When we reach this case, u from (4) is the desired common descendant \(x''\) and \(y''\).
Finally, we observe that we shall always reach Case 5, since in each of the other cases, \(i_x\) or \(i_y\) is increased by 1. This completes the proof. \(\square \)
With the help of Theorem 12 we obtain the following corollary.
[Style2 Style1]Corollary 13
For any \(k\in \mathbb {N}\), \(k\geqslant 2\), any \(n\in \mathbb {N}\) and any even \(q\in \mathbb {N}\),
$$\begin{aligned} A_q(n;*)^{\textrm{rc}}_k \leqslant q^{k-1}\cdot \frac{q^{kq^k+k}-1}{q^k-1}, \end{aligned}$$
and therefore
$$\begin{aligned} R_q(*)^{\textrm{rc}}_k = 0. \end{aligned}$$
Proof
By Theorem 12, a code cannot contain two codewords with the same summary and the same prefix of length at most \(k-1\). The number of prefixes is upper bounded by \(q^{k-1}\). As for the summaries, each summary is a string of length \(\ell k\), for some \(0\leqslant \ell \leqslant q^k\). Thus,
$$\begin{aligned} A_q(n;*)^{\textrm{rc}}_k \leqslant q^{k-1} \sum _{\ell =0}^{q^k} q^{k\ell } = q^{k-1}\cdot \frac{q^{kq^k+k}-1}{q^k-1}. \end{aligned}$$
Substituting this into the definition of coding capacity we get
$$\begin{aligned} R_q(*)^{\textrm{rc}}_k = 0, \end{aligned}$$
as claimed. \(\square \)

5 Palindromic duplication

Having determined the coding capacity of the reverse-complement duplication channel in the previous sections, we now turn to the simpler case of palindromic duplication. In this case, factors are duplicated in reverse, but without applying the complement.
We first look at the case of duplication of length \(k=1\). In this case, reversing does nothing, and so this duplication is the same as tandem duplication of length \(k=1\), which was already studied in [11].
[Style2 Style1]Corollary 14
For any \(n,q\in \mathbb {N}\), \(q\geqslant 2\),
$$\begin{aligned} A_q(n;*)^{\textrm{pal}}_1 = q\sum _{i=0}^{n-1}(q-1)^i = {\left\{ \begin{array}{ll} q \cdot \frac{(q-1)^n-1}{q-2} & q\geqslant 3, \\ qn & q=2, \end{array}\right. } \end{aligned}$$
and therefore
$$\begin{aligned} R_q(*)^{\textrm{pal}}_1 = \log _q (q-1). \end{aligned}$$
Proof
Since a palindromic duplication of length 1 is the same as a tandem duplication of length 1, we look at the known results for tandem duplication. According to [11, Theorems 15,16], the optimal codes in these cases contain all the strings whose adjacent positions contain distinct letters, except perhaps in the suffix. Choosing the first letter has q options, followed by \(0\leqslant i\leqslant n-1\) letters distinct from their predecessors in the string for a total of \((q-1)^i\) options, finally followed by repeating the last letter chosen to get a string of length n. \(\square \)
We note that the results are quite similar to Corollaries 3 and 7. In the binary case the coding capacity vanishes in both types of duplication, whereas in the non-binary case, it is slightly higher in palindromic duplication, compared with reverse-complement duplication.
Moving on to longer duplication lengths, \(k\geqslant 2\), we observe that the complement operation plays no role whatsoever in Sect. 4. We do need to replace Lemma 8, which we do easily by adapting [1, Lemma 2].
Lemma 15
Let \(k\geqslant 2\), \(x \in \mathbb {Z}_q^*, |x|\geqslant k + 1\), and let \(i \geqslant 2\) be an integer. If the i-th letter from the end of x is a, then there exists \(x'\in D^{2}_k(x)\), such that a is the \((i-2)\)-nd letter from the end of \(x'\), and where \(D(\cdot )\) denotes descendants with respect to palindromic duplication.
Proof
Let us write \(x=uvbcw\), where \(u,v,w\in \mathbb {Z}_q^*\), \(|v|=k-1\), \(b,c\in \mathbb {Z}_q\), and where a, the i-th letter from the end of x, occurs in v. Then,
$$\begin{aligned} x=uvbcw \Longrightarrow _k uvbbv^R cw \Longrightarrow _k uvbbv^R ccvw \triangleq x', \end{aligned}$$
where \(\Longrightarrow \) denotes derivation with respect to palindromic duplication. Thus, \(x'\) contains a as the \((i-2)\)-nd letter from the end. \(\square \)
We now obtain the palindromic analogue of Corollary 13.
[Style2 Style1]Corollary 16
For any \(k\in \mathbb {N}\), \(k\geqslant 2\), any \(n\in \mathbb {N}\) and any even \(q\in \mathbb {N}\),
$$\begin{aligned} A_q(n;*)^{\textrm{pal}}_k \leqslant q^{k-1}\cdot \frac{q^{kq^k+k}-1}{q^k-1}, \end{aligned}$$
and therefore
$$\begin{aligned} R_q(*)^{\textrm{pal}}_k = 0. \end{aligned}$$
Proof
Scanning Sect. 4, we replace Lemma 8 with Lemma 15, and in Lemma 11 we simply remove the complement. The rest remains the same, giving us the same conclusion in the palindromic case as in the reverse-complement case. \(\square \)
The vanishing capacity for palindromic duplication with length \(k\geqslant 2\), contradicts the code constructions given in [23]. All of these constructions have positive asymptotic rates (provided the alphabet size is not too small). To show that these constructions are unfortunately flawed, we present counter examples based on our proof technique. The counter examples take on the simple form of showing two codewords that have a common descendant, thus, contradicting the definition of an error-correcting code. In all of the cases, these two codewords have the same summary, and hence, by the palindromic version of Theorem 12, have a common descendant. We give a full description in the first example, including the exact derivation of the common descendant, while for the rest of the examples we only provide the two codewords. All of the examples are not necessarily the smallest counter examples.
Example 6
Codes correcting palindromic duplications of length 2: The code presented in [23, Theorem 2] contains, among other things, all the strings from \(\mathbb {Z}_q^n\) that do not contain a palindromic duplication of length 2, nor a tandem duplication of length at most 2. Take \(q=4\) and \(n=8\), then 01020301 and 01020302 both satisfy the conditions, and so are codewords. However, they both have the same 2-summary, 010203, and therefore, must have a common descendant with respect to palindromic duplication of length 2. In this case, a common descendant for both is:
$$\begin{aligned} 01001221120020021121120023323320023323320021121120020022011011022002. \end{aligned}$$
This contradicts the claim that the code can correct such duplications. The derivation of the common descendant from each of these codewords is presented in detail in Appendix A.
The remaining counter examples all share the same flavor, and we bring only a short description of them without the full derivation of the common descendants.
Example 7
Codes correcting palindromic duplications of length 3: The code presented in [23, Theorem 4] contains, among other things, all the strings from \(\mathbb {Z}_q^n\) that do not contain a palindromic duplication of length 3, nor a tandem duplication of length at most 3. Take \(q=5\) and \(n=12\), then 012013014012 and 012013014013 both satisfy the conditions, and so are codewords. However, they both have the same 3-summary, 012013014, and therefore, must have a common descendant with respect to palindromic duplication of length 3.
Example 8
Codes correcting any combination of palindromic duplications of length 2 and tandem duplications of length 2: The code presented in [23, Section IV.A] contains, among other things, all the strings from \(\mathbb {Z}_q^n\) that do not contain a palindromic duplication of length 2, nor a tandem duplication of length at most 2, and nor a factor of the form abcb, with \(a,b,c\in \mathbb {Z}_q\) being distinct letters. Take \(q=6\) and \(n=14\), then 01450245034501 and 01450245034502 both satisfy the conditions, and so are codewords. However, they both have the same 2-summary, 01450203, and therefore, must have a common descendant with respect to palindromic duplication of length 2. Note that we do not even use any tandem duplication in the process of deriving the common descendant.
Example 9
Codes correcting any combination of palindromic duplications of length 3 and tandem duplications of length 3: The code presented in [23, Section IV.B] contains, among other things, all the strings from \(\mathbb {Z}_q^n\) that do not contain a palindromic duplication of length 3, nor a tandem duplication of length at most 3, and nor a factor of the form aba or abca, with \(a,b,c\in \mathbb {Z}_q\) being distinct letters. Take \(q=8\) and \(n=21\), then 012567013567014567012 and 012567013567014567013 both satisfy the conditions, and so are codewords. However, they both have the same 3-summary, 012567013014, and therefore, must have a common descendant with respect to palindromic duplication of length 3. Note that we do not even use any tandem duplication in the process of deriving the common descendant.
Example 10
Codes correcting any combination of palindromic duplications of length \(\leqslant 2\) and tandem duplications of length \(\leqslant 2\): The code presented in [23, Theorem 5] contains, among other things, all the strings from \(\mathbb {Z}_q^n\) that do not contain a palindromic duplication of length 2, nor a tandem duplication of length at most 2, and nor a factor of the form aba or abca, with \(a,b,c\in \mathbb {Z}_q\) being distinct letters. The same codewords as in Example 8 provide a counter example. Here, it is sufficient to use palindromic duplication of length 2 to derive the common descendant.
Example 11
Codes correcting any combination of palindromic duplications of length \(\leqslant 3\) and tandem duplications of length \(\leqslant 3\): The code presented in [23, Theorem 7] contains, among other things, all the strings from \(\mathbb {Z}_q^n\) that do not contain a palindromic duplication of length 3, nor a tandem duplication of length at most 3, and nor a factor of the form aba or abca or abcda or abcdea, with \(a,b,c,d,e\in \mathbb {Z}_q\) being distinct letters. The same codewords as in Example 9 provide a counter example. Here, it is sufficient to use palindromic duplication of length 3 to derive the common descendant.

6 Conclusion

In this work, we studied the parameters of duplication-correcting codes, both in the reverse-complement and the palindromic settings. We determined \(A_q(n;*)^{\textrm{rc}}_1\) and constructed optimal codes. We then showed that the coding capacity, \(R_q(n;*)^{\textrm{rc}}_k\), \(k\geqslant 2\), is vanishing. A similar result for \(k\geqslant 2\) and palindromic duplication was also proved.
While having a vanishing coding capacity is disappointing, when limiting ourselves to a single duplication of length k, the coding capacity is 1 (e.g., by treating this duplication as a single burst insertion and using the codes from [2, 15, 20]). Thus, finding the trade-off between coding capacity and the number of correctable errors remains an open question beyond the scope of this paper. Additionally, from a mathematical perspective, it is interesting to note that, in contrast with reverse-complement and palindromic duplications, for tandem duplication (i.e., the duplicated part is neither reversed, nor complemented) the coding capacity is positive [11].
Other open questions remain. We do not yet know how to construct \((n,M;t)^{\textrm{rc}}_k\) and \((n,M;t)^{\textrm{pal}}_k\) for a finite t. If \(t=1\), namely, a single duplication error is to be corrected, then [1] constructed reverse-complement duplication-correcting codes when k is odd. More generally, we can use a burst-insertion-correcting code, e.g., [2, 15, 20]. However, when \(t\geqslant 2\) no solution is known except using a tk-insertion correcting code, which is, most likely, far from optimal. We leave finding such codes, determining the optimal code size as well as the coding capacity, for future work.

Acknowledgements

This work was supported in part by the US National Science Foundation (NSF), Grant CCF-5237372, and by the Zhejiang Lab BioBit Program (Grant No. 2022YFB507). The author M. Schwartz is currently on a leave of absence from Ben-Gurion University of the Negev.

Declarations

Conflict of interest

The authors declare no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix

Full derivation for example 6

Derivation from 01020301 (underline denotes the duplicated part):
$$\begin{aligned}&01020301 \Longrightarrow \\&010\underline{01}20301 \Longrightarrow \\&010012\underline{21}0301 \Longrightarrow \\&010012210\underline{01}301 \Longrightarrow \\&010012210013\underline{31}01 \Longrightarrow \\&01001221001\underline{10}33101 \Longrightarrow \\&01001221001103\underline{30}3101 \Longrightarrow \\&01001221001103303\underline{30}101 \Longrightarrow \\&01001221001103303301\underline{10}01 \Longrightarrow \\&01001221\underline{12}0011033033011001 \Longrightarrow \\&01001221120\underline{02}011033033011001 \Longrightarrow \\&01001221120020\underline{02}11033033011001 \Longrightarrow \\&01001221120020021\underline{12}1033033011001 \Longrightarrow \\&01001221120020021121\underline{12}033033011001 \Longrightarrow \\&01001221120020021121120\underline{02}33033011001 \Longrightarrow \\&01001221120020021121120023\underline{32}3033011001 \Longrightarrow \\&01001221120020021121120023323\underline{32}033011001 \Longrightarrow \\&01001221120020021121120023323320\underline{02}33011001 \Longrightarrow \\&01001221120020021121120023323320023\underline{32}3011001 \Longrightarrow \\&01001221120020021121120023323320023323\underline{32}011001 \Longrightarrow \\&01001221120020021121120023323320023323320\underline{02}11001 \Longrightarrow \\&01001221120020021121120023323320023323320021\underline{12}1001 \Longrightarrow \\&01001221120020021121120023323320023323320021121\underline{12}001 \Longrightarrow \\&01001221120020021121120023323320023323320021121120\underline{02}01 \Longrightarrow \\&01001221120020021121120023323320023323320021121120020\underline{02}1 \Longrightarrow \\&01001221120020021121120023323320023323320021121120020021\underline{12} \Longrightarrow \\&0100122112002002112112002332332002332332002112112002002\underline{20}112 \Longrightarrow \\&0100122112002002112112002332332002332332002112112002002201\underline{10}12 \Longrightarrow \\&0100122112002002112112002332332002332332002112112002002201101\underline{10}2 \Longrightarrow \\&0100122112002002112112002332332002332332002112112002002201101102\underline{20} \Longrightarrow \\&010012211200200211211200233233200233233200211211200200220110110220\underline{02} \end{aligned}$$
Derivation from 01020302 (underline denotes the duplicated part):
$$\begin{aligned}&01020302 \Longrightarrow \\&010\underline{01}20302 \Longrightarrow \\&010012\underline{21}0302 \Longrightarrow \\&010012210\underline{01}302 \Longrightarrow \\&010012210013\underline{31}02 \Longrightarrow \\&01001221001\underline{10}33102 \Longrightarrow \\&01001221001103\underline{30}3102 \Longrightarrow \\&01001221001103303\underline{30}102 \Longrightarrow \\&01001221001103303301\underline{10}02 \Longrightarrow \\&0100122100110330330110\underline{01}02 \Longrightarrow \\&01001221\underline{12}001103303301100102 \Longrightarrow \\&01001221120\underline{02}01103303301100102 \Longrightarrow \\&01001221120020\underline{02}1103303301100102 \Longrightarrow \\&01001221120020021\underline{12}103303301100102 \Longrightarrow \\&01001221120020021121\underline{12}03303301100102 \Longrightarrow \\&01001221120020021121120\underline{02}3303301100102 \Longrightarrow \\&01001221120020021121120023\underline{32}303301100102 \Longrightarrow \\&01001221120020021121120023323\underline{32}03301100102 \Longrightarrow \\&01001221120020021121120023323320\underline{02}3301100102 \Longrightarrow \\&01001221120020021121120023323320023\underline{32}301100102 \Longrightarrow \\&01001221120020021121120023323320023323\underline{32}01100102 \Longrightarrow \\&01001221120020021121120023323320023323320\underline{02}1100102 \Longrightarrow \\&01001221120020021121120023323320023323320021\underline{12}100102 \Longrightarrow \\&01001221120020021121120023323320023323320021121\underline{12}00102 \Longrightarrow \\&01001221120020021121120023323320023323320021121120\underline{02}0102 \Longrightarrow \\&01001221120020021121120023323320023323320021121120020\underline{02}102 \Longrightarrow \\&01001221120020021121120023323320023323320021121120020021\underline{12}02 \Longrightarrow \\&0100122112002002112112002332332002332332002112112002002\underline{20}11202 \Longrightarrow \\&0100122112002002112112002332332002332332002112112002002201\underline{10}1202 \Longrightarrow \\&0100122112002002112112002332332002332332002112112002002201101\underline{10}202 \Longrightarrow \\&0100122112002002112112002332332002332332002112112002002201101102\underline{20}02 \end{aligned}$$
Literature
1.
go back to reference Ben-Tolila E., Schwartz M.: On the reverse-complement string-duplication system. IEEE Trans. Inf. Theory 68(11), 7184–7197 (2022).MathSciNetCrossRef Ben-Tolila E., Schwartz M.: On the reverse-complement string-duplication system. IEEE Trans. Inf. Theory 68(11), 7184–7197 (2022).MathSciNetCrossRef
2.
go back to reference Bitar R., Hanna S.K., Polyanskii N., Vorobyev I.: Optimal codes correcting localized deletions. In: Proceedings of the 2021 IEEE International Symposium on Information Theory (ISIT2021), Melbourne, Victoria, Australia, pp. 1991–1996 (2021). Bitar R., Hanna S.K., Polyanskii N., Vorobyev I.: Optimal codes correcting localized deletions. In: Proceedings of the 2021 IEEE International Symposium on Information Theory (ISIT2021), Melbourne, Victoria, Australia, pp. 1991–1996 (2021).
3.
go back to reference Church G.M., Gao Y., Kosuri S.: Next-generation digital information storage in DNA. Science 337, 1628 (2012).CrossRef Church G.M., Gao Y., Kosuri S.: Next-generation digital information storage in DNA. Science 337, 1628 (2012).CrossRef
5.
go back to reference Elishco O., Farnoud F., Schwartz M., Bruck J.: The entropy rate of some Pólya string models. IEEE Trans. Inf. Theory 65(12), 8180–8193 (2019).CrossRef Elishco O., Farnoud F., Schwartz M., Bruck J.: The entropy rate of some Pólya string models. IEEE Trans. Inf. Theory 65(12), 8180–8193 (2019).CrossRef
6.
go back to reference Farnoud F., Schwartz M., Bruck J.: The capacity of string-duplication systems. IEEE Trans. Inf. Theory 62(2), 811–824 (2016).MathSciNetCrossRef Farnoud F., Schwartz M., Bruck J.: The capacity of string-duplication systems. IEEE Trans. Inf. Theory 62(2), 811–824 (2016).MathSciNetCrossRef
7.
go back to reference Farnoud F., Schwartz M., Bruck J.: Estimation of duplication history under a stochastic model for tandem repeats. BMC Bioinform. 20(64), 1–11 (2019). Farnoud F., Schwartz M., Bruck J.: Estimation of duplication history under a stochastic model for tandem repeats. BMC Bioinform. 20(64), 1–11 (2019).
8.
go back to reference Goldman N., Bertone P., Chen S., Dessimoz C., LeProust E.M., Sipos B., Birney E.: Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494(7435), 77–80 (2013).CrossRef Goldman N., Bertone P., Chen S., Dessimoz C., LeProust E.M., Sipos B., Birney E.: Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494(7435), 77–80 (2013).CrossRef
9.
go back to reference Goshkoder D., Polyanskii N., Vorobyev I.: Codes correcting a single long duplication error. In: Proceedings of the 2023 IEEE International Symposium on Information Theory (ISIT2023), Taipei, Taiwan, pp. 2708–2713 (2023). Goshkoder D., Polyanskii N., Vorobyev I.: Codes correcting a single long duplication error. In: Proceedings of the 2023 IEEE International Symposium on Information Theory (ISIT2023), Taipei, Taiwan, pp. 2708–2713 (2023).
10.
go back to reference Jain S., Farnoud F., Bruck J.: Capacity and expressiveness of genomic tandem duplication. IEEE Trans. Inf. Theory 63(10), 6129–6138 (2017).MathSciNetCrossRef Jain S., Farnoud F., Bruck J.: Capacity and expressiveness of genomic tandem duplication. IEEE Trans. Inf. Theory 63(10), 6129–6138 (2017).MathSciNetCrossRef
11.
go back to reference Jain S., Farnoud F., Schwartz M., Bruck J.: Duplication-correcting codes for data storage in the DNA of living organisms. IEEE Trans. Inf. Theory 63(8), 4996–5010 (2017).MathSciNetCrossRef Jain S., Farnoud F., Schwartz M., Bruck J.: Duplication-correcting codes for data storage in the DNA of living organisms. IEEE Trans. Inf. Theory 63(8), 4996–5010 (2017).MathSciNetCrossRef
12.
go back to reference Kovačević M.: Zero-error capacity of duplication channels. IEEE Trans. Commun. 67(10), 6735–6742 (2019).CrossRef Kovačević M.: Zero-error capacity of duplication channels. IEEE Trans. Commun. 67(10), 6735–6742 (2019).CrossRef
13.
14.
go back to reference Nguyen T.T., Cai K., Song W., Immink K.A.S.: Optimal single chromosome-inversion correcting codes for data storage in live DNA. In: Proceedings of the 2022 IEEE International Symposium on Information Theory (ISIT2022), Espoo, Finland, pp. 1791–1796 (2022). Nguyen T.T., Cai K., Song W., Immink K.A.S.: Optimal single chromosome-inversion correcting codes for data storage in live DNA. In: Proceedings of the 2022 IEEE International Symposium on Information Theory (ISIT2022), Espoo, Finland, pp. 1791–1796 (2022).
15.
go back to reference Schoeny C., Wachter-Zeh A., Gabrys R., Yaakobi E.: Codes correcting a burst of deletions or insertions. IEEE Trans. Inf. Theory 63(4), 1971–1985 (2017).MathSciNetCrossRef Schoeny C., Wachter-Zeh A., Gabrys R., Yaakobi E.: Codes correcting a burst of deletions or insertions. IEEE Trans. Inf. Theory 63(4), 1971–1985 (2017).MathSciNetCrossRef
16.
go back to reference Shipman S.L., Nivala J., Macklis J.D., Church G.M.: CRISPR-Cas encoding of digital movie into the genomes of a population of living bacteria. Nature 547, 345–349 (2017).CrossRef Shipman S.L., Nivala J., Macklis J.D., Church G.M.: CRISPR-Cas encoding of digital movie into the genomes of a population of living bacteria. Nature 547, 345–349 (2017).CrossRef
17.
go back to reference Tang Y., Farnoud F.: Error-correcting codes for short tandem duplication and edit errors. IEEE Trans. Inf. Theory 68(2), 871–880 (2021).MathSciNetCrossRef Tang Y., Farnoud F.: Error-correcting codes for short tandem duplication and edit errors. IEEE Trans. Inf. Theory 68(2), 871–880 (2021).MathSciNetCrossRef
18.
go back to reference Tang Y., Yehezkeally Y., Schwartz M., Farnoud F.: Single-error detection and correction for duplication and substitution channels. IEEE Trans. Inf. Theory 66(11), 6908–6919 (2020).MathSciNetCrossRef Tang Y., Yehezkeally Y., Schwartz M., Farnoud F.: Single-error detection and correction for duplication and substitution channels. IEEE Trans. Inf. Theory 66(11), 6908–6919 (2020).MathSciNetCrossRef
19.
go back to reference Tang Y., Wang S., Lou H., Gabrys R., Farnoud F.: Low-redundancy codes for correcting multiple short-duplication and edit errors. IEEE Trans. Inform. Theory 69(5), 2940–2954 (2023).MathSciNetCrossRef Tang Y., Wang S., Lou H., Gabrys R., Farnoud F.: Low-redundancy codes for correcting multiple short-duplication and edit errors. IEEE Trans. Inform. Theory 69(5), 2940–2954 (2023).MathSciNetCrossRef
20.
go back to reference Wang S., Tang Y., Sima J., Gabrys R., Farnoud F.: Non-binary codes for correcting a burst of at most \(t\) deletions. IEEE Trans. Inf. Theory 70(2), 964–979 (2004).MathSciNetCrossRef Wang S., Tang Y., Sima J., Gabrys R., Farnoud F.: Non-binary codes for correcting a burst of at most \(t\) deletions. IEEE Trans. Inf. Theory 70(2), 964–979 (2004).MathSciNetCrossRef
21.
go back to reference Yu W., Schwartz M.: On duplication-free codes for disjoint or equal-length errors. Des. Codes Cryptog. 92(10), 2845–2861 (2024).MathSciNetCrossRef Yu W., Schwartz M.: On duplication-free codes for disjoint or equal-length errors. Des. Codes Cryptog. 92(10), 2845–2861 (2024).MathSciNetCrossRef
22.
go back to reference Zeraatpisheh M., Esmaeili M., Gulliver T.A.: Construction of tandem duplication correcting codes. IET Commun. 13(15), 2217–2225 (2019).CrossRef Zeraatpisheh M., Esmaeili M., Gulliver T.A.: Construction of tandem duplication correcting codes. IET Commun. 13(15), 2217–2225 (2019).CrossRef
23.
go back to reference Zeraatpisheh M., Esmaeili M., Gulliver T.A.: Construction of duplication correcting codes. IEEE Access 8, 96150–96161 (2020).CrossRef Zeraatpisheh M., Esmaeili M., Gulliver T.A.: Construction of duplication correcting codes. IEEE Access 8, 96150–96161 (2020).CrossRef
Metadata
Title
On the coding capacity of reverse-complement and palindromic duplication-correcting codes
Authors
Lev Yohananov
Moshe Schwartz
Publication date
10-05-2025
Publisher
Springer US
Published in
Designs, Codes and Cryptography
Print ISSN: 0925-1022
Electronic ISSN: 1573-7586
DOI
https://doi.org/10.1007/s10623-025-01627-7

Premium Partner