1 Introduction
Storing information in DNA molecules promises a medium that is six orders of magnitude denser than the densest electronic storage medium of today. The feasibility of this technology was first demonstrated in [3, 8] (for in-vitro DNA storage), and then for storage in the DNA of living organisms, in [16] (for in-vivo DNA storage). Apart from plain storage, the latter also allows watermarking of genetically-modified organisms (GMOs), labeling organisms for biological studies, and running synthetic-biology algorithms.
As with any storage or communication system, the transmitted information is corrupted by the channel and received with errors. In the case of DNA storage, some of the observed error mechanisms have already been studied in the literature in other contexts, mainly electronic communications. Among these we can see substitution errors, insertions and deletions. However, in-vivo DNA storage exhibits a new kind of errors—duplication errors—which may be caused during faulty DNA replication.
Advertisement
A DNA strand is a long string over the alphabet \(\{\textsf{A},\textsf{C},\textsf{G},\textsf{T}\}\), standing for the four chemical bases: adenine, cytosine, guanine, and thymine. The bases are partitioned into complement pairs, where \(\textsf{A}\) and \(\textsf{T}\) are complement to each other, as are \(\textsf{C}\) and \(\textsf{G}\). In a nutshell, a duplication error takes a substring of the DNA string, and inserts a copy of it (perhaps altered) back into the string. Possible alterations vary, depending on the biological process creating them. Common examples we shall be referring to throughout the paper, include tandem duplication, where the copy is unaltered and placed next to its original location, e.g.,where a duplication of length 4 occurred, inserting a duplicate copy of \(\textsf{G}\textsf{G}\textsf{A}\textsf{T}\) (underlined on the right) immediately following its original location. Another type of duplication is palindromic duplication (sometimes called reverse duplication), which alters the inserted copy by reversing it, e.g.,Finally, we mention reverse-complement duplication, in which the altered copy is both reversed and complemented, e.g.,The string-duplication channel was introduced in [6], and the first duplication-correcting codes were developed in [11]. These generalize the setting to allow for arbitrary alphabets, parametric duplication length, and various duplication types. Codes correcting any number of tandem duplications were studied in [11, 21‐23] and any number of palindromic duplications in [23]. Codes that correct only a fixed number of tandem duplications (sometimes just one) were studied in [9, 12, 13], palindromic duplications in [14], and reverse-complement duplications in [1]. Additionally, codes that correct mixtures of duplications, substitutions, as well as insertions and deletions, have been the focus of [17‐19]. We also mention related work on duplication errors, though not providing error-correcting codes, but rather studying the possible outcomes from multiple errors [1, 4‐7, 10].
$$\begin{aligned} \textsf{A}\textsf{A}\textsf{C}\textsf{T}\textsf{G}\textsf{G}\textsf{A}\textsf{T}\textsf{C}\textsf{C}\textsf{C}\textsf{T}\Longrightarrow \textsf{A}\textsf{A}\textsf{C}\textsf{T}\textsf{G}\textsf{G}\textsf{A}\textsf{T}\underline{\textsf{G}\textsf{G}\textsf{A}\textsf{T}}\textsf{C}\textsf{C}\textsf{C}\textsf{T}, \end{aligned}$$
$$\begin{aligned} \textsf{A}\textsf{A}\textsf{C}\textsf{T}\textsf{G}\textsf{G}\textsf{A}\textsf{T}\textsf{C}\textsf{C}\textsf{C}\textsf{T}\Longrightarrow \textsf{A}\textsf{A}\textsf{C}\textsf{T}\textsf{G}\textsf{G}\textsf{A}\textsf{T}\underline{\textsf{T}\textsf{A}\textsf{G}\textsf{G}}\textsf{C}\textsf{C}\textsf{C}\textsf{T}. \end{aligned}$$
$$\begin{aligned} \textsf{A}\textsf{A}\textsf{C}\textsf{T}\textsf{G}\textsf{G}\textsf{A}\textsf{T}\textsf{C}\textsf{C}\textsf{C}\textsf{T}\Longrightarrow \textsf{A}\textsf{A}\textsf{C}\textsf{T}\textsf{G}\textsf{G}\textsf{A}\textsf{T}\underline{\textsf{A}\textsf{T}\textsf{C}\textsf{C}}\textsf{C}\textsf{C}\textsf{C}\textsf{T}. \end{aligned}$$
In this paper we focus on the asymptotic rate of optimal codes (the coding capacity) that are capable of correcting any number of duplication errors. The coding capacity for tandem-duplication errors is already known [11]. Thus, we study reverse-complement duplications, and palindromic duplications. We provide a complete solution, for all duplication lengths k, and all possible alphabet sizes q. We show that, except for the case of duplication length \(k=1\), the coding capacity is 0. When \(k=1\), the coding capacity depends on the alphabet size q, and in this case we construct optimal codes. A summary of the coding-capacity results is provided in Table 1. In addition, the vanishing coding capacity for the case of palindromic duplication of length \(k\geqslant 2\), contradict the code constructions presented in [23] that have positive asymptotic rate. We provide counter examples to these constructions that show the constructions are unfortunately flawed.
The paper is organized as follows. In Sect. 2 we give all the necessary notation and definitions. In Sect. 3 we study reverse-complement duplication with duplication length \(k=1\), whereas in Sect. 4 we focus on longer duplication lengths. Section 5 provides the results for palindromic duplication. We conclude in Sect. 6 with a short summary of the results and some open questions.
Table 1
A summary of the coding-capacity results, where k denotes the duplication length, q denotes the alphabet size, and \(R_q(*)_k\) denotes the coding capacity
Type | k | q | \(R_q(*)_k\) | Location |
---|---|---|---|---|
Reverse complement | \(=1\) | \(=2\) | 0 | Corollary 3 |
\(\geqslant 4\) | \(\log _q(q-2)\) | Corollary 7 | ||
\(\geqslant 2\) | \(\geqslant 2\) | 0 | Corollary 13 | |
Palindromic | \(=1\) | \(\geqslant 2\) | \(\log _q(q-1)\) | [11], equivalent to tandem duplication |
\(\geqslant 2\) | \(\geqslant 2\) | 0 | Corollary 16 |
2 Preliminaries
Throughout this paper, we shall be considering strings over a finite alphabet. Without loss of generality, we shall assume that alphabet is \(\mathbb {Z}_q\), the ring of integers modulo q, where \(q\in \mathbb {N}\), \(q\geqslant 2\). A string is a sequence of letters from the alphabet \(\mathbb {Z}_q\). The set of strings of length \(n\in \mathbb {N}\) is denoted by \(\mathbb {Z}_q^n\). Assume \(x\in \mathbb {Z}_q^n\) is such a string, then the letters of x are denoted by adding a subscript i, where indices start from 0. Thus, \(x=x_0 x_1 \dots x_{n-1}\), with \(x_i\in \mathbb {Z}_q\) for all i. We denote the length of x by \(|x|=n\). If \(y\in \mathbb {Z}_q^m\) is also a string, we denote by xy the concatenation of x and y. If \(\ell \geqslant 0\) is an integer, we denote by \(x^\ell \) the concatenation of \(\ell \) copies of x. Since subscripts are used to denote individual letters in a string, and superscripts denote concatenation of copies of a string, when we require several strings with the same name, we shall use superscripts in parentheses to distinguish between them, e.g., \(x^{(0)}, x^{(1)}, x^{(2)},\dots \).
Advertisement
The set of all finite-length strings over \(\mathbb {Z}_q\) is denoted by \(\mathbb {Z}_q^*\). We use \(\varepsilon \) to write the unique empty string of length 0. Obviously \(\varepsilon \in \mathbb {Z}_q^*\). If we are interested only in strings with positive length we write \(\mathbb {Z}_q^+\triangleq \mathbb {Z}_q^*\setminus \{\varepsilon \}\).
Assume \(x=uvw\), with \(u,v,w\in \mathbb {Z}_q^*\), \(|v|=k\) for some \(k\in \mathbb {N}\). Then we say that v is a k-factor of x. If \(u=\varepsilon \), then v is said to be a k-prefix of x, and if \(w=\varepsilon \) then v is said to be a k-suffix of x. If the length of x is not of importance, we just say it is a factor, or prefix, or suffix of x.
The notion of complement will be important. We assume the letters of the alphabet \(\mathbb {Z}_q\) may be partitioned into pairs of complements. If \(a\in \mathbb {Z}_q\) is a letter, we denote its complement by \(\overline{a}\), where necessarily \(\overline{a}\ne a\). Additionally, \(\overline{\overline{a}}=a\). It follows that if a complement is defined, then q is even. We extend the complement notation in the natural way to strings, namely, if \(x=x_0 x_1 \dots x_{n-1}\in \mathbb {Z}_q^n\), then \(\overline{x}=\overline{x_0} \,\overline{x_1} \dots \overline{x_{n-1}}\).
Another important notation we introduce is the reverse of a string. If \(x=x_0 x_1 \dots x_{n-1}\in \mathbb {Z}_q^n\), then \(x^R \triangleq x_{n-1} \dots x_1 x_0\). Obviously, the complement and reverse operations commute, and so \(\overline{x^R} = \overline{x}^R\).
The two major concepts we shall study in this paper are string-duplication channels, and codes that correct duplication errors introduced by the channel. We shall start by describing the former. We shall generally follow the notation set by [11].
In a string-duplication channel, strings from \(\mathbb {Z}_q^*\) are transmitted. However, the channel introduces errors by applying a (finite) number of duplication rules. Generally speaking, a duplication rule is just a function \(T:\mathbb {Z}_q^*\rightarrow \mathbb {Z}_q^*\). The set of all duplication rules the channel may apply is denoted by \(\mathcal {T}\subseteq \mathbb {Z}_q^*{\,}^{\mathbb {Z}_q^*}\). The choice of rules to contain in this set is motivated by the biological processes present in our in-vivo DNA storage system. We shall focus on two such sets which we define later (for more examples, see [11]).
Assume \(u,v\in \mathbb {Z}_q^*\) are two strings. We say v is a descendant of u if there exist \(t\geqslant 0\) and \(T_1,T_2,\dots ,T_t\in \mathcal {T}\) such thatIf the exact sequence of rules is immaterial, we just writeto signify this relation between u and v. If \(t=1\) we may omit it and write \(u\Longrightarrow v\). If t is unknown or irrelevant, we simply write \(u\Longrightarrow ^*v\). For convenience, we always have \(u\Longrightarrow ^0 u\). The set of all descendants of u is called the descendant cone of u, and is denoted byThus, \(D^*(u)\) is the reflexive transitive closure of \(\mathcal {T}\) acting on u.
$$\begin{aligned} v= T_t(T_{t-1}( \dots T_1(u)\dots )). \end{aligned}$$
$$\begin{aligned} u\Longrightarrow ^t v, \end{aligned}$$
$$\begin{aligned} D^*(u) \triangleq \big \{ v\in \mathbb {Z}_q^* ~:~ u\Longrightarrow ^*v \big \}. \end{aligned}$$
Motivated by biological processes, in this paper we focus on two sets of duplication rules: the reverse-complement and the palindromic. For the first, we define reverse-complement duplication rules of the following form:Thus, \(T^{\textrm{rc}}_{i,k}\) takes a k-factor v which follows an i-prefix u, and inserts a reverse-complement copy of v following its original position. Note that the rule is not defined on strings of length shorter than \(i+k\). We then define the set of reverse-complement duplication rules asWe denote derivation by \(\Longrightarrow _k\), and the descendant cone by \(D_k\), where the purpose of the subscript is solely to emphasize the fixed length of the duplication k. We may omit this subscript if it is clear from context.
$$\begin{aligned} T^{\textrm{rc}}_{i,k}(x)\triangleq uv \overline{v}^R w \qquad \text {if x=uvw, |u|=i, |v|=k.} \end{aligned}$$
$$\begin{aligned} \mathcal {T}^{\textrm{rc}}_k \triangleq \big \{T^{\textrm{rc}}_{i,k} ~:~ i\geqslant 0\big \}. \end{aligned}$$
Example 1
Assume \(q=4\), \(k=2\), and in \(\mathbb {Z}_q\) we have \(\overline{0}=3\) and \(\overline{1}=2\). In the reverse-complement duplication setting,where the underlined factor is the inserted reverse-complement duplication. Thus, we may writeas the right-hand side may be derived from the left-hand side by a sequence of 3 reverse-complement duplications of length 2 each.
$$\begin{aligned} 01103203 \Longrightarrow _2 011032\underline{10}03 \Longrightarrow _2 011\underline{22}0321003\Longrightarrow _2 0112\underline{12}20321003, \end{aligned}$$
$$\begin{aligned} 01103203 \Longrightarrow ^3_2 01121220321003, \end{aligned}$$
Along the same lines, we define the palindromic-duplication rules of the following form:and the set of all palindromic duplication rules asThe only difference between the palindromic and the reverse-complement duplication rules is that the latter also complement the duplicated symbols. We once again use \(\Longrightarrow _k\) for derivation and \(D_k\) for the descendant cone. To avoid cumbersome notation we infer from the context whether we use palindromic or reverse-complement duplication rules. In particular, Sect. 3 and Sect. 4 use reverse-complement duplication rules, whereas only Sect. 5 uses palindromic duplication rules.
$$\begin{aligned} T^{\textrm{pal}}_{i,k}(x)\triangleq uv v^R w \qquad \text {if x=uvw, |u|=i, |v|=k,} \end{aligned}$$
$$\begin{aligned} \mathcal {T}^{\textrm{pal}}_k \triangleq \big \{T^{\textrm{pal}}_{i,k} ~:~ i\geqslant 0\big \}. \end{aligned}$$
Example 2
Assume \(q=4\) and \(k=2\). In the palindromic duplication setting,where the underlined factor is the inserted palindromic duplication.
$$\begin{aligned} 01103203 \Longrightarrow _2 011032\underline{23}03, \end{aligned}$$
We now reach the second major concept—error-correcting codes.
Definition 1
An (n, M; t) duplication-correcting code, with respect to a set of duplication rules \(\mathcal {T}\), is a set \(C\subseteq \mathbb {Z}_q^n\), of size \(|C|=M\), such that for any two distinct codewords, \(c,c'\in C\),Here, t denotes the number of duplication errors the code can correct, where \(t\in \mathbb {N}\cup \{*\}\), and where \(t=*\) denotes any number of errors.
$$\begin{aligned} D^t(c)\cap D^t(c') = \emptyset . \end{aligned}$$
An important figure of merit associated with a code’s efficiency, is its rate. If C is an (n, M; t) duplication-correcting code over \(\mathbb {Z}_q\), then its rate is defined asThe largest size of such a code will be denoted by \(A_q(n;t)\), and codes attaining this upper bound will be called optimal. Finally, the coding capacity is defined asIn the remainder of the paper, we shall be focusing on the reverse-complement duplication channel and the palindromic duplication channel, both with duplication length k, over \(\mathbb {Z}_q\), and with an unbounded number of errors, i.e., \(t=*\). We shall therefore try to determine or bound \(A_q(n;*)^{\textrm{rc}}_k\) and \(R_q(*)^{\textrm{rc}}_k\) for the reverse-complement duplication channel, and \(A_q(n;*)^{\textrm{pal}}_k\) and \(R_q(*)^{\textrm{pal}}_k\) for the palindromic duplication channel.
$$\begin{aligned} R(C) \triangleq \frac{1}{n}\log _q M. \end{aligned}$$
$$\begin{aligned} R_q(t) \triangleq \limsup _{n\rightarrow \infty } \frac{1}{n}\log _q A_q(n;t). \end{aligned}$$
3 Reverse-complement duplication of length \(k=1\)
In this section we shall completely determine \(A_q(n;*)_1^{\textrm{rc}}\), the maximal size of a q-ary code of length n that is capable of correcting any number of reverse-complement duplications of length 1. Thus, we shall also find the coding capacity \(R_q(*)_1^{\textrm{rc}}\). Our method is constructive, and we will present codes of optimal size. The results depend on the alphabet size, and in particular, we shall divide our discussion to the case of binary alphabet, \(q=2\), and the case of larger alphabets of even size, \(q\geqslant 4\).
3.1 The binary case, \(q=2\)
We determine an exact condition for two binary strings to have a common descendant. As we shall show, this depends only on the first bit of the string. We start with a simple technical lemma.
Lemma 1
Let \(a\in \mathbb {Z}_2\) be a bit. Thenfor any \(w\in \mathbb {Z}_2^*\).
$$\begin{aligned} a \Longrightarrow ^*_1 a\overline{a}w \end{aligned}$$
Proof
We can writewhere b is either a or \(\overline{a}\), and where \(m_1\geqslant 1\) for all i. We then start with a, and duplicate that a \(m_1\) times,We then duplicate the last copy of \(\overline{a}\) a total of \(m_2\) times, soRepeating this process of duplicating the last bit to create a new run of a or \(\overline{a}\) results in the desired derivation\(\square \)
$$\begin{aligned} a\overline{a}w = a \overline{a}^{m_1} a^{m_2} \overline{a}^{m_3} a^{m_4} \dots b^{m_\ell }, \end{aligned}$$
$$\begin{aligned} a \Longrightarrow ^{m_1}_1 a \overline{a}^{m_1}. \end{aligned}$$
$$\begin{aligned} a \Longrightarrow ^{m_1}_1 a \overline{a}^{m_1} \Longrightarrow ^{m_2}_1 a \overline{a}^{m_1} a^{m_2}. \end{aligned}$$
$$\begin{aligned} a \Longrightarrow ^*_1 a\overline{a}w. \end{aligned}$$
We can now give our main characterization of strings with a common descendant.
Theorem 2
Let \(x,y\in \mathbb {Z}_2^+\). Thenif and only if \(x_0=y_0\).
$$\begin{aligned} D_1^*(x)\cap D_1^*(y)\ne \emptyset \end{aligned}$$
Proof
For the first direction, assume \(x_0\ne y_0\). Since no duplication can change the first bit of the string, it follows that all the strings in \(D_1^*(x)\) start with \(x_0\), whereas all those in \(D_1^*(y)\) start with \(y_0\). Hence, \(D_1^*(x)\cap D_1^*(y)=\emptyset \).
For the second direction, assume \(x_0=y_0=a\). Let us therefore writeWe distinguish between two cases. For the first case, assume \(y'=\varepsilon \), namely, \(y=a\). In that case we have,and by Lemma 1 we also haveHence, \(D_1^*(x)\cap D_1^*(y)\ne \emptyset \).
$$\begin{aligned} x&= a x',&y&= a y'. \end{aligned}$$
$$\begin{aligned} x=a x' \Longrightarrow _1 a \overline{a}x', \end{aligned}$$
$$\begin{aligned} y = a \Longrightarrow _1^* a \overline{a}x'. \end{aligned}$$
For the second case, assume \(y'\ne \varepsilon \), and write \(y'=y''b\), with \(b\in \mathbb {Z}_2\). We now observe the derivationswhere both of the \(\Longrightarrow _1^*\) derivations follow from Lemma 1. \(\square \)
$$\begin{aligned} x&=ax' \Longrightarrow _1^* a \overline{a}y'' b x' \Longrightarrow _1 a \overline{a}y'' b \overline{b}x', \\ y&=ay''b \Longrightarrow _1 a \overline{a}y'' b \Longrightarrow _1^* a \overline{a}y'' b \overline{b}x', \end{aligned}$$
We thus immediately deduce the following:
[Style2 Style1]Corollary 3
For all \(n\in \mathbb {N}\)and therefore,
$$\begin{aligned} A_2(n;*)_1^{\textrm{rc}} = 2, \end{aligned}$$
$$\begin{aligned} R_2(*)_1^{\textrm{rc}} = 0. \end{aligned}$$
Proof
Since in a binary \((n,M;*)_1^{\textrm{rc}}\) code we need to have pair-wise disjoint descendant cones for distinct codewords, by Theorem 2, we can have at most two codewords: one starting with a 0, and one starting with a 1. This is also easily achievable, e.g., the repetition code \(C=\{0^n,1^n\}\). Hence, the maximal size of a binary code in this case is \(A_2(n;*)_1^{\textrm{rc}}=2\). The coding capacity follows by definition. \(\square \)
3.2 The non-binary case, \(q\geqslant 4\)
In this section we assume a general non-binary alphabet, namely, \(\mathbb {Z}_q\), with \(q\geqslant 4\) even. We shall, once again, characterize the exact condition for two strings to have a joint descendant, thereby allowing us to find optimal codes as well as the coding capacity.
We introduce some new notation. Let \(a\in \mathbb {Z}_q\) be a letter. We definenamely, \(a^\oplus \) denotes the set of all strings starting with the letter a, and followed by any finite length string composed of a and \(\overline{a}\) only.
$$\begin{aligned} a^\oplus \triangleq a\big \{a,\overline{a}\big \}^*, \end{aligned}$$
If \(x\in \mathbb {Z}_q^+\) is a non-empty string, we can always decompose it intowhere \(a_i\in \mathbb {Z}_q\) for all i. We say this is a minimal decomposition if \(\ell \) is the smallest possible, thus necessarily, \(a_{i+1}\not \in \{a_i,\overline{a_i}\}\) for all i. If (1) is a minimal decomposition of x, we define the signature of x to be
$$\begin{aligned} x \in a_0^\oplus a_1^\oplus \dots a_{\ell -1}^\oplus , \end{aligned}$$
(1)
$$\begin{aligned} \sigma (x) \triangleq a_0 a_1 \dots a_{\ell -1}. \end{aligned}$$
Example 3
Assume \(q=4\), \(k=2\), and in \(\mathbb {Z}_q\) we have \(\overline{0}=3\) and \(\overline{1}=2\). ThenThis is because the minimal decomposition of 0110300203 is given by 0|11|0300|2|03, where the vertical lines show the relevant factors in the decomposition. Notice how
$$\begin{aligned} \sigma (0110300203) = 01020. \end{aligned}$$
$$\begin{aligned} 0\in 0^\oplus , \quad 11\in 1^\oplus , \qquad 0300\in 0^\oplus , \qquad 2\in 2^\oplus , \qquad 03\in 0^\oplus . \end{aligned}$$
We first observe that all strings in a descendant cone share the same signature.
Lemma 4
Let \(x\in \mathbb {Z}_q^+\) be a string. Then for all \(x'\in D^*_1(x)\),
$$\begin{aligned} \sigma (x)=\sigma (x'). \end{aligned}$$
Proof
Let \(x\in \mathbb {Z}_q^+\) be any string, and let its signature be \(\sigma (x)=a_0 a_1 \dots a_{\ell -1}\), with \(a_i\in \mathbb {Z}_q\) for all i. Furthermore, assume its minimal decomposition iswhere \(x^{(i)}\in a_i^\oplus \) for all i.
$$\begin{aligned} x = x^{(0)} x^{(1)} \dots x^{(\ell -1)}, \end{aligned}$$
Now let \(x''\) be obtained from x using a single reverse-complement duplication of length 1, namely, \(x\Longrightarrow _1 x''\). If the letter being duplicated resides in \(x^{(i)}\), then the effect of the duplication is merely extending \(x^{(i)}\) by one letter, without changing its first one. Thus,By a simple induction this may be extended to any number of derivations, and so for any \(x'\in D_1^*(x)\) we have \(\sigma (x)=\sigma (x')\). \(\square \)
$$\begin{aligned} \sigma (x'')=\sigma (x). \end{aligned}$$
The signature can also determine whether two strings have a common descendant, as the following theorem shows.
Theorem 5
Let \(x,y\in \mathbb {Z}_q^+\). Thenif and only if \(\sigma (x)=\sigma (y)\).
$$\begin{aligned} D_1^*(x)\cap D_1^*(y)\ne \emptyset \end{aligned}$$
Proof
For the first direction, assume \(\sigma (x)=\sigma (y)=a_0 a_1 \dots a_{\ell -1}\). That means we can writewhere \(x^{(i)},y^{(i)}\in a_i^\oplus \), for all i. We also note that \(x^{(i)}\) and \(y^{(i)}\) begin with the same letter, \(a_i\), and contain only letters from \(\{a_i,\overline{a_i}\}\). Hence, by Theorem 2, for each i, there exists \(z^{(i)}\in \mathbb {Z}_q^+\) such thatNamely,It then follows thatand soFor the other direction, assume \(D_1^*(x)\cap D_1^*(y)\ne \emptyset \), and let \(z\in D_1^*(x)\cap D_1^*(y)\) be a common descendant of x and y. By Lemma 4,which completes the proof. \(\square \)
$$\begin{aligned} x&= x^{(0)} x^{(1)} \dots x^{(\ell -1)},\\ y&= y^{(0)} y^{(1)} \dots y^{(\ell -1)}, \end{aligned}$$
$$\begin{aligned} z^{(i)}\in D_1^* (x^{(i)})\cap D_1^* (y^{(i)}). \end{aligned}$$
$$\begin{aligned} x^{(i)} \Longrightarrow _1^* z^{(i)}, \qquad \text {and}\qquad y^{(i)} \Longrightarrow _1^* z^{(i)}. \end{aligned}$$
$$\begin{aligned} x&= x^{(0)} x^{(1)} \dots x^{(\ell -1)} \Longrightarrow _1^* z^{(0)} z^{(1)} \dots z^{(\ell -1)},\\ y&= y^{(0)} y^{(1)} \dots y^{(\ell -1)} \Longrightarrow _1^* z^{(0)} z^{(1)} \dots z^{(\ell -1)},\\ \end{aligned}$$
$$\begin{aligned} D_1^*(x)\cap D_1^*(y)\ne \emptyset . \end{aligned}$$
$$\begin{aligned} \sigma (x) = \sigma (z) = \sigma (y), \end{aligned}$$
We note that Theorem 5 is a generalization of Theorem 2, since in the binary case, the signature of any non-empty string is just its first letter.
By knowing when descendant cones intersect, we can now construct q-ary \((n,M;*)^{\textrm{rc}}_1\) codes.
Construction 1
Let \(n\in \mathbb {N}\). We construct the following code:
$$\begin{aligned} C = \big \{a_0 a_1\dots a_{\ell -1}a_{\ell -1}^{n-\ell } ~:~ 1\leqslant \ell \leqslant n, \forall i, a_i\in \mathbb {Z}_q, a_{i+1}\notin \big \{a_i,\overline{a_i}\big \}\big \}. \end{aligned}$$
Theorem 6
The code C from Construction 1 is an optimal \((n, M; *)^{\textrm{rc}}_{1}\) code, with
$$\begin{aligned} M=q\cdot \frac{(q-2)^n-1}{q-3}. \end{aligned}$$
Proof
All the codewords of C are of length n. Consider a codeword \(c\in C\), withwhere for all i we have \(a_i\in \mathbb {Z}_q\) and \(a_{i+1}\not \in \{a_i,\overline{a_i}\}\). By definition, the signature of c isIt follows that distinct codewords of C have distinct signatures, and so by Theorem 5, they have disjoint descendant cones. Additionally, each possible signature appears in C, and so it is optimal. It remains to compute its size, M. Assuming we want a signature of length \(\ell \), we have q options for the first letters, and each subsequent letter has \(q-2\) options. Thus,\(\square \)
$$\begin{aligned} c=a_0 a_1\dots a_{\ell -1}a_{\ell -1}^{n-\ell }, \end{aligned}$$
$$\begin{aligned} \sigma (c) = a_0 a_1 \dots a_{\ell -1}. \end{aligned}$$
$$\begin{aligned} M=\sum _{\ell =1}^{n} q(q-2)^{\ell -1}=q\cdot \frac{(q-2)^n-1}{q-3}. \end{aligned}$$
We comment that decoding the code from Construction 1 is straightforward. Assume \(z\in \mathbb {Z}_q\) was received. We then compute its signature, which we denote by \(\sigma (z)=a_0 a_1 \dots a_{\ell -1}\), with \(a_i\in \mathbb {Z}_q\) for all i. Then the decoding function returns \(a_0 a_1 \dots a_{\ell -1} a_{\ell -1}^{n-\ell }\in C\). The entire process runs in \(O(|z|)\) time, i.e., linear in the length of the received string.
Finally, we obtain the following:
[Style2 Style1]Corollary 7
For all \(n\in \mathbb {N}\) and even \(q\in \mathbb {N}\), \(q\geqslant 4\),and therefore
$$\begin{aligned} A_q(n;*)^{\textrm{rc}}_1 = q\cdot \frac{(q-2)^n-1}{q-3}, \end{aligned}$$
$$\begin{aligned} R_q(*)^{\textrm{rc}}_1 = \log _q (q-2). \end{aligned}$$
Proof
The claims follow directly from the size of the optimal code proved in Theorem 6. \(\square \)
4 Reverse-complement duplication of length \(k\geqslant 2\)
We shall now study codes that correct any number of reverse-complement duplication errors of length \(k\geqslant 2\). We will show that the number of codewords in such a code is upper bounded by a constant, therefore implying a vanishing coding capacity.
Our proof strategy consists of identifying a property of strings that ensures the existence of a common descendant. We call this the k-summary of the string. Since only a constant number of summaries are possible, the main claim will follow.
Let \(w\in \mathbb {Z}_q^*\) be a string. We say that \(a\in \mathbb {Z}_q\) is the i-th letter from the end in w if we can writewith \(w',w''\in \mathbb {Z}_q^*\) and \(|w''|=i\). In particular, the 0-th letter from the end of w is the last letter of w. We also require the following technical tool from [1]: (whose proof we bring for completeness only)
$$\begin{aligned} w = w' a w'' \end{aligned}$$
Lemma 8
( [1, Lemma 2]) Let \(k\geqslant 2\), \(x \in \mathbb {Z}_q^*, |x|\geqslant k + 1\), and let \(i \geqslant 2\) be an integer. If the i-th letter from the end of x is a, then there exists \(x'\in D^2_k(x)\), such that a is the \((i-2)\)-nd letter from the end of \(x'\).
Proof
We can write \(x=uvbcw\), where \(u,w\in \mathbb {Z}_q^*\), \(v\in \mathbb {Z}_q^{k-1}\), \(b,c\in \mathbb {Z}_q\), \(|w|\leqslant i-2\), and v contains the letter a that is ith from the end of s. We consider the following sequence of derivations:where the underline shows the duplicated parts. Observe that the letter a in v that was ith from the end in x, is \((i-2)\)nd from the end of \(x'\). \(\square \)
$$\begin{aligned} x=uvbcw \Longrightarrow _k uvb\underline{\overline{b}\overline{v}^R} cw \Longrightarrow _k uvb\overline{b}\overline{v}^R c\underline{\overline{c}v}w=x',\end{aligned}$$
An important concept of subsequences is now defined.
Definition 2
Let \(k\geqslant 2\) be an integer, \(x\in \mathbb {Z}_q^k\) be a string of length k, and let \(y\in \mathbb {Z}_q^*\) be any string. We say x is properly spaced in y if both of the following hold:
1.
x is a subsequence of y, namely, there exist \(y^{(0)}, y^{(1)}, y^{(2)},\dots ,y^{(k)}\in \mathbb {Z}_q^*\) such that
$$\begin{aligned} y = y^{(0)} x_0 y^{(1)} x_1 y^{(2)} \dots y^{(k-1)} x_{k-1} y^{(k)}. \end{aligned}$$
2.
If k is even, then
$$\begin{aligned} \left|y^{(1)}\right| \equiv \left|y^{(2)}\right| \equiv \dots \equiv \left|y^{(k)}\right| \equiv 0 \pmod {2}. \end{aligned}$$
Example 4
Consider the string \(y=0110300203\). The string 323 is properly spaced in y, as shown by the underlined locations \(0110\underline{3}00\underline{2}0\underline{3}\) (note that \(k=|323|=3\) is odd in this case). The string 32 is also properly spaced in y, as we see in \(0110\underline{3}00\underline{2}03\) (this time \(k=|32|=2\) is even). However, 23 is not properly spaced in y, since the length of the factor between 2 and 3 in y is odd, and \(k=|23|=2\) is even.
We observe that the property of being properly-spaced persists in all descendants.
Lemma 9
Let \(k\geqslant 2\), \(x\in \mathbb {Z}_q^k\), and \(y\in \mathbb {Z}_q^*\), such that x is properly spaced in y. If \(y'\in D^*_k(y)\), then x is also properly spaced in \(y'\).
Proof
When a string is properly spaced, we can extract it as a suffix in one of the descendants, as the following lemma shows.
Lemma 10
Let \(k\geqslant 2\), \(x\in \mathbb {Z}_q^k\), and \(y\in \mathbb {Z}_q^*\), such that x is properly spaced in y. Then there exists \(v\in \mathbb {Z}_q^*\) such that
$$\begin{aligned} y \Longrightarrow ^*_k vx. \end{aligned}$$
Proof
By definition, we necessarily have \(|y|\geqslant k\). We consider two cases, depending on the parity of k.
If k is even, the last letter of x, namely, \(x_{k-1}\), is at an even distance from the end. If it is at the end, we duplicate the k-suffix twice (this does not change the last letter, but ensures at least one duplication was employed). Otherwise, we can use Lemma 8 repeatedly to get \(x_{k-1}\) to the end, with each application the distance of \(x_{k-1}\) to the end diminishes by 2. Thus, we havefor some \(v^{(k-1)}\in \mathbb {Z}_q^*\). Since at least one duplication was used, we haveBy Lemma 9, x is also properly spaced in \(v^{(k-1)}x_{k-1}\). That means \(x_{k-2}\) is at an even distance from the end of \(v^{(k-1)}\). We ignore the last letter, \(x_{k-1}\), and repeat the process above to push the letter \(x_{k-2}\):Continuing in this manner for all the letters of x we finally obtainfor some \(v\in \mathbb {Z}_q^*\).
$$\begin{aligned} y \Longrightarrow ^*_k v^{(k-1)} x_{k-1}, \end{aligned}$$
$$\begin{aligned} \left|v^{(k-1)}x_{k-1}\right| \geqslant |y|+k \geqslant 2k\geqslant k+2. \end{aligned}$$
$$\begin{aligned} v^{(k-1)}x_{k-1} \Longrightarrow ^*_k v^{(k-2)}x_{k-2}x_{k-1}. \end{aligned}$$
$$\begin{aligned} y \Longrightarrow ^*_k vx \end{aligned}$$
If k is odd, we repeat the same proof as in the case of k even. The only difference is that we are not guaranteed that the letter we are trying to push, \(x_i\), is at an even distance from the end. If that is not the case, and the distance is odd, we simply perform a duplication on the suffix, thus making the distance even. \(\square \)
Another technical helpful tool allows us to synchronize derivations in certain cases, and produce a common descendant.
Lemma 11
Let \(k\geqslant 2\), \(x\in \mathbb {Z}_q^k\), and \(y\in \mathbb {Z}_q^*\), such that x is properly spaced in y. Then there exists \(v\in \mathbb {Z}_q^*\) such that
$$\begin{aligned} y&\Longrightarrow ^*_k vx,&yx \Longrightarrow ^*_k vx. \end{aligned}$$
Proof
By Lemma 10, there exists \(u\in \mathbb {Z}_q^*\) such thatContinuing this derivation we getwhere we define \(v\triangleq ux\overline{x}^R\).
$$\begin{aligned} y \Longrightarrow ^*_k ux. \end{aligned}$$
(2)
$$\begin{aligned} y \Longrightarrow ^*_k ux \Longrightarrow ^2_k ux\overline{x}^R x = vx, \end{aligned}$$
Now, starting with the same derivation as in (2), we getwhich completes the proof. \(\square \)
$$\begin{aligned} yx \Longrightarrow ^*_k uxx \Longrightarrow _k ux\overline{x}^R x = vx, \end{aligned}$$
The following definition provides the property used by our main theorem in this section.
Definition 3
Let \(k\in \mathbb {N}\), \(k\geqslant 2\), let \(\ell \geqslant 0\) be an integer, and let \(y\in \mathbb {Z}_q^{\ell k}\) be a string. Partition y into k-factors,with \(y^{(i)}\in \mathbb {Z}_q^k\) for all i. DefineThen the summary of y, denoted \(\Phi (y)\) is defined as
$$\begin{aligned} y = y^{(0)} y^{(1)} \dots y^{(\ell -1)}, \end{aligned}$$
$$\begin{aligned} \phi (y^{(i)}) \triangleq {\left\{ \begin{array}{ll} \varepsilon & \text {if there exists } i'<i \text { such that } y^{(i')}=y^{(i)},\\ y^{(i)} & \text {otherwise.} \end{array}\right. } \end{aligned}$$
$$\begin{aligned} \Phi (y) \triangleq \phi (y^{(0)}) \phi (y^{(1)}) \dots \phi (y^{(\ell -1)}). \end{aligned}$$
Intuitively, to compute the summary of string, we scan its k-factors from left to right. Each k-factor that appears for the first time is appended to the summary, whereas k-factors that have already appeared before are removed. It follows that the summary of a string contains a partition into distinct k-factors, and whose total length is upper bounded by \(kq^k\).
Example 5
Assume \(q=4\) and \(k=2\). Let \(y=011011013030023003\). Its partition into 2-factors is 01|10|11|01|30|30|02|30|03. If we underline repeated 2-factors in the partition we get \(01|10|11|\underline{01}|30|\underline{30}|02|\underline{30}|03\). By removing the repeated 2-factors from the partition, we obtain the summary of y,We observe that while 01 appears twice as a 2-factor in the summary, \(\underline{01}1\underline{01}1300203\), the second appearance is not removed since it does not appear in the partition.
$$\begin{aligned} \Phi (y)=011011300203. \end{aligned}$$
In the main theorem for this section, we show that strings which have the same summary (up to a short prefix perhaps) also have a common descendant.
Theorem 12
Let \(k\geqslant 2\), and \(x,y\in \mathbb {Z}_q^*\). Writewherei.e., \(0\leqslant |x'|,|y'|\leqslant k-1\). If \(x'=y'\) and \(\Phi (x'')=\Phi (y'')\), then
$$\begin{aligned} x&= x' x''&y&= y' y'', \end{aligned}$$
$$\begin{aligned} \left|x'\right|&= \left|x\right| \bmod k,&\left|y'\right|=\left|y\right| \bmod k, \end{aligned}$$
$$\begin{aligned} D^*_k(x)\cap D^*_k(y)\ne \emptyset . \end{aligned}$$
Proof
We shall actually prove that \(x''\) and \(y''\) have a common descendant, namely,Then, by prepending the prefix \(x'=y'\), we shall trivially deduce the desired claim,Thus, we shall forget about the prefix \(x'=y'\) from now on. Other simple cases we easily dismiss are that of \(x=y\), as well as the case of \(\Phi (x'')=\Phi (y'')=\varepsilon \), since then, again, \(x=y\).
$$\begin{aligned} D^*_k(x'')\cap D^*_k(y'') \ne \emptyset . \end{aligned}$$
$$\begin{aligned} D^*_k(x'x'')\cap D^*_k(y'y'') \ne \emptyset . \end{aligned}$$
Let us now partition \(x''\) and \(y''\) into k-factors,where \(x^{(i)}, y^{(j)}\in \mathbb {Z}_q^k\) for all i and j. The remainder of the proof shall proceed in iterations. We initialize two indices, \(i_x=i_y=0\). We shall, at each iteration, make sure thatand that there exists some \(u\in \mathbb {Z}_q^*\) such thatObviously, before we begin, when \(i_x=i_y=0\), both (3) and (4) hold, since for (3) we have \(\Phi (\varepsilon )=\varepsilon \), and for (4) choosing \(u=\varepsilon \) gives the trivial \(\varepsilon \Longrightarrow ^*_k \varepsilon \). We distinguish between the following cases:
$$\begin{aligned} x''&= x^{(0)} x^{(1)} \dots x^{(\ell _x-1)},&y''&= y^{(0)} y^{(1)} \dots y^{(\ell _y-1)}, \end{aligned}$$
$$\begin{aligned} \Phi (x^{(0)} \dots x^{(i_x-1)}) = \Phi (y^{(0)} \dots y^{(i_y-1)}), \end{aligned}$$
(3)
$$\begin{aligned} x^{(0)} \dots x^{(i_x-1)}&\Longrightarrow ^*_k u,&y^{(0)} \dots y^{(i_y-1)}&\Longrightarrow ^*_k u. \end{aligned}$$
(4)
Case 1 \(i_x<\ell _x\), \(i_y<\ell _y\) and \(x^{(i_x)}=y^{(i_y)}\). Since (3) holds, we also have,Additionally, since (4) holds, we obviously haveThus, we can increase both \(i_x\) and \(i_y\) by 1, while maintaining (3) and (4).
$$\begin{aligned} \Phi (x^{(0)} \dots x^{(i_x)}) = \Phi (y^{(0)} \dots y^{(i_y)}). \end{aligned}$$
$$\begin{aligned} x^{(0)} \dots x^{(i_x-1)} x^{(i_x)}&\Longrightarrow ^*_k u x^{(i_x)},&y^{(0)} \dots y^{(i_y-1)} y^{(i_y)}&\Longrightarrow ^*_k u y^{(i_y)}. \end{aligned}$$
Case 2 \(i_x<\ell _x\), \(i_y<\ell _y\) and \(x^{(i_x)}\ne y^{(i_y)}\). Since (3) holds, and also \(\Phi (x'')=\Phi (y'')\), there exists \(i'_x<i_x\) such that \(x^{(i'_x)}=x^{(i_x)}\), or there exists \(i'_y<i_y\) such that \(y^{(i'_y)}=y^{(i_y)}\). Let us assume the former (if the latter holds, the proof is symmetric). We then haveBy definition, we observe that \(x^{(i_x)}=x^{(i'_x)}\) is properly spaced in \(x^{(0)}\dots x^{(i_x-1)}\). By Lemma 9, this is true also in any descendant of \(x^{(0)}\dots x^{(i_x-1)}\), in particular, u. By Lemma 11, there exists \(u'\in \mathbb {Z}_q^*\) such thatWe therefore haveIt follows that we can increase \(i_x\) by 1, while maintaining (3) and (4).
$$\begin{aligned} \Phi (x^{(0)} \dots x^{(i_x)}) = \Phi (x^{(0)} \dots x^{(i_x-1)}) = \Phi (y^{(0)} \dots y^{(i_y-1)}). \end{aligned}$$
$$\begin{aligned} u&\Longrightarrow ^*_k u',&u x^{(i_x)}&\Longrightarrow ^*_k u'. \end{aligned}$$
$$\begin{aligned} x^{(0)} \dots x^{(i_x-1)} x^{(i_x)}&\Longrightarrow ^*_k u x^{(i_x)}\Longrightarrow ^*_k u',&y^{(0)} \dots y^{(i_y-1)}&\Longrightarrow ^*_k u \Longrightarrow ^*_k u'. \end{aligned}$$
Case 3 \(i_x < \ell _x\) and \(i_y = \ell _y\). In this case we have \(y^{(0)}\dots y^{(i_y-1)}=y''\). Since \(\Phi (y'')=\Phi (x'')\), there exists \(i'_x<i_x\) such that \(x^{(i'_x)}=x^{(i_x)}\). The remainder of this case is the same as Case 2.
Case 4 \(i_x = \ell _x\) and \(i_y < \ell _y\). This case is symmetric to Case 3.
Case 5 \(i_x=\ell _x\) and \(i_y=\ell _y\). When we reach this case, u from (4) is the desired common descendant \(x''\) and \(y''\).
Finally, we observe that we shall always reach Case 5, since in each of the other cases, \(i_x\) or \(i_y\) is increased by 1. This completes the proof. \(\square \)
With the help of Theorem 12 we obtain the following corollary.
[Style2 Style1]Corollary 13
For any \(k\in \mathbb {N}\), \(k\geqslant 2\), any \(n\in \mathbb {N}\) and any even \(q\in \mathbb {N}\),and therefore
$$\begin{aligned} A_q(n;*)^{\textrm{rc}}_k \leqslant q^{k-1}\cdot \frac{q^{kq^k+k}-1}{q^k-1}, \end{aligned}$$
$$\begin{aligned} R_q(*)^{\textrm{rc}}_k = 0. \end{aligned}$$
Proof
By Theorem 12, a code cannot contain two codewords with the same summary and the same prefix of length at most \(k-1\). The number of prefixes is upper bounded by \(q^{k-1}\). As for the summaries, each summary is a string of length \(\ell k\), for some \(0\leqslant \ell \leqslant q^k\). Thus,Substituting this into the definition of coding capacity we getas claimed. \(\square \)
$$\begin{aligned} A_q(n;*)^{\textrm{rc}}_k \leqslant q^{k-1} \sum _{\ell =0}^{q^k} q^{k\ell } = q^{k-1}\cdot \frac{q^{kq^k+k}-1}{q^k-1}. \end{aligned}$$
$$\begin{aligned} R_q(*)^{\textrm{rc}}_k = 0, \end{aligned}$$
5 Palindromic duplication
Having determined the coding capacity of the reverse-complement duplication channel in the previous sections, we now turn to the simpler case of palindromic duplication. In this case, factors are duplicated in reverse, but without applying the complement.
We first look at the case of duplication of length \(k=1\). In this case, reversing does nothing, and so this duplication is the same as tandem duplication of length \(k=1\), which was already studied in [11].
[Style2 Style1]Corollary 14
For any \(n,q\in \mathbb {N}\), \(q\geqslant 2\),and therefore
$$\begin{aligned} A_q(n;*)^{\textrm{pal}}_1 = q\sum _{i=0}^{n-1}(q-1)^i = {\left\{ \begin{array}{ll} q \cdot \frac{(q-1)^n-1}{q-2} & q\geqslant 3, \\ qn & q=2, \end{array}\right. } \end{aligned}$$
$$\begin{aligned} R_q(*)^{\textrm{pal}}_1 = \log _q (q-1). \end{aligned}$$
Proof
Since a palindromic duplication of length 1 is the same as a tandem duplication of length 1, we look at the known results for tandem duplication. According to [11, Theorems 15,16], the optimal codes in these cases contain all the strings whose adjacent positions contain distinct letters, except perhaps in the suffix. Choosing the first letter has q options, followed by \(0\leqslant i\leqslant n-1\) letters distinct from their predecessors in the string for a total of \((q-1)^i\) options, finally followed by repeating the last letter chosen to get a string of length n. \(\square \)
We note that the results are quite similar to Corollaries 3 and 7. In the binary case the coding capacity vanishes in both types of duplication, whereas in the non-binary case, it is slightly higher in palindromic duplication, compared with reverse-complement duplication.
Moving on to longer duplication lengths, \(k\geqslant 2\), we observe that the complement operation plays no role whatsoever in Sect. 4. We do need to replace Lemma 8, which we do easily by adapting [1, Lemma 2].
Lemma 15
Let \(k\geqslant 2\), \(x \in \mathbb {Z}_q^*, |x|\geqslant k + 1\), and let \(i \geqslant 2\) be an integer. If the i-th letter from the end of x is a, then there exists \(x'\in D^{2}_k(x)\), such that a is the \((i-2)\)-nd letter from the end of \(x'\), and where \(D(\cdot )\) denotes descendants with respect to palindromic duplication.
Proof
Let us write \(x=uvbcw\), where \(u,v,w\in \mathbb {Z}_q^*\), \(|v|=k-1\), \(b,c\in \mathbb {Z}_q\), and where a, the i-th letter from the end of x, occurs in v. Then,where \(\Longrightarrow \) denotes derivation with respect to palindromic duplication. Thus, \(x'\) contains a as the \((i-2)\)-nd letter from the end. \(\square \)
$$\begin{aligned} x=uvbcw \Longrightarrow _k uvbbv^R cw \Longrightarrow _k uvbbv^R ccvw \triangleq x', \end{aligned}$$
We now obtain the palindromic analogue of Corollary 13.
[Style2 Style1]Corollary 16
For any \(k\in \mathbb {N}\), \(k\geqslant 2\), any \(n\in \mathbb {N}\) and any even \(q\in \mathbb {N}\),and therefore
$$\begin{aligned} A_q(n;*)^{\textrm{pal}}_k \leqslant q^{k-1}\cdot \frac{q^{kq^k+k}-1}{q^k-1}, \end{aligned}$$
$$\begin{aligned} R_q(*)^{\textrm{pal}}_k = 0. \end{aligned}$$
Proof
The vanishing capacity for palindromic duplication with length \(k\geqslant 2\), contradicts the code constructions given in [23]. All of these constructions have positive asymptotic rates (provided the alphabet size is not too small). To show that these constructions are unfortunately flawed, we present counter examples based on our proof technique. The counter examples take on the simple form of showing two codewords that have a common descendant, thus, contradicting the definition of an error-correcting code. In all of the cases, these two codewords have the same summary, and hence, by the palindromic version of Theorem 12, have a common descendant. We give a full description in the first example, including the exact derivation of the common descendant, while for the rest of the examples we only provide the two codewords. All of the examples are not necessarily the smallest counter examples.
Example 6
Codes correcting palindromic duplications of length 2: The code presented in [23, Theorem 2] contains, among other things, all the strings from \(\mathbb {Z}_q^n\) that do not contain a palindromic duplication of length 2, nor a tandem duplication of length at most 2. Take \(q=4\) and \(n=8\), then 01020301 and 01020302 both satisfy the conditions, and so are codewords. However, they both have the same 2-summary, 010203, and therefore, must have a common descendant with respect to palindromic duplication of length 2. In this case, a common descendant for both is:This contradicts the claim that the code can correct such duplications. The derivation of the common descendant from each of these codewords is presented in detail in Appendix A.
$$\begin{aligned} 01001221120020021121120023323320023323320021121120020022011011022002. \end{aligned}$$
The remaining counter examples all share the same flavor, and we bring only a short description of them without the full derivation of the common descendants.
Example 7
Codes correcting palindromic duplications of length 3: The code presented in [23, Theorem 4] contains, among other things, all the strings from \(\mathbb {Z}_q^n\) that do not contain a palindromic duplication of length 3, nor a tandem duplication of length at most 3. Take \(q=5\) and \(n=12\), then 012013014012 and 012013014013 both satisfy the conditions, and so are codewords. However, they both have the same 3-summary, 012013014, and therefore, must have a common descendant with respect to palindromic duplication of length 3.
Example 8
Codes correcting any combination of palindromic duplications of length 2 and tandem duplications of length 2: The code presented in [23, Section IV.A] contains, among other things, all the strings from \(\mathbb {Z}_q^n\) that do not contain a palindromic duplication of length 2, nor a tandem duplication of length at most 2, and nor a factor of the form abcb, with \(a,b,c\in \mathbb {Z}_q\) being distinct letters. Take \(q=6\) and \(n=14\), then 01450245034501 and 01450245034502 both satisfy the conditions, and so are codewords. However, they both have the same 2-summary, 01450203, and therefore, must have a common descendant with respect to palindromic duplication of length 2. Note that we do not even use any tandem duplication in the process of deriving the common descendant.
Example 9
Codes correcting any combination of palindromic duplications of length 3 and tandem duplications of length 3: The code presented in [23, Section IV.B] contains, among other things, all the strings from \(\mathbb {Z}_q^n\) that do not contain a palindromic duplication of length 3, nor a tandem duplication of length at most 3, and nor a factor of the form aba or abca, with \(a,b,c\in \mathbb {Z}_q\) being distinct letters. Take \(q=8\) and \(n=21\), then 012567013567014567012 and 012567013567014567013 both satisfy the conditions, and so are codewords. However, they both have the same 3-summary, 012567013014, and therefore, must have a common descendant with respect to palindromic duplication of length 3. Note that we do not even use any tandem duplication in the process of deriving the common descendant.
Example 10
Codes correcting any combination of palindromic duplications of length \(\leqslant 2\) and tandem duplications of length \(\leqslant 2\): The code presented in [23, Theorem 5] contains, among other things, all the strings from \(\mathbb {Z}_q^n\) that do not contain a palindromic duplication of length 2, nor a tandem duplication of length at most 2, and nor a factor of the form aba or abca, with \(a,b,c\in \mathbb {Z}_q\) being distinct letters. The same codewords as in Example 8 provide a counter example. Here, it is sufficient to use palindromic duplication of length 2 to derive the common descendant.
Example 11
Codes correcting any combination of palindromic duplications of length \(\leqslant 3\) and tandem duplications of length \(\leqslant 3\): The code presented in [23, Theorem 7] contains, among other things, all the strings from \(\mathbb {Z}_q^n\) that do not contain a palindromic duplication of length 3, nor a tandem duplication of length at most 3, and nor a factor of the form aba or abca or abcda or abcdea, with \(a,b,c,d,e\in \mathbb {Z}_q\) being distinct letters. The same codewords as in Example 9 provide a counter example. Here, it is sufficient to use palindromic duplication of length 3 to derive the common descendant.
6 Conclusion
In this work, we studied the parameters of duplication-correcting codes, both in the reverse-complement and the palindromic settings. We determined \(A_q(n;*)^{\textrm{rc}}_1\) and constructed optimal codes. We then showed that the coding capacity, \(R_q(n;*)^{\textrm{rc}}_k\), \(k\geqslant 2\), is vanishing. A similar result for \(k\geqslant 2\) and palindromic duplication was also proved.
While having a vanishing coding capacity is disappointing, when limiting ourselves to a single duplication of length k, the coding capacity is 1 (e.g., by treating this duplication as a single burst insertion and using the codes from [2, 15, 20]). Thus, finding the trade-off between coding capacity and the number of correctable errors remains an open question beyond the scope of this paper. Additionally, from a mathematical perspective, it is interesting to note that, in contrast with reverse-complement and palindromic duplications, for tandem duplication (i.e., the duplicated part is neither reversed, nor complemented) the coding capacity is positive [11].
Other open questions remain. We do not yet know how to construct \((n,M;t)^{\textrm{rc}}_k\) and \((n,M;t)^{\textrm{pal}}_k\) for a finite t. If \(t=1\), namely, a single duplication error is to be corrected, then [1] constructed reverse-complement duplication-correcting codes when k is odd. More generally, we can use a burst-insertion-correcting code, e.g., [2, 15, 20]. However, when \(t\geqslant 2\) no solution is known except using a tk-insertion correcting code, which is, most likely, far from optimal. We leave finding such codes, determining the optimal code size as well as the coding capacity, for future work.
Acknowledgements
This work was supported in part by the US National Science Foundation (NSF), Grant CCF-5237372, and by the Zhejiang Lab BioBit Program (Grant No. 2022YFB507). The author M. Schwartz is currently on a leave of absence from Ben-Gurion University of the Negev.
Declarations
Conflict of interest
The authors declare no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.