The block compression is very similar in spirit to pair compression, the only additional difficulty is the fact that blocks can be long, i.e. up to exponential. In case of letters without a crossing block, the compression can be done as in the case of noncrossing pairs, i.e. by replacing explicit blocks in G and adding some transitions to N. The case of a letter with a crossing block is reduced to the simpler case of letter without such a block: similarly to the compression of crossing pairs, we need to ‘pop’ letters from the beginning and the end of a nonterminal. However, this time popping one letter is not enough, we need to remove the whole a-prefix and a-suffix.
4.3.2 Removing Crossing Blocks of a Letter
It was already mentioned that a has a crossing block if and only if aa is a crossing pair. We know a method transforming a crossing pair to a noncrossing pair, see Lemma 5, however, it essentially assumes that the pair consists of two different letter: in short, when aX
i
appears in the rule and we left-pop a letter a from X
i
, then the pair aa is still crossing, whenever \(\operatorname {val}(X_{i})\) still begins with a. This is fixed in the most straightforward manner: we keep left-popping the letters from X
i
, until the first letter of \(\operatorname {val}(X_{i})\) is different from a. In other words, it is enough to remove each nonterminal’s a-prefix (and a-suffix). To be more precise: fix i and let \(\operatorname {val}(X_{i}) = a^{\ell_{i}} u a^{r_{i}}\), where u does not start nor end with a. Then our goal is to modify G so that \(\operatorname {val}(X_{i}') = u\). (If \(\operatorname {val}(X_{i})\) is a power of a, we simply give u=ϵ and r
i
=0.) This can be done in a bottom-up fashion, by two subprocedures, one of them removes the prefix, the other the suffix; these subprocedures work similarly as LeftPop and RightPop. We need to modify the NFA accordingly: it is enough to replace the transition labelled with X
i
by path consisting of three transitions, labelled with \(a^{\ell_{i}}\), \(X_{i}'\) and \(a^{r_{i}}\).
The removed a-prefixes and a-suffixes can be exponentially long, and so we store them in the rules in a succinct way, i.e. a
ℓ
is represented as (a,ℓ); the size of representation of ℓ is \(\mathcal {O}(\log \ell)\), i.e. \(\mathcal {O}(n)\), see Remark 1. We say that such a grammar is in an a-succinct form. The NFA N might have transitions labelled with a
ℓ
, which are stored in succinct way as well. We say that N satisfies a-relaxed (Aut 1), if its transitions are labelled by nonterminals, a single letter or by a
ℓ
, where ℓ≤2
n
. The semantics of the new automaton should be clear: it can traverse the transition labelled with a
ℓ
when consuming the a
ℓ
from the beginning of the input word. Similarly, this automaton is deterministic, if for any two transitions, labelled with α and β, originating from the same state, the first letters of \(\operatorname {val}(\alpha)\) and \(\operatorname {val}(\beta)\) are different.
Before stating the appropriate algorithms, we show that even with such a succinct representation, the number of different lengths of maximal blocks of a is also linearly bounded, similarly as in Lemma 3.
Proof
We first explain, how to calculate the a-prefix \(a^{\ell_{i}}\) of \(\operatorname {val}(X_{i})\): since G is in a-succinct form, this might be non-obvious. It is enough to scan the explicit strings stored in the productions’ right-hand sides, summing the lengths of the consecutive a’s appearances. This clearly works in polytime also for G stored in an a-succinct form, as the powers of a may have length at most 2
n
, see Remark 1, and so the length of their representation is linear in n (the correctness of this approach is shown later).
The running time of CutPref(a) is in polytime, as the loop has n iterations and all lines can be performed in polytime.
Concerning the preservation of the invariants: as in each rule there are at most two nonterminals and each nonterminal introduces at most one a’s block to the rule (in line 5 of CutPref) in each rule of the grammar at most 2 maximal blocks of a are introduced. They may be long, however, in compressed form we treat them as singular symbols. In this way rules of G are stored in an a-succinct form, which was explicitly allowed. Then the a-prefix is removed and if X
i
defines ϵ, it is removed from the right-hand sides of the productions. This does not affect the (SLP 1)–(SLP 2) (recall that a is not $, neither #). Since the NFA is also changed, we inspect the invariants regarding N: introducing new states p
1 and replacing transition δ
N
(p,X
i
,q) by two transitions \(\delta_{N'}(p,a^{\ell_{i}},p_{1})\), δ
N′(p
1,X,q
1) preserves the (Aut 1)–(Aut 2), with the exception that it a-relaxes (Aut 1); the same holds in the case, when α=ϵ and the transition δ(p,X
i
,q) is replaced with \(\delta(p,a^{\ell}_{i},q)\).
Notice that if N is deterministic, so is N′: as already mentioned the only change done to N is the replacement of transition by X
i
by a path of two transitions \(a^{\ell_{i}}\) and \(X_{i}'\), such that the first letter of \(\operatorname {val}(X_{i})\) is a and the state in the middle have exactly one incoming and outgoing transition. This preserves determinism of the automaton: Let X
i
label a transition from p to q and let p
1 be the new state in the middle. Then p
1 has one outgoing transition, furthermore, the first letter of the word defined by the label of the transition from p′ to p
1 is a, so the same as it used to be for X
i
which led from p to q. As N was deterministic, no other transition from p had a as the first letter, and so also no such transition, except the one to p
1, exists in N′.
To show the correctness, we prove by induction on
i the two main claims of the lemma:
-
CutPref correctly calculates the length of the a-prefix of \(\operatorname {val}(X_{i})\), i.e. ℓ
i
,
-
\(\operatorname {val}(X_{i}) = a^{\ell_{i}} \operatorname {val}(X_{i}')\).
For i=1 notice that the whole production for X
1 is stored explicitly, and so CutPref correctly calculates the a-prefix of \(\operatorname {val}(X_{1})\) and after its removal, \(\operatorname {val}(X_{1}) = a^{\ell_{1}} \operatorname {val}(X_{1}')\). Furthermore, as each X
1 in the rules was replaced with \(a^{\ell_{1}} \operatorname {val}(X_{1}')\), the \(\operatorname {val}(X_{j})\) for j>i is not changed.
For the induction step, let
X
i
→
uX
j
vX
k
w. Then by the induction assumption:
There are cases to consider, depending on whether
\(\operatorname {val}(X_{j}') = \epsilon\) or not and whether
\(\operatorname {val}(X_{k}') = \epsilon\) or not. We describe the one with
\(\operatorname {val}(X_{j}') = \epsilon\) and
\(\operatorname {val}(X_{k}') \neq \epsilon\), other cases are treated in a similar way. Then the rule is rewritten as
\(u a^{\ell_{j}} v a^{\ell_{k}} \operatorname {val}(X_{k}') w\). Since by the inductive assumption,
a is not the first letter of
\(\operatorname {val}(X_{k}')\), the
a prefix of
\(\operatorname {val}(X_{i}')\) is the
a-prefix of
\(u a^{\ell_{j}} v a^{\ell_{k}}\). Thus it is correctly calculated by
CutPref. As the last action,
CutPref removes the
a-prefix from the rule, which shows that
\(a^{\ell_{i}}\operatorname {val}(X_{i}')= \operatorname {val}(X_{i})\).
It is left to show that N accepts \(\operatorname {val}(X_{n})\) if and only if N′ accepts \(\operatorname {val}(X_{n}')\). To this end, notice that the only modification to N is the replacement of the transition of the form δ
N
(p,X
i
,q) by a path labelled with \(a^{\ell_{i}},X_{i}'\) (or by \(a^{\ell_{i}}\) alone). Furthermore, the vertex inside this path has only one incoming and one outgoing transition. The path labelled with \(a^{\ell_{i}},X_{i}'\) defines the string \(a^{\ell_{i}}\operatorname {val}(X_{i}')\), which was already shown to be \(\operatorname {val}(X_{i})\). It is left to observe that the newly introduced state in the middle of the path is not accepting, nor starting. Hence the starting (accepting) states of N and N′ coincide, and so each string is accepted by N if and only if it is accepted by N′. □
A symmetric algorithm CutSuff, which removes the a-suffix, is also defined; it has similar properties as CutPref.
Accordingly to the intuition provided at the beginning of this subsection, and similarly as in the case of the crossing pairs and procedures LeftPop, RightPop, it can be easily shown that after applying CutPref(a) and CutSuff(a) the letter a no longer has crossing blocks.
Since after CutPref and CutSuff letter a no longer has crossing blocks, we may compress its maximal blocks using BlockCompNcr. Some small twitches are needed to accommodate the a-succinct form of G and the fact that N is a-relaxed: the non-trivial part of BlockCompNcr was the application of Lemma 1, which works for such large powers of a in npolytime, see Lemma 1. Other actions of BlockCompNcr generalise in a simple way.
Proof
By Lemma 13, there are only |
G|+4
n possible lengths of
a’s maximal blocks and they can be calculated in
polytime; hence line 1 of
BlockCompNcr takes
polytime even in this extended setting. All the loops in
BlockCompNcr have only polynomially many iterations. All operations listed in
BlockCompNcr are elementary and can be clearly performed in
polytime, except replacing each
a
ℓ
by
a
ℓ
in a rule in line 4 and for the verification in line 6. For the former operation, it is enough to read the rules of the grammar: recall that
a
ℓ
is represented as a pair (
a,
ℓ). Since
ℓ≤2
n
, addition of the lengths can be performed in
polytime. The verification is more involved, we outline how to perform it: For given two states
p,
q and a string
a
ℓ
we want to verify, whether there is a path from
p to
q for a string
a
ℓ
. The transitions of the NFA are labelled either with single letters, powers of
a (not larger then 2
n
) or by nonterminals; however, from Lemma 16 it follows that for each nonterminal
X
i
nor the first, nor the last letter of
\(\operatorname {val}(X_{i})\) is
a. Thus, a path for
a
ℓ
can use only transitions labelled with powers of
a (or a single letter
a). Hence, our problem can be rephrased as a fully compressed membership problem for NFA over a unary alphabet: it is enough to
-
restrict NFA N to transitions by powers of a,
-
make p the unique starting state,
-
make q the unique accepting state.
Since all considered powers
a
r
appear in strings defined by
G, and so
r≤2
n
, see Remark 1. So each such
a
r
can be represented by an SLP of
\(\mathcal {O}(n)\) size. In particular, Lemma 1 is applicable here and so we can verify in
npolytime whether there is a path from
p to
q for a word
a
ℓ
.
Concerning the preservation of invariants: we first show that the G′ is not in the a-succinct form, nor N′ is a-relaxed. When BlockCompNcr finishes its work, all maximal blocks of a are replaced, in particular, there are no succinct representations of a powers inside the grammar. Notice that there should be no transition labelled with a
ℓ
in the NFA. To this end a new line at the end of BlockCompNcr should be added, so that all transitions by powers of a are removed.
Now we can return to showing the preservation of the invariants: since the only change to the productions consists of replacing maximal blocks of a by a single letter, (SLP 1)–(SLP 2) are preserved. Also, the only modifications to the NFA is the addition of new letter transitions. Thus, (Aut 1) holds. To see that also (Aut 2) holds, notice that if p receives new incoming (outgoing) transition in N′, this transition is of the form a
ℓ
and p had an incoming (outgoing, respectively) transition by a
ℓ
in N. In particular, the starting and accepting state remain unaffected and no transition by $ and # are introduced. Thus, also (Aut 2) holds for N′.
Notice that if N is deterministic, so is N′: suppose that there are two different transitions in N′ whose strings start with d. If d is not one of the new letters, then the same transitions were present in N, contradiction. So suppose it is one of the new letters, say a
ℓ
. Observe that by Lemma 16 none of the strings \(\operatorname {val}(X_{i})\) started with a, and so after the block compression none of them starts with a
ℓ
. So this means that there are two letter transitions starting from state p and labelled with a
ℓ
. Thus, in N there are two paths for the string a
ℓ
starting from p, which is a contradiction.
Before showing the main property of BlockCompNcr, we briefly comment on the last claim of the lemma: if BlockCompNcr is applied to a letter a and b≠a
ℓ
had no crossing blocks before this application, then it also does not have after the application. This should be obvious: application of BlockCompNcr(a) introduces some new transitions to N, by letters other than b, changes transitions by a
ℓ
into fresh letters (other than b) and changes a’s blocks in G into fresh letters (again other than b); so all these operations do not influence, whether b has crossing blocks or not.
We proceed to the proof of the main property of BlockCompNcr: N accepts \(\operatorname {val}(X_{n})\) if and only if for some nondeterministic choices N′ accepts \(\operatorname {val}(X_{n}')\). To this end we first show, how application of BlockCompNcr affects the words defined by G and the NFA N.
The proofs are analogous as the proofs of Claims 1–2 in Lemma 5 and are thus omitted. Notice that the properties stated in Claims 3–4 do not depend on the non-deterministic choices of BlockCompNcr.
It is left to show the main claims of the lemma: N recognises \(\operatorname {val}(X_{n})\) if and only if the NFA N′ obtained for some non-deterministic choices recognises \(\operatorname {val}(X_{n}')\).
Suppose first that the
N′ accepts
\(\operatorname {val}(X_{n}')\), using the path
\(\mathcal{P}'\). Clearly
\(\operatorname {val}(\mathcal{P}') = \operatorname {val}(X_{n}')\). Let the list of labels on
\(\mathcal{P}'\) be
$$ u_{i_1}'X_{i_2}'u_{i_3}'X_{i_4}' \cdots X_{i_{m-1}}'u_{i_m}'. $$
Let
\(u_{i_{j}}\) be obtained from
\(u_{i_{j}}'\) be replacing each
\(a_{\ell_{k}}\) with
\(a^{\ell_{k}}\). We shall construct a path
\(\mathcal{P}\) in
N, which has the same starting and ending as
\(\mathcal{P}'\) and induces a list of labels
$$ u_{i_1}X_{i_2}u_{i_3}X_{i_4}\cdots X_{i_{m-1}}u_{i_m}. $$
Notice that
-
if there is a transition \(\delta_{N'}(p,X_{i}',q)\) in N′ then there is a transition δ
N
(p,X
i
,q) in N;
-
if there is a transition δ
N′(p,b,q) for \(b \neq a_{\ell_{k}}\) in N′, then there is a transition δ
N
(p,b,q) in N;
-
if there is a transition \(\delta_{N}(p,a_{\ell_{k}},q)\) in N′ for some \(a^{\ell_{k}}\) that is a maximal block in one of \(\operatorname {val}(X_{1}), \ldots, \operatorname {val}(X_{n})\), then there is a path from p to q for a string \(a^{\ell_{k}}\) in N.
Therefore, by an easy induction,
\(\mathcal{P}\) is a valid path in
N, moreover, since
\(\mathcal{P}'\) is accepting, so is
\(\mathcal{P}\). It is left to demonstrate that
\(\mathcal{P}\) defines
\(\operatorname {val}(X_{n})\): since
BC
a
is a one-to-one function on strings not containing letters of the form
a
ℓ
and both
\(\operatorname {val}(\mathcal{P})\) and
\(\operatorname {val}(X_{n})\) do not contain such letters, it is enough to show that
\(BC_{a}(\operatorname {val}(\mathcal{P})) = BC_{a}(\operatorname {val}(X_{n}))\). By (
7) it holds that
\(BC_{a}(\operatorname {val}(X_{n})) = \operatorname {val}(X_{n}')\). The value of
\(BC_{a}(\operatorname {val}(\mathcal{P}))\) is already known from (
8), and so it is enough to show that
$$ BC_{a}(u_{i_1})\operatorname {val}\bigl(X_{i_2}' \bigr) BC_{a}(u_{i_3})\operatorname {val}\bigl(X_{i_4}' \bigr)\cdots \operatorname {val}\bigl(X_{i_{m-1}}'\bigr)BC_{a}(u_{i_m}) = \operatorname {val}\bigl(X_n'\bigr). $$
But this is simply the fact that path
\(\mathcal{P}'\) defines
\(\operatorname {val}(X_{n}')\), which holds by the assumption.
Suppose now that
N accepts
\(\operatorname {val}(X_{n})\). Consider the case in which
BlockCompNcr always made a correct non-deterministic choice, i.e. that each time it correctly guessed in line 6.
Let the accepting path
\(\mathcal{P}\) in
N has a list of labels
$$ u_{i_1}X_{i_2}u_{i_3}X_{i_4}\cdots X_{i_{m-1}}u_{i_m}, $$
where, similarly as in Claim 4, each
\(u_{i_{j}}\) is a string representing the consecutive letter labels (
\(u_{i_{j}}\) may be empty) and
\(X_{i_{j}}\) represents a transition by a nonterminal transition. By the definition,
$$ \operatorname {val}(X_n) = u_{i_1}\operatorname {val}(X_{i_2}) u_{i_3}\operatorname {val}(X_{i_4})\cdots \operatorname {val}(X_{i_{m-1}})u_{i_m}. $$
Now consider the
a’s block compression applied to both side of this equality. By (
7) and (
8)
$$ \operatorname {val}\bigl(X_n'\bigr) = BC_{a}(u_{i_1}) \operatorname {val}\bigl(X_{i_2}'\bigr) BC_{a}(u_{i_3}) \operatorname {val}\bigl(X_{i_4}'\bigr)\cdots \operatorname {val}\bigl(X_{i_{m-1}}' \bigr)BC_{a}(u_{i_m}). $$
We will construct an accepting path
\(\mathcal{P}'\) with a list of labels
$$ BC_{a}(u_{i_1})X_{i_2}' BC_{a}(u_{i_3})X_{i_4}'\cdots X_{i_{m-1}}'BC_{a}(u_{i_m}). $$
Notice that
\(\operatorname {val}(\mathcal{P}') = \operatorname {val}(X_{n}')\), and so construction of such path
\(\mathcal{P}'\) will conclude the proof. We iteratively transform
\(\mathcal{P}\) into
\(\mathcal{P}'\). Notice that
-
if there is a transition δ
N
(p,X
i
,q) in N then there is a transition \(\delta_{N'}(p,X_{i}',q)\) in N′;
-
if there is a transition δ
N
(p,b,q) for b≠a in N, then there is a transition δ
N′(p,b,q) in N′;
-
if there is a path in N from p to q for string a
ℓ
that has maximal block in \(\operatorname {val}(X_{n})\), then there is a transition δ
N′(p,a
ℓ
,q) in N′ (by the assumption that BlockCompNcr guessed correctly).
And by an easy induction
\(\mathcal{P}'\) is a valid path in
N′ and has the same starting and ending state as
\(\mathcal{P}\). Since
\(\mathcal{P}\) is accepting in
N and the starting and accepting states in
N and
N′ coincide,
\(\mathcal{P}'\) is an accepting path in
N′. □
The CutPref, CutSuff and BlockCompNcr can be now used to implement the blocks’ compression for an arbitrary letter, with crossing blocks or not.