Zum Inhalt

Maximum alternating clean balanced cycle decomposition and applications in rearrangement distance problems

  • Open Access
  • 01.03.2026
Erschienen in:

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Dieser Artikel geht auf das Problem der maximalen alternativen, sauberen und ausgewogenen Zyklusabbaurate (MAX-ACBCD) und seine Anwendung bei Problemen der Genom-Umlagerung ein. Die Studie stellt eine neue Variante des Problems vor, die MAX-ACBCD, die intergene Regionen in das Modell einbezieht. Der Artikel stellt Algorithmen mit konstanten Approximationsfaktoren für das MAX-ACBCD-Problem und andere Probleme der Genom-Umlagerung vor, einschließlich des Problems der Sortierung nach Umkehrungen und intergener Indels (SbRII). Die Studie liefert auch experimentelle Ergebnisse, die die praktische Anwendbarkeit der vorgeschlagenen Algorithmen belegen. Der Artikel schließt mit einer Diskussion über die theoretischen und praktischen Implikationen der Ergebnisse, wobei die Relevanz des MAX-ACBCD-Problems für kombinatorische Optimierungs- und Genom-Umlagerungsstudien hervorgehoben wird.
A preliminary version of this work appeared in the Proceedings of the 21st International Conference on Comparative Genomics (RECOMB-CG 2024).

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

Genome rearrangement problems have been investigated since the 1920s. Nowadays, there are different ways to represent a genome (Oliveira et al. 2024). However, using a string of characters to represent genetic information is one of the most used methods. In this case, the i-th gene of a genome is mapped into an element \(\pi _i\) of a string \(\pi \). If the orientation of the genes is known, a sign “+” or “−” is associated with each element \(\pi _i\), indicating the gene orientation. Otherwise, the sign is omitted. The signed case and unsigned case are terms used to indicate whether the representation has information regarding the orientation of the genes. One common assumption, when using this representation, is that each gene appears at most once on each genome. We are considering this assumption for the whole work, and in some cases we will also assume that there are no genes present in only one of the genomes.
When we are using the representation of genomes with strings, we can estimate the evolutionary distance by calculating the minimum number of rearrangement events (mutations that may affect large stretches of DNA) capable of transforming a source genome (represented by a string \(\pi \)) into the target genome (represented by a string \(\iota \), which is normally set to the identity permutation \((1~2~3~\ldots )\)). The reversal, which inverts a segment of the genome and, in the signed case, also flips the sign of the affected genes, is one of the most studied events in the literature that focuses on mathematical models for genome rearrangement problems (Caprara 1999b; Hannenhalli and Pevzner 1999; Berman et al. 2002). The Sorting by Reversals problem (SbR) seeks a minimum-length sequence of reversals that transforms one genome into another. The signed case of the SbR problem admits an exact polynomial-time algorithm (Hannenhalli and Pevzner 1999). However, Caprara (Caprara 1999b) showed that the unsigned case of the problem is NP-hard, and today the best-known algorithm has an approximation factor of 1.375 (Berman et al. 2002). Another widely studied rearrangement event is the transposition, which exchanges two consecutive segments of the genome. The Sorting by Transpositions problem seeks a minimum-length sequence of transpositions that transforms one genome into another. This problem is only defined for the unsigned case, which is shown to be NP-hard (Bulteau et al. 2012), and today the best-known algorithm has an approximation factor of 1.375 (Elias and Hartman 2006). We also have problems that seek to minimize a sequence formed by distinct rearrangement events. If we combine both events of reversals and transpositions, we have the Sorting by Reversals and Transpositions (SbRT) problem, which is NP-hard for both signed and unsigned cases (Oliveira et al. 2019). For the signed case there is a 2-approximation algorithm to it (Walter et al. 1998), and for the unsigned case there is a 2k-approximation to it (Rahman et al. 2008), where k is the approximation to the following decomposition problem.
One of the structures used to derive bounds for the SbR and SbRT problem is the Breakpoint Graph (Bafna and Pevzner 1996). Bafna and Pevzner proved that the number of breakpoints (b) minus the number of cycles from a maximum alternating cycle decomposition of the breakpoint graph (c) is a lower bound for the SbR problem (Bafna and Pevzner 1996). A breakpoint is a pair of consecutive elements in the source genome that are not consecutive in the target genome. The number of breakpoints is easily calculated, given the source and target genomes. Finding a decomposition of a breakpoint graph into alternating cycles that maximizes the number of cycles leads us to the Maximum Alternating Cycle Decomposition problem (MAX-ACD), which is NP-hard (Caprara 1999b) in the unsigned case.
Besides, Caprara (1999a) presented a probabilistic analysis and showed that the number of reversals required to transform \(\pi \) (source genome representation) into \(\iota \) (target genome representation) is \(b - c\) with a probability of \(1-\varTheta (\frac{1}{n^5})\) for a random signed permutation with n elements. His analysis characterizes the existence of the structures called hurdles and fortresses, because the presence of these structures in \(\pi \) ensures that more than \(b - c\) reversals are needed to transform \(\pi \) into \(\iota \) (such characterization was later simplified by Swenson et al. (2008)). The results for the MAX-ACD problem also have applications in other genome rearrangement problems (Chen 2013; Pinheiro et al. 2020). Particularly, there is an event called double-cut-and-join (DCJ), which cuts the genomes at two points and connects the generated pieces. The sorting by DCJ (\(\textsc {SbDCJ}{}\)) problem seeks a minimum-length sequence of DCJs that transforms one genome into another, and the size of this sequence is exactly \(b - c\). So there is a correspondence between the MAX-ACD and the sorting by DCJ problem. This event generalizes the other ones (one DCJ can simulate one reversal, and two DCJs can simulate a transposition) with the possibility of affecting more than one chromosome. With this operation, the genomes can be represented by sets of strings, one for each chromosome. These relations between the value c and rearrangement problems motivate searching for algorithms to approximate the value of c (Caprara and Rizzi 2002; Lin and Jiang 2004; Chen 2013). The best-known approximation factor for that problem, in the unsigned case, is \(\frac{17}{12}+\epsilon \) (Chen 2013).
A genome has a lot of information, but research in the genome rearrangement field traditionally uses only the order and orientation of genes. Studies have pointed out that incorporating other genetic features into the models can make the results more realistic (Bulteau et al. 2016; Biller et al. 2016). Intergenic regions are DNA sequences between each pair of consecutive genes and at the ends of a linear genome. The number of nucleotides present in it gives the size of each intergenic region. Considering this new structure, studies have emerged representing a genome using two elements: i) a string that stores the information of gene order and orientation (when available), and ii) a list of non-negative integers storing the information regarding the size of each intergenic region. The genome rearrangement problems using this new genome representation remain with the same goal. However, the rearrangement events have the potential to affect both genes and intergenic regions.
Considering the exclusive use of the reversal event, we have a new variation of the Sorting by Reversals problem, which is NP-hard in both the signed and unsigned cases (Oliveira et al. 2021; Brito et al. 2020). Including the intergenic indel event (which inserts or deletes nucleotides from intergenic regions), we have the Sorting by Reversals and Intergenic Indels problem (SbRII). The unsigned case of SbRII is NP-hard, while the complexity of the signed case remains unknown. The best-known algorithms for the SbR and SbRII problems have an approximation factor of 2 for the signed case (Oliveira et al. 2021), and 4 for the unsigned case (Brito et al. 2020).
Considering that indels may affect both genes and intergenic regions, we have the Reversals and Indels Distance problem. There is a 2.5-approximation algorithm for the signed case, but the problem complexity remains unknown (Alexandrino et al. 2021, 2022). For the unsigned case, the problem is NP-hard and has a 4-approximation algorithm (Alexandrino et al. 2021). If we also allow transposition in combination with reversals and indels, we have the Reversal, Transposition, and Indel Distance problem (RTID). The unsigned case is NP-hard and there is a known 6-approximation algorithm for it (Alexandrino et al. 2021). The signed case is also NP-hard and there is a known 4-approximation algorithm for it (Alexandrino et al. 2023).
The sorting by DCJ problem with intergenic regions is also NP-hard and admits a 4/3-approximation algorithm (Fertin et al. 2017). With intergenic indels, we have the Sorting by DCJ and Intergenic Indels (SbDCJII) problem, which is polynomial (Bulteau et al. 2016).
We introduce a variation of the MAX-ACD problem to deal with intergenic regions, called the Maximum Alternating Clean Balanced Cycle Decomposition problem (MAX-ACBCD), and present an algorithm with a constant approximation factor. Furthermore, we design an improved algorithm for the signed case of the SbRII problem that guarantees an approximation factor of \(\frac{3}{2}\). For the unsigned case, we develop approximation algorithms for the SbR, SbRII, RTID, SbDCJ, and SbDCJII problems with factors of 2k, \(\frac{3k}{2}\), 4k, 2k, and k, respectively, where \(k=\frac{31}{21}+\epsilon \).
This paper is organized as follows. Section 2 introduces concepts and definitions used to obtain the results. Section 3 presents the theoretical results. Next, section 4 shows some experimental results. Lastly, Section 5 concludes the manuscript.

2 Definitions

In this section, we introduce definitions, concepts, and graph structures used in the next section.
There are different ways to represent a genome, and depending on its characteristics, some approaches may have more advantages than others. This work assumes that each gene has no duplication. Besides, we consider two types of representations. The first uses information restricted to the gene order, while the second also takes into account the information regarding the intergenic regions. There is an intergenic region between each pair of consecutive genes and at the extremities of a genome. Each intergenic region has a specific number of nucleotides, named size. Next, we describe the structures used in these two representations.
Given a genome \(\mathcal {G} = (\mathcal {G}_1,\mathcal {G}_2,\dots ,\mathcal {G}_n)\), with n genes, we use a representation with a string \(\pi =(\pi _1~\pi _2~\dots ~\pi _n)\), such that \(\pi _i\), with \(1 \le i \le n\), is an integer number that represents the gene \(\mathcal {G}_i\). If the gene orientation of \(\mathcal {G}\) is known, each element \(\pi _i\) receives a “+" or “−" sign indicating its orientation. Otherwise, the sign is omitted. We call this form the classic genome representation. For any string \(\pi \), we use \(\varSigma _{\pi }\) to denote the set of characters from \(\pi \). For this set, character signs are ignored if they are present in the string.
Given a genome \(\mathcal {G} = (\mathcal {R}_1,\mathcal {G}_1,\mathcal {R}_2,\mathcal {G}_2,\dots ,\mathcal {R}_n,\mathcal {G}_n,\mathcal {R}_{n+1},)\), with n genes and \({n+1}\) intergenic regions, we use a representation with two elements: (i) a string \(\pi =(\pi _1~\pi _2~\dots ~,\pi _n)\), such that \(\pi _i\), with \(1 \le i \le n\), is an integer number that represents the gene \(\mathcal {G}_i\); (ii) a list of non-negative integers \(\breve{\pi }=(\breve{\pi }_1,\breve{\pi }_2,\dots ,\breve{\pi }_{n+1})\), such that \(\breve{\pi }_i\), with \(1 \le i \le n+1\), represents the size of the intergenic region \(\mathcal {R}_i\). If the gene orientation of \(\mathcal {G}\) is known, each element \(\pi _i\) receives a “+" or “−" sign indicating its orientation; the sign is omitted otherwise. We call this form the intergenic genome representation.
In both representations, when we are comparing two genomes with a distinct set of genes, which require indels to transform one genome into the other, we assume that we only insert or delete from \(\pi \) genes that only appear in \(\iota \) or \(\pi \), respectively. We use a special character \(\alpha \) to denote genes that will be deleted from \(\pi \). We will use \(\bar{\pi }\) to denote the string \(\pi \) without the characters \(\alpha \), the other characters are in the same order as in \(\pi \). Note that we can use \(\varSigma _{\bar{\pi }}\) to refer to the characters of \(\varSigma _{\pi }\) excluding \(\alpha \).
The extended form of a classic genome \(\mathcal {G} = \pi \) is obtained by adding two new elements \(\pi _0 = 0\) and \(\pi _{n+1} = n+1\). Similarly, given an intergenic genome \(\mathcal {G} = (\pi ,\breve{\pi })\), we obtain its extended form by adding two new elements \(\pi _0 = 0\) and \(\pi _{n+1} = n+1\). We hereafter assume that both genomes are in their extended form. In the following, we show how reversal, transposition, and indel events affect an intergenic genome \(\mathcal {G} = (\pi ,\breve{\pi })\).
Definition 1
Given an intergenic genome \(\mathcal {G} = (\pi ,\breve{\pi })\), a reversal \(\rho ^{(i,j)}_{(x,y)}\), with \(1 \le i \le j \le n\), \(0 \le x \le \breve{\pi }_i\), and \(0 \le y \le \breve{\pi }_{j+1}\), splits the intergenic regions \(\breve{\pi }_i\) and \(\breve{\pi }_{j+1}\), respectively, into \((x,x^{\prime })\) and \((y,y^{\prime })\), such that \(x^{\prime } = \breve{\pi }_i - x\) and \(y^{\prime } = \breve{\pi }_{j+1} - y\). The segment \((x^{\prime },\pi _i,\breve{\pi }_{i+1},\dots ,\breve{\pi }_{j},\pi _j,y)\) is inverted and, if the orientation of the genes is known, the signs of the elements from \(\pi _i\) to \(\pi _j\) are flipped. Lastly, the segments are reassembled. The reversal \(\rho _{(x,y)}^{(i,j)}\) applied on \(\mathcal {G} = (\pi ,\breve{\pi })\), denoted by \((\pi ,\breve{\pi }) \cdot \rho _{(x,y)}^{(i,j)}\), results in a new genome \(G^{\prime } = (\pi ^{\prime },\breve{\pi }^{\prime })\), such that:
$$\begin{aligned} \pi ^{\prime }= & ({\pi _0}~{\pi _1}\dots {\pi _{i-1}}~\underline{{-\pi _j}~{-\pi _{j-1}}\dots {-\pi _{i+1}}~{-\pi _{i}}}~{\pi _{j+1}}~{\pi _{j+2}}\dots {\pi _{n}}~{\pi _{n+1}}),\\ \breve{\pi }^{\prime }= & (\breve{\pi }_1,\breve{\pi }_2,\dots ,\breve{\pi }_{i-1},\underline{\breve{\pi }^{\prime }_i,\breve{\pi }_{j},\dots ,\breve{\pi }_{i+1},\breve{\pi }^{\prime }_{j+1}},\breve{\pi }_{j+2},\dots ,\breve{\pi }_{n},\breve{\pi }_{n+1}), \end{aligned}$$
if the orientation of the genes is known and,
$$\begin{aligned} \pi ^{\prime }= & ({\pi _0}~{\pi _1}\dots {\pi _{i-1}}~\underline{{\pi _j}~{\pi _{j-1}}\dots {\pi _{i+1}}~{\pi _{i}}}~{\pi _{j+1}}~{\pi _{j+2}}\dots {\pi _{n}}~{\pi _{n+1}}),\\ \breve{\pi }^{\prime }= & (\breve{\pi }_1,\breve{\pi }_2,\dots ,\breve{\pi }_{i-1},\underline{\breve{\pi }^{\prime }_i,\breve{\pi }_{j},\dots ,\breve{\pi }_{i+1},\breve{\pi }^{\prime }_{j+1}},\breve{\pi }_{j+2},\dots ,\breve{\pi }_{n},\breve{\pi }_{n+1}), \end{aligned}$$
otherwise. In both cases \(\breve{\pi }^{\prime }_i = x+y\) and \(\breve{\pi }^{\prime }_{j+1} = x^{\prime } + y^{\prime }\).
In genomes with a single chromosome, the double-cut-and-join (DCJ) is a generalization of reversal, which can create intermediary states with separate pieces of the genome.
Definition 2
Given an intergenic genome \(\mathcal {G} = (\pi ,\breve{\pi })\), a DCJ \(\gamma ^{(i,j,r)}_{(x,y)}\), with \(1 \le i \le j \le n\), \(r \in \{1,2\}\), \(0 \le x \le \breve{\pi }_i\), and \(0 \le y \le \breve{\pi }_{j+1}\), splits the intergenic regions \(\breve{\pi }_i\) and \(\breve{\pi }_{j+1}\), respectively, into \((x,x^{\prime })\) and \((y,y^{\prime })\), such that \(x^{\prime } = \breve{\pi }_i - x\) and \(y^{\prime } = \breve{\pi }_{j+1} - y\). Next the DCJ recombines the pieces, connecting \(\pi _{i-1}\) with \(\pi _{j}\) and \(\pi _{i}\) with \(\pi _{j+1}\) (if \(r=1\)), or connecting \(\pi _{i-1}\) with \(\pi _{j+1}\) and \(\pi _{i}\) with \(\pi _{j}\) (if \(r=2\)). If the orientation of the genes is known, the signs of the elements from \(\pi _i\) to \(\pi _j\) are flipped for \(r=1\).
Note that the DCJ can, for instance, produce a piece of circular genome \(\pi _i~\pi _{i+1}~\ldots ~\pi _{j}\), where \(\pi _i\) and \(\pi _j\) are connected. Besides, if there are already some pieces created from a previous operation, then \((\pi _{i-1},\pi _{i})\) and \((\pi _j,\pi _{j+1})\) can be in separate pieces that can be recombined into a single piece.
Definition 3
Given an intergenic genome \(\mathcal {G} = (\pi ,\breve{\pi })\), a transposition \(\rho ^{(i,j,k)}_{(x,y,z)}\), with \(1 \le i< j < k \le n + 1\), \(0 \le x \le \breve{\pi }_i\), \(0 \le y \le \breve{\pi }_{j}\), and \(0 \le z \le \breve{\pi }_{k}\), splits the intergenic regions \(\breve{\pi }_i\), \(\breve{\pi }_{j}\), and \(\breve{\pi }_{k}\), respectively, into \((x,x^{\prime })\), \((y,y^{\prime })\), and \((z,z^{\prime })\), such that \(x^{\prime } = \breve{\pi }_i - x\), \(y^{\prime } = \breve{\pi }_{j} - y\), and \(z^{\prime } = \breve{\pi }_{k} - z\). The segment \((x^{\prime },\pi _i,\breve{\pi }_{i+1},\dots ,\breve{\pi }_{j-1},\pi _{j-1},y)\) is exchanged with the segment \((y^{\prime },\pi _j,\breve{\pi }_{j+1},\dots ,\breve{\pi }_{k-1},\pi _{k-1},z)\). The transposition \(\tau _{(x,y,z)}^{(i,j,k)}\) applied on \(\mathcal {G} = (\pi ,\breve{\pi })\), denoted by \((\pi ,\breve{\pi }) \cdot \tau _{(x,y,z)}^{(i,j,k)}\), results in a new genome \(G^{\prime } = (\pi ^{\prime },\breve{\pi }^{\prime })\), such that:
$$\begin{aligned} \pi ^{\prime }= & ({\pi _0}~{\pi _1}\dots {\pi _{i-1}}~\underline{{\pi _j}~{\pi _{j+1}}\dots {\pi _{k-1}}}~\underline{{\pi _i}~{\pi _{i+1}}\dots {\pi _{j-1}}}~{\pi _{k}}\dots {\pi _{n+1}}),\\ \breve{\pi }^{\prime }= & (\breve{\pi }_1,\breve{\pi }_2,\dots ,\breve{\pi }_{i-1},\breve{\pi }^{\prime }_i,\underline{\breve{\pi }_{j+1},\dots ,\breve{\pi }_{k-1}},\breve{\pi }^{\prime }_{\ell },\underline{\breve{\pi }_{i+1},\dots ,\breve{\pi }_{j-1}},\breve{\pi }^{\prime }_{k},\dots ,\breve{\pi }_{n+1}), \end{aligned}$$
with \(\breve{\pi }^{\prime }_i = x+y^{\prime }\), \(\breve{\pi }^{\prime }_{\ell } = z + x^{\prime }\), and \(\breve{\pi }^{\prime }_{k} = y + z^{\prime }\).
Definition 4
Given an intergenic genome \(\mathcal {G} = (\pi ,\breve{\pi })\), an intergenic indel \(\delta i^{(i)}_{(x)}\), with \(1 \le i \le n+1\) and \(x \ge -\breve{\pi }_i\), changes the size of the intergenic region \(\breve{\pi }_i\) to \(\breve{\pi }_i + x\). The intergenic indel \(\delta i^{(i)}_{(x)}\) applied on \(\mathcal {G} = (\pi ,\breve{\pi })\), denoted by \((\pi ,\breve{\pi }) \cdot \delta i^{(i)}_{(x)}\), results in a new genome \(G^{\prime } = (\pi ^{\prime },\breve{\pi }^{\prime })\), such that:
$$\begin{aligned} \pi ^{\prime }= & ({\pi _0}~{\pi _1}\dots {\pi _{i-1}}~{\pi _i}~{\pi _{i+1}}\dots {\pi _{j-1}}~{\pi _{j}}~{\pi _{j+1}}~{\pi _{j+2}}\dots {\pi _{n}}~{\pi _{n+1}}),\\ \breve{\pi }^{\prime }= & (\breve{\pi }_1,\breve{\pi }_2,\dots ,\breve{\pi }_{i-1},\underline{\breve{\pi }^{\prime }_{i}},\breve{\pi }_{i+1},\dots ,\breve{\pi }_{j},\breve{\pi }_{j+1},\breve{\pi }_{j+2},\dots ,\breve{\pi }_{n},\breve{\pi }_{n+1}), \end{aligned}$$
with \(\breve{\pi }^{\prime }_i = \breve{\pi }_i + x\).
Note that the intergenic indel event is a single notation to represent the non-conservative events of insertion and deletion in intergenic regions. Observe that when x is negative (resp. positive), the intergenic indel deletes (resp. inserts) nucleotides in the intergenic region.
Definition 5
Given an intergenic genome \(\mathcal {G} = (\pi ,\breve{\pi })\), an insertion \(\phi ^{(i,S,\breve{S})}_{(x)}\), with \(1 \le i \le n+1\) and \(0 \le x \le \breve{\pi }_i\), inserts a sequence of genes \(S = (S_1~\ldots ~S_{|S|})\) and a sequence of intergenic region sizes \(\breve{S} = (\breve{S}_1~\ldots ~\breve{S}_{|S|+1})\), after the x-th nucleotide of the intergenic region \(\breve{\pi }_{i}\), which results in a new genome \(G^{\prime } = (\pi ^{\prime },\breve{\pi }^{\prime })\), such that:
$$\begin{aligned} \pi ^{\prime }= & ({\pi _0}~{\pi _1}\dots {\pi _{i-1}}~\underline{{S_1}~\ldots ~{S_|S|}}~{\pi _i}~{\pi _{i+1}}\dots {\pi _{n}}~{\pi _{n+1}}),\\ \breve{\pi }^{\prime }= & (\breve{\pi }_1,\breve{\pi }_2,\dots ,\breve{\pi }_{i-1},\underline{x + \breve{S}_1,\ldots ,\breve{S}_{|S|+1} + x^{\prime }},\breve{\pi }_{i+1},\dots ,\breve{\pi }_{n},\breve{\pi }_{n+1}), \end{aligned}$$
with \(x^{\prime } = \breve{\pi }_i - x\).
Definition 6
Given an intergenic genome \(\mathcal {G} = (\pi ,\breve{\pi })\), a deletion \(\psi ^{(i,j)}_{(x,y)}\), with \(1 \le i \le j \le n+1\), \(\pi _k = \alpha \) for \(i \le k < j\), \(0 \le x \le \breve{\pi }_i\), and \(0 \le y \le \breve{\pi }_j\), deletes the segment that starts after the x-th nucleotide of the intergenic region \(\breve{\pi }_{i}\) and ends after the y-th nucleotide of the intergenic region \(\breve{\pi }_{j}\), which results in a new genome \(G^{\prime } = (\pi ^{\prime },\breve{\pi }^{\prime })\), such that:
$$\begin{aligned} \pi ^{\prime }= & ({\pi _0}~{\pi _1}\dots {\pi _{i-1}}~{\pi _{j}}~\dots {\pi _{n}}~{\pi _{n+1}}),\\ \breve{\pi }^{\prime }= & (\breve{\pi }_1,\breve{\pi }_2,\dots ,\breve{\pi }_{i-1},x+y^{\prime },\breve{\pi }_{j+1},\dots ,\breve{\pi }_{n},\breve{\pi }_{n+1}), \end{aligned}$$
with \(y^{\prime } = \breve{\pi }_j - y\). If \(i = j\), we must also have \(0 \le x \le y \le \breve{\pi }_i\).
We use a \(\delta \) to represent a rearrangement that can be either an insertion or a deletion. We call such a rearrangement an indel. Note that an intergenic indel is a special case of indel that affects only intergenic regions.
Given an intergenic genome \(\mathcal {G} = (\pi ,\breve{\pi })\) and a sequence \(S=(\beta _1,\beta _2,\dots ,\beta _n)\) of genome rearrangement events, \((\pi ,\breve{\pi }) \cdot S = (\pi ,\breve{\pi }) \cdot \beta _1 \cdot \beta _2 \cdot \dots \cdot \beta _n\) denotes the sequence S being applied on \(\mathcal {G}\), which results in a new genome. We also use the notation, \(\bar{\pi } \cdot S\) to denote \(\bar{\pi }'\), where \((\pi ',\breve{\pi }' ) = (\pi ,\breve{\pi }) \cdot S\).
We say that two genomes are related if they have the same set of genes. If we are comparing two related genomes of size n, after the extension, \(\pi \) is a permutation of the numbers from 0 to \(n+1\). Note that, in that case, \(\bar{\pi } = \pi \). Two genomes are balanced if they are related and the sum of the intergenic regions of one is equal to the sum of the intergenic regions of the other.
A classic instance \(\mathcal {I}=(\pi ,\iota )\) is composed of a pair of classic genomes, the source genome \(\pi \) and target genome \(\iota \). If the orientation of genes in both genomes is known, then the classic instance \(\mathcal {I}\) is called signed. Otherwise, \(\mathcal {I}\) is called unsigned.
Given a classic instance \(\mathcal {I}=(\pi ,\iota )\), we have \(next(x) = min(y \in \varSigma _{\bar{\pi }} \cap \varSigma _{\iota } | y > x)\), for all \(x \in \varSigma _{\bar{\pi }}\).
Definition 7
Given an unsigned classic instance \(\mathcal {I}=(\pi ,\iota )\), a pair of elements \((\bar{\pi }_i,\bar{\pi }_{i+1})\), such that \(0 \le i \le |\bar{\pi }|\), is a breakpoint if \(|\bar{\pi }_{i+1} - \bar{\pi }_i| \ne 1\) or there is at least one \(\alpha \) between \(\bar{\pi }_i\) and \(\bar{\pi }_{i+1}\) in \(\pi \).
Definition 8
Given an unsigned classic instance \(\mathcal {I}=(\pi ,\iota )\), a pair (ab), such that \(a,b \in \varSigma _{\bar{\pi }}\), is an adjacency if \(a = next(b)\) or \(b = next(a)\) and a is right before or after b in \(\pi \).
An intergenic instance \(\mathcal {I}=((\pi ,\breve{\pi }),(\iota ,\breve{\iota }))\) is composed of a pair of intergenic genomes, the source genome \((\pi ,\breve{\pi })\) and target genome \((\iota ,\breve{\iota })\). Similarly to the classic instance, if the orientation of genes in both genomes is known, then the intergenic instance \(\mathcal {I}\) is called signed. Otherwise, \(\mathcal {I}\) is called unsigned. In an intergenic instance, we can use the same definition of next as in the classic instance.
Given an intergenic instance \(\mathcal {I}=((\pi ,\breve{\pi }),(\iota ,\breve{\iota }))\), we have the following distances:
  • the reversal distance \(d_{\textsc {R}}(\mathcal {I})\) is the minimum number of reversals necessary to turn \((\pi ,\breve{\pi })\) into \((\iota ,\breve{\iota })\);
  • the reversal and intergenic indel distance \(d_{\textsc {RII}}(\mathcal {I})\) is the minimum number of reversals and intergenic indels necessary to turn \((\pi ,\breve{\pi })\) into \((\iota ,\breve{\iota })\).
  • the DCJ distance \(d_{\textsc {DCJ}}(\mathcal {I})\) is the minimum number of DCJs necessary to turn \((\pi ,\breve{\pi })\) into \((\iota ,\breve{\iota })\);
  • the DCJ and intergenic indel distance \(d_{\textsc {DCJII}}(\mathcal {I})\) is the minimum number of DCJs and intergenic indels necessary to turn \((\pi ,\breve{\pi })\) into \((\iota ,\breve{\iota })\);
  • the reversal, transposition, and indel distance \(d_{\textsc {RTI}}(\mathcal {I})\) is the minimum number of reversals, transpositions, and indels necessary to turn \((\pi ,\breve{\pi })\) into \((\iota ,\breve{\iota })\).
Note that for distances without indels, the genomes must be balanced; for distances with only intergenic indels, the genomes must be related.
Next, we generalize the breakpoint and adjacency definition to consider the size of the intergenic regions in an unsigned intergenic instance.
Definition 9
Given an unsigned intergenic instance \(\mathcal {I}=((\pi ,\breve{\pi }),(\iota ,\breve{\iota }))\), a pair of elements \((\bar{\pi }_i,\bar{\pi }_{i+1})\), such that \(0 \le i \le |\bar{\pi }|\), is an intergenic breakpoint if one of the following cases occur: (i) \(|\bar{\pi }_{i+1} - \bar{\pi }_i| \ne 1\); (ii) there is at least one \(\alpha \) between \(\bar{\pi }_i\) and \(\bar{\pi }_{i+1}\) in \(\pi \); (iii) none of the previous cases occur and the intergenic region between \(\bar{\pi }_i\) and \(\bar{\pi }_{i+1}\) is not equal to \(\breve{\iota }_{x}\), such that \(x = \max (\bar{\pi }_i,\bar{\pi }_{i+1})\).
Given an unsigned intergenic instance \(\mathcal {I}\), \(ib(\mathcal {I})\) denotes the number of intergenic breakpoints in \(\mathcal {I}\).
Definition 10
Given an unsigned intergenic instance \(\mathcal {I}=((\pi ,\breve{\pi }),(\iota ,\breve{\iota }))\), a pair (ab), such that \(a,b \in \varSigma _{\bar{\pi }}\), is an intergenic adjacency if the following conditions are fulfilled: (i) \(a = next(b)\) or \(b = next(a)\); (ii) a is right before or after b in \(\pi \); (iii) the intergenic region between a and b in \(\pi \) is equal to the intergenic region between a and b in \(\iota \).
In the following, we present the breakpoint graph structure and its extended form created to deal with intergenic instances, called weighted labeled breakpoint graph.

2.1 Breakpoint graph

Given a classic unsigned instance \(\mathcal {I}=(\pi ,\iota )\) constructed from related genomes, the graph \(G_{\mathcal {B}}(\mathcal {I})= (V,E,w)\) is an edge-colored graph composed of the set of vertices \(V=\{\pi _0,\pi _1,\dots ,\pi _n,\pi _{n+1}\}\) and the set of edges \(E=E_b \cup E_g\), which is divided into reality edges (\(E_b\)) and desire edges (\(E_g\)). For all \(0 \le i \le n\), there exists a reality edge \((\pi _i, \pi _{i+1})\) if the pair of elements \((\pi _i, \pi _{i+1})\) is a breakpoint in \(\mathcal {I}\). For all \(0 \le i \le n\), there exists a desire edge \((i, i + 1)\) if the pair \((i, i+1)\) is not an adjacency in \(\mathcal {I}\).
Note that there is no cycle in \(G_{\mathcal {B}}\) with edges of one color exclusively. A cycle C such that every pair of consecutive edges has distinct colors is called an alternating cycle. An alternating cycle C with k reality edges is called a k-cycle. Note that, by construction, there are no 1-cycles in \(G_{\mathcal {B}}\) since when the pair of elements \((\pi _i, \pi _{i+1})\) is not a breakpoint it implies that the pair \((\pi _i, \pi _{i+1})\) is an adjacency. Figure 1 shows two possibilities of decomposition of the breakpoint graph \(G_{\mathcal {B}}(\mathcal {I})\) into edge-disjoint alternating cycles using the classic unsigned instance \(\mathcal {I}=(\pi =(0~5~1~2~4~7~3~6~8),\iota =(0~1~2~3~4~5~6~7~8))\).
Fig. 1
The breakpoint graph \(G_{\mathcal {B}}(\mathcal {I})\) for \(\mathcal {I}=(\pi =(0~5~1~2~4~7~3~6~8),\iota =(0~1~2~3~4~5~6~7~8))\). The reality edges are represented using a solid line and placed horizontally, while desire edges use a dashed line forming arcs over the vertices. Figures (a) and (b) show two different decompositions of \(G_{\mathcal {B}}(\mathcal {I})\) into edge-disjoint alternating cycles, represented by red and blue colors
Bild vergrößern

2.2 Weighted labeled breakpoint graph on unsigned intergenic instances

The weighted labeled breakpoint graph is an extension of the breakpoint graph created to deal with additional information from an intergenic instance and the presence of genes in only one of the genomes. The weighted labeled breakpoint graph differs by considering the intergenic breakpoint and intergenic adjacency definitions to construct the set of edges. Besides, each edge of the graph has a weight associated with the size of a specific intergenic region from the instance and a label recording genes that only appear in one of the genomes. In the following, we formally define the weighted labeled breakpoint graph structure.
Given an unsigned intergenic instance \(\mathcal {I}=((\pi ,\breve{\pi }),(\iota ,\breve{\iota }))\), the weighted labeled breakpoint graph \(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})= (V,E,w,\ell )\) is composed of the set of vertices \(V=\{\bar{\pi }_0,\bar{\pi }_1,\dots ,\bar{\pi }_{|\bar{\pi }|}\}\) and the set of edges \(E=E_b \cup E_g\), which is divided into reality edges (\(E_b\)) and desire edges (\(E_g\)). The weighted function \(w: E \rightarrow \mathbb {N}_0\) relates the size of the intergenic regions from the source and target genomes with weights of the edges in E. The edge labeling function \(\ell : E \rightarrow (\varSigma _{\iota } \setminus \varSigma _{\pi }) \cup \{\alpha \}\) relates the exclusive genes of each genome with the edges between genes around those exclusive genes. For all \(0 \le i < |\bar{\pi }|\), if the pair \((\bar{\pi }_i, \bar{\pi }_{i+1})\) is an intergenic breakpoint in \(\mathcal {I}\) then there exists a reality edge \((\bar{\pi }_i, \bar{\pi }_{i+1})\) with \(w((\bar{\pi }_i, \bar{\pi }_{i+1}))\) equal to the sum of intergenic regions between \(\bar{\pi }_i\) and \(\bar{\pi }_{i+1}\) in \(\breve{\pi }\). The label \(\ell ((\bar{\pi }_i, \bar{\pi }_{i+1}))\) is equal to \(\alpha \), if there exists at least one \(\alpha \) between \(\bar{\pi }_i\) and \(\bar{\pi }_{i+1}\), and it is empty otherwise (denoted by \(\emptyset \)). For all \(x \in \varSigma _{\bar{\pi }} \setminus \{n+1\}\), if the pair (xnext(x)) is not an intergenic adjacency in \(\mathcal {I}\) then there exists a desire edge (xnext(x)) with w((xnext(x))) equal to the sum of intergenic regions between x and next(x) in \(\breve{\iota }\). The label \(\ell ((x,next(x)))\) is equal to \(x + 1\), if \(next(x) \ne x+1\), and it is empty otherwise.
The alternating cycle and the k-cycle definitions are exactly as described in Section 2.1. An alternating cycle C from \(G_{\mathcal{I}\mathcal{B}}\) is balanced if the sum of the weights of its reality edges minus the sum of the weights of its desire edges is equal to zero. Otherwise, C is unbalanced. An alternating cycle C from \(G_{\mathcal{I}\mathcal{B}}\) is labeled if one of its edges has a nonempty label. Otherwise, C is clean. Note that there may be 1-cycles in \(G_{\mathcal{I}\mathcal{B}}\), since the intergenic region between two adjacent elements in \(\pi \) and \(\iota \) may have different sizes, and there may be genes that appear in only one of the genomes. However, all 1-cycles in \(G_{\mathcal{I}\mathcal{B}}\) are unbalanced or labeled by construction.
Figure 2 shows two decompositions of the weighted labeled breakpoint graph \(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})\) into edge-disjoint alternating cycles for \(\mathcal {I}=((\pi =(0~\alpha ~4~3~6~8~2~5~\alpha ~9~10),\breve{\pi }=(1,1,5,1,2,3,0,1,0,3)),(\iota =(0~1~2~3~4~5~6~7~8~9~10),\breve{\iota }=(1,1,4,5,0,2,0,0,1,3)))\). Note that the decomposition in Figure 2(a) has one labeled balanced cycle and one clean unbalanced cycle, while the decomposition in Figure 2(b) has three balanced cycles, such that one of them is clean.
Fig. 2
The weighted labeled breakpoint graph \(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})\) for \(\mathcal {I}=((\pi =(0~\alpha ~4~3~6~8~2~5\) \(\alpha ~9~10), \breve{\pi }=(1,1,5,1,2,3,0,1,0,3)),(\iota =(0~1~2~3~4~5~6~7~8~9~10), \breve{\iota }=(1,1,4,5,0,2,0,\) 0, 1, 3))). The reality edges are represented using a solid line and placed horizontally, while desire edges use a dashed line forming arcs over the vertices. The weights are above the edges and the labels are below the edges. Figures (a) and (b) show two decompositions of \(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})\) into edge-disjoint alternating cycles, represented by black, red, and blue colors
Bild vergrößern
Given an unsigned intergenic instance \(\mathcal {I}=((\pi ,\breve{\pi }),(\iota ,\breve{\iota }))\), we use \(c_b(G_{\mathcal{I}\mathcal{B}}(\mathcal {I}))\) to denote the maximum number of balanced cycles obtained from all possible decompositions of \(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})\) into edge-disjoint alternating cycles and \(c_b^{clean}(G_{\mathcal{I}\mathcal{B}}(\mathcal {I}))\) to denote the maximum number of balanced and clean cycles obtained from all possible decompositions of \(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})\) into edge-disjoint alternating cycles. Considering a sequence of rearrangements S, we use the value \(\varDelta c^{clean}_b(\mathcal {I}, S) = c_b^{clean}(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})) -c_b^{clean}(G_{\mathcal{I}\mathcal{B}}(\mathcal {I \cdot S}))\) to track the changes in \(c_b^{clean}\).

2.3 Weighted labeled breakpoint graph on signed intergenic instances

The weighted labeled breakpoint graph on signed intergenic instances uses the information regarding the orientation of the genes to create a structure where the decomposition into edge-disjoint alternating cycles is unique.
Given a signed intergenic instance \(\mathcal {I}=((\pi ,\breve{\pi }),(\iota ,\breve{\iota }))\), the graph \(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I})= (V,E,w,\ell )\) is composed of the set of vertices \(V=\{+\bar{\pi }_0,-\bar{\pi }_1,+\bar{\pi }_1,-\bar{\pi }_2,+\bar{\pi }_2,\dots ,-\bar{\pi }_{|\bar{\pi }|}\}\) and the set of edges \(E=E_b \cup E_g\), which is divided into reality edges (\(E_b\)) and desire edges (\(E_g\)). The weighted function \(w: E \rightarrow \mathbb {N}_0\) relates the size of the intergenic regions from the source and target genomes with weights of the edges in E. The edge labeling function \(\ell : E \rightarrow (\varSigma _{\iota } \setminus \varSigma _{\pi }) \cup \{\alpha ,\emptyset \}\) relates the exclusive genes of each genome with the edges between genes around those exclusive genes. For all \(0 \le i < |\bar{\pi }|\), there is a reality edge \((+\bar{\pi }_i, -\bar{\pi }_{i+1})\) with \(w((+\bar{\pi }_i, -\bar{\pi }_{i+1}))\) equal to the sum of intergenic regions between \(\bar{\pi }_i\) and \(\bar{\pi }_{i+1}\) in \(\breve{\pi }\). The label \(\ell ((+\bar{\pi }_i, -\bar{\pi }_{i+1}))\) is equal to \(\alpha \), if there exists at least one \(\alpha \) between \(\bar{\pi }_i\) and \(\bar{\pi }_{i+1}\), and it is empty otherwise (denoted by \(\emptyset \)). For all \(x \in \varSigma _{\bar{\pi }} \setminus \{n+1\}\), there exists a desire edge \((+x, -next(x))\) with \(w((+x, -next(x)))\) equal to the sum of intergenic regions between x and next(x) in \(\breve{\iota }\). The label \(\ell ((+x,-next(x)))\) is equal to \(x+1\), if \(next(x) \ne x+1\), and it is empty otherwise.
The alternating cycle and the k-cycle definitions are exactly as described in Section 2.1. Besides, the balanced, unbalanced, labeled, and clean cycle definitions are the same as described in Section 2.2. Note that each vertex in \(G_{\mathcal{I}\mathcal{B}}^{+}\) has two incident edges (one reality and one desire). For this reason, the decomposition of \(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I})\) into edge-disjoint alternating cycles is unique. There are different ways to draw the weighted labeled breakpoint graph. However, we use a representation called standard to obtain a unique representation of the cycles in the graph. The standard representation places the vertices in a horizontal line from the left to the right following the order: \({+0},{-\bar{\pi }_1},{+\bar{\pi }_1},\) \({-\bar{\pi }_2},{+\bar{\pi }_2}, \dots ,{-(n+1)}\). Next, the reality edges are placed horizontally while the desire edges form an arc over the vertices.
Figure 3 shows a weighted labeled breakpoint graph created using the signed intergenic instance \(\mathcal {I}=((\pi =({+0}~{+\alpha }~{-5}~{-2}~{-1}~{+\alpha }~{-3}~{-4}~{+7}), \breve{\pi }=(2,3,2,2,3,\) \(0,2,1)),(\iota =({+0}~{+1}~{+2}~{+3}~{+4}~{+5}~{+6}~{+7}), \breve{\iota }=(2,2,5,3,3,0,0)))\).
Fig. 3
The weighted breakpoint graph \(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I})\) of the signed intergenic instance \(\mathcal {I}=((\pi =({+0}~{+\alpha }~{-5}~{-2}~{-1}~{+\alpha }~{-3}~{-4}~{+7}), \breve{\pi }=(2,3,2,2,3,0,2,1)),(\iota =({+0}~{+1}{+2}~{+3}~{+4}~{+5}~{+6}~{+7}), \breve{\iota }=(2,2,5,3,3,0,0)))\). The reality edges are represented using a solid line and placed horizontally, while desire edges use a dashed line forming arcs over the vertices. The weights are above the edges and the labels are below the edges
Bild vergrößern
An unbalanced cycle C in \(G_{\mathcal{I}\mathcal{B}}^{+}\) is positive if the sum of the weights in its desire edges minus the sum of the weights in its reality edges is greater than zero. In the opposite way, an unbalanced cycle C in \(G_{\mathcal{I}\mathcal{B}}^{+}\) is negative if the sum of the weights in its desire edges minus the sum of the weights in its reality edges is less than zero. A k-cycle is called non-trivial if \(k \ge 2\) and trivial otherwise. Each cycle in \(G_{\mathcal{I}\mathcal{B}}^{+}\) is identified by the list of its reality edges, such that the first reality edge is the rightmost considering the standard representation, and is traversed from right to left. A reality edge that is traversed from right to left is called convergent, and it is called divergent otherwise. A non-trivial cycle such that all reality edges are convergent is called convergent, and divergent otherwise. Considering the standard drawing of the graph, a convergent cycle C is called non-oriented if starting from the first reality edge the reality edges can be traversed, following the edges of C, according to the order in which they appear in the drawing (from left to right). Otherwise, C is called oriented.
Figure 3 shows a weighted labeled breakpoint graph with three cycles \(C_1=(({-7},{-4}),({+3},{-1}),({+0},{+5}))\), \(C_2=(({+4},{-3}),({+2},{-5}))\), and \(C_3=(({+1},{-2}))\). Note that \(C_3\) is a trivial clean balanced cycle, while \(C_1\) and \(C_2\) are labeled negative divergent and clean positive convergent non-oriented cycles, respectively.
Given a non-trivial cycle C from \(G_{\mathcal{I}\mathcal{B}}^{+}\), we say that a desire edge of C is an open gate if it does not intersect with any other desire edge from C. For instance, in Figure 3, the desire edges \(({-3},{+2})\) and \(({-5},{+4})\) from the cycle \(C_2\) are open gates. An open gate is closed if a desire edge from another cycle intersects with it. In Figure 3, the desire edges \(({-4},{+3})\) and \(({-1},{+0})\) close the open gates \(({-3},{+2})\) and \(({-5},{+4})\). It is important to note that given any signed intergenic instance \(\mathcal {I}\), all the open gates from \(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I})\) are closed (Bafna and Pevzner 1996).
Given a signed intergenic instance \(\mathcal {I}\), \(c_b(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I}))\) denotes the number of balanced cycles in \(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I})\). Besides, \(c_b^{clean}(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I}))\) denotes the number of clean balanced cycles in \(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I})\), while \(tc_b^{clean}(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I}))\) and \(ntc_b^{clean}(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I}))\) denote the number of trivial clean balanced and non-trivial clean balanced cycles in \(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I})\), respectively. Note that \(c_b^{clean}(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I})) = tc_b^{clean}(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I})) + ntc_b^{clean}(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I}))\). Considering a sequence of rearrangements S, we use the value \(\varDelta c^{clean}_b(\mathcal {I}, S) = (|\bar{\pi }| - 1 - c_b^{clean}(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I}))) -(|\bar{\pi } \cdot S| - 1 - c_b^{clean}(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I \cdot S}))) \) to track the changes in clean balanced cycles combined with changes on the gene content of the genomes.

2.4 Breakpoint graphs with multiple chromosomes

The graphs presented in this section and the results of this paper are described for the case where both genomes have a single chromosome. However, the same graphs can be used to represent genomes with multiple circular and linear chromosomes as long as the genomes are co-tailed, that is, they share the same set of chromosome extremities. Note that only linear chromosomes have extremities.
In this section, we present how to adapt the breakpoint graph on unsigned instances to represent two co-tailed genomes with multiple chromosomes. Similar adaptations can be applied to the weighted labeled breakpoint graph for unsigned and signed intergenic instances.
Now, we represent a genome by a set of strings \(\varPi \), and an instance is a pair \(\mathcal {I} = (\varPi ,\varGamma )\) of such sets. Each string represents a chromosome and may be circular, that is, the first and last characters may be adjacent. We denote by \(\bar{\varPi }\) the set created by removing the characters \(\alpha \) from the strings of \(\varPi \) and \(\varSigma _{\bar{\varPi }}\) is the set of characters from all strings of \(\bar{\varPi }\). For a in a string S of \(\varGamma \), we denote by Next(a) the next character in S after skipping all characters that only appear in \(\varGamma \). We pick an arbitrary orientation to define Next(a) in circular strings.
As we are assuming two co-tailed genomes with a single copy of each gene, the graph \(G_{\mathcal {B}}(\mathcal {I})= (V,E,w)\) is defined as before, with a slight change in the notions of breakpoints and adjacencies to consider the multiple chromosomes.
Definition 11
Given an unsigned instance with multiple chromosomes \(\mathcal {I}=(\varPi ,\varGamma )\), a pair of consecutive elements (ab) in the same string of \(\bar{\varPi }\) is a breakpoint if they are not consecutive in \(\varGamma \) or there is at least one \(\alpha \) between a and b in \(\varPi \).
Definition 12
Given an unsigned instance with multiple chromosomes \(\mathcal {I}=(\varPi ,\varGamma )\), a pair (ab), such that \(a,b \in \varSigma _{\bar{\varPi }}\), is an adjacency if \(a = Next(b)\) or \(b = Next(a)\), and a is right before or after b in a string of \(\varPi \) (See Fig. 4).
Fig. 4
The breakpoint graph \(G_{\mathcal {B}}(\mathcal {I})\) for \(\mathcal {I}=(\varPi =\{(0~6~2~4),(5~3~1~7),(8~9)\},\varGamma =\{(0~1~2~3~4),(5~6~7),(8~9)\})\). The reality edges are represented using a solid line and placed horizontally, while desire edges use a dashed line forming arcs over the vertices. Figures (a) and (b) show two different decompositions of \(G_{\mathcal {B}}(\mathcal {I})\) into edge-disjoint alternating cycles, represented by red, blue, and black colors
Bild vergrößern

3 Theoretical results

In this section, we formally present the Maximum Alternating Clean Balanced Cycle Decomposition (MAX-ACBCD) problem. We also prove the NP-hardness of the MAX-ACBCD problem, on unsigned instances, and of the RTID problem, on signed instances. Finally, we also describe approximation algorithms for SbR, SbRII, and RTID on signed and unsigned instances.

3.1 Complexity of the MAX-ACBCD problem

In this section, we prove the NP-hardness of the Maximum Alternating Clean Balanced Cycle Decomposition (MAX-ACBCD) problem. First, we describe the decision version of the MAX-ACD problem.
Bild vergrößern
Caprara (1999b) showed that the Maximum Alternating Cycle Decomposition (MAX-ACD) problem is NP-hard. In the following, we describe the decision version of the MAX-ACBCD problem.
Bild vergrößern
Note that the MAX-ACBCD is a generalization of the MAX-ACD problem, which leads us to the following lemma.
Lemma 1
The MAX-ACBCD problem is NP-hard.
Proof
Given a classic unsigned instance \(\mathcal {I} = (\pi ,\iota )\) of the MAX-ACD problem and a natural number \(s'\), we create an intergenic unsigned instance \(\mathcal {I'} = ((\pi ',\breve{\pi }'),(\iota ',\breve{\iota }'))\) for the MAX-ACBCD problem and define a natural number \(s''\) as follows: (i) \(\pi ' = \pi \); (ii) \(\iota ' = \iota \); (iii) \(\breve{\pi }' = \breve{\iota }' = (0,0,\dots ,0)\); (iv) \(s'' = s'\). Note that all the edges (reality or desire) in \(G_{\mathcal{I}\mathcal{B}}(\mathcal {I'})\) have weight zero and there is no genes present in only one of the genomes, so it follows that any cycle in \(G_{\mathcal{I}\mathcal{B}}(\mathcal {I'})\) is balanced and clean. Note that maximizing the number of balanced clean cycles in \(G_{\mathcal{I}\mathcal{B}}(\mathcal {I'})\) implies maximizing the number of cycles in \(G_{\mathcal {B}}(\mathcal {I})\). \(\square \)

3.2 An approximation algorithm for the MAX-ACBCD problem

In this section, we adapted the approach presented by Caprara and Rizzi (2002) and Lin and Jiang (2004) to design an approximation algorithm for the MAX-ACD problem. Following their approach, we used known solutions to two sub-problems (Berman and Fürer 1994; Hurkens and Schrijver 1989). These sub-problems are the origin of the parameter \(\epsilon \) in the approximation, and the original papers have more details on how the complexity of the algorithm is affected by this parameter.
For the results in this section, we consider an arbitrary optimal decomposition of \(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})\) into edge-disjoint alternating cycles. The goal is to show that the algorithm finds a solution with a constant factor of \(c^{clean}_b(G_{\mathcal{I}\mathcal{B}}(\mathcal {I}))\), which is the cost of the optimal decomposition. For \(k \ge 2\), \(c^{clean}_{bk}\) denotes the maximum number of clean balanced k-cycles obtained from the fixed decomposition. Note that \(c^{clean}_b(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})) = \sum _{k \ge 2} c^{clean}_{bk}\).
The main idea is to obtain the best solution considering only cycles with two reality edges and considering cycles with up to three reality edges. It is important to note that, for any \(k \ge 2\), finding the largest collection of edge-disjoint cycles with at most k reality edges in \(G_{\mathcal {B}}\) is NP-hard (Berman and Karpinski 1999). Consequently, this result is also valid considering the \(G_{\mathcal{I}\mathcal{B}}\) graph.
Using the \(G_{\mathcal{I}\mathcal{B}}\) graph, create another graph \(G^*\) as follows: (i) for each clean balanced 2-cycle in \(G_{\mathcal{I}\mathcal{B}}\) add a vertex in \(G^*\); (ii) add an edge between two vertices in \(G^*\) if and only if the corresponding 2-cycles in \(G_{\mathcal{I}\mathcal{B}}\) share at least one edge.
Lemma 2
For any graph \(G^*\) created from any graph \(G_{\mathcal{I}\mathcal{B}}\), we have that \(\varDelta (G^*) \le 6\).
Proof
Caprara and Rizzi (2002) showed that \(G^*\) created from the \(G_{b}\) graph has a maximum degree of six. The proof is based on the maximum number of edges shared by two different 2-cycles and the maximum number of 2-cycle that an edge can belong to. Since the cycles structures in \(G_{b}\) and \(G_{\mathcal{I}\mathcal{B}}\) are the same, the result holds considering \(G^*\) created from \(G_{\mathcal{I}\mathcal{B}}\). \(\square \)
Given a graph \(G=(V, E)\), an independent set S is a set of vertices of G, such that each edge in G has at most one endpoint in S. In other words, for every pair of vertices in S, there is no edge connecting them in G. The Maximal Independent Set (MIS) problem is an NP-hard problem (Karp 1972), in which the goal is to find the maximum independent set in an input graph \(G=(V, E)\). The MIS problem does not admit an algorithm with a constant approximation factor (Garey and Johnson n.d.). However, when the maximum degree of a graph is upper bounded by B, with \(B > 2\), the problem admits algorithms with a constant approximation factor (Papadimitriou and Yannakakis 1991). In that case, B is called the graph degree.
Lemma 3
(Berman and Fürer (1994)) Let \(B > 2\) be a graph degree and \(\epsilon > 0\). The MIS problem in bounded degree graphs has an approximation algorithm with a factor of \(\frac{5}{B+3}-\epsilon \) if B is even and \(\frac{5}{B+3.25}-\epsilon \) if B is odd.
Finding the largest collection of edge-disjoint clean balanced 2-cycles in a weighted labeled breakpoint graph \(G_{\mathcal{I}\mathcal{B}}\) is equivalent to the MIS problem in the corresponding graph \(G^{*}\), which leads us to the following corollary.
Corollary 1
Finding the largest collection of edge-disjoint clean balanced 2-cycles in \(G_{\mathcal{I}\mathcal{B}}\) can be approximated in polynomial time with a factor of \(\frac{5}{9}-\epsilon \) for any positive \(\epsilon \).
It is important to mention that the theoretical approximation factor of \(\frac{5}{9}-\epsilon \) is achieved considering the worst scenario, which can happen only in the case when \(G^{*}\) has a degree of six. Let \(c^{clean*}_{b2}\) denote the maximum number of edge-disjoint clean balanced cycles with two reality edges in a given weighted labeled breakpoint graph \(G_{\mathcal{I}\mathcal{B}}\). Note that \(c_{b2}^{clean*} \ge c^{clean}_{b2}\).
Given a set U and a collection \(C=\{C_1,C_2,\dots ,C_m\}\) of subsets of U, the Set Packing (SP) problem seeks the largest sub-collection of pairwise disjoint subsets. The SP problem belongs to the NP-hard class (Karp 1972). If every subset in C has a size at most k, we obtain the k-Set Packing (K-SP) problem. For \(k > 2\) the K-SP problem admits an algorithm with a constant approximation factor, which leads us to the following lemma.
Lemma 4
(Hurkens and Schrijver (1989)) There is an approximation algorithm for the K-SP problem with a factor of \(\frac{2}{k}-\epsilon \) for any positive \(\epsilon \).
Given a weighted labeled breakpoint graph \(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})= (V,E,w)\), we create an instance \(\mathcal {I}_{KSP}\) of the K-SP problem as follows: (i) \(U = E\); (ii) for each clean balanced alternating cycle with at most three reality edges add the cycle edges as a subset in C. Observe that each subset in C has a size of at most six since we are dealing with clean balanced alternating cycles with at most three reality edges (and also three desire edges).
Note that finding the largest collection of edge-disjoint clean balanced alternating cycles with at most three reality edges in weighted labeled breakpoint graph \(G_{\mathcal{I}\mathcal{B}}\) is equivalent to the K-SP problem in the corresponding instance \(\mathcal {I}_{KSP}\), which leads us to the following corollary.
Corollary 2
Finding the largest collection of edge-disjoint clean balanced alternating cycles with at most three reality edges in \(G_{\mathcal{I}\mathcal{B}}\) can be approximated in polynomial time with a factor of \(\frac{1}{3}-\epsilon \) for any positive \(\epsilon \).
Let \(c^{clean*}_{b3}\) denote the maximum number of edge-disjoint clean balanced cycles with up to three reality edges in a given weighted labeled breakpoint graph \(G_{\mathcal{I}\mathcal{B}}\). Note that \(c_{b3}^{clean*} \ge c^{clean}_{b2} + c^{clean}_{b3}\). Let \(z_2\) and \(z_3\) denote the number of edge-disjoint balanced cycles obtained from corollaries 1 and 2, respectively. Observe that \(z_2 \ge (\frac{5}{9}-\epsilon )c_{b2}^{clean*} \ge (\frac{5}{9}-\epsilon )c^{clean}_{b2}\) and \(z_3 \ge (\frac{1}{3}-\epsilon )c_{b3}^{clean*} \ge (\frac{1}{3}-\epsilon )(c^{clean}_{b2}+c^{clean}_{b3})\). Now we derive the following theorem.
Theorem 1
Given an intergenic instance \(\mathcal {I} = ((\pi ,\breve{\pi }),(\iota ,\breve{\iota }))\), the parameter \(ib(\mathcal {I}) - c^{clean}_b(G_{\mathcal{I}\mathcal{B}}(\mathcal {I}))\) can be approximated within \(\frac{31}{21}+\epsilon \), for any positive \(\epsilon \), in polynomial time.
Proof
We will use a proof similar to that presented by Lin and Jiang (2004) considering the clean balanced cycles in the weighted labeled breakpoint graph. Let \(c_{b}^{clean*}\) denote the largest collection of clean balanced cycles between \(z_2\) and \(z_3\). Adopting \(\epsilon = \frac{5}{3}\eta \) in Corollary 1 and \(\epsilon =\eta \) in Corollary 2, we obtain that:
$$\begin{aligned} c_{b}^{clean*} \ge \max \Bigg \{\Big (\frac{5}{9}-\frac{5}{3}\eta \Big )c^{clean}_{b2}, \Big (\frac{1}{3}-\eta \Big )\Big (c^{clean}_{b2} + c^{clean}_{b3}\Big )\Bigg \}. \end{aligned}$$
Since each intergenic breakpoint in \(\mathcal {I}\) implies in one reality edge in \(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})\), we have that \(c^{clean}_b(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})) \le c^{clean}_{b2} + c^{clean}_{b3} + \frac{ib(\mathcal {I}) - 2c^{clean}_{b2} - 3c^{clean}_{b3}}{4}\). Note that \(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})\) has \(ib(\mathcal {I})\) reality edges and removing the reality edges used by the clean balanced cycles with two and three reality edges in an optimal decomposition remains \(ib(\mathcal {I}) - 2c^{clean}_{b2} - 3c^{clean}_{b3}\) reality edges. In the best scenario, a collection of edge-disjoint clean balanced cycles with four reality edges each can be obtained. Therefore, we have that:
$$\begin{aligned} \frac{ib(\mathcal {I}) - c^{clean*}_{b}}{ib(\mathcal {I}) - c^{clean}_b(G_{\mathcal{I}\mathcal{B}}(\mathcal {I}))} \le \frac{ib(\mathcal {I}) - c_{b}^{clean*}}{ib(\mathcal {I}) - c^{clean}_{b2} - c^{clean}_{b3} - \frac{ib(\mathcal {I}) - 2c^{clean}_{b2} - 3c^{clean}_{b3}}{4}}. \end{aligned}$$
Now, we will consider the maximum and the minimum value that \(ib(\mathcal {I})\) can assume and the ratio achieved. Observe that \(2c^{clean}_{b2} + 3c^{clean}_{b3}\le ib(\mathcal {I}) \le \infty \). Setting \(ib(\mathcal {I})\) to be \(\infty \) the ratio \(\frac{4}{3}\) is achieved. Considering the value of \(ib(\mathcal {I})\) as \(2c^{clean}_{b2} + 3c^{clean}_{b3}\), we obtain the ratio of \(\frac{2c^{clean}_{b2} + 3c^{clean}_{b3} - c_{b}^{clean*}}{c^{clean}_{b2} + 2c^{clean}_{b3}}\). Now, we analyze the two following cases:
  • Case 1: \(\left( \frac{5}{9}-\frac{5}{3}\eta \right) c^{clean}_{b2} \ge \left( \frac{1}{3}-\eta \right) (c^{clean}_{b2} + c^{clean}_{b3})\). In this case, we have that \(c^{clean}_{b3} \le \frac{\frac{2 - 6\eta }{9}}{\frac{1 - 3\eta }{3}}c^{clean}_{b2} = \frac{2}{3}c^{clean}_{b2}\). Thus,
    $$\begin{aligned}&\frac{2c^{clean}_{b2} + 3c^{clean}_{b3} - c_{b}^{clean*}}{c^{clean}_{b2} + 2c^{clean}_{b3}} \\ =&\frac{(\frac{13}{9} + \frac{5}{3}\eta )c^{clean}_{b2} + 3c^{clean}_{b3}}{c^{clean}_{b2} + 2c^{clean}_{b3}} \\ =&\frac{\frac{13c^{clean}_{b2} + 15\eta c^{clean}_{b2} + 27c^{clean}_{b3}}{9}}{\frac{c^{clean}_{b2} + 2c^{clean}_{b3}}{1}} \\ =&\frac{13c^{clean}_{b2} + 15\eta c^{clean}_{b2} + 27c^{clean}_{b3}}{9c^{clean}_{b2} + 18c^{clean}_{b3}} \\ \le&\frac{13c^{clean}_{b2} + 15\eta c^{clean}_{b2} + 18c^{clean}_{b2}}{9c^{clean}_{b2} + 12c^{clean}_{b2}} \\ =&\frac{31c^{clean}_{b2} + 15\eta c^{clean}_{b2}}{21c^{clean}_{b2}} \\ =&\frac{31}{21} + \frac{5}{7}\eta . \end{aligned}$$
  • Case 2: \(\left( \frac{5}{9}-\frac{5}{3}\eta \right) c^{clean}_{b2} \le \left( \frac{1}{3}-\eta )(c^{clean}_{b2} + c^{clean}_{b3}\right) \). In this case, we have that \(c^{clean}_{b2} \le \frac{\frac{1 - 3\eta }{3}}{\frac{2 - 6\eta }{9}}c^{clean}_{b3} = \frac{3}{2}c^{clean}_{b3}\). Thus,
    $$\begin{aligned}&\frac{2c^{clean}_{b2} + 3c^{clean}_{b3} - c_{b}^{clean*}}{c^{clean}_{b2} + 2c^{clean}_{b3}} \\ =&\frac{\left( \frac{5}{3} + \eta \right) c^{clean}_{b2} + \left( \frac{8}{3} + \eta \right) c^{clean}_{b3}}{c^{clean}_{b2} + 2c^{clean}_{b3}} \\ =&\frac{\frac{5c^{clean}_{b2} + 3\eta c^{clean}_{b2} + 8c^{clean}_{b3} + 3\eta c^{clean}_{b3}}{3}}{\frac{c^{clean}_{b2} + 2c^{clean}_{b3}}{1}} \\ =&\frac{5c^{clean}_{b2} + 3\eta c^{clean}_{b2} + 8c^{clean}_{b3} + 3\eta c^{clean}_{b3}}{3c^{clean}_{b2} + 6c^{clean}_{b3}} \\ \le&\frac{\frac{15c^{clean}_{b3}}{2} + \frac{9c^{clean}_{b3}}{2}\eta + 8c^{clean}_{b3} + 3\eta c^{clean}_{b3}}{\frac{9c^{clean}_{b3}}{2} + 6c^{clean}_{b3}} \\ =&\frac{31c^{clean}_{b3} + 15\eta c^{clean}_{b3}}{21c^{clean}_{b3}} \\ =&\frac{31}{21} + \frac{5}{7}\eta . \end{aligned}$$
Adopting \(\eta =\frac{7}{5}\epsilon \), we have that \(\frac{ib(\mathcal {I}) - c_{b}^{clean*}}{ib(\mathcal {I}) - c^{clean}_b(G_{\mathcal{I}\mathcal{B}}(\mathcal {I}))} \le \frac{31}{21} + \epsilon \), and the theorem follows. \(\square \)

3.3 Improved approximation algorithm for the \(S_BRII\) problem on signed intergenic instances

In this section, we present a \(\frac{3}{2}\)-approximation algorithm for the SbRII problem on signed intergenic instances based on the weighted labeled breakpoint graph structure. For this problem, we assume that the compared genomes are related. Consequently, the Weighted Labeled Breakpoint Graph does not have any labeled edge. In the following, we present lemmas that are used by the algorithm steps.
Lemma 5
(Lemma 7.2 from Oliveira et al. (2021)) Given a signed intergenic instance \(\mathcal {I}\), \(d_{\textsc {RII}}(\mathcal {I}) \ge {n+1} - c_b(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I}))\).
Lemma 6
Let C be a trivial negative cycle in \(G_{\mathcal{I}\mathcal{B}}^{+}\), then there is an intergenic indel that increases the number of balanced cycles by one.
Proof
Note that an intergenic indel changes the weight of only one reality edge in \(G_{\mathcal{I}\mathcal{B}}^{+}\). Besides, a trivial cycle has only two edges (one reality and one desire). In this case, the intergenic indel decreases the weight of the reality edge to be equal to the weight present in the desire edge of the cycle, and the lemma follows. \(\square \)
Lemma 7
Let C be a positive cycle in \(G_{\mathcal{I}\mathcal{B}}^{+}\), then there is an intergenic indel that increases the number of balanced cycles by one.
Proof
In this case, just apply an intergenic indel on the first reality edge of the cycle to increase its weight in order to turn C into a balanced cycle. \(\square \)
Lemma 8
(Lemma 5.1 from Oliveira et al. (2021)) Let C be a non-trivial divergent cycle in \(G_{\mathcal{I}\mathcal{B}}^{+}\) that is either balanced or negative, then there is a reversal that increases the number of balanced cycles by one.
Next, we present two important remarks that are fundamental to proving the next lemma.
Remark 1
Given a signed intergenic instance \(\mathcal {I}=((\pi ,\breve{\pi }), (\iota ,\breve{\iota }))\), if all the non-trivial cycles in \(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I})\) are convergent, then each element \(\pi _i\) of \(\pi \) has a positive sign.
Remark 2
Given a signed intergenic instance \(\mathcal {I}=((\pi ,\breve{\pi }), (\iota ,\breve{\iota }))\), if any element \(\pi _i\) of \(\pi \) has a negative sign, then \(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I})\) has at least one divergent cycle.
Now, we show how to deal with a weighted labeled breakpoint graph in which each non-trivial cycle is either balanced convergent or negative convergent.
Lemma 9
Given a signed intergenic instance \(\mathcal {I}=((\pi ,\breve{\pi }), (\iota ,\breve{\iota }))\) and let \(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I})\) be a graph in which each non-trivial cycle is either balanced convergent or negative convergent. Then there is a sequence of three reversals that increases the number of balanced cycles by two.
Proof
Oliveira et al. (2021) showed that given a negative or balanced convergent cycle C from \(G_{\mathcal{I}\mathcal{B}}^{+}\) in which all negative and balanced cycles are convergent, then it is possible to increase the number of balanced cycles using two reversals. Since each non-trivial cycle in \(G_{\mathcal{I}\mathcal{B}}^{+}\) is either balanced convergent or negative convergent, the sequence of two reversals described by Oliveira et al. (2021) can be applied. Besides, we will show that there exists a third reversal that increases the number of balanced cycles by one.
Before applying any reversal note that, by Remark 1, each element \(\pi _i\) of \(\pi \) has a positive sign. The first reversal is used to create a divergent cycle C in \(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I})\) without splitting the cycles. This can be done using only one reversal (see Lemma 5.2 from Oliveira et al. (2021)).
After creating a divergent cycle, a reversal from Lemma 8 can be applied on cycle C to increase the number of balanced cycles by one unit. Observe that the second reversal splits C generating a new balanced cycle, then each non-trivial cycle in \(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I})\) remains balanced or negative.
Now, we show that there is at least one divergent cycle in \(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I})\) and Lemma 8 can be applied again. Note that the first reversal creates a segment in the permutation \(\pi \), such that the elements have a negative sign. After the second reversal, at least one element from \(\pi \) remains with a negative sign. Otherwise, the second reversal would have undone the action of the first, which implied that after applying these two reversals the resulting permutation would be equal to the original permutation and, therefore, the number of cycles would not increase after these reversals. Since at least one element \(\pi _i\) of \(\pi \) has a negative sign, by Remark 2, we know that \(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I})\) has at least one divergent (balanced or negative) cycle and Lemma 8 can be applied. Thus, a sequence of three reversals increases the number of balanced cycles in two units and the lemma follows. \(\square \)
Now consider Algorithm 1, which is divided into four steps. Step I: turn trivial negative cycles into balanced. Step II: turn positive cycles into balanced. Step III: deal with divergent cycles. Step IV: deal with convergent cycles.
Algorithm 1
The input is a signed intergenic instance \(\mathcal {I} = ((\pi ,\breve{\pi }),(\iota ,\breve{\iota }))\) and the output is a sequence S of reversal and intergenic indel operations such that \((\pi , \breve{\pi }) \cdot S = (\iota ,\breve{\iota })\).
Bild vergrößern
Lemma 10
Given a signed intergenic instance \(\mathcal {I}=((\pi ,\breve{\pi }), (\iota ,\breve{\iota }))\), Algorithm 1 turns \((\pi ,\breve{\pi })\) into \((\iota ,\breve{\iota })\) using at most \(\frac{3({n+1} - c_b(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I})))}{2}\) reversals and intergenic indels.
Proof
Observe that if Algorithm 1 reaches Step III, then steps I and II ensure that any trivial cycle is balanced and any non-trivial cycle is either negative or balanced. Note that both steps I and II increase the number of balanced cycles by one unit each. If exists any divergent cycle, then Step III also increases the number of balanced cycles by one unit. Otherwise, each non-trivial cycle is either balanced convergent or negative convergent and Step IV increases the number of balanced cycles in two units. Since each iteration applies one of the four steps and the number of balanced cycles increases by at least one unit, then eventually \((\pi ,\breve{\pi })\) will be turned into \((\iota ,\breve{\iota })\) and the algorithm stops. Note that to turn \((\pi ,\breve{\pi })\) into \((\iota ,\breve{\iota })\) it is necessary to increase the number of balanced cycles in \({n+1} - c_b(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I}))\) units. The worst case of Algorithm 1 occurs in Step IV, which has a cost of \(\frac{3}{2}\) operations per balanced cycle. Thus, no more than \(\frac{3({n+1} - c_b(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I})))}{2}\) reversals and intergenic indels are used to turn \((\pi ,\breve{\pi })\) into \((\iota ,\breve{\iota })\), and the lemma follows. \(\square \)
Note that in Algorithm 1 steps I, II, and III require a linear time to be performed, while Step IV requires a quadratic time. Since the number of iterations is \(\mathcal {O}(n)\), then the time complexity of Algorithm 1 is \(\mathcal {O}(n^3)\).
Theorem 2
Given a signed intergenic instance \(\mathcal {I}\), Algorithm 1 is an approximation with factor \(\frac{3}{2}\) for the SbRII problem.
Proof
Directly by lemmas 5 and 10. \(\square \)

3.4 Algorithms for SbDCJ and SbDCJII

This section recalls properties of the signed SbDCJ approximation algorithm by Fertin et al. (2017) and shows how to adapt it to unsigned instances. We also handle unbalanced related genomes by allowing intergenic indels. In that case, the procedure yields an exact solution for SbDCJII.
For the SbDCJ problem, Fertin et al. (2017) used the weighted breakpoint graph, which is the weighted labeled breakpoint graph without the labels, to approximate the distance. They showed that the distance is equal to \(e-c+2m\), where e is the number of reality edges in the graph (in our case \(n+1\)), c is the number of cycles in the graph, and m is the minimum number of cycles merges, which are DCJ operations that can combine two cycles into one, necessary to turn all cycles of the graph balanced. Their algorithm finds a number \(m'\) of merges that approximates the value of m and returns an approximated distance of \(n+1-c+2m'\).
Lemma 11
Given a signed intergenic instance \(\mathcal {I}=((\pi ,\breve{\pi }), (\iota ,\breve{\iota }))\), the algorithm from Fertin et al. (2017) uses at most \(2({n+1} - c_b(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I})))\) DCJs to turn \((\pi ,\breve{\pi })\) into \((\iota ,\breve{\iota })\).
Proof
The genomes are balanced, so if we have u unbalanced cycles, in the worst case, we can use \(u - 1\) DCJs to combine all of them into a balanced cycle. Consequently, \(m' \le u - 1\) and \(n+1-c+2m' < n + 1 - c_b(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I})) - u + 2u = n + 1 - c_b(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I})) + u\). As u is at most \(n+1\), we have \(n + 1 - c_b(G_{\mathcal{I}\mathcal{B}}^{+}) + u \le 2({n+1} - c_b(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I})))\) and the lemma follows. \(\square \)
If we include intergenic indels, we can find a stronger result. Note that any cycle of \(G_{\mathcal{I}\mathcal{B}}^{+}\) can be turned into a balanced cycle by adding weight to one of its reality edges or to one of its desire edges. Adding weight to one of the reality edges corresponds to an intergenic insertion. On the other hand, adding weight to one of the desire edges can be seen as an intergenic deletion at the end of the sequence of operations, because at the end we will have a balanced trivial cycle with this edge and we can remove the weights from both the reality and desire edge of this cycle to recover the original weight of the desire edge. Now consider Algorithm 2, which uses one intergenic indel to balance each cycle and finds the remaining DCJ operations from the algorithm from Fertin et al. (2017).
Lemma 12
Given a signed intergenic instance \(\mathcal {I}=((\pi ,\breve{\pi }), (\iota ,\breve{\iota }))\), Algorithm 2 turns \((\pi ,\breve{\pi })\) into \((\iota ,\breve{\iota })\) using at most \({n+1} - c_b(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I}))\) DCJs and intergenic indels.
Proof
Assuming that we have u unbalanced cycles, we are using u indels to balance the cycles. After that, we no longer require merges. Hence, the algorithm from Fertin et al. (2017) requires \(n+1-c=n+1-c_b(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I}))-u\) operations. In total we have \({n+1} - c_b(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I}))\) and the lemma follows. \(\square \)
Additionally, we can use the following lower bound to show that this algorithm is exact.
Lemma 13
Considering a signed intergenic instance \(\mathcal {I}\) for the SbDCJII problem, we have that \(d_{\textsc {DCJII}}(\mathcal {I}) \ge {n+1} - c_b(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I}))\).
Proof
After the application of all DCJ operations that turn one genome into another, we have \(n+1\) balanced cycles. A DCJ can only increase the number of balanced cycles by combining two unbalanced cycles into one balanced cycle, by breaking an unbalanced cycle into an unbalanced and a balanced cycle, or by breaking a balanced cycle into two balanced cycles. An indel can only increase the number of balanced cycles by changing the weight of one reality edge and increasing the number of balanced cycles by one. In all cases, the number of balanced cycles only increases by 1. Consequently, we need at least \({n+1} - c_b(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I}))\) DCJs and intergenic indels to turn one genome into another. \(\square \)
Theorem 3
Algorithm 2 finds the exact solution for the SbDCJII problem.
Proof
Direct from lemmas 12 and 13. \(\square \)

3.5 Improved approximation algorithms for rearrangement problems on unsigned intergenic instances

In this section, we present approximation algorithms for the SbR, SbRII, and RTID problems, on unsigned intergenic instances, with factors of 2k, \(\frac{3k}{2}\), and 4k, respectively. The value of k in both algorithms is \(\frac{31}{21} + \epsilon < 1.48 + \epsilon \). Note that the strategies used in this section can be adapted to convert other approximations for signed intergenic rearrangement problems into approximations for unsigned intergenic rearrangement problems. In the following, we present a lower bound for both problems and lemmas that are used by the algorithms to ensure the approximation ratio.
Given an unsigned intergenic instance \(\mathcal {I}=((\pi ,\breve{\pi }), (\iota ,\breve{\iota }))\) and an event \(\sigma \), \(\varDelta ib(I, \sigma ) = ib(\mathcal {I}') - ib(\mathcal {I})\), such that \(\mathcal {I}' = ((\pi ,\breve{\pi })\cdot \sigma , (\iota ,\breve{\iota }))\), denotes the variation in the number of intergenic breakpoints after the application of \(\sigma \) on the source genome from \(\mathcal {I}\). Next, we present lemmas used to derive a lower bound for the problems.
Lemma 14
For every unsigned intergenic instance \(\mathcal {I}\) and reversal \(\rho \), we have that \(\varDelta c^{clean}_b(I, \rho ) - \varDelta ib(I, \rho ) \le 1\).
Proof
Note that a reversal \(\rho \) removes the reality edges \((\bar{\pi }_{i-1},\bar{\pi }_i)\) and \((\bar{\pi }_{j},\bar{\pi }_{j+1})\) from \(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})\) and may add the reality edges \((\bar{\pi }_{i-1},\bar{\pi }_j)\) and \((\bar{\pi }_{i},\bar{\pi }_{j+1})\). Besides, observe that each reality edge in \(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})\) represents an intergenic breakpoint.
Suppose that the number of reality edges after applying the reversal \(\rho \) does not change, it implies that \(\varDelta ib(I, \rho ) = 0\). However, in the best scenario, the number of balanced clean cycles may increase at most by one unit, either by merging two unbalanced or labeled cycles generating a new balanced clean cycle or splitting a cycle into two new cycles. In this case, we have that \(\varDelta c^{clean}_b(I, \rho ) - \varDelta ib(I, \rho ) \le 1\).
Now, suppose that the number of reality edges after applying the reversal \(\rho \) decreases by one unit, it implies that \(\varDelta ib(I, \rho ) = -1\). Note that to remove one reality edge from \(G_{\mathcal{I}\mathcal{B}}\) the reversal \(\rho \) affected a k-cycle generating a \(k-1\)-cycle and one intergenic adjacency, but the number of balanced clean cycles remains the same (\(\varDelta c^{clean}_b(I, \rho ) = 0\)), which lead us to \(\varDelta c^{clean}_b(I, \rho ) - \varDelta ib(I, \rho ) = 1\).
Lastly, suppose that the number of reality edges after applying the reversal \(\rho \) decreases by two units, it implies that \(\varDelta ib(I, \rho ) = -2\). Note that to remove two reality edges from \(G_{\mathcal{I}\mathcal{B}}\) the reversal \(\rho \) must affect a balanced clean 2-cycle generating two intergenic adjacencies. Consequently, we have that \(\varDelta c^{clean}_b(I, \rho ) = -1\) and \(\varDelta c^{clean}_b(I, \rho ) - \varDelta ib(I, \rho ) = 1\).
Considering the three possibilities, we have that \(\varDelta c^{clean}_b(I, \rho ) - \varDelta ib(I, \rho ) \le 1\), and the lemma follows. \(\square \)
Lemma 15
For every unsigned intergenic instance \(\mathcal {I}\) and indel \(\delta \), we have that \(\varDelta c^{clean}_b(I, \delta ) - \varDelta ib(I, \delta ) \le 1\). That result is also valid for intergenic indels.
Proof
Note that a deletion \(\psi \) or intergenic indel \(\delta i\) only affects a reality edge \((\bar{\pi }_{i-1},\bar{\pi }_i)\) in \(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})\) changing its weight or label. We have three possibilities:
  • The intergenic indel affects one reality edge of a k-cycle, such that \(k > 1\), turning it into a balanced clean cycle. In this case, we have that \(\varDelta c^{clean}_b(I, \delta i) = 1\), \(\varDelta ib(I, \delta i) = 0\), and \(\varDelta c^{clean}_b(I, \delta i) - \varDelta ib(I, \delta i) = 1\).
  • The reality edge affected by the intergenic indel belongs to a 1-cycle (recall that every clean 1-cycle in \(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})\) is labeled or unbalanced) and generates a new intergenic adjacency. In this case, we have that \(\varDelta c^{clean}_b(I, \delta i) = 0\), \(\varDelta ib(I, \delta i) = -1\), and \(\varDelta c^{clean}_b(I, \delta i) - \varDelta ib(I, \delta i) = 1\).
  • The intergenic indel affects one reality edge of a k-cycle, turning it into an unbalanced or labeled cycle or keeping it unbalanced or labeled. In this case, we have that \(\varDelta c^{clean}_b(I, \delta i) < 1\), \(\varDelta ib(I, \delta i) = 0\), and \(\varDelta c^{clean}_b(I, \delta i) - \varDelta ib(I, \delta i) < 1\).
Considering the three possibilities, we have that \(\varDelta c^{clean}_b(I, \delta i) - \varDelta ib(I, \delta i) \le 1\).
An insertion \(\phi \) that inserts k genes will add k vertices to the graph in the place of one reality edge \((\bar{\pi }_{i-1},\bar{\pi }_i)\) of a cycle C. Considering \(\bar{\pi }_{i-1}\), \(\bar{\pi }_i\), and the k new vertices, each new cycle that is created will also require the creation of at least one reality edge. Besides, the cycle C may be transformed into a clean balanced cycle. Consequently, if C is made clean and balanced and t new clean balanced cycles are created then \(\varDelta c^{clean}_b(I, \delta i) = 1 + b\), \(\varDelta ib(I, \delta i) \ge b\), and \(\varDelta c^{clean}_b(I, \delta i) - \varDelta ib(I, \delta i) \le 1\). \(\square \)
Lemma 16
For every unsigned intergenic instance \(\mathcal {I}\) and DCJ \(\gamma \), we have that \(\varDelta c^{clean}_b(I, \gamma ) - \varDelta ib(I, \gamma ) \le 1\).
Proof
Note that a DCJ \(\gamma \) removes the reality edges \((\bar{\pi }_{i-1},\bar{\pi }_i)\) and \((\bar{\pi }_{j},\bar{\pi }_{j+1})\) from \(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})\) and may add the reality edges \((\bar{\pi }_{i-1},\bar{\pi }_j)\) and \((\bar{\pi }_{i},\bar{\pi }_{j+1})\) or \((\bar{\pi }_{i-1},\bar{\pi }_{j+1})\) and \((\bar{\pi }_{i},\bar{\pi }_{j})\). Consequently, we can use the same arguments as in Lemma 14 to ensure that \(\varDelta c^{clean}_b(I, \gamma ) - \varDelta ib(I, \gamma ) \le 1\). \(\square \)
Lemma 17
For every unsigned intergenic instance \(\mathcal {I}\) and transposition \(\tau \), we have that \(\varDelta c^{clean}_b(I, \tau ) - \varDelta ib(I, \tau ) \le 2\).
Proof
A transposition \(\tau \) removes the reality edges \((\bar{\pi }_{i-1},\bar{\pi }_i)\), \((\bar{\pi }_{j-1},\bar{\pi }_{j})\), and \((\bar{\pi }_{k-1},\bar{\pi }_{k})\) from \(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})\) and may add the reality edges \((\bar{\pi }_{i-1},\bar{\pi }_j)\), \((\bar{\pi }_{k-1},\bar{\pi }_{i})\), and \((\bar{\pi }_{j-1},\bar{\pi }_{k})\). Besides, observe that each reality edge in \(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})\) represents an intergenic breakpoint.
Suppose that the number of reality edges after applying the transposition \(\tau \) does not change, which implies that \(\varDelta ib(I, \tau ) = 0\). However, in the best scenario, the number of balanced clean cycles may increase at most by two units, either by affecting two unbalanced or labeled cycles, generating two new balanced clean cycles, or splitting a cycle into three new cycles. In this case, we have that \(\varDelta c^{clean}_b(I, \tau ) - \varDelta ib(I, \tau ) \le 2\).
Now, suppose that the number of reality edges after applying the transposition \(\tau \) decreases by one unit, which implies that \(\varDelta ib(I, \tau ) = -1\). Note that to remove one reality edge from \(G_{\mathcal{I}\mathcal{B}}\) the transposition \(\tau \) affects two cycles, generating a non-trivial cycle and one intergenic adjacency, or splits a cycle creating one intergenic adjacency and two other cycles. At best, we have \(\varDelta c^{clean}_b(I, \tau ) = 2\), which leads to \(\varDelta c^{clean}_b(I, \tau ) - \varDelta ib(I, \tau ) = 1\).
Lastly, suppose that the number of reality edges after applying the transposition \(\tau \) decreases in \(k \in \{2,3\}\) units, which implies that \(\varDelta ib(I, \tau ) = -k\). Note that to remove two reality edges from \(G_{\mathcal{I}\mathcal{B}}\) the transposition \(\tau \) must affect a balanced clean cycle generating k intergenic adjacencies. Consequently, we have that \(\varDelta c^{clean}_b(I, \tau ) = 0\), for \(k=2\), or \(\varDelta c^{clean}_b(I, \tau ) = -1\), for \(k=3\), and \(\varDelta c^{clean}_b(I, \tau ) - \varDelta ib(I, \tau ) = 2\), for both cases.
Considering the three possibilities, we have that \(\varDelta c^{clean}_b(I, \tau ) - \varDelta ib(I, \tau ) \le 2\), and the lemma follows. \(\square \)
Lemma 18
Given an unsigned intergenic instance \(\mathcal {I}\), we have that \(d_{\textsc {R}}(\mathcal {I}) \ge ib(\mathcal {I}) - c_b(G_{\mathcal{I}\mathcal{B}}(\mathcal {I}))\), \(d_{\textsc {RII}}(\mathcal {I}) \ge ib(\mathcal {I}) - c_b(G_{\mathcal{I}\mathcal{B}}(\mathcal {I}))\), \(d_{\textsc {RTI}}(\mathcal {I}) \ge \frac{ib(\mathcal {I}) - c_b(G_{\mathcal{I}\mathcal{B}}(\mathcal {I}))}{2}\), \(d_{\textsc {DCJ}}(\mathcal {I}) \ge ib(\mathcal {I}) - c_b(G_{\mathcal{I}\mathcal{B}}(\mathcal {I}))\), and \(d_{\textsc {DCJII}}(\mathcal {I}) \ge ib(\mathcal {I}) - c_b(G_{\mathcal{I}\mathcal{B}}(\mathcal {I}))\).
Proof
The proof is similar to that presented in Theorem 2 from Bafna and Pevzner (1996) considering that \(\varDelta c_b(I, \sigma ) - \varDelta ib(I, \sigma ) \le 1\) for any \(\sigma \in \{\rho ,\delta i, \gamma \}\) and \(\varDelta c_b(I, \sigma ) - \varDelta ib(I, \sigma ) \le 2\) for \(\sigma = \tau \) (lemmas 141516 and 17). \(\square \)
Given an unsigned intergenic instance \(\mathcal {I}\) and a decomposition D of \(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})\) into edge-disjoint alternating cycles, \(c_b^{clean,D}(G_{\mathcal{I}\mathcal{B}}(\mathcal {I}))\) denotes the number of clean balanced cycles obtained in \(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})\) from the D decomposition.
Next, we present important characteristics considering a signed intergenic instance induced from a decomposition of \(G_{\mathcal{I}\mathcal{B}}\) into edge-disjoint alternating cycles.
Lemma 19
Let \(\mathcal {I}=((\pi ,\breve{\pi }), (\iota ,\breve{\iota }))\) be an unsigned intergenic instance and D a decomposition of \(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})\) into edge-disjoint alternating cycles, then we have that:
$$\begin{aligned} ib(\mathcal {I}) - c_b^{clean,D}(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})) = |\bar{\pi }|-1 - c^{clean}_b(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I}')), \end{aligned}$$
such that \(\mathcal {I}'\) is a signed intergenic instance induced by D.
Proof
Note that \(c^{clean}_b(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I}')) = tc_b^{clean}(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I}')) + ntc_b^{clean}(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I}'))\). Besides, the maximum number that \(ib(\mathcal {I})\) can reach is \(|\bar{\pi }| - 1\), and each trivial balanced cycle in \(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I}')\) represents an intergenic adjacency in \(\mathcal {I}\). Thus, we have that \(ib(\mathcal {I}) = |\bar{\pi }|-1 - tc_b^{clean}(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I}'))\). Since \(\mathcal {I}'\) was induced by the decomposition D of \(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})\), then for each non-trivial balanced cycle in \(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I}')\) we have a corresponding balanced cycle with the same number of reality edges in \(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})\). Therefore, \(c^{clean}_b(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I}')) = ntc_b^{clean}(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I}'))\) and the lemma follows. \(\square \)
Remark 3
Given an unsigned intergenic instance \(\mathcal {I}=((\pi ,\breve{\pi }), (\iota ,\breve{\iota }))\) and a D decomposition of \(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})\) into edge-disjoint alternating cycles. Let \(\mathcal {I}'=((\pi ',\breve{\pi }'), (\iota ',\breve{\iota }'))\) be a signed intergenic instance induced by D, then any sequence of operations that transforms \((\pi ',\breve{\pi }')\) into \((\iota ',\breve{\iota }')\) also transforms \((\pi ,\breve{\pi })\) into \((\iota ,\breve{\iota })\).
Note that the results for the SbR problem on signed intergenic instances is important to derive results on unsigned intergenic instances, which leads us to the following lemma.
Lemma 20
(Algorithm 1 from Oliveira et al. (2021)) Given a signed intergenic instance \(\mathcal {I}=((\pi ,\breve{\pi }), (\iota ,\breve{\iota }))\), created from related genomes, there is an algorithm that turns \((\pi ,\breve{\pi })\) into \((\iota ,\breve{\iota })\) using at most \(2({n+1} - c_b(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I})))\) reversals.
Let \(ALG_D\) an algorithm for the MAX-ACBCD problem that follows the steps described in Theorem 1 and ensures an approximation ratio of \(k=\frac{31}{21}+\epsilon \) for the parameter \(ib(\mathcal {I}) - c^{clean}_b(G_{\mathcal{I}\mathcal{B}}(\mathcal {I}))\). Now consider the Algorithm 3 for the SbR problem on unsigned intergenic instances.
Bild vergrößern
Lemma 21
Given an unsigned intergenic instance \(\mathcal {I}=((\pi ,\breve{\pi }), (\iota ,\breve{\iota }))\), Algorithm 3 turns \((\pi ,\breve{\pi })\) into \((\iota ,\breve{\iota })\) using at most \(2k(ib(\mathcal {I}) - c^{clean}_b(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})))\) reversals, such that \(k=\frac{31}{21} + \epsilon \).
Proof
Note that, by Theorem 1, we have that \(ib(\mathcal {I}) - c_b^{clean,D}(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})) \le k(ib(\mathcal {I}) - c^{clean}_b(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})))\). Since \(\mathcal {I}'\) was induced by D, then the sequence S of reversals provided by the algorithm from Oliveira et al. (2021) also transforms \((\pi ,\breve{\pi })\) into \((\iota ,\breve{\iota })\) (Remark 3). By Lemma 20, the sequence S has no more than \(2({n+1} - c^{clean}_b(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I}')))\) reversals. By Lemma 19, \(ib(\mathcal {I}) - c_b^{clean,D}(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})) = n+1 - c^{clean}_b(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I}'))\), so the sequence of events S has no more than \(2(ib(\mathcal {I}) - c_b^{clean,D}(G_{\mathcal{I}\mathcal{B}}(\mathcal {I}))) \le 2k(ib(\mathcal {I}) - c^{clean}_b(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})))\) reversals, and the lemma follows. \(\square \)
Theorem 4
Considering an unsigned intergenic instance \(\mathcal {I}\), Algorithm 3 is a 2k-approximation for the SbR problem, such that \(k=\frac{31}{21}+\epsilon \).
Proof
Straightforward from lemmas 18 and 21. \(\square \)
Now consider new algorithms, which are similar to Algorithm 3, but using other algorithms for the signed problems as follows:
  • Algorithm 4 uses the algorithm from Section 3.3 for the SbRII problem.
  • Algorithm 5 uses Algorithm 2 from Alexandrino et al. (2023) for the RTID problem.
  • Algorithm 6 uses the algorithm from Fertin et al. (2017) for the SbDCJ problem.
  • Algorithm 7 uses Algorithm 2 from Section 3.4 for the SbDCJII problem.
Lemma 22
Given an unsigned intergenic instance \(\mathcal {I}=((\pi ,\breve{\pi }), (\iota ,\breve{\iota }))\), Algorithm 4 turns \((\pi ,\breve{\pi })\) into \((\iota ,\breve{\iota })\) using at most \(\frac{3k}{2}(ib(\mathcal {I}) - c_b(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})))\) reversals and intergenic indels, such that \(k=\frac{31}{21} + \epsilon \).
Proof
Similar to the proof of Lemma 21, but using Lemma 10 instead of Lemma 20. \(\square \)
Theorem 5
Considering an unsigned intergenic instance \(\mathcal {I}\), Algorithm 4 is a \(\frac{3k}{2}\)-approximation for the SbRII problem, such that \(k=\frac{31}{21}+\epsilon \).
Proof
Straightforward from lemmas 18 and 22. \(\square \)
Lemma 23
(Algorithm 2 from Alexandrino et al. (2023)) Given a signed intergenic instance \(\mathcal {I}=((\pi ,\breve{\pi }), (\iota ,\breve{\iota }))\), there is an algorithm that turns \((\pi ,\breve{\pi })\) into \((\iota ,\breve{\iota })\) using at most \(4({n+1} - c_b(G_{\mathcal{I}\mathcal{B}}^{+}(\mathcal {I})))\) reversals, transpositions, and indels.
Lemma 24
Given an unsigned intergenic instance \(\mathcal {I}=((\pi ,\breve{\pi }), (\iota ,\breve{\iota }))\), Algorithm 5 turns \((\pi ,\breve{\pi })\) into \((\iota ,\breve{\iota })\) using at most \(4k(ib(\mathcal {I}) - c_b(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})))\) reversals, transpositions, and indels, such that \(k=\frac{31}{21} + \epsilon \).
Proof
Similar to the proof of Lemma 21, but using Lemma 23 instead of Lemma 20. \(\square \)
Theorem 6
Considering an unsigned intergenic instance \(\mathcal {I}\), Algorithm 5 is a 4k-approximation for the SbRII problem, such that \(k=\frac{31}{21}+\epsilon \).
Proof
Straightforward from lemmas 18 and 24. \(\square \)
Lemma 25
Given an unsigned intergenic instance \(\mathcal {I}=((\pi ,\breve{\pi }), (\iota ,\breve{\iota }))\), Algorithm 6 turns \((\pi ,\breve{\pi })\) into \((\iota ,\breve{\iota })\) using at most \(2k(ib(\mathcal {I}) - c_b(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})))\) DCJs such that \(k=\frac{31}{21} + \epsilon \).
Proof
Similar to the proof of Lemma 21, but using Lemma 11 instead of Lemma 20. \(\square \)
Theorem 7
Considering an unsigned intergenic instance \(\mathcal {I}\), Algorithm 7 is a 2k-approximation for the SbDCJ problem, such that \(k=\frac{31}{21}+\epsilon \).
Proof
Straightforward from lemmas 18 and 25. \(\square \)
Lemma 26
Given an unsigned intergenic instance \(\mathcal {I}=((\pi ,\breve{\pi }), (\iota ,\breve{\iota }))\), Algorithm 7 turns \((\pi ,\breve{\pi })\) into \((\iota ,\breve{\iota })\) using at most \(k(ib(\mathcal {I}) - c_b(G_{\mathcal{I}\mathcal{B}}(\mathcal {I})))\) DCJ and intergenic indels, such that \(k=\frac{31}{21} + \epsilon \).
Proof
Similar to the proof of Lemma 21, but using Lemma 12 instead of Lemma 20. \(\square \)
Theorem 8
Considering an unsigned intergenic instance \(\mathcal {I}\), Algorithm 7 is a k-approximation for the SbDCJII problem, such that \(k=\frac{31}{21}+\epsilon \).
Proof
Straightforward from lemmas 18 and 26. \(\square \)

4 Experimental results

We created a database of synthetic genomes to test in practice the use of the MAX-ACBCD problem to calculate distances for the SbDCJII problem. We chose to perform the tests with this distance because it generalizes well the other distances described in this work, and it has a direct relation with the graphs.
The original algorithms for the subproblems of Section 3.2, although of theoretical interest, have a high complexity to be implemented in practice, so we opted to use greedy approaches as heuristics for the subproblems. Consequently, the algorithm used in these tests does not ensure an approximation. For the MIS problem, we select one of the vertices with the highest degree and remove it from the graph together with its neighbors. We repeat this process until there are no vertices in the graph. For the K-SP problem, we select one of the largest sets and remove it together with all other sets intersecting it. We repeat this process until there are no more sets.
For the database, we created genomes with 500 genes (resulting in 502 genes after the extension). The genomes were created in pairs, such that the first was randomly generated and the second was produced by applying operations of reversal and intergenic indel. We produce multiple sets of genomes with 100 pairs of genomes each. The sets are defined by the values \(r \in \{50,100,150,200\}\) and \(d \in \{10,30,50,70,90\}\), corresponding to the number of reversals and intergenic indels applied, respectively. Each pair of genomes \((\iota ,\breve{\iota })\) and \((\pi ,\breve{\pi })\) was created by the following process (all random choices, using a uniform distribution):
  • For \(\iota \) we use the sequence from 1 to 500 and randomly select 501 numbers between 0 and 100 for \(\breve{\iota }\).
  • For \((\pi ,\breve{\pi })\) we apply r reversals, followed by d intergenic indels, each with parameter x between \(-\breve{\pi }_i\) and 100. The parameters of all operations were randomly chosen.
Table 1 shows the results of the experiments. In our tests, we calculated the distances between pairs of genomes using our algorithm, which is shown in the columns under Decomposition Heuristic, and using a random sign assignment for the genomes to produce a decomposition, which is shown in the columns under Random Signs. For each case, we report the mean and sample standard deviation (std.) of the results for the 100 pairs from each set.
Ideally, we would prefer to have an exact solution to compare against our heuristic to find the exact gap between the results, but the space of possible decompositions of the graphs is too large for the genomes of our database. Consequently, we opt to compare our heuristic with a baseline using random sign assignment to produce a decomposition. We also compare the results against the number of operations used to create the database.
Table 1
Mean and sample standard deviation (std.) for the DCJ and intergenic indel distance considering the generated sets of pairs of genomes and using the decompositions from random sign assignment and from the heuristic
Reversals
Indels
Random Signs
Decomposition Heuristic
mean
std.
mean
std.
50
10
290.89
29.79
68.29
2.78
50
30
299.83
28.52
88.63
3.42
50
50
306.66
27.49
107.80
3.67
50
70
314.61
26.86
125.75
4.42
50
90
322.62
25.33
141.74
5.02
100
10
323.62
21.01
136.96
4.49
100
30
331.10
20.34
156.31
4.38
100
50
337.42
19.82
172.91
4.92
100
70
344.32
18.34
188.17
5.25
100
90
350.41
18.26
202.94
5.75
150
10
358.35
17.33
204.77
5.35
150
30
364.00
16.67
219.75
5.75
150
50
369.66
16.36
232.77
5.83
150
70
374.83
15.57
245.66
6.12
150
90
379.97
15.34
257.52
5.93
200
10
383.19
14.42
260.14
6.95
200
30
387.33
14.49
270.74
6.66
200
50
392.60
14.15
281.51
7.19
200
70
396.89
13.07
292.24
6.72
200
90
400.71
12.83
299.77
7.22
Looking at these results, we can see that our heuristic finds distances considerably shorter on average than the ones found by the random sign assignment. The lower the number of reversals applied, the bigger the gap between the two methods. We also note that the heuristic has lower variability (standard deviation) than the random sign assignment.

5 Conclusion

In this work, we introduced a generalization of the Maximum Alternating Cycle Decomposition problem, named the Maximum Alternating Clean Balanced Cycle Decomposition problem. The exploration of this generalization by itself is of significant interest in combinatorial optimization, as it incorporates additional information into the graph model and reuses the same reduction toolkit as in the original problem, with suitable adaptations. We also designed an enhanced algorithm for the Sorting by Reversals and Intergenic Indels on signed intergenic instances, which guarantees an approximation factor of \(\frac{3}{2}\). Combined with an algorithm for the Maximum Alternating Clean Balanced Cycle Decomposition problem, we showed how to obtain better approximation algorithms for multiple rearrangement distance problems on unsigned intergenic instances.
Our experimental results indicated that a heuristic approach inspired by our algorithm for the decomposition problem finds shorter DCJ and intergenic indel distances between the genomes if compared to the random sign assignment for the genes.
It is important to mention that the use of unsigned genomes is less applicable to a real scenario than the use of signed genomes, but it is still relevant from the theoretical point of view, because the complexity of the problem arises from that scenario. Besides, this problem can still be used when we are not considering gene orientation and can give us insights on the signed variant of the problems.
In future works, practical experiments can be performed to verify the practical applicability of the proposed algorithms, and it is possible to investigate how to achieve better results for intergenic models in unsigned scenarios using the Maximum Alternating Clean Balanced Cycle Decomposition problem. Another direction for future work is to consider problems with more than one copy of each gene.

Acknowledgements

This work was supported by the São Paulo Research Foundation, FAPESP (grant 2021/13824-8).

Declarations

Conflicts of Interest

The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Download
Titel
Maximum alternating clean balanced cycle decomposition and applications in rearrangement distance problems
Verfasst von
Gabriel Siqueira
Klairton Lima Brito
Alexsandro Oliveira Alexandrino
Andre Rodrigues Oliveira
Ulisses Dias
Zanoni Dias
Publikationsdatum
01.03.2026
Verlag
Springer US
Erschienen in
Journal of Combinatorial Optimization / Ausgabe 2/2026
Print ISSN: 1382-6905
Elektronische ISSN: 1573-2886
DOI
https://doi.org/10.1007/s10878-026-01405-8
Zurück zum Zitat Alexandrino AO, Brito KL, Oliveira AR, Dias U, Dias Z (2021) Reversal Distance on Genomes with Different Gene Content and Intergenic Regions Information. In: Algorithms for Computational Biology, vol. 12715, pp 121–133. Springer International Publishing
Zurück zum Zitat Alexandrino AO, Brito KL, Oliveira AR, Dias U, Dias Z (2022) Reversal and indel distance with intergenic region information. IEEE/ACM Transactions on Computational Biology and Bioinformatics pp 1–13
Zurück zum Zitat Alexandrino AO, Oliveira AR, Dias U, Dias Z (2021) Incorporating intergenic regions into reversal and transposition distances with indels. J Bioinform Comput Biol 19(06):2140011CrossRef
Zurück zum Zitat Alexandrino AO, Oliveira AR, Jean G, Fertin G, Dias U, Dias Z (2023) Reversal and transposition distance on unbalanced genomes using intergenic information. J Comput Biol 30(8):861–876MathSciNetCrossRef
Zurück zum Zitat Bafna V, Pevzner PA (1996) Genome rearrangements and sorting by reversals. SIAM J Comput 25(2):272–289MathSciNetCrossRef
Zurück zum Zitat Berman P, Fürer M (1994) Approximating Maximum Independent Set in Bounded Degree Graphs. In: SODA’1994: Proceedings of the Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, pp 365–371. Society for Industrial and Applied Mathematics
Zurück zum Zitat Berman P, Hannenhalli S, Karpinski M (2002) 1.375-Approximation Algorithm for Sorting by Reversals. In: Proceedings of the 10th Annual European Symposium on Algorithms (ESA’2002), Lecture Notes in Computer Science, vol. 2461, pp 200–210. Springer-Verlag, Berlin Heidelberg New York, Berlin/Heidelberg, Germany
Zurück zum Zitat Berman P, Karpinski M (1999) On Some Tighter Inapproximability Results (Extended Abstract). In: Proceedings of the 26th International Colloquium on Automata, Languages and Programming (ICAL’1999), Lecture Notes in Computer Science, vol. 1644, pp. 200–209. Springer-Verlag, London, UK
Zurück zum Zitat Biller P, Knibbe C, Beslon G, Tannier E (2016) Comparative Genomics on Artificial Life. In: Pursuit of the Universal, pp. 35–44. Springer International Publishing
Zurück zum Zitat Brito KL, Jean G, Fertin G, Oliveira AR, Dias U, Dias Z (2020) Sorting by genome rearrangements on both gene order and intergenic sizes. J Comput Biol 27(2):156–174MathSciNetCrossRef
Zurück zum Zitat Bulteau L, Fertin G, Komusiewicz C (2016) (Prefix) reversal distance for (signed) strings with few blocks or small alphabets. J Discrete Algorithms 37:44–55MathSciNetCrossRef
Zurück zum Zitat Bulteau L, Fertin G, Rusu I (2012) Sorting by transpositions is difficult. SIAM J Discret Math 26(3):1148–1180MathSciNetCrossRef
Zurück zum Zitat Bulteau L, Fertin G, Tannier E (2016) Genome rearrangements with indels in intergenes restrict the scenario space. BMC Bioinformatics 17(14):426CrossRef
Zurück zum Zitat Caprara A (1999) On the tightness of the alternating-cycle lower bound for sorting by reversals. J Comb Optim 3(2):149–182MathSciNetCrossRef
Zurück zum Zitat Caprara A (1999) Sorting permutations by reversals and Eulerian cycle decompositions. SIAM J Discret Math 12(1):91–110MathSciNetCrossRef
Zurück zum Zitat Caprara A, Rizzi R (2002) Improved approximation for breakpoint graph decomposition and sorting by reversals. J Comb Optim 6(2):157–182MathSciNetCrossRef
Zurück zum Zitat Chen X (2013) On sorting unsigned permutations by double-cut-and-joins. J Comb Optim 25(3):339–351MathSciNetCrossRef
Zurück zum Zitat Elias I, Hartman T (2006) A 1.375-approximation algorithm for sorting by transpositions. IEEE/ACM Trans Comput Biol Bioinf 3(4):369–379CrossRef
Zurück zum Zitat Fertin G, Jean G, Tannier E (2017) Algorithms for computing the double cut and join distance on both gene order and intergenic sizes. Algorithms Mol Biol 12(1):16CrossRef
Zurück zum Zitat Garey MR, Johnson DS (1979) Computers and Intractability; A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York, NY, USA
Zurück zum Zitat Hannenhalli S, Pevzner PA (1999) Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals. J ACM 46(1):1–27MathSciNetCrossRef
Zurück zum Zitat Hurkens CAJ, Schrijver A (1989) On the size of systems of sets every t of which have an sdr, with an application to the worst-case ratio of heuristics for packing problems. SIAM J Discret Math 2(1):68–72MathSciNetCrossRef
Zurück zum Zitat Karp RM (1972) Reducibility among combinatorial problems. In: Complexity of computer computations, pp. 85–103. Springer
Zurück zum Zitat Lin G, Jiang T (2004) A further improved approximation algorithm for breakpoint graph decomposition. J Comb Optim 8(2):183–194MathSciNetCrossRef
Zurück zum Zitat Oliveira AR, Brito KL, Alexandrino AO, Siqueira G, Dias U, Dias Z (2024) Rearrangement distance problems: an updated survey. ACM Comput Surv 56(8):1CrossRef
Zurück zum Zitat Oliveira AR, Brito KL, Dias U, Dias Z (2019) On the complexity of sorting by reversals and transpositions problems. J Comput Biol 26:1223–1229. https://doi.org/10.1089/cmb.2019.0078MathSciNetCrossRef
Zurück zum Zitat Oliveira AR, Jean G, Fertin G, Brito KL, Bulteau L, Dias U, Dias Z (2021) Sorting signed permutations by intergenic reversals. IEEE/ACM Trans Comput Biol Bioinf 18(6):2870–2876CrossRef
Zurück zum Zitat Papadimitriou CH, Yannakakis M (1991) Optimization, approximation, and complexity classes. J Comput Syst Sci 43:425–440MathSciNetCrossRef
Zurück zum Zitat Pinheiro PO, Alexandrino AO, Oliveira AR, de Souza CC, Dias Z (2020) Heuristics for Breakpoint Graph Decomposition with Applications in Genome Rearrangement Problems. In: Proceedings of the 13th Brazilian Symposium on Bioinformatics (BSB’2020), pp. 129–140. Springer International Publishing
Zurück zum Zitat Rahman A, Shatabda S, Hasan M (2008) An approximation algorithm for sorting by reversals and transpositions. J Discrete Algorithms 6(3):449–457MathSciNetCrossRef
Zurück zum Zitat Swenson KM, Lin Y, Rajan V, Moret BME (2008) Hurdles hardly have to be heeded. In: Nelson CE, Vialette S (eds) Comparative Genomics. Springer, Berlin Heidelberg, Berlin, Heidelberg, pp 241–251CrossRef
Zurück zum Zitat Walter MEMT, Dias Z, Meidanis J (1998) Reversal and Transposition Distance of Linear Chromosomes. In: Proceedings of the 5th International Symposium on String Processing and Information Retrieval (SPIRE’1998), pp. 96–102. IEEE Computer Society, Los Alamitos, CA, USA
Bildnachweise
AvePoint Deutschland GmbH/© AvePoint Deutschland GmbH, ams.solutions GmbH/© ams.solutions GmbH, Wildix/© Wildix, arvato Systems GmbH/© arvato Systems GmbH, Ninox Software GmbH/© Ninox Software GmbH, Nagarro GmbH/© Nagarro GmbH, GWS mbH/© GWS mbH, CELONIS Labs GmbH, USU GmbH/© USU GmbH, G Data CyberDefense/© G Data CyberDefense, Vendosoft/© Vendosoft, Deutsche Telekom MMS GmbH/© Vendosoft, Noriis Network AG/© Noriis Network AG, Asseco Solutions AG/© Asseco Solutions AG, AFB Gemeinnützige GmbH/© AFB Gemeinnützige GmbH, Ferrari electronic AG/© Ferrari electronic AG, Doxee AT GmbH/© Doxee AT GmbH , Haufe Group SE/© Haufe Group SE, NTT Data/© NTT Data, Videocast 1: Standbild/© Springer Fachmedien Wiesbaden, IT-Director und IT-Mittelstand: Ihre Webinar-Matineen /© da-kuk / Getty Images / iStock