Exact algorithm and heuristic for the Closest String Problem
Introduction
For many problems in computational biology, there is a need to compare and find common features in sequences. In this paper, we study one such problem, known as the Closest String Problem (CSP). The CSP has many important applications not only in computational biology, but also in other areas such as coding theory. In coding theory, the objective is to find sequences of characters which are closest to some given set of strings. This is useful in determining the best way of encoding a set of messages [1].
Let be a finite set of characters, from which strings can be constructed. Let Am denote the set of all strings over A with length m. Then the CSP can be defined as follows: given a finite set of strings , each of which is in Am, find a center string t of length m, such that for every string si in S, . By we mean the Hamming distance between si and t.
The CSP has been widely studied in recent years. It has been proved that CSP is NP-hard. However, if the distance d is fixed, the exact solution to the problem can be found in polynomial time [2]. For the general case where d is variable, there have proposed some approximation algorithms; for example, Gasieniec et al. [3] and Lanctot et al. [4] develop independently a 4/3-approximation algorithm, and Li et al. [5] presents a polynomial time approximation scheme (PTAS). These algorithms all need to solve an integer linear programming, so solving large-size instances of the CSP may take a long time. Meneses et al. [6], [7] have shown empirically the practical use of integer programming techniques to solve moderate-size instances of the CSP. By moderate-size instances we mean instances with 10–30 strings each of which is 300–800 characters long. However, real-world instances are much larger and more difficult. Gomes et al. [8] propose a heuristic (Algorithm 1 in [8]) and implement the sequential and parallel versions of the algorithm. Later Gomes et al. [9] propose a slightly different version of the heuristic in [8] (Algorithm 1 in [9]) and run it on a parallel machine with 28 nodes for large-scale instances of the CSP.
In this paper, we present an exact algorithm called Distance First Algorithm (DFA) for the special case of CSP with n=3 strings and alphabet size |A|=2. Also, we design a polynomial heuristic that is a combination of our proposed approximation algorithm LDDA [10] and local search strategies [6]. The computational experiments show that the proposed heuristic is effective in solving small and large-scale instances of the CSP.
This paper is organized as follows. In Section 2, we introduce the integer programming formulation of CSP. Section 3 describes the exact algorithm for the special case of CSP with n=3 strings and alphabet size |A|=2 and gives its proof. Section 4 describes the proposed heuristic for solving small and large-scale instances of the CSP. Section 5 reports the numerical results of the proposed heuristic. Finally, some concluding remarks are given in Section 6.
Section snippets
Mathematical model of CSP
An instance of the CSP consists of such that for . The objective of the CSP is to find a string such that is minimized.
Here we introduce the third IP formulation in [6] which is thought as the strongest. A unique natural number can be assigned to each character , then each string si is transformed into a point . For example, to the alphabet , we can map it to . Given a string , we have
Distance First Algorithm
Although the CSP for a constant number of strings is solvable in linear time using linear programming, the linear algorithm leads to huge running time even for moderate number of variables [11]. Hence, an efficient linear time algorithm “3-Strings” for the special case of CSP with n=3 strings is proposed in [11]. In the 3-Strings algorithm, preprocessing is required to transform the instance into a normalized one and split it into “blocks”, and the rest of operation is a little more
Polynomial heuristic for CSP
In [10], we proposed an algorithm, named LDDA for short, for the general CSP problem. However, unfortunately, this algorithm easily runs into a local optimum. In this section, we will improve the efficiency of LDDA by combining it with local search strategies.
The local search strategies that we adapted are slightly different versions of the one presented in [6], [8], [9]. The difference from the improve_solution algorithm in [6], [8] and Algorithm 2 in [9] is the selection of the initial value
Computational experiments
We now present the computational experiments with the proposed heuristic (LDDA_LSS for short) described in Section 4 to solve some test instances. First, we describe the set of instances used in our experiments. Then, in Section 5.2 we report the results obtained by LDDA_LSS and other algorithms. Section 5.3 will discuss the impact of b_rep on the performance of LDDA_LSS.
Concluding remarks
In this paper, we proposed an exact algorithm for the special case of CSP with n=3 strings and alphabet size |A|=2 and gave the corresponding theoretical analysis. In addition, we also presented a polynomial heuristic for the general CSP. Numerical experiments and comparisons with the exact method and the algorithm in [9] on real and simulated biological data showed that our proposed algorithm was not only effective in solving small-size instances of the CSP, but also very effective in solving
Acknowledgments
The authors would like to thank the anonymous reviewers for their constructive comments. We are very grateful to Professor Jinchao Chen for providing the McClure dataset in this paper. We also thank Dr. Shaohua Pan for helpful discussions. The work is supported by Project of Guangdong Science and Technology (2008B080701005) and the National Natural Science Foundation of China (61070033).
References (14)
- et al.
A parallel multistart algorithm for the Closest String Problem
Computers & Operations Research
(2008) - Roman S. Coding and information theory. In: Graduate texts in mathematics, vol. 134. Heidelberg, Germany:...
- Ben-Dor A, Lancia G, Perone J, Ravi R. Banishing bias from consensus sequences. In: Apostolico A, Hein J, editors....
- Gasieniec L, Jansson J, Lingas A. Efficient approximation algorithms for the hamming center problem. In: Proceedings of...
- et al.
Distinguishing string selection problems
Information and Computation
(2003) - et al.
On the closest string and substring problems
Journal of the ACM
(2002) - et al.
Optimal solutions for the Closest String Problem via integer programming
INFORMS Journal on Computing
(2004)
Cited by (14)
An improved integer linear programming formulation for the closest 0-1 string problem
2017, Computers and Operations ResearchCitation Excerpt :Due to its importance, the problem has recently attracted extensive research, see e.g. [5,8–10].
The Selective Fixing Algorithm for the closest string problem
2014, Computers and Operations ResearchImproved LP-based algorithms for the closest string problem
2012, Computers and Operations ResearchEncoding Hard String Problems with Answer Set Programming
2023, Leibniz International Proceedings in Informatics, LIPIcsApplication of Negative Learning Ant Colony Optimization to the Far from Most String Problem
2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)