Exact algorithm and heuristic for the Closest String Problem

https://doi.org/10.1016/j.cor.2011.01.009Get rights and content

Abstract

The Closest String Problem (CSP) is an NP-hard problem, which arises in computational molecular biology and coding theory. This class of problems is to find a string that minimizes the maximum Hamming distance to a given set of strings. In this paper, we present an exact algorithm called Distance First Algorithm (DFA) for three strings of CSP with alphabet size two. For the general CSP, we design a polynomial heuristic which is a combination of our proposed approximation algorithm LDDA ([10] Liu Xiaolan, Fu Keqiang, Shao Renxiang. Largest distance decreasing algorithm for the Closest String Problem. Journal of Information & Computational Science 2004; 1(2): 287–92) and local search strategies. Numerical results show that the proposed heuristic may obtain a nearly optimal value in a reasonable time for small and large-scale instances of the CSP.

Introduction

For many problems in computational biology, there is a need to compare and find common features in sequences. In this paper, we study one such problem, known as the Closest String Problem (CSP). The CSP has many important applications not only in computational biology, but also in other areas such as coding theory. In coding theory, the objective is to find sequences of characters which are closest to some given set of strings. This is useful in determining the best way of encoding a set of messages [1].

Let A={a1,,a|A|} be a finite set of characters, from which strings can be constructed. Let Am denote the set of all strings over A with length m. Then the CSP can be defined as follows: given a finite set of strings S={s1,s2,,sn}, each of which is in Am, find a center string t of length m, such that for every string si in S, dH(si,t)d. By dH(si,t) we mean the Hamming distance between si and t.

The CSP has been widely studied in recent years. It has been proved that CSP is NP-hard. However, if the distance d is fixed, the exact solution to the problem can be found in polynomial time [2]. For the general case where d is variable, there have proposed some approximation algorithms; for example, Gasieniec et al. [3] and Lanctot et al. [4] develop independently a 4/3-approximation algorithm, and Li et al. [5] presents a polynomial time approximation scheme (PTAS). These algorithms all need to solve an integer linear programming, so solving large-size instances of the CSP may take a long time. Meneses et al. [6], [7] have shown empirically the practical use of integer programming techniques to solve moderate-size instances of the CSP. By moderate-size instances we mean instances with 10–30 strings each of which is 300–800 characters long. However, real-world instances are much larger and more difficult. Gomes et al. [8] propose a heuristic (Algorithm 1 in [8]) and implement the sequential and parallel versions of the algorithm. Later Gomes et al. [9] propose a slightly different version of the heuristic in [8] (Algorithm 1 in [9]) and run it on a parallel machine with 28 nodes for large-scale instances of the CSP.

In this paper, we present an exact algorithm called Distance First Algorithm (DFA) for the special case of CSP with n=3 strings and alphabet size |A|=2. Also, we design a polynomial heuristic that is a combination of our proposed approximation algorithm LDDA [10] and local search strategies [6]. The computational experiments show that the proposed heuristic is effective in solving small and large-scale instances of the CSP.

This paper is organized as follows. In Section 2, we introduce the integer programming formulation of CSP. Section 3 describes the exact algorithm for the special case of CSP with n=3 strings and alphabet size |A|=2 and gives its proof. Section 4 describes the proposed heuristic for solving small and large-scale instances of the CSP. Section 5 reports the numerical results of the proposed heuristic. Finally, some concluding remarks are given in Section 6.

Section snippets

Mathematical model of CSP

An instance of the CSP consists of S={s1,s2,,sn} such that si=s1is2i,,smiAm for i=1,,n. The objective of the CSP is to find a string tAm such that maxidH(si,t) is minimized.

Here we introduce the third IP formulation in [6] which is thought as the strongest. A unique natural number can be assigned to each character ski, then each string si is transformed into a point xiZm. For example, to the alphabet {a,b,,z}, we can map it to {1,2,,26}. Given a string s=differ, we have x=(4,9,6,6,5,18)

Distance First Algorithm

Although the CSP for a constant number of strings is solvable in linear time using linear programming, the linear algorithm leads to huge running time even for moderate number of variables [11]. Hence, an efficient linear time algorithm “3-Strings” for the special case of CSP with n=3 strings is proposed in [11]. In the 3-Strings algorithm, preprocessing is required to transform the instance into a normalized one and split it into “blocks”, and the rest of operation is a little more

Polynomial heuristic for CSP

In [10], we proposed an algorithm, named LDDA for short, for the general CSP problem. However, unfortunately, this algorithm easily runs into a local optimum. In this section, we will improve the efficiency of LDDA by combining it with local search strategies.

The local search strategies that we adapted are slightly different versions of the one presented in [6], [8], [9]. The difference from the improve_solution algorithm in [6], [8] and Algorithm 2 in [9] is the selection of the initial value

Computational experiments

We now present the computational experiments with the proposed heuristic (LDDA_LSS for short) described in Section 4 to solve some test instances. First, we describe the set of instances used in our experiments. Then, in Section 5.2 we report the results obtained by LDDA_LSS and other algorithms. Section 5.3 will discuss the impact of b_rep on the performance of LDDA_LSS.

Concluding remarks

In this paper, we proposed an exact algorithm for the special case of CSP with n=3 strings and alphabet size |A|=2 and gave the corresponding theoretical analysis. In addition, we also presented a polynomial heuristic for the general CSP. Numerical experiments and comparisons with the exact method and the algorithm in [9] on real and simulated biological data showed that our proposed algorithm was not only effective in solving small-size instances of the CSP, but also very effective in solving

Acknowledgments

The authors would like to thank the anonymous reviewers for their constructive comments. We are very grateful to Professor Jinchao Chen for providing the McClure dataset in this paper. We also thank Dr. Shaohua Pan for helpful discussions. The work is supported by Project of Guangdong Science and Technology (2008B080701005) and the National Natural Science Foundation of China (61070033).

References (14)

  • F.C. Gomes et al.

    A parallel multistart algorithm for the Closest String Problem

    Computers & Operations Research

    (2008)
  • Roman S. Coding and information theory. In: Graduate texts in mathematics, vol. 134. Heidelberg, Germany:...
  • Ben-Dor A, Lancia G, Perone J, Ravi R. Banishing bias from consensus sequences. In: Apostolico A, Hein J, editors....
  • Gasieniec L, Jansson J, Lingas A. Efficient approximation algorithms for the hamming center problem. In: Proceedings of...
  • K. Lanctot et al.

    Distinguishing string selection problems

    Information and Computation

    (2003)
  • M. Li et al.

    On the closest string and substring problems

    Journal of the ACM

    (2002)
  • C.N. Meneses et al.

    Optimal solutions for the Closest String Problem via integer programming

    INFORMS Journal on Computing

    (2004)
There are more references available in the full text version of this article.

Cited by (14)

  • An improved integer linear programming formulation for the closest 0-1 string problem

    2017, Computers and Operations Research
    Citation Excerpt :

    Due to its importance, the problem has recently attracted extensive research, see e.g. [5,8–10].

  • Encoding Hard String Problems with Answer Set Programming

    2023, Leibniz International Proceedings in Informatics, LIPIcs
  • Application of Negative Learning Ant Colony Optimization to the Far from Most String Problem

    2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
View all citing articles on Scopus
View full text