nach oben

2008 | Buch

The Burrows-Wheeler Transform: Data Compression, Suffix Arrays, and Pattern Matching

verfasst von: Donald Adjeroh, Tim Bell, Amar Mukherjee

Verlag: Springer US

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

The Burrows-Wheeler Transform is one of the best lossless compression me- ods available. It is an intriguing — even puzzling — approach to squeezing redundancy out of data, it has an interesting history, and it has applications well beyond its original purpose as a compression method. It is a relatively late addition to the compression canon, and hence our motivation to write this book, looking at the method in detail, bringing together the threads that led to its discovery and development, and speculating on what future ideas might grow out of it. The book is aimed at a wide audience, ranging from those interested in learning a little more than the short descriptions of the BWT given in st- dard texts, through to those whose research is building on what we know about compression and pattern matching. The ?rst few chapters are a careful description suitable for readers with an elementary computer science ba- ground (and these chapters have been used in undergraduate courses), but later chapters collect a wide range of detailed developments, some of which are built on advanced concepts from a range of computer science topics (for example, some of the advanced material has been used in a graduate c- puter science course in string algorithms). Some of the later explanations require some mathematical sophistication, but most should be accessible to those with a broad background in computer science.

Inhaltsverzeichnis

Frontmatter

1. Introduction

This puzzle is an example of the Burrows-Wheeler Transform (BWT), which uses the intriguing idea of muddling (we prefer to call it permuting) the letters in a document to make it easier to find a compact representation and to perform other kinds of processing. What is amazing about the BWT is that although there are 2,615,348,735,999 different ways to unmuddle the above characters into possible anagrams, the Burrows-Wheeler Transform makes it very easy to find the unique correct permutation very quickly.

2. How the Burrows-Wheeler Transform works

This chapter will look in detail at how the Burrows-Wheeler Transform is implemented in practice. The examples given in Chapter 1 overlooked some important practical details — to transform a text of n characters the encoder was sorting an array of n strings, each n characters long, and the decoder performed n sorts to reverse the transform. This complexity is not necessary for the BWT, and in this chapter we will see how to perform the encoding and decoding in O(n) space, and O(n log n) time. In fact, using a few tricks, the time can be reduced to O(n).

We will also look at various auxiliary data structures that are used for decoding the Burrows-Wheeler Transform, as some of them, while not essential for decoding, are useful if the transformed text is to be searched. These extra structures can still be constructed in O(n) time so in principle they add little to the decoding cost.

This chapter considers only the transform; in the next chapter we will look at how a compression system can take advantage of the transformed text to reduce its size; we refer to this second phase as the “Local to Global Transform”, or LGT.

We will present the Burrows-Wheeler Transform for coding a string T of n characters, T[1 … n], over an alphabet Σ of |Σ| characters. Note that there is a summary of all the main notation in Appendix A on page 309.

3. Coders for the Burrows-Wheeler Transform

Like most transforms, the Burrows-Wheeler Transform does not change the size of the file that has been transformed, but merely rearranges it so that it will be easier to represent it compactly. It then needs to be coded using a second phase which we will refer to as the “Local to Global Transform” (LGT). Figure 3.1 shows a section of the transformed text for Shakespeare's “Hamlet”, which reveals the kind of regularities that the BWT exposes. These characters are ones that appear before the context nd; initially the nd is followed by a space, and hence a is very common, but then the character is followed by ndeed, where the i becomes common, and the last few characters precede nder.

Clearly the text in Figure 3.1 contains a lot of patterns, and therefore will be easy to compress. Many sophisticated techniques have been proposed to exploit the regularities of the BWT transformed text, and yet it has emerged that one of the simplest approaches (RleAc, based on run-length encoding and an order-zero arithmetic coder) gives the best compression and is also very fast compared with more complicated methods. We will begin this section by looking at this simple coder, but later we will also review various other approaches that have been proposed, including Burrows and Wheeler's original “Move to Front” (MTF) list, inversion frequencies, distance coding, frequency counting methods, wavelet trees, and alternative permutations. We will also consider the effect of the block size on compression performance.

4. Suffix trees and suffix arrays

The Burrows-Wheeler Transform has a very close relationship with suffix trees and suffix arrays — the array of indexes to the sorted array of substrings generated during the transform is essentially a suffix array, which in turn is a representation of the information in a suffix tree. As pointed out by Burrows and Wheeler in their original work (Burrows and Wheeler, 1994), the problem of sorting the rotated matrices is the major bottleneck in performing the transformation, and this is essentially an exercise in suffix sorting. This relationship between the BWT and suffix arrays and suffix trees also has important implications in the applications of the BWT, and in its relationship with other compression schemes, such as PPM. Analyzing the performance of the BWT is greatly simplified by an understanding of the construction and complexity of suffix trees and suffix arrays.

In this chapter we study suffix trees and suffix arrays in more detail. While this is motivated by their relationship with the BWT, suffix trees and suffix arrays have become important data structures in their own right, especially for problems in pattern matching, full-text indexing, compression, and other applications.

5. Analysis of the Burrows-Wheeler Transform

The Burrows-Wheeler Transform performs a permutation of the input string, and thus provides no compression on its own. Rather, the BWT essentially reorganizes the input sequence so that symbols with similar contexts are clustered together in the output stream. In this sense the BWT can be seen as a preprocessing scheme to expose potential redundancies in a given input, and hence enhance later compression, using existing compression algorithms. Thus, the BWT can be viewed as a “compression booster”, since it makes it possible for relatively simple compression algorithms to perform better on most input sequences (Ferragina et al., 2005a). Interestingly, some of the better compression methods don't do so well on the BWT output because they are being too “clever” looking for patterns when the patterns are very simple. This ability of the BWT to reorganize (sort) the input sequence based on contexts is central to its relationship with other compression algorithms. It is also the key to its use in various other fields and applications, different from data compression. Not surprisingly, this context sorting stage of the BWT is also the major bottleneck in performing the transformation on a given input sequence.

In this chapter, we analyze the theoretical performance of the Burrows- Wheeler Transform.We consider its performance in terms of its computational complexity (both space and time complexity). We also consider its theoretical performance in data compression, in terms of how close or how fast it could approach the theoretically optimal performance for a given source. We will examine how the theoretical performance of the BWT, be it computational complexity or compression ability, is related to the sorted contexts used by the transform. We will then relate it to other compression methods, especially those based on using contexts, as the BWT approach turns out to effectively be partitioning the input according to contexts in which the characters occur.

6. Variants of the Burrows-Wheeler Transform

With the huge excitement that was generated by the publication of the original paper on the Burrows-Wheeler Transform in 1994, followed by a more detailed empirical study by Fenwick between 1995 and 1996 (Fenwick, 1995b,c, 1996a,b), it did not take long before researchers started considering different variations, extensions and generalizations of the transform. There were many questions to ask; for instance, given the sorted BWT rotation matrix, is the array of last characters (the last column L of the matrix As) selected by the BWT as its output the only possible choice? And if other choices are possible, might they give better compression? The first column (F) would be an attractive choice if it were possible to recover the original text from this column, since it can be represented very efficiently. It would seem that there is insufficient information to recover the text T from F, and we know how to recover it from L, but what of the columns between them?

Another debate was about the transformation itself. Do we need a complete sorting of all the cyclic rotations of the original text, or can we make do with a limited-length key comparison — for instance, sorting based on the klength prefix of each row, for an arbitrary k? Can we recover the original text without error from such limited-order sorting? Given that sorting is the major bottleneck in BWT-based analysis, if this simplified sort were possible it could have a significant advantage with respect to computational complexity; but what will be the impact on compression? Other questions included whether the BWT can be applied to a word-based alphabet, especially given that the original paper that proposed the MTF algorithm for compression (Bentley et al., 1986) used word-based alphabets. In this chapter we address the above issues and more by considering various published extensions and generalizations of the BWT. Where possible, we include empirical performance of the BWT variant or generalization.

7. Exact and approximate pattern matching

A fundamental operation with strings is determining whether a pattern of characters or symbols occurs as a substring in a larger string called the text, or as an approximate subsequence in the text. This problem has been investigated since the early 1960s, not only for its theoretical importance in computer science but because it has many applications in information processing and biological sciences. In computer science, string pattern matching algorithms are used in database search and retrieval, text processing and editing, lexical analysis of computer programs, data compression, cryptography and other applications. In recent years, string matching algorithms have been used as powerful tools in the study of genomics and proteomics, in finding genes and regulatory motifs, and in comparative genomics, gene expression analysis and molecular evolutionary theory.

In this chapter we will begin in Section 7.1 by looking at classic exact pattern matching algorithms for uncompressed text, as some of the compresseddomain methods build on these. The section also looks at algorithms for pattern matching with “don't-care” characters, which allow for uncertainty in the search. We then discuss the compressed domain pattern matching problem in Section 7.2, with the main emphasis on the use of the Burrows-Wheeler Transform to aid compressed-domain pattern search. We will then move on to approximate pattern matching algorithms (Section 7.4), including “k-mismatch algorithms” which allow a fixed number of characters to be different between a pattern and the text it matches. Some uncompressed-domain approximate matching methods are introduced, and then implementations of BWT domain approximate pattern matching are described. It is also possible to design hardware algorithms to accelerate pattern matching, and these will be discussed briefly in Section 7.5.

8. Other applications of the Burrows-Wheeler Transform

Traditionally the major application of the Burrows-Wheeler Transform has been for data compression and efficient storage, and earlier chapters in this book have provided a detailed consideration of the BWT from this viewpoint, analyzing its performance for data compression. However, recent research on the Burrows-Wheeler Transform has shown the versatility of the BWT, and hence efforts are shifting from its traditional application in data compression to other areas of study. In Chapter 7 we showed how the BWT provides an effective mechanism for rapid pattern matching. In this chapter we expand more on its many uses, focusing on new and emerging applications. Some of the applications discussed in this chapter also relate to new data compression applications; given that the original purpose of the Burrows-Wheeler Transform was for text compression, it is no surprise that the method could be used for compression of other types of data. The data compression applications discussed in this chapter exploit the specific nature of the data under consideration in combination with relevant properties of the BWT to provide effective compaction of the data. We discuss the applicability of the BWT to the compression of specialized data sets, including compression of test patterns used in automatic testing in chip manufacturing, image compression, and compression of biological sequences.

Other applications described in this chapter are somewhat different from data compression, but make use of some of the special characteristics of the BWT, such as its clustering property. Examples here include its use in shape analysis in computer vision, and in machine translation.We also discuss recent reports of the application of the BWT in bioinformatics and computational biology, full-text compressed indexes, prediction and entropy estimation, and recent approaches in joint-source channel coding.

The major objective in this chapter is to show the myriad virtues of the BWT, and hopefully motivate others to think of how the BWT could be used to effectively address some important problems that at first glance may seem to be unrelated to the transform.

9. Conclusion

Through this book we have covered many of the directions that the BWT has led people in: new algorithms and data structures for performing the transform, fine-tuning the coding that can be done with the transformed file, new ways to sort the strings for the transform, a better understanding of other compression methods by relating them to the BWT, using the BWT data structures to aid searching and pattern matching, and applying its algorithms and data structures in contexts ranging from image analysis to computational biology. Yet there is a strong sense that we are only just beginning to understand the transform and its potential, as new variants and applications continue to be published regularly. The BWT is a powerful idea, and in the process of decoding generates a collection of useful arrays (R, V, W and so on) which can be used to provide a variety of indexes and views of the original text. The transform process thus provides us with data structures that have opened up novel possibilities, and may yet hold more opportunities for future applications.

Backmatter

Titel: The Burrows-Wheeler Transform: Data Compression, Suffix Arrays, and Pattern Matching
verfasst von: Donald Adjeroh
Tim Bell
Amar Mukherjee
Verlag: Springer US
Electronic ISBN: 978-0-387-78909-5
Print ISBN: 978-0-387-78908-8
DOI: https://doi.org/10.1007/978-0-387-78909-5

Springer Professional