Abstract
This paper studies in a probabilistic framework some topics concerning the way words (strings) can overlap, and relationship of this to the height of digital trees associated with this set of words. A word is defined as a random sequence of (possibly infinite) symbols over a finite alphabet. A key notion of analignment matrix {C ij } n i,j=1 is introduced whereC ij is the length of the longest string that is a prefix of theith and thejth word. It is proved that the height of an associated digital tree is simply related to the alignment matrix through some order statistics. In particular, using this observation and proving some inequalities for order statistics, we establish that the height of adigital trie under anindependent model (i.e., all words are statistically independent) is asymptotically equal to 2 logα n wheren is the number of words stored in the trie and α is a parameter of the probabilistic model. This result is generalized in three directions, namely we considerb-tries,Markovian model (i.e., dependency among letters in a word), and adependent model (i.e., dependency among words). In particular, when consecutive letters in a word are Markov dependent (Markovian model), then we demonstrate that the height converges in probability to 2 logθ n where θ is a parameter of the underlying Markov chain. On the other hand, for suffix trees which fall into the dependent model, we show that the height does not exceed 2 logκ n, where κ is a parameter of the probabilistic model. These results find plenty of applications in the analysis of data structures built over digital words.
Similar content being viewed by others
References
A. V. Aho, J. E. Hopcroft, and J. D. Ullman,The Design and Analysis of Computer Algorithms, Addison-Wesley, Reading, MA (1974).
D. Knuth,The Art of Computer Programming. Sorting and Searching, vol. III, Addison-Wesley, Reading, MA (1973).
A. Apostolico, The Myriad Virtues of Suffix Trees,Combinatorial Algorithms on Words, pp. 85–96, Springer-Verlag, New York (1985).
R. Fagin, J. Nievergelt, N. Pippenger, and H. Strong, Extendible Hashing: A Fast Access Method for Dynamic Files,ACM TODS,4, 315–344 (1979).
P. Flajolet, On the Performance Evaluation of Extendible Hashing and Trie Searching,Acta Inform.,20, 345–369 (1983).
R. Gallager,Information Theory and Reliable Communications, Wiley, New York (1968).
J. Capetanakis, Tree Algorithms for Packet Broadcast Channels,IEEE Trans. Inform. Theory,25, 505–525 (1979).
IEEE Transaction Inform. Theory,31, 2 (1985).
Ph. Jacquet and M. Regnier,Trie Partitioning Process: Limiting Distributions, Lecture Notes in Computer Science, vol. 214, pp. 196–210, Springer-Verlag, Berlin, 1986.
L. Devroye, A Probabilistic Analysis of the Height of Tries and of the Complexity of Trie Sort,Acta Inform.,21, 229–232 (1984).
B. Pittel, Asymptotic Growth of a Class of Random Trees,Ann. Probab.,13, 414–427 (1985).
B. Pittel, Path in a Random Digital Tree: Limiting Distributions,Adv. in Appl. Probab.,18, 139–155 (1986).
M. Regnier, On the Average Height of Trees in Digital Searching and Dynamic Hashing,Inform. Process. Lett.,13, 64–66 (1981).
A. Yao, A Note on the Analysis of Extendible Hashing,Inform. Process. Lett.,11, 84–86 (1980).
W. Szpankowski, On the Analysis of the Average Height of a Digital Trie: Another Approach, CSD TR-646, Purdue University (1986).
A. Apostolico and W. Szpankowski, Self-Alignments in Words and Their Applications, CSD TR-732, Purdue University (1987), submitted to a journal.
W. Szpankowski, Some Results onV-ary Asymmetric Tries,J. Algorithms,9, 224–244 (1988).
P. Kirschenhofer, H. Prodinger, and W. Szpankowski, On the Variance of the External Path Length in a Symmetric Digital Trie,Discrete Appl. Math.,25, 129–143 (1989).
H. David,Order Statistics, Wiley, New York (1980).
J. Galambos,The Asymptotic Theory of Extreme Order Statistics, Wiley, New York (1978).
T. Lai and H. Robbins, A Class of Dependent Random Variables and Their Maxima,Z. Wahrsch. Vern. Gebiete,42, 89–111 (1978).
P. Billingsley,Probability and Measures, Wiley, New York (1986).
W. Szpankowski, (Probably) Optimal Solutions to Some Problems NOT Only on Graphs, CSD TR-780, Purdue University (1988); revision TR-872 (1989).
D. Knuth,Art of Computer Programming. Fundamental Algorithms, vol. I, Addison-Wesley, Reading, MA (1973).
D. Sankoff and J. Kruskal (eds.),Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparisons, Addison-Wesley, Reading, MA (1983).
Bull. Math. Biol,46, No. 4 (1984) (a special commemorative issue honoring Margonet O. Dayhoff).
D. Aldous,Probability Approximations via the Poisson Clumping Heristic, Springer-Verlag, New York (1989).
N. Kamarkar, R. M. Karp, G. S. Lueker, and A. D. Odlyzko, Probabilistic Analysis of Optimum Partitioning,J. Appl. Probab.,23, 626–645 (1986).
S. Karlin and F. Ost, Counts of Long Aligned Word Matches Among Random Letter Sequences,Adv. Appl. Probab.,19, 293–351 (1987).
J. F. C. Kingman, Subadditive Processes, inEcole d'Eté de Probabilités de Saint-Flour V—1975, Lecture Notes in Mathematics, vol. 539, Springer-Verlag, Berlin, 1976.
L. Devroye, W. Szpankowski, and B. Rais, A Note on the Height of Suffix Trees, CSD TR-905, Purdue University (1989).
Author information
Authors and Affiliations
Additional information
Communicated by Philippe Flajolet.
This research was supported by NSF Grants NCR-8702115 and CCR-8900305, in part by Grant R01 LM05118 from the National Library of Medicine, and by AFOSR Grant 90-0107.
Rights and permissions
About this article
Cite this article
Szpankowski, W. On the height of digital trees and related problems. Algorithmica 6, 256–277 (1991). https://doi.org/10.1007/BF01759045
Received:
Revised:
Issue Date:
DOI: https://doi.org/10.1007/BF01759045