skip to main content
research-article

Optimal distance bounds for fast search on compressed time-series query logs

Published:29 April 2010Publication History
Skip Abstract Section

Abstract

Consider a database of time-series, where each datapoint in the series records the total number of users who asked for a specific query at an internet search engine. Storage and analysis of such logs can be very beneficial for a search company from multiple perspectives. First, from a data organization perspective, because query Weblogs capture important trends and statistics, they can help enhance and optimize the search experience (keyword recommendation, discovery of news events). Second, Weblog data can provide an important polling mechanism for the microeconomic aspects of a search engine, since they can facilitate and promote the advertising facet of the search engine (understand what users request and when they request it).

Due to the sheer amount of time-series Weblogs, manipulation of the logs in a compressed form is an impeding necessity for fast data processing and compact storage requirements. Here, we explicate how to compute the lower and upper distance bounds on the time-series logs when working directly on their compressed form. Optimal distance estimation means tighter bounds, leading to better candidate selection/elimination and ultimately faster search performance. Our derivation of the optimal distance bounds is based on the careful analysis of the problem using optimization principles. The experimental evaluation suggests a clear performance advantage of the proposed method, compared to previous compression/search techniques. The presented method results in a 10--30% improvement on distance estimations, which in turn leads to 25--80% improvement on the search performance.

Skip Supplemental Material Section

Supplemental Material

References

  1. Adar, E., Weld, D., Bershad, B., and Gribble, S. 2007. Why we search: Visualizing and predicting user behavior. In Proceedings of the International World Wide Web Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Agrawal, R., Faloutsos, C., and Swami, A. 1993. Efficient similarity search in sequence databases. In Proceedings of the International Conference on Foundations of Data Organization and Algorithms (FODO). 69--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Assent, I., Krieger, R., Afschari, F., and Seidl, T. 2008. The TS-tree: Efficient time series search and retrieval. In Proceedings of the International Conference on Extending Database Technology (EDBT). 252--263. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Beitzel, S., Jensen, E., Chowdhury, A., Frieder, O., and Grossman, D. 2007. Temporal analysis of a very large topically categorized Web query log. In J. Amer. Soc. Inform. Sci. Tech. 58, 2, 166--178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bertsekas, D. P. 2000. Nonlinear Programming. Athena Scientific.Google ScholarGoogle Scholar
  6. Bollegala, D., Matsuo, Y., and Ishizuka, M. 2007. Measuring semantic similarity between words using Web search engines. In Proceedings of the International World Wide Web Conference. 757--766. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Bozkaya, T. and Özsoyoglu, M. 1997. Distance-based indexing for high-dimensional metric spaces. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 357--368. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Cai, Y. and Ng, R. 2004. Indexing spatio-temporal trajectories with chebyshev polynomials. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 599--610. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chee Fu, A. W., Chan, P. M., Cheung, Y.-L., and Moon, Y. 2000. Dynamic VP-Tree indexing for N-nearest neighbor search given pair-wise distances. J. VLDB, 154--173. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chen, Q., Li, M., and Zhou, M. 2007. Improving query spelling correction using Web search results. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language. 181--189.Google ScholarGoogle Scholar
  11. Chien, S. and Immorlica, N. 2005. Semantic similarity between search engine queries using temporal correlation. In Proceedings of the International World Wide Web Conference. 2--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Cover, T. M. and Thomas, J. A. 1991. Elements of Information Theory. John Wiley and Sons. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Dubinko, M., Kumar, R., Magnani, J., Novak, J., Raghavan, P., and Tomkins, A. 2006. Visualizing tags over time. In Proceedings of the International World Wide Web Conference. 193--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Guttman, A. 1984. R-trees: A dynamic index structure for spatial searching. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 47--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Hall, E. A. M. C. K. 2009. Gazpacho and summer rash: Lexical relationships from temporal patterns of web search queries. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1046--1055. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Kage, T. and Sumiya, K. 2006. A Web search method based on the temporal relation of query keywords. In Proceedings of the Web Information Systems. 4--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Kahveci, T. and Singh, A. K. 2001. Variable length queries for time series data. In Proceedings of the IEEE International Conference on Data Engineering. 273--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Keogh, E. 2002. Exact indexing of dynamic time warping. In Proceedings of the International Conference on Very Large Databases. 406--417. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Keogh, E., Chakrabarti, K., Mehrotra, S., and Pazzani, M. 2001. Locally adaptive dimensionality reduction for indexing large time series databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 151--162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Keogh, E. J. and Kasetty, S. 2003. On the need for time series data mining benchmarks: A survey and empirical demonstration. Data Min. Knowl. Discov. 7, 4, 349--371. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Kleinberg, J. 2002. Bursty and hierarchical structure in streams. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 91--101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Lie, B., Jones, R., and Klinkner, K. 2006. Measuring the meaning in time series clustering of text search queries. In Proceedings of the International Conference on Information and Knowledge Management. 836--837. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Liu, N., Nong, S., Yan, J., Zhang, B., Chen, Z., and Li, Y. 2006. Similarity of temporal query logs based on ARIMA model. In Proceedings of the IEEE International Conference on Data Mining. 975--979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Oppenheim, A., Willsky, A., and Nawab, S. 1997. Signals and Systems, 2nd Ed. Prentice Hall. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Otsuka, S. and Kitsuregawa, M. 2006. Clustering of search engine keywords using access logs. In Proceedings of the International Conference on Database and Expert Systems Applications. 842--852. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Otsuka, S., Toyoda, M., Hirai, J., and Kitsuregawa, M. 2004. Extracting user behavior by Web communities technology on global Web logs. In Proceedings of the International Conference on Database and Expert Systems Applications. 957--988.Google ScholarGoogle Scholar
  27. Rafiei, D. and Mendelzon, A. 1998. Efficient retrieval of similar time sequences using DFT. In Proceedings of the International Conference on Foundations of Data Organization and Algorithms. 249--257.Google ScholarGoogle Scholar
  28. Rattenbury, T., Good, N., and Naaman, M. 2007. Towards extracting flickr tag semantics. Proceedings of the International World Wide Web Conference. 1287--1288. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Richardson, M. 2008. Learning about the world through long-term query logs. ACM Trans. Web, 2, 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Rode, H. and Hiemstra, D. 2006. Using query profiles for clarification. In Proceedings of the European Conference on Information Retrieval. 205--216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Sanderson, M. and Dumais, S. 2007. Examining repetition in user search behavior. In Proceedings of the European Conference on Information Retrieval. 597--604. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Sun, Y., Liu, N., Xie, K., Yan, S., Zhang, B., and Chen, Z. 2007. Causal relation of queries from temporal logs. In Proceedings of the International World Wide Web Conference. 1141--142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Vlachos, M., Meek, C., Vagena, Z., and Gunopulos, D. 2004. Identification of similarities, periodicities & bursts for online search queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 131--142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Vlachos, M., Yu, P., and Castelli, V. 2005. On Periodicity detection and structural periodic similarity. In Proceedings of the SIAM International Conference on Data Mining.Google ScholarGoogle Scholar
  35. Wang, C. and Wang, X. S. 2000. Multilevel filtering for high dimensional nearest neighbor search. In Proceedings of the ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.Google ScholarGoogle Scholar
  36. Wang, X., Zhai, C., Hu, X., and Sproat, R. 2007. Mining correlated bursty topic patterns from coordinated text streams. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 784--793. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Weber, R., Schek, H.-J., and Blott, S. 1998. A quantitative analysis and performance study for similarity search methods in high-dimensional spaces. In Proceedings of the International Conference on Very Large Database. 194--205. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Yi, B. and Faloutsos, C. 2000. Fast time sequence indexing for arbitrary Lp norms. In Proceedings of the International Conference on Very Large Database. 385--394. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Yianilos, P. 1992. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithm. 311--321. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Zhao, Q., Hoi, S., Liu, T.-Y., Bhowmick, S. S., Lyu, M. R., and Ma, W.-Y. 2006. Time-dependent semantic similarity measure of queries using historical click-through data. In Proceedings of the International World Wide Web Conference. 543--552. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Ziegler, C.-N., Simon, K., and Lausen, G. 2006. Automatic computation of semantic proximity using taxonomic knowledge. In Proceedings of the International Conference on Information and Knowledge Management. 465--474. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Optimal distance bounds for fast search on compressed time-series query logs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on the Web
        ACM Transactions on the Web  Volume 4, Issue 2
        April 2010
        136 pages
        ISSN:1559-1131
        EISSN:1559-114X
        DOI:10.1145/1734200
        Issue’s Table of Contents

        Copyright © 2010 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 29 April 2010
        • Accepted: 1 December 2009
        • Revised: 1 October 2009
        • Received: 1 December 2007
        Published in tweb Volume 4, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader