research-article

Optimal distance bounds for fast search on compressed time-series query logs

Authors:
Michail Vlachos

IBM Zürich Research Lab, Switzerland

IBM Zürich Research Lab, Switzerland
View Profile

,
Suleyman S. Kozat

Koç University, Istanbul, Turkey

Koç University, Istanbul, Turkey
View Profile

,
Philip S. Yu

University of Illinois at Chicago, Chicago, IL

University of Illinois at Chicago, Chicago, IL
View Profile

Authors Info & Claims

ACM Transactions on the Web Volume 4 Issue 2Article No.: 6pp 1–28https://doi.org/10.1145/1734200.1734203

Published:29 April 2010Publication History

ACM Transactions on the Web

Abstract

Consider a database of time-series, where each datapoint in the series records the total number of users who asked for a specific query at an internet search engine. Storage and analysis of such logs can be very beneficial for a search company from multiple perspectives. First, from a data organization perspective, because query Weblogs capture important trends and statistics, they can help enhance and optimize the search experience (keyword recommendation, discovery of news events). Second, Weblog data can provide an important polling mechanism for the microeconomic aspects of a search engine, since they can facilitate and promote the advertising facet of the search engine (understand what users request and when they request it).

Due to the sheer amount of time-series Weblogs, manipulation of the logs in a compressed form is an impeding necessity for fast data processing and compact storage requirements. Here, we explicate how to compute the lower and upper distance bounds on the time-series logs when working directly on their compressed form. Optimal distance estimation means tighter bounds, leading to better candidate selection/elimination and ultimately faster search performance. Our derivation of the optimal distance bounds is based on the careful analysis of the problem using optimization principles. The experimental evaluation suggests a clear performance advantage of the proposed method, compared to previous compression/search techniques. The presented method results in a 10--30% improvement on distance estimations, which in turn leads to 25--80% improvement on the search performance.

Supplemental Material

Available for Download

pdf

a6-vlachos-apndx.pdf (49.4 KB)

Online appendix to Optimal distance bounds for fast search on compressed time-series query logs on article 6.

References

Adar, E., Weld, D., Bershad, B., and Gribble, S. 2007. Why we search: Visualizing and predicting user behavior. In Proceedings of the International World Wide Web Conference. Google ScholarDigital Library
Agrawal, R., Faloutsos, C., and Swami, A. 1993. Efficient similarity search in sequence databases. In Proceedings of the International Conference on Foundations of Data Organization and Algorithms (FODO). 69--84. Google ScholarDigital Library
Assent, I., Krieger, R., Afschari, F., and Seidl, T. 2008. The TS-tree: Efficient time series search and retrieval. In Proceedings of the International Conference on Extending Database Technology (EDBT). 252--263. Google ScholarDigital Library
Beitzel, S., Jensen, E., Chowdhury, A., Frieder, O., and Grossman, D. 2007. Temporal analysis of a very large topically categorized Web query log. In J. Amer. Soc. Inform. Sci. Tech. 58, 2, 166--178. Google ScholarDigital Library
Bertsekas, D. P. 2000. Nonlinear Programming. Athena Scientific.Google Scholar
Bollegala, D., Matsuo, Y., and Ishizuka, M. 2007. Measuring semantic similarity between words using Web search engines. In Proceedings of the International World Wide Web Conference. 757--766. Google ScholarDigital Library
Bozkaya, T. and Özsoyoglu, M. 1997. Distance-based indexing for high-dimensional metric spaces. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 357--368. Google ScholarDigital Library
Cai, Y. and Ng, R. 2004. Indexing spatio-temporal trajectories with chebyshev polynomials. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 599--610. Google ScholarDigital Library
Chee Fu, A. W., Chan, P. M., Cheung, Y.-L., and Moon, Y. 2000. Dynamic VP-Tree indexing for N-nearest neighbor search given pair-wise distances. J. VLDB, 154--173. Google ScholarDigital Library
Chen, Q., Li, M., and Zhou, M. 2007. Improving query spelling correction using Web search results. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language. 181--189.Google Scholar
Chien, S. and Immorlica, N. 2005. Semantic similarity between search engine queries using temporal correlation. In Proceedings of the International World Wide Web Conference. 2--11. Google ScholarDigital Library
Cover, T. M. and Thomas, J. A. 1991. Elements of Information Theory. John Wiley and Sons. Google ScholarDigital Library
Dubinko, M., Kumar, R., Magnani, J., Novak, J., Raghavan, P., and Tomkins, A. 2006. Visualizing tags over time. In Proceedings of the International World Wide Web Conference. 193--202. Google ScholarDigital Library
Guttman, A. 1984. R-trees: A dynamic index structure for spatial searching. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 47--57. Google ScholarDigital Library
Hall, E. A. M. C. K. 2009. Gazpacho and summer rash: Lexical relationships from temporal patterns of web search queries. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1046--1055. Google ScholarDigital Library
Kage, T. and Sumiya, K. 2006. A Web search method based on the temporal relation of query keywords. In Proceedings of the Web Information Systems. 4--15. Google ScholarDigital Library
Kahveci, T. and Singh, A. K. 2001. Variable length queries for time series data. In Proceedings of the IEEE International Conference on Data Engineering. 273--282. Google ScholarDigital Library
Keogh, E. 2002. Exact indexing of dynamic time warping. In Proceedings of the International Conference on Very Large Databases. 406--417. Google ScholarDigital Library
Keogh, E., Chakrabarti, K., Mehrotra, S., and Pazzani, M. 2001. Locally adaptive dimensionality reduction for indexing large time series databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 151--162. Google ScholarDigital Library
Keogh, E. J. and Kasetty, S. 2003. On the need for time series data mining benchmarks: A survey and empirical demonstration. Data Min. Knowl. Discov. 7, 4, 349--371. Google ScholarDigital Library
Kleinberg, J. 2002. Bursty and hierarchical structure in streams. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 91--101. Google ScholarDigital Library
Lie, B., Jones, R., and Klinkner, K. 2006. Measuring the meaning in time series clustering of text search queries. In Proceedings of the International Conference on Information and Knowledge Management. 836--837. Google ScholarDigital Library
Liu, N., Nong, S., Yan, J., Zhang, B., Chen, Z., and Li, Y. 2006. Similarity of temporal query logs based on ARIMA model. In Proceedings of the IEEE International Conference on Data Mining. 975--979. Google ScholarDigital Library
Oppenheim, A., Willsky, A., and Nawab, S. 1997. Signals and Systems, 2nd Ed. Prentice Hall. Google ScholarDigital Library
Otsuka, S. and Kitsuregawa, M. 2006. Clustering of search engine keywords using access logs. In Proceedings of the International Conference on Database and Expert Systems Applications. 842--852. Google ScholarDigital Library
Otsuka, S., Toyoda, M., Hirai, J., and Kitsuregawa, M. 2004. Extracting user behavior by Web communities technology on global Web logs. In Proceedings of the International Conference on Database and Expert Systems Applications. 957--988.Google Scholar
Rafiei, D. and Mendelzon, A. 1998. Efficient retrieval of similar time sequences using DFT. In Proceedings of the International Conference on Foundations of Data Organization and Algorithms. 249--257.Google Scholar
Rattenbury, T., Good, N., and Naaman, M. 2007. Towards extracting flickr tag semantics. Proceedings of the International World Wide Web Conference. 1287--1288. Google ScholarDigital Library
Richardson, M. 2008. Learning about the world through long-term query logs. ACM Trans. Web, 2, 4. Google ScholarDigital Library
Rode, H. and Hiemstra, D. 2006. Using query profiles for clarification. In Proceedings of the European Conference on Information Retrieval. 205--216. Google ScholarDigital Library
Sanderson, M. and Dumais, S. 2007. Examining repetition in user search behavior. In Proceedings of the European Conference on Information Retrieval. 597--604. Google ScholarDigital Library
Sun, Y., Liu, N., Xie, K., Yan, S., Zhang, B., and Chen, Z. 2007. Causal relation of queries from temporal logs. In Proceedings of the International World Wide Web Conference. 1141--142. Google ScholarDigital Library
Vlachos, M., Meek, C., Vagena, Z., and Gunopulos, D. 2004. Identification of similarities, periodicities & bursts for online search queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 131--142. Google ScholarDigital Library
Vlachos, M., Yu, P., and Castelli, V. 2005. On Periodicity detection and structural periodic similarity. In Proceedings of the SIAM International Conference on Data Mining.Google Scholar
Wang, C. and Wang, X. S. 2000. Multilevel filtering for high dimensional nearest neighbor search. In Proceedings of the ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.Google Scholar
Wang, X., Zhai, C., Hu, X., and Sproat, R. 2007. Mining correlated bursty topic patterns from coordinated text streams. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 784--793. Google ScholarDigital Library
Weber, R., Schek, H.-J., and Blott, S. 1998. A quantitative analysis and performance study for similarity search methods in high-dimensional spaces. In Proceedings of the International Conference on Very Large Database. 194--205. Google ScholarDigital Library
Yi, B. and Faloutsos, C. 2000. Fast time sequence indexing for arbitrary Lp norms. In Proceedings of the International Conference on Very Large Database. 385--394. Google ScholarDigital Library
Yianilos, P. 1992. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithm. 311--321. Google ScholarDigital Library
Zhao, Q., Hoi, S., Liu, T.-Y., Bhowmick, S. S., Lyu, M. R., and Ma, W.-Y. 2006. Time-dependent semantic similarity measure of queries using historical click-through data. In Proceedings of the International World Wide Web Conference. 543--552. Google ScholarDigital Library
Ziegler, C.-N., Simon, K., and Lausen, G. 2006. Automatic computation of semantic proximity using taxonomic knowledge. In Proceedings of the International Conference on Information and Knowledge Management. 465--474. Google ScholarDigital Library

Index Terms

Optimal distance bounds for fast search on compressed time-series query logs
1. Information systems
  1. Data management systems
    1. Data structures
      1. Data layout
        Data compression
  2. Information systems applications
    1. Data mining

Recommendations

Re-ranking search results using query logs
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management

This work addresses two common problems in search, frequently occurring with underspecified user queries: the top-ranked results for such queries may not contain documents relevant to the user's search intent, and fresh and relevant pages may not get ...
Read More
Mining search engine query logs via suggestion sampling

Many search engines and other web applications suggest auto-completions as the user types in a query. The suggestions are generated from hidden underlying databases, such as query logs, directories, and lexicons. These databases consist of interesting ...
Read More
Predicting Next Search Actions with Search Engine Query Logs
WI-IAT '11: Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01

Capturing users' future search actions has many potential applications such as query recommendation, web page re-ranking, advertisement arrangement, and so on. This paper predicts users' future queries and URL clicks based on their current access ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on the Web Volume 4, Issue 2
April 2010
136 pages
ISSN:1559-1131
EISSN:1559-114X
DOI:10.1145/1734200
Issue’s Table of Contents

Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 April 2010
- Accepted: 1 December 2009
- Revised: 1 October 2009
- Received: 1 December 2007
Published in tweb Volume 4, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 451
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Optimal distance bounds for fast search on compressed time-series query logs

ACM Transactions on the Web

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Re-ranking search results using query logs

Mining search engine query logs via suggestion sampling

Predicting Next Search Actions with Search Engine Query Logs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Optimal distance bounds for fast search on compressed time-series query logs

ACM Transactions on the Web

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Re-ranking search results using query logs

Mining search engine query logs via suggestion sampling

Predicting Next Search Actions with Search Engine Query Logs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media