Abstract
Consider a database of time-series, where each datapoint in the series records the total number of users who asked for a specific query at an internet search engine. Storage and analysis of such logs can be very beneficial for a search company from multiple perspectives. First, from a data organization perspective, because query Weblogs capture important trends and statistics, they can help enhance and optimize the search experience (keyword recommendation, discovery of news events). Second, Weblog data can provide an important polling mechanism for the microeconomic aspects of a search engine, since they can facilitate and promote the advertising facet of the search engine (understand what users request and when they request it).
Due to the sheer amount of time-series Weblogs, manipulation of the logs in a compressed form is an impeding necessity for fast data processing and compact storage requirements. Here, we explicate how to compute the lower and upper distance bounds on the time-series logs when working directly on their compressed form. Optimal distance estimation means tighter bounds, leading to better candidate selection/elimination and ultimately faster search performance. Our derivation of the optimal distance bounds is based on the careful analysis of the problem using optimization principles. The experimental evaluation suggests a clear performance advantage of the proposed method, compared to previous compression/search techniques. The presented method results in a 10--30% improvement on distance estimations, which in turn leads to 25--80% improvement on the search performance.
Supplemental Material
Available for Download
Online appendix to Optimal distance bounds for fast search on compressed time-series query logs on article 6.
- Adar, E., Weld, D., Bershad, B., and Gribble, S. 2007. Why we search: Visualizing and predicting user behavior. In Proceedings of the International World Wide Web Conference. Google ScholarDigital Library
- Agrawal, R., Faloutsos, C., and Swami, A. 1993. Efficient similarity search in sequence databases. In Proceedings of the International Conference on Foundations of Data Organization and Algorithms (FODO). 69--84. Google ScholarDigital Library
- Assent, I., Krieger, R., Afschari, F., and Seidl, T. 2008. The TS-tree: Efficient time series search and retrieval. In Proceedings of the International Conference on Extending Database Technology (EDBT). 252--263. Google ScholarDigital Library
- Beitzel, S., Jensen, E., Chowdhury, A., Frieder, O., and Grossman, D. 2007. Temporal analysis of a very large topically categorized Web query log. In J. Amer. Soc. Inform. Sci. Tech. 58, 2, 166--178. Google ScholarDigital Library
- Bertsekas, D. P. 2000. Nonlinear Programming. Athena Scientific.Google Scholar
- Bollegala, D., Matsuo, Y., and Ishizuka, M. 2007. Measuring semantic similarity between words using Web search engines. In Proceedings of the International World Wide Web Conference. 757--766. Google ScholarDigital Library
- Bozkaya, T. and Özsoyoglu, M. 1997. Distance-based indexing for high-dimensional metric spaces. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 357--368. Google ScholarDigital Library
- Cai, Y. and Ng, R. 2004. Indexing spatio-temporal trajectories with chebyshev polynomials. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 599--610. Google ScholarDigital Library
- Chee Fu, A. W., Chan, P. M., Cheung, Y.-L., and Moon, Y. 2000. Dynamic VP-Tree indexing for N-nearest neighbor search given pair-wise distances. J. VLDB, 154--173. Google ScholarDigital Library
- Chen, Q., Li, M., and Zhou, M. 2007. Improving query spelling correction using Web search results. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language. 181--189.Google Scholar
- Chien, S. and Immorlica, N. 2005. Semantic similarity between search engine queries using temporal correlation. In Proceedings of the International World Wide Web Conference. 2--11. Google ScholarDigital Library
- Cover, T. M. and Thomas, J. A. 1991. Elements of Information Theory. John Wiley and Sons. Google ScholarDigital Library
- Dubinko, M., Kumar, R., Magnani, J., Novak, J., Raghavan, P., and Tomkins, A. 2006. Visualizing tags over time. In Proceedings of the International World Wide Web Conference. 193--202. Google ScholarDigital Library
- Guttman, A. 1984. R-trees: A dynamic index structure for spatial searching. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 47--57. Google ScholarDigital Library
- Hall, E. A. M. C. K. 2009. Gazpacho and summer rash: Lexical relationships from temporal patterns of web search queries. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1046--1055. Google ScholarDigital Library
- Kage, T. and Sumiya, K. 2006. A Web search method based on the temporal relation of query keywords. In Proceedings of the Web Information Systems. 4--15. Google ScholarDigital Library
- Kahveci, T. and Singh, A. K. 2001. Variable length queries for time series data. In Proceedings of the IEEE International Conference on Data Engineering. 273--282. Google ScholarDigital Library
- Keogh, E. 2002. Exact indexing of dynamic time warping. In Proceedings of the International Conference on Very Large Databases. 406--417. Google ScholarDigital Library
- Keogh, E., Chakrabarti, K., Mehrotra, S., and Pazzani, M. 2001. Locally adaptive dimensionality reduction for indexing large time series databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 151--162. Google ScholarDigital Library
- Keogh, E. J. and Kasetty, S. 2003. On the need for time series data mining benchmarks: A survey and empirical demonstration. Data Min. Knowl. Discov. 7, 4, 349--371. Google ScholarDigital Library
- Kleinberg, J. 2002. Bursty and hierarchical structure in streams. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 91--101. Google ScholarDigital Library
- Lie, B., Jones, R., and Klinkner, K. 2006. Measuring the meaning in time series clustering of text search queries. In Proceedings of the International Conference on Information and Knowledge Management. 836--837. Google ScholarDigital Library
- Liu, N., Nong, S., Yan, J., Zhang, B., Chen, Z., and Li, Y. 2006. Similarity of temporal query logs based on ARIMA model. In Proceedings of the IEEE International Conference on Data Mining. 975--979. Google ScholarDigital Library
- Oppenheim, A., Willsky, A., and Nawab, S. 1997. Signals and Systems, 2nd Ed. Prentice Hall. Google ScholarDigital Library
- Otsuka, S. and Kitsuregawa, M. 2006. Clustering of search engine keywords using access logs. In Proceedings of the International Conference on Database and Expert Systems Applications. 842--852. Google ScholarDigital Library
- Otsuka, S., Toyoda, M., Hirai, J., and Kitsuregawa, M. 2004. Extracting user behavior by Web communities technology on global Web logs. In Proceedings of the International Conference on Database and Expert Systems Applications. 957--988.Google Scholar
- Rafiei, D. and Mendelzon, A. 1998. Efficient retrieval of similar time sequences using DFT. In Proceedings of the International Conference on Foundations of Data Organization and Algorithms. 249--257.Google Scholar
- Rattenbury, T., Good, N., and Naaman, M. 2007. Towards extracting flickr tag semantics. Proceedings of the International World Wide Web Conference. 1287--1288. Google ScholarDigital Library
- Richardson, M. 2008. Learning about the world through long-term query logs. ACM Trans. Web, 2, 4. Google ScholarDigital Library
- Rode, H. and Hiemstra, D. 2006. Using query profiles for clarification. In Proceedings of the European Conference on Information Retrieval. 205--216. Google ScholarDigital Library
- Sanderson, M. and Dumais, S. 2007. Examining repetition in user search behavior. In Proceedings of the European Conference on Information Retrieval. 597--604. Google ScholarDigital Library
- Sun, Y., Liu, N., Xie, K., Yan, S., Zhang, B., and Chen, Z. 2007. Causal relation of queries from temporal logs. In Proceedings of the International World Wide Web Conference. 1141--142. Google ScholarDigital Library
- Vlachos, M., Meek, C., Vagena, Z., and Gunopulos, D. 2004. Identification of similarities, periodicities & bursts for online search queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 131--142. Google ScholarDigital Library
- Vlachos, M., Yu, P., and Castelli, V. 2005. On Periodicity detection and structural periodic similarity. In Proceedings of the SIAM International Conference on Data Mining.Google Scholar
- Wang, C. and Wang, X. S. 2000. Multilevel filtering for high dimensional nearest neighbor search. In Proceedings of the ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.Google Scholar
- Wang, X., Zhai, C., Hu, X., and Sproat, R. 2007. Mining correlated bursty topic patterns from coordinated text streams. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 784--793. Google ScholarDigital Library
- Weber, R., Schek, H.-J., and Blott, S. 1998. A quantitative analysis and performance study for similarity search methods in high-dimensional spaces. In Proceedings of the International Conference on Very Large Database. 194--205. Google ScholarDigital Library
- Yi, B. and Faloutsos, C. 2000. Fast time sequence indexing for arbitrary Lp norms. In Proceedings of the International Conference on Very Large Database. 385--394. Google ScholarDigital Library
- Yianilos, P. 1992. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithm. 311--321. Google ScholarDigital Library
- Zhao, Q., Hoi, S., Liu, T.-Y., Bhowmick, S. S., Lyu, M. R., and Ma, W.-Y. 2006. Time-dependent semantic similarity measure of queries using historical click-through data. In Proceedings of the International World Wide Web Conference. 543--552. Google ScholarDigital Library
- Ziegler, C.-N., Simon, K., and Lausen, G. 2006. Automatic computation of semantic proximity using taxonomic knowledge. In Proceedings of the International Conference on Information and Knowledge Management. 465--474. Google ScholarDigital Library
Index Terms
- Optimal distance bounds for fast search on compressed time-series query logs
Recommendations
Re-ranking search results using query logs
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge managementThis work addresses two common problems in search, frequently occurring with underspecified user queries: the top-ranked results for such queries may not contain documents relevant to the user's search intent, and fresh and relevant pages may not get ...
Mining search engine query logs via suggestion sampling
Many search engines and other web applications suggest auto-completions as the user types in a query. The suggestions are generated from hidden underlying databases, such as query logs, directories, and lexicons. These databases consist of interesting ...
Predicting Next Search Actions with Search Engine Query Logs
WI-IAT '11: Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01Capturing users' future search actions has many potential applications such as query recommendation, web page re-ranking, advertisement arrangement, and so on. This paper predicts users' future queries and URL clicks based on their current access ...
Comments