Skip to main content
Erschienen in: Discover Computing 3/2011

01.06.2011 | Web Mining for Search

A unified representation of web logs for mining applications

verfasst von: Michelangelo Diligenti, Marco Gori, Marco Maggini

Erschienen in: Discover Computing | Ausgabe 3/2011

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The collective feedback of the users of an Information Retrieval (IR) system has been shown to provide semantic information that, while hard to extract using standard IR techniques, can be useful in Web mining tasks. In the last few years, several approaches have been proposed to process the logs stored by Internet Service Providers (ISP), Intranet proxies or Web search engines. However, the solutions proposed in the literature only partially represent the information available in the Web logs. In this paper, we propose to use a richer data structure, which is able to preserve most of the information available in the Web logs. This data structure consists of three groups of entities: users, documents and queries, which are connected in a network of relations. Query refinements correspond to separate transitions between the corresponding query nodes in the graph, while users are linked to the queries they have issued and to the documents they have selected. The classical query/document transitions, which connect a query to the documents selected by the users’ in the returned result page, are also considered. The resulting data structure is a complete representation of the collective search activity performed by the users of a search engine or of an Intranet. The experimental results show that this more powerful representation can be successfully used in several Web mining tasks like discovering semantically relevant query suggestions and Web page categorization by topic.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Adomavicius, G., & Tuzhilin, A. (2005). Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering , 734–749. Adomavicius, G., & Tuzhilin, A. (2005). Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering , 734–749.
Zurück zum Zitat Baeza-Yates, R., Hurtado, C., & Mendoza, M. (2004). Query recommendation using query logs in search engines. In International workshop on clustering information over the web (ClustWeb, in conjunction with EDBT), Creete, Greece, March (to flapper in LNCS). Springer. Baeza-Yates, R., Hurtado, C., & Mendoza, M. (2004). Query recommendation using query logs in search engines. In International workshop on clustering information over the web (ClustWeb, in conjunction with EDBT), Creete, Greece, March (to flapper in LNCS). Springer.
Zurück zum Zitat Baeza-Yates, R., Hurtado, C., & Mendoza, M. (2005a). Query recommendation using query logs in search engines. In Proceedings of current trends in database technology—EDBT 2004 workshops (pp. 395–397). Springer. Baeza-Yates, R., Hurtado, C., & Mendoza, M. (2005a). Query recommendation using query logs in search engines. In Proceedings of current trends in database technology—EDBT 2004 workshops (pp. 395–397). Springer.
Zurück zum Zitat Baeza-Yates, R., Hurtado, C., & Mendoza, M. (2005b). Modeling user search behavior. In 3rd Latin American Web Congress (LA-WEB) pp. 242–251. Baeza-Yates, R., Hurtado, C., & Mendoza, M. (2005b). Modeling user search behavior. In 3rd Latin American Web Congress (LA-WEB) pp. 242–251.
Zurück zum Zitat Baeza-Yates, R., & Tiberi, A. (2007). Extracting semantic relations from query logs. In Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining (p. 85). ACM. Baeza-Yates, R., & Tiberi, A. (2007). Extracting semantic relations from query logs. In Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining (p. 85). ACM.
Zurück zum Zitat Baluja, S., Seth, R., Sivakumar, D., Jing, Y., Yagnik, J., Kumar, S., Ravichandran, D., & Aly, M. (2008). Video suggestion and discovery for Youtube: Taking random walks through the view graph. In Proceedings of the 17th international conference on world wide web (pp. 895–904). ACM. Baluja, S., Seth, R., Sivakumar, D., Jing, Y., Yagnik, J., Kumar, S., Ravichandran, D., & Aly, M. (2008). Video suggestion and discovery for Youtube: Taking random walks through the view graph. In Proceedings of the 17th international conference on world wide web (pp. 895–904). ACM.
Zurück zum Zitat Beeferman, D., & Berger, A. (2000). Agglomerative clustering of a search engine query log. In Sixth ACM international conference on knowledge discovery and data mining (SIGKDD) (pp. 407–416). New York, NY: ACM. Beeferman, D., & Berger, A. (2000). Agglomerative clustering of a search engine query log. In Sixth ACM international conference on knowledge discovery and data mining (SIGKDD) (pp. 407–416). New York, NY: ACM.
Zurück zum Zitat Berry, M., & Castellanos, M. (2007). Survey of text mining II: Clustering, classification, and retrieval. New York: Springer New York Inc. Berry, M., & Castellanos, M. (2007). Survey of text mining II: Clustering, classification, and retrieval. New York: Springer New York Inc.
Zurück zum Zitat Boldi, P., Bonchi, F., Castillo, C., Donato, D., Gionis, A., & Vigna, S. (2008). The query-flow graph: Model and applications. In CIKM ’08: Proceedings of the 17th ACM conference on information and knowledge mining (pp. 609–618). New York, NY, USA: ACM. Boldi, P., Bonchi, F., Castillo, C., Donato, D., Gionis, A., & Vigna, S. (2008). The query-flow graph: Model and applications. In CIKM ’08: Proceedings of the 17th ACM conference on information and knowledge mining (pp. 609–618). New York, NY, USA: ACM.
Zurück zum Zitat Broder, A. (2002). A taxonomy of web search. SIGIR Forum, 36(2), 3–10.CrossRef Broder, A. (2002). A taxonomy of web search. SIGIR Forum, 36(2), 3–10.CrossRef
Zurück zum Zitat Broder, A., Glassman, S., Manasse, M., & Zweig, G. (1997). Syntactic clustering of the web. Computer Networks and ISDN Systems, 29(8–13), 1157–1166.CrossRef Broder, A., Glassman, S., Manasse, M., & Zweig, G. (1997). Syntactic clustering of the web. Computer Networks and ISDN Systems, 29(8–13), 1157–1166.CrossRef
Zurück zum Zitat Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., & Hullender, G. (2005). Learning to rank using gradient descent. In Proceedings of the 22nd international conference on machine learning (pp. 96). ACM. Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., & Hullender, G. (2005). Learning to rank using gradient descent. In Proceedings of the 22nd international conference on machine learning (pp. 96). ACM.
Zurück zum Zitat Carpineto, C., Osiński, S., Romano, G., & Weiss, D. (2009). A survey of web clustering engines. ACM Computer Survey, 41(3), 1–38.CrossRef Carpineto, C., Osiński, S., Romano, G., & Weiss, D. (2009). A survey of web clustering engines. ACM Computer Survey, 41(3), 1–38.CrossRef
Zurück zum Zitat Chakrabarti, S., Joshi, M., & Tawde, V. (2001). Enhanced topic distillation using text, markup tags, and hyperlinks. In Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (pp. 208–216), New York, NY, USA: ACM. Chakrabarti, S., Joshi, M., & Tawde, V. (2001). Enhanced topic distillation using text, markup tags, and hyperlinks. In Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (pp. 208–216), New York, NY, USA: ACM.
Zurück zum Zitat Cohen, J. (2009). Graph twiddling in a MapReduce world. Computing in Science & Engineering, 11(4), 29–41.CrossRef Cohen, J. (2009). Graph twiddling in a MapReduce world. Computing in Science & Engineering, 11(4), 29–41.CrossRef
Zurück zum Zitat Cohen, W., Schapire, R., & Singer, Y. (1999). Learning to order things. The Journal of Artificial Intelligence Research, 10, 243–270.MathSciNetMATH Cohen, W., Schapire, R., & Singer, Y. (1999). Learning to order things. The Journal of Artificial Intelligence Research, 10, 243–270.MathSciNetMATH
Zurück zum Zitat Collins-Thompson, K., & Callan, J. (2005). Query expansion using random walk models. In Proceedings of the 14th ACM international conference on information and knowledge management (pp. 704–711). New York, NY, USA: ACM. Collins-Thompson, K., & Callan, J. (2005). Query expansion using random walk models. In Proceedings of the 14th ACM international conference on information and knowledge management (pp. 704–711). New York, NY, USA: ACM.
Zurück zum Zitat Cooper, J., Coden, A., & Brown, E. (2002). Detecting similar documents using salient terms. In Proceedings of the eleventh international conference on information and knowledge management (p. 251). ACM, 2002. Cooper, J., Coden, A., & Brown, E. (2002). Detecting similar documents using salient terms. In Proceedings of the eleventh international conference on information and knowledge management (p. 251). ACM, 2002.
Zurück zum Zitat Craswell, N., & Szummer, M. (2007). Random walks on the click graph. In 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 239–246). New York, NY, USA: ACM Press. Craswell, N., & Szummer, M. (2007). Random walks on the click graph. In 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 239–246). New York, NY, USA: ACM Press.
Zurück zum Zitat Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113.CrossRef Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113.CrossRef
Zurück zum Zitat Donato, D., & Gionis, A. (2010). A survey of graph mining for web applications. Managing and Mining Graph Data, 455–485. Donato, D., & Gionis, A. (2010). A survey of graph mining for web applications. Managing and Mining Graph Data, 455–485.
Zurück zum Zitat Enright, A., Van Dongen, S., & Ouzounis, C. (2002). An efficient algorithm for large-scale detection of protein families. Nucleic Acids Research, 30(7), 1575.CrossRef Enright, A., Van Dongen, S., & Ouzounis, C. (2002). An efficient algorithm for large-scale detection of protein families. Nucleic Acids Research, 30(7), 1575.CrossRef
Zurück zum Zitat Ferragina, P., & Gulli, A. (2008). A personalized search engine based on web-snippet hierarchical clustering. Software: Practice and Experience, 38(2), 189–225.CrossRef Ferragina, P., & Gulli, A. (2008). A personalized search engine based on web-snippet hierarchical clustering. Software: Practice and Experience, 38(2), 189–225.CrossRef
Zurück zum Zitat Flake, G., Lawrence, S., Giles, C., & Coetzee, F. (2002). Self-organization and identification of web communities. IEEE Computer, 35(3), 66–70. Flake, G., Lawrence, S., Giles, C., & Coetzee, F. (2002). Self-organization and identification of web communities. IEEE Computer, 35(3), 66–70.
Zurück zum Zitat Fürnkranz, J. (1999). Exploiting structural information for text classification on the WWW. Advances in Intelligent Data Analysis, 487–497. Fürnkranz, J. (1999). Exploiting structural information for text classification on the WWW. Advances in Intelligent Data Analysis, 487–497.
Zurück zum Zitat He, D., Göker, A., & Harper, D. (2002) Combining evidence for automatic web session identification. Information Processing and Management, 38(5), 727–742.MATHCrossRef He, D., Göker, A., & Harper, D. (2002) Combining evidence for automatic web session identification. Information Processing and Management, 38(5), 727–742.MATHCrossRef
Zurück zum Zitat He, X., Zha, H., HQ, D., et al. (2002). Web document clustering using hyperlink structures. Computational Statistics & Data Analysis, 41(1), 19–45.MathSciNetCrossRef He, X., Zha, H., HQ, D., et al. (2002). Web document clustering using hyperlink structures. Computational Statistics & Data Analysis, 41(1), 19–45.MathSciNetCrossRef
Zurück zum Zitat Hotho, A., Jäschke, R., Schmitz, C., & Stumme, G. (2006). Information retrieval in folksonomies: Search and ranking. The Semantic Web: Research and Applications, 411–426. Hotho, A., Jäschke, R., Schmitz, C., & Stumme, G. (2006). Information retrieval in folksonomies: Search and ranking. The Semantic Web: Research and Applications, 411–426.
Zurück zum Zitat Huang, X., Peng, F., An, A., & Schuurmans, D. (2004) Dynamic web log session identification with statistical language models. Journal of the American Society for Information Science and Technology, 55(14), 1290–1303.CrossRef Huang, X., Peng, F., An, A., & Schuurmans, D. (2004) Dynamic web log session identification with statistical language models. Journal of the American Society for Information Science and Technology, 55(14), 1290–1303.CrossRef
Zurück zum Zitat Jansen, B., Spink, A., & Koshman, S. (2007). Web searcher interaction with the Dogpile.com metasearch engine. Journal of the American Society for Information Science and Technology, 58(5), 744–755.CrossRef Jansen, B., Spink, A., & Koshman, S. (2007). Web searcher interaction with the Dogpile.com metasearch engine. Journal of the American Society for Information Science and Technology, 58(5), 744–755.CrossRef
Zurück zum Zitat Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. Machine Learning: ECML-98, 137–142.CrossRef Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. Machine Learning: ECML-98, 137–142.CrossRef
Zurück zum Zitat Jones, R., Rey, B., Madani, O., & Greiner, W. (2006). Generating query substitutions. In Proceedings of the 15th international conference on world wide web (pp. 387–396). New York, NY, USA: ACM. Jones, R., Rey, B., Madani, O., & Greiner, W. (2006). Generating query substitutions. In Proceedings of the 15th international conference on world wide web (pp. 387–396). New York, NY, USA: ACM.
Zurück zum Zitat Liu, T. -Y. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3), 225–331.MATHCrossRef Liu, T. -Y. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3), 225–331.MATHCrossRef
Zurück zum Zitat Papadimitriou, S., & Sun, J. (2008) Disco: Distributed co-clustering with MapReduce: A case study towards petabyte-scale end-to-end mining. In Eighth IEEE international conference on data mining, 2008, ICDM’08 (pp. 512–521). Papadimitriou, S., & Sun, J. (2008) Disco: Distributed co-clustering with MapReduce: A case study towards petabyte-scale end-to-end mining. In Eighth IEEE international conference on data mining, 2008, ICDM’08 (pp. 512–521).
Zurück zum Zitat Pierrakos, D., Paliouras, G., Papatheodorou, C., & Spyropoulos, C. (2003). Web usage mining as a tool for personalization: A survey. User Modeling and User-Adapted Interaction, 13(4), 311–372.CrossRef Pierrakos, D., Paliouras, G., Papatheodorou, C., & Spyropoulos, C. (2003). Web usage mining as a tool for personalization: A survey. User Modeling and User-Adapted Interaction, 13(4), 311–372.CrossRef
Zurück zum Zitat Radlinski, F., & Joachims, T. (2005). Query chains: Learning to rank from implicit feedback. In KDD ’05: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining (pp. 239–248). New York, NY, USA: ACM. Radlinski, F., & Joachims, T. (2005). Query chains: Learning to rank from implicit feedback. In KDD ’05: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining (pp. 239–248). New York, NY, USA: ACM.
Zurück zum Zitat Scholkopf, B., & Smola. A. J. (2001). Learning with Kernels. Cambridge, MA, USA: MIT Press. Scholkopf, B., & Smola. A. J. (2001). Learning with Kernels. Cambridge, MA, USA: MIT Press.
Zurück zum Zitat Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34,(1), 1–47.CrossRef Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34,(1), 1–47.CrossRef
Zurück zum Zitat Seneta, E. (2006). Non-negative matrices and Markov chains. Berlin: Springer.MATH Seneta, E. (2006). Non-negative matrices and Markov chains. Berlin: Springer.MATH
Zurück zum Zitat Szummer, M., & Craswell, N. (2008). Behavioral classification on the click graph. In WWW ’08: Proceedings of the 17th international conference on world wide web (pp. 1241–1242). New York, NY, USA: ACM. Szummer, M., & Craswell, N. (2008). Behavioral classification on the click graph. In WWW ’08: Proceedings of the 17th international conference on world wide web (pp. 1241–1242). New York, NY, USA: ACM.
Zurück zum Zitat Wang, Y., & Kitsuregawa, M. (2002). On combining link and contents information for web page clustering. In Database and expert systems applications (pp. 487–566). Berlin: Springer. Wang, Y., & Kitsuregawa, M. (2002). On combining link and contents information for web page clustering. In Database and expert systems applications (pp. 487–566). Berlin: Springer.
Zurück zum Zitat Wen, J.-R., Nie, J.-Y., & Zhang, H.-J. (2001). Clustering user queries of a search engine. In WWW ’01: Proceedings of the 10th international conference on world wide web (pp. 162–168). New York, NY, USA: ACM Press. Wen, J.-R., Nie, J.-Y., & Zhang, H.-J. (2001). Clustering user queries of a search engine. In WWW ’01: Proceedings of the 10th international conference on world wide web (pp. 162–168). New York, NY, USA: ACM Press.
Zurück zum Zitat Zamir, O., & Etzioni, O. (1998). Web document clustering: A feasibility demonstration. In Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval (pp. 46–54). ACM. Zamir, O., & Etzioni, O. (1998). Web document clustering: A feasibility demonstration. In Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval (pp. 46–54). ACM.
Zurück zum Zitat Zeng, H., He, Q., Chen, Z., Ma, W., & Ma, J. (2004). Learning to cluster web search results. In Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval (p. 217). ACM. Zeng, H., He, Q., Chen, Z., Ma, W., & Ma, J. (2004). Learning to cluster web search results. In Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval (p. 217). ACM.
Zurück zum Zitat Zhang, Z., & Nasraoui, O. (2006). Mining search engine query logs for query recommendation. In Proceedings of the 15th international conference on world wide web (p. 1040). ACM. Zhang, Z., & Nasraoui, O. (2006). Mining search engine query logs for query recommendation. In Proceedings of the 15th international conference on world wide web (p. 1040). ACM.
Zurück zum Zitat Zhou, D., & Scholkopf, B. (2004). A regularization framework for learning from graph data. In ICML workshop on statistical relational learning and Its connections to other fields (pp. 132–137). Zhou, D., & Scholkopf, B. (2004). A regularization framework for learning from graph data. In ICML workshop on statistical relational learning and Its connections to other fields (pp. 132–137).
Zurück zum Zitat Zhou, D., Bousquet, O., Lal, T., Weston, J., & Scholkopf, B. (2004). Learning with local and global consistency. Advances in Neural Information Processing Systems, 16, 321–328. Zhou, D., Bousquet, O., Lal, T., Weston, J., & Scholkopf, B. (2004). Learning with local and global consistency. Advances in Neural Information Processing Systems, 16, 321–328.
Zurück zum Zitat Zhou, D., Scholkopf, B., & Hofmann, T. (2005). Semi-supervised learning on directed graphs. Advances in Neural Information Processing Systems, 17, 1633–1640. Zhou, D., Scholkopf, B., & Hofmann, T. (2005). Semi-supervised learning on directed graphs. Advances in Neural Information Processing Systems, 17, 1633–1640.
Metadaten
Titel
A unified representation of web logs for mining applications
verfasst von
Michelangelo Diligenti
Marco Gori
Marco Maggini
Publikationsdatum
01.06.2011
Verlag
Springer Netherlands
Erschienen in
Discover Computing / Ausgabe 3/2011
Print ISSN: 2948-2984
Elektronische ISSN: 2948-2992
DOI
https://doi.org/10.1007/s10791-010-9160-6

Weitere Artikel der Ausgabe 3/2011

Discover Computing 3/2011 Zur Ausgabe

Premium Partner