Skip to main content
Erschienen in: Discover Computing 5/2011

01.10.2011

Page importance computation based on Markov processes

verfasst von: Bin Gao, Tie-Yan Liu, Yuting Liu, Taifeng Wang, Zhi-Ming Ma, Hang Li

Erschienen in: Discover Computing | Ausgabe 5/2011

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This paper is concerned with Markov processes for computing page importance. Page importance is a key factor in Web search. Many algorithms such as PageRank and its variations have been proposed for computing the quantity in different scenarios, using different data sources, and with different assumptions. Then a question arises, as to whether these algorithms can be explained in a unified way, and whether there is a general guideline to design new algorithms for new scenarios. In order to answer these questions, we introduce a General Markov Framework in this paper. Under the framework, a Web Markov Skeleton Process is used to model the random walk conducted by the web surfer on a given graph. Page importance is then defined as the product of two factors: page reachability, the average possibility that the surfer arrives at the page, and page utility, the average value that the page gives to the surfer in a single visit. These two factors can be computed as the stationary probability distribution of the corresponding embedded Markov chain and the mean staying time on each page of the Web Markov Skeleton Process respectively. We show that this general framework can cover many existing algorithms including PageRank, TrustRank, and BrowseRank as its special cases. We also show that the framework can help us design new algorithms to handle more complex problems, by constructing graphs from new data sources, employing new family members of the Web Markov Skeleton Process, and using new methods to estimate these two factors. In particular, we demonstrate the use of the framework with the exploitation of a new process, named Mirror Semi-Markov Process. In the new process, the staying time on a page, as a random variable, is assumed to be dependent on both the current page and its inlink pages. Our experimental results on both the user browsing graph and the mobile web graph validate that the Mirror Semi-Markov Process is more effective than previous models in several tasks, even when there are web spams and when the assumption on preferential attachment does not hold.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
According to recent study, US companies paid a record $14.2 billion for paid keyword-driven contextual ads in 2009. Meanwhile, click fraud rates rose to 17.4 to 29.4% in the first 3 months of 2010. That is, we will observe a great number of transitions from the content pages (e.g., blogs and forums) to the landing pages of the click fraud ads. cf. http://​lastwatchdog.​com/​botnet-driven-click-fraud-steals-millions-advertisers/​
 
2
It can also be called as the skeleton process of the Web Markov Skeleton Process.
 
3
In some algorithms, the random surfer sometimes does not follow the edges but performs random resets. In such case, we regard the graph as containing virtual edges corresponding to the resets.
 
4
Note that we briefly discussed BrowseRank Plus in a short paper (Gao et al. 2009).
 
5
Lattice distribution is a discrete distribution of a random variable such that every possible value can be represented in the form a + bn, where a, b ≠ 0 and n is an integer.
 
6
Note that we briefly discussed MobileRank in a short paper (Gao et al. 2009).
 
7
For privacy consideration, we do not list the name of this website in the paper.
 
Literatur
Zurück zum Zitat Berberich, K., Vazirgiannis, M., & Weikum, G. (2004). Time-aware authority ranking. In Algorithms and Models for the Web-Graph: Third International Workshop, WAW’04 (pp. 131–141). Springer. Berberich, K., Vazirgiannis, M., & Weikum, G. (2004). Time-aware authority ranking. In Algorithms and Models for the Web-Graph: Third International Workshop, WAW’04 (pp. 131–141). Springer.
Zurück zum Zitat Bianchini, M., Gori, M., & Scarselli, F. (2005). Inside pagerank. ACM Transactions on Interet Technology, 5(1), 92–128.CrossRef Bianchini, M., Gori, M., & Scarselli, F. (2005). Inside pagerank. ACM Transactions on Interet Technology, 5(1), 92–128.CrossRef
Zurück zum Zitat Boldi, P., Santini, M., & Vigna, S. (2005). Pagerank as a function of the damping factor. In WWW ’05. ACM. Boldi, P., Santini, M., & Vigna, S. (2005). Pagerank as a function of the damping factor. In WWW ’05. ACM.
Zurück zum Zitat Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1–7), 107–117. Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1–7), 107–117.
Zurück zum Zitat Chen, Z., Tao, L., Wang, J., Liu, W., & Ma, W. (2002). A unified framework for web link analysis. In WISE’02. Chen, Z., Tao, L., Wang, J., Liu, W., & Ma, W. (2002). A unified framework for web link analysis. In WISE’02.
Zurück zum Zitat Ding, C., He, X., Husbands, P., Zha, H., & Simon, H. (2001). PageRank, HITS, and a unified framework link analysis. LBNL Tech Report 49372, Nov 2001 (updated September 2002). Ding, C., He, X., Husbands, P., Zha, H., & Simon, H. (2001). PageRank, HITS, and a unified framework link analysis. LBNL Tech Report 49372, Nov 2001 (updated September 2002).
Zurück zum Zitat Golub, G. H., & Loan, C. F. V. (1996). Matrix computations (3rd ed.). Baltimore, MD, USA: Johns Hopkins University Press.MATH Golub, G. H., & Loan, C. F. V. (1996). Matrix computations (3rd ed.). Baltimore, MD, USA: Johns Hopkins University Press.MATH
Zurück zum Zitat Gao, B., Liu, T., Ma, Z., Wang, T., & Li, H. (2009). A general markov framework for page importance computation. In the Proceedings of the 18th ACM conference on information and knowledge management (CIKM 2009) (pp. 1835–1838). Gao, B., Liu, T., Ma, Z., Wang, T., & Li, H. (2009). A general markov framework for page importance computation. In the Proceedings of the 18th ACM conference on information and knowledge management (CIKM 2009) (pp. 1835–1838).
Zurück zum Zitat Gyongyi, Z., & Garcia-Molina, H. (2004). Web spam Taxonomy. Technical report, Stanford Digital Library Technologies Project. Gyongyi, Z., & Garcia-Molina, H. (2004). Web spam Taxonomy. Technical report, Stanford Digital Library Technologies Project.
Zurück zum Zitat Gyongyi, Z., Garcia-Molina, H., & Pedersen, J. (2004). Combating web spam with trustrank. In VLDB ’04 (pp. 576–587). VLDB Endowment. Gyongyi, Z., Garcia-Molina, H., & Pedersen, J. (2004). Combating web spam with trustrank. In VLDB ’04 (pp. 576–587). VLDB Endowment.
Zurück zum Zitat Haveliwala, T. (1999). Efficient computation of pageRank. Technical Report 1999–31. Haveliwala, T. (1999). Efficient computation of pageRank. Technical Report 1999–31.
Zurück zum Zitat Haveliwala, T., & Kamvar, S. (2003). The second eigenvalue of the google matrix. Haveliwala, T., & Kamvar, S. (2003). The second eigenvalue of the google matrix.
Zurück zum Zitat Haveliwala, T., Kamvar, S., & Jeh, G. (2003). An analytical comparison of approaches to personalizing pagerank. Haveliwala, T., Kamvar, S., & Jeh, G. (2003). An analytical comparison of approaches to personalizing pagerank.
Zurück zum Zitat Haveliwala, T. H. (May 2002). Topic-sensitive pagerank. In WWW ’02, Honolulu, Hawaii. Haveliwala, T. H. (May 2002). Topic-sensitive pagerank. In WWW ’02, Honolulu, Hawaii.
Zurück zum Zitat Hou, Z., & Liu, G. (2005). Markov Skeleton processes and their applications. USA: Science Press and International Press Hou, Z., & Liu, G. (2005). Markov Skeleton processes and their applications. USA: Science Press and International Press
Zurück zum Zitat Hou, Z., Liu, Z., & Zou, J. (June 1998). Markov Skeleton Processes. Chinese Science Bulletin, 43(11), 881–889. Hou, Z., Liu, Z., & Zou, J. (June 1998). Markov Skeleton Processes. Chinese Science Bulletin, 43(11), 881–889.
Zurück zum Zitat Jeh, G., Widom, J. (2002). SimRank: A measure of structural-context similarity. In KDD ’02. Jeh, G., Widom, J. (2002). SimRank: A measure of structural-context similarity. In KDD ’02.
Zurück zum Zitat Jindal, A., Crutchfield, C., Goel, S., Kolluri, R., & Jain, R. (2008). The mobile web is structurally different. In the Proceedings of the 11th IEEE global internet symposium. Jindal, A., Crutchfield, C., Goel, S., Kolluri, R., & Jain, R. (2008). The mobile web is structurally different. In the Proceedings of the 11th IEEE global internet symposium.
Zurück zum Zitat Kleinberg, J. M. (1998). Authoritative sources in a hyperlinked environment. In SODA ’98 (pp. 668–677). Philadelphia, PA, USA: Society for Industrial and Applied Mathematics. Kleinberg, J. M. (1998). Authoritative sources in a hyperlinked environment. In SODA ’98 (pp. 668–677). Philadelphia, PA, USA: Society for Industrial and Applied Mathematics.
Zurück zum Zitat Liu, Y., Gao, B., Liu, T., Zhang, Y., Ma, Z., He, S., et al. (2008). BrowseRank: Letting users vote for page importance. In SIGIR ’08 (pp. 451–458). Liu, Y., Gao, B., Liu, T., Zhang, Y., Ma, Z., He, S., et al. (2008). BrowseRank: Letting users vote for page importance. In SIGIR ’08 (pp. 451–458).
Zurück zum Zitat McSherry, F. (2005). A uniform approach to accelerated pagerank computation. In WWW ’05 (pp. 575–582). New York, NY, USA: ACM. McSherry, F. (2005). A uniform approach to accelerated pagerank computation. In WWW ’05 (pp. 575–582). New York, NY, USA: ACM.
Zurück zum Zitat Nie, Z., Zhang, Y., Wen, J., & Ma, W. (2005). Object-level ranking: Bringing order to web objects. In WWW’05. Nie, Z., Zhang, Y., Wen, J., & Ma, W. (2005). Object-level ranking: Bringing order to web objects. In WWW’05.
Zurück zum Zitat Page, L., Brin, S., Motwani, R., & Winograd, T. (1998). The pagerank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project. Page, L., Brin, S., Motwani, R., & Winograd, T. (1998). The pagerank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project.
Zurück zum Zitat Papoulis, A., & Pillai, S. U. (2001). Probability, random variables and stochastic processes. New York: McGraw-Hill Science/Engineering/Math. Papoulis, A., & Pillai, S. U. (2001). Probability, random variables and stochastic processes. New York: McGraw-Hill Science/Engineering/Math.
Zurück zum Zitat Poblete, B., Castillo, C., & Gionis, A. (2008). Dr. Searcher and Mr. Browser: A unified hyperlink-click graph. In CIKM ’08: Proceeding of the 17th ACM conference on information and knowledge mining (pp. 1123–1132). Poblete, B., Castillo, C., & Gionis, A. (2008). Dr. Searcher and Mr. Browser: A unified hyperlink-click graph. In CIKM ’08: Proceeding of the 17th ACM conference on information and knowledge mining (pp. 1123–1132).
Zurück zum Zitat Richardson, M., & Domingos, P. (2002). The intelligent surfer: Probabilistic combination of link and content information in PageRank. In Advances in neural information processing systems 14. Cambridge: MIT Press. Richardson, M., & Domingos, P. (2002). The intelligent surfer: Probabilistic combination of link and content information in PageRank. In Advances in neural information processing systems 14. Cambridge: MIT Press.
Zurück zum Zitat Yu, P. S., Li, X., & Liu, B. (2005). Adding the temporal dimension to search—A case study in publication search. Proceedings of the 2005 IEEE/WIC/ACM international conference on web intelligence. Yu, P. S., Li, X., & Liu, B. (2005). Adding the temporal dimension to search—A case study in publication search. Proceedings of the 2005 IEEE/WIC/ACM international conference on web intelligence.
Metadaten
Titel
Page importance computation based on Markov processes
verfasst von
Bin Gao
Tie-Yan Liu
Yuting Liu
Taifeng Wang
Zhi-Ming Ma
Hang Li
Publikationsdatum
01.10.2011
Verlag
Springer Netherlands
Erschienen in
Discover Computing / Ausgabe 5/2011
Print ISSN: 2948-2984
Elektronische ISSN: 2948-2992
DOI
https://doi.org/10.1007/s10791-011-9164-x