Skip to main content
Top
Published in: Discover Computing 5/2011

01-10-2011

Page importance computation based on Markov processes

Authors: Bin Gao, Tie-Yan Liu, Yuting Liu, Taifeng Wang, Zhi-Ming Ma, Hang Li

Published in: Discover Computing | Issue 5/2011

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This paper is concerned with Markov processes for computing page importance. Page importance is a key factor in Web search. Many algorithms such as PageRank and its variations have been proposed for computing the quantity in different scenarios, using different data sources, and with different assumptions. Then a question arises, as to whether these algorithms can be explained in a unified way, and whether there is a general guideline to design new algorithms for new scenarios. In order to answer these questions, we introduce a General Markov Framework in this paper. Under the framework, a Web Markov Skeleton Process is used to model the random walk conducted by the web surfer on a given graph. Page importance is then defined as the product of two factors: page reachability, the average possibility that the surfer arrives at the page, and page utility, the average value that the page gives to the surfer in a single visit. These two factors can be computed as the stationary probability distribution of the corresponding embedded Markov chain and the mean staying time on each page of the Web Markov Skeleton Process respectively. We show that this general framework can cover many existing algorithms including PageRank, TrustRank, and BrowseRank as its special cases. We also show that the framework can help us design new algorithms to handle more complex problems, by constructing graphs from new data sources, employing new family members of the Web Markov Skeleton Process, and using new methods to estimate these two factors. In particular, we demonstrate the use of the framework with the exploitation of a new process, named Mirror Semi-Markov Process. In the new process, the staying time on a page, as a random variable, is assumed to be dependent on both the current page and its inlink pages. Our experimental results on both the user browsing graph and the mobile web graph validate that the Mirror Semi-Markov Process is more effective than previous models in several tasks, even when there are web spams and when the assumption on preferential attachment does not hold.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
According to recent study, US companies paid a record $14.2 billion for paid keyword-driven contextual ads in 2009. Meanwhile, click fraud rates rose to 17.4 to 29.4% in the first 3 months of 2010. That is, we will observe a great number of transitions from the content pages (e.g., blogs and forums) to the landing pages of the click fraud ads. cf. http://​lastwatchdog.​com/​botnet-driven-click-fraud-steals-millions-advertisers/​
 
2
It can also be called as the skeleton process of the Web Markov Skeleton Process.
 
3
In some algorithms, the random surfer sometimes does not follow the edges but performs random resets. In such case, we regard the graph as containing virtual edges corresponding to the resets.
 
4
Note that we briefly discussed BrowseRank Plus in a short paper (Gao et al. 2009).
 
5
Lattice distribution is a discrete distribution of a random variable such that every possible value can be represented in the form a + bn, where a, b ≠ 0 and n is an integer.
 
6
Note that we briefly discussed MobileRank in a short paper (Gao et al. 2009).
 
7
For privacy consideration, we do not list the name of this website in the paper.
 
Literature
go back to reference Berberich, K., Vazirgiannis, M., & Weikum, G. (2004). Time-aware authority ranking. In Algorithms and Models for the Web-Graph: Third International Workshop, WAW’04 (pp. 131–141). Springer. Berberich, K., Vazirgiannis, M., & Weikum, G. (2004). Time-aware authority ranking. In Algorithms and Models for the Web-Graph: Third International Workshop, WAW’04 (pp. 131–141). Springer.
go back to reference Bianchini, M., Gori, M., & Scarselli, F. (2005). Inside pagerank. ACM Transactions on Interet Technology, 5(1), 92–128.CrossRef Bianchini, M., Gori, M., & Scarselli, F. (2005). Inside pagerank. ACM Transactions on Interet Technology, 5(1), 92–128.CrossRef
go back to reference Boldi, P., Santini, M., & Vigna, S. (2005). Pagerank as a function of the damping factor. In WWW ’05. ACM. Boldi, P., Santini, M., & Vigna, S. (2005). Pagerank as a function of the damping factor. In WWW ’05. ACM.
go back to reference Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1–7), 107–117. Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1–7), 107–117.
go back to reference Chen, Z., Tao, L., Wang, J., Liu, W., & Ma, W. (2002). A unified framework for web link analysis. In WISE’02. Chen, Z., Tao, L., Wang, J., Liu, W., & Ma, W. (2002). A unified framework for web link analysis. In WISE’02.
go back to reference Ding, C., He, X., Husbands, P., Zha, H., & Simon, H. (2001). PageRank, HITS, and a unified framework link analysis. LBNL Tech Report 49372, Nov 2001 (updated September 2002). Ding, C., He, X., Husbands, P., Zha, H., & Simon, H. (2001). PageRank, HITS, and a unified framework link analysis. LBNL Tech Report 49372, Nov 2001 (updated September 2002).
go back to reference Golub, G. H., & Loan, C. F. V. (1996). Matrix computations (3rd ed.). Baltimore, MD, USA: Johns Hopkins University Press.MATH Golub, G. H., & Loan, C. F. V. (1996). Matrix computations (3rd ed.). Baltimore, MD, USA: Johns Hopkins University Press.MATH
go back to reference Gao, B., Liu, T., Ma, Z., Wang, T., & Li, H. (2009). A general markov framework for page importance computation. In the Proceedings of the 18th ACM conference on information and knowledge management (CIKM 2009) (pp. 1835–1838). Gao, B., Liu, T., Ma, Z., Wang, T., & Li, H. (2009). A general markov framework for page importance computation. In the Proceedings of the 18th ACM conference on information and knowledge management (CIKM 2009) (pp. 1835–1838).
go back to reference Gyongyi, Z., & Garcia-Molina, H. (2004). Web spam Taxonomy. Technical report, Stanford Digital Library Technologies Project. Gyongyi, Z., & Garcia-Molina, H. (2004). Web spam Taxonomy. Technical report, Stanford Digital Library Technologies Project.
go back to reference Gyongyi, Z., Garcia-Molina, H., & Pedersen, J. (2004). Combating web spam with trustrank. In VLDB ’04 (pp. 576–587). VLDB Endowment. Gyongyi, Z., Garcia-Molina, H., & Pedersen, J. (2004). Combating web spam with trustrank. In VLDB ’04 (pp. 576–587). VLDB Endowment.
go back to reference Haveliwala, T. (1999). Efficient computation of pageRank. Technical Report 1999–31. Haveliwala, T. (1999). Efficient computation of pageRank. Technical Report 1999–31.
go back to reference Haveliwala, T., & Kamvar, S. (2003). The second eigenvalue of the google matrix. Haveliwala, T., & Kamvar, S. (2003). The second eigenvalue of the google matrix.
go back to reference Haveliwala, T., Kamvar, S., & Jeh, G. (2003). An analytical comparison of approaches to personalizing pagerank. Haveliwala, T., Kamvar, S., & Jeh, G. (2003). An analytical comparison of approaches to personalizing pagerank.
go back to reference Haveliwala, T. H. (May 2002). Topic-sensitive pagerank. In WWW ’02, Honolulu, Hawaii. Haveliwala, T. H. (May 2002). Topic-sensitive pagerank. In WWW ’02, Honolulu, Hawaii.
go back to reference Hou, Z., & Liu, G. (2005). Markov Skeleton processes and their applications. USA: Science Press and International Press Hou, Z., & Liu, G. (2005). Markov Skeleton processes and their applications. USA: Science Press and International Press
go back to reference Hou, Z., Liu, Z., & Zou, J. (June 1998). Markov Skeleton Processes. Chinese Science Bulletin, 43(11), 881–889. Hou, Z., Liu, Z., & Zou, J. (June 1998). Markov Skeleton Processes. Chinese Science Bulletin, 43(11), 881–889.
go back to reference Jeh, G., Widom, J. (2002). SimRank: A measure of structural-context similarity. In KDD ’02. Jeh, G., Widom, J. (2002). SimRank: A measure of structural-context similarity. In KDD ’02.
go back to reference Jindal, A., Crutchfield, C., Goel, S., Kolluri, R., & Jain, R. (2008). The mobile web is structurally different. In the Proceedings of the 11th IEEE global internet symposium. Jindal, A., Crutchfield, C., Goel, S., Kolluri, R., & Jain, R. (2008). The mobile web is structurally different. In the Proceedings of the 11th IEEE global internet symposium.
go back to reference Kleinberg, J. M. (1998). Authoritative sources in a hyperlinked environment. In SODA ’98 (pp. 668–677). Philadelphia, PA, USA: Society for Industrial and Applied Mathematics. Kleinberg, J. M. (1998). Authoritative sources in a hyperlinked environment. In SODA ’98 (pp. 668–677). Philadelphia, PA, USA: Society for Industrial and Applied Mathematics.
go back to reference Liu, Y., Gao, B., Liu, T., Zhang, Y., Ma, Z., He, S., et al. (2008). BrowseRank: Letting users vote for page importance. In SIGIR ’08 (pp. 451–458). Liu, Y., Gao, B., Liu, T., Zhang, Y., Ma, Z., He, S., et al. (2008). BrowseRank: Letting users vote for page importance. In SIGIR ’08 (pp. 451–458).
go back to reference McSherry, F. (2005). A uniform approach to accelerated pagerank computation. In WWW ’05 (pp. 575–582). New York, NY, USA: ACM. McSherry, F. (2005). A uniform approach to accelerated pagerank computation. In WWW ’05 (pp. 575–582). New York, NY, USA: ACM.
go back to reference Nie, Z., Zhang, Y., Wen, J., & Ma, W. (2005). Object-level ranking: Bringing order to web objects. In WWW’05. Nie, Z., Zhang, Y., Wen, J., & Ma, W. (2005). Object-level ranking: Bringing order to web objects. In WWW’05.
go back to reference Page, L., Brin, S., Motwani, R., & Winograd, T. (1998). The pagerank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project. Page, L., Brin, S., Motwani, R., & Winograd, T. (1998). The pagerank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project.
go back to reference Papoulis, A., & Pillai, S. U. (2001). Probability, random variables and stochastic processes. New York: McGraw-Hill Science/Engineering/Math. Papoulis, A., & Pillai, S. U. (2001). Probability, random variables and stochastic processes. New York: McGraw-Hill Science/Engineering/Math.
go back to reference Poblete, B., Castillo, C., & Gionis, A. (2008). Dr. Searcher and Mr. Browser: A unified hyperlink-click graph. In CIKM ’08: Proceeding of the 17th ACM conference on information and knowledge mining (pp. 1123–1132). Poblete, B., Castillo, C., & Gionis, A. (2008). Dr. Searcher and Mr. Browser: A unified hyperlink-click graph. In CIKM ’08: Proceeding of the 17th ACM conference on information and knowledge mining (pp. 1123–1132).
go back to reference Richardson, M., & Domingos, P. (2002). The intelligent surfer: Probabilistic combination of link and content information in PageRank. In Advances in neural information processing systems 14. Cambridge: MIT Press. Richardson, M., & Domingos, P. (2002). The intelligent surfer: Probabilistic combination of link and content information in PageRank. In Advances in neural information processing systems 14. Cambridge: MIT Press.
go back to reference Yu, P. S., Li, X., & Liu, B. (2005). Adding the temporal dimension to search—A case study in publication search. Proceedings of the 2005 IEEE/WIC/ACM international conference on web intelligence. Yu, P. S., Li, X., & Liu, B. (2005). Adding the temporal dimension to search—A case study in publication search. Proceedings of the 2005 IEEE/WIC/ACM international conference on web intelligence.
Metadata
Title
Page importance computation based on Markov processes
Authors
Bin Gao
Tie-Yan Liu
Yuting Liu
Taifeng Wang
Zhi-Ming Ma
Hang Li
Publication date
01-10-2011
Publisher
Springer Netherlands
Published in
Discover Computing / Issue 5/2011
Print ISSN: 2948-2984
Electronic ISSN: 2948-2992
DOI
https://doi.org/10.1007/s10791-011-9164-x

Premium Partner