nach oben

2009 | Buch

Kapitel lesen Erstes Kapitel lesen

Algorithms and Models for the Web-Graph

6th International Workshop, WAW 2009, Barcelona, Spain, February 12-13, 2009. Proceedings

herausgegeben von: Konstantin Avrachenkov, Debora Donato, Nelly Litvak

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This book constitutes the refereed proceedings of the 6th International Workshop on Algorithms and Models for the Web-Graph, WAW 2009, held in Barcelona, Spain, in February 2009 - co-located with WSDM 2009, the Second ACM International Conference on Web Search and Data Mining. The 14 revised full papers presented were carefully reviewed and selected from numerous submissions for inclusion in the book. The papers address a wide variety of topics related to the study of the Web-graph such as theoretical and empirical analysis of the Web graph and Web 2.0 graphs, random walks on the Web and Web 2.0 graphs and their applications, and design and performance evaluation of the algorithms for social networks. The workshop papers have been naturally clustered in three topical sections on graph models for complex networks, pagerank and Web graph, and social networks and search.

Inhaltsverzeichnis

Frontmatter

Graph Models for Complex Networks

Information Theoretic Comparison of Stochastic Graph Models: Some Experiments

Abstract

The Modularity-Q measure of community structure is known to falsely ascribe community structure to random graphs, at least when it is naively applied. Although Q is motivated by a simple kind of comparison of stochastic graph models, it has been suggested that a more careful comparison in an information-theoretic framework might avoid problems like this one. Most earlier papers exploring this idea have ignored the issue of skewed degree distributions and have only done experiments on a few small graphs. By means of a large-scale experiment on over 100 large complex networks, we have found that modeling the degree distribution is essential. Once this is done, the resulting information-theoretic clustering measure does indeed avoid Q’s bad property of seeing cluster structure in random graphs.

Kevin J. Lang

Approximating the Number of Network Motifs

Abstract

World Wide Web, the Internet, coupled biological and chemical systems, neural networks, and social interacting species, are only a few examples of systems composed by a large number of highly interconnected dynamical units. These networks contain characteristic patterns, termed network motifs, which occur far more often than in randomized networks with the same degree sequence. Several algorithms have been suggested for counting or detecting the number of induced or non-induced occurrences of network motifs in the form of trees and bounded treewidth subgraphs of size O(logn), and of size at most 7 for some motifs.

In addition, counting the number of motifs a node is part of was recently suggested as a method to classify nodes in the network. The promise is that the distribution of motifs a node participate in is an indication of its function in the network. Therefore, counting the number of network motifs a node is part of provides a major challenge. However, no such practical algorithm exists.

We present several algorithms with time complexity \(O\left(e^{2k}k\cdot n \cdot |E|\cdot \right.\) \(\left.\log\frac{1}{\delta}/{\epsilon^2}\right)\) that, for the first time, approximate for every vertex the number of non-induced occurrences of the motif the vertex is part of, for k-length cycles, k-length cycles with a chord, and (k − 1)-length paths, where k = O(logn), and for all motifs of size of at most four. In addition, we show algorithms that approximate the total number of non-induced occurrences of these network motifs, when no efficient algorithm exists. Some of our algorithms use the color coding technique.

Mira Gonen, Yuval Shavitt

Finding Dense Subgraphs with Size Bounds

Abstract

We consider the problem of finding dense subgraphs with specified upper or lower bounds on the number of vertices. We introduce two optimization problems: the densest at-least-k-subgraph problem (dalks), which is to find an induced subgraph of highest average degree among all subgraphs with at least k vertices, and the densest at-most-k-subgraph problem (damks), which is defined similarly. These problems are relaxed versions of the well-known densest k-subgraph problem (dks), which is to find the densest subgraph with exactly k vertices. Our main result is that dalks can be approximated efficiently, even for web-scale graphs. We give a (1/3)-approximation algorithm for dalks that is based on the core decomposition of a graph, and that runs in time O(m + n), where n is the number of nodes and m is the number of edges. In contrast, we show that damks is nearly as hard to approximate as the densest k-subgraph problem, for which no good approximation algorithm is known. In particular, we show that if there exists a polynomial time approximation algorithm for damks with approximation ratio γ, then there is a polynomial time approximation algorithm for dks with approximation ratio γ ²/8. In the experimental section, we test the algorithm for dalks on large publicly available web graphs. We observe that, in addition to producing near-optimal solutions for dalks, the algorithm also produces near-optimal solutions for dks for nearly all values of k.

Reid Andersen, Kumar Chellapilla

The Giant Component in a Random Subgraph of a Given Graph

Abstract

We consider a random subgraph G _p of a host graph G formed by retaining each edge of G with probability p. We address the question of determining the critical value p (as a function of G) for which a giant component emerges. Suppose G satisfies some (mild) conditions depending on its spectral gap and higher moments of its degree sequence. We define the second order average degree \(\tilde{d}\) to be \(\tilde{d}=\sum_v d_v^2/(\sum_v d_v)\) where d _v denotes the degree of v. We prove that for any ε> 0, if \(p > (1+ \epsilon)/{\tilde d}\) then asymptotically almost surely the percolated subgraph G _p has a giant component. In the other direction, if \(p < (1-\epsilon)/\tilde{d}\) then almost surely the percolated subgraph G _p contains no giant component.

Fan Chung, Paul Horn, Linyuan Lu

Quantifying the Impact of Information Aggregation on Complex Networks: A Temporal Perspective

Abstract

Complex networks are a popular and frequent tool for modeling a variety of entities and their relationships. Understanding these relationships and selecting which data will be used in their analysis is key to a proper characterization. Most of the current approaches consider all available information for analysis, aggregating it over time. In this work, we studied the impact of such aggregation while characterizing complex networks. We model four real complex networks using an extended graph model that enables us to quantify the impact of the information aggregation over time. We conclude that data aggregation may distort the characteristics of the underlying real-world network and must be performed carefully.

Fernando Mourão, Leonardo Rocha, Lucas Miranda, Virgílio Almeida, Wagner Meira Jr.

PageRank and Web Graph

A Local Graph Partitioning Algorithm Using Heat Kernel Pagerank

Abstract

We give an improved local partitioning algorithm using heat kernel pagerank, a modified version of PageRank. For a subset S with Cheeger ratio (or conductance) h, we show that there are at least a quarter of the vertices in S that can serve as seeds for heat kernel pagerank which lead to local cuts with Cheeger ratio at most \(O(\sqrt{h})\), improving the previously bound by a factor of \(\sqrt{log|S|}\).

Fan Chung

Choose the Damping, Choose the Ranking?

Abstract

To what extent can changes in PageRank’s damping factor affect node ranking? We prove that, at least on some graphs, the top k nodes assume all possible k! orderings as the damping factor varies, even if it varies within an arbitrarily small interval (e.g. [0.84999, 0.85001]). Thus, the rank of a node for a given (finite set of discrete) damping factor(s) provides very little information about the rank of that node as the damping factor varies over a continuous interval.

We bypass this problem introducing lineage analysis and proving that there is a simple condition, with a “natural” interpretation independent of PageRank, that allows one to verify “in one shot” if a node outperforms another simultaneously for all damping factors and all damping variables (informally, time variant damping factors). The novel notions of strong rank and weak rank of a node provide a measure of the fuzziness of the rank of that node, of the objective orderability of a graph’s nodes, and of the quality of results returned by different ranking algorithms based on the random surfer model.

We deploy our analytical tools on a 41M node snapshot of the .it Web domain and on a 0.7M node snapshot of the CiteSeer citation graph. Among other findings, we show that rank is indeed relatively stable in both graphs; that “classic” PageRank (d = 0.85) marginally outperforms Weighted In-degree (d→0), mainly due to its ability to ferret out “niche” items; and that, for both the Web and CiteSeer, the ideal damping factor appears to be 0.8 − 0.9 to obtain those items of high importance to at least one (model of randomly surfing) user, but only 0.5 − 0.6 to obtain those items important to every (model of randomly surfing) user.

Marco Bressan, Enoch Peserico

Characterization of Tail Dependence for In-Degree and PageRank

Abstract

The dependencies between power law parameters such as in-degree and PageRank, can be characterized by the so-called angular measure, a notion used in extreme value theory to describe the dependency between very large values of coordinates of a random vector. Basing on an analytical stochastic model, we argue that the angular measure for in-degree and personalized PageRank is concentrated in two points. This corresponds to the two main factors for high ranking: large in-degree and a high rank of one of the ancestors. Furthermore, we can formally establish the relative importance of these two factors.

Nelly Litvak, Werner Scheinhardt, Yana Volkovich, Bert Zwart

Web Page Rank Prediction with PCA and EM Clustering

Abstract

In this paper we describe learning algorithms for Web page rank prediction. We consider linear regression models and combinations of regression with probabilistic clustering and Principal Components Analysis (PCA). These models are learned from time-series data sets and can predict the ranking of a set of Web pages in some future time. The first algorithm uses separate linear regression models. This is further extended by applying probabilistic clustering based on the EM algorithm. Clustering allows for the Web pages to be grouped together by fitting a mixture of regression models. A different method combines linear regression with PCA so as dependencies between different web pages can be exploited. All the methods are evaluated using real data sets obtained from Internet Archive, Wikipedia and Yahoo! ranking lists. We also study the temporal robustness of the prediction framework. Overall the system constitutes a set of tools for high accuracy pagerank prediction which can be used for efficient resource management by search engines.

Polyxeni Zacharouli, Michalis Titsias, Michalis Vazirgiannis

Permuting Web Graphs

Abstract

Since the first investigations on web graph compression, it has been clear that the ordering of the nodes of the graph has a fundamental influence on the compression rate (usually expressed as the number of bits per link). The author of the LINK database [1], for instance, investigated three different approaches: an extrinsic ordering (URL ordering) and two intrinsic (or coordinate-free) orderings based on the rows of the adjacency matrix (lexicographic and Gray code); they concluded that URL ordering has many advantages in spite of a small penalty in compression. In this paper we approach this issue in a more systematic way, testing some old orderings and proposing some new ones. Our experiments are made in the WebGraph framework [2], and show that the compression technique and the structure of the graph can produce significantly different results. In particular, we show that for the transpose web graph URL ordering is significantly less effective, and that some new orderings combining host information and Gray/lexicographic orderings outperform all previous methods. In particular, in some large transposed graphs they yield the quite incredible compression rate of 1 bit per link.

Paolo Boldi, Massimo Santini, Sebastiano Vigna

Social Networks and Search

A Dynamic Model for On-Line Social Networks

Abstract

We present a deterministic model for on-line social networks based on transitivity and local knowledge in social interactions. In the Iterated Local Transitivity (ILT) model, at each time-step and for every existing node x, a new node appears which joins to the closed neighbour set of x. The ILT model provably satisfies a number of both local and global properties that were observed in real-world on-line social and other complex networks, such as a densification power law, decreasing average distance, and higher clustering than in random graphs with the same average degree. Experimental studies of social networks demonstrate poor expansion properties as a consequence of the existence of communities with low number of inter-community links. A spectral gap for both the adjacency and normalized Laplacian matrices is proved for graphs arising from the ILT model, thereby simulating such bad expansion properties.

Anthony Bonato, Noor Hadi, Paul Horn, Paweł Prałat, Changping Wang

TC-SocialRank: Ranking the Social Web

Abstract

Web search is extensively adopted for accessing information on the web, and recent development in personalized search, social bookmarking and folksonomy systems has tremendously eased the user’s path towards the desired information.

In this paper, we discuss TC-SocialRank, a novel link-based algorithm which extends state-of-the-art algorithms for folksonomy systems. The algorithm leverages the importance of users in the social community, the importance of the bookmarks/resource they share, and additional temporal information and clicks information. Temporal information has a primary importance in social bookmarking search since users continuously post new and fresh information. The importance of this information may decay after a while, if it is no longer tagged or clicked.

As a case study for testing the effectiveness of TC-SocialRank, we discuss JammingSearch a novel folksonomy system that unifies web search and social bookmarking by transparently leveraging a Wiki-based collaborative editing system. When an interesting search result is found, a user can share it with the JammingSearch community by simply clicking a button. This information is implicitly tagged with the query submitted to any commodity search engine. Later on, additional tags can be added by the user community. Currently, our system interacts with Ask.com, Google, Microsoft Live, Yahoo!, and AOL.

Antonio Gulli, Stefano Cataudella, Luca Foschini

Exploiting Positive and Negative Graded Relevance Assessments for Content Recommendation

Abstract

Social media allow users to give their opinion about the available content by assigning a rating. Collaborative filtering approaches to predict recommendations based on these graded relevance assessments are hampered by the sparseness of the data. This sparseness problem can be overcome with graph-based models, but current methods are not able to deal with negative relevance assessments.

We propose a new graph-based model that exploits both positive and negative preference data. Hereto, we combine in a single content ranking the results from two graphs, one based on positive and the other based on negative preference information. The resulting ranking contains less false positives than a ranking based on positive information alone. Low ratings however appear to have a predictive value for relevant content. Discounting the negative information therefore does not only remove the irrelevant content from the top of the ranking, but also reduces the recall of relevant documents.

Maarten Clements, Arjen P. de Vries, Marcel J. T. Reinders

Cluster Based Personalized Search

Abstract

We study personalized web ranking algorithms based on the existence of document clusterings. Motivated by the topic sensitive page ranking of Haveliwala [20], we develop and implement an efficient “local-cluster” algorithm by extending the web search algorithm of Achlioptas et al. [10]. We propose some formal criteria for evaluating such personalized ranking algorithms and provide some preliminary experiments in support of our analysis. Both theoretically and experimentally, our algorithm differs significantly from Topc Sensitive Page Rank.

Hyun Chul Lee, Allan Borodin

Backmatter

Titel: Algorithms and Models for the Web-Graph
herausgegeben von: Konstantin Avrachenkov
Debora Donato
Nelly Litvak
Verlag: Springer Berlin Heidelberg
Electronic ISBN: 978-3-540-95995-3
Print ISBN: 978-3-540-95994-6
DOI: https://doi.org/10.1007/978-3-540-95995-3

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Graph Models for Complex Networks

Information Theoretic Comparison of Stochastic Graph Models: Some Experiments

Approximating the Number of Network Motifs

Finding Dense Subgraphs with Size Bounds

The Giant Component in a Random Subgraph of a Given Graph

Quantifying the Impact of Information Aggregation on Complex Networks: A Temporal Perspective

PageRank and Web Graph

A Local Graph Partitioning Algorithm Using Heat Kernel Pagerank

Choose the Damping, Choose the Ranking?

Characterization of Tail Dependence for In-Degree and PageRank

Web Page Rank Prediction with PCA and EM Clustering

Permuting Web Graphs

Social Networks and Search

A Dynamic Model for On-Line Social Networks

TC-SocialRank: Ranking the Social Web

Exploiting Positive and Negative Graded Relevance Assessments for Content Recommendation

Cluster Based Personalized Search

Backmatter

Premium Partner