Skip to main content

2007 | Buch

Algorithms and Models for the Web-Graph

5th International Workshop, WAW 2007, San Diego, CA, USA, December 11-12, 2007. Proceedings

herausgegeben von: Anthony Bonato, Fan R. K. Chung

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Inhaltsverzeichnis

Frontmatter
Bias Reduction in Traceroute Sampling – Towards a More Accurate Map of the Internet
Abstract
Traceroute sampling is an important technique in exploring the internet router graph and the autonomous system graph. Although it is one of the primary techniques used in calculating statistics about the internet, it can introduce bias that corrupts these estimates. This paper reports on a theoretical and experimental investigation of a new technique to reduce the bias of traceroute sampling when estimating the degree distribution. We develop a new estimator for the degree of a node in a traceroute-sampled graph; validate the estimator theoretically in ER graphs and, through computer experiments, for a wider range of graphs; and apply it to produce a new picture of the degree distribution of the autonomous system graph.
Abraham D. Flaxman, Juan Vera
Distribution of PageRank Mass Among Principle Components of the Web
Abstract
We study the PageRank mass of principal components in a bow-tie Web Graph, as a function of the damping factor c. Using a singular perturbation approach, we show that the PageRank share of IN and SCC components remains high even for very large values of the damping factor, in spite of the fact that it drops to zero when c→1. However, a detailed study of the OUT component reveals the presence of “dead-ends” (small groups of pages linking only to each other) that receive an unfairly high ranking when c is close to one. We argue that this problem can be mitigated by choosing c as small as 1/2.
Konstantin Avrachenkov, Nelly Litvak, Kim Son Pham
Finding a Dense-Core in Jellyfish Graphs
Abstract
The connectivity of the Internet crucially depends on the relationships between thousands of Autonomous Systems (ASes) that exchange routing information using the Border Gateway Protocol (BGP). These relationships can be modeled as a graph, called the AS-graph, in which the vertices model the ASes, and the edges model the peering arrangements between the ASes. Based on topological studies, it is widely believed that the Internet graph contains a central dense-core: Informally, this is a small set of high-degree, tightly interconnected ASes that participate in a large fraction of end-to-end routes. Finding this dense-core is a very important practical task when analyzing the Internet’s topology.
In this work we introduce a randomized sublinear algorithm that finds a dense-core of the AS-graph. We mathematically prove the correctness of our algorithm, bound the density of the core it returns, and analyze its running time. We also implemented our algorithm and tested it on real AS-graph data. Our results show that the core discovered by our algorithm is nearly identical to the cores found by existing algorithms - at a fraction of the running time.
Mira Gonen, Dana Ron, Udi Weinsberg, Avishai Wool
A Geometric Preferential Attachment Model of Networks II
Abstract
A detailed understanding of expansion in complex networks can greatly aid in the design and analysis of algorithms for a variety of important network tasks, including routing messages, ranking nodes, and compressing graphs. This has motivated several recent investigations of expansion properties in real-world graphs and also in random models of real-world graphs, like the preferential attachment graph. The results point to a gap between real-world observations and theoretical models. Some real-world graphs are expanders and others are not, but a graph generated by the preferential attachment model is an expander whp .
We study a random graph G n that combines certain aspects of geometric random graphs and preferential attachment graphs. This model yields a graph with power-law degree distribution where the expansion property depends on a tunable parameter of the model.
The vertices of G n are n sequentially generated points x 1,x 2,...,x n chosen uniformly at random from the unit sphere in https://static-content.springer.com/image/chp%3A10.1007%2F978-3-540-77004-6_4/MediaObjects/978-3-540-77004-6_4_IEq1_HTML.png . After generating x t , we randomly connect it to m points from those points in x 1,x 2,...,x t − 1 ....
Abraham D. Flaxman, Alan M. Frieze, Juan Vera
Clustering Social Networks
Abstract
Social networks are ubiquitous. The discovery of close-knit clusters in these networks is of fundamental and practical interest. Existing clustering criteria are limited in that clusters typically do not overlap, all vertices are clustered and/or external sparsity is ignored. We introduce a new criterion that overcomes these limitations by combining internal density with external sparsity in a natural way. An algorithm is given for provably finding the clusters, provided there is a sufficiently large gap between internal density and external sparsity. Experiments on real social networks illustrate the effectiveness of the algorithm.
Nina Mishra, Robert Schreiber, Isabelle Stanton, Robert E. Tarjan
Manipulation-Resistant Reputations Using Hitting Time
Abstract
Popular reputation systems for linked networks can be manipulated by spammers who strategically place links. The reputation of node v is interpreted as the world’s opinion of v’s importance. In PageRank [4], v’s own opinion can be seen to have considerable influence on her reputation, where v expresses a high opinion of herself by participating in short directed cycles. In contrast, we show that expected hitting time — the time to reach v in a random walk — measures essentially the same quantity as PageRank, but excludes v’s opinion. We make these notions precise, and show that a reputation system based on hitting time resists tampering by individuals or groups who strategically place outlinks. We also present an algorithm to efficiently compute hitting time for all nodes in a massive graph; conventional algorithms do not scale adequately.
John Hopcroft, Daniel Sheldon
Using Polynomial Chaos to Compute the Influence of Multiple Random Surfers in the PageRank Model
Abstract
The PageRank equation computes the importance of pages in a web graph relative to a single random surfer with a constant teleportation coefficient. To be globally relevant, the teleportation coefficient should account for the influence of all users. Therefore, we correct the PageRank formulation by modeling the teleportation coefficient as a random variable distributed according to user behavior. With this correction, the PageRank values themselves become random. We present two methods to quantify the uncertainty in the random PageRank: a Monte Carlo sampling algorithm and an algorithm based the truncated polynomial chaos expansion of the random quantities. With each of these methods, we compute the expectation and standard deviation of the PageRanks. Our statistical analysis shows that the standard deviation of the PageRanks are uncorrelated with the PageRank vector.
Paul G. Constantine, David F. Gleich
A Spatial Web Graph Model with Local Influence Regions
Abstract
The web graph may be considered as embedded in a topic space, with a metric that expresses the extent to which web pages are related to each other. Using this assumption, we present a new model for the web and other complex networks, based on a spatial embedding of the nodes, called the Spatial Preferred Attachment (SPA) model. In the SPA model, nodes have influence regions of varying size, and new nodes may only link to a node if they fall within its influence region. We prove that our model gives a power law in-degree distribution, with exponent in [2, ∞ ) depending on the parameters, and with concentration for a wide range of in-degree values. We also show that the model allows for edges that span a large distance in the underlying space, modelling a feature often observed in real-world complex networks.
W. Aiello, A. Bonato, C. Cooper, J. Janssen, P. Prałat
Determining Factors Behind the PageRank Log-Log Plot
Abstract
We study the relation between PageRank and other parameters of information networks such as in-degree, out-degree, and the fraction of dangling nodes. We model this relation through a stochastic equation inspired by the original definition of PageRank. Further, we use the theory of regular variation to prove that PageRank and in-degree follow power laws with the same exponent. The difference between these two power laws is in a multiplicative constant, which depends mainly on the fraction of dangling nodes, average in-degree, the power law exponent, and the damping factor. The out-degree distribution has a minor effect, which we explicitly quantify. Finally, we propose a ranking scheme which does not depend on out-degrees.
Yana Volkovich, Nelly Litvak, Debora Donato
Approximating Betweenness Centrality
Abstract
Betweenness is a centrality measure based on shortest paths, widely used in complex network analysis. It is computationally-expensive to exactly determine betweenness; currently the fastest-known algorithm by Brandes requires O(nm) time for unweighted graphs and O(nm + n 2logn) time for weighted graphs, where n is the number of vertices and m is the number of edges in the network. These are also the worst-case time bounds for computing the betweenness score of a single vertex. In this paper, we present a novel approximation algorithm for computing betweenness centrality of a given vertex, for both weighted and unweighted graphs. Our approximation algorithm is based on an adaptive sampling technique that significantly reduces the number of single-source shortest path computations for vertices with high centrality. We conduct an extensive experimental study on real-world graph instances, and observe that our random sampling algorithm gives very good betweenness approximations for biological networks, road networks and web crawls.
David A. Bader, Shiva Kintali, Kamesh Madduri, Milena Mihail
Random Dot Product Graph Models for Social Networks
Abstract
Inspired by the recent interest in combining geometry with random graph models, we explore in this paper two generalizations of the random dot product graph model proposed by Kraetzl, Nickel and Scheinerman, and Tucker [1,2]. In particular we consider the properties of clustering, diameter and degree distribution with respect to these models. Additionally we explore the conductance of these models and show that in a geometric sense, the conductance is constant.
Stephen J. Young, Edward R. Scheinerman
Local Computation of PageRank Contributions
Abstract
Motivated by the problem of detecting link-spam, we consider the following graph-theoretic primitive: Given a webgraph G, a vertex v in G, and a parameter δ ∈ (0,1), compute the set of all vertices that contribute to v at least a δ fraction of v’s PageRank. We call this set the δ-contributing set of v. To this end, we define the contribution vector of v to be the vector whose entries measure the contributions of every vertex to the PageRank of v. A local algorithm is one that produces a solution by adaptively examining only a small portion of the input graph near a specified vertex. We give an efficient local algorithm that computes an ε-approximation of the contribution vector for a given vertex by adaptively examining O(1/ε) vertices. Using this algorithm, we give a local approximation algorithm for the primitive defined above. Specifically, we give an algorithm that returns a set containing the δ-contributing set of v and at most O(1/δ) vertices from the δ/2-contributing set of v, and which does so by examining at most O(1/δ) vertices. We also give a local algorithm for solving the following problem: If there exist k vertices that contribute a ρ-fraction to the PageRank of v, find a set of k vertices that contribute at least a (ρ − ε)-fraction to the PageRank of v. In this case, we prove that our algorithm examines at most O(k/ε) vertices.
Reid Andersen, Christian Borgs, Jennifer Chayes, John Hopcraft, Vahab S. Mirrokni, Shang-Hua Teng
Local Partitioning for Directed Graphs Using PageRank
Abstract
A local partitioning algorithm finds a set with small conductance near a specified seed vertex. In this paper, we present a generalization of a local partitioning algorithm for undirected graphs to strongly connected directed graphs. In particular, we prove that by computing a personalized PageRank vector in a directed graph, starting from a single seed vertex within a set S that has conductance at most α, and by performing a sweep over that vector, we can obtain a set of vertices S′ with conductance \(\Phi_{M}(S')= O(\sqrt{\alpha \log |S|})\). Here, the conductance function Φ M is defined in terms of the stationary distribution of a random walk in the directed graph. In addition, we describe how this algorithm may be applied to the PageRank Markov chain of an arbitrary directed graph, which provides a way to partition directed graphs that are not strongly connected.
Reid Andersen, Fan Chung, Kevin Lang
Stochastic Kronecker Graphs
Abstract
A random graph model based on Kronecker products of probability matrices has been recently proposed as a generative model for large-scale real-world networks such as the web. This model simultaneously captures several well-known properties of real-world networks; in particular, it gives rise to a heavy-tailed degree distribution, has a low diameter, and obeys the densification power law. Most properties of Kronecker products of graphs (such as connectivity and diameter) are only rigorously analyzed in the deterministic case. In this paper, we study the basic properties of stochastic Kronecker products based on an initiator matrix of size two (which is the case that is shown to provide the best fit to many real-world networks). We will show a phase transition for the emergence of the giant component and another phase transition for connectivity, and prove that such graphs have constant diameters beyond the connectivity threshold, but are not searchable using a decentralized algorithm.
Mohammad Mahdian, Ying Xu
Deterministic Decentralized Search in Random Graphs
Abstract
We study a general framework for decentralized search in random graphs. Our main focus is on deterministic memoryless search algorithms that use only local information to reach their destination in a bounded number of steps in expectation. This class includes (with small modifications) the search algorithms used in Kleinberg’s pioneering work on long-range percolation graphs and hierarchical network models. We give a characterization of searchable graphs in this model, and use this characterization to prove a monotonicity property for searchability.
Esteban Arcaute, Ning Chen, Ravi Kumar, David Liben-Nowell, Mohammad Mahdian, Hamid Nazerzadeh, Ying Xu
Using Bloom Filters to Speed Up HITS-Like Ranking Algorithms
Abstract
This paper describes a technique for reducing the query-time cost of HITS-like ranking algorithm. The basic idea is to compute for each node in the web graph a summary of its immediate neighborhood (which is a query-independent operation and thus can be done off-line), and to approximate the neighborhood graph of a result set at query-time by combining the summaries of the result set nodes. This approximation of the query-specific neighborhood graph can then be used to perform query-dependent link-based ranking algorithms such as HITS and SALSA. We have evaluated our technique on a large web graph and a substantial set of queries with partially judged results, and found that its effectiveness (retrieval performance) is comparable to the original SALSA algorithm, while its efficiency (query-time speed) is substantially higher.
Sreenivas Gollapudi, Marc Najork, Rina Panigrahy
Parallelizing the Computation of PageRank
Abstract
This paper presents a technique we call ParaSolve that exploits the sparsity structure of the web graph matrix to improve on the degree of parallelism in a number of distributed approaches for computating PageRank. Specifically, a typical algorithm (such as power iteration or GMRES) for solving the linear system corresponding to PageRank, call it LinearSolve, may be converted to a distributed algorithm, Distrib(LinearSolve), by partitioning the problem and applying a standard technique (i.e., Distrib). By reducing the number of inter-partition multiplications, we may greatly increase the degree of parallelism, while achieving a similar degree of accuracy. This should lead to increasingly better performance as we utilize more processors. For example, using GeoSolve (a variant of Jacobi iteration) as our linear solver and the 2001 web graph from Stanford’s WebBase project, on 12 processors ParaSolve(GeoSolve) outperforms Distrib(GeoSolve) by a factor of 1.4, while on 32 processors the performance ratio improves to 2.8.
John Wicks, Amy Greenwald
Giant Component and Connectivity in Geographical Threshold Graphs
Abstract
The geographical threshold graph model is a random graph model with nodes distributed in a Euclidean space and edges assigned through a function of distance and node weights. We study this model and give conditions for the absence and existence of the giant component, as well as for connectivity.
Milan Bradonjić, Aric Hagberg, Allon G. Percus
Backmatter
Metadaten
Titel
Algorithms and Models for the Web-Graph
herausgegeben von
Anthony Bonato
Fan R. K. Chung
Copyright-Jahr
2007
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-77004-6
Print ISBN
978-3-540-77003-9
DOI
https://doi.org/10.1007/978-3-540-77004-6