Lognormal and Pareto distributions in the Internet

https://doi.org/10.1016/j.comcom.2004.11.001Get rights and content

Abstract

Numerous studies have reported long-tailed distributions for various network metrics, including file sizes, transfer times, and burst lengths. We review techniques for identifying long-tailed distributions based on a sample, propose a new technique, and apply these methods to datasets used in previous reports. We find that the evidence for long tails is inconsistent, and that lognormal and other non-long-tailed models are usually sufficient to characterize network metrics. We discuss the implications of this result for current explanations of self-similarity in network traffic.

Introduction

Researchers have reported traffic patterns in the Internet that show characteristics of self-similarity (see [1] for a survey). Many proposed explanations of this phenomenon are based on the assumption that the distribution of transfer times in the network is long-tailed [2], [3], [4], [5]. In turn, this assumption is based on the assumption that the distribution of file sizes is long-tailed [6], [7].

We examine these assumptions, looking at data from a variety of systems, including many of the datasets originally presented as evidence of long-tailed distributions.

Section 2 evaluates existing methods for identifying long-tailed distributions, and proposes a new statistical method for classifying distributions. Section 3 applies this methodology to empirical distributions of file sizes from a variety of systems. We find that the distribution of file sizes tends to be lognormal, in local file systems and in the World Wide Web. This tendency is strongest in large datasets that aggregate many file systems.

Section 4 discusses the implications of this result on current explanations of self-similarity, and presents alternative explanations. The remaining sections evaluate these alternatives by examining the distributions of interarrival times (Section 5), transfer times (Section 6) and burst durations (Section 7).

We find that there is little evidence that the distribution of interarrival times is long-tailed. Similarly, there is only ambiguous support for long-tailed transfer times. On the other hand, there is some evidence that bursts of file transfers in both ftp and HTTP are long-tailed. We investigate this possibility and its causes.

Section snippets

Methodology

A fundamental problem in this area of inquiry is the lack of methodology for identifying a long-tailed distribution based on a sample. For explanatory models of self-similarity, the relevant definition of ‘long-tailed’ is a distribution with polynomial tail behavior; that isP{X>x}cxαasxwhere X is a random variable, c is a location parameter, and α is a shape parameter. When α is less than 2, the distribution has infinite variance, which is also required for these models to produce

File sizes

In this section we survey prior studies that have looked at measured file sizes and presented evidence that the distribution is long-tailed.

Self-similar network traffic

Many current explanations of self-similarity in the Internet are based on the assumption that some network metric—either transfer times, interarrival times, or burst sizes—is long-tailed.

One of these explanatory models is an M/G/∞ queue in which network transfers are customers and the network is an infinite-server system [2], [3]. If the distribution of service times is long-tailed, then the number of customers in the system is an asymptotically self-similar process.

Willinger et al. [4] propose

Interarrival times

Paxson and Floyd [2] measure the distribution of interarrival times for packets within Telnet connections, and report that “the main body of the observed distribution fits very well to a Pareto distribution… with shape parameter 0.9, and the upper 3% tail to a Pareto distribution with [shape parameter] 0.95.” They do not show the ccdf or explain how they chose these parameters.

This claim is based on traces collected at Lawrence Berkeley Labs during 1-h intervals in December 1993 and January

Transfer times

Even if file sizes are not long-tailed, transfer times might be. The performance of wide-area networks is highly variable in time; it is possible that this variability causes long-tailed transfer times. In this section, we investigate the relationship between file sizes and transfer times for HTTP and ftp transfers.

Transfer bursts

The motivation for investigating the sizes of transfer bursts is that ON periods in the ON/OFF model might correspond not to individual file transfers, but to periods of network activity interrupted only by network delays and short intervals between files. From the network's point of view, there is no difference between a delay caused by a TCP timeout and a delay with the same duration caused by user activity or processing delays.

Conclusions

We have reviewed techniques for identifying long-tailed distributions, and applied them to datasets that have been reported as long-tailed. Unfortunately, no single test is sufficient to provide convincing evidence of a long-tailed distribution. Looking at previous claims for long-tailed distributions, we find that some are not well supported by the evidence. In other cases, the evidence is ambiguous.

  • In our review of published observations, we did not find compelling evidence that the

Acknowledgements

Thanks to Mark Crovella (Boston University), Vern Paxson (ICIR) and Carey Williamson (University of Saskatchewan) for making their datasets available on the Web; Martin Arlitt (Hewlett-Packard) for providing processed data from the datasets he collected; Gordon Irlam for his survey of file sizes, and John Douceur (Microsoft Research) for sending me the Microsoft dataset. Also, many thanks to Joachim Charzinski (Siemens AG) for providing the burst lengths from his traces.

Thanks to Kim Claffy and

References (24)

  • J. Charzinski

    HTTP/TCP connection and flow characteristics

    Performance Evaluation

    (2000)
  • K. Park et al.

    Self-similar Network Traffic: An Overview

    (2000)
  • V. Paxson et al.

    Wide-area traffic: the failure of Poisson modeling

    IEEE/ACM Transactions on Networking

    (1995)
  • M. Parulekar, A.M. Makowski, M/G/∞ input process: a versatile class of models for network traffic, Tech. Rep. T.R....
  • W. Willinger et al.

    Self-similarity through high-variability: statistical analysis of Ethernet LAN traffic at the source level

    (1995)
  • A. Feldmann et al.

    Dynamics of IP traffic: a study of the role of variability and the impact of control

    (1999)
  • K. Park, G. Kim, M.E. Crovella, On the relationship between file sizes, transport protocols, and self-similar network...
  • M.E. Crovella et al.

    Heavy-tailed probability distributions in the World Wide Web

    (1998)
  • M.E. Crovella et al.

    Estimating the heavy tail index from scaling properties

    Methodology and Computing in Applied Probability

    (1999)
  • M.F. Arlitt et al.

    Web server workload characterization: the search for invariants

    (1996)
  • M. Arlitt, T. Jin, Workload characterization of the 1998 World Cup Web site, Tech. Rep. HPL-1999-35R1, Hewlett-Packard,...
  • M. Arlitt, R. Friedrich, T. Jin, Workload characterization of a Web proxy in a cable modem environment, ACM Sigmetrics...
  • Cited by (85)

    • Five degrees of randomness

      2021, Physica A: Statistical Mechanics and its Applications
    • Flow length and size distributions in campus Internet traffic

      2021, Computer Communications
      Citation Excerpt :

      Unfortunately, the work does not provide any reusable numerical data. Finally, there is a large group of work on traffic distributions of a single network service (e.g. HTTP, video streaming, or voice over IP (VoIP)): [27–43] Therefore, these works cannot be considered as universal enough to be representative for the general Internet load. In addition, there is a series of works that refer to extracting the distribution of flows from packet samples: [22,44,45] and [46].

    View all citing articles on Scopus
    View full text