ABSTRACT
Skewed distributions appear very often in practice. Unfortunately, the traditional Zipf distribution often fails to model them well. In this paper, we propose a new probability distribution, the Discrete Gaussian Exponential (DGX), to achieve excellent fits in a wide variety of settings; our new distribution includes the Zipf distribution as a special case. We present a statistically sound method for estimating the DGX parameters based on maximum likelihood estimation (MLE). We applied DGX to a wide variety of real world data sets, such as sales data from a large retailer chain, us-age data from AT&T, and Internet clickstream data; in all cases, DGX fits these distributions very well, with almost a 99% correlation coefficient in quantile-quantile plots. Our algorithm also scales very well because it requires only a single pass over the data. Finally, we illustrate the power of DGX as a new tool for data mining tasks, such as outlier detection.
- 1.L. Adamic. Zipf, power-laws, and pareto - a ranking tutorial, http://www.parc.xerox-com/istl/groups/iea /papers/ranking/ranking_html.Google Scholar
- 2.B. Berry and W. Garrison, Alternate explanations of urban rank size relations" Annals of the Association of American Geographers, 48:83-91, 1958.Google Scholar
- 3.L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web caching and zipf-like distributions: Evidence and implications. In IEEE Infocom '9g, New York, NY, Mar. 1999.Google Scholar
- 4.A. B. C. Cunha and M. CroveUa. Characteristics of www client-based traces. Tr-95-010, Boston University, April 1995. http://www.cs.bu.edu/groups /oceans/papers/Home.ht ml. Google ScholarDigital Library
- 5.C. Faloutsos and H. Jagadish. On B-tree indices for skewed distributions. In 18th VLDB Conference, pages 363-374, Vancouver, British Columbia, Aug. 23-27 1992. Google ScholarDigital Library
- 6.W. Fox and W. Lasker. The distribution of surname frequencies. International Statistical Review, 51:81-87, 1983.Google ScholarCross Ref
- 7.Galton. The geometric mean in vital and social statistics. Proceedings of the Royal Society of London, 29:365-367, 1879.Google ScholarCross Ref
- 8.S, Glassman. A caching relay for the world wide web. In Proc, of First International Confer, nee on the World Wide Web, CERN, Geneva, Switzerland, May 1994. http://wwwl.cern,ch/WWW94/P reUmProcs .html. Google ScholarDigital Library
- 9.P. Halmos. Random Mms. Annals of Mathematical Statistics, 15:182-189, 1944.Google ScholarCross Ref
- 10.G. Herdan. Small Particle Statistics. Butterworth's, London, 2 edition, 1060.Google Scholar
- 11.B. Hill. The rank-frequency form of zipf's law. Journal of the American Statistical Association, 69(348):1017--1026, 1974.Google ScholarCross Ref
- 12.The MathWorks Inc. Matlab user's guide.Google Scholar
- 13.J. Laherr6re. "parabolic fraetal" distributions in nature, ht tp://www.hubber t peak.com/laherrere /fractal.htm.Google Scholar
- 14.B. Laherrere and D. Sornette. Stretched exponential distributions in nature and economy: "fat tails" with characteristic scales. European Physical Journal, B(2):525-539, 1998.Google Scholar
- 15.S. K. N.I. Johnson and N. Baiakrishnan. Continuous Univaviate Distribution, Volume 1. John Wiley & Sons, Inc., U.S.A, 1994.Google Scholar
- 16.V. Pareto. Cours d'Economie Politique. Rouge and Cie, Lausanne and Paris, 1897.Google Scholar
- 17.H. Simon. On a class of skew distribution functions. Biometrika, 42:425-440, 1955.Google ScholarCross Ref
- 18.M. C. V. Almeida, A. Bestavros and A. de Oliveira. Characterizing reference locality in the www. In Proc. of IBBE International Conference in Parallel and Distributed Information Systems, Miami Beach, Florida, U.S.A., December 1996. http: //www.es.bu.edu /groups /oeeans /papers /Home,html. Google ScholarDigital Library
- 19.G. Yule. A mathematical theory of evolution, based on conclusions of dr. i.e. willis, Lr.s. PhilosophieaI dimnsaetions of the Royal Society of London, 213:21-87, 1923.Google Scholar
- 20.G. Zipf. Human Behavior and Principle of Least Effort: An Introduction to Human Ecology. Addison Wesley, Cambridge, Massachusetts, 1949.Google Scholar
Index Terms
- The "DGX" distribution for mining massive, skewed data
Recommendations
Towards finding the best-fit distribution for OSN data
AbstractCurrently, all online social networks (OSNs) are considered to follow a power-law distribution. In this paper, the degree distribution for multiple OSNs has been studied. It is seen that the degree distributions of OSNs differ moderately from a ...
A skewed truncated t distribution
Skewed symmetric distributions have attracted a great deal of attention in the last few years. One of them, the skewed t distribution suffers from limited applicability because of the lack of finite moments. This note proposes an alternative to the ...
Estimation of parameters for the MarshallOlkin generalized exponential distribution based on complete data
This paper derives some properties of the MarshallOlkin generalized exponential distribution and shows that this distribution is more flexible than the exponentiated exponential distribution. Then we discuss estimation of the distribution parameters by ...
Comments