skip to main content
10.1145/502512.502521acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

The "DGX" distribution for mining massive, skewed data

Published:26 August 2001Publication History

ABSTRACT

Skewed distributions appear very often in practice. Unfortunately, the traditional Zipf distribution often fails to model them well. In this paper, we propose a new probability distribution, the Discrete Gaussian Exponential (DGX), to achieve excellent fits in a wide variety of settings; our new distribution includes the Zipf distribution as a special case. We present a statistically sound method for estimating the DGX parameters based on maximum likelihood estimation (MLE). We applied DGX to a wide variety of real world data sets, such as sales data from a large retailer chain, us-age data from AT&T, and Internet clickstream data; in all cases, DGX fits these distributions very well, with almost a 99% correlation coefficient in quantile-quantile plots. Our algorithm also scales very well because it requires only a single pass over the data. Finally, we illustrate the power of DGX as a new tool for data mining tasks, such as outlier detection.

References

  1. 1.L. Adamic. Zipf, power-laws, and pareto - a ranking tutorial, http://www.parc.xerox-com/istl/groups/iea /papers/ranking/ranking_html.Google ScholarGoogle Scholar
  2. 2.B. Berry and W. Garrison, Alternate explanations of urban rank size relations" Annals of the Association of American Geographers, 48:83-91, 1958.Google ScholarGoogle Scholar
  3. 3.L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web caching and zipf-like distributions: Evidence and implications. In IEEE Infocom '9g, New York, NY, Mar. 1999.Google ScholarGoogle Scholar
  4. 4.A. B. C. Cunha and M. CroveUa. Characteristics of www client-based traces. Tr-95-010, Boston University, April 1995. http://www.cs.bu.edu/groups /oceans/papers/Home.ht ml. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. 5.C. Faloutsos and H. Jagadish. On B-tree indices for skewed distributions. In 18th VLDB Conference, pages 363-374, Vancouver, British Columbia, Aug. 23-27 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. 6.W. Fox and W. Lasker. The distribution of surname frequencies. International Statistical Review, 51:81-87, 1983.Google ScholarGoogle ScholarCross RefCross Ref
  7. 7.Galton. The geometric mean in vital and social statistics. Proceedings of the Royal Society of London, 29:365-367, 1879.Google ScholarGoogle ScholarCross RefCross Ref
  8. 8.S, Glassman. A caching relay for the world wide web. In Proc, of First International Confer, nee on the World Wide Web, CERN, Geneva, Switzerland, May 1994. http://wwwl.cern,ch/WWW94/P reUmProcs .html. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. 9.P. Halmos. Random Mms. Annals of Mathematical Statistics, 15:182-189, 1944.Google ScholarGoogle ScholarCross RefCross Ref
  10. 10.G. Herdan. Small Particle Statistics. Butterworth's, London, 2 edition, 1060.Google ScholarGoogle Scholar
  11. 11.B. Hill. The rank-frequency form of zipf's law. Journal of the American Statistical Association, 69(348):1017--1026, 1974.Google ScholarGoogle ScholarCross RefCross Ref
  12. 12.The MathWorks Inc. Matlab user's guide.Google ScholarGoogle Scholar
  13. 13.J. Laherr6re. "parabolic fraetal" distributions in nature, ht tp://www.hubber t peak.com/laherrere /fractal.htm.Google ScholarGoogle Scholar
  14. 14.B. Laherrere and D. Sornette. Stretched exponential distributions in nature and economy: "fat tails" with characteristic scales. European Physical Journal, B(2):525-539, 1998.Google ScholarGoogle Scholar
  15. 15.S. K. N.I. Johnson and N. Baiakrishnan. Continuous Univaviate Distribution, Volume 1. John Wiley & Sons, Inc., U.S.A, 1994.Google ScholarGoogle Scholar
  16. 16.V. Pareto. Cours d'Economie Politique. Rouge and Cie, Lausanne and Paris, 1897.Google ScholarGoogle Scholar
  17. 17.H. Simon. On a class of skew distribution functions. Biometrika, 42:425-440, 1955.Google ScholarGoogle ScholarCross RefCross Ref
  18. 18.M. C. V. Almeida, A. Bestavros and A. de Oliveira. Characterizing reference locality in the www. In Proc. of IBBE International Conference in Parallel and Distributed Information Systems, Miami Beach, Florida, U.S.A., December 1996. http: //www.es.bu.edu /groups /oeeans /papers /Home,html. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. 19.G. Yule. A mathematical theory of evolution, based on conclusions of dr. i.e. willis, Lr.s. PhilosophieaI dimnsaetions of the Royal Society of London, 213:21-87, 1923.Google ScholarGoogle Scholar
  20. 20.G. Zipf. Human Behavior and Principle of Least Effort: An Introduction to Human Ecology. Addison Wesley, Cambridge, Massachusetts, 1949.Google ScholarGoogle Scholar

Index Terms

  1. The "DGX" distribution for mining massive, skewed data

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            KDD '01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
            August 2001
            493 pages
            ISBN:158113391X
            DOI:10.1145/502512

            Copyright © 2001 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 26 August 2001

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

            Acceptance Rates

            KDD '01 Paper Acceptance Rate31of237submissions,13%Overall Acceptance Rate1,133of8,635submissions,13%

            Upcoming Conference

            KDD '24

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader