skip to main content
10.1145/342009.335437acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article
Free Access

Efficient algorithms for mining outliers from large data sets

Authors Info & Claims
Published:16 May 2000Publication History

ABSTRACT

In this paper, we propose a novel formulation for distance-based outliers that is based on the distance of a point from its kth nearest neighbor. We rank each point on the basis of its distance to its kth nearest neighbor and declare the top n points in this ranking to be outliers. In addition to developing relatively straightforward solutions to finding such outliers based on the classical nested-loop join and index join algorithms, we develop a highly efficient partition-based algorithm for mining outliers. This algorithm first partitions the input data set into disjoint subsets, and then prunes entire partitions as soon as it is determined that they cannot contain outliers. This results in substantial savings in computation. We present the results of an extensive experimental study on real-life and synthetic data sets. The results from a real-life NBA database highlight and reveal several expected and unexpected aspects of the database. The results from a study on synthetic data sets demonstrate that the partition-based algorithm scales well with respect to both data set size and data set dimensionality.

References

  1. AAR96.A. Arning, Rakesh Agrawal, and R Raghavan. A linear method for deviation detection in large databases. In Int'l Conference on Knowledge Discovery in Databases and Data Mining (KDD-95), Portland, Oregon, August 1996.Google ScholarGoogle Scholar
  2. AMS+95.Rakesh Agrawal, Heikki Mannila, Ramakrishnan Srikant, Hannu Toivonen, and A. Inkeri Verkamo. Fast Discovery of Association Rules, chapter 14. 1995.Google ScholarGoogle Scholar
  3. BKNS00.Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jorg Sander. Lof:indetifying density-based local outliers. In Proc. of the ACM SIGMOD Conference on Management of Data, May 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. BKSS90.N. Beckmann, H.-R Kriegel, R. Schneider, and B. Seeger. The R*-tree: an efficient and robust access method for points and rectangles. In Proc. of ACM SIGMOD, pages 322-331, Atlantic City, NJ, May 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. BL94.V. Barnett and T. Lewis. Outliers in Statistical Data. John Wiley and Sons, New York, 1994.Google ScholarGoogle Scholar
  6. EKX95.Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu. A database interface for clustering in large spatial databases. In Int'l Conference on Knowledge Discovery in Databases and Data Mining (KDD-95), Montreal, Canada, August 1995.Google ScholarGoogle Scholar
  7. GRS98.Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. Cure: An efficient clustering algorithm for large databases. In Proc. of the ACM SIGMOD Conference on Management of Data, June 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. JD88.Anil K. Jain and Richard C. Dubes. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, New Jersey, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. KN98.Edwin Knorr and Raymond Ng. Algorithms for mining distance-based outliers in large datasets. In Proc. of the VLDB Conference, pages 392-403, New York, USA, September 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. KN99.Edwin Knorr and Raymond Ng. Finding intensional knowledge of distance-based outliers. In Proc. of the VLDB Conference, pages 211-222, Edinburgh, UK, September 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. NH94.Raymond T. Ng and Jiawei Han. Efficient and effective clustering methods for spatial data mining. In Proc. of the VLDB Conference, Santiago, Chile, September 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. RKV95.N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries. In Proc. of ACM SIGMOD, pages 71-79, San Jose, CA, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. RRS98.Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. Efficient algorithms for mining outliers from large data sets. Technical report, Bell Laboratories, Murray Hill, 1998.Google ScholarGoogle Scholar
  14. RS98.Rajeev Rastogi and Kyuseok Shim. Public: A decision tree classifier that integrates building and pruning. In Proc. of the Int'l Conf. on Vet7 Large Data Bases, New York, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Sam89.H. Samet. The Design and Analysis of Spatial Data Structures. Addison-Wesley, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. SAM98.S. Sarawagi, R. Agrawal, and N. Megiddo. Discoverydriven exploration of olap data cubes. In Proc. of the Sixth Int'l Conference on Extending Database Technology (EDBT), Valencia, Spain, March 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. ZRL96.Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: An efficient data clustering method for very large databases. In Proceedings of the ACM SIGMOD Conference on Management of Data, pages 103-114, Montreal, Canada, June 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient algorithms for mining outliers from large data sets

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGMOD '00: Proceedings of the 2000 ACM SIGMOD international conference on Management of data
          May 2000
          604 pages
          ISBN:1581132174
          DOI:10.1145/342009

          Copyright © 2000 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 16 May 2000

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          SIGMOD '00 Paper Acceptance Rate42of248submissions,17%Overall Acceptance Rate785of4,003submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader