Efficient algorithms for mining outliers from large data sets

Authors:
Sridhar Ramaswamy

Epiphany Inc., Palo Alto, CA

Epiphany Inc., Palo Alto, CA
View Profile

,
Rajeev Rastogi

Bell Laboratories, Murray Hill, NJ

Bell Laboratories, Murray Hill, NJ
View Profile

,
Kyuseok Shim

Korea Advanced Institute of Science and Technology and Advanced Information Technology Research Center at KAIST, Taejon, KOREA

Korea Advanced Institute of Science and Technology and Advanced Information Technology Research Center at KAIST, Taejon, KOREA
View Profile

SIGMOD '00: Proceedings of the 2000 ACM SIGMOD international conference on Management of dataMay 2000Pages 427–438https://doi.org/10.1145/342009.335437

Published:16 May 2000Publication History

SIGMOD '00: Proceedings of the 2000 ACM SIGMOD international conference on Management of data

Pages 427–438

ABSTRACT

In this paper, we propose a novel formulation for distance-based outliers that is based on the distance of a point from its k^th nearest neighbor. We rank each point on the basis of its distance to its k^th nearest neighbor and declare the top n points in this ranking to be outliers. In addition to developing relatively straightforward solutions to finding such outliers based on the classical nested-loop join and index join algorithms, we develop a highly efficient partition-based algorithm for mining outliers. This algorithm first partitions the input data set into disjoint subsets, and then prunes entire partitions as soon as it is determined that they cannot contain outliers. This results in substantial savings in computation. We present the results of an extensive experimental study on real-life and synthetic data sets. The results from a real-life NBA database highlight and reveal several expected and unexpected aspects of the database. The results from a study on synthetic data sets demonstrate that the partition-based algorithm scales well with respect to both data set size and data set dimensionality.

References

AAR96.A. Arning, Rakesh Agrawal, and R Raghavan. A linear method for deviation detection in large databases. In Int'l Conference on Knowledge Discovery in Databases and Data Mining (KDD-95), Portland, Oregon, August 1996.Google Scholar
AMS+95.Rakesh Agrawal, Heikki Mannila, Ramakrishnan Srikant, Hannu Toivonen, and A. Inkeri Verkamo. Fast Discovery of Association Rules, chapter 14. 1995.Google Scholar
BKNS00.Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jorg Sander. Lof:indetifying density-based local outliers. In Proc. of the ACM SIGMOD Conference on Management of Data, May 2000. Google ScholarDigital Library
BKSS90.N. Beckmann, H.-R Kriegel, R. Schneider, and B. Seeger. The R*-tree: an efficient and robust access method for points and rectangles. In Proc. of ACM SIGMOD, pages 322-331, Atlantic City, NJ, May 1990. Google ScholarDigital Library
BL94.V. Barnett and T. Lewis. Outliers in Statistical Data. John Wiley and Sons, New York, 1994.Google Scholar
EKX95.Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu. A database interface for clustering in large spatial databases. In Int'l Conference on Knowledge Discovery in Databases and Data Mining (KDD-95), Montreal, Canada, August 1995.Google Scholar
GRS98.Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. Cure: An efficient clustering algorithm for large databases. In Proc. of the ACM SIGMOD Conference on Management of Data, June 1998. Google ScholarDigital Library
JD88.Anil K. Jain and Richard C. Dubes. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, New Jersey, 1988. Google ScholarDigital Library
KN98.Edwin Knorr and Raymond Ng. Algorithms for mining distance-based outliers in large datasets. In Proc. of the VLDB Conference, pages 392-403, New York, USA, September 1998. Google ScholarDigital Library
KN99.Edwin Knorr and Raymond Ng. Finding intensional knowledge of distance-based outliers. In Proc. of the VLDB Conference, pages 211-222, Edinburgh, UK, September 1999. Google ScholarDigital Library
NH94.Raymond T. Ng and Jiawei Han. Efficient and effective clustering methods for spatial data mining. In Proc. of the VLDB Conference, Santiago, Chile, September 1994. Google ScholarDigital Library
RKV95.N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries. In Proc. of ACM SIGMOD, pages 71-79, San Jose, CA, 1995. Google ScholarDigital Library
RRS98.Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. Efficient algorithms for mining outliers from large data sets. Technical report, Bell Laboratories, Murray Hill, 1998.Google Scholar
RS98.Rajeev Rastogi and Kyuseok Shim. Public: A decision tree classifier that integrates building and pruning. In Proc. of the Int'l Conf. on Vet7 Large Data Bases, New York, 1998. Google ScholarDigital Library
Sam89.H. Samet. The Design and Analysis of Spatial Data Structures. Addison-Wesley, 1989. Google ScholarDigital Library
SAM98.S. Sarawagi, R. Agrawal, and N. Megiddo. Discoverydriven exploration of olap data cubes. In Proc. of the Sixth Int'l Conference on Extending Database Technology (EDBT), Valencia, Spain, March 1998. Google ScholarDigital Library
ZRL96.Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: An efficient data clustering method for very large databases. In Proceedings of the ACM SIGMOD Conference on Management of Data, pages 103-114, Montreal, Canada, June 1996. Google ScholarDigital Library

Index Terms

Efficient algorithms for mining outliers from large data sets
1. Information systems

Recommendations

Efficient algorithms for mining outliers from large data sets

In this paper, we propose a novel formulation for distance-based outliers that is based on the distance of a point from its k^th nearest neighbor. We rank each point on the basis of its distance to its k^th nearest neighbor and declare the top n points in ...
Read More
Double-local rough sets for efficient data mining
Highlights
- An efficient rough set model called double-local rough sets is proposed.
- The ...
Abstract
As an important extension of classical rough sets, local rough set model is effective to handle large data sets with small amounts of labeled data, which has an obvious advantage in improving computational performance. However, the ...
Read More
Mining top-n local outliers in large databases
KDD '01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining

Outlier detection is an important task in data mining with numerous applications, including credit card fraud detection, video surveillance, etc. A recent work on outlier detection has introduced a novel notion of local outlier in which the degree to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '00: Proceedings of the 2000 ACM SIGMOD international conference on Management of data
May 2000
604 pages
ISBN:1581132174
DOI:10.1145/342009
Chairmen:
Maggie Dunham
Southern Methodist Univ.
,
Jeffrey F. Naughton
Univ. of Wisconsin-Madison
,
Weidong Chen
Southern Methodist Univ.
,
Nick Koudas
AT &T Labs
ACM SIGMOD Record Volume 29, Issue 2
June 2000
609 pages
ISSN:0163-5808
DOI:10.1145/335191
Editors:
Weidong Chen
Southern Methodist Univ., Dallas, TX
,
Jeffrey Naughton
Univ. of Wisconsin-Madison, Madison
,
Philip A. Bernstein
Microsoft
Issue’s Table of Contents
Copyright © 2000 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 May 2000
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
SIGMOD '00 Paper Acceptance Rate42of248submissions,17%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1,550
  Total Citations
  View Citations
- 3,849
  Total Downloads
- Downloads (Last 12 months)1,205
- Downloads (Last 6 weeks)160
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient algorithms for mining outliers from large data sets

SIGMOD '00: Proceedings of the 2000 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Efficient algorithms for mining outliers from large data sets

Double-local rough sets for efficient data mining

Mining top-n local outliers in large databases