ABSTRACT
In this paper, we propose a novel formulation for distance-based outliers that is based on the distance of a point from its kth nearest neighbor. We rank each point on the basis of its distance to its kth nearest neighbor and declare the top n points in this ranking to be outliers. In addition to developing relatively straightforward solutions to finding such outliers based on the classical nested-loop join and index join algorithms, we develop a highly efficient partition-based algorithm for mining outliers. This algorithm first partitions the input data set into disjoint subsets, and then prunes entire partitions as soon as it is determined that they cannot contain outliers. This results in substantial savings in computation. We present the results of an extensive experimental study on real-life and synthetic data sets. The results from a real-life NBA database highlight and reveal several expected and unexpected aspects of the database. The results from a study on synthetic data sets demonstrate that the partition-based algorithm scales well with respect to both data set size and data set dimensionality.
- AAR96.A. Arning, Rakesh Agrawal, and R Raghavan. A linear method for deviation detection in large databases. In Int'l Conference on Knowledge Discovery in Databases and Data Mining (KDD-95), Portland, Oregon, August 1996.Google Scholar
- AMS+95.Rakesh Agrawal, Heikki Mannila, Ramakrishnan Srikant, Hannu Toivonen, and A. Inkeri Verkamo. Fast Discovery of Association Rules, chapter 14. 1995.Google Scholar
- BKNS00.Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jorg Sander. Lof:indetifying density-based local outliers. In Proc. of the ACM SIGMOD Conference on Management of Data, May 2000. Google ScholarDigital Library
- BKSS90.N. Beckmann, H.-R Kriegel, R. Schneider, and B. Seeger. The R*-tree: an efficient and robust access method for points and rectangles. In Proc. of ACM SIGMOD, pages 322-331, Atlantic City, NJ, May 1990. Google ScholarDigital Library
- BL94.V. Barnett and T. Lewis. Outliers in Statistical Data. John Wiley and Sons, New York, 1994.Google Scholar
- EKX95.Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu. A database interface for clustering in large spatial databases. In Int'l Conference on Knowledge Discovery in Databases and Data Mining (KDD-95), Montreal, Canada, August 1995.Google Scholar
- GRS98.Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. Cure: An efficient clustering algorithm for large databases. In Proc. of the ACM SIGMOD Conference on Management of Data, June 1998. Google ScholarDigital Library
- JD88.Anil K. Jain and Richard C. Dubes. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, New Jersey, 1988. Google ScholarDigital Library
- KN98.Edwin Knorr and Raymond Ng. Algorithms for mining distance-based outliers in large datasets. In Proc. of the VLDB Conference, pages 392-403, New York, USA, September 1998. Google ScholarDigital Library
- KN99.Edwin Knorr and Raymond Ng. Finding intensional knowledge of distance-based outliers. In Proc. of the VLDB Conference, pages 211-222, Edinburgh, UK, September 1999. Google ScholarDigital Library
- NH94.Raymond T. Ng and Jiawei Han. Efficient and effective clustering methods for spatial data mining. In Proc. of the VLDB Conference, Santiago, Chile, September 1994. Google ScholarDigital Library
- RKV95.N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries. In Proc. of ACM SIGMOD, pages 71-79, San Jose, CA, 1995. Google ScholarDigital Library
- RRS98.Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. Efficient algorithms for mining outliers from large data sets. Technical report, Bell Laboratories, Murray Hill, 1998.Google Scholar
- RS98.Rajeev Rastogi and Kyuseok Shim. Public: A decision tree classifier that integrates building and pruning. In Proc. of the Int'l Conf. on Vet7 Large Data Bases, New York, 1998. Google ScholarDigital Library
- Sam89.H. Samet. The Design and Analysis of Spatial Data Structures. Addison-Wesley, 1989. Google ScholarDigital Library
- SAM98.S. Sarawagi, R. Agrawal, and N. Megiddo. Discoverydriven exploration of olap data cubes. In Proc. of the Sixth Int'l Conference on Extending Database Technology (EDBT), Valencia, Spain, March 1998. Google ScholarDigital Library
- ZRL96.Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: An efficient data clustering method for very large databases. In Proceedings of the ACM SIGMOD Conference on Management of Data, pages 103-114, Montreal, Canada, June 1996. Google ScholarDigital Library
Index Terms
- Efficient algorithms for mining outliers from large data sets
Recommendations
Efficient algorithms for mining outliers from large data sets
In this paper, we propose a novel formulation for distance-based outliers that is based on the distance of a point from its kth nearest neighbor. We rank each point on the basis of its distance to its kth nearest neighbor and declare the top n points in ...
Double-local rough sets for efficient data mining
Highlights- An efficient rough set model called double-local rough sets is proposed.
- The ...
AbstractAs an important extension of classical rough sets, local rough set model is effective to handle large data sets with small amounts of labeled data, which has an obvious advantage in improving computational performance. However, the ...
Mining top-n local outliers in large databases
KDD '01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data miningOutlier detection is an important task in data mining with numerous applications, including credit card fraud detection, video surveillance, etc. A recent work on outlier detection has introduced a novel notion of local outlier in which the degree to ...
Comments