Abstract
Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-presumption of any canonical data distribution, and insensitivity to the order of input records. We present CLIQUE, a clustering algorithm that satisfies each of these requirements. CLIQUE identifies dense clusters in subspaces of maximum dimensionality. It generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension. It produces identical results irrespective of the order in which input records are presented and does not presume any specific mathematical form for data distribution. Through experiments, we show that CLIQUE efficiently finds accurate cluster in large high dimensional datasets.
- 1 R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast Discovery of Association Rules. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, chapter 12, pages 307-328. AAAI/MIT Press, 1996.]] Google ScholarDigital Library
- 2 A. Aho, 2. Hopcroft, and J. Ullman. The Design and Analysis of Computer Algorithms. Addison-Welsley, 1974.]] Google ScholarDigital Library
- 3 P. Arabic and L. J. Hubert. An overview of combinatorial data analyis, in P. Arabic, L. Hubert, and G. D. Sorts, editors, Clustering and Classification, pages 5-63. World Scientific Pub., New Jersey, 1996.]]Google ScholarCross Ref
- 4 Arbor Software Corporation. Application Manager User's Guide, Essbase Version 4.0 edition.]]Google Scholar
- 5 R. Bayardo. Efficiently mining long patterns from databases. of Data, Seattle, Washington, 1998.]] Google ScholarDigital Library
- 6 S. Berchtold, C. Bohm, D. Keim, and H.-P. Kriegel. A cost model for nearest neighbor search in high-dimensional data space. In Proceedings o} the 16th Symposium on Principles of Database Systems (PODS), pages 78-86, 1997.]] Google ScholarDigital Library
- 7 M. Berger and I. Regoutsos. An algorithm for point clustering and grid generation. IEEE Transactions on Systems, Man and Cybernetics, 21(5):1278-86, 1991.]]Google ScholarCross Ref
- 8 S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In Proc. of the A CM SIGMOD Conference on Management of Da~a, May 1997.]] Google ScholarDigital Library
- 9 P. Cheeseman and J. Stuff.. Bayesian classification (autoclass): Theory and results. In U. M. Fayyad, O. Piatetsky- Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discove~'y and Data Mining, chapter 6, pages 153-180. AAAI/MIT Press, 1996.]] Google ScholarDigital Library
- 10 R. Chhikara and D. Register. A numerical classification method for partitioning of a large multidimensional mixed data set. Technometrics, 21:531-537, 1979.]]Google ScholarCross Ref
- 11 R. O. Duds and P. E. Hart. Pattern Classification and Scene Analysis. john Wiley and Sons, 1973.]]Google Scholar
- 12 R. J. Earle. Method and apparatus for storing and retrieving multi-dimensional data in computer memory. U.S. Patent No. 5359724, October 1994.]]Google Scholar
- 13 M. Ester, H.-P. Kriegel, 2. Sander, and X. Xu. A densitybased algorithm for discovering clusters in large spatial databases with noise, in Proc. of the ~nd Int'l Cort/erenee on Knowledge Discovery in Databases and Data Mining, Portland, Oregon, August 1996.]]Google Scholar
- 14 M. Ester, H.-P. Kriegel, and X. Xu. A database interface for clustering in large spatial databases. In Proc. of the I st Int'l Conference on Knowledge Discovery in Databases and Data Mining, Montreal, Canada, August 1995.]] Google ScholarDigital Library
- 15 U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.]] Google ScholarDigital Library
- 16 U. Feige. A threshold of In n for approximating set cover. In Proceedings of the Twenty-Eighth Annual A CM Symposium on Theory of Computing, pages 314-318, 1996.]] Google ScholarDigital Library
- 17 D. Franzblau. Performance guarantees on a sweep-line heuristic for covering rectilinear polygons with rectangles. SIAM Y. Disc. Math, 2:307-321, 3 (1989).]] Google ScholarDigital Library
- 18 J. Friedman. Optimizing a noisy function of many variables with application to data mining. In UW/MSR Summer Research Institute in Data Mining, July 1997.]]Google Scholar
- 19 K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, 1990.]] Google ScholarDigital Library
- 20 D. Gunopulos, R. Khardon, H. Mannila, and S. Saluja. Data mining, hypergraph transversals, and machine learning. In P~oc. of the 16th ACId Syrup. on Principles of Database Systems, pages 209-216, 1997.]] Google ScholarDigital Library
- 21 C.-T. Ho, Ft. Agrawal, N. Megiddo, and R. Srikant. Range queries in OLAP data cubes. In Proc. o~ ~h,e A UM SIGMOD Conference on Management o~ Data, Tucson, Arizona, May 1997.]] Google ScholarDigital Library
- 22 S. J. Hong. MINI: A heuristic algorithm for two-level logic minimization. In R. Newton, editor, Selected Papers on Logic Synthesis/or Integrated Circuit Design. IEEE Press, 1987.]]Google Scholar
- 23 Internationl Business Machines. IBM Intelligent Miner User's Guide, Version 1 Release 1, SH12-6213-00 edition, July 1996.]]Google Scholar
- 24 A.K. Jain and R. C. Dubes. Algorithms lot Clustering Data. Prentice Hall, 1988.]] Google ScholarDigital Library
- 25 L. Kaufman and P. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons, 1990.]]Google Scholar
- 26 D.-I. Lin and Z. M. Kedem. Pincer search: A new algorithm for discovering the maximum frequent sets. In Pr0c. o} the 6th Int'I Conference on Eztending Database Technology (EDBT), Valencia, Spain, 1998.]] Google ScholarDigital Library
- 27 L. Lovasz. On the ratio of the optimal integral and fractional covers. Discrete Mathematics, 13:383-390, 1975.]]Google ScholarDigital Library
- 28 C. Lund nnd M. Yannakakis. On the hardness of approximating minimization problems. In Proceedings o} the A CM Symposium on Theory o} Computing, pages 286-293, 1993.]] Google ScholarDigital Library
- 29 W. Masek. Some NP-comptete set covering problems. M.S. Thesis, MIT, 1978.]]Google Scholar
- 30 M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A fast scalable classifier for data mining. In Proc. o} the Filth Int'l Conference or, Extending Database Technology (EDBT), Avignon, France, March 1996.]] Google ScholarDigital Library
- 31 R. S. Michalski and It. E. Stepp. Learning from observation: Conceptual clustering. In It. S. Michalski, 3. G. Carbonell, nnd T. M. Mitchell, editors, Machine Learning: An Artificial Intelligence Approach, volume I, pages 331-363. Morgan Kaufmann, 1983.]]Google Scholar
- 32 R. Miller and Y. Yang. Association rules over interval data. In Proc. A CM SIGMOD international Con}. on Management of Data, pages 452-461, 1997'.]] Google ScholarDigital Library
- 33 R.T. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. In Proc. o} the VLDB Conference, Santiago, Chile, September 1994.]] Google ScholarDigital Library
- 34 It. A. Reckhow and J. Culberson. Covering simple orthogohal polygon with a minimum number of orthogonally convex polygons. In Proc. o} the ACId 3rd Annual Computational Geometry Conference, pages 268-277, 1987.]] Google ScholarDigital Library
- 35 J. Rissanen. Stochastic Complexly in Statistical Inquiry. World Scientific Publ. Co., 1989.]] Google ScholarDigital Library
- 36 P. Schroeter and J. Bigun. Hierarchical image segmentation by multi-dimensional clustering and orientation-adaptive boundary refinement. Pattern Recognition, 25(5):695-709, May 1995.]]Google ScholarCross Ref
- 37 J. Sharer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier for data mining. In Proc. of the 2$nd Int'l Conference on Very Large Databases, Bombay, India, September 1996.]] Google ScholarDigital Library
- 38 A. Shoshani. Personal communication. 1997.]]Google Scholar
- 39 P. Sneath and R. Sokal. Numerical Tazonomy. Freeman, 1973.]]Google Scholar
- 40 Ft. Srikant and R. Agrawal. Mining Quantitative Association Rules in Large Relational Tables. In Proc. of the A C'M SIGMOD Conference on Management o} Data, Montreal, Canada, june 1996.]] Google ScholarDigital Library
- 41 H. Toivonen. Sampling large databases for association rules. In Proc. of the $$nd Int't Conference on Very Large Databases, pages 134-145, Mumbai (Bombay), India, September 1996.]] Google ScholarDigital Library
- 42 S. Wharton. A generalized histogram clustering for multidimensional image data. Pattern Recognition, 16(2):193-199, 1983.]]Google ScholarCross Ref
- 43 M. Zait and H. Messatfa. A comparative study of clustering methods. Future Generation Computer Systems, 13(2- 3):149-159, November 1997.]] Google ScholarDigital Library
- 44 D. Zhang and A. Bowyer. CSG set-theoretic solid modelling and NC machining of blend surfaces. In Proceedings o} the Second Annual A CM Symposium on Computational Geometry, pages 314-318, 1986.]] Google ScholarDigital Library
- 45 T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clusterinK method for very large datnbales. In Proc. of the A CM SIGMOD ConJerence on Management o} Da~a, Montreal, Canada, June 1996.]] Google ScholarDigital Library
Index Terms
- Automatic subspace clustering of high dimensional data for data mining applications
Recommendations
Subspace clustering for high dimensional data: a review
Special issue on learning from imbalanced datasetsSubspace clustering is an extension of traditional clustering that seeks to find clusters in different subspaces within a dataset. Often in high dimensional data, many dimensions are irrelevant and can mask existing clusters in noisy data. Feature ...
Automatic subspace clustering of high dimensional data for data mining applications
SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of dataData mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-presumption of any canonical ...
Automatic Subspace Clustering of High Dimensional Data
Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-presumption of any canonical ...
Comments