skip to main content
article
Free Access

Automatic subspace clustering of high dimensional data for data mining applications

Authors Info & Claims
Published:01 June 1998Publication History
Skip Abstract Section

Abstract

Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-presumption of any canonical data distribution, and insensitivity to the order of input records. We present CLIQUE, a clustering algorithm that satisfies each of these requirements. CLIQUE identifies dense clusters in subspaces of maximum dimensionality. It generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension. It produces identical results irrespective of the order in which input records are presented and does not presume any specific mathematical form for data distribution. Through experiments, we show that CLIQUE efficiently finds accurate cluster in large high dimensional datasets.

References

  1. 1 R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast Discovery of Association Rules. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, chapter 12, pages 307-328. AAAI/MIT Press, 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. 2 A. Aho, 2. Hopcroft, and J. Ullman. The Design and Analysis of Computer Algorithms. Addison-Welsley, 1974.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. 3 P. Arabic and L. J. Hubert. An overview of combinatorial data analyis, in P. Arabic, L. Hubert, and G. D. Sorts, editors, Clustering and Classification, pages 5-63. World Scientific Pub., New Jersey, 1996.]]Google ScholarGoogle ScholarCross RefCross Ref
  4. 4 Arbor Software Corporation. Application Manager User's Guide, Essbase Version 4.0 edition.]]Google ScholarGoogle Scholar
  5. 5 R. Bayardo. Efficiently mining long patterns from databases. of Data, Seattle, Washington, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. 6 S. Berchtold, C. Bohm, D. Keim, and H.-P. Kriegel. A cost model for nearest neighbor search in high-dimensional data space. In Proceedings o} the 16th Symposium on Principles of Database Systems (PODS), pages 78-86, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. 7 M. Berger and I. Regoutsos. An algorithm for point clustering and grid generation. IEEE Transactions on Systems, Man and Cybernetics, 21(5):1278-86, 1991.]]Google ScholarGoogle ScholarCross RefCross Ref
  8. 8 S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In Proc. of the A CM SIGMOD Conference on Management of Da~a, May 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. 9 P. Cheeseman and J. Stuff.. Bayesian classification (autoclass): Theory and results. In U. M. Fayyad, O. Piatetsky- Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discove~'y and Data Mining, chapter 6, pages 153-180. AAAI/MIT Press, 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. 10 R. Chhikara and D. Register. A numerical classification method for partitioning of a large multidimensional mixed data set. Technometrics, 21:531-537, 1979.]]Google ScholarGoogle ScholarCross RefCross Ref
  11. 11 R. O. Duds and P. E. Hart. Pattern Classification and Scene Analysis. john Wiley and Sons, 1973.]]Google ScholarGoogle Scholar
  12. 12 R. J. Earle. Method and apparatus for storing and retrieving multi-dimensional data in computer memory. U.S. Patent No. 5359724, October 1994.]]Google ScholarGoogle Scholar
  13. 13 M. Ester, H.-P. Kriegel, 2. Sander, and X. Xu. A densitybased algorithm for discovering clusters in large spatial databases with noise, in Proc. of the ~nd Int'l Cort/erenee on Knowledge Discovery in Databases and Data Mining, Portland, Oregon, August 1996.]]Google ScholarGoogle Scholar
  14. 14 M. Ester, H.-P. Kriegel, and X. Xu. A database interface for clustering in large spatial databases. In Proc. of the I st Int'l Conference on Knowledge Discovery in Databases and Data Mining, Montreal, Canada, August 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. 15 U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. 16 U. Feige. A threshold of In n for approximating set cover. In Proceedings of the Twenty-Eighth Annual A CM Symposium on Theory of Computing, pages 314-318, 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. 17 D. Franzblau. Performance guarantees on a sweep-line heuristic for covering rectilinear polygons with rectangles. SIAM Y. Disc. Math, 2:307-321, 3 (1989).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. 18 J. Friedman. Optimizing a noisy function of many variables with application to data mining. In UW/MSR Summer Research Institute in Data Mining, July 1997.]]Google ScholarGoogle Scholar
  19. 19 K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, 1990.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. 20 D. Gunopulos, R. Khardon, H. Mannila, and S. Saluja. Data mining, hypergraph transversals, and machine learning. In P~oc. of the 16th ACId Syrup. on Principles of Database Systems, pages 209-216, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. 21 C.-T. Ho, Ft. Agrawal, N. Megiddo, and R. Srikant. Range queries in OLAP data cubes. In Proc. o~ ~h,e A UM SIGMOD Conference on Management o~ Data, Tucson, Arizona, May 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. 22 S. J. Hong. MINI: A heuristic algorithm for two-level logic minimization. In R. Newton, editor, Selected Papers on Logic Synthesis/or Integrated Circuit Design. IEEE Press, 1987.]]Google ScholarGoogle Scholar
  23. 23 Internationl Business Machines. IBM Intelligent Miner User's Guide, Version 1 Release 1, SH12-6213-00 edition, July 1996.]]Google ScholarGoogle Scholar
  24. 24 A.K. Jain and R. C. Dubes. Algorithms lot Clustering Data. Prentice Hall, 1988.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. 25 L. Kaufman and P. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons, 1990.]]Google ScholarGoogle Scholar
  26. 26 D.-I. Lin and Z. M. Kedem. Pincer search: A new algorithm for discovering the maximum frequent sets. In Pr0c. o} the 6th Int'I Conference on Eztending Database Technology (EDBT), Valencia, Spain, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. 27 L. Lovasz. On the ratio of the optimal integral and fractional covers. Discrete Mathematics, 13:383-390, 1975.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. 28 C. Lund nnd M. Yannakakis. On the hardness of approximating minimization problems. In Proceedings o} the A CM Symposium on Theory o} Computing, pages 286-293, 1993.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. 29 W. Masek. Some NP-comptete set covering problems. M.S. Thesis, MIT, 1978.]]Google ScholarGoogle Scholar
  30. 30 M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A fast scalable classifier for data mining. In Proc. o} the Filth Int'l Conference or, Extending Database Technology (EDBT), Avignon, France, March 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. 31 R. S. Michalski and It. E. Stepp. Learning from observation: Conceptual clustering. In It. S. Michalski, 3. G. Carbonell, nnd T. M. Mitchell, editors, Machine Learning: An Artificial Intelligence Approach, volume I, pages 331-363. Morgan Kaufmann, 1983.]]Google ScholarGoogle Scholar
  32. 32 R. Miller and Y. Yang. Association rules over interval data. In Proc. A CM SIGMOD international Con}. on Management of Data, pages 452-461, 1997'.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. 33 R.T. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. In Proc. o} the VLDB Conference, Santiago, Chile, September 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. 34 It. A. Reckhow and J. Culberson. Covering simple orthogohal polygon with a minimum number of orthogonally convex polygons. In Proc. o} the ACId 3rd Annual Computational Geometry Conference, pages 268-277, 1987.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. 35 J. Rissanen. Stochastic Complexly in Statistical Inquiry. World Scientific Publ. Co., 1989.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. 36 P. Schroeter and J. Bigun. Hierarchical image segmentation by multi-dimensional clustering and orientation-adaptive boundary refinement. Pattern Recognition, 25(5):695-709, May 1995.]]Google ScholarGoogle ScholarCross RefCross Ref
  37. 37 J. Sharer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier for data mining. In Proc. of the 2$nd Int'l Conference on Very Large Databases, Bombay, India, September 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. 38 A. Shoshani. Personal communication. 1997.]]Google ScholarGoogle Scholar
  39. 39 P. Sneath and R. Sokal. Numerical Tazonomy. Freeman, 1973.]]Google ScholarGoogle Scholar
  40. 40 Ft. Srikant and R. Agrawal. Mining Quantitative Association Rules in Large Relational Tables. In Proc. of the A C'M SIGMOD Conference on Management o} Data, Montreal, Canada, june 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. 41 H. Toivonen. Sampling large databases for association rules. In Proc. of the $$nd Int't Conference on Very Large Databases, pages 134-145, Mumbai (Bombay), India, September 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. 42 S. Wharton. A generalized histogram clustering for multidimensional image data. Pattern Recognition, 16(2):193-199, 1983.]]Google ScholarGoogle ScholarCross RefCross Ref
  43. 43 M. Zait and H. Messatfa. A comparative study of clustering methods. Future Generation Computer Systems, 13(2- 3):149-159, November 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. 44 D. Zhang and A. Bowyer. CSG set-theoretic solid modelling and NC machining of blend surfaces. In Proceedings o} the Second Annual A CM Symposium on Computational Geometry, pages 314-318, 1986.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. 45 T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clusterinK method for very large datnbales. In Proc. of the A CM SIGMOD ConJerence on Management o} Da~a, Montreal, Canada, June 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Automatic subspace clustering of high dimensional data for data mining applications

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM SIGMOD Record
            ACM SIGMOD Record  Volume 27, Issue 2
            June 1998
            595 pages
            ISSN:0163-5808
            DOI:10.1145/276305
            Issue’s Table of Contents
            • cover image ACM Conferences
              SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data
              June 1998
              599 pages
              ISBN:0897919955
              DOI:10.1145/276304

            Copyright © 1998 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 1 June 1998

            Check for updates

            Qualifiers

            • article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader