Automatic subspace clustering of high dimensional data for data mining applications

Authors:
Rakesh Agrawal

IBM Almaden Research Center, 650 Harry Road, San Jose, CA

IBM Almaden Research Center, 650 Harry Road, San Jose, CA
View Profile

,
Johannes Gehrke

IBM Almaden Research Center, 650 Harry Road, San Jose, CA

IBM Almaden Research Center, 650 Harry Road, San Jose, CA
View Profile

,
Dimitrios Gunopulos

IBM Almaden Research Center, 650 Harry Road, San Jose, CA

IBM Almaden Research Center, 650 Harry Road, San Jose, CA
View Profile

,
Prabhakar Raghavan

IBM Almaden Research Center, 650 Harry Road, San Jose, CA

IBM Almaden Research Center, 650 Harry Road, San Jose, CA
View Profile

Authors Info & Claims

ACM SIGMOD Record Volume 27 Issue 2June 1998pp 94–105https://doi.org/10.1145/276305.276314

Published:01 June 1998Publication History

ACM SIGMOD Record

Abstract

Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-presumption of any canonical data distribution, and insensitivity to the order of input records. We present CLIQUE, a clustering algorithm that satisfies each of these requirements. CLIQUE identifies dense clusters in subspaces of maximum dimensionality. It generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension. It produces identical results irrespective of the order in which input records are presented and does not presume any specific mathematical form for data distribution. Through experiments, we show that CLIQUE efficiently finds accurate cluster in large high dimensional datasets.

References

1 R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast Discovery of Association Rules. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, chapter 12, pages 307-328. AAAI/MIT Press, 1996.]] Google ScholarDigital Library
2 A. Aho, 2. Hopcroft, and J. Ullman. The Design and Analysis of Computer Algorithms. Addison-Welsley, 1974.]] Google ScholarDigital Library
3 P. Arabic and L. J. Hubert. An overview of combinatorial data analyis, in P. Arabic, L. Hubert, and G. D. Sorts, editors, Clustering and Classification, pages 5-63. World Scientific Pub., New Jersey, 1996.]]Google ScholarCross Ref
4 Arbor Software Corporation. Application Manager User's Guide, Essbase Version 4.0 edition.]]Google Scholar
5 R. Bayardo. Efficiently mining long patterns from databases. of Data, Seattle, Washington, 1998.]] Google ScholarDigital Library
6 S. Berchtold, C. Bohm, D. Keim, and H.-P. Kriegel. A cost model for nearest neighbor search in high-dimensional data space. In Proceedings o} the 16th Symposium on Principles of Database Systems (PODS), pages 78-86, 1997.]] Google ScholarDigital Library
7 M. Berger and I. Regoutsos. An algorithm for point clustering and grid generation. IEEE Transactions on Systems, Man and Cybernetics, 21(5):1278-86, 1991.]]Google ScholarCross Ref
8 S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In Proc. of the A CM SIGMOD Conference on Management of Da~a, May 1997.]] Google ScholarDigital Library
9 P. Cheeseman and J. Stuff.. Bayesian classification (autoclass): Theory and results. In U. M. Fayyad, O. Piatetsky- Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discove~'y and Data Mining, chapter 6, pages 153-180. AAAI/MIT Press, 1996.]] Google ScholarDigital Library
10 R. Chhikara and D. Register. A numerical classification method for partitioning of a large multidimensional mixed data set. Technometrics, 21:531-537, 1979.]]Google ScholarCross Ref
11 R. O. Duds and P. E. Hart. Pattern Classification and Scene Analysis. john Wiley and Sons, 1973.]]Google Scholar
12 R. J. Earle. Method and apparatus for storing and retrieving multi-dimensional data in computer memory. U.S. Patent No. 5359724, October 1994.]]Google Scholar
13 M. Ester, H.-P. Kriegel, 2. Sander, and X. Xu. A densitybased algorithm for discovering clusters in large spatial databases with noise, in Proc. of the ~nd Int'l Cort/erenee on Knowledge Discovery in Databases and Data Mining, Portland, Oregon, August 1996.]]Google Scholar
14 M. Ester, H.-P. Kriegel, and X. Xu. A database interface for clustering in large spatial databases. In Proc. of the I st Int'l Conference on Knowledge Discovery in Databases and Data Mining, Montreal, Canada, August 1995.]] Google ScholarDigital Library
15 U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.]] Google ScholarDigital Library
16 U. Feige. A threshold of In n for approximating set cover. In Proceedings of the Twenty-Eighth Annual A CM Symposium on Theory of Computing, pages 314-318, 1996.]] Google ScholarDigital Library
17 D. Franzblau. Performance guarantees on a sweep-line heuristic for covering rectilinear polygons with rectangles. SIAM Y. Disc. Math, 2:307-321, 3 (1989).]] Google ScholarDigital Library
18 J. Friedman. Optimizing a noisy function of many variables with application to data mining. In UW/MSR Summer Research Institute in Data Mining, July 1997.]]Google Scholar
19 K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, 1990.]] Google ScholarDigital Library
20 D. Gunopulos, R. Khardon, H. Mannila, and S. Saluja. Data mining, hypergraph transversals, and machine learning. In P~oc. of the 16th ACId Syrup. on Principles of Database Systems, pages 209-216, 1997.]] Google ScholarDigital Library
21 C.-T. Ho, Ft. Agrawal, N. Megiddo, and R. Srikant. Range queries in OLAP data cubes. In Proc. o~ ~h,e A UM SIGMOD Conference on Management o~ Data, Tucson, Arizona, May 1997.]] Google ScholarDigital Library
22 S. J. Hong. MINI: A heuristic algorithm for two-level logic minimization. In R. Newton, editor, Selected Papers on Logic Synthesis/or Integrated Circuit Design. IEEE Press, 1987.]]Google Scholar
23 Internationl Business Machines. IBM Intelligent Miner User's Guide, Version 1 Release 1, SH12-6213-00 edition, July 1996.]]Google Scholar
24 A.K. Jain and R. C. Dubes. Algorithms lot Clustering Data. Prentice Hall, 1988.]] Google ScholarDigital Library
25 L. Kaufman and P. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons, 1990.]]Google Scholar
26 D.-I. Lin and Z. M. Kedem. Pincer search: A new algorithm for discovering the maximum frequent sets. In Pr0c. o} the 6th Int'I Conference on Eztending Database Technology (EDBT), Valencia, Spain, 1998.]] Google ScholarDigital Library
27 L. Lovasz. On the ratio of the optimal integral and fractional covers. Discrete Mathematics, 13:383-390, 1975.]]Google ScholarDigital Library
28 C. Lund nnd M. Yannakakis. On the hardness of approximating minimization problems. In Proceedings o} the A CM Symposium on Theory o} Computing, pages 286-293, 1993.]] Google ScholarDigital Library
29 W. Masek. Some NP-comptete set covering problems. M.S. Thesis, MIT, 1978.]]Google Scholar
30 M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A fast scalable classifier for data mining. In Proc. o} the Filth Int'l Conference or, Extending Database Technology (EDBT), Avignon, France, March 1996.]] Google ScholarDigital Library
31 R. S. Michalski and It. E. Stepp. Learning from observation: Conceptual clustering. In It. S. Michalski, 3. G. Carbonell, nnd T. M. Mitchell, editors, Machine Learning: An Artificial Intelligence Approach, volume I, pages 331-363. Morgan Kaufmann, 1983.]]Google Scholar
32 R. Miller and Y. Yang. Association rules over interval data. In Proc. A CM SIGMOD international Con}. on Management of Data, pages 452-461, 1997'.]] Google ScholarDigital Library
33 R.T. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. In Proc. o} the VLDB Conference, Santiago, Chile, September 1994.]] Google ScholarDigital Library
34 It. A. Reckhow and J. Culberson. Covering simple orthogohal polygon with a minimum number of orthogonally convex polygons. In Proc. o} the ACId 3rd Annual Computational Geometry Conference, pages 268-277, 1987.]] Google ScholarDigital Library
35 J. Rissanen. Stochastic Complexly in Statistical Inquiry. World Scientific Publ. Co., 1989.]] Google ScholarDigital Library
36 P. Schroeter and J. Bigun. Hierarchical image segmentation by multi-dimensional clustering and orientation-adaptive boundary refinement. Pattern Recognition, 25(5):695-709, May 1995.]]Google ScholarCross Ref
37 J. Sharer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier for data mining. In Proc. of the 2$nd Int'l Conference on Very Large Databases, Bombay, India, September 1996.]] Google ScholarDigital Library
38 A. Shoshani. Personal communication. 1997.]]Google Scholar
39 P. Sneath and R. Sokal. Numerical Tazonomy. Freeman, 1973.]]Google Scholar
40 Ft. Srikant and R. Agrawal. Mining Quantitative Association Rules in Large Relational Tables. In Proc. of the A C'M SIGMOD Conference on Management o} Data, Montreal, Canada, june 1996.]] Google ScholarDigital Library
41 H. Toivonen. Sampling large databases for association rules. In Proc. of the $$nd Int't Conference on Very Large Databases, pages 134-145, Mumbai (Bombay), India, September 1996.]] Google ScholarDigital Library
42 S. Wharton. A generalized histogram clustering for multidimensional image data. Pattern Recognition, 16(2):193-199, 1983.]]Google ScholarCross Ref
43 M. Zait and H. Messatfa. A comparative study of clustering methods. Future Generation Computer Systems, 13(2- 3):149-159, November 1997.]] Google ScholarDigital Library
44 D. Zhang and A. Bowyer. CSG set-theoretic solid modelling and NC machining of blend surfaces. In Proceedings o} the Second Annual A CM Symposium on Computational Geometry, pages 314-318, 1986.]] Google ScholarDigital Library
45 T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clusterinK method for very large datnbales. In Proc. of the A CM SIGMOD ConJerence on Management o} Da~a, Montreal, Canada, June 1996.]] Google ScholarDigital Library

Index Terms

Automatic subspace clustering of high dimensional data for data mining applications

Recommendations

Subspace clustering for high dimensional data: a review
Special issue on learning from imbalanced datasets

Subspace clustering is an extension of traditional clustering that seeks to find clusters in different subspaces within a dataset. Often in high dimensional data, many dimensions are irrelevant and can mask existing clusters in noisy data. Feature ...
Read More
Automatic subspace clustering of high dimensional data for data mining applications
SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data

Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-presumption of any canonical ...
Read More
Automatic Subspace Clustering of High Dimensional Data

Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-presumption of any canonical ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGMOD Record Volume 27, Issue 2
June 1998
595 pages
ISSN:0163-5808
DOI:10.1145/276305
Chairmen:
Laura Haas
IBM Almaden Research Center, San Jose, CA
,
Pamela Drew
Boeing Co.
,
Editor:
Ashutosh Tiwary
Boeing Co.; and Univ. of Washington, Seattle
Issue’s Table of Contents
SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data
June 1998
599 pages
ISBN:0897919955
DOI:10.1145/276304
Chairmen:
Laura Haas
IBM AlmadenResearch Center, San Jose, CA
,
Pamela Drew
Boeing Co.
,
Editors:
Ashutosh Tiwary
Boeing Co.; and Univ. of Washington, Seattle
,
Michael Franklin
Univ. of Maryland, College Park
Copyright © 1998 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 1998
Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1,798
  Total Citations
  View Citations
- 8,268
  Total Downloads
- Downloads (Last 12 months)767
- Downloads (Last 6 weeks)100
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Automatic subspace clustering of high dimensional data for data mining applications

ACM SIGMOD Record

Abstract

References

Cited By

Index Terms

Recommendations

Subspace clustering for high dimensional data: a review

Automatic subspace clustering of high dimensional data for data mining applications

Automatic Subspace Clustering of High Dimensional Data