Abstract
We introduce the problem of cluster-grouping and show that it can be considered a subtask in several important data mining tasks, such as subgroup discovery, mining correlated patterns, clustering and classification. The algorithm CG for solving cluster-grouping problems is then introduced, and it is incorporated as a component in several existing and novel algorithms for tackling subgroup discovery, clustering and classification. The resulting systems are empirically compared to state-of-the-art systems such as CN2, CBA, Ripper, Autoclass and CobWeb. The results indicate that the CG algorithm can be useful as a generic local pattern mining component in a wide variety of data mining and machine learning algorithms.
Article PDF
Similar content being viewed by others
References
Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In Proceedings of the 20th international conference on very large databases (pp. 487–499), Santiago de Chile, Chile, September 1994. San Mateo: Morgan Kaufmann.
Atzmüller, M., & Puppe, F. (2006). SD-Map—a fast algorithm for exhaustive subgroup discovery. In J. Fürnkranz, T. Scheffer, & M. Spiliopoulou (Eds.), Proceedings of the tenth European conference on principles and practice of knowledge discovery in databases (pp. 6–17). Berlin: Springer.
Bay, S. D., & Pazzani, M. J. (2001). Detecting group differences: mining constrast sets. Data Mining and Knowledge Discovery, 5(3), 213–246.
Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases.
Blockeel, H., De Raedt, L., & Ramon, J. (1998). Top-down induction of clustering trees. In J. W. Shavlik (Ed.), Proceedings of the fifteenth international conference on machine learning (pp. 55–63). San Mateo: Morgan Kaufmann.
Brin, S., Motwani, R., & Silverstein, C. (1997). Beyond market baskets: Generalizing association rules to correlations. In J. Peckham (Ed.), SIGMOD 1997, Proceedings ACM SIGMOD international conference on management of data (pp. 265–276). New York: ACM.
Bringmann, B., & Zimmermann, A. (2005). Tree2—Decision trees for tree structured data. In A. Jorge, L. Torgo, P. Brazdil, R. Camacho, & J. Gama (Eds.), 9th European conference on principles and practice of knowledge discovery in databases (pp. 46–58). Berlin: Springer.
Cardie, C. (1993). Using decision trees to improve case-based learning. In Proceedings of the tenth international conference on machine learning (pp. 25–32), Amherst, Massachusetts, USA, June 1993. San Mateo: Morgan Kaufmann.
Cheeseman, P., Kelly, J., Self, M., Stutz, J., Taylor, W., & Freeman, D. (1988). Autoclass: A Bayesian classification system. In J. E. Laird (Ed.), Proceedings of the fifth international conference on machine learning (pp. 54–64), Ann Arbor, Michigan, USA, June 1988. San Mateo: Morgan Kaufmann.
Clark, P., & Niblett, T. (1989). The CN2 induction algorithm. Machine Learning, 3, 261–283.
Coenen, F., & Leng, P. (2005). Obtaining best parameter values for accurate classification. In J. Han, B. W. Wah, V. Raghavan, X. Wu, & R. Rastogi (Eds.), Proceedings of the fifth IEEE international conference on data mining (pp. 597–600), Houston, Texas, USA, November 2005. New York: IEEE.
Cohen, W. W. (1995). Fast effective rule induction. In A. Prieditis, & S. J. Russell (Eds.), Proceedings of the twelfth international conference on machine learning (pp. 115–123), Tahoe City, California, USA, July 1995. San Mateo: Morgan Kaufmann.
De Raedt, L. (2008). Logical and relational learning. Cognitive technologies. Berlin: Springer.
Dietterich, T. G., & Bakiri, G. (1991). Error-correcting output codes: A general method for improving multiclass inductive learning programs. In Proceedings of the 9th national conference on artificial intelligence (pp. 572–577), Anaheim, California, USA, July 1991. Menlo Park/Cambridge: AAAI Press/MIT Press.
Fayyad, U. M., & Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the 13th international joint conference on artificial intelligence (pp. 1022–1029), Chambéry, France, August 1993. San Mateo: Morgan Kaufmann.
Fisher, D. H. (1987). Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2(2), 139–172.
Fisher, D. H. (1996). Iterative optimization and simplification of hierarchical clusterings. Journal of Artificial Intelligence Research (JAIR), 4, 147–178.
Fisher, D. H., & Hapanyengwi, G. (1993). Database management and analysis tools of machine learning. Journal of Intelligent Information Systems, 2, 5–38.
Flach, P. A., & Lachiche, N. (2001). Confirmation-guided discovery of first-order rules with Tertius. Machine Learning, 42(1/2), 61–95.
Frank, E., & Witten, I. H. (1999). Data mining: practical machine learning tools and techniques with Java implementations. San Mateo: Morgan Kaufmann.
Fürnkranz, J. (2004). From local to global patterns: Evaluation issues in rule learning algorithms. In Morik et al. (2004) (pp. 20–38).
Fürnkranz, J., & Flach, P. A. (2005). ROC ‘n’ rule learning-towards a better understanding of covering algorithms. Machine Learning, 58(1), 39–77.
Gluck, M. A., & Corter, J. E. (1985). Information, uncertainty, and the utility of categories. In Proceedings of the 7th annual conference of the cognitive science society (pp. 283–287), Irvine, California, USA, 1985. Hillsdale: Erlbaum.
Höppner, F. (2004). Local pattern detection and clustering. In Morik et al. (2004) (pp. 53–70).
Kavsek, B., & Lavrac, N. (2006). Apriori-SD: Adapting association rule learning to subgroup discovery. Applied Artificial Intelligence, 20(7), 543–583.
Klösgen, W. (1996). Explora: A multipattern and multistrategy discovery assistant. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, & R. Uthurusamy (Eds.), Advances in knowledge discovery and data mining. Cambridge: MIT Press.
Lavrač, N., Flach, P. A., & Zupan, B. (1999). Rule evaluation measures: A unifying view. In S. Džeroski, P. A. Flach (Eds.), Proceedings of the 9th international workshop on inductive logic programming (pp. 174–185), Bled, Slovenia, June 1999. Berlin: Springer.
Lavrač, N., Kavsek, B., Flach, P. A., & Todorovski, L. (2004). Subgroup discovery with CN2-SD. Journal of Machine Learning Research, 5, 153–188.
Li, W., Han, J., & Pei, J. (2001). CMAR: Accurate and efficient classification based on multiple class-association rules. In N. Cercone, T. Y. Lin, & X. Wu (Eds.), Proceedings of the 2001 IEEE international conference on data mining (pp. 369–376), San José, California, USA, November 2001. Los Alamitos: IEEE Computer Society.
Liu, B., Hsu, W., & Ma, Y. (1998). Integrating classification and association rule mining. In R. Agrawal, P. E. Stolorz, & G. Piatetsky-Shapiro (Eds.), Proceedings of the fourth international conference on knowledge discovery and data mining (pp. 80–86), New York City, New York, USA, August 1998. Menlo Park: AAAI Press.
Masulli, F., & Valentini, G. (2000). Effectiveness of error correcting output codes in multiclass learning problems. In J. Kittler, & F. Roli (Eds.), Proceedings on the first international workshop on multiple classifier systems (pp. 107–116), Cagliari, Italy, June 2000. Berlin: Springer.
Michalski, R. S., & Stepp, R. E. (1983). Learning from observation: Conceptual clustering. Machine Learning, An Artificial Intelligence Approach, 1, 331–363.
Morik, K., Boulicaut, J.-F., & Siebes, A. (Eds.) (2004). Local pattern detection, international seminar, revised selected papers. Dagstuhl Castle, Germany, April 2004. Berlin: Springer.
Morishita, S., & Sese, J. (2000). Traversing itemset lattices with statistical metric pruning. In Proceedings of the nineteenth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems (pp. 226–236), Dallas, Texas, USA, May 2000. New York: ACM.
Murthy, S. K. (1997). On growing better decision trees from data. PhD thesis, John Hopkins University, Baltimore, Maryland, USA.
Mutter, S., Hall, M., & Frank, E. (2004). Using classification to evaluate the output of confidence-based association rule mining. In G. I. Webb, & X. Yu (Eds.), Proceedings of the 17th Australian joint conference on artificial intelligence (pp. 538–549), Cairns, Australia, December 2004. Berlin: Springer.
Nevins, A. J. (1995). A branch and bound incremental conceptual clusterer. Machine Learning, 18(1), 5–22.
Perkowitz, M., & Etzioni, O. (1999). Adaptive web sites: Conceptual cluster mining. In T. Dean (Ed.), Proceedings of the sixteenth international joint conference on artificial intelligence (pp. 264–269), Stockholm, Sweden, July 1999. San Mateo: Morgan Kaufmann.
Riddle, P. J., Segal, R., & Etzioni, O. (1994). Representation design and brut-force induction in a Boeing manufacturing domain. Applied Artificial Intelligence, 8(1), 125–147.
Scheffer, T., & Wrobel, S. (2002). Finding the most interesting patterns in a database quickly by using sequential sampling. Journal of Machine Learning Research, 3, 833–862.
Sese, J., & Morishita, S. (2004). Itemset classified clustering. In J.-F. Boulicaut, F. Esposito, F. Giannotti, & D. Pedreschi (Eds.), Proceedings of the 8th European conference on principles of data mining and knowledge discovery (pp. 398–409), Pisa, Italy, September 2004. Berlin: Springer.
Talavera, L. (2000). Dynamic feature selection in incremental hierarchical clustering. In R. L. de Mántaras, & E. Plaza (Eds.), Proceedings of the 11th European conference on machine learning (pp. 392–403), Barcelona, Catalonia, Spain, May 2000. Berlin: Springer.
Webb, G. I. (1995). Opus: An efficient admissible algorithm for unordered search. Journal of Artificial Intelligence Research, 3, 431–465.
Webb, G. I. (2007). Discovering significant patterns. Machine Learning, 68(1), 1–33.
Webb, G. I., & Zhang, S. (2005). K-optimal rule discovery. Data Mining and Knowledge Discovery, 10(1), 39–79.
Wrobel, S. (1997). An algorithm for multi-relational discovery of subgroups. In J. Komorowski, & J. Zytkow (Eds.), Proceedings of the first European symposium on principles of data mining and knowledge discovery (PKDD ’97) (pp. 78–87), Trondheim, Norway, 1997. Berlin: Springer.
Zimmermann, A., & De Raedt, L. (2004a). Cluster-grouping: From subgroup discovery to clustering. In J.-F. Boulicaut, F. Esposito, F. Giannotti, & D. Pedreschi (Eds.), Proceedings of the 15th European conference on machine learning (pp. 575–577), Pisa, Italy, September 2004. Berlin: Springer.
Zimmermann, A., & De Raedt, L. (2004b). Corclass: Correlated association rule mining for classification. In E. Suzuki, & S. Arikawa (Eds.), Proceedings of the 7th international conference on discovery science (pp. 60–72), Padova, Italy, October 2004. Berlin: Springer.
Zimmermann, A., & De Raedt, L. (2004c). Inductive querying for discovering subgroups and clusters. In J.-F. Boulicaut, L. De Raedt, & H. Mannila (Eds.), Constraint-based mining and inductive databases (pp. 380–399). Berlin: Springer.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zimmermann, A., De Raedt, L. Cluster-grouping: from subgroup discovery to clustering. Mach Learn 77, 125–159 (2009). https://doi.org/10.1007/s10994-009-5121-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-009-5121-y