Abstract
Data mining can be viewed, in many instances, as the task of computing a representation of a theory of a model or a database, in particular by finding a set of maximally specific sentences satisfying some property. We prove some hardness results that rule out simple approaches to solving the problem.The a priori algorithm is an algorithm that has been successfully applied to many instances of the problem. We analyze this algorithm, and prove that is optimal when the maximally specific sentences are "small". We also point out its limitations.We then present a new algorithm, the Dualize and Advance algorithm, and prove worst-case complexity bounds that are favorable in the general case. Our results use the concept of hypergraph transversals. Our analysis shows that the a priori algorithm can solve the problem of enumerating the transversals of a hypergraph, improving on previously known results in a special case. On the other hand, using results for the general case of the hypergraph transversal enumeration problem, we can show that the Dualize and Advance algorithm has worst-case running time that is sub-exponential to the output size (i.e., the number of maximally specific sentences).We further show that the problem of finding maximally specific sentences is closely related to the problem of exact learning with membership queries studied in computational learning theory.
- Agrawal, R. C., Aggarwal, C. C., and Prasad, V. V. V. 2000. Depth first generation of long patterns. In Knowledge Discovery and Data Mining, pp. 108--118.]] Google ScholarDigital Library
- Agrawal, R., Imielinski, T., and Swami, A. 1993. Mining association rules between sets of items in large databases. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD'93) (May). ACM, New York, pp. 207--216.]] Google ScholarDigital Library
- Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., and Verkamo, A. I. 1996. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Eds. AAAI Press, Menlo Park, Calif., pp. 307--328.]] Google ScholarDigital Library
- Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB'94) (Sept.). pp. 487--499.]] Google ScholarDigital Library
- Agrawal, R. and Srikant, R. 1995. Mining sequential patterns. In Proceedings of the 11th International Conference on Data Engineering (ICDE'95) (Taipei, Taiwan, Mar.). pp. 3--14.]] Google ScholarDigital Library
- Angluin, D. 1988. Queries and concept learning. Mach. Learn. 2, 4, (Apr.), 319--342.]] Google ScholarDigital Library
- Bayardo, R. J. 1998. Efficiently mining long patterns from databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York.]] Google ScholarDigital Library
- Bell, S. 2003. Deciding distinctness of query results by discovered constraints. Manuscript.]]Google Scholar
- Bell, S. and Brockhausen, P. 1995. Discovery of data dependencies in relational databases. Tech. Rep. LS-8 14. Universität Dortmund, Fachbereich Informatik, Lehrstuhl VIII, Künstliche Intelligenz.]]Google Scholar
- Berge, C. 1973. Hypergraphs. Combinatorics of Finite Sets, 3rd ed. North-Holland, Amsterdam, The Netherlands.]]Google Scholar
- Bioch, J. and Ibaraki, T. 1995. Complexity of identification and dualization of positive Boolean functions. Inf. Comput. 123, 1, 50--63.]] Google ScholarDigital Library
- Bshouty, N. H. 1996. The monotone theory for the pac-model. In Proceedings of the ACM Symposium on Theory of Computing (STOC). ACM, New York.]]Google Scholar
- Bshouty, N. H., Cleve, R., Gavalda, R., Kannan, S., and Tamon, C. 1996. Oracles and queries that are sufficient for exact learning. J. Comput. Syst. Sci. 52, 421--433.]] Google ScholarDigital Library
- Burdick, D., Calimlim, M., and Gehrke, J. 2001. Mafia: A maximal frequent itemset algorithm for transactional databases. In Proceedings of the International Conference on Data Engineering.]] Google ScholarDigital Library
- Eiter, T. and Gottlob, G. 1995. Identifying the minimal transversals of a hypergraph and related problems. SIAM J. Comput. 24, 6 (Dec.), 1278--1304.]] Google ScholarDigital Library
- Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R., Eds. 1996. Advances in Knowledge Discovery and Data Mining. AAAI Press, Menlo Park, Calif.]] Google ScholarDigital Library
- Fredman, M. L. and Khachiyan, L. 1996. On the complexity of dualization of monotone disjunctive normal forms. J. Algorithms 21, 3 (Nov.), 618--628.]] Google ScholarDigital Library
- Garey, M. and Johnson, D. 1979. Computers and Intractability---A Guide to the Theory of NP-Completeness. W. H. Freeman, New York.]] Google ScholarDigital Library
- Gouda, K. and Zaki, M. J. 2001. Efficiently mining maximal frequent itemsets. In ICDM, pp. 163--170.]] Google ScholarDigital Library
- Gunopulos, D., Khardon, R., Mannila, H., and Toivonen, H. 1997a. Data mining, hypergraph transversals, and machine learning. In Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'97). ACM, New York.]] Google ScholarDigital Library
- Gunopulos, D., Mannila, H., and Saluja, S. 1997b. Discovering all most specific sentences by randomized algorithms. In Proceedings of the International Conference on Database Theory (ICDT'97).]] Google ScholarDigital Library
- Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without candidate generation. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. ACM, New York, pp. 1--12.]] Google ScholarDigital Library
- Kavvadias, D. and Stavropoulos, E. C. 1999. Evaluation of an algorithm for the transversal hypergraph problem. In Algorithm Engineering, pp. 72--84.]] Google ScholarDigital Library
- Khardon, R. 1995. Translating between Horn representations and their characteristic models. J. Artif. Intel. Res. 3, 349--372.]]Google ScholarDigital Library
- Knobbe, A. J. and Adriaans, P. W. 1995. Discovering foreign key relations in relational databases. In Workshop Notes of the ECML-95 Workshop on Statistics, Machine Learning, and Knowledge Discovery in Databases (Heraklion, Crete, Greece, Apr.). pp. 94--99.]]Google Scholar
- Langley, P. 1995. Elements of Machine Learning. Morgan-Kaufmann, San Mateo, Calif.]] Google ScholarDigital Library
- Lin, D.-I. and Kedem, Z. M. 1998. Pincer search: A new algorithm for discovering the maximum frequent set. In Extending Database Technology, pp. 105--119.]] Google ScholarDigital Library
- Mannila, H. 1995. Aspects of data mining. In Workshop Notes of the ECML-95 Workshop on Statistics, Machine Learning, and Knowledge Discovery in Databases (Heraklion, Crete, Greece, Apr.). pp. 1--6.]]Google Scholar
- Mannila, H. 1996. Data mining: Machine learning, statistics, and databases. In Proceedings of the 8th International Conference on Scientific and Statistical Database Management (Stockholm, Sweden). pp, 2--9.]] Google ScholarCross Ref
- Mannila, H. and Räihä, K.-J. 1986. Design by example: An application of Armstrong relations. J. Comput. Syst. Sci. 33, 2, 126--141.]] Google ScholarDigital Library
- Mannila, H. and Räihä, K.-J. 1992. Design of Relational Databases. Addison-Wesley, Wokingham, U.K.]] Google ScholarDigital Library
- Mannila, H. and Räihä, K.-J. 1994. Algorithms for inferring functional dependencies. Data Knowl. Eng. 12, 1 (Feb.), 83--99.]] Google ScholarDigital Library
- Mannila, H. and Toivonen, H. 1997. Levelwise search and borders of theories in knowledge discovery. Data Mining Knowl. Disc. 1, 3, 241--258.]] Google ScholarDigital Library
- Mannila, H., Toivonen, H., and Verkamo, A. I. 1994. Efficient algorithms for discovering association rules. In Knowledge Discovery in Databases, Papers from the 1994 AAAI Workshop (KDD'94) (Seattle, Wash., July), pp. 181--192.]]Google Scholar
- Mannila, H., Toivonen, H., and Verkamo, A. I. 1995. Discovering frequent episodes in sequences. In Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining (KDD'95) (Montreal, Ont., Canada, Aug.). AAAI Press, pp. 210--215.]]Google Scholar
- Mishra, N. and Pitt, L. 1997. Generating all maximal independent sets of bounded-degree hypergraphs. In Proceedings of the Conference on Computational Learning Theory (COLT). pp. 211--217.]] Google ScholarDigital Library
- Mitchell, T. M. 1982. Generalization as search. Artif. Intel. 18, 203--226.]]Google ScholarCross Ref
- Rymon, R. 1992. Search through systematic set enumeration. In Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning.]]Google Scholar
- Schlimmer, J. 1993. Using learned dependencies to automatically construct sufficient and sensible editing views. In Knowledge Discovery in Databases, Papers from the 1993 AAAI Workshop (KDD'93) (Washington, D.C.). AAAI Press, pp. 186--196.]]Google Scholar
- Ullman, J. D. 1988. Principles of Database and Knowledge-Base Systems, vol. I. Computer Science Press, Rockville, Md.]] Google ScholarDigital Library
- Valiant, L. G. 1979. The complexity of enumeration and reliability problems. SIAM J. Comput. 8, 3, 410--421.]]Google ScholarDigital Library
- Zheng, Z., Kohavi, R., and Mason, L. 2001. Real world performance of association rule algorithms. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York.]] Google ScholarDigital Library
Index Terms
- Discovering all most specific sentences
Recommendations
An Efficient Hash-Based Method for Discovering the Maximal Frequent Set
COMPSAC '01: Proceedings of the 25th International Computer Software and Applications Conference on Invigorating Software DevelopmentThe association rule mining can be divided into two steps. The first step is to find out all frequent itemsets, whose occurrences are greater than or equal to the user-specified threshold. The second step is to generate reliable association rules based ...
Efficiently mining Maximal Frequent Sets in dense databases for discovering association rules
We present, MaxDomino, an algorithm for mining Maximal Frequent Sets (MFS) for discovering association rules in dense databases. The algorithm uses novel concepts of dominancy factor and collapsibility of transaction for efficiently mining MFS. Unlike ...
Mining fuzzy specific rare itemsets for education data
Association rule mining is an important data analysis method for the discovery of associations within data. There have been many studies focused on finding fuzzy association rules from transaction databases. Unfortunately, in the real world, one may ...
Comments