skip to main content
article

Discovering all most specific sentences

Authors Info & Claims
Published:01 June 2003Publication History
Skip Abstract Section

Abstract

Data mining can be viewed, in many instances, as the task of computing a representation of a theory of a model or a database, in particular by finding a set of maximally specific sentences satisfying some property. We prove some hardness results that rule out simple approaches to solving the problem.The a priori algorithm is an algorithm that has been successfully applied to many instances of the problem. We analyze this algorithm, and prove that is optimal when the maximally specific sentences are "small". We also point out its limitations.We then present a new algorithm, the Dualize and Advance algorithm, and prove worst-case complexity bounds that are favorable in the general case. Our results use the concept of hypergraph transversals. Our analysis shows that the a priori algorithm can solve the problem of enumerating the transversals of a hypergraph, improving on previously known results in a special case. On the other hand, using results for the general case of the hypergraph transversal enumeration problem, we can show that the Dualize and Advance algorithm has worst-case running time that is sub-exponential to the output size (i.e., the number of maximally specific sentences).We further show that the problem of finding maximally specific sentences is closely related to the problem of exact learning with membership queries studied in computational learning theory.

References

  1. Agrawal, R. C., Aggarwal, C. C., and Prasad, V. V. V. 2000. Depth first generation of long patterns. In Knowledge Discovery and Data Mining, pp. 108--118.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Agrawal, R., Imielinski, T., and Swami, A. 1993. Mining association rules between sets of items in large databases. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD'93) (May). ACM, New York, pp. 207--216.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., and Verkamo, A. I. 1996. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Eds. AAAI Press, Menlo Park, Calif., pp. 307--328.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB'94) (Sept.). pp. 487--499.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Agrawal, R. and Srikant, R. 1995. Mining sequential patterns. In Proceedings of the 11th International Conference on Data Engineering (ICDE'95) (Taipei, Taiwan, Mar.). pp. 3--14.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Angluin, D. 1988. Queries and concept learning. Mach. Learn. 2, 4, (Apr.), 319--342.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Bayardo, R. J. 1998. Efficiently mining long patterns from databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Bell, S. 2003. Deciding distinctness of query results by discovered constraints. Manuscript.]]Google ScholarGoogle Scholar
  9. Bell, S. and Brockhausen, P. 1995. Discovery of data dependencies in relational databases. Tech. Rep. LS-8 14. Universität Dortmund, Fachbereich Informatik, Lehrstuhl VIII, Künstliche Intelligenz.]]Google ScholarGoogle Scholar
  10. Berge, C. 1973. Hypergraphs. Combinatorics of Finite Sets, 3rd ed. North-Holland, Amsterdam, The Netherlands.]]Google ScholarGoogle Scholar
  11. Bioch, J. and Ibaraki, T. 1995. Complexity of identification and dualization of positive Boolean functions. Inf. Comput. 123, 1, 50--63.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Bshouty, N. H. 1996. The monotone theory for the pac-model. In Proceedings of the ACM Symposium on Theory of Computing (STOC). ACM, New York.]]Google ScholarGoogle Scholar
  13. Bshouty, N. H., Cleve, R., Gavalda, R., Kannan, S., and Tamon, C. 1996. Oracles and queries that are sufficient for exact learning. J. Comput. Syst. Sci. 52, 421--433.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Burdick, D., Calimlim, M., and Gehrke, J. 2001. Mafia: A maximal frequent itemset algorithm for transactional databases. In Proceedings of the International Conference on Data Engineering.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Eiter, T. and Gottlob, G. 1995. Identifying the minimal transversals of a hypergraph and related problems. SIAM J. Comput. 24, 6 (Dec.), 1278--1304.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R., Eds. 1996. Advances in Knowledge Discovery and Data Mining. AAAI Press, Menlo Park, Calif.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Fredman, M. L. and Khachiyan, L. 1996. On the complexity of dualization of monotone disjunctive normal forms. J. Algorithms 21, 3 (Nov.), 618--628.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Garey, M. and Johnson, D. 1979. Computers and Intractability---A Guide to the Theory of NP-Completeness. W. H. Freeman, New York.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Gouda, K. and Zaki, M. J. 2001. Efficiently mining maximal frequent itemsets. In ICDM, pp. 163--170.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Gunopulos, D., Khardon, R., Mannila, H., and Toivonen, H. 1997a. Data mining, hypergraph transversals, and machine learning. In Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'97). ACM, New York.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Gunopulos, D., Mannila, H., and Saluja, S. 1997b. Discovering all most specific sentences by randomized algorithms. In Proceedings of the International Conference on Database Theory (ICDT'97).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without candidate generation. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. ACM, New York, pp. 1--12.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Kavvadias, D. and Stavropoulos, E. C. 1999. Evaluation of an algorithm for the transversal hypergraph problem. In Algorithm Engineering, pp. 72--84.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Khardon, R. 1995. Translating between Horn representations and their characteristic models. J. Artif. Intel. Res. 3, 349--372.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Knobbe, A. J. and Adriaans, P. W. 1995. Discovering foreign key relations in relational databases. In Workshop Notes of the ECML-95 Workshop on Statistics, Machine Learning, and Knowledge Discovery in Databases (Heraklion, Crete, Greece, Apr.). pp. 94--99.]]Google ScholarGoogle Scholar
  26. Langley, P. 1995. Elements of Machine Learning. Morgan-Kaufmann, San Mateo, Calif.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Lin, D.-I. and Kedem, Z. M. 1998. Pincer search: A new algorithm for discovering the maximum frequent set. In Extending Database Technology, pp. 105--119.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Mannila, H. 1995. Aspects of data mining. In Workshop Notes of the ECML-95 Workshop on Statistics, Machine Learning, and Knowledge Discovery in Databases (Heraklion, Crete, Greece, Apr.). pp. 1--6.]]Google ScholarGoogle Scholar
  29. Mannila, H. 1996. Data mining: Machine learning, statistics, and databases. In Proceedings of the 8th International Conference on Scientific and Statistical Database Management (Stockholm, Sweden). pp, 2--9.]] Google ScholarGoogle ScholarCross RefCross Ref
  30. Mannila, H. and Räihä, K.-J. 1986. Design by example: An application of Armstrong relations. J. Comput. Syst. Sci. 33, 2, 126--141.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Mannila, H. and Räihä, K.-J. 1992. Design of Relational Databases. Addison-Wesley, Wokingham, U.K.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Mannila, H. and Räihä, K.-J. 1994. Algorithms for inferring functional dependencies. Data Knowl. Eng. 12, 1 (Feb.), 83--99.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Mannila, H. and Toivonen, H. 1997. Levelwise search and borders of theories in knowledge discovery. Data Mining Knowl. Disc. 1, 3, 241--258.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Mannila, H., Toivonen, H., and Verkamo, A. I. 1994. Efficient algorithms for discovering association rules. In Knowledge Discovery in Databases, Papers from the 1994 AAAI Workshop (KDD'94) (Seattle, Wash., July), pp. 181--192.]]Google ScholarGoogle Scholar
  35. Mannila, H., Toivonen, H., and Verkamo, A. I. 1995. Discovering frequent episodes in sequences. In Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining (KDD'95) (Montreal, Ont., Canada, Aug.). AAAI Press, pp. 210--215.]]Google ScholarGoogle Scholar
  36. Mishra, N. and Pitt, L. 1997. Generating all maximal independent sets of bounded-degree hypergraphs. In Proceedings of the Conference on Computational Learning Theory (COLT). pp. 211--217.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Mitchell, T. M. 1982. Generalization as search. Artif. Intel. 18, 203--226.]]Google ScholarGoogle ScholarCross RefCross Ref
  38. Rymon, R. 1992. Search through systematic set enumeration. In Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning.]]Google ScholarGoogle Scholar
  39. Schlimmer, J. 1993. Using learned dependencies to automatically construct sufficient and sensible editing views. In Knowledge Discovery in Databases, Papers from the 1993 AAAI Workshop (KDD'93) (Washington, D.C.). AAAI Press, pp. 186--196.]]Google ScholarGoogle Scholar
  40. Ullman, J. D. 1988. Principles of Database and Knowledge-Base Systems, vol. I. Computer Science Press, Rockville, Md.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Valiant, L. G. 1979. The complexity of enumeration and reliability problems. SIAM J. Comput. 8, 3, 410--421.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Zheng, Z., Kohavi, R., and Mason, L. 2001. Real world performance of association rule algorithms. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Discovering all most specific sentences

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader