article

Discovering all most specific sentences

Authors:
Dimitrios Gunopulos

Computer Science and Engineering Department, University of California, Riverside, CA

Computer Science and Engineering Department, University of California, Riverside, CA
View Profile

,
Roni Khardon

EECS Department, Tufts University, Medford, MA

EECS Department, Tufts University, Medford, MA
View Profile

,
Heikki Mannila

Department of Computer Science, University of Helsinki, Helsinki, Finland

Department of Computer Science, University of Helsinki, Helsinki, Finland
View Profile

,
Sanjeev Saluja

LSI Logic, Milpitas, CA

LSI Logic, Milpitas, CA
View Profile

,
Hannu Toivonen

Department of Computer Science, University of Helsinki, Helsinki, Finland

Department of Computer Science, University of Helsinki, Helsinki, Finland
View Profile

,
Ram Sewak Sharma

Computer Science and Engineering Department, University of California, Riverside, CA

Computer Science and Engineering Department, University of California, Riverside, CA
View Profile

Authors Info & Claims

ACM Transactions on Database Systems Volume 28 Issue 2pp 140–174https://doi.org/10.1145/777943.777945

Published:01 June 2003Publication History

ACM Transactions on Database Systems

Abstract

Data mining can be viewed, in many instances, as the task of computing a representation of a theory of a model or a database, in particular by finding a set of maximally specific sentences satisfying some property. We prove some hardness results that rule out simple approaches to solving the problem.The a priori algorithm is an algorithm that has been successfully applied to many instances of the problem. We analyze this algorithm, and prove that is optimal when the maximally specific sentences are "small". We also point out its limitations.We then present a new algorithm, the Dualize and Advance algorithm, and prove worst-case complexity bounds that are favorable in the general case. Our results use the concept of hypergraph transversals. Our analysis shows that the a priori algorithm can solve the problem of enumerating the transversals of a hypergraph, improving on previously known results in a special case. On the other hand, using results for the general case of the hypergraph transversal enumeration problem, we can show that the Dualize and Advance algorithm has worst-case running time that is sub-exponential to the output size (i.e., the number of maximally specific sentences).We further show that the problem of finding maximally specific sentences is closely related to the problem of exact learning with membership queries studied in computational learning theory.

References

Agrawal, R. C., Aggarwal, C. C., and Prasad, V. V. V. 2000. Depth first generation of long patterns. In Knowledge Discovery and Data Mining, pp. 108--118.]] Google ScholarDigital Library
Agrawal, R., Imielinski, T., and Swami, A. 1993. Mining association rules between sets of items in large databases. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD'93) (May). ACM, New York, pp. 207--216.]] Google ScholarDigital Library
Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., and Verkamo, A. I. 1996. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Eds. AAAI Press, Menlo Park, Calif., pp. 307--328.]] Google ScholarDigital Library
Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB'94) (Sept.). pp. 487--499.]] Google ScholarDigital Library
Agrawal, R. and Srikant, R. 1995. Mining sequential patterns. In Proceedings of the 11th International Conference on Data Engineering (ICDE'95) (Taipei, Taiwan, Mar.). pp. 3--14.]] Google ScholarDigital Library
Angluin, D. 1988. Queries and concept learning. Mach. Learn. 2, 4, (Apr.), 319--342.]] Google ScholarDigital Library
Bayardo, R. J. 1998. Efficiently mining long patterns from databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York.]] Google ScholarDigital Library
Bell, S. 2003. Deciding distinctness of query results by discovered constraints. Manuscript.]]Google Scholar
Bell, S. and Brockhausen, P. 1995. Discovery of data dependencies in relational databases. Tech. Rep. LS-8 14. Universität Dortmund, Fachbereich Informatik, Lehrstuhl VIII, Künstliche Intelligenz.]]Google Scholar
Berge, C. 1973. Hypergraphs. Combinatorics of Finite Sets, 3rd ed. North-Holland, Amsterdam, The Netherlands.]]Google Scholar
Bioch, J. and Ibaraki, T. 1995. Complexity of identification and dualization of positive Boolean functions. Inf. Comput. 123, 1, 50--63.]] Google ScholarDigital Library
Bshouty, N. H. 1996. The monotone theory for the pac-model. In Proceedings of the ACM Symposium on Theory of Computing (STOC). ACM, New York.]]Google Scholar
Bshouty, N. H., Cleve, R., Gavalda, R., Kannan, S., and Tamon, C. 1996. Oracles and queries that are sufficient for exact learning. J. Comput. Syst. Sci. 52, 421--433.]] Google ScholarDigital Library
Burdick, D., Calimlim, M., and Gehrke, J. 2001. Mafia: A maximal frequent itemset algorithm for transactional databases. In Proceedings of the International Conference on Data Engineering.]] Google ScholarDigital Library
Eiter, T. and Gottlob, G. 1995. Identifying the minimal transversals of a hypergraph and related problems. SIAM J. Comput. 24, 6 (Dec.), 1278--1304.]] Google ScholarDigital Library
Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R., Eds. 1996. Advances in Knowledge Discovery and Data Mining. AAAI Press, Menlo Park, Calif.]] Google ScholarDigital Library
Fredman, M. L. and Khachiyan, L. 1996. On the complexity of dualization of monotone disjunctive normal forms. J. Algorithms 21, 3 (Nov.), 618--628.]] Google ScholarDigital Library
Garey, M. and Johnson, D. 1979. Computers and Intractability---A Guide to the Theory of NP-Completeness. W. H. Freeman, New York.]] Google ScholarDigital Library
Gouda, K. and Zaki, M. J. 2001. Efficiently mining maximal frequent itemsets. In ICDM, pp. 163--170.]] Google ScholarDigital Library
Gunopulos, D., Khardon, R., Mannila, H., and Toivonen, H. 1997a. Data mining, hypergraph transversals, and machine learning. In Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'97). ACM, New York.]] Google ScholarDigital Library
Gunopulos, D., Mannila, H., and Saluja, S. 1997b. Discovering all most specific sentences by randomized algorithms. In Proceedings of the International Conference on Database Theory (ICDT'97).]] Google ScholarDigital Library
Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without candidate generation. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. ACM, New York, pp. 1--12.]] Google ScholarDigital Library
Kavvadias, D. and Stavropoulos, E. C. 1999. Evaluation of an algorithm for the transversal hypergraph problem. In Algorithm Engineering, pp. 72--84.]] Google ScholarDigital Library
Khardon, R. 1995. Translating between Horn representations and their characteristic models. J. Artif. Intel. Res. 3, 349--372.]]Google ScholarDigital Library
Knobbe, A. J. and Adriaans, P. W. 1995. Discovering foreign key relations in relational databases. In Workshop Notes of the ECML-95 Workshop on Statistics, Machine Learning, and Knowledge Discovery in Databases (Heraklion, Crete, Greece, Apr.). pp. 94--99.]]Google Scholar
Langley, P. 1995. Elements of Machine Learning. Morgan-Kaufmann, San Mateo, Calif.]] Google ScholarDigital Library
Lin, D.-I. and Kedem, Z. M. 1998. Pincer search: A new algorithm for discovering the maximum frequent set. In Extending Database Technology, pp. 105--119.]] Google ScholarDigital Library
Mannila, H. 1995. Aspects of data mining. In Workshop Notes of the ECML-95 Workshop on Statistics, Machine Learning, and Knowledge Discovery in Databases (Heraklion, Crete, Greece, Apr.). pp. 1--6.]]Google Scholar
Mannila, H. 1996. Data mining: Machine learning, statistics, and databases. In Proceedings of the 8th International Conference on Scientific and Statistical Database Management (Stockholm, Sweden). pp, 2--9.]] Google ScholarCross Ref
Mannila, H. and Räihä, K.-J. 1986. Design by example: An application of Armstrong relations. J. Comput. Syst. Sci. 33, 2, 126--141.]] Google ScholarDigital Library
Mannila, H. and Räihä, K.-J. 1992. Design of Relational Databases. Addison-Wesley, Wokingham, U.K.]] Google ScholarDigital Library
Mannila, H. and Räihä, K.-J. 1994. Algorithms for inferring functional dependencies. Data Knowl. Eng. 12, 1 (Feb.), 83--99.]] Google ScholarDigital Library
Mannila, H. and Toivonen, H. 1997. Levelwise search and borders of theories in knowledge discovery. Data Mining Knowl. Disc. 1, 3, 241--258.]] Google ScholarDigital Library
Mannila, H., Toivonen, H., and Verkamo, A. I. 1994. Efficient algorithms for discovering association rules. In Knowledge Discovery in Databases, Papers from the 1994 AAAI Workshop (KDD'94) (Seattle, Wash., July), pp. 181--192.]]Google Scholar
Mannila, H., Toivonen, H., and Verkamo, A. I. 1995. Discovering frequent episodes in sequences. In Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining (KDD'95) (Montreal, Ont., Canada, Aug.). AAAI Press, pp. 210--215.]]Google Scholar
Mishra, N. and Pitt, L. 1997. Generating all maximal independent sets of bounded-degree hypergraphs. In Proceedings of the Conference on Computational Learning Theory (COLT). pp. 211--217.]] Google ScholarDigital Library
Mitchell, T. M. 1982. Generalization as search. Artif. Intel. 18, 203--226.]]Google ScholarCross Ref
Rymon, R. 1992. Search through systematic set enumeration. In Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning.]]Google Scholar
Schlimmer, J. 1993. Using learned dependencies to automatically construct sufficient and sensible editing views. In Knowledge Discovery in Databases, Papers from the 1993 AAAI Workshop (KDD'93) (Washington, D.C.). AAAI Press, pp. 186--196.]]Google Scholar
Ullman, J. D. 1988. Principles of Database and Knowledge-Base Systems, vol. I. Computer Science Press, Rockville, Md.]] Google ScholarDigital Library
Valiant, L. G. 1979. The complexity of enumeration and reliability problems. SIAM J. Comput. 8, 3, 410--421.]]Google ScholarDigital Library
Zheng, Z., Kohavi, R., and Mason, L. 2001. Real world performance of association rule algorithms. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York.]] Google ScholarDigital Library

Index Terms

Discovering all most specific sentences
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing

Recommendations

An Efficient Hash-Based Method for Discovering the Maximal Frequent Set
COMPSAC '01: Proceedings of the 25th International Computer Software and Applications Conference on Invigorating Software Development

The association rule mining can be divided into two steps. The first step is to find out all frequent itemsets, whose occurrences are greater than or equal to the user-specified threshold. The second step is to generate reliable association rules based ...
Read More
Efficiently mining Maximal Frequent Sets in dense databases for discovering association rules

We present, MaxDomino, an algorithm for mining Maximal Frequent Sets (MFS) for discovering association rules in dense databases. The algorithm uses novel concepts of dominancy factor and collapsibility of transaction for efficiently mining MFS. Unlike ...
Read More
Mining fuzzy specific rare itemsets for education data

Association rule mining is an important data analysis method for the discovery of associations within data. There have been many studies focused on finding fuzzy association rules from transaction databases. Unfortunately, in the real world, one may ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Database Systems Volume 28, Issue 2
June 2003
108 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/777943
Issue’s Table of Contents

Copyright © 2003 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 2003
Published in tods Volume 28, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Data mining
association rules
learning with membership queries
maximal frequent sets
minimal keys
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 165
  Total Citations
  View Citations
- 1,482
  Total Downloads
- Downloads (Last 12 months)20
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Discovering all most specific sentences

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

An Efficient Hash-Based Method for Discovering the Maximal Frequent Set

Efficiently mining Maximal Frequent Sets in dense databases for discovering association rules

Mining fuzzy specific rare itemsets for education data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Discovering all most specific sentences

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

An Efficient Hash-Based Method for Discovering the Maximal Frequent Set

Efficiently mining Maximal Frequent Sets in dense databases for discovering association rules

Mining fuzzy specific rare itemsets for education data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media