Article

Free Access

The power of sampling in knowledge discovery

Authors:
Jyrki Kivinen

University of California, Santa Cruz

University of California, Santa Cruz
View Profile

,
Heikki Mannila

University of Helsinki

University of Helsinki
View Profile

PODS '94: Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systemsMay 1994Pages 77–85https://doi.org/10.1145/182591.182601

Published:24 May 1994Publication History

PODS '94: Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems

Pages 77–85

ABSTRACT

We consider the problem of approximately verifying the truth of sentences of tuple relational calculus in a given relation M by considering only a random sample of M. We define two different measures for the error of a universal sentence in a relation. For a set of n universal sentences each with at most k universal quantifiers, we give upper and lower bounds for the sample sizes required for having a high probability that all the sentences with error at least ε can be detected as false by considering the sample. The sample sizes are O((log n)/ε) or O((|M|^1–1/k)log n/ε), depending on the error measure used. We also consider universal-existential sentences.

References

Coh93.William W. Cohen. PAC-learning a restricted class of recursive logic programs. In Proceedings of the Eleventh National Conference on Artificial Intelligence, pages 86-92, July 1993. AAAI Press, Menlo Park, CA.Google ScholarDigital Library
Fag82.Ronald Fagin. Horn clauses and database dependencies. Journal of the A CM, 29(4):952-985, October 1982. Google ScholarDigital Library
GHK92.Adam J. Grove, Joseph Y. Halpern and Daphne Koller. Asymptotic conditional probabilities for first-order logic. In Proceedings of the Twenty Fourth Symposium on the Theory of Computin9 (STOC 'gP), pages 294-305, May 1992. ACM. Google ScholarDigital Library
HKM94.Lauri Hells, Jyrki Kivinen and Heikki Mannila. Error measures and generalized quantifiers. In preparation, 1994.Google Scholar
HO91.Weh-Chi Hou and Gultekin Ozsoyoglu. Statistical estimators for aggregate relational algebra queries. A CM Transactions on Database Systems, 16(4):600-654, December 1991. Google ScholarDigital Library
HS92.Peter J. Haas and Arun N. Swami. Sequential sampling procedures for query size optimization. In Michael Stonebraker, editor, Proceedings of the 199P A GM SIGMOD Conference on Management of Data, pages 341-350, June 1992. ACM. Google ScholarDigital Library
KI91.Ravi Krishnamurthy and Tomasz imielinski. Research directions in knowledge discovery. SIG- MOD Record, 20(3):76-78, September 1991.Google Scholar
KM92.Jyrki Kivinen and Heikki Mannila. Approximate dependency inference from relations. In Joachim Biskup and Richard Hull, editors, Proceedings of the Fourth International Conference on Database Theory (ICDT '9P), pages 86-98, October 1992. Springer-Verlag. Google ScholarDigital Library
LNS90.Richard J. Lipton, Jeffrey F. Naughton and D. A. Schneider. Practical selectivity estimation through adaptive sampling. In Hector Garcia- Molina and H. V. Jagadish, editors, Proceedings of the 1990 A GM SIGMOD Conference on Management o/Data, pages 1-11, May 1990. ACM. Google ScholarDigital Library
Mug92.Steven Muggleton, editor. Inductive Logic Programmin9. Academic Press, 1992.Google Scholar
NS90.Jeffrey F. Naughton and S. Seshadri. On estimating the size of projections. In Serge Abiteboul and Paris C. Kanellalds, editors, Proceedings of the Third International Conference on Database Theory (IGDT '90), pages 499-513, December 1990. Springer-Verlag. Google ScholarDigital Library
Nii87.Ilkka Niiniluoto. Truthlikeness. D. Reidel Pubfishing Company, Dordrecht, Holland, 1987.Google Scholar
Odd86.Graham Oddie. Likeness to Truth. D. Reidel Publishing Company, Dordrecht, Holland, 1986.Google ScholarCross Ref
OSW89.Daniel N. Osherson, Michael Stob and Scott Weinstein. On approximate truth. In Ronald Rivest, David Haussler and Manfred K. Warmuth, editors, Proceedings of the Second Workshop on Computational Learning Theory, pages 88-101, July 1989. Morgan Kaufmann, San Mateo, CA. Google ScholarDigital Library
PS91.Gregory Piatetsky-Shapiro. Discovery, analysis, and presentation of strong rules. In Gregory Piatetsky-Shapiro and William J. Frawley, editors, Knowledge Discovery in Databases, pages 229-248, 1991. AAAI Press, Menlo Park, CA.Google Scholar
PSF91.Gregory Piatetsky-Shapiro and William J. Frawley, editors. Knowledge Discovery in Databases. AAAI Press, Menlo Park, CA, 1991.Google ScholarDigital Library
Roy91.Shaibal Roy. Semantic complexity of classes of relational queries and query independent data partitioning. In Proceedings of the Tenth Symposium on Principles of Database Systems (PODS '90), pages 259-267, May 1991. ACM. Google ScholarDigital Library
Ull88.Jeffrey D. Ullman. Principles of Database and Knowledge-Base Systems, Volume L Computer Science Press, 1988. Google ScholarDigital Library
Val84.Leslie G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134-1142, November 1984. Google ScholarDigital Library
ZZ93.Robert Zembowicz and Jan M. Zytkov. Testing the existence of functional relationship in data. In Proceedings of the AAAI-93 Workshop on Knowledge Discovery in Databases, 1993.Google Scholar

Index Terms

The power of sampling in knowledge discovery
1. Information systems
  1. Data management systems
    1. Database design and models
    2. Database management system engines
      1. Database query processing
2. Theory of computation

Recommendations

Stratified random sampling for power estimation

In this paper, we present new statistical sampling techniques for performing power estimation at the circuit level. These techniques first transform the power estimation problem to a survey sampling problem, and then apply stratified random sampling to ...
Read More
Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms

Scalability is a key requirement for any KDD and data mining algorithm, and one of the biggest research challenges is to develop methods that allow to use large amounts of data. One possible approach for dealing with huge amounts of data is to take a ...
Read More
AND/OR importance sampling
UAI'08: Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence

The paper introduces AND/OR importance sampling for probabilistic graphical models. In contrast to importance sampling, AND/OR importance sampling caches samples in the AND/OR space and then extracts a new sample mean from the stored samples. We prove ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PODS '94: Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
May 1994
313 pages
ISBN:0897916425
DOI:10.1145/182591
Chairman:
Victor Vianu
Univ. of California at San Diego, La Jolla
Copyright © 1994 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 May 1994
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
PODS '94 Paper Acceptance Rate28of117submissions,24%Overall Acceptance Rate642of2,707submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 59
  Total Citations
  View Citations
- 503
  Total Downloads
- Downloads (Last 12 months)39
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The power of sampling in knowledge discovery

PODS '94: Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Stratified random sampling for power estimation

Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms

AND/OR importance sampling

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

The power of sampling in knowledge discovery

PODS '94: Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Stratified random sampling for power estimation

Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms

AND/OR importance sampling

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media