skip to main content
article
Free Access

Statistical profile estimation in database systems

Published:01 September 1988Publication History
Skip Abstract Section

Abstract

A statistical profile summarizes the instances of a database. It describes aspects such as the number of tuples, the number of values, the distribution of values, the correlation between value sets, and the distribution of tuples among secondary storage units. Estimation of database profiles is critical in the problems of query optimization, physical database design, and database performance prediction. This paper describes a model of a database of profile, relates this model to estimating the cost of database operations, and surveys methods of estimating profiles. The operators and objects in the model include build profile, estimate profile, and update profile. The estimate operator is classified by the relational algebra operator (select, project, join), the property to be estimated (cardinality, distribution of values, and other parameters), and the underlying method (parametric, nonparametric, and ad-hoc). The accuracy, overhead, and assumptions of methods are discussed in detail. Relevant research in both the database and the statistics disciplines is incorporated in the detailed discussion.

References

  1. APERS, P. M. G., HEV~ER, A. R., ^NO Y^O, S. B. 1983. Optimization algorithms for distributed queries. IEEE Trans. Softw. Eng. SE-9, 1, 57-68.]]Google ScholarGoogle Scholar
  2. ASTRAHAN, M., SCHKOLNICK, M., AND WHANG, K. 1985. Counting unique values of an attribute without sorting. Tech. Rep. RJ 4960, IBM Research Division.]]Google ScholarGoogle Scholar
  3. BATORY, D., AND MANNINO, M. 1986. Panel on extensible database systems. In Proceedings of the ACM SIGMOD Conference (Washington, D.C., May). ACM, New York, pp. 187-190.]] Google ScholarGoogle Scholar
  4. BERNSTEIN, P., GOODMAN, N., WONG, E., REEVE, C., AND ROTHNIE, J. 1981. Query processing in SDD-1: A system for distributed databases. ACM Trans. Database Syst. 6, 4 (Dec.), 602-625.]] Google ScholarGoogle Scholar
  5. BREIMAN, L., MEISEL, W., AND PURCELL, E. 1977. Variable kernel estimates of multivariate densities. Technometrics 19, 135-144.]]Google ScholarGoogle Scholar
  6. CACOULLOS, V., 1966. Estimation of a multivariate density. Ann. Inst. Stat. Math. 18, 178-189.]]Google ScholarGoogle Scholar
  7. CARDENAS, A. 1975. Analysis and performance of inverted data-base structures. Commun. A CM 18, 5 (May), 253-263.]] Google ScholarGoogle Scholar
  8. CERI, S., AND PELAGATTI, G. 1984. Distributed Databases: Principles & Systems. McGraw-Hill, New York.]] Google ScholarGoogle Scholar
  9. CHAMBERLIN, D., ASTRAHAN, M., KING, W., LORiE, R., MEHL, J., PRICE, T., SCHKOLNICK, M., GRIFFITHS SELINGER, P., SLUTZ, D., WADE, B., AND YOST, R. 1981. Support for repetitive transactions and ad hoc queries in system R. ACM Trans. Database Syst. 6, i (Mar.), 70-94.]] Google ScholarGoogle Scholar
  10. CHEUNG, T. 1982. Estimating block accesses and number of records in file management. Commun. ACM 25, 7 (July), 484-487.]] Google ScholarGoogle Scholar
  11. CHRISTODOULAKIS, S. 1981. Estimating selectivities in data bases. Ph.D. dissertation, CSRG-136, Computer Systems Research Group, Univ. of Toronto.]] Google ScholarGoogle Scholar
  12. CHRISTODOULAKIS, S. 1983a. Estimating record selectivities. Inf. Syst. 8, 2 105-115.]]Google ScholarGoogle Scholar
  13. CHRISTODOULAKIS, $. 1983b. Estimating block transfers and join sizes. In Proceedings of the ACM SIGMOD Conference (May). ACM, New York, pp. 40-54.]] Google ScholarGoogle Scholar
  14. CHRISTODOULAKIS, $. 1984a. Estimating block selectivities. Inf. Syst. 9, 1, 69-79.]]Google ScholarGoogle Scholar
  15. CHRISTODOULAKIS, $. 1984b. Implications of certain assumptions in database performance evaluation. ACM Trans. Database Syst. 9, 2 (June), 163-186.]] Google ScholarGoogle Scholar
  16. CLARK, C., AND SCHKADE, L. 1983. StatisticalAnatysis for Administrative Decisions. South-Western, Cincinnati.]]Google ScholarGoogle Scholar
  17. COPELAND, G., KHOSHAFIAN, $., SMITH, M., AND VALDURIEZ, P. 1986. Buffering schemes for permanent data. In Proceedings of the Conference on Data Engineering (COMPDEC) (Los Angeles, Calif., Feb.). IEEE, New York, pp. 214-221.]] Google ScholarGoogle Scholar
  18. DEVROYE, L. 1985. A note on the L~ consistency of variable kernel estimates. Ann. Star. 13, 1041-1049.]]Google ScholarGoogle Scholar
  19. DEWITT, D., KATZ, R., OLKEN, F., SHAPIRO, L., STONEBRAKER, M., AND WOOD, D. 1984. Implementation techniques for main memory database systems. In Proceedings of the A CM SIG- MOD Conference (Boston, Mass. June). ACM, New York, pp. 1-8.]] Google ScholarGoogle Scholar
  20. FEDOROWICZ, J. 1984. Database evaluation using multiple regression techniques. In Proceedings of the ACM SIGMOD Conference (Boston, Mass., June). ACM, New York, pp. 70-76.]] Google ScholarGoogle Scholar
  21. FEDOROWICZ, J. 1987. Database performance evaluation in an indexed file environment. A CM Trans. Database Syst. 12, i (Mar.), 85-110.]] Google ScholarGoogle Scholar
  22. FINKELSTEIN, $., SCHKOLNICK, M., AND TIBERIO, P. 1988. Physical database design for relational databases. A CM Trans. Database Syst. 13, 1 (Mar.), 91-128.]] Google ScholarGoogle Scholar
  23. FRASER, D. 1957. Nonparametric Methods in Statistics. Wiley, New York.]]Google ScholarGoogle Scholar
  24. GELENSE, E., AND GARDY, D. 1982. The size of projections of relations satisfying a functional dependency. In Proceedings of the 8th International Conference on Very Large Data Bases (Mexico City). Very Large Data Base Endowment, Saratoga, Calif., pp. 325-333.]] Google ScholarGoogle Scholar
  25. HEVNER, A., AND YAO, S. B. 1979. Query processing in distributed database systems. IEEE Trans. Softw. Eng. SE-3, 3.]]Google ScholarGoogle Scholar
  26. IJBEMA, A., ANO BLANKEN, H. 1986. Estimating bucket accesses: A practical approach. In Proceedings of the Conference on Data Engineering (COMPDEC) (Los Angeles, Calif., Feb.), pp. 30-37.]] Google ScholarGoogle Scholar
  27. JARKE, M., ANO KOCH, J. 1984. Query optimization in database systems. A CM Comput. Surv. 16, 2 (June), 111-152.]] Google ScholarGoogle Scholar
  28. JOHNSON, N., AND KOTZ, S. 1970. Distributions in Statistics: Continuous Univariate Distributions, vols. I and 2. Houghton Mifflin, Boston.]]Google ScholarGoogle Scholar
  29. KAMEL, N., AND KING, R. 1985. A model of data distribution based on texture analysis. In Proceedings of the ACM SIGMOD Conference (Austin, Tex., May). ACM, New York, pp. 319-325.]] Google ScholarGoogle Scholar
  30. KERSCHBERG, L. TING, P. D., AND YAO, S. B. 1982. Query optimization in star computer networks. A CM Trans. Database Syst. 7, 4 (Dec.), 678-711.]] Google ScholarGoogle Scholar
  31. KOOI, R. 1980. The optimization of queries in relational databases. Ph.D. dissertation, Case Western Reserve Univ., Cleveland, Ohio.]] Google ScholarGoogle Scholar
  32. KOTZ, S., AND JOHNSON, N. 1977. Urn Models and Their Application. Wiley, New York.]]Google ScholarGoogle Scholar
  33. KUMAR, A., AND STONEBRAKER, M. 1987. The effect of join selectivities on optimal nesting order. SIGMOD Rec. 16, i (Mar.), 28-41.]] Google ScholarGoogle Scholar
  34. LOHMAN, G., MOHAN, C., HAAS, L., LINDSAY, B., SELINGER, P., WILMS, P., AND DANIELS, D. 1985. Query processing in R*. In Query Processing in Database Systems, W. Kim, D. Batory, and D. Reiner, Eds. Springer-Verlag, New York, pp. 31-47.]]Google ScholarGoogle Scholar
  35. LUK, W. 1983. On estimating block accesses in database organizations. Commun. A CM 26, 11 (Nov.), 945-947.]] Google ScholarGoogle Scholar
  36. MACKERT, L., AND LOHMAN, G. 1985. Index scans using a finite LRU buffer: A validated I/O model. IBM Research Rep. RJ4836, Almaden Research Center, San Jose, Calif.]]Google ScholarGoogle Scholar
  37. MACKERT, L., AND LOHMAN, G. 1986a. R* optimizer validation and performance evaluation for local queries. In Proceedings of the A CM SIGMOD Conference (Washington, D.C., May). ACM, New York, pp. 84-95.]] Google ScholarGoogle Scholar
  38. MACKERT, L., AND LOHMAN, G. 1986b. R* Optimizer validation and performance evaluation for distributed queries. In Proceedings of the 12th International Conference on Very Large Databases (Kyoto, Japan, Aug.). Morgan Kaufmann Publishers, Inc. (Also IBM Research Report RJ5050, Alamaden Research Center, San Jose, Calif.)]] Google ScholarGoogle Scholar
  39. MAHALANOBIS, P. 1936. On the generalized distance in statistics. In Proceedings of the National Institute of Sciences of India 12, pp. 35-49.]]Google ScholarGoogle Scholar
  40. MANNINO, M. 1986. Selectivity estimation in unify SQL. Tech. Rep. Dept. of Management Science and Information Systems, Univ. of Texas, Austin.]]Google ScholarGoogle Scholar
  41. MANNINO, M., AND RIVERA, A. 1988. An extensible model of selectivity estimation. Inf. Sci.-An International Journal. Special issue on database systems to appear Spring 1988.]] Google ScholarGoogle Scholar
  42. MERRETT, T. H., AND OTOO, E. 1979. Distribution models of relations. In Proceedings of the 5th International Conference on Very Large Data Bases (Rio de Janeiro, Brazil, Oct.). ACM, New York, pp. 418-425.]]Google ScholarGoogle Scholar
  43. MONTGOMERY, A., D'SOuzA, D., AND LEE, S. 1983. The cost of relational algebraic operations on skewed data: Estimates and experiments. In Information Processing Letters 83. Elsevier North-Holland, New York, pp. 235-241.]]Google ScholarGoogle Scholar
  44. MOORE, D., AND YACKEL, J. 1977. Consistency properties of nearest neighbor density function estimates. Ann. Stat. 5, 143-154.]]Google ScholarGoogle Scholar
  45. MURALIKRISHNA, M., AND DEWlTT, D. 1988. Equidepth histograms for estimating selectivity factors for multi-dimensional queries. In Proceedings of the ACM SIGMOD Conference (Chicago, Ill., June). ACM, New York, pp. 28-36.]] Google ScholarGoogle Scholar
  46. MUTHUSWAMY, B., AND KERSCHBERG, G. 1985. A DDSM for relational query optimization. Tech. Rep., Univ. of South Carolina, Columbia. Also in Proceedings of the ACM Annual Conference (Denver, Colo., Oct.). ACM, New York.]] Google ScholarGoogle Scholar
  47. PIATETSKY-SHAPIRO, G., AND CONNELL, G. 1984. Accurate estimation of the number of tuples satisfying a condition. In Proceedings of the ACM SIGMOD Conference (Boston, Mass., June). ACM, New York, pp. 256-276.]] Google ScholarGoogle Scholar
  48. PIATETSKY-SHAPIRO, G. 1985. Estimating the number of distinct attribute values by using sampling. Submitted for publication.]]Google ScholarGoogle Scholar
  49. ROWE, N. 1985. Antisampling for estimation: An overview. IEEE Trans. Softw. Eng. 11, 10 (Oct.)., 1081-1091.]] Google ScholarGoogle Scholar
  50. SAMSON, W., AND BENDELL, A. 1983. Rank order distributions and secondary key indexing (extended abstract). In Proceedings of the 2nd International Conference on Databases, (Cambridge, England).]]Google ScholarGoogle Scholar
  51. SELINGER, P., ASTRAHAN, M., CHAMBERLIN, D., LORIE, R., ANO PRICE, T. 1979. Access path selection in a relational database management system. In Proceedings of the ACM SIGMOD Conference (San Jose, Calif.). ACM, New York, pp. 23-34.]] Google ScholarGoogle Scholar
  52. SCHKOLNICK, M., AND TIBERIO, P. 1979. Considerations in developing a design tool for a relational DBMS. In Proceedings of the IEEE COMPSAC Conference (Chicago, Ill., Nov.). IEEE Press, New York, pp. 228-235.]]Google ScholarGoogle Scholar
  53. STONEBRAKER, M., RUBENSTEIN, B., AND GUTMANN, A. 1983. Application of abstract data types and abstract indices to CAD databases. In Proceedings of Database Week: Engineering Design Applications (San Jose, Calif., May). pp. 107-113.]]Google ScholarGoogle Scholar
  54. STUROES, H. 1926. The choice of class interval. J. Am. Stat. Assoc. 65-66.]]Google ScholarGoogle Scholar
  55. TAPIA, R., AND THOMPSON, J. 1978. Nonparametric Probability Density Estimation. John Hopkins University Press, Baltimore, Md.]]Google ScholarGoogle Scholar
  56. TEOREY, T. ANO FRY, J. 1982. Design of Database Structures. Prentice-Hall, Englewood, Cliffs, N.J.]] Google ScholarGoogle Scholar
  57. UNIFY CORPORATION. 1985. Unix Relational Database Management System--Reference Manual. Release 3.2. Portland, Oreg.]]Google ScholarGoogle Scholar
  58. VANDER ZANDER, B., TAYLOR, H., AND BITON, D. 1986. Estimating block accesses when attributes are correlated. In Proceedings of the 12th International Conference on Very Large Databases (Kyoto, Japan, Aug.). Morgan Kaufmann Publishers, Inc., pp. 119-127.]] Google ScholarGoogle Scholar
  59. WEOMAN, E. 1983. Density estimation. In Encyclopedia of Statistical Sciences, Vol. 2, S. Kotz and N. Johnson, Eds. Wiley, New York.]]Google ScholarGoogle Scholar
  60. WERTZ, W. 1978. Statistical Density Estimation: A Survey. Vandenhoeck and Ruprecht, Gottingen.]]Google ScholarGoogle Scholar
  61. WHANG, K., WIEDERHOLD, G., AND SAGALOWICZ, D. 1983. Estimating block accesses in database organizations: A closed noniterative formula. Commun. ACM 26, 11 (Nov.), 940-944.]] Google ScholarGoogle Scholar
  62. YAO, S. 1977. Approximating block accesses in database organizations. Commun. A CM 20, 4 (Apr.), 260-261.]] Google ScholarGoogle Scholar
  63. ZAHORJAN, J., BELL. B., AND SEVCIK, K. 1983. Estimating block transfers when record access probabilities are non-uniform. Inf. Process. Lett. 16, 5 (June), 249-252.]]Google ScholarGoogle Scholar
  64. ZIPF, G. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley, Cambridge, Mass.]]Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM Computing Surveys
    ACM Computing Surveys  Volume 20, Issue 3
    Sept. 1988
    68 pages
    ISSN:0360-0300
    EISSN:1557-7341
    DOI:10.1145/62061
    Issue’s Table of Contents

    Copyright © 1988 ACM

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 1 September 1988
    Published in csur Volume 20, Issue 3

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader