Abstract
A statistical profile summarizes the instances of a database. It describes aspects such as the number of tuples, the number of values, the distribution of values, the correlation between value sets, and the distribution of tuples among secondary storage units. Estimation of database profiles is critical in the problems of query optimization, physical database design, and database performance prediction. This paper describes a model of a database of profile, relates this model to estimating the cost of database operations, and surveys methods of estimating profiles. The operators and objects in the model include build profile, estimate profile, and update profile. The estimate operator is classified by the relational algebra operator (select, project, join), the property to be estimated (cardinality, distribution of values, and other parameters), and the underlying method (parametric, nonparametric, and ad-hoc). The accuracy, overhead, and assumptions of methods are discussed in detail. Relevant research in both the database and the statistics disciplines is incorporated in the detailed discussion.
- APERS, P. M. G., HEV~ER, A. R., ^NO Y^O, S. B. 1983. Optimization algorithms for distributed queries. IEEE Trans. Softw. Eng. SE-9, 1, 57-68.]]Google Scholar
- ASTRAHAN, M., SCHKOLNICK, M., AND WHANG, K. 1985. Counting unique values of an attribute without sorting. Tech. Rep. RJ 4960, IBM Research Division.]]Google Scholar
- BATORY, D., AND MANNINO, M. 1986. Panel on extensible database systems. In Proceedings of the ACM SIGMOD Conference (Washington, D.C., May). ACM, New York, pp. 187-190.]] Google Scholar
- BERNSTEIN, P., GOODMAN, N., WONG, E., REEVE, C., AND ROTHNIE, J. 1981. Query processing in SDD-1: A system for distributed databases. ACM Trans. Database Syst. 6, 4 (Dec.), 602-625.]] Google Scholar
- BREIMAN, L., MEISEL, W., AND PURCELL, E. 1977. Variable kernel estimates of multivariate densities. Technometrics 19, 135-144.]]Google Scholar
- CACOULLOS, V., 1966. Estimation of a multivariate density. Ann. Inst. Stat. Math. 18, 178-189.]]Google Scholar
- CARDENAS, A. 1975. Analysis and performance of inverted data-base structures. Commun. A CM 18, 5 (May), 253-263.]] Google Scholar
- CERI, S., AND PELAGATTI, G. 1984. Distributed Databases: Principles & Systems. McGraw-Hill, New York.]] Google Scholar
- CHAMBERLIN, D., ASTRAHAN, M., KING, W., LORiE, R., MEHL, J., PRICE, T., SCHKOLNICK, M., GRIFFITHS SELINGER, P., SLUTZ, D., WADE, B., AND YOST, R. 1981. Support for repetitive transactions and ad hoc queries in system R. ACM Trans. Database Syst. 6, i (Mar.), 70-94.]] Google Scholar
- CHEUNG, T. 1982. Estimating block accesses and number of records in file management. Commun. ACM 25, 7 (July), 484-487.]] Google Scholar
- CHRISTODOULAKIS, S. 1981. Estimating selectivities in data bases. Ph.D. dissertation, CSRG-136, Computer Systems Research Group, Univ. of Toronto.]] Google Scholar
- CHRISTODOULAKIS, S. 1983a. Estimating record selectivities. Inf. Syst. 8, 2 105-115.]]Google Scholar
- CHRISTODOULAKIS, $. 1983b. Estimating block transfers and join sizes. In Proceedings of the ACM SIGMOD Conference (May). ACM, New York, pp. 40-54.]] Google Scholar
- CHRISTODOULAKIS, $. 1984a. Estimating block selectivities. Inf. Syst. 9, 1, 69-79.]]Google Scholar
- CHRISTODOULAKIS, $. 1984b. Implications of certain assumptions in database performance evaluation. ACM Trans. Database Syst. 9, 2 (June), 163-186.]] Google Scholar
- CLARK, C., AND SCHKADE, L. 1983. StatisticalAnatysis for Administrative Decisions. South-Western, Cincinnati.]]Google Scholar
- COPELAND, G., KHOSHAFIAN, $., SMITH, M., AND VALDURIEZ, P. 1986. Buffering schemes for permanent data. In Proceedings of the Conference on Data Engineering (COMPDEC) (Los Angeles, Calif., Feb.). IEEE, New York, pp. 214-221.]] Google Scholar
- DEVROYE, L. 1985. A note on the L~ consistency of variable kernel estimates. Ann. Star. 13, 1041-1049.]]Google Scholar
- DEWITT, D., KATZ, R., OLKEN, F., SHAPIRO, L., STONEBRAKER, M., AND WOOD, D. 1984. Implementation techniques for main memory database systems. In Proceedings of the A CM SIG- MOD Conference (Boston, Mass. June). ACM, New York, pp. 1-8.]] Google Scholar
- FEDOROWICZ, J. 1984. Database evaluation using multiple regression techniques. In Proceedings of the ACM SIGMOD Conference (Boston, Mass., June). ACM, New York, pp. 70-76.]] Google Scholar
- FEDOROWICZ, J. 1987. Database performance evaluation in an indexed file environment. A CM Trans. Database Syst. 12, i (Mar.), 85-110.]] Google Scholar
- FINKELSTEIN, $., SCHKOLNICK, M., AND TIBERIO, P. 1988. Physical database design for relational databases. A CM Trans. Database Syst. 13, 1 (Mar.), 91-128.]] Google Scholar
- FRASER, D. 1957. Nonparametric Methods in Statistics. Wiley, New York.]]Google Scholar
- GELENSE, E., AND GARDY, D. 1982. The size of projections of relations satisfying a functional dependency. In Proceedings of the 8th International Conference on Very Large Data Bases (Mexico City). Very Large Data Base Endowment, Saratoga, Calif., pp. 325-333.]] Google Scholar
- HEVNER, A., AND YAO, S. B. 1979. Query processing in distributed database systems. IEEE Trans. Softw. Eng. SE-3, 3.]]Google Scholar
- IJBEMA, A., ANO BLANKEN, H. 1986. Estimating bucket accesses: A practical approach. In Proceedings of the Conference on Data Engineering (COMPDEC) (Los Angeles, Calif., Feb.), pp. 30-37.]] Google Scholar
- JARKE, M., ANO KOCH, J. 1984. Query optimization in database systems. A CM Comput. Surv. 16, 2 (June), 111-152.]] Google Scholar
- JOHNSON, N., AND KOTZ, S. 1970. Distributions in Statistics: Continuous Univariate Distributions, vols. I and 2. Houghton Mifflin, Boston.]]Google Scholar
- KAMEL, N., AND KING, R. 1985. A model of data distribution based on texture analysis. In Proceedings of the ACM SIGMOD Conference (Austin, Tex., May). ACM, New York, pp. 319-325.]] Google Scholar
- KERSCHBERG, L. TING, P. D., AND YAO, S. B. 1982. Query optimization in star computer networks. A CM Trans. Database Syst. 7, 4 (Dec.), 678-711.]] Google Scholar
- KOOI, R. 1980. The optimization of queries in relational databases. Ph.D. dissertation, Case Western Reserve Univ., Cleveland, Ohio.]] Google Scholar
- KOTZ, S., AND JOHNSON, N. 1977. Urn Models and Their Application. Wiley, New York.]]Google Scholar
- KUMAR, A., AND STONEBRAKER, M. 1987. The effect of join selectivities on optimal nesting order. SIGMOD Rec. 16, i (Mar.), 28-41.]] Google Scholar
- LOHMAN, G., MOHAN, C., HAAS, L., LINDSAY, B., SELINGER, P., WILMS, P., AND DANIELS, D. 1985. Query processing in R*. In Query Processing in Database Systems, W. Kim, D. Batory, and D. Reiner, Eds. Springer-Verlag, New York, pp. 31-47.]]Google Scholar
- LUK, W. 1983. On estimating block accesses in database organizations. Commun. A CM 26, 11 (Nov.), 945-947.]] Google Scholar
- MACKERT, L., AND LOHMAN, G. 1985. Index scans using a finite LRU buffer: A validated I/O model. IBM Research Rep. RJ4836, Almaden Research Center, San Jose, Calif.]]Google Scholar
- MACKERT, L., AND LOHMAN, G. 1986a. R* optimizer validation and performance evaluation for local queries. In Proceedings of the A CM SIGMOD Conference (Washington, D.C., May). ACM, New York, pp. 84-95.]] Google Scholar
- MACKERT, L., AND LOHMAN, G. 1986b. R* Optimizer validation and performance evaluation for distributed queries. In Proceedings of the 12th International Conference on Very Large Databases (Kyoto, Japan, Aug.). Morgan Kaufmann Publishers, Inc. (Also IBM Research Report RJ5050, Alamaden Research Center, San Jose, Calif.)]] Google Scholar
- MAHALANOBIS, P. 1936. On the generalized distance in statistics. In Proceedings of the National Institute of Sciences of India 12, pp. 35-49.]]Google Scholar
- MANNINO, M. 1986. Selectivity estimation in unify SQL. Tech. Rep. Dept. of Management Science and Information Systems, Univ. of Texas, Austin.]]Google Scholar
- MANNINO, M., AND RIVERA, A. 1988. An extensible model of selectivity estimation. Inf. Sci.-An International Journal. Special issue on database systems to appear Spring 1988.]] Google Scholar
- MERRETT, T. H., AND OTOO, E. 1979. Distribution models of relations. In Proceedings of the 5th International Conference on Very Large Data Bases (Rio de Janeiro, Brazil, Oct.). ACM, New York, pp. 418-425.]]Google Scholar
- MONTGOMERY, A., D'SOuzA, D., AND LEE, S. 1983. The cost of relational algebraic operations on skewed data: Estimates and experiments. In Information Processing Letters 83. Elsevier North-Holland, New York, pp. 235-241.]]Google Scholar
- MOORE, D., AND YACKEL, J. 1977. Consistency properties of nearest neighbor density function estimates. Ann. Stat. 5, 143-154.]]Google Scholar
- MURALIKRISHNA, M., AND DEWlTT, D. 1988. Equidepth histograms for estimating selectivity factors for multi-dimensional queries. In Proceedings of the ACM SIGMOD Conference (Chicago, Ill., June). ACM, New York, pp. 28-36.]] Google Scholar
- MUTHUSWAMY, B., AND KERSCHBERG, G. 1985. A DDSM for relational query optimization. Tech. Rep., Univ. of South Carolina, Columbia. Also in Proceedings of the ACM Annual Conference (Denver, Colo., Oct.). ACM, New York.]] Google Scholar
- PIATETSKY-SHAPIRO, G., AND CONNELL, G. 1984. Accurate estimation of the number of tuples satisfying a condition. In Proceedings of the ACM SIGMOD Conference (Boston, Mass., June). ACM, New York, pp. 256-276.]] Google Scholar
- PIATETSKY-SHAPIRO, G. 1985. Estimating the number of distinct attribute values by using sampling. Submitted for publication.]]Google Scholar
- ROWE, N. 1985. Antisampling for estimation: An overview. IEEE Trans. Softw. Eng. 11, 10 (Oct.)., 1081-1091.]] Google Scholar
- SAMSON, W., AND BENDELL, A. 1983. Rank order distributions and secondary key indexing (extended abstract). In Proceedings of the 2nd International Conference on Databases, (Cambridge, England).]]Google Scholar
- SELINGER, P., ASTRAHAN, M., CHAMBERLIN, D., LORIE, R., ANO PRICE, T. 1979. Access path selection in a relational database management system. In Proceedings of the ACM SIGMOD Conference (San Jose, Calif.). ACM, New York, pp. 23-34.]] Google Scholar
- SCHKOLNICK, M., AND TIBERIO, P. 1979. Considerations in developing a design tool for a relational DBMS. In Proceedings of the IEEE COMPSAC Conference (Chicago, Ill., Nov.). IEEE Press, New York, pp. 228-235.]]Google Scholar
- STONEBRAKER, M., RUBENSTEIN, B., AND GUTMANN, A. 1983. Application of abstract data types and abstract indices to CAD databases. In Proceedings of Database Week: Engineering Design Applications (San Jose, Calif., May). pp. 107-113.]]Google Scholar
- STUROES, H. 1926. The choice of class interval. J. Am. Stat. Assoc. 65-66.]]Google Scholar
- TAPIA, R., AND THOMPSON, J. 1978. Nonparametric Probability Density Estimation. John Hopkins University Press, Baltimore, Md.]]Google Scholar
- TEOREY, T. ANO FRY, J. 1982. Design of Database Structures. Prentice-Hall, Englewood, Cliffs, N.J.]] Google Scholar
- UNIFY CORPORATION. 1985. Unix Relational Database Management System--Reference Manual. Release 3.2. Portland, Oreg.]]Google Scholar
- VANDER ZANDER, B., TAYLOR, H., AND BITON, D. 1986. Estimating block accesses when attributes are correlated. In Proceedings of the 12th International Conference on Very Large Databases (Kyoto, Japan, Aug.). Morgan Kaufmann Publishers, Inc., pp. 119-127.]] Google Scholar
- WEOMAN, E. 1983. Density estimation. In Encyclopedia of Statistical Sciences, Vol. 2, S. Kotz and N. Johnson, Eds. Wiley, New York.]]Google Scholar
- WERTZ, W. 1978. Statistical Density Estimation: A Survey. Vandenhoeck and Ruprecht, Gottingen.]]Google Scholar
- WHANG, K., WIEDERHOLD, G., AND SAGALOWICZ, D. 1983. Estimating block accesses in database organizations: A closed noniterative formula. Commun. ACM 26, 11 (Nov.), 940-944.]] Google Scholar
- YAO, S. 1977. Approximating block accesses in database organizations. Commun. A CM 20, 4 (Apr.), 260-261.]] Google Scholar
- ZAHORJAN, J., BELL. B., AND SEVCIK, K. 1983. Estimating block transfers when record access probabilities are non-uniform. Inf. Process. Lett. 16, 5 (June), 249-252.]]Google Scholar
- ZIPF, G. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley, Cambridge, Mass.]]Google Scholar
Recommendations
Statistical relational tables for statistical database management
This paper extends Codd's relational view to represent statistical data and to achieve the efficient analysis of statistical data. It discusses why the relational calculus has not been popular with statisticians. A new view called a statistical ...
Statistical Database Query Languages
Databases that are mainly used for statistical analysis are called statistical databases (SDB). A statistical database management system (SDBMS) may be defined as a database management system that provides capabilities 1) to model, store, and manipulate ...
Comments