Abstract
We study the problem of economical representation of subsets of structured sets, which are sets equipped with a set cover or a family of preorders. Given a structured set U, and a language L whose expressions define subsets of U, the problem of minimum description length in L (L-MDL) is: “given a subset V of U, find a shortest string in L that defines V.” Depending on the structure and the language, the MDL-problem is in general intractable. We study the complexity of the MDL-problem for various structures and show that certain specializations are tractable. The families of focus are hierarchy, linear order, and their multidimensional extensions; these are found in the context of statistical and OLAP databases. In the case of general OLAP databases, data organization is a mixture of multidimensionality, hierarchy, and ordering, which can also be viewed naturally as a cover-structured ordered set. Efficient algorithms are provided for the MDL-problem for hierarchical and linearly ordered structures, and we prove that the multidimensional extensions are NP-complete. Finally, we illustrate the application of the theory to summarization of large result sets and (multi) query optimization for ROLAP queries.
- Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P. 1998. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of SIGMOD 1998. ACM Press, New York, NY, 94--105.]] Google ScholarDigital Library
- Agrawal, R., Gupta, A., and Sarawagi, S. 1997. Modeling multidimensional databases. In Proceedings of ICDE 1997. 232--243.]] Google ScholarDigital Library
- Batchelor, B. 1980. Hierarchical shape description based on convex hulls of concavities. J. Cybernet. 10, 205--210.]]Google ScholarCross Ref
- Cabibbo, L. and Torlone, R. 1997. Querying multidimensional databases. In Proceedings of the 6th DBPL Workshop. 319--335.]] Google ScholarDigital Library
- Cabibbo, L. and Torlone, R. 1998. A logical approach to multidimensional databases. In Proceedings of EDBT 1998. 183--197.]] Google ScholarDigital Library
- Edmonds, J., Gryz, J., Liang, D., and Miller, R. J. 2001. Mining for empty rectangles in large data sets. In Proceedings of ICDT 2001. 174--188.]] Google ScholarDigital Library
- Garey, M. R. and Johnson, D. S. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, New York, NY.]] Google ScholarDigital Library
- Gyssens, M. and Lakshmanan, L. 1997. A foundation for multi-dimensional databases. In Proceedings of VLDB 1997. 106--115.]] Google ScholarDigital Library
- Hansen, M. H. and Yu, B. 2001. Model selection and the principle of minimum description length. J. Amer. Statist. Assoc. 96, 454, 746--774.]]Google ScholarCross Ref
- Hurtado, C. and Mendelzon, A. 2002. OLAP dimensional constraints. In Proceedings of PODS 2002. 169--179.]] Google ScholarDigital Library
- Kalnis, P. and Papadias, D. 2001. Optimization Algorithms for Simultaneous Multidimensional Queries in OLAP Environments. Lecture Notes in Computer Science, vol. 2114. Springer-Verlag, Berlin, Germany, 264--273.]] Google ScholarDigital Library
- Keil, J. 1999. Polygon decomposition. In Handbook of Computational Geometry. Elsevier Sciences, Amsterdem, The Netherlands, Chap. 11, 491--518.]]Google Scholar
- Kimball, R. 1996. The Data Warehouse Toolkit. Wiley, New York, NY.]]Google Scholar
- Kumar, V. S. A. and Ramesh, H. 1999. Covering rectilinear polygons with axis-parallel rectangles. In Proceedings of the ACM Symposium on Theory of Computing 1999. ACM Press, New York, NY, 445--454.]] Google ScholarDigital Library
- Lakshmanan, L., Ng, R. T., Wang, C. X., Zhou, X., and Johnson, T. J. 1999. The generalized MDL approach for summarization. In Proceedings of VLDB 1999. 445--454.]]Google Scholar
- Lam, W. and Bacchus, F. 1994. Learning Bayesian belief networks: An approach based on the MDL principle. Comput. Intel. 10, 269--293.]]Google ScholarCross Ref
- Levcopoulos, C. and Gudmundsson, J. 1997. Approximation algorithms for covering polygons with squares and similar problems. In Proceedings of the International Workshop on Randomization and Approximation Techniques in Computer Science. Lecture Notes in Computer Science, vol. 1269. Springer, Berlin, Germany, 27--41.]] Google ScholarDigital Library
- Liang, W. and Orlowska, M. 2000. Optimizing multiple dimensional queries simultaneously in multidimensional databases. VLDB J. 8, 319--338.]] Google ScholarDigital Library
- Lodi, E., Luccio, F., Mugnai, C., and Pagli, L. 1979. On two-dimensional data organization I. Fundam. Inform. 2, 211--226.]]Google Scholar
- Park, C., Kim, M., and Lee, Y. 2001. Rewriting OLAP queries using materialized views and dimension hierarchies in data warehouses. In Proceedings of ICDE 2001. 515--523.]] Google ScholarDigital Library
- Tor, S. and Middleditch, A. 1984. Convex decomposition of simple polygons. ACM Trans. Graph. 3, 244--265.]] Google ScholarDigital Library
- Zhao, Y., Deshpande, P., Naughton, J., and Shukla, A. 1998. Simultaneous optimization and evaluation of multiple dimensional queries. In Proceedings of SIGMOD 1998. 271--282.]] Google ScholarDigital Library
Index Terms
- Concise descriptions of subsets of structured sets
Recommendations
Concise descriptions of subsets of structured sets
PODS '03: Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsWe study the problem of economical representation of subsets of structured sets, that is, sets equipped with a set cover. Given a structured set U, and a language L whose expressions define subsets of U, the problem of Minimum Description Length in L (L-...
Characterizations and Existence of Easy Sets without Hard Subsets
Theory that Counts: To Oscar Ibarra on His 70th BirthdayThis paper introduces and studies two notions of easy sets without hard subsets: i) 𝒞-hollow sets are defined to be sets in P that have no 𝒞 - P subsets for (presumably) superclasses 𝒞 of P such as NP, PSPACE, E, NE, RE, etc.; and ii) 𝒞-scant sets ...
Dense Subsets of Pseudorandom Sets
FOCS '08: Proceedings of the 2008 49th Annual IEEE Symposium on Foundations of Computer ScienceA theorem of Green, Tao, and Ziegler can be stated (roughly) as follows: if$R$ is a pseudorandom set, and $D$ is a dense subset of $R$, then $D$ maybe modeled by a set $M$ that is dense in the entire domain such that $D$and $M$ are indistinguishable. (...
Comments