Abstract
We establish a principled schema design framework for data with missing values. The framework is based on the new notion of an embedded functional dependency, which is independent of the interpretation of missing values, able to express completeness and integrity requirements on application data, and capable of capturing redundant data value occurrences that may cause problems with processing data that meets the requirements. We establish axiomatic, algorithmic, and logical foundations for reasoning about embedded functional dependencies. These foundations enable us to introduce generalizations of Boyce-Codd and Third normal forms that avoid processing difficulties of any application data, or minimize these difficulties across dependency-preserving decompositions, respectively. We show how to transform any given schema into application schemata that meet given completeness and integrity requirements, and the conditions of the generalized normal forms. Data over those application schemata are therefore fit for purpose by design. Extensive experiments with benchmark schemata and data illustrate the effectiveness of our framework for the acquisition of the constraints, the schema design process, and the performance of the schema designs in terms of updates and join queries.
- Serge Abiteboul, Marcelo Arenas, Pablo Barceló, Meghyn Bienvenu, Diego Calvanese, Claire David, Richard Hull, Eyke Hüllermeier, Benny Kimelfeld, Leonid Libkin, Wim Martens, Tova Milo, Filip Murlak, Frank Neven, Magdalena Ortiz, Thomas Schwentick, Julia Stoyanovich, Jianwen Su, Dan Suciu, Victor Vianu, and Ke Yi. 2018. Research directions for principles of data management. Dagstuhl Manifestos 7, 1 (2018), 1–29.Google Scholar
- Marcelo Arenas. 2006. Normalization theory for XML. SIGMOD Rec. 35, 4 (2006), 57–64.Google ScholarDigital Library
- William Armstrong. 1974. Dependency structures of data base relationships. In Proceedings of the 6th IFIP Congress on Information Processing, Jack L. Rosenfeld (Ed.). North-Holland, 580–583.Google Scholar
- Paolo Atzeni and Nicola M. Morfuni. 1986. Functional dependencies and constraints on null values in database relations. Info. Control 70, 1 (1986), 1–31.Google ScholarDigital Library
- Carlo Batini, Cinzia Cappiello, Chiara Francalanci, and Andrea Maurino. 2009. Methodologies for data quality assessment and improvement. ACM Comput. Surv. 41, 3 (2009), 16:1–16:52.Google Scholar
- Carlo Batini and Andrea Maurino. 2018. Design for data quality. In Encyclopedia of Database Systems, 2nd ed. Springer.Google Scholar
- Catriel Beeri and Philip A. Bernstein. 1979. Computational problems related to the design of normal form relational schemas. ACM Trans. Database Syst. 4, 1 (1979), 30–59.Google ScholarDigital Library
- Philip A. Bernstein. 1976. Synthesizing third normal form relations from functional dependencies. ACM Trans. Database Syst. 1, 4 (1976), 277–298.Google ScholarDigital Library
- Joachim Biskup. 1998. Achievements of relational database schema design theory revisited. In Semantics in Databases (Lecture Notes in Computer Science), Leonid Libkin and Bernhard Thalheim (Eds.), Vol. 1358. Springer, 29–54.Google Scholar
- Joachim Biskup, Umeshwar Dayal, and Philip A. Bernstein. 1979. Synthesizing independent database schemas. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Philip A. Bernstein (Ed.). 143–151.Google Scholar
- Loreto Bravo, Wenfei Fan, Floris Geerts, and Shuai Ma. 2008. Increasing the expressivity of conditional functional dependencies without extra complexity. In Proceedings of the 24th International Conference on Data Engineering (ICDE’08). 516–525.Google ScholarDigital Library
- Peter P. Chen. 1976. The entity-relationship model—Toward a unified view of data. ACM Trans. Database Syst. 1, 1 (1976), 9–36.Google ScholarDigital Library
- Edgar F. Codd. 1972. Further normalization of the database relational model. In Proceedings of the Courant Computer Science Symposia 6: Data Base Systems. 33–64.Google Scholar
- C. J. Date and Ronald Fagin. 1992. Simple conditions for guaranteeing higher normal forms in relational databases. ACM Trans. Database Syst. 17, 3 (1992), 465–476.Google ScholarDigital Library
- William F. Dowling and Jean H. Gallier. 1984. Linear-time algorithms for testing the satisfiability of propositional Horn formulae. J. Log. Program. 1, 3 (1984), 267–284.Google ScholarCross Ref
- N. A. Emran. 2015. Data completeness measures. In Pattern Analysis, Intelligent Security and the Internet of Things (Advances in Intelligent Systems and Computing), Vol. 355. Springer, 117–130.Google Scholar
- Ronald Fagin. 1977. The decomposition versus synthetic approach to relational database design. In Proceedings of the Third International Conference on Very Large Data Bases. IEEE, 441–446.Google ScholarDigital Library
- Ronald Fagin. 1977. Multivalued dependencies and a new normal form for relational databases. ACM Trans. Database Syst. 2, 3 (1977), 262–278.Google ScholarDigital Library
- Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. 2008. Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Database Syst. 33, 2 (2008).Google ScholarDigital Library
- Patrick C. Fischer, Lawrence V. Saxton, Stan J. Thomas, and Dirk Van Gucht. 1985. Interactions between dependencies and nested relational structures. J. Comput. Syst. Sci. 31, 3 (1985), 343–354.Google ScholarDigital Library
- Sergio Greco, Cristian Molinaro, and Francesca Spezzano. 2012. Incomplete Data and Data Dependencies in Relational Databases. Morgan & Claypool Publishers.Google ScholarDigital Library
- Paolo Guagliardo and Leonid Libkin. 2017. Correctness of SQL queries on databases with nulls. SIGMOD Rec. 46, 3 (2017), 5–16.Google ScholarDigital Library
- Sven Hartmann and Sebastian Link. 2012. The implication problem of data dependencies over SQL table definitions: Axiomatic, algorithmic and logical characterizations. ACM Trans. Database Syst. 37, 2 (2012), 13:1–13:40.Google ScholarDigital Library
- I. J. Heath. 1971. Unacceptable file operations in a relational data base. In Proceedings of the ACM-SIGFIDET Workshop on Data Description, Access and Control, E. F. Codd and A. L. Dean (Eds.). ACM, 19–33.Google ScholarDigital Library
- Tomasz Imielinski and Witold Lipski Jr.1983. Incomplete information and dependencies in relational databases. In Proceedings of Annual Meeting of the Special Interest Group on Management of Data (SIGMOD’83). 178–184.Google Scholar
- Christian S. Jensen, Richard T. Snodgrass, and Michael D. Soo. 1996. Extending existing dependency theory to temporal databases. IEEE Trans. Knowl. Data Eng. 8, 4 (1996), 563–582.Google ScholarDigital Library
- Neil D. Jones and William T. Laaser. 1976. Complete problems for deterministic polynomial time. Theor. Comput. Sci. 3, 1 (1976), 105–117.Google ScholarCross Ref
- Henning Köhler and Sebastian Link. 2016. SQL schema design: Foundations, normal forms, and normalization. In Proceedings of the International Conference on Management of Data (SIGMOD’16), Fatma Özcan, Georgia Koutrika, and Sam Madden (Eds.). ACM, 267–279.Google ScholarDigital Library
- Henning Köhler and Sebastian Link. 2018. SQL schema design: Foundations, normal forms, and normalization. Info. Syst. 76 (2018), 88–113.Google Scholar
- S. Kolahi. 2007. Dependency-preserving normalization of relational and XML data. J. Comput. Syst. Sci. 73, 4 (2007), 636–647.Google ScholarDigital Library
- Solmaz Kolahi and Leonid Libkin. 2010. An information-theoretic analysis of worst-case redundancy in database design. ACM Trans. Database Syst. 35, 1 (2010), 5:1–5:32.Google ScholarDigital Library
- Mark Levene and George Loizou. 1998. Axiomatisation of functional dependencies in incomplete relations. Theor. Comput. Sci. 206, 1–2 (1998), 283–300.Google ScholarDigital Library
- Mark Levene and George Loizou. 1999. A Guided Tour of Relational Databases and Beyond. Springer.Google Scholar
- Mark Levene and Millist W. Vincent. 2000. Justification for inclusion dependency normal form. IEEE Trans. Knowl. Data Eng. 12, 2 (2000), 281–291.Google ScholarDigital Library
- Y. Edmund Lien. 1979. Multivalued dependencies with null values in relational data bases. In Proceedings of the 5th International Conference on Very Large Data Bases. 61–66.Google ScholarCross Ref
- Y. Edmund Lien. 1982. On the equivalence of database models. J. ACM 29, 2 (1982), 333–362.Google ScholarDigital Library
- Sebastian Link and Henri Prade. 2019. Relational database schema design for uncertain data. Info. Syst. 84 (2019), 88–110.Google Scholar
- David Maier. 1983. The Theory of Relational Databases. Computer Science Press.Google ScholarDigital Library
- Wai Yin Mok. 2016. Utilizing nested normal form to design redundancy free JSON schemas. Int. J. Recent Contributions Eng. Sci. IT 4, 4 (2016), 21–25.Google ScholarCross Ref
- Wai Yin Mok, Yiu-Kai Ng, and David W. Embley. 1996. A normal form for precisely characterizing redundancy in nested relations. ACM Trans. Database Syst. 21, 1 (1996), 77–106.Google ScholarDigital Library
- Thorsten Papenbrock and Felix Naumann. 2016. A hybrid approach to functional dependency discovery. In Proceedings of the International Conference on Management of Data (SIGMOD’16). 821–833.Google ScholarDigital Library
- Thorsten Papenbrock and Felix Naumann. 2017. Data-driven schema normalization. In Proceedings of the 20th International Conference on Extending Database Technology (EDBT’17), V. Markl, S. Orlando, B. Mitschang, P. Andritsos, K.-U. Sattler, and S. Breß (Eds.). OpenProceedings.org, 342–353.Google Scholar
- Raghu Ramakrishnan and Johannes Gehrke. 2003. Database Management Systems. McGraw-Hill.Google ScholarDigital Library
- Jorma Rissanen. 1977. Independent components of relations. ACM Trans. Database Syst. 2, 4 (1977), 317–325.Google ScholarDigital Library
- Mark A. Roth and Henry F. Korth. 1987. The design of 1NF relational databases into nested normal form. In Proceedings of the International Conference on Management of Data (SIGMOD’87). 143–159.Google Scholar
- Shazia W. Sadiq (Ed.). 2013. Handbook of Data Quality, Research and Practice. Springer.Google Scholar
- Zahir Tari, John Stokes, and Stefano Spaccapietra. 1997. Object normal forms and dependency constraints for object-oriented schemata. ACM Trans. Database Syst. 22, 4 (1997), 513–569.Google ScholarDigital Library
- Millist W. Vincent. 1997. A corrected 5NF definition for relational database design. Theor. Comput. Sci. 185, 2 (1997), 379–391.Google ScholarDigital Library
- Millist W. Vincent. 1999. Semantic foundations of 4NF in relational database design. Acta Inf. 36, 3 (1999), 173–213.Google ScholarCross Ref
- Millist W. Vincent and Mark Levene. 2000. Restructuring partitioned normal form relations without information loss. SIAM J. Comput. 29, 5 (2000), 1550–1567.Google ScholarDigital Library
- Ziheng Wei, Sven Hartmann, and Sebastian Link. 2020. Discovery algorithms for embedded functional dependencies. In Proceedings of the International Conference on Management of Data (SIGMOD’20). 833–843.Google ScholarDigital Library
- Ziheng Wei, Uwe Leck, and Sebastian Link. 2019. Discovery and ranking of embedded uniqueness constraints. Proc. Very Large Data Base 12, 13 (2019), 2339–2352.Google Scholar
- Ziheng Wei, Uwe Leck, and Sebastian Link. 2019. Entity integrity, referential integrity, and query optimization with embedded uniqueness constraints. In Proceedings of the 35th IEEE International Conference on Data Engineering (ICDE’19),. 1694–1697.Google ScholarCross Ref
- Ziheng Wei and Sebastian Link. 2018. Embedded cardinality constraints. In Proceedings of the 30th International Conference on Advanced Information Systems Engineering (CAiSE’18) (Lecture Notes in Computer Science), John Krogstie and Hajo A. Reijers (Eds.), Vol. 10816. Springer, 523–538.Google ScholarDigital Library
- Ziheng Wei and Sebastian Link. 2019. Embedded functional dependencies and data-completeness tailored database design. Proc. Very Large Data Base 12, 11 (2019), 1458–1470.Google Scholar
- Ziheng Wei, Sebastian Link, and Jiamou Liu. 2017. Contextual keys. In Proceedings of the 36th International Conference on Conceptual Modeling (ER’17) (Lecture Notes in Computer Science), Heinrich C. Mayr, Giancarlo Guizzardi, Hui Ma, and Oscar Pastor (Eds.), Vol. 10650. Springer, 266–279.Google ScholarCross Ref
- C. Zaniolo. 1984. Database relations with null values. J. Comput. Syst. Sci. 28, 1 (1984), 142–166.Google ScholarCross Ref
Index Terms
- Embedded Functional Dependencies and Data-completeness Tailored Database Design
Recommendations
Embedded functional dependencies and data-completeness tailored database design
We establish a robust schema design framework for data with missing values. The framework is based on the new notion of an embedded functional dependency, which is independent of the interpretation of missing values, able to express completeness and ...
Database design with equality-generating dependencies
DASFAA'05: Proceedings of the 10th international conference on Database Systems for Advanced ApplicationsIn relational database systems, traditional normalization techniques (eg, BCNF, 4NF) remove data redundancies from a single relation, but can not detect and remove redundancies across multiple relations. However, redundancies among multiple relations ...
Multivalued dependencies and a new normal form for relational databases
A new type of dependency, which includes the well-known functional dependencies as a special case, is defined for relational databases. By using this concept, a new (“fourth”) normal form for relation schemata is defined. This fourth normal form is ...
Comments