ABSTRACT
Data integration is the problem of combining data residing at different sources, and providing the user with a unified view of these data. The problem of designing data integration systems is important in current real world applications, and is characterized by a number of issues that are interesting from a theoretical point of view. This document presents on overview of the material to be presented in a tutorial on data integration. The tutorial is focused on some of the theoretical issues that are relevant for data integration. Special attention will be devoted to the following aspects: modeling a data integration application, processing queries in data integration, dealing with inconsistent data sources, and reasoning on queries.
- S. Abiteboul and O. Duschka. Complexity of answering queries using materialized views. In Proc. of the 17th ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS'98), pages 254-265, 1998.]] Google ScholarDigital Library
- S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. L. Wiener. The Lorel query language for semistructured data. Int. J. on Digital Libraries, 1(1):68-88, 1997.]]Google ScholarCross Ref
- S. Adali, K. S. Candan, Y. Papakonstantinou, and V. S. Subrahmanian. Query caching and optimization in distributed mediator systems. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 137-148, 1996.]] Google ScholarDigital Library
- F. N. Afrati, M. Gergatsoulis, and T. Kavalieros. Answering queries using materialized views with disjunction. In Proc. of the 7th Int. Conf. on Database Theory (ICDT'99), volume 1540 of Lecture Notes in Computer Science, pages 435-452. Springer, 1999.]] Google ScholarDigital Library
- A. V. Aho, Y. Sagiv, and J. D. Ullman. Equivalence among relational expressions. SIAM J. on Computing, 8:218-246, 1979.]]Google ScholarDigital Library
- M. Arenas, L. E. Bertossi, and J. Chomicki. Consistent query answers in inconsistent databases. In Proc. of the 18th ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS'99), pages 68-79, 1999.]] Google ScholarDigital Library
- M. Arenas, L. E. Bertossi, and J. Chomicki. Specifying and querying database repairs using logic programs with exceptions. In Proc. of the 4th Int. Conf. on Flexible Query Answering Systems (FQAS'00), pages 27-41. Springer, 2000.]]Google Scholar
- F. Baader, D. Calvanese, D. McGuinness, D. Nardi, and P. F. Patel-Schneider, editors. The Description Logic Handbook: Theory, Implementation and Applications. Cambridge University Press, 2002. To appear.]] Google ScholarDigital Library
- C. Beeri, A. Y. Levy, and M.-C. Rousset. Rewriting queries using views in description logics. In Proc. of the 16th ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS'97), pages 99-108, 1997.]] Google ScholarDigital Library
- D. Beneventano, S. Bergamaschi, S. Castano, A. Corni, R. Guidetti, G. Malvezzi, M. Melchiori, and M. Vincini. Information integration: the MOMIS project demonstration. In Proc. of the 26th Int. Conf. on Very Large Data Bases (VLDB 2000), 2000.]] Google ScholarDigital Library
- A. Borgida. Description logics in data management. IEEE Trans. on Knowledge and Data Engineering, 7(5):671-682, 1995.]] Google ScholarDigital Library
- M. Bouzeghoub and M. Lenzerini. Introduction to the special issue on data extraction, cleaning, and reconciliation. Information Systems, 26(8):535-536, 2001.]]Google ScholarCross Ref
- F. Bry. Query answering in information systems with integrity constraints. In IFIP WG 11.5 Working Conf. on Integrity and Control in Information System. Chapman & Hall, 1997.]] Google ScholarDigital Library
- P. Buneman. Semistructured data. In Proc. of the 16th ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS'97), pages 117-121, 1997.]] Google ScholarDigital Library
- P. Buneman, S. Davidson, G. Hillebrand, and D. Suciu. A query language and optimization technique for unstructured data. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 505-516, 1996.]] Google ScholarDigital Library
- A. Calì, D. Calvanese, G. De Giacomo, and M. Lenzerini. Accessing data integration systems through conceptual schemas. In Proc. of the 20th Int. Conf. on Conceptual Modeling (ER 2001), 2001.]] Google ScholarDigital Library
- A. Calì, D. Calvanese, G. De Giacomo, and M. Lenzerini. Data integration under integrity constraints. In Proc. of the 14th Conf. on Advanced Information Systems Engineering (CAiSE 2002), 2002. To appear.]] Google ScholarDigital Library
- A. Calì, D. Calvanese, G. De Giacomo, and M. Lenzerini. On the expressive power of data integration systems. Submitted for pubblication, 2002.]]Google Scholar
- A. Calì, G. De Giacomo, and M. Lenzerini. Models of information integration: Turning local-as-view into global-as-view. In Foundations of Models for Information Integration. On line proceedings, http://www.fmldo.org/FMII-2001, 2001.]]Google Scholar
- D. Calvanese, G. De Giacomo, and M. Lenzerini. On the decidability of query containment under constraints. In Proc. of the 17th ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS'98), pages 149-158, 1998.]] Google ScholarDigital Library
- D. Calvanese, G. De Giacomo, and M. Lenzerini. Answering queries using views over description logics knowledge bases. In Proc. of the 17th Nat. Conf. on Artificial Intelligence (AAAI 2000), pages 386-391, 2000.]] Google ScholarDigital Library
- D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, and R. Rosati. Description logic framework for information integration. In Proc. of the 6th Int. Conf. on Principles of Knowledge Representation and Reasoning (KR'98), pages 2-13, 1998.]]Google Scholar
- D. Calvanese, G. De Giacomo, M. Lenzerini, and M. Y. Vardi. Rewriting of regular expressions and regular path queries. In Proc. of the 18th ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS'99), pages 194-204, 1999.]] Google ScholarDigital Library
- D. Calvanese, G. De Giacomo, M. Lenzerini, and M. Y. Vardi. Answering regular path queries using views. In Proc. of the 16th IEEE Int. Conf. on Data Engineering (ICDE 2000), pages 389-398, 2000.]] Google ScholarDigital Library
- D. Calvanese, G. De Giacomo, M. Lenzerini, and M. Y. Vardi. Containment of conjunctive regular path queries with inverse. In Proc. of the 7th Int. Conf. on Principles of Knowledge Representation and Reasoning (KR 2000), pages 176-185, 2000.]]Google Scholar
- D. Calvanese, G. De Giacomo, M. Lenzerini, and M. Y. Vardi. Query processing using views for regular path queries with inverse. In Proc. of the 19th ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS 2000), pages 58-66, 2000.]] Google ScholarDigital Library
- D. Calvanese, G. De Giacomo, M. Lenzerini, and M. Y. Vardi. View-based query processing and constraint satisfaction. In Proc. of the 15th IEEE Symp. on Logic in Computer Science (LICS 2000), pages 361-371, 2000.]] Google ScholarDigital Library
- D. Calvanese, G. De Giacomo, M. Lenzerini, and M. Y. Vardi. View-based query answering and query containment over semistructured data. In Proc. of the 8th Int. Workshop on Database Programming Languages (DBPL 2001), 2001.]] Google ScholarDigital Library
- D. Calvanese, G. De Giacomo, M. Lenzerini, and M. Y. Vardi. Lossless regular views. In Proc. of the 21st ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS 2002), pages 58-66, 2002.]] Google ScholarDigital Library
- M. J. Carey, L. M. Haas, P. M. Schwarz, M. Arya, W. F. Cody, R. Fagin, M. Flickner, A. Luniewski, W. Niblack, D. Petkovic, J. Thomas, J. H. Williams, and E. L. Wimmers. Towards heterogeneous multimedia information systems: The Garlic approach. In Proc. of the 5th Int. Workshop on Research Issues in Data Engineering --- Distributed Object Management (RIDE-DOM'95), pages 124-131. IEEE Computer Society Press, 1995.]] Google ScholarDigital Library
- T. Catarci and M. Lenzerini. Representing and using interschema knowledge in cooperative information systems. J. of Intelligent and Cooperative Information Systems, 2(4):375-398, 1993.]]Google ScholarCross Ref
- E. P. F. Chan. Containment and minimization of positive conjunctive queries in oodb's. In Proc. of the 11th ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS'92), pages 202-211, 1992.]] Google ScholarDigital Library
- A. K. Chandra and P. M. Merlin. Optimal implementation of conjunctive queries in relational data bases. In Proc. of the 9th ACM Symp. on Theory of Computing (STOC'77), pages 77-90, 1977.]] Google ScholarDigital Library
- S. Chaudhuri, S. Krishnamurthy, S. Potarnianos, and K. Shim. Optimizing queries with materialized views. In Proc. of the 11th IEEE Int. Conf. on Data Engineering (ICDE'95), Taipei (Taiwan), 1995.]] Google ScholarDigital Library
- S. Chaudhuri and M. Y. Vardi. On the equivalence of recursive and nonrecursive Datalog programs. In Proc. of the 11th ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS'92), pages 55-66, 1992.]] Google ScholarDigital Library
- R. Chirkova, A. Y. Halevy, and D. Suciu. A formal perspective on the view selection problem. In Proc. of the 27th Int. Conf. on Very Large Data Bases (VLDB 2001), pages 59-68, 2001.]] Google ScholarDigital Library
- S. Cohen, W. Nutt, and A. Serebrenik. Rewriting aggregate queries using views. In Proc. of the 18th ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS'99), pages 155-166, 1999.]] Google ScholarDigital Library
- W. W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 201-212, 1998.]] Google ScholarDigital Library
- A. C. K. David S. Johnson. Testing containment of conjunctive queries under functional and inclusion dependencies. J. of Computer and System Sciences, 28(1):167-189, 1984.]]Google ScholarCross Ref
- G. De Giacomo. Intensional query answering by partial evaluation. J. of Intelligent Information Systems, 7(3):205-233, 1996.]] Google ScholarDigital Library
- A. Deutsch and V. Tannen. Optimization properties for classes of conjunctive regular path queries. In Proc. of the 8th Int. Workshop on Database Programming Languages (DBPL 2001), 2001.]] Google ScholarDigital Library
- O. Duschka. Query Planning and Optimization in Information Integration. PhD thesis, Stanford University, 1997.]] Google ScholarDigital Library
- O. M. Duschka and M. R. Genesereth. Answering recursive queries using views. In Proc. of the 16th ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS'97), pages 109-116, 1997.]] Google ScholarDigital Library
- O. M. Duschka and A. Y. Levy. Recursive plans for information gathering. In Proc. of the 15th Int. Joint Conf. on Artificial Intelligence (IJCAI'97), pages 778-784, 1997.]]Google Scholar
- M. F. Fernandez, D. Florescu, J. Kang, A. Y. Levy, and D. Suciu. Catching the boat with strudel: Experiences with a web-site management system. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 414-425, 1998.]] Google ScholarDigital Library
- M. F. Fernandez and D. Suciu. Optimizing regular path expressions using graph schemas. In Proc. of the 14th IEEE Int. Conf. on Data Engineering (ICDE'98), pages 14-23, 1998.]] Google ScholarDigital Library
- D. Florescu, A. Levy, and D. Suciu. Query containment for conjunctive queries with regular expressions. In Proc. of the 17th ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS'98), pages 139-148, 1998.]] Google ScholarDigital Library
- D. Florescu, A. Y. Levy, I. Manolescu, and D. Suciu. Query optimization in the presence of limited access patterns. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 311-322, 1999.]] Google ScholarDigital Library
- M. Friedman, A. Levy, and T. Millstein. Navigational plans for data integration. In Proc. of the 16th Nat. Conf. on Artificial Intelligence (AAAI'99), pages 67-73. AAAI Press/The MIT Press, 1999.]] Google ScholarDigital Library
- H. Galhardas, D. Florescu, D. Shasha, and E. Simon. An extensible framework for data cleaning. Technical Report 3742, INRIA, Rocquencourt, 1999.]]Google Scholar
- H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman, Y. Sagiv, J. D. Ullman, V. Vassalos, and J. Widom. The TSIMMIS approach to mediation: Data models and languages. J. of Intelligent Information Systems, 8(2):117-132, 1997.]] Google ScholarDigital Library
- C. H. Goh, S. Bressan, S. E. Madnick, and M. D. Siegel. Context interchange: New features and formalisms for the intelligent integration of information. ACM Trans. on Information Systems, 17(3):270-293, 1999.]] Google ScholarDigital Library
- G. Grahne and A. O. Mendelzon. Tableau techniques for querying information sources through global schemas. In Proc. of the 7th Int. Conf. on Database Theory (ICDT'99), volume 1540 of Lecture Notes in Computer Science, pages 332-347. Springer, 1999.]] Google ScholarDigital Library
- G. Greco, S. Greco, and E. Zumpano. A logic programming approach to the integration, repairing and querying of inconsistent databases. In Proc. of the 17th Int. Conf. on Logic Programming (ICLP'01), volume 2237 of Lecture Notes in Artificial Intelligence, pages 348-364. Springer, 2001.]] Google ScholarDigital Library
- S. Gribble, A. Halevy, Z. Ives, M. Rodrig, and D. Suciu. What can databases do for peer-to-peer? In Proc. of the Int. Workshop on the Web and Databases (WebDB'01), 2001.]]Google Scholar
- S. Grumbach, M. Rafanelli, and L. Tininini. Querying aggregate data. In Proc. of the 18th ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS'99), pages 174-184, 1999.]] Google ScholarDigital Library
- S. Grumbach and L. Tininini. On the content of materialized aggregate views. In Proc. of the 19th ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS 2000), pages 47-57, 2000.]] Google ScholarDigital Library
- M. Gruninger and J. Lee. Ontology applications and design. Communications of the ACM, 45(2):39-41, 2002.]] Google ScholarDigital Library
- J. Gryz. Query folding with inclusion dependencies. In Proc. of the 14th IEEE Int. Conf. on Data Engineering (ICDE'98), pages 126-133, 1998.]] Google ScholarDigital Library
- A. Y. Halevy. Answering queries using views: A survey. Very Large Database J., 10(4):270-294, 2001.]] Google ScholarDigital Library
- R. Hull. Managing semantic heterogeneity in databases: A theoretical perspective. In Proc. of the 16th ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS'97), 1997.]] Google ScholarDigital Library
- T. Kirk, A. Y. Levy, Y. Sagiv, and D. Srivastava. The Information Manifold. In Proceedings of the AAAI 1995 Spring Symp. on Information Gathering from Heterogeneous, Distributed Enviroments, pages 85-91, 1995.]]Google Scholar
- A. C. Klug. On conjunctive queries containing inequalities. J. of the ACM, 35(1):146-160, 1988.]] Google ScholarDigital Library
- D. Lembo, M. Lenzerini, and R. Rosati. Source inconsistency and incompleteness in data integration. In Proc. of the 9th Int. Workshop on Knowledge Representation meets Databases (KRDB 2002), 2002.]]Google Scholar
- A. Y. Levy. Obtaining complete answers from incomplete databases. In Proc. of the 22nd Int. Conf. on Very Large Data Bases (VLDB'96), pages 402-412, 1996.]] Google ScholarDigital Library
- A. Y. Levy, A. O. Mendelzon, Y. Sagiv, and D. Srivastava. Answering queries using views. In Proc. of the 14th ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS'95), pages 95-104, 1995.]] Google ScholarDigital Library
- A. Y. Levy, A. Rajaraman, and J. J. Ordille. Querying heterogenous information sources using source descriptions. In Proc. of the 22nd Int. Conf. on Very Large Data Bases (VLDB'96), 1996.]] Google ScholarDigital Library
- A. Y. Levy, A. Rajaraman, and J. D. Ullman. Answering queries using limited external query processors. In Proc. of the 15th ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS'96), pages 227-237, 1996.]] Google ScholarDigital Library
- A. Y. Levy and M.-C. Rousset. CARIN: A representation language combining Horn rules and description logics. In Proc. of the 12th Eur. Conf. on Artificial Intelligence (ECAI'96), pages 323-327, 1996.]]Google Scholar
- A. Y. Levy and M.-C. Rousset. Combining horn rules and description logics in CARIN. Artificial Intelligence, 104(1-2):165-209, 1998.]] Google ScholarDigital Library
- A. Y. Levy and D. Suciu. Deciding containment for queries with complex objects. In Proc. of the 16th ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS'97), pages 20-31, 1997.]] Google ScholarDigital Library
- C. Li, M. Bawa, and J. D. Ullman. Minimizing view sets without loosing query-answering power. In Proc. of the 8th Int. Conf. on Database Theory (ICDT 2001), pages 99-103, 2001.]] Google ScholarDigital Library
- C. Li and E. Chang. Query planning with limited source capabilities. In Proc. of the 16th IEEE Int. Conf. on Data Engineering (ICDE 2000), pages 401-412, 2000.]] Google ScholarDigital Library
- C. Li and E. Chang. On answering queries in the presence of limited access patterns. In Proc. of the 8th Int. Conf. on Database Theory (ICDT 2001), pages 219-233, 2001.]] Google ScholarDigital Library
- C. Li, R. Yerneni, V. Vassalos, H. Garcia-Molina, Y. Papakonstantinou, J. D. Ullman, and M. Valiveti. Capability based mediation in TSIMMIS. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 564-566, 1998.]] Google ScholarDigital Library
- J. Lin and A. O. Mendelzon. Merging databases under constraints. Int. J. of Cooperative Information Systems, 7(1):55-76, 1998.]]Google ScholarCross Ref
- J. W. Lloyd, Foundations of Logic Programming (Second, Extended Edition). Springer, Berlin, Heidelberg, 1987.]] Google ScholarDigital Library
- I. Manolescu, D. Florescu, and D. Kossmann. Answering XML queries on heterogeneous data sources. In Proc. of the 27th Int. Conf. on Very Large Data Bases (VLDB 2001), pages 241-250, 2001.]] Google ScholarDigital Library
- T. D. Millstein, A. Y. Levy, and M. Friedman. Query containment for data integration systems. In Proc. of the 19th ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS 2000), pages 67-75, 2000.]] Google ScholarDigital Library
- T. Milo and D. Suciu. Index structures for path expressions. In Proc. of the 7th Int. Conf. on Database Theory (ICDT'99), volume 1540 of Lecture Notes in Computer Science, pages 277-295. Springer, 1999.]] Google ScholarDigital Library
- F. Naumann, U. Leser, and J. C. Freytag. Quality-driven integration of heterogenous information systems. In Proc. of the 25th Int. Conf. on Very Large Data Bases (VLDB'99), pages 447-458, 1999.]] Google ScholarDigital Library
- Y. Papakonstantinou and V. Vassalos. Query rewriting using semistructured views. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, 1999.]] Google ScholarDigital Library
- E. Rahn and P. A. Bernstein. A survey of approaches to automatic schema matching. Very Large Database J., 10(4):334-350, 2001.]] Google ScholarDigital Library
- A. Rajaraman, Y. Sagiv, and J. D. Ullman. Answering queries using templates with binding patterns. In Proc. of the 14th ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS'95), 1995.]] Google ScholarDigital Library
- R. Reiter. On closed world data bases. In H. Gallaire and J. Minker, editors, Logic and Databases, pages 119-140. Plenum Publ. Co., New York, 1978.]]Google ScholarCross Ref
- Y. Sagiv and M. Yannakakis. Equivalences among relational expressions with the union and difference operators. J. of the ACM, 27(4):633-655, 1980.]] Google ScholarDigital Library
- D. Srivastava, S. Dar, H. V. Jagadish, and A. Levy. Answering queries with aggregation using views. In Proc. of the 22nd Int. Conf. on Very Large Data Bases (VLDB'96), pages 318-329, 1996.]] Google ScholarDigital Library
- O. G. Tsatalos, M. H. Solomon, and Y. E. Ioannidis. The GMAP: A versatile tool for phyisical data independence. Very Large Database J., 5(2):101-118, 1996.]] Google ScholarDigital Library
- J. D. Ullman. Information integration using logical views. In Proc. of the 6th Int. Conf. on Database Theory (ICDT'97), volume 1186 of Lecture Notes in Computer Science, pages 19-40. Springer, 1997.]] Google ScholarDigital Library
- R. van der Meyden. The Complexity of Querying Indefinite Information. PhD thesis, Rutgers University, 1992.]] Google ScholarDigital Library
- R. van der Meyden. Logical approaches to incomplete information. In J. Chomicki and G. Saake, editors, Logics for Databases and Information Systems, pages 307-356. Kluwer Academic Publisher, 1998.]] Google ScholarDigital Library
- G. Zhou, R. Hull, R. King, and J.-C. Franchitti. Using object matching and materialization to integrate heterogeneous databases. In Proc. of the 3rd Int. Conf. on Cooperative Information Systems (CoopIS'95), pages 4-18, 1995.]]Google Scholar
Index Terms
- Data integration: a theoretical perspective
Recommendations
On-demand big data integration
Scientific research requires access, analysis, and sharing of data that is distributed across various heterogeneous data sources at the scale of the Internet. An eager extract, transform, and load (ETL) process constructs an integrated data repository ...
Source integration for data warehousing
Multidimensional databasesWhile the main goal of a data warehouse is to provide support for data analysis and management's decisions, a fundamental aspect in design of a data warehouse system is the process of acquiring the raw data from a set of relevant information sources. We ...
Data Integration on Multiple Data Sets
BIBM '08: Proceedings of the 2008 IEEE International Conference on Bioinformatics and BiomedicineA critical issue in dealing with voluminous records is that of data integration. Integration of data from two data bases has been studied well. For example, FEBRL is an excellent system for integrating two databases. Not much work has been conducted to ...
Comments