ABSTRACT
This paper proposes a novel approach to integrating heterogeneous XML DTDs. With this approach, an information agent can be easily extended to integrate heterogeneous XML-based contents and perform federated search. Based on a tree grammar inference technique, this approach derives an integrated view of XML DTDs in an information integration framework. The derivation takes advantages of naming and structural similarities among DTDs in similar domains. The complete approach consists of three main steps. (1) DTD clustering clusters DTDs in similar domains into classes. (2) Schema learning applies a tree grammar inference technique to generate a set of tree grammar rules from the DTDs in a class from the previous step. (3) Minimization optimizes the rules generated in the previous step and transforms them into an integrated view. We have implemented the proposed approach into a system called DEEP and tested the system on artificial and real domains. The experimental results reveal that this system can effectively and efficiently integrate radically different DTDs.
- 1.T. Bray, J. Paoli, and C. M. Sperberg-McQueen. Extensible Markup Language(XML) 1.0, 1998. W3C Recommendation.Google Scholar
- 2.P. Buneman, S. Davidson, G. Hillebrand, and D. Suciu. A query language and optimization techniques for unstructured data. In Proceedings of SIGMOD, 1996. Google ScholarDigital Library
- 3.S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J. Widom. The TSIMMIS project: Integration of heterogeneous information sources. In Proceedings of the Information Processing Society of Japan Conference, pages 7-18, Tokyo, Japan, October 1995.Google Scholar
- 4.A. Deutsch, M. Fernandez, D. Florescu, A. Levy, and D. Suciu. XML-QL: a query language for XML, 1998.Google Scholar
- 5.A. Doan, P. Domingos, and A. Levy. Learning source descriptions for data integration. In 3rd International Workshop on the Web and Databases, 2000.Google Scholar
- 6.O. Duschka and M. Genesereth. Query planning in infomaster. In Proceedings of the ACM Symposium on Applied Computing, San Jose, CA, February 1997. Google ScholarDigital Library
- 7.O. Etzioni and D. Weld. A softbot-based interface to the Internet. In C. ACM, 1994. Google ScholarDigital Library
- 8.M. Fernandez, J. Simeon, and P. Wadler. XML query languages:experiences and examplars, 1999. W3C Draft manuscript.Google Scholar
- 9.H. Fukuda and K. Kamata. Inference of tree automata from sample set of trees. International Journal of Computer and Information Sciences, 13:177-196, 1984.Google ScholarCross Ref
- 10.M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim. XTRACT: a system for extracting document type descriptors from xml documents. In Proceedings of the ACM SIGMOD, 2000. Google ScholarDigital Library
- 11.T. Kirk, A. Y. Levy, Y. Sagiv, and D. Srivsstava. The information manifold. In Proceedings of the AAAI Spring Symposium on Information Gathering in Distributed Heterogeneous Environments, Stanford, California, March 1995.Google Scholar
- 12.C. A. Knoblock, Y. Arens, and C. N. Hsu. Cooperating agents for information retrieval. In Proceedings of International Conference on Cooperative Information Systems, 1994.Google ScholarCross Ref
- 13.C. Kwok and D. Weld. Planning to gather information. In Proceedings on 13th National Conference of AI, 1996.Google Scholar
- 14.S. Y. Lu. A tree matching algorithm based on node splitting and merging. In IEEE Transactions on Pattern Analysis and Machine Intelligence, volume 6, pages 249-256, 1984.Google ScholarDigital Library
- 15.S. Nestorov, S. Abiteboul, and R. Motwani. Extracting schema from semistructured data. In Proceedings of the ACM SIGMOD, pages 295-306, Seattle, June 1998. Google ScholarDigital Library
- 16.E. Rasmussen. Clustering Algorithms, chapter 16. Prentice Hall, 1992. Google ScholarDigital Library
Index Terms
- Induction of integrated view for XML data with heterogeneous DTDs
Recommendations
View inference for heterogeneous XML information integration
Special issue on web intelligenceThis paper proposes a novel approach to integrating heterogeneous XML DTDs. With this approach, an information agent can be easily extended to integrate heterogeneous XML-based contents and perform federated search. Based on a tree grammar inference ...
Towards the preservation of functional dependency in XML data transformation
With the advent of XML as a data representation and exchange format over the web, a massive amount of data is being stored in XML. As the use of XML grows rapidly, the task of data transformation for integration purposes in XML is getting much ...
Semistructured data and XML
Information organization and databasesXML poses a new set of challenges for semistructured data research. The Extensible Markup Language, XML, is a new recommendation from World Wide Web Consortium that will become a universal data exchange format for the Web. XML shares many common ...
Comments