Abstract
Querying data from presentation formats like HTML, for purposes such as information extraction, requires the consideration of tree structures as well as the consideration of spatial relationships between laid out elements. The underlying rationale is that frequently the rendering of tree structures is very involved and undergoing more frequent updates than the resulting layout structure. Therefore, in this paper, we present Spatial XPath (SXPath), an extension of XPath 1.0 that allows for inclusion of spatial navigation primitives into the language resulting in conceptually simpler queries on Web documents. The SXPath language is based on a combination of a spatial algebra with formal descriptions of XPath navigation, and maintains polynomial time combined complexity. Practical experiments demonstrate the usability of SXPath.
- Acid Tests, http://www.acidtests.org. Web Standards Project.Google Scholar
- S. Adali, M. L. Sapino, and V. S. Subrahmanian. An algebra for creating and querying multimedia presentations. Multimedia Syst., 8(3):212--230, 2000. Google Scholar
- J. F. Allen. Maintaining knowledge about temporal intervals. Communications of the ACM, 26(11):832--843, 1983. Google Scholar
- P. Balbiani, J.-F. Condotta, and L. F. d. Cerro. A new tractable subclass of the rectangle algebra. In IJCAI, pages 442--447, 1999. Google Scholar
- R. Baumgartner, G. Gottlob, and M. Herzog. Scalable web data extraction for online market intelligence. VLDB, 2(2):1512--1523, 2009. Google Scholar
- M. Benedikt and C. Koch. Xpath leashed. ACM Computational Survey, 41(1):1--54, 2008. Google Scholar
- C.-H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan. A survey of web information extraction systems. TKDE, 18(10):1411--1428, 2006. Google Scholar
- P. Eades and K. Sugiyama. How to draw a directed graph. Journal of Information Processing, 13(4):424--437, 1990. Google Scholar
- W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krüpl, and B. Pollak. Towards domain-independent information extraction from web tables. In WWW, pages 71--80, 2007. Google Scholar
- G. Gottlob, C. Koch, R. Baumgartner, M. Herzog, and S. Flesca. The lixto data extraction project: back and forth between theory and practice. In PODS, pages 1--12, 2004. Google Scholar
- G. Gottlob, C. Koch, and R. Pichler. Efficient algorithms for processing xpath queries. In VLDB, pages 95--106, 2002. Google Scholar
- G. Gottlob, C. Koch, and R. Pichler. Efficient algorithms for processing xpath queries. TODS, 30(2):444--491, 2005. Google Scholar
- J. Kong, K. Zhang, and X. Zeng. Spatial graph grammars for graphical user interfaces. TOCHI, 13(2):268--307, 2006. Google Scholar
- T. Lee, L. Sheng, T. Bozkaya, N. H. Balkir, Z. M. Özsoyoglu, and G. Özsoyoglu. Querying multimedia presentations based on content. TKDE, 11(3):361--385, 1999. Google Scholar
- L. Libkin. Elements Of Finite Model Theory. SpringerVerlag, 2004. Google Scholar
- J. Madhavan, S. R. Jeffery, S. Cohen, X. Dong, D. Ko, C. Yu, A. Halevy, and G. Inc. Web-scale data integration: You can only afford to pay as you go. In CIDR, 2007.Google Scholar
- M. Marx and M. de Rijke. Semantic characterizations of navigational xpath. SIGMOD Rec., 34(2):41--46, 2005. Google Scholar
- I. Navarrete and G. Sciavicco. Spatial reasoning with rectangular cardinal direction relations. In ECAI, pages 1--9, 2006.Google Scholar
- N. K. Papadakis, D. Skoutas, K. Raftopoulos, and T. A. Varvarigou. Stavies: A system for information extraction from unknown web data sources through automatic web wrapper generation using clustering techniques. TKDE, 17(12):1638--1652, 2005. Google Scholar
- P. Parys. Xpath evaluation in linear time with polynomial combined complexity. In PODS, pages 55--64. ACM, 2009. Google Scholar
- J. Renz. Qualitative spatial reasoning with topological information. Springer, 2002. Google Scholar
- A. Sahuguet and F. Azavant. Building intelligent web applications using lightweight wrappers. DKE, 36(3):283--316, 2001. Google Scholar
- B. ten Cate and M. Marx. Axiomatizing the logical core of xpath 2.0. Theory of Computing Systems, 44(4):561--589, 2009. Google Scholar
- W3C, http://www.w3.org/XML/Query/. XML Query (XQuery), 1.0 edition.Google Scholar
- W3C, http://www.w3.org/TR/xpath. XML Path Language (XPath) Version 1.0, 1.0 edition, November 1999.Google Scholar
- P. Wadler. Two semantics for xpath. Draft: http://homepages.inf.ed.ac.uk/~wadler/papers/xpath-semantics, 2000.Google Scholar
- Y. Zhai and B. Liu. Structured data extraction from the web based on partial tree alignment. TKDE, 18(12):1614--1628, 2006. Google Scholar
Index Terms
- SXPath: extending XPath towards spatial querying on web documents
Recommendations
Mapping of bibliographical standards into XML
The most popular bibliographical standards, which prescribe the exchange of bibliographical data in machine readable form, are MARC (Machine Readable Cataloguing) and UNIMARC (Universal Machine Readable Cataloguing). This paper presents two schemas, ...
XML-based information mediation with MIX
SIGMOD '99: Proceedings of the 1999 ACM SIGMOD international conference on Management of dataThe MIX mediator system, MIXm, is developed as part of the MIX Project at the San Diego Supercomputer Center, and the University of California, San Diego.1 MIXm uses XML as the common model for data exchange. Mediator views are expressed in XMAS (XML ...
Comments