skip to main content
10.1145/2361354.2361374acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
research-article

A first approach to the automatic recognition of structural patterns in XML documents

Published:04 September 2012Publication History

ABSTRACT

XML is among the preferred formats for storing the structure of documents such as scientific articles, manuals, documentation, literary works, etc. Sometimes publishers adopt established and well-known vocabularies such as DocBook and TEI, other times they create partially or entirely new ones that better deal with the particular requirements of their documents. The (explicit and implicit) requirements of use in these vocabularies often follow well-established patterns, creating meta-structures (the block, the container, the inline element, etc.) that persist across vocabularies and authors and that describe a truer and more general conceptualization of the documents' building blocks. Addressing such meta-structures not only gives a better insight of what documents really are composed of, but provides abstract and more general mechanisms to work on documents regardless of the availability of specific schemas, tools and presentation stylesheets. In this paper we introduce a schemaindependent theory based on eleven structural patterns. We provide a definition of such patterns and how they synthesize characteristics emerging from real markup documents. Additionally, we propose an algorithm that allows us to identify the pattern of each element in a set of homogeneous markup documents.

References

  1. Cardoso, E., Jabour, I., Laber, E., Rodrigues, R., Cardoso, P. (2011). An efficient language-independent method to extract content from news webpages. In Proceedings of the 2011 ACM symposium on Document engineering (DocEng11). DOI: 10.1145/2034691.2034720. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Colazzo, D., Sartiani, C., Albano, A., Manghi, P., Ghelli, G., Lini, L., Paoli, M. (2002). A typed text retrieval query language for XML documents. In Journal of the American Society for Information Science and Technology, 53 (6): 467--488. DOI: 10.1002/asi.10059. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Dattolo, A., Di Iorio, A., Duca, S., Feliziani, A.A., Vitali, F. (2007). Structural patterns for descriptive documents. In Baresi, L., Fraternali, P., Houben, G. (Eds.), Proceedings of the 7th International Conference on Web Engineering 2007 (ICWE 2007). DOI: 10.1007/978--3--540--73597--7_35. Google ScholarGoogle Scholar
  4. Di Iorio, A., Gubellini, D., Vitali, F. (2005). Design patterns for document substructures. In Proceedings of the Extreme Markup Languages 2005. Rockville, MD, USA: Mulberry Technologies, Inc. http://conferences.idealliance.org/extreme/html/2005/ Vitali01/EML2005Vitali01.html (last visited June 29, 2012).Google ScholarGoogle Scholar
  5. Gamma, E., Helm, R., Johnson, R., Vlissides, J. (1994). Design Patterns: Elements of Reusable Object-Oriented Software. Boston, Massachusetts, USA: Addison-Wesley. ISBN: 0201633610. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Georg, G., Hernault, H., Cavazza, M., Prendinger, H., Ishizuka, M. (2009). From Rhetorical Structures to Document Structure: Shallow Pragmatic Analysis for Document Engineering. In Proceedings of the 2009 ACM symposium on Document engineering (DocEng09). DOI: 10.1145/1600193.1600235. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Georg, G., Jaulent, M. (2007). A Document Engineering Environment for Clinical Guidelines. In Proceeding of the 2007 ACM symposium on Document engineering (DocEng07). DOI: 10.1145/1284420.1284440. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Horrocks, I., Patel-Schneider, P. F., McGuinness, D. L., Welty, C. A. (2007). OWL: A Description Logic Based Ontology Language for the Semantic Web. In Baader, F., Calvanese, D., McGuinness, D. L., Nardi, D., Patel-Schneider, P. F. (Eds.), The Description Logic Handbook: Theory, Implementation and Applications (2nd edition): 458--486. Cambridge, UK: Cambridge University Press. ISBN: 9780521876254.Google ScholarGoogle Scholar
  9. Koh, E., Caruso, D., Kerne, A., Gutierrez-Osuna, R. (2007). Elimination of junk document surrogate candidates through pattern recognition. In Proceedings of the 2007 ACM symposium on Document engineering (DocEng07). DOI: 10.1145/1284420.1284466. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Krotzsch, M., Simancik, F., Horrocks, I. (2011). A Description Logic Primer. Ithaca, New York, New York: Cornell University Library. http://arxiv.org/pdf/1201.4089v1 (last visited June 29, 2012).Google ScholarGoogle Scholar
  11. Lini, L., Lombardini, D., Paoli, M., Colazzo, D., Sartiani, C. (2001). XTReSy: A Text Retrieval System for XML documents. In Augmenting Comprehension: Digital Tools for the History of Ideas.Google ScholarGoogle Scholar
  12. Presutti, V., Gangemi, A. (2008). Content Ontology Design Patterns as practical building blocks for web ontologies. In Li, Q., Spaccapietra, S., Yu, E. S. K., Olivé, A. (Eds.), Proceedings of the 27th International Conference on Conceptual Modeling (ER 2008). DOI: 10.1007/978--3--540--87877--3_11. Google ScholarGoogle ScholarCross RefCross Ref
  13. Tannier, X., Girardot, J.,Mathieu, M. (2005). Classifying XML tags through "reading contexts". In Proceedings of the 2005 ACM symposium on Document engineering (DocEng05). DOI: 10.1145/1096601.1096638. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Text Encoding Initiative Consortium (2005). TEI P5: Guidelines for Electronic Text Encoding and Interchange. Charlottesville, Virginia, USA: TEI Consortium. http://www.tei-c.org/Guidelines/P5 (last visited June 29, 2012).Google ScholarGoogle Scholar
  15. Walsh, N. (2010). DocBook 5: The Definitive Guide. Sebastopol, CA, USA: O'Really Media. Version 1.0.3. ISBN: 0596805029.Google ScholarGoogle Scholar
  16. Zou, J., Le, D., Thoma, G. R. (2007). Structure and Content Analysis for HTML Medical Articles: A Hidden Markov Model Approach. In Proceedings of the 2007 ACM symposium on Document engineering (DocEng07). DOI: 10.1145/1284420.1284468. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A first approach to the automatic recognition of structural patterns in XML documents

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          DocEng '12: Proceedings of the 2012 ACM symposium on Document engineering
          September 2012
          256 pages
          ISBN:9781450311168
          DOI:10.1145/2361354

          Copyright © 2012 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 4 September 2012

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate178of537submissions,33%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader