research-article

A first approach to the automatic recognition of structural patterns in XML documents

Authors:
Angelo Di Iorio

University of Bologna, Bologna, Italy

University of Bologna, Bologna, Italy
View Profile

,
Silvio Peroni

University of Bologna, Bologna, Italy

University of Bologna, Bologna, Italy
View Profile

,
Francesco Poggi

University of Bologna, Bologna, Italy

University of Bologna, Bologna, Italy
View Profile

,
Fabio Vitali

University of Bologna, Bologna, Italy

University of Bologna, Bologna, Italy
View Profile

DocEng '12: Proceedings of the 2012 ACM symposium on Document engineeringSeptember 2012Pages 85–94https://doi.org/10.1145/2361354.2361374

Published:04 September 2012Publication History

DocEng '12: Proceedings of the 2012 ACM symposium on Document engineering

Pages 85–94

ABSTRACT

XML is among the preferred formats for storing the structure of documents such as scientific articles, manuals, documentation, literary works, etc. Sometimes publishers adopt established and well-known vocabularies such as DocBook and TEI, other times they create partially or entirely new ones that better deal with the particular requirements of their documents. The (explicit and implicit) requirements of use in these vocabularies often follow well-established patterns, creating meta-structures (the block, the container, the inline element, etc.) that persist across vocabularies and authors and that describe a truer and more general conceptualization of the documents' building blocks. Addressing such meta-structures not only gives a better insight of what documents really are composed of, but provides abstract and more general mechanisms to work on documents regardless of the availability of specific schemas, tools and presentation stylesheets. In this paper we introduce a schemaindependent theory based on eleven structural patterns. We provide a definition of such patterns and how they synthesize characteristics emerging from real markup documents. Additionally, we propose an algorithm that allows us to identify the pattern of each element in a set of homogeneous markup documents.

References

Cardoso, E., Jabour, I., Laber, E., Rodrigues, R., Cardoso, P. (2011). An efficient language-independent method to extract content from news webpages. In Proceedings of the 2011 ACM symposium on Document engineering (DocEng11). DOI: 10.1145/2034691.2034720. Google ScholarDigital Library
Colazzo, D., Sartiani, C., Albano, A., Manghi, P., Ghelli, G., Lini, L., Paoli, M. (2002). A typed text retrieval query language for XML documents. In Journal of the American Society for Information Science and Technology, 53 (6): 467--488. DOI: 10.1002/asi.10059. Google ScholarDigital Library
Dattolo, A., Di Iorio, A., Duca, S., Feliziani, A.A., Vitali, F. (2007). Structural patterns for descriptive documents. In Baresi, L., Fraternali, P., Houben, G. (Eds.), Proceedings of the 7th International Conference on Web Engineering 2007 (ICWE 2007). DOI: 10.1007/978--3--540--73597--7_35. Google Scholar
Di Iorio, A., Gubellini, D., Vitali, F. (2005). Design patterns for document substructures. In Proceedings of the Extreme Markup Languages 2005. Rockville, MD, USA: Mulberry Technologies, Inc. http://conferences.idealliance.org/extreme/html/2005/ Vitali01/EML2005Vitali01.html (last visited June 29, 2012).Google Scholar
Gamma, E., Helm, R., Johnson, R., Vlissides, J. (1994). Design Patterns: Elements of Reusable Object-Oriented Software. Boston, Massachusetts, USA: Addison-Wesley. ISBN: 0201633610. Google ScholarDigital Library
Georg, G., Hernault, H., Cavazza, M., Prendinger, H., Ishizuka, M. (2009). From Rhetorical Structures to Document Structure: Shallow Pragmatic Analysis for Document Engineering. In Proceedings of the 2009 ACM symposium on Document engineering (DocEng09). DOI: 10.1145/1600193.1600235. Google ScholarDigital Library
Georg, G., Jaulent, M. (2007). A Document Engineering Environment for Clinical Guidelines. In Proceeding of the 2007 ACM symposium on Document engineering (DocEng07). DOI: 10.1145/1284420.1284440. Google ScholarDigital Library
Horrocks, I., Patel-Schneider, P. F., McGuinness, D. L., Welty, C. A. (2007). OWL: A Description Logic Based Ontology Language for the Semantic Web. In Baader, F., Calvanese, D., McGuinness, D. L., Nardi, D., Patel-Schneider, P. F. (Eds.), The Description Logic Handbook: Theory, Implementation and Applications (2nd edition): 458--486. Cambridge, UK: Cambridge University Press. ISBN: 9780521876254.Google Scholar
Koh, E., Caruso, D., Kerne, A., Gutierrez-Osuna, R. (2007). Elimination of junk document surrogate candidates through pattern recognition. In Proceedings of the 2007 ACM symposium on Document engineering (DocEng07). DOI: 10.1145/1284420.1284466. Google ScholarDigital Library
Krotzsch, M., Simancik, F., Horrocks, I. (2011). A Description Logic Primer. Ithaca, New York, New York: Cornell University Library. http://arxiv.org/pdf/1201.4089v1 (last visited June 29, 2012).Google Scholar
Lini, L., Lombardini, D., Paoli, M., Colazzo, D., Sartiani, C. (2001). XTReSy: A Text Retrieval System for XML documents. In Augmenting Comprehension: Digital Tools for the History of Ideas.Google Scholar
Presutti, V., Gangemi, A. (2008). Content Ontology Design Patterns as practical building blocks for web ontologies. In Li, Q., Spaccapietra, S., Yu, E. S. K., Olivé, A. (Eds.), Proceedings of the 27th International Conference on Conceptual Modeling (ER 2008). DOI: 10.1007/978--3--540--87877--3_11. Google ScholarCross Ref
Tannier, X., Girardot, J.,Mathieu, M. (2005). Classifying XML tags through "reading contexts". In Proceedings of the 2005 ACM symposium on Document engineering (DocEng05). DOI: 10.1145/1096601.1096638. Google ScholarDigital Library
Text Encoding Initiative Consortium (2005). TEI P5: Guidelines for Electronic Text Encoding and Interchange. Charlottesville, Virginia, USA: TEI Consortium. http://www.tei-c.org/Guidelines/P5 (last visited June 29, 2012).Google Scholar
Walsh, N. (2010). DocBook 5: The Definitive Guide. Sebastopol, CA, USA: O'Really Media. Version 1.0.3. ISBN: 0596805029.Google Scholar
Zou, J., Le, D., Thoma, G. R. (2007). Structure and Content Analysis for HTML Medical Articles: A Hidden Markov Model Approach. In Proceedings of the 2007 ACM symposium on Document engineering (DocEng07). DOI: 10.1145/1284420.1284468. Google ScholarDigital Library

Index Terms

A first approach to the automatic recognition of structural patterns in XML documents
1. Applied computing
  1. Document management and text processing
    1. Document capture
      1. Document analysis
    2. Document preparation
      1. Markup languages
2. Information systems
  1. World Wide Web
    1. Web data description languages
      1. Markup languages

Recommendations

A practical approach to extracting DTD-conforming XML documents from heterogeneous data sources

XML documents are becoming popular for business process integration. To achieve interoperability between applications, XML documents must also conform to various commonly used data type definitions (DTDs). However, most business data are not maintained ...
Read More
Efficient Revalidation of XML Documents

We study the problem of schema revalidation where XML data known to conform to one schema must be validated with respect to another schema. Such revalidation algorithms have applications in schema evolution, query processing, XML-based programming ...
Read More
Schemas Extraction for XML Documents by XML Element Sequence Patterns
ICISE '09: Proceedings of the 2009 First IEEE International Conference on Information Science and Engineering

XML is the de facto standard format for data exchange manipulation of structured documents. XML schema provides important structural information of XML documents. Unfortunately, much XML data does not have XML schema or is not accompanied by its XML ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DocEng '12: Proceedings of the 2012 ACM symposium on Document engineering
September 2012
256 pages
ISBN:9781450311168
DOI:10.1145/2361354
General Chair:
Cyril Concolato
Telecom ParisTech, France
,
Program Chair:
Patrick Schmitz
University of California, Berkeley, USA
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 September 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
XML
descriptive markup
document visualisation
pattern recognition
structural patterns
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate178of537submissions,33%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 16
  Total Citations
  View Citations
- 144
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A first approach to the automatic recognition of structural patterns in XML documents

DocEng '12: Proceedings of the 2012 ACM symposium on Document engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

A practical approach to extracting DTD-conforming XML documents from heterogeneous data sources

Efficient Revalidation of XML Documents

Schemas Extraction for XML Documents by XML Element Sequence Patterns

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A first approach to the automatic recognition of structural patterns in XML documents

DocEng '12: Proceedings of the 2012 ACM symposium on Document engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

A practical approach to extracting DTD-conforming XML documents from heterogeneous data sources

Efficient Revalidation of XML Documents

Schemas Extraction for XML Documents by XML Element Sequence Patterns

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media