research-article

X-Class: Associative Classification of XML Documents by Structure

Authors:
Gianni Costa

ICAR-CNR

ICAR-CNR
View Profile

,
Riccardo Ortale

ICAR-CNR

ICAR-CNR
View Profile

,
Ettore Ritacco

ICAR-CNR

ICAR-CNR
View Profile

Authors Info & Claims

ACM Transactions on Information Systems Volume 31 Issue 1Article No.: 3pp 1–40https://doi.org/10.1145/2414782.2414785

Published:01 January 2013Publication History

ACM Transactions on Information Systems

Abstract

The supervised classification of XML documents by structure involves learning predictive models in which certain structural regularities discriminate the individual document classes. Hitherto, research has focused on the adoption of prespecified substructures. This is detrimental for classification effectiveness, since the a priori chosen substructures may not accord with the structural properties of the XML documents. Therein, an unexplored question is how to choose the type of structural regularity that best adapts to the structures of the available XML documents.

We tackle this problem through X-Class, an approach that handles all types of tree-like substructures and allows for choosing the most discriminatory one. Algorithms are designed to learn compact rule-based classifiers in which the chosen substructures discriminate the classes of XML documents.

X-Class is studied across various domains and types of substructures. Its classification performance is compared against several rule-based and SVM-based competitors. Empirical evidence reveals that the classifiers induced by X-Class are compact, scalable, and at least as effective as the established competitors. In particular, certain substructures allow the induction of very compact classifiers that generally outperform the rule-based competitors in terms of effectiveness over all chosen corpora of XML data. Furthermore, such classifiers are substantially as effective as the SVM-based competitor, with the additional advantage of a high-degree of interpretability.

References

Aggarwal, C., Ta, N., Wang, J., Feng, J., and Zaki, M. 2007. Xproj: A framework for projected structural clustering of xml documents. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovey and Data Mining. 46--55. Google ScholarDigital Library
Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proceedings of the International Conference on Very Large Data Bases. 487--499. Google ScholarDigital Library
Aiolli, F., Da San Martino, G., and Sperduti, A. 2009. Route kernels for trees. In Proceedings of the International Conference on Machine Learning. 17--24. Google ScholarDigital Library
Amer-Yahia, S. and Lalmas, M. 2006. Xml search: Languages, inex and scoring. ACM SIGMOD Record 35, 4, 16--23. Google ScholarDigital Library
Antonie, M.-L. and Zaïane, O. 2002. Text document categorization by term association. In Proceedings of the IEEE International Conference on Data Mining. 19--26. Google ScholarDigital Library
Antonie, M.-L. and Zaïane, O. 2004. An associative classifier based on positive and negative rules. In Proceedings of the ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. 64--69. Google ScholarDigital Library
Arunasalam, B. and Chawla, S. 2006. CCCS: A top-down association classifier for imbalanced class distribution. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 517--522. Google ScholarDigital Library
Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. Addison-Wesley, Boston, MA. Google ScholarDigital Library
Baeza-Yates, R., Fuhr, N., and Andamaarek, Y. 2006. Special issue on xml retrieval. ACM Trans. Inform. Syst. 24, 4. Google ScholarDigital Library
Baker, L. and McCallum, A. 1998. Distributional clustering of words for text classification. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval. 96--103. Google ScholarDigital Library
Bennet, K. and Campbell, C. 2000. Support vector machines: Hype or hallelujah. SIGKDD Explo. 2, 2, 1--13. Google ScholarDigital Library
Bratko, A. and Filipic̆, B. 2006. Exploiting structural information for semi-structured document categorization. Inf. Process. Manage. 42, 3, 679--694. Google ScholarDigital Library
Brutlag, J. and Meek, C. 2000. Challenges of the email domain for text classification. In Proceedings of the International Conference on Machine Learning. 103--110. Google ScholarDigital Library
Burges, C. 1998. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discovery 2, 2, 121--167. Google ScholarDigital Library
Candillier, L., Tellier, I., and Torre, F. 2006. Transforming xml trees for efficient classification and clustering. In Advances in XML Information Retrieval and Evaluation, Proceedings of the 4th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, M. Lalmas, S. Malik, and G. Kazai Eds., Springer, 469--480. Google ScholarDigital Library
Candillier, L., Tellier, I., and Torre, F. 2007. Xml document mining using graph neural network. In Comparative Evaluation of XML Information Retrieval Systems, Proceedings of the 5th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, M. Lalmas, and A. Trotman Eds., Springer, 458--472.Google Scholar
Chang, C.-C. and Lin, C.-J. 2001. LIBSVM: A library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Google ScholarDigital Library
Cheng, H., Yan, X., Han, J., and Hsu, C.-W. 2007. Discriminative frequent pattern analysis for effective classification. In Proceedings of the International Conference on Data Engineering. 716--725.Google Scholar
Coenen, F. 2004. LUCS KDD implementations of CBA and CPAR. Department of Computer Science, The University of Liverpool. www.csc.liv.ac.uk/~frans/KDD/Software/.Google Scholar
Collins, M. and Duffy, N. 2001. Convolution kernels for natural language. In Proceedings of the Conference on Neural Information Processing Systems. 625--632.Google Scholar
Collins, M. and Duffy, N. 2002. New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. In Proceedings of the Annual Meeting on Association for Computational Linguistics. 263--270. Google ScholarDigital Library
Cong, G., Xu, X., Pan, F., Tung, A., and Yang, J. 2004. FARMER: Finding interesting rule groups in microarray datasets. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 123--126. Google ScholarDigital Library
Costa, G. and Ortale, R. 2012. On effective xml clustering by path commonality: An efficient and scalable algorithm. In Proceedings of the IEEE International Conference on Tools with Artificial Intelligence. 389--396. Google ScholarDigital Library
Costa, G., Manco, G., Ortale, R., and Tagarelli, A. 2004. A tree-based approach to clustering xml documents by structure. In Proceedings of the International Conference on Principles and Practice of Knowledge Discovery in Databases. 137--148. Google ScholarDigital Library
Costa, G., Ortale, R., and Ritacco, E. 2011a. Effective xml classification using content and structural information via rule learning. In Proceedings of the IEEE International Conference on Tools with Artificial Intelligence. 102--109. Google ScholarDigital Library
Costa, G., Ortale, R., and Ritacco, E. 2011b. A transactional approach to associative xml classification by content and structure. In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval. 104--113.Google Scholar
Crescenzi, V., Mecca, G., and Merialdo, P. 2001. Roadrunner: Towards automatic data extraction from large web sites. In Proceedings of the International Conference on Very Large Databases. 109--118. Google ScholarDigital Library
Cristianini, N. and Shawe-Taylor, J. 2000. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press. Google ScholarDigital Library
Da San Martino, G. 2009. Kernel methods for tree structured data. Ph.D. dissertation, University of Bologna, Padova.Google Scholar
Dalamagas, T., Cheng, T., Winkel, K.-J., and Sellis, T. 2006. A methodology for clustering xml documents by structure. Inf. Syst. 31, 3, 187--228. Google ScholarDigital Library
de Campos, L., Fernández-Luna, J., Huete, J., and Romero, A. 2008. Probabilistic methods for structured document classification at inex’07. In Focused Access to XML Documents, Proceedings of the 6th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, J. Kamps, M. Lalmas, and A. Trotman Eds., Springer, 195--206. Google ScholarDigital Library
Denoyer, L. and Gallinari, P. 2007. Report on the xml mining track at inex 2005 and inex 2006: Categorization and clustering of xml documents. ACM SIGIR Forum 41, 1, 79--90. Google ScholarDigital Library
Denoyer, L. and Gallinari, P. 2008. Report on the xml mining track at inex 2007: Categorization and clustering of xml documents. ACM SIGIR Forum 42, 1, 22--28. Google ScholarDigital Library
Flesca, S., Manco, G., Masciari, E., Pontieri, L., and Pugliese, A. 2005. Fast detection of xml structural similarity. IEEE Trans. Knowl. Data Eng. 17, 2, 160--175. Google ScholarDigital Library
Fox, C. 1992. Lexical Analysis and Stoplists. Prentice Hall, Upper Saddle River, NJ.Google Scholar
Garboni, C., Masseglia, F., and Trousse, B. 2006. Sequential pattern mining for structure-based xml document classification. In Advances in XML Information Retrieval and Evaluation, Proceedings of the 4th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, M. Lalmas, S. Malik, and G. Kazai Eds., Springer, 458--468. Google ScholarDigital Library
Han, J. and Yin, Y. 2000. Mining frequent patterns without candidate generation. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1--12. Google ScholarDigital Library
Haussler, D. 1999. Convolution kernels on discrete structures. Tech. rep., University of California at Santa Cruz.Google Scholar
Hearst, M. 1998. Trends & controversies: Support vector machines. IEEE Intell. Syst. 13, 4, 18--28. Google ScholarDigital Library
Helmer, S. 2007. Measuring the structural similarity of semistructured documents using entropy. In Proceedings of the International Conference on Very Large Data Base. 1022--1032. Google ScholarDigital Library
Hofmann, T., Schölkopf, B., and Smola, A. 2008. Kernel methods in machine learning. Ann. Stat. 36, 3, 1171--1220.Google ScholarCross Ref
Ikeda, K. 2006. Effects of kernel function on support vector machines in extreme cases. IEEE Trans. Neural Netw. 17, 1, 1--9. Google ScholarDigital Library
Jaakkola, T. and Haussler, D. 1999. Exploiting generative models in discriminative classifiers. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 487--493. Google ScholarDigital Library
Kashima, H. and Koyanagi, T. 2002. Kernels for semi-structured data. In Proceedings of the International Conference on Machine Learning. 291--298. Google ScholarDigital Library
Knijf, J. D. 2007. Fat-cat: Frequent attributes tree based classification. In Comparative Evaluation of XML Information Retrieval Systems, Proceedings of the 5th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, M. Lalmas, and A. Trotman Eds., Springer, 485--496.Google Scholar
Kuboyama, T., Hirata, K., Kashima, H., Aoki-Kinoshita, K., and Yasuda, H. 2007. Information and media technologies. Ann. Stat. 2, 1, 292--299.Google Scholar
Kurt, A. and Tozal, E. 2006. Classification of xslt-generated Web documents with support vector machines. In Proceedings of the International Workshop on Knowledge Discovery from XML Documents. 33--42. Google ScholarDigital Library
Li, W., Han, J., and Pei, J. 2001. CMAR: Accurate and efficient classification based on multiple class-association rules. In Proceedings of the IEEE International Conference on Data Mining. 369--376. Google ScholarDigital Library
Lian, W., Cheung, D., Mamoulis, N., and Yiu, S.-M. 2004. An efficient and scalable algorithm for clustering xml documents by structure. IEEE Trans. Knowl. Data Eng. 16, 1, 82--96. Google ScholarDigital Library
Liu, B., Hsu, W., and Ma, Y. 1998. Integrating classification and association rule mining. In Proceedings of the ACM SIGKDD International Conference on Kwnoledge Discovery and Data Mining. 80--86.Google Scholar
Liu, B., Ma, Y., and Wong, C. 2000. Improving an association rule based classifier. In Proceedings of the International Conference on Principles of Data Mining and Knowledge Discovery. 504--509. Google ScholarDigital Library
Manning, C., Raghavan, P., and Schütze., H. 2008. Introduction to Information Retrieval. Cambridge University Press. Google ScholarDigital Library
Mitchell, T. 1997. Machine Learning. McGraw-Hill, New York, NY. Google ScholarDigital Library
Murugeshan, M., Lakshmi, K., and Mukherjee, S. 2007. A categorization approach for Wikipedia collection based on negative category information and initial descriptions. In Proceedings of the INitiative for the Evaluation of XML Retrieval (INEX’07). 212--214.Google Scholar
Ning, P., Steinbach, M., and Kumar, V. 2006. Introduction to Data Mining. Addison Wesley, Boston, MA.Google Scholar
Pearl, J. 1998. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan and Kaufmann, Burlington, MA. Google ScholarDigital Library
Porter, M. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.Google ScholarDigital Library
Quinlan, J. and Cameron-Jones, R. 1993. FOIL: A midterm report. In Proceedings of the European Conference on Machine Learning. 3--20. Google ScholarDigital Library
Schölkopf, B. and Smola, A. 2001. Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond. MIT Press, Cambridge, MA. Google ScholarDigital Library
Steinwart, I. and Christmann, A. 2008. Support Vector Machines. Springer, Berlin. Google ScholarDigital Library
Suzuki, J. and Isozaki, H. 2006. Sequence and tree kernels with statistical feature mining. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 1321--1328.Google Scholar
Thabtah, F. 2007. A review of associative classification mining. Knowl. Eng. Rev. 22, 1, 37--65. Google ScholarDigital Library
Theobald, M., Schenkel, R., and Weikum, G. 2003. Exploiting structure, annotation, and ontological knowledge for automatic classification of xml data. In Proceedings of the WebDB Workshop. 1--6.Google Scholar
Vapnik, V. 1998. Statistical Learning Theory. John Wiley & Sons, Hoboken, NJ.Google Scholar
Vishwanathan, S. and Smola, A. 2002. Fast kernels for string and tree matching. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 569--576.Google Scholar
Walpole, R., Myers, R., Myers, S., and Ye, K. 2011. Probability & Statistics forEngineers & Scientists. Prentice Hall, Upper Saddle River, NJ.Google Scholar
Wang, C., Hong, M., Pei, J., Zhou, H., Wang, W., and Shi, B. 2004. Efficient pattern-growth methods for frequent tree pattern mining. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. 441--451.Google Scholar
Wang, J. and Karypis, G. 2005. HARMONY: Efficiently mining the best rules for classification. In Proceedings of the SIAM International Conference on Data Mining. 205--216.Google Scholar
Wolpert, D. 1992. Staked generalization. Neural Netw. 5, 241--259. Google ScholarDigital Library
Wu, J. and Tang, J. 2008. A bottom-up approach for xml documents classification. In Proceedings of the International Symposium on Database Engineering and Applications. 131--137. Google ScholarDigital Library
Xing, G., Guo, J., and Xia, Z. 2007. Classifying xml documents based on structure/content similarity. In Comparative Evaluation of XML Information Retrieval Systems, Proceedings of the 5th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, M. Lalmas, and A. Trotman Eds., Springer, 444--457.Google Scholar
Yang, J. and Chen, X. 2002. A semi-structured document model for text mining. J. Comput. Sci. Techol. 17, 5, 603--610. Google ScholarDigital Library
Yang, J. and Wang, S. 2010. Extended vsm for xml document classification using frequent subtrees. In Focused Retrieval and Evaluation, Proceedings of the 8th International Workshop of the Initiative for the Evaluation of XML Retrieval, S. Geva, J. Kamps, and A. Trotman Eds., Springer, 441--448. Google ScholarDigital Library
Yang, J. and Zhang, F. 2008. Xml document classification using extended vsm. In Focused Access to XML Documents, Proceedings of the 6th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, J. Kamps, M. Lalmas, and A. Trotman Eds., Springer, 234--244. Google ScholarDigital Library
Yi, J. and Sundaresan, N. 2000. A classifier for semi-structured documents. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovey and Data Mining. 340--344. Google ScholarDigital Library
Yin, X. and Han, J. 2003. CPAR: Classification based on predictive association rules. In Proceedings of the SIAM International Conference on Data Mining. 331--335.Google Scholar
Zaki, M. 2005. Efficiently mining frequent trees in a forest: Algorithms and applications. IEEE Trans. Knowl. Data Eng. 17, 8, 1021--1035. Google ScholarDigital Library
Zaki, M. and Aggarwal, C. 2006. Xrules: An effective algorithm for structural classification of xml data. Mach. Learn. 62, 1--2, 137--170. Google ScholarDigital Library

Index Terms

X-Class: Associative Classification of XML Documents by Structure
1. Applied computing
  1. Document management and text processing
    1. Document preparation
      1. Markup languages
2. Information systems

Recommendations

A weighted common structure based clustering technique for XML documents

XML has recently become very popular as a means of representing semistructured data and as a standard for data exchange over the Web, because of its varied applicability in numerous applications. Therefore, XML documents constitute an important data ...
Read More
Semantic clustering of XML documents

Dealing with structure and content semantics underlying semistructured documents is challenging for any task of document management and knowledge discovery conceived for such data. In this work we address the novel problem of clustering semantically ...
Read More
A Novel Top-Down Algorithm of Frequent XML Query Pattern Mining
ICCEA '10: Proceedings of the 2010 Second International Conference on Computer Engineering and Applications - Volume 02

Providing efficient query to XML documents is crucial, as XML has become the most important technique to exchange data in WWW. Due to its tree-structure paradigm, XML is superior for its capability of storing and querying complex data. Therefore, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Information Systems Volume 31, Issue 1
January 2013
163 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/2414782
Issue’s Table of Contents

Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 January 2013
- Accepted: 1 October 2012
- Revised: 1 August 2012
- Received: 1 January 2011
Published in tois Volume 31, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Structural XML classification
XML mining
XML transactional modeling
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 21
  Total Citations
  View Citations
- 525
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

X-Class: Associative Classification of XML Documents by Structure

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

A weighted common structure based clustering technique for XML documents

Semantic clustering of XML documents

A Novel Top-Down Algorithm of Frequent XML Query Pattern Mining

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

X-Class: Associative Classification of XML Documents by Structure

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

A weighted common structure based clustering technique for XML documents

Semantic clustering of XML documents

A Novel Top-Down Algorithm of Frequent XML Query Pattern Mining

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media