skip to main content
research-article

X-Class: Associative Classification of XML Documents by Structure

Published:01 January 2013Publication History
Skip Abstract Section

Abstract

The supervised classification of XML documents by structure involves learning predictive models in which certain structural regularities discriminate the individual document classes. Hitherto, research has focused on the adoption of prespecified substructures. This is detrimental for classification effectiveness, since the a priori chosen substructures may not accord with the structural properties of the XML documents. Therein, an unexplored question is how to choose the type of structural regularity that best adapts to the structures of the available XML documents.

We tackle this problem through X-Class, an approach that handles all types of tree-like substructures and allows for choosing the most discriminatory one. Algorithms are designed to learn compact rule-based classifiers in which the chosen substructures discriminate the classes of XML documents.

X-Class is studied across various domains and types of substructures. Its classification performance is compared against several rule-based and SVM-based competitors. Empirical evidence reveals that the classifiers induced by X-Class are compact, scalable, and at least as effective as the established competitors. In particular, certain substructures allow the induction of very compact classifiers that generally outperform the rule-based competitors in terms of effectiveness over all chosen corpora of XML data. Furthermore, such classifiers are substantially as effective as the SVM-based competitor, with the additional advantage of a high-degree of interpretability.

References

  1. Aggarwal, C., Ta, N., Wang, J., Feng, J., and Zaki, M. 2007. Xproj: A framework for projected structural clustering of xml documents. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovey and Data Mining. 46--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proceedings of the International Conference on Very Large Data Bases. 487--499. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Aiolli, F., Da San Martino, G., and Sperduti, A. 2009. Route kernels for trees. In Proceedings of the International Conference on Machine Learning. 17--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Amer-Yahia, S. and Lalmas, M. 2006. Xml search: Languages, inex and scoring. ACM SIGMOD Record 35, 4, 16--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Antonie, M.-L. and Zaïane, O. 2002. Text document categorization by term association. In Proceedings of the IEEE International Conference on Data Mining. 19--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Antonie, M.-L. and Zaïane, O. 2004. An associative classifier based on positive and negative rules. In Proceedings of the ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. 64--69. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Arunasalam, B. and Chawla, S. 2006. CCCS: A top-down association classifier for imbalanced class distribution. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 517--522. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. Addison-Wesley, Boston, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Baeza-Yates, R., Fuhr, N., and Andamaarek, Y. 2006. Special issue on xml retrieval. ACM Trans. Inform. Syst. 24, 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Baker, L. and McCallum, A. 1998. Distributional clustering of words for text classification. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval. 96--103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Bennet, K. and Campbell, C. 2000. Support vector machines: Hype or hallelujah. SIGKDD Explo. 2, 2, 1--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Bratko, A. and Filipic̆, B. 2006. Exploiting structural information for semi-structured document categorization. Inf. Process. Manage. 42, 3, 679--694. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Brutlag, J. and Meek, C. 2000. Challenges of the email domain for text classification. In Proceedings of the International Conference on Machine Learning. 103--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Burges, C. 1998. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discovery 2, 2, 121--167. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Candillier, L., Tellier, I., and Torre, F. 2006. Transforming xml trees for efficient classification and clustering. In Advances in XML Information Retrieval and Evaluation, Proceedings of the 4th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, M. Lalmas, S. Malik, and G. Kazai Eds., Springer, 469--480. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Candillier, L., Tellier, I., and Torre, F. 2007. Xml document mining using graph neural network. In Comparative Evaluation of XML Information Retrieval Systems, Proceedings of the 5th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, M. Lalmas, and A. Trotman Eds., Springer, 458--472.Google ScholarGoogle Scholar
  17. Chang, C.-C. and Lin, C.-J. 2001. LIBSVM: A library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Cheng, H., Yan, X., Han, J., and Hsu, C.-W. 2007. Discriminative frequent pattern analysis for effective classification. In Proceedings of the International Conference on Data Engineering. 716--725.Google ScholarGoogle Scholar
  19. Coenen, F. 2004. LUCS KDD implementations of CBA and CPAR. Department of Computer Science, The University of Liverpool. www.csc.liv.ac.uk/~frans/KDD/Software/.Google ScholarGoogle Scholar
  20. Collins, M. and Duffy, N. 2001. Convolution kernels for natural language. In Proceedings of the Conference on Neural Information Processing Systems. 625--632.Google ScholarGoogle Scholar
  21. Collins, M. and Duffy, N. 2002. New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. In Proceedings of the Annual Meeting on Association for Computational Linguistics. 263--270. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Cong, G., Xu, X., Pan, F., Tung, A., and Yang, J. 2004. FARMER: Finding interesting rule groups in microarray datasets. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 123--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Costa, G. and Ortale, R. 2012. On effective xml clustering by path commonality: An efficient and scalable algorithm. In Proceedings of the IEEE International Conference on Tools with Artificial Intelligence. 389--396. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Costa, G., Manco, G., Ortale, R., and Tagarelli, A. 2004. A tree-based approach to clustering xml documents by structure. In Proceedings of the International Conference on Principles and Practice of Knowledge Discovery in Databases. 137--148. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Costa, G., Ortale, R., and Ritacco, E. 2011a. Effective xml classification using content and structural information via rule learning. In Proceedings of the IEEE International Conference on Tools with Artificial Intelligence. 102--109. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Costa, G., Ortale, R., and Ritacco, E. 2011b. A transactional approach to associative xml classification by content and structure. In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval. 104--113.Google ScholarGoogle Scholar
  27. Crescenzi, V., Mecca, G., and Merialdo, P. 2001. Roadrunner: Towards automatic data extraction from large web sites. In Proceedings of the International Conference on Very Large Databases. 109--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Cristianini, N. and Shawe-Taylor, J. 2000. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Da San Martino, G. 2009. Kernel methods for tree structured data. Ph.D. dissertation, University of Bologna, Padova.Google ScholarGoogle Scholar
  30. Dalamagas, T., Cheng, T., Winkel, K.-J., and Sellis, T. 2006. A methodology for clustering xml documents by structure. Inf. Syst. 31, 3, 187--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. de Campos, L., Fernández-Luna, J., Huete, J., and Romero, A. 2008. Probabilistic methods for structured document classification at inex’07. In Focused Access to XML Documents, Proceedings of the 6th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, J. Kamps, M. Lalmas, and A. Trotman Eds., Springer, 195--206. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Denoyer, L. and Gallinari, P. 2007. Report on the xml mining track at inex 2005 and inex 2006: Categorization and clustering of xml documents. ACM SIGIR Forum 41, 1, 79--90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Denoyer, L. and Gallinari, P. 2008. Report on the xml mining track at inex 2007: Categorization and clustering of xml documents. ACM SIGIR Forum 42, 1, 22--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Flesca, S., Manco, G., Masciari, E., Pontieri, L., and Pugliese, A. 2005. Fast detection of xml structural similarity. IEEE Trans. Knowl. Data Eng. 17, 2, 160--175. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Fox, C. 1992. Lexical Analysis and Stoplists. Prentice Hall, Upper Saddle River, NJ.Google ScholarGoogle Scholar
  36. Garboni, C., Masseglia, F., and Trousse, B. 2006. Sequential pattern mining for structure-based xml document classification. In Advances in XML Information Retrieval and Evaluation, Proceedings of the 4th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, M. Lalmas, S. Malik, and G. Kazai Eds., Springer, 458--468. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Han, J. and Yin, Y. 2000. Mining frequent patterns without candidate generation. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Haussler, D. 1999. Convolution kernels on discrete structures. Tech. rep., University of California at Santa Cruz.Google ScholarGoogle Scholar
  39. Hearst, M. 1998. Trends & controversies: Support vector machines. IEEE Intell. Syst. 13, 4, 18--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Helmer, S. 2007. Measuring the structural similarity of semistructured documents using entropy. In Proceedings of the International Conference on Very Large Data Base. 1022--1032. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Hofmann, T., Schölkopf, B., and Smola, A. 2008. Kernel methods in machine learning. Ann. Stat. 36, 3, 1171--1220.Google ScholarGoogle ScholarCross RefCross Ref
  42. Ikeda, K. 2006. Effects of kernel function on support vector machines in extreme cases. IEEE Trans. Neural Netw. 17, 1, 1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Jaakkola, T. and Haussler, D. 1999. Exploiting generative models in discriminative classifiers. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 487--493. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Kashima, H. and Koyanagi, T. 2002. Kernels for semi-structured data. In Proceedings of the International Conference on Machine Learning. 291--298. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Knijf, J. D. 2007. Fat-cat: Frequent attributes tree based classification. In Comparative Evaluation of XML Information Retrieval Systems, Proceedings of the 5th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, M. Lalmas, and A. Trotman Eds., Springer, 485--496.Google ScholarGoogle Scholar
  46. Kuboyama, T., Hirata, K., Kashima, H., Aoki-Kinoshita, K., and Yasuda, H. 2007. Information and media technologies. Ann. Stat. 2, 1, 292--299.Google ScholarGoogle Scholar
  47. Kurt, A. and Tozal, E. 2006. Classification of xslt-generated Web documents with support vector machines. In Proceedings of the International Workshop on Knowledge Discovery from XML Documents. 33--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Li, W., Han, J., and Pei, J. 2001. CMAR: Accurate and efficient classification based on multiple class-association rules. In Proceedings of the IEEE International Conference on Data Mining. 369--376. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Lian, W., Cheung, D., Mamoulis, N., and Yiu, S.-M. 2004. An efficient and scalable algorithm for clustering xml documents by structure. IEEE Trans. Knowl. Data Eng. 16, 1, 82--96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Liu, B., Hsu, W., and Ma, Y. 1998. Integrating classification and association rule mining. In Proceedings of the ACM SIGKDD International Conference on Kwnoledge Discovery and Data Mining. 80--86.Google ScholarGoogle Scholar
  51. Liu, B., Ma, Y., and Wong, C. 2000. Improving an association rule based classifier. In Proceedings of the International Conference on Principles of Data Mining and Knowledge Discovery. 504--509. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Manning, C., Raghavan, P., and Schütze., H. 2008. Introduction to Information Retrieval. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Mitchell, T. 1997. Machine Learning. McGraw-Hill, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Murugeshan, M., Lakshmi, K., and Mukherjee, S. 2007. A categorization approach for Wikipedia collection based on negative category information and initial descriptions. In Proceedings of the INitiative for the Evaluation of XML Retrieval (INEX’07). 212--214.Google ScholarGoogle Scholar
  55. Ning, P., Steinbach, M., and Kumar, V. 2006. Introduction to Data Mining. Addison Wesley, Boston, MA.Google ScholarGoogle Scholar
  56. Pearl, J. 1998. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan and Kaufmann, Burlington, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Porter, M. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Quinlan, J. and Cameron-Jones, R. 1993. FOIL: A midterm report. In Proceedings of the European Conference on Machine Learning. 3--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Schölkopf, B. and Smola, A. 2001. Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond. MIT Press, Cambridge, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Steinwart, I. and Christmann, A. 2008. Support Vector Machines. Springer, Berlin. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Suzuki, J. and Isozaki, H. 2006. Sequence and tree kernels with statistical feature mining. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 1321--1328.Google ScholarGoogle Scholar
  62. Thabtah, F. 2007. A review of associative classification mining. Knowl. Eng. Rev. 22, 1, 37--65. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Theobald, M., Schenkel, R., and Weikum, G. 2003. Exploiting structure, annotation, and ontological knowledge for automatic classification of xml data. In Proceedings of the WebDB Workshop. 1--6.Google ScholarGoogle Scholar
  64. Vapnik, V. 1998. Statistical Learning Theory. John Wiley & Sons, Hoboken, NJ.Google ScholarGoogle Scholar
  65. Vishwanathan, S. and Smola, A. 2002. Fast kernels for string and tree matching. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 569--576.Google ScholarGoogle Scholar
  66. Walpole, R., Myers, R., Myers, S., and Ye, K. 2011. Probability & Statistics forEngineers & Scientists. Prentice Hall, Upper Saddle River, NJ.Google ScholarGoogle Scholar
  67. Wang, C., Hong, M., Pei, J., Zhou, H., Wang, W., and Shi, B. 2004. Efficient pattern-growth methods for frequent tree pattern mining. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. 441--451.Google ScholarGoogle Scholar
  68. Wang, J. and Karypis, G. 2005. HARMONY: Efficiently mining the best rules for classification. In Proceedings of the SIAM International Conference on Data Mining. 205--216.Google ScholarGoogle Scholar
  69. Wolpert, D. 1992. Staked generalization. Neural Netw. 5, 241--259. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Wu, J. and Tang, J. 2008. A bottom-up approach for xml documents classification. In Proceedings of the International Symposium on Database Engineering and Applications. 131--137. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Xing, G., Guo, J., and Xia, Z. 2007. Classifying xml documents based on structure/content similarity. In Comparative Evaluation of XML Information Retrieval Systems, Proceedings of the 5th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, M. Lalmas, and A. Trotman Eds., Springer, 444--457.Google ScholarGoogle Scholar
  72. Yang, J. and Chen, X. 2002. A semi-structured document model for text mining. J. Comput. Sci. Techol. 17, 5, 603--610. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Yang, J. and Wang, S. 2010. Extended vsm for xml document classification using frequent subtrees. In Focused Retrieval and Evaluation, Proceedings of the 8th International Workshop of the Initiative for the Evaluation of XML Retrieval, S. Geva, J. Kamps, and A. Trotman Eds., Springer, 441--448. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Yang, J. and Zhang, F. 2008. Xml document classification using extended vsm. In Focused Access to XML Documents, Proceedings of the 6th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, J. Kamps, M. Lalmas, and A. Trotman Eds., Springer, 234--244. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Yi, J. and Sundaresan, N. 2000. A classifier for semi-structured documents. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovey and Data Mining. 340--344. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Yin, X. and Han, J. 2003. CPAR: Classification based on predictive association rules. In Proceedings of the SIAM International Conference on Data Mining. 331--335.Google ScholarGoogle Scholar
  77. Zaki, M. 2005. Efficiently mining frequent trees in a forest: Algorithms and applications. IEEE Trans. Knowl. Data Eng. 17, 8, 1021--1035. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Zaki, M. and Aggarwal, C. 2006. Xrules: An effective algorithm for structural classification of xml data. Mach. Learn. 62, 1--2, 137--170. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. X-Class: Associative Classification of XML Documents by Structure

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Information Systems
            ACM Transactions on Information Systems  Volume 31, Issue 1
            January 2013
            163 pages
            ISSN:1046-8188
            EISSN:1558-2868
            DOI:10.1145/2414782
            Issue’s Table of Contents

            Copyright © 2013 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 1 January 2013
            • Accepted: 1 October 2012
            • Revised: 1 August 2012
            • Received: 1 January 2011
            Published in tois Volume 31, Issue 1

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader