Abstract
The supervised classification of XML documents by structure involves learning predictive models in which certain structural regularities discriminate the individual document classes. Hitherto, research has focused on the adoption of prespecified substructures. This is detrimental for classification effectiveness, since the a priori chosen substructures may not accord with the structural properties of the XML documents. Therein, an unexplored question is how to choose the type of structural regularity that best adapts to the structures of the available XML documents.
We tackle this problem through X-Class, an approach that handles all types of tree-like substructures and allows for choosing the most discriminatory one. Algorithms are designed to learn compact rule-based classifiers in which the chosen substructures discriminate the classes of XML documents.
X-Class is studied across various domains and types of substructures. Its classification performance is compared against several rule-based and SVM-based competitors. Empirical evidence reveals that the classifiers induced by X-Class are compact, scalable, and at least as effective as the established competitors. In particular, certain substructures allow the induction of very compact classifiers that generally outperform the rule-based competitors in terms of effectiveness over all chosen corpora of XML data. Furthermore, such classifiers are substantially as effective as the SVM-based competitor, with the additional advantage of a high-degree of interpretability.
- Aggarwal, C., Ta, N., Wang, J., Feng, J., and Zaki, M. 2007. Xproj: A framework for projected structural clustering of xml documents. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovey and Data Mining. 46--55. Google ScholarDigital Library
- Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proceedings of the International Conference on Very Large Data Bases. 487--499. Google ScholarDigital Library
- Aiolli, F., Da San Martino, G., and Sperduti, A. 2009. Route kernels for trees. In Proceedings of the International Conference on Machine Learning. 17--24. Google ScholarDigital Library
- Amer-Yahia, S. and Lalmas, M. 2006. Xml search: Languages, inex and scoring. ACM SIGMOD Record 35, 4, 16--23. Google ScholarDigital Library
- Antonie, M.-L. and Zaïane, O. 2002. Text document categorization by term association. In Proceedings of the IEEE International Conference on Data Mining. 19--26. Google ScholarDigital Library
- Antonie, M.-L. and Zaïane, O. 2004. An associative classifier based on positive and negative rules. In Proceedings of the ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. 64--69. Google ScholarDigital Library
- Arunasalam, B. and Chawla, S. 2006. CCCS: A top-down association classifier for imbalanced class distribution. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 517--522. Google ScholarDigital Library
- Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. Addison-Wesley, Boston, MA. Google ScholarDigital Library
- Baeza-Yates, R., Fuhr, N., and Andamaarek, Y. 2006. Special issue on xml retrieval. ACM Trans. Inform. Syst. 24, 4. Google ScholarDigital Library
- Baker, L. and McCallum, A. 1998. Distributional clustering of words for text classification. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval. 96--103. Google ScholarDigital Library
- Bennet, K. and Campbell, C. 2000. Support vector machines: Hype or hallelujah. SIGKDD Explo. 2, 2, 1--13. Google ScholarDigital Library
- Bratko, A. and Filipic̆, B. 2006. Exploiting structural information for semi-structured document categorization. Inf. Process. Manage. 42, 3, 679--694. Google ScholarDigital Library
- Brutlag, J. and Meek, C. 2000. Challenges of the email domain for text classification. In Proceedings of the International Conference on Machine Learning. 103--110. Google ScholarDigital Library
- Burges, C. 1998. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discovery 2, 2, 121--167. Google ScholarDigital Library
- Candillier, L., Tellier, I., and Torre, F. 2006. Transforming xml trees for efficient classification and clustering. In Advances in XML Information Retrieval and Evaluation, Proceedings of the 4th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, M. Lalmas, S. Malik, and G. Kazai Eds., Springer, 469--480. Google ScholarDigital Library
- Candillier, L., Tellier, I., and Torre, F. 2007. Xml document mining using graph neural network. In Comparative Evaluation of XML Information Retrieval Systems, Proceedings of the 5th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, M. Lalmas, and A. Trotman Eds., Springer, 458--472.Google Scholar
- Chang, C.-C. and Lin, C.-J. 2001. LIBSVM: A library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Google ScholarDigital Library
- Cheng, H., Yan, X., Han, J., and Hsu, C.-W. 2007. Discriminative frequent pattern analysis for effective classification. In Proceedings of the International Conference on Data Engineering. 716--725.Google Scholar
- Coenen, F. 2004. LUCS KDD implementations of CBA and CPAR. Department of Computer Science, The University of Liverpool. www.csc.liv.ac.uk/~frans/KDD/Software/.Google Scholar
- Collins, M. and Duffy, N. 2001. Convolution kernels for natural language. In Proceedings of the Conference on Neural Information Processing Systems. 625--632.Google Scholar
- Collins, M. and Duffy, N. 2002. New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. In Proceedings of the Annual Meeting on Association for Computational Linguistics. 263--270. Google ScholarDigital Library
- Cong, G., Xu, X., Pan, F., Tung, A., and Yang, J. 2004. FARMER: Finding interesting rule groups in microarray datasets. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 123--126. Google ScholarDigital Library
- Costa, G. and Ortale, R. 2012. On effective xml clustering by path commonality: An efficient and scalable algorithm. In Proceedings of the IEEE International Conference on Tools with Artificial Intelligence. 389--396. Google ScholarDigital Library
- Costa, G., Manco, G., Ortale, R., and Tagarelli, A. 2004. A tree-based approach to clustering xml documents by structure. In Proceedings of the International Conference on Principles and Practice of Knowledge Discovery in Databases. 137--148. Google ScholarDigital Library
- Costa, G., Ortale, R., and Ritacco, E. 2011a. Effective xml classification using content and structural information via rule learning. In Proceedings of the IEEE International Conference on Tools with Artificial Intelligence. 102--109. Google ScholarDigital Library
- Costa, G., Ortale, R., and Ritacco, E. 2011b. A transactional approach to associative xml classification by content and structure. In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval. 104--113.Google Scholar
- Crescenzi, V., Mecca, G., and Merialdo, P. 2001. Roadrunner: Towards automatic data extraction from large web sites. In Proceedings of the International Conference on Very Large Databases. 109--118. Google ScholarDigital Library
- Cristianini, N. and Shawe-Taylor, J. 2000. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press. Google ScholarDigital Library
- Da San Martino, G. 2009. Kernel methods for tree structured data. Ph.D. dissertation, University of Bologna, Padova.Google Scholar
- Dalamagas, T., Cheng, T., Winkel, K.-J., and Sellis, T. 2006. A methodology for clustering xml documents by structure. Inf. Syst. 31, 3, 187--228. Google ScholarDigital Library
- de Campos, L., Fernández-Luna, J., Huete, J., and Romero, A. 2008. Probabilistic methods for structured document classification at inex’07. In Focused Access to XML Documents, Proceedings of the 6th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, J. Kamps, M. Lalmas, and A. Trotman Eds., Springer, 195--206. Google ScholarDigital Library
- Denoyer, L. and Gallinari, P. 2007. Report on the xml mining track at inex 2005 and inex 2006: Categorization and clustering of xml documents. ACM SIGIR Forum 41, 1, 79--90. Google ScholarDigital Library
- Denoyer, L. and Gallinari, P. 2008. Report on the xml mining track at inex 2007: Categorization and clustering of xml documents. ACM SIGIR Forum 42, 1, 22--28. Google ScholarDigital Library
- Flesca, S., Manco, G., Masciari, E., Pontieri, L., and Pugliese, A. 2005. Fast detection of xml structural similarity. IEEE Trans. Knowl. Data Eng. 17, 2, 160--175. Google ScholarDigital Library
- Fox, C. 1992. Lexical Analysis and Stoplists. Prentice Hall, Upper Saddle River, NJ.Google Scholar
- Garboni, C., Masseglia, F., and Trousse, B. 2006. Sequential pattern mining for structure-based xml document classification. In Advances in XML Information Retrieval and Evaluation, Proceedings of the 4th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, M. Lalmas, S. Malik, and G. Kazai Eds., Springer, 458--468. Google ScholarDigital Library
- Han, J. and Yin, Y. 2000. Mining frequent patterns without candidate generation. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1--12. Google ScholarDigital Library
- Haussler, D. 1999. Convolution kernels on discrete structures. Tech. rep., University of California at Santa Cruz.Google Scholar
- Hearst, M. 1998. Trends & controversies: Support vector machines. IEEE Intell. Syst. 13, 4, 18--28. Google ScholarDigital Library
- Helmer, S. 2007. Measuring the structural similarity of semistructured documents using entropy. In Proceedings of the International Conference on Very Large Data Base. 1022--1032. Google ScholarDigital Library
- Hofmann, T., Schölkopf, B., and Smola, A. 2008. Kernel methods in machine learning. Ann. Stat. 36, 3, 1171--1220.Google ScholarCross Ref
- Ikeda, K. 2006. Effects of kernel function on support vector machines in extreme cases. IEEE Trans. Neural Netw. 17, 1, 1--9. Google ScholarDigital Library
- Jaakkola, T. and Haussler, D. 1999. Exploiting generative models in discriminative classifiers. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 487--493. Google ScholarDigital Library
- Kashima, H. and Koyanagi, T. 2002. Kernels for semi-structured data. In Proceedings of the International Conference on Machine Learning. 291--298. Google ScholarDigital Library
- Knijf, J. D. 2007. Fat-cat: Frequent attributes tree based classification. In Comparative Evaluation of XML Information Retrieval Systems, Proceedings of the 5th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, M. Lalmas, and A. Trotman Eds., Springer, 485--496.Google Scholar
- Kuboyama, T., Hirata, K., Kashima, H., Aoki-Kinoshita, K., and Yasuda, H. 2007. Information and media technologies. Ann. Stat. 2, 1, 292--299.Google Scholar
- Kurt, A. and Tozal, E. 2006. Classification of xslt-generated Web documents with support vector machines. In Proceedings of the International Workshop on Knowledge Discovery from XML Documents. 33--42. Google ScholarDigital Library
- Li, W., Han, J., and Pei, J. 2001. CMAR: Accurate and efficient classification based on multiple class-association rules. In Proceedings of the IEEE International Conference on Data Mining. 369--376. Google ScholarDigital Library
- Lian, W., Cheung, D., Mamoulis, N., and Yiu, S.-M. 2004. An efficient and scalable algorithm for clustering xml documents by structure. IEEE Trans. Knowl. Data Eng. 16, 1, 82--96. Google ScholarDigital Library
- Liu, B., Hsu, W., and Ma, Y. 1998. Integrating classification and association rule mining. In Proceedings of the ACM SIGKDD International Conference on Kwnoledge Discovery and Data Mining. 80--86.Google Scholar
- Liu, B., Ma, Y., and Wong, C. 2000. Improving an association rule based classifier. In Proceedings of the International Conference on Principles of Data Mining and Knowledge Discovery. 504--509. Google ScholarDigital Library
- Manning, C., Raghavan, P., and Schütze., H. 2008. Introduction to Information Retrieval. Cambridge University Press. Google ScholarDigital Library
- Mitchell, T. 1997. Machine Learning. McGraw-Hill, New York, NY. Google ScholarDigital Library
- Murugeshan, M., Lakshmi, K., and Mukherjee, S. 2007. A categorization approach for Wikipedia collection based on negative category information and initial descriptions. In Proceedings of the INitiative for the Evaluation of XML Retrieval (INEX’07). 212--214.Google Scholar
- Ning, P., Steinbach, M., and Kumar, V. 2006. Introduction to Data Mining. Addison Wesley, Boston, MA.Google Scholar
- Pearl, J. 1998. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan and Kaufmann, Burlington, MA. Google ScholarDigital Library
- Porter, M. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.Google ScholarDigital Library
- Quinlan, J. and Cameron-Jones, R. 1993. FOIL: A midterm report. In Proceedings of the European Conference on Machine Learning. 3--20. Google ScholarDigital Library
- Schölkopf, B. and Smola, A. 2001. Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond. MIT Press, Cambridge, MA. Google ScholarDigital Library
- Steinwart, I. and Christmann, A. 2008. Support Vector Machines. Springer, Berlin. Google ScholarDigital Library
- Suzuki, J. and Isozaki, H. 2006. Sequence and tree kernels with statistical feature mining. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 1321--1328.Google Scholar
- Thabtah, F. 2007. A review of associative classification mining. Knowl. Eng. Rev. 22, 1, 37--65. Google ScholarDigital Library
- Theobald, M., Schenkel, R., and Weikum, G. 2003. Exploiting structure, annotation, and ontological knowledge for automatic classification of xml data. In Proceedings of the WebDB Workshop. 1--6.Google Scholar
- Vapnik, V. 1998. Statistical Learning Theory. John Wiley & Sons, Hoboken, NJ.Google Scholar
- Vishwanathan, S. and Smola, A. 2002. Fast kernels for string and tree matching. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 569--576.Google Scholar
- Walpole, R., Myers, R., Myers, S., and Ye, K. 2011. Probability & Statistics forEngineers & Scientists. Prentice Hall, Upper Saddle River, NJ.Google Scholar
- Wang, C., Hong, M., Pei, J., Zhou, H., Wang, W., and Shi, B. 2004. Efficient pattern-growth methods for frequent tree pattern mining. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. 441--451.Google Scholar
- Wang, J. and Karypis, G. 2005. HARMONY: Efficiently mining the best rules for classification. In Proceedings of the SIAM International Conference on Data Mining. 205--216.Google Scholar
- Wolpert, D. 1992. Staked generalization. Neural Netw. 5, 241--259. Google ScholarDigital Library
- Wu, J. and Tang, J. 2008. A bottom-up approach for xml documents classification. In Proceedings of the International Symposium on Database Engineering and Applications. 131--137. Google ScholarDigital Library
- Xing, G., Guo, J., and Xia, Z. 2007. Classifying xml documents based on structure/content similarity. In Comparative Evaluation of XML Information Retrieval Systems, Proceedings of the 5th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, M. Lalmas, and A. Trotman Eds., Springer, 444--457.Google Scholar
- Yang, J. and Chen, X. 2002. A semi-structured document model for text mining. J. Comput. Sci. Techol. 17, 5, 603--610. Google ScholarDigital Library
- Yang, J. and Wang, S. 2010. Extended vsm for xml document classification using frequent subtrees. In Focused Retrieval and Evaluation, Proceedings of the 8th International Workshop of the Initiative for the Evaluation of XML Retrieval, S. Geva, J. Kamps, and A. Trotman Eds., Springer, 441--448. Google ScholarDigital Library
- Yang, J. and Zhang, F. 2008. Xml document classification using extended vsm. In Focused Access to XML Documents, Proceedings of the 6th International Workshop of the Initiative for the Evaluation of XML Retrieval, N. Fuhr, J. Kamps, M. Lalmas, and A. Trotman Eds., Springer, 234--244. Google ScholarDigital Library
- Yi, J. and Sundaresan, N. 2000. A classifier for semi-structured documents. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovey and Data Mining. 340--344. Google ScholarDigital Library
- Yin, X. and Han, J. 2003. CPAR: Classification based on predictive association rules. In Proceedings of the SIAM International Conference on Data Mining. 331--335.Google Scholar
- Zaki, M. 2005. Efficiently mining frequent trees in a forest: Algorithms and applications. IEEE Trans. Knowl. Data Eng. 17, 8, 1021--1035. Google ScholarDigital Library
- Zaki, M. and Aggarwal, C. 2006. Xrules: An effective algorithm for structural classification of xml data. Mach. Learn. 62, 1--2, 137--170. Google ScholarDigital Library
Index Terms
- X-Class: Associative Classification of XML Documents by Structure
Recommendations
A weighted common structure based clustering technique for XML documents
XML has recently become very popular as a means of representing semistructured data and as a standard for data exchange over the Web, because of its varied applicability in numerous applications. Therefore, XML documents constitute an important data ...
Semantic clustering of XML documents
Dealing with structure and content semantics underlying semistructured documents is challenging for any task of document management and knowledge discovery conceived for such data. In this work we address the novel problem of clustering semantically ...
A Novel Top-Down Algorithm of Frequent XML Query Pattern Mining
ICCEA '10: Proceedings of the 2010 Second International Conference on Computer Engineering and Applications - Volume 02Providing efficient query to XML documents is crucial, as XML has become the most important technique to exchange data in WWW. Due to its tree-structure paradigm, XML is superior for its capability of storing and querying complex data. Therefore, ...
Comments