Abstract
This paper introduces a novel method for learning a wrapper for extraction of information from web pages, based upon (k,l)-contextual tree languages. It also introduces a method to learn good values of k and l based on a few positive and negative examples. Finally, it describes how the algorithm can be integrated in a tool for information extraction.
Article PDF
Similar content being viewed by others
References
Ahonen, H. (1996). Generating grammars for structured documents using grammatical inference methods. PhD thesis, University of Helsinki, Department of Computer Science.
Angluin, D. (1982). Inference of reversible languages. Journal of the ACM, 29(3), 741–765.
Angluin, D. (1988). Queries and concept-learning. Machine Learning, 2, 319–342.
Califf, M. E., & Mooney, R. J. (1999). Relational learning of pattern-match rules for information extraction. In AAAI/IAAI ’99: proceedings of the 16th national conference on artificial intelligence and the 11th innovative applications of AI conference (pp. 328–334). Menlo Park: American Association for Artificial Intelligence.
Carme, J., Lemay, A., & Niehren, J. (2004). Learning node selecting tree transducer from completely annotated examples. In Lecture notes in artificial intelligence : Vol. 3264. International colloquium on grammatical inference (pp. 91–102). Berlin: Springer.
Chidlovskii, B., Ragetli, J., & de Rijke, M. (2000). Wrapper generation via grammar induction. In Lecture notes in computer science : Vol. 1810. Proceedings of the 11th European conference on machine learning (ECML) (pp. 96–108). Berlin: Springer.
Freitag, D. (1998). Information extraction from HTML: Application of a general machine learning approach. In AAAI/IAAI ’98: proceedings of the fifteenth national/tenth conference on artificial intelligence/Innovative applications of artificial intelligence (pp. 517–523). Menlo Park: American Association for Artificial Intelligence.
Freitag, D., & Kushmerick, N. (2000). Boosted wrapper induction. In Proceedings of the seventeenth national conference on artificial intelligence and twelfth innovative applications of AI conference (pp. 577–583). Menlo Park: AAAI Press.
Freitag, D., & McCallum, A. (1999). Information extraction with HMMs and shrinkage. In AAAI-99 workshop on machine learning for information extraction.
García, P. (1993). Learning k -testable tree sets from positive data (Technical Report DSIC/II/46/1993). Departamento de Sistemas Informáticos y Computación, Universidad Politécnica de Valencia.
García, P., & Vidal, E. (1990). Inference of k-testable languages in the strict sense and application to syntactic pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(9), 920–925.
Gold, E. M. (1967). Language identification in the limit. Information and Control, 10(5), 447–474.
Gottlob, G., & Koch, C. (2004). Logic-based web information extraction. SIGMOD Record, 33(2), 87–94.
Hsu, C.-N., & Dung, M.-T. (1998). Generating finite-state transducers for semi-structured data extraction from the web. Information Systems, 23(8), 521–538.
Knuutila, T. (1993). Inference of k-testable tree languages. In H. Bunke (Ed.), Advances in structural and syntactic pattern recognition: proceedings of the international workshop (pp. 109–120). World Scientific: Singapore.
Kosala, R., Van den Bussche, J., Bruynooghe, M., & Blockeel, H. (2002). Information extraction in structured documents using tree automata induction. In Lecture notes in computer science : Vol. 2431. PKDD (pp. 299–310). Berlin: Springer.
Kosala, R., Bruynooghe, M., Blockeel, H., & Van den Bussche, J. (2003). Information extraction from web documents based on local unranked tree automaton inference. In International joint conference on artificial intelligence (IJCAI) (pp. 403–408).
Kosala, R., Blockeel, H., Bruynooghe, M., & Van den Bussche, J. (2006). Information extraction from structured documents using k-testable tree automaton inference. Data and Knowledge Engineering, 58(2), 129–158.
Kushmerick, N., Weld, D. S., & Doorenbos, R. B. (1997). Wrapper induction for information extraction. In International joint conference on artificial intelligence (IJCAI) (pp. 729–737).
McNaughton, R. (1974). Algebraic decision procedures for local testability. Mathematical Systems Theory, 8(1), 60–76.
Muggleton, S. (1990). Inductive acquisition of expert knowledge. Reading: Addison-Wesley.
Muslea, I., Minton, S., & Knoblock, C. (2001). Hierarchical wrapper induction for semistructured information sources. Journal of Autonomous Agents and Multi-Agent Systems, 4, 93–114.
Muslea, I., Minton, S., & Knoblock, C. (2003). Active learning with strong and weak views: a case study on wrapper induction. In International joint conference on artificial intelligence (IJCAI) (pp. 415–420).
Raeymaekers, S., & Bruynooghe, M. (2004a). Extracting information from structured documents with automata in a single run. In Proceedings of 2nd international workshop on mining graphs, trees and sequences (MGTS 2004), Pisa, Italy (pp. 71–82). Pisa: University of Pisa.
Raeymaekers, S., & Bruynooghe, M. (2004b). Parameterless information extraction using (k,l)-contextual tree languages. In BNAIC 2004—proceedings of the 16th Belgian–Dutch conference on artificial intelligence (pp. 211–218).
Raeymaekers, S., & Bruynooghe, M. (2006). Wrapper induction: learning (k,l)-contextual tree languages directly as unranked tree automata. In Proceedings of international workshop on mining and learning with graphs (MLG-2006) (pp. 197–204). Berlin, Germany.
Raeymaekers, S., Bruynooghe, M., & Van den Bussche, J. (2005). Learning (k,l)-contextual tree languages for information extraction. In Lecture notes in computer science : Vol. 3720. European conference on machine learning (ECML) (pp. 305–316). Berlin: Springer.
Rico-Juan, J. R., Calera-Rubio, J., & Carrasco, R. C. (2000). Probabilistic k-testable tree languages. In A. Oliveira (Ed.), Lecture notes in computer science : Vol. 1891. Proceedings of 5th international colloquium on grammatical inference (ICGI 2000) (pp. 221–228). Berlin: Springer.
Schapire, R. E., & Singer, Y. (1999). Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3), 297–336.
Soderland, S. (1999). Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1–3), 233–272.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editor: Dan Roth.
Rights and permissions
About this article
Cite this article
Raeymaekers, S., Bruynooghe, M. & Van den Bussche, J. Learning (k,l)-contextual tree languages for information extraction from web pages. Mach Learn 71, 155–183 (2008). https://doi.org/10.1007/s10994-008-5049-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-008-5049-7