Skip to main content
Top
Published in: Knowledge and Information Systems 1/2016

01-10-2016 | Regular Paper

Roller: a novel approach to Web information extraction

Authors: Patricia Jiménez, Rafael Corchuelo

Published in: Knowledge and Information Systems | Issue 1/2016

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The research regarding Web information extraction focuses on learning rules to extract some selected information from Web documents. Many proposals are ad hoc and cannot benefit from the advances in machine learning; furthermore, they are likely to fade away as the Web evolves, and their intrinsic assumptions are not satisfied. Some authors have explored transforming Web documents into relational data and then using techniques that got inspiration from inductive logic programming. In theory, such proposals should be easier to adapt as the Web evolves because they build on catalogues of features that can be adapted without changing the proposals themselves. Unfortunately, they are difficult to scale as the number of documents or features increases. In the general field of machine learning, there are propositio-relational proposals that attempt to provide effective and efficient means to learn from relational data using propositional techniques, but they have seldom been explored regarding Web information extraction. In this article, we present a new proposal called Roller: it relies on a search procedure that uses a dynamic flattening technique to explore the context of the nodes that provide the information to be extracted; it is configured with an open catalogue of features, so that it can adapt to the evolution of the Web; it also requires a base learner and a rule scorer, which helps it benefit from the continuous advances in machine learning. Our experiments confirm that it outperforms other state-of-the-art proposals in terms of effectiveness and that it is very competitive in terms of efficiency; we have also confirmed that our conclusions are solid from a statistical point of view.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Footnotes
1
Cantelli’s inequality states that given a random variable X, if \(\mu \) denotes its mean and \(\sigma \) denotes is standard deviation, then the probability that \(X \ge \mu + k \, \sigma \) or that \(X \le \mu - k \, \sigma \) is not greater than \(1 / (1 + k ^ 2)\). If we wish to discard, say, 5 % of the data as lower or upper outliers, we then have to set \(1 / (1 + k ^ 2 ) = 0.05\), that is \(k = 4.36\).
 
Literature
1.
go back to reference Álvarez M, Pan A, Raposo J, Bellas F, Cacheda F (2008) Extracting lists of data records from semi-structured web pages. Data Knowl Eng 64(2):491–509CrossRef Álvarez M, Pan A, Raposo J, Bellas F, Cacheda F (2008) Extracting lists of data records from semi-structured web pages. Data Knowl Eng 64(2):491–509CrossRef
2.
go back to reference Arasu A, Garcia-Molina H (2003) Extracting structured data from web pages. In: SIGMOD conference, pp 337–348 Arasu A, Garcia-Molina H (2003) Extracting structured data from web pages. In: SIGMOD conference, pp 337–348
3.
go back to reference Atramentov A, Leiva H, Honavar V (2003) A multi-relational decision tree learning algorithm. In: ILP, pp 38–56 Atramentov A, Leiva H, Honavar V (2003) A multi-relational decision tree learning algorithm. In: ILP, pp 38–56
4.
go back to reference Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor 6(1):20–29CrossRef Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor 6(1):20–29CrossRef
5.
go back to reference Bernstein PA, Haas LM (2008) Information integration in the enterprise. Commun ACM 51(9):72–79CrossRef Bernstein PA, Haas LM (2008) Information integration in the enterprise. Commun ACM 51(9):72–79CrossRef
7.
go back to reference Blockeel H, Raedt LD, Jacobs N, Demoen B (1999) Scaling up inductive logic programming by learning from interpretations. Data Min Knowl Discov 3(1):59–93CrossRef Blockeel H, Raedt LD, Jacobs N, Demoen B (1999) Scaling up inductive logic programming by learning from interpretations. Data Min Knowl Discov 3(1):59–93CrossRef
8.
go back to reference Bădică C, Bădică A, Popescu E, Abraham A (2007) L-wrappers: concepts, properties and construction. Soft Comput 11(8):753–772CrossRef Bădică C, Bădică A, Popescu E, Abraham A (2007) L-wrappers: concepts, properties and construction. Soft Comput 11(8):753–772CrossRef
9.
go back to reference Califf ME, Mooney RJ (2003) Bottom-up relational learning of pattern matching rules for information extraction. J Mach Learn Res 4:177–210MathSciNetMATH Califf ME, Mooney RJ (2003) Bottom-up relational learning of pattern matching rules for information extraction. J Mach Learn Res 4:177–210MathSciNetMATH
10.
go back to reference Chang C-H, Kuo S-C (2004) OLERA: Semisupervised web-data extraction with visual support. IEEE Intell Syst 19(6):56–64CrossRef Chang C-H, Kuo S-C (2004) OLERA: Semisupervised web-data extraction with visual support. IEEE Intell Syst 19(6):56–64CrossRef
11.
go back to reference Chang C-H, Kayed M, Girgis MR, Shaalan KF (2006) A survey of web information extraction systems. IEEE Trans Knowl Data Eng 18(10):1411–1428CrossRef Chang C-H, Kayed M, Girgis MR, Shaalan KF (2006) A survey of web information extraction systems. IEEE Trans Knowl Data Eng 18(10):1411–1428CrossRef
12.
go back to reference Chidlovskii B (2001) Wrapping web information providers by transducer induction. In: ECML, pp 61–72 Chidlovskii B (2001) Wrapping web information providers by transducer induction. In: ECML, pp 61–72
14.
go back to reference Crescenzi V, Merialdo P (2008) Wrapper inference for ambiguous web pages. Appl Artif Intell 22(1&2):21–52CrossRef Crescenzi V, Merialdo P (2008) Wrapper inference for ambiguous web pages. Appl Artif Intell 22(1&2):21–52CrossRef
15.
go back to reference Cumby CM, Roth D (2003) On kernel methods for relational learning. In: ICML, pp 107–114 Cumby CM, Roth D (2003) On kernel methods for relational learning. In: ICML, pp 107–114
16.
go back to reference Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MathSciNetMATH Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MathSciNetMATH
17.
go back to reference Džeroski S, Lavrač N (1993) Inductive learning in deductive databases. IEEE Trans Knowl Data Eng 5(6):939–949CrossRef Džeroski S, Lavrač N (1993) Inductive learning in deductive databases. IEEE Trans Knowl Data Eng 5(6):939–949CrossRef
18.
go back to reference Emde W and Wettschereck D (1996) Relational instance-based learning. In ICML, pp 122–130 Emde W and Wettschereck D (1996) Relational instance-based learning. In ICML, pp 122–130
19.
go back to reference Esposito F, Ferilli S, Fanizzi N, Basile TMA, Mauro ND (2003) Incremental multistrategy learning for document processing. Appl Artif Intell 17(8–9):859–883CrossRef Esposito F, Ferilli S, Fanizzi N, Basile TMA, Mauro ND (2003) Incremental multistrategy learning for document processing. Appl Artif Intell 17(8–9):859–883CrossRef
20.
go back to reference Fernández-Villamor JI, Iglesias CÁ, Garijo M (2012) First-order logic rule induction for information extraction in web resources. Int J Artif Intell Tools 21(6):1–20CrossRef Fernández-Villamor JI, Iglesias CÁ, Garijo M (2012) First-order logic rule induction for information extraction in web resources. Int J Artif Intell Tools 21(6):1–20CrossRef
21.
go back to reference Flach PA, Lachiche N (2004) Naive bayesian classification of structured data. Mach Learn 57(3):233–269CrossRefMATH Flach PA, Lachiche N (2004) Naive bayesian classification of structured data. Mach Learn 57(3):233–269CrossRefMATH
22.
go back to reference Frank E, Hall MA, Holmes G, Kirkby R, Pfahringer B, Witten IH, Trigg L (2010) Weka-a machine learning workbench for data mining. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, Berlin, pp 1269–1277 Frank E, Hall MA, Holmes G, Kirkby R, Pfahringer B, Witten IH, Trigg L (2010) Weka-a machine learning workbench for data mining. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, Berlin, pp 1269–1277
23.
go back to reference Freitag D (1998) Information extraction from HTML: application of a general machine learning approach. In: AAAI/IAAI, pp 517–523 Freitag D (1998) Information extraction from HTML: application of a general machine learning approach. In: AAAI/IAAI, pp 517–523
24.
go back to reference Freitag D (2000) Machine learning for information extraction in informal domains. Mach Learn 39(2/3):169–202CrossRefMATH Freitag D (2000) Machine learning for information extraction in informal domains. Mach Learn 39(2/3):169–202CrossRefMATH
25.
go back to reference García S, Herrera F (2008) An extension on ‘statistical comparisons of classifiers over multiple data sets’ for all pair-wise comparisons. J Mach Learn Res 9:2677–2694MATH García S, Herrera F (2008) An extension on ‘statistical comparisons of classifiers over multiple data sets’ for all pair-wise comparisons. J Mach Learn Res 9:2677–2694MATH
26.
go back to reference Gärtner T, Lloyd JW, Flach PA (2004) Kernels and distances for structured data. Mach Learn 57(3):205–232CrossRefMATH Gärtner T, Lloyd JW, Flach PA (2004) Kernels and distances for structured data. Mach Learn 57(3):205–232CrossRefMATH
27.
go back to reference Geng L, Hamilton HJ (2006) Interestingness measures for data mining: a survey. ACM Comput Surv 38(3):1–32 Geng L, Hamilton HJ (2006) Interestingness measures for data mining: a survey. ACM Comput Surv 38(3):1–32
28.
go back to reference Getoor L, Friedman N, Koller D, Taskar B (2001) Learning probabilistic models of relational structure. In: ICML, pp 170–177 Getoor L, Friedman N, Koller D, Taskar B (2001) Learning probabilistic models of relational structure. In: ICML, pp 170–177
29.
go back to reference Guo H, Viktor HL (2008) Multirelational classification: a multiple view approach. Knowl Inf Syst 17(3):287–312CrossRef Guo H, Viktor HL (2008) Multirelational classification: a multiple view approach. Knowl Inf Syst 17(3):287–312CrossRef
30.
go back to reference He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284CrossRef He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284CrossRef
31.
go back to reference Hickson I, Berjon R, Faulkner S, Leithead T, Navara ED, O’Connor E, Pfeiffer S (2014) HTML 5: a vocabulary and associated APIs for HTML and XHTML. Technical report W3C Hickson I, Berjon R, Faulkner S, Leithead T, Navara ED, O’Connor E, Pfeiffer S (2014) HTML 5: a vocabulary and associated APIs for HTML and XHTML. Technical report W3C
32.
go back to reference Hogue AW, Karger DR (2005) Thresher: automating the unwrapping of semantic content from the world wide web. In: WWW, pp 86–95 Hogue AW, Karger DR (2005) Thresher: automating the unwrapping of semantic content from the world wide web. In: WWW, pp 86–95
33.
go back to reference Horváth T, Wrobel S, Bohnebeck U (2001) Relational instance-based learning with lists and terms. Mach Learn 43(1/2):53–80CrossRefMATH Horváth T, Wrobel S, Bohnebeck U (2001) Relational instance-based learning with lists and terms. Mach Learn 43(1/2):53–80CrossRefMATH
34.
go back to reference Hsu C-N, Dung M-T (1998) Generating finite-state transducers for semi-structured data extraction from the web. Inf Syst 23(8):521–538CrossRef Hsu C-N, Dung M-T (1998) Generating finite-state transducers for semi-structured data extraction from the web. Inf Syst 23(8):521–538CrossRef
35.
go back to reference Irmak U, Suel T (2006) Interactive wrapper generation with minimal user effort. In: WWW, pp 553–563 Irmak U, Suel T (2006) Interactive wrapper generation with minimal user effort. In: WWW, pp 553–563
36.
go back to reference Jaeger M (2008) Probabilistic-logic models: reasoning and learning with relational structures. In: SCAI, pp 197–200 Jaeger M (2008) Probabilistic-logic models: reasoning and learning with relational structures. In: SCAI, pp 197–200
37.
go back to reference Kavurucu Y, Senkul P, Toroslu IH (2011) A comparative study on ILP-based concept discovery systems. Expert Syst Appl 38(9):11598–11607CrossRef Kavurucu Y, Senkul P, Toroslu IH (2011) A comparative study on ILP-based concept discovery systems. Expert Syst Appl 38(9):11598–11607CrossRef
38.
go back to reference Kayed M, Chang C-H (2010) FiVaTech: page-level web data extraction from template pages. IEEE Trans Knowl Data Eng 22(2):249–263CrossRef Kayed M, Chang C-H (2010) FiVaTech: page-level web data extraction from template pages. IEEE Trans Knowl Data Eng 22(2):249–263CrossRef
39.
go back to reference Knobbe AJ, de Haas M, Siebes A (2001) Propositionalisation and aggregates. In: PKDD, pp 277–288 Knobbe AJ, de Haas M, Siebes A (2001) Propositionalisation and aggregates. In: PKDD, pp 277–288
40.
go back to reference Kramer S, Lavrač N, Flach P (2001a) Propositionalization approaches to relational data mining. In: Džeroski S, Lavrač N (eds) Relational data mining. Springer, Berlin, pp 262–291CrossRef Kramer S, Lavrač N, Flach P (2001a) Propositionalization approaches to relational data mining. In: Džeroski S, Lavrač N (eds) Relational data mining. Springer, Berlin, pp 262–291CrossRef
41.
go back to reference Kramer S, Widmer G, Pfahringer B, de Groeve M (2001b) Prediction of ordinal classes using regression trees. Fundam Inform 47(1–2):1–13MathSciNetMATH Kramer S, Widmer G, Pfahringer B, de Groeve M (2001b) Prediction of ordinal classes using regression trees. Fundam Inform 47(1–2):1–13MathSciNetMATH
42.
go back to reference Krogel MA (2005) On propositionalization for knowledge discovery in relational databases. PhD thesis, Otto von Guericke Universität Magdeburg Krogel MA (2005) On propositionalization for knowledge discovery in relational databases. PhD thesis, Otto von Guericke Universität Magdeburg
43.
go back to reference Krogel M-A, Rawles S, Zelezný F, Flach PA, Lavrač N, Wrobel S (2003) Comparative evaluation of approaches to propositionalization. In: ILP, pp 197–214 Krogel M-A, Rawles S, Zelezný F, Flach PA, Lavrač N, Wrobel S (2003) Comparative evaluation of approaches to propositionalization. In: ILP, pp 197–214
44.
go back to reference Kushmerick N, Weld DS, Doorenbos RB (1997) Wrapper induction for information extraction. IJCAI 1:729–737 Kushmerick N, Weld DS, Doorenbos RB (1997) Wrapper induction for information extraction. IJCAI 1:729–737
45.
go back to reference Lavrač N, Džeroski S (1994) Inductive logic programming: techniques and applications. Ellis Horwood, ChichesterMATH Lavrač N, Džeroski S (1994) Inductive logic programming: techniques and applications. Ellis Horwood, ChichesterMATH
46.
go back to reference Montoto P, Pan A, Raposo J, Losada J, Bellas F, Carneiro V (2008) A workflow language for web automation. J UCS 14(11):1838–1856 Montoto P, Pan A, Raposo J, Losada J, Bellas F, Carneiro V (2008) A workflow language for web automation. J UCS 14(11):1838–1856
47.
go back to reference Muggleton S (2000) Learning stochastic logic programs. Electron Trans Artif Intell 4(B):141–153MathSciNet Muggleton S (2000) Learning stochastic logic programs. Electron Trans Artif Intell 4(B):141–153MathSciNet
48.
go back to reference Muggleton S, Raedt LD, Poole D, Bratko I, Flach PA, Inoue K, Srinivasan A (2012) ILP turns 20: biography and future challenges. Mach Learn 86(1):3–23MathSciNetCrossRefMATH Muggleton S, Raedt LD, Poole D, Bratko I, Flach PA, Inoue K, Srinivasan A (2012) ILP turns 20: biography and future challenges. Mach Learn 86(1):3–23MathSciNetCrossRefMATH
49.
go back to reference Muslea I, Minton S, Knoblock CA (2001) Hierarchical wrapper induction for semistructured information sources. Auton Agents Multi-Agent Syst 4(1/2):93–114CrossRef Muslea I, Minton S, Knoblock CA (2001) Hierarchical wrapper induction for semistructured information sources. Auton Agents Multi-Agent Syst 4(1/2):93–114CrossRef
50.
go back to reference Park J, Barbosa D (2007) Adaptive record extraction from web pages. In: WWW, pp 1335–1336 Park J, Barbosa D (2007) Adaptive record extraction from web pages. In: WWW, pp 1335–1336
51.
go back to reference Quinlan JR, Cameron-Jones RM (1995) Induction of logic programs: FOIL and related systems. New Gener Comput 13(3&4):287–312CrossRef Quinlan JR, Cameron-Jones RM (1995) Induction of logic programs: FOIL and related systems. New Gener Comput 13(3&4):287–312CrossRef
52.
53.
go back to reference Shen YK, Karger DR (2007) U-REST: an unsupervised record extraction system. In: WWW, pp 1347–1348 Shen YK, Karger DR (2007) U-REST: an unsupervised record extraction system. In: WWW, pp 1347–1348
54.
go back to reference Sheskin DJ (2012) Handbook of parametric and nonparametric statistical procedures, 5th edn. Chapman and Hall/CRC, Boca Raton/LondonMATH Sheskin DJ (2012) Handbook of parametric and nonparametric statistical procedures, 5th edn. Chapman and Hall/CRC, Boca Raton/LondonMATH
55.
go back to reference Sleiman HA, Corchuelo R (2013a) TEX: an efficient and effective unsupervised web information extractor. Knowl Based Syst 39:109–123CrossRef Sleiman HA, Corchuelo R (2013a) TEX: an efficient and effective unsupervised web information extractor. Knowl Based Syst 39:109–123CrossRef
56.
go back to reference Sleiman HA, Corchuelo R (2013b) A survey on region extractors from web documents. IEEE Trans Knowl Data Eng 25(9):1960–1981CrossRef Sleiman HA, Corchuelo R (2013b) A survey on region extractors from web documents. IEEE Trans Knowl Data Eng 25(9):1960–1981CrossRef
57.
go back to reference Sleiman HA, Corchuelo R (2014a) A class of neural-network-based transducers for web information extraction. Neurocomputing 135:61–68CrossRef Sleiman HA, Corchuelo R (2014a) A class of neural-network-based transducers for web information extraction. Neurocomputing 135:61–68CrossRef
58.
go back to reference Sleiman HA, Corchuelo R (2014b) Trinity: on using trinary trees for unsupervised web data extraction. IEEE Trans Knowl Data Eng 26(6):1544–1556CrossRef Sleiman HA, Corchuelo R (2014b) Trinity: on using trinary trees for unsupervised web data extraction. IEEE Trans Knowl Data Eng 26(6):1544–1556CrossRef
59.
go back to reference Soderland S (1999) Learning information extraction rules for semi-structured and free text. Mach Learn 34(1–3):233–272CrossRefMATH Soderland S (1999) Learning information extraction rules for semi-structured and free text. Mach Learn 34(1–3):233–272CrossRefMATH
60.
go back to reference Srinivasan A (2004) The Aleph manual. Technical report, University of Oxford Srinivasan A (2004) The Aleph manual. Technical report, University of Oxford
61.
go back to reference Su W, Wang J, Lochovsky FH (2009) ODE: ontology-assisted data extraction. ACM Trans. Database Syst. 34(2):12.1–12.35 Su W, Wang J, Lochovsky FH (2009) ODE: ontology-assisted data extraction. ACM Trans. Database Syst. 34(2):12.1–12.35
62.
go back to reference Turmo J, Ageno A, Català N (2006) Adaptive information extraction. ACM Comput Surv 38(2):1–47 Turmo J, Ageno A, Català N (2006) Adaptive information extraction. ACM Comput Surv 38(2):1–47
63.
go back to reference van Kesteren A, Gregor A, Russell A, Berjon R (2014) Document object model 4. Technical report W3C van Kesteren A, Gregor A, Russell A, Berjon R (2014) Document object model 4. Technical report W3C
64.
go back to reference Yin X, Han J, Yang J, Yu PS (2006) Efficient classification across multiple database relations: a crossmine approach. IEEE Trans Knowl Data Eng 18(6):770–783CrossRefMATH Yin X, Han J, Yang J, Yu PS (2006) Efficient classification across multiple database relations: a crossmine approach. IEEE Trans Knowl Data Eng 18(6):770–783CrossRefMATH
65.
go back to reference Zhang H, Su J (2004) Conditional independence trees. In: ECML, pp 513–524 Zhang H, Su J (2004) Conditional independence trees. In: ECML, pp 513–524
Metadata
Title
Roller: a novel approach to Web information extraction
Authors
Patricia Jiménez
Rafael Corchuelo
Publication date
01-10-2016
Publisher
Springer London
Published in
Knowledge and Information Systems / Issue 1/2016
Print ISSN: 0219-1377
Electronic ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-016-0921-4

Other articles of this Issue 1/2016

Knowledge and Information Systems 1/2016 Go to the issue

Premium Partner