Skip to main content
Erschienen in: The VLDB Journal 1/2013

01.02.2013 | Special Issue Paper

OXPath: A language for scalable data extraction, automation, and crawling on the deep web

verfasst von: Tim Furche, Georg Gottlob, Giovanni Grasso, Christian Schallhart, Andrew Sellers

Erschienen in: The VLDB Journal | Ausgabe 1/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The evolution of the web has outpaced itself: A growing wealth of information and increasingly sophisticated interfaces necessitate automated processing, yet existing automation and data extraction technologies have been overwhelmed by this very growth. To address this trend, we identify four key requirements for web data extraction, automation, and (focused) web crawling: (1) interact with sophisticated web application interfaces, (2) precisely capture the relevant data to be extracted, (3) scale with the number of visited pages, and (4) readily embed into existing web technologies. We introduce OXPath as an extension of XPath for interacting with web applications and extracting data thus revealed—matching all the above requirements. OXPath’s page-at-a-time evaluation guarantees memory use independent of the number of visited pages, yet remains polynomial in time. We experimentally validate the theoretical complexity and demonstrate that OXPath’s resource consumption is dominated by page rendering in the underlying browser. With an extensive study of sublanguages and properties of OXPath, we pinpoint the effect of specific features on evaluation performance. Our experiments show that OXPath outperforms existing commercial and academic data extraction tools by a wide margin.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
However, classical results [41] on rewriting reverse axes such as ancestor in XPath do not extend to OXPath.
 
2
Thus, (path) *[qp] = \(\left(\bigcup _{i=0}^\infty \mathtt{\textit{path} }^i\right)\) [qp] always holds, but (path) *[qp] = \(\bigcup _{i=0}^\infty \) path \(^i\) [qp] does not hold necessarily, since [qp] is applied to each of the \(i\)-th copy of \(path\).
 
3
Simple OXPath is the restriction of OXPath to simple OXPath expression, but we allow a doc() action at the start of the expression to set the document to be queried.
 
Literatur
6.
Zurück zum Zitat Alba, A., Bhagwan, V., Grandison, T.: Accessing the deep web: when good ideas go bad. In: OOPSLA (2008) Alba, A., Bhagwan, V., Grandison, T.: Accessing the deep web: when good ideas go bad. In: OOPSLA (2008)
7.
Zurück zum Zitat Anton, T.: XPath—wrapper induction by generalizing tree traversal patterns. In: LWA (2005) Anton, T.: XPath—wrapper induction by generalizing tree traversal patterns. In: LWA (2005)
8.
Zurück zum Zitat Anupam, V., Freire, J., Kumar, B., Lieuwen, D.: Automating web navigation with the webvcr. In: WWW (2000) Anupam, V., Freire, J., Kumar, B., Lieuwen, D.: Automating web navigation with the webvcr. In: WWW (2000)
9.
Zurück zum Zitat Arocena, G.O., Mendelzon, A.O.: Weboql: Restructuring documents, databases, and webs. In: ICDE (1998) Arocena, G.O., Mendelzon, A.O.: Weboql: Restructuring documents, databases, and webs. In: ICDE (1998)
10.
Zurück zum Zitat Badica, C., Badica, A., Popescu, E., Abraham, A.: L-wrappers: concepts, properties and construction: A declarative approach to data extraction from web sources. Soft Comput. 11(8), 753–772 (2007)CrossRef Badica, C., Badica, A., Popescu, E., Abraham, A.: L-wrappers: concepts, properties and construction: A declarative approach to data extraction from web sources. Soft Comput. 11(8), 753–772 (2007)CrossRef
11.
Zurück zum Zitat Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the Web. In: IJCAI (2007) Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the Web. In: IJCAI (2007)
12.
Zurück zum Zitat Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with Lixto. In: VLDB (2001) Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with Lixto. In: VLDB (2001)
13.
Zurück zum Zitat Benedikt, M., Koch, C.: Xpath leashed. CSUR 41, 3:1–3:54 (2009) Benedikt, M., Koch, C.: Xpath leashed. CSUR 41, 3:1–3:54 (2009)
14.
Zurück zum Zitat Bergman, M.K.: The deep web: Surfacing hidden value. J. Electron. Publ. 7(1), 1–17 (2001)CrossRef Bergman, M.K.: The deep web: Surfacing hidden value. J. Electron. Publ. 7(1), 1–17 (2001)CrossRef
15.
Zurück zum Zitat Bigham, J.P., Cavender, A.C., Kaminsky, R.S., Prince, C.M., Obison T.S.: Transcendence: enabling a personal view of the deep web. In: IUI (2008) Bigham, J.P., Cavender, A.C., Kaminsky, R.S., Prince, C.M., Obison T.S.: Transcendence: enabling a personal view of the deep web. In: IUI (2008)
16.
Zurück zum Zitat Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: a scalable fully distributed web crawler. Softw. Practice Experience 34, 711–726 (2004)CrossRef Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: a scalable fully distributed web crawler. Softw. Practice Experience 34, 711–726 (2004)CrossRef
17.
Zurück zum Zitat Bolin, M., Webber, M., Rha, P., Wilson, T., Miller, R.C.:. Automation and customization of rendered web pages. In: UIST (2005) Bolin, M., Webber, M., Rha, P., Wilson, T., Miller, R.C.:. Automation and customization of rendered web pages. In: UIST (2005)
18.
Zurück zum Zitat Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30(1–7), 107–117 (1998) Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30(1–7), 107–117 (1998)
19.
Zurück zum Zitat Cafarella, M.J., Halevy, A.Y., Wang, D.Z., Wy, E., Zhang, Y.: WebTables: exploring the power of tables on the web. PVLDB 1(1), 538–549 (2008) Cafarella, M.J., Halevy, A.Y., Wang, D.Z., Wy, E., Zhang, Y.: WebTables: exploring the power of tables on the web. PVLDB 1(1), 538–549 (2008)
20.
Zurück zum Zitat Centeno, V.L., Kloos, C.D., Fernández, L.S.: García, N.F.: Intelligent automated navigation through the deep web. In: Advances in Web Intelligence (2004) Centeno, V.L., Kloos, C.D., Fernández, L.S.: García, N.F.: Intelligent automated navigation through the deep web. In: Advances in Web Intelligence (2004)
21.
Zurück zum Zitat Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. TKDE 18(10), 1411–1428 (2006) Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. TKDE 18(10), 1411–1428 (2006)
22.
Zurück zum Zitat Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: automatic data extraction from data-intensive web sites. In: SIGMOD (2002) Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: automatic data extraction from data-intensive web sites. In: SIGMOD (2002)
23.
Zurück zum Zitat Cafarella, M.J., Downey, D., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Unsupervised named-entity extraction from the Web: an experimental study. Artif. Intell. 165(1), 91–134 (2005)CrossRef Cafarella, M.J., Downey, D., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Unsupervised named-entity extraction from the Web: an experimental study. Artif. Intell. 165(1), 91–134 (2005)CrossRef
24.
Zurück zum Zitat Furche, T., Gottlob, G., Grasso, G., Gunes, O., Guo, X., Kravchenko, A., Orsi, G., Schallhart, C., Sellers, A., Wang, C.: DIADEM: Domain-centric, intelligent, automated data extraction methodology. In: WWW (2012) Furche, T., Gottlob, G., Grasso, G., Gunes, O., Guo, X., Kravchenko, A., Orsi, G., Schallhart, C., Sellers, A., Wang, C.: DIADEM: Domain-centric, intelligent, automated data extraction methodology. In: WWW (2012)
25.
Zurück zum Zitat Furche, T., Gottlob, G., Grasso, G., Schallhart, C., Sellers, A.: Oxpath: A language for scalable, memory-efficient data extraction from web applications. PVLDB 4(11), 1016–1027 (2011) Furche, T., Gottlob, G., Grasso, G., Schallhart, C., Sellers, A.: Oxpath: A language for scalable, memory-efficient data extraction from web applications. PVLDB 4(11), 1016–1027 (2011)
26.
Zurück zum Zitat Gottlob, G., Koch, C., Pichler, R.: Efficient algorithms for processing XPath queries. In: TODS (2005) Gottlob, G., Koch, C., Pichler, R.: Efficient algorithms for processing XPath queries. In: TODS (2005)
27.
Zurück zum Zitat Gruhl, D., Chavet, L., Gibson, D., Meyer, J., Pattanayak, P., Tomkins, A., Zien, J.: How to build a webfountain: an architecture for very large-scale text analytics. IBM Syst. J. 43, 64–77 (2004)CrossRef Gruhl, D., Chavet, L., Gibson, D., Meyer, J., Pattanayak, P., Tomkins, A., Zien, J.: How to build a webfountain: an architecture for very large-scale text analytics. IBM Syst. J. 43, 64–77 (2004)CrossRef
28.
Zurück zum Zitat He, B., Patel, M., Zhang, Z., Chang, K.C.-C.: Accessing the deep web. Commun. ACM 50(5), 94–101 (2007)CrossRef He, B., Patel, M., Zhang, Z., Chang, K.C.-C.: Accessing the deep web. Commun. ACM 50(5), 94–101 (2007)CrossRef
29.
Zurück zum Zitat Heydon, A., Najork, M.: Mercator: a scalable, extensible web crawler. World Wide Web 2(4), 219–229 (1999)CrossRef Heydon, A., Najork, M.: Mercator: a scalable, extensible web crawler. World Wide Web 2(4), 219–229 (1999)CrossRef
30.
Zurück zum Zitat Kranzdorf, J., Sellers, A., Grasso, G., Schallhart, C., Furche, T: Spotting the tracks on the oxpath. In: WWW (2012) Kranzdorf, J., Sellers, A., Grasso, G., Schallhart, C., Furche, T: Spotting the tracks on the oxpath. In: WWW (2012)
31.
Zurück zum Zitat Leshed, G., Haber, E.M., Matthews, T., Lau, T.: Coscripter: automating& sharing how-to knowledge in the enterprise. In: CHI (2008) Leshed, G., Haber, E.M., Matthews, T., Lau, T.: Coscripter: automating& sharing how-to knowledge in the enterprise. In: CHI (2008)
32.
Zurück zum Zitat Lin, J., Wong, J., Nichols, J., Cypher, A., Lau, T.A.: End-user programming of mashups with vegemite. In: IUI (2009) Lin, J., Wong, J., Nichols, J., Cypher, A., Lau, T.A.: End-user programming of mashups with vegemite. In: IUI (2009)
33.
Zurück zum Zitat Liu, L., Pu, C., Han, W.: Xwrap: an xml-enabled wrapper construction system for web information sources. In: ICDE (2000) Liu, L., Pu, C., Han, W.: Xwrap: an xml-enabled wrapper construction system for web information sources. In: ICDE (2000)
34.
Zurück zum Zitat Liu, M., Ling, T.W.: A rule-based query language for html. In: DASFAA (2001) Liu, M., Ling, T.W.: A rule-based query language for html. In: DASFAA (2001)
35.
Zurück zum Zitat Marx, M.: Conditional XPath. ACM Trans. Database Syst. 30(4), 929–959 (2005)CrossRef Marx, M.: Conditional XPath. ACM Trans. Database Syst. 30(4), 929–959 (2005)CrossRef
36.
Zurück zum Zitat Marx, M., de Rijke, M.: Semantic characterizations of navigational XPath. ACM SIGMOD Rec. 34(2), 41–46 (2005) Marx, M., de Rijke, M.: Semantic characterizations of navigational XPath. ACM SIGMOD Rec. 34(2), 41–46 (2005)
37.
Zurück zum Zitat Mendelzon, A.O., Mihaila, G.A., Milo, T.: Querying the world wide web. Int. J. Digit. Libr. 1(1), 54–67 (1997) Mendelzon, A.O., Mihaila, G.A., Milo, T.: Querying the world wide web. Int. J. Digit. Libr. 1(1), 54–67 (1997)
38.
Zurück zum Zitat Mir, S., Staab, S., Rojas, I.: Web-prospector—an automatic, site-wide wrapper induction approach for scientific deep-web databases. In: BTW (2009) Mir, S., Staab, S., Rojas, I.: Web-prospector—an automatic, site-wide wrapper induction approach for scientific deep-web databases. In: BTW (2009)
39.
Zurück zum Zitat Montoto, P., Pan, A., Raposo, J., Bellas, F., López, J: Automating navigation sequences in ajax websites. In: ICWE (2009) Montoto, P., Pan, A., Raposo, J., Bellas, F., López, J: Automating navigation sequences in ajax websites. In: ICWE (2009)
40.
Zurück zum Zitat Myllymaki, J.: Effective web data extraction with standard xml technologies. Comput. Netw. 39(5), 635–644 (2002)CrossRef Myllymaki, J.: Effective web data extraction with standard xml technologies. Comput. Netw. 39(5), 635–644 (2002)CrossRef
41.
Zurück zum Zitat Olteanu, D., Meuss, H., Furche, T., Bry, F.: XPath: looking Forward. In: EDBT-XML-Based Data Management, LNCS 2490 (2002) Olteanu, D., Meuss, H., Furche, T., Bry, F.: XPath: looking Forward. In: EDBT-XML-Based Data Management, LNCS 2490 (2002)
42.
Zurück zum Zitat Raposo, J., Pan, A., Álvarez, M., Hidalgo, J., Viña., A.: The wargo system: semi-automatic wrapper generation in presence of complex data access modes. In: DEXA (2002) Raposo, J., Pan, A., Álvarez, M., Hidalgo, J., Viña., A.: The wargo system: semi-automatic wrapper generation in presence of complex data access modes. In: DEXA (2002)
43.
Zurück zum Zitat Safonov, A.: Web macros by example: users managing the www of applications. In: CHI, pp. 71–72. ACM (1999) Safonov, A.: Web macros by example: users managing the www of applications. In: CHI, pp. 71–72. ACM (1999)
44.
Zurück zum Zitat Sahuguet, A., Azavant, F.: Building light-weight wrappers for legacy web data-sources using w4f. In: VLDB, pp. 738–741 (1999) Sahuguet, A., Azavant, F.: Building light-weight wrappers for legacy web data-sources using w4f. In: VLDB, pp. 738–741 (1999)
45.
Zurück zum Zitat Sawa, N., Morishima, A., Sugimoto, S., Kitagawa, H.: Wraplet: Wrapping your web contents with a lightweight language. In: SITIS, pp. 387–394 (2007) Sawa, N., Morishima, A., Sugimoto, S., Kitagawa, H.: Wraplet: Wrapping your web contents with a lightweight language. In: SITIS, pp. 387–394 (2007)
46.
Zurück zum Zitat Shen, W., Doan, A., Naughton, J.F., Ramakrishnan, R: Declarative information extraction using datalog with embedded extraction predicates. In: VLDB (2007) Shen, W., Doan, A., Naughton, J.F., Ramakrishnan, R: Declarative information extraction using datalog with embedded extraction predicates. In: VLDB (2007)
47.
Zurück zum Zitat Su, J.-Y., Sun, D.-J., Wu, I.-C., Chen, L.-P.: On design of browser-oriented data extraction system and plug-ins. J. Mar. Sci. Technol. 18(2), 189–200 (2010) Su, J.-Y., Sun, D.-J., Wu, I.-C., Chen, L.-P.: On design of browser-oriented data extraction system and plug-ins. J. Mar. Sci. Technol. 18(2), 189–200 (2010)
48.
Zurück zum Zitat Wang, Y., Hornung, T.: Deep web navigation by example. Scalable Comput. Practice Experience 9, 281–292 (2008) Wang, Y., Hornung, T.: Deep web navigation by example. Scalable Comput. Practice Experience 9, 281–292 (2008)
Metadaten
Titel
OXPath: A language for scalable data extraction, automation, and crawling on the deep web
verfasst von
Tim Furche
Georg Gottlob
Giovanni Grasso
Christian Schallhart
Andrew Sellers
Publikationsdatum
01.02.2013
Verlag
Springer-Verlag
Erschienen in
The VLDB Journal / Ausgabe 1/2013
Print ISSN: 1066-8888
Elektronische ISSN: 0949-877X
DOI
https://doi.org/10.1007/s00778-012-0286-6

Weitere Artikel der Ausgabe 1/2013

The VLDB Journal 1/2013 Zur Ausgabe

Premium Partner