Skip to main content
Erschienen in: The VLDB Journal 5/2013

01.10.2013 | Special Issue Paper

The ontological key: automatically understanding and integrating forms to access the deep Web

verfasst von: Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi, Christian Schallhart

Erschienen in: The VLDB Journal | Ausgabe 5/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Forms are our gates to the Web. They enable us to access the deep content of Web sites. Automatic form understanding provides applications, ranging from crawlers over meta-search engines to service integrators, with a key to this content. Yet, it has received little attention other than as component in specific applications such as crawlers or meta-search engines. No comprehensive approach to form understanding exists, let alone one that produces rich models for semantic services or integration with linked open data. In this paper, we present opal, the first comprehensive approach to form understanding and integration. We identify form labeling and form interpretation as the two main tasks involved in form understanding. On both problems, opal advances the state of the art: For form labeling, it combines features from the text, structure, and visual rendering of a Web page. In extensive experiments on the ICQ and TEL-8 benchmarks and a set of 200 modern Web forms, opal outperforms previous approaches for form labeling by a significant margin. For form interpretation, opal uses a schema (or ontology) of forms in a given domain. Thanks to this domain schema, it is able to produce nearly perfect (\(>\)97 % accuracy in the evaluation domains) form interpretations. Yet, the effort to produce a domain schema is very low, as we provide a datalog-based template language that eases the specification of such schemata and a methodology for deriving a domain schema largely automatically from an existing domain ontology. We demonstrate the value of opal’s form interpretations through a light-weight form integration system that successfully translates and distributes master queries to hundreds of forms with no error, yet is implemented with only a handful translation rules.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Araujo, S., Gao, Q., Leonardi, E., Houben, G.-J.: Carbon: domain-independent automatic web form filling. In: Proceedings of the International Conference on Web Engineering (ICWE), pp. 292–306 (2010) Araujo, S., Gao, Q., Leonardi, E., Houben, G.-J.: Carbon: domain-independent automatic web form filling. In: Proceedings of the International Conference on Web Engineering (ICWE), pp. 292–306 (2010)
2.
Zurück zum Zitat Barbosa, L., Freire, J.: Combining classifiers to identify online databases. In: Proceedings of the International World Wide Web Conference (WWW), pp. 431–440 (2007) Barbosa, L., Freire, J.: Combining classifiers to identify online databases. In: Proceedings of the International World Wide Web Conference (WWW), pp. 431–440 (2007)
3.
Zurück zum Zitat Barbosa, L., Freire, J.: Siphoning hidden-web data through keyword-based interfaces. In: Proceedings of the Brazilian Symposium on Databases, pp. 309–321 (2004) Barbosa, L., Freire, J.: Siphoning hidden-web data through keyword-based interfaces. In: Proceedings of the Brazilian Symposium on Databases, pp. 309–321 (2004)
4.
Zurück zum Zitat Bar-Yossef, Z., Gurevich, M.: Random sampling from a search engine’s index. J. ACM 55(5), 24:1–247:4 (2008)MathSciNetCrossRef Bar-Yossef, Z., Gurevich, M.: Random sampling from a search engine’s index. J. ACM 55(5), 24:1–247:4 (2008)MathSciNetCrossRef
5.
Zurück zum Zitat Benedikt, M., Gottlob, G., Senellart, P.: Determining relevance of accesses at runtime. In: Proceedings Symposium on Principles of Database Systems (PODS), pp. 211–222 (2011) Benedikt, M., Gottlob, G., Senellart, P.: Determining relevance of accesses at runtime. In: Proceedings Symposium on Principles of Database Systems (PODS), pp. 211–222 (2011)
6.
Zurück zum Zitat Benedikt, M., Koch, C.: XPath leashed. In: ACM Computing Surveys, pp. 3:1–3:54 (2007) Benedikt, M., Koch, C.: XPath leashed. In: ACM Computing Surveys, pp. 3:1–3:54 (2007)
7.
Zurück zum Zitat Cafarella, M.J., Chang, E.Y., Fikes, A., Halevy, A.Y., Hsieh, W.C., Lerner, A., Madhavan, J., Muthukrishnan, S.: Data management projects at Google. Sigmod Record 37(1), 34–38 (2008)CrossRef Cafarella, M.J., Chang, E.Y., Fikes, A., Halevy, A.Y., Hsieh, W.C., Lerner, A., Madhavan, J., Muthukrishnan, S.: Data management projects at Google. Sigmod Record 37(1), 34–38 (2008)CrossRef
8.
Zurück zum Zitat Chang, K.C.-C., He, B., Zhang, Z.: Mining semantics for large scale integration on the web: evidences, insights, and challenges. SIGKDD Explor. Newsl. 6(2), 67–76 (2004)CrossRef Chang, K.C.-C., He, B., Zhang, Z.: Mining semantics for large scale integration on the web: evidences, insights, and challenges. SIGKDD Explor. Newsl. 6(2), 67–76 (2004)CrossRef
9.
Zurück zum Zitat Crescenzi, W., Merialdo, P., Qiu, D.: A framework for learning web wrappers from the crowd. In: Proceedings of the International World Wide Web Conference (WWW) (2013) Crescenzi, W., Merialdo, P., Qiu, D.: A framework for learning web wrappers from the crowd. In: Proceedings of the International World Wide Web Conference (WWW) (2013)
10.
Zurück zum Zitat Dragut, E.C., Kabisch, T., Yu, C., Leser, U.: A hierarchical approach to model web query interfaces for web source integration. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 325–336 (2009) Dragut, E.C., Kabisch, T., Yu, C., Leser, U.: A hierarchical approach to model web query interfaces for web source integration. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 325–336 (2009)
11.
Zurück zum Zitat Dragut, E.C., Meng, W., Yu, C.T.: Deep Web Query Interface Understanding and Integration. Synthesis Lectures on Data Management. Morgan & Claypool Publishers (2012) Dragut, E.C., Meng, W., Yu, C.T.: Deep Web Query Interface Understanding and Integration. Synthesis Lectures on Data Management. Morgan & Claypool Publishers (2012)
12.
Zurück zum Zitat Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., Schallhart, C.: OPAL: automated form understanding for the deep web. In: Proceedings of the International World Wide Web Conference (WWW), pp. 829–838 (2012) Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., Schallhart, C.: OPAL: automated form understanding for the deep web. In: Proceedings of the International World Wide Web Conference (WWW), pp. 829–838 (2012)
13.
Zurück zum Zitat Furche, T., Gottlob, G., Grasso, G., Orsi, G., Schallhart, C., Wang, C.: Little knowledge rules the web: domain-centric result page extraction. In: Proceedings of the International Conference on Web Reasoning and Rule Systems (RR), pp. 61–76 (2011) Furche, T., Gottlob, G., Grasso, G., Orsi, G., Schallhart, C., Wang, C.: Little knowledge rules the web: domain-centric result page extraction. In: Proceedings of the International Conference on Web Reasoning and Rule Systems (RR), pp. 61–76 (2011)
14.
Zurück zum Zitat He, B., Zhang, Z., Chang, K.C.-C.: Towards building a MetaQuerier: extracting and matching web query interfaces. In: Proceedings of the International Conference on Data Engineering (ICDE), pp. 1098–1099 (2005) He, B., Zhang, Z., Chang, K.C.-C.: Towards building a MetaQuerier: extracting and matching web query interfaces. In: Proceedings of the International Conference on Data Engineering (ICDE), pp. 1098–1099 (2005)
15.
Zurück zum Zitat He, H., Meng, W., Lu, Y., Yu, C., Wu, Z.: Towards deeper understanding of the search interfaces of the deep web. Word Wide Web 10, 133–155 (2007)CrossRef He, H., Meng, W., Lu, Y., Yu, C., Wu, Z.: Towards deeper understanding of the search interfaces of the deep web. Word Wide Web 10, 133–155 (2007)CrossRef
16.
Zurück zum Zitat Kaljuvee, O., Buyukkokten, O., Garcia-Molina, H., Paepcke, A.: Efficient web form entry on PDAs. In: Proceedings of the International World Wide Web Conference (WWW), pp. 663–672 (2001) Kaljuvee, O., Buyukkokten, O., Garcia-Molina, H., Paepcke, A.: Efficient web form entry on PDAs. In: Proceedings of the International World Wide Web Conference (WWW), pp. 663–672 (2001)
17.
Zurück zum Zitat Khare, R., An, Y.: An empirical study on using hidden markov model for search interface segmentation. In: Proceedings of the Conference on Information and Knowledge Management (CIKM), pp. 17–26 (2009) Khare, R., An, Y.: An empirical study on using hidden markov model for search interface segmentation. In: Proceedings of the Conference on Information and Knowledge Management (CIKM), pp. 17–26 (2009)
18.
Zurück zum Zitat Khare, R., An, Y., Song, I.-Y.: Understanding deep web search interfaces: a survey. Sigmod Records 39(1), 33–40 (2010)CrossRef Khare, R., An, Y., Song, I.-Y.: Understanding deep web search interfaces: a survey. Sigmod Records 39(1), 33–40 (2010)CrossRef
19.
Zurück zum Zitat Lehmann, J., Furche, T., Grasso, G., Ngomo, A.-C.N., Schallhart, C., Sellers, A., Unger, C., Bühmann, L., Gerber, D., Konrad Höffner, D.L., Auer S.: Deqa: deep web extraction for question answering. In: Proceedings of the International Semantic Web Conference (ISWC) (2012) Lehmann, J., Furche, T., Grasso, G., Ngomo, A.-C.N., Schallhart, C., Sellers, A., Unger, C., Bühmann, L., Gerber, D., Konrad Höffner, D.L., Auer S.: Deqa: deep web extraction for question answering. In: Proceedings of the International Semantic Web Conference (ISWC) (2012)
20.
Zurück zum Zitat Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s deep web crawl. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 1241–1252 (2008) Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s deep web crawl. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 1241–1252 (2008)
21.
Zurück zum Zitat Maiti, A., Dasgupta, A., Zhang, N., Das, G.: HDSampler: revealing data behind web form interfaces. In: Proceedings of the Symposium on Management of Data (SIGMOD), pp. 1131–1134 (2009) Maiti, A., Dasgupta, A., Zhang, N., Das, G.: HDSampler: revealing data behind web form interfaces. In: Proceedings of the Symposium on Management of Data (SIGMOD), pp. 1131–1134 (2009)
22.
Zurück zum Zitat Navarrete, I., Morales, A., Cardenas, M., Sciavicco, G.: Spatial reasoning with rectangular cardinal relations—the convex tractable subalgebra. Ann. Math. Artif. Intell. (2012) Navarrete, I., Morales, A., Cardenas, M., Sciavicco, G.: Spatial reasoning with rectangular cardinal relations—the convex tractable subalgebra. Ann. Math. Artif. Intell. (2012)
23.
Zurück zum Zitat Nguyen, H., Nguyen, T., Freire, J.: Learning to extract form labels. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 684–694 (2008) Nguyen, H., Nguyen, T., Freire, J.: Learning to extract form labels. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 684–694 (2008)
24.
Zurück zum Zitat Nguyen, T.H., Nguyen, H., Freire, J.: PruSM: a prudent schema matching approach for web forms. In: International Conference on Information and Knowledge Management (CIKM), pp. 1385–1388 (2010) Nguyen, T.H., Nguyen, H., Freire, J.: PruSM: a prudent schema matching approach for web forms. In: International Conference on Information and Knowledge Management (CIKM), pp. 1385–1388 (2010)
25.
Zurück zum Zitat Niu, F., Zhang, C., Re, C., Shavlik, J.: DeepDive: web-scale knowledge-base construction using statistical learning and inference. In: Very Large Data Search (VLDS), pp. 25–28 (2012) Niu, F., Zhang, C., Re, C., Shavlik, J.: DeepDive: web-scale knowledge-base construction using statistical learning and inference. In: Very Large Data Search (VLDS), pp. 25–28 (2012)
26.
Zurück zum Zitat Pedersen, T., Patwardhan, S., Michelizzi, J.: Wordnet::similarity—measuring the relatedness of concepts. In: Proceedings of the HLT-NAACL-Demonstrations, pp. 38–41 (2004) Pedersen, T., Patwardhan, S., Michelizzi, J.: Wordnet::similarity—measuring the relatedness of concepts. In: Proceedings of the HLT-NAACL-Demonstrations, pp. 38–41 (2004)
27.
Zurück zum Zitat Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: Conference on Very Large Data Bases (VLDB), pp. 129–138 (2001) Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: Conference on Very Large Data Bases (VLDB), pp. 129–138 (2001)
28.
Zurück zum Zitat Shestakov, D., Bhowmick, S., Lim, E.: Deque: querying the deep web. Data Knowl. Eng. (DKE) 52(3), 273–311 (2005) Shestakov, D., Bhowmick, S., Lim, E.: Deque: querying the deep web. Data Knowl. Eng. (DKE) 52(3), 273–311 (2005)
29.
Zurück zum Zitat Su, W., Wang, J., Lochovsky, F.H.: ODE: ontology-assisted data extraction. ACM Trans. Database Syst. 34(2), 12:1–12:35 (2009)CrossRef Su, W., Wang, J., Lochovsky, F.H.: ODE: ontology-assisted data extraction. ACM Trans. Database Syst. 34(2), 12:1–12:35 (2009)CrossRef
30.
Zurück zum Zitat Su, W., Wu, H., Li, Y., Zhao, J., Lochovsky, F.H., Cai, H., Huang, T.: Understanding query interfaces by statistical parsing. ACM Trans. Web 7(2), 8:1–8:22 (2012) Su, W., Wu, H., Li, Y., Zhao, J., Lochovsky, F.H., Cai, H., Huang, T.: Understanding query interfaces by statistical parsing. ACM Trans. Web 7(2), 8:1–8:22 (2012)
31.
Zurück zum Zitat Wang, J., Wen, J.-R., Lochovsky, F., Ma, W.-Y.: Instance-based schema matching for web databases by domain-specific query probing. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 408–419 (2004) Wang, J., Wen, J.-R., Lochovsky, F., Ma, W.-Y.: Instance-based schema matching for web databases by domain-specific query probing. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 408–419 (2004)
32.
Zurück zum Zitat Wu, W., Doan, A., Yu, C., Meng, W.: Modeling and extracting deep-web query interfaces. Adv. Inf. Intell. Syst., pp. 65–90 (2009) Wu, W., Doan, A., Yu, C., Meng, W.: Modeling and extracting deep-web query interfaces. Adv. Inf. Intell. Syst., pp. 65–90 (2009)
33.
Zurück zum Zitat Yuan, X., Zhang, H., Yang, Z., Wen, Y.: Understanding the search interfaces of the deep web based on domain model. In: International Conference on Computer and Information Science, pp. 1194–1199 (2009) Yuan, X., Zhang, H., Yang, Z., Wen, Y.: Understanding the search interfaces of the deep web based on domain model. In: International Conference on Computer and Information Science, pp. 1194–1199 (2009)
34.
Zurück zum Zitat Zhang, Z., He, B., Chang, K.C.-C.: Understanding web query interfaces: best-effort parsing with hidden syntax. In: Proceedings of the Symposium on Management of Data (SIGMOD), (2004) Zhang, Z., He, B., Chang, K.C.-C.: Understanding web query interfaces: best-effort parsing with hidden syntax. In: Proceedings of the Symposium on Management of Data (SIGMOD), (2004)
Metadaten
Titel
The ontological key: automatically understanding and integrating forms to access the deep Web
verfasst von
Tim Furche
Georg Gottlob
Giovanni Grasso
Xiaonan Guo
Giorgio Orsi
Christian Schallhart
Publikationsdatum
01.10.2013
Verlag
Springer Berlin Heidelberg
Erschienen in
The VLDB Journal / Ausgabe 5/2013
Print ISSN: 1066-8888
Elektronische ISSN: 0949-877X
DOI
https://doi.org/10.1007/s00778-013-0323-0

Weitere Artikel der Ausgabe 5/2013

The VLDB Journal 5/2013 Zur Ausgabe

Premium Partner