Skip to main content

2015 | OriginalPaper | Buchkapitel

On Extracting Information from Semi-structured Deep Web Documents

verfasst von : Patricia Jiménez, Rafael Corchuelo

Erschienen in: Business Information Systems

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Some software agents need information that is provided by some web sites, which is difficult if they lack a query API. Information extractors are intended to extract the information of interest automatically and offer it in a structured format. Unfortunately, most of them rely on ad-hoc techniques, which make them fade away as the Web evolves. In this paper, we present a proposal that relies on an open catalogue of features that allows to adapt it easily; we have also devised an optimisation that allows it to be very efficient. Our experimental results prove that our proposal outperforms other state-of-the-art proposals.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Álvarez, M., Pan, A., Raposo, J., Bellas, F., Cacheda, F.: Finding and extracting data records from web pages. Signal Process. Syst. 59(1), 123–137 (2010)CrossRef Álvarez, M., Pan, A., Raposo, J., Bellas, F., Cacheda, F.: Finding and extracting data records from web pages. Signal Process. Syst. 59(1), 123–137 (2010)CrossRef
2.
Zurück zum Zitat Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: SIGMOD Conference, pp. 337–348 (2003) Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: SIGMOD Conference, pp. 337–348 (2003)
3.
Zurück zum Zitat Ashraf, F., Özyer, T., Alhajj, R.: Employing clustering techniques for automatic information extraction from HTML documents. IEEE Trans. Syst. Man Cybern. Part C 38(5), 660–673 (2008)CrossRef Ashraf, F., Özyer, T., Alhajj, R.: Employing clustering techniques for automatic information extraction from HTML documents. IEEE Trans. Syst. Man Cybern. Part C 38(5), 660–673 (2008)CrossRef
4.
Zurück zum Zitat Barbosa, J.P.D.: Adaptive record extraction from web pages. In: WWW, pp. 1335–1336 (2007) Barbosa, J.P.D.: Adaptive record extraction from web pages. In: WWW, pp. 1335–1336 (2007)
5.
Zurück zum Zitat Bădică, C., Bădică, A., Popescu, E., Abraham, A.: L-wrappers: concepts, properties and construction. Soft Comput. 11(8), 753–772 (2007)CrossRef Bădică, C., Bădică, A., Popescu, E., Abraham, A.: L-wrappers: concepts, properties and construction. Soft Comput. 11(8), 753–772 (2007)CrossRef
6.
Zurück zum Zitat Califf, M.E., Mooney, R.J.: Bottom-up relational learning of pattern matching rules for information extraction. J. Mach. Learn. Res. 4, 177–210 (2003)MathSciNet Califf, M.E., Mooney, R.J.: Bottom-up relational learning of pattern matching rules for information extraction. J. Mach. Learn. Res. 4, 177–210 (2003)MathSciNet
7.
Zurück zum Zitat Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006)CrossRef Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006)CrossRef
8.
Zurück zum Zitat Chang, C.H., Kuo, S.C.: OLERA: semisupervised web-data extraction with visual support. IEEE Intel. Syst. 19(6), 56–64 (2004)CrossRef Chang, C.H., Kuo, S.C.: OLERA: semisupervised web-data extraction with visual support. IEEE Intel. Syst. 19(6), 56–64 (2004)CrossRef
9.
Zurück zum Zitat Cohen, W.W., Hurst, M., Jensen, L.S.: A flexible learning system for wrapping tables and lists in HTML documents. In: WWW, pp. 232–241 (2002) Cohen, W.W., Hurst, M., Jensen, L.S.: A flexible learning system for wrapping tables and lists in HTML documents. In: WWW, pp. 232–241 (2002)
11.
Zurück zum Zitat Crescenzi, V., Merialdo, P.: Wrapper inference for ambiguous web pages. Appl. Artif. Intel. 22(1–2), 21–52 (2008)CrossRef Crescenzi, V., Merialdo, P.: Wrapper inference for ambiguous web pages. Appl. Artif. Intel. 22(1–2), 21–52 (2008)CrossRef
12.
Zurück zum Zitat Fernández-Villamor, J.I., Iglesias, C.A., Garijo, M.: First-order logic rule induction for information extraction in web resources. Int. J. Artif. Intel. Tools 21(6), 20 (2012)CrossRef Fernández-Villamor, J.I., Iglesias, C.A., Garijo, M.: First-order logic rule induction for information extraction in web resources. Int. J. Artif. Intel. Tools 21(6), 20 (2012)CrossRef
13.
Zurück zum Zitat Freitag, D.: Machine learning for information extraction in informal domains. Mach. Learn. 39(2/3), 169–202 (2000)MATHCrossRef Freitag, D.: Machine learning for information extraction in informal domains. Mach. Learn. 39(2/3), 169–202 (2000)MATHCrossRef
14.
Zurück zum Zitat Gregg, D.G., Walczak, S.: Exploiting the information web. IEEE Trans. Syst. Man Cybern. Part C 37(1), 109–125 (2007)CrossRef Gregg, D.G., Walczak, S.: Exploiting the information web. IEEE Trans. Syst. Man Cybern. Part C 37(1), 109–125 (2007)CrossRef
15.
Zurück zum Zitat Gulhane, P., Madaan, A., Mehta, R.R., Ramamirtham, J., Rastogi, R., Satpal, S., Sengamedu, S.H., Tengli, A., Tiwari, C.: Web-scale information extraction with vertex. In: ICDE, pp. 1209–1220 (2011) Gulhane, P., Madaan, A., Mehta, R.R., Ramamirtham, J., Rastogi, R., Satpal, S., Sengamedu, S.H., Tengli, A., Tiwari, C.: Web-scale information extraction with vertex. In: ICDE, pp. 1209–1220 (2011)
16.
Zurück zum Zitat Hogue, A.W., Karger, D.R.: Thresher: automating the unwrapping of semantic content from the world wide web. In: WWW, pp. 86–95 (2005) Hogue, A.W., Karger, D.R.: Thresher: automating the unwrapping of semantic content from the world wide web. In: WWW, pp. 86–95 (2005)
17.
Zurück zum Zitat Hsu, C.N., Dung, M.T.: Generating finite-state transducers for semi-structured data extraction from the Web. Inf. Syst. 23(8), 521–538 (1998)CrossRef Hsu, C.N., Dung, M.T.: Generating finite-state transducers for semi-structured data extraction from the Web. Inf. Syst. 23(8), 521–538 (1998)CrossRef
18.
Zurück zum Zitat Irmak, U., Suel, T.: Interactive wrapper generation with minimal user effort. In: WWW, pp. 553–563 (2006) Irmak, U., Suel, T.: Interactive wrapper generation with minimal user effort. In: WWW, pp. 553–563 (2006)
19.
Zurück zum Zitat Kayed, M., Chang, C.H.: Fivatech: page-level web data extraction from template pages. IEEE Trans. Knowl. Data Eng. 22(2), 249–263 (2010)CrossRef Kayed, M., Chang, C.H.: Fivatech: page-level web data extraction from template pages. IEEE Trans. Knowl. Data Eng. 22(2), 249–263 (2010)CrossRef
20.
Zurück zum Zitat Kosala, R., Blockeel, H., Bruynooghe, M., den Bussche, J.V.: Information extraction from structured documents using \(k\)-testable tree automaton inference. Data Knowl. Eng. 58(2), 129–158 (2006)CrossRef Kosala, R., Blockeel, H., Bruynooghe, M., den Bussche, J.V.: Information extraction from structured documents using \(k\)-testable tree automaton inference. Data Knowl. Eng. 58(2), 129–158 (2006)CrossRef
21.
Zurück zum Zitat Kushmerick, N., Weld, D.S., Doorenbos, R.B.: Wrapper induction for information extraction. In: IJCAI, vol. 1, pp. 729–737 (1997) Kushmerick, N., Weld, D.S., Doorenbos, R.B.: Wrapper induction for information extraction. In: IJCAI, vol. 1, pp. 729–737 (1997)
22.
Zurück zum Zitat Liu, B., Zhai, Y.: NET – a system for extracting web data from flat and nested data records. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, J.-Y., Sheng, Q.Z. (eds.) WISE 2005. LNCS, vol. 3806, pp. 487–495. Springer, Heidelberg (2005) CrossRef Liu, B., Zhai, Y.: NET – a system for extracting web data from flat and nested data records. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, J.-Y., Sheng, Q.Z. (eds.) WISE 2005. LNCS, vol. 3806, pp. 487–495. Springer, Heidelberg (2005) CrossRef
23.
Zurück zum Zitat Liu, W., Meng, X., Meng, W.: Vide: a vision-based approach for deep web data extraction. IEEE Trans. Knowl. Data Eng. 22(3), 447–460 (2010)CrossRef Liu, W., Meng, X., Meng, W.: Vide: a vision-based approach for deep web data extraction. IEEE Trans. Knowl. Data Eng. 22(3), 447–460 (2010)CrossRef
24.
Zurück zum Zitat Meng, W., Yu, C.T.: Advanced Metasearch Engine Technology. Morgan & Claypool Publishers, USA (2010) Meng, W., Yu, C.T.: Advanced Metasearch Engine Technology. Morgan & Claypool Publishers, USA (2010)
25.
Zurück zum Zitat Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical wrapper induction for semistructured information sources. Auton. Agents Multi-Agent Syst. 4(1/2), 93–114 (2001)CrossRef Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical wrapper induction for semistructured information sources. Auton. Agents Multi-Agent Syst. 4(1/2), 93–114 (2001)CrossRef
26.
Zurück zum Zitat Raposo, J., Pan, A., Álvarez, M., Hidalgo, J., Viña, Á.: The wargo system: semi-automatic wrapper generation in presence of complex data access modes. In: DEXA Workshops, pp. 313–320 (2002) Raposo, J., Pan, A., Álvarez, M., Hidalgo, J., Viña, Á.: The wargo system: semi-automatic wrapper generation in presence of complex data access modes. In: DEXA Workshops, pp. 313–320 (2002)
27.
Zurück zum Zitat Simon, K., Lausen, G.: ViPER: augmenting automatic information extraction with visual perceptions. In: CIKM, pp. 381–388 (2005) Simon, K., Lausen, G.: ViPER: augmenting automatic information extraction with visual perceptions. In: CIKM, pp. 381–388 (2005)
28.
Zurück zum Zitat Sleiman, H.A., Corchuelo, R.: An unsupervised technique to extract information from semi-structured web pages. In: Wang, X.S., Cruz, I., Delis, A., Huang, G. (eds.) WISE 2012. LNCS, vol. 7651, pp. 631–637. Springer, Heidelberg (2012) CrossRef Sleiman, H.A., Corchuelo, R.: An unsupervised technique to extract information from semi-structured web pages. In: Wang, X.S., Cruz, I., Delis, A., Huang, G. (eds.) WISE 2012. LNCS, vol. 7651, pp. 631–637. Springer, Heidelberg (2012) CrossRef
29.
Zurück zum Zitat Sleiman, H.A., Corchuelo, R.: A survey on region extractors from web documents. IEEE Trans. Knowl. Data Eng. 25(9), 1960–1981 (2013)CrossRef Sleiman, H.A., Corchuelo, R.: A survey on region extractors from web documents. IEEE Trans. Knowl. Data Eng. 25(9), 1960–1981 (2013)CrossRef
30.
Zurück zum Zitat Sleiman, H.A., Corchuelo, R.: TEX: an efficient and effective unsupervised web information extractor. Knowl.-Based Syst. 39, 109–123 (2013)CrossRef Sleiman, H.A., Corchuelo, R.: TEX: an efficient and effective unsupervised web information extractor. Knowl.-Based Syst. 39, 109–123 (2013)CrossRef
31.
Zurück zum Zitat Sleiman, H.A., Corchuelo, R.: A class of neural-network-based transducers for web information extraction. Neurocomputing 135, 61–68 (2014)CrossRef Sleiman, H.A., Corchuelo, R.: A class of neural-network-based transducers for web information extraction. Neurocomputing 135, 61–68 (2014)CrossRef
32.
Zurück zum Zitat Sleiman, H.A., Corchuelo, R.: Trinity: on using trinary trees for unsupervised web data extraction. IEEE Trans. Knowl. Data Eng. 26(6), 1544–1556 (2014)CrossRef Sleiman, H.A., Corchuelo, R.: Trinity: on using trinary trees for unsupervised web data extraction. IEEE Trans. Knowl. Data Eng. 26(6), 1544–1556 (2014)CrossRef
33.
Zurück zum Zitat Su, W., Wang, J., Lochovsky, F.H.: ODE: ontology-assisted data extraction. ACM Trans. Database Syst. 34(2) (2009) Su, W., Wang, J., Lochovsky, F.H.: ODE: ontology-assisted data extraction. ACM Trans. Database Syst. 34(2) (2009)
34.
Zurück zum Zitat Tao, C., Embley, D.W.: Automatic hidden-web table interpretation, conceptualization, and semantic annotation. Data Knowl. Eng. 68(7), 683–703 (2009)CrossRef Tao, C., Embley, D.W.: Automatic hidden-web table interpretation, conceptualization, and semantic annotation. Data Knowl. Eng. 68(7), 683–703 (2009)CrossRef
35.
Zurück zum Zitat Turmo, J., Ageno, A., Català, N.: Adaptive information extraction. ACM Comput. Surv. 38(2) (2006) Turmo, J., Ageno, A., Català, N.: Adaptive information extraction. ACM Comput. Surv. 38(2) (2006)
36.
Zurück zum Zitat Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: WWW, pp. 187–196 (2003) Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: WWW, pp. 187–196 (2003)
37.
Zurück zum Zitat Zhai, Y., Liu, B.: Structured data extraction from the web based on partial tree alignment. IEEE Trans. Knowl. Data Eng. 18(12), 1614–1628 (2006)CrossRef Zhai, Y., Liu, B.: Structured data extraction from the web based on partial tree alignment. IEEE Trans. Knowl. Data Eng. 18(12), 1614–1628 (2006)CrossRef
38.
Zurück zum Zitat Zhu, J., Nie, Z., Wen, J.R., Zhang, B., Ma, W.Y.: Simultaneous record detection and attribute labeling in web data extraction. In: KDD, pp. 494–503 (2006) Zhu, J., Nie, Z., Wen, J.R., Zhang, B., Ma, W.Y.: Simultaneous record detection and attribute labeling in web data extraction. In: KDD, pp. 494–503 (2006)
Metadaten
Titel
On Extracting Information from Semi-structured Deep Web Documents
verfasst von
Patricia Jiménez
Rafael Corchuelo
Copyright-Jahr
2015
DOI
https://doi.org/10.1007/978-3-319-19027-3_12