Skip to main content

2016 | OriginalPaper | Buchkapitel

Revisiting Web Data Extraction Using In-Browser Structural Analysis and Visual Cues in Modern Web Designs

verfasst von : Alfonso Murolo, Moira C. Norrie

Erschienen in: Web Engineering

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Recent trends in website design have an impact on methods used for web data extraction. Many existing methods rely on structural analysis of web pages and, with the introduction of CSS, table-based layouts are no longer used, while responsive design means that layout and presentation are dependent on browsing context which also makes the use of visual clues more complex. We present DeepDesign, a system that semi-automatically extracts data records from web pages based on a combination of structural and visual features. It runs in a general-purpose browser, taking advantage of direct access to the complete CSS3 spectrum and the capability to trigger and execute JavaScript in the page. The user sees record matching in real-time and dynamically adapts the process if required. We present the details of the matching algorithms and provide an evaluation of them based on the top ten Alexa websites.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Zheng, S., Song, R., Wen, J., Giles, C.L.: Efficient record-level wrapper induction. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management. ACM (2009) Zheng, S., Song, R., Wen, J., Giles, C.L.: Efficient record-level wrapper induction. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management. ACM (2009)
2.
Zurück zum Zitat Liu, W., Meng, X., Meng, W.: Vision-based web data records extraction. In: Proceedings 9th International Workshop on the Web and Databases (2006) Liu, W., Meng, X., Meng, W.: Vision-based web data records extraction. In: Proceedings 9th International Workshop on the Web and Databases (2006)
3.
Zurück zum Zitat Manabe, T., Tajima, K.: Extracting logical hierarchical structure of HTML documents based on headings. Proc. VLDB Endowment 8(12), 1606–1617 (2015)CrossRef Manabe, T., Tajima, K.: Extracting logical hierarchical structure of HTML documents based on headings. Proc. VLDB Endowment 8(12), 1606–1617 (2015)CrossRef
4.
Zurück zum Zitat Pembe, F., Canan, F., Güngör, T.: A tree learning approach to web document sectional hierarchy extraction. In: Proceedings of the 2nd International Conference on Agents and Artificial Intelligence (2010) Pembe, F., Canan, F., Güngör, T.: A tree learning approach to web document sectional hierarchy extraction. In: Proceedings of the 2nd International Conference on Agents and Artificial Intelligence (2010)
5.
Zurück zum Zitat Geel, M., Church, T., Norrie, M.C.: Sift: an end-user tool for gathering web content on the go. In: Proceedings of the 2012 ACM Symposium on Document Engineering, pp. 181–190. ACM (2012) Geel, M., Church, T., Norrie, M.C.: Sift: an end-user tool for gathering web content on the go. In: Proceedings of the 2012 ACM Symposium on Document Engineering, pp. 181–190. ACM (2012)
6.
Zurück zum Zitat Murolo, A., Norrie, M.C.: Deriving custom post types from digital mockups. In: Cimiano, P., Frasincar, F., Houben, G.-J., Schwabe, D. (eds.) ICWE 2015. LNCS, vol. 9114, pp. 71–80. Springer, Heidelberg (2015)CrossRef Murolo, A., Norrie, M.C.: Deriving custom post types from digital mockups. In: Cimiano, P., Frasincar, F., Houben, G.-J., Schwabe, D. (eds.) ICWE 2015. LNCS, vol. 9114, pp. 71–80. Springer, Heidelberg (2015)CrossRef
7.
Zurück zum Zitat Chang, C., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006)CrossRef Chang, C., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006)CrossRef
8.
Zurück zum Zitat Adelberg, B.: NoDoSE a tool for semi-automatically extracting structured and semistructured data from text documents. In: Proceedings of the 9th ACM SIGMOD International Conference on Management of Data (SIGMOD). ACM (1998) Adelberg, B.: NoDoSE a tool for semi-automatically extracting structured and semistructured data from text documents. In: Proceedings of the 9th ACM SIGMOD International Conference on Management of Data (SIGMOD). ACM (1998)
9.
Zurück zum Zitat Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: towards automatic data extraction from large web sites. In: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB). Morgan Kaufmann (2001) Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: towards automatic data extraction from large web sites. In: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB). Morgan Kaufmann (2001)
10.
Zurück zum Zitat Chang, C., Lui, S.: IEPAD: information extraction based on pattern discovery. In: Proceedings of the 10th International Conference on World Wide Web (WWW). ACM (2001) Chang, C., Lui, S.: IEPAD: information extraction based on pattern discovery. In: Proceedings of the 10th International Conference on World Wide Web (WWW). ACM (2001)
11.
Zurück zum Zitat Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: Proceedings of the 12th International Conference on World Wide Web (WWW). ACM (2003) Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: Proceedings of the 12th International Conference on World Wide Web (WWW). ACM (2003)
12.
Zurück zum Zitat Zhai, Y., Liu, B.: Structured data extraction from the web based on partial tree alignment. IEEE Trans. Knowl. Data Eng. 18(12), 1614–1628 (2006)CrossRef Zhai, Y., Liu, B.: Structured data extraction from the web based on partial tree alignment. IEEE Trans. Knowl. Data Eng. 18(12), 1614–1628 (2006)CrossRef
13.
Zurück zum Zitat Lu, Y., He, H., Zhao, H., Meng, W., Yu, C.: Annotating search results from web databases. IEEE Trans. Knowl. Data Eng. 25(3), 514–527 (2013)CrossRef Lu, Y., He, H., Zhao, H., Meng, W., Yu, C.: Annotating search results from web databases. IEEE Trans. Knowl. Data Eng. 25(3), 514–527 (2013)CrossRef
14.
Zurück zum Zitat Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. In: Proceedings of the 14th International Conference on World Wide Web. ACM (2005) Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. In: Proceedings of the 14th International Conference on World Wide Web. ACM (2005)
15.
Zurück zum Zitat Hong, J.L., Siew, E., Egerton, S.: ViWER-Data extraction for search engine results pages using visual cue and dom tree. In: Proceedings of the 1st International Conference on Information Retrieval & Knowledge Management (CAMP). IEEE (2010) Hong, J.L., Siew, E., Egerton, S.: ViWER-Data extraction for search engine results pages using visual cue and dom tree. In: Proceedings of the 1st International Conference on Information Retrieval & Knowledge Management (CAMP). IEEE (2010)
16.
Zurück zum Zitat Liu, W., Meng, X., Meng, W.: Vide: a vision-based approach for deep web data extraction. IEEE Trans. Knowl. Data Eng. 22(3), 447–460 (2010)CrossRef Liu, W., Meng, X., Meng, W.: Vide: a vision-based approach for deep web data extraction. IEEE Trans. Knowl. Data Eng. 22(3), 447–460 (2010)CrossRef
17.
Zurück zum Zitat Laender, A.H., Ribeiro-Neto, B., da Silva, A.S.: DEByE - data extraction by example. Data Knowl. Eng. 40(2), 121–154 (2002)CrossRefMATH Laender, A.H., Ribeiro-Neto, B., da Silva, A.S.: DEByE - data extraction by example. Data Knowl. Eng. 40(2), 121–154 (2002)CrossRefMATH
18.
Zurück zum Zitat Chang, C., Kuo, S.: OLERA: semisupervised web-data extraction with visual support. IEEE Intell. Syst. 19(6), 56–64 (2004)CrossRef Chang, C., Kuo, S.: OLERA: semisupervised web-data extraction with visual support. IEEE Intell. Syst. 19(6), 56–64 (2004)CrossRef
19.
Zurück zum Zitat Hogue, A., Karger, D.: Thresher: automating the unwrapping of semantic content from the world wide web. In: Proceedings of the 14th International Conference on World Wide Web. ACM (2005) Hogue, A., Karger, D.: Thresher: automating the unwrapping of semantic content from the world wide web. In: Proceedings of the 14th International Conference on World Wide Web. ACM (2005)
20.
Zurück zum Zitat Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10, 707–710 (1966)MathSciNetMATH Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10, 707–710 (1966)MathSciNetMATH
21.
Zurück zum Zitat Augsten, N., Böhlen, M., Gamper, J.: Approximate matching of hierarchical data using Pq-Grams. In: Proceedings of the 31st International Conference on Very Large Data Bases (VLDB), VLDB Endowment (2005) Augsten, N., Böhlen, M., Gamper, J.: Approximate matching of hierarchical data using Pq-Grams. In: Proceedings of the 31st International Conference on Very Large Data Bases (VLDB), VLDB Endowment (2005)
22.
Zurück zum Zitat Sakai, S., Togasaki, M., Yamazaki, K.: A note on greedy algorithms for the maximum weighted independent set problem. Discrete Appl. Math. 126(2), 313–322 (2003)MathSciNetCrossRefMATH Sakai, S., Togasaki, M., Yamazaki, K.: A note on greedy algorithms for the maximum weighted independent set problem. Discrete Appl. Math. 126(2), 313–322 (2003)MathSciNetCrossRefMATH
23.
Zurück zum Zitat Demange, M.: A note on the approximation of a minimum-weight maximal independent set. Comput. Optim. Appl. 14(1), 157–169 (1999)MathSciNetCrossRefMATH Demange, M.: A note on the approximation of a minimum-weight maximal independent set. Comput. Optim. Appl. 14(1), 157–169 (1999)MathSciNetCrossRefMATH
Metadaten
Titel
Revisiting Web Data Extraction Using In-Browser Structural Analysis and Visual Cues in Modern Web Designs
verfasst von
Alfonso Murolo
Moira C. Norrie
Copyright-Jahr
2016
DOI
https://doi.org/10.1007/978-3-319-38791-8_7