Skip to main content
Top

2016 | OriginalPaper | Chapter

Revisiting Web Data Extraction Using In-Browser Structural Analysis and Visual Cues in Modern Web Designs

Authors : Alfonso Murolo, Moira C. Norrie

Published in: Web Engineering

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Recent trends in website design have an impact on methods used for web data extraction. Many existing methods rely on structural analysis of web pages and, with the introduction of CSS, table-based layouts are no longer used, while responsive design means that layout and presentation are dependent on browsing context which also makes the use of visual clues more complex. We present DeepDesign, a system that semi-automatically extracts data records from web pages based on a combination of structural and visual features. It runs in a general-purpose browser, taking advantage of direct access to the complete CSS3 spectrum and the capability to trigger and execute JavaScript in the page. The user sees record matching in real-time and dynamically adapts the process if required. We present the details of the matching algorithms and provide an evaluation of them based on the top ten Alexa websites.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Zheng, S., Song, R., Wen, J., Giles, C.L.: Efficient record-level wrapper induction. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management. ACM (2009) Zheng, S., Song, R., Wen, J., Giles, C.L.: Efficient record-level wrapper induction. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management. ACM (2009)
2.
go back to reference Liu, W., Meng, X., Meng, W.: Vision-based web data records extraction. In: Proceedings 9th International Workshop on the Web and Databases (2006) Liu, W., Meng, X., Meng, W.: Vision-based web data records extraction. In: Proceedings 9th International Workshop on the Web and Databases (2006)
3.
go back to reference Manabe, T., Tajima, K.: Extracting logical hierarchical structure of HTML documents based on headings. Proc. VLDB Endowment 8(12), 1606–1617 (2015)CrossRef Manabe, T., Tajima, K.: Extracting logical hierarchical structure of HTML documents based on headings. Proc. VLDB Endowment 8(12), 1606–1617 (2015)CrossRef
4.
go back to reference Pembe, F., Canan, F., Güngör, T.: A tree learning approach to web document sectional hierarchy extraction. In: Proceedings of the 2nd International Conference on Agents and Artificial Intelligence (2010) Pembe, F., Canan, F., Güngör, T.: A tree learning approach to web document sectional hierarchy extraction. In: Proceedings of the 2nd International Conference on Agents and Artificial Intelligence (2010)
5.
go back to reference Geel, M., Church, T., Norrie, M.C.: Sift: an end-user tool for gathering web content on the go. In: Proceedings of the 2012 ACM Symposium on Document Engineering, pp. 181–190. ACM (2012) Geel, M., Church, T., Norrie, M.C.: Sift: an end-user tool for gathering web content on the go. In: Proceedings of the 2012 ACM Symposium on Document Engineering, pp. 181–190. ACM (2012)
6.
go back to reference Murolo, A., Norrie, M.C.: Deriving custom post types from digital mockups. In: Cimiano, P., Frasincar, F., Houben, G.-J., Schwabe, D. (eds.) ICWE 2015. LNCS, vol. 9114, pp. 71–80. Springer, Heidelberg (2015)CrossRef Murolo, A., Norrie, M.C.: Deriving custom post types from digital mockups. In: Cimiano, P., Frasincar, F., Houben, G.-J., Schwabe, D. (eds.) ICWE 2015. LNCS, vol. 9114, pp. 71–80. Springer, Heidelberg (2015)CrossRef
7.
go back to reference Chang, C., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006)CrossRef Chang, C., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006)CrossRef
8.
go back to reference Adelberg, B.: NoDoSE a tool for semi-automatically extracting structured and semistructured data from text documents. In: Proceedings of the 9th ACM SIGMOD International Conference on Management of Data (SIGMOD). ACM (1998) Adelberg, B.: NoDoSE a tool for semi-automatically extracting structured and semistructured data from text documents. In: Proceedings of the 9th ACM SIGMOD International Conference on Management of Data (SIGMOD). ACM (1998)
9.
go back to reference Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: towards automatic data extraction from large web sites. In: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB). Morgan Kaufmann (2001) Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: towards automatic data extraction from large web sites. In: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB). Morgan Kaufmann (2001)
10.
go back to reference Chang, C., Lui, S.: IEPAD: information extraction based on pattern discovery. In: Proceedings of the 10th International Conference on World Wide Web (WWW). ACM (2001) Chang, C., Lui, S.: IEPAD: information extraction based on pattern discovery. In: Proceedings of the 10th International Conference on World Wide Web (WWW). ACM (2001)
11.
go back to reference Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: Proceedings of the 12th International Conference on World Wide Web (WWW). ACM (2003) Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: Proceedings of the 12th International Conference on World Wide Web (WWW). ACM (2003)
12.
go back to reference Zhai, Y., Liu, B.: Structured data extraction from the web based on partial tree alignment. IEEE Trans. Knowl. Data Eng. 18(12), 1614–1628 (2006)CrossRef Zhai, Y., Liu, B.: Structured data extraction from the web based on partial tree alignment. IEEE Trans. Knowl. Data Eng. 18(12), 1614–1628 (2006)CrossRef
13.
go back to reference Lu, Y., He, H., Zhao, H., Meng, W., Yu, C.: Annotating search results from web databases. IEEE Trans. Knowl. Data Eng. 25(3), 514–527 (2013)CrossRef Lu, Y., He, H., Zhao, H., Meng, W., Yu, C.: Annotating search results from web databases. IEEE Trans. Knowl. Data Eng. 25(3), 514–527 (2013)CrossRef
14.
go back to reference Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. In: Proceedings of the 14th International Conference on World Wide Web. ACM (2005) Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. In: Proceedings of the 14th International Conference on World Wide Web. ACM (2005)
15.
go back to reference Hong, J.L., Siew, E., Egerton, S.: ViWER-Data extraction for search engine results pages using visual cue and dom tree. In: Proceedings of the 1st International Conference on Information Retrieval & Knowledge Management (CAMP). IEEE (2010) Hong, J.L., Siew, E., Egerton, S.: ViWER-Data extraction for search engine results pages using visual cue and dom tree. In: Proceedings of the 1st International Conference on Information Retrieval & Knowledge Management (CAMP). IEEE (2010)
16.
go back to reference Liu, W., Meng, X., Meng, W.: Vide: a vision-based approach for deep web data extraction. IEEE Trans. Knowl. Data Eng. 22(3), 447–460 (2010)CrossRef Liu, W., Meng, X., Meng, W.: Vide: a vision-based approach for deep web data extraction. IEEE Trans. Knowl. Data Eng. 22(3), 447–460 (2010)CrossRef
17.
go back to reference Laender, A.H., Ribeiro-Neto, B., da Silva, A.S.: DEByE - data extraction by example. Data Knowl. Eng. 40(2), 121–154 (2002)CrossRefMATH Laender, A.H., Ribeiro-Neto, B., da Silva, A.S.: DEByE - data extraction by example. Data Knowl. Eng. 40(2), 121–154 (2002)CrossRefMATH
18.
go back to reference Chang, C., Kuo, S.: OLERA: semisupervised web-data extraction with visual support. IEEE Intell. Syst. 19(6), 56–64 (2004)CrossRef Chang, C., Kuo, S.: OLERA: semisupervised web-data extraction with visual support. IEEE Intell. Syst. 19(6), 56–64 (2004)CrossRef
19.
go back to reference Hogue, A., Karger, D.: Thresher: automating the unwrapping of semantic content from the world wide web. In: Proceedings of the 14th International Conference on World Wide Web. ACM (2005) Hogue, A., Karger, D.: Thresher: automating the unwrapping of semantic content from the world wide web. In: Proceedings of the 14th International Conference on World Wide Web. ACM (2005)
20.
go back to reference Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10, 707–710 (1966)MathSciNetMATH Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10, 707–710 (1966)MathSciNetMATH
21.
go back to reference Augsten, N., Böhlen, M., Gamper, J.: Approximate matching of hierarchical data using Pq-Grams. In: Proceedings of the 31st International Conference on Very Large Data Bases (VLDB), VLDB Endowment (2005) Augsten, N., Böhlen, M., Gamper, J.: Approximate matching of hierarchical data using Pq-Grams. In: Proceedings of the 31st International Conference on Very Large Data Bases (VLDB), VLDB Endowment (2005)
22.
go back to reference Sakai, S., Togasaki, M., Yamazaki, K.: A note on greedy algorithms for the maximum weighted independent set problem. Discrete Appl. Math. 126(2), 313–322 (2003)MathSciNetCrossRefMATH Sakai, S., Togasaki, M., Yamazaki, K.: A note on greedy algorithms for the maximum weighted independent set problem. Discrete Appl. Math. 126(2), 313–322 (2003)MathSciNetCrossRefMATH
23.
Metadata
Title
Revisiting Web Data Extraction Using In-Browser Structural Analysis and Visual Cues in Modern Web Designs
Authors
Alfonso Murolo
Moira C. Norrie
Copyright Year
2016
DOI
https://doi.org/10.1007/978-3-319-38791-8_7

Premium Partner