Top

Published in:

2016 | OriginalPaper | Chapter

Revisiting Web Data Extraction Using In-Browser Structural Analysis and Visual Cues in Modern Web Designs

Authors : Alfonso Murolo, Moira C. Norrie

Published in: Web Engineering

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Recent trends in website design have an impact on methods used for web data extraction. Many existing methods rely on structural analysis of web pages and, with the introduction of CSS, table-based layouts are no longer used, while responsive design means that layout and presentation are dependent on browsing context which also makes the use of visual clues more complex. We present DeepDesign, a system that semi-automatically extracts data records from web pages based on a combination of structural and visual features. It runs in a general-purpose browser, taking advantage of direct access to the complete CSS3 spectrum and the capability to trigger and execute JavaScript in the page. The user sees record matching in real-time and dynamically adapts the process if required. We present the details of the matching algorithms and provide an evaluation of them based on the top ten Alexa websites.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter CTAT: Tilt-and-Tap Across Devices

next chapter Clustering-Aided Page Object Generation for Web Testing

http://www.alexa.com.

https://wordpress.org/.

http://www.schema.org.

http://www.alexa.com/topsites - 15.10.2015.

http://dev.globis.ethz.ch/deepdesign/DDpreliminaryeval.zip.

https://github.com/hoonto/jqgram - 15.10.2015.

Zheng, S., Song, R., Wen, J., Giles, C.L.: Efficient record-level wrapper induction. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management. ACM (2009)

Liu, W., Meng, X., Meng, W.: Vision-based web data records extraction. In: Proceedings 9th International Workshop on the Web and Databases (2006)

Manabe, T., Tajima, K.: Extracting logical hierarchical structure of HTML documents based on headings. Proc. VLDB Endowment 8(12), 1606–1617 (2015)CrossRef

Pembe, F., Canan, F., Güngör, T.: A tree learning approach to web document sectional hierarchy extraction. In: Proceedings of the 2nd International Conference on Agents and Artificial Intelligence (2010)

Geel, M., Church, T., Norrie, M.C.: Sift: an end-user tool for gathering web content on the go. In: Proceedings of the 2012 ACM Symposium on Document Engineering, pp. 181–190. ACM (2012)

Murolo, A., Norrie, M.C.: Deriving custom post types from digital mockups. In: Cimiano, P., Frasincar, F., Houben, G.-J., Schwabe, D. (eds.) ICWE 2015. LNCS, vol. 9114, pp. 71–80. Springer, Heidelberg (2015)CrossRef

Chang, C., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006)CrossRef

Adelberg, B.: NoDoSE a tool for semi-automatically extracting structured and semistructured data from text documents. In: Proceedings of the 9th ACM SIGMOD International Conference on Management of Data (SIGMOD). ACM (1998)

Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: towards automatic data extraction from large web sites. In: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB). Morgan Kaufmann (2001)

10.

Chang, C., Lui, S.: IEPAD: information extraction based on pattern discovery. In: Proceedings of the 10th International Conference on World Wide Web (WWW). ACM (2001)

11.

Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: Proceedings of the 12th International Conference on World Wide Web (WWW). ACM (2003)

12.

Zhai, Y., Liu, B.: Structured data extraction from the web based on partial tree alignment. IEEE Trans. Knowl. Data Eng. 18(12), 1614–1628 (2006)CrossRef

13.

Lu, Y., He, H., Zhao, H., Meng, W., Yu, C.: Annotating search results from web databases. IEEE Trans. Knowl. Data Eng. 25(3), 514–527 (2013)CrossRef

14.

Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. In: Proceedings of the 14th International Conference on World Wide Web. ACM (2005)

15.

Hong, J.L., Siew, E., Egerton, S.: ViWER-Data extraction for search engine results pages using visual cue and dom tree. In: Proceedings of the 1st International Conference on Information Retrieval & Knowledge Management (CAMP). IEEE (2010)

16.

Liu, W., Meng, X., Meng, W.: Vide: a vision-based approach for deep web data extraction. IEEE Trans. Knowl. Data Eng. 22(3), 447–460 (2010)CrossRef

17.

Laender, A.H., Ribeiro-Neto, B., da Silva, A.S.: DEByE - data extraction by example. Data Knowl. Eng. 40(2), 121–154 (2002)CrossRefMATH

18.

Chang, C., Kuo, S.: OLERA: semisupervised web-data extraction with visual support. IEEE Intell. Syst. 19(6), 56–64 (2004)CrossRef

19.

Hogue, A., Karger, D.: Thresher: automating the unwrapping of semantic content from the world wide web. In: Proceedings of the 14th International Conference on World Wide Web. ACM (2005)

20.

Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10, 707–710 (1966)MathSciNetMATH

21.

Augsten, N., Böhlen, M., Gamper, J.: Approximate matching of hierarchical data using Pq-Grams. In: Proceedings of the 31st International Conference on Very Large Data Bases (VLDB), VLDB Endowment (2005)

22.

Sakai, S., Togasaki, M., Yamazaki, K.: A note on greedy algorithms for the maximum weighted independent set problem. Discrete Appl. Math. 126(2), 313–322 (2003)MathSciNetCrossRefMATH

23.

Demange, M.: A note on the approximation of a minimum-weight maximal independent set. Comput. Optim. Appl. 14(1), 157–169 (1999)MathSciNetCrossRefMATH

Title: Revisiting Web Data Extraction Using In-Browser Structural Analysis and Visual Cues in Modern Web Designs
Authors: Alfonso Murolo
Moira C. Norrie
Publisher: Springer International Publishing
Book: Web Engineering
Print ISBN: 978-3-319-38790-1

Electronic ISBN: 978-3-319-38791-8

Copyright Year: 2016
DOI: https://doi.org/10.1007/978-3-319-38791-8_7

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner