Top

Published in:

2017 | OriginalPaper | Chapter

A Vision-Based Approach for Deep Web Form Extraction

Authors : Jiachen Pu, Jin Liu, Jin Wang

Published in: Advanced Multimedia and Ubiquitous Engineering

Publisher: Springer Singapore

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

The World Wide Web is a large source of information that contains data in either Surface Web or Deep Web. Compared with the data in the Surface Web, the Deep Web contains a greater amount of structured data with higher quality, but it is difficult to use directly. Studies in this field have revealed some methods for Deep Web Form Extraction, they may fall into the following categories which are HTML-based, vision-based, ontology-based, ML-based, NLP-based and so on. This paper try to combine the DOM tree and the convolutional neural network together and then find out the form in the Web page. This paper proposed a vision-based method VBF, which figures out the form from the Web page through the acquisition of the HTML code and screenshots of Web pages, establishment of the DOM tree and the calculation of the neural network and form recognition, matching, and generation.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Multi-step Prediction for Time Series with Factor Mining and Neural Network

next chapter Questions Classification with Attention Machine

Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: towards automatic data extraction from large web sites. In: Very Large Data Bases (VLDB), vol. 1, pp. 109–118 (2001)

Chang, C.H., Hsu, C.N., Lui, S.C.: Automatic information extraction from semi-structured web pages by pattern discovery. J. Decis. Support Syst. 35(1), 129–147 (2003)CrossRef

Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. In: Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 601–606. ACM (2003)

Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proceedings of the 14th International Conference on World Wide Web, pp. 76–85. ACM (2005)

Cai, D., Yu, S., Wen, J.R., et al.: VIPS: a vision-based page segmentation algorithm. Microsoft Technical report, MSR-TR-2003-79 (2003)

Liu, W., Meng, X., Meng, W.: ViDE: a vision-based approach for deep web data extraction. IEEE Trans. Knowl. Data Eng. 22(3), 447–460 (2009)CrossRef

Embley, D.W., Campbell, D.M., Smith, R.D.: Ontology-based extraction and structuring of information from data-rich unstructured documents. In: Proceedings of the Seventh International Conference on Information and Knowledge Management, 2–7 November 1998

Vijayarajan, V., Dinakaran, M., Tejaswin, P., et al.: A generic framework for ontology-based information retrieval and image retrieval in web data. J. Hum. Centric Comput. Inf. Sci. 6(1), 18 (2016)CrossRef

Califf, M., Mooney, R.: Relational learning of pattern-match rules for information extraction. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence and Eleventh Conference on Innovative Applications of Artificial Intelligence, Florida, Orlando, pp. 328–334 (1999)

10.

Freitag, D.: Machine learning for information extraction in informal domains. J. Mach. Learn. 39(2–3), 169–202 (2000)CrossRefMATH

11.

Soderland, S.: Learning information extraction rules for semi-structured and free text. J. Mach. Learn. 34(1–3), 233–272 (1999)CrossRefMATH

12.

Rafiei, M., Kardan, A.A.: A novel method for expert finding in online communities based on concept map and PageRank. J. Hum. Centric Comput. Inf. Sci. 5(1), 1–18 (2015)CrossRef

13.

Zhou, S.X., Lin, Y.P., Wang, Y.N.: Text information extraction based on active hidden Markov model. J. Hunan Univ. (Nat. Sci.), 601–606 (2007)

14.

Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning, pp. 282–289 (2001)

15.

Borromeo, R.M., Toyama, M.: An investigation of unpaid crowdsourcing. J. Hum. Centric Comput. Inf. Sci. 6(1), 11 (2016)CrossRef

Title: A Vision-Based Approach for Deep Web Form Extraction
Authors: Jiachen Pu
Jin Liu
Jin Wang
Publisher: Springer Singapore
Book: Advanced Multimedia and Ubiquitous Engineering
Print ISBN: 978-981-10-5040-4

Electronic ISBN: 978-981-10-5041-1

Copyright Year: 2017
DOI: https://doi.org/10.1007/978-981-10-5041-1_111

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner