Skip to main content
Top

2017 | OriginalPaper | Chapter

A Vision-Based Approach for Deep Web Form Extraction

Authors : Jiachen Pu, Jin Liu, Jin Wang

Published in: Advanced Multimedia and Ubiquitous Engineering

Publisher: Springer Singapore

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The World Wide Web is a large source of information that contains data in either Surface Web or Deep Web. Compared with the data in the Surface Web, the Deep Web contains a greater amount of structured data with higher quality, but it is difficult to use directly. Studies in this field have revealed some methods for Deep Web Form Extraction, they may fall into the following categories which are HTML-based, vision-based, ontology-based, ML-based, NLP-based and so on. This paper try to combine the DOM tree and the convolutional neural network together and then find out the form in the Web page. This paper proposed a vision-based method VBF, which figures out the form from the Web page through the acquisition of the HTML code and screenshots of Web pages, establishment of the DOM tree and the calculation of the neural network and form recognition, matching, and generation.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: towards automatic data extraction from large web sites. In: Very Large Data Bases (VLDB), vol. 1, pp. 109–118 (2001) Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: towards automatic data extraction from large web sites. In: Very Large Data Bases (VLDB), vol. 1, pp. 109–118 (2001)
2.
go back to reference Chang, C.H., Hsu, C.N., Lui, S.C.: Automatic information extraction from semi-structured web pages by pattern discovery. J. Decis. Support Syst. 35(1), 129–147 (2003)CrossRef Chang, C.H., Hsu, C.N., Lui, S.C.: Automatic information extraction from semi-structured web pages by pattern discovery. J. Decis. Support Syst. 35(1), 129–147 (2003)CrossRef
3.
go back to reference Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. In: Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 601–606. ACM (2003) Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. In: Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 601–606. ACM (2003)
4.
go back to reference Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proceedings of the 14th International Conference on World Wide Web, pp. 76–85. ACM (2005) Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proceedings of the 14th International Conference on World Wide Web, pp. 76–85. ACM (2005)
5.
go back to reference Cai, D., Yu, S., Wen, J.R., et al.: VIPS: a vision-based page segmentation algorithm. Microsoft Technical report, MSR-TR-2003-79 (2003) Cai, D., Yu, S., Wen, J.R., et al.: VIPS: a vision-based page segmentation algorithm. Microsoft Technical report, MSR-TR-2003-79 (2003)
6.
go back to reference Liu, W., Meng, X., Meng, W.: ViDE: a vision-based approach for deep web data extraction. IEEE Trans. Knowl. Data Eng. 22(3), 447–460 (2009)CrossRef Liu, W., Meng, X., Meng, W.: ViDE: a vision-based approach for deep web data extraction. IEEE Trans. Knowl. Data Eng. 22(3), 447–460 (2009)CrossRef
7.
go back to reference Embley, D.W., Campbell, D.M., Smith, R.D.: Ontology-based extraction and structuring of information from data-rich unstructured documents. In: Proceedings of the Seventh International Conference on Information and Knowledge Management, 2–7 November 1998 Embley, D.W., Campbell, D.M., Smith, R.D.: Ontology-based extraction and structuring of information from data-rich unstructured documents. In: Proceedings of the Seventh International Conference on Information and Knowledge Management, 2–7 November 1998
8.
go back to reference Vijayarajan, V., Dinakaran, M., Tejaswin, P., et al.: A generic framework for ontology-based information retrieval and image retrieval in web data. J. Hum. Centric Comput. Inf. Sci. 6(1), 18 (2016)CrossRef Vijayarajan, V., Dinakaran, M., Tejaswin, P., et al.: A generic framework for ontology-based information retrieval and image retrieval in web data. J. Hum. Centric Comput. Inf. Sci. 6(1), 18 (2016)CrossRef
9.
go back to reference Califf, M., Mooney, R.: Relational learning of pattern-match rules for information extraction. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence and Eleventh Conference on Innovative Applications of Artificial Intelligence, Florida, Orlando, pp. 328–334 (1999) Califf, M., Mooney, R.: Relational learning of pattern-match rules for information extraction. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence and Eleventh Conference on Innovative Applications of Artificial Intelligence, Florida, Orlando, pp. 328–334 (1999)
10.
go back to reference Freitag, D.: Machine learning for information extraction in informal domains. J. Mach. Learn. 39(2–3), 169–202 (2000)CrossRefMATH Freitag, D.: Machine learning for information extraction in informal domains. J. Mach. Learn. 39(2–3), 169–202 (2000)CrossRefMATH
11.
go back to reference Soderland, S.: Learning information extraction rules for semi-structured and free text. J. Mach. Learn. 34(1–3), 233–272 (1999)CrossRefMATH Soderland, S.: Learning information extraction rules for semi-structured and free text. J. Mach. Learn. 34(1–3), 233–272 (1999)CrossRefMATH
12.
go back to reference Rafiei, M., Kardan, A.A.: A novel method for expert finding in online communities based on concept map and PageRank. J. Hum. Centric Comput. Inf. Sci. 5(1), 1–18 (2015)CrossRef Rafiei, M., Kardan, A.A.: A novel method for expert finding in online communities based on concept map and PageRank. J. Hum. Centric Comput. Inf. Sci. 5(1), 1–18 (2015)CrossRef
13.
go back to reference Zhou, S.X., Lin, Y.P., Wang, Y.N.: Text information extraction based on active hidden Markov model. J. Hunan Univ. (Nat. Sci.), 601–606 (2007) Zhou, S.X., Lin, Y.P., Wang, Y.N.: Text information extraction based on active hidden Markov model. J. Hunan Univ. (Nat. Sci.), 601–606 (2007)
14.
go back to reference Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning, pp. 282–289 (2001) Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning, pp. 282–289 (2001)
15.
go back to reference Borromeo, R.M., Toyama, M.: An investigation of unpaid crowdsourcing. J. Hum. Centric Comput. Inf. Sci. 6(1), 11 (2016)CrossRef Borromeo, R.M., Toyama, M.: An investigation of unpaid crowdsourcing. J. Hum. Centric Comput. Inf. Sci. 6(1), 11 (2016)CrossRef
Metadata
Title
A Vision-Based Approach for Deep Web Form Extraction
Authors
Jiachen Pu
Jin Liu
Jin Wang
Copyright Year
2017
Publisher
Springer Singapore
DOI
https://doi.org/10.1007/978-981-10-5041-1_111

Premium Partner