Top

World Wide Web

Published in:

05-09-2018

A novel approach for Web page modeling in personal information extraction

Authors: Wei Yuliang, Zhou Qi, Lv Fang, Han Xixian, Xin Guodong, Wang Bailing

Published in: World Wide Web | Issue 2/2019

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

The target of personal information extraction (PIE) is to extract content associated with a name form Web pages. Available Web page models, which are also used widely in content extraction and automatic wrapper algorithms, include text model, document object model, and vision-based page segmentation model. Because of existing models focus on Web structure rather than semantic relevance, they are difficult to be directly used for PIE. To deal with this problem, we introduce the sequence block model (SBM), by which is easy to determine the relevance of each page block to the retrieval name. Then, we give the definition of PIE based on the SBM. Depending on the sequence correlation of SBM, we design a 4-layer seq2seq deep learning network for PIE. Experiment result shows that our new model extracts twice as much data as content extraction algorithms. And the recall rate of the network is 7% higher than the traditional model with classification algorithm.

previous article Deep learning approaches for video-based anomalous activity detection

next article Residual attention-based LSTM for video captioning

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Banu, A., Chitra, M.: Dwde-ir: an efficient deep Web data extraction for information retrieval on Web mining. J. Emerg. Technol. Web Intell. 6(1), 133–141 (2014)

Bartoli, A., De Lorenzo, A., Medvet, E., Tarlao, F.: Inference of regular expressions for text extraction from examples. IEEE Trans. Knowl. Data Eng. 28(5), 1217–1230 (2016)CrossRef

Bu, Z., Zhang, C., Xia, Z., Wang, J.: An far-sw based approach for Webpage information extraction. Inf. Syst. Front. 16(5), 771–785 (2014)CrossRef

Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: VIPS: vision-based page segmentation algorithm. Microsoft Technical Report, MSR-TR-2003-79 (2003)

Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of Web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006)CrossRef

Cramer, D.: A library to extract meaningful data from a Webpage. https://code.google.com/archive/p/decruft/

Cuthbertson, T.: Python-readability. https://github.com/timbertson/python-readability

Dalvi, N., Kumar, R., Soliman, M.: Automatic wrappers for large scale Web extraction. Proc. VLDB Endow. 4(4), 219–230 (2011)CrossRef

Doddington, G.R., Mitchell, A., Przybocki, M.A., Ramshaw, L.A., Strassel, S., Weischedel, R.M.: The automatic content extraction (ace) program-tasks, data, and evaluation. In: LREC, vol. 2, pp. 837–840 (2004)

10.

Ferrara, E., De Meo, P., Fiumara, G., Baumgartner, R.: Web data extraction, applications and techniques: a survey. Knowl.-Based Syst. 70, 301–323 (2014)CrossRef

11.

Flesca, S., Manco, G., Masciari, E., Rende, E., Tagarelli, A.: Web wrapper induction: a brief survey. AI Commun. 17(2), 57–61 (2004)

12.

Furche, T., Gottlob, G., Grasso, G., Schallhart, C., Sellers, A.: Oxpath: a language for scalable data extraction, automation, and crawling on the deep Web. VLDB J. 22(1), 47–72 (2013)CrossRef

13.

Gao, L., Guo, Z., Zhang, H., Xu, X., Shen, H.T.: Video captioning with attention-based lstm and semantic consistency. IEEE Trans. Multimedia 19(9), 2045–2055 (2017)CrossRef

14.

Gogar, T., Hubacek, O., Sedivy, J.: Deep neural networks for Web page information extraction. In: IFIP International Conference on Artificial Intelligence Applications and Innovations, pp. 154–163. Springer, Berlin (2016)

15.

Grigalis, T., Radvilavičius, L., Čenys, A., Gordevičius, J.: Clustering visually similar Web page elements for structured Web data extraction. In: Web Engineering, pp. 435–438 (2012)

16.

Hadnagy, C.: Social Engineering: the Art of Human Hacking. Wiley, New York (2010)

17.

Jarrett Irons, G.Y.: Goose - article extractor. https://github.com/GravityLabs/goose

18.

Junyi, S.: jparser - parsing binary files made easy. https://github.com/fxsjy/jparser

19.

Kohlschütter, C.: Boilerplate removal and fulltext extraction from html pages. https://github.com/kohlschutter

20.

Krishna, S.S., Dattatraya, J.S.: Schema inference and data extraction from templatized Web pages. In: 2015 International Conference on Pervasive Computing (ICPC), pp. 1–6. IEEE (2015)

21.

Kushmerick, N.: Finite-state approaches to Web information extraction. In: Lecture Notes in Computer Science, pp. 77–91 (2003)

22.

Li, Y., Krishnamurthy, R., Raghavan, S., Vaithyanathan, S., Jagadish, H.: Regular expression learning for information extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 21–30. Association for Computational Linguistics (2008)

23.

Li, J.Q., Zhao, Y., Garcia-Molina, H.: A path-based approach for Web page retrieval. World Wide Web 15(3), 257–283 (2012)CrossRef

24.

Mann, G.S., Yarowsky, D.: Unsupervised personal name disambiguation. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003-Volume 4, pp. 33–40. Association for Computational Linguistics (2003)

25.

Saleh, A.I., Al Rahmawy, M.F., Abulwafa, A.E.: A semantic based Web page classification strategy using multi-layered domain ontology. World Wide Web 20(5), 939–993 (2017)CrossRef

26.

Sanoja, A., Gancarski, S.: Block-o-matic: a Web page segmentation framework. In: 2014 International Conference on Multimedia Computing and Systems (ICMCS), pp. 595–600. IEEE (2014)

27.

Sleiman, H.A., Corchuelo, R.: Tex: an efficient and effective unsupervised Web information extractor. Knowl.-Based Syst. 39, 109–123 (2013)CrossRef

28.

Song, D., Sun, F., Liao, L.: A hybrid approach for content extraction with text density and visual importance of dom nodes. Knowl. Inf. Syst. 42(1), 75–96 (2015)CrossRef

29.

Song, J., Zhang, H., Li, X., Gao, L., Wang, M., Hong, R.: Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Trans. Image Process. 27 (7), 3210–3221 (2018)MathSciNetCrossRefMATH

30.

Thamviset, W., Wongthanavasu, S.: Information extraction for deep Web using repetitive subject pattern. World Wide Web 17(5), 1109–1139 (2014)CrossRef

31.

Vijendran, A.S., Deepa, C.: LBDA: a novel framework for extracting content from Web pages. In: 2013 International Conference on Advanced Computing & Communication Systems (ICACCS), pp. 1–7. IEEE (2013)

32.

Wei, Y., Wang, B., Liu, Y., Lv, F.: Research on Webpage similarity computing technology based on visual blocks. In: Chinese National Conference on Social Media Processing, pp. 187–197 (2014)

33.

Williams, R.J., Zipser, D.: A learning algorithm for continually running fully recurrent neural networks. Neural Comput. 1(2), 270–280 (1989)CrossRef

34.

Wu, G., Li, L., Hu, X., Wu, X.: Web news extraction via path ratios. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, pp. 2059–2068. ACM (2013)

35.

Xu, X., Shen, F., Yang, Y., Shen, H.T., Li, X.: Learning discriminative binary codes for large-scale cross-modal retrieval. IEEE Trans. Image Process. 26(5), 2494–2507 (2017)MathSciNetCrossRefMATH

36.

Xu, X., He, L., Lu, H., Gao, L., Ji, Y.: Deep adversarial metric learning for cross-modal retrieval. World Wide Web, pp. 1–16 (2018)

37.

Zhang, C., Liu, C., Zhang, X., Almpanidis, G.: An up-to-date comparison of state-of-the-art classification algorithms. Expert. Syst. Appl. 82, 128–150 (2017)CrossRef

Title: A novel approach for Web page modeling in personal information extraction
Authors: Wei Yuliang
Zhou Qi
Lv Fang
Han Xixian
Xin Guodong
Wang Bailing
Publication date: 05-09-2018
Publisher: Springer US
Published in: World Wide Web / Issue 2/2019
Print ISSN: 1386-145X
Electronic ISSN: 1573-1413
DOI: https://doi.org/10.1007/s11280-018-0631-9

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Other articles of this Issue 2/2019

Generalized zero-shot learning for action recognition with web-scale video data

An emotion-based responding model for natural language conversation

Context-aware graph pattern based top-k designated nodes finding in social graphs

Semi-supervised cross-modal learning for cross modal retrieval and image annotation

Residual attention-based LSTM for video captioning

Deep learning approaches for video-based anomalous activity detection

Premium Partner