Skip to main content
Top
Published in: World Wide Web 2/2019

05-09-2018

A novel approach for Web page modeling in personal information extraction

Authors: Wei Yuliang, Zhou Qi, Lv Fang, Han Xixian, Xin Guodong, Wang Bailing

Published in: World Wide Web | Issue 2/2019

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The target of personal information extraction (PIE) is to extract content associated with a name form Web pages. Available Web page models, which are also used widely in content extraction and automatic wrapper algorithms, include text model, document object model, and vision-based page segmentation model. Because of existing models focus on Web structure rather than semantic relevance, they are difficult to be directly used for PIE. To deal with this problem, we introduce the sequence block model (SBM), by which is easy to determine the relevance of each page block to the retrieval name. Then, we give the definition of PIE based on the SBM. Depending on the sequence correlation of SBM, we design a 4-layer seq2seq deep learning network for PIE. Experiment result shows that our new model extracts twice as much data as content extraction algorithms. And the recall rate of the network is 7% higher than the traditional model with classification algorithm.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Banu, A., Chitra, M.: Dwde-ir: an efficient deep Web data extraction for information retrieval on Web mining. J. Emerg. Technol. Web Intell. 6(1), 133–141 (2014) Banu, A., Chitra, M.: Dwde-ir: an efficient deep Web data extraction for information retrieval on Web mining. J. Emerg. Technol. Web Intell. 6(1), 133–141 (2014)
2.
go back to reference Bartoli, A., De Lorenzo, A., Medvet, E., Tarlao, F.: Inference of regular expressions for text extraction from examples. IEEE Trans. Knowl. Data Eng. 28(5), 1217–1230 (2016)CrossRef Bartoli, A., De Lorenzo, A., Medvet, E., Tarlao, F.: Inference of regular expressions for text extraction from examples. IEEE Trans. Knowl. Data Eng. 28(5), 1217–1230 (2016)CrossRef
3.
go back to reference Bu, Z., Zhang, C., Xia, Z., Wang, J.: An far-sw based approach for Webpage information extraction. Inf. Syst. Front. 16(5), 771–785 (2014)CrossRef Bu, Z., Zhang, C., Xia, Z., Wang, J.: An far-sw based approach for Webpage information extraction. Inf. Syst. Front. 16(5), 771–785 (2014)CrossRef
4.
go back to reference Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: VIPS: vision-based page segmentation algorithm. Microsoft Technical Report, MSR-TR-2003-79 (2003) Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: VIPS: vision-based page segmentation algorithm. Microsoft Technical Report, MSR-TR-2003-79 (2003)
5.
go back to reference Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of Web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006)CrossRef Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of Web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006)CrossRef
8.
go back to reference Dalvi, N., Kumar, R., Soliman, M.: Automatic wrappers for large scale Web extraction. Proc. VLDB Endow. 4(4), 219–230 (2011)CrossRef Dalvi, N., Kumar, R., Soliman, M.: Automatic wrappers for large scale Web extraction. Proc. VLDB Endow. 4(4), 219–230 (2011)CrossRef
9.
go back to reference Doddington, G.R., Mitchell, A., Przybocki, M.A., Ramshaw, L.A., Strassel, S., Weischedel, R.M.: The automatic content extraction (ace) program-tasks, data, and evaluation. In: LREC, vol. 2, pp. 837–840 (2004) Doddington, G.R., Mitchell, A., Przybocki, M.A., Ramshaw, L.A., Strassel, S., Weischedel, R.M.: The automatic content extraction (ace) program-tasks, data, and evaluation. In: LREC, vol. 2, pp. 837–840 (2004)
10.
go back to reference Ferrara, E., De Meo, P., Fiumara, G., Baumgartner, R.: Web data extraction, applications and techniques: a survey. Knowl.-Based Syst. 70, 301–323 (2014)CrossRef Ferrara, E., De Meo, P., Fiumara, G., Baumgartner, R.: Web data extraction, applications and techniques: a survey. Knowl.-Based Syst. 70, 301–323 (2014)CrossRef
11.
go back to reference Flesca, S., Manco, G., Masciari, E., Rende, E., Tagarelli, A.: Web wrapper induction: a brief survey. AI Commun. 17(2), 57–61 (2004) Flesca, S., Manco, G., Masciari, E., Rende, E., Tagarelli, A.: Web wrapper induction: a brief survey. AI Commun. 17(2), 57–61 (2004)
12.
go back to reference Furche, T., Gottlob, G., Grasso, G., Schallhart, C., Sellers, A.: Oxpath: a language for scalable data extraction, automation, and crawling on the deep Web. VLDB J. 22(1), 47–72 (2013)CrossRef Furche, T., Gottlob, G., Grasso, G., Schallhart, C., Sellers, A.: Oxpath: a language for scalable data extraction, automation, and crawling on the deep Web. VLDB J. 22(1), 47–72 (2013)CrossRef
13.
go back to reference Gao, L., Guo, Z., Zhang, H., Xu, X., Shen, H.T.: Video captioning with attention-based lstm and semantic consistency. IEEE Trans. Multimedia 19(9), 2045–2055 (2017)CrossRef Gao, L., Guo, Z., Zhang, H., Xu, X., Shen, H.T.: Video captioning with attention-based lstm and semantic consistency. IEEE Trans. Multimedia 19(9), 2045–2055 (2017)CrossRef
14.
go back to reference Gogar, T., Hubacek, O., Sedivy, J.: Deep neural networks for Web page information extraction. In: IFIP International Conference on Artificial Intelligence Applications and Innovations, pp. 154–163. Springer, Berlin (2016) Gogar, T., Hubacek, O., Sedivy, J.: Deep neural networks for Web page information extraction. In: IFIP International Conference on Artificial Intelligence Applications and Innovations, pp. 154–163. Springer, Berlin (2016)
15.
go back to reference Grigalis, T., Radvilavičius, L., Čenys, A., Gordevičius, J.: Clustering visually similar Web page elements for structured Web data extraction. In: Web Engineering, pp. 435–438 (2012) Grigalis, T., Radvilavičius, L., Čenys, A., Gordevičius, J.: Clustering visually similar Web page elements for structured Web data extraction. In: Web Engineering, pp. 435–438 (2012)
16.
go back to reference Hadnagy, C.: Social Engineering: the Art of Human Hacking. Wiley, New York (2010) Hadnagy, C.: Social Engineering: the Art of Human Hacking. Wiley, New York (2010)
20.
go back to reference Krishna, S.S., Dattatraya, J.S.: Schema inference and data extraction from templatized Web pages. In: 2015 International Conference on Pervasive Computing (ICPC), pp. 1–6. IEEE (2015) Krishna, S.S., Dattatraya, J.S.: Schema inference and data extraction from templatized Web pages. In: 2015 International Conference on Pervasive Computing (ICPC), pp. 1–6. IEEE (2015)
21.
go back to reference Kushmerick, N.: Finite-state approaches to Web information extraction. In: Lecture Notes in Computer Science, pp. 77–91 (2003) Kushmerick, N.: Finite-state approaches to Web information extraction. In: Lecture Notes in Computer Science, pp. 77–91 (2003)
22.
go back to reference Li, Y., Krishnamurthy, R., Raghavan, S., Vaithyanathan, S., Jagadish, H.: Regular expression learning for information extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 21–30. Association for Computational Linguistics (2008) Li, Y., Krishnamurthy, R., Raghavan, S., Vaithyanathan, S., Jagadish, H.: Regular expression learning for information extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 21–30. Association for Computational Linguistics (2008)
23.
go back to reference Li, J.Q., Zhao, Y., Garcia-Molina, H.: A path-based approach for Web page retrieval. World Wide Web 15(3), 257–283 (2012)CrossRef Li, J.Q., Zhao, Y., Garcia-Molina, H.: A path-based approach for Web page retrieval. World Wide Web 15(3), 257–283 (2012)CrossRef
24.
go back to reference Mann, G.S., Yarowsky, D.: Unsupervised personal name disambiguation. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003-Volume 4, pp. 33–40. Association for Computational Linguistics (2003) Mann, G.S., Yarowsky, D.: Unsupervised personal name disambiguation. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003-Volume 4, pp. 33–40. Association for Computational Linguistics (2003)
25.
go back to reference Saleh, A.I., Al Rahmawy, M.F., Abulwafa, A.E.: A semantic based Web page classification strategy using multi-layered domain ontology. World Wide Web 20(5), 939–993 (2017)CrossRef Saleh, A.I., Al Rahmawy, M.F., Abulwafa, A.E.: A semantic based Web page classification strategy using multi-layered domain ontology. World Wide Web 20(5), 939–993 (2017)CrossRef
26.
go back to reference Sanoja, A., Gancarski, S.: Block-o-matic: a Web page segmentation framework. In: 2014 International Conference on Multimedia Computing and Systems (ICMCS), pp. 595–600. IEEE (2014) Sanoja, A., Gancarski, S.: Block-o-matic: a Web page segmentation framework. In: 2014 International Conference on Multimedia Computing and Systems (ICMCS), pp. 595–600. IEEE (2014)
27.
go back to reference Sleiman, H.A., Corchuelo, R.: Tex: an efficient and effective unsupervised Web information extractor. Knowl.-Based Syst. 39, 109–123 (2013)CrossRef Sleiman, H.A., Corchuelo, R.: Tex: an efficient and effective unsupervised Web information extractor. Knowl.-Based Syst. 39, 109–123 (2013)CrossRef
28.
go back to reference Song, D., Sun, F., Liao, L.: A hybrid approach for content extraction with text density and visual importance of dom nodes. Knowl. Inf. Syst. 42(1), 75–96 (2015)CrossRef Song, D., Sun, F., Liao, L.: A hybrid approach for content extraction with text density and visual importance of dom nodes. Knowl. Inf. Syst. 42(1), 75–96 (2015)CrossRef
29.
go back to reference Song, J., Zhang, H., Li, X., Gao, L., Wang, M., Hong, R.: Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Trans. Image Process. 27 (7), 3210–3221 (2018)MathSciNetCrossRefMATH Song, J., Zhang, H., Li, X., Gao, L., Wang, M., Hong, R.: Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Trans. Image Process. 27 (7), 3210–3221 (2018)MathSciNetCrossRefMATH
30.
go back to reference Thamviset, W., Wongthanavasu, S.: Information extraction for deep Web using repetitive subject pattern. World Wide Web 17(5), 1109–1139 (2014)CrossRef Thamviset, W., Wongthanavasu, S.: Information extraction for deep Web using repetitive subject pattern. World Wide Web 17(5), 1109–1139 (2014)CrossRef
31.
go back to reference Vijendran, A.S., Deepa, C.: LBDA: a novel framework for extracting content from Web pages. In: 2013 International Conference on Advanced Computing & Communication Systems (ICACCS), pp. 1–7. IEEE (2013) Vijendran, A.S., Deepa, C.: LBDA: a novel framework for extracting content from Web pages. In: 2013 International Conference on Advanced Computing & Communication Systems (ICACCS), pp. 1–7. IEEE (2013)
32.
go back to reference Wei, Y., Wang, B., Liu, Y., Lv, F.: Research on Webpage similarity computing technology based on visual blocks. In: Chinese National Conference on Social Media Processing, pp. 187–197 (2014) Wei, Y., Wang, B., Liu, Y., Lv, F.: Research on Webpage similarity computing technology based on visual blocks. In: Chinese National Conference on Social Media Processing, pp. 187–197 (2014)
33.
go back to reference Williams, R.J., Zipser, D.: A learning algorithm for continually running fully recurrent neural networks. Neural Comput. 1(2), 270–280 (1989)CrossRef Williams, R.J., Zipser, D.: A learning algorithm for continually running fully recurrent neural networks. Neural Comput. 1(2), 270–280 (1989)CrossRef
34.
go back to reference Wu, G., Li, L., Hu, X., Wu, X.: Web news extraction via path ratios. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, pp. 2059–2068. ACM (2013) Wu, G., Li, L., Hu, X., Wu, X.: Web news extraction via path ratios. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, pp. 2059–2068. ACM (2013)
35.
go back to reference Xu, X., Shen, F., Yang, Y., Shen, H.T., Li, X.: Learning discriminative binary codes for large-scale cross-modal retrieval. IEEE Trans. Image Process. 26(5), 2494–2507 (2017)MathSciNetCrossRefMATH Xu, X., Shen, F., Yang, Y., Shen, H.T., Li, X.: Learning discriminative binary codes for large-scale cross-modal retrieval. IEEE Trans. Image Process. 26(5), 2494–2507 (2017)MathSciNetCrossRefMATH
36.
go back to reference Xu, X., He, L., Lu, H., Gao, L., Ji, Y.: Deep adversarial metric learning for cross-modal retrieval. World Wide Web, pp. 1–16 (2018) Xu, X., He, L., Lu, H., Gao, L., Ji, Y.: Deep adversarial metric learning for cross-modal retrieval. World Wide Web, pp. 1–16 (2018)
37.
go back to reference Zhang, C., Liu, C., Zhang, X., Almpanidis, G.: An up-to-date comparison of state-of-the-art classification algorithms. Expert. Syst. Appl. 82, 128–150 (2017)CrossRef Zhang, C., Liu, C., Zhang, X., Almpanidis, G.: An up-to-date comparison of state-of-the-art classification algorithms. Expert. Syst. Appl. 82, 128–150 (2017)CrossRef
Metadata
Title
A novel approach for Web page modeling in personal information extraction
Authors
Wei Yuliang
Zhou Qi
Lv Fang
Han Xixian
Xin Guodong
Wang Bailing
Publication date
05-09-2018
Publisher
Springer US
Published in
World Wide Web / Issue 2/2019
Print ISSN: 1386-145X
Electronic ISSN: 1573-1413
DOI
https://doi.org/10.1007/s11280-018-0631-9

Other articles of this Issue 2/2019

World Wide Web 2/2019 Go to the issue

Premium Partner