Skip to main content
Top
Published in: International Journal of Machine Learning and Cybernetics 9/2018

28-03-2017 | Original Article

A learning framework for information block search based on probabilistic graphical models and Fisher Kernel

Authors: Tak-Lam Wong, Haoran Xie, Wai Lam, Fu Lee Wang

Published in: International Journal of Machine Learning and Cybernetics | Issue 9/2018

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Contrary to traditional Web information retrieval methods that can only return a ranked list of Web pages and only allow search terms in the query, we have developed a novel learning framework for retrieving precise information blocks from Web pages given a query, which may contain some search terms and prior information such as the layout format of the data. There are two challenging sub-tasks for this problem. One challenge is information block detection, where a Web page is automatically segmented into blocks. Another challenge is to find the information blocks relevant to the query. Existing page segmentation methods, which make use of only visual layout information or only content information, do not consider the query information, leading to a solution having conflict with the information need expressed by the query. Our framework aims at modeling the query and the block features to capture both keyword information and prior information via a probabilistic graphical model. Fisher Kernel, which can effectively incorporate the graphical model, is then employed to accomplish the two sub-tasks in a unified manner, optimizing the final goal of block retrieval performance. We have conducted experiments on benchmark datasets and read-world data. Comparisons between existing methods have been conducted to evaluate the effectiveness of our framework.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Show more products
Literature
2.
go back to reference Bah A, Chandar P, Carterette B (2015) Document comprehensiveness and user preferences in novelty search tasks. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp 735–738 Bah A, Chandar P, Carterette B (2015) Document comprehensiveness and user preferences in novelty search tasks. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp 735–738
3.
go back to reference Bilenko M, Kamath B, Mooney R (2006) Adaptive blocking: learning to scale up record linkage. In: Proceedings of the sixth IEEE international conference on data mining (ICDM), pp 87–96 Bilenko M, Kamath B, Mooney R (2006) Adaptive blocking: learning to scale up record linkage. In: Proceedings of the sixth IEEE international conference on data mining (ICDM), pp 87–96
4.
go back to reference Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022MATH Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022MATH
5.
go back to reference Cai Y, Li Q (2010) Personalized search by tag-based user profile and resource profile in collaborative tagging systems. In: Proceedings of the 19th ACM international conference on information and knowledge management, pp 969–978 Cai Y, Li Q (2010) Personalized search by tag-based user profile and resource profile in collaborative tagging systems. In: Proceedings of the 19th ACM international conference on information and knowledge management, pp 969–978
6.
go back to reference Cai D, Yu S, Wen J-R, Ma W-Y (2004) Block-based web search. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, pp 456–463 Cai D, Yu S, Wen J-R, Ma W-Y (2004) Block-based web search. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, pp 456–463
7.
go back to reference Chen Y, Lee S, Huang C-R (2012) A robust web personal name information extraction system. Exp Syst Appl 39(3):2690–2699CrossRef Chen Y, Lee S, Huang C-R (2012) A robust web personal name information extraction system. Exp Syst Appl 39(3):2690–2699CrossRef
9.
go back to reference Culotta A, Wick M, Hall R, Marzilli M, McCallum A (2007) Canonicalization of database records using adaptive similarity measures. In: Proceedings of the thirteenth ACM SIGKDD international conference on knowledge discovery and data mining, pp 201–209 Culotta A, Wick M, Hall R, Marzilli M, McCallum A (2007) Canonicalization of database records using adaptive similarity measures. In: Proceedings of the thirteenth ACM SIGKDD international conference on knowledge discovery and data mining, pp 201–209
10.
go back to reference Ekbal A, Saha S, Sikdar U (2014) On active annotation for named entity recognition. Int J Mach Learn Cybern 7(4):623–640CrossRef Ekbal A, Saha S, Sikdar U (2014) On active annotation for named entity recognition. Int J Mach Learn Cybern 7(4):623–640CrossRef
11.
go back to reference Fernandes D, de Moura ES, Ribeiro-Neto B, da Silva AS, Gonçalves MA (2007) Computing block importance for searching on web sites. In: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management(CIKM), pp 165–174 Fernandes D, de Moura ES, Ribeiro-Neto B, da Silva AS, Gonçalves MA (2007) Computing block importance for searching on web sites. In: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management(CIKM), pp 165–174
12.
go back to reference Hao Q, Cai R, Pang Y, Zhang L (2011) From one tree to a forest: a unified solution for structured web data extraction. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval, pp 775–784 Hao Q, Cai R, Pang Y, Zhang L (2011) From one tree to a forest: a unified solution for structured web data extraction. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval, pp 775–784
13.
go back to reference Hu Y, Xin G, Song R, Hu G, Shi S, Cao Y, Li H (2005) Title extraction from bodies of html documents and its application to web page retrieval. In: Proceedings of the 28th international ACM SIGIR conference on research and development in information retrieval, pp 250–257 Hu Y, Xin G, Song R, Hu G, Shi S, Cao Y, Li H (2005) Title extraction from bodies of html documents and its application to web page retrieval. In: Proceedings of the 28th international ACM SIGIR conference on research and development in information retrieval, pp 250–257
14.
go back to reference Jaakkola T, Haussler D (1998) Exploiting generative models in discriminative classifiers. In: Advances in neural information processing systems 11, neural information processing systems, pp 487–493 Jaakkola T, Haussler D (1998) Exploiting generative models in discriminative classifiers. In: Advances in neural information processing systems 11, neural information processing systems, pp 487–493
15.
go back to reference Jajishirzi H, Yih W, Kolcz A (2010) Adaptive near-duplicate detection via similarity learning. In: Proceedings of the 33st international ACM SIGIR conference on research and development in information retrieval, pp 419–426 Jajishirzi H, Yih W, Kolcz A (2010) Adaptive near-duplicate detection via similarity learning. In: Proceedings of the 33st international ACM SIGIR conference on research and development in information retrieval, pp 419–426
16.
go back to reference Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of 18th international conference on machine learning, pp 282–289 Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of 18th international conference on machine learning, pp 282–289
17.
go back to reference Lau RY, Li C, Liao SS (2014) Social analytics: learning fuzzy product ontologies for aspect-oriented sentiment analysis. Decis Support Syst 65:80–94CrossRef Lau RY, Li C, Liao SS (2014) Social analytics: learning fuzzy product ontologies for aspect-oriented sentiment analysis. Decis Support Syst 65:80–94CrossRef
18.
go back to reference Lin S-H, Ho J-M (2002) Discovering informative content blocks from web documents. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (SIGKDD), pp 588–593 Lin S-H, Ho J-M (2002) Discovering informative content blocks from web documents. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (SIGKDD), pp 588–593
19.
go back to reference Lin S, Jin P, Zhao X, Yue L (2014) Exploiting temporal information in web search. Exp Syst Appl 41(2):331–341CrossRef Lin S, Jin P, Zhao X, Yue L (2014) Exploiting temporal information in web search. Exp Syst Appl 41(2):331–341CrossRef
20.
go back to reference Liu W, Meng X, Meng W (2010) Vide: a vision-based approach for deep web data extraction. IEEE Trans Knowl Data Eng 22(3):447–460CrossRef Liu W, Meng X, Meng W (2010) Vide: a vision-based approach for deep web data extraction. IEEE Trans Knowl Data Eng 22(3):447–460CrossRef
21.
go back to reference Li X, Wang Y-Y, Acero A (2009) Extracting structured information from user queries with semi-supervised conditional random fields. In: Proceedings of the 30th international ACM SIGIR conference on research and development in information retrieval, pp 572–579 Li X, Wang Y-Y, Acero A (2009) Extracting structured information from user queries with semi-supervised conditional random fields. In: Proceedings of the 30th international ACM SIGIR conference on research and development in information retrieval, pp 572–579
22.
go back to reference Luo P, Lin F, Xiong Y, Zhao Y, Shi Z (2009) Towards combining web classification and web information extraction: a case study. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1235–1244 Luo P, Lin F, Xiong Y, Zhao Y, Shi Z (2009) Towards combining web classification and web information extraction: a case study. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1235–1244
23.
go back to reference McCallum A, Li W (2003) Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of the seventh conference on natural language learning, pp 188–191 McCallum A, Li W (2003) Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of the seventh conference on natural language learning, pp 188–191
24.
go back to reference Miao G, Tatemura J, Hsiung W-P, Sawires A, Moser LE (2009) Extracting data records from the web using tag path clustering. In: Proceedings of the eighteenth international world wide web conference (WWW), pp 81–990 Miao G, Tatemura J, Hsiung W-P, Sawires A, Moser LE (2009) Extracting data records from the web using tag path clustering. In: Proceedings of the eighteenth international world wide web conference (WWW), pp 81–990
25.
go back to reference Monge A, Elkan C (1997) An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Research issues on data mining and knowledge discovery Monge A, Elkan C (1997) An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Research issues on data mining and knowledge discovery
26.
go back to reference Qin Y, Zheng D, Zhao T (2012) Research on search results optimization technology with category features integration. Int J Mach Learn Cybern 3(1):71–76CrossRef Qin Y, Zheng D, Zhao T (2012) Research on search results optimization technology with category features integration. Int J Mach Learn Cybern 3(1):71–76CrossRef
27.
go back to reference Ruiz-Sarmiento JR, Galindo C, Gonzalez-Jimenez J (2015) Scene object recognition for mobile robots through semantic knowledge and probabilistic graphical models. Exp Syst Appl 42(22):8805–8816CrossRef Ruiz-Sarmiento JR, Galindo C, Gonzalez-Jimenez J (2015) Scene object recognition for mobile robots through semantic knowledge and probabilistic graphical models. Exp Syst Appl 42(22):8805–8816CrossRef
28.
go back to reference Sha F, Pereira F (2003) Shallow parsing with conditional random fields. In: Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology, pp 134–141 Sha F, Pereira F (2003) Shallow parsing with conditional random fields. In: Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology, pp 134–141
29.
go back to reference Song X, Liu J, Cao Y, Lin C-Y, Hon H-W (2010) Automatic extraction of web data records containing user-generated content. In: Proceedings of the nineteenth ACM conference on Conference on information and knowledge management(CIKM), pp 39–48 Song X, Liu J, Cao Y, Lin C-Y, Hon H-W (2010) Automatic extraction of web data records containing user-generated content. In: Proceedings of the nineteenth ACM conference on Conference on information and knowledge management(CIKM), pp 39–48
30.
go back to reference Sun Q, Li R, Luo D, Wu X (2008) Text segmentation with lda-based Fisher Kernel. In: Proceedings of the 46th annual meeting of the association for computational linguistics on human language technologies: short papers, pp 269–272 Sun Q, Li R, Luo D, Wu X (2008) Text segmentation with lda-based Fisher Kernel. In: Proceedings of the 46th annual meeting of the association for computational linguistics on human language technologies: short papers, pp 269–272
31.
go back to reference Sun F, Song D, Liao L (2011) Dom based content extraction via text density. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval, pp 245–254 Sun F, Song D, Liao L (2011) Dom based content extraction via text density. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval, pp 245–254
32.
go back to reference Sutton C, Rohanimanesh K, McCallum A (2004) Dynamic conditional random fileds: factorized probabilistic models for labeling and segmenting sequence data. In: Proceedings of twenty-first international conference on machine learning, pp 783–790 Sutton C, Rohanimanesh K, McCallum A (2004) Dynamic conditional random fileds: factorized probabilistic models for labeling and segmenting sequence data. In: Proceedings of twenty-first international conference on machine learning, pp 783–790
34.
go back to reference Theobald M, Siddharth J, Paepcke A (2008) Spotsigs: robust and efficient near duplicate detection in large web collections. In: Proceedings of the 31st international ACM SIGIR conference on research and development in information retrieval, pp 563–370 Theobald M, Siddharth J, Paepcke A (2008) Spotsigs: robust and efficient near duplicate detection in large web collections. In: Proceedings of the 31st international ACM SIGIR conference on research and development in information retrieval, pp 563–370
35.
go back to reference Turmo J, Ageno A, Catala N (2006) Adaptive information extraction. ACM Comput Surv 38(2). Article 4 Turmo J, Ageno A, Catala N (2006) Adaptive information extraction. ACM Comput Surv 38(2). Article 4
36.
go back to reference van der Maaten L (2011) Learning discriminative Fisher Kernels. In: Proceedings of twenty-eighth international conference on machine learning van der Maaten L (2011) Learning discriminative Fisher Kernels. In: Proceedings of twenty-eighth international conference on machine learning
37.
go back to reference Vo D-T, Hai V, Ock C-Y (2015) Exploiting language models to classify events from twitter. Comput Intell Neurosci. Article ID 401024 Vo D-T, Hai V, Ock C-Y (2015) Exploiting language models to classify events from twitter. Comput Intell Neurosci. Article ID 401024
38.
go back to reference Wang T, Cai Y, Leung HF, Cai Z, Min H (2015) Entropy-based term weighting schemes for text categorization in vsm. In: Proceedings of the IEEE 27th international conference on tools with artificial intelligence, pp 325–332 Wang T, Cai Y, Leung HF, Cai Z, Min H (2015) Entropy-based term weighting schemes for text categorization in vsm. In: Proceedings of the IEEE 27th international conference on tools with artificial intelligence, pp 325–332
39.
go back to reference Yang C, Cao Y, Nie Z, Zhou J, Wen J-R (2010) Closing the loop in webpage understanding. IEEE Trans Knowl Data Eng 22:639–650CrossRef Yang C, Cao Y, Nie Z, Zhou J, Wen J-R (2010) Closing the loop in webpage understanding. IEEE Trans Knowl Data Eng 22:639–650CrossRef
40.
go back to reference Yan Y, Yin X-C, Li S, Yang M, Hao H-W (2015) Learning document semantic representation with hybrid deep belief network. Comput Intell Neurosci. Article ID 650527 Yan Y, Yin X-C, Li S, Yang M, Hao H-W (2015) Learning document semantic representation with hybrid deep belief network. Comput Intell Neurosci. Article ID 650527
41.
go back to reference Zheng S, Song R, Wen J-R, Giles CL (2009) Efficient record-level wrapper induction. In: Proceeding of the 18th ACM international conference on information and knowledge management, pp 47–56 Zheng S, Song R, Wen J-R, Giles CL (2009) Efficient record-level wrapper induction. In: Proceeding of the 18th ACM international conference on information and knowledge management, pp 47–56
42.
go back to reference Zhu J, Nie Z, Zhang B, Wen J-R (2008) Dynamic hierarchical markov random fields for integrated web data extraction. J Mach Learn Res 9:1583–1614MATH Zhu J, Nie Z, Zhang B, Wen J-R (2008) Dynamic hierarchical markov random fields for integrated web data extraction. J Mach Learn Res 9:1583–1614MATH
Metadata
Title
A learning framework for information block search based on probabilistic graphical models and Fisher Kernel
Authors
Tak-Lam Wong
Haoran Xie
Wai Lam
Fu Lee Wang
Publication date
28-03-2017
Publisher
Springer Berlin Heidelberg
Published in
International Journal of Machine Learning and Cybernetics / Issue 9/2018
Print ISSN: 1868-8071
Electronic ISSN: 1868-808X
DOI
https://doi.org/10.1007/s13042-017-0657-9

Other articles of this Issue 9/2018

International Journal of Machine Learning and Cybernetics 9/2018 Go to the issue