Skip to main content
Erschienen in: International Journal of Machine Learning and Cybernetics 9/2018

28.03.2017 | Original Article

A learning framework for information block search based on probabilistic graphical models and Fisher Kernel

verfasst von: Tak-Lam Wong, Haoran Xie, Wai Lam, Fu Lee Wang

Erschienen in: International Journal of Machine Learning and Cybernetics | Ausgabe 9/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Contrary to traditional Web information retrieval methods that can only return a ranked list of Web pages and only allow search terms in the query, we have developed a novel learning framework for retrieving precise information blocks from Web pages given a query, which may contain some search terms and prior information such as the layout format of the data. There are two challenging sub-tasks for this problem. One challenge is information block detection, where a Web page is automatically segmented into blocks. Another challenge is to find the information blocks relevant to the query. Existing page segmentation methods, which make use of only visual layout information or only content information, do not consider the query information, leading to a solution having conflict with the information need expressed by the query. Our framework aims at modeling the query and the block features to capture both keyword information and prior information via a probabilistic graphical model. Fisher Kernel, which can effectively incorporate the graphical model, is then employed to accomplish the two sub-tasks in a unified manner, optimizing the final goal of block retrieval performance. We have conducted experiments on benchmark datasets and read-world data. Comparisons between existing methods have been conducted to evaluate the effectiveness of our framework.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Weitere Produktempfehlungen anzeigen
Literatur
2.
Zurück zum Zitat Bah A, Chandar P, Carterette B (2015) Document comprehensiveness and user preferences in novelty search tasks. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp 735–738 Bah A, Chandar P, Carterette B (2015) Document comprehensiveness and user preferences in novelty search tasks. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp 735–738
3.
Zurück zum Zitat Bilenko M, Kamath B, Mooney R (2006) Adaptive blocking: learning to scale up record linkage. In: Proceedings of the sixth IEEE international conference on data mining (ICDM), pp 87–96 Bilenko M, Kamath B, Mooney R (2006) Adaptive blocking: learning to scale up record linkage. In: Proceedings of the sixth IEEE international conference on data mining (ICDM), pp 87–96
4.
Zurück zum Zitat Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022MATH Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022MATH
5.
Zurück zum Zitat Cai Y, Li Q (2010) Personalized search by tag-based user profile and resource profile in collaborative tagging systems. In: Proceedings of the 19th ACM international conference on information and knowledge management, pp 969–978 Cai Y, Li Q (2010) Personalized search by tag-based user profile and resource profile in collaborative tagging systems. In: Proceedings of the 19th ACM international conference on information and knowledge management, pp 969–978
6.
Zurück zum Zitat Cai D, Yu S, Wen J-R, Ma W-Y (2004) Block-based web search. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, pp 456–463 Cai D, Yu S, Wen J-R, Ma W-Y (2004) Block-based web search. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, pp 456–463
7.
Zurück zum Zitat Chen Y, Lee S, Huang C-R (2012) A robust web personal name information extraction system. Exp Syst Appl 39(3):2690–2699CrossRef Chen Y, Lee S, Huang C-R (2012) A robust web personal name information extraction system. Exp Syst Appl 39(3):2690–2699CrossRef
9.
Zurück zum Zitat Culotta A, Wick M, Hall R, Marzilli M, McCallum A (2007) Canonicalization of database records using adaptive similarity measures. In: Proceedings of the thirteenth ACM SIGKDD international conference on knowledge discovery and data mining, pp 201–209 Culotta A, Wick M, Hall R, Marzilli M, McCallum A (2007) Canonicalization of database records using adaptive similarity measures. In: Proceedings of the thirteenth ACM SIGKDD international conference on knowledge discovery and data mining, pp 201–209
10.
Zurück zum Zitat Ekbal A, Saha S, Sikdar U (2014) On active annotation for named entity recognition. Int J Mach Learn Cybern 7(4):623–640CrossRef Ekbal A, Saha S, Sikdar U (2014) On active annotation for named entity recognition. Int J Mach Learn Cybern 7(4):623–640CrossRef
11.
Zurück zum Zitat Fernandes D, de Moura ES, Ribeiro-Neto B, da Silva AS, Gonçalves MA (2007) Computing block importance for searching on web sites. In: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management(CIKM), pp 165–174 Fernandes D, de Moura ES, Ribeiro-Neto B, da Silva AS, Gonçalves MA (2007) Computing block importance for searching on web sites. In: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management(CIKM), pp 165–174
12.
Zurück zum Zitat Hao Q, Cai R, Pang Y, Zhang L (2011) From one tree to a forest: a unified solution for structured web data extraction. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval, pp 775–784 Hao Q, Cai R, Pang Y, Zhang L (2011) From one tree to a forest: a unified solution for structured web data extraction. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval, pp 775–784
13.
Zurück zum Zitat Hu Y, Xin G, Song R, Hu G, Shi S, Cao Y, Li H (2005) Title extraction from bodies of html documents and its application to web page retrieval. In: Proceedings of the 28th international ACM SIGIR conference on research and development in information retrieval, pp 250–257 Hu Y, Xin G, Song R, Hu G, Shi S, Cao Y, Li H (2005) Title extraction from bodies of html documents and its application to web page retrieval. In: Proceedings of the 28th international ACM SIGIR conference on research and development in information retrieval, pp 250–257
14.
Zurück zum Zitat Jaakkola T, Haussler D (1998) Exploiting generative models in discriminative classifiers. In: Advances in neural information processing systems 11, neural information processing systems, pp 487–493 Jaakkola T, Haussler D (1998) Exploiting generative models in discriminative classifiers. In: Advances in neural information processing systems 11, neural information processing systems, pp 487–493
15.
Zurück zum Zitat Jajishirzi H, Yih W, Kolcz A (2010) Adaptive near-duplicate detection via similarity learning. In: Proceedings of the 33st international ACM SIGIR conference on research and development in information retrieval, pp 419–426 Jajishirzi H, Yih W, Kolcz A (2010) Adaptive near-duplicate detection via similarity learning. In: Proceedings of the 33st international ACM SIGIR conference on research and development in information retrieval, pp 419–426
16.
Zurück zum Zitat Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of 18th international conference on machine learning, pp 282–289 Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of 18th international conference on machine learning, pp 282–289
17.
Zurück zum Zitat Lau RY, Li C, Liao SS (2014) Social analytics: learning fuzzy product ontologies for aspect-oriented sentiment analysis. Decis Support Syst 65:80–94CrossRef Lau RY, Li C, Liao SS (2014) Social analytics: learning fuzzy product ontologies for aspect-oriented sentiment analysis. Decis Support Syst 65:80–94CrossRef
18.
Zurück zum Zitat Lin S-H, Ho J-M (2002) Discovering informative content blocks from web documents. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (SIGKDD), pp 588–593 Lin S-H, Ho J-M (2002) Discovering informative content blocks from web documents. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (SIGKDD), pp 588–593
19.
Zurück zum Zitat Lin S, Jin P, Zhao X, Yue L (2014) Exploiting temporal information in web search. Exp Syst Appl 41(2):331–341CrossRef Lin S, Jin P, Zhao X, Yue L (2014) Exploiting temporal information in web search. Exp Syst Appl 41(2):331–341CrossRef
20.
Zurück zum Zitat Liu W, Meng X, Meng W (2010) Vide: a vision-based approach for deep web data extraction. IEEE Trans Knowl Data Eng 22(3):447–460CrossRef Liu W, Meng X, Meng W (2010) Vide: a vision-based approach for deep web data extraction. IEEE Trans Knowl Data Eng 22(3):447–460CrossRef
21.
Zurück zum Zitat Li X, Wang Y-Y, Acero A (2009) Extracting structured information from user queries with semi-supervised conditional random fields. In: Proceedings of the 30th international ACM SIGIR conference on research and development in information retrieval, pp 572–579 Li X, Wang Y-Y, Acero A (2009) Extracting structured information from user queries with semi-supervised conditional random fields. In: Proceedings of the 30th international ACM SIGIR conference on research and development in information retrieval, pp 572–579
22.
Zurück zum Zitat Luo P, Lin F, Xiong Y, Zhao Y, Shi Z (2009) Towards combining web classification and web information extraction: a case study. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1235–1244 Luo P, Lin F, Xiong Y, Zhao Y, Shi Z (2009) Towards combining web classification and web information extraction: a case study. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1235–1244
23.
Zurück zum Zitat McCallum A, Li W (2003) Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of the seventh conference on natural language learning, pp 188–191 McCallum A, Li W (2003) Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of the seventh conference on natural language learning, pp 188–191
24.
Zurück zum Zitat Miao G, Tatemura J, Hsiung W-P, Sawires A, Moser LE (2009) Extracting data records from the web using tag path clustering. In: Proceedings of the eighteenth international world wide web conference (WWW), pp 81–990 Miao G, Tatemura J, Hsiung W-P, Sawires A, Moser LE (2009) Extracting data records from the web using tag path clustering. In: Proceedings of the eighteenth international world wide web conference (WWW), pp 81–990
25.
Zurück zum Zitat Monge A, Elkan C (1997) An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Research issues on data mining and knowledge discovery Monge A, Elkan C (1997) An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Research issues on data mining and knowledge discovery
26.
Zurück zum Zitat Qin Y, Zheng D, Zhao T (2012) Research on search results optimization technology with category features integration. Int J Mach Learn Cybern 3(1):71–76CrossRef Qin Y, Zheng D, Zhao T (2012) Research on search results optimization technology with category features integration. Int J Mach Learn Cybern 3(1):71–76CrossRef
27.
Zurück zum Zitat Ruiz-Sarmiento JR, Galindo C, Gonzalez-Jimenez J (2015) Scene object recognition for mobile robots through semantic knowledge and probabilistic graphical models. Exp Syst Appl 42(22):8805–8816CrossRef Ruiz-Sarmiento JR, Galindo C, Gonzalez-Jimenez J (2015) Scene object recognition for mobile robots through semantic knowledge and probabilistic graphical models. Exp Syst Appl 42(22):8805–8816CrossRef
28.
Zurück zum Zitat Sha F, Pereira F (2003) Shallow parsing with conditional random fields. In: Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology, pp 134–141 Sha F, Pereira F (2003) Shallow parsing with conditional random fields. In: Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology, pp 134–141
29.
Zurück zum Zitat Song X, Liu J, Cao Y, Lin C-Y, Hon H-W (2010) Automatic extraction of web data records containing user-generated content. In: Proceedings of the nineteenth ACM conference on Conference on information and knowledge management(CIKM), pp 39–48 Song X, Liu J, Cao Y, Lin C-Y, Hon H-W (2010) Automatic extraction of web data records containing user-generated content. In: Proceedings of the nineteenth ACM conference on Conference on information and knowledge management(CIKM), pp 39–48
30.
Zurück zum Zitat Sun Q, Li R, Luo D, Wu X (2008) Text segmentation with lda-based Fisher Kernel. In: Proceedings of the 46th annual meeting of the association for computational linguistics on human language technologies: short papers, pp 269–272 Sun Q, Li R, Luo D, Wu X (2008) Text segmentation with lda-based Fisher Kernel. In: Proceedings of the 46th annual meeting of the association for computational linguistics on human language technologies: short papers, pp 269–272
31.
Zurück zum Zitat Sun F, Song D, Liao L (2011) Dom based content extraction via text density. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval, pp 245–254 Sun F, Song D, Liao L (2011) Dom based content extraction via text density. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval, pp 245–254
32.
Zurück zum Zitat Sutton C, Rohanimanesh K, McCallum A (2004) Dynamic conditional random fileds: factorized probabilistic models for labeling and segmenting sequence data. In: Proceedings of twenty-first international conference on machine learning, pp 783–790 Sutton C, Rohanimanesh K, McCallum A (2004) Dynamic conditional random fileds: factorized probabilistic models for labeling and segmenting sequence data. In: Proceedings of twenty-first international conference on machine learning, pp 783–790
34.
Zurück zum Zitat Theobald M, Siddharth J, Paepcke A (2008) Spotsigs: robust and efficient near duplicate detection in large web collections. In: Proceedings of the 31st international ACM SIGIR conference on research and development in information retrieval, pp 563–370 Theobald M, Siddharth J, Paepcke A (2008) Spotsigs: robust and efficient near duplicate detection in large web collections. In: Proceedings of the 31st international ACM SIGIR conference on research and development in information retrieval, pp 563–370
35.
Zurück zum Zitat Turmo J, Ageno A, Catala N (2006) Adaptive information extraction. ACM Comput Surv 38(2). Article 4 Turmo J, Ageno A, Catala N (2006) Adaptive information extraction. ACM Comput Surv 38(2). Article 4
36.
Zurück zum Zitat van der Maaten L (2011) Learning discriminative Fisher Kernels. In: Proceedings of twenty-eighth international conference on machine learning van der Maaten L (2011) Learning discriminative Fisher Kernels. In: Proceedings of twenty-eighth international conference on machine learning
37.
Zurück zum Zitat Vo D-T, Hai V, Ock C-Y (2015) Exploiting language models to classify events from twitter. Comput Intell Neurosci. Article ID 401024 Vo D-T, Hai V, Ock C-Y (2015) Exploiting language models to classify events from twitter. Comput Intell Neurosci. Article ID 401024
38.
Zurück zum Zitat Wang T, Cai Y, Leung HF, Cai Z, Min H (2015) Entropy-based term weighting schemes for text categorization in vsm. In: Proceedings of the IEEE 27th international conference on tools with artificial intelligence, pp 325–332 Wang T, Cai Y, Leung HF, Cai Z, Min H (2015) Entropy-based term weighting schemes for text categorization in vsm. In: Proceedings of the IEEE 27th international conference on tools with artificial intelligence, pp 325–332
39.
Zurück zum Zitat Yang C, Cao Y, Nie Z, Zhou J, Wen J-R (2010) Closing the loop in webpage understanding. IEEE Trans Knowl Data Eng 22:639–650CrossRef Yang C, Cao Y, Nie Z, Zhou J, Wen J-R (2010) Closing the loop in webpage understanding. IEEE Trans Knowl Data Eng 22:639–650CrossRef
40.
Zurück zum Zitat Yan Y, Yin X-C, Li S, Yang M, Hao H-W (2015) Learning document semantic representation with hybrid deep belief network. Comput Intell Neurosci. Article ID 650527 Yan Y, Yin X-C, Li S, Yang M, Hao H-W (2015) Learning document semantic representation with hybrid deep belief network. Comput Intell Neurosci. Article ID 650527
41.
Zurück zum Zitat Zheng S, Song R, Wen J-R, Giles CL (2009) Efficient record-level wrapper induction. In: Proceeding of the 18th ACM international conference on information and knowledge management, pp 47–56 Zheng S, Song R, Wen J-R, Giles CL (2009) Efficient record-level wrapper induction. In: Proceeding of the 18th ACM international conference on information and knowledge management, pp 47–56
42.
Zurück zum Zitat Zhu J, Nie Z, Zhang B, Wen J-R (2008) Dynamic hierarchical markov random fields for integrated web data extraction. J Mach Learn Res 9:1583–1614MATH Zhu J, Nie Z, Zhang B, Wen J-R (2008) Dynamic hierarchical markov random fields for integrated web data extraction. J Mach Learn Res 9:1583–1614MATH
Metadaten
Titel
A learning framework for information block search based on probabilistic graphical models and Fisher Kernel
verfasst von
Tak-Lam Wong
Haoran Xie
Wai Lam
Fu Lee Wang
Publikationsdatum
28.03.2017
Verlag
Springer Berlin Heidelberg
Erschienen in
International Journal of Machine Learning and Cybernetics / Ausgabe 9/2018
Print ISSN: 1868-8071
Elektronische ISSN: 1868-808X
DOI
https://doi.org/10.1007/s13042-017-0657-9

Weitere Artikel der Ausgabe 9/2018

International Journal of Machine Learning and Cybernetics 9/2018 Zur Ausgabe

Neuer Inhalt