Skip to main content
Top
Published in: Knowledge and Information Systems 3/2019

20-02-2018 | Regular Paper

A novel page clipping search engine based on page discussion topics

Author: Lin-Chih Chen

Published in: Knowledge and Information Systems | Issue 3/2019

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In this paper, we propose a page clipping search engine based on page discussion topics. Compared to other search engines, our search engine uses the page discussion topic instead of the search engine results page as the main result. After the user selects the topic of interest, our search engine will clip the relevant pages according to the selected topic and produce an integrated page result. The advantage of this topic-based integration page result is that the user can reduce the time it takes to decide whether the page content is relevant. Our results consist of two parts: the query-related discussion topics and the clipping results for relevant pages. We first use an adjusted N-gram language model and a hash method to produce discussion topics. At the same time, we use the idea of binary coding and mathematical set to organize related topics into a hierarchical topic tree with parent–child relationship. Next, we use a cost-effective genetic algorithm to produce the relevant page clipping results. This study has the following three advantages. The first is that we can find multiple clustering relationships, that is, a child topic can appear simultaneously in multiple parent topics. The second is that we propose a good topic generation method, that is, we cannot only produce better quality topics, but also produce the topic tree in a linear time. The third is that we propose a good clipping generation method, that is, we cannot only produce better quality clippings, but also produce a cost-effective solution.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Abu Arqub O, Abo-Hammour Z, Momani S (2014) Application of continuous genetic algorithm for nonlinear system of second-order boundary value problems. Appl Math Inf Sci 8(1):235–248MATHCrossRef Abu Arqub O, Abo-Hammour Z, Momani S (2014) Application of continuous genetic algorithm for nonlinear system of second-order boundary value problems. Appl Math Inf Sci 8(1):235–248MATHCrossRef
2.
go back to reference Al Jadaan O, Rajamani L, Rao C (2008) Improved selection operator for GA. J Theor Appl Inf Technol 4(4):269–277 Al Jadaan O, Rajamani L, Rao C (2008) Improved selection operator for GA. J Theor Appl Inf Technol 4(4):269–277
3.
go back to reference Banu WA, Kader PSA (2010) A hybrid context based approach for web information retrieval. Int J Comput Appl 10(7):25–28 Banu WA, Kader PSA (2010) A hybrid context based approach for web information retrieval. Int J Comput Appl 10(7):25–28
4.
go back to reference Bhunia AK, Sahoo L, Roy D (2010) Reliability stochastic optimization for a series system with interval component reliability via genetic algorithm. Appl Math Comput 216(3):929–939MathSciNetMATH Bhunia AK, Sahoo L, Roy D (2010) Reliability stochastic optimization for a series system with interval component reliability via genetic algorithm. Appl Math Comput 216(3):929–939MathSciNetMATH
5.
go back to reference Carpineto C, Osinski S, Romano G, Weiss D (2009) A survey of web clustering engines. ACM Comput Surv 41(3):17:11–17:38CrossRef Carpineto C, Osinski S, Romano G, Weiss D (2009) A survey of web clustering engines. ACM Comput Surv 41(3):17:11–17:38CrossRef
6.
go back to reference Chen L-C (2011) Building a web-snippet clustering system based on a mixed clustering method. Online Inf Rev 35(4):611–635CrossRef Chen L-C (2011) Building a web-snippet clustering system based on a mixed clustering method. Online Inf Rev 35(4):611–635CrossRef
7.
go back to reference Chen L-C, Luh C-J (2005) Web page prediction from metasearch results. Internet Res 15(4):421–446CrossRef Chen L-C, Luh C-J (2005) Web page prediction from metasearch results. Internet Res 15(4):421–446CrossRef
8.
go back to reference Chen L-C, Luh C-J, Jou C (2005) Generating page clippings from web search results using a dynamically terminated genetic algorithm. Inf Syst 30(4):299–316CrossRef Chen L-C, Luh C-J, Jou C (2005) Generating page clippings from web search results using a dynamically terminated genetic algorithm. Inf Syst 30(4):299–316CrossRef
9.
go back to reference Cilibrasi RL, Vitanyi PMB (2007) The google similarity distance. IEEE Trans Knowl Data Eng 19(3):370–383CrossRef Cilibrasi RL, Vitanyi PMB (2007) The google similarity distance. IEEE Trans Knowl Data Eng 19(3):370–383CrossRef
10.
go back to reference Croft B, Lafferty J (2013) Language modeling for information retrieval. Springer, New YorkMATH Croft B, Lafferty J (2013) Language modeling for information retrieval. Springer, New YorkMATH
11.
go back to reference Croft B, Metzler D, Strohman T (2009) Search engines: information retrieval in practice. Pearson Press, Pearson Croft B, Metzler D, Strohman T (2009) Search engines: information retrieval in practice. Pearson Press, Pearson
12.
go back to reference Ferragina P, Guli A (2008) A personalized search engine based on web-snippet hierarchical clustering. Softw Pract Exp 38(2):189–225CrossRef Ferragina P, Guli A (2008) A personalized search engine based on web-snippet hierarchical clustering. Softw Pract Exp 38(2):189–225CrossRef
13.
go back to reference Fox C (1989) A stop list for general text. ACM SIGIR Forum 24(1–2):19–35CrossRef Fox C (1989) A stop list for general text. ACM SIGIR Forum 24(1–2):19–35CrossRef
14.
go back to reference Hammache A, Boughanem M, Ahmed-Ouamer R (2014) Combining compound and single terms under language model framework. Knowl Inf Syst 39(2):329–349CrossRef Hammache A, Boughanem M, Ahmed-Ouamer R (2014) Combining compound and single terms under language model framework. Knowl Inf Syst 39(2):329–349CrossRef
15.
go back to reference Hinow M, Mevissen M (2011) Substation maintenance strategy adaptation for life-cycle cost reduction using genetic algorithm. IEEE Trans Power Deliv 26(1):197–204CrossRef Hinow M, Mevissen M (2011) Substation maintenance strategy adaptation for life-cycle cost reduction using genetic algorithm. IEEE Trans Power Deliv 26(1):197–204CrossRef
16.
go back to reference Ho W, Ho GT, Ji P, Lau HC (2008) A hybrid genetic algorithm for the multi-depot vehicle routing problem. Eng Appl Artif Intell 21(4):548–557CrossRef Ho W, Ho GT, Ji P, Lau HC (2008) A hybrid genetic algorithm for the multi-depot vehicle routing problem. Eng Appl Artif Intell 21(4):548–557CrossRef
17.
go back to reference Huang C-L, Wang C-J (2006) A GA-based feature selection and parameters optimization for support vector machines. Expert Syst Appl 31(2):231–240CrossRef Huang C-L, Wang C-J (2006) A GA-based feature selection and parameters optimization for support vector machines. Expert Syst Appl 31(2):231–240CrossRef
18.
go back to reference Indira SU, Ramesh AC (2011) Image segmentation using artificial neural network and genetic algorithm: a comparative analysis. In: Proceedings of the 2011 international conference on process automation, control and computing, pp 1–6 Indira SU, Ramesh AC (2011) Image segmentation using artificial neural network and genetic algorithm: a comparative analysis. In: Proceedings of the 2011 international conference on process automation, control and computing, pp 1–6
19.
go back to reference Ivanov V, Palyukh B, Sotnikov A (2016) Efficiency of genetic algorithm for subject search queries. Lobachevskii J Math 37(3):244–254MathSciNetMATHCrossRef Ivanov V, Palyukh B, Sotnikov A (2016) Efficiency of genetic algorithm for subject search queries. Lobachevskii J Math 37(3):244–254MathSciNetMATHCrossRef
20.
go back to reference Jinarat S, Haruechaiyasak C, Rungsawang A (2015) Graph-based concept clustering for web search results. Int J Electr Comput Eng 5(6):1536–1544 Jinarat S, Haruechaiyasak C, Rungsawang A (2015) Graph-based concept clustering for web search results. Int J Electr Comput Eng 5(6):1536–1544
21.
go back to reference Kaur M, Kaur P, Singh M (2015) Rank aggregation using multi objective genetic algorithm. In: Proceedings of the 2015 1st international conference on next generation computing technologies (NGCT), pp 836–840 Kaur M, Kaur P, Singh M (2015) Rank aggregation using multi objective genetic algorithm. In: Proceedings of the 2015 1st international conference on next generation computing technologies (NGCT), pp 836–840
22.
go back to reference Lau JH, Cook P, Baldwin T (2013) Topic modelling-based word sense induction for web snippet clustering. In: Proceedings of the 7th international workshop on semantic evaluation, pp 217–221 Lau JH, Cook P, Baldwin T (2013) Topic modelling-based word sense induction for web snippet clustering. In: Proceedings of the 7th international workshop on semantic evaluation, pp 217–221
23.
go back to reference Lindsey R, Veksler VD, Grintsvayg A, Gray WD (2007) Be wary of what your computer reads: the effects of corpus selection on measuring semantic relatedness. In: Proceedings of the 8th international conference on cognitive modeling. Taylor & Francis Press, Ann Arbor, Michigan, pp 279–284 Lindsey R, Veksler VD, Grintsvayg A, Gray WD (2007) Be wary of what your computer reads: the effects of corpus selection on measuring semantic relatedness. In: Proceedings of the 8th international conference on cognitive modeling. Taylor & Francis Press, Ann Arbor, Michigan, pp 279–284
24.
go back to reference Martín P, Sierra A (2016) Improving power system static security margins by means of a real coded genetic algorithm. IEEE Trans Power Syst 31(3):1915–1924CrossRef Martín P, Sierra A (2016) Improving power system static security margins by means of a real coded genetic algorithm. IEEE Trans Power Syst 31(3):1915–1924CrossRef
25.
go back to reference Meng W, Wang W, Sun H, Yu C (2002) Concept hierarchy-based text database categorization. Knowl Inf Syst 4(2):132–150CrossRef Meng W, Wang W, Sun H, Yu C (2002) Concept hierarchy-based text database categorization. Knowl Inf Syst 4(2):132–150CrossRef
26.
go back to reference Nirkhi S, Hande K (2008) A survey on clustering algorithms for web applications. In: Proceedings of the 2008 international conference on semantic web and web services. CSREA Press, Las Vegas, Nevada, July 14–17, 2008 Nirkhi S, Hande K (2008) A survey on clustering algorithms for web applications. In: Proceedings of the 2008 international conference on semantic web and web services. CSREA Press, Las Vegas, Nevada, July 14–17, 2008
27.
go back to reference Özel SA (2011) A web page classification system based on a genetic algorithm using tagged-terms as features. Expert Syst Appl 38(4):3407–3415CrossRef Özel SA (2011) A web page classification system based on a genetic algorithm using tagged-terms as features. Expert Syst Appl 38(4):3407–3415CrossRef
28.
go back to reference Prakash BR, Hanumanthappa M (2012) Web snippet clustering and labeling using lingo algorithm. Int J Adv Res Comput Sci 3(2):262–265 Prakash BR, Hanumanthappa M (2012) Web snippet clustering and labeling using lingo algorithm. Int J Adv Res Comput Sci 3(2):262–265
29.
go back to reference Prakash S, Vidyarthi D (2011) Load balancing in computational grid using genetic algorithm. Adv Comput 1(1):8–17CrossRef Prakash S, Vidyarthi D (2011) Load balancing in computational grid using genetic algorithm. Adv Comput 1(1):8–17CrossRef
30.
go back to reference Quan X, Liu G, Lu Z, Ni X, Wenyin L (2010) Short text similarity based on probabilistic topics. Knowl Inf Syst 25(3):473–491CrossRef Quan X, Liu G, Lu Z, Ni X, Wenyin L (2010) Short text similarity based on probabilistic topics. Knowl Inf Syst 25(3):473–491CrossRef
31.
go back to reference Recchia G, Jones MN (2009) More data trumps smarter algorithms: comparing pointwise mutual information with latent semantic analysis. Behav Res Methods 41(3):647–656CrossRef Recchia G, Jones MN (2009) More data trumps smarter algorithms: comparing pointwise mutual information with latent semantic analysis. Behav Res Methods 41(3):647–656CrossRef
32.
go back to reference Sadaf K, Alam M (2012) Web search result clustering—a review. Int J Comput Sci Eng Surv 3(4):85–92CrossRef Sadaf K, Alam M (2012) Web search result clustering—a review. Int J Comput Sci Eng Surv 3(4):85–92CrossRef
33.
go back to reference Scaiella U, Ferragina P, Marino A, Ciaramita M (2012) Topical clustering of search results. In: Proceedings of the 5th ACM international conference on web search and data mining, pp 223–232 Scaiella U, Ferragina P, Marino A, Ciaramita M (2012) Topical clustering of search results. In: Proceedings of the 5th ACM international conference on web search and data mining, pp 223–232
34.
go back to reference Spink A, Wolfram D, Jansen MBJ, Saracevic T (2001) Searching the web: the public and their queries. J Am Soc Inform Sci Technol 52(3):226–234CrossRef Spink A, Wolfram D, Jansen MBJ, Saracevic T (2001) Searching the web: the public and their queries. J Am Soc Inform Sci Technol 52(3):226–234CrossRef
35.
go back to reference Sun X, Gong D, Jin Y, Chen S (2013) A new surrogate-assisted interactive genetic algorithm with weighted semisupervised learning. IEEE Trans Cybern 43(2):685–698CrossRef Sun X, Gong D, Jin Y, Chen S (2013) A new surrogate-assisted interactive genetic algorithm with weighted semisupervised learning. IEEE Trans Cybern 43(2):685–698CrossRef
36.
go back to reference Tomašev N, Mladenić D (2014) Hubness-aware shared neighbor distances for high-dimensional K-nearest neighbor classification. Knowl Inf Syst 39(1):89–122CrossRef Tomašev N, Mladenić D (2014) Hubness-aware shared neighbor distances for high-dimensional K-nearest neighbor classification. Knowl Inf Syst 39(1):89–122CrossRef
37.
go back to reference Uğuz H (2011) A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowl Based Syst 24(7):1024–1032CrossRef Uğuz H (2011) A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowl Based Syst 24(7):1024–1032CrossRef
38.
go back to reference Voorhees EM (1999) The TREC-8 question answering track report. In: Proceedings of the 8th text retrieval conference, pp 77–82 Voorhees EM (1999) The TREC-8 question answering track report. In: Proceedings of the 8th text retrieval conference, pp 77–82
39.
go back to reference Wang Q, Qian Y, Song R, Dou Z, Zhang F, Sakai T, Zheng Q (2013) Mining subtopics from text fragments for a web query. Inf Retr 16(4):484–503CrossRef Wang Q, Qian Y, Song R, Dou Z, Zhang F, Sakai T, Zheng Q (2013) Mining subtopics from text fragments for a web query. Inf Retr 16(4):484–503CrossRef
40.
go back to reference Wang Y, Chen W, Tellambura C (2012) Genetic algorithm based nearly optimal peak reduction tone set selection for adaptive amplitude clipping PAPR reduction. IEEE Trans Broadcast 58(3):462–471CrossRef Wang Y, Chen W, Tellambura C (2012) Genetic algorithm based nearly optimal peak reduction tone set selection for adaptive amplitude clipping PAPR reduction. IEEE Trans Broadcast 58(3):462–471CrossRef
41.
go back to reference Zamir O, Etzioni O (1999) Grouper: a dynamic clustering interface to web search results. Comput Netw 31(11–16):1361–1374CrossRef Zamir O, Etzioni O (1999) Grouper: a dynamic clustering interface to web search results. Comput Netw 31(11–16):1361–1374CrossRef
42.
go back to reference Zhao H, Qi Z (2010) Hierarchical agglomerative clustering with ordering constraints. In: Proceedings of the 2010 3rd international conference on knowledge discovery and data mining, Phuket, 9–10 January 2010, pp 195–199 Zhao H, Qi Z (2010) Hierarchical agglomerative clustering with ordering constraints. In: Proceedings of the 2010 3rd international conference on knowledge discovery and data mining, Phuket, 9–10 January 2010, pp 195–199
43.
go back to reference Zhou F, Liu X (2005) An improved genetic algorithm of suited web-based negotiation support system. Comput Eng 23:061 Zhou F, Liu X (2005) An improved genetic algorithm of suited web-based negotiation support system. Comput Eng 23:061
44.
go back to reference Zhu X, Lu P (2009) A two-phase scheduling strategy for real-time applications with security requirements on heterogeneous clusters. Comput Electr Eng 35(6):980–993MATHCrossRef Zhu X, Lu P (2009) A two-phase scheduling strategy for real-time applications with security requirements on heterogeneous clusters. Comput Electr Eng 35(6):980–993MATHCrossRef
Metadata
Title
A novel page clipping search engine based on page discussion topics
Author
Lin-Chih Chen
Publication date
20-02-2018
Publisher
Springer London
Published in
Knowledge and Information Systems / Issue 3/2019
Print ISSN: 0219-1377
Electronic ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-018-1173-2

Other articles of this Issue 3/2019

Knowledge and Information Systems 3/2019 Go to the issue

Premium Partner