Skip to main content
Top
Published in: World Wide Web 4/2016

01-07-2016

Focused crawling for the hidden web

Authors: Panagiotis Liakos, Alexandros Ntoulas, Alexandros Labrinidis, Alex Delis

Published in: World Wide Web | Issue 4/2016

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

A constantly growing amount of high-quality information resides in databases and is guarded behind forms that users fill out and submit. The Hidden Web comprises all these information sources that conventional web crawlers are incapable of discovering. In order to excavate and make available meaningful data from the Hidden Web, previous work has focused on developing query generation techniques that aim at downloading all the content of a given Hidden Web site with the minimum cost. However, there are circumstances where only a specific part of such a site might be of interest. For example, a politics portal should not have to waste bandwidth or processing power to retrieve sports articles just because they are residing in databases also containing documents relevant to politics. In cases like this one, we need to make the best use of our resources in downloading only the portion of the Hidden Web site that we are interested in. We investigate how we can build a focused Hidden Web crawler that can autonomously extract topic-specific pages from the Hidden Web by searching only the subset that is related to the corresponding area. In this regard, we present an approach that progresses iteratively and analyzes the returned results in order to extract terms that capture the essence of the topic we are interested in. We propose a number of different crawling policies and we experimentally evaluate them with data from four popular sites. Our approach is able to download most of the content in search in all cases, using a significantly smaller number of queries compared to existing approaches.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Álvarez, M., Raposo, J., Pan, A., Cacheda, F., Bellas, F., Carneiro, V.: Deepbot: A focused crawler for accessing hidden web content. In: Proceedings of the 3rd International Workshop on Data Enginering Issues in E-commerce and Services (EC), pp. 18–25, San Diego (2007) Álvarez, M., Raposo, J., Pan, A., Cacheda, F., Bellas, F., Carneiro, V.: Deepbot: A focused crawler for accessing hidden web content. In: Proceedings of the 3rd International Workshop on Data Enginering Issues in E-commerce and Services (EC), pp. 18–25, San Diego (2007)
2.
go back to reference Barbosa, L., Freire, J.: Siphoning hidden-web data through keyword-based interfaces. In: SBBD, pp. 309–321. Distrito Federal, Brasil (2004) Barbosa, L., Freire, J.: Siphoning hidden-web data through keyword-based interfaces. In: SBBD, pp. 309–321. Distrito Federal, Brasil (2004)
3.
go back to reference Barbosa, L., Freire, J.: Searching for hidden-web databases. In: Proceedings of the 8th International WebDB, pp. 1–6, Baltimore (2005) Barbosa, L., Freire, J.: Searching for hidden-web databases. In: Proceedings of the 8th International WebDB, pp. 1–6, Baltimore (2005)
4.
go back to reference Barbosa, L., Freire, J.: An adaptive crawler for locating hidden-web entry points. In: Proceedings of the 16th International Conference on World Wide Web (WWW), pp. 441–450. Banff, Canada (2007) Barbosa, L., Freire, J.: An adaptive crawler for locating hidden-web entry points. In: Proceedings of the 16th International Conference on World Wide Web (WWW), pp. 441–450. Banff, Canada (2007)
5.
go back to reference Bergholz, A., Chidlovskii, B.: Crawling for domain-specific hidden web resources. In: Proceedings of the 4th International Conference on Web Information Systems Engineering (WISE), pp. 125–133, Roma (2003) Bergholz, A., Chidlovskii, B.: Crawling for domain-specific hidden web resources. In: Proceedings of the 4th International Conference on Web Information Systems Engineering (WISE), pp. 125–133, Roma (2003)
6.
go back to reference Bergman, M.K.: The deep web. surfacing hidden value. J. Electron. Publ. 7(1), 1–17 (2001)CrossRef Bergman, M.K.: The deep web. surfacing hidden value. J. Electron. Publ. 7(1), 1–17 (2001)CrossRef
7.
go back to reference Cafarella, M.J., Madhavan, J., Halevy, A.: Web-scale extraction of structured data. SIGMOD Rec. 37(4), 55–61 (2009)CrossRef Cafarella, M.J., Madhavan, J., Halevy, A.: Web-scale extraction of structured data. SIGMOD Rec. 37(4), 55–61 (2009)CrossRef
8.
go back to reference Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: A new approach to topic-specific web resource discovery. In: In Proceedings of the 8th International Conference on World Wide Web (WWW), pp. 1623–1640, Toronto (1999) Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: A new approach to topic-specific web resource discovery. In: In Proceedings of the 8th International Conference on World Wide Web (WWW), pp. 1623–1640, Toronto (1999)
9.
go back to reference Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused crawling using context graphs. In: Proceedings of the 26th International Conference on Very Large Data Bases (VLDB), pp. 527–534, Cairo (2000) Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused crawling using context graphs. In: Proceedings of the 26th International Conference on Very Large Data Bases (VLDB), pp. 527–534, Cairo (2000)
10.
go back to reference Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach. Learn. 29(2–3), 131–163 (1997)CrossRefMATH Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach. Learn. 29(2–3), 131–163 (1997)CrossRefMATH
11.
go back to reference Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: An update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRef Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: An update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRef
12.
go back to reference He, B., Patel, M., Zhang, Z., Chang, K.C.-C.: Accessing the deep web: A survey. Communications of the ACM 50(5), 94–101 (2007) He, B., Patel, M., Zhang, Z., Chang, K.C.-C.: Accessing the deep web: A survey. Communications of the ACM 50(5), 94–101 (2007)
13.
go back to reference Ipeirotis P.G., Gravano, L.: Distributed search over the hidden web: Hierarchical database sampling and selection. In: Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), pp. 394–405, Hong Kong (2002) Ipeirotis P.G., Gravano, L.: Distributed search over the hidden web: Hierarchical database sampling and selection. In: Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), pp. 394–405, Hong Kong (2002)
14.
go back to reference Ipeirotis, P.G., Gravano, L., Sahami, M.: Probe, count, and classify: Categorizing hidden web databases. SIGMOD Rec. 30, 67–78 (2001)CrossRef Ipeirotis, P.G., Gravano, L., Sahami, M.: Probe, count, and classify: Categorizing hidden web databases. SIGMOD Rec. 30, 67–78 (2001)CrossRef
15.
go back to reference Liakos P., Ntoulas, A.: Topic-sensitive hidden-web crawling. In: Proceedings of the 13th International Conference on Web Information Systems Engineering (WISE), pp. 538–551, Paphos (2012) Liakos P., Ntoulas, A.: Topic-sensitive hidden-web crawling. In: Proceedings of the 13th International Conference on Web Information Systems Engineering (WISE), pp. 538–551, Paphos (2012)
16.
go back to reference Lim, T.-S., Loh, W.-Y., Shih, Y.-S.: A comparison of prediction accuracy, complexity, and training time of old, thirty-three algorithms, new classification. Mach. Learn. 40(3), 203–228 (2000)CrossRefMATH Lim, T.-S., Loh, W.-Y., Shih, Y.-S.: A comparison of prediction accuracy, complexity, and training time of old, thirty-three algorithms, new classification. Mach. Learn. 40(3), 203–228 (2000)CrossRefMATH
17.
go back to reference Lu, J., Wang, Y., Liang, J., Chen, J., Liu, J.: An approach to deep web crawling by sampling. In: Proceedings of the 2008 IEEE / WIC / ACM International Conference on Web Intelligence, (WI), pp. 718–724, New SouthWales (2008) Lu, J., Wang, Y., Liang, J., Chen, J., Liu, J.: An approach to deep web crawling by sampling. In: Proceedings of the 2008 IEEE / WIC / ACM International Conference on Web Intelligence, (WI), pp. 718–724, New SouthWales (2008)
18.
go back to reference Madhavan, J., Ko, D., Kot, Ł., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s deep web crawl. Proc. VLDB Endow. 1(2), 1241–1252 (2008)CrossRef Madhavan, J., Ko, D., Kot, Ł., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s deep web crawl. Proc. VLDB Endow. 1(2), 1241–1252 (2008)CrossRef
19.
go back to reference McCandless, M., Hatcher, E., Gospodnetic, O.: Lucene in Action, 2nd. Manning Publications Co., Greenwich (2010) McCandless, M., Hatcher, E., Gospodnetic, O.: Lucene in Action, 2nd. Manning Publications Co., Greenwich (2010)
20.
go back to reference Noh, S., Choi, Y., Seo, H., Choi, K., Jung, G.: An intelligent topic-specific crawler using degree of relevance. In: IDEAL, volume 3177 of Lecture Notes in Computer Science, pp. 491–498 (2004) Noh, S., Choi, Y., Seo, H., Choi, K., Jung, G.: An intelligent topic-specific crawler using degree of relevance. In: IDEAL, volume 3177 of Lecture Notes in Computer Science, pp. 491–498 (2004)
21.
go back to reference Ntoulas, A., Zerfos, P., Cho, J.: Downloading textual hidden web content through keyword queries. In: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), pp. 100–109, Denver (2005) Ntoulas, A., Zerfos, P., Cho, J.: Downloading textual hidden web content through keyword queries. In: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), pp. 100–109, Denver (2005)
22.
go back to reference Platt, J.C.: Advances in Kernel Methods. Chapter Fast Training of Support Vector Machines Using Sequential Minimal Optimization, pp. 185–208. MIT Press, Cambridge (1999) Platt, J.C.: Advances in Kernel Methods. Chapter Fast Training of Support Vector Machines Using Sequential Minimal Optimization, pp. 185–208. MIT Press, Cambridge (1999)
23.
go back to reference Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB), p. 2001, Roma Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB), p. 2001, Roma
24.
go back to reference Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Inc., New York (1986)MATH Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Inc., New York (1986)MATH
25.
go back to reference Schonhofen, P.: Identifying document topics using the wikipedia category network. In: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI), pp. 456–462, Hong Kong (2006) Schonhofen, P.: Identifying document topics using the wikipedia category network. In: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI), pp. 456–462, Hong Kong (2006)
26.
go back to reference Wang, Y., Lu, J., Chen, J.: Crawling deep web using a new set covering algorithm. In: Proceedings of the 5th International Conference on Advanced Data Mining and Applications (ADMA), pp. 326–337, Beijing (2009) Wang, Y., Lu, J., Chen, J.: Crawling deep web using a new set covering algorithm. In: Proceedings of the 5th International Conference on Advanced Data Mining and Applications (ADMA), pp. 326–337, Beijing (2009)
27.
go back to reference Wu, P., Wen, J.-R., Liu, H., Ma, W.-Y. : Query selection techniques for efficient crawling of structured web sources, p. 47, Atlanta (2006) Wu, P., Wen, J.-R., Liu, H., Ma, W.-Y. : Query selection techniques for efficient crawling of structured web sources, p. 47, Atlanta (2006)
28.
go back to reference Wu, W., Yu, C., Doan, A., Meng, W.: An interactive clustering-based approach to integrating source query interfaces on the deep web. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 95–106, Paris (2004) Wu, W., Yu, C., Doan, A., Meng, W.: An interactive clustering-based approach to integrating source query interfaces on the deep web. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 95–106, Paris (2004)
29.
go back to reference Yang, Y., Bansal, N., Dakka, W., Ipeirotis, P., Koudas, N., Papadias D.: Query by document. In: Proceedings of the 2nd ACM International Conference on Web Search and Data Mining (WSDM), pp. 34–43, Barcelona (2009) Yang, Y., Bansal, N., Dakka, W., Ipeirotis, P., Koudas, N., Papadias D.: Query by document. In: Proceedings of the 2nd ACM International Conference on Web Search and Data Mining (WSDM), pp. 34–43, Barcelona (2009)
30.
go back to reference Zhang, Z, He, B., Chang, K. C.-C.: Understanding web query interfaces: Best-effort parsing with hidden syntax. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 107–118, Paris (2004) Zhang, Z, He, B., Chang, K. C.-C.: Understanding web query interfaces: Best-effort parsing with hidden syntax. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 107–118, Paris (2004)
Metadata
Title
Focused crawling for the hidden web
Authors
Panagiotis Liakos
Alexandros Ntoulas
Alexandros Labrinidis
Alex Delis
Publication date
01-07-2016
Publisher
Springer US
Published in
World Wide Web / Issue 4/2016
Print ISSN: 1386-145X
Electronic ISSN: 1573-1413
DOI
https://doi.org/10.1007/s11280-015-0349-x

Other articles of this Issue 4/2016

World Wide Web 4/2016 Go to the issue

Premium Partner