Skip to main content
Erschienen in: The Journal of Supercomputing 10/2020

20.02.2019

An effective approach to enhancing a focused crawler using Google

verfasst von: Jae-Gil Lee, Donghwan Bae, Sansung Kim, Jungeun Kim, Mun Yong Yi

Erschienen in: The Journal of Supercomputing | Ausgabe 10/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this paper, we share our experience in augmenting a focused crawler of our vertical search engine designed to work with academic slides. The goal of the focused crawler was to collect Microsoft PowerPoint files from academic institutions. A previous approach based on a general web crawler can fail to collect a sufficient number of files mainly because of the robots exclusion protocol and missing hyperlinks. As a remedy to these problems, we propose a combinatory approach in which the indexing information maintained by a general web search engine such as Google is utilized for target URL list generation through our query generator, further then complemented by our URL extractor and file downloader. Because Google has already crawled billions of web pages, it will be more cost-efficient and potentially effective to systematically retrieve the desired information from Google than to redo crawling from scratch by ourselves. Our focused crawler, which we call SlideCrawler, has been used for our vertical search engine CourseShare since the fall of 2011. The capability of SlideCrawler was verified for the top-500 world wide universities. SlideCrawler collected about one million files from the top-500 universities. Further, the study results show that SlideCrawler outperforms Nutch, collecting 3.7 times more slide files.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Fußnoten
1
Apache Nutch is an open source web-search software project, and its project homepage is http://​nutch.​apache.​org/​. Its crawler has been written from scratch specifically for this project.
 
3
We understand that the same slide file can be located at many different URLs, and this type of duplication will have to be removed during indexing time after crawling.
 
Literatur
1.
Zurück zum Zitat Boldi P, Codenotti B, Santini M, Vigna S (2004) UbiCrawler: a scalable fully distributed web crawler. Softw Pract Exp 34(8):711–726CrossRef Boldi P, Codenotti B, Santini M, Vigna S (2004) UbiCrawler: a scalable fully distributed web crawler. Softw Pract Exp 34(8):711–726CrossRef
2.
Zurück zum Zitat Bonato A, del Río-Chanona RM, MacRury C, Nicolaidis J, Pérez-Giménez X, Prałat P, Ternovsky K (2018) The robot crawler graph process. Discrete Appl Math 247:23–36MathSciNetCrossRef Bonato A, del Río-Chanona RM, MacRury C, Nicolaidis J, Pérez-Giménez X, Prałat P, Ternovsky K (2018) The robot crawler graph process. Discrete Appl Math 247:23–36MathSciNetCrossRef
3.
Zurück zum Zitat Boukadi K, Rekik M, Rekik M, Ben-Abdallah H (2018) FC4CD: a new SOA-based focused crawler for cloud service discovery. Computing 100(10):1081–1107CrossRef Boukadi K, Rekik M, Rekik M, Ben-Abdallah H (2018) FC4CD: a new SOA-based focused crawler for cloud service discovery. Computing 100(10):1081–1107CrossRef
4.
Zurück zum Zitat Chakrabarti S, van den Berg M, Dom B (1999) Focused crawling: a new approach to topic-specific web resource discovery. Comput Netw 31(11–16):1623–1640CrossRef Chakrabarti S, van den Berg M, Dom B (1999) Focused crawling: a new approach to topic-specific web resource discovery. Comput Netw 31(11–16):1623–1640CrossRef
5.
Zurück zum Zitat Chakrabarti S, Punera K, Subramanyam M (2002) Accelerated focused crawling through online relevance feedback. In: Proceedings of 11th International World Wide Web Conference, Honolulu, Hawaii, pp 148–159 Chakrabarti S, Punera K, Subramanyam M (2002) Accelerated focused crawling through online relevance feedback. In: Proceedings of 11th International World Wide Web Conference, Honolulu, Hawaii, pp 148–159
6.
Zurück zum Zitat Chau M, Chen H (2003) Comparison of three vertical search spiders. IEEE Comput 36(5):56–62CrossRef Chau M, Chen H (2003) Comparison of three vertical search spiders. IEEE Comput 36(5):56–62CrossRef
7.
Zurück zum Zitat Cho J, Garcia-Molina H (2000) The evolution of the web and implications for an incremental crawler. In: Proceedings of 26th International Conference on Very Large Data Bases, Cairo, Egypt, pp 200–209 Cho J, Garcia-Molina H (2000) The evolution of the web and implications for an incremental crawler. In: Proceedings of 26th International Conference on Very Large Data Bases, Cairo, Egypt, pp 200–209
8.
Zurück zum Zitat Cho J, Garcia-Molina H (2000) Synchronizing a database to improve freshness. In: Proceedings of 2000 ACM SIGMOD International Conference on Management of Data, Dallas, TX, pp 117–128 Cho J, Garcia-Molina H (2000) Synchronizing a database to improve freshness. In: Proceedings of 2000 ACM SIGMOD International Conference on Management of Data, Dallas, TX, pp 117–128
9.
Zurück zum Zitat Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proceedings of 6th Symposium on Operating System Design and Implementation, San Francisco, California, pp 137–150 Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proceedings of 6th Symposium on Operating System Design and Implementation, San Francisco, California, pp 137–150
10.
Zurück zum Zitat Diligenti M, Coetzee F, Lawrence S, Giles CL, Gori M (2000) Focused crawling using context graphs. In: Proceedings of 26th International Conference on Very Large Data Bases, Cairo, Egypt, pp 527–534 Diligenti M, Coetzee F, Lawrence S, Giles CL, Gori M (2000) Focused crawling using context graphs. In: Proceedings of 26th International Conference on Very Large Data Bases, Cairo, Egypt, pp 527–534
11.
Zurück zum Zitat Edwards J, McCurley KS, Tomlin JA (2001) An adaptive model for optimizing performance of an incremental web crawler. In: Proceedings 10th International World Wide Web Conference, Hong Kong, China, pp 106–113 Edwards J, McCurley KS, Tomlin JA (2001) An adaptive model for optimizing performance of an incremental web crawler. In: Proceedings 10th International World Wide Web Conference, Hong Kong, China, pp 106–113
12.
Zurück zum Zitat Gantz J, Reinsel D (2012) The digital universe in 2020: bigger digital shadows, and biggest growth in the far east. Technical Report, IDC Gantz J, Reinsel D (2012) The digital universe in 2020: bigger digital shadows, and biggest growth in the far east. Technical Report, IDC
13.
Zurück zum Zitat Heydon A, Najork M (1999) Mercator: a scalable, extensible web crawler. World Wide Web 2(4):219–229CrossRef Heydon A, Najork M (1999) Mercator: a scalable, extensible web crawler. World Wide Web 2(4):219–229CrossRef
14.
Zurück zum Zitat Kleinberg JM (2001) Small-world phenomena and the dynamics of information. In: Proceedings of Advances in Neural Information Processing Systems, vol 14, Vancouver, British Columbia, pp 431–438 Kleinberg JM (2001) Small-world phenomena and the dynamics of information. In: Proceedings of Advances in Neural Information Processing Systems, vol 14, Vancouver, British Columbia, pp 431–438
17.
Zurück zum Zitat Langville AN, Meyer CD (2006) Google’s PageRank and beyond: the science of search engine rankings. Princeton University Press, PrincetonCrossRef Langville AN, Meyer CD (2006) Google’s PageRank and beyond: the science of search engine rankings. Princeton University Press, PrincetonCrossRef
18.
Zurück zum Zitat Lee W, Leung CKS, Lee JJH (2011) Mobile web navigation in digital ecosystems using rooted directed trees. IEEE Trans Ind Electron 58(6):2154–2162CrossRef Lee W, Leung CKS, Lee JJH (2011) Mobile web navigation in digital ecosystems using rooted directed trees. IEEE Trans Ind Electron 58(6):2154–2162CrossRef
19.
Zurück zum Zitat Menczer F, Pant G, Srinivasan P (2004) Topical web crawlers: evaluating adaptive algorithms. ACM Trans Internet Technol 4(4):378–419CrossRef Menczer F, Pant G, Srinivasan P (2004) Topical web crawlers: evaluating adaptive algorithms. ACM Trans Internet Technol 4(4):378–419CrossRef
20.
Zurück zum Zitat Pal A, Tomar DS, Shrivastava S (2009) Effective focused crawling based on content and link structure analysis. Int J Comput Sci Inf Secur 2(1):80 Pal A, Tomar DS, Shrivastava S (2009) Effective focused crawling based on content and link structure analysis. Int J Comput Sci Inf Secur 2(1):80
21.
Zurück zum Zitat Pant G, Srinivasan P, Menczer F (2004) Crawling the web. In: Poulovassilis A, Levene M (eds) Web dynamics. Springer, Berlin, pp 153–178CrossRef Pant G, Srinivasan P, Menczer F (2004) Crawling the web. In: Poulovassilis A, Levene M (eds) Web dynamics. Springer, Berlin, pp 153–178CrossRef
22.
Zurück zum Zitat Pirkola A (2007) Focused crawling: a means to acquire biological data from the web. In: Proceedings of VLDB workshop on data mining in bioinformatics, Austria, Vienna Pirkola A (2007) Focused crawling: a means to acquire biological data from the web. In: Proceedings of VLDB workshop on data mining in bioinformatics, Austria, Vienna
23.
Zurück zum Zitat Shemshadi A, Sheng QZ, Qin Y (2016) ThingSeek: a crawler and search engine for the internet of things. In: Proceedings of 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, Pisa, Italy, pp 1149–1152 Shemshadi A, Sheng QZ, Qin Y (2016) ThingSeek: a crawler and search engine for the internet of things. In: Proceedings of 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, Pisa, Italy, pp 1149–1152
24.
Zurück zum Zitat Shkapenyuk V, Suel T (2002) Design and implementation of a high-performance distributed web crawler. In: Proceedings of 18th International Conference on Data Engineering, San Jose, California, pp 357–368 Shkapenyuk V, Suel T (2002) Design and implementation of a high-performance distributed web crawler. In: Proceedings of 18th International Conference on Data Engineering, San Jose, California, pp 357–368
25.
Zurück zum Zitat Tatli EI, Urgun B (2017) WIVET-benchmarking coverage qualities of web crawlers. Comput J 60(4):555–572 Tatli EI, Urgun B (2017) WIVET-benchmarking coverage qualities of web crawlers. Comput J 60(4):555–572
26.
Zurück zum Zitat Vieira K, Barbosa L, da Silva AS, Freire J, Moura E (2016) Finding seeds to bootstrap focused crawlers. World Wide Web 19(3):449–474CrossRef Vieira K, Barbosa L, da Silva AS, Freire J, Moura E (2016) Finding seeds to bootstrap focused crawlers. World Wide Web 19(3):449–474CrossRef
29.
Zurück zum Zitat Yin C, Liu J, Yang C, Zhang H (2009) A novel method for crawler in domain-specific search. J Comput Inf Syst 5(6):1749–1755 Yin C, Liu J, Yang C, Zhang H (2009) A novel method for crawler in domain-specific search. J Comput Inf Syst 5(6):1749–1755
30.
Zurück zum Zitat Zhao F, Zhou J, Nie C, Huang H, Jin H (2016) SmartCrawler: a two-stage crawler for efficiently harvesting deep-web interfaces. IEEE Trans Serv Comput 9(4):608–620CrossRef Zhao F, Zhou J, Nie C, Huang H, Jin H (2016) SmartCrawler: a two-stage crawler for efficiently harvesting deep-web interfaces. IEEE Trans Serv Comput 9(4):608–620CrossRef
Metadaten
Titel
An effective approach to enhancing a focused crawler using Google
verfasst von
Jae-Gil Lee
Donghwan Bae
Sansung Kim
Jungeun Kim
Mun Yong Yi
Publikationsdatum
20.02.2019
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 10/2020
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-019-02787-9

Weitere Artikel der Ausgabe 10/2020

The Journal of Supercomputing 10/2020 Zur Ausgabe