Skip to main content
Erschienen in: World Wide Web 1/2016

01.01.2016

AcT: Accuracy-aware crawling techniques for cloud-crawler

verfasst von: Kanik Gupta, Vishal Mittal, Bazir Bishnoi, Siddharth Maheshwari, Dhaval Patel

Erschienen in: World Wide Web | Ausgabe 1/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

News aggregation websites collect news from various online sources using crawling techniques and provide a unified view to millions of users. Since, news sources update information frequently; aggregators have to recrawl them from time to time in order to have durable archiving of the news content. The majority of recrawling techniques assume the availability of unlimited resources and zero operating cost. However, in reality, the resources and budget are limited and it is impossible to crawl every news source at every point of time. To the best of our knowledge, none of the existing techniques discuss the crawling strategy that can retrieve the maximum amount of information in a resource/budget constrained environment. In this paper, we present a framework AcT that supports two different accuracy-aware personalized crawling techniques to attain the optimal accuracy level of retrieving the information. Given the crawling frequency as a resource constraint, the first scheme aims to find the optimal schedule that maximizes the accuracy. In the second scheme, we optimize the crawling frequency and the corresponding crawling schedule for a given accuracy level. We propose a supervised technique that monitors each news source for a particular time period and collect the news update patterns. The news update patterns are later analyzed using mixed integer programming to discover the optimal crawling schedule for the first scheme, whereas a greedy strategy is proposed to discover the optimal crawling frequency and crawling schedule for the second scheme. We develop a crawler for 87 news sources and performed a series of experiments to demonstrate the quality and efficiency of our proposed techniques against benchmark strategies.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Adam, G., Bouras, C., Poulopoulos, V.: Efficient extraction of news articles based on rss crawling. In: MWI, pp. 1–7 (2010) Adam, G., Bouras, C., Poulopoulos, V.: Efficient extraction of news articles based on rss crawling. In: MWI, pp. 1–7 (2010)
2.
Zurück zum Zitat Cho, J., Schonfeld, U.: Rankmass crawler: A crawler with high personalized pagerank coverage guarantee. In: VLDB, pp. 375–386 (2007) Cho, J., Schonfeld, U.: Rankmass crawler: A crawler with high personalized pagerank coverage guarantee. In: VLDB, pp. 375–386 (2007)
3.
Zurück zum Zitat He, D., Parker, D.S.: Optimized retrieval algorithms for personalized content aggregation. In: IRI, pp. 270–277 (2013) He, D., Parker, D.S.: Optimized retrieval algorithms for personalized content aggregation. In: IRI, pp. 270–277 (2013)
4.
Zurück zum Zitat Olston, C., Pandey, S.: Recrawl scheduling based on information longevity. In: WWW, pp. 437–446 (2008) Olston, C., Pandey, S.: Recrawl scheduling based on information longevity. In: WWW, pp. 437–446 (2008)
5.
Zurück zum Zitat Paliouras, G., Mouzakidis, A., Moustakas, V., Skourlas, C.: Pns: A personalized news aggregator on the web. ISKB 104, 175–197 (2008) Paliouras, G., Mouzakidis, A., Moustakas, V., Skourlas, C.: Pns: A personalized news aggregator on the web. ISKB 104, 175–197 (2008)
6.
Zurück zum Zitat Radinsky, K., Bennett, P.N.: Predicting content change on the web. In: WSDM, pp. 415–424 (2013) Radinsky, K., Bennett, P.N.: Predicting content change on the web. In: WSDM, pp. 415–424 (2013)
7.
Zurück zum Zitat Rose, I., Murty, R., Pietzuch, P., Ledlie, J., Roussopoulos, M., Welsh, M.: Cobra: Contentbased filtering and aggregation of blogs and rss feeds. In: Networked Systems Design, pp. 3–3 (2007) Rose, I., Murty, R., Pietzuch, P., Ledlie, J., Roussopoulos, M., Welsh, M.: Cobra: Contentbased filtering and aggregation of blogs and rss feeds. In: Networked Systems Design, pp. 3–3 (2007)
8.
Zurück zum Zitat Saad, M.B., Gançarski, S.: Archiving the web using page changes patterns: a case study. Digit. Libr. 13, 33–49 (2012)CrossRef Saad, M.B., Gançarski, S.: Archiving the web using page changes patterns: a case study. Digit. Libr. 13, 33–49 (2012)CrossRef
9.
Zurück zum Zitat Sheets, D.: The design and implementation of erachnid: an extensible, scalable web crawler in erlang. In: TR (2009) Sheets, D.: The design and implementation of erachnid: an extensible, scalable web crawler in erlang. In: TR (2009)
10.
Zurück zum Zitat Sia, K.C., Cho, J., Cho, H.-K.: Efficient monitoring algorithm for fast news alerts. TKDE 19, 950–961 (2007) Sia, K.C., Cho, J., Cho, H.-K.: Efficient monitoring algorithm for fast news alerts. TKDE 19, 950–961 (2007)
11.
Zurück zum Zitat Taddesse, F., Tekli, J., Chbeir, R., Viviani, M., Yetongnon, K.: Semantic-based merging of RSS items. In: WWWJ, pp. 169–207 (2010) Taddesse, F., Tekli, J., Chbeir, R., Viviani, M., Yetongnon, K.: Semantic-based merging of RSS items. In: WWWJ, pp. 169–207 (2010)
12.
Zurück zum Zitat Tammaro, D., Doumith, E.A., Zahr, S.A., Smets-Solanes, J.-P., Gagnaire, M.: Dynamic resource allocation in cloud environment under time-variant job requests. In: CloudCom, pp. 592–598 (2011) Tammaro, D., Doumith, E.A., Zahr, S.A., Smets-Solanes, J.-P., Gagnaire, M.: Dynamic resource allocation in cloud environment under time-variant job requests. In: CloudCom, pp. 592–598 (2011)
13.
Zurück zum Zitat Vanderbei, R.J.: Linear programming: foundations and extensions. Springer (1996) Vanderbei, R.J.: Linear programming: foundations and extensions. Springer (1996)
14.
Zurück zum Zitat Warneke, D., Kao, O.: Exploiting dynamic resource allocation for efficient parallel data processing in the cloud. IEEE TPDS 22, 985–997 (2011) Warneke, D., Kao, O.: Exploiting dynamic resource allocation for efficient parallel data processing in the cloud. IEEE TPDS 22, 985–997 (2011)
15.
Zurück zum Zitat Wolf, J.L., Squillante, M.S., Yu, P.S., Sethuraman, J., Ozsen, L.: Optimal crawling strategies for web search engines. In: WWW, pp. 136–147 (2002) Wolf, J.L., Squillante, M.S., Yu, P.S., Sethuraman, J., Ozsen, L.: Optimal crawling strategies for web search engines. In: WWW, pp. 136–147 (2002)
Metadaten
Titel
AcT: Accuracy-aware crawling techniques for cloud-crawler
verfasst von
Kanik Gupta
Vishal Mittal
Bazir Bishnoi
Siddharth Maheshwari
Dhaval Patel
Publikationsdatum
01.01.2016
Verlag
Springer US
Erschienen in
World Wide Web / Ausgabe 1/2016
Print ISSN: 1386-145X
Elektronische ISSN: 1573-1413
DOI
https://doi.org/10.1007/s11280-015-0328-2

Weitere Artikel der Ausgabe 1/2016

World Wide Web 1/2016 Zur Ausgabe