nach oben

World Wide Web

Erschienen in:

01.01.2016

AcT: Accuracy-aware crawling techniques for cloud-crawler

verfasst von: Kanik Gupta, Vishal Mittal, Bazir Bishnoi, Siddharth Maheshwari, Dhaval Patel

Erschienen in: World Wide Web | Ausgabe 1/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

News aggregation websites collect news from various online sources using crawling techniques and provide a unified view to millions of users. Since, news sources update information frequently; aggregators have to recrawl them from time to time in order to have durable archiving of the news content. The majority of recrawling techniques assume the availability of unlimited resources and zero operating cost. However, in reality, the resources and budget are limited and it is impossible to crawl every news source at every point of time. To the best of our knowledge, none of the existing techniques discuss the crawling strategy that can retrieve the maximum amount of information in a resource/budget constrained environment. In this paper, we present a framework AcT that supports two different accuracy-aware personalized crawling techniques to attain the optimal accuracy level of retrieving the information. Given the crawling frequency as a resource constraint, the first scheme aims to find the optimal schedule that maximizes the accuracy. In the second scheme, we optimize the crawling frequency and the corresponding crawling schedule for a given accuracy level. We propose a supervised technique that monitors each news source for a particular time period and collect the news update patterns. The news update patterns are later analyzed using mixed integer programming to discover the optimal crawling schedule for the first scheme, whereas a greedy strategy is proposed to discover the optimal crawling frequency and crawling schedule for the second scheme. We develop a crawler for 87 news sources and performed a series of experiments to demonstrate the quality and efficiency of our proposed techniques against benchmark strategies.

Vorheriger Artikel Access and privacy control enforcement in RFID middleware systems: Proposal and implementation on the fosstrak platform

Nächster Artikel Modeling dynamic recovery strategy for composite web services execution

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

indiatoday.intoday.in/

http://www.i2c2.aut.ac.nz/Wiki/OPTI/index.php

http://jsoup.org

Adam, G., Bouras, C., Poulopoulos, V.: Efficient extraction of news articles based on rss crawling. In: MWI, pp. 1–7 (2010)

Cho, J., Schonfeld, U.: Rankmass crawler: A crawler with high personalized pagerank coverage guarantee. In: VLDB, pp. 375–386 (2007)

He, D., Parker, D.S.: Optimized retrieval algorithms for personalized content aggregation. In: IRI, pp. 270–277 (2013)

Olston, C., Pandey, S.: Recrawl scheduling based on information longevity. In: WWW, pp. 437–446 (2008)

Paliouras, G., Mouzakidis, A., Moustakas, V., Skourlas, C.: Pns: A personalized news aggregator on the web. ISKB 104, 175–197 (2008)

Radinsky, K., Bennett, P.N.: Predicting content change on the web. In: WSDM, pp. 415–424 (2013)

Rose, I., Murty, R., Pietzuch, P., Ledlie, J., Roussopoulos, M., Welsh, M.: Cobra: Contentbased filtering and aggregation of blogs and rss feeds. In: Networked Systems Design, pp. 3–3 (2007)

Saad, M.B., Gançarski, S.: Archiving the web using page changes patterns: a case study. Digit. Libr. 13, 33–49 (2012)CrossRef

Sheets, D.: The design and implementation of erachnid: an extensible, scalable web crawler in erlang. In: TR (2009)

10.

Sia, K.C., Cho, J., Cho, H.-K.: Efficient monitoring algorithm for fast news alerts. TKDE 19, 950–961 (2007)

11.

Taddesse, F., Tekli, J., Chbeir, R., Viviani, M., Yetongnon, K.: Semantic-based merging of RSS items. In: WWWJ, pp. 169–207 (2010)

12.

Tammaro, D., Doumith, E.A., Zahr, S.A., Smets-Solanes, J.-P., Gagnaire, M.: Dynamic resource allocation in cloud environment under time-variant job requests. In: CloudCom, pp. 592–598 (2011)

13.

Vanderbei, R.J.: Linear programming: foundations and extensions. Springer (1996)

14.

Warneke, D., Kao, O.: Exploiting dynamic resource allocation for efficient parallel data processing in the cloud. IEEE TPDS 22, 985–997 (2011)

15.

Wolf, J.L., Squillante, M.S., Yu, P.S., Sethuraman, J., Ozsen, L.: Optimal crawling strategies for web search engines. In: WWW, pp. 136–147 (2002)

Titel: AcT: Accuracy-aware crawling techniques for cloud-crawler
verfasst von: Kanik Gupta
Vishal Mittal
Bazir Bishnoi
Siddharth Maheshwari
Dhaval Patel
Publikationsdatum: 01.01.2016
Verlag: Springer US
Erschienen in: World Wide Web / Ausgabe 1/2016
Print ISSN: 1386-145X
Elektronische ISSN: 1573-1413
DOI: https://doi.org/10.1007/s11280-015-0328-2

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 1/2016

Anonymizing multimedia documents

Access and privacy control enforcement in RFID middleware systems: Proposal and implementation on the fosstrak platform

Context respectful counseling agent virtualized on the web

Behavior evaluation for trust management based on formal distributed network monitoring

Modeling dynamic recovery strategy for composite web services execution

MUBaaS: mobile ubiquitous brokerage as a service