Top

Published in:

2015 | OriginalPaper | Chapter

Crawling Ranked Deep Web Data Sources

Authors : Yan Wang, Yaxin Li, Nannan Pi, Jianguo Lu

Published in: Web Information Systems Engineering – WISE 2015

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

In the era of big data, the vast majority of the data are not from the surface web, the web that is interconnected by hyperlinks and indexed by most general purpose search engines. Instead, the trove of valuable data often reside in the deep web, the web that is hidden behind query interfaces. Since the data in the deep web are often of high value, there is a line of research on crawling deep web data sources in the recent decade. However, most existing crawling methods assume that all the matched documents are returned. In practice, many data sources rank the matched documents, and return only the top k matches. When conventional methods are applied on such ranked data sources, popular queries that matches more than k documents will cause large redundancy. This paper proposes the document frequency (df) based algorithm that exploits the queries whose document frequencies are within the specified range. The algorithm is extensively tested on a variety of datasets and compared with existing two algorithms. We demonstrate that our method outperforms the two algorithms 58 % and 90 % on average respectively.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Genetic-Based Approach for ATS and SLA-aware Web Services Composition

next chapter Influence Maximization in Signed Social Networks

In this paper, we use the two words ‘term’ and ‘query’ interchangeably and the minor difference is that a query is an issued term.

Bergman, M.K.: The deep web: Surfacing hidden value. J. Electron. Publishing 7(1), 1–17 (2001)CrossRef

Shestakov, D., Bhowmick, S.S., Lim, E.P.: Deque: querying the deep web. J. Data Knowl. Eng. 52(3), 273–311 (2005)CrossRef

He, B., Patel, M., Zhang, Z., Chang, K.C.: Accessing the deep web: a survey. Commun. ACM 50(5), 94–101 (2007)CrossRef

Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s deep-web crawl. In: Proceeding of VLDB, pp. 1241–1252 (2008)

Ipeirotis, P., Gravano, L., Sahami, M.: Probe, count, and classify: categorizing hidden web databases. In: proceeding of SIGMOD, pp. 67–68 (2001)

Raghavan, S., Molina, H.G.: Crawling the hidden web. In: Proceeding of the 27th international Conference on Very Large Data Bases (VLDB), pp. 129–138 (2001)

Liddle, S.W., Embley, D.W., Scott, D.T., Yau, S.H.: Extracting data behind web forms. In: Olivé, À., Yoshikawa, M., Yu, E.S.K. (eds.) ER 2003. LNCS, vol. 2784, pp. 402–413. Springer, Heidelberg (2003) CrossRef

Madhavan, J., Afanasiev, L., Antova, L., Halevy, A.: Harnessing the deep web: present and future. In: Proceeding of CIDR (2009)

He, Y., Xin, D., V, G., Rajaraman, S., Shah, N.: Crawling deep web entity pages. In: Proceeding of WSDM 2013, pp. 355–364 (2013)

10.

Wu, P., Wen, J.R., Liu, H., Ma, W.Y.: Query selection techniques for efficient crawling of structured web sources. In: Proceeding of ICDE, pp. 47–56 (2006)

11.

Ipeirotis, P., Gravano, L.: Distributed search over the hidden web: Hierarchical database sampling and selection. In: VLDB (2002)

12.

Dong, X., Srivastava, D.: Big data integration. In: ICDE, pp. 1245–1248 (2013)

13.

Yang, M., Wang, H., Lim, L., Wang, M.: Optimizing content freshness of relations extracted from the web using keyword search. In: Proceeding of SIGMOND, pp. 819–830 (2010)

14.

http://www.dmoz.org

15.

Lu, J., Wang, Y., liang, J., Chen, J., Liu, J.: An approach to deep web crawling by sampling. In: Proceeding of Web Intelligence, pp. 718–724 (2008)

16.

Wang, Y., Lu, J., Chen, J.: Crawling deep web using a new set covering algorithm. In: Huang, R., Yang, Q., Pei, J., Gama, J., Meng, X., Li, X. (eds.) ADMA 2009. LNCS, vol. 5678, pp. 326–337. Springer, Heidelberg (2009) CrossRef

17.

Wang, Y., Lu, J., Chen, J.: TS-IDS algorithm for query selection in the deep web crawling. In: Chen, L., Jia, Y., Sellis, T., Liu, G. (eds.) APWeb 2014. LNCS, vol. 8709, pp. 189–200. Springer, Heidelberg (2014)

18.

Barbosa, L., Freire, J.: Siphoning hidden-web data through keyword-based interfaces. In: Proceeding of SBBD (2004)

19.

Ntoulas, A., Zerfos, P., Cho, J.: Downloading textual hidden web content through keyword queries. In: Proceeding of the Joint Conference on Digital Libraries (JCDL), pp. 100–109 (2005)

20.

Zheng, Q., Wu, Z., Cheng, X., Jiang, L., Liu, J.: Learning to crawl deep web. Inf. Syst. 38(6), 801–819 (2013)CrossRef

21.

Jiang, L., Wu, Z., Zheng, Q., Liu, J.: Learning deep web crawling with diverse featueres. In: WI-IAT, pp. 572–575 (2009)

22.

Dong, Y., Li, Q.: A deep web crawling approach based on query harvest model. J. Comput. Inf. Syst. 8(3), 973–981 (2012)

23.

Jiang, L., Wu, Z., Feng, Q., Liu, J., Zheng, Q.: Efficient deep web crawling using reinforcement learning. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010, Part I. LNCS, vol. 6118, pp. 428–439. Springer, Heidelberg (2010) CrossRef

24.

Lu, J.: Ranking bias in deep web size estimation using capture recapture method. Journal of Data and Knowledge Engineering 69(8), 866–879 (2010)CrossRef

25.

Bar-Yossef, Z., Gurevich, M.: Random sampling from a search engine’s index. In: WWW, pp. 367–376 (2006)

26.

Myung, I.J.: Tutorial on maximum likelihood estimation. J. Math. Psychol. 47, 90–100 (2003)MATHMathSciNetCrossRef

27.

Gale, W.A., Sampson, G.: Good-turing frequency estimation without tears*. J. Quant. Linguist. 2(3), 217–237 (1995)CrossRef

28.

Hatcher, E., Gospodnetic, O.: Lucene in Action. Manning Publications (2004)

Title: Crawling Ranked Deep Web Data Sources
Authors: Yan Wang
Yaxin Li
Nannan Pi
Jianguo Lu
Publisher: Springer International Publishing
Book: Web Information Systems Engineering – WISE 2015
Print ISBN: 978-3-319-26189-8

Electronic ISBN: 978-3-319-26190-4

Copyright Year: 2015
DOI: https://doi.org/10.1007/978-3-319-26190-4_26

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner