Skip to main content
Top

2015 | OriginalPaper | Chapter

Crawling Ranked Deep Web Data Sources

Authors : Yan Wang, Yaxin Li, Nannan Pi, Jianguo Lu

Published in: Web Information Systems Engineering – WISE 2015

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In the era of big data, the vast majority of the data are not from the surface web, the web that is interconnected by hyperlinks and indexed by most general purpose search engines. Instead, the trove of valuable data often reside in the deep web, the web that is hidden behind query interfaces. Since the data in the deep web are often of high value, there is a line of research on crawling deep web data sources in the recent decade. However, most existing crawling methods assume that all the matched documents are returned. In practice, many data sources rank the matched documents, and return only the top k matches. When conventional methods are applied on such ranked data sources, popular queries that matches more than k documents will cause large redundancy. This paper proposes the document frequency (df) based algorithm that exploits the queries whose document frequencies are within the specified range. The algorithm is extensively tested on a variety of datasets and compared with existing two algorithms. We demonstrate that our method outperforms the two algorithms 58 % and 90 % on average respectively.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
In this paper, we use the two words ‘term’ and ‘query’ interchangeably and the minor difference is that a query is an issued term.
 
Literature
1.
go back to reference Bergman, M.K.: The deep web: Surfacing hidden value. J. Electron. Publishing 7(1), 1–17 (2001)CrossRef Bergman, M.K.: The deep web: Surfacing hidden value. J. Electron. Publishing 7(1), 1–17 (2001)CrossRef
2.
go back to reference Shestakov, D., Bhowmick, S.S., Lim, E.P.: Deque: querying the deep web. J. Data Knowl. Eng. 52(3), 273–311 (2005)CrossRef Shestakov, D., Bhowmick, S.S., Lim, E.P.: Deque: querying the deep web. J. Data Knowl. Eng. 52(3), 273–311 (2005)CrossRef
3.
go back to reference He, B., Patel, M., Zhang, Z., Chang, K.C.: Accessing the deep web: a survey. Commun. ACM 50(5), 94–101 (2007)CrossRef He, B., Patel, M., Zhang, Z., Chang, K.C.: Accessing the deep web: a survey. Commun. ACM 50(5), 94–101 (2007)CrossRef
4.
go back to reference Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s deep-web crawl. In: Proceeding of VLDB, pp. 1241–1252 (2008) Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s deep-web crawl. In: Proceeding of VLDB, pp. 1241–1252 (2008)
5.
go back to reference Ipeirotis, P., Gravano, L., Sahami, M.: Probe, count, and classify: categorizing hidden web databases. In: proceeding of SIGMOD, pp. 67–68 (2001) Ipeirotis, P., Gravano, L., Sahami, M.: Probe, count, and classify: categorizing hidden web databases. In: proceeding of SIGMOD, pp. 67–68 (2001)
6.
go back to reference Raghavan, S., Molina, H.G.: Crawling the hidden web. In: Proceeding of the 27th international Conference on Very Large Data Bases (VLDB), pp. 129–138 (2001) Raghavan, S., Molina, H.G.: Crawling the hidden web. In: Proceeding of the 27th international Conference on Very Large Data Bases (VLDB), pp. 129–138 (2001)
7.
go back to reference Liddle, S.W., Embley, D.W., Scott, D.T., Yau, S.H.: Extracting data behind web forms. In: Olivé, À., Yoshikawa, M., Yu, E.S.K. (eds.) ER 2003. LNCS, vol. 2784, pp. 402–413. Springer, Heidelberg (2003) CrossRef Liddle, S.W., Embley, D.W., Scott, D.T., Yau, S.H.: Extracting data behind web forms. In: Olivé, À., Yoshikawa, M., Yu, E.S.K. (eds.) ER 2003. LNCS, vol. 2784, pp. 402–413. Springer, Heidelberg (2003) CrossRef
8.
go back to reference Madhavan, J., Afanasiev, L., Antova, L., Halevy, A.: Harnessing the deep web: present and future. In: Proceeding of CIDR (2009) Madhavan, J., Afanasiev, L., Antova, L., Halevy, A.: Harnessing the deep web: present and future. In: Proceeding of CIDR (2009)
9.
go back to reference He, Y., Xin, D., V, G., Rajaraman, S., Shah, N.: Crawling deep web entity pages. In: Proceeding of WSDM 2013, pp. 355–364 (2013) He, Y., Xin, D., V, G., Rajaraman, S., Shah, N.: Crawling deep web entity pages. In: Proceeding of WSDM 2013, pp. 355–364 (2013)
10.
go back to reference Wu, P., Wen, J.R., Liu, H., Ma, W.Y.: Query selection techniques for efficient crawling of structured web sources. In: Proceeding of ICDE, pp. 47–56 (2006) Wu, P., Wen, J.R., Liu, H., Ma, W.Y.: Query selection techniques for efficient crawling of structured web sources. In: Proceeding of ICDE, pp. 47–56 (2006)
11.
go back to reference Ipeirotis, P., Gravano, L.: Distributed search over the hidden web: Hierarchical database sampling and selection. In: VLDB (2002) Ipeirotis, P., Gravano, L.: Distributed search over the hidden web: Hierarchical database sampling and selection. In: VLDB (2002)
12.
go back to reference Dong, X., Srivastava, D.: Big data integration. In: ICDE, pp. 1245–1248 (2013) Dong, X., Srivastava, D.: Big data integration. In: ICDE, pp. 1245–1248 (2013)
13.
go back to reference Yang, M., Wang, H., Lim, L., Wang, M.: Optimizing content freshness of relations extracted from the web using keyword search. In: Proceeding of SIGMOND, pp. 819–830 (2010) Yang, M., Wang, H., Lim, L., Wang, M.: Optimizing content freshness of relations extracted from the web using keyword search. In: Proceeding of SIGMOND, pp. 819–830 (2010)
15.
go back to reference Lu, J., Wang, Y., liang, J., Chen, J., Liu, J.: An approach to deep web crawling by sampling. In: Proceeding of Web Intelligence, pp. 718–724 (2008) Lu, J., Wang, Y., liang, J., Chen, J., Liu, J.: An approach to deep web crawling by sampling. In: Proceeding of Web Intelligence, pp. 718–724 (2008)
16.
go back to reference Wang, Y., Lu, J., Chen, J.: Crawling deep web using a new set covering algorithm. In: Huang, R., Yang, Q., Pei, J., Gama, J., Meng, X., Li, X. (eds.) ADMA 2009. LNCS, vol. 5678, pp. 326–337. Springer, Heidelberg (2009) CrossRef Wang, Y., Lu, J., Chen, J.: Crawling deep web using a new set covering algorithm. In: Huang, R., Yang, Q., Pei, J., Gama, J., Meng, X., Li, X. (eds.) ADMA 2009. LNCS, vol. 5678, pp. 326–337. Springer, Heidelberg (2009) CrossRef
17.
go back to reference Wang, Y., Lu, J., Chen, J.: TS-IDS algorithm for query selection in the deep web crawling. In: Chen, L., Jia, Y., Sellis, T., Liu, G. (eds.) APWeb 2014. LNCS, vol. 8709, pp. 189–200. Springer, Heidelberg (2014) Wang, Y., Lu, J., Chen, J.: TS-IDS algorithm for query selection in the deep web crawling. In: Chen, L., Jia, Y., Sellis, T., Liu, G. (eds.) APWeb 2014. LNCS, vol. 8709, pp. 189–200. Springer, Heidelberg (2014)
18.
go back to reference Barbosa, L., Freire, J.: Siphoning hidden-web data through keyword-based interfaces. In: Proceeding of SBBD (2004) Barbosa, L., Freire, J.: Siphoning hidden-web data through keyword-based interfaces. In: Proceeding of SBBD (2004)
19.
go back to reference Ntoulas, A., Zerfos, P., Cho, J.: Downloading textual hidden web content through keyword queries. In: Proceeding of the Joint Conference on Digital Libraries (JCDL), pp. 100–109 (2005) Ntoulas, A., Zerfos, P., Cho, J.: Downloading textual hidden web content through keyword queries. In: Proceeding of the Joint Conference on Digital Libraries (JCDL), pp. 100–109 (2005)
20.
go back to reference Zheng, Q., Wu, Z., Cheng, X., Jiang, L., Liu, J.: Learning to crawl deep web. Inf. Syst. 38(6), 801–819 (2013)CrossRef Zheng, Q., Wu, Z., Cheng, X., Jiang, L., Liu, J.: Learning to crawl deep web. Inf. Syst. 38(6), 801–819 (2013)CrossRef
21.
go back to reference Jiang, L., Wu, Z., Zheng, Q., Liu, J.: Learning deep web crawling with diverse featueres. In: WI-IAT, pp. 572–575 (2009) Jiang, L., Wu, Z., Zheng, Q., Liu, J.: Learning deep web crawling with diverse featueres. In: WI-IAT, pp. 572–575 (2009)
22.
go back to reference Dong, Y., Li, Q.: A deep web crawling approach based on query harvest model. J. Comput. Inf. Syst. 8(3), 973–981 (2012) Dong, Y., Li, Q.: A deep web crawling approach based on query harvest model. J. Comput. Inf. Syst. 8(3), 973–981 (2012)
23.
go back to reference Jiang, L., Wu, Z., Feng, Q., Liu, J., Zheng, Q.: Efficient deep web crawling using reinforcement learning. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010, Part I. LNCS, vol. 6118, pp. 428–439. Springer, Heidelberg (2010) CrossRef Jiang, L., Wu, Z., Feng, Q., Liu, J., Zheng, Q.: Efficient deep web crawling using reinforcement learning. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010, Part I. LNCS, vol. 6118, pp. 428–439. Springer, Heidelberg (2010) CrossRef
24.
go back to reference Lu, J.: Ranking bias in deep web size estimation using capture recapture method. Journal of Data and Knowledge Engineering 69(8), 866–879 (2010)CrossRef Lu, J.: Ranking bias in deep web size estimation using capture recapture method. Journal of Data and Knowledge Engineering 69(8), 866–879 (2010)CrossRef
25.
go back to reference Bar-Yossef, Z., Gurevich, M.: Random sampling from a search engine’s index. In: WWW, pp. 367–376 (2006) Bar-Yossef, Z., Gurevich, M.: Random sampling from a search engine’s index. In: WWW, pp. 367–376 (2006)
27.
go back to reference Gale, W.A., Sampson, G.: Good-turing frequency estimation without tears*. J. Quant. Linguist. 2(3), 217–237 (1995)CrossRef Gale, W.A., Sampson, G.: Good-turing frequency estimation without tears*. J. Quant. Linguist. 2(3), 217–237 (1995)CrossRef
28.
go back to reference Hatcher, E., Gospodnetic, O.: Lucene in Action. Manning Publications (2004) Hatcher, E., Gospodnetic, O.: Lucene in Action. Manning Publications (2004)
Metadata
Title
Crawling Ranked Deep Web Data Sources
Authors
Yan Wang
Yaxin Li
Nannan Pi
Jianguo Lu
Copyright Year
2015
DOI
https://doi.org/10.1007/978-3-319-26190-4_26

Premium Partner