Skip to main content
Erschienen in:
Buchtitelbild

2018 | OriginalPaper | Buchkapitel

1. Intelligent Rule-Based Deep Web Crawler

verfasst von : S. G. Shaila, A. Vadivel

Erschienen in: Textual and Visual Information Retrieval using Query Refinement and Pattern Analysis

Verlag: Springer Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this chapter, architecture specification of a deep web crawler is discussed. The crawler has indexer with the capability to fetch huge documents from both surface and deep web. The documents from the deep web are fetched-based rules, where core and allied fields of the forms play important role. Based on the domain and nature of FORM in HTML pages, functional dependency between the fields, core and allied fields are identified. The SVM classifier is used for classifying the rule as most preferable, least preferable and mutually exclusive. The documents are fetched by using the most preferable fields in FORM. The fetched document is indexed, and the same architecture is scaled to support distributed functionality with the help of web services. This architecture specification processes huge number of documents which has encouraging coverage rate and lower fetching time. The retrieval performance of the crawler is compared with Google retrieval system and found that the proposed architecture archives similar procession of retrieval.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Ajoudanian, S., & Jazi, M. D. (2009). Deep web content mining. Proceedings of World Academy of Science: Engineering and Technology, 49. Ajoudanian, S., & Jazi, M. D. (2009). Deep web content mining. Proceedings of World Academy of Science: Engineering and Technology, 49.
Zurück zum Zitat Alvarez, M., Raposo, J., Pan, A., Cacheda, F., Bellas, F., & Carneiro, V. (2007). Crawling the content hidden behind web forms. In Computational Science and Its Applications Proceedings of the International Conference (Part II, pp 322–333). Berlin, Heidelberg: Springer. Alvarez, M., Raposo, J., Pan, A., Cacheda, F., Bellas, F., & Carneiro, V. (2007). Crawling the content hidden behind web forms. In Computational Science and Its Applications Proceedings of the International Conference (Part II, pp 322–333). Berlin, Heidelberg: Springer.
Zurück zum Zitat Arasu, A., Cho, J., Garcia-Molina, H., & Raghavan, S. (2001). Searching the web. ACM Transactions on Internet Technologies, 1(1), 2–43. CrossRef Arasu, A., Cho, J., Garcia-Molina, H., & Raghavan, S. (2001). Searching the web. ACM Transactions on Internet Technologies, 1(1), 2–43. CrossRef
Zurück zum Zitat Barbosa, L., & Freire, J. (2004). Siphoning hidden-web data through keyword-based interfaces. In XIX Simpsio Brasileiro de Bancos de Dados (pp. 309–321). Barbosa, L., & Freire, J. (2004). Siphoning hidden-web data through keyword-based interfaces. In XIX Simpsio Brasileiro de Bancos de Dados (pp. 309–321).
Zurück zum Zitat Barbosa, L., & Freire, J. (2007). An adaptive crawler for locating hidden web entry points. In World Wide Web Proceedings of the 16th International Conference (pp. 441–450). New York, NY, USA: ACM. Barbosa, L., & Freire, J. (2007). An adaptive crawler for locating hidden web entry points. In World Wide Web Proceedings of the 16th International Conference (pp. 441–450). New York, NY, USA: ACM.
Zurück zum Zitat Brooks, T. A. (2004). The nature of meaning in the age of Google. Proceedings of Information Research, 9(3). Brooks, T. A. (2004). The nature of meaning in the age of Google. Proceedings of Information Research, 9(3).
Zurück zum Zitat Caverlee, J., Liu, L., & Rocco, D. (2006). Discovering interesting relationships among deep web databases: A source-biased approach. Journal of World Wide Web, 9(4), 585–622.CrossRef Caverlee, J., Liu, L., & Rocco, D. (2006). Discovering interesting relationships among deep web databases: A source-biased approach. Journal of World Wide Web, 9(4), 585–622.CrossRef
Zurück zum Zitat Chakrabarti, S., Dom, B., Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A., et al. (1999). Mining the Web’s link structure. Computer, 32(8), 60–67.CrossRef Chakrabarti, S., Dom, B., Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A., et al. (1999). Mining the Web’s link structure. Computer, 32(8), 60–67.CrossRef
Zurück zum Zitat Chang, K. C.-C., He, B., Li, C., Patel, M., & Zhang, Z. (2004). Structured databases on the web: Observations and implications. SIGMOD Record, 33(3), 61–70.CrossRef Chang, K. C.-C., He, B., Li, C., Patel, M., & Zhang, Z. (2004). Structured databases on the web: Observations and implications. SIGMOD Record, 33(3), 61–70.CrossRef
Zurück zum Zitat Gianvecchio, S., Xie, M., Wu, Z., & Wang, H. (2008). Measurement and classification of humans and bots in internet chat. In Proceedings of the 17th International Conference on Security Symposium, Association Berkeley, USA (pp. 155–169). Gianvecchio, S., Xie, M., Wu, Z., & Wang, H. (2008). Measurement and classification of humans and bots in internet chat. In Proceedings of the 17th International Conference on Security Symposium, Association Berkeley, USA (pp. 155–169).
Zurück zum Zitat Kayed, M., & Chang, C.-H. (2010). FiVaTech: Page-level web data extraction from template pages. IEEE Transactions on Knowledge and Data Engineering, 22(2), 249–263.CrossRef Kayed, M., & Chang, C.-H. (2010). FiVaTech: Page-level web data extraction from template pages. IEEE Transactions on Knowledge and Data Engineering, 22(2), 249–263.CrossRef
Zurück zum Zitat Liu, J., Lu, J., Wu, Z., & Zheng, Q. (2011). Deep web adaptive crawling based on minimum executable pattern. Journal of Intelligent Information Systems, 36(2), 197–215.CrossRef Liu, J., Lu, J., Wu, Z., & Zheng, Q. (2011). Deep web adaptive crawling based on minimum executable pattern. Journal of Intelligent Information Systems, 36(2), 197–215.CrossRef
Zurück zum Zitat Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., & Halevy, A. (2008). Google’s deep web crawl. Proceedings of the VLDB Endowment, 1(2), 1241–1252.CrossRef Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., & Halevy, A. (2008). Google’s deep web crawl. Proceedings of the VLDB Endowment, 1(2), 1241–1252.CrossRef
Zurück zum Zitat Ntoulas, A., Zerfos, P., & Cho, J. (2005). Downloading textual hidden web content through keyword queries. In ACM/IEEE-CS Proceedings of the 5th Joint Conference on Digital Libraries (pp. 100–109). New York, NY, USA: ACM. Ntoulas, A., Zerfos, P., & Cho, J. (2005). Downloading textual hidden web content through keyword queries. In ACM/IEEE-CS Proceedings of the 5th Joint Conference on Digital Libraries (pp. 100–109). New York, NY, USA: ACM.
Zurück zum Zitat Ntoulas, A., Zerfos, P., & Cho, J. (2008). Downloading hidden web content. UCLA, Computer Science. Retrieved February 24, 2009. Ntoulas, A., Zerfos, P., & Cho, J. (2008). Downloading hidden web content. UCLA, Computer Science. Retrieved February 24, 2009.
Zurück zum Zitat Raghavan, S., & Garcia-Molina, H. (2001). Crawling the hidden web. In Very Large Databases (VLDB F01) Proceedings of the 27th International Conference (pp. 129–138). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. Raghavan, S., & Garcia-Molina, H. (2001). Crawling the hidden web. In Very Large Databases (VLDB F01) Proceedings of the 27th International Conference (pp. 129–138). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
Zurück zum Zitat Rennie, J., & McCallum, A. (1999). Using reinforcement learning to spider the web efficiently. In Machine Learning (ICML) Proceedings of the 16th International Conference (pp. 335–343). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. Rennie, J., & McCallum, A. (1999). Using reinforcement learning to spider the web efficiently. In Machine Learning (ICML) Proceedings of the 16th International Conference (pp. 335–343). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
Zurück zum Zitat Wei, L., Xiaofeng, M., & Weiyi, M. (2010). ViDE: A vision-based approach for deep web data extraction. IEEE Transactions on Knowledge and Data Engineering, 22(3), 447–460. CrossRef Wei, L., Xiaofeng, M., & Weiyi, M. (2010). ViDE: A vision-based approach for deep web data extraction. IEEE Transactions on Knowledge and Data Engineering, 22(3), 447–460. CrossRef
Zurück zum Zitat Wu, P., Wen, J. R., Liu, H., & Ma, W. Y. (2006). Query selection techniques for efficient crawling of structured web sources. In Data Engineering Proceedings of the 22nd International Conference, Atlanta, 2006 (pp. 47–56). Wu, P., Wen, J. R., Liu, H., & Ma, W. Y. (2006). Query selection techniques for efficient crawling of structured web sources. In Data Engineering Proceedings of the 22nd International Conference, Atlanta, 2006 (pp. 47–56).
Zurück zum Zitat Yongquan, D., & Qingzhong, L. (2012). A deep web crawling approach based on query harvest model. Journal of Computational Information Systems, 8(3), 973–981. Yongquan, D., & Qingzhong, L. (2012). A deep web crawling approach based on query harvest model. Journal of Computational Information Systems, 8(3), 973–981.
Zurück zum Zitat Zhao, P., Huang, L., Fang, W., & Cui, Z. (2008). Organizing structured deep web by clustering query interfaces link graph. In Advanced Data Mining and Applications Proceedings of the 4th International Conference of ADMA ‘08 (pp. 683–690). Berlin, Heidelberg: Springer. Zhao, P., Huang, L., Fang, W., & Cui, Z. (2008). Organizing structured deep web by clustering query interfaces link graph. In Advanced Data Mining and Applications Proceedings of the 4th International Conference of ADMA ‘08 (pp. 683–690). Berlin, Heidelberg: Springer.
Metadaten
Titel
Intelligent Rule-Based Deep Web Crawler
verfasst von
S. G. Shaila
A. Vadivel
Copyright-Jahr
2018
Verlag
Springer Singapore
DOI
https://doi.org/10.1007/978-981-13-2559-5_1

Neuer Inhalt