nach oben

Erschienen in:

2018 | OriginalPaper | Buchkapitel

1. Intelligent Rule-Based Deep Web Crawler

verfasst von : S. G. Shaila, A. Vadivel

Erschienen in: Textual and Visual Information Retrieval using Query Refinement and Pattern Analysis

Verlag: Springer Singapore

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

In this chapter, architecture specification of a deep web crawler is discussed. The crawler has indexer with the capability to fetch huge documents from both surface and deep web. The documents from the deep web are fetched-based rules, where core and allied fields of the forms play important role. Based on the domain and nature of FORM in HTML pages, functional dependency between the fields, core and allied fields are identified. The SVM classifier is used for classifying the rule as most preferable, least preferable and mutually exclusive. The documents are fetched by using the most preferable fields in FORM. The fetched document is indexed, and the same architecture is scaled to support distributed functionality with the help of web services. This architecture specification processes huge number of documents which has encouraging coverage rate and lower fetching time. The retrieval performance of the crawler is compared with Google retrieval system and found that the proposed architecture archives similar procession of retrieval.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nächstes Kapitel Information Classification and Organization Using Neuro-Fuzzy Model for Event Pattern Retrieval

Ajoudanian, S., & Jazi, M. D. (2009). Deep web content mining. Proceedings of World Academy of Science: Engineering and Technology, 49.

Alvarez, M., Raposo, J., Pan, A., Cacheda, F., Bellas, F., & Carneiro, V. (2007). Crawling the content hidden behind web forms. In Computational Science and Its Applications Proceedings of the International Conference (Part II, pp 322–333). Berlin, Heidelberg: Springer.

Arasu, A., Cho, J., Garcia-Molina, H., & Raghavan, S. (2001). Searching the web. ACM Transactions on Internet Technologies, 1(1), 2–43. CrossRef

Barbosa, L., & Freire, J. (2004). Siphoning hidden-web data through keyword-based interfaces. In XIX Simpsio Brasileiro de Bancos de Dados (pp. 309–321).

Barbosa, L., & Freire, J. (2007). An adaptive crawler for locating hidden web entry points. In World Wide Web Proceedings of the 16th International Conference (pp. 441–450). New York, NY, USA: ACM.

Brooks, T. A. (2004). The nature of meaning in the age of Google. Proceedings of Information Research, 9(3).

Caverlee, J., Liu, L., & Rocco, D. (2006). Discovering interesting relationships among deep web databases: A source-biased approach. Journal of World Wide Web, 9(4), 585–622.CrossRef

Chakrabarti, S., Dom, B., Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A., et al. (1999). Mining the Web’s link structure. Computer, 32(8), 60–67.CrossRef

Chang, K. C.-C., He, B., Li, C., Patel, M., & Zhang, Z. (2004). Structured databases on the web: Observations and implications. SIGMOD Record, 33(3), 61–70.CrossRef

Gianvecchio, S., Xie, M., Wu, Z., & Wang, H. (2008). Measurement and classification of humans and bots in internet chat. In Proceedings of the 17th International Conference on Security Symposium, Association Berkeley, USA (pp. 155–169).

Kayed, M., & Chang, C.-H. (2010). FiVaTech: Page-level web data extraction from template pages. IEEE Transactions on Knowledge and Data Engineering, 22(2), 249–263.CrossRef

Liu, J., Lu, J., Wu, Z., & Zheng, Q. (2011). Deep web adaptive crawling based on minimum executable pattern. Journal of Intelligent Information Systems, 36(2), 197–215.CrossRef

Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., & Halevy, A. (2008). Google’s deep web crawl. Proceedings of the VLDB Endowment, 1(2), 1241–1252.CrossRef

Ntoulas, A., Zerfos, P., & Cho, J. (2005). Downloading textual hidden web content through keyword queries. In ACM/IEEE-CS Proceedings of the 5th Joint Conference on Digital Libraries (pp. 100–109). New York, NY, USA: ACM.

Ntoulas, A., Zerfos, P., & Cho, J. (2008). Downloading hidden web content. UCLA, Computer Science. Retrieved February 24, 2009.

Raghavan, S., & Garcia-Molina, H. (2001). Crawling the hidden web. In Very Large Databases (VLDB F01) Proceedings of the 27th International Conference (pp. 129–138). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.

Rennie, J., & McCallum, A. (1999). Using reinforcement learning to spider the web efficiently. In Machine Learning (ICML) Proceedings of the 16th International Conference (pp. 335–343). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.

Wei, L., Xiaofeng, M., & Weiyi, M. (2010). ViDE: A vision-based approach for deep web data extraction. IEEE Transactions on Knowledge and Data Engineering, 22(3), 447–460. CrossRef

Wu, P., Wen, J. R., Liu, H., & Ma, W. Y. (2006). Query selection techniques for efficient crawling of structured web sources. In Data Engineering Proceedings of the 22nd International Conference, Atlanta, 2006 (pp. 47–56).

Yongquan, D., & Qingzhong, L. (2012). A deep web crawling approach based on query harvest model. Journal of Computational Information Systems, 8(3), 973–981.

Zhao, P., Huang, L., Fang, W., & Cui, Z. (2008). Organizing structured deep web by clustering query interfaces link graph. In Advanced Data Mining and Applications Proceedings of the 4th International Conference of ADMA ‘08 (pp. 683–690). Berlin, Heidelberg: Springer.

Titel: Intelligent Rule-Based Deep Web Crawler
verfasst von: S. G. Shaila
A. Vadivel
Verlag: Springer Singapore
Buch: Textual and Visual Information Retrieval using Query Refinement and Pattern Analysis
Print ISBN: 978-981-13-2558-8

Electronic ISBN: 978-981-13-2559-5

Copyright-Jahr: 2018
DOI: https://doi.org/10.1007/978-981-13-2559-5_1

Neuer Inhalt

Bildnachweise

VDI-Icon, Profil Icon, inhalt2, Springer Professional Modul/© Springer Fachmedien Wiesbaden GmbH, Nachhaltigkeitsaward Key Visual/© Cometis AG/Global ESG Monitor | Daniel Rupp | Generiert mit KI, Search Icon, Banner Hanser, Frank Urbansky/© Peter Eichler / Leipzig, CO2-Fußabdruck/© Jenny Sturm / stock.adobe.com, Interview Entropie Bild 1/© Bernhard Weßling, Zeitschrift Wissensmanagement Cover, PatentFit-Logo/© Springer Fachmedien Wiesbaden GmbH, Sustainibility Finance/© Robert Kneschke / stock.adobe.com / Springer Fachmedien Wiesbaden GmbH, Zukunftswerkstatt Sales Excellence 2024/© AndreyPopov / Getty Images / iStock, 2023_Antrieb/© supervisuell

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Neuer Inhalt

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.