Top

Information Systems Frontiers

Published in:

07-06-2018

Schema Extraction for Deep Web Query Interfaces Using Heuristics Rules

Author: Chichang Jou

Published in: Information Systems Frontiers | Issue 1/2019

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Along with the popularity of the world wide web, data volumes inside web databases have been increasing tremendously. These deep web contents, hidden behind the query interfaces, are of much better quality than those in the surface web. Internet users need to fill in query conditions in the HTML query interface and click the submit button to obtain deep web data. Many deep web contents related applications, like named entity attribute collection, topic-focused crawling, and heterogeneous data integration, are based on understanding schema of these query interfaces. The schema needs to cover mappings of input elements and labels, data types of valid input values, and range constraints of the input values. Additionally, to extract these hidden data, the schema needs to include many form submission related information, like cookies and action types. We design and implement a Heuristics-based deep web query interface Schema Extraction system (HSE). In HSE, texts surrounding elements are collected as candidate labels. We propose a string similarity function and use a dynamic similarity threshold to cleanse candidate labels. In HSE, elements, candidate labels, and new lines in the query interface are streamlined to produce its Interface Expression (IEXP). By combining the user’s view and the designer’s view, with the aid of semantic information, we build heuristic rules to extract schema from IEXP of query interfaces in the ICQ dataset. These rules are constructed through utilizing (1) the characteristics of labels and elements, and (2) the spatial, group, and range relationships of labels and elements. Supplemented with form submission related information, the extracted schemas are then stored in the XML format, so that they could be utilized in further applications, like schema matching and merging for federated query interface integration. The experimental results on the TEL-8 dataset illustrate that HSE produces effective performance.

previous article Exploiting Repositories in Mobile Software Ecosystems from a Governance Perspective

next article Discovering composable web services using functional semantics and service dependencies based on natural language requests

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

http://portswigger.net/burp/

https://www.abc.nl/search#/

http://www.schrenk.com/nostarch/webbots/DSP_download.php

Awadallah, H., Bahaaeldin, M., Haw, S.-C., & Soon, L.-K. (2018). A review on utilising XML as the mediated ;ayer for data integration. Advanced Science Letters, 24(2), 1191–1195(5).CrossRef

Bergman, M. K. (2001). The deep web: surfacing hidden value. Technical report, BrightPlanet LLC.

Dragut, E. C., Kabisch, T., Yu, C., & Leser, U. (2009). A hierarchical approach to model web query interfaces for web source integration. In Proceedings of the 35th International Conference on Very Large Data Bases (pp. 325–335).

Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., & Schallhart, C. (2013). The ontological key: automatically understanding and integrating forms to access the deep web. The VLDB Journal, 22(5), 615–640.CrossRef

He, H., Meng, W., Yu, C., & Wu, Z. (2005). Constructing interface schemas for search interfaces of web databases. In Proceedings of the 6th International Conference on Web Information Systems Engineering (pp. 29–42).

He, H., Meng, W., Lu, Y., Yu, C., & Wu, Z. (2007). Towards deeper understanding of the search interfaces of the deep web. World Wide Web, 10(2), 133–155.CrossRef

Jou, C. (2016). Deep web query interface integration based on incremental schema matching and merging. In Proceedings of the the 3rd Multidisciplinary International Social Networks Conference on Social Informatics, Data Science, Article No. 34.

Khare, R., & An, Y. (2009). An empirical study on using hidden markov model for search interface segmentation. In Proceedings of the 18th International Conference on Information and Knowledge Management (pp. 17–26).

Naz, T. (2006). An XML schema generator for HTML search interfaces. technical report, Institute Faculty of Informatics, DBAI, Technical University of Vienna, Austria.

Nguyen, H., Nguyen, T., & Freire, J. (2008). Learning to extract form labels. Proceedings of the Very Large Data Bases Endowment, 1(1), 684–694.

Raghavan, S., & Garcia-Molina, H. (2001). Crawling the hidden web. In Proceedings of 27th International Conference on Very Large Data Bases (pp. 129–138).

Saissi, Y., Zellou, A., & Idri, A. (2016). Towards XML schema extraction from deep web. In Proceedings of 4th IEEE International Colloquium on Information Science and Technology (pp. 94–99).

Salem, R., Boussaïd, O., & Darmont, J. (2013). Active XML-based web data integration. Information Systems Frintiers, 15(3), 371–398.CrossRef

Su, W., Wu, H., Li, Y., Zhao, J., Lochovsky, F. H., Cai, H., & Huang, T. (2013). Understanding query interfaces by statistical parsing. ACM Transactions on the Web, 7(2) Article No. 8.

Wu, W., Doan, A., Yu, C., & Meng, W. (2009). Modeling and extracting deep-web query interfaces. Advances in Information & Intelligent Systems, SCI, 251, 65–90.CrossRef

Yu, H., & Ye, F. (2015). Research on extract the schema of query interfaces. In Proceedings of the 10th International Conference on Intelligent Systems and Knowledge Engineering (pp. 442–447).

Zhang, Z., He, B., & Chang, K. C.-C. (2004). Understanding web query interfaces: best-effort parsing with hidden syntax. In Proceedings of the 2004 ACM SIGMOD Conference (pp. 107–118).CrossRef

Title: Schema Extraction for Deep Web Query Interfaces Using Heuristics Rules
Author: Chichang Jou
Publication date: 07-06-2018
Publisher: Springer US
Published in: Information Systems Frontiers / Issue 1/2019
Print ISSN: 1387-3326
Electronic ISSN: 1572-9419
DOI: https://doi.org/10.1007/s10796-018-9863-6

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 1/2019

Quantitative Analysis of Apache Storm Applications: The NewsAsset Case Study

Discovering composable web services using functional semantics and service dependencies based on natural language requests

Towards a Reuse Strategic Decision Pattern Framework – from Theories to Practices

Extracting Knowledge from Technical Reports for the Valuation of West Texas Intermediate Crude Oil Futures

An examination of the long-term business value of investments in information technology

An information integration and transmission model of multi-source data for product quality and safety

Premium Partner