nach oben

Journal of Intelligent Information Systems

Erschienen in:

01.02.2013

Automatic discovery of Web Query Interfaces using machine learning techniques

verfasst von: Heidy M. Marin-Castro, Victor J. Sosa-Sosa, Jose F. Martinez-Trinidad, Ivan Lopez-Arevalo

Erschienen in: Journal of Intelligent Information Systems | Ausgabe 1/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

The amount of information contained in databases available on the Web has grown explosively in the last years. This information, known as the Deep Web, is heterogeneous and dynamically generated by querying these back-end (relational) databases through Web Query Interfaces (WQIs) that are a special type of HTML forms. The problem of accessing to the information of Deep Web is a great challenge because the information existing usually is not indexed by general-purpose search engines. Therefore, it is necessary to create efficient mechanisms to access, extract and integrate information contained in the Deep Web. Since WQIs are the only means to access to the Deep Web, the automatic identification of WQIs plays an important role. It facilitates traditional search engines to increase the coverage and the access to interesting information not available on the indexable Web. The accurate identification of Deep Web data sources are key issues in the information retrieval process. In this paper we propose a new strategy for automatic discovery of WQIs. This novel proposal makes an adequate selection of HTML elements extracted from HTML forms, which are used in a set of heuristic rules that help to identify WQIs. The proposed strategy uses machine learning algorithms for classification of searchable (WQIs) and non-searchable (non-WQI) HTML forms using a prototypes selection algorithm that allows to remove irrelevant or redundant data in the training set. The internal content of Web Query Interfaces was analyzed with the objective of identifying only those HTML elements that are frequently appearing provide relevant information for the WQIs identification. For testing, we use three groups of datasets, two available at the UIUC repository and a new dataset that we created using a generic crawler supported by human experts that includes advanced and simple query interfaces. The experimental results show that the proposed strategy outperforms others previously reported works.

Vorheriger Artikel A sound and complete chase procedure for constrained tuple-generating dependencies

Nächster Artikel Clustering interval data through kernel-induced feature space

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Barbosa, L., & Freire, J. (2005). Searching for hidden-web databases. In Proceedings of the 8th ACM SIGMOD international workshop on web and databases (pp. 1–6). Baltimore, Maryland.

Barbosa, L., & Freire, J. (2007a). Combining classifiers to identify online databases. In Proceedings of the 16th international conference on World Wide Web, WWW ’07 (pp. 431–440). New York: ACM. ISBN 978-1-59593-654-7. doi:10.1145/1242572.1242631.CrossRef

Barbosa, L., & Freire, J. (2007b). An adaptive crawler for locating hidden-web entry points. In Proceedings of the 16th international conference on World Wide Web, WWW ’07 (pp. 441–450). New York: ACM. ISBN 978-1-59593-654-7. doi:10.1145/1242572.1242632.CrossRef

Barbosa, L., Nguyen, H., Nguyen, T., Pinnamaneni, R., Freire, J. (2010). Creating and exploring web form repositories. In Proceedings of the 2010 international conference on management of data, SIGMOD ’10 (pp. 1175–1178). New York: ACM. ISBN 978-1-4503-0032-2. doi:10.1145/1807167.1807311.CrossRef

Bergman, M.K. (2001). The deep web: surfacing hidden value (white paper). Journal of Electronic Publishing, 7(1), 4.CrossRef

Cope, J., Craswell, N., Hawking, D. (2003). Automated discovery of search interfaces on the web. In Proceedings of the 14th Australasian database conference, ADC ’03 (Vol. 17, pp. 181–189). Darlinghurst: Australian Computer Society. Inc. ISBN 0-909-92595-X. URL: http://portal.acm.org/citation.cfm?id=820085.820120.

D’Agostino, R.B., Belanger, A., D’Agostino, R.B. Jr. (1990). A suggestion for using powerful and informative tests of normality. The American Statistician, 44(4), 316–321. ISSN 00031305. URL http://www.jstor.org/stable/2684359.

García-Serrano, J.R., & Martinez-Trinidad, J.F. (1999). Extension to c-means algorithm for the use of similarity functions. In Proceedings of the 3rd European conference on principles of data mining and knowledge discovery, PKDD ’99 (pp. 354–359). London: Springer-Verlag. ISBN 3-540-66490-4. URL: http://dl.acm.org/citation.cfm?id=645803.669654.CrossRef

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P.,Witten, I.H. (2009). The WEKA data mining software: an update (Vol. 11, Issue 1). SIGKDD Explorations.

Jericho HTML Parser (2010). A Java Library for parsing HTML documents. Sourceforge Project, 2010. http://jericho.htmlparser.net/docs/index.html. Accessed 12 Dec 2011.

Jiang, L., Wu, Z., Zheng, Q., Liu, J. (2009). Learning deep web crawling with diverse features. In Web intelligence (pp. 572–575).

Jiang, L., Wu, Z., Feng, Q., Liu, J., Zheng, Q. (2010). Efficient deep web crawling using reinforcement learning. In PAKDD (1) (pp 428–439).

Kabisch, T., Dragut, E.C., Yu, C.T., Leser, U. (2009). A hierarchical approach to model web query interfaces for web source integration. Proceedings, Very Large Data Bases, 2(1), 325–336.

Li, Y., Nie, T., Shen, D., Yu, G. (2010). Domain-oriented deep web data sources’ discovery and identification. In Proceedings of the 2010 12th International Asia-Pacific Web Conference, APWEB ’10, pages 464–467, Washington, DC, USA. IEEE Computer Society. ISBN 978-0-7695-4012-2. doi:10.1109/APWeb.2010.54.

Lin, L., & Zhou, L. (2009). Web database schema identification through simple query interface. In RED (pp. 18–34).

Liu, V.Z., Luo, R.C., Cho, J., Chu, W.W. (2004). D-pro: A probabilistic approach for hidden web database selection using dynamic probing. In Proceedings of the ICDE. URL: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.13.2525.

Lu, J., & Li, D. (2010). Estimating deep web data source size by capture-recapture method. Information Retrieval, 13(1), 70–95.CrossRef

Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A. (2008). Google’s deep web crawl. Very Large Data Bases, 1, 1241–1252. ISSN 2150-8097. doi:10.1145/1454159.1454163.

Mitchell, T.M. (1997). Machine learning. New York: McGraw-Hill.MATH

Olvera-Lopez, J.A., Martinez-Trinidad, J.F., Carrasco-Ochoa, J.A. (2007). Mixed data object selection based on clustering and border objects. In CIARP (pp. 674–683).

Platt, J.C. (1998). Sequential minimal optimization: a fast algorithm for training support vector machines. Advances in Kernel MethodsSupport Vector Learning, 208(MSR-TR-98-14), 1–21. URL: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.55.560&rep=rep1&type=pdf.

Quinlan, J.R. (1986). Induction of decision trees. Machine Learning, 1, 81–106. doi:10.1007/BF00116251.

Ru, Y., & Horowitz, E. (2005). Indexing the invisible web: a survey. Online Information Review, 29(3), 249–265.CrossRef

Shestakov, D. (2008). Search interfaces on the web: Querying and characterizing. PhD thesis, University of Turku Department of Information Technology.

The UIUC web integration repository (2003). Computer Science Department, University of Illinois at Urbana-Champaign. http://metaquerier.cs.uiuc.edu/repository.

Wang, H., Liu, Y.W., Zuo, W.L. (2008). Using classifiers to find domain-specific online databases automatically. Journal of Software, 19(2), 246–256. URL: http://www.jos.org.cn/1000-9825/19/246.htm.CrossRef

Wang, Y., Li, H., Zuo, W., He, F., Wang, X., Chen, K. (2011). Research on discovering deep web entries. Computer Science and Information Systems, 8(3), 779–799.CrossRef

Witten, I.H., Frank, E., Hall, M.A. (2000). Data mining: Practical machine learning tools and techniques with java implementations. USA: Academic Press. ISBN 1558605525.

Wu, W., Yu, C., Doan, A., Meng, W. (2004). An interactive clustering-based approach to integrating source query interfaces on the deep Web. In Proceedings of the 2004 ACM SIGMOD international conference on management of data, SIGMOD ’04 (pp. 95–106). New York: ACM. ISBN 1-58113-859-8. doi:10.1145/1007568.1007582.CrossRef

Zhang, P., Qu, Y., Huang, C., Jaeger, P.T., Wells, J., Hayes, W.S., Hayes, J.E., Jin, X. (2010) Collaborative identification and annotation of government deep web resources: A hybrid approach. In Proceedings of the 21st ACM conference on Hypertext and hypermedia, HT ’10 (pp. 285–286). New York: ACM. ISBN 978-1-4503-0041-4. doi:10.1145/1810617.1810677.CrossRef

Zhang, Z., He, B., Chang, K.C.-C. (2004). Understanding Web query interfaces: Best-effort parsing with hidden syntax. In Proceedings of the 2004 ACM SIGMOD international conference on management of data, SIGMOD ’04, pages 107–118, New York: ACM. ISBN 1-58113-859-8. doi:10.1145/1007568.1007583.CrossRef

Titel: Automatic discovery of Web Query Interfaces using machine learning techniques
verfasst von: Heidy M. Marin-Castro
Victor J. Sosa-Sosa
Jose F. Martinez-Trinidad
Ivan Lopez-Arevalo
Publikationsdatum: 01.02.2013
Verlag: Springer US
Erschienen in: Journal of Intelligent Information Systems / Ausgabe 1/2013
Print ISSN: 0925-9902
Elektronische ISSN: 1573-7675
DOI: https://doi.org/10.1007/s10844-012-0217-4

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 1/2013

E-FFC: an enhanced form-focused crawler for domain-specific deep web databases

Clustering interval data through kernel-induced feature space

NODAR: mining globally distributed substructures from a single labeled graph

Content-based and collaborative techniques for tag recommendation: an empirical evaluation

Fiducial feature reduction analysis for electrocardiogram (ECG) based biometric recognition

A sound and complete chase procedure for constrained tuple-generating dependencies