Skip to main content
Erschienen in: Journal of Intelligent Information Systems 1/2013

01.02.2013

Automatic discovery of Web Query Interfaces using machine learning techniques

verfasst von: Heidy M. Marin-Castro, Victor J. Sosa-Sosa, Jose F. Martinez-Trinidad, Ivan Lopez-Arevalo

Erschienen in: Journal of Intelligent Information Systems | Ausgabe 1/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The amount of information contained in databases available on the Web has grown explosively in the last years. This information, known as the Deep Web, is heterogeneous and dynamically generated by querying these back-end (relational) databases through Web Query Interfaces (WQIs) that are a special type of HTML forms. The problem of accessing to the information of Deep Web is a great challenge because the information existing usually is not indexed by general-purpose search engines. Therefore, it is necessary to create efficient mechanisms to access, extract and integrate information contained in the Deep Web. Since WQIs are the only means to access to the Deep Web, the automatic identification of WQIs plays an important role. It facilitates traditional search engines to increase the coverage and the access to interesting information not available on the indexable Web. The accurate identification of Deep Web data sources are key issues in the information retrieval process. In this paper we propose a new strategy for automatic discovery of WQIs. This novel proposal makes an adequate selection of HTML elements extracted from HTML forms, which are used in a set of heuristic rules that help to identify WQIs. The proposed strategy uses machine learning algorithms for classification of searchable (WQIs) and non-searchable (non-WQI) HTML forms using a prototypes selection algorithm that allows to remove irrelevant or redundant data in the training set. The internal content of Web Query Interfaces was analyzed with the objective of identifying only those HTML elements that are frequently appearing provide relevant information for the WQIs identification. For testing, we use three groups of datasets, two available at the UIUC repository and a new dataset that we created using a generic crawler supported by human experts that includes advanced and simple query interfaces. The experimental results show that the proposed strategy outperforms others previously reported works.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Barbosa, L., & Freire, J. (2005). Searching for hidden-web databases. In Proceedings of the 8th ACM SIGMOD international workshop on web and databases (pp. 1–6). Baltimore, Maryland. Barbosa, L., & Freire, J. (2005). Searching for hidden-web databases. In Proceedings of the 8th ACM SIGMOD international workshop on web and databases (pp. 1–6). Baltimore, Maryland.
Zurück zum Zitat Barbosa, L., & Freire, J. (2007a). Combining classifiers to identify online databases. In Proceedings of the 16th international conference on World Wide Web, WWW ’07 (pp. 431–440). New York: ACM. ISBN 978-1-59593-654-7. doi:10.1145/1242572.1242631.CrossRef Barbosa, L., & Freire, J. (2007a). Combining classifiers to identify online databases. In Proceedings of the 16th international conference on World Wide Web, WWW ’07 (pp. 431–440). New York: ACM. ISBN 978-1-59593-654-7. doi:10.​1145/​1242572.​1242631.CrossRef
Zurück zum Zitat Barbosa, L., & Freire, J. (2007b). An adaptive crawler for locating hidden-web entry points. In Proceedings of the 16th international conference on World Wide Web, WWW ’07 (pp. 441–450). New York: ACM. ISBN 978-1-59593-654-7. doi:10.1145/1242572.1242632.CrossRef Barbosa, L., & Freire, J. (2007b). An adaptive crawler for locating hidden-web entry points. In Proceedings of the 16th international conference on World Wide Web, WWW ’07 (pp. 441–450). New York: ACM. ISBN 978-1-59593-654-7. doi:10.​1145/​1242572.​1242632.CrossRef
Zurück zum Zitat Barbosa, L., Nguyen, H., Nguyen, T., Pinnamaneni, R., Freire, J. (2010). Creating and exploring web form repositories. In Proceedings of the 2010 international conference on management of data, SIGMOD ’10 (pp. 1175–1178). New York: ACM. ISBN 978-1-4503-0032-2. doi:10.1145/1807167.1807311.CrossRef Barbosa, L., Nguyen, H., Nguyen, T., Pinnamaneni, R., Freire, J. (2010). Creating and exploring web form repositories. In Proceedings of the 2010 international conference on management of data, SIGMOD ’10 (pp. 1175–1178). New York: ACM. ISBN 978-1-4503-0032-2. doi:10.​1145/​1807167.​1807311.CrossRef
Zurück zum Zitat Bergman, M.K. (2001). The deep web: surfacing hidden value (white paper). Journal of Electronic Publishing, 7(1), 4.CrossRef Bergman, M.K. (2001). The deep web: surfacing hidden value (white paper). Journal of Electronic Publishing, 7(1), 4.CrossRef
Zurück zum Zitat García-Serrano, J.R., & Martinez-Trinidad, J.F. (1999). Extension to c-means algorithm for the use of similarity functions. In Proceedings of the 3rd European conference on principles of data mining and knowledge discovery, PKDD ’99 (pp. 354–359). London: Springer-Verlag. ISBN 3-540-66490-4. URL: http://dl.acm.org/citation.cfm?id=645803.669654.CrossRef García-Serrano, J.R., & Martinez-Trinidad, J.F. (1999). Extension to c-means algorithm for the use of similarity functions. In Proceedings of the 3rd European conference on principles of data mining and knowledge discovery, PKDD ’99 (pp. 354–359). London: Springer-Verlag. ISBN 3-540-66490-4. URL: http://​dl.​acm.​org/​citation.​cfm?​id=​645803.​669654.CrossRef
Zurück zum Zitat Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P.,Witten, I.H. (2009). The WEKA data mining software: an update (Vol. 11, Issue 1). SIGKDD Explorations. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P.,Witten, I.H. (2009). The WEKA data mining software: an update (Vol. 11, Issue 1). SIGKDD Explorations.
Zurück zum Zitat Jiang, L., Wu, Z., Zheng, Q., Liu, J. (2009). Learning deep web crawling with diverse features. In Web intelligence (pp. 572–575). Jiang, L., Wu, Z., Zheng, Q., Liu, J. (2009). Learning deep web crawling with diverse features. In Web intelligence (pp. 572–575).
Zurück zum Zitat Jiang, L., Wu, Z., Feng, Q., Liu, J., Zheng, Q. (2010). Efficient deep web crawling using reinforcement learning. In PAKDD (1) (pp 428–439). Jiang, L., Wu, Z., Feng, Q., Liu, J., Zheng, Q. (2010). Efficient deep web crawling using reinforcement learning. In PAKDD (1) (pp 428–439).
Zurück zum Zitat Kabisch, T., Dragut, E.C., Yu, C.T., Leser, U. (2009). A hierarchical approach to model web query interfaces for web source integration. Proceedings, Very Large Data Bases, 2(1), 325–336. Kabisch, T., Dragut, E.C., Yu, C.T., Leser, U. (2009). A hierarchical approach to model web query interfaces for web source integration. Proceedings, Very Large Data Bases, 2(1), 325–336.
Zurück zum Zitat Li, Y., Nie, T., Shen, D., Yu, G. (2010). Domain-oriented deep web data sources’ discovery and identification. In Proceedings of the 2010 12th International Asia-Pacific Web Conference, APWEB ’10, pages 464–467, Washington, DC, USA. IEEE Computer Society. ISBN 978-0-7695-4012-2. doi:10.1109/APWeb.2010.54. Li, Y., Nie, T., Shen, D., Yu, G. (2010). Domain-oriented deep web data sources’ discovery and identification. In Proceedings of the 2010 12th International Asia-Pacific Web Conference, APWEB ’10, pages 464–467, Washington, DC, USA. IEEE Computer Society. ISBN 978-0-7695-4012-2. doi:10.​1109/​APWeb.​2010.​54.
Zurück zum Zitat Lin, L., & Zhou, L. (2009). Web database schema identification through simple query interface. In RED (pp. 18–34). Lin, L., & Zhou, L. (2009). Web database schema identification through simple query interface. In RED (pp. 18–34).
Zurück zum Zitat Lu, J., & Li, D. (2010). Estimating deep web data source size by capture-recapture method. Information Retrieval, 13(1), 70–95.CrossRef Lu, J., & Li, D. (2010). Estimating deep web data source size by capture-recapture method. Information Retrieval, 13(1), 70–95.CrossRef
Zurück zum Zitat Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A. (2008). Google’s deep web crawl. Very Large Data Bases, 1, 1241–1252. ISSN 2150-8097. doi:10.1145/1454159.1454163. Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A. (2008). Google’s deep web crawl. Very Large Data Bases, 1, 1241–1252. ISSN 2150-8097. doi:10.​1145/​1454159.​1454163.
Zurück zum Zitat Mitchell, T.M. (1997). Machine learning. New York: McGraw-Hill.MATH Mitchell, T.M. (1997). Machine learning. New York: McGraw-Hill.MATH
Zurück zum Zitat Olvera-Lopez, J.A., Martinez-Trinidad, J.F., Carrasco-Ochoa, J.A. (2007). Mixed data object selection based on clustering and border objects. In CIARP (pp. 674–683). Olvera-Lopez, J.A., Martinez-Trinidad, J.F., Carrasco-Ochoa, J.A. (2007). Mixed data object selection based on clustering and border objects. In CIARP (pp. 674–683).
Zurück zum Zitat Ru, Y., & Horowitz, E. (2005). Indexing the invisible web: a survey. Online Information Review, 29(3), 249–265.CrossRef Ru, Y., & Horowitz, E. (2005). Indexing the invisible web: a survey. Online Information Review, 29(3), 249–265.CrossRef
Zurück zum Zitat Shestakov, D. (2008). Search interfaces on the web: Querying and characterizing. PhD thesis, University of Turku Department of Information Technology. Shestakov, D. (2008). Search interfaces on the web: Querying and characterizing. PhD thesis, University of Turku Department of Information Technology.
Zurück zum Zitat Wang, Y., Li, H., Zuo, W., He, F., Wang, X., Chen, K. (2011). Research on discovering deep web entries. Computer Science and Information Systems, 8(3), 779–799.CrossRef Wang, Y., Li, H., Zuo, W., He, F., Wang, X., Chen, K. (2011). Research on discovering deep web entries. Computer Science and Information Systems, 8(3), 779–799.CrossRef
Zurück zum Zitat Witten, I.H., Frank, E., Hall, M.A. (2000). Data mining: Practical machine learning tools and techniques with java implementations. USA: Academic Press. ISBN 1558605525. Witten, I.H., Frank, E., Hall, M.A. (2000). Data mining: Practical machine learning tools and techniques with java implementations. USA: Academic Press. ISBN 1558605525.
Zurück zum Zitat Wu, W., Yu, C., Doan, A., Meng, W. (2004). An interactive clustering-based approach to integrating source query interfaces on the deep Web. In Proceedings of the 2004 ACM SIGMOD international conference on management of data, SIGMOD ’04 (pp. 95–106). New York: ACM. ISBN 1-58113-859-8. doi:10.1145/1007568.1007582.CrossRef Wu, W., Yu, C., Doan, A., Meng, W. (2004). An interactive clustering-based approach to integrating source query interfaces on the deep Web. In Proceedings of the 2004 ACM SIGMOD international conference on management of data, SIGMOD ’04 (pp. 95–106). New York: ACM. ISBN 1-58113-859-8. doi:10.​1145/​1007568.​1007582.CrossRef
Zurück zum Zitat Zhang, P., Qu, Y., Huang, C., Jaeger, P.T., Wells, J., Hayes, W.S., Hayes, J.E., Jin, X. (2010) Collaborative identification and annotation of government deep web resources: A hybrid approach. In Proceedings of the 21st ACM conference on Hypertext and hypermedia, HT ’10 (pp. 285–286). New York: ACM. ISBN 978-1-4503-0041-4. doi:10.1145/1810617.1810677.CrossRef Zhang, P., Qu, Y., Huang, C., Jaeger, P.T., Wells, J., Hayes, W.S., Hayes, J.E., Jin, X. (2010) Collaborative identification and annotation of government deep web resources: A hybrid approach. In Proceedings of the 21st ACM conference on Hypertext and hypermedia, HT ’10 (pp. 285–286). New York: ACM. ISBN 978-1-4503-0041-4. doi:10.​1145/​1810617.​1810677.CrossRef
Zurück zum Zitat Zhang, Z., He, B., Chang, K.C.-C. (2004). Understanding Web query interfaces: Best-effort parsing with hidden syntax. In Proceedings of the 2004 ACM SIGMOD international conference on management of data, SIGMOD ’04, pages 107–118, New York: ACM. ISBN 1-58113-859-8. doi:10.1145/1007568.1007583.CrossRef Zhang, Z., He, B., Chang, K.C.-C. (2004). Understanding Web query interfaces: Best-effort parsing with hidden syntax. In Proceedings of the 2004 ACM SIGMOD international conference on management of data, SIGMOD ’04, pages 107–118, New York: ACM. ISBN 1-58113-859-8. doi:10.​1145/​1007568.​1007583.CrossRef
Metadaten
Titel
Automatic discovery of Web Query Interfaces using machine learning techniques
verfasst von
Heidy M. Marin-Castro
Victor J. Sosa-Sosa
Jose F. Martinez-Trinidad
Ivan Lopez-Arevalo
Publikationsdatum
01.02.2013
Verlag
Springer US
Erschienen in
Journal of Intelligent Information Systems / Ausgabe 1/2013
Print ISSN: 0925-9902
Elektronische ISSN: 1573-7675
DOI
https://doi.org/10.1007/s10844-012-0217-4

Weitere Artikel der Ausgabe 1/2013

Journal of Intelligent Information Systems 1/2013 Zur Ausgabe