skip to main content
10.1145/1806338.1806426acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiiwasConference Proceedingsconference-collections
short-paper

Information extraction from web tables

Published:14 December 2009Publication History

ABSTRACT

Nowadays, many users use web search engines to find and gather information. User faces an increasing amount of various web pages information sources. The issue of correlating, integrating and presenting related information to users becomes important. When a user uses a search engine such as Yahoo and Google to seek a specific information, the results are not only information about the availability of the desired information, but also information about other pages on which the desired information is mentioned. Extracting information from the web pages also becomes very important because the massive and increasing amount of diverse web pages information sources in the Internet that are available to users, and the variety of web pages making the process of information extraction from web a challenging problem. This paper proposes an approach for extracting information from web tables based on standard classifications. The proposed approach consists of four main phases, namely: (i) pre-processing, (ii) extraction, (iii) classification, and (iv) simplification. The proposed approach is evaluated by conducting experiments on a number of web pages from the Nokia products domain, as to the best of our knowledge this is the only product that has complete and complex standard classifiers.

References

  1. Fatima Ashraf and Reda Alhajj, 2007. ClusTex: Information Extraction from HTML Pages. In Proceedings of the 21st. International Conference on Advanced Information Networking and Applications Workshops (AINAW'O7). 1:355--360. DOI= 10.1109/AINAW.2007.119 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Fatima Ashraf, Tansel Ozyer, and Reda Alhajj, 2008. Employing Clustering Techniques for Automatic Information Extraction from HTML Documents. Journal of IEEE Transactions on Systems. 38: 660--673. DOI= 10.1109/TSMCC.2008.923882 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Guntis Arnicans and Girts Karnitis, 2006. Intelligent Integration of Information from Semi-Structured Web Data Sources on the Base of Ontology and Meta-Models. In Proceedings of the 7th. International Baltic Conference. 177--186. DOI= 10.1109/DBIS.2006.1678494Google ScholarGoogle Scholar
  4. Horacio Saggion, Adam Funk, Diana Maynard, and Kalina Bontcheva, 2007. Ontology-based Information Extraction for Business Intelligence. In Proceedings of the 6th. International Semantic Web Conference and the 2nd. Asian Semantic Web Conference. United Kingdom. 843--856. DOI= 10.1007/978-3-540-76298-0_61 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Jeong-Woo Son, Jae-An Lee, Seong-Bae Park, Hyun-Je Song, Sang-Jo Lee, and Se-Young Park, 2008. Discriminating Meaningful Web Tables from Decorative Tables using Composite Kernel. In Proceedings of ACM International Conference on Web Intelligence and Intelligent Agent Technology. 1:368--371. DOI:10.1109/WIIAT.2008.241 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Jyotirmaya Nanda, Timothy W. Simpson, Soundar R. T. Kumara, and Steven B. Shooter, 2006. A Methodology for Product Family Ontology Development using Formal Concept Analysis and Web Ontology Language. Journal of Computing and Information Science in Engineering. 6:1--11. DOI= 10.1115/1.2190237Google ScholarGoogle ScholarCross RefCross Ref
  7. Katharina Kaiser and Silvia Miksch, 2007. Modeling Treatment Processes using Information Extraction. In: Advanced Computational Intelligence Paradigms in Healthcare - 1, 84:189--224. Springer Berlin, Heidelberg. DOI= 10.1007/978-3-540-47527-9Google ScholarGoogle Scholar
  8. Kostyantyn Shchekotykhin, Dietmar Jannach, and Gerhard Friedrich, 2007. Clustering Web Documents with Tables for Information Extraction. In Proceedings of the 4th. International Conference on Knowledge Capture, Canada, 169--170. http://doi.acm.org/10.1145/1298406.1298438 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Man I. Lam, Zhiguo Gong, and Maybin Muyeba, 2008. A Method for Web Information Extraction. In Proceedings of 10th. Asia-Pacific Web Conference. APWeb. Shenyang. China. 4976: 383--394. DOI= 10.1007/978-3-540-78849-2 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Srinivas Vadrevu, Fatih Gelgi, and Hasan Davulcu, 2007. Information Extraction from Web Pages using Presentation Regularities and Domain Knowledge. Journal of World Wide Web, Springer Netherlands. Arizona State University. USA, 10: 157--179. DOI= 10.1007/s11280-007-0021-1 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Sung Won Jung, Kyung Hee Sung, Tae Won Park, and Hyuk Chul Kwon, 2001. Intelligent Integration of Information on the Internet for Travelers on Demand. In Proceedings of ISIE IEEE International Symposium. Pusin, Korea, 338--342. DOI= 10.1109/ISIE.2001.931810Google ScholarGoogle Scholar
  12. Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak, 2007. Towards Domain-independent Information Extraction from Web Tables. In Proceedings of the 16th. International Conference on World Wide Web. Canada, 71--80. http://doi.acm.org/10.1145/1242572.1242583 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Information extraction from web tables

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Other conferences
            iiWAS '09: Proceedings of the 11th International Conference on Information Integration and Web-based Applications & Services
            December 2009
            763 pages
            ISBN:9781605586601
            DOI:10.1145/1806338

            Copyright © 2009 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 14 December 2009

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • short-paper
          • Article Metrics

            • Downloads (Last 12 months)5
            • Downloads (Last 6 weeks)2

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader