short-paper

Information extraction from web tables

Authors:
Mahmoud Shaker

Universiti Putra Malaysia, Serdang, Malaysia

Universiti Putra Malaysia, Serdang, Malaysia
View Profile

,
Hamidah Ibrahim

Universiti Putra Malaysia, Serdang, Malaysia

Universiti Putra Malaysia, Serdang, Malaysia
View Profile

,
Aida Mustapha

Universiti Putra Malaysia, Serdang, Malaysia

Universiti Putra Malaysia, Serdang, Malaysia
View Profile

,
Lili Nurliyana Abdullah

Universiti Putra Malaysia, Serdang, Malaysia

Universiti Putra Malaysia, Serdang, Malaysia
View Profile

iiWAS '09: Proceedings of the 11th International Conference on Information Integration and Web-based Applications & ServicesDecember 2009Pages 470–476https://doi.org/10.1145/1806338.1806426

Published:14 December 2009Publication History

iiWAS '09: Proceedings of the 11th International Conference on Information Integration and Web-based Applications & Services

Pages 470–476

ABSTRACT

Nowadays, many users use web search engines to find and gather information. User faces an increasing amount of various web pages information sources. The issue of correlating, integrating and presenting related information to users becomes important. When a user uses a search engine such as Yahoo and Google to seek a specific information, the results are not only information about the availability of the desired information, but also information about other pages on which the desired information is mentioned. Extracting information from the web pages also becomes very important because the massive and increasing amount of diverse web pages information sources in the Internet that are available to users, and the variety of web pages making the process of information extraction from web a challenging problem. This paper proposes an approach for extracting information from web tables based on standard classifications. The proposed approach consists of four main phases, namely: (i) pre-processing, (ii) extraction, (iii) classification, and (iv) simplification. The proposed approach is evaluated by conducting experiments on a number of web pages from the Nokia products domain, as to the best of our knowledge this is the only product that has complete and complex standard classifiers.

References

Fatima Ashraf and Reda Alhajj, 2007. ClusTex: Information Extraction from HTML Pages. In Proceedings of the 21st. International Conference on Advanced Information Networking and Applications Workshops (AINAW'O7). 1:355--360. DOI= 10.1109/AINAW.2007.119 Google ScholarDigital Library
Fatima Ashraf, Tansel Ozyer, and Reda Alhajj, 2008. Employing Clustering Techniques for Automatic Information Extraction from HTML Documents. Journal of IEEE Transactions on Systems. 38: 660--673. DOI= 10.1109/TSMCC.2008.923882 Google ScholarDigital Library
Guntis Arnicans and Girts Karnitis, 2006. Intelligent Integration of Information from Semi-Structured Web Data Sources on the Base of Ontology and Meta-Models. In Proceedings of the 7th. International Baltic Conference. 177--186. DOI= 10.1109/DBIS.2006.1678494Google Scholar
Horacio Saggion, Adam Funk, Diana Maynard, and Kalina Bontcheva, 2007. Ontology-based Information Extraction for Business Intelligence. In Proceedings of the 6th. International Semantic Web Conference and the 2nd. Asian Semantic Web Conference. United Kingdom. 843--856. DOI= 10.1007/978-3-540-76298-0_61 Google ScholarDigital Library
Jeong-Woo Son, Jae-An Lee, Seong-Bae Park, Hyun-Je Song, Sang-Jo Lee, and Se-Young Park, 2008. Discriminating Meaningful Web Tables from Decorative Tables using Composite Kernel. In Proceedings of ACM International Conference on Web Intelligence and Intelligent Agent Technology. 1:368--371. DOI:10.1109/WIIAT.2008.241 Google ScholarDigital Library
Jyotirmaya Nanda, Timothy W. Simpson, Soundar R. T. Kumara, and Steven B. Shooter, 2006. A Methodology for Product Family Ontology Development using Formal Concept Analysis and Web Ontology Language. Journal of Computing and Information Science in Engineering. 6:1--11. DOI= 10.1115/1.2190237Google ScholarCross Ref
Katharina Kaiser and Silvia Miksch, 2007. Modeling Treatment Processes using Information Extraction. In: Advanced Computational Intelligence Paradigms in Healthcare - 1, 84:189--224. Springer Berlin, Heidelberg. DOI= 10.1007/978-3-540-47527-9Google Scholar
Kostyantyn Shchekotykhin, Dietmar Jannach, and Gerhard Friedrich, 2007. Clustering Web Documents with Tables for Information Extraction. In Proceedings of the 4th. International Conference on Knowledge Capture, Canada, 169--170. http://doi.acm.org/10.1145/1298406.1298438 Google ScholarDigital Library
Man I. Lam, Zhiguo Gong, and Maybin Muyeba, 2008. A Method for Web Information Extraction. In Proceedings of 10th. Asia-Pacific Web Conference. APWeb. Shenyang. China. 4976: 383--394. DOI= 10.1007/978-3-540-78849-2 Google ScholarDigital Library
Srinivas Vadrevu, Fatih Gelgi, and Hasan Davulcu, 2007. Information Extraction from Web Pages using Presentation Regularities and Domain Knowledge. Journal of World Wide Web, Springer Netherlands. Arizona State University. USA, 10: 157--179. DOI= 10.1007/s11280-007-0021-1 Google ScholarDigital Library
Sung Won Jung, Kyung Hee Sung, Tae Won Park, and Hyuk Chul Kwon, 2001. Intelligent Integration of Information on the Internet for Travelers on Demand. In Proceedings of ISIE IEEE International Symposium. Pusin, Korea, 338--342. DOI= 10.1109/ISIE.2001.931810Google Scholar
Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak, 2007. Towards Domain-independent Information Extraction from Web Tables. In Proceedings of the 16th. International Conference on World Wide Web. Canada, 71--80. http://doi.acm.org/10.1145/1242572.1242583 Google ScholarDigital Library

Index Terms

Information extraction from web tables
1. Information systems
  1. Information retrieval

Recommendations

Towards domain-independent information extraction from web tables
WWW '07: Proceedings of the 16th international conference on World Wide Web

Traditionally, information extraction from web tables has focused on small, more or less homogeneous corpora, often based on assumptions about the use of <table> tags. A multitude of different HTML implementations of web tables make these approaches ...
Read More
Web Information Extraction Technology Research Based on Ajax
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global Informatization

Along with the rapid development of Internet, research of information extraction in the field has been extensive concerned by scholars. However, with the widely application of Web2.0, the traditional web information extraction technology can't meet the ...
Read More
Web-scale knowledge extraction from semi-structured tables
WWW '10: Proceedings of the 19th international conference on World wide web

A wealth of knowledge is encoded in the form of tables on the World Wide Web. We propose a classification algorithm and a rich feature set for automatically recognizing layout tables and attribute/value tables. We report the frequencies of these table ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
iiWAS '09: Proceedings of the 11th International Conference on Information Integration and Web-based Applications & Services
December 2009
763 pages
ISBN:9781605586601
DOI:10.1145/1806338
General Chair:
Gabriele Kotsis
Johannes Kepler University Linz, Austria
,
Program Chairs:
David Taniar
Monash University, Australia
,
Eric Pardede
La Trobe University, Australia
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 December 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
information extraction
web tables
Qualifiers
- short-paper
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 249
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Information extraction from web tables

iiWAS '09: Proceedings of the 11th International Conference on Information Integration and Web-based Applications & Services

ABSTRACT

References

Cited By

Index Terms

Recommendations

Towards domain-independent information extraction from web tables

Web Information Extraction Technology Research Based on Ajax

Web-scale knowledge extraction from semi-structured tables

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Information extraction from web tables

iiWAS '09: Proceedings of the 11th International Conference on Information Integration and Web-based Applications & Services

ABSTRACT

References

Cited By

Index Terms

Recommendations

Towards domain-independent information extraction from web tables

Web Information Extraction Technology Research Based on Ajax

Web-scale knowledge extraction from semi-structured tables

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media