ABSTRACT
Nowadays, many users use web search engines to find and gather information. User faces an increasing amount of various web pages information sources. The issue of correlating, integrating and presenting related information to users becomes important. When a user uses a search engine such as Yahoo and Google to seek a specific information, the results are not only information about the availability of the desired information, but also information about other pages on which the desired information is mentioned. Extracting information from the web pages also becomes very important because the massive and increasing amount of diverse web pages information sources in the Internet that are available to users, and the variety of web pages making the process of information extraction from web a challenging problem. This paper proposes an approach for extracting information from web tables based on standard classifications. The proposed approach consists of four main phases, namely: (i) pre-processing, (ii) extraction, (iii) classification, and (iv) simplification. The proposed approach is evaluated by conducting experiments on a number of web pages from the Nokia products domain, as to the best of our knowledge this is the only product that has complete and complex standard classifiers.
- Fatima Ashraf and Reda Alhajj, 2007. ClusTex: Information Extraction from HTML Pages. In Proceedings of the 21st. International Conference on Advanced Information Networking and Applications Workshops (AINAW'O7). 1:355--360. DOI= 10.1109/AINAW.2007.119 Google ScholarDigital Library
- Fatima Ashraf, Tansel Ozyer, and Reda Alhajj, 2008. Employing Clustering Techniques for Automatic Information Extraction from HTML Documents. Journal of IEEE Transactions on Systems. 38: 660--673. DOI= 10.1109/TSMCC.2008.923882 Google ScholarDigital Library
- Guntis Arnicans and Girts Karnitis, 2006. Intelligent Integration of Information from Semi-Structured Web Data Sources on the Base of Ontology and Meta-Models. In Proceedings of the 7th. International Baltic Conference. 177--186. DOI= 10.1109/DBIS.2006.1678494Google Scholar
- Horacio Saggion, Adam Funk, Diana Maynard, and Kalina Bontcheva, 2007. Ontology-based Information Extraction for Business Intelligence. In Proceedings of the 6th. International Semantic Web Conference and the 2nd. Asian Semantic Web Conference. United Kingdom. 843--856. DOI= 10.1007/978-3-540-76298-0_61 Google ScholarDigital Library
- Jeong-Woo Son, Jae-An Lee, Seong-Bae Park, Hyun-Je Song, Sang-Jo Lee, and Se-Young Park, 2008. Discriminating Meaningful Web Tables from Decorative Tables using Composite Kernel. In Proceedings of ACM International Conference on Web Intelligence and Intelligent Agent Technology. 1:368--371. DOI:10.1109/WIIAT.2008.241 Google ScholarDigital Library
- Jyotirmaya Nanda, Timothy W. Simpson, Soundar R. T. Kumara, and Steven B. Shooter, 2006. A Methodology for Product Family Ontology Development using Formal Concept Analysis and Web Ontology Language. Journal of Computing and Information Science in Engineering. 6:1--11. DOI= 10.1115/1.2190237Google ScholarCross Ref
- Katharina Kaiser and Silvia Miksch, 2007. Modeling Treatment Processes using Information Extraction. In: Advanced Computational Intelligence Paradigms in Healthcare - 1, 84:189--224. Springer Berlin, Heidelberg. DOI= 10.1007/978-3-540-47527-9Google Scholar
- Kostyantyn Shchekotykhin, Dietmar Jannach, and Gerhard Friedrich, 2007. Clustering Web Documents with Tables for Information Extraction. In Proceedings of the 4th. International Conference on Knowledge Capture, Canada, 169--170. http://doi.acm.org/10.1145/1298406.1298438 Google ScholarDigital Library
- Man I. Lam, Zhiguo Gong, and Maybin Muyeba, 2008. A Method for Web Information Extraction. In Proceedings of 10th. Asia-Pacific Web Conference. APWeb. Shenyang. China. 4976: 383--394. DOI= 10.1007/978-3-540-78849-2 Google ScholarDigital Library
- Srinivas Vadrevu, Fatih Gelgi, and Hasan Davulcu, 2007. Information Extraction from Web Pages using Presentation Regularities and Domain Knowledge. Journal of World Wide Web, Springer Netherlands. Arizona State University. USA, 10: 157--179. DOI= 10.1007/s11280-007-0021-1 Google ScholarDigital Library
- Sung Won Jung, Kyung Hee Sung, Tae Won Park, and Hyuk Chul Kwon, 2001. Intelligent Integration of Information on the Internet for Travelers on Demand. In Proceedings of ISIE IEEE International Symposium. Pusin, Korea, 338--342. DOI= 10.1109/ISIE.2001.931810Google Scholar
- Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak, 2007. Towards Domain-independent Information Extraction from Web Tables. In Proceedings of the 16th. International Conference on World Wide Web. Canada, 71--80. http://doi.acm.org/10.1145/1242572.1242583 Google ScholarDigital Library
Index Terms
- Information extraction from web tables
Recommendations
Towards domain-independent information extraction from web tables
WWW '07: Proceedings of the 16th international conference on World Wide WebTraditionally, information extraction from web tables has focused on small, more or less homogeneous corpora, often based on assumptions about the use of <table> tags. A multitude of different HTML implementations of web tables make these approaches ...
Web Information Extraction Technology Research Based on Ajax
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global InformatizationAlong with the rapid development of Internet, research of information extraction in the field has been extensive concerned by scholars. However, with the widely application of Web2.0, the traditional web information extraction technology can't meet the ...
Web-scale knowledge extraction from semi-structured tables
WWW '10: Proceedings of the 19th international conference on World wide webA wealth of knowledge is encoded in the form of tables on the World Wide Web. We propose a classification algorithm and a rich feature set for automatically recognizing layout tables and attribute/value tables. We report the frequencies of these table ...
Comments