ABSTRACT
In this paper we propose a new knowledge management task which aims to map Web pages to their corresponding records in a structured database. For example, the DBLP database contains records for many computer scientists, and most of these persons have public Web pages; if we can map the database record with the appropriate Web page then the new information could be used to further describe the person's database record. To accomplish this goal we employ link paths which contain anchor texts from multiple paths through the Web ending at the Web page in question. We hypothesize that the information from these link paths can be used to generate an accurate Web page to database record mapping. Experiments on two large, real world data sets, DBLP and IMDB for the structured data and computer science faculty members' Web pages and official movie homepages for the Web page data, show that our method does provide an accurate mapping. Finally, we conclude by issuing a call for further research on this promising new task.
- S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1--7):107--117, 1998. Google ScholarDigital Library
- N. Craswell and D. Hawking. Overview of the trec-2002 web track. In TREC '02: In Proceedings of the eleventh text retrieval conference TREC-2002, pages 86--95. NIST, 2003.Google Scholar
- N. Craswell, D. Hawking, and S. Robertson. Effective site finding using link anchor information. In SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 250--257, New York, NY, USA, 2001. ACM. Google ScholarDigital Library
- O. A. McBryan. Genvl and wwww: tools for taming the web. In WWW1: Proceedings of the 15th international conference on World Wide Web, 1994.Google Scholar
- W. Xi, E. A. Fox, R. P. Tan, and J. Shu. Machine learning approach for homepage finding task. In SPIRE 2002: Proceedings of the 9th International Symposium on String Processing and Information Retrieval, pages 145--159, London, UK, 2002. Springer-Verlag. Google ScholarDigital Library
- Y. Yen. Finding the k shortest loopless paths in a network. Management Science, 17(1):712--716, 1971.Google ScholarCross Ref
Index Terms
- Mapping web pages to database records via link paths
Recommendations
Building enriched web page representations using link paths
HT '12: Proceedings of the 23rd ACM conference on Hypertext and social mediaAnchor text has a history of enriching documents for a variety of tasks within the World Wide Web. Anchor texts are useful because they are similar to typical Web queries, and because they express the document's context. Therefore, it is a common ...
A relational database interface to the World-Wide Web
DL '99: Proceedings of the fourth ACM conference on Digital librariesFinding and Extracting Data Records from Web Pages
Many HTML pages are generated by software programs by querying some underlying databases and then filling in a template with the data. In these situations the metainformation about the data structure is lost, so automated software programs cannot ...
Comments