ABSTRACT
Anchor text has a history of enriching documents for a variety of tasks within the World Wide Web. Anchor texts are useful because they are similar to typical Web queries, and because they express the document's context. Therefore, it is a common practice for Web search engines to incorporate incoming anchor text into the document's standard textual representation. However, this approach will not suffice for documents with very few inlinks, and it does not incorporate the document's full context. To mediate these problems, we employ link paths, which contain anchor texts from paths through the Web ending at the document in question. We propose and study several different ways to aggregate anchor text from link paths, and we show that the information from link paths can be used to (1) improve known item search in site-specific search, and (2) map Web pages to database records. We rigorously evaluate our proposed approach on several real world test collections. We find that our approach significantly improves performance over baseline and existing techniques in both tasks.
- S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1--7):107--117, 1998. Google ScholarDigital Library
- M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. Proc. VLDB Endow., 1(1):538--549, 2008. Google ScholarDigital Library
- M. J. Cafarella, J. Madhavan, and A. Halevy. Web-scale extraction of structured data. SIGMOD Rec., 37(4):55--61, 2008. Google ScholarDigital Library
- S. Chakrabarti, B. Dom, P. Raghavan, S. R. D. Gibson, and J. Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. In WWW, pages 65--74, Amsterdam, The Netherlands, The Netherlands, 1998. Elsevier Science Publishers B. V. Google ScholarDigital Library
- N. Craswell, D. Hawking, and S. Robertson. Effective site finding using link anchor information. In SIGIR, pages 250--257, New York, NY, USA, 2001. ACM. Google ScholarDigital Library
- Z. Dou, R. Song, J.-Y. Nie, and J.-R. Wen. Using anchor texts with their hyperlink structure for web search. In SIGIR, pages 227--234, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- N. Eiron and K. S. McCurley. Analysis of anchor text for web search. In SIGIR, pages 459--460, New York, NY, USA, 2003. ACM. Google ScholarDigital Library
- A. Fujii. Modeling anchor text and classifying queries to enhance web document retrieval. In WWW, pages 337--346, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- A. Fujii, K. Itou, T. Akiba, and T. Ishikawa. Exploiting anchor text for the navigationalweb retrieval at ntcir-5. In NTCIR-5 Workshop, 2005.Google Scholar
- V. Harmandas, M. Sanderson, and M. D. Dunlop. Image retrieval by hypertext links. SIGIR Forum, 31(SI):296--303, 1997. Google ScholarDigital Library
- E. H. Hovy. Natural Language Processing and Information Systems, chapter 1, pages 1--7. Springer Berlin / Heidelberg, 2010.Google Scholar
- R. Jin, A. G. Hauptmann, and C. X. Zhai. Title language model for information retrieval. In SIGIR, pages 42--48, New York, NY, USA, 2002. ACM. Google ScholarDigital Library
- M. Koolen and J. Kamps. The importance of anchor text for ad hoc search revisited. In SIGIR, pages 122--129, 2010. Google ScholarDigital Library
- R. Kraft and J. Zien. Mining anchor text for query refinement. In WWW, pages 666--674, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
- C. X. Lin, B. Zhao, T. Weninger, J. Han, and B. Liu. Entity relation discovery from webtables and links. In WWW. ACM, April 2010. Google ScholarDigital Library
- B. Liu. Web Data Mining -- Exploring Hyperlinks, Contents and Usage Data. Springer, 2006. Google ScholarDigital Library
- W.-H. Lu, L.-F. Chien, and H.-J. Lee. Anchor text mining for translation of web queries: A transitive translation approach. ACM Trans. Inf. Syst., 22(2):242--269, 2004. Google ScholarDigital Library
- O. A. McBryan. Genvl and wwww: tools for taming the web. In WWW, 1994.Google Scholar
- D. Metzler, J. Novak, H. Cui, and S. Reddy. Building enriched document representations using aggregated anchor text. In SIGIR, pages 219--226, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- G. Miao, J. Tatemura, W.-P. Hsiung, A. Sawires, and L. E. Moser. Extracting data records from the web using tag path clustering. In WWW, pages 981--990, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- P. Ogilvie and J. Callan. Combining document representations for known-item search. In SIGIR, pages 143--150, 2003. Google ScholarDigital Library
- S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR, pages 232--241, New York, NY, USA, 1994. Springer-Verlag New York, Inc. Google ScholarDigital Library
- D. Shen, J.-T. Sun, Q. Yang, and Z. Chen. A comparison of implicit and explicit links for web page classification. In WWW, pages 643--650, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- T. Weninger, F. Fumarola, R. Barber, C. X. Lin, J. Han, and D. Malerba. Growing parallel paths for entity-page discovery. In WWW, 2011. Google ScholarDigital Library
- T. Westerveld, W. Kraaij, and D. Hiemstra. Retrieving web pages using content, links, urls and anchors. TREC, 10, 2001.Google Scholar
- Y. Zhai and B. Liu. Structured data extraction from the web based on partial tree alignment. IEEE Trans. on Knowl. and Data Eng., 18(12):1614--1628, 2006. Google ScholarDigital Library
Index Terms
- Building enriched web page representations using link paths
Recommendations
Mapping web pages to database records via link paths
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge managementIn this paper we propose a new knowledge management task which aims to map Web pages to their corresponding records in a structured database. For example, the DBLP database contains records for many computer scientists, and most of these persons have ...
Building enriched document representations using aggregated anchor text
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrievalIt is well known that anchor text plays a critical role in a variety of search tasks performed over hypertextual domains, including enterprise search, wiki search, and web search. It is common practice to enrich a document's standard textual ...
A framework to derive web page context from hyperlink structure
Since an anchor is used in an HTML document to point to a related document/picture/media application, anchor-text becomes a potential resource to extract the information about an associated web page. However, sometimes anchor-texts are either not ...
Comments