skip to main content
10.1145/2309996.2310006acmconferencesArticle/Chapter ViewAbstractPublication PageshtConference Proceedingsconference-collections
research-article

Building enriched web page representations using link paths

Authors Info & Claims
Published:25 June 2012Publication History

ABSTRACT

Anchor text has a history of enriching documents for a variety of tasks within the World Wide Web. Anchor texts are useful because they are similar to typical Web queries, and because they express the document's context. Therefore, it is a common practice for Web search engines to incorporate incoming anchor text into the document's standard textual representation. However, this approach will not suffice for documents with very few inlinks, and it does not incorporate the document's full context. To mediate these problems, we employ link paths, which contain anchor texts from paths through the Web ending at the document in question. We propose and study several different ways to aggregate anchor text from link paths, and we show that the information from link paths can be used to (1) improve known item search in site-specific search, and (2) map Web pages to database records. We rigorously evaluate our proposed approach on several real world test collections. We find that our approach significantly improves performance over baseline and existing techniques in both tasks.

References

  1. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1--7):107--117, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. Proc. VLDB Endow., 1(1):538--549, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. J. Cafarella, J. Madhavan, and A. Halevy. Web-scale extraction of structured data. SIGMOD Rec., 37(4):55--61, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Chakrabarti, B. Dom, P. Raghavan, S. R. D. Gibson, and J. Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. In WWW, pages 65--74, Amsterdam, The Netherlands, The Netherlands, 1998. Elsevier Science Publishers B. V. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. N. Craswell, D. Hawking, and S. Robertson. Effective site finding using link anchor information. In SIGIR, pages 250--257, New York, NY, USA, 2001. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Z. Dou, R. Song, J.-Y. Nie, and J.-R. Wen. Using anchor texts with their hyperlink structure for web search. In SIGIR, pages 227--234, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. N. Eiron and K. S. McCurley. Analysis of anchor text for web search. In SIGIR, pages 459--460, New York, NY, USA, 2003. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Fujii. Modeling anchor text and classifying queries to enhance web document retrieval. In WWW, pages 337--346, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Fujii, K. Itou, T. Akiba, and T. Ishikawa. Exploiting anchor text for the navigationalweb retrieval at ntcir-5. In NTCIR-5 Workshop, 2005.Google ScholarGoogle Scholar
  10. V. Harmandas, M. Sanderson, and M. D. Dunlop. Image retrieval by hypertext links. SIGIR Forum, 31(SI):296--303, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. E. H. Hovy. Natural Language Processing and Information Systems, chapter 1, pages 1--7. Springer Berlin / Heidelberg, 2010.Google ScholarGoogle Scholar
  12. R. Jin, A. G. Hauptmann, and C. X. Zhai. Title language model for information retrieval. In SIGIR, pages 42--48, New York, NY, USA, 2002. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Koolen and J. Kamps. The importance of anchor text for ad hoc search revisited. In SIGIR, pages 122--129, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. Kraft and J. Zien. Mining anchor text for query refinement. In WWW, pages 666--674, New York, NY, USA, 2004. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. X. Lin, B. Zhao, T. Weninger, J. Han, and B. Liu. Entity relation discovery from webtables and links. In WWW. ACM, April 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. B. Liu. Web Data Mining -- Exploring Hyperlinks, Contents and Usage Data. Springer, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. W.-H. Lu, L.-F. Chien, and H.-J. Lee. Anchor text mining for translation of web queries: A transitive translation approach. ACM Trans. Inf. Syst., 22(2):242--269, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. O. A. McBryan. Genvl and wwww: tools for taming the web. In WWW, 1994.Google ScholarGoogle Scholar
  19. D. Metzler, J. Novak, H. Cui, and S. Reddy. Building enriched document representations using aggregated anchor text. In SIGIR, pages 219--226, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. G. Miao, J. Tatemura, W.-P. Hsiung, A. Sawires, and L. E. Moser. Extracting data records from the web using tag path clustering. In WWW, pages 981--990, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Ogilvie and J. Callan. Combining document representations for known-item search. In SIGIR, pages 143--150, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR, pages 232--241, New York, NY, USA, 1994. Springer-Verlag New York, Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. Shen, J.-T. Sun, Q. Yang, and Z. Chen. A comparison of implicit and explicit links for web page classification. In WWW, pages 643--650, New York, NY, USA, 2006. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. T. Weninger, F. Fumarola, R. Barber, C. X. Lin, J. Han, and D. Malerba. Growing parallel paths for entity-page discovery. In WWW, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. T. Westerveld, W. Kraaij, and D. Hiemstra. Retrieving web pages using content, links, urls and anchors. TREC, 10, 2001.Google ScholarGoogle Scholar
  26. Y. Zhai and B. Liu. Structured data extraction from the web based on partial tree alignment. IEEE Trans. on Knowl. and Data Eng., 18(12):1614--1628, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Building enriched web page representations using link paths

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        HT '12: Proceedings of the 23rd ACM conference on Hypertext and social media
        June 2012
        340 pages
        ISBN:9781450313353
        DOI:10.1145/2309996

        Copyright © 2012 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 25 June 2012

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        HT '12 Paper Acceptance Rate33of120submissions,28%Overall Acceptance Rate378of1,158submissions,33%

        Upcoming Conference

        HT '24
        35th ACM Conference on Hypertext and Social Media
        September 10 - 13, 2024
        Poznan , Poland

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader