research-article

Building enriched web page representations using link paths

Authors:
Tim Weninger

University of Illinois Urbana-Champaign, Urbana, IL, USA

University of Illinois Urbana-Champaign, Urbana, IL, USA
View Profile

,
ChengXiang Zhai

University of Illinois Urbana-Champaign, Urbana, IL, USA

University of Illinois Urbana-Champaign, Urbana, IL, USA
View Profile

,
Jiawei Han

University of Illinois Urbana-Champaign, Urbana, IL, USA

University of Illinois Urbana-Champaign, Urbana, IL, USA
View Profile

HT '12: Proceedings of the 23rd ACM conference on Hypertext and social mediaJune 2012Pages 53–62https://doi.org/10.1145/2309996.2310006

Published:25 June 2012Publication History

HT '12: Proceedings of the 23rd ACM conference on Hypertext and social media

Pages 53–62

ABSTRACT

Anchor text has a history of enriching documents for a variety of tasks within the World Wide Web. Anchor texts are useful because they are similar to typical Web queries, and because they express the document's context. Therefore, it is a common practice for Web search engines to incorporate incoming anchor text into the document's standard textual representation. However, this approach will not suffice for documents with very few inlinks, and it does not incorporate the document's full context. To mediate these problems, we employ link paths, which contain anchor texts from paths through the Web ending at the document in question. We propose and study several different ways to aggregate anchor text from link paths, and we show that the information from link paths can be used to (1) improve known item search in site-specific search, and (2) map Web pages to database records. We rigorously evaluate our proposed approach on several real world test collections. We find that our approach significantly improves performance over baseline and existing techniques in both tasks.

References

S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1--7):107--117, 1998. Google ScholarDigital Library
M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. Proc. VLDB Endow., 1(1):538--549, 2008. Google ScholarDigital Library
M. J. Cafarella, J. Madhavan, and A. Halevy. Web-scale extraction of structured data. SIGMOD Rec., 37(4):55--61, 2008. Google ScholarDigital Library
S. Chakrabarti, B. Dom, P. Raghavan, S. R. D. Gibson, and J. Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. In WWW, pages 65--74, Amsterdam, The Netherlands, The Netherlands, 1998. Elsevier Science Publishers B. V. Google ScholarDigital Library
N. Craswell, D. Hawking, and S. Robertson. Effective site finding using link anchor information. In SIGIR, pages 250--257, New York, NY, USA, 2001. ACM. Google ScholarDigital Library
Z. Dou, R. Song, J.-Y. Nie, and J.-R. Wen. Using anchor texts with their hyperlink structure for web search. In SIGIR, pages 227--234, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
N. Eiron and K. S. McCurley. Analysis of anchor text for web search. In SIGIR, pages 459--460, New York, NY, USA, 2003. ACM. Google ScholarDigital Library
A. Fujii. Modeling anchor text and classifying queries to enhance web document retrieval. In WWW, pages 337--346, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
A. Fujii, K. Itou, T. Akiba, and T. Ishikawa. Exploiting anchor text for the navigationalweb retrieval at ntcir-5. In NTCIR-5 Workshop, 2005.Google Scholar
V. Harmandas, M. Sanderson, and M. D. Dunlop. Image retrieval by hypertext links. SIGIR Forum, 31(SI):296--303, 1997. Google ScholarDigital Library
E. H. Hovy. Natural Language Processing and Information Systems, chapter 1, pages 1--7. Springer Berlin / Heidelberg, 2010.Google Scholar
R. Jin, A. G. Hauptmann, and C. X. Zhai. Title language model for information retrieval. In SIGIR, pages 42--48, New York, NY, USA, 2002. ACM. Google ScholarDigital Library
M. Koolen and J. Kamps. The importance of anchor text for ad hoc search revisited. In SIGIR, pages 122--129, 2010. Google ScholarDigital Library
R. Kraft and J. Zien. Mining anchor text for query refinement. In WWW, pages 666--674, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
C. X. Lin, B. Zhao, T. Weninger, J. Han, and B. Liu. Entity relation discovery from webtables and links. In WWW. ACM, April 2010. Google ScholarDigital Library
B. Liu. Web Data Mining -- Exploring Hyperlinks, Contents and Usage Data. Springer, 2006. Google ScholarDigital Library
W.-H. Lu, L.-F. Chien, and H.-J. Lee. Anchor text mining for translation of web queries: A transitive translation approach. ACM Trans. Inf. Syst., 22(2):242--269, 2004. Google ScholarDigital Library
O. A. McBryan. Genvl and wwww: tools for taming the web. In WWW, 1994.Google Scholar
D. Metzler, J. Novak, H. Cui, and S. Reddy. Building enriched document representations using aggregated anchor text. In SIGIR, pages 219--226, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
G. Miao, J. Tatemura, W.-P. Hsiung, A. Sawires, and L. E. Moser. Extracting data records from the web using tag path clustering. In WWW, pages 981--990, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
P. Ogilvie and J. Callan. Combining document representations for known-item search. In SIGIR, pages 143--150, 2003. Google ScholarDigital Library
S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR, pages 232--241, New York, NY, USA, 1994. Springer-Verlag New York, Inc. Google ScholarDigital Library
D. Shen, J.-T. Sun, Q. Yang, and Z. Chen. A comparison of implicit and explicit links for web page classification. In WWW, pages 643--650, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
T. Weninger, F. Fumarola, R. Barber, C. X. Lin, J. Han, and D. Malerba. Growing parallel paths for entity-page discovery. In WWW, 2011. Google ScholarDigital Library
T. Westerveld, W. Kraaij, and D. Hiemstra. Retrieving web pages using content, links, urls and anchors. TREC, 10, 2001.Google Scholar
Y. Zhai and B. Liu. Structured data extraction from the web based on partial tree alignment. IEEE Trans. on Knowl. and Data Eng., 18(12):1614--1628, 2006. Google ScholarDigital Library

Index Terms

Building enriched web page representations using link paths
1. Information systems
  1. Information retrieval
  2. Information systems applications
    1. Data mining

Recommendations

Mapping web pages to database records via link paths
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

In this paper we propose a new knowledge management task which aims to map Web pages to their corresponding records in a structured database. For example, the DBLP database contains records for many computer scientists, and most of these persons have ...
Read More
Building enriched document representations using aggregated anchor text
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

It is well known that anchor text plays a critical role in a variety of search tasks performed over hypertextual domains, including enterprise search, wiki search, and web search. It is common practice to enrich a document's standard textual ...
Read More
A framework to derive web page context from hyperlink structure

Since an anchor is used in an HTML document to point to a related document/picture/media application, anchor-text becomes a potential resource to extract the information about an associated web page. However, sometimes anchor-texts are either not ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
HT '12: Proceedings of the 23rd ACM conference on Hypertext and social media
June 2012
340 pages
ISBN:9781450313353
DOI:10.1145/2309996
General Chair:
Ethan Munson
University of Wisconsin - Milwaukee, USA
,
Program Chair:
Markus Strohmaier
Graz University of Technology, Austria
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 June 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
anchor text
document indexing
link paths
record linkage
web
Qualifiers
- research-article
Conference

Acceptance Rates
HT '12 Paper Acceptance Rate33of120submissions,28%Overall Acceptance Rate378of1,158submissions,33%
More
Upcoming Conference
HT '24

Sponsor:

sigweb

35th ACM Conference on Hypertext and Social Media

September 10 - 13, 2024

Poznan , Poland
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 219
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Building enriched web page representations using link paths

HT '12: Proceedings of the 23rd ACM conference on Hypertext and social media

ABSTRACT

References

Cited By

Index Terms

Recommendations

Mapping web pages to database records via link paths

Building enriched document representations using aggregated anchor text

A framework to derive web page context from hyperlink structure