Article

Page-level template detection via isotonic smoothing

Authors:
Deepayan Chakrabarti

Yahoo! Research, Sunnyvale, CA

Yahoo! Research, Sunnyvale, CA
View Profile

,
Ravi Kumar

Yahoo! Research, Sunnyvale, CA

Yahoo! Research, Sunnyvale, CA
View Profile

,
Kunal Punera

University of Texas at Austin, Austin, TX

University of Texas at Austin, Austin, TX
View Profile

WWW '07: Proceedings of the 16th international conference on World Wide WebMay 2007Pages 61–70https://doi.org/10.1145/1242572.1242582

Published:08 May 2007Publication History

WWW '07: Proceedings of the 16th international conference on World Wide Web

Pages 61–70

ABSTRACT

We develop a novel framework for the page-level template detection problem. Our framework is built on two main ideas. The first is theautomatic generation of training data for a classifier that, given apage, assigns a templateness score to every DOM node of the page. The second is the global smoothing of these per-node classifier scores bysolving a regularized isotonic regression problem; the latter follows from a simple yet powerful abstraction of templateness on a page. Our extensive experiments on human-labeled test data show that our approachdetects templates effectively.

References

S. Angelov, B. Harb, S. Kannan, and L.-S. Wang. Weighted isotonic regression under the l₁ norm. In Proc. 17th SODA, pages 783--791, 2006. Google ScholarDigital Library
S. Baluja. Browsing on small screens: Recasting web-page segmentation into an efficient machine learning framework. In Proc. 15th WWW, pages 33--42, 2006. Google ScholarDigital Library
Z. Bar-Yossef and S. Rajagopalan. Template detection via data mining and its applications. In Proc. 11th WWW, pages 580--591, 2002. Google ScholarDigital Library
K. Bharat, A. Broder, J. Dean, and M.R. Henzinger. A comparison of techniques to find mirrored hosts on the WWW. JASIS, 51(12):1114--1122, 2000. Google ScholarDigital Library
A.Z. Broder, S.C. Glassman, M.S. Manasse, and G. Zweig. Syntactic clustering of the web. WWW6 / Computer Networks, 29(8-13):1157--1166, 1997. Google ScholarDigital Library
S. Chakrabarti, B.E. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In Proc. SIGMOD, pages 307--318, 1998. Google ScholarDigital Library
Y. Chen, X. Xie, W.-Y. Ma, and H.-J. Zhang. Adapting web pages for small-screen devices. Internet Computing, 9(1):50--56, 2005. Google ScholarDigital Library
J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding replicated web collections. In Proc. SIGMOD, pages 355--366, 2000. Google ScholarDigital Library
B. Davison. Recognizing nepotistic links on the web. In AAAI-2000 Workshop on Artificial Intelligence for Web Search, pages 23--28, 2000.Google Scholar
S. Debnath, P. Mitra, N. Pal, and C.L. Giles. Automatic identification of informative sections of web pages. TKDE, 17(9):1233--1246, 2005. Google ScholarDigital Library
D. Gibson, K. Punera, and A. Tomkins. The volume and evolution of web page templates. In Proc. 14th WWW (Special interest tracks and posters), pages 830--839, 2005. Google ScholarDigital Library
L. Hubert and P. Arabie. Comparing partitions. J. Classification, 2:193--218, 1985.Google ScholarCross Ref
H.-Y. Kao, M.-S. Chen, S.-H. Lin, and J.-M. Ho. Entropy-based link analysis for mining web informative structures. In Proc. 11th CIKM, pages 574--581, 2002. Google ScholarDigital Library
H.-Y. Kao, J.-M. Ho, and M.-S. Chen. WISDOM: Web intrapage informative structure mining based on document object model. TKDE, 17(5):614--627, 2005. Google ScholarDigital Library
R. Kumar, K. Punera, and A. Tomkins. Hierarchical topic segmentation of websites. In Proc. 12th KDD, pages 257--266, 2006. Google ScholarDigital Library
N. Kushmerick. Learning to remove internet advertisement. In Proc. 3rd Agents, pages 175--181, 1999. Google ScholarDigital Library
G. Milligan and M. Cooper. A study of the comparability of external criteria for hierarchical cluster analysis. Multivariate Behavioral Research, 21(4):441--458, 1986.Google ScholarCross Ref
T. Mitchell. Machine Learning. McGraw Hill, 1997. Google ScholarDigital Library
T. Morton-Jones, P. Diggle, L. Parker, H.O. Dickinson, and K. Blinks. Additive isotonic regression models in epidemiology. Statistics in Medicine, 19(6):849--859, 2000.Google ScholarCross Ref
P.M. Pardalos and G. Xue. Algorithms for a class of isotonic regression problems. Algorithmica, 23(3):211--222, 1999.Google ScholarCross Ref
T. Robertson, F.T. Wright, and R.L. Dykstra. Order-Restrictied Statistical Inference. Wiley, 1988.Google Scholar
R. Song, H. Liu, J.-R. Wen, and W.-Y. Ma. Learning block importance models for web pages. In Proc. 13th WWW, pages 203--211, 2004. Google ScholarDigital Library
Q. Stout. Optimal algorithms for unimodal regression. Computing Science and Statistics, 32:348--355, 2000.Google Scholar
K. Vieira, A. Silva, N. Pinto, E. Moura, J. Cavalcanti, and J. Freire. A fast and robust method for web page template detection and removal. In Proc. 15th CIKM, pages 256--267, 2006. Google ScholarDigital Library
L. Yi and B. Liu. Web page cleaning for web mining through feature weighting. In Proc. 18th IJCAI, pages 43--50, 2003. Google ScholarDigital Library
L. Yi, B. Liu, and X. Li. Eliminating noisy information in web pages for data mining. In Proc. 9th KDD, pages 296--305, 2003. Google ScholarDigital Library
X. Yin and W.S. Lee. Using link analysis to improve layout on mobile devices. In Proc. 13th WWW, pages 338--344, 2004. Google ScholarDigital Library

Index Terms

Page-level template detection via isotonic smoothing
1. Information systems
  1. Information systems applications

Recommendations

Isotonic Median Regression: A Linear Programming Approach

The isotonic median regression problem arises in statistics. It is known that the isotonic median regression problem, with respect to a complete order, may be solved by a “Pool Adjacent Violators” algorithm. In this paper we show that this algorithm is ...
Read More
Effectiveness of template detection on noise reduction and websites summarization

The World Wide Web is the most rapidly growing and accessible source of information. Its popularity has been largely influenced by the wide availability of the Internet in almost every modern house and even on the go after the wide-spread of the ...
Read More
Template detection for large scale search engines
SAC '06: Proceedings of the 2006 ACM symposium on Applied computing

Templates in web sites hurt search engine retrieval performance, especially in content relevance and link analysis. Current template removal methods suffer from processing speed and scalability when dealing with large volume web pages. In this paper, we ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '07: Proceedings of the 16th international conference on World Wide Web
May 2007
1382 pages
ISBN:9781595936547
DOI:10.1145/1242572
General Chairs:
Carey Williamson
University of Calgary, Canada
,
Mary Ellen Zurko
IBM, USA
,
Program Chairs:
Peter Patel-Schneider
Bell Labs Research, USA
,
Prashant Shenoy
University of Massachusetts at Amherst, USA
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 May 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
isotonic regression
template detection
webpage sectioning
webpage segmentation
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 53
  Total Citations
  View Citations
- 708
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Page-level template detection via isotonic smoothing

WWW '07: Proceedings of the 16th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Isotonic Median Regression: A Linear Programming Approach

Effectiveness of template detection on noise reduction and websites summarization

Template detection for large scale search engines