skip to main content
10.1145/1242572.1242582acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
Article

Page-level template detection via isotonic smoothing

Published:08 May 2007Publication History

ABSTRACT

We develop a novel framework for the page-level template detection problem. Our framework is built on two main ideas. The first is theautomatic generation of training data for a classifier that, given apage, assigns a templateness score to every DOM node of the page. The second is the global smoothing of these per-node classifier scores bysolving a regularized isotonic regression problem; the latter follows from a simple yet powerful abstraction of templateness on a page. Our extensive experiments on human-labeled test data show that our approachdetects templates effectively.

References

  1. S. Angelov, B. Harb, S. Kannan, and L.-S. Wang. Weighted isotonic regression under the l1 norm. In Proc. 17th SODA, pages 783--791, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Baluja. Browsing on small screens: Recasting web-page segmentation into an efficient machine learning framework. In Proc. 15th WWW, pages 33--42, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Z. Bar-Yossef and S. Rajagopalan. Template detection via data mining and its applications. In Proc. 11th WWW, pages 580--591, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. K. Bharat, A. Broder, J. Dean, and M.R. Henzinger. A comparison of techniques to find mirrored hosts on the WWW. JASIS, 51(12):1114--1122, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A.Z. Broder, S.C. Glassman, M.S. Manasse, and G. Zweig. Syntactic clustering of the web. WWW6 / Computer Networks, 29(8-13):1157--1166, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Chakrabarti, B.E. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In Proc. SIGMOD, pages 307--318, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Y. Chen, X. Xie, W.-Y. Ma, and H.-J. Zhang. Adapting web pages for small-screen devices. Internet Computing, 9(1):50--56, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding replicated web collections. In Proc. SIGMOD, pages 355--366, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. B. Davison. Recognizing nepotistic links on the web. In AAAI-2000 Workshop on Artificial Intelligence for Web Search, pages 23--28, 2000.Google ScholarGoogle Scholar
  10. S. Debnath, P. Mitra, N. Pal, and C.L. Giles. Automatic identification of informative sections of web pages. TKDE, 17(9):1233--1246, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Gibson, K. Punera, and A. Tomkins. The volume and evolution of web page templates. In Proc. 14th WWW (Special interest tracks and posters), pages 830--839, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. L. Hubert and P. Arabie. Comparing partitions. J. Classification, 2:193--218, 1985.Google ScholarGoogle ScholarCross RefCross Ref
  13. H.-Y. Kao, M.-S. Chen, S.-H. Lin, and J.-M. Ho. Entropy-based link analysis for mining web informative structures. In Proc. 11th CIKM, pages 574--581, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. H.-Y. Kao, J.-M. Ho, and M.-S. Chen. WISDOM: Web intrapage informative structure mining based on document object model. TKDE, 17(5):614--627, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Kumar, K. Punera, and A. Tomkins. Hierarchical topic segmentation of websites. In Proc. 12th KDD, pages 257--266, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. N. Kushmerick. Learning to remove internet advertisement. In Proc. 3rd Agents, pages 175--181, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. Milligan and M. Cooper. A study of the comparability of external criteria for hierarchical cluster analysis. Multivariate Behavioral Research, 21(4):441--458, 1986.Google ScholarGoogle ScholarCross RefCross Ref
  18. T. Mitchell. Machine Learning. McGraw Hill, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. Morton-Jones, P. Diggle, L. Parker, H.O. Dickinson, and K. Blinks. Additive isotonic regression models in epidemiology. Statistics in Medicine, 19(6):849--859, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  20. P.M. Pardalos and G. Xue. Algorithms for a class of isotonic regression problems. Algorithmica, 23(3):211--222, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  21. T. Robertson, F.T. Wright, and R.L. Dykstra. Order-Restrictied Statistical Inference. Wiley, 1988.Google ScholarGoogle Scholar
  22. R. Song, H. Liu, J.-R. Wen, and W.-Y. Ma. Learning block importance models for web pages. In Proc. 13th WWW, pages 203--211, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Q. Stout. Optimal algorithms for unimodal regression. Computing Science and Statistics, 32:348--355, 2000.Google ScholarGoogle Scholar
  24. K. Vieira, A. Silva, N. Pinto, E. Moura, J. Cavalcanti, and J. Freire. A fast and robust method for web page template detection and removal. In Proc. 15th CIKM, pages 256--267, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. L. Yi and B. Liu. Web page cleaning for web mining through feature weighting. In Proc. 18th IJCAI, pages 43--50, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. L. Yi, B. Liu, and X. Li. Eliminating noisy information in web pages for data mining. In Proc. 9th KDD, pages 296--305, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. X. Yin and W.S. Lee. Using link analysis to improve layout on mobile devices. In Proc. 13th WWW, pages 338--344, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Page-level template detection via isotonic smoothing

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      WWW '07: Proceedings of the 16th international conference on World Wide Web
      May 2007
      1382 pages
      ISBN:9781595936547
      DOI:10.1145/1242572

      Copyright © 2007 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 8 May 2007

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate1,899of8,196submissions,23%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader