skip to main content
10.1145/988672.988732acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
Article

Automatic detection of fragments in dynamically generated web pages

Published:17 May 2004Publication History

ABSTRACT

Dividing web pages into fragments has been shown to provide significant benefits for both content generation and caching. In order for a web site to use fragment-based content generation, however, good methods are needed for dividing web pages into fragments. Manual fragmentation of web pages is expensive, error prone, and unscalable. This paper proposes a novel scheme to automatically detect and flag fragments that are cost-effective cache units in web sites serving dynamic content. We consider the fragments to be interesting if they are shared among multiple documents or they have different lifetime or personalization characteristics. Our approach has three unique features. First, we propose a hierarchical and fragment-aware model of the dynamic web pages and a data structure that is compact and effective for fragment detection. Second, we present an efficient algorithm to detect maximal fragments that are shared among multiple documents. Third, we develop a practical algorithm that effectively detects fragments based on their lifetime and personalization characteristics. We evaluate the proposed scheme through a series of experiments, showing the benefits and costs of the algorithms. We also study the impact of adopting the fragments detected by our system on disk space utilization and network bandwidth consumption.

References

  1. Document Object Model - W3C Recommendation. http://www.w3.org/DOM.]]Google ScholarGoogle Scholar
  2. Edge Side Includes - Standard Specification. http://www.esi.org.]]Google ScholarGoogle Scholar
  3. HTML TIDY. http://www.w3.org/People/Raggett/tidy/.]]Google ScholarGoogle Scholar
  4. H. Bahn, H. Lee, S. H. Noh, S. L. Min, and K. Koh. Replica-Aware Caching for Web Proxies. Computer Communications, 25(3), 2002.]]Google ScholarGoogle Scholar
  5. Z. Bar-Yossef and S. Rajagopalan. Template Detection via Data Mining and its Applications. In Proceedings of WWW-2002, May 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Broder. On resemblance and Containment of Documents. In Proceedings of SEQUENCES-97, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic Clustering of the Web. In Proceedings of WWW-6, April 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Buttler and L. Liu. A Fully Automated Object Extraction System for the World Wide Web. In Proceedings of ICDCS-2001, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. K. S. Candan, D. Agrawal, W.-S. Li, O. Po, and W.-P. Hsiung. View Invalidation for Dynamic Content Caching in Multi tiered Architectures. In Proceedings of VLDB-2002, September 2002.]]Google ScholarGoogle Scholar
  10. J. Challenger, A. Iyengar, and P. Dantzig. A Scalable System for Consistently Caching Dynamic Web Data. In Proceedings of IEEE INFOCOM 1999, March 1999.]]Google ScholarGoogle ScholarCross RefCross Ref
  11. J. Challenger, A. Iyengar, K. Witting, C. Ferstat, and P. Reed. Publishing System for Efficiently Creating Dynamic Web Content. In Proceedings of IEEE INFOCOM 2000, May 2000.]]Google ScholarGoogle ScholarCross RefCross Ref
  12. M. C. Chan and T. W. C. Woo. Cache-Based Compaction: A New Technique for Optimizing Web Transfer. In Proceedings of INFOCOM-1999.]]Google ScholarGoogle Scholar
  13. A. Datta, K. Dutta, H. Thomas, D. VanderMeer, Suresha, and K. Ramamritham. Proxy-Based Accelaration of Dynamically Generated Content on the World Wide Web: An Approach and Implementation. In Proceedings of SIGMOD-2002, June 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. F. Douglis and A. Iyengar. Application-Specific Delta Encoding Via Resemblance Detection. In Proceedings of the USENIX Annual Technical Conference, June 2003.]]Google ScholarGoogle Scholar
  15. X.-D. Gu, J. Chen, W.-Y. Ma, and G.-L. Chen. Visual Based Content Understanding towards Web Adaptation. In Proceedings of AH-2002, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. Kelly and J. Mogul. Aliasing on the World Wide Web: Prevalence and Performance Implications. In Proceedings of the 11th International World Wide Web Conference, May 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. Kulkarni, F. Douglis, J. LaVoie, and J. Tracey. Redundancy Elimination Within Large Collections of Files. In Proceedings of the USENIX Annual Technical Conference, June 2004. To appear.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. U. Manber. Finding Similar Files in a Large File System. In Proceedings of USENIX-1994, January 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Mogul. Network Behavior of a Busy Web Server and its Clients. Technical report, DEC Western Research Laboratories, 1995.]]Google ScholarGoogle Scholar
  20. J. Mogul, Y. Chan, and T. Kelly. Design, Implementation, and Evaluation of Duplicate Transfer Detection in HTTP. In Proceedings of NSDI '04, March 2004. To appear.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Mohapatra and H. Chen. A Framework for Managing QoS and Improving Performance of Dynamic Web Content. In Proceedings of GLOBECOM-2001, November 2001.]]Google ScholarGoogle ScholarCross RefCross Ref
  22. M. Naaman, H. Garcia-Molina, and A. Paepcke. Evaluation of ESI and Class-Based Delta Encoding. In Proceedings of WCW - 2003.]]Google ScholarGoogle Scholar
  23. M. O. Rabin. Fingerprinting by Random Polynomials. Technical report, Center for Research in Computing Technology, Harvard University, 1981.]]Google ScholarGoogle Scholar
  24. S. C. Rhea, K. Liang, and E. Brewer. Value-Based Web Caching. In Proceedings of 12th WWW Conference, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. T. Suel, P. Noel, and D. Trendafilov. Improved File Synchronization Techniques for Maintaining Large Replicated Collections Over Slow Networks. In Proceedings of ICDE 2004, March 2004. To appear.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Automatic detection of fragments in dynamically generated web pages

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        WWW '04: Proceedings of the 13th international conference on World Wide Web
        May 2004
        754 pages
        ISBN:158113844X
        DOI:10.1145/988672

        Copyright © 2004 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 17 May 2004

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate1,899of8,196submissions,23%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader