ABSTRACT
Dividing web pages into fragments has been shown to provide significant benefits for both content generation and caching. In order for a web site to use fragment-based content generation, however, good methods are needed for dividing web pages into fragments. Manual fragmentation of web pages is expensive, error prone, and unscalable. This paper proposes a novel scheme to automatically detect and flag fragments that are cost-effective cache units in web sites serving dynamic content. We consider the fragments to be interesting if they are shared among multiple documents or they have different lifetime or personalization characteristics. Our approach has three unique features. First, we propose a hierarchical and fragment-aware model of the dynamic web pages and a data structure that is compact and effective for fragment detection. Second, we present an efficient algorithm to detect maximal fragments that are shared among multiple documents. Third, we develop a practical algorithm that effectively detects fragments based on their lifetime and personalization characteristics. We evaluate the proposed scheme through a series of experiments, showing the benefits and costs of the algorithms. We also study the impact of adopting the fragments detected by our system on disk space utilization and network bandwidth consumption.
- Document Object Model - W3C Recommendation. http://www.w3.org/DOM.]]Google Scholar
- Edge Side Includes - Standard Specification. http://www.esi.org.]]Google Scholar
- HTML TIDY. http://www.w3.org/People/Raggett/tidy/.]]Google Scholar
- H. Bahn, H. Lee, S. H. Noh, S. L. Min, and K. Koh. Replica-Aware Caching for Web Proxies. Computer Communications, 25(3), 2002.]]Google Scholar
- Z. Bar-Yossef and S. Rajagopalan. Template Detection via Data Mining and its Applications. In Proceedings of WWW-2002, May 2002.]] Google ScholarDigital Library
- A. Broder. On resemblance and Containment of Documents. In Proceedings of SEQUENCES-97, 1997.]] Google ScholarDigital Library
- A. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic Clustering of the Web. In Proceedings of WWW-6, April 1997.]] Google ScholarDigital Library
- D. Buttler and L. Liu. A Fully Automated Object Extraction System for the World Wide Web. In Proceedings of ICDCS-2001, 2001.]] Google ScholarDigital Library
- K. S. Candan, D. Agrawal, W.-S. Li, O. Po, and W.-P. Hsiung. View Invalidation for Dynamic Content Caching in Multi tiered Architectures. In Proceedings of VLDB-2002, September 2002.]]Google Scholar
- J. Challenger, A. Iyengar, and P. Dantzig. A Scalable System for Consistently Caching Dynamic Web Data. In Proceedings of IEEE INFOCOM 1999, March 1999.]]Google ScholarCross Ref
- J. Challenger, A. Iyengar, K. Witting, C. Ferstat, and P. Reed. Publishing System for Efficiently Creating Dynamic Web Content. In Proceedings of IEEE INFOCOM 2000, May 2000.]]Google ScholarCross Ref
- M. C. Chan and T. W. C. Woo. Cache-Based Compaction: A New Technique for Optimizing Web Transfer. In Proceedings of INFOCOM-1999.]]Google Scholar
- A. Datta, K. Dutta, H. Thomas, D. VanderMeer, Suresha, and K. Ramamritham. Proxy-Based Accelaration of Dynamically Generated Content on the World Wide Web: An Approach and Implementation. In Proceedings of SIGMOD-2002, June 2002.]] Google ScholarDigital Library
- F. Douglis and A. Iyengar. Application-Specific Delta Encoding Via Resemblance Detection. In Proceedings of the USENIX Annual Technical Conference, June 2003.]]Google Scholar
- X.-D. Gu, J. Chen, W.-Y. Ma, and G.-L. Chen. Visual Based Content Understanding towards Web Adaptation. In Proceedings of AH-2002, 2002.]] Google ScholarDigital Library
- T. Kelly and J. Mogul. Aliasing on the World Wide Web: Prevalence and Performance Implications. In Proceedings of the 11th International World Wide Web Conference, May 2002.]] Google ScholarDigital Library
- P. Kulkarni, F. Douglis, J. LaVoie, and J. Tracey. Redundancy Elimination Within Large Collections of Files. In Proceedings of the USENIX Annual Technical Conference, June 2004. To appear.]] Google ScholarDigital Library
- U. Manber. Finding Similar Files in a Large File System. In Proceedings of USENIX-1994, January 1994.]] Google ScholarDigital Library
- J. Mogul. Network Behavior of a Busy Web Server and its Clients. Technical report, DEC Western Research Laboratories, 1995.]]Google Scholar
- J. Mogul, Y. Chan, and T. Kelly. Design, Implementation, and Evaluation of Duplicate Transfer Detection in HTTP. In Proceedings of NSDI '04, March 2004. To appear.]] Google ScholarDigital Library
- P. Mohapatra and H. Chen. A Framework for Managing QoS and Improving Performance of Dynamic Web Content. In Proceedings of GLOBECOM-2001, November 2001.]]Google ScholarCross Ref
- M. Naaman, H. Garcia-Molina, and A. Paepcke. Evaluation of ESI and Class-Based Delta Encoding. In Proceedings of WCW - 2003.]]Google Scholar
- M. O. Rabin. Fingerprinting by Random Polynomials. Technical report, Center for Research in Computing Technology, Harvard University, 1981.]]Google Scholar
- S. C. Rhea, K. Liang, and E. Brewer. Value-Based Web Caching. In Proceedings of 12th WWW Conference, 2003.]] Google ScholarDigital Library
- T. Suel, P. Noel, and D. Trendafilov. Improved File Synchronization Techniques for Maintaining Large Replicated Collections Over Slow Networks. In Proceedings of ICDE 2004, March 2004. To appear.]] Google ScholarDigital Library
Index Terms
- Automatic detection of fragments in dynamically generated web pages
Recommendations
Automatic Fragment Detection in Dynamic Web Pages and Its Impact on Caching
Constructing Web pages from fragments has been shown to provide significant benefits for both content generation and caching. In order for a Web site to use fragment-based content generation, however, good methods are needed for fragmenting the Web ...
Techniques for efficient fragment detection in web pages
CIKM '03: Proceedings of the twelfth international conference on Information and knowledge managementThe existing approaches to fragment-based publishing, delivery and caching of web pages assume that the web pages are manually fragmented at their respective web sites. However manual fragmentation of web pages is expensive, error prone, and not ...
Accelerating dynamic web content delivery using keyword-based fragment detection
Recent advances in Web engineering have enabled the rapid growth of dynamic Web services such as Web-based email, online banking, online shopping and entertainment. We envision that finding an effective way to deliver these dynamic Web services and ...
Comments