ABSTRACT
We formulate and propose the template detection problem, and suggest a practical solution for it based on counting frequent item sets. We show that the use of templates is pervasive on the web. We describe three principles, which characterize the assumptions made by hypertext information retrieval (IR) and data mining (DM) systems, and show that templates are a major source of violation of these principles. As a consequence, basic "pure" implementations of simple search algorithms coupled with template detection and elimination show surprising increases in precision at all levels of recall.
- R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules in Large Databases. In Proceedings of the Twentieth International Conference on Very Large Databases, pages 487--499, Santiago, Chile, 1994. Google ScholarDigital Library
- K. Bharat and M. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 104--111, 1998. Google ScholarDigital Library
- S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th International World Wide Web Conference (WWW7), pages 107--117, 1998. Google ScholarDigital Library
- A. Z. Broder, S. C. Glassman, and M. S. Manasse. Syntactic clustering of the web. In Proceedings of the 6th International World Wide Web Conference (WWW6), pages 1157--1166, 1997. Google ScholarDigital Library
- V. Bush. As we may think. The Atlantic Monthly, 176(1):101--108, July 1945.Google Scholar
- S. Chakrabarti. Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction. In Proceedings of the 10th International World Wide Web Conference (WWW2001), pages 211--220, 2001. Google ScholarDigital Library
- S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan. Automatic resource list compilation by analyzing hyperlink structure and associated text. In Proceedings of the 7th International World Wide Web Conference (WWW7), pages 65--74, 1998. Google ScholarDigital Library
- S. Chakrabarti, B. Dom, D. Gibson, R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Topic distillation and spectral filtering. Artificial Intelligence Review, 13(5-6):409--435, 1999. Google ScholarDigital Library
- S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, pages 307--318, 1998. Google ScholarDigital Library
- S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. In Proceedings of the 7th International World Wide Web Conference (WWW7), pages 65--74, 1998. Google ScholarDigital Library
- S. Chakrabarti, M. Joshi, and V. Tawde. Enhanced topic distillation using text, markup tags, and hyperlinks. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2001. Google ScholarDigital Library
- S. Chakrabarti, M. van den Berg, and B. Dom. Distributed hypertext resource discovery through examples. In Proceedings of the 25th International Conference on Very Large Databases (VLDB), pages 375--386, 1999. Google ScholarDigital Library
- S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: A new approach to topic-specific web resource discovery. In Proceedings of the 8th International World Wide Web Conference (WWW8), pages 1623--1640, 1999. Google ScholarDigital Library
- B. D. Davison. Recognizing nepotistic links on the web. In Proceedings of the AAAI-2000 Workshop on Artificial Intelligence for Web Search, pages 23--28, 2000.Google Scholar
- J. Dean and M. Henzinger. Finding related pages in the world wide web. In Proceedings of the 8th International World Wide Web Conference (WWW8), pages 1467--1479, 1999. Google ScholarDigital Library
- E. Garfield. "Citation Analysis as a Tool in Journal Evaluation". Science, 178:471--479, 1972.Google ScholarCross Ref
- Google. http://www.google.com.Google Scholar
- M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14:10--25, 1963.Google ScholarCross Ref
- J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, pages 604--632, 1999. Google ScholarDigital Library
- R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the Web for emerging cyber-communities. In Proceedings of the 8th International World Wide Web Conference (WWW8), pages 1481--1493, 1999. Google ScholarDigital Library
- R. Lempel and S. Moran. The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Computer Networks (Amsterdam, Netherlands: 1999), 33(1--6):387--401, June 2000. Google ScholarDigital Library
- Y. Maarek, D. Berry, and G. Kaiser. An information retrieval approach for automatically constructing software libraries. Transactions on Software Engineering, 17(8):800--813, 1991. Google ScholarDigital Library
- L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical report, Computer Science Department, Stanford University, 1998.Google Scholar
- G. Pinski and F. Narin. Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics. Inf. Proc. and Management, 12, 1976.Google Scholar
- P. Pirolli, J. E. Pitkow, and R. Rao. Silk from a sow's ear: Extracting usable structures from the Web. In Conference Proceedings on Human Factors and Computing (CHI), pages 118--125, 1996. Google ScholarDigital Library
- H. Small. Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24:265--269, 1973.Google ScholarCross Ref
Index Terms
- Template detection via data mining and its applications
Recommendations
Mining fuzzy specific rare itemsets for education data
Association rule mining is an important data analysis method for the discovery of associations within data. There have been many studies focused on finding fuzzy association rules from transaction databases. Unfortunately, in the real world, one may ...
Mining uncertain data for constrained frequent sets
IDEAS '09: Proceedings of the 2009 International Database Engineering & Applications SymposiumData mining aims to search for implicit, previously unknown, and potentially useful pieces of information---such as sets of items that are frequently co-occurring together---that are embedded in data. The mined frequent sets can be used in the discovery ...
Big Data Mining Applications and Services
BigDAS '15: Proceedings of the 2015 International Conference on Big Data Applications and ServicesData mining and analytics aims to analyze valuable data and extract implicit, previously unknown, and potentially useful information from the data. Due to advances in technology, high volumes of valuable data are generated at a high velocity in high ...
Comments