Article

Template detection via data mining and its applications

Authors:
Ziv Bar-Yossef

University of California at Berkeley, Berkeley, CA

University of California at Berkeley, Berkeley, CA
View Profile

,
Sridhar Rajagopalan

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

WWW '02: Proceedings of the 11th international conference on World Wide WebMay 2002Pages 580–591https://doi.org/10.1145/511446.511522

Published:07 May 2002Publication History

WWW '02: Proceedings of the 11th international conference on World Wide Web

Pages 580–591

ABSTRACT

We formulate and propose the template detection problem, and suggest a practical solution for it based on counting frequent item sets. We show that the use of templates is pervasive on the web. We describe three principles, which characterize the assumptions made by hypertext information retrieval (IR) and data mining (DM) systems, and show that templates are a major source of violation of these principles. As a consequence, basic "pure" implementations of simple search algorithms coupled with template detection and elimination show surprising increases in precision at all levels of recall.

References

R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules in Large Databases. In Proceedings of the Twentieth International Conference on Very Large Databases, pages 487--499, Santiago, Chile, 1994. Google ScholarDigital Library
K. Bharat and M. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 104--111, 1998. Google ScholarDigital Library
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th International World Wide Web Conference (WWW7), pages 107--117, 1998. Google ScholarDigital Library
A. Z. Broder, S. C. Glassman, and M. S. Manasse. Syntactic clustering of the web. In Proceedings of the 6th International World Wide Web Conference (WWW6), pages 1157--1166, 1997. Google ScholarDigital Library
V. Bush. As we may think. The Atlantic Monthly, 176(1):101--108, July 1945.Google Scholar
S. Chakrabarti. Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction. In Proceedings of the 10th International World Wide Web Conference (WWW2001), pages 211--220, 2001. Google ScholarDigital Library
S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan. Automatic resource list compilation by analyzing hyperlink structure and associated text. In Proceedings of the 7th International World Wide Web Conference (WWW7), pages 65--74, 1998. Google ScholarDigital Library
S. Chakrabarti, B. Dom, D. Gibson, R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Topic distillation and spectral filtering. Artificial Intelligence Review, 13(5-6):409--435, 1999. Google ScholarDigital Library
S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, pages 307--318, 1998. Google ScholarDigital Library
S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. In Proceedings of the 7th International World Wide Web Conference (WWW7), pages 65--74, 1998. Google ScholarDigital Library
S. Chakrabarti, M. Joshi, and V. Tawde. Enhanced topic distillation using text, markup tags, and hyperlinks. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2001. Google ScholarDigital Library
S. Chakrabarti, M. van den Berg, and B. Dom. Distributed hypertext resource discovery through examples. In Proceedings of the 25th International Conference on Very Large Databases (VLDB), pages 375--386, 1999. Google ScholarDigital Library
S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: A new approach to topic-specific web resource discovery. In Proceedings of the 8th International World Wide Web Conference (WWW8), pages 1623--1640, 1999. Google ScholarDigital Library
B. D. Davison. Recognizing nepotistic links on the web. In Proceedings of the AAAI-2000 Workshop on Artificial Intelligence for Web Search, pages 23--28, 2000.Google Scholar
J. Dean and M. Henzinger. Finding related pages in the world wide web. In Proceedings of the 8th International World Wide Web Conference (WWW8), pages 1467--1479, 1999. Google ScholarDigital Library
E. Garfield. "Citation Analysis as a Tool in Journal Evaluation". Science, 178:471--479, 1972.Google ScholarCross Ref
Google. http://www.google.com.Google Scholar
M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14:10--25, 1963.Google ScholarCross Ref
J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, pages 604--632, 1999. Google ScholarDigital Library
R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the Web for emerging cyber-communities. In Proceedings of the 8th International World Wide Web Conference (WWW8), pages 1481--1493, 1999. Google ScholarDigital Library
R. Lempel and S. Moran. The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Computer Networks (Amsterdam, Netherlands: 1999), 33(1--6):387--401, June 2000. Google ScholarDigital Library
Y. Maarek, D. Berry, and G. Kaiser. An information retrieval approach for automatically constructing software libraries. Transactions on Software Engineering, 17(8):800--813, 1991. Google ScholarDigital Library
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical report, Computer Science Department, Stanford University, 1998.Google Scholar
G. Pinski and F. Narin. Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics. Inf. Proc. and Management, 12, 1976.Google Scholar
P. Pirolli, J. E. Pitkow, and R. Rao. Silk from a sow's ear: Extracting usable structures from the Web. In Conference Proceedings on Human Factors and Computing (CHI), pages 118--125, 1996. Google ScholarDigital Library
H. Small. Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24:265--269, 1973.Google ScholarCross Ref

Index Terms

Template detection via data mining and its applications
1. Information systems
  1. Information retrieval

Recommendations

Mining fuzzy specific rare itemsets for education data

Association rule mining is an important data analysis method for the discovery of associations within data. There have been many studies focused on finding fuzzy association rules from transaction databases. Unfortunately, in the real world, one may ...
Read More
Mining uncertain data for constrained frequent sets
IDEAS '09: Proceedings of the 2009 International Database Engineering & Applications Symposium

Data mining aims to search for implicit, previously unknown, and potentially useful pieces of information---such as sets of items that are frequently co-occurring together---that are embedded in data. The mined frequent sets can be used in the discovery ...
Read More
Big Data Mining Applications and Services
BigDAS '15: Proceedings of the 2015 International Conference on Big Data Applications and Services

Data mining and analytics aims to analyze valuable data and extract implicit, previously unknown, and potentially useful information from the data. Due to advances in technology, high volumes of valuable data are generated at a high velocity in high ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '02: Proceedings of the 11th international conference on World Wide Web
May 2002
754 pages
ISBN:1581134495
DOI:10.1145/511446
Conference Chairs:
David Lassner
University of Hawaii
,
Dave De Roure
University of Southampton
,
Arun Iyengar
IBM T.J. Watson Research Center
Copyright © 2002 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 May 2002
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data mining
hypertext
information retrieval
web searching
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 143
  Total Citations
  View Citations
- 1,548
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Template detection via data mining and its applications

WWW '02: Proceedings of the 11th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Mining fuzzy specific rare itemsets for education data

Mining uncertain data for constrained frequent sets

Big Data Mining Applications and Services

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Template detection via data mining and its applications

WWW '02: Proceedings of the 11th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Mining fuzzy specific rare itemsets for education data

Mining uncertain data for constrained frequent sets

Big Data Mining Applications and Services

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media