Article

Free Access

A probabilistic description-oriented approach for categorizing web documents

Authors:
Norbert Gövert

University of Dortmund

University of Dortmund
View Profile

,
Mounia Lalmas

Department of Computer Science, Queen Mary & Westfield College, University of London and University of Dortmund

Department of Computer Science, Queen Mary & Westfield College, University of London and University of Dortmund
View Profile

,
Norbert Fuhr

University of Dortmund

University of Dortmund
View Profile

CIKM '99: Proceedings of the eighth international conference on Information and knowledge managementNovember 1999Pages 475–482https://doi.org/10.1145/319950.320053

Published:01 November 1999Publication History

CIKM '99: Proceedings of the eighth international conference on Information and knowledge management

Pages 475–482

ABSTRACT

The automatic categorisation of web documents is becoming crucial for organising the huge amount of information available in the Internet. We are facing a new challenge due to the fact that web documents have a rich structure and are highly heterogeneous. Two ways to respond to this challenge are (1) using a representation of the content of web documents that captures these two characteristics and (2) using more effective classifiers.

Our categorisation approach is based on a probabilistic description-oriented representation of web documents, and a probabilistic interpretation of the k-nearest neighbour classifier. With the former, we provide an enhanced document representation that incorporates the structural and heterogeneous nature of web documents. With the latter, we provide a theoretical sound justification for the various parameters of the k-nearest neighbour classifier.

Experimental results show that (1) using an enhanced representation of web documents is crucial for an effective categorisation of web documents, and (2) a theoretical interpretation of the k-nearest neighbour classifier gives us improvement over the standard k-nearest neighbour classifier.

References

Chakrabarti, S.; Dom, B.; Indyk, P. (1998). Enhanced Hypertext Categorization Using Hyperlinks. In: Haas, L.; Tiwary, A. (eds.): Proceedings of the 1998 A CM SIGMOD. International Conference on Management of Data. ACM Special Interest Group on Management of Data, ACM, New York. Google ScholarDigital Library
Fuhr, N.; Buckley, C. (1991). A Probabilistic Learning Approach for Document Indexing. A CM Transactions on In}orrnation Systems 9(3), pages 223-248. Google ScholarDigital Library
Fuhr, N.; Buckley, C. (1993). Optimizing Document Indexing and Search Term Weighting Based on Probabilistic Models. In: Harman, D. (ed.): The First Text REtrieval Conference (TREC-1), pages 89-100. National Institute of Standards and Technology Special Publication 500-207, Gaithersburg, Md. 20899.Google Scholar
Fuhr, N. (1989). Models for Retrieval with Probabilistic Indexing. Information Processing and Management 25(1), pages 55-72. Google ScholarDigital Library
Knorz, G. (1983). Automatisches Indezieren als Erkennen abstrakter Objekte. Niemeyer, T~ibingen.Google Scholar
van Rijsbergen, C. J. (1989). Towards an Information Logic. In: Belkin, N.; van Rijsbergen, C. J. (eds.) : Proceedings of the Twelfth Annual International A CM SIGIR Conference on Research and Development in Information Retrieval, pages 77-86. ACM, New York. Google Scholar
Salton, G.; Buekley, C. (1988). Term Weighting Approaches in Automatic Text Retrieval. In}ormation Processing and Management ~j{5), pages 513-523. Google ScholarDigital Library
Schiirmann, J. (1977). Polltnomklassifikatoren fiir die Zeichenerkennung. Ansatz, Adaption, Anwendung. Oldenbourg, Mfinchen, Wien.Google Scholar
Wong, S.; Yao, Y. (1995). On Modeling information Retrieval with Probabilistic Inference. A CM 2Yansactions on Information Systems 13(1), pages 38-68. Google ScholarDigital Library
Yang, Y. (1994). Expert Network: Effective and Efficent Learning from Human Decisions in Text Categorisation and Retrieval. In: Croft, W. B.; van Rijsbergen, C. J. (eds.) : Proceedings of the Seventeenth Annual International A CM SIGIR Conference on Research and Development in information Retrieval, pages 13-22. Springer-Verlag, London, et al. Google Scholar
Yang, Y. (1999). An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval 1(1), pages 69-90. Google ScholarDigital Library

Index Terms

Recommendations

An automatic approach to classify web documents using a domain ontology
PReMI'05: Proceedings of the First international conference on Pattern Recognition and Machine Intelligence

This paper suggests an automated method for document classification using an ontology, which expresses terminology information and vocabulary contained in Web documents by way of a hierarchical structure. Ontologybased document classification involves ...
Read More
Categorizing Images in Web Documents

The Web provides an increasingly powerful and popular publication mechanism. Web documents often contain a large number of images that serve various purposes. Identifying the functional categories of these images is an important task in Web repurposing. ...
Read More
A TNATS approach to hidden web documents
ICDCIT'04: Proceedings of the First international conference on Distributed Computing and Internet Technology

Hidden Web databases maintain a collection of documents, which are dynamically generated using Web page templates in response to user queries This paper presents a technique, Text with Neighbouring Adjacent Tag Segments (TNATS), to represent the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '99: Proceedings of the eighth international conference on Information and knowledge management
November 1999
564 pages
ISBN:1581131461
DOI:10.1145/319950
Editor:
Susan Gauch
Copyright © 1999 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 November 1999
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 32
  Total Citations
  View Citations
- 363
  Total Downloads
- Downloads (Last 12 months)18
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A probabilistic description-oriented approach for categorizing web documents

CIKM '99: Proceedings of the eighth international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

An automatic approach to classify web documents using a domain ontology

Categorizing Images in Web Documents

A TNATS approach to hidden web documents

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A probabilistic description-oriented approach for categorizing web documents

CIKM '99: Proceedings of the eighth international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

An automatic approach to classify web documents using a domain ontology

Categorizing Images in Web Documents

A TNATS approach to hidden web documents

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media