skip to main content
10.1145/319950.320053acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article
Free Access

A probabilistic description-oriented approach for categorizing web documents

Published:01 November 1999Publication History

ABSTRACT

The automatic categorisation of web documents is becoming crucial for organising the huge amount of information available in the Internet. We are facing a new challenge due to the fact that web documents have a rich structure and are highly heterogeneous. Two ways to respond to this challenge are (1) using a representation of the content of web documents that captures these two characteristics and (2) using more effective classifiers.

Our categorisation approach is based on a probabilistic description-oriented representation of web documents, and a probabilistic interpretation of the k-nearest neighbour classifier. With the former, we provide an enhanced document representation that incorporates the structural and heterogeneous nature of web documents. With the latter, we provide a theoretical sound justification for the various parameters of the k-nearest neighbour classifier.

Experimental results show that (1) using an enhanced representation of web documents is crucial for an effective categorisation of web documents, and (2) a theoretical interpretation of the k-nearest neighbour classifier gives us improvement over the standard k-nearest neighbour classifier.

References

  1. Chakrabarti, S.; Dom, B.; Indyk, P. (1998). Enhanced Hypertext Categorization Using Hyperlinks. In: Haas, L.; Tiwary, A. (eds.): Proceedings of the 1998 A CM SIGMOD. International Conference on Management of Data. ACM Special Interest Group on Management of Data, ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Fuhr, N.; Buckley, C. (1991). A Probabilistic Learning Approach for Document Indexing. A CM Transactions on In}orrnation Systems 9(3), pages 223-248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Fuhr, N.; Buckley, C. (1993). Optimizing Document Indexing and Search Term Weighting Based on Probabilistic Models. In: Harman, D. (ed.): The First Text REtrieval Conference (TREC-1), pages 89-100. National Institute of Standards and Technology Special Publication 500-207, Gaithersburg, Md. 20899.Google ScholarGoogle Scholar
  4. Fuhr, N. (1989). Models for Retrieval with Probabilistic Indexing. Information Processing and Management 25(1), pages 55-72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Knorz, G. (1983). Automatisches Indezieren als Erkennen abstrakter Objekte. Niemeyer, T~ibingen.Google ScholarGoogle Scholar
  6. van Rijsbergen, C. J. (1989). Towards an Information Logic. In: Belkin, N.; van Rijsbergen, C. J. (eds.) : Proceedings of the Twelfth Annual International A CM SIGIR Conference on Research and Development in Information Retrieval, pages 77-86. ACM, New York. Google ScholarGoogle Scholar
  7. Salton, G.; Buekley, C. (1988). Term Weighting Approaches in Automatic Text Retrieval. In}ormation Processing and Management ~j{5), pages 513-523. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Schiirmann, J. (1977). Polltnomklassifikatoren fiir die Zeichenerkennung. Ansatz, Adaption, Anwendung. Oldenbourg, Mfinchen, Wien.Google ScholarGoogle Scholar
  9. Wong, S.; Yao, Y. (1995). On Modeling information Retrieval with Probabilistic Inference. A CM 2Yansactions on Information Systems 13(1), pages 38-68. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Yang, Y. (1994). Expert Network: Effective and Efficent Learning from Human Decisions in Text Categorisation and Retrieval. In: Croft, W. B.; van Rijsbergen, C. J. (eds.) : Proceedings of the Seventeenth Annual International A CM SIGIR Conference on Research and Development in information Retrieval, pages 13-22. Springer-Verlag, London, et al. Google ScholarGoogle Scholar
  11. Yang, Y. (1999). An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval 1(1), pages 69-90. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A probabilistic description-oriented approach for categorizing web documents

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in
                • Published in

                  cover image ACM Conferences
                  CIKM '99: Proceedings of the eighth international conference on Information and knowledge management
                  November 1999
                  564 pages
                  ISBN:1581131461
                  DOI:10.1145/319950

                  Copyright © 1999 ACM

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 1 November 1999

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • Article

                  Acceptance Rates

                  Overall Acceptance Rate1,861of8,427submissions,22%

                  Upcoming Conference

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader