ABSTRACT
The automatic categorisation of web documents is becoming crucial for organising the huge amount of information available in the Internet. We are facing a new challenge due to the fact that web documents have a rich structure and are highly heterogeneous. Two ways to respond to this challenge are (1) using a representation of the content of web documents that captures these two characteristics and (2) using more effective classifiers.
Our categorisation approach is based on a probabilistic description-oriented representation of web documents, and a probabilistic interpretation of the k-nearest neighbour classifier. With the former, we provide an enhanced document representation that incorporates the structural and heterogeneous nature of web documents. With the latter, we provide a theoretical sound justification for the various parameters of the k-nearest neighbour classifier.
Experimental results show that (1) using an enhanced representation of web documents is crucial for an effective categorisation of web documents, and (2) a theoretical interpretation of the k-nearest neighbour classifier gives us improvement over the standard k-nearest neighbour classifier.
- Chakrabarti, S.; Dom, B.; Indyk, P. (1998). Enhanced Hypertext Categorization Using Hyperlinks. In: Haas, L.; Tiwary, A. (eds.): Proceedings of the 1998 A CM SIGMOD. International Conference on Management of Data. ACM Special Interest Group on Management of Data, ACM, New York. Google ScholarDigital Library
- Fuhr, N.; Buckley, C. (1991). A Probabilistic Learning Approach for Document Indexing. A CM Transactions on In}orrnation Systems 9(3), pages 223-248. Google ScholarDigital Library
- Fuhr, N.; Buckley, C. (1993). Optimizing Document Indexing and Search Term Weighting Based on Probabilistic Models. In: Harman, D. (ed.): The First Text REtrieval Conference (TREC-1), pages 89-100. National Institute of Standards and Technology Special Publication 500-207, Gaithersburg, Md. 20899.Google Scholar
- Fuhr, N. (1989). Models for Retrieval with Probabilistic Indexing. Information Processing and Management 25(1), pages 55-72. Google ScholarDigital Library
- Knorz, G. (1983). Automatisches Indezieren als Erkennen abstrakter Objekte. Niemeyer, T~ibingen.Google Scholar
- van Rijsbergen, C. J. (1989). Towards an Information Logic. In: Belkin, N.; van Rijsbergen, C. J. (eds.) : Proceedings of the Twelfth Annual International A CM SIGIR Conference on Research and Development in Information Retrieval, pages 77-86. ACM, New York. Google Scholar
- Salton, G.; Buekley, C. (1988). Term Weighting Approaches in Automatic Text Retrieval. In}ormation Processing and Management ~j{5), pages 513-523. Google ScholarDigital Library
- Schiirmann, J. (1977). Polltnomklassifikatoren fiir die Zeichenerkennung. Ansatz, Adaption, Anwendung. Oldenbourg, Mfinchen, Wien.Google Scholar
- Wong, S.; Yao, Y. (1995). On Modeling information Retrieval with Probabilistic Inference. A CM 2Yansactions on Information Systems 13(1), pages 38-68. Google ScholarDigital Library
- Yang, Y. (1994). Expert Network: Effective and Efficent Learning from Human Decisions in Text Categorisation and Retrieval. In: Croft, W. B.; van Rijsbergen, C. J. (eds.) : Proceedings of the Seventeenth Annual International A CM SIGIR Conference on Research and Development in information Retrieval, pages 13-22. Springer-Verlag, London, et al. Google Scholar
- Yang, Y. (1999). An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval 1(1), pages 69-90. Google ScholarDigital Library
Index Terms
- A probabilistic description-oriented approach for categorizing web documents
Recommendations
An automatic approach to classify web documents using a domain ontology
PReMI'05: Proceedings of the First international conference on Pattern Recognition and Machine IntelligenceThis paper suggests an automated method for document classification using an ontology, which expresses terminology information and vocabulary contained in Web documents by way of a hierarchical structure. Ontologybased document classification involves ...
Categorizing Images in Web Documents
The Web provides an increasingly powerful and popular publication mechanism. Web documents often contain a large number of images that serve various purposes. Identifying the functional categories of these images is an important task in Web repurposing. ...
A TNATS approach to hidden web documents
ICDCIT'04: Proceedings of the First international conference on Distributed Computing and Internet TechnologyHidden Web databases maintain a collection of documents, which are dynamically generated using Web page templates in response to user queries This paper presents a technique, Text with Neighbouring Adjacent Tag Segments (TNATS), to represent the ...
Comments