skip to main content
10.1145/1148170.1148254acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Graph-based text classification: learn from your neighbors

Authors Info & Claims
Published:06 August 2006Publication History

ABSTRACT

Automatic classification of data items, based on training samples, can be boosted by considering the neighborhood of data items in a graph structure (e.g., neighboring documents in a hyperlink environment or co-authors and their publications for bibliographic data entries). This paper presents a new method for graph-based classification, with particular emphasis on hyperlinked text documents but broader applicability. Our approach is based on iterative relaxation labeling and can be combined with either Bayesian or SVM classifiers on the feature spaces of the given data items. The graph neighborhood is taken into consideration to exploit locality patterns while at the same time avoiding overfitting. In contrast to prior work along these lines, our approach employs a number of novel techniques: dynamically inferring the link/class pattern in the graph in the run of the iterative relaxation labeling, judicious pruning of edges from the neighborhood graph based on node dissimilarities and node degrees, weighting the influence of edges based on a distance metric between the classification labels of interest and weighting edges by content similarity measures. Our techniques considerably improve the robustness and accuracy of the classification outcome, as shown in systematic experimental comparisons with previously published methods on three different real-world datasets.

References

  1. S. Chakrabarti. Mining the Web: Discovering Know- ledge from Hypertext Data. Morgan-Kauffman, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Chakrabarti. Breaking through the syntax barrier: Searching with entities and relations. In ECML, 9--16, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Chakrabarti, B. E. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In SIGMOD'98, 307--318 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C.-C. Chang, C.-J. Lin. Libsvm: a Library for Support Vector Machines. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. W. W. Cohen, S. Sarawagi. Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. In KDD, 89--98, 2004 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In NIPS 13, 2001.Google ScholarGoogle Scholar
  7. P. Domingos and M. Richardson. Mining the network value of customers. In KDD, 57--66, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Eppstein. Finding the k shortest paths. In IEEE Symp. on Foundations of Computer Science, 154--165, 1994.Google ScholarGoogle Scholar
  9. R. Feldman and H. Shatkay. Link analysis for bioinformatics: Current state of the art, PSB Tutorial, 2003.Google ScholarGoogle Scholar
  10. L. Getoor. Link mining: a new data mining challenge. SIGKDD Explor. Newsl., 5(1):84--89, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, September 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Kleinberg, E. Tardos. Approximation algorithms for classification problems with pairwise relationships: Metric labeling and markov random fields. In FOCS, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. of ICML, 282--289, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Z. Li. Markov random field modeling in image analysis. Springer-Verlag New York, Inc., 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Q. Lu, L. Getoor. Link-based classification. ICML, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. H.-J. Oh, S. H. Myaeng, and M.-H. Lee. A practical hypertext catergorization method using links and incrementally available class information. SIGIR, 264--271, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. L. Pelkowitz. A continuous relaxation labeling algorithm for markov random fields. ACM Press, 20:709--715, 1990.Google ScholarGoogle Scholar
  18. T. Washio and H. Motoda. State of the art of graph - based data mining. SIGKDD Explor. Newsl., 5(1):59--68, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T.-F. Wu, C.-J. Lin, and R. C. Weng. Probability estimates for multi-class classification by pairwise coupling. J. of Machine Learning Research, 5:975--1005, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Graph-based text classification: learn from your neighbors

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
          August 2006
          768 pages
          ISBN:1595933697
          DOI:10.1145/1148170

          Copyright © 2006 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 6 August 2006

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          Overall Acceptance Rate792of3,983submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader