Top

International Journal on Digital Libraries

Published in:

01-06-2015

Information-theoretic term weighting schemes for document clustering and classification

Author: Weimao Ke

Published in: International Journal on Digital Libraries | Issue 2/2015

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

We propose a new theory to quantify information in probability distributions and derive a new document representation model for text clustering and classification. By extending Shannon entropy to accommodate a non-linear relation between information and uncertainty, the proposed least information theory provides insight into how terms can be weighted based on their probability distributions in documents vs. in the collection. We derive two basic quantities for document representation: (1) LI Binary (LIB) which quantifies information due to the observation of a term’s (binary) occurrence in a document; and (2) LI Frequency (LIF) which measures information for the observation of a randomly picked term from the document. The two quantities are computed based on terms’ prior distributions in the entire collection and posterior distributions in a document. LIB and LIF can be used individually or combined to represent documents for text clustering and classification. Experiments on four benchmark text collections demonstrate strong performances of the proposed methods compared to classic TF*IDF. Particularly, the LIB*LIF weighting scheme, which combines LIB and LIF, consistently outperforms TF*IDF in terms of multiple evaluation metrics. The least information measure has a potentially broad range of applications beyond text clustering and classification.

previous article Evaluating sliding and sticky target policies by measuring temporal drift in acyclic walks through a web archive

next article Sifting useful comments from Flickr Commons and YouTube

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

The literature has used a variety of names in reference to KL divergence. While Kullback preferred discrimination information for the principle of minimum discrimination information (MDI) [21], the literature has often referred to it as divergence information or relative entropy.

Inference probabilities are never perfectly independent of one another given the degree of freedom. But to simplify the discussion and formulation, we use the independence assumption.

The term opposite does not indicate true vs. false information. Opposite information semantics are essentially to increase vs. to decrease the probability of an inference, e.g., good news vs. bad news about a candidate that may influence the outcome of an election.

Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Mach. Learn. 6(1), 37–66 (1991). doi:10.1023/A:1022689900470

Aizawa, A.: The feature quantity: an information theoretic perspective of TFIDF-like measures. In: SIGIR’00, pp. 104–111 (2000). doi:10.1145/345508.345556

Amati, G., van Rijsbergen, C.J.: Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20(4), 357–389 (2002)CrossRef

Arthur, D., Vassilvitskii, S.: k-means++: the advantages of carefull seeding. In: SIAM’07, pp. 1027–1035 (2007)

Aslam, J.A., Yilmaz, E., Pavlu, V.: The maximum entropy method for analyzing retrieval measures. In: SIGIR’05, pp. 27–34 (2005) doi:10.1145/1076034.1076042

Baierlein, R.: Atoms and Information Theory: An Introduction to Statistical Mechanics. W.H. Freeman and Company, New York (1971)

Berry, M.W.: Survey of Text Mining: Clustering, Classification, and Retrieval. Springer, New York (2004)

Clinchant, S., Gaussier, E.: Information-based models for Ad Hoc IR. In: SIGIR’11, pp. 234–241 (2011)

Cover, T.M., Thomas, J.A.: Entropy, Relative Entropy and Mutual Information. Wiley, New York, pp. 12–49 (1991)

10.

Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., Slattery, S.: Learning to extract symbolic knowledge from the world wide web. In: AAAI’98, pp. 509–516 (1998). http://dl.acm.org/citation.cfm?id=295240.295725

11.

Dhillon, I.S., Mallela, S., Kumar, R.: A divisive information theoretic feature clustering algorithm for text classification. J. Mach. Learn. Res. 3, 1265–1287 (2003). http://dl.acm.org/citation.cfm?id=944919.944973

12.

Fast, J.D.: Entropy: The Significance of the Concept of Entropy and Its Applications in Science and Technology. McGraw-Hill, New York (1962)

13.

Fox, C.: Information and misinformation: an investigation of the notions of information, misinformation, informing, and misinforming. In: Contributions in Librarianship and Information Science. Greenwood Press, Westport (1983). http://books.google.com/books?id=TNHgAAAAMAAJ

14.

Jain, A.K., Murty, M.N., Flynn, J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999). doi:10.1145/331499.331504

15.

Jaynes, E.T. : Information theory and statistical mechanics. II. Phys. Rev. 108, 171–190 (1957). doi:10.1103/PhysRev.108.171

16.

Ji, X., Xu, W.: Document clustering with prior knowledge. In: SIGIR’06, pp. 405–412 (1996). doi:10.1145/1148170.1148241

17.

Kantor, P.B., Lee, J.J. :The maximum entropy principle in information retrieval. In: SIGIR’86, pp. 269–274 (1986). doi:10.1145/253168.253225

18.

Ke, W.: Least Information Modeling for Information Retrieval. ArXiv preprint arXiv:1205.0312 (2012)

19.

Ke, W., Mostafa, J., Fu, Y.: Collaborative classifier agents: studying the impact of learning in distributed document classification. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM, New York, JCDL ’07, pp. 428–437 (2007). doi:10.1145/1255175.1255263

20.

Knight, K.: Mining online text. Commun. ACM 42(11), 58–61 (1999). doi:10.1145/319382.319394

21.

Kullback, S.: Letters to the editor: the Kullback–Leibler distance. Am. Stat. 41(4), 338–341 (1987). http://www.jstor.org/stable/2684769

22.

Kullback, S., Leibler, A.: On information and sufficiency. Ann. Math. Stat. 22, 79–86 (1951). doi:10.1214/aoms/1177729694

23.

Lang, K.: Newsweeder: learning to filter netnews. In: Proceedings of the 12th International Conference on Machine Learning, pp. 331–339 (1995)

24.

Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004). http://dl.acm.org/citation.cfm?id=1005332.1005345

25.

Lin, J.: Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 37(1), 145–151 (2006). doi:10.1109/18.61115

26.

Liu, T., Liu, S., Cheng, Z., Ma, W.Y.: An evaluation on feature selection for text clustering. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003). AAAI Press, Washington, DC, pp. 488–495 (2003)

27.

Lovins, B.: Development of a stemming algorithm. Mech. Transl. Comput. Linguist. 11, 22–31 (1968)

28.

MacKay, D.M.: Information, Mechanism and Meaning. The M.I.T. Press, Cambridge (1969)

29.

Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)

30.

Rapoport, A.: What is information? ETC Rev. Gen. Semant. 10(4), 5–12 (1953)

31.

Robertson, S.: Understanding inverse document frequency: on theoretical arguments for IDF. J. Doc. 60, 503–520 (2004)CrossRef

32.

Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond. Found. Trends\(\textregistered \) Inf. Retr. 3(4), 333–389 (2009). doi:10.1561/1500000019

33.

Sandhaus, E.: The New York Times Annotated Corpus. Linguistic Data Consortium, Philadelphia (2008)

34.

Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002). doi:10.1145/505282.505283

35.

Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423; 623–656 (1948)

36.

Siegler, M., Witbrock, M.: Improving the suitability of imperfect transcriptions for information retrieval from spoken documents. In: ICASSP’99, IEEE Press, pp. 505–508 (1999)

37.

Snickars, F., Weibull, J.W.: A minimum information principle: theory and practice. Reg. Sci. Urban Econ. 7(1), 137–168 (1977). doi:10.1016/0166-0462(77)90021-7

38.

Spärck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 60, 493–502 (2004)CrossRef

39.

Witten, I.H., Frank, E., Hall, M.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. Morgan Kaufmann, San Francisco (2011)

40.

Yang, Y., Pedersen, J.O.A.: Comparative study on feature selection in text categorization. In: ICML’97, pp. 412–420 (1997) http://dl.acm.org/citation.cfm?id=645526.657137

41.

Zhang, D., Wang, J., Si, L.: Document clustering with universum. In: SIGIR’11, pp. 873–882 (2011) doi:10.1145/2009916.2010033

42.

Zhao, Y., Karypis, G.: Evaluation of hierarchical clustering algorithms for document datasets. In: CIKM’02, pp. 515–524 (2002). doi:10.1145/584792.584877

Title: Information-theoretic term weighting schemes for document clustering and classification
Author: Weimao Ke
Publication date: 01-06-2015
Publisher: Springer Berlin Heidelberg
Published in: International Journal on Digital Libraries / Issue 2/2015
Print ISSN: 1432-5012
Electronic ISSN: 1432-1300
DOI: https://doi.org/10.1007/s00799-014-0121-3

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 2/2015

Evaluating sliding and sticky target policies by measuring temporal drift in acyclic walks through a web archive

Introduction to the focused issue of award-nominated papers from JCDL 2013

A generalized topic modeling approach for automatic document annotation

Sifting useful comments from Flickr Commons and YouTube

A comprehensive evaluation of scholarly paper recommendation using potential citation papers

Premium Partner