Skip to main content
Top
Published in: International Journal on Digital Libraries 2/2015

01-06-2015

Information-theoretic term weighting schemes for document clustering and classification

Author: Weimao Ke

Published in: International Journal on Digital Libraries | Issue 2/2015

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

We propose a new theory to quantify information in probability distributions and derive a new document representation model for text clustering and classification. By extending Shannon entropy to accommodate a non-linear relation between information and uncertainty, the proposed least information theory provides insight into how terms can be weighted based on their probability distributions in documents vs. in the collection. We derive two basic quantities for document representation: (1) LI Binary (LIB) which quantifies information due to the observation of a term’s (binary) occurrence in a document; and (2) LI Frequency (LIF) which measures information for the observation of a randomly picked term from the document. The two quantities are computed based on terms’ prior distributions in the entire collection and posterior distributions in a document. LIB and LIF can be used individually or combined to represent documents for text clustering and classification. Experiments on four benchmark text collections demonstrate strong performances of the proposed methods compared to classic TF*IDF. Particularly, the LIB*LIF weighting scheme, which combines LIB and LIF, consistently outperforms TF*IDF in terms of multiple evaluation metrics. The least information measure has a potentially broad range of applications beyond text clustering and classification.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
The literature has used a variety of names in reference to KL divergence. While Kullback preferred discrimination information for the principle of minimum discrimination information (MDI) [21], the literature has often referred to it as divergence information or relative entropy.
 
2
Inference probabilities are never perfectly independent of one another given the degree of freedom. But to simplify the discussion and formulation, we use the independence assumption.
 
3
The term opposite does not indicate true vs. false information. Opposite information semantics are essentially to increase vs. to decrease the probability of an inference, e.g., good news vs. bad news about a candidate that may influence the outcome of an election.
 
Literature
2.
3.
go back to reference Amati, G., van Rijsbergen, C.J.: Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20(4), 357–389 (2002)CrossRef Amati, G., van Rijsbergen, C.J.: Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20(4), 357–389 (2002)CrossRef
4.
go back to reference Arthur, D., Vassilvitskii, S.: k-means++: the advantages of carefull seeding. In: SIAM’07, pp. 1027–1035 (2007) Arthur, D., Vassilvitskii, S.: k-means++: the advantages of carefull seeding. In: SIAM’07, pp. 1027–1035 (2007)
6.
go back to reference Baierlein, R.: Atoms and Information Theory: An Introduction to Statistical Mechanics. W.H. Freeman and Company, New York (1971) Baierlein, R.: Atoms and Information Theory: An Introduction to Statistical Mechanics. W.H. Freeman and Company, New York (1971)
7.
go back to reference Berry, M.W.: Survey of Text Mining: Clustering, Classification, and Retrieval. Springer, New York (2004) Berry, M.W.: Survey of Text Mining: Clustering, Classification, and Retrieval. Springer, New York (2004)
8.
go back to reference Clinchant, S., Gaussier, E.: Information-based models for Ad Hoc IR. In: SIGIR’11, pp. 234–241 (2011) Clinchant, S., Gaussier, E.: Information-based models for Ad Hoc IR. In: SIGIR’11, pp. 234–241 (2011)
9.
go back to reference Cover, T.M., Thomas, J.A.: Entropy, Relative Entropy and Mutual Information. Wiley, New York, pp. 12–49 (1991) Cover, T.M., Thomas, J.A.: Entropy, Relative Entropy and Mutual Information. Wiley, New York, pp. 12–49 (1991)
12.
go back to reference Fast, J.D.: Entropy: The Significance of the Concept of Entropy and Its Applications in Science and Technology. McGraw-Hill, New York (1962) Fast, J.D.: Entropy: The Significance of the Concept of Entropy and Its Applications in Science and Technology. McGraw-Hill, New York (1962)
19.
go back to reference Ke, W., Mostafa, J., Fu, Y.: Collaborative classifier agents: studying the impact of learning in distributed document classification. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM, New York, JCDL ’07, pp. 428–437 (2007). doi:10.1145/1255175.1255263 Ke, W., Mostafa, J., Fu, Y.: Collaborative classifier agents: studying the impact of learning in distributed document classification. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM, New York, JCDL ’07, pp. 428–437 (2007). doi:10.​1145/​1255175.​1255263
23.
go back to reference Lang, K.: Newsweeder: learning to filter netnews. In: Proceedings of the 12th International Conference on Machine Learning, pp. 331–339 (1995) Lang, K.: Newsweeder: learning to filter netnews. In: Proceedings of the 12th International Conference on Machine Learning, pp. 331–339 (1995)
26.
go back to reference Liu, T., Liu, S., Cheng, Z., Ma, W.Y.: An evaluation on feature selection for text clustering. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003). AAAI Press, Washington, DC, pp. 488–495 (2003) Liu, T., Liu, S., Cheng, Z., Ma, W.Y.: An evaluation on feature selection for text clustering. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003). AAAI Press, Washington, DC, pp. 488–495 (2003)
27.
go back to reference Lovins, B.: Development of a stemming algorithm. Mech. Transl. Comput. Linguist. 11, 22–31 (1968) Lovins, B.: Development of a stemming algorithm. Mech. Transl. Comput. Linguist. 11, 22–31 (1968)
28.
go back to reference MacKay, D.M.: Information, Mechanism and Meaning. The M.I.T. Press, Cambridge (1969) MacKay, D.M.: Information, Mechanism and Meaning. The M.I.T. Press, Cambridge (1969)
29.
go back to reference Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008) Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
30.
go back to reference Rapoport, A.: What is information? ETC Rev. Gen. Semant. 10(4), 5–12 (1953) Rapoport, A.: What is information? ETC Rev. Gen. Semant. 10(4), 5–12 (1953)
31.
go back to reference Robertson, S.: Understanding inverse document frequency: on theoretical arguments for IDF. J. Doc. 60, 503–520 (2004)CrossRef Robertson, S.: Understanding inverse document frequency: on theoretical arguments for IDF. J. Doc. 60, 503–520 (2004)CrossRef
32.
go back to reference Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond. Found. Trends\(\textregistered \) Inf. Retr. 3(4), 333–389 (2009). doi:10.1561/1500000019 Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond. Found. Trends\(\textregistered \) Inf. Retr. 3(4), 333–389 (2009). doi:10.​1561/​1500000019
33.
go back to reference Sandhaus, E.: The New York Times Annotated Corpus. Linguistic Data Consortium, Philadelphia (2008) Sandhaus, E.: The New York Times Annotated Corpus. Linguistic Data Consortium, Philadelphia (2008)
35.
go back to reference Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423; 623–656 (1948) Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423; 623–656 (1948)
36.
go back to reference Siegler, M., Witbrock, M.: Improving the suitability of imperfect transcriptions for information retrieval from spoken documents. In: ICASSP’99, IEEE Press, pp. 505–508 (1999) Siegler, M., Witbrock, M.: Improving the suitability of imperfect transcriptions for information retrieval from spoken documents. In: ICASSP’99, IEEE Press, pp. 505–508 (1999)
38.
go back to reference Spärck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 60, 493–502 (2004)CrossRef Spärck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 60, 493–502 (2004)CrossRef
39.
go back to reference Witten, I.H., Frank, E., Hall, M.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. Morgan Kaufmann, San Francisco (2011) Witten, I.H., Frank, E., Hall, M.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. Morgan Kaufmann, San Francisco (2011)
42.
Metadata
Title
Information-theoretic term weighting schemes for document clustering and classification
Author
Weimao Ke
Publication date
01-06-2015
Publisher
Springer Berlin Heidelberg
Published in
International Journal on Digital Libraries / Issue 2/2015
Print ISSN: 1432-5012
Electronic ISSN: 1432-1300
DOI
https://doi.org/10.1007/s00799-014-0121-3

Other articles of this Issue 2/2015

International Journal on Digital Libraries 2/2015 Go to the issue

Premium Partner