Skip to main content

2015 | OriginalPaper | Buchkapitel

Beyond tf-idf and Cosine Distance in Documents Dissimilarity Measure

verfasst von : Sunil Aryal, Kai Ming Ting, Gholamreza Haffari, Takashi Washio

Erschienen in: Information Retrieval Technology

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In vector space model, different types of term weighting schemes are used to adjust bag-of-words document vectors in order to improve the performance of the most widely used cosine distance. Even though the cosine distance with some term weighting schemes result in more reliable (dis)similarity measure in some data sets, it may not perform well in others because of the underlying assumptions of the term weighting schemes. In this paper, we argue that the explicit adjustment of bag-of-words document vectors using term weighting is not required if a data-dependent dissimilarity measure called \(m_p\)-dissimilarity is used. Our empirical result in document retrieval task reveals that \(m_p\) with the simplest binary bag-of-words representation is either better or competitive to the cosine distance with the best performing state-of-the-art term weighting scheme in four widely used benchmark document collections.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
The parameter p in \(m_p\) has the same role as in the case of traditional \(\ell _p\)-norm. The performance of \(m_p\) may be changed slightly using different p values in some data sets. Empirically, we observed that \(p=0.1\) is a reasonably good setting.
 
Literatur
1.
Zurück zum Zitat Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Inc, New York (1986) Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Inc, New York (1986)
2.
Zurück zum Zitat Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)CrossRef Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)CrossRef
3.
Zurück zum Zitat Zobel, J., Moffat, A.: Exploring the similarity space. SIGIR Forum 32(1), 18–34 (1998)CrossRef Zobel, J., Moffat, A.: Exploring the similarity space. SIGIR Forum 32(1), 18–34 (1998)CrossRef
5.
Zurück zum Zitat Zhu, X., Goldberg, A.B., Rabbat, M., Nowak, R.: Learning bigrams from unigrams. In: Proceedings of ACL 2008: HLT, Association for Computational Linguistics, pp. 656–664 (2008) Zhu, X., Goldberg, A.B., Rabbat, M., Nowak, R.: Learning bigrams from unigrams. In: Proceedings of ACL 2008: HLT, Association for Computational Linguistics, pp. 656–664 (2008)
6.
Zurück zum Zitat Aryal, S., Ting, K., Haffari, G., Washio, T.: \(m_p\)-dissimilarity: a data dependent dissimilarity measure. In: Proceedings of the IEEE International Conference on Data Mining (ICDM), pp. 707–712. IEEE (2014) Aryal, S., Ting, K., Haffari, G., Washio, T.: \(m_p\)-dissimilarity: a data dependent dissimilarity measure. In: Proceedings of the IEEE International Conference on Data Mining (ICDM), pp. 707–712. IEEE (2014)
7.
Zurück zum Zitat Cardoso-Cachopo, A.: Improving Methods for Single-label Text Categorization. Ph.D. Thesis, Instituto Superior Tecnico, Technical University of Lisbon, Lisbon, Portugal (2007) Cardoso-Cachopo, A.: Improving Methods for Single-label Text Categorization. Ph.D. Thesis, Instituto Superior Tecnico, Technical University of Lisbon, Lisbon, Portugal (2007)
8.
Zurück zum Zitat Han, E.-H.(Sam), Karypis, G.: Centroid-based document classification: analysis and experimental results. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 424–431. Springer, Heidelberg (2000)CrossRef Han, E.-H.(Sam), Karypis, G.: Centroid-based document classification: analysis and experimental results. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 424–431. Springer, Heidelberg (2000)CrossRef
Metadaten
Titel
Beyond tf-idf and Cosine Distance in Documents Dissimilarity Measure
verfasst von
Sunil Aryal
Kai Ming Ting
Gholamreza Haffari
Takashi Washio
Copyright-Jahr
2015
DOI
https://doi.org/10.1007/978-3-319-28940-3_33

Neuer Inhalt