Skip to main content

2016 | OriginalPaper | Buchkapitel

A Semantic Comparison of Clustering Algorithms for the Evaluation of Web-Based Similarity Measures

verfasst von : Valentina Franzoni, Alfredo Milani

Erschienen in: Computational Science and Its Applications – ICCSA 2016

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The Internet explosion and the massive diffusion of mobile devices lead to the creation of a worldwide collaborative system, daily used by millions of users through search engines and application interfaces. New paradigms permit to calculate the similarity of terms using only the statistical information returned by a query, or from additional features; also old algorithms and measures have been applied to new domains and scopes, to efficiently find words clusters from the Web. The problem of evaluating such techniques and algorithms in new domains emerges, and highlights a still open field of experimentation.
In this paper, preliminary tests have been held on different semantic proximity measures (average confidence, NGD, PMI, \(\chi ^{2}\), PMING Distance), and different clustering algorithms among the most used in literature have been compared (e.g. k-means, Expectation-Maximization, spectral clustering) for evaluating such measures. The suitability of the considered measures and methods to calculate the semantic proximity was verified at the state-of-art, and problems were identified, comparing the results of measurements to a ground truth provided by models of contextualized knowledge, clustering and human perception of semantic relations, which data are already studied in literature.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
The local normalization of a measure \(\eta \) in a evaluation context W consists in evaluating max \(\eta \) of the values \(\eta (w_{i})\) for \(w_{i}\in W\) and substituting these values with \(\eta (w_{i})/\eta \).
 
2
The complementation to 1 of the locally normalized PMI consists in substituting each locally normalized value with \((1-\eta (w_{i})/\eta )\) and eventually forcing null values in the diagonal of the adjacency matrix.
 
3
The Kullback-Leibler divergence, also known as information gain or information divergence [24], measures the difference between two probability distribution P and Q, considering the expected value of the extra bits requested to encode examples from P when a code based on Q is used (i.e. represents the “true" distribution of the observations, or a theoretical distribution which correctly approximates P).
 
4
The Laplacian matrix, or Kirchhoff matrix [28], is used to calculate the spanning tree number, i.e. the number of the trees which can be constructed starting from an indirect connected graph, formed by every vertex of some of (or all) the edges; in other words, a selection of the edges of a graph which expand on each vertex such as every vertex is inside the tree, without cycles. In spectral graphs, the Laplacian approximates the most sparse cut of the graph, through the second eigenvalue.
 
Literatur
1.
Zurück zum Zitat Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: Introduction to WordNet: An On-line Lexical Database (1993) Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: Introduction to WordNet: An On-line Lexical Database (1993)
2.
Zurück zum Zitat Budanitsky, A., Hirst, G.: Semantic distance in wordnet: an experimental, application-oriented evaluation of five measures. In: Proceedings of Workshop on WordNet and Other Lexical Resources, p. 641. North American Chapter of the Association for Computational Linguistics, Pittsburgh, PA, USA (2001) Budanitsky, A., Hirst, G.: Semantic distance in wordnet: an experimental, application-oriented evaluation of five measures. In: Proceedings of Workshop on WordNet and Other Lexical Resources, p. 641. North American Chapter of the Association for Computational Linguistics, Pittsburgh, PA, USA (2001)
3.
Zurück zum Zitat Franzoni, V., Milani, A.: PMING distance: a collaborative semantic proximity measure. WI-IAT 2, 442–449 (2012). IEEE/WIC/ACM Franzoni, V., Milani, A.: PMING distance: a collaborative semantic proximity measure. WI-IAT 2, 442–449 (2012). IEEE/WIC/ACM
5.
Zurück zum Zitat Franzoni, V., Milani, A.: Heuristic semantic walk. In: Murgante, B., Misra, S., Carlini, M., Torre, C.M., Nguyen, H.-Q., Taniar, D., Apduhan, B.O., Gervasi, O. (eds.) ICCSA 2013, Part IV. LNCS, vol. 7974, pp. 643–656. Springer, Heidelberg (2013)CrossRef Franzoni, V., Milani, A.: Heuristic semantic walk. In: Murgante, B., Misra, S., Carlini, M., Torre, C.M., Nguyen, H.-Q., Taniar, D., Apduhan, B.O., Gervasi, O. (eds.) ICCSA 2013, Part IV. LNCS, vol. 7974, pp. 643–656. Springer, Heidelberg (2013)CrossRef
6.
Zurück zum Zitat Leung, C.H.C., Li, Y., Milani, A., Franzoni, V.: Collective evolutionary concept distance based query expansion for effective web document retrieval. In: Murgante, B., Misra, S., Carlini, M., Torre, C.M., Nguyen, H.-Q., Taniar, D., Apduhan, B.O., Gervasi, O. (eds.) ICCSA 2013, Part IV. LNCS, vol. 7974, pp. 657–672. Springer, Heidelberg (2013)CrossRef Leung, C.H.C., Li, Y., Milani, A., Franzoni, V.: Collective evolutionary concept distance based query expansion for effective web document retrieval. In: Murgante, B., Misra, S., Carlini, M., Torre, C.M., Nguyen, H.-Q., Taniar, D., Apduhan, B.O., Gervasi, O. (eds.) ICCSA 2013, Part IV. LNCS, vol. 7974, pp. 657–672. Springer, Heidelberg (2013)CrossRef
7.
Zurück zum Zitat Cilibrasi, R., Vitanyi, P.: The google similarity distance. ArXiv.org (2004) Cilibrasi, R., Vitanyi, P.: The google similarity distance. ArXiv.org (2004)
8.
Zurück zum Zitat Franzoni, V., Milani, A.: Semantic Context extraction from collaborative networks. In: IEEE International Conference on Computer Supported Cooperative Work in Design CSCWD, 2015, Italy (2015) Franzoni, V., Milani, A.: Semantic Context extraction from collaborative networks. In: IEEE International Conference on Computer Supported Cooperative Work in Design CSCWD, 2015, Italy (2015)
9.
Zurück zum Zitat Franzoni, V., Milani, A.: Context extraction by multi-path traces in semantic networks, CEUR-WS. In: Proceedings of RR2015 Doctoral Consortium, Berlin (2015) Franzoni, V., Milani, A.: Context extraction by multi-path traces in semantic networks, CEUR-WS. In: Proceedings of RR2015 Doctoral Consortium, Berlin (2015)
10.
Zurück zum Zitat Franzoni, V., Leung, C.H.C., Li, Y., Mengoni, P., Milani, A.: Set similarity measures for images based on collective knowledge. In: Gervasi, O., Murgante, B., Misra, S., Gavrilova, M.L., Rocha, A.M.A.C., Torre, C., Taniar, D., Apduhan, B.O. (eds.) ICCSA 2015. LNCS, vol. 9155, pp. 408–417. Springer, Heidelberg (2015)CrossRef Franzoni, V., Leung, C.H.C., Li, Y., Mengoni, P., Milani, A.: Set similarity measures for images based on collective knowledge. In: Gervasi, O., Murgante, B., Misra, S., Gavrilova, M.L., Rocha, A.M.A.C., Torre, C., Taniar, D., Apduhan, B.O. (eds.) ICCSA 2015. LNCS, vol. 9155, pp. 408–417. Springer, Heidelberg (2015)CrossRef
11.
Zurück zum Zitat Chiancone, A., Niyogi, R., et al.: Improving link ranking quality by quasi-common neighbourhood. In: 2015 IEEE CPS International Conference on Computational Science and Its Applications (2015) Chiancone, A., Niyogi, R., et al.: Improving link ranking quality by quasi-common neighbourhood. In: 2015 IEEE CPS International Conference on Computational Science and Its Applications (2015)
12.
Zurück zum Zitat Chiancone, A., Madotto, A., et al.: Multistrain bacterial model for link prediction. In: 2015 Proceedings of 11th International Conference on Natural Computation IEEE ICNC. CFP15CNC-CDR, Observation of strains. Infect Dis Ther. 3(1), 35–43 (2011). ISBN: 978-1-4673-7678-5 Chiancone, A., Madotto, A., et al.: Multistrain bacterial model for link prediction. In: 2015 Proceedings of 11th International Conference on Natural Computation IEEE ICNC. CFP15CNC-CDR, Observation of strains. Infect Dis Ther. 3(1), 35–43 (2011). ISBN: 978-1-4673-7678-5
18.
Zurück zum Zitat Franzoni, V., Milani, A., Pallottelli, S.: Multi-path traces in semantic graphs for latent knowledge elicitation. In: 2015 IEEE ICNC Proceedings of 11th International Conference on Natural Computation, CFP15CNC-CDR (2015). ISBN: 978-1-4673-7678-5 Franzoni, V., Milani, A., Pallottelli, S.: Multi-path traces in semantic graphs for latent knowledge elicitation. In: 2015 IEEE ICNC Proceedings of 11th International Conference on Natural Computation, CFP15CNC-CDR (2015). ISBN: 978-1-4673-7678-5
19.
Zurück zum Zitat Church, K.W., Hanks, P.: Word association norms, mutual information and lexicography. In: Proceedings of ACL, vol. 27 (1989) Church, K.W., Hanks, P.: Word association norms, mutual information and lexicography. In: Proceedings of ACL, vol. 27 (1989)
20.
Zurück zum Zitat Turney, P.D.: Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In: Flach, P.A., Raedt, L. (eds.) ECML 2001. LNCS (LNAI), vol. 2167, p. 491. Springer, Heidelberg (2001)CrossRef Turney, P.D.: Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In: Flach, P.A., Raedt, L. (eds.) ECML 2001. LNCS (LNAI), vol. 2167, p. 491. Springer, Heidelberg (2001)CrossRef
21.
Zurück zum Zitat Newman, M.E.J.: Fast algorithm for detecting community structure in networks. University of Michigan, MI (2003) Newman, M.E.J.: Fast algorithm for detecting community structure in networks. University of Michigan, MI (2003)
22.
Zurück zum Zitat Matsuo, Y., Sakaki, T., Uchiyama, K., Ishizuka, M.: Graph-based word clustering using a web search engine. University of Tokio (2006) Matsuo, Y., Sakaki, T., Uchiyama, K., Ishizuka, M.: Graph-based word clustering using a web search engine. University of Tokio (2006)
23.
Zurück zum Zitat Dellaert, F.: The Expectation-Maximization Algorithm. Elsevier, New York (2002) Dellaert, F.: The Expectation-Maximization Algorithm. Elsevier, New York (2002)
25.
Zurück zum Zitat Hartigan, J.A., Wong, M.A.: Algorithm AS 136: A k-means clustering algorithm. J. Roy. Stat. Soc. 28, 100–108 (1979)MATH Hartigan, J.A., Wong, M.A.: Algorithm AS 136: A k-means clustering algorithm. J. Roy. Stat. Soc. 28, 100–108 (1979)MATH
26.
Zurück zum Zitat Joyce, J.M.: Kullback-Leibler Divergence: International Encyclopedia of Statistical Science. Springer, Heidelberg (2011)CrossRef Joyce, J.M.: Kullback-Leibler Divergence: International Encyclopedia of Statistical Science. Springer, Heidelberg (2011)CrossRef
27.
Zurück zum Zitat Dhillon, I.S., Guan, Y., Kulis, B.: Kernel k-means: spectral clustering and normalized cuts. In: Proceedings of the Tenth ACM SIGKDD (2004) Dhillon, I.S., Guan, Y., Kulis, B.: Kernel k-means: spectral clustering and normalized cuts. In: Proceedings of the Tenth ACM SIGKDD (2004)
28.
Zurück zum Zitat Pozrikidis, C.: Node degree distribution in spanning trees. J. Phys. A: Math. Theoret. 49(12) (2016) Pozrikidis, C.: Node degree distribution in spanning trees. J. Phys. A: Math. Theoret. 49(12) (2016)
30.
Zurück zum Zitat Wu, L., Hua, X.S., Yu, N., Ma, W.Y., Li, S.: Flickr Distance. In: Proceedings of Microsoft Research Asia (2008) Wu, L., Hua, X.S., Yu, N., Ma, W.Y., Li, S.: Flickr Distance. In: Proceedings of Microsoft Research Asia (2008)
31.
Zurück zum Zitat Manning, D., Schutze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, London (2002)MATH Manning, D., Schutze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, London (2002)MATH
32.
Zurück zum Zitat Franzoni, V., Poggioni, V., Zollo, F.: Automated book classification according to the emotional tags of the social network Zazie. In: CEUR-WS on ESSEM, AI*IA, vol. 1096, pp. 83–94 (2013) Franzoni, V., Poggioni, V., Zollo, F.: Automated book classification according to the emotional tags of the social network Zazie. In: CEUR-WS on ESSEM, AI*IA, vol. 1096, pp. 83–94 (2013)
33.
Zurück zum Zitat Franzoni, V., Leung, C.H.C., Li, Y., Milani, A., Pallottelli, S.: Context-based image semantic similarity. In: 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), Zhangjiajie, pp. 1280–1284 (2015). doi:10.1109/FSKD.2015.7382127 Franzoni, V., Leung, C.H.C., Li, Y., Milani, A., Pallottelli, S.: Context-based image semantic similarity. In: 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), Zhangjiajie, pp. 1280–1284 (2015). doi:10.​1109/​FSKD.​2015.​7382127
34.
Zurück zum Zitat Chiancone, A., Franzoni, V., Li, Y., Markov, K., Milani, A.: Leveraging zero tail in neighbourhood based link prediction. In: 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 3, pp. 135–139 (2015). doi:10.1109/WI-IAT.2015.129 Chiancone, A., Franzoni, V., Li, Y., Markov, K., Milani, A.: Leveraging zero tail in neighbourhood based link prediction. In: 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 3, pp. 135–139 (2015). doi:10.​1109/​WI-IAT.​2015.​129
35.
Zurück zum Zitat Franzoni, V., Milani, A.: A Pheromone-like model for semantic context extraction from collaborative networks. In: 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Singapore, pp. 540–547 (2015). doi:10.1109/WI-IAT.2015.21 Franzoni, V., Milani, A.: A Pheromone-like model for semantic context extraction from collaborative networks. In: 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Singapore, pp. 540–547 (2015). doi:10.​1109/​WI-IAT.​2015.​21
36.
Zurück zum Zitat Di Iorio, A., Schaerf, M.: Identification semantics for an organization establishing a digital library system. In: Risse, T., Predoiu, L., Nürnberger, A., Ross, S. (eds.) Proceedings of the 4th International Workshop on Semantic Digital Archives (SDA 2014) Co-located with the International Digital Libraries Conference (DL 2014), vol. 1306, London, UK, September 12, 2014, pp. 16–27. CEUR-WS.org (2014). http://ceur-ws.org/Vol-1306 Di Iorio, A., Schaerf, M.: Identification semantics for an organization establishing a digital library system. In: Risse, T., Predoiu, L., Nürnberger, A., Ross, S. (eds.) Proceedings of the 4th International Workshop on Semantic Digital Archives (SDA 2014) Co-located with the International Digital Libraries Conference (DL 2014), vol. 1306, London, UK, September 12, 2014, pp. 16–27. CEUR-WS.org (2014). http://​ceur-ws.​org/​Vol-1306
Metadaten
Titel
A Semantic Comparison of Clustering Algorithms for the Evaluation of Web-Based Similarity Measures
verfasst von
Valentina Franzoni
Alfredo Milani
Copyright-Jahr
2016
DOI
https://doi.org/10.1007/978-3-319-42092-9_34

Premium Partner