Skip to main content
Top

2016 | OriginalPaper | Chapter

A Semantic Comparison of Clustering Algorithms for the Evaluation of Web-Based Similarity Measures

Authors : Valentina Franzoni, Alfredo Milani

Published in: Computational Science and Its Applications – ICCSA 2016

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The Internet explosion and the massive diffusion of mobile devices lead to the creation of a worldwide collaborative system, daily used by millions of users through search engines and application interfaces. New paradigms permit to calculate the similarity of terms using only the statistical information returned by a query, or from additional features; also old algorithms and measures have been applied to new domains and scopes, to efficiently find words clusters from the Web. The problem of evaluating such techniques and algorithms in new domains emerges, and highlights a still open field of experimentation.
In this paper, preliminary tests have been held on different semantic proximity measures (average confidence, NGD, PMI, \(\chi ^{2}\), PMING Distance), and different clustering algorithms among the most used in literature have been compared (e.g. k-means, Expectation-Maximization, spectral clustering) for evaluating such measures. The suitability of the considered measures and methods to calculate the semantic proximity was verified at the state-of-art, and problems were identified, comparing the results of measurements to a ground truth provided by models of contextualized knowledge, clustering and human perception of semantic relations, which data are already studied in literature.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
The local normalization of a measure \(\eta \) in a evaluation context W consists in evaluating max \(\eta \) of the values \(\eta (w_{i})\) for \(w_{i}\in W\) and substituting these values with \(\eta (w_{i})/\eta \).
 
2
The complementation to 1 of the locally normalized PMI consists in substituting each locally normalized value with \((1-\eta (w_{i})/\eta )\) and eventually forcing null values in the diagonal of the adjacency matrix.
 
3
The Kullback-Leibler divergence, also known as information gain or information divergence [24], measures the difference between two probability distribution P and Q, considering the expected value of the extra bits requested to encode examples from P when a code based on Q is used (i.e. represents the “true" distribution of the observations, or a theoretical distribution which correctly approximates P).
 
4
The Laplacian matrix, or Kirchhoff matrix [28], is used to calculate the spanning tree number, i.e. the number of the trees which can be constructed starting from an indirect connected graph, formed by every vertex of some of (or all) the edges; in other words, a selection of the edges of a graph which expand on each vertex such as every vertex is inside the tree, without cycles. In spectral graphs, the Laplacian approximates the most sparse cut of the graph, through the second eigenvalue.
 
Literature
1.
go back to reference Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: Introduction to WordNet: An On-line Lexical Database (1993) Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: Introduction to WordNet: An On-line Lexical Database (1993)
2.
go back to reference Budanitsky, A., Hirst, G.: Semantic distance in wordnet: an experimental, application-oriented evaluation of five measures. In: Proceedings of Workshop on WordNet and Other Lexical Resources, p. 641. North American Chapter of the Association for Computational Linguistics, Pittsburgh, PA, USA (2001) Budanitsky, A., Hirst, G.: Semantic distance in wordnet: an experimental, application-oriented evaluation of five measures. In: Proceedings of Workshop on WordNet and Other Lexical Resources, p. 641. North American Chapter of the Association for Computational Linguistics, Pittsburgh, PA, USA (2001)
3.
go back to reference Franzoni, V., Milani, A.: PMING distance: a collaborative semantic proximity measure. WI-IAT 2, 442–449 (2012). IEEE/WIC/ACM Franzoni, V., Milani, A.: PMING distance: a collaborative semantic proximity measure. WI-IAT 2, 442–449 (2012). IEEE/WIC/ACM
5.
go back to reference Franzoni, V., Milani, A.: Heuristic semantic walk. In: Murgante, B., Misra, S., Carlini, M., Torre, C.M., Nguyen, H.-Q., Taniar, D., Apduhan, B.O., Gervasi, O. (eds.) ICCSA 2013, Part IV. LNCS, vol. 7974, pp. 643–656. Springer, Heidelberg (2013)CrossRef Franzoni, V., Milani, A.: Heuristic semantic walk. In: Murgante, B., Misra, S., Carlini, M., Torre, C.M., Nguyen, H.-Q., Taniar, D., Apduhan, B.O., Gervasi, O. (eds.) ICCSA 2013, Part IV. LNCS, vol. 7974, pp. 643–656. Springer, Heidelberg (2013)CrossRef
6.
go back to reference Leung, C.H.C., Li, Y., Milani, A., Franzoni, V.: Collective evolutionary concept distance based query expansion for effective web document retrieval. In: Murgante, B., Misra, S., Carlini, M., Torre, C.M., Nguyen, H.-Q., Taniar, D., Apduhan, B.O., Gervasi, O. (eds.) ICCSA 2013, Part IV. LNCS, vol. 7974, pp. 657–672. Springer, Heidelberg (2013)CrossRef Leung, C.H.C., Li, Y., Milani, A., Franzoni, V.: Collective evolutionary concept distance based query expansion for effective web document retrieval. In: Murgante, B., Misra, S., Carlini, M., Torre, C.M., Nguyen, H.-Q., Taniar, D., Apduhan, B.O., Gervasi, O. (eds.) ICCSA 2013, Part IV. LNCS, vol. 7974, pp. 657–672. Springer, Heidelberg (2013)CrossRef
7.
go back to reference Cilibrasi, R., Vitanyi, P.: The google similarity distance. ArXiv.org (2004) Cilibrasi, R., Vitanyi, P.: The google similarity distance. ArXiv.org (2004)
8.
go back to reference Franzoni, V., Milani, A.: Semantic Context extraction from collaborative networks. In: IEEE International Conference on Computer Supported Cooperative Work in Design CSCWD, 2015, Italy (2015) Franzoni, V., Milani, A.: Semantic Context extraction from collaborative networks. In: IEEE International Conference on Computer Supported Cooperative Work in Design CSCWD, 2015, Italy (2015)
9.
go back to reference Franzoni, V., Milani, A.: Context extraction by multi-path traces in semantic networks, CEUR-WS. In: Proceedings of RR2015 Doctoral Consortium, Berlin (2015) Franzoni, V., Milani, A.: Context extraction by multi-path traces in semantic networks, CEUR-WS. In: Proceedings of RR2015 Doctoral Consortium, Berlin (2015)
10.
go back to reference Franzoni, V., Leung, C.H.C., Li, Y., Mengoni, P., Milani, A.: Set similarity measures for images based on collective knowledge. In: Gervasi, O., Murgante, B., Misra, S., Gavrilova, M.L., Rocha, A.M.A.C., Torre, C., Taniar, D., Apduhan, B.O. (eds.) ICCSA 2015. LNCS, vol. 9155, pp. 408–417. Springer, Heidelberg (2015)CrossRef Franzoni, V., Leung, C.H.C., Li, Y., Mengoni, P., Milani, A.: Set similarity measures for images based on collective knowledge. In: Gervasi, O., Murgante, B., Misra, S., Gavrilova, M.L., Rocha, A.M.A.C., Torre, C., Taniar, D., Apduhan, B.O. (eds.) ICCSA 2015. LNCS, vol. 9155, pp. 408–417. Springer, Heidelberg (2015)CrossRef
11.
go back to reference Chiancone, A., Niyogi, R., et al.: Improving link ranking quality by quasi-common neighbourhood. In: 2015 IEEE CPS International Conference on Computational Science and Its Applications (2015) Chiancone, A., Niyogi, R., et al.: Improving link ranking quality by quasi-common neighbourhood. In: 2015 IEEE CPS International Conference on Computational Science and Its Applications (2015)
12.
go back to reference Chiancone, A., Madotto, A., et al.: Multistrain bacterial model for link prediction. In: 2015 Proceedings of 11th International Conference on Natural Computation IEEE ICNC. CFP15CNC-CDR, Observation of strains. Infect Dis Ther. 3(1), 35–43 (2011). ISBN: 978-1-4673-7678-5 Chiancone, A., Madotto, A., et al.: Multistrain bacterial model for link prediction. In: 2015 Proceedings of 11th International Conference on Natural Computation IEEE ICNC. CFP15CNC-CDR, Observation of strains. Infect Dis Ther. 3(1), 35–43 (2011). ISBN: 978-1-4673-7678-5
18.
go back to reference Franzoni, V., Milani, A., Pallottelli, S.: Multi-path traces in semantic graphs for latent knowledge elicitation. In: 2015 IEEE ICNC Proceedings of 11th International Conference on Natural Computation, CFP15CNC-CDR (2015). ISBN: 978-1-4673-7678-5 Franzoni, V., Milani, A., Pallottelli, S.: Multi-path traces in semantic graphs for latent knowledge elicitation. In: 2015 IEEE ICNC Proceedings of 11th International Conference on Natural Computation, CFP15CNC-CDR (2015). ISBN: 978-1-4673-7678-5
19.
go back to reference Church, K.W., Hanks, P.: Word association norms, mutual information and lexicography. In: Proceedings of ACL, vol. 27 (1989) Church, K.W., Hanks, P.: Word association norms, mutual information and lexicography. In: Proceedings of ACL, vol. 27 (1989)
20.
go back to reference Turney, P.D.: Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In: Flach, P.A., Raedt, L. (eds.) ECML 2001. LNCS (LNAI), vol. 2167, p. 491. Springer, Heidelberg (2001)CrossRef Turney, P.D.: Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In: Flach, P.A., Raedt, L. (eds.) ECML 2001. LNCS (LNAI), vol. 2167, p. 491. Springer, Heidelberg (2001)CrossRef
21.
go back to reference Newman, M.E.J.: Fast algorithm for detecting community structure in networks. University of Michigan, MI (2003) Newman, M.E.J.: Fast algorithm for detecting community structure in networks. University of Michigan, MI (2003)
22.
go back to reference Matsuo, Y., Sakaki, T., Uchiyama, K., Ishizuka, M.: Graph-based word clustering using a web search engine. University of Tokio (2006) Matsuo, Y., Sakaki, T., Uchiyama, K., Ishizuka, M.: Graph-based word clustering using a web search engine. University of Tokio (2006)
23.
go back to reference Dellaert, F.: The Expectation-Maximization Algorithm. Elsevier, New York (2002) Dellaert, F.: The Expectation-Maximization Algorithm. Elsevier, New York (2002)
25.
go back to reference Hartigan, J.A., Wong, M.A.: Algorithm AS 136: A k-means clustering algorithm. J. Roy. Stat. Soc. 28, 100–108 (1979)MATH Hartigan, J.A., Wong, M.A.: Algorithm AS 136: A k-means clustering algorithm. J. Roy. Stat. Soc. 28, 100–108 (1979)MATH
26.
go back to reference Joyce, J.M.: Kullback-Leibler Divergence: International Encyclopedia of Statistical Science. Springer, Heidelberg (2011)CrossRef Joyce, J.M.: Kullback-Leibler Divergence: International Encyclopedia of Statistical Science. Springer, Heidelberg (2011)CrossRef
27.
go back to reference Dhillon, I.S., Guan, Y., Kulis, B.: Kernel k-means: spectral clustering and normalized cuts. In: Proceedings of the Tenth ACM SIGKDD (2004) Dhillon, I.S., Guan, Y., Kulis, B.: Kernel k-means: spectral clustering and normalized cuts. In: Proceedings of the Tenth ACM SIGKDD (2004)
28.
go back to reference Pozrikidis, C.: Node degree distribution in spanning trees. J. Phys. A: Math. Theoret. 49(12) (2016) Pozrikidis, C.: Node degree distribution in spanning trees. J. Phys. A: Math. Theoret. 49(12) (2016)
30.
go back to reference Wu, L., Hua, X.S., Yu, N., Ma, W.Y., Li, S.: Flickr Distance. In: Proceedings of Microsoft Research Asia (2008) Wu, L., Hua, X.S., Yu, N., Ma, W.Y., Li, S.: Flickr Distance. In: Proceedings of Microsoft Research Asia (2008)
31.
go back to reference Manning, D., Schutze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, London (2002)MATH Manning, D., Schutze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, London (2002)MATH
32.
go back to reference Franzoni, V., Poggioni, V., Zollo, F.: Automated book classification according to the emotional tags of the social network Zazie. In: CEUR-WS on ESSEM, AI*IA, vol. 1096, pp. 83–94 (2013) Franzoni, V., Poggioni, V., Zollo, F.: Automated book classification according to the emotional tags of the social network Zazie. In: CEUR-WS on ESSEM, AI*IA, vol. 1096, pp. 83–94 (2013)
33.
go back to reference Franzoni, V., Leung, C.H.C., Li, Y., Milani, A., Pallottelli, S.: Context-based image semantic similarity. In: 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), Zhangjiajie, pp. 1280–1284 (2015). doi:10.1109/FSKD.2015.7382127 Franzoni, V., Leung, C.H.C., Li, Y., Milani, A., Pallottelli, S.: Context-based image semantic similarity. In: 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), Zhangjiajie, pp. 1280–1284 (2015). doi:10.​1109/​FSKD.​2015.​7382127
34.
go back to reference Chiancone, A., Franzoni, V., Li, Y., Markov, K., Milani, A.: Leveraging zero tail in neighbourhood based link prediction. In: 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 3, pp. 135–139 (2015). doi:10.1109/WI-IAT.2015.129 Chiancone, A., Franzoni, V., Li, Y., Markov, K., Milani, A.: Leveraging zero tail in neighbourhood based link prediction. In: 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 3, pp. 135–139 (2015). doi:10.​1109/​WI-IAT.​2015.​129
35.
go back to reference Franzoni, V., Milani, A.: A Pheromone-like model for semantic context extraction from collaborative networks. In: 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Singapore, pp. 540–547 (2015). doi:10.1109/WI-IAT.2015.21 Franzoni, V., Milani, A.: A Pheromone-like model for semantic context extraction from collaborative networks. In: 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Singapore, pp. 540–547 (2015). doi:10.​1109/​WI-IAT.​2015.​21
36.
go back to reference Di Iorio, A., Schaerf, M.: Identification semantics for an organization establishing a digital library system. In: Risse, T., Predoiu, L., Nürnberger, A., Ross, S. (eds.) Proceedings of the 4th International Workshop on Semantic Digital Archives (SDA 2014) Co-located with the International Digital Libraries Conference (DL 2014), vol. 1306, London, UK, September 12, 2014, pp. 16–27. CEUR-WS.org (2014). http://ceur-ws.org/Vol-1306 Di Iorio, A., Schaerf, M.: Identification semantics for an organization establishing a digital library system. In: Risse, T., Predoiu, L., Nürnberger, A., Ross, S. (eds.) Proceedings of the 4th International Workshop on Semantic Digital Archives (SDA 2014) Co-located with the International Digital Libraries Conference (DL 2014), vol. 1306, London, UK, September 12, 2014, pp. 16–27. CEUR-WS.org (2014). http://​ceur-ws.​org/​Vol-1306
Metadata
Title
A Semantic Comparison of Clustering Algorithms for the Evaluation of Web-Based Similarity Measures
Authors
Valentina Franzoni
Alfredo Milani
Copyright Year
2016
DOI
https://doi.org/10.1007/978-3-319-42092-9_34

Premium Partner