ABSTRACT
Different words are usually assumed to be semantically independent in most existing similarity measures, which is not often true in practice. The semantic relatedness between words cannot be conveniently employed in the existing measures. We propose a novel similarity measure based on the earth mover's distance (EMD). In the proposed measure, the semantic distances between words are computed based on the electronic lexical database-WordNet and then the EMD is employed to calculate the document similarity with a many-to-many matching between words. Experiments and results demonstrate the effectiveness of the proposed similarity measure.
- Allan, J., Carbonell, J., Doddington, G., Yamron, J. P., and Yang, Y. Topic detection and tracking pilot study: final report. In Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, 194--218,1998.Google Scholar
- Aslam, J. A., and Frost, M. An information-theoretic measure for document similarity. In Proc. of SIGIR'03, Toronto, Canada, 2003. Google ScholarDigital Library
- Patwardhan, S. Incorporating dictionary and corpus information into a context vector measure of semantic relatedness. Master's thesis, Univ. of Minnesota, Duluth, 2003.Google Scholar
- Rubner, Y., Tomasi, C. and Guibas, L. The Earth Mover's Distance as a Metric for Image Retrieval. Int. Journal of Computer Vision, Vol. 40, No. 2, pp. 99--121, 2000. Google ScholarDigital Library
Index Terms
- The earth mover's distance as a semantic measure for document similarity
Recommendations
A novel document similarity measure based on earth mover's distance
In this paper we propose a novel measure based on the earth mover's distance (EMD) to evaluate document similarity by allowing many-to-many matching between subtopics. First, each document is decomposed into a set of subtopics, and then the EMD is ...
A semantic similarity measure in document databases: an earth mover's distance-based approach
RACS '13: Proceedings of the 2013 Research in Adaptive and Convergent SystemsMeasuring document similarity is important in order to find documents which are similar to a given query document from a user. Text-based document similarity is measured by comparing the words in two documents. The representative text-based document ...
A novel similarity/dissimilarity measure for intuitionistic fuzzy sets and its application in pattern recognition
Among the most interesting measures in intuitionistic fuzzy sets (IFSs) theory, the similarity measure is an essential tool to compare and determine degree of similarity between IFSs. Although there exist many similarity measures for IFSs, most of them ...
Comments