Top

Discover Computing

Published in:

30-11-2017

Clustering small-sized collections of short texts

Authors: Lili Kotlerman, Ido Dagan, Oren Kurland

Published in: Discover Computing | Issue 4/2018

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

The need to cluster small text corpora composed of a few hundreds of short texts rises in various applications; e.g., clustering top-retrieved documents based on their snippets. This clustering task is challenging due to the vocabulary mismatch between short texts and the insufficient corpus-based statistics (e.g., term co-occurrence statistics) due to the corpus size. We address this clustering challenge using a framework that utilizes a set of external knowledge resources that provide information about term relations. Specifically, we use information induced from the resources to estimate similarity between terms and produce term clusters. We also utilize the resources to expand the vocabulary used in the given corpus and thus enhance term clustering. We then project the texts in the corpus onto the term clusters to cluster the texts. We evaluate various instantiations of the proposed framework by varying the term clustering method used, the approach of projecting the texts onto the term clusters, and the way of applying external knowledge resources. Extensive empirical evaluation demonstrates the merits of our approach with respect to applying clustering algorithms directly on the text corpus, and using state-of-the-art co-clustering and topic modeling methods.

previous article Website replica detection with distant supervision

next article EveTAR: building a large-scale multi-task test collection over Arabic tweets

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

We note that our approach of associating texts with term clusters is shown below to outperform in our setting a method which is commonly used in co-clustering algorithms.

Implementation available at https://marketplace.gephi.org/plugin/chinese-whispers-clustering/

The datasets are available at http://u.cs.biu.ac.il/~davidol/.

To account for multi-cluster assignment, we used the adaptation suggested by Rosenberg and Binkowski (2004), according to which each text has partial membership in each of its multiple clusters.

We publish our list of stopwords along with the datasets.

Available for download from https://github.com/hltfbk/EOP-1.2.0/wiki/English-Knowledge-Resources.

We used the following tools: LingPipe http://alias-i.com/lingpipe/index.html for Complete Link, R implementation of the PAM algorithm https://cran.r-project.org/web/packages/cluster/cluster.pdf, http://www.cs.utexas.edu/users/dml/Software/cocluster.html for co-clustering and http://mallet.cs.umass.edu for LDA. For the KMY baseline we applied for text clustering the algorithm in Fig. 6 with \(\theta =0.7\) as suggested in the original report (Ye and Young 2006). We experimented with additional thresholds (0 through 1 with the step of 0.1) and found the threshold of 0.7 to be one of the best for our setting. The algorithm is not very sensitive to threshold values around 0.7, although much higher and lower thresholds from the range of [0, 1] result in degraded performance.

We used the word2vec software accompanying (Mikolov et al. 2013) with context size of 5, the negative-training approach with 15 negative samples (NEG-15), and sub-sampling of frequent words with a parameter of \(10^{-5}\). The parameter settings follow Mikolov et al. (2013).

We experimented also with K randomly selected initial medoids, but having each text as an initial medoid showed better results. Since our text collections are small this is not computationally expensive.

Aggarwal, C.C., & Zhai, C. (2012). A survey of text clustering algorithms. In Mining text data (pp. 77–128). Springer.

Allan, J., Papka, R., & Lavrenko, V. (1998). On-line new event detection and tracking. In Proceedings of SIGIR (pp. 37–45).

Aslam, J. A., Ekstrand-Abueg, M., Pavlu, V., Diaz, F., McCreadie, R., & Sakai, T. (2014). TREC 2014 temporal summarization track overview. In Proceedings of TREC.

Baker, L. D., & McCallum, A. K. (1998). Distributional clustering of words for text classification. In Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval (pp. 96–103). ACM (1998)

Barak, L., Dagan, I., & Shnarch, E. (2009). Text categorization from category name via lexical reference. In Proceedings of human language technologies: the 2009 annual conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, NAACL-Short ’09 (pp. 33–36), Association for Computational Linguistics, Stroudsburg, PA, USA. http://dl.acm.org/citation.cfm?id=1620853.1620864

Berger, A.L., & Lafferty, J.D. (1999). Information retrieval as statistical translation. In Proceedings of SIGIR (pp. 222–229).

Biemann, C. (2006). Chinese whispers: An efficient graph clustering algorithm and its application to natural language processing problems. In Proceedings of the first workshop on graph based methods for natural language processing (pp. 73–80). Association for Computational Linguistics.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. the. Journal of machine Learning research, 3, 993–1022.MATH

Boros, E., Kantor, P. B., & Neu, D. J. (2001). A clustering based approach to creating multi-document summaries. In Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval.

De Boom, C., Van Canneyt, S., Demeester, T., & Dhoedt, B. (2016). Representation learning for very short texts using weighted word embedding aggregation. Pattern Recognition Letters, 80, 150–156.CrossRef

Denkowski, M., & Lavie, A. (2010). Meteor-next and the meteor paraphrase tables: Improved evaluation support for five target languages. In Proceedings of the joint fifth workshop on statistical machine translation and MetricsMATR, (pp. 339–342). Association for Computational Linguistics.

Dhillon, I.S., Mallela, S., & Modha, D.S. (2003). Information-theoretic co-clustering. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 89–98). ACM.

Di Marco, A., & Navigli, R. (2013). Clustering and diversifying web search results with graph-based word sense induction. Computational Linguistics, 39(3), 709–754.CrossRef

Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural computation, 10(7), 1895–1923.CrossRef

Erkan, G., & Radev, D. R. (2004). Lexrank: graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research, 22, 457–479.

Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database. Cambridge: MIT Press.MATH

Ferraresi, A., Zanchetta, E., Baroni, M., & Bernardini, S. (2008). Introducing and evaluating ukwac, a very large web-derived corpus of english. In Proceedings of the 4th web as corpus workshop (WAC-4) can we beat Google (pp. 47–54).

Gabrilovich, E., & Markovitch, S. (2006). Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. Proceedings of AAAI, 6, 1301–1306.

Ganitkevitch, J., Van Durme, B., & Callison-Burch, C. (2013). Ppdb: The paraphrase database. In HLT-NAACL (pp. 758–764).

Glickman, O., Shnarch, E., & Dagan, I. (2006). Lexical reference: A semantic matching subtask. In Proceedings of the 2006 conference on empirical methods in natural language processing, EMNLP ’06 (pp. 172–179). Association for Computational Linguistics, Stroudsburg, PA, USA. http://dl.acm.org/citation.cfm?id=1610075.1610103

Green, S. J. (1999). Building hypertext links by computing semantic similarity. IEEE Transactions on Knowledge and Data Engineering, 11(5), 713–730.CrossRef

Habash, N., & Dorr, B. (2003). Catvar: A database of categorial variations for english. In Proceedings of the MT summit (pp. 471–474).

Hearst, M.A., Karger, D.R., & Pedersen, J.O. (1995). Scatter/gather as a tool for the navigation of retrieval results. In Working Notes of the 1995 AAAI fall symposium on AI applications in knowledge navigation and retrieval.

Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of SIGIR (pp. 50–57).

Hotho, A., Staab, S., Stumme, G. (2003). Ontologies improve text document clustering. In Third IEEE international conference on data mining, 2003. ICDM 2003 (pp. 541–544). IEEE.

Hu, J., Fang, L., Cao, Y., Zeng, H.J., Li, H., et al. (2008). Enhancing text clustering by leveraging wikipedia semantics. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval (pp. 179–186). ACM.

Hu, X., Zhang, X., Lu, C., Park, E.K., & Zhou, X. (2009). Exploiting wikipedia as external knowledge for document clustering. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 389–396). ACM.

Karimzadehgan, M., & Zhai, C. (2010). Estimation of statistical translation models based on mutual information for ad hoc information retrieval. In Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval (pp. 323–330). ACM.

Kaufman, L., & Rousseeuw, P. (1990). Finding groups in data: an introduction to cluster analysis. Applied probability and statistics section (EUA): Wiley series in probability and mathematical statistics.CrossRefMATH

Kenter, T., & De Rijke, M. (2015). Short text similarity with word embeddings. In Proceedings of the 24th ACM international on conference on information and knowledge management (pp. 1411–1420). ACM.

Kotlerman, L., Dagan, I., Gorodetsky, M., & Daya, E. (2012a). Sentence clustering via projection over term clusters. In: *SEM 2012: The first joint conference on lexical and computational semantics—Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the sixth international workshop on semantic evaluation (SemEval 2012) (pp. 38–43). Association for Computational Linguistics, Montréal, Canada (2012). http://www.aclweb.org/anthology/S12-1005

Kotlerman, L., Dagan, I., Magnini, B., & Bentivogli, L. (2015b). Textual entailment graphs. Natural Language Engineering, 21(5), 699–724.CrossRef

Kotlerman, L., Dagan, I., Szpektor, I., & Zhitomirsky-Geffet, M. (2010). Directional distributional similarity for lexical inference. Natural Language Engineering, 16(4), 359–389. http://dblp.uni-trier.de/db/journals/nle/nle16.html#KotlermanDSZ10

Kurland, O. (2009). Re-ranking search results using language models of query-specific clusters. Information Retrieval, 12(4), 437–460.CrossRef

Levy, O., & Goldberg, Y. (2014). Dependencybased word embeddings. In Proceedings of the 52nd annual meeting of the association for computational linguistics (Vol. 2, pp. 302–308).

Li, W., & McCallum, A. (2006). Pachinko allocation: Dag-structured mixture models of topic correlations. In Proceedings of ICML (pp. 577–584).

Liebeskind, C., Kotlerman, L., Dagan, I. (2015). Text categorization from category name in an industry-motivated scenario. Language resources and evaluation (pp. 1–35).

Lin, D. (1998). Automatic retrieval and clustering of similar words. In COLING-ACL (pp. 768–774).

Liu, X., & Croft, W.B. (2004). Cluster-based retrieval using language models. In Proceedings of SIGIR (pp. 186–193).

Metzler, D., Dumais, S., & Meek, C. (2007). Similarity measures for short segments of text. Berlin: Springer.CrossRef

Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).

Naughton, M., Kushmerick, N., & Carthy, J. (2006). Clustering sentences for discovering events in news articles. In Advances in information retrieval (pp. 535–538). Springer.

Nomoto, T., & Matsumoto, Y. (2001). A new approach to unsupervised text summarization. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 26–34). ACM.

Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. EMNLP, 14, 1532–1543.

Phan, X.H., Nguyen, L.M., & Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In Proceedings of the 17th international conference on world wide web, (pp. 91–100). ACM.

Raiber, F., Kurland, O., Radlinski, F., & Shokouhi, M. (2015). Learning asymmetric co-relevance. In Proceedings of ICTIR (pp. 281–290).

Rose, T., Stevenson, M., & Whitehead, M. (2002). The reuters corpus volume 1-from yesterday’s news to tomorrow’s language resources. LREC, 2, 827–832.

Rosenberg, A., & Binkowski, E. (2004). Augmenting the kappa statistic to determine interannotator reliability for multiply labeled data points. In: Proceedings of HLT-NAACL 2004: short papers (pp. 77–80). Association for Computational Linguistics.

Sahami, M., Heilman, T.D.: A web-based kernel function for measuring the similarity of short text snippets. In Proceedings of WWW (pp. 377–386).

Sedding, J., & Kazakov, D. (2004). Wordnet-based text document clustering. In Proceedings of the 3rd workshop on robust methods in analysis of natural language data (pp. 104–113). Association for Computational Linguistics.

Severyn, A., & Moschitti, A. (2015). Learning to rank short text pairs with convolutional deep neural networks. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval (pp. 373–382). ACM.

Shehata, S. (2009). A wordnet-based semantic model for enhancing text clustering. In IEEE international conference on data mining workshops, 2009. ICDMW’09, (pp. 477–482). IEEE.

Shnarch, E., Barak, L., & Dagan, I. Extracting lexical reference rules from wikipedia. In Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP: Volume 1-Volume 1, ACL ’09 (pp. 450–458). Association for Computational Linguistics, Stroudsburg, PA, USA. http://dl.acm.org/citation.cfm?id=1687878.1687942

Steinbach, M., Karypis, G., Kumar, V., et al. (2000). A comparison of document clustering techniques. In KDD workshop on text mining (Vol. 400, pp. 525–526). Boston

Tan, B., Velivelli, A., Fang, H., & Zhai, C. (2007). Term feedback for information retrieval with language models. In SIGIR 2007: proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 263–270), Amsterdam, The Netherlands, July 23–27, 2007. https://doi.org/10.1145/1277741.1277788.

Tsur, O., Littman, A., & Rappoport, A. (2013). Efficient clustering of short messages into general domains. In Proceedings of ICWSM.

Udupa, R., Bhole, A., & Bhattacharyya, P. (2009). “A term is known by the company it keeps”: On selecting a good expansion set in pseudo-relevance feedback. In Proceedings of second international conference on the theory of information retrieval, advances in information retrieval theory, ICTIR 2009 (pp. 104–115), Cambridge, UK, September 10–12, 2009. https://doi.org/10.1007/978-3-642-04417-5

Whissell, J. S., & Clarke, C. L. (2011). Improving document clustering using okapi bm25 feature weighting. Information retrieval, 14(5), 466–487.CrossRef

Ye, H., & Young, S. (2006). A clustering approach to semantic decoding. In: Ninth international conference on spoken language processing.

Zamir, O., & Etzioni, O. (1998). Web document clustering: A feasibility demonstration. In Proceedings of SIGIR (pp. 46–54).

Title: Clustering small-sized collections of short texts
Authors: Lili Kotlerman
Ido Dagan
Oren Kurland
Publication date: 30-11-2017
Publisher: Springer Netherlands
Published in: Discover Computing / Issue 4/2018
Print ISSN: 2948-2984
Electronic ISSN: 2948-2992
DOI: https://doi.org/10.1007/s10791-017-9324-8

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner