Skip to main content
Top
Published in: Discover Computing 4/2018

30-11-2017

Clustering small-sized collections of short texts

Authors: Lili Kotlerman, Ido Dagan, Oren Kurland

Published in: Discover Computing | Issue 4/2018

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The need to cluster small text corpora composed of a few hundreds of short texts rises in various applications; e.g., clustering top-retrieved documents based on their snippets. This clustering task is challenging due to the vocabulary mismatch between short texts and the insufficient corpus-based statistics (e.g., term co-occurrence statistics) due to the corpus size. We address this clustering challenge using a framework that utilizes a set of external knowledge resources that provide information about term relations. Specifically, we use information induced from the resources to estimate similarity between terms and produce term clusters. We also utilize the resources to expand the vocabulary used in the given corpus and thus enhance term clustering. We then project the texts in the corpus onto the term clusters to cluster the texts. We evaluate various instantiations of the proposed framework by varying the term clustering method used, the approach of projecting the texts onto the term clusters, and the way of applying external knowledge resources. Extensive empirical evaluation demonstrates the merits of our approach with respect to applying clustering algorithms directly on the text corpus, and using state-of-the-art co-clustering and topic modeling methods.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
We note that our approach of associating texts with term clusters is shown below to outperform in our setting a method which is commonly used in co-clustering algorithms.
 
4
To account for multi-cluster assignment, we used the adaptation suggested by Rosenberg and Binkowski (2004), according to which each text has partial membership in each of its multiple clusters.
 
5
We publish our list of stopwords along with the datasets.
 
7
We used the following tools: LingPipe http://​alias-i.​com/​lingpipe/​index.​html for Complete Link, R implementation of the PAM algorithm https://​cran.​r-project.​org/​web/​packages/​cluster/​cluster.​pdf, http://​www.​cs.​utexas.​edu/​users/​dml/​Software/​cocluster.​html for co-clustering and http://​mallet.​cs.​umass.​edu for LDA. For the KMY baseline we applied for text clustering the algorithm in Fig. 6 with \(\theta =0.7\) as suggested in the original report (Ye and Young 2006). We experimented with additional thresholds (0 through 1 with the step of 0.1) and found the threshold of 0.7 to be one of the best for our setting. The algorithm is not very sensitive to threshold values around 0.7, although much higher and lower thresholds from the range of [0, 1] result in degraded performance.
 
8
We used the word2vec software accompanying (Mikolov et al. 2013) with context size of 5, the negative-training approach with 15 negative samples (NEG-15), and sub-sampling of frequent words with a parameter of \(10^{-5}\). The parameter settings follow Mikolov et al. (2013).
 
9
We experimented also with K randomly selected initial medoids, but having each text as an initial medoid showed better results. Since our text collections are small this is not computationally expensive.
 
Literature
go back to reference Aggarwal, C.C., & Zhai, C. (2012). A survey of text clustering algorithms. In Mining text data (pp. 77–128). Springer. Aggarwal, C.C., & Zhai, C. (2012). A survey of text clustering algorithms. In Mining text data (pp. 77–128). Springer.
go back to reference Allan, J., Papka, R., & Lavrenko, V. (1998). On-line new event detection and tracking. In Proceedings of SIGIR (pp. 37–45). Allan, J., Papka, R., & Lavrenko, V. (1998). On-line new event detection and tracking. In Proceedings of SIGIR (pp. 37–45).
go back to reference Aslam, J. A., Ekstrand-Abueg, M., Pavlu, V., Diaz, F., McCreadie, R., & Sakai, T. (2014). TREC 2014 temporal summarization track overview. In Proceedings of TREC. Aslam, J. A., Ekstrand-Abueg, M., Pavlu, V., Diaz, F., McCreadie, R., & Sakai, T. (2014). TREC 2014 temporal summarization track overview. In Proceedings of TREC.
go back to reference Baker, L. D., & McCallum, A. K. (1998). Distributional clustering of words for text classification. In Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval (pp. 96–103). ACM (1998) Baker, L. D., & McCallum, A. K. (1998). Distributional clustering of words for text classification. In Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval (pp. 96–103). ACM (1998)
go back to reference Barak, L., Dagan, I., & Shnarch, E. (2009). Text categorization from category name via lexical reference. In Proceedings of human language technologies: the 2009 annual conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, NAACL-Short ’09 (pp. 33–36), Association for Computational Linguistics, Stroudsburg, PA, USA. http://dl.acm.org/citation.cfm?id=1620853.1620864 Barak, L., Dagan, I., & Shnarch, E. (2009). Text categorization from category name via lexical reference. In Proceedings of human language technologies: the 2009 annual conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, NAACL-Short ’09 (pp. 33–36), Association for Computational Linguistics, Stroudsburg, PA, USA. http://​dl.​acm.​org/​citation.​cfm?​id=​1620853.​1620864
go back to reference Berger, A.L., & Lafferty, J.D. (1999). Information retrieval as statistical translation. In Proceedings of SIGIR (pp. 222–229). Berger, A.L., & Lafferty, J.D. (1999). Information retrieval as statistical translation. In Proceedings of SIGIR (pp. 222–229).
go back to reference Biemann, C. (2006). Chinese whispers: An efficient graph clustering algorithm and its application to natural language processing problems. In Proceedings of the first workshop on graph based methods for natural language processing (pp. 73–80). Association for Computational Linguistics. Biemann, C. (2006). Chinese whispers: An efficient graph clustering algorithm and its application to natural language processing problems. In Proceedings of the first workshop on graph based methods for natural language processing (pp. 73–80). Association for Computational Linguistics.
go back to reference Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. the. Journal of machine Learning research, 3, 993–1022.MATH Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. the. Journal of machine Learning research, 3, 993–1022.MATH
go back to reference Boros, E., Kantor, P. B., & Neu, D. J. (2001). A clustering based approach to creating multi-document summaries. In Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval. Boros, E., Kantor, P. B., & Neu, D. J. (2001). A clustering based approach to creating multi-document summaries. In Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval.
go back to reference De Boom, C., Van Canneyt, S., Demeester, T., & Dhoedt, B. (2016). Representation learning for very short texts using weighted word embedding aggregation. Pattern Recognition Letters, 80, 150–156.CrossRef De Boom, C., Van Canneyt, S., Demeester, T., & Dhoedt, B. (2016). Representation learning for very short texts using weighted word embedding aggregation. Pattern Recognition Letters, 80, 150–156.CrossRef
go back to reference Denkowski, M., & Lavie, A. (2010). Meteor-next and the meteor paraphrase tables: Improved evaluation support for five target languages. In Proceedings of the joint fifth workshop on statistical machine translation and MetricsMATR, (pp. 339–342). Association for Computational Linguistics. Denkowski, M., & Lavie, A. (2010). Meteor-next and the meteor paraphrase tables: Improved evaluation support for five target languages. In Proceedings of the joint fifth workshop on statistical machine translation and MetricsMATR, (pp. 339–342). Association for Computational Linguistics.
go back to reference Dhillon, I.S., Mallela, S., & Modha, D.S. (2003). Information-theoretic co-clustering. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 89–98). ACM. Dhillon, I.S., Mallela, S., & Modha, D.S. (2003). Information-theoretic co-clustering. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 89–98). ACM.
go back to reference Di Marco, A., & Navigli, R. (2013). Clustering and diversifying web search results with graph-based word sense induction. Computational Linguistics, 39(3), 709–754.CrossRef Di Marco, A., & Navigli, R. (2013). Clustering and diversifying web search results with graph-based word sense induction. Computational Linguistics, 39(3), 709–754.CrossRef
go back to reference Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural computation, 10(7), 1895–1923.CrossRef Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural computation, 10(7), 1895–1923.CrossRef
go back to reference Erkan, G., & Radev, D. R. (2004). Lexrank: graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research, 22, 457–479. Erkan, G., & Radev, D. R. (2004). Lexrank: graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research, 22, 457–479.
go back to reference Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database. Cambridge: MIT Press.MATH Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database. Cambridge: MIT Press.MATH
go back to reference Ferraresi, A., Zanchetta, E., Baroni, M., & Bernardini, S. (2008). Introducing and evaluating ukwac, a very large web-derived corpus of english. In Proceedings of the 4th web as corpus workshop (WAC-4) can we beat Google (pp. 47–54). Ferraresi, A., Zanchetta, E., Baroni, M., & Bernardini, S. (2008). Introducing and evaluating ukwac, a very large web-derived corpus of english. In Proceedings of the 4th web as corpus workshop (WAC-4) can we beat Google (pp. 47–54).
go back to reference Gabrilovich, E., & Markovitch, S. (2006). Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. Proceedings of AAAI, 6, 1301–1306. Gabrilovich, E., & Markovitch, S. (2006). Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. Proceedings of AAAI, 6, 1301–1306.
go back to reference Ganitkevitch, J., Van Durme, B., & Callison-Burch, C. (2013). Ppdb: The paraphrase database. In HLT-NAACL (pp. 758–764). Ganitkevitch, J., Van Durme, B., & Callison-Burch, C. (2013). Ppdb: The paraphrase database. In HLT-NAACL (pp. 758–764).
go back to reference Green, S. J. (1999). Building hypertext links by computing semantic similarity. IEEE Transactions on Knowledge and Data Engineering, 11(5), 713–730.CrossRef Green, S. J. (1999). Building hypertext links by computing semantic similarity. IEEE Transactions on Knowledge and Data Engineering, 11(5), 713–730.CrossRef
go back to reference Habash, N., & Dorr, B. (2003). Catvar: A database of categorial variations for english. In Proceedings of the MT summit (pp. 471–474). Habash, N., & Dorr, B. (2003). Catvar: A database of categorial variations for english. In Proceedings of the MT summit (pp. 471–474).
go back to reference Hearst, M.A., Karger, D.R., & Pedersen, J.O. (1995). Scatter/gather as a tool for the navigation of retrieval results. In Working Notes of the 1995 AAAI fall symposium on AI applications in knowledge navigation and retrieval. Hearst, M.A., Karger, D.R., & Pedersen, J.O. (1995). Scatter/gather as a tool for the navigation of retrieval results. In Working Notes of the 1995 AAAI fall symposium on AI applications in knowledge navigation and retrieval.
go back to reference Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of SIGIR (pp. 50–57). Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of SIGIR (pp. 50–57).
go back to reference Hotho, A., Staab, S., Stumme, G. (2003). Ontologies improve text document clustering. In Third IEEE international conference on data mining, 2003. ICDM 2003 (pp. 541–544). IEEE. Hotho, A., Staab, S., Stumme, G. (2003). Ontologies improve text document clustering. In Third IEEE international conference on data mining, 2003. ICDM 2003 (pp. 541–544). IEEE.
go back to reference Hu, J., Fang, L., Cao, Y., Zeng, H.J., Li, H., et al. (2008). Enhancing text clustering by leveraging wikipedia semantics. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval (pp. 179–186). ACM. Hu, J., Fang, L., Cao, Y., Zeng, H.J., Li, H., et al. (2008). Enhancing text clustering by leveraging wikipedia semantics. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval (pp. 179–186). ACM.
go back to reference Hu, X., Zhang, X., Lu, C., Park, E.K., & Zhou, X. (2009). Exploiting wikipedia as external knowledge for document clustering. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 389–396). ACM. Hu, X., Zhang, X., Lu, C., Park, E.K., & Zhou, X. (2009). Exploiting wikipedia as external knowledge for document clustering. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 389–396). ACM.
go back to reference Karimzadehgan, M., & Zhai, C. (2010). Estimation of statistical translation models based on mutual information for ad hoc information retrieval. In Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval (pp. 323–330). ACM. Karimzadehgan, M., & Zhai, C. (2010). Estimation of statistical translation models based on mutual information for ad hoc information retrieval. In Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval (pp. 323–330). ACM.
go back to reference Kaufman, L., & Rousseeuw, P. (1990). Finding groups in data: an introduction to cluster analysis. Applied probability and statistics section (EUA): Wiley series in probability and mathematical statistics.CrossRefMATH Kaufman, L., & Rousseeuw, P. (1990). Finding groups in data: an introduction to cluster analysis. Applied probability and statistics section (EUA): Wiley series in probability and mathematical statistics.CrossRefMATH
go back to reference Kenter, T., & De Rijke, M. (2015). Short text similarity with word embeddings. In Proceedings of the 24th ACM international on conference on information and knowledge management (pp. 1411–1420). ACM. Kenter, T., & De Rijke, M. (2015). Short text similarity with word embeddings. In Proceedings of the 24th ACM international on conference on information and knowledge management (pp. 1411–1420). ACM.
go back to reference Kotlerman, L., Dagan, I., Gorodetsky, M., & Daya, E. (2012a). Sentence clustering via projection over term clusters. In: *SEM 2012: The first joint conference on lexical and computational semantics—Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the sixth international workshop on semantic evaluation (SemEval 2012) (pp. 38–43). Association for Computational Linguistics, Montréal, Canada (2012). http://www.aclweb.org/anthology/S12-1005 Kotlerman, L., Dagan, I., Gorodetsky, M., & Daya, E. (2012a). Sentence clustering via projection over term clusters. In: *SEM 2012: The first joint conference on lexical and computational semantics—Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the sixth international workshop on semantic evaluation (SemEval 2012) (pp. 38–43). Association for Computational Linguistics, Montréal, Canada (2012). http://​www.​aclweb.​org/​anthology/​S12-1005
go back to reference Kotlerman, L., Dagan, I., Magnini, B., & Bentivogli, L. (2015b). Textual entailment graphs. Natural Language Engineering, 21(5), 699–724.CrossRef Kotlerman, L., Dagan, I., Magnini, B., & Bentivogli, L. (2015b). Textual entailment graphs. Natural Language Engineering, 21(5), 699–724.CrossRef
go back to reference Kurland, O. (2009). Re-ranking search results using language models of query-specific clusters. Information Retrieval, 12(4), 437–460.CrossRef Kurland, O. (2009). Re-ranking search results using language models of query-specific clusters. Information Retrieval, 12(4), 437–460.CrossRef
go back to reference Levy, O., & Goldberg, Y. (2014). Dependencybased word embeddings. In Proceedings of the 52nd annual meeting of the association for computational linguistics (Vol. 2, pp. 302–308). Levy, O., & Goldberg, Y. (2014). Dependencybased word embeddings. In Proceedings of the 52nd annual meeting of the association for computational linguistics (Vol. 2, pp. 302–308).
go back to reference Li, W., & McCallum, A. (2006). Pachinko allocation: Dag-structured mixture models of topic correlations. In Proceedings of ICML (pp. 577–584). Li, W., & McCallum, A. (2006). Pachinko allocation: Dag-structured mixture models of topic correlations. In Proceedings of ICML (pp. 577–584).
go back to reference Liebeskind, C., Kotlerman, L., Dagan, I. (2015). Text categorization from category name in an industry-motivated scenario. Language resources and evaluation (pp. 1–35). Liebeskind, C., Kotlerman, L., Dagan, I. (2015). Text categorization from category name in an industry-motivated scenario. Language resources and evaluation (pp. 1–35).
go back to reference Lin, D. (1998). Automatic retrieval and clustering of similar words. In COLING-ACL (pp. 768–774). Lin, D. (1998). Automatic retrieval and clustering of similar words. In COLING-ACL (pp. 768–774).
go back to reference Liu, X., & Croft, W.B. (2004). Cluster-based retrieval using language models. In Proceedings of SIGIR (pp. 186–193). Liu, X., & Croft, W.B. (2004). Cluster-based retrieval using language models. In Proceedings of SIGIR (pp. 186–193).
go back to reference Metzler, D., Dumais, S., & Meek, C. (2007). Similarity measures for short segments of text. Berlin: Springer.CrossRef Metzler, D., Dumais, S., & Meek, C. (2007). Similarity measures for short segments of text. Berlin: Springer.CrossRef
go back to reference Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119). Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).
go back to reference Naughton, M., Kushmerick, N., & Carthy, J. (2006). Clustering sentences for discovering events in news articles. In Advances in information retrieval (pp. 535–538). Springer. Naughton, M., Kushmerick, N., & Carthy, J. (2006). Clustering sentences for discovering events in news articles. In Advances in information retrieval (pp. 535–538). Springer.
go back to reference Nomoto, T., & Matsumoto, Y. (2001). A new approach to unsupervised text summarization. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 26–34). ACM. Nomoto, T., & Matsumoto, Y. (2001). A new approach to unsupervised text summarization. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 26–34). ACM.
go back to reference Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. EMNLP, 14, 1532–1543. Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. EMNLP, 14, 1532–1543.
go back to reference Phan, X.H., Nguyen, L.M., & Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In Proceedings of the 17th international conference on world wide web, (pp. 91–100). ACM. Phan, X.H., Nguyen, L.M., & Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In Proceedings of the 17th international conference on world wide web, (pp. 91–100). ACM.
go back to reference Raiber, F., Kurland, O., Radlinski, F., & Shokouhi, M. (2015). Learning asymmetric co-relevance. In Proceedings of ICTIR (pp. 281–290). Raiber, F., Kurland, O., Radlinski, F., & Shokouhi, M. (2015). Learning asymmetric co-relevance. In Proceedings of ICTIR (pp. 281–290).
go back to reference Rose, T., Stevenson, M., & Whitehead, M. (2002). The reuters corpus volume 1-from yesterday’s news to tomorrow’s language resources. LREC, 2, 827–832. Rose, T., Stevenson, M., & Whitehead, M. (2002). The reuters corpus volume 1-from yesterday’s news to tomorrow’s language resources. LREC, 2, 827–832.
go back to reference Rosenberg, A., & Binkowski, E. (2004). Augmenting the kappa statistic to determine interannotator reliability for multiply labeled data points. In: Proceedings of HLT-NAACL 2004: short papers (pp. 77–80). Association for Computational Linguistics. Rosenberg, A., & Binkowski, E. (2004). Augmenting the kappa statistic to determine interannotator reliability for multiply labeled data points. In: Proceedings of HLT-NAACL 2004: short papers (pp. 77–80). Association for Computational Linguistics.
go back to reference Sahami, M., Heilman, T.D.: A web-based kernel function for measuring the similarity of short text snippets. In Proceedings of WWW (pp. 377–386). Sahami, M., Heilman, T.D.: A web-based kernel function for measuring the similarity of short text snippets. In Proceedings of WWW (pp. 377–386).
go back to reference Sedding, J., & Kazakov, D. (2004). Wordnet-based text document clustering. In Proceedings of the 3rd workshop on robust methods in analysis of natural language data (pp. 104–113). Association for Computational Linguistics. Sedding, J., & Kazakov, D. (2004). Wordnet-based text document clustering. In Proceedings of the 3rd workshop on robust methods in analysis of natural language data (pp. 104–113). Association for Computational Linguistics.
go back to reference Severyn, A., & Moschitti, A. (2015). Learning to rank short text pairs with convolutional deep neural networks. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval (pp. 373–382). ACM. Severyn, A., & Moschitti, A. (2015). Learning to rank short text pairs with convolutional deep neural networks. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval (pp. 373–382). ACM.
go back to reference Shehata, S. (2009). A wordnet-based semantic model for enhancing text clustering. In IEEE international conference on data mining workshops, 2009. ICDMW’09, (pp. 477–482). IEEE. Shehata, S. (2009). A wordnet-based semantic model for enhancing text clustering. In IEEE international conference on data mining workshops, 2009. ICDMW’09, (pp. 477–482). IEEE.
go back to reference Shnarch, E., Barak, L., & Dagan, I. Extracting lexical reference rules from wikipedia. In Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP: Volume 1-Volume 1, ACL ’09 (pp. 450–458). Association for Computational Linguistics, Stroudsburg, PA, USA. http://dl.acm.org/citation.cfm?id=1687878.1687942 Shnarch, E., Barak, L., & Dagan, I. Extracting lexical reference rules from wikipedia. In Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP: Volume 1-Volume 1, ACL ’09 (pp. 450–458). Association for Computational Linguistics, Stroudsburg, PA, USA. http://​dl.​acm.​org/​citation.​cfm?​id=​1687878.​1687942
go back to reference Steinbach, M., Karypis, G., Kumar, V., et al. (2000). A comparison of document clustering techniques. In KDD workshop on text mining (Vol. 400, pp. 525–526). Boston Steinbach, M., Karypis, G., Kumar, V., et al. (2000). A comparison of document clustering techniques. In KDD workshop on text mining (Vol. 400, pp. 525–526). Boston
go back to reference Tan, B., Velivelli, A., Fang, H., & Zhai, C. (2007). Term feedback for information retrieval with language models. In SIGIR 2007: proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 263–270), Amsterdam, The Netherlands, July 23–27, 2007. https://doi.org/10.1145/1277741.1277788. Tan, B., Velivelli, A., Fang, H., & Zhai, C. (2007). Term feedback for information retrieval with language models. In SIGIR 2007: proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 263–270), Amsterdam, The Netherlands, July 23–27, 2007. https://​doi.​org/​10.​1145/​1277741.​1277788.
go back to reference Tsur, O., Littman, A., & Rappoport, A. (2013). Efficient clustering of short messages into general domains. In Proceedings of ICWSM. Tsur, O., Littman, A., & Rappoport, A. (2013). Efficient clustering of short messages into general domains. In Proceedings of ICWSM.
go back to reference Udupa, R., Bhole, A., & Bhattacharyya, P. (2009). “A term is known by the company it keeps”: On selecting a good expansion set in pseudo-relevance feedback. In Proceedings of second international conference on the theory of information retrieval, advances in information retrieval theory, ICTIR 2009 (pp. 104–115), Cambridge, UK, September 10–12, 2009. https://doi.org/10.1007/978-3-642-04417-5 Udupa, R., Bhole, A., & Bhattacharyya, P. (2009). “A term is known by the company it keeps”: On selecting a good expansion set in pseudo-relevance feedback. In Proceedings of second international conference on the theory of information retrieval, advances in information retrieval theory, ICTIR 2009 (pp. 104–115), Cambridge, UK, September 10–12, 2009. https://​doi.​org/​10.​1007/​978-3-642-04417-5
go back to reference Whissell, J. S., & Clarke, C. L. (2011). Improving document clustering using okapi bm25 feature weighting. Information retrieval, 14(5), 466–487.CrossRef Whissell, J. S., & Clarke, C. L. (2011). Improving document clustering using okapi bm25 feature weighting. Information retrieval, 14(5), 466–487.CrossRef
go back to reference Ye, H., & Young, S. (2006). A clustering approach to semantic decoding. In: Ninth international conference on spoken language processing. Ye, H., & Young, S. (2006). A clustering approach to semantic decoding. In: Ninth international conference on spoken language processing.
go back to reference Zamir, O., & Etzioni, O. (1998). Web document clustering: A feasibility demonstration. In Proceedings of SIGIR (pp. 46–54). Zamir, O., & Etzioni, O. (1998). Web document clustering: A feasibility demonstration. In Proceedings of SIGIR (pp. 46–54).
Metadata
Title
Clustering small-sized collections of short texts
Authors
Lili Kotlerman
Ido Dagan
Oren Kurland
Publication date
30-11-2017
Publisher
Springer Netherlands
Published in
Discover Computing / Issue 4/2018
Print ISSN: 2948-2984
Electronic ISSN: 2948-2992
DOI
https://doi.org/10.1007/s10791-017-9324-8

Premium Partner