Skip to main content
Erschienen in: Cluster Computing 3/2016

01.09.2016

A comparison study of clustering algorithms for microblog posts

verfasst von: Lin Li, Jingjing Ye, Fang Deng, Shengwu Xiong, Luo Zhong

Erschienen in: Cluster Computing | Ausgabe 3/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Clustering is a popular unsupervised learning approach for topic analysis in text mining. In this paper, we do a comparison study of clustering algorithms for microblog posts, including weighting and programming model. Our experimental data is crawled from Sina Weibo in China. They are the 74,662 microblogs of 14 topics about Internet Technology. First of all, we do preprocessing to these microblog posts. Then we propose a manual sampling based dynamic incremental clustering algorithm (MS-DICA) to extract the topic threads from the microblogs we crawled. We evaluate the proposed algorithm from four aspects. Moreover, experimental comparisons are done in terms of accuracy and efficiency with the traditional k-means algorithm. Our experimental results show that the proposed MS-DICA is effective in the topic thread extraction. Its accuracy is close to the traditional k-means algorithm, and the running speed improves more than five times. In addition, the MapReduce programming model in Hadoop distributed computation platform that can run paralleled the k-means algorithm for cluster speeding up.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Kaplan, A.M., Haenlein, M.: The early bird catches the news: nine things you should know about micro-blogging. Bus. Horizons 54(2), 105–113 (2011)CrossRef Kaplan, A.M., Haenlein, M.: The early bird catches the news: nine things you should know about micro-blogging. Bus. Horizons 54(2), 105–113 (2011)CrossRef
2.
Zurück zum Zitat Anick, P.G., Vaithyanathan, S.: Exploiting clustering and phrases for context-based information retrieval. In Proceeding of the 20th Annual International ACM Conference on Research and Development in Information Retrieval, SIGIR, pp. 314–323 (1997) Anick, P.G., Vaithyanathan, S.: Exploiting clustering and phrases for context-based information retrieval. In Proceeding of the 20th Annual International ACM Conference on Research and Development in Information Retrieval, SIGIR, pp. 314–323 (1997)
3.
Zurück zum Zitat Pervin, N., Fang, F., Datta, A., Dutta, K., VanderMeer, Debra E.: Fast, scalable, and context-sensitive detection of trending topics in microblog post streams. ACM Trans. Manag. Inf. Syst. 3(3), 19 (2013) Pervin, N., Fang, F., Datta, A., Dutta, K., VanderMeer, Debra E.: Fast, scalable, and context-sensitive detection of trending topics in microblog post streams. ACM Trans. Manag. Inf. Syst. 3(3), 19 (2013)
4.
Zurück zum Zitat Hu, X., Tang, L., Tang, J., Liu, H.: Exploiting social relations for sentiment analysis in microblogging. In: Proceeding of the Sixth ACM International Conference on Web Search and Data Mining, WSDM, pp. 537–546 (2013) Hu, X., Tang, L., Tang, J., Liu, H.: Exploiting social relations for sentiment analysis in microblogging. In: Proceeding of the Sixth ACM International Conference on Web Search and Data Mining, WSDM, pp. 537–546 (2013)
5.
Zurück zum Zitat Lin, C., Lin, C., Li, J., Wang, D., Chen, Y., Li, T.: Generating event storylines from microblogs. In: Proceeding of the 21st ACM International Conference on Information and Knowledge Management, CIKM, pp. 175–184 (2012) Lin, C., Lin, C., Li, J., Wang, D., Chen, Y., Li, T.: Generating event storylines from microblogs. In: Proceeding of the 21st ACM International Conference on Information and Knowledge Management, CIKM, pp. 175–184 (2012)
6.
Zurück zum Zitat Efron, M., Organisciak, P., Fenlon, K.: Improving retrieval of short texts through document expansion. In: Proceeding of the 35th Annual International ACM Conference on Research and Development in Information Retrieval, SIGIR, pp. 911–920 (2012) Efron, M., Organisciak, P., Fenlon, K.: Improving retrieval of short texts through document expansion. In: Proceeding of the 35th Annual International ACM Conference on Research and Development in Information Retrieval, SIGIR, pp. 911–920 (2012)
7.
Zurück zum Zitat Xi, W., Lind, J., Brill, E.: Learning effective ranking functions for newsgroup search. In: Proceeding of the 27th Annual International ACM Conference on Research and Development in Information Retrieval, SIGIR, pp. 394–401 (2004) Xi, W., Lind, J., Brill, E.: Learning effective ranking functions for newsgroup search. In: Proceeding of the 27th Annual International ACM Conference on Research and Development in Information Retrieval, SIGIR, pp. 394–401 (2004)
8.
Zurück zum Zitat Elsas, J.L., Carbonell, J.G.: It pays to be picky: an evaluation of thread retrieval in online forums. In: Proceeding of the 32nd Annual International ACM Conference on Research and Development in Information Retrieval, SIGIR, pp. 714–715 (2009) Elsas, J.L., Carbonell, J.G.: It pays to be picky: an evaluation of thread retrieval in online forums. In: Proceeding of the 32nd Annual International ACM Conference on Research and Development in Information Retrieval, SIGIR, pp. 714–715 (2009)
9.
Zurück zum Zitat Sun, A., Hu, M., Lim, E.-P.: Searching blogs and news: a study on popular queries. In: Proceeding of the 31st Annual International ACM Conference on Research and Development in Information Retrieval, SIGIR, pp. 729–730 (2008) Sun, A., Hu, M., Lim, E.-P.: Searching blogs and news: a study on popular queries. In: Proceeding of the 31st Annual International ACM Conference on Research and Development in Information Retrieval, SIGIR, pp. 729–730 (2008)
10.
Zurück zum Zitat Smith, M., Cadiz, J.J., Burkhalter, B.: Conversation trees and threaded chats. In: Proceeding on the ACM 2000 Conference on Computer Supported Cooperative Work, CSCW, pp. 97–105 (2000) Smith, M., Cadiz, J.J., Burkhalter, B.: Conversation trees and threaded chats. In: Proceeding on the ACM 2000 Conference on Computer Supported Cooperative Work, CSCW, pp. 97–105 (2000)
11.
Zurück zum Zitat Qureshi, M.A., O’Riordan, C., Pasi, G.: Short-text domain specific key terms/phrases extraction using an n-gram model with wikipedia. In: Proceeding of the 21st ACM International Conference on Information and Knowledge Management, CIKM, pp. 2515–2518 (2012) Qureshi, M.A., O’Riordan, C., Pasi, G.: Short-text domain specific key terms/phrases extraction using an n-gram model with wikipedia. In: Proceeding of the 21st ACM International Conference on Information and Knowledge Management, CIKM, pp. 2515–2518 (2012)
12.
Zurück zum Zitat Vitale, D., Ferragina, P., Scaiella, U.: Classification of short texts by deploying topical annotations. In: Proceedings of 34th European Conference on IR Research, ECIR, pp. 376–387 (2012) Vitale, D., Ferragina, P., Scaiella, U.: Classification of short texts by deploying topical annotations. In: Proceedings of 34th European Conference on IR Research, ECIR, pp. 376–387 (2012)
13.
Zurück zum Zitat Wang, W.-C., Joshi, M., Cohen, W.W., Rosé, C.P.: Recovering implicit thread structure in newsgroup style conversations. In: Proceedings of Proceedings of the Second International Conference on Weblogs and Social Media, ICWSM, pp. 152–160 (2008) Wang, W.-C., Joshi, M., Cohen, W.W., Rosé, C.P.: Recovering implicit thread structure in newsgroup style conversations. In: Proceedings of Proceedings of the Second International Conference on Weblogs and Social Media, ICWSM, pp. 152–160 (2008)
14.
Zurück zum Zitat Luo, Z., Osborne, M., Petrovic, S., Wang, T.: Improving twitter retrieval by exploiting structural information. In: Proceedings of Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, AAAI, pp. 648–654 (2012) Luo, Z., Osborne, M., Petrovic, S., Wang, T.: Improving twitter retrieval by exploiting structural information. In: Proceedings of Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, AAAI, pp. 648–654 (2012)
15.
Zurück zum Zitat Skovsgaard, A., Sidlauskas, D., Jensen, C.S.: A clustering approach to the discovery of points of interest from geo-tagged microblog posts. In Proceedings of IEEE 15th International Conference on Mobile Data Management, MDM, pp. 178–188 (2014) Skovsgaard, A., Sidlauskas, D., Jensen, C.S.: A clustering approach to the discovery of points of interest from geo-tagged microblog posts. In Proceedings of IEEE 15th International Conference on Mobile Data Management, MDM, pp. 178–188 (2014)
16.
Zurück zum Zitat Hu, X., Lei, T., Huan, L.: Embracing information explosion without choking: clustering and labeling in microblogging. IEEE Trans. Big Data 1(1), 35–46 (2015)CrossRef Hu, X., Lei, T., Huan, L.: Embracing information explosion without choking: clustering and labeling in microblogging. IEEE Trans. Big Data 1(1), 35–46 (2015)CrossRef
17.
Zurück zum Zitat Macqueen, J.: Some methods for classification and analysis of multivariate observations. In Proceedings of Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (2015) Macqueen, J.: Some methods for classification and analysis of multivariate observations. In Proceedings of Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (2015)
18.
Zurück zum Zitat Steinhaus H.: Sur la division des corps matériels en parties. Bull. Acad. Polon. Sci. Cl. iii, 801–804 (1956) Steinhaus H.: Sur la division des corps matériels en parties. Bull. Acad. Polon. Sci. Cl. iii, 801–804 (1956)
20.
Zurück zum Zitat MacKay, D.J.C.: Information theory, inference, and learning algorithms. Cambridge University Press 2003, ISBN 978-0-521-64298-9, pp. I–XII, 1–628 MacKay, D.J.C.: Information theory, inference, and learning algorithms. Cambridge University Press 2003, ISBN 978-0-521-64298-9, pp. I–XII, 1–628
21.
Zurück zum Zitat Tan, P.-N., Steinbach, M., Kumar, V.:. Introduction to Data Mining. Addison-Wesley (2005). ISBN : 0321321367 Tan, P.-N., Steinbach, M., Kumar, V.:. Introduction to Data Mining. Addison-Wesley (2005). ISBN : 0321321367
22.
Zurück zum Zitat Xu, Z., et al.: Knowle: a semantic link network based system for organizing large scale online news events. Fut. Gener. Comput. Syst. 43–44, 40–50 (2015)CrossRef Xu, Z., et al.: Knowle: a semantic link network based system for organizing large scale online news events. Fut. Gener. Comput. Syst. 43–44, 40–50 (2015)CrossRef
24.
Zurück zum Zitat Xuan, J., Luo, X., Zhang, G., Lu, J., Xu, Z.: Uncertainty analysis for the keyword system of web events. IEEE Trans. Syst. Man Cybern. Syst. 46(4), 829–842 (2016)CrossRef Xuan, J., Luo, X., Zhang, G., Lu, J., Xu, Z.: Uncertainty analysis for the keyword system of web events. IEEE Trans. Syst. Man Cybern. Syst. 46(4), 829–842 (2016)CrossRef
25.
Zurück zum Zitat Luo, X., Xu, Z., Yu, J., Chen, X.: Building association link network for semantic link on web resources. IEEE Trans. Automat. Sci. Eng. 8(3), 482–494 (2011)CrossRef Luo, X., Xu, Z., Yu, J., Chen, X.: Building association link network for semantic link on web resources. IEEE Trans. Automat. Sci. Eng. 8(3), 482–494 (2011)CrossRef
Metadaten
Titel
A comparison study of clustering algorithms for microblog posts
verfasst von
Lin Li
Jingjing Ye
Fang Deng
Shengwu Xiong
Luo Zhong
Publikationsdatum
01.09.2016
Verlag
Springer US
Erschienen in
Cluster Computing / Ausgabe 3/2016
Print ISSN: 1386-7857
Elektronische ISSN: 1573-7543
DOI
https://doi.org/10.1007/s10586-016-0589-2

Weitere Artikel der Ausgabe 3/2016

Cluster Computing 3/2016 Zur Ausgabe