Skip to main content

2016 | OriginalPaper | Buchkapitel

Distributed Algorithm for Text Documents Clustering Based on k-Means Approach

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The presented paper describes the design and implementation of distributed k-means clustering algorithm for text documents analysis. Motivation for the research effort presented in this paper is to propose a distributed approach based on current in-memory distributed computing technologies. We have used our Jbowl java text mining library and GridGain as a framework for distributed computing. Using these technologies we have designed and implemented k-means distributed clustering algorithm in two modifications and performed the experiments on the standard text data collections. Experiments were conducted in two testing environments—a distributed computing infrastructure and on a multi-core server.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Paralič, J., Furdík, K., Tutoky, G., Bednár, P., Sarnovský, M., Butka, P., Babič, F.: Text Mining (in Slovak: Dolovanie znalostí z textov). Equilibria, Košice (2010) Paralič, J., Furdík, K., Tutoky, G., Bednár, P., Sarnovský, M., Butka, P., Babič, F.: Text Mining (in Slovak: Dolovanie znalostí z textov). Equilibria, Košice (2010)
2.
Zurück zum Zitat Andrews, N.O., Fox, E.A.: Recent developments in document clustering. Technical report TR-07-35. Deparment of Computer Science, Virginia Tech (2007) Andrews, N.O., Fox, E.A.: Recent developments in document clustering. Technical report TR-07-35. Deparment of Computer Science, Virginia Tech (2007)
3.
Zurück zum Zitat Joshi, M.N.: Parallel K-means algorithm on distributed memory multiprocessors, Project Report, Computer Science Department, University of Minnesota, Twin Cities (2003) Joshi, M.N.: Parallel K-means algorithm on distributed memory multiprocessors, Project Report, Computer Science Department, University of Minnesota, Twin Cities (2003)
4.
Zurück zum Zitat Jin, R., Goswami, A., Agrawal, G.: Fast and exact out-of-core and distributed k-means clustering. Knowl. Inf. Syst. 10 (2006) Jin, R., Goswami, A., Agrawal, G.: Fast and exact out-of-core and distributed k-means clustering. Knowl. Inf. Syst. 10 (2006)
5.
Zurück zum Zitat Rui, M.E., Rui, P., Chunming, R.: K-means clustering in the cloud—a Mahout test. In: Proceedings of the 2011 IEEE Workshops of International Conference on Advanced Information Networking and Applications (WAINA’11). IEEE Computer Society, Washington (2011) Rui, M.E., Rui, P., Chunming, R.: K-means clustering in the cloud—a Mahout test. In: Proceedings of the 2011 IEEE Workshops of International Conference on Advanced Information Networking and Applications (WAINA’11). IEEE Computer Society, Washington (2011)
6.
Zurück zum Zitat Bednar, P., Butka, P.: Task-based execution engine for JBOWL. In: Proceedings of WIKT, Bratislava, Slovakia, pp. 65–68 (2009) Bednar, P., Butka, P.: Task-based execution engine for JBOWL. In: Proceedings of WIKT, Bratislava, Slovakia, pp. 65–68 (2009)
7.
Zurück zum Zitat Butka, P., Bednar, P., Babic, F.: Use of task-based text-mining execution engine in support of knowledge creation processes. In: Proceedings of Znalosti, Bratislava, pp. 289–292 (2009) Butka, P., Bednar, P., Babic, F.: Use of task-based text-mining execution engine in support of knowledge creation processes. In: Proceedings of Znalosti, Bratislava, pp. 289–292 (2009)
9.
Zurück zum Zitat Sarnovský, M., Kačur, T.: Cloud-based classification of text documents using the Gridgain platform. In: Proceedings of 7th IEEE International Symposium on Applied Computational Intelligence and Informatics, SACI 2012, Timişoara, Romania (2012) Sarnovský, M., Kačur, T.: Cloud-based classification of text documents using the Gridgain platform. In: Proceedings of 7th IEEE International Symposium on Applied Computational Intelligence and Informatics, SACI 2012, Timişoara, Romania (2012)
10.
Zurück zum Zitat Butka, P., Pocs, J., Pocsova, J.: Distributed version of algorithm for generalized one-sided concept lattices. In: Intelligent Distributed Computing VII Book Series. Studies in Computational Intelligence, vol. 511, pp. 119–129 (2014) Butka, P., Pocs, J., Pocsova, J.: Distributed version of algorithm for generalized one-sided concept lattices. In: Intelligent Distributed Computing VII Book Series. Studies in Computational Intelligence, vol. 511, pp. 119–129 (2014)
11.
Zurück zum Zitat Butka, P., Pocs, J., Pocsova, J.: Distributed computation of generalized one-sided concept lattices on sparse data tables. Comput. Inform. 34(1), 77–98 (2015)MathSciNet Butka, P., Pocs, J., Pocsova, J.: Distributed computation of generalized one-sided concept lattices on sparse data tables. Comput. Inform. 34(1), 77–98 (2015)MathSciNet
12.
Zurück zum Zitat Butka, P., Pócsová, J., Pócs, J.: Proposal of the information retrieval system based on the generalized one-sided concept lattices. In: Topics in Intelligent Engineering and Informatics. Applied Computational Intelligence in Engineering and Information Technology, vol. 1. Springer, Berlin (2012) Butka, P., Pócsová, J., Pócs, J.: Proposal of the information retrieval system based on the generalized one-sided concept lattices. In: Topics in Intelligent Engineering and Informatics. Applied Computational Intelligence in Engineering and Information Technology, vol. 1. Springer, Berlin (2012)
13.
Zurück zum Zitat Sarnovsky, M., Ulbrik, Z.: Cloud-based clustering of text documents using the GHSOM algorithm on the GridGain platform. In: SACI (2013), pp. 309–313 (2013) Sarnovsky, M., Ulbrik, Z.: Cloud-based clustering of text documents using the GHSOM algorithm on the GridGain platform. In: SACI (2013), pp. 309–313 (2013)
14.
Zurück zum Zitat Sarnovský, M., Butka, P.: Cloud computing as a platform for distributed data analysis. In: 7th Workshop on Intelligent and Knowledge Oriented Technologies, WIKT 2012 (2012) Sarnovský, M., Butka, P.: Cloud computing as a platform for distributed data analysis. In: 7th Workshop on Intelligent and Knowledge Oriented Technologies, WIKT 2012 (2012)
15.
Zurück zum Zitat Srinath, N.K.: MapReduce design of K-means clustering algorithm. In: Proceedings of International Conference on Information Science and Applications (ICISA) 2013 (2013) Srinath, N.K.: MapReduce design of K-means clustering algorithm. In: Proceedings of International Conference on Information Science and Applications (ICISA) 2013 (2013)
16.
Zurück zum Zitat Zhang, C., Li, F., Jestes, J.: Efficient parallel kNN joins for large data in MapReduce. In: Proceedings of the 15th International Conference on Extending Database Technology (EDBT’12), pp. 38–49. ACM, New York (2012) Zhang, C., Li, F., Jestes, J.: Efficient parallel kNN joins for large data in MapReduce. In: Proceedings of the 15th International Conference on Extending Database Technology (EDBT’12), pp. 38–49. ACM, New York (2012)
Metadaten
Titel
Distributed Algorithm for Text Documents Clustering Based on k-Means Approach
verfasst von
Martin Sarnovsky
Noema Carnoka
Copyright-Jahr
2016
DOI
https://doi.org/10.1007/978-3-319-28561-0_13