Skip to main content
Erschienen in: Cluster Computing 2/2015

01.06.2015

A parallel text document clustering algorithm based on neighbors

verfasst von: Yanjun Li, Congnan Luo, Soon M. Chung

Erschienen in: Cluster Computing | Ausgabe 2/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this paper, we propose a new parallel algorithm for text document clustering based on the concept of neighbor (Guha et al. in Inf Syst 25(5):345–366, 2000). If two documents are similar enough, they are considered as neighbors of each other. The new algorithm is named parallel k-means based on neighbors (PKBN), and it is a parallel version of sequential k-means based on neighbors (SKBN) that we proposed in Luo et al. (Data Knowl Eng 68(11):1271–1288, 2009). PKBN fully exploits the data-parallelism of SKBN and adopts a new parallel pair-generating method to build the neighbor matrix. Our new parallel pair-generating method causes less communication overhead between processors than existing methods. PKBN is designed for message-passing multiprocessor systems and is implemented on a cluster of Linux workstations to analyze its performance. Our experimental results on real-life data sets demonstrate that PKBN is very efficient and has good scalability with respect to the number of processors and the size of data set.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Aboutabl, A.E., Elsayed, M.N.: A novel parallel algorithm for clustering documents based on the hierarchical agglomerative approach. Int. J. Comput. Sci. Inf. Technol. (IJCSIT). 3(2), 152–163 (2011) Aboutabl, A.E., Elsayed, M.N.: A novel parallel algorithm for clustering documents based on the hierarchical agglomerative approach. Int. J. Comput. Sci. Inf. Technol. (IJCSIT). 3(2), 152–163 (2011)
2.
Zurück zum Zitat Bobda, C., Steenbock, N.: Singular value decomposition on distributed reconfigurable systems. In: Proceedings of the 12th International Workshop on Rapid System Prototyping, pp. 38–43 (2001) Bobda, C., Steenbock, N.: Singular value decomposition on distributed reconfigurable systems. In: Proceedings of the 12th International Workshop on Rapid System Prototyping, pp. 38–43 (2001)
3.
Zurück zum Zitat Brent, R.P., Luk, F.T.: The solution of singular-value and symmetric eigen-value problems on multiprocessor arrays. SIAM J. Sci. Stat. Comput. 6, 69–84 (1985)CrossRefMATHMathSciNet Brent, R.P., Luk, F.T.: The solution of singular-value and symmetric eigen-value problems on multiprocessor arrays. SIAM J. Sci. Stat. Comput. 6, 69–84 (1985)CrossRefMATHMathSciNet
4.
Zurück zum Zitat Cao, Z., Zhou, Y. : Parallel text clustering based on MapReduce. In: Proceedings of the 2nd International Conference on Cloud and Green Computing, pp. 226–229 (2012) Cao, Z., Zhou, Y. : Parallel text clustering based on MapReduce. In: Proceedings of the 2nd International Conference on Cloud and Green Computing, pp. 226–229 (2012)
5.
Zurück zum Zitat Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318–329 (1992) Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318–329 (1992)
6.
Zurück zum Zitat Dhillon, I.S., Modha, D.S.: A data-clustering algorithm on distributed memory multiprocessors. In: Large-Scale Parallel Data Mining, LNCS, vol. 1759, pp. 245–260. Springer, Heidelberg (2000) Dhillon, I.S., Modha, D.S.: A data-clustering algorithm on distributed memory multiprocessors. In: Large-Scale Parallel Data Mining, LNCS, vol. 1759, pp. 245–260. Springer, Heidelberg (2000)
7.
Zurück zum Zitat Forman, G., Zhang, B.: Distributed data clustering can be efficient and exact. SIGKDD Explor. Newsl. 2(2), 34–38 (2000)CrossRef Forman, G., Zhang, B.: Distributed data clustering can be efficient and exact. SIGKDD Explor. Newsl. 2(2), 34–38 (2000)CrossRef
8.
Zurück zum Zitat Garey, M.R., Johnson, D.S., Witsenhausen, H.S.: Complexity of the generalized Lloyd–Max problem. IEEE Trans. Inf. Theory 28(2), 256–257 (1982)CrossRefMathSciNet Garey, M.R., Johnson, D.S., Witsenhausen, H.S.: Complexity of the generalized Lloyd–Max problem. IEEE Trans. Inf. Theory 28(2), 256–257 (1982)CrossRefMathSciNet
9.
Zurück zum Zitat Garg, A., Mangla, A., Gupta, N., Bhatnagar, V.: PBIRCH: a scalable parallel clustering algorithm for incremental data. In: Proceedings of the International Database Engineering and Applications Symposium, pp. 315–316 (2006) Garg, A., Mangla, A., Gupta, N., Bhatnagar, V.: PBIRCH: a scalable parallel clustering algorithm for incremental data. In: Proceedings of the International Database Engineering and Applications Symposium, pp. 315–316 (2006)
10.
Zurück zum Zitat Guha, S., Rastogi, R., Shim, K.: ROCK: a robust clustering algorithm for categorical attributes. Inf. Syst. 25(5), 345–366 (2000)CrossRef Guha, S., Rastogi, R., Shim, K.: ROCK: a robust clustering algorithm for categorical attributes. Inf. Syst. 25(5), 345–366 (2000)CrossRef
11.
Zurück zum Zitat Hartigan, J.A.: Clustering Algorithms. Wiley, New York (1975)MATH Hartigan, J.A.: Clustering Algorithms. Wiley, New York (1975)MATH
12.
Zurück zum Zitat Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (1988)MATH Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (1988)MATH
13.
Zurück zum Zitat Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1990)CrossRef Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1990)CrossRef
14.
Zurück zum Zitat Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 16–22 (1999) Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 16–22 (1999)
17.
Zurück zum Zitat Li, Y., Chung, S.M.: Parallel bisecting K-means with prediction clustering algorithm. J. Supercomput. 39(1), 19–37 (2007)CrossRefMATH Li, Y., Chung, S.M.: Parallel bisecting K-means with prediction clustering algorithm. J. Supercomput. 39(1), 19–37 (2007)CrossRefMATH
18.
Zurück zum Zitat Li, Y., Chung, S.M., Holt, J.D.: Text document clustering based on frequent word meaning sequences. Data Knowl. Eng. 64(1), 381–404 (2008) Li, Y., Chung, S.M., Holt, J.D.: Text document clustering based on frequent word meaning sequences. Data Knowl. Eng. 64(1), 381–404 (2008)
19.
Zurück zum Zitat Li, Y., Luo, C., Chung, S.M.: Text clustering with feature selection by using statistical data. IEEE Trans Knowl. Data Eng. 20(5), 641–652 (2008)CrossRef Li, Y., Luo, C., Chung, S.M.: Text clustering with feature selection by using statistical data. IEEE Trans Knowl. Data Eng. 20(5), 641–652 (2008)CrossRef
20.
Zurück zum Zitat Liu, G., Wang, Y., Zhao, T., Li, D.: Research on the parallel text clustering algorithm based on the semantic tree. In: Proceedings of the 6th International Conference on Computer Sciences and Convergence Information Technology, pp. 400–403 (2011) Liu, G., Wang, Y., Zhao, T., Li, D.: Research on the parallel text clustering algorithm based on the semantic tree. In: Proceedings of the 6th International Conference on Computer Sciences and Convergence Information Technology, pp. 400–403 (2011)
21.
Zurück zum Zitat Luo, C., Li, Y., Chung, S.M.: Text document clustering based on neighbors. Data Knowl. Eng. 68(11), 1271–1288 (2009)CrossRef Luo, C., Li, Y., Chung, S.M.: Text document clustering based on neighbors. Data Knowl. Eng. 68(11), 1271–1288 (2009)CrossRef
22.
Zurück zum Zitat Mogill, J.A., Haglin, D.J.: Toward parallel document clustering. In: Proceedings of the IEEE International Symposium on Parallel and Distributed Processing (IPDPS) Workshops and PhD Forum, pp. 1700–1709 (2011) Mogill, J.A., Haglin, D.J.: Toward parallel document clustering. In: Proceedings of the IEEE International Symposium on Parallel and Distributed Processing (IPDPS) Workshops and PhD Forum, pp. 1700–1709 (2011)
23.
Zurück zum Zitat Moore, A.W.: The anchors hierarchy: using the triangle inequality to survive high dimensional data. In: Proceedings of the 12th Conference on Uncertainty in Artificial Intelligence, pp. 397–405 (2000) Moore, A.W.: The anchors hierarchy: using the triangle inequality to survive high dimensional data. In: Proceedings of the 12th Conference on Uncertainty in Artificial Intelligence, pp. 397–405 (2000)
25.
Zurück zum Zitat Ordonez, C., Omiecinski, E.: Efficient disk-based K-means clustering for relational databases. IEEE Trans. Knowl. Data Eng. 16(8), 909–921 (2004)CrossRef Ordonez, C., Omiecinski, E.: Efficient disk-based K-means clustering for relational databases. IEEE Trans. Knowl. Data Eng. 16(8), 909–921 (2004)CrossRef
26.
Zurück zum Zitat Ranka, S., Sahni, S.: Clustering on a hypercube multicomputer. IEEE Trans. Parallel Distrib. Syst. 2(2), 129–137 (1991)CrossRef Ranka, S., Sahni, S.: Clustering on a hypercube multicomputer. IEEE Trans. Parallel Distrib. Syst. 2(2), 129–137 (1991)CrossRef
27.
Zurück zum Zitat van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Buttersworth, London (1979) van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Buttersworth, London (1979)
28.
Zurück zum Zitat Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. KDD Workshop on Text Mining (2000) Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. KDD Workshop on Text Mining (2000)
29.
Zurück zum Zitat Zhang, Y., Sun, J., Zhang, Y., Zhang, X.: Parallel implementation of CLARANS using PVM. In: Proceedings of the 2004 International Conference on Machine Learning and Cybernetics, vol. 3, pp. 1646–1649 (2004) Zhang, Y., Sun, J., Zhang, Y., Zhang, X.: Parallel implementation of CLARANS using PVM. In: Proceedings of the 2004 International Conference on Machine Learning and Cybernetics, vol. 3, pp. 1646–1649 (2004)
30.
Zurück zum Zitat Zhao, Y., Karypis, G.: Comparison of agglomerative and partitional document clustering algorithms. Technical Report# TR 02-014, Department of Computer Science, University of Minnesota, Minneapolis (2002) Zhao, Y., Karypis, G.: Comparison of agglomerative and partitional document clustering algorithms. Technical Report# TR 02-014, Department of Computer Science, University of Minnesota, Minneapolis (2002)
Metadaten
Titel
A parallel text document clustering algorithm based on neighbors
verfasst von
Yanjun Li
Congnan Luo
Soon M. Chung
Publikationsdatum
01.06.2015
Verlag
Springer US
Erschienen in
Cluster Computing / Ausgabe 2/2015
Print ISSN: 1386-7857
Elektronische ISSN: 1573-7543
DOI
https://doi.org/10.1007/s10586-015-0450-z

Weitere Artikel der Ausgabe 2/2015

Cluster Computing 2/2015 Zur Ausgabe