Skip to main content

2020 | OriginalPaper | Buchkapitel

Decentralized and Adaptive K-Means Clustering for Non-IID Data Using HyperLogLog Counters

verfasst von : Amira Soliman, Sarunas Girdzijauskas, Mohamed-Rafik Bouguelia, Sepideh Pashami, Slawomir Nowaczyk

Erschienen in: Advances in Knowledge Discovery and Data Mining

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The data shared over the Internet tends to originate from ubiquitous and autonomous sources such as mobile phones, fitness trackers, and IoT devices. Centralized and federated machine learning solutions represent the predominant way of providing smart services for users. However, moving data to central location for analysis causes not only many privacy concerns, but also communication overhead. Therefore, in certain situations machine learning models need to be trained in a collaborative and decentralized manner, similar to the way the data is originally generated without requiring any central authority for data or model aggregation. This paper presents a decentralized and adaptive k-means algorithm that clusters data from multiple sources organized in peer-to-peer networks. Our algorithm allows peers to reach an approximation of the global model without sharing any raw data. Most importantly, we address the challenge of decentralized clustering with skewed non-IID data and asynchronous computations by integrating HyperLogLog counters with k-means algorithm. Furthermore, our clustering algorithm allows nodes to individually determine the number of clusters that fits their local data. Results using synthetic and real-world datasets show that our algorithm outperforms state-of-the-art decentralized k-means algorithms achieving accuracy gain that is up-to 36%.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Balcan, M.F., Ehrlich, S., Liang, Y.: Distributed \( k \)-means and \( k \)-median clustering on general topologies. In: Advances in Neural Information Processing Systems, pp. 1995–2003 (2013) Balcan, M.F., Ehrlich, S., Liang, Y.: Distributed \( k \)-means and \( k \)-median clustering on general topologies. In: Advances in Neural Information Processing Systems, pp. 1995–2003 (2013)
2.
Zurück zum Zitat Berta, Á., Hegedűs, I., Ormándi, R.: Lightning fast asynchronous distributed k-means clustering (2014) Berta, Á., Hegedűs, I., Ormándi, R.: Lightning fast asynchronous distributed k-means clustering (2014)
3.
Zurück zum Zitat Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings of Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pp. 21–29. IEEE (1997) Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings of Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pp. 21–29. IEEE (1997)
4.
Zurück zum Zitat Drineas, P., Frieze, A., Kannan, R., Vempala, S., Vinay, V.: Clustering large graphs via the singular value decomposition. Mach. Learn. 56(1–3), 9–33 (2004)CrossRef Drineas, P., Frieze, A., Kannan, R., Vempala, S., Vinay, V.: Clustering large graphs via the singular value decomposition. Mach. Learn. 56(1–3), 9–33 (2004)CrossRef
5.
Zurück zum Zitat Fellus, J, Picard, D., Gosselin, P.-H.: Decentralized k-means using randomized gossip protocols for clustering large datasets. In: 2013 IEEE 13th International Conference on Data Mining Workshops, pp. 599–606. IEEE (2013) Fellus, J, Picard, D., Gosselin, P.-H.: Decentralized k-means using randomized gossip protocols for clustering large datasets. In: 2013 IEEE 13th International Conference on Data Mining Workshops, pp. 599–606. IEEE (2013)
6.
Zurück zum Zitat Flajolet, P., Fusy, É., Gandouet, O., Meunier, F.: HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. In: Discrete Mathematics and Theoretical Computer Science, pp. 137–156 (2007) Flajolet, P., Fusy, É., Gandouet, O., Meunier, F.: HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. In: Discrete Mathematics and Theoretical Computer Science, pp. 137–156 (2007)
7.
Zurück zum Zitat Forman, G., Zhang, B.: Distributed data clustering can be efficient and exact. SIGKDD Explor. 2(2), 34–38 (2000)CrossRef Forman, G., Zhang, B.: Distributed data clustering can be efficient and exact. SIGKDD Explor. 2(2), 34–38 (2000)CrossRef
8.
Zurück zum Zitat Giaretta, L., Girdzijauskas, Š.: Gossip learning: off the beaten path. In: IEEE International Conference on Big Data (IEEE Big Data 2019), Los Angeles, CA, USA, 9–12 December 2019, p. 2019 (2019) Giaretta, L., Girdzijauskas, Š.: Gossip learning: off the beaten path. In: IEEE International Conference on Big Data (IEEE Big Data 2019), Los Angeles, CA, USA, 9–12 December 2019, p. 2019 (2019)
9.
Zurück zum Zitat Heule, S., Nunkesser, M., Hall, A.: Hyperloglog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. In: Proceedings of the 16th International Conference on Extending Database Technology, pp. 683–692. ACM (2013) Heule, S., Nunkesser, M., Hall, A.: Hyperloglog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. In: Proceedings of the 16th International Conference on Extending Database Technology, pp. 683–692. ACM (2013)
10.
Zurück zum Zitat Januzaj, E., Kriegel, H.-P., Pfeifle, M.: Towards effective and efficient distributed clustering. In: Workshop on Clustering Large Data Sets (ICDM 2003) (2003) Januzaj, E., Kriegel, H.-P., Pfeifle, M.: Towards effective and efficient distributed clustering. In: Workshop on Clustering Large Data Sets (ICDM 2003) (2003)
11.
Zurück zum Zitat Kargupta, H., Huang, W., Sivakumar, K., Johnson, E.: Distributed clustering using collective principal component analysis. Knowl. Inf. Syst. 3(4), 422–448 (2001)CrossRef Kargupta, H., Huang, W., Sivakumar, K., Johnson, E.: Distributed clustering using collective principal component analysis. Knowl. Inf. Syst. 3(4), 422–448 (2001)CrossRef
13.
Zurück zum Zitat Konečnỳ, J., McMahan, H.B., Ramage, D., Richtárik, P.: Federated optimization: distributed machine learning for on-device intelligence. arXiv preprint arXiv:1610.02527 (2016) Konečnỳ, J., McMahan, H.B., Ramage, D., Richtárik, P.: Federated optimization: distributed machine learning for on-device intelligence. arXiv preprint arXiv:​1610.​02527 (2016)
15.
Zurück zum Zitat MacQueen, J., et al.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Oakland, CA, USA, vol. 1, pp. 281–297 (1967) MacQueen, J., et al.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Oakland, CA, USA, vol. 1, pp. 281–297 (1967)
16.
Zurück zum Zitat Mashayekhi, H., Habibi, J., Khalafbeigi, T., Voulgaris, S., Van Steen, M.: GDCluster: a general decentralized clustering algorithm. IEEE Trans. Knowl. Data Eng. 27(7), 1892–1905 (2015)CrossRef Mashayekhi, H., Habibi, J., Khalafbeigi, T., Voulgaris, S., Van Steen, M.: GDCluster: a general decentralized clustering algorithm. IEEE Trans. Knowl. Data Eng. 27(7), 1892–1905 (2015)CrossRef
17.
Zurück zum Zitat Nguyen, G.H., Bouzerdoum, A., Phung, S.L.: Learning pattern classification tasks with imbalanced data sets. In: Yin, P.-Y. (ed.) Pattern Recognition. IntechOpen, Rijeka (2009) Nguyen, G.H., Bouzerdoum, A., Phung, S.L.: Learning pattern classification tasks with imbalanced data sets. In: Yin, P.-Y. (ed.) Pattern Recognition. IntechOpen, Rijeka (2009)
18.
Zurück zum Zitat Ormándi, R., Hegedűs, I., Jelasity, M.: Gossip learning with linear models on fully distributed data. Concurrency Comput. Pract. Experience 25(4), 556–571 (2013)CrossRef Ormándi, R., Hegedűs, I., Jelasity, M.: Gossip learning with linear models on fully distributed data. Concurrency Comput. Pract. Experience 25(4), 556–571 (2013)CrossRef
19.
Zurück zum Zitat Rajaraman, A., David Ullman, J.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2011)CrossRef Rajaraman, A., David Ullman, J.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2011)CrossRef
20.
Zurück zum Zitat Soliman, A., Bahri, L., Carminati, B., Ferrari, E., Girdzijauskas, S.: DIVa: decentralized identity validation for social networks. In: Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015, pp. 383–391. ACM (2015) Soliman, A., Bahri, L., Carminati, B., Ferrari, E., Girdzijauskas, S.: DIVa: decentralized identity validation for social networks. In: Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015, pp. 383–391. ACM (2015)
21.
Zurück zum Zitat Soliman, A., Bahri, L., Girdzijauskas, S., Carminati, B., Ferrari, E.: CADIVa: cooperative and adaptive decentralized identity validation model for social networks. Soc. Network Anal. Min. 6(1), 36 (2016)CrossRef Soliman, A., Bahri, L., Girdzijauskas, S., Carminati, B., Ferrari, E.: CADIVa: cooperative and adaptive decentralized identity validation model for social networks. Soc. Network Anal. Min. 6(1), 36 (2016)CrossRef
22.
Zurück zum Zitat Soliman, A., Girdzijauskas, S.: DLSAS: distributed large-scale anti-spam framework for decentralized online social networks. In: 2016 IEEE 2nd International Conference on Collaboration and Internet Computing (CIC), pp. 363–372. IEEE (2016) Soliman, A., Girdzijauskas, S.: DLSAS: distributed large-scale anti-spam framework for decentralized online social networks. In: 2016 IEEE 2nd International Conference on Collaboration and Internet Computing (CIC), pp. 363–372. IEEE (2016)
23.
Zurück zum Zitat Tasoulis, D.K., Vrahatis, M.N.: Unsupervised distributed clustering. In: Parallel and Distributed Computing and Networks, pp. 347–351 (2004) Tasoulis, D.K., Vrahatis, M.N.: Unsupervised distributed clustering. In: Parallel and Distributed Computing and Networks, pp. 347–351 (2004)
Metadaten
Titel
Decentralized and Adaptive K-Means Clustering for Non-IID Data Using HyperLogLog Counters
verfasst von
Amira Soliman
Sarunas Girdzijauskas
Mohamed-Rafik Bouguelia
Sepideh Pashami
Slawomir Nowaczyk
Copyright-Jahr
2020
DOI
https://doi.org/10.1007/978-3-030-47426-3_27