Skip to main content
Log in

Outlier-eliminated k-means clustering algorithm based on differential privacy preservation

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Individual privacy may be compromised during the process of mining for valuable information, and the potential for data mining is hindered by the need to preserve privacy. It is well known that k-means clustering algorithms based on differential privacy require preserving privacy while maintaining the availability of clustering. However, it is difficult to balance both aspects in traditional algorithms. In this paper, an outlier-eliminated differential privacy (OEDP) k-means algorithm is proposed that both preserves privacy and improves clustering efficiency. The proposed approach selects the initial centre points in accordance with the distribution density of data points, and adds Laplacian noise to the original data for privacy preservation. Both a theoretical analysis and comparative experiments were conducted. The theoretical analysis shows that the proposed algorithm satisfies ε-differential privacy. Furthermore, the experimental results show that, compared to other methods, the proposed algorithm effectively preserves data privacy and improves the clustering results in terms of accuracy, stability, and availability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Acs G., Castelluccia C., Chen R. (2012) Differentially private histogram publishing through lossy compression. In: proceedings of IEEE 12th International Conference on Data Mining, ICDM, pp 1–10

  2. Agrawal R, Srikant R (2000) Privacy-preserving data mining. ACM Sigmod Record 29(2):439–450

    Article  Google Scholar 

  3. Angiulli F, Fassetti F (2009) DOLPHIN: an efficient algorithm for mining distance-based outliers in very large datasets. ACM Transactions on Knowledge Discovery from Data(TKDD) 3(1):4:1–57

    Google Scholar 

  4. Angiulli F, Pizzuti C (2002) Fast outlier detection in high dimensional spaces. In: proceedings of the 6th European Conference on the Principles of Data Mining and Knowledge Discovery, pp 15–27

  5. Bhaskar R, Laxman S, Smith A, Thakurta A (2010) Discovering frequent patterns in sensitive data. In: proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD10). Washington, USA , pp 503–512

  6. Blum A, Dwork C, Mcsherry F, Nissim K (2005) Practical privacy: the SuLQ framework. In: proceedings of the 24th ACM SIGMOD International Conference on Management of Data / Principles of Database Systems. New Yorks:ACM Press, pp 128–138

  7. Chaudhuri K, Monteleoni C (2008) Privacy-preserving logistic regression. In: proceedings of the 22nd Annual Conference on Neural Information Processing Systems. Vancouver, Canada, pp 289–296

  8. Chen R, Acs G, Castelluccia C (2012) Differentially private sequential data publication via variable-length n-grams. In: proceedings of the 2012 ACM Conference on Computer and Communications Security, pp 638–649

  9. Dwork C (2011) A firm foundation for private data analysis. Commun ACM 54(1):86–95

    Article  Google Scholar 

  10. Dwork C, McSherry F, Nissim K, Smith A (2006) Calibrating noise to sensitivity in private data analysis. In: proceedings of the 3rd Conference on Theory of Cryptography. New York, USA, pp 265–284

  11. Dwork C (2006) Differential privacy. In: proceedings of the 33rd International Colloquium on Automata,languages and Programming. Springer, Berlin, pp 1–12

  12. Dwork C (2010) Differential privacy in new settings. In: proceedings of ACM-SIAM Symposium on Discrete Algorithms (SODA), pp 174–183

  13. Dwork C (2008) Differential privacy: a survey of results. In: proceedings of the 5th International Conference on Theory and Application of Models of Computation. Berlin Heidelberg, pp 1–19

  14. Dwork C (2009) The differential privacy frontier(extended abstract). In: proceedings of the 6th Theory of Cryptography Conference(TCC09). Springer, Berlin, pp 496–502

  15. Friedman A, Schuster A (2010) Data mining with differential privacy. In: proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington, USA, pp 493–502

  16. Ganta SR, Kasiviswanathan S, Smith A (2008) Composition attacks and auxiliary information in data privacy. In: proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Las Vegas, USA, pp 265–273

  17. Hautamäki V, Cherednichenko S, Kärkkäinen I, Kinnunen T, Fränti P (2005) Improving k-means by outlier removal. Lect Notes Comput Sci 3540:978–987

    Article  Google Scholar 

  18. Hawkins S, He H, Williams G, Baxter R (2002) Outlier detection using replicator neural networks. In: proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery. Springer, Berlin Heidelberg, pp 170–180

  19. Huang M, Ni W, Wang J, Sun F, Chong Z (2012) A logarithmic spiral based data perturbation method for clustering. Chin J Comput 35(11):2275–2282

    Article  Google Scholar 

  20. Jiang F, Chen Y-M (2015) Outlier detection based on granular computing and rough set theory. Appl Intell 42(2):303–322

    Article  Google Scholar 

  21. Jiang H, Yi S, Li J, Yang F, Hu X (2010) Ant clustering algorithm with k-harmonic means clustering. Expert Systems With Applications 37(12):8679–8684

    Article  Google Scholar 

  22. Kasiviswanathan SP, Lee HK, Nissim K, Raskhodnikova S, Smith A (2009) What can we learn privately?. Foundations of Computer Science Annual Symposium on 40(3):531–540

    MathSciNet  MATH  Google Scholar 

  23. Knorr EM, Ng RT, Tucakov V (2000) Distance-based outliers: algorithms and applications. The VLDB Journal-The International Journal on Very Large Data Bases 8(3-4):237–253

    Article  Google Scholar 

  24. Li N, Qardaji W, Su D, Cao J (2012) PrivBasis: frequent itemset mining with differential privacy. In: proceedings of the 38th International Conference on Very Large Data Bases(VLDB12). New Yorks:ACM, pp 1340–1351

  25. Li X-B, Sarkar S (2010) Data clustering and micro-perturbation for privacy-preserving data sharing and analysis. In: proceedings of the International Conference on Information Systems (ICIS). Yamagata, Japan, pp 58–73

  26. Li Y, Hao Z, Wen W, Xie G (2013) Research on differential privacy preserving k-means clustering. Comput Sci 40(3):287–290

    Google Scholar 

  27. Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M (2007) l-diversity: privacy beyond k-anonymity. ACM Trans Knowl Discov Data 1(1):24

    Article  Google Scholar 

  28. Nguyen HH, Kim J, Kim Y (2013) Differential privacy in practice. Int J Comput Sci Eng 7(3):177–186

    Google Scholar 

  29. Nissim K, Raskhodnikova S, Smith A (2007) Smooth sensitivity and sampling in private data analysis. In: proceedings of the 39th Annual ACM symposium on Theory of Computing – STOC 07, pp 75–84

  30. Oliveira SRM, Zaïane OR (2004) Achieving privacy preservation when sharing data for clustering. In: proceedings of the International Workshop on Secure Data Management in a Connnected World. Toronto, Canada, pp 67–82

  31. Parameswaran R, Blough DM (2008) Privacy preserving data obfuscation for inherently clustered data. Int J Inf Comput Secur 2(1):4–26

    Google Scholar 

  32. Sweeney L (2002) K-anonymity: a model for protecting privacy. Int J Uncertainty Fuzziness Knowledge Based Syst 10(5):557–570

    Article  MathSciNet  MATH  Google Scholar 

  33. Tung AKH, Xu X, Ooi BC (2005) CURLER: finding and visualizing nonlinear correlation clusters. In: proceedings of the International Conference on Management of Data, pp 467–478

  34. Visalakshi NK, Thangavel K (2009) Impact of normalization in distributed k-means clustering. Int J Soft Comput 4(4):168–172

    Google Scholar 

  35. Xiong P, Zhu T, Wang X (2014) A survey on differential privacy and applications. Chin J Comput 37 (1):101–122

    Google Scholar 

  36. Zhang X, Wang M, Meng X (2014) An accurate method for mining top-k frequent pattern under differential privacy. Journal of Computer Research and Development 51(1):104–114

    Google Scholar 

Download references

Acknowledgments

The authors would like to thank the reviewers for their useful comments and suggestions for this paper. This work was supported by the National Natural Science Foundation of China (61370050) and the Natural Science Foundation of Anhui Province (1508085QF134).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yonglong Luo.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, Q., Luo, Y., Chen, C. et al. Outlier-eliminated k-means clustering algorithm based on differential privacy preservation. Appl Intell 45, 1179–1191 (2016). https://doi.org/10.1007/s10489-016-0813-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-016-0813-z

Keywords

Navigation