Skip to main content
Erschienen in:
Buchtitelbild

2015 | OriginalPaper | Buchkapitel

BPMiner: Algorithms for Large-Scale Private Analysis

verfasst von : Quach Vinh Thanh, Anwitaman Datta

Erschienen in: Transactions on Large-Scale Data- and Knowledge-Centered Systems XXII

Verlag: Springer Berlin Heidelberg

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

An abundance of data generated from a multitude of sources, and intelligence derived by analyzing the same, has become an important asset across many walks of life. Simultaneously, it raises serious concerns about privacy. Differential privacy has become a popular way to reason about the amount of information about individual entries of a dataset that is divulged upon giving out a perturbed result for a query on a given data-set. However, current differentially-private algorithms are computationally inefficient, and do not explicitly exploit the abundance of data, thus wearing out the privacy budget irrespective of the volume of data. In this paper, we propose BPMiner, a solution that is both private and accurate, while simultaneously addressing the computation and budget challenges of very big datasets. The main idea is a non-trivial combination between differential privacy, sample-and-aggregation, and a classical statistical methodology called sequential estimation. Rigorous proof regarding the privacy and asymptotic accuracy of our solution are provided. Furthermore, experimental results over multiple datasets demonstrate that BPMiner outperforms current private algorithms in terms of computational and budget efficiencies, while achieving comparable accuracy. Overall, BPMiner is a practical solution based on strong theoretical foundations for privacy-preserving analysis on big datasets.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
In this paper, an ‘individual’ refers to an entry in a statistical database, which may correspond to information about a real-world entity, e.g., a patient’s record, a financial transaction, etc.
 
2
While some works regard ‘privacy budget’ as \(\epsilon \) (the privacy parameter), we consider it a fixed budget that is reduced per analysis. Such interpretation is seen in [19, 28, 29].
 
3
This name reflects the fact that we use sequential estimation w.r.t. data blocks. It is not to be confused with moving blocks boostrap in the field of bootstrap/subsampling.
 
Literatur
1.
Zurück zum Zitat Agarwal, A., Chapelle, O., Dudík, M., Langford, J.: A reliable effective terascale linear learning system. J. Mach. Learn. Res. 15(1), 1111–1133 (2014)MathSciNetMATH Agarwal, A., Chapelle, O., Dudík, M., Langford, J.: A reliable effective terascale linear learning system. J. Mach. Learn. Res. 15(1), 1111–1133 (2014)MathSciNetMATH
2.
Zurück zum Zitat Agarwal, S., Milner, H., Kleiner, A., Talwalkar, A., Jordan, M.I., Madden, S., Mozafari, B., Stoica, I.: Knowing when you’re wrong: building fast and reliable approximate query processing systems. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014, pp. 481–492 (2014) Agarwal, S., Milner, H., Kleiner, A., Talwalkar, A., Jordan, M.I., Madden, S., Mozafari, B., Stoica, I.: Knowing when you’re wrong: building fast and reliable approximate query processing systems. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014, pp. 481–492 (2014)
3.
5.
Zurück zum Zitat Bache, K., Lichman, M.: UCI Machine Learning Repository. IOS Press, Amsterdam (2013) Bache, K., Lichman, M.: UCI Machine Learning Repository. IOS Press, Amsterdam (2013)
6.
Zurück zum Zitat Bertin-Mahieux, T., Ellis, D.P.W., Whitman, B., Lamere, P.: The million song dataset. In: Proceedings of the 12th International Society for Music Information Retrieval Conference ISMIR 2011, Miami, Florida, USA, 24–28 October 2011, pp. 591–596 (2011) Bertin-Mahieux, T., Ellis, D.P.W., Whitman, B., Lamere, P.: The million song dataset. In: Proceedings of the 12th International Society for Music Information Retrieval Conference ISMIR 2011, Miami, Florida, USA, 24–28 October 2011, pp. 591–596 (2011)
8.
Zurück zum Zitat Blum, A., Dwork, C., McSherry, F., Nissim, F.: Practical privacy: the sulq framework. In: Proceedings of the Twenty-fourth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Baltimore, Maryland, USA, 13–15 June 2005, pp. 128–138 (2005) Blum, A., Dwork, C., McSherry, F., Nissim, F.: Practical privacy: the sulq framework. In: Proceedings of the Twenty-fourth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Baltimore, Maryland, USA, 13–15 June 2005, pp. 128–138 (2005)
9.
Zurück zum Zitat Cai, Z., Gao, Z.J., Luo, S., Perez, L.L., Vagena, Z., Jermaine, C.M.: A comparison of platforms for implementing and running very large scale machine learning algorithms. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014, pp. 1371–1382 (2014) Cai, Z., Gao, Z.J., Luo, S., Perez, L.L., Vagena, Z., Jermaine, C.M.: A comparison of platforms for implementing and running very large scale machine learning algorithms. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014, pp. 1371–1382 (2014)
10.
Zurück zum Zitat Chen, J., Chen, X.: A new method for adaptive sequential sampling for learning and parameter estimation. In: Kryszkiewicz, M., Rybinski, H., Skowron, A., Raś, Z.W. (eds.) ISMIS 2011. LNCS, vol. 6804, pp. 220–229. Springer, Heidelberg (2011) CrossRef Chen, J., Chen, X.: A new method for adaptive sequential sampling for learning and parameter estimation. In: Kryszkiewicz, M., Rybinski, H., Skowron, A., Raś, Z.W. (eds.) ISMIS 2011. LNCS, vol. 6804, pp. 220–229. Springer, Heidelberg (2011) CrossRef
11.
Zurück zum Zitat Chow, Y.S., Robbins, H.: On the asymptotic theory of fixed-width sequential confidence intervals for the mean. Ann. Math. Stat. 36(2), 457–462 (1965)MathSciNetCrossRefMATH Chow, Y.S., Robbins, H.: On the asymptotic theory of fixed-width sequential confidence intervals for the mean. Ann. Math. Stat. 36(2), 457–462 (1965)MathSciNetCrossRefMATH
12.
Zurück zum Zitat Condie, T., Mineiro, P., Polyzotis, N., Weimer, M.: Machine learning for big data. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, 22–27 June 2013, pp. 939–942 (2013) Condie, T., Mineiro, P., Polyzotis, N., Weimer, M.: Machine learning for big data. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, 22–27 June 2013, pp. 939–942 (2013)
13.
Zurück zum Zitat Condie, T., Mineiro, P., Polyzotis, N., Weimer, M.: Machine learning on big data. In: 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia 8–12 April 2013, pp. 1242–1244 (2013) Condie, T., Mineiro, P., Polyzotis, N., Weimer, M.: Machine learning on big data. In: 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia 8–12 April 2013, pp. 1242–1244 (2013)
14.
Zurück zum Zitat Dwork, C.: Differential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006) CrossRef Dwork, C.: Differential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006) CrossRef
15.
Zurück zum Zitat Dwork, C.: Differential privacy: a survey of results. In: Agrawal, M., Du, D.-Z., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 1–19. Springer, Heidelberg (2008) CrossRef Dwork, C.: Differential privacy: a survey of results. In: Agrawal, M., Du, D.-Z., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 1–19. Springer, Heidelberg (2008) CrossRef
16.
Zurück zum Zitat Dwork, C.: A firm foundation for private data analysis. Commun. ACM 54(1), 86–95 (2011)CrossRef Dwork, C.: A firm foundation for private data analysis. Commun. ACM 54(1), 86–95 (2011)CrossRef
17.
Zurück zum Zitat Dwork, C., Rothblum, G.N., Vadhan, S.P.: Boosting and differential privacy. In: 51th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2010, Las Vegas, Neveda, USA, 23–26 October 2010, pp. 51–60 (2010) Dwork, C., Rothblum, G.N., Vadhan, S.P.: Boosting and differential privacy. In: 51th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2010, Las Vegas, Neveda, USA, 23–26 October 2010, pp. 51–60 (2010)
18.
Zurück zum Zitat Dwork, C., Smith, A.: Differential privacy for statistics: What we know and what we want to learn. J. Priv. Confidentiality 1(2), 135–154 (2009) Dwork, C., Smith, A.: Differential privacy for statistics: What we know and what we want to learn. J. Priv. Confidentiality 1(2), 135–154 (2009)
19.
Zurück zum Zitat Haeberlen, A., Pierce, B.C., Narayan, A.: Differential privacy under fire. In: 20th USENIX Security Symposium, San Francisco, CA, USA, 8–12 August 2011, Proceedings, pp. 33–33 (2011) Haeberlen, A., Pierce, B.C., Narayan, A.: Differential privacy under fire. In: 20th USENIX Security Symposium, San Francisco, CA, USA, 8–12 August 2011, Proceedings, pp. 33–33 (2011)
20.
Zurück zum Zitat Ho, C.-H., Lin, C.-J.: Large-scale linear support vector regression. J. Mach. Learn. Res. 13, 3323–3348 (2012)MathSciNetMATH Ho, C.-H., Lin, C.-J.: Large-scale linear support vector regression. J. Mach. Learn. Res. 13, 3323–3348 (2012)MathSciNetMATH
21.
Zurück zum Zitat Jordan, M.I.: Divide-and-conquer and statistical inference for big data. In: The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2012, Beijing, China, 12–16 August 2012, p. 4 (2012) Jordan, M.I.: Divide-and-conquer and statistical inference for big data. In: The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2012, Beijing, China, 12–16 August 2012, p. 4 (2012)
22.
Zurück zum Zitat Kleiner, A., Talwalkar, A., Sarkar, P., Jordan, M.I.: The big data bootstrap. In: Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK June 26 - July 1 2012, pp. 1759–1766 (2012) Kleiner, A., Talwalkar, A., Sarkar, P., Jordan, M.I.: The big data bootstrap. In: Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK June 26 - July 1 2012, pp. 1759–1766 (2012)
23.
Zurück zum Zitat Kraska, T., Talwalkar, A., Duchi, J.C., Griffith, R., Franklin, M.J., Jordan, M.I.: Mlbase: A distributed machine-learning system. In: CIDR 2013, Sixth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, 6–9 January 2013, Online Proceedings (2013) Kraska, T., Talwalkar, A., Duchi, J.C., Griffith, R., Franklin, M.J., Jordan, M.I.: Mlbase: A distributed machine-learning system. In: CIDR 2013, Sixth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, 6–9 January 2013, Online Proceedings (2013)
24.
Zurück zum Zitat Laptev, N., Zeng, K., Zaniolo, C.: Early accurate results for advanced analytics on mapreduce. Proc. VLDB Endow. PVLDB 5(10), 1028–1039 (2012)CrossRef Laptev, N., Zeng, K., Zaniolo, C.: Early accurate results for advanced analytics on mapreduce. Proc. VLDB Endow. PVLDB 5(10), 1028–1039 (2012)CrossRef
25.
Zurück zum Zitat Laptev, N., Zeng, K., Zaniolo, C.: Very fast estimation for result and accuracy of big data analytics: the EARL system. In: 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, 8–12 April 2013, pp. 1296–1299 (2013) Laptev, N., Zeng, K., Zaniolo, C.: Very fast estimation for result and accuracy of big data analytics: the EARL system. In: 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, 8–12 April 2013, pp. 1296–1299 (2013)
26.
Zurück zum Zitat Lin, J., Kolcz, A.: Large-scale machine learning at twitter. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, AZ, USA,Scottsdale, 20–24 May 2012, pp. 793–804 (2012) Lin, J., Kolcz, A.: Large-scale machine learning at twitter. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, AZ, USA,Scottsdale, 20–24 May 2012, pp. 793–804 (2012)
27.
Zurück zum Zitat Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Distributed graphlab: a framework for machine learning in the cloud. Proc. VLDB Endow. PVLDB 5(8), 716–727 (2012)CrossRef Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Distributed graphlab: a framework for machine learning in the cloud. Proc. VLDB Endow. PVLDB 5(8), 716–727 (2012)CrossRef
28.
Zurück zum Zitat McSherry, F.: Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, Providence, Rhode Island, USA, June 29 - July 2, 2009, pp. 19–30 (2009) McSherry, F.: Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, Providence, Rhode Island, USA, June 29 - July 2, 2009, pp. 19–30 (2009)
29.
Zurück zum Zitat Mohan, P., Thakurta, A., Shi, E., Song, D., Culler, D.E.: Gupt: privacy preserving data analysis made easy. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, 20–24 May 2012, pp. 349–360 (2012) Mohan, P., Thakurta, A., Shi, E., Song, D., Culler, D.E.: Gupt: privacy preserving data analysis made easy. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, 20–24 May 2012, pp. 349–360 (2012)
30.
Zurück zum Zitat Mukhopadhyay, N.: A consistent and asymptotically efficient two-stage procedure to construct fixed width confidence intervals for the mean. Metrika 27(1), 281–284 (1980)MathSciNetCrossRefMATH Mukhopadhyay, N.: A consistent and asymptotically efficient two-stage procedure to construct fixed width confidence intervals for the mean. Metrika 27(1), 281–284 (1980)MathSciNetCrossRefMATH
31.
Zurück zum Zitat Mukhopadhyay, Nitis, de Silva, Basil M.: Sequential Methods and Their Applications. Chapman and Hall/CRC, Boca Raton (2008)CrossRefMATH Mukhopadhyay, Nitis, de Silva, Basil M.: Sequential Methods and Their Applications. Chapman and Hall/CRC, Boca Raton (2008)CrossRefMATH
32.
Zurück zum Zitat Nadas, A.: An extension of a theorem of chow and robbins on sequential confidence intervals for the mean. Ann. Math. Stat. 40(2), 667–671 (1969)MathSciNetCrossRefMATH Nadas, A.: An extension of a theorem of chow and robbins on sequential confidence intervals for the mean. Ann. Math. Stat. 40(2), 667–671 (1969)MathSciNetCrossRefMATH
33.
Zurück zum Zitat Nissim, K., Raskhodnikova, S., Smith, A.: Smooth sensitivity and sampling in private data analysis. In: Proceedings of the 39th Annual ACM Symposium on Theory of Computing, San Diego, California, USA, 11–13 June 2007, pp. 75–84 (2007) Nissim, K., Raskhodnikova, S., Smith, A.: Smooth sensitivity and sampling in private data analysis. In: Proceedings of the 39th Annual ACM Symposium on Theory of Computing, San Diego, California, USA, 11–13 June 2007, pp. 75–84 (2007)
34.
Zurück zum Zitat Committee on the Analysis of Massive Data: Committee on Applied & Theoretical Statistics, Board on Mathematical Sciences & Their Applications, Division on Engineering & Physical Sciences, and National Research Council. The National Academies Press, Frontiers in Massive Data Analysis (2013) Committee on the Analysis of Massive Data: Committee on Applied & Theoretical Statistics, Board on Mathematical Sciences & Their Applications, Division on Engineering & Physical Sciences, and National Research Council. The National Academies Press, Frontiers in Massive Data Analysis (2013)
35.
Zurück zum Zitat Sandmann, W.: Sequential estimation for prescribed statistical accuracy in stochastic simulation of biological systems. Math. Biosci. 221(1), 43–53 (2009)MathSciNetCrossRefMATH Sandmann, W.: Sequential estimation for prescribed statistical accuracy in stochastic simulation of biological systems. Math. Biosci. 221(1), 43–53 (2009)MathSciNetCrossRefMATH
37.
Zurück zum Zitat Smith, A.: Asymptotically optimal and private statistical estimation. In: Garay, J.A., Miyaji, A., Otsuka, A. (eds.) CANS 2009. LNCS, vol. 5888, pp. 53–57. Springer, Heidelberg (2009) CrossRef Smith, A.: Asymptotically optimal and private statistical estimation. In: Garay, J.A., Miyaji, A., Otsuka, A. (eds.) CANS 2009. LNCS, vol. 5888, pp. 53–57. Springer, Heidelberg (2009) CrossRef
38.
Zurück zum Zitat Smith, A.: Privacy-preserving statistical estimation with optimal convergence rates. In: Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011, San Jose, CA, USA, June 6–8 2011, pp. 813–822 (2011) Smith, A.: Privacy-preserving statistical estimation with optimal convergence rates. In: Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011, San Jose, CA, USA, June 6–8 2011, pp. 813–822 (2011)
39.
Zurück zum Zitat Wasserman, L.: Minimaxity, statistical thinking and differential privacy. J. Priv. Confidentiality 4(1), 51–63 (2012) Wasserman, L.: Minimaxity, statistical thinking and differential privacy. J. Priv. Confidentiality 4(1), 51–63 (2012)
40.
Zurück zum Zitat Xin, R.S., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I.: Shark: SQL and rich analytics at scale. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, 22–27 June 2013, pp. 13–24 (2013) Xin, R.S., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I.: Shark: SQL and rich analytics at scale. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, 22–27 June 2013, pp. 13–24 (2013)
41.
Zurück zum Zitat Yui, M., Kojima, I.: A database-hadoop hybrid approach to scalable machine learning. In: IEEE International Congress on Big Data, BigData Congress 2013, June 27 2013-July 2, 2013, pp. 1–8 (2013) Yui, M., Kojima, I.: A database-hadoop hybrid approach to scalable machine learning. In: IEEE International Congress on Big Data, BigData Congress 2013, June 27 2013-July 2, 2013, pp. 1–8 (2013)
42.
Zurück zum Zitat Zeng, K., Gao, S., Gu, J., Mozafari, B., Zaniolo, C.: ABS: a system for scalable approximate queries with accuracy guarantees. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014, pp. 1067–1070 (2014) Zeng, K., Gao, S., Gu, J., Mozafari, B., Zaniolo, C.: ABS: a system for scalable approximate queries with accuracy guarantees. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014, pp. 1067–1070 (2014)
43.
Zurück zum Zitat Zeng, K., Gao, S., Mozafari, B., Zaniolo, C.: The analytical bootstrap: a new method for fast error estimation in approximate query processing. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014, pp. 277–288 (2014) Zeng, K., Gao, S., Mozafari, B., Zaniolo, C.: The analytical bootstrap: a new method for fast error estimation in approximate query processing. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014, pp. 277–288 (2014)
Metadaten
Titel
BPMiner: Algorithms for Large-Scale Private Analysis
verfasst von
Quach Vinh Thanh
Anwitaman Datta
Copyright-Jahr
2015
Verlag
Springer Berlin Heidelberg
DOI
https://doi.org/10.1007/978-3-662-48567-5_1

Neuer Inhalt