Skip to main content
Top

2016 | OriginalPaper | Chapter

DSS: A Scalable and Efficient Stratified Sampling Algorithm for Large-Scale Datasets

Authors : Minne Li, Dongsheng Li, Siqi Shen, Zhaoning Zhang, Xicheng Lu

Published in: Network and Parallel Computing

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Statistical analysis of aggregated records is widely used in various domains such as market research, sociological investigation and network analysis, etc. Stratified sampling (SS), which samples the population divided into distinct groups separately, is preferred in the practice for its high effectiveness and accuracy. In this paper, we propose a scalable and efficient algorithm named DSS, for SS to process large datasets. DSS executes all the sampling operations in parallel by calculating the exact subsample size for each partition according to the data distribution. We implement DSS on Spark, a big-data processing system, and we show through large-scale experiments that it can achieve lower data-transmission cost and higher efficiency than state-of-the-art methods with high sample representativeness.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Boldi, P., Vigna, S.: The webgraph framework I: compression techniques. In: Proceedings of the 13th International Conference on World Wide Web, pp. 595–602. ACM (2004) Boldi, P., Vigna, S.: The webgraph framework I: compression techniques. In: Proceedings of the 13th International Conference on World Wide Web, pp. 595–602. ACM (2004)
2.
go back to reference Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends\({\textregistered }\) Mach. Learn. 3(1), 1–122 (2011) Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends\({\textregistered }\) Mach. Learn. 3(1), 1–122 (2011)
4.
go back to reference Cooper, D.R., Schindler, P.S., Sun, J.: Business Research Methods (2006) Cooper, D.R., Schindler, P.S., Sun, J.: Business Research Methods (2006)
5.
go back to reference Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef
6.
go back to reference Faloutsos, M., Faloutsos, P., Faloutsos, C.: On power-law relationships of the internet topology. ACM SIGCOMM Comput. Commun. Rev. 29, 251–262 (1999). ACMCrossRefMATH Faloutsos, M., Faloutsos, P., Faloutsos, C.: On power-law relationships of the internet topology. ACM SIGCOMM Comput. Commun. Rev. 29, 251–262 (1999). ACMCrossRefMATH
7.
go back to reference Gjoka, M., Butts, C.T., Kurant, M., Markopoulou, A.: Multigraph sampling of online social networks. IEEE J. Sel. Areas Commun. 29(9), 1893–1905 (2011)CrossRef Gjoka, M., Butts, C.T., Kurant, M., Markopoulou, A.: Multigraph sampling of online social networks. IEEE J. Sel. Areas Commun. 29(9), 1893–1905 (2011)CrossRef
8.
go back to reference Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A.: A walk in facebook: uniform sampling of users in online social networks. arXiv preprint arXiv:0906.0060 (2009) Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A.: A walk in facebook: uniform sampling of users in online social networks. arXiv preprint arXiv:​0906.​0060 (2009)
9.
go back to reference Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C.: Powergraph: distributed graph-parallel computation on natural graphs. Presented as Ppart of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2012), pp. 17–30 (2012) Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C.: Powergraph: distributed graph-parallel computation on natural graphs. Presented as Ppart of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2012), pp. 17–30 (2012)
11.
go back to reference Kolmogorov, A.N.: Sulla determinazione empirica di una legge di distribuzione. na (1933) Kolmogorov, A.N.: Sulla determinazione empirica di una legge di distribuzione. na (1933)
12.
go back to reference Kurant, M., Gjoka, M., Butts, C.T., Markopoulou, A.: Walking on a graph with a magnifying glass: stratified sampling via weighted random walks. In: Proceedings of the ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, pp. 281–292. ACM (2011) Kurant, M., Gjoka, M., Butts, C.T., Markopoulou, A.: Walking on a graph with a magnifying glass: stratified sampling via weighted random walks. In: Proceedings of the ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, pp. 281–292. ACM (2011)
14.
go back to reference Levin, R., Kanza, Y.: Stratified-sampling over social networks using mapreduce. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 863–874. ACM (2014) Levin, R., Kanza, Y.: Stratified-sampling over social networks using mapreduce. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 863–874. ACM (2014)
15.
go back to reference Lu, X., Wang, H., Wang, J., Xu, J., Li, D.: Internet-based virtual computing environment: beyond the data center as a computer. Future Gener. Comput. Syst. 29(1), 309–322 (2013)CrossRef Lu, X., Wang, H., Wang, J., Xu, J., Li, D.: Internet-based virtual computing environment: beyond the data center as a computer. Future Gener. Comput. Syst. 29(1), 309–322 (2013)CrossRef
16.
go back to reference Meng, X.: Scalable simple random sampling and stratified sampling. In: Proceedings of the 30th International Conference on Machine Learning (ICML-2013), pp. 531–539 (2013) Meng, X.: Scalable simple random sampling and stratified sampling. In: Proceedings of the 30th International Conference on Machine Learning (ICML-2013), pp. 531–539 (2013)
17.
go back to reference Owen, S., Anil, R., Dunning, T., Friedman, E.: Mahout in Action. Manning Publications Co., Greenwich (2011) Owen, S., Anil, R., Dunning, T., Friedman, E.: Mahout in Action. Manning Publications Co., Greenwich (2011)
18.
go back to reference Papagelis, M., Das, G., Koudas, N.: Sampling online social networks. IEEE Trans. Knowl. Data Eng. 25(3), 662–676 (2013)CrossRef Papagelis, M., Das, G., Koudas, N.: Sampling online social networks. IEEE Trans. Knowl. Data Eng. 25(3), 662–676 (2013)CrossRef
19.
22.
go back to reference Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, p. 2. USENIX Association (2012) Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, p. 2. USENIX Association (2012)
23.
go back to reference Zhang, Z., Li, D., Wu, K.: Large-scale virtual machines provisioning in clouds: challenges and approaches. Front. Comput. Sci. 10(1), 2–18 (2016)CrossRef Zhang, Z., Li, D., Wu, K.: Large-scale virtual machines provisioning in clouds: challenges and approaches. Front. Comput. Sci. 10(1), 2–18 (2016)CrossRef
Metadata
Title
DSS: A Scalable and Efficient Stratified Sampling Algorithm for Large-Scale Datasets
Authors
Minne Li
Dongsheng Li
Siqi Shen
Zhaoning Zhang
Xicheng Lu
Copyright Year
2016
DOI
https://doi.org/10.1007/978-3-319-47099-3_11

Premium Partner