Skip to main content

2018 | OriginalPaper | Buchkapitel

A Two-Stage Data Processing Algorithm to Generate Random Sample Partitions for Big Data Analysis

verfasst von : Chenghao Wei, Salman Salloum, Tamer Z. Emara, Xiaoliang Zhang, Joshua Zhexue Huang, Yulin He

Erschienen in: Cloud Computing – CLOUD 2018

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

To enable the individual data block files of a distributed big data set to be used as random samples for big data analysis, a two-stage data processing (TSDP) algorithm is proposed in this paper to convert a big data set into a random sample partition (RSP) representation which ensures that each individual data block in the RSP is a random sample of the big data, therefore, it can be used to estimate the statistical properties of the big data. The first stage of this algorithm is to sequentially chunk the big data set into non-overlapping subsets and distribute these subsets as data block files to the nodes of a cluster. The second stage is to take a random sample from each subset without replacement to form a new subset saved as an RSP data block file and the random sampling step is repeated until all data records in all subsets are used up and a new set of RSP data block files are created to form an RSP of the big data. It is formally proved that the expectation of the sample distribution function (s.d.f.) of each RSP data block equals to the s.d.f. of the big data set, therefore, each RSP data block is a random sample of the big data set. Implementation of the TSDP algorithm on Apache Spark and HDFS is presented. Performance evaluations on terabyte data sets show the efficiency of this algorithm in converting HDFS big data files into HDFS RSP big data files. We also show an example that uses only a small number of RSP data blocks to build ensemble models which perform better than the single model built from the entire data set.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
2
Note: In Spark’s terminology, an RDD is equal to a partition of the big data set. A partition is equal to a data block of the big data set. In this section, we use partition to indicate a data block of the big data set loaded to an Spark RDD in order to be consistent with Spark’ terminology in this Spark implementation.
 
Literatur
1.
Zurück zum Zitat Fan, J., Fang, H., Han, L.: Challenges of big data analysis. Nat. Sci. Rev. 1(2), 293–314 (2014)CrossRef Fan, J., Fang, H., Han, L.: Challenges of big data analysis. Nat. Sci. Rev. 1(2), 293–314 (2014)CrossRef
2.
Zurück zum Zitat Zhao, J., Zhang, W., Liu, Y.: Improved K-means cluster algorithm in telecommunications enterprises customer segmentation. In: IEEE International Conference on Information Theory and Information Security, pp. 167–169 (2010) Zhao, J., Zhang, W., Liu, Y.: Improved K-means cluster algorithm in telecommunications enterprises customer segmentation. In: IEEE International Conference on Information Theory and Information Security, pp. 167–169 (2010)
3.
Zurück zum Zitat Michael, B.: Uncovering online political communities of Belgian MPs through social network clustering analysis. In Proceedings of the ACM 2015 2nd International Conference on Electronic Governance and Open Society, pp. 150–163 (2015) Michael, B.: Uncovering online political communities of Belgian MPs through social network clustering analysis. In Proceedings of the ACM 2015 2nd International Conference on Electronic Governance and Open Society, pp. 150–163 (2015)
4.
Zurück zum Zitat Ahmad, A., Paul, A., Rathore, M.M.: An efficient divide-and-conquer approach for big data analytics in machine-to-machine communication. Neurocomputing 174(86), 439–453 (2016)CrossRef Ahmad, A., Paul, A., Rathore, M.M.: An efficient divide-and-conquer approach for big data analytics in machine-to-machine communication. Neurocomputing 174(86), 439–453 (2016)CrossRef
5.
Zurück zum Zitat Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: IEEE 26th Symposium Mass Storage Systems and Technologies, pp. 1–10 (2010) Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: IEEE 26th Symposium Mass Storage Systems and Technologies, pp. 1–10 (2010)
6.
Zurück zum Zitat Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef
7.
Zurück zum Zitat Elteir, M., Lin, H., Feng, W.C.: Enhancing mapreduce via asynchronous data processing. In: IEEE International Conference on Parallel and Distributed Systems, pp. 397–405 (2010) Elteir, M., Lin, H., Feng, W.C.: Enhancing mapreduce via asynchronous data processing. In: IEEE International Conference on Parallel and Distributed Systems, pp. 397–405 (2010)
8.
Zurück zum Zitat Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: HotCloud 2010, p. 10 (2010) Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: HotCloud 2010, p. 10 (2010)
9.
Zurück zum Zitat Salloum, S., Dautov, R., Chen, X., Peng, P.X., Huang, J.Z.: Big data analytics on apache spark. Int. J. Data Sci. Anal. 1(3–4), 145–164 (2016)CrossRef Salloum, S., Dautov, R., Chen, X., Peng, P.X., Huang, J.Z.: Big data analytics on apache spark. Int. J. Data Sci. Anal. 1(3–4), 145–164 (2016)CrossRef
10.
Zurück zum Zitat Salloum, S., Huang, J.Z., He, Y.L.: Empirical analysis of asymptotic ensemble learning for big data. In: Proceedings of the 2016 IEEE/ACM International Conference on Big Data Computing, Applications and Technologies, pp. 8–17 (2016) Salloum, S., Huang, J.Z., He, Y.L.: Empirical analysis of asymptotic ensemble learning for big data. In: Proceedings of the 2016 IEEE/ACM International Conference on Big Data Computing, Applications and Technologies, pp. 8–17 (2016)
11.
Zurück zum Zitat Cormode, G., Duffield, N.: Sampling for big data: a tutorial. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p. 1975 (2014) Cormode, G., Duffield, N.: Sampling for big data: a tutorial. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p. 1975 (2014)
12.
Zurück zum Zitat Garcia, D., Lubiano, M.A., Alonso, M.C.: Estimating the expected value of fuzzy random variables in the stratified random sampling from finite populations. Inf. Sci. 138(4), 165–184 (2001)MathSciNetCrossRef Garcia, D., Lubiano, M.A., Alonso, M.C.: Estimating the expected value of fuzzy random variables in the stratified random sampling from finite populations. Inf. Sci. 138(4), 165–184 (2001)MathSciNetCrossRef
13.
Zurück zum Zitat Leo, S., Zanetti, G.: Pydoop: a python mapreduce and HDFS API for hadoop. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 819–825 (2010) Leo, S., Zanetti, G.: Pydoop: a python mapreduce and HDFS API for hadoop. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 819–825 (2010)
14.
Zurück zum Zitat Sheather, S.J., Jones, M.C.: A reliable data-based bandwidth selection method for kernel density estimation. J. Roy. Stat. Soc. 53(3), 683–690 (1991)MathSciNetMATH Sheather, S.J., Jones, M.C.: A reliable data-based bandwidth selection method for kernel density estimation. J. Roy. Stat. Soc. 53(3), 683–690 (1991)MathSciNetMATH
Metadaten
Titel
A Two-Stage Data Processing Algorithm to Generate Random Sample Partitions for Big Data Analysis
verfasst von
Chenghao Wei
Salman Salloum
Tamer Z. Emara
Xiaoliang Zhang
Joshua Zhexue Huang
Yulin He
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-319-94295-7_24

Premium Partner