Skip to main content
Erschienen in: The Journal of Supercomputing 9/2019

03.04.2019

SAIR: significance-aware approach to improve QoR of big data processing in case of budget constraint

verfasst von: Hossein Ahmadvand, Maziar Goudarzi

Erschienen in: The Journal of Supercomputing | Ausgabe 9/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Nowadays, a wide range of enterprises are faced with big data processing in different domains such as transaction operations, business calculations and analytical computations. Large-scale computing is an approach for big data processing. Due to the cost of large-scale computing and limitations of enterprise budgets, it is hardly possible to process all the input data and therefore the Quality of Result (QoR) may be affected. SAIR is an approach to improve QoR of big data processing for aggregative usages based on significance variety when there is a budget constraint. In this paper, the most significant data portions have been assigned to the most efficient resources in terms of time and cost. If the budget is still available, other data portions have been assigned to remaining resources. In this approach, statistical methods and a sampling technique with a 95% of the confidence interval and 5% of error margin are used to identify the most and least significant data portions. By using this method, the users are able to improve QoR with respect to budget constraint and preferred finishing time. In the evaluation phase, applications from different domains such as document and text, transaction data and system logs are used. Our results indicate that SAIR improves QoR while meeting budget constraint for considered usages. This approach improves the QoR up to 15%, compared with the state of the art.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Barroso LA, Clidaras J, Hölzle U (2013) The datacenter as a computer: an introduction to the design of warehouse-scale machines, vol 8.3, 2nd edn. Morgan & Claypool, San Rafael, pp 1–154 Barroso LA, Clidaras J, Hölzle U (2013) The datacenter as a computer: an introduction to the design of warehouse-scale machines, vol 8.3, 2nd edn. Morgan & Claypool, San Rafael, pp 1–154
2.
Zurück zum Zitat Gantz J, Reinsel D (2012) The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far east. IDC iView IDC Anal Future 2007:1–16 Gantz J, Reinsel D (2012) The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far east. IDC iView IDC Anal Future 2007:1–16
3.
Zurück zum Zitat Ahmadvand H, Goudarzi M (2017) Using data variety for efficient progressive big data processing in warehouse-scale computers. IEEE Comput Archit Lett 16(2):166–169CrossRef Ahmadvand H, Goudarzi M (2017) Using data variety for efficient progressive big data processing in warehouse-scale computers. IEEE Comput Archit Lett 16(2):166–169CrossRef
4.
Zurück zum Zitat Fekete J-D, Primet R (2016) Progressive analytics: a computation paradigm for exploratory data analysis. arXiv preprint arXiv, vol. 1607.05162 Fekete J-D, Primet R (2016) Progressive analytics: a computation paradigm for exploratory data analysis. arXiv preprint arXiv, vol. 1607.05162
5.
Zurück zum Zitat Mittal S (2016) A survey of techniques for approximate computing. ACM CSUR 48:62 Mittal S (2016) A survey of techniques for approximate computing. ACM CSUR 48:62
6.
Zurück zum Zitat Parasyris K, Vassiliadis V, Antonopoulos CD, Lalis S, Bellas N (2017) Significance-aware program execution on unreliable hardware. ACM TACO 14(2):12 Parasyris K, Vassiliadis V, Antonopoulos CD, Lalis S, Bellas N (2017) Significance-aware program execution on unreliable hardware. ACM TACO 14(2):12
7.
Zurück zum Zitat Zhao Y, Calheiros RN, Gange G, Ramamohanarao K, Buyya R (2015) SLA-based resource scheduling for big data analytics as a service in cloud computing environments. In: 2015 44th International Conference on Parallel Processing (ICPP) Zhao Y, Calheiros RN, Gange G, Ramamohanarao K, Buyya R (2015) SLA-based resource scheduling for big data analytics as a service in cloud computing environments. In: 2015 44th International Conference on Parallel Processing (ICPP)
8.
Zurück zum Zitat Honjo T, Oikawa K (2013) Hardware acceleration of hadoop mapreduce. In: 2013 IEEE International Conference on in Big Data Honjo T, Oikawa K (2013) Hardware acceleration of hadoop mapreduce. In: 2013 IEEE International Conference on in Big Data
9.
Zurück zum Zitat Shan Y, Wang B, Yan J, Wang Y, Xu N, Yang H (2010) FPMR: MapReduce framework on FPGA. In: Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays Shan Y, Wang B, Yan J, Wang Y, Xu N, Yang H (2010) FPMR: MapReduce framework on FPGA. In: Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays
10.
Zurück zum Zitat Polato I, Ré R, Goldman A, Kon F (2014) A comprehensive view of Hadoop research—a systematic literature review. J Netw Comput Appl 46:1–25CrossRef Polato I, Ré R, Goldman A, Kon F (2014) A comprehensive view of Hadoop research—a systematic literature review. J Netw Comput Appl 46:1–25CrossRef
11.
Zurück zum Zitat Mashayekhy L, Movahed Nejad M, Grosu D, Zhang Q, Shi W (2015) Energy-aware scheduling of mapreduce jobs for big data applications. IEEE Trans Parallel Distrib Syst 26(10):2720–2733CrossRef Mashayekhy L, Movahed Nejad M, Grosu D, Zhang Q, Shi W (2015) Energy-aware scheduling of mapreduce jobs for big data applications. IEEE Trans Parallel Distrib Syst 26(10):2720–2733CrossRef
12.
Zurück zum Zitat Chandramouli B, Goldstein J, Quamar A (2013) Scalable progressive analytics on big data in the cloud. Proc VLDB Endow 6:1726–1737CrossRef Chandramouli B, Goldstein J, Quamar A (2013) Scalable progressive analytics on big data in the cloud. Proc VLDB Endow 6:1726–1737CrossRef
13.
Zurück zum Zitat Condie T, Conway N, Alvaro P, Hellerstein JM, Elmeleegy K, Sears R (2010) MapReduce online. In Nsdi Condie T, Conway N, Alvaro P, Hellerstein JM, Elmeleegy K, Sears R (2010) MapReduce online. In Nsdi
14.
Zurück zum Zitat Wang Y, Shi W (2013) On optimal budget-driven scheduling algorithms for MapReduce jobs in the hetereogeneous cloud. Technical report TR-13–02, Carleton University Wang Y, Shi W (2013) On optimal budget-driven scheduling algorithms for MapReduce jobs in the hetereogeneous cloud. Technical report TR-13–02, Carleton University
15.
Zurück zum Zitat Goiri I, Bianchini R, Nagarakatte S, Nguyen TD (2015) Approxhadoop: bringing approximations to mapreduce frameworks. ACM SIGARCH Comput Archit News 43:383–397CrossRef Goiri I, Bianchini R, Nagarakatte S, Nguyen TD (2015) Approxhadoop: bringing approximations to mapreduce frameworks. ACM SIGARCH Comput Archit News 43:383–397CrossRef
16.
Zurück zum Zitat Ahmadvand H, Goudarzi M, Foroutan F (2019) Gapprox: using Gallup approach for approximation in big data processing. J Big Data 6(1):20CrossRef Ahmadvand H, Goudarzi M, Foroutan F (2019) Gapprox: using Gallup approach for approximation in big data processing. J Big Data 6(1):20CrossRef
17.
Zurück zum Zitat Vassiliadis V, Riehme J, Deussen J, Parasyris K, Antonopoulos CD, Bellas N, Lalis S, Naumann U (2016) Towards automatic significance analysis for approximate computing. In: 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) Vassiliadis V, Riehme J, Deussen J, Parasyris K, Antonopoulos CD, Bellas N, Lalis S, Naumann U (2016) Towards automatic significance analysis for approximate computing. In: 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
18.
Zurück zum Zitat Chen Y, An A (2016) Approximate parallel high utility itemset mining. Big Data Res 6:26–42CrossRef Chen Y, An A (2016) Approximate parallel high utility itemset mining. Big Data Res 6:26–42CrossRef
19.
Zurück zum Zitat Zamani AR, AbdelBaky M, Balouek-Thomert D, Rodero I, Parashar M (2017) Supporting data-driven workflows enabled by large scale observatories. In: IEEE 13th International Conference on e-Science (e-Science), Auckland, New Zealand Zamani AR, AbdelBaky M, Balouek-Thomert D, Rodero I, Parashar M (2017) Supporting data-driven workflows enabled by large scale observatories. In: IEEE 13th International Conference on e-Science (e-Science), Auckland, New Zealand
20.
Zurück zum Zitat Zhang X, Wang J, Yin J (2016) Sapprox: enabling efficient and accurate approximations on sub-datasets with distribution-aware online sampling. Proc VLDB Endow 10(3):109–120CrossRef Zhang X, Wang J, Yin J (2016) Sapprox: enabling efficient and accurate approximations on sub-datasets with distribution-aware online sampling. Proc VLDB Endow 10(3):109–120CrossRef
21.
Zurück zum Zitat Li K, Li G (2018) Approximate query processing: what is new and where to go? Data Sci Eng 3(4):379–397CrossRef Li K, Li G (2018) Approximate query processing: what is new and where to go? Data Sci Eng 3(4):379–397CrossRef
22.
Zurück zum Zitat Agarwal S, Mozafari B, Panda A, Milner H, Madden S, Stoica I (2013) BlinkDB: queries with bounded errors and bounded response times on very large data. In: Proceedings of the European Conference on Computer Systems (EuroSys) Agarwal S, Mozafari B, Panda A, Milner H, Madden S, Stoica I (2013) BlinkDB: queries with bounded errors and bounded response times on very large data. In: Proceedings of the European Conference on Computer Systems (EuroSys)
23.
Zurück zum Zitat Zheng C, Zhan J, Jia Z, Zhang L (2013) Characterizing os behavior of scale-out data center workloads. In: The Seventh Annual Workshop on the Interaction amongst Virtualization, Operating Systems and Computer Architecture (WIVOSCA 2013) Zheng C, Zhan J, Jia Z, Zhang L (2013) Characterizing os behavior of scale-out data center workloads. In: The Seventh Annual Workshop on the Interaction amongst Virtualization, Operating Systems and Computer Architecture (WIVOSCA 2013)
24.
Zurück zum Zitat Lee Y, Lee Y (2011) Detecting ddos attacks with hadoop. In: Proceedings of The ACM CoNEXT Student Workshop Lee Y, Lee Y (2011) Detecting ddos attacks with hadoop. In: Proceedings of The ACM CoNEXT Student Workshop
25.
Zurück zum Zitat Thusoo A, Shao Z, Anthony S, Borthakur D, Jain N, Sarma JS, Murthy R, Liu H (2010) Data warehousing and analytics infrastructure at Facebook. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data Thusoo A, Shao Z, Anthony S, Borthakur D, Jain N, Sarma JS, Murthy R, Liu H (2010) Data warehousing and analytics infrastructure at Facebook. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data
26.
Zurück zum Zitat Kaur N, Sood SK (2017) Efficient resource management system based on 4Vs of big data streams. Big Data Research Kaur N, Sood SK (2017) Efficient resource management system based on 4Vs of big data streams. Big Data Research
27.
Zurück zum Zitat Jiang Y, Huang Z, Tsang DHK (2018) Towards max–min fair resource allocation for stream big data analytics in shared clouds. IEEE Trans Big Data 4(1):130–137CrossRef Jiang Y, Huang Z, Tsang DHK (2018) Towards max–min fair resource allocation for stream big data analytics in shared clouds. IEEE Trans Big Data 4(1):130–137CrossRef
28.
Zurück zum Zitat Kelley J, Stewart C, Morris N, Tiwari D, He Y, Elnikety S (2017) Obtaining and managing answer quality for online data-intensive services. ACM TOMPECS 2(2):11 Kelley J, Stewart C, Morris N, Tiwari D, He Y, Elnikety S (2017) Obtaining and managing answer quality for online data-intensive services. ACM TOMPECS 2(2):11
29.
Zurück zum Zitat Li C, Zhu L, Liu Y, Luo Y (2017) Resource scheduling approach for multimedia cloud content management. J Supercomput 73(12):5150–5172CrossRef Li C, Zhu L, Liu Y, Luo Y (2017) Resource scheduling approach for multimedia cloud content management. J Supercomput 73(12):5150–5172CrossRef
30.
Zurück zum Zitat Wang J, Zhang X, Yin J, Wang R, Wu H, Han D (2018) Speed up big data analytics by unveiling the storage distribution of sub-datasets. IEEE Trans Big Data 4(2):231–244CrossRef Wang J, Zhang X, Yin J, Wang R, Wu H, Han D (2018) Speed up big data analytics by unveiling the storage distribution of sub-datasets. IEEE Trans Big Data 4(2):231–244CrossRef
31.
Zurück zum Zitat Papadias D, Tao Y, Fu G, Seeger B (2005) Progressive skyline computation in database systems. ACM TODS 30:41–82CrossRef Papadias D, Tao Y, Fu G, Seeger B (2005) Progressive skyline computation in database systems. ACM TODS 30:41–82CrossRef
32.
Zurück zum Zitat Tan K-L, Eng P-K, Ooi BC (2001) Efficient progressive skyline computation. VLDB 1:301–310 Tan K-L, Eng P-K, Ooi BC (2001) Efficient progressive skyline computation. VLDB 1:301–310
33.
Zurück zum Zitat Zhang D, Du Y, Xia T, Tao Y (2006) Progressive computation of the min-dist optimal-location query. In: Proceedings of the 32nd International Conference on Very Large Data Bases Zhang D, Du Y, Xia T, Tao Y (2006) Progressive computation of the min-dist optimal-location query. In: Proceedings of the 32nd International Conference on Very Large Data Bases
34.
Zurück zum Zitat Krishnan DR, Quoc DL, Bhatotia P, Fetzer C, Rodrigues R (2016) IncApprox: a data analytics system for incremental approximate computing. In: Proceedings of the 25th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee Krishnan DR, Quoc DL, Bhatotia P, Fetzer C, Rodrigues R (2016) IncApprox: a data analytics system for incremental approximate computing. In: Proceedings of the 25th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee
35.
Zurück zum Zitat Conejero J, Corella S, Badia RM, Labarta J (2018) Task-based programming in COMPSs to converge from HPC to big data. Int J High Perform Comput Appl 32(1):45–60CrossRef Conejero J, Corella S, Badia RM, Labarta J (2018) Task-based programming in COMPSs to converge from HPC to big data. Int J High Perform Comput Appl 32(1):45–60CrossRef
37.
Zurück zum Zitat Mian R, Martin P, Vazquez-Poletti JL (2012) Provisioning data analytic workloads in a cloud. Future Gen Comput Syst 29(6):1452–1458CrossRef Mian R, Martin P, Vazquez-Poletti JL (2012) Provisioning data analytic workloads in a cloud. Future Gen Comput Syst 29(6):1452–1458CrossRef
38.
Zurück zum Zitat Malekimajd M, Ardagna D, Ciavotta M, Gianniti E, Passacantando M, Rizzi AM (2018) An optimization framework for the capacity allocation. J Supercomput 74(10):5314–5348CrossRef Malekimajd M, Ardagna D, Ciavotta M, Gianniti E, Passacantando M, Rizzi AM (2018) An optimization framework for the capacity allocation. J Supercomput 74(10):5314–5348CrossRef
40.
Zurück zum Zitat Cochran WG (2007) Sampling techniques. Wiley, HobokenMATH Cochran WG (2007) Sampling techniques. Wiley, HobokenMATH
41.
Zurück zum Zitat Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRef Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRef
45.
Zurück zum Zitat Wang L, Zhan J, Luo C, Zhu Y, Yang Q, He Y, Gao W, Jia Z, Shi Y, Zhang S, Zheng C, Lu G, Zhan K, Li X, Qiu B (2014) Bigdatabench: a big data benchmark suite from internet services. In: 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA) Wang L, Zhan J, Luo C, Zhu Y, Yang Q, He Y, Gao W, Jia Z, Shi Y, Zhang S, Zheng C, Lu G, Zhan K, Li X, Qiu B (2014) Bigdatabench: a big data benchmark suite from internet services. In: 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)
48.
Zurück zum Zitat Efron B, Tibshirani R (1986) Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Stat Sci 1(1):54–75MathSciNetCrossRef Efron B, Tibshirani R (1986) Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Stat Sci 1(1):54–75MathSciNetCrossRef
50.
Zurück zum Zitat Lohr SL (2009) Sampling: design and analysis. Cengage Learning, BostonMATH Lohr SL (2009) Sampling: design and analysis. Cengage Learning, BostonMATH
Metadaten
Titel
SAIR: significance-aware approach to improve QoR of big data processing in case of budget constraint
verfasst von
Hossein Ahmadvand
Maziar Goudarzi
Publikationsdatum
03.04.2019
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 9/2019
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-019-02797-7

Weitere Artikel der Ausgabe 9/2019

The Journal of Supercomputing 9/2019 Zur Ausgabe

Premium Partner