Skip to main content
Top
Published in: The Journal of Supercomputing 1/2020

07-11-2019

Use case-based evaluation of workflow optimization strategy in real-time computation system

Authors: Saima Gulzar Ahmad, Hikmat Ullah Khan, Samia Ijaz, Ehsan Ullah Munir

Published in: The Journal of Supercomputing | Issue 1/2020

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

With the start of big data era, data stream computing has emerged as a well-known approach to optimize data-intensive workflows. Apache STORM is an open-source real-time distributed computation system for processing data streams and has been opted by famous organizations such as Twitter, Yahoo, Alibaba, Baidu, Groupon. The workflows are implemented as topologies in STORM. The main aspect that controls the execution performance of a workflow in STORM is the strategy of scheduling the topology components (spout and bolts). In this paper, we evaluate and analyze the performance of our algorithm Partition-based Data-intensive Workflow optimization Algorithm (PDWA) in Apache STORM using a use case workflow, EURExpressII. It is a real-world application-based workflow that builds a transcriptome-wide atlas of gene expression for the developing mouse embryo established by ribonucleic acid (RNA) in situ hybridization. Our proposed algorithm, PDWA, partitions the application task graph so that the data movement between partitions is minimum. Each partition is then mapped on one machine for the execution of tasks of that partition. It provides minimum execution time for that particular partition. Partial task duplication is also part of this algorithm that enhances the performance. A STORM-based computing cluster is developed in OpenStack cloud which is used as a computing environment. The performance of PDWA-based optimizer is evaluated with the data sets of different sizes. The achieved results show that PDWA performs with 21% improved average execution time for different sizes of data sets and varying execution nodes. In addition, the comparative results show that on average the efficiency of PDWA is 20.4% higher as compared to STORM default scheduler (SDS).

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literature
4.
go back to reference Umasri ML, Shyamalagowri D, Kumar S (2014) Aspects and infrastructure of big data. Int J Adv Res Comput Sci Softw Eng 4(1):609–612 Umasri ML, Shyamalagowri D, Kumar S (2014) Aspects and infrastructure of big data. Int J Adv Res Comput Sci Softw Eng 4(1):609–612
5.
go back to reference Deelman E, Gannon D, Shields M, Taylor I (2009) Workflows and e-science: an overview of workflow system features and capabilities. Fut Gener Comput Syst 25(5):528–540CrossRef Deelman E, Gannon D, Shields M, Taylor I (2009) Workflows and e-science: an overview of workflow system features and capabilities. Fut Gener Comput Syst 25(5):528–540CrossRef
6.
go back to reference Laszewski GV, Hategan M, Kodeboyina D (2007) Workflows for e-Science: scientific workflows for grids. Springer, London, pp 340–356CrossRef Laszewski GV, Hategan M, Kodeboyina D (2007) Workflows for e-Science: scientific workflows for grids. Springer, London, pp 340–356CrossRef
7.
go back to reference Liew CS, van Hemert JI, Atkinson MP, Han L (2010) Towards optimising distributed data streaming graphs using parallel streams. In: High-Performance Parallel and Distributed Computing, pp 725–736 Liew CS, van Hemert JI, Atkinson MP, Han L (2010) Towards optimising distributed data streaming graphs using parallel streams. In: High-Performance Parallel and Distributed Computing, pp 725–736
9.
go back to reference Ahmad SG, Liew CS, Rafique MM, Munir EU, Khan SU (2014) Data-intensive workflow optimization based on application task graph partitioning in heterogeneous computing systems. In: Fourth International Conference on Big Data and Cloud Computing, vol 123, pp 129–136 Ahmad SG, Liew CS, Rafique MM, Munir EU, Khan SU (2014) Data-intensive workflow optimization based on application task graph partitioning in heterogeneous computing systems. In: Fourth International Conference on Big Data and Cloud Computing, vol 123, pp 129–136
10.
go back to reference Han L, van Hemert JI, Baldock RA (2011) Automatically identifying and annotating mouse embryo gene expression patterns. Bioinformatics 27(8):1101–1107CrossRef Han L, van Hemert JI, Baldock RA (2011) Automatically identifying and annotating mouse embryo gene expression patterns. Bioinformatics 27(8):1101–1107CrossRef
11.
go back to reference Vydyanathan N, Catalyurek U, Kurc T, Sadayappan P, Saltz J (2011) Optimizing latency and throughput of application workflows on clusters. Parallel Comput 37:694–712MathSciNetCrossRef Vydyanathan N, Catalyurek U, Kurc T, Sadayappan P, Saltz J (2011) Optimizing latency and throughput of application workflows on clusters. Parallel Comput 37:694–712MathSciNetCrossRef
12.
go back to reference Guirado F, Roig C, Ripoll A (2013) Enhancing throughput for streaming applications running on cluster systems. J Parallel Distrib Comput 73(8):1092–1105CrossRef Guirado F, Roig C, Ripoll A (2013) Enhancing throughput for streaming applications running on cluster systems. J Parallel Distrib Comput 73(8):1092–1105CrossRef
13.
go back to reference Gu Y, Shenq S-L, Wu Q, Dasgupta D (2012) On a multi-objective evolutionary algorithm for optimizing end-to-end performance of scientific workflows in distributed environments. In: Proceedings of the 45th Annual Simulation Symposium Gu Y, Shenq S-L, Wu Q, Dasgupta D (2012) On a multi-objective evolutionary algorithm for optimizing end-to-end performance of scientific workflows in distributed environments. In: Proceedings of the 45th Annual Simulation Symposium
14.
go back to reference Agrawal K, Benoit A, Dufosse F, Robert Y (2009) Mapping filtering streaming applications with communication costs. Technical report, Massachusetts Institute of Technology, USA Agrawal K, Benoit A, Dufosse F, Robert Y (2009) Mapping filtering streaming applications with communication costs. Technical report, Massachusetts Institute of Technology, USA
15.
go back to reference Gu Y, Wu Q (2010) Maximizing workflow throughput for streaming applications in distributed environments. In: 19th International Conference on Computer Communications and Networks (ICCCN) Gu Y, Wu Q (2010) Maximizing workflow throughput for streaming applications in distributed environments. In: 19th International Conference on Computer Communications and Networks (ICCCN)
16.
go back to reference Cao F, Zhu MM, Ding D (2014) Distributed workflow scheduling under throughput and budget constraints in grid environments. In: Lecture notes in computer science, Job scheduling strategies for parallel processing. Springer, Berlin, pp 62–80 Cao F, Zhu MM, Ding D (2014) Distributed workflow scheduling under throughput and budget constraints in grid environments. In: Lecture notes in computer science, Job scheduling strategies for parallel processing. Springer, Berlin, pp 62–80
17.
go back to reference Agarwalla B, Ahmed N, Hilley D, Ramachandran U (2007) Streamline: a scheduling heuristic for streaming applications on the grid. Multimed Syst 13:69–85CrossRef Agarwalla B, Ahmed N, Hilley D, Ramachandran U (2007) Streamline: a scheduling heuristic for streaming applications on the grid. Multimed Syst 13:69–85CrossRef
18.
go back to reference Foster I, Kesselman C (1997) Globus: a metacomputing infrastructure toolkit. Int J Supercomput Appl High Perform Comput 11:115–128 Foster I, Kesselman C (1997) Globus: a metacomputing infrastructure toolkit. Int J Supercomput Appl High Perform Comput 11:115–128
19.
go back to reference Aniello L, Baldoni R, Querzoni L (2013) Adaptive online scheduling in storm. In: 7th ACM International Conference on Distributed Event-Based Systems, pp 207–218 Aniello L, Baldoni R, Querzoni L (2013) Adaptive online scheduling in storm. In: 7th ACM International Conference on Distributed Event-Based Systems, pp 207–218
20.
go back to reference Sun D, Zhang G, Yang S, Zheng W, Khan SU, Li K (2015) Re-Stream: real-time and energy-efficient resource scheduling in big data stream computing environments. Inf Sci 319:92–112MathSciNetCrossRef Sun D, Zhang G, Yang S, Zheng W, Khan SU, Li K (2015) Re-Stream: real-time and energy-efficient resource scheduling in big data stream computing environments. Inf Sci 319:92–112MathSciNetCrossRef
21.
go back to reference Rychly M, Skdo P, Smrz P (2014) Scheduling decisions in stream processing on heterogeneous clusters. In: International Conference on Complex, Intelligent and Software Intensive Systems (CISIS), pp 614–619 Rychly M, Skdo P, Smrz P (2014) Scheduling decisions in stream processing on heterogeneous clusters. In: International Conference on Complex, Intelligent and Software Intensive Systems (CISIS), pp 614–619
25.
go back to reference Sun LC (2012) Optimisation of the enactment of fine-grained distributed data-intensive workflows. The University of Edinburgh, Edinburgh Sun LC (2012) Optimisation of the enactment of fine-grained distributed data-intensive workflows. The University of Edinburgh, Edinburgh
26.
go back to reference Smirnov P, Melnik M, Nasonov D (2017) Performance-aware scheduling of streaming applications using genetic algorithm. In: Proceedings of the International Conference on Computational Science, ICCS 12–14 June 2017. Zurich, Switzerland Smirnov P, Melnik M, Nasonov D (2017) Performance-aware scheduling of streaming applications using genetic algorithm. In: Proceedings of the International Conference on Computational Science, ICCS 12–14 June 2017. Zurich, Switzerland
27.
go back to reference Sun D, Gao S, Liu X, Li F, Zheng X, Buyya R (2019) State and runtime-aware scheduling in elastic stream computing systems. Fut Gener Comput Syst (FGCS) 97:194–209CrossRef Sun D, Gao S, Liu X, Li F, Zheng X, Buyya R (2019) State and runtime-aware scheduling in elastic stream computing systems. Fut Gener Comput Syst (FGCS) 97:194–209CrossRef
28.
go back to reference Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Hum Genet 7(2):179–188 Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Hum Genet 7(2):179–188
Metadata
Title
Use case-based evaluation of workflow optimization strategy in real-time computation system
Authors
Saima Gulzar Ahmad
Hikmat Ullah Khan
Samia Ijaz
Ehsan Ullah Munir
Publication date
07-11-2019
Publisher
Springer US
Published in
The Journal of Supercomputing / Issue 1/2020
Print ISSN: 0920-8542
Electronic ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-019-03060-9

Other articles of this Issue 1/2020

The Journal of Supercomputing 1/2020 Go to the issue

Premium Partner