Skip to main content
Top
Published in: Knowledge and Information Systems 4/2021

05-02-2021 | Regular Paper

AQUA+: Query Optimization for Hybrid Database-MapReduce System

Published in: Knowledge and Information Systems | Issue 4/2021

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

MapReduce has been widely recognized as an efficient tool for large-scale data analysis. It achieves high performance by exploiting parallelism among processing nodes while providing a simple interface for upper-layer applications. However, there are many existing applications maintaining their data in a distributed database. It is costly to export those data into the storage system of MapReduce (normally a distributed file system). Moreover, compared to MapReduce, database is equipped with many state-of-the-art techniques, such as index and optimizer. Therefore, a hybrid Database-MapReduce system inheriting the advantages of both systems is preferred. In this paper, we propose AQUA+, a query optimizer tailored for the hybrid system. AQUA+ is an extension work of our previous system AQUA. It generates a plan that adaptively assigns the operators to the database engine and MapReduce engine to optimize the performance. The intuition is to exploit the index, co-partition and other features provided by the database as much as possible and reduce the data volume processed by the MapReduce. Due to the complexity of query optimization, in AQUA+, we introduce a novel tuning technique, learning to optimize. In particular, two neural networks are trained to predict cost and refine query plan, respectively. We train them based on our log of real query processing. Experiments carried out on our in-house cluster confirm the effectiveness of our query optimizer.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
2.
go back to reference Abouzied A, Bajda-Pawlikowski K, Huang J, Abadi DJ, Silberschatz A (2010) Hadoopdb in action: building real world applications, in ‘SIGMOD Conference’, pp. 1111–1114 Abouzied A, Bajda-Pawlikowski K, Huang J, Abadi DJ, Silberschatz A (2010) Hadoopdb in action: building real world applications, in ‘SIGMOD Conference’, pp. 1111–1114
3.
go back to reference Afrati FN, Ullman JD (2010) Optimizing joins in a map-reduce environment, In Proceedings of the 13th International Conference on Extending Database Technology (pp. 99-110) Afrati FN, Ullman JD (2010) Optimizing joins in a map-reduce environment, In Proceedings of the 13th International Conference on Extending Database Technology (pp. 99-110)
4.
go back to reference Bernstein PA, Goodman N, Wong E, Reeve CL, Rothnie JB Jr (1981) Query processing in a system for distributed databases (sdd-1). ACM Trans. Database Syst. 6(4):602–625CrossRef Bernstein PA, Goodman N, Wong E, Reeve CL, Rothnie JB Jr (1981) Query processing in a system for distributed databases (sdd-1). ACM Trans. Database Syst. 6(4):602–625CrossRef
5.
go back to reference Bittorf M, Bobrovytsky T, Erickson C C A C J, Hecht M GD, Kuff M J I JL, Leblang D KA, Robinson N L I PH, Rus D RS, Wanderman J R D TS, Yoder MM (2015) Impala: A modern, open-source sql engine for hadoop, in Proceedings of the 7th Biennial Conference on Innovative Data Systems Research Bittorf M, Bobrovytsky T, Erickson C C A C J, Hecht M GD, Kuff M J I JL, Leblang D KA, Robinson N L I PH, Rus D RS, Wanderman J R D TS, Yoder MM (2015) Impala: A modern, open-source sql engine for hadoop, in Proceedings of the 7th Biennial Conference on Innovative Data Systems Research
6.
go back to reference Camacho-Rodríguez J, Chauhan A, Gates A, Koifman E, O’Malley O, Garg V, Haindrich Z, Shelukhin S, Jayachandran P, Seth S, Jaiswal D, Bouguerra S, Bangarwa N, Hariappan S, Agarwal A, Dere J, Dai D, Nair T, Dembla N, Vijayaraghavan G, Hagleitner G (2019) Apache hive: From mapreduce to enterprise-grade big data warehousing, in P. A. Boncz, S. Manegold, A. Ailamaki, A. Deshpande and T. Kraska, eds, ‘Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019’, ACM, pp. 1773–1786. https://doi.org/10.1145/3299869.3314045 Camacho-Rodríguez J, Chauhan A, Gates A, Koifman E, O’Malley O, Garg V, Haindrich Z, Shelukhin S, Jayachandran P, Seth S, Jaiswal D, Bouguerra S, Bangarwa N, Hariappan S, Agarwal A, Dere J, Dai D, Nair T, Dembla N, Vijayaraghavan G, Hagleitner G (2019) Apache hive: From mapreduce to enterprise-grade big data warehousing, in P. A. Boncz, S. Manegold, A. Ailamaki, A. Deshpande and T. Kraska, eds, ‘Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019’, ACM, pp. 1773–1786. https://​doi.​org/​10.​1145/​3299869.​3314045
7.
go back to reference Chaudhuri S (1998) An overview of query optimization in relational systems, in PODS, pp. 34–43 Chaudhuri S (1998) An overview of query optimization in relational systems, in PODS, pp. 34–43
8.
go back to reference Chen M-S, Yu PS, Wu K-L (1996) Optimization of parallel execution for multi-join queries. IEEE Trans on Knowl and Data Eng 8(3):416–428CrossRef Chen M-S, Yu PS, Wu K-L (1996) Optimization of parallel execution for multi-join queries. IEEE Trans on Knowl and Data Eng 8(3):416–428CrossRef
9.
go back to reference Chen S (2010) Cheetah: a high performance, custom data warehouse on top of mapreduce. Proc. VLDB Endow. 3(1–2):1459–1468CrossRef Chen S (2010) Cheetah: a high performance, custom data warehouse on top of mapreduce. Proc. VLDB Endow. 3(1–2):1459–1468CrossRef
10.
go back to reference Chen Z, Gehrke J, Korn F (2001) Query optimization in compressed database systems, in Proceedings of the 2001 ACM SIGMOD international conference on Management of data, Santa Barbara, CA, USA, May 21-24, 2001, pp. 271–282. https://doi.org/10.1145/375663.375692 Chen Z, Gehrke J, Korn F (2001) Query optimization in compressed database systems, in Proceedings of the 2001 ACM SIGMOD international conference on Management of data, Santa Barbara, CA, USA, May 21-24, 2001, pp. 271–282. https://​doi.​org/​10.​1145/​375663.​375692
11.
go back to reference Condie T, Conway N, Alvaro P, Hellerstein JM, Elmeleegy K, Sears R (2010) Mapreduce online. In NSDI 10:21–29 Condie T, Conway N, Alvaro P, Hellerstein JM, Elmeleegy K, Sears R (2010) Mapreduce online. In NSDI 10:21–29
12.
go back to reference Dean J, Ghemawat S (2004) Mapreduce: simplified data processing on large clusters, in OSDI, 137–150 Dean J, Ghemawat S (2004) Mapreduce: simplified data processing on large clusters, in OSDI, 137–150
13.
go back to reference Franklin MJ, Jónsson BT, Kossmann D (1996) Performance tradeoffs for client-server query processing. SIGMOD Rec. 25(2):149–160CrossRef Franklin MJ, Jónsson BT, Kossmann D (1996) Performance tradeoffs for client-server query processing. SIGMOD Rec. 25(2):149–160CrossRef
14.
go back to reference Friedman E, Pawlowski PM, Cieslewicz J (2009) Sql/mapreduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions. PVLDB 2(2):1402–1413 Friedman E, Pawlowski PM, Cieslewicz J (2009) Sql/mapreduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions. PVLDB 2(2):1402–1413
15.
go back to reference Ganguly S, Hasan W, Krishnamurthy R (1992) Query optimization for parallel execution. SIGMOD Rec. 21(2): Ganguly S, Hasan W, Krishnamurthy R (1992) Query optimization for parallel execution. SIGMOD Rec. 21(2):
16.
go back to reference Hausenblas M, Nadeau J (2013) Apache drill: interactive ad-hoc analysis at scale. Big Data 1(2):100–104CrossRef Hausenblas M, Nadeau J (2013) Apache drill: interactive ad-hoc analysis at scale. Big Data 1(2):100–104CrossRef
20.
go back to reference Lin Y, Agrawal D, Chen C, Ooi BC, Wu S (2011) Llama: leveraging columnar storage for scalable join processing in the mapreduce framework, in SIGMOD Conference, 961–972 Lin Y, Agrawal D, Chen C, Ooi BC, Wu S (2011) Llama: leveraging columnar storage for scalable join processing in the mapreduce framework, in SIGMOD Conference, 961–972
21.
go back to reference Olston C, Reed B, Srivastava U, Kumar R, Tomkins A (2008) Pig latin: a not-so-foreign language for data processing, in SIGMOD Conference, 1099–1110 Olston C, Reed B, Srivastava U, Kumar R, Tomkins A (2008) Pig latin: a not-so-foreign language for data processing, in SIGMOD Conference, 1099–1110
22.
go back to reference Ozsu MT (2007) Principles of distributed database systems, 3rd edn. Prentice Hall Press, NJ, USA Ozsu MT (2007) Principles of distributed database systems, 3rd edn. Prentice Hall Press, NJ, USA
23.
go back to reference Poosala V, Haas PJ, Ioannidis YE, Shekita EJ (1996) Improved histograms for selectivity estimation of range predicates, in SIGMOD Conference, 294–305 Poosala V, Haas PJ, Ioannidis YE, Shekita EJ (1996) Improved histograms for selectivity estimation of range predicates, in SIGMOD Conference, 294–305
24.
go back to reference Stewart RJ, Trinder PW, Loidl H-W (2011) Comparing high level mapreduce query languages, in APPT, 58–72 Stewart RJ, Trinder PW, Loidl H-W (2011) Comparing high level mapreduce query languages, in APPT, 58–72
25.
go back to reference Tai KS, Socher R, Manning CD (2015) Improved semantic representations from tree-structured long short-term memory networks, in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, 1556–1566. https://www.aclweb.org/anthology/P15-1150/ Tai KS, Socher R, Manning CD (2015) Improved semantic representations from tree-structured long short-term memory networks, in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, 1556–1566. https://​www.​aclweb.​org/​anthology/​P15-1150/​
26.
go back to reference Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R (2009) Hive - a warehousing solution over a map-reduce framework. PVLDB 2(2):1626–1629 Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R (2009) Hive - a warehousing solution over a map-reduce framework. PVLDB 2(2):1626–1629
27.
go back to reference bibitemhiveicde Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Zhang N, Anthony S, Liu H, Murthy R (2010) Hive - a petabyte scale data warehouse using hadoop, In ICDE, 996–1005 bibitemhiveicde Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Zhang N, Anthony S, Liu H, Murthy R (2010) Hive - a petabyte scale data warehouse using hadoop, In ICDE, 996–1005
28.
go back to reference Traverso M (2013) Presto: Interacting with petabytes of data at facebook. Retrieved February 4, 2014 Traverso M (2013) Presto: Interacting with petabytes of data at facebook. Retrieved February 4, 2014
29.
go back to reference Wu S, Li F, Mehrotra S, Ooi BC (2011) Query optimization for massively parallel data processing, In SoCC , 12 Wu S, Li F, Mehrotra S, Ooi BC (2011) Query optimization for massively parallel data processing, In SoCC , 12
Metadata
Title
AQUA+: Query Optimization for Hybrid Database-MapReduce System
Publication date
05-02-2021
Published in
Knowledge and Information Systems / Issue 4/2021
Print ISSN: 0219-1377
Electronic ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-020-01542-4

Other articles of this Issue 4/2021

Knowledge and Information Systems 4/2021 Go to the issue

Premium Partner