Skip to main content
Erschienen in: GeoInformatica 1/2019

22.10.2018

Spatial data management in apache spark: the GeoSpark perspective and beyond

verfasst von: Jia Yu, Zongsi Zhang, Mohamed Sarwat

Erschienen in: GeoInformatica | Ausgabe 1/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The paper presents the details of designing and developing GeoSpark, which extends the core engine of Apache Spark and SparkSQL to support spatial data types, indexes, and geometrical operations at scale. The paper also gives a detailed analysis of the technical challenges and opportunities of extending Apache Spark to support state-of-the-art spatial data partitioning techniques: uniform grid, R-tree, Quad-Tree, and KDB-Tree. The paper also shows how building local spatial indexes, e.g., R-Tree or Quad-Tree, on each Spark data partition can speed up the local computation and hence decrease the overall runtime of the spatial analytics program. Furthermore, the paper introduces a comprehensive experiment analysis that surveys and experimentally evaluates the performance of running de-facto spatial operations like spatial range, spatial K-Nearest Neighbors (KNN), and spatial join queries in the Apache Spark ecosystem. Extensive experiments on real spatial datasets show that GeoSpark achieves up to two orders of magnitude faster run time performance than existing Hadoop-based systems and up to an order of magnitude faster performance than Spark-based systems.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat NRC (2001) Committee on the science of climate change, climate change science: an analysis of some key questions, National Academies Press, Washington NRC (2001) Committee on the science of climate change, climate change science: an analysis of some key questions, National Academies Press, Washington
2.
Zurück zum Zitat Zeng N, Dickinson RE, Zeng X (1996) Climatic impact of amazon Deforestation? A mechanistic model study. Journal of Climate 9:859–883CrossRef Zeng N, Dickinson RE, Zeng X (1996) Climatic impact of amazon Deforestation? A mechanistic model study. Journal of Climate 9:859–883CrossRef
3.
Zurück zum Zitat Chen C, Burton M, Greenberger E, Dmitrieva J (1999) Population migration and the variation of dopamine D4 receptor (DRD4) allele frequencies around the globe. Evol Hum Behav 20(5):309–324CrossRef Chen C, Burton M, Greenberger E, Dmitrieva J (1999) Population migration and the variation of dopamine D4 receptor (DRD4) allele frequencies around the globe. Evol Hum Behav 20(5):309–324CrossRef
4.
Zurück zum Zitat Woodworth PL, Menéndez M, Gehrels WR (2011) Evidence for century-timescale acceleration in mean sea levels and for recent changes in extreme sea levels. Surv Geophys 32(4-5):603–618CrossRef Woodworth PL, Menéndez M, Gehrels WR (2011) Evidence for century-timescale acceleration in mean sea levels and for recent changes in extreme sea levels. Surv Geophys 32(4-5):603–618CrossRef
5.
Zurück zum Zitat Dhar S, Varshney U (2011) Challenges and business models for mobile location-based services and advertising. Commun ACM 54(5):121–128CrossRef Dhar S, Varshney U (2011) Challenges and business models for mobile location-based services and advertising. Commun ACM 54(5):121–128CrossRef
8.
Zurück zum Zitat Aji A, Wang F, Vo H, Lee R, Liu Q, Zhang X, Saltz JH (2013) Hadoop-GIS: a high performance spatial data warehousing system over MapReduce. Proc Int Conf on Very Large Data Bases, VLDB 6(11):1009–1020 Aji A, Wang F, Vo H, Lee R, Liu Q, Zhang X, Saltz JH (2013) Hadoop-GIS: a high performance spatial data warehousing system over MapReduce. Proc Int Conf on Very Large Data Bases, VLDB 6(11):1009–1020
9.
Zurück zum Zitat Eldawy A, Mokbel MF (2015) Spatialhadoop: a mapreduce framework for spatial data. In: Proceedings of the IEEE International Conference on Data Engineering, ICDE, pp 1352–1363 Eldawy A, Mokbel MF (2015) Spatialhadoop: a mapreduce framework for spatial data. In: Proceedings of the IEEE International Conference on Data Engineering, ICDE, pp 1352–1363
10.
Zurück zum Zitat Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauly M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the USENIX symposium on Networked Systems Design and Implementation, NSDI, pp 15–28 Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauly M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the USENIX symposium on Networked Systems Design and Implementation, NSDI, pp 15–28
11.
Zurück zum Zitat Ashworth M (2016) Information technology – database languages – sql multimedia and application packages – part 3: Spatial, standard, International organization for standardization, Geneva, Switzerland Ashworth M (2016) Information technology – database languages – sql multimedia and application packages – part 3: Spatial, standard, International organization for standardization, Geneva, Switzerland
12.
Zurück zum Zitat Pagel B-U, Six H-W, Toben H, Widmayer P (1993) Towards an analysis of range query performance in spatial data structures. In: Proceedings of the Twelfth ACM SIGACT-SIGMOD-SIGART symposium on Principles of Database Systems PODS ’93 Pagel B-U, Six H-W, Toben H, Widmayer P (1993) Towards an analysis of range query performance in spatial data structures. In: Proceedings of the Twelfth ACM SIGACT-SIGMOD-SIGART symposium on Principles of Database Systems PODS ’93
13.
Zurück zum Zitat Patel JM, DeWitt DJ (1996) Partition based spatial-merge join. In: Proceedings of the ACM international conference on management of data, SIGMOD, pp 259–270 Patel JM, DeWitt DJ (1996) Partition based spatial-merge join. In: Proceedings of the ACM international conference on management of data, SIGMOD, pp 259–270
14.
Zurück zum Zitat Guttman A (1984) R-trees: a dynamic index structure for spatial searching. In: Proceedings of the ACM international conference on management of data, SIGMOD, pp 47–57 Guttman A (1984) R-trees: a dynamic index structure for spatial searching. In: Proceedings of the ACM international conference on management of data, SIGMOD, pp 47–57
15.
Zurück zum Zitat Samet H (1984) The quadtree and related hierarchical data structures. ACM Comput Surv (CSUR) 16(2):187–260CrossRef Samet H (1984) The quadtree and related hierarchical data structures. ACM Comput Surv (CSUR) 16(2):187–260CrossRef
16.
Zurück zum Zitat Eldawy A, Alarabi L, Mokbel MF (2015) Spatial partitioning techniques in spatial hadoop. Proc Int Conf on Very Large Data Bases, VLDB 8(12):1602–1605 Eldawy A, Alarabi L, Mokbel MF (2015) Spatial partitioning techniques in spatial hadoop. Proc Int Conf on Very Large Data Bases, VLDB 8(12):1602–1605
17.
Zurück zum Zitat Eldawy A, Mokbel MF, Jonathan C (2016) Hadoopviz: A mapreduce framework for extensible visualization of big spatial data. In: Proceedings of the IEEE International Conference on Data Engineering, ICDE, pp 601–612 Eldawy A, Mokbel MF, Jonathan C (2016) Hadoopviz: A mapreduce framework for extensible visualization of big spatial data. In: Proceedings of the IEEE International Conference on Data Engineering, ICDE, pp 601–612
18.
Zurück zum Zitat Eldawy A, Mokbel MF (2014) Pigeon: a spatial mapreduce language. In: IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31 - April 4, 2014, pp 1242–1245 Eldawy A, Mokbel MF (2014) Pigeon: a spatial mapreduce language. In: IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31 - April 4, 2014, pp 1242–1245
19.
Zurück zum Zitat Lu J, Guting RH (2012) Parallel secondo: boosting database engines with Hadoop. In: International conference on parallel and distributed systems, pp 738 –743 Lu J, Guting RH (2012) Parallel secondo: boosting database engines with Hadoop. In: International conference on parallel and distributed systems, pp 738 –743
20.
Zurück zum Zitat Vo H, Aji A, Wang F (2014) SATO: a spatial data partitioning framework for scalable query processing. In: Proceedings of the ACM international conference on advances in geographic information systems, ACM SIGSPATIAL, pp 545–548 Vo H, Aji A, Wang F (2014) SATO: a spatial data partitioning framework for scalable query processing. In: Proceedings of the ACM international conference on advances in geographic information systems, ACM SIGSPATIAL, pp 545–548
21.
Zurück zum Zitat Thusoo A, Sen JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R (2009) Hive: a warehousing solution over a Map-Reduce framework. In: Proceedings of the International Conference on Very Large Data Bases, VLDB, pp 1626–1629 Thusoo A, Sen JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R (2009) Hive: a warehousing solution over a Map-Reduce framework. In: Proceedings of the International Conference on Very Large Data Bases, VLDB, pp 1626–1629
22.
Zurück zum Zitat Armbrust M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A, Zaharia M (2015) Spark SQL: relational data processing in spark. In: Proceedings of the ACM international conference on management of data, SIGMOD, pp 1383–1394 Armbrust M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A, Zaharia M (2015) Spark SQL: relational data processing in spark. In: Proceedings of the ACM international conference on management of data, SIGMOD, pp 1383–1394
23.
Zurück zum Zitat Xie D, Li F, Yao B, Li G, Zhou L, Guo M (2016) Simba: efficient in-memory spatial analytics. In: Proceedings of the ACM international conference on management of data, SIGMOD Xie D, Li F, Yao B, Li G, Zhou L, Guo M (2016) Simba: efficient in-memory spatial analytics. In: Proceedings of the ACM international conference on management of data, SIGMOD
25.
Zurück zum Zitat You S, Zhang J, Gruenwald L (2015) Large-scale spatial join query processing in cloud. In: Proceedings of the IEEE International Conference on Data Engineering Workshop, ICDEW, pp 34–41 You S, Zhang J, Gruenwald L (2015) Large-scale spatial join query processing in cloud. In: Proceedings of the IEEE International Conference on Data Engineering Workshop, ICDEW, pp 34–41
26.
Zurück zum Zitat Hughes NJ, Annex A, Eichelberger CN, Fox A, Hulbert A, Ronquest M (2015) Geomesa: a distributed architecture for spatio-temporal fusion. In: SPIE defense+ security, pp 94730F–94730F, International society for optics and photonics Hughes NJ, Annex A, Eichelberger CN, Fox A, Hulbert A, Ronquest M (2015) Geomesa: a distributed architecture for spatio-temporal fusion. In: SPIE defense+ security, pp 94730F–94730F, International society for optics and photonics
27.
Zurück zum Zitat Finkel RA, Bentley JL (1974) Quad trees a data structure for retrieval on composite keys. Acta informatica 4(1):1–9CrossRef Finkel RA, Bentley JL (1974) Quad trees a data structure for retrieval on composite keys. Acta informatica 4(1):1–9CrossRef
28.
Zurück zum Zitat Herring JR (2006) Opengis implementation specification for geographic information-simple feature access-part 2: Sql option, Open Geospatial Consortium Inc Herring JR (2006) Opengis implementation specification for geographic information-simple feature access-part 2: Sql option, Open Geospatial Consortium Inc
30.
Zurück zum Zitat Hunt P, Konar M, Junqueira FP, Reed B (2010) Zookeeper: Wait-free coordination for internet-scale systems. In: USENIX annual technical conference, Boston, MA, USA June 23-25 Hunt P, Konar M, Junqueira FP, Reed B (2010) Zookeeper: Wait-free coordination for internet-scale systems. In: USENIX annual technical conference, Boston, MA, USA June 23-25
32.
Zurück zum Zitat Perry M, Herring J (2012) Ogc geosparql-a geographic query language for rdf data, OGC Implementation Standard Sept Perry M, Herring J (2012) Ogc geosparql-a geographic query language for rdf data, OGC Implementation Standard Sept
33.
Zurück zum Zitat Group H et al (2014) Hierarchical data format version 5 Group H et al (2014) Hierarchical data format version 5
34.
Zurück zum Zitat ESRI E (1998) Shapefile technical description, an ESRI white paper ESRI E (1998) Shapefile technical description, an ESRI white paper
35.
Zurück zum Zitat Yu J, Sarwat M (2016) Two birds, one stone: A fast, yet lightweight, indexing scheme for modern database systems. Proc Int Conf on Very Large Data Bases, VLDB 10(4):385–396 Yu J, Sarwat M (2016) Two birds, one stone: A fast, yet lightweight, indexing scheme for modern database systems. Proc Int Conf on Very Large Data Bases, VLDB 10(4):385–396
36.
Zurück zum Zitat Yu J, Sarwat M (2017) Indexing the pickup and drop-off locations of NYC taxi trips in postgresql - lessons from the road. In: Proceedings of the international symposium on advances in spatial and temporal databases, SSTD, pp 145–162 Yu J, Sarwat M (2017) Indexing the pickup and drop-off locations of NYC taxi trips in postgresql - lessons from the road. In: Proceedings of the international symposium on advances in spatial and temporal databases, SSTD, pp 145–162
38.
Zurück zum Zitat Robinson JT (1981) The k-d-b-tree: a search structure for large multidimensional dynamic indexes. In: Proceedings of the 1981 ACM SIGMOD international conference on management of data, Ann Arbor, Michigan, April 29 - May 1, 1981, pp 10–18 Robinson JT (1981) The k-d-b-tree: a search structure for large multidimensional dynamic indexes. In: Proceedings of the 1981 ACM SIGMOD international conference on management of data, Ann Arbor, Michigan, April 29 - May 1, 1981, pp 10–18
39.
Zurück zum Zitat Opyrchal L, Prakash A (1999) Efficient object serialization in java. In: Proceedings of the 19th IEEE international conference on distributed computing systems workshops on electronic commerce and web-based applications/middleware, 1999, IEEE, pp 96–101 Opyrchal L, Prakash A (1999) Efficient object serialization in java. In: Proceedings of the 19th IEEE international conference on distributed computing systems workshops on electronic commerce and web-based applications/middleware, 1999, IEEE, pp 96–101
40.
Zurück zum Zitat Cao P, Wang Z (2004) Efficient top-k query calculation in distributed networks. In: Proceedings of the twenty-third annual ACM symposium on principles of distributed computing, PODC 2004, St. John’s, Newfoundland, Canada, July 25-28, 2004, pp 206–215 Cao P, Wang Z (2004) Efficient top-k query calculation in distributed networks. In: Proceedings of the twenty-third annual ACM symposium on principles of distributed computing, PODC 2004, St. John’s, Newfoundland, Canada, July 25-28, 2004, pp 206–215
41.
Zurück zum Zitat Roussopoulos N, Kelley S, Vincent F (1995) Nearest neighbor queries. In: Proceedings of the ACM international conference on management of data, SIGMOD, pp 71–79 Roussopoulos N, Kelley S, Vincent F (1995) Nearest neighbor queries. In: Proceedings of the ACM international conference on management of data, SIGMOD, pp 71–79
42.
Zurück zum Zitat Zhou X, Abel DJ, Truffet D (1998) Data partitioning for parallel spatial join processing. Geoinformatica 2(2):175–204CrossRef Zhou X, Abel DJ, Truffet D (1998) Data partitioning for parallel spatial join processing. Geoinformatica 2(2):175–204CrossRef
43.
Zurück zum Zitat Luo G, Naughton JF, Ellmann CJ (2002) A non-blocking parallel spatial join algorithm Luo G, Naughton JF, Ellmann CJ (2002) A non-blocking parallel spatial join algorithm
44.
Zurück zum Zitat Zhang S, Han J, Liu Z, Wang K, Xu Z (2009) SJMR: parallelizing spatial join with mapreduce on clusters. In: Proceedings of the 2009 IEEE international conference on cluster computing, August 31 - September 4, 2009, New Orleans, Louisiana, USA, pp 1–8 Zhang S, Han J, Liu Z, Wang K, Xu Z (2009) SJMR: parallelizing spatial join with mapreduce on clusters. In: Proceedings of the 2009 IEEE international conference on cluster computing, August 31 - September 4, 2009, New Orleans, Louisiana, USA, pp 1–8
45.
Zurück zum Zitat Dittrich J, Seeger B (2000) Data redundancy and duplicate detection in spatial join processing. In: Proceedings of the 16th international conference on data engineering, San Diego, California, USA, February 28 - March 3, 2000, pp 535–546 Dittrich J, Seeger B (2000) Data redundancy and duplicate detection in spatial join processing. In: Proceedings of the 16th international conference on data engineering, San Diego, California, USA, February 28 - March 3, 2000, pp 535–546
47.
Zurück zum Zitat Ripley BD (2005) Spatial statistics, vol 575, Wiley, New York Ripley BD (2005) Spatial statistics, vol 575, Wiley, New York
48.
Zurück zum Zitat Haklay MM, Weber P (2008) Openstreetmap: User-generated street maps. IEEE Pervasive Computing 7(4):12–18CrossRef Haklay MM, Weber P (2008) Openstreetmap: User-generated street maps. IEEE Pervasive Computing 7(4):12–18CrossRef
Metadaten
Titel
Spatial data management in apache spark: the GeoSpark perspective and beyond
verfasst von
Jia Yu
Zongsi Zhang
Mohamed Sarwat
Publikationsdatum
22.10.2018
Verlag
Springer US
Erschienen in
GeoInformatica / Ausgabe 1/2019
Print ISSN: 1384-6175
Elektronische ISSN: 1573-7624
DOI
https://doi.org/10.1007/s10707-018-0330-9