An efficient algorithm for bulk-loading xBR+-trees

https://doi.org/10.1016/j.csi.2017.05.003Get rights and content

Highlights

  • An algorithm (PBL+) for bulk-loading a spatial index, the xBR+-tree, is presented.

  • PBL+ is an improved version of PBL, the first algorithm for bulk-loading xBR+-trees.

  • An experimental comparison of PBL+, PBL and STR (an R-trees algorithm) is presented.

  • PBL+ is faster than PBL and STR and creates better xBR+-trees than PBL.

  • xBR+-trees of PBL+ process big data queries more efficiently than R-trees of STR.

Abstract

A major part of the interface to a database is made up of the queries that can be addressed to this database and answered (processed) in an efficient way, contributing to the quality of the developed software. Efficiently processed spatial queries constitute a fundamental part of the interface to spatial databases due to the wide area of applications that may address such queries, like geographical information systems (GIS), location-based services, computer visualization, automated mapping, facilities management, etc. Another important capability of the interface to a spatial database is to offer the creation of efficient index structures to speed up spatial query processing. The xBR+-tree is a balanced disk-resident quadtree-based index structure for point data, which is very efficient for processing such queries. Bulk-loading refers to the process of creating an index from scratch, when the dataset to be indexed is available beforehand, instead of creating the index gradually (and more slowly), when the dataset elements are inserted one-by-one. In this paper, we present an algorithm for bulk-loading xBR+-trees for big datasets residing on disk, using a limited amount of main memory. The resulting tree is not only built fast, but exhibits high performance in processing a broad range of spatial queries, where one or two datasets are involved. To justify these characteristics, using real and artificial datasets of various cardinalities, first, we present an experimental comparison of this algorithm vs. a previous version of the same algorithm and STR, a popular algorithm of bulk-loading R-trees, regarding tree creation time and the characteristics of the trees created, and second, we experimentally compare the query efficiency of bulk-loaded xBR+-trees vs. bulk-loaded R-trees, regarding I/O and execution time. Thus, this paper contributes to the implementation of spatial database interfaces and the efficient storage organization for big spatial data management.

Introduction

The various types of queries that can be addressed to a database and answered (processed) in an efficient way make up a key part of the interface to this database. Spatial queries that can be efficiently processed constitute a fundamental part of the interface to spatial databases due to the wide area of applications that may address such queries, like geographical information systems (GIS), location-based services, computer visualization, automated mapping, facilities management, etc. One of the most important characteristics of the interface to a spatial database is to provide the creation of spatial indexes that can be used for the execution of efficient algorithms for spatial queries [1], [2] on big data. In many cases, the efficiency of these index structures depends on whether the execution time of spatial queries without such indexes is less than the total time to execute them when the time to create such indexes is added. For example, these spatial indexes can be created from non-indexed datasets or from intermediate results of the query executions, and such a creation process must not be too time-consuming. To carry out this creation task, a bulk-loading algorithm of a spatial index is needed. A bulk-loading algorithm creates an index from scratch, when all data are known a priori, and therefore it is based on more information than the standard insertion algorithm, where all data are inserted one-by-one. By using this additional knowledge, the bulk-loading algorithm can produce a spatial index (1) in smaller creation time, (2) with better storage utilization, and (3) with better performance of spatial query processing. These three issues should be taken into account in the design of an efficient bulk-loading algorithm for a particular spatial index structure.

If the dataset for which a spatial index is needed is static (i.e. when insertions and deletions are rarely executed or even they are not performed at all), then we can devote efforts on building the spatial index in a way that permits the efficient execution of queries. However, it is also desirable that the creation time of the index is as small as possible. Both these goals become even more important when this dataset is big. The fast construction of an optimized spatial index structure regarding certain index characteristics (e.g. storage overhead minimization, storage utilization maximization, etc.) is an interesting challenge, since it is anticipated that, due to these characteristics, spatial query processing performance will be improved. The creation of indexes by inserting elements one-by-one is less efficient than bulk-loading algorithms that can be executed for creating the indexes with the same complexity as external sorting [3].

Bulk-loading refers to the process of creating an index from scratch for a given dataset [3], and here we use this term to characterize the process of building a disk-based spatial index for an entire dataset. Bulk-loading is especially necessary and useful when an index has to be built up for the first time, knowing the data in advance and the datasets are large enough [4]. The bulk-loading methods can be classified into three categories: sort-based, buffer-based and sampled-based methods [3]. The sorted-based methods are the ones most studied in the literature and they are widely used in popular spatial DBMS [5], [6], [7], because they roughly consist of sorting the data and create the tree in a bottom-up fashion.

In this research work, we study the efficiency of building quadtree-based index structures. In particular, we focus on the xBR+-tree [8], a balanced disk-based index structure for point data that belongs to the Quadtree family, and hierarchically decomposes the space in a regular manner. The xBR+-tree improves the xBR-tree [9], [10] regarding nodes structure and the splitting process of nodes. Moreover, it outperforms xBR-trees and R*-trees [11] with respect to several well-known spatial queries, such as Point Location, Window Query, K-Nearest Neighbor, etc.

Material that contributes to the foundation of this research work appeared in [12]. The algorithm proposed in that paper (called PBL) was the first algorithm for bulk-loading xBR+-trees for large datasets residing on disk, using a limited amount of main memory. This algorithm belongs to sort-based bulk-loading methods, since it sorts, according to z-order, groups of data items that either fit in the available internal memory, or in a disk page, depending on the phase of the algorithm. This sorting is performed in parallel to tree building. This algorithm improves the index structure creation process with respect to the xBR+-tree algorithm of loading the elements one-by-one, and allowed better spatial query processing of single dataset spatial queries (Point Location - PLQ, Window Query - WQ, Distance Range Query - DRQ, K-Nearest Neighbors Query - KNNQ, Constrained K-Nearest Neighbors Query - CKNNQ) and of dual dataset spatial queries (K-Closest Pairs Query - KCPQ and Distance Join Query - εDJQ).

In this paper, we present PBL+, an improvement of the PBL bulk-loading algorithm for bulk loading xBR+-trees [12]. The new algorithm, like its predecessor, consists of four phases, however, in the new algorithm, all the phases of [12] have been improved so that the xBR+-tree is created by making better use of the available main memory, faster and with improved structural characteristics and reduced size that lead to increased performance in processing queries. Moreover, using real and artificial datasets of various (small and big) cardinalities, first, we present an experimental comparison of the new algorithm (PBL+) vs. the algorithm of [12] (PBL) and Sort-Tile-Recursive (STR) [13], a popular algorithm of bulk-loading R-trees, regarding tree creation time and the characteristics of the trees created, and second, we experimentally compare the query efficiency of xBR+-trees created by the new algorithm vs. R-trees created by STR, regarding I/O and execution time. Thus, through improving spatial storage structures and the efficiency of processing spatial queries, this paper contributes to the implementation of spatial database interfaces and the quality of software for big spatial data management.

This paper is organized as follows. In Section 2, we review related work on bulk-loading techniques, the PBL algorithm, its key differences to the PBL+ algorithm and the most important characteristics of xBR+-trees, from the implementation point of view. Section 3 presents our new bulk-loading method for the xBR+-tree. In Section 4, we present and discuss the results of our experiments. And finally, Section 5 provides the conclusions arising from our work and discusses related future work directions.

Section snippets

Related work and background

This section reviews previous bulk-loading methods in general, whose main target is to reduce the creation time, the storage utilization and the query cost of the resulting index structure, or all of them. In [3] the bulk-loading methods are roughly classified into three categories: sort-based, buffer-based and sampled-based methods.

  • The sort-based bulk-loading methods are characterized by the following two steps: first, the dataset is sorted and second, the tree is built in a bottom-up

Bulk-loading xBR+-tree (the PBL+ algorithm)

In this section, we present the method, PBL+, that we developed for bulk-loading xBR+-trees. This method consists of four phases.

Experimental results

We designed and run a large set of experiments to compare the Process of Bulk-Loading xBR+-trees (PBL) of [12], the Process of Bulk-Loading xBR+-trees as presented in Section 3 (PBL+) and the process of Bulk-Loading R-trees (STR) as published in [13]. The STR algorithm is a very popular and fast, frequently used as base line, algorithm for Bulk-Loading comparisons. We used 6 real spatial datasets of North America, representing cultural landmarks (NAclN with 9203 points) and populated places

Conclusions, discussion and future work

In this paper, we presented an improvement of the algorithm of [12] (PBL), which was the first algorithm for bulk-loading xBR+-trees for large datasets residing on disk, using a limited amount of RAM. The new algorithm (PBL+), improves all the four phases of its predecessor (it reduces movements of data items, makes better use of internal memory and achieves better partitioning of space).

Using real and artificial datasets of various cardinalities, we presented an extensive experimental

Acknowledgments

Work of Antonio Corral, Michael Vassilakopoulos and Yannis Manolopoulos funded by the MINECO research project (TIN2013-41576-R).

References (42)

  • R.H. Güting

    An introduction to spatial database systems

    VLDB J.

    (1994)
  • S. Shekhar et al.

    Spatial Databases - A Tour

    (2003)
  • J.V. den Bercken et al.

    An evaluation of generic bulk loading techniques

    VLDB Conference

    (2001)
  • H. Tan et al.

    On packing very large r-trees

    MDM Conference

    (2012)
  • N. An et al.

    Improving performance with bulk-inserts in oracle r-trees

    VLDB Conference

    (2003)
  • M.Y. Eltabakh et al.

    Space-partitioning trees in postgresql: realization and performance

    ICDE Conference

    (2006)
  • Y. Fang et al.

    Spatial indexing in microsoft SQL server 2008

    SIGMOD Conference

    (2008)
  • G. Roumelis et al.

    The xBR+-tree: an efficient access method for points

    DEXA Conference

    (2015)
  • G. Roumelis et al.

    Performance comparison of xbr-trees and r*-trees for single dataset spatial queries

    ADBIS Conference

    (2011)
  • M. Vassilakopoulos et al.

    External balanced regular (x-br) trees: new structures for very large spatial databases

    Advances in Informatics: Selected papers of the 7th Panhellenic Conf. on Informatics

    (2000)
  • N. Beckmann et al.

    The r*-tree: an efficient and robust access method for points and rectangles

    SIGMOD Conference

    (1990)
  • G. Roumelis et al.

    Bulk-loading xBR+-trees

    MEDI Conference

    (2016)
  • S.T. Leutenegger et al.

    Str: a simple and efficient algorithm for r-tree packing

    ICDE Conference

    (1997)
  • L. Arge et al.

    Efficient bulk operations on dynamic r-trees

    Algorithmica

    (2002)
  • D. Achakeev et al.

    Sort-based query-adaptive loading of r-trees

    CIKM Conference

    (2012)
  • P. Ciaccia et al.

    Bulk loading the m-tree

    ADC Conference

    (1998)
  • C.A. Lang et al.

    Modeling high-dimensional index structures using sampling

    SIGMOD Conference

    (2001)
  • N. Roussopoulos et al.

    Direct spatial search on pictorial databases using packed r-trees

    SIGMOD Conference

    (1985)
  • I. Kamel et al.

    On packing r-trees

    CIKM Conference

    (1993)
  • D.J. DeWitt et al.

    Client-server paradise

    VLDB Conference

    (1994)
  • D. Achakeev et al.

    Sort-based parallel loading of r-trees

    BigSpatial Workshop

    (2012)
  • Cited by (4)

    View full text