Elsevier

Neurocomputing

Volume 277, 14 February 2018, Pages 21-28
Neurocomputing

Probabilistic group nearest neighbor query optimization based on classification using ELM

https://doi.org/10.1016/j.neucom.2017.05.095Get rights and content

Abstract

The probabilistic group nearest neighbor(PGNN) query , which returns all the uncertain objects whose probabilities of being the group nearest neighbor (GNN) results exceed a user-specified threshold, is widely used in uncertain database. Most existing work for answering PGNN queries adopted a general framework which consist of three phases: spatial pruning, probabilistic pruning, refinement. In the probabilistic pruning phase, dividing the uncertain regions into many partitions to derive a tighter probabilities bounds is a common method. However, there is a tradeoff between the computational cost of probabilistic pruning phase and refinement phase controlled by the granularity of the partitions. In this paper, we study the problem of setting the optimal granularity of the partitions for uncertain objects, and propose a new framework for PGNN queries based on granularity classification using ELM such that the overall cost is minimized. In addition, to improve the accuracy of classification and make the classifier applicable to the dynamic environment, a plurality voting method and a dynamic classification strategy are proposed respectively. Extensive experiments shows that compared with the default granularities of the partitions, the granularities chosen by ELM classifiers are more proper, which further improves the performance of PGNN query algorithm. In addition, ELM outperforms SVM with regard to both the response time and classification accuracy.

Introduction

Group nearest neighbor(GNN) query, which returns the object that minimizes the sum of distances to multiple query points, is one of the important variants of nearest neighbor(NN) query. This type of query has many applications in clustering [1], outlier detection [2] and facilities management [3], and has been studied extensively on certain data [3], [4], [5], [6].

However, uncertain data are inherent in numerous emerging applications due to limitations of measuring equipment, delayed data updates, or privacy protection. Probabilistic GNN(PGNN) query over uncertain data, which is to find all the uncertain objects whose probabilities of being the GNN results exceed a user-specified threshold, has been proposed by Lian et al. [7]. The algorithm bounds all the query points as an MBR to reduce the computation of distances, and pre-computes several (1-β)-hyperspheres for each object in a preprocessing step. The idea of this algorithm is that the object can be pruned with a probability of at least (1-β) if its corresponding sphere is pruned.

The algorithm above is not efficient if the objects have arbitrary shapes of their uncertain regions or the query points are distributed sparsely. To overcome these limitations, two pruning algorithms are proposed in our previous work (Li et al. [8]). The spatial pruning algorithm utilizes the centroid point to efficiently filter out objects in consideration of their spatial locations. Then, the probabilistic pruning algorithm derives tighter bounds by partitioning uncertain objects to filter out objects by comparing with the threshold, and a space partition structure called BA-Quadtree is proposed to facilitate the partitioning process. At last, the exact probabilities of being GNN results of the remaining objects are calculated in the refinement phase.

There is a tradeoff between the computational cost for the probabilistic pruning phase and the refinement phase (these two phases are far more time-consuming than the spatial pruning phase), controlled by the granularity of partitions. A finer granularity means that there are more partitions which can strength the probabilistic pruning power and decrease the computational cost in the refinement phase, but more partitions need more computational cost in the probabilistic pruning phase. In the extreme case, each partition only involves one instance. On the contrary, a coarser granularity means weaker probabilistic pruning power which means that there will be more uncertain objects required to be further verified in the refinement phase. In the extreme case, each partition equals to the whole uncertain region. The local index BA-Quadtree stores different level of partitions with different granularity, and the parameter depth of the quadtree is used to represent the granularity of partitions. The key of setting depthes is to find a good tradeoff between pruning cost and refinement cost such that the overall cost is minimized. In the experiment of our previous work, the default value of depth for each object is set to the half of the height of the BA-Quadtree. In this paper, we further improve this algorithm by focusing on the problem of setting depthes for different objects dynamically according to the attributes of the objects and different queries.

An optimal depth optdi for an uncertain object Ui is a minimal value which means that either a smaller value or a larger value of optdi will increase the total query time of the probabilistic phase and the refinement phase. Finding the optimal depth for each uncertain object has many challenges. That is because both the computational cost and the probabilistic pruning power have effect on the overall cost, and meanwhile they both are affected by many factors. Firstly, the computational cost is affected by the number and the distribution of the uncertain objects, the number of partitions, the number and the distribution of the instances and so on. Secondly, the setting of depth of an uncertain object needs to consider not only its partitions itself whether can be pruned or not, but also its partitions whether can prune other partitions. This is because these two cases both reflect the probabilistic pruning power. Thirdly, whether a partition can be pruned or not is affected by many factors, such as the user-specified threshold, the probability distribution of the instances, the location relationship between two objects or two partitions, the number and distribution of the query points and so on.

To address these challenges, we first analyse the possible affected factors, and then utilize a classifier to classify each uncertain region so that the optimal depth can be set according to the classification result. Since each uncertain object need to be classified in real time according to the online query points, the classifier should have fast learning and classification speed. To suit our needs, Extreme Learning Machine (ELM) [9], [10] is adopted for classifying uncertain objects into their respective depth classes, because that ELM has more fast learning speed and better generalization performance than neural networks and Support Vector Machines (SVMs) [11].

The main contributions in this paper are: (1) A new framework based on depth classification is proposed for PGNN query (Section 3), where the optds of candidate objects are selected automatically based on different queries. (2) Some useful features are selected for the PGNN query (Section 4.1). (3) A depth classification algorithm (DCA) using ELM is proposed, where a plurality voting mechanism is utilized, and a dynamic classification strategy is proposed (Section 4.2). (4) Extensive experiments are devised to study the performance of the classification and query algorithms (Section 5).

Section snippets

Preliminaries

In this section, the uncertain model and the local index for uncertain objects called BA-Quadtree are firstly reviewed which is proposed in our previous work. Then the definitions of GNN query, PGNN query and optimal depth are presented. At last, the classifier (extreme learning machine) used in this work is briefly introduced.

Framework for probabilistic group nearest neighbor (PGNN) query

The general framework for PGNN query on uncertain data usually consists of three phases: spatial pruning, probabilistic pruning and refinement, which is widely adopted in processing probabilistic GNN query [7], [8]. In this paper, a new framework is proposed, which contains another phase named setting before the probabilistic pruning phase.

Fig. 2 shows the details of the new framework for PGNN query based on depth classification. The gray parts are new proposed in this paper, and the rest parts

Probabilistic group nearest neighbor (PGNN) query optimization

In this section, the details of the optimization algorithm for PGNN query are described. Firstly, the features used to classify the depth for PGNN query are introduced. Then, the ELM based depth classification algorithm (DCA) is proposed, where the plurality voting method and the dynamic strategy are applied respectively.

Experiment setup

In this section, we compare our ELM classification based algorithm with other PGNN query algorithms using both real and synthetic data sets. In particular, we use two real data sets, CAR and TCB, which contain MBRs of 2,249,727 streets (polylines) of California, and 556,696 census blocks (polygons) of Iowa, Kansa, Missouri and Nebraska1, respectively. The data space is normalized to have rang [0, 1000] on each dimension, and these MBRs are treated as

Conclusion

In this paper, we study the optimal partition granularity setting problem of probabilistic pruning algorithm for PGNN query. A new framework with a setting phase, which is based on granularity (depth) classification method, is proposed. In particular, a depth classification algorithm using ELM is proposed to set the optimal depthes for the candidate objects. Additionally, in order to adapt to the changes of environment, a dynamic classification strategy is proposed to retrain ELM classifier

Acknowledgments

This research was partially supported by the National Natural Science Foundation of China under Grant nos. 61502317, 61502316; and the Natural Science Foundation of Liaoning Province of China under Grant no. 201602559, 201602568; and the Doctor Startup Fund of Shenyang Aerospace University of China under Grant no. 15YB04.

LI Jiajia was born in 1987. She received the B.S. degree from Northeast Normal University in software engineer in 2008 and received the M.S. and Ph.D. degrees from Northeastern University in computer science in 2010 and 2014, respectively. Now she is a lecture and master supervisor at College of Computer Science, Shenyang Aerospace University, and undertakes one national natural science foundation of China and one natural science foundation of Liaoning Province of China. Her research interests

References (22)

  • LianX. et al.

    Probabilistic group nearest neighbor queries in uncertain databases

    TKDE

    (2008)
  • Cited by (14)

    • Improvement of evolution process of dandelion algorithm with extreme learning machine for global optimization problems

      2021, Expert Systems with Applications
      Citation Excerpt :

      For example, Zhu et al. propose an ELM algorithm based on differential evolution (E-ELM) (Zhu et al., 2005), which mainly uses DE to optimize the input weights and offsets of the ELM, so that the generalization performance of the ELM is further improved; Cao et al. propose an improved DE to optimize the ELM (Cao et al., 2012), in which an improved DE is proposed to improve the performance of the DE, and then use it to improve the performance of the ELM; Xu et al. use the basic particle swarm optimization to improve the ELM (Xu & Shu, 2006), which made the ELM have better generalization performance; An improved particle swarm optimization is proposed to optimize the ELM (Han et al., 2013), in which the improved algorithm optimizes the parameters of the ELM and the experiments show that the model has better generalization performance; Wang et al. proposed an EELM algorithm based on an in-depth analysis of which parameters affect the performance of the ELM (Wang et al., 2011). Due to its advantages, such as ease of use, faster learning speed and higher generalization performance, it also has been applied to many fields successfully (Chen et al., 2014; Eshtay et al., 2018; Li et al., 2018; Lin et al., 2013; Wu et al., 2011; Yu et al., 2014). Different from the previous studies, instead of optimizing ELM by metaheuristic methods, this paper uses the ELM to help DA classify the seeds.

    • RNN-GWR: A geographically weighted regression approach for frequently updated data

      2020, Neurocomputing
      Citation Excerpt :

      The proposed method utilizes reverse nearest neighbor (RNN) strategy and so it is called as RNN-GWR method. RNN approach [15] is commonly used for dynamic data and moving objects [16–20] and classification and outlier detection [21–23]. We compare the performance of the proposed RNN-GWR algorithm with a Naïve-GWR and FastGWR [14] approaches using real-world meteorological dataset of Turkey and synthetic datasets.

    View all citing articles on Scopus

    LI Jiajia was born in 1987. She received the B.S. degree from Northeast Normal University in software engineer in 2008 and received the M.S. and Ph.D. degrees from Northeastern University in computer science in 2010 and 2014, respectively. Now she is a lecture and master supervisor at College of Computer Science, Shenyang Aerospace University, and undertakes one national natural science foundation of China and one natural science foundation of Liaoning Province of China. Her research interests include spatial-temporal database, uncertain data management and machine learning.

    XIA Xiufeng was born in 1964. He received his B.S. degree from Shenyang Institute of Aeronautical Engineering in computer application in 1985 and received the M.S. and Ph.D. degrees from Northeastern University in computer science in 1991 and 2005, respectively. He is current a professor and master supervisor at College of Computer Science, Shenyang Aerospace University. His research interests include data warehouse, NoSQL and product data management.

    LIU Xiangyu was born in 1981. He received his B.Sc., M.Sc. and Ph.D. degrees from Northeastern University in 2004, 2007 and 2014, respectively, all in Computer Science. He is current a lecture and master supervisor at College of Computer Science, Shenyang Aerospace University. His research interests include social network, privacy protection, machine learning.

    WANG Botao was born in 1968. He received the Ph.D. degree in computer science in 2000 from Kyushu University. Current he is a professor at Department of Information Science and Engineering, Northeastern University. His research interests include machine learning, cloud computing, spatial-temporal database, publish/subscribe system, data stream and mobile data management.

    ZHOU Dahai was born in 1979. He received his B.S. and M.S. degrees from Shenyang Institute of Aeronautical Engineering in computer application in 2001 and 2008, respectively. He is current a lecture at College of Computer Science, Shenyang Aerospace University. His research interests include data warehouse, product data management and big data management.

    An Yunzhe was born in 1978. He received the B.S. degree from Changchun University of Science and Technology in Computer Application in 2002 and received the M.S. degree from Northeastern University in Computer Software and Theory in 2008. He is current a lecture at College of Computer Science, Shenyang Aerospace University. His research interests include data integration, Oracle database management and machine learning.

    View full text