Elsevier

Econometrics and Statistics

Volume 5, January 2018, Pages 148-170
Econometrics and Statistics

Density estimation over spatio-temporal data streams

https://doi.org/10.1016/j.ecosta.2017.08.005Get rights and content

Abstract

In the last few years, data can be collected extremely easily in many scientific research fields. This became possible by the recent technological advances that have made online monitoring possible. In such situations, if real time or online estimations are expected, the usual nonparametric techniques rapidly require a lot of time to be computed and therefore become useless in practice. Adaptative counterparts of the classical kernel density estimators, that can be updated extremely easily when a new set of observations is available are investigated, for spatio-temporal processes with weak dependence structures. Mean square, uniform almost sure convergences and a central limit result are obtained under general and easily verifiable conditions. The efficiency of the considered estimators is evaluated through simulations and a real data application. The results show that the proposed method works well within the framework of a spatio-temporal data stream.

Introduction

Density estimation is a common but useful technique in statistics. It is a fundamental problem in numerical analysis, data mining and many scientific research fields, and it is also necessary for some nonparametric prediction models. There are many circumstances under which it is essential to know the density function of a specific distribution, given a sequence of random variables identically drawn from it. For instance, by knowing the density distribution of univariate or multivariate sample data, we can get an idea of the distribution of the sample. Consequently, we can calculate the mean, median and other essential quantities. For an extensive overview on the use of density estimation in statistical applications, we refer the reader to the recent work of Hwang and Shin (2016), which considers nonparametric kernel-type estimation for modes that maximize nonparametric kernel-type density estimators.

In this work, we are interested in estimating a multivariate spatio-temporal random process density function. Spatio-temporal data naturally arise in many fields, such as environmental sciences, geophysics, oceanography, soil science, econometrics, epidemiology, environmental science, forestry, image processing and many others in which the phenomena of interest are continuous in space and time and the data are collected across time as well as space. A plethora of processes, such as atmospheric pollutant concentrations, precipitation fields and surface winds, are characterized by spatial and temporal variability. For some background in parametric spatial statistical modeling, refer to Ripley (1981); Cressie (1992); Anselin and Florax (2012); Guyon (1995) and the references therein. Nonparametric methods for spatial data have also been developed by many authors in past decades. For instance, kernel density estimators for spatial data have been discussed in Tran (1990) and abundantly studied in Hallin, Lu, Tran, 2001, Hallin, Lu, Tran, 2004; Biau and Cadre (2004); Fazekas and Chuprunov (2006); Carbon et al. (2007); Dabo-Niang et al. (2011) and El Machkouri (2014). Recently, Dabo-Niang et al. (2014) proposed spatial density estimators for multivariate data, depending on two kernels, one of which controls the distance between observations and the other which controls the spatial dependence structure. Additionally, Lu and Tjø stheim (2014) proposed nonparametric kernel estimators for density functions in the so-called “expanding-domain infill asymptotics” framework, which is discussed in Section 2.2. Nonparametric and semiparametric regression models have been considered by several authors, for instance (Hallin, Lu, Tran, 2004, Hallin, Lu, Yu, 2009, Gao, Lu, Tjø stheim, 2006, Robinson, 2011, Jenish, 2012, Lu, Steinskog, Tjø stheim, Yao, 2009) and (Al-Sulami et al., 2017).

Complex issues arise in spatial analysis, many of which are neither clearly defined nor completely resolved, but form the basis of current research. Among the practical considerations that influence the available techniques used in spatio-temporal data modeling is data dependency. In fact, spatial data are often dependent and a spatial model must be able to handle this aspect. Note that linear models for spatio-temporal data only capture global linear relationships between spatial locations. However, in many circumstances the spatial dependency is not linear, for example, the classical case, where one deals with the spatial pattern of extreme events, such as in the economic analysis of poverty and in environmental sciences. In such situations, it is more appropriate to use a nonlinear spatial dependence measure, for example, the concept of strong mixing coefficients (Tran, 1990). To the best of our knowledge, the literature on nonparametric estimation techniques that incorporates nonlinear spatio-temporal dependency is not extensive compared to that on linear dependency. However, there has been substantial interest in spatial and spatio-temporal nonparametric techniques during the past decade. Here, we are interested in the asymptotic properties of nonparametric recursive density estimation for spatio-temporal processes.

The literature on spatio-temporal models is relatively abundant, for example, the recent books of Christakos (2000) and Cressie and Wikle (2015). Recent nonparametric models have been developed to study spatio-temporal data. Wang and Wang (2009), and Wang et al. (2012) proposed, respectively, a trend estimation and prediction model in a spatio-temporal context using a kernel weight function that accounts for the distance between sites. Al-Sulami et al. (2017) proposed a semiparametric nonlinear regression for cases of irregularly located spatial time-series data.

In recent years, data collection from sensor networks has been facilitated by modern technology. Researchers are currently able to collect a large volume of data arriving continuously and at very high speed. In such cases, it is either unnecessary or impractical to save all the data on a disk. These phenomena are commonly referred to as streaming data or data streams. As an example, let us consider a sensor that sends a reading of the surface height of the ocean to a base station every tenth of a second. The data produced by this sensor are a stream of real numbers. Moreover, to learn something about ocean behavior, one sensor might not be sufficient, and we might want to deploy sensor networks, with each member sending a stream of data to the central node at a rate of ten data points per second. Processing and analyzing these data streams effectively and efficiently is an active challenge in computational statistics. Because decisions should be made as soon as data are received in many applications, traditional nonparametric techniques that require a lot of time are useless in practice if real-time forecasts are expected. That is why we consider, in this work, the density estimation problem in the context of spatio-temporal (sequential with respect to time) data streams. More precisely, we address recursive kernel estimators, where recursive means that the estimator calculated from the first n+k observations, say fn+k, is a function of only fn (i.e., the first n observations) and the new data received by the user. Such a recursive property is very helpful in practice within the framework of streaming data. In the above cases, the recursive estimates enable us to update the estimates as additional observations are received. From a practical point of view, this arrangement provides important savings in computational time and memory because the estimate updating is independent of the history of the stream. This is not the case for the basic kernel estimator, which must be computed again using the whole history of the stream.

In the temporal case, the asymptotic results of the recursive estimators are highly competitive with the non-recursive results. Huang et al. (2014) studied the asymptotic properties of the recursive kernel density and the regression estimators for a general class of stationary processes. The recursive density estimator for i.i.d. random variables has been studied by many authors, including (Deheuvels, 1973, Wegman, Davies, 1979) and (Wagner and Wolverton, 1969), who studied the quadratic mean convergence and strong consistency. Amiri (2009) generalized these studies to a general family of recursive density estimators using different values of the parameter ℓ ∈ [0, 1], which plays a role in regulating the quality improvement of the estimator with respect to the variance and estimation errors. The mean square convergence and the asymptotic normality were studied in the past by Masry (1986) under strong mixing, for the fixed values =1 and =1/2. Tran (1989) obtained the uniform convergence of this recursive estimator for =1 under α-mixing. The asymptotic normality under negative association was considered by Liang and Baek (2004) for =1. The above results were generalized to any ℓ ∈ [0, 1] under strong mixing with additional assumptions by Amiri (2009). Based on the work of Amiri (2009); Mezhoud et al. (2014) provided the variance and mean squared error of the recursive estimator under η-weak dependence. Amiri et al. (2016) studied a recursive density estimator in the spatial case with increasing domain asymptotics. Recent advances in the topic have also been the subject of Zhou et al. (2003); Cao et al. (2012); Xu et al. (2014), and Amiri et al. (2017), who proposed kernel density estimation methods over data streams.

The present work extends the previous results by addressing nonparametric estimation of the probability density function of dependent spatio-temporal data using a recursive kernel approach. We derive a recursive version of the classic spatio-temporal kernel density estimator, study the asymptotic results of the estimator and present some numerical results.

The rest of this paper is organized as follows. Section 2 introduces the spatio-temporal data stream model and defines the recursive kernel density estimator in this context. Section 3 presents simulation results to evaluate the accuracy of the proposed estimator, and in Section 4 we demonstrate the usefulness of the proposed methodology in practice using real datasets. In Section 5, we present the asymptotic results for the recursive density estimator. The last Section is devoted to the proofs and some auxiliary results.

Section snippets

Spatio-temporal data stream model

Let NN(N1) denotes the real lattice points in N-dimensional Euclidean space. For any distinct locations (referred to as sites) i=(i1,,iN),j=(j1,,jN)NN, the uniform distance between sites i and j is defined as ij:=max1kd|ikjk|. We denote i^=#{jNN:ji}.

Let X be a d-dimensional random variable with d ≥ 1. Consider a spatio-temporal data stream for which the input data can be partitioned into a sequence of observed arrays of the form W(s,t):={X(s,t)1,,X(s,t)k(s,t)},sNN,tZ, with

Simulations

In this section, we present a simulation study to compare our recursive kernel density estimator with respect to its natural competitor introduced by Wang and Wang (2009). We discuss the case of spatio-temporal data streams, in which the data are continually captured over time at n fixed observation sites and real-time updates of the estimates are required. This scenario is provided by the stream1 package available in the R software

A real data example: Application to the Intel Lab dataset

In this section, our main objective is to demonstrate the usefulness of the proposed methodology in practice. To this end, we analyze the publicly available Intel Lab dataset, which contains data collected from 54 sensors deployed in the Intel Berkeley Research Laboratory. The sensor nodes are identified by numbers ranging from 1 to 54, and Fig. 5 shows the locations of the sensors in the laboratory.

The data consist of humidity, temperature, light and voltage measurements recorded every 31 s

Consistency results

In this paper, we are interested in the context where the sequence of data satisfies a weak dependence condition that is more general than mixing. Weak dependence is more widely applicable than many existing dependence measures, such as mixing, since it covers a large class of processes. For instance, mixing is considered to be useful for characterizing the dependence between time series data since the it is fulfilled for many classes of processes and since it enables derivation of the same

Proofs

Throughout the proofs C denotes a constant whose value is unimportant and may vary from line to line. Before we come to the proof of the main results, we state an auxiliary lemma that is a consequence of a result proved in Robison (1926) and is a generalization of the Toeplitz’s lemma within the double-series framework.

Lemma 6.1

Let (wn, T)n ≥ 1, T ≥ 1 be a bounded sequence with finite limit w. If Assumption A5.3 holds, then 1nTk=1nt=1T(h(k,t)h(n,T))rwk,tβrwas(n,T).

Proof

If we set: ak,t(n,T)={1nT(h(k,t)h

Acknowledgment

We wish to express our appreciation to the Associate Editor and the referees for their helpful remarks and suggestions, which led to a substantially improved version of the paper.

References (53)

  • G.M. Robison

    Divergent double sequences and series

    Trans. Am. Math. Soc.

    (1926)
  • T. Wagner et al.

    Recursive estimates of probability density

    IEEE Trans. Syst. Sci. Cybern.

    (1969)
  • A. Amiri et al.

    On the estimation of the density of a directional data stream

    Scand. J. Stat.

    (2017)
  • L. Anselin et al.

    New directions in spatial econometrics

    (2012)
  • G. Biau et al.

    Nonparametric spatial prediction

    Stat. Inference Stoch. Process.

    (2004)
  • Y. Cao et al.

    Somke: Kernel density estimation over data streams by sequences of self-organizing maps

    IEEE Trans. Neural Netw. Learn. Syst.

    (2012)
  • G. Christakos

    Modern spatiotemporal geostatistics

    (2000)
  • N. Cressie

    Statistics for spatial data

    Terra Nova

    (1992)
  • N. Cressie et al.

    Statistics for spatio-temporal data

    (2015)
  • Dabo-NiangS. et al.

    A kernel spatial density estimation allowing for the analysis of spatial clustering. application to monsoon asia drought atlas data

    Stoch. Environ. Res. Risk Assess.

    (2014)
  • Dabo-NiangS. et al.

    Kernel regression estimation for spatial functional random variables

    Far East J. Theor. Stat.

    (2011)
  • DaviesT.M. et al.

    Adaptive kernel estimation of spatial relative risk

    Statist. Med.

    (2010)
  • J. Dedecker et al.

    Weak dependence: with examples and applications

    (2007)
  • J. Dedecker et al.

    Coupling for τ-dependent sequences and applications

    J. Theor. Probab.

    (2004)
  • J. Dedecker et al.

    New dependence coefficients. examples and applications to statistics

    Probab. Theory Relat. Fields

    (2005)
  • P. Deheuvels

    Sur une famille d’estimateurs de la densité d’une variable aléatoire

    C. R. Acad. Sci. Paris Sér. A-B

    (1973)
  • Cited by (3)

    View full text