Density estimation over spatio-temporal data streams
Introduction
Density estimation is a common but useful technique in statistics. It is a fundamental problem in numerical analysis, data mining and many scientific research fields, and it is also necessary for some nonparametric prediction models. There are many circumstances under which it is essential to know the density function of a specific distribution, given a sequence of random variables identically drawn from it. For instance, by knowing the density distribution of univariate or multivariate sample data, we can get an idea of the distribution of the sample. Consequently, we can calculate the mean, median and other essential quantities. For an extensive overview on the use of density estimation in statistical applications, we refer the reader to the recent work of Hwang and Shin (2016), which considers nonparametric kernel-type estimation for modes that maximize nonparametric kernel-type density estimators.
In this work, we are interested in estimating a multivariate spatio-temporal random process density function. Spatio-temporal data naturally arise in many fields, such as environmental sciences, geophysics, oceanography, soil science, econometrics, epidemiology, environmental science, forestry, image processing and many others in which the phenomena of interest are continuous in space and time and the data are collected across time as well as space. A plethora of processes, such as atmospheric pollutant concentrations, precipitation fields and surface winds, are characterized by spatial and temporal variability. For some background in parametric spatial statistical modeling, refer to Ripley (1981); Cressie (1992); Anselin and Florax (2012); Guyon (1995) and the references therein. Nonparametric methods for spatial data have also been developed by many authors in past decades. For instance, kernel density estimators for spatial data have been discussed in Tran (1990) and abundantly studied in Hallin, Lu, Tran, 2001, Hallin, Lu, Tran, 2004; Biau and Cadre (2004); Fazekas and Chuprunov (2006); Carbon et al. (2007); Dabo-Niang et al. (2011) and El Machkouri (2014). Recently, Dabo-Niang et al. (2014) proposed spatial density estimators for multivariate data, depending on two kernels, one of which controls the distance between observations and the other which controls the spatial dependence structure. Additionally, Lu and Tjø stheim (2014) proposed nonparametric kernel estimators for density functions in the so-called “expanding-domain infill asymptotics” framework, which is discussed in Section 2.2. Nonparametric and semiparametric regression models have been considered by several authors, for instance (Hallin, Lu, Tran, 2004, Hallin, Lu, Yu, 2009, Gao, Lu, Tjø stheim, 2006, Robinson, 2011, Jenish, 2012, Lu, Steinskog, Tjø stheim, Yao, 2009) and (Al-Sulami et al., 2017).
Complex issues arise in spatial analysis, many of which are neither clearly defined nor completely resolved, but form the basis of current research. Among the practical considerations that influence the available techniques used in spatio-temporal data modeling is data dependency. In fact, spatial data are often dependent and a spatial model must be able to handle this aspect. Note that linear models for spatio-temporal data only capture global linear relationships between spatial locations. However, in many circumstances the spatial dependency is not linear, for example, the classical case, where one deals with the spatial pattern of extreme events, such as in the economic analysis of poverty and in environmental sciences. In such situations, it is more appropriate to use a nonlinear spatial dependence measure, for example, the concept of strong mixing coefficients (Tran, 1990). To the best of our knowledge, the literature on nonparametric estimation techniques that incorporates nonlinear spatio-temporal dependency is not extensive compared to that on linear dependency. However, there has been substantial interest in spatial and spatio-temporal nonparametric techniques during the past decade. Here, we are interested in the asymptotic properties of nonparametric recursive density estimation for spatio-temporal processes.
The literature on spatio-temporal models is relatively abundant, for example, the recent books of Christakos (2000) and Cressie and Wikle (2015). Recent nonparametric models have been developed to study spatio-temporal data. Wang and Wang (2009), and Wang et al. (2012) proposed, respectively, a trend estimation and prediction model in a spatio-temporal context using a kernel weight function that accounts for the distance between sites. Al-Sulami et al. (2017) proposed a semiparametric nonlinear regression for cases of irregularly located spatial time-series data.
In recent years, data collection from sensor networks has been facilitated by modern technology. Researchers are currently able to collect a large volume of data arriving continuously and at very high speed. In such cases, it is either unnecessary or impractical to save all the data on a disk. These phenomena are commonly referred to as streaming data or data streams. As an example, let us consider a sensor that sends a reading of the surface height of the ocean to a base station every tenth of a second. The data produced by this sensor are a stream of real numbers. Moreover, to learn something about ocean behavior, one sensor might not be sufficient, and we might want to deploy sensor networks, with each member sending a stream of data to the central node at a rate of ten data points per second. Processing and analyzing these data streams effectively and efficiently is an active challenge in computational statistics. Because decisions should be made as soon as data are received in many applications, traditional nonparametric techniques that require a lot of time are useless in practice if real-time forecasts are expected. That is why we consider, in this work, the density estimation problem in the context of spatio-temporal (sequential with respect to time) data streams. More precisely, we address recursive kernel estimators, where recursive means that the estimator calculated from the first observations, say is a function of only fn (i.e., the first n observations) and the new data received by the user. Such a recursive property is very helpful in practice within the framework of streaming data. In the above cases, the recursive estimates enable us to update the estimates as additional observations are received. From a practical point of view, this arrangement provides important savings in computational time and memory because the estimate updating is independent of the history of the stream. This is not the case for the basic kernel estimator, which must be computed again using the whole history of the stream.
In the temporal case, the asymptotic results of the recursive estimators are highly competitive with the non-recursive results. Huang et al. (2014) studied the asymptotic properties of the recursive kernel density and the regression estimators for a general class of stationary processes. The recursive density estimator for i.i.d. random variables has been studied by many authors, including (Deheuvels, 1973, Wegman, Davies, 1979) and (Wagner and Wolverton, 1969), who studied the quadratic mean convergence and strong consistency. Amiri (2009) generalized these studies to a general family of recursive density estimators using different values of the parameter ℓ ∈ [0, 1], which plays a role in regulating the quality improvement of the estimator with respect to the variance and estimation errors. The mean square convergence and the asymptotic normality were studied in the past by Masry (1986) under strong mixing, for the fixed values and . Tran (1989) obtained the uniform convergence of this recursive estimator for under α-mixing. The asymptotic normality under negative association was considered by Liang and Baek (2004) for . The above results were generalized to any ℓ ∈ [0, 1] under strong mixing with additional assumptions by Amiri (2009). Based on the work of Amiri (2009); Mezhoud et al. (2014) provided the variance and mean squared error of the recursive estimator under η-weak dependence. Amiri et al. (2016) studied a recursive density estimator in the spatial case with increasing domain asymptotics. Recent advances in the topic have also been the subject of Zhou et al. (2003); Cao et al. (2012); Xu et al. (2014), and Amiri et al. (2017), who proposed kernel density estimation methods over data streams.
The present work extends the previous results by addressing nonparametric estimation of the probability density function of dependent spatio-temporal data using a recursive kernel approach. We derive a recursive version of the classic spatio-temporal kernel density estimator, study the asymptotic results of the estimator and present some numerical results.
The rest of this paper is organized as follows. Section 2 introduces the spatio-temporal data stream model and defines the recursive kernel density estimator in this context. Section 3 presents simulation results to evaluate the accuracy of the proposed estimator, and in Section 4 we demonstrate the usefulness of the proposed methodology in practice using real datasets. In Section 5, we present the asymptotic results for the recursive density estimator. The last Section is devoted to the proofs and some auxiliary results.
Section snippets
Spatio-temporal data stream model
Let denotes the real lattice points in N-dimensional Euclidean space. For any distinct locations (referred to as sites) the uniform distance between sites i and j is defined as We denote
Let X be a d-dimensional random variable with d ≥ 1. Consider a spatio-temporal data stream for which the input data can be partitioned into a sequence of observed arrays of the form with
Simulations
In this section, we present a simulation study to compare our recursive kernel density estimator with respect to its natural competitor introduced by Wang and Wang (2009). We discuss the case of spatio-temporal data streams, in which the data are continually captured over time at n fixed observation sites and real-time updates of the estimates are required. This scenario is provided by the stream1 package available in the R software
A real data example: Application to the Intel Lab dataset
In this section, our main objective is to demonstrate the usefulness of the proposed methodology in practice. To this end, we analyze the publicly available Intel Lab dataset, which contains data collected from 54 sensors deployed in the Intel Berkeley Research Laboratory. The sensor nodes are identified by numbers ranging from 1 to 54, and Fig. 5 shows the locations of the sensors in the laboratory.
The data consist of humidity, temperature, light and voltage measurements recorded every 31 s
Consistency results
In this paper, we are interested in the context where the sequence of data satisfies a weak dependence condition that is more general than mixing. Weak dependence is more widely applicable than many existing dependence measures, such as mixing, since it covers a large class of processes. For instance, mixing is considered to be useful for characterizing the dependence between time series data since the it is fulfilled for many classes of processes and since it enables derivation of the same
Proofs
Throughout the proofs C denotes a constant whose value is unimportant and may vary from line to line. Before we come to the proof of the main results, we state an auxiliary lemma that is a consequence of a result proved in Robison (1926) and is a generalization of the Toeplitz’s lemma within the double-series framework.
Lemma 6.1 Let (wn, T)n ≥ 1, T ≥ 1 be a bounded sequence with finite limit w. If Assumption A5.3 holds, then
Proof If we set:
Acknowledgment
We wish to express our appreciation to the Associate Editor and the referees for their helpful remarks and suggestions, which led to a substantially improved version of the paper.
References (53)
- et al.
Estimation for semiparametric nonlinear regression of irregularly located spatial time-series data
Econom. Stat.
(2017) Sur une famille paramétrique d’estimateurs séquentiels de la densité pour un processus fortement mélangeant
Comptes Rendus Math.
(2009)- et al.
Nonparametric recursive density estimation for spatial data
Comptes Rendus Math.
(2016) - et al.
Kernel regression estimation for random fields
J. Stat. Plan. Inference
(2007) - et al.
Kernel density estimation for random fields (density estimation for random fields)
Stat. Probab. Lett.
(1997) Gmm estimation with cross sectional dependence
J. Econom.
(1999)- et al.
A new weak dependence condition and applications to moment inequalities
Stoch. Process. Appl.
(1999) - et al.
Kernel density estimation for spatial processes: the L1 theory
J. Multivar. Anal.
(2004) Nonparametric spatial regression under near-epoch dependence
J. Econom.
(2012)- et al.
Recursive kernel estimation of the density under η-weak dependence
J. Korean Stat. Soc.
(2014)
Divergent double sequences and series
Trans. Am. Math. Soc.
Recursive estimates of probability density
IEEE Trans. Syst. Sci. Cybern.
On the estimation of the density of a directional data stream
Scand. J. Stat.
New directions in spatial econometrics
Nonparametric spatial prediction
Stat. Inference Stoch. Process.
Somke: Kernel density estimation over data streams by sequences of self-organizing maps
IEEE Trans. Neural Netw. Learn. Syst.
Modern spatiotemporal geostatistics
Statistics for spatial data
Terra Nova
Statistics for spatio-temporal data
A kernel spatial density estimation allowing for the analysis of spatial clustering. application to monsoon asia drought atlas data
Stoch. Environ. Res. Risk Assess.
Kernel regression estimation for spatial functional random variables
Far East J. Theor. Stat.
Adaptive kernel estimation of spatial relative risk
Statist. Med.
Weak dependence: with examples and applications
Coupling for τ-dependent sequences and applications
J. Theor. Probab.
New dependence coefficients. examples and applications to statistics
Probab. Theory Relat. Fields
Sur une famille d’estimateurs de la densité d’une variable aléatoire
C. R. Acad. Sci. Paris Sér. A-B
Cited by (3)
Asymptotic normality of residual density estimator in stationary and explosive autoregressive models
2022, Computational Statistics and Data AnalysisPredicted Distribution Density Estimation for Streaming Data
2021, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)Kernel regression estimation with errors-in-variables for random fields
2020, Afrika Matematika