nach oben

Data Mining and Knowledge Discovery

Erschienen in:

01.09.2014

Leveraging the power of local spatial autocorrelation in geophysical interpolative clustering

verfasst von: Annalisa Appice, Donato Malerba

Erschienen in: Data Mining and Knowledge Discovery | Ausgabe 5-6/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Nowadays ubiquitous sensor stations are deployed worldwide, in order to measure several geophysical variables (e.g. temperature, humidity, light) for a growing number of ecological and industrial processes. Although these variables are, in general, measured over large zones and long (potentially unbounded) periods of time, stations cannot cover any space location. On the other hand, due to their huge volume, data produced cannot be entirely recorded for future analysis. In this scenario, summarization, i.e. the computation of aggregates of data, can be used to reduce the amount of produced data stored on the disk, while interpolation, i.e. the estimation of unknown data in each location of interest, can be used to supplement station records. We illustrate a novel data mining solution, named interpolative clustering, that has the merit of addressing both these tasks in time-evolving, multivariate geophysical applications. It yields a time-evolving clustering model, in order to summarize geophysical data and computes a weighted linear combination of cluster prototypes, in order to predict data. Clustering is done by accounting for the local presence of the spatial autocorrelation property in the geophysical data. Weights of the linear combination are defined, in order to reflect the inverse distance of the unseen data to each cluster geometry. The cluster geometry is represented through shape-dependent sampling of geographic coordinates of clustered stations. Experiments performed with several data collections investigate the trade-off between the summarization capability and predictive accuracy of the presented interpolative clustering algorithm.

Vorheriger Artikel Ontology of core data mining entities

Nächster Artikel Preserving worker privacy in crowdsourcing

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

The predictive clustering framework is originally defined in Blockeel et al. (1998), in order to combine clustering problems and classification/regression problems. The predictive inference is performed by distinguishing between target variables and explanatory variables. Target variables are considered when evaluating similarity between training data such that training examples with similar target values are grouped in the same cluster, while training examples with dissimilar target values are grouped in separate clusters. Explanatory variables are used to generate a symbolic description of the clusters. Although the algorithm presented in Blockeel et al. (1998) can be, in principle, run by considering the same set of variables for both explanatory and target roles, this case is not investigated in the original study.

Inverse distance weighting is a common interpolation algorithm. It has several advantages that endorse its widespread use in geostatistics (Li and Revesz 2002; Karydas et al. 2009; Li et al. 2011): simplicity of implementation; lack of tunable parameters; ability to interpolate scattered data and work on any grid without suffering from multicollinearity.

We can extend this representation of a sensor network by considering a multi-dimensional representation of space. In the multi-dimensional case, multiple variables will be used to identify the location of a station. These multiple variables will be taken into account when computing the distance between sensors.

In the on-line learning phase, missing observations of a variable are interpolated in the data snapshot by using the inverse distance weighted sum of nearby known data in the row.

The spherical law of cosines is used, in order to approximate the geographical distance between the geographic coordinates (e.g. latitude and longitude) of two sensors.

The time cost of computing the local indicators of the spatial autocorrelation property can be made subquadratic by using a spatial data structure, in order to maintain, for each sensor in the network, the sphere of its neighbours. The structure will be updated only when a new sensor is either switched-on or switched-off in the network.

The quadtree decomposition of a cluster divides recursively a cluster quadrant into four subquadrants until final quadrants are determined. As we plan to compute \(Np^{\%}\) final quadrants, the number of levels of this quadtree decomposition is about \(\log _4(Np^{\%})\).

http://www.di.uniba.it/~appice/software/ICT_TICT/index.htm.

We compute RRMSE, in order to scale the error of a target variable with the domain size of the variable.

We note that the local indicators, which are computed by accounting for the pairwise comparison between neighbor stations, can be precomputed before building the tree. Thus, only the variance reduction of local indicators is evaluated over each node. On the other hand, MoranVar computes the global indicator of the spatial autocorrelation over each node. This requires the computation of the pairwise comparison between the neighbor stations that fall in the present node. As neighbors may change in number throughout the tree, the global measure has to be recomputed at each node.

Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of 29th international conference on very large data bases (VLDB 2003), pp 81–92

Aggarwal CC, Han J, Wang J, Yu PS (2007) On clustering massive data streams: a summarization paradigm. In: Advances in database systems: data streams models and algorithms (book chapter), vol 31. Springer-US, pp 9–38

Aho T, Zenko B, Dzeroski S, Elomaa T (2012) Multi-target regression with rule ensembles. J Mach Learn Res 2(13):2367–2407MathSciNet

Angin P, Neville J (2008) A shrinkage approach for modeling non-stationary relational autocorrelation. In: Proceedings of the 8th IEEE international conference on data mining, IEEE Computer Society, pp 707–712

Anselin L (1995) Local indicators of spatial association:lisa. Geogr Anal 27(2):93–115CrossRef

Appice A, Ceci M, Malerba D, Lanza A (2012) Learning and transferring geographically weighted regression trees across time. In: Proceedings of MSM/MUSE 2012, LNCS, vol 7472. Springer, Berlin, pp 97–117

Appice A, Ciampi A, Malerba D (2013a) Summarizing numeric spatial data streams by trend cluster discovery. Data Mining Knowl Discov. doi:10.1007/s10618-013-0337-7

Appice A, Ciampi A, Malerba D, Guccione P (2013b) Using trend clusters for spatiotemporal interpolation of missing data in a sensor network. J Spatial Inf Sci 6(1):119–153

Appice A, Pravilovic S, Malerba D, Lanza A (2013c) Enhancing regression models with spatio-temporal indicator additions. In: Baldoni M, Baroglio C, Boella G, Micalizio R (eds) Proceedings of AI*IA 2013: Advances in Artificial Intelligence—XIIIth international conference of the Italian Association for Artificial Intelligence, Lecture Notes in Computer Science, vol 8249. Springer, Berlin, pp 433–444

Bailey T, Krzanowski W (2012) An overview of approaches to the analysis and modelling of multivariate geostatistical data. Math Geosci 44(4):381–393. doi:10.1007/s11004-011-9360-7 CrossRef

Blanchet FG, Legendre P, Borcard D (2008) Modelling directional spatial processes in ecological data. Ecol Model 215(4):325–336. doi:10.1016/j.ecolmodel.2008.04.001. http://www.sciencedirect.com/science/article/pii/S0304380008001798

Blockeel H, De Raedt L, Ramon J (1998) Top–down induction of clustering trees. In: Proceedings of ICML. Morgan Kaufmann, pp 55–63

Boots B (2002) Local measures of spatial association. Ecoscience 9(2):168–176MathSciNet

Burrough P, McDonnell R (1998) Principles of geographical information systems. Oxford University Press, Oxford

Chen Z, Yang S, Li L, Xie Z (2010) A clustering approximation mechanism based on data spatial correlation in wireless sensor networks. In: Proceedings of the 9th conference on wireless telecommunications symposium, WTS 2010. IEEE Press, pp 208–214

Chiky R, Hébrail G (2008) Summarizing distributed data streams for storage in data warehouses. In: Proceedings of the 10th international conference on data warehousing and knowledge discovery (DaWaK 2008), LNCS, vol 5182. Springer, Berlin, pp 65–74

Cressie N (1990) The origins of kriging. Math Geol 22(3):239–252. doi:10.1007/BF00889887 CrossRefMATHMathSciNet

Cressie N (1993) Statistics for spatial data. Wiley, New York. doi:10.1111/j.1365-3121.1992.tb00605.x

Debeljak M, Trajanov A, Stojanova D, Leprince F, Džeroski S (2012) Using relational decision trees to model out-crossing rates in a multi-field setting. Ecol Model 245:75–83

Demšar D, Debeljak M, Lavigne C, Džeroski S (2005) Modelling pollen dispersal of genetically modified oilseed rape within the field. In: Abstracts of the 90th ESA annual meeting, The Ecological Society of America, p 152

Dray S, Jombart T (2011) Revisiting guerry’s data: introducing spatial constraints in multivariate analysis. Ann Appl Stat 5(4):2278–2299CrossRefMATHMathSciNet

Dray S, Legendre P, Peres-Neto PR (2006) Spatial modelling: a comprehensive framework for principal coordinate analysis of neighbour matrices (pcnm). Ecol Model 196(34):483–493. doi:10.1016/j.ecolmodel.2006.02.015. http://www.sciencedirect.com/science/article/pii/S0304380006000925

European Environment Agency (2006) Corine land cover 2006. http://sia.eionet.europa.eu/CLC2006

Gama J (2010) Knowledge discovery from data streams, 1st edn. Chapman & Hall/CRC, Boca RatonCrossRefMATH

Getis A (2008) A history of the concept of spatial autocorrelation: a geographer’s perspective. Geogr Anal 40(3):297–309CrossRef

Getis A, Ord JK (1992) The analysis of spatial association by use of distance statistics. Geogr Anal 24(3):189–206CrossRef

Goodchild M (1986) Spatial autocorrelation. Geo Books

Goovaerts P (1997) Geostatistics for natural resources evaluation. Oxford University Press, Oxford

Gora G, Wojna A (2002) RIONA: a classifier combining rule induction and k-NN method with automated selection of optimal neighbourhood. In: Proceedings of ECML 2002. Springer, Berlin, pp 111–123

Holden ZA, Evans JS (2010) Using fuzzy c-means and local autocorrelation to cluster satellite-inferred burn severity classes. Int J Wildland Fire 19(7):853–860CrossRef

Ikonomovska E, Gama J, Dzeroski S (2011) Incremental multi-target model trees for data streams. In: Chu WC, Wong WE, Palakal MJ, Hung CC (eds) Proceedings of the 2011 ACM symposium on applied computing (SAC). ACM, pp 988–993

Ingelrest F, Barrenetxea G, Schaefer G, Vetterli M, Couach O, Parlange M (2010) Sensorscope: application-specific sensor network for environmental monitoring. ACM Trans Sens Netw 17(1–17):32

Isaaks EH, Srivastava RM (1989) An introduction to applied geostatistics. Oxford University Press, Oxford

Karydas C, Gitas I, Koutsogiannaki E, Lydakis-Simantiris N, Silleos G (2009) Evaluation of spatial interpolation techniques for mapping agricultural topsoil properties in Crete. In: Proceedings of EARSeL 2009, vol 8, pp 26–39

Kelley P, Barry R (1999) Sparse spatial autoregressions. Stat Probab Lett 33:291–297CrossRef

Kim B, Tsiotras P (2009) Image segmentation on cell-center sampled quadtree and octree grids. pp 72, 480L–72, pp. 480L–9. doi:10.1117/12.810965

Kistler R, Kalnay E, Collins W, Saha S, White G, Woollen J, Chelliah M, Ebisuzaki W, Kanamitsu M, Kousky V, van den Dool H, Jenne R, Fiorino M (2001) The ncep/ncar 50-year reanalysis. Bull Am Meteorol Soc 82(2):247–267CrossRef

Krige DG (1951) A statistical approach to some mine valuation and allied problems on the Witwatersrand. Master’s thesis

Lam N (1983) Spatial interpolation methods: a review. Am Cartogr 10:129–149. doi:10.1559/152304083783914958 CrossRef

Legendre P (1993) Spatial autocorrelation: trouble or new paradigm? Ecology 74:1659–1673CrossRef

LeSage JH, Pace K (2001) Spatial dependence in data mining. In: Data mining for scientific and engineering applications. Kluwer, Dordrecht, pp 439–460

Li J, Heap A (2008) A review of spatial interpolation methods for environmental scientists. Geoscience Australia, Record 2008/23

Li L, Revesz P (2002) A comparison of spatio-temporal interpolation methods. GIScience, LNCS 2478. Springer, Berlin, pp 145–160

Li L, Zhang X, Holt J, Tian J, Piltner R (2011) Spatiotemporal interpolation methods for air pollution exposure. In: Proceedings of SARA 2011, AAAI

Lin G, Chen L (2004) A spatial interpolation method based on radial basis function networks incorporating a semivariogram model. J Hydrol 288:288–298CrossRef

Lu GY, Wong DW (2008) An adaptive inverse-distance weighting spatial interpolation technique. J Comput Geosci 34:1044–1055. doi:10.1016/j.cageo.2007.07.010 CrossRef

Michalski RS, Stepp RE (1983) Learning from observation: conceptual clustering. In: Carbonell JG, Mitchell TM (eds) Michalski RS. Machine learning, an artificial intelligence approach, Tioga, pp 331–364

Nassar S, Sander J (2007) Effective summarization of multi-dimensional data streams for historical stream mining. In: Proceedings of the 19th international conference on scientific and statistical database management, SSDBM 2007. IEEE Computer Society, p 30

NOAACoastWatch (2013a) Ndbc standard meteorological buoy data. http://coastwatch.pfeg.noaa.gov/erddap/tabledap/cwwcNDBCMet.html

NOAACoastWatch (2013b) Wind diffusivity current, metop ascat, global, near real time (1 day composite). http://coastwatch.pfeg.noaa.gov/erddap/griddap/erdQAekm1day.html

NOAACoastWatch (2013c) Wind stress, metop ascat, global, near real time (1 day composite). http://coastwatch.pfeg.noaa.gov/erddap/griddap/erdQAstress1day.html

NOAANODC (2009) World ocean atlas 2009, seasonal climatology, 5 degree, temperature, salinity, oxygen. http://coastwatch.pfeg.noaa.gov/erddap/griddap/nodcWoa09sea5t.html

Ohashi O, Torgo L (2012) Spatial interpolation using multiple regression. In: Zaki MJ, Siebes A, Yu JX, Goethals B, Webb GI, Wu X (eds) 12th IEEE international conference on data mining, ICDM 2012. IEEE Computer Society, pp 1044–1049

Orkin M, Drogin R (1990) Vital statistics. McGraw Hill, New York

Pace P, Barry R (1997) Quick computation of regression with a spatially autoregressive dependent variable. Geogr Anal 29(3):232–247CrossRef

Price M (2012) Arcgis 10: importing data from excel spreadsheets. http://www.esri.com/news/arcuser/0312/importing-data-from-excel-spreadsheets.html

Rodrigues PP, Gama J, Lopes LMB (2008) Clustering distributed sensor data streams. In: Proceedings of the European conference on machine learning and knowledge discovery in databases, LNCS 5212. Springer, Berlin, pp 282–297

Sampson PD, Guttorp P (1992) Nonparametric estimation of nonstationary spatial covariance structure. J Am Stat Assoc 87:108–119CrossRef

Scrucca L (2005) Clustering multivariate spatial data based on local measures of spatial autocorrelation. Tech. Rep. 20, Quaderni del Dipartimento di Economia, Finanza e Statistica, Università di Perugia

Şen Z, Şalhn AD (2001) Spatial interpolation and estimation of solar irradiation by cumulative semivariograms. Solar Energy 71(1):11–21. doi:10.1016/S0038-092X(01)00009-3 CrossRef

Shepard D (1968a) A two-dimensional interpolation function for irregularly-spaced data. In: Proceedings of the 1968 23rd ACM national conference, ACM ’68. ACM, New York, NY, USA, pp 517–524. doi:10.1145/800186.810616

Shepard D (1968b) A two-dimensional interpolation function for irregularly-spaced data. In: Proceedings of the 1968 ACM national conference, ACM, pp 517–524

Song YC, Meng HD (2010) The application of cluster analysis in geophysical data interpretation. Comput Geosci 14(2):263–271CrossRefMATH

Spiliopoulou M, Ntoutsi I, Theodoridis Y, Schult R (2006) Monic: modeling and monitoring cluster transitions. In: Proceedings of the KDD 2006, ACM, pp 706–711

Stein ML (1999) Interpolation of spatial data: some theory for kriging (springer series in statistics), 1st edn. Springer, Berlin

Stojanova D (2009) Estimating forest properties from remotely sensed data by using machine learning. Master’s thesis, Jožef Stefan International Postgraduate School, Ljubljana, Slovenia

Stojanova D, Ceci M, Appice A, Dzeroski S (2012) Network regression with predictive clustering trees. Data Min Knowl Discov 25(2):378–413CrossRefMATHMathSciNet

Stojanova D, Ceci M, Appice A, Malerba D, Dzeroski S (2013) Dealing with spatial autocorrelation when learning predictive clustering trees. Ecol Inform 13:22–39

Teegavarapu RSV, Meskele T, Pathak CS (2012) Geo-spatial grid-based transformations of precipitation estimates using spatial interpolation methods. Comput Geosci 40:28–39. doi:10.1016/j.cageo.2011.07.004 CrossRef

Tobler W (1979) Cellular geography. Philos Geogr 20:379–386

Umer M, Kulik L, Tanin E (2010) Spatial interpolation in wireless sensor networks: localized algorithms for variogram modeling and Kriging. Geoinformatica 14(1):101–134. doi:10.1007/s10707-009-0078-3 CrossRef

Wang Y, Witten I (1997) Induction of model trees for predicting continuous classes. In: Proceedings of ECML 1997. Springer, Berlin, pp 128–137

Yong J, Xiao-ling Z, Jun S (2007) Unsupervised classification of polarimetric SAR Image by quad-tree segment and SVM. In: 1st Asian and Pacific conference on synthetic aperture radar, 2007 (APSAR 2007), pp 480–483. doi:10.1109/APSAR.2007.4418655

Titel: Leveraging the power of local spatial autocorrelation in geophysical interpolative clustering
verfasst von: Annalisa Appice
Donato Malerba
Publikationsdatum: 01.09.2014
Verlag: Springer US
Erschienen in: Data Mining and Knowledge Discovery / Ausgabe 5-6/2014
Print ISSN: 1384-5810
Elektronische ISSN: 1573-756X
DOI: https://doi.org/10.1007/s10618-014-0372-z

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 5-6/2014

Self-organizing maps by difference of convex functions optimization

Generalization-based privacy preservation and discrimination prevention in data publishing and mining

Detecting localized homogeneous anomalies over spatio-temporal data

Ontology of core data mining entities

Learning about meetings

Confidence bands for time series data