Skip to main content
Log in

Global and local distance-based generalized linear models

  • Original Paper
  • Published:
TEST Aims and scope Submit manuscript

Abstract

This paper introduces local distance-based generalized linear models. These models extend (weighted) distance-based linear models first to the generalized linear model framework. Then, a nonparametric version of these models is proposed by means of local fitting. Distances between individuals are the only predictor information needed to fit these models. Therefore, they are applicable, among others, to mixed (qualitative and quantitative) explanatory variables or when the regressor is of functional type. An implementation is provided by the R package dbstats, which also implements other distance-based prediction methods. Supplementary material for this article is available online, which reproduces all the results of this article.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Andrews DF, Herzberg AM (1985) Data. A collection of problems from many fields for the student and research worker. Springer, New York

    MATH  Google Scholar 

  • Banks D, Carley K (1994) Metric inference for social networks. J Classif 11(1):121–149

    Article  MATH  Google Scholar 

  • Boj E, Caballé A, Delicado P, Fortiana J (2014)dbstats: distance-based statistics (dbstats). R package version 1:4

  • Boj E, Claramunt MM, Fortiana J (2007a) Selection of predictors in distance-based regression. Commun Stat A Theory 36:87–98

    MATH  Google Scholar 

  • Boj E, Claramunt MM, Grané A, Fortiana J (2007b) Implementing PLS for distance-based regression: computational issues. Comput Stat 22:237–248

    Article  MATH  Google Scholar 

  • Boj E, Delicado P, Fortiana J (2010) Local linear functional regression based on weighted distance-based regression. Comput Stat Data Ann 54:429–437

    Article  MathSciNet  MATH  Google Scholar 

  • Borg I, Groenen P (2005) Modern multidimensional scaling: theory and applications, 2nd edn. Springer, New York

    Google Scholar 

  • Bowman A, Azzalini A (1997) Applied smoothing techniques for data analysis. Oxford University Press, Oxford

    MATH  Google Scholar 

  • Brenchley JM, Hörchner U, Kalivas JH (1997) Wavelength selection characterization for NIR spectra. Appl Spectrosc 51:689–699

    Article  Google Scholar 

  • Brockman MJ, Wright TS (1992) Statistical motor rating: making effective use of your data. J Inst Actuar 119(3):457–543

    Article  Google Scholar 

  • Buja A, Swayne DF, Littman ML, Dean N, Hofmann H, Chen L (2008) Data visualization with multidimensional scaling. J Comput Graph Stat 17(2):444–472

    Article  MathSciNet  Google Scholar 

  • Butts CT, Carley KM (2001) Multivariate methods for inter-structural analysis. Casos working paper, Carnegie Mellon University

  • Butts CT, Carley KM (2005) Some simple algorithms for structural comparison. Comput Math Organ Theory 11(4):291–305

    Article  MATH  Google Scholar 

  • Crainiceanu C, Reiss P, Goldsmith J, Huang L, Huo L, Scheipl F (2014) refund: regression with functional data computing (refund). R package version 0.1-11

  • Cuadras C, Arenas C (1990) A distance-based regression model for prediction with mixed data. Commun. Stat. A Theory 19:2261–2279

    Article  MathSciNet  Google Scholar 

  • Cuadras CM, Dodge Y (1989) Distance analysis in discrimination and classification using both continuous and categorical variables. Statistical data analysis and inference. Amsterdam, North-Holland, pp 459–473

    Google Scholar 

  • Cuadras CM, Arenas C, Fortiana J (1996) Some computational aspects of a distance-based model for prediction. Commun. Stat. B Simul. 25:593–609

    Article  MATH  Google Scholar 

  • Esteve A, Boj E, Fortiana J (2009) Interaction terms in distance-based regression. Commun. Stat. A Theor. 38:3498–3509

    Article  MathSciNet  MATH  Google Scholar 

  • Everitt BS, Landau S, Leese M, Stahl D (2011) Cluster analysis, 5th edn. Wiley, New York

    Book  MATH  Google Scholar 

  • Faraway J (2014a)faraway: functions and datasets for books by Julian Faraway. R package version 1.0.6

  • Faraway J (2014b) Regression for non-Euclidean data using distance matrices. J Appl Stat 41(11):23422357

    Article  MathSciNet  Google Scholar 

  • Febrero-Bande M, González-Manteiga W (2013) Generalized additive models for functional data. Test 22(2):278–292

    Article  MathSciNet  MATH  Google Scholar 

  • Febrero-Bande M, Oviedo M (2014) fda.usc: Functional data analysis and utilities for statistical computing (fda.usc). R package version 1.2.1

  • Ferraty F, Vieu P (2006a) Non parametric functional data analysis. Theory and practice. Springer, New York

    Google Scholar 

  • Ferraty F, Vieu P (2006b) Reference manual for implementing nonparametric functional data analysis (NPFDA). Companion manual of the book: NonParametric Functional Data Analysis: Theory and Practice. Springer, New York

    Google Scholar 

  • Gower JC (1968) Adding a point to vector diagrams in multivariate analysis. Biometrika 55:582–585

    Article  MATH  Google Scholar 

  • Gower JC (1971) A general coefficient of similarity and some of its properties. Biometrics 27:857–874

    Article  Google Scholar 

  • Green PJ (1984) Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives. J R Stat Soc B Meth 46(2):149–192

    MATH  Google Scholar 

  • Haberman S, Renshaw AE (1996) Generalized linear models and actuarial science. J Roy Stat Soc D Stat 45(4):407–436

    Google Scholar 

  • Hallin M, Ingenbleek JF (1983) The Swedish automobile portfolio in 1977. A statistical study. Scand Actuar J 83:49–64

    Article  Google Scholar 

  • Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning. Data Mining, inference, and prediction, 2nd edn. Springer, New York

    MATH  Google Scholar 

  • Kalivas JH (1997) Two data sets of near infrared spectra. Chemometr Intell Lab 37:255–259

    Article  Google Scholar 

  • Kass R, Goovaerts M, Dhaene J, Denuit M (2008) Modern actuarial risk theory using R, 2nd edn. Springer, Berlin

    Book  Google Scholar 

  • Loader C (1999) Local regression and likelihood. Springer, New York

    MATH  Google Scholar 

  • Maechler M (2015) cluster: cluster analysis extended Rousseeuw et al. R package version 2.0.1

  • Marx BD, Eilers PHC (1999) Generalized linear regression on sampled signals and curves: a P-spline approach. Technometrics 41(1):1–13

    Article  MathSciNet  Google Scholar 

  • McCullagh P, Nelder JA (1989) Generalized linear models, 2nd edn. Chapman and Hall, London

    Book  MATH  Google Scholar 

  • Mclean MW, Hooker G, Staicu AM, Scheipl F, Ruppert D (2014) Functional generalized additive models. J Comput Graph Stat 23(1):249–269

    Article  MathSciNet  Google Scholar 

  • Meyer D, Buchta C (2015) proxy: distance and similarity measures. R package version 0.4-14

  • R Development Core Team (2015) R: a language and environment for statistical computing. Austria, Vienna

  • Ramsay JO, Silverman BW (2005) Functional data analysis, 2nd edn. Springer, New York

    Google Scholar 

  • Rao CR (1973) Linear statistical inference and its applications. Wiley, New York

    Book  MATH  Google Scholar 

  • Stewart GW (1993) On the early history of the singular values decomposition. SIAM Rev 35:551–566

    Article  MathSciNet  MATH  Google Scholar 

  • Street JO, Carroll RJ, Ruppert D (1988) A note on computing robust regression estimates via iteratively reweighted least squares. Am Stat 42(2):152–154

    Google Scholar 

  • Wasserman L (2004) All of statistics. A concise course in statistical inference. Springer, New York

    MATH  Google Scholar 

  • Wasserman L (2006) All of nonparametric statistics. Springer, New York

    MATH  Google Scholar 

  • Wood SN (2006) Generalized additive models: an introduction with R. Chapman & Hall, Boca Raton

    Google Scholar 

Download references

Acknowledgments

We appreciate very much the efforts that Philip K. Hopke, Clarkson University, is doing to make again publicly available the data sets described in Kalivas (1997). We are very grateful to John H. Kalivas for allowing us to add the protein.asc and whtspec.asc data files as supplementary material of this paper. Work supported in part by the Spanish Ministerio de Educación y Ciencia and FEDER, grants MTM2010-17323, MTM2010-14887, MTM2013-43992-R and MTM2014-56535-R.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pedro Delicado.

Electronic supplementary material

Appendices

Appendix A: The dbstats package

The dbstats package (Boj et al. 2014) for R (R Development Core Team 2015) implements several distance-based prediction methods. The main functions of dbstats are: dblm for DB-LM, ldblm for local DB-LM, dbglm for DB-GLM, ldbglm for local DB-GLM and dbplsr for DB-PLSR.

In Sect. A.1, we describe the usage of function dbglm whereas ldbglm is described in Sect. A.2. Two examples illustrating the usage of these functions from a user perspective have been presented, respectively, in Sects. 3.1 and 4.1. For details of dblm, ldblm and plsr, we refer to Boj et al. (2014).

1.1 A.1 Function dbglm

Function dbglm fits DB-GLM. In this function, distances can be directly provided as: an interdistances matrix (class dist or dissimilarity as in stats package); a squared interdistances matrix (class D2); or an inner-products matrix (class Gram). Classes D2 and Gram have been implemented in the dbstats package. It is also possible to compute distances directly from observed explanatory variables (using a class formula object in the call to dbglm).

The dbstats package does not provide specific methods for computing distances, depending instead on other available functions and packages such as dist in the stats package, daisy in the cluster package (Maechler 2015) or dist in the proxy package (Meyer and Buchta 2015). Utility functions such as as.D2, as.Gram, D2toDist, D2toG, distoD2 and GtoD2 allow the user mutual interconversions (see Boj et al. 2014 for details).

Response and link function are as in the glm function of stats for ordinary GLM.

The usage of dbglm is:

figure a

where the argument distance is of class dist or dissimilarity. The same information can be provided replacing distance by an object of class either D2 or Gram.

When calling dbglm using an object of class formula, the first and second arguments in the previous call to dbglm are replaced by other three arguments: formula (of the form \({\varvec{y}}\sim {\varvec{Z}}\)), data (a data frame containing the variables in the model: both response \({\varvec{y}}\) and explanatory variables \({\varvec{Z}}\)) and metric (that indicates how to compute distances between the rows of \({\varvec{Z}}\); it must be one of the strings “euclidean” (the default), “manhattan” or “gower” to be passed to function daisy of cluster package).

In addition to the response y, the distance matrix distance (or equivalent information), the formula formula, data data and metric, it is worth mentioning the following arguments of dbglm:

  • family, weights, offset, mustart are arguments with the same role that they have in the glm function.

  • method sets the method to be used in deciding the effective rank. There are five different methods: “AIC”, “BIC”, “GCV” (default), “eff.rank” and “rel.gvar”. See Sect. 3 (before Sect. A.1) for details on these criteria.

  • range.eff.rank a vector defining the range of possible values for the effective rank in the dblm iterations to be evaluated when method is “AIC”, “BIC” or “GCV”. It should be restricted between \(c(1,n-1)\).

  • full.search sets the optimization procedure to be used to minimize the modeling criterion specified in method when “AIC”, “BIC” or “GCV” criteria are specified. See the help of dbstats package Boj et al. (2014) for details.

  • rel.gvar relative geometric variability (a real number between 0 and 1; default is 0.95). More details can be found at the end of Sect. 3 (before Sect. A.1).

  • eff.rank integer between 1 and \( n-1\). If specified its value overrides rel.gvar. When eff.rank = NULL (default), calls to dblm are made with method = “rel.gvar”. More details can be found in Sect. 3 (before Sect. 3.1).

  • maxiter, eps1, eps2 are stopping criteria for the iterative algorithm that fits the DB-GLM.

The function returns a list of class dbglm containing the following components:

  • Common elements with the output of glm function for R: residuals, fitted.values, family, deviance, aic.model, null.deviance, iter, prior.weights, weights, df.residual, df.null, y, call.

  • H hat matrix projector of the last dblm iteration.

  • convcrit convergence criterion. One of: “DevStat” (stopping criterion 1: when the relative decrement of deviance in one step is less than eps1), “muStat” (stopping criterion 2: when the relative change of the estimated expected values of the responses in one step is less than eps2), “maxiter” (maximum allowed number of iterations has been exceeded).

  • eff.rank, rel.gvar effective rank and relative geometric variability that have been finally used.

  • bic.model, gcv.model BIC and GCV criteria of the final DB-GLM.

  • dev.resids deviance residuals (the way they are computed depends on the specified family).

  • varmu vector of estimated variance of each observation (that depends on the estimated vector of expected values and on the specified family).

1.2 A.2 Function ldbglm

Function ldbglm is a localized version of a DB-GLM. As in the global model dbglm, explanatory information is coded as distances between individuals, that can either be computed from observed explanatory variables or directly provided to function ldbglm as a (possibly squared) interdistances matrix or as a inner-products matrix (a Gram matrix).

Remember that in local DB-GLM there appear two distance functions, \(\delta _1\) and \(\delta _2\), playing different roles. Accordingly, function ldbglm has two different arguments, dist1 and dist2 of class dist or dissimilarity, where distances \(\delta _1\) and \(\delta _2\) are specified: dist1 defines the neighborhood delimiting what observations (and with what weight) are used when locally fitting a DB-GLM, whereas dist2 (which may coincide with dist1) is used specifically for fitting this weighted DB-GLM.

The usage of ldbglm is:

figure b

In the same way that it was explained in the overview of function dbglm, the predictive information contained in distance matrices dist1 and dist2 can be provided to function ldbglm in three alternative ways: two squared distances matrices (D2.1 and D2.2), two inner-products matrices (G1 and G2), or a formula jointly with a data set and two metrics (formula, data, metric1 and metric2).

The following are other arguments of ldbglm that are specific of this function because they control its local character:

  • kind.of.kernel integer number between 1 and 6 which determines the user’s choice of smoothing kernel K (see Eq. 14): (1) Epanechnikov (Default), (2) Biweight, (3) Triweight, (4) Normal, (5) Triangular, (6) Uniform.

  • method.h sets the method to be used when choosing the bandwidth h to be used in Eq. (14). There are four different methods, AIC, BIC, GCV (default) and user.h. AIC, BIC and GCV take the bandwidth minimizing the Akaike or Bayesian Information Criterion or the generalized cross-validation, respectively. When method is user.h, the bandwidth is explicitly set by the user through the user.h optional parameter which, in this case, becomes mandatory.

  • user.h global bandwidth set by the user. The default value is the first quartile of all the distances d(ij) in matrix dist1. It applies only if method = "user.h".

  • h.range a vector of length 2 giving the range for automatic bandwidth choice. (Default value: quantiles 0.05 and 0.5 of d(ij) in matrix dist1). It applies when method != "user.h".

  • noh number of bandwidth h values within h.range for automatic bandwidth choice. It applies when method != "user.h".

  • k.knn minimum number of observations with positive weight in any local fit of a DB-GLM model. A too small value of bandwidth h could originate a neighborhood with only one observation producing a runtime error when trying to fit a local fit of a DB-GLM model. Choosing k.knn > 1 prevents from this problem. By default k.knn = 3.

The function returns a list of class ldbglm containing the following components:

  • Common elements with the output of dbglm: residuals, fitted.values, family, weights, y, call.

  • dist1, dist2 the distances matrices used to calculate the local weights of the observations and to locally fit the dbglm’s, respectively.

  • h.opt the optimal bandwidth h used in the fitting process (if method != user.h).

  • S the smoothing matrix in the last iteration of the IRWLS. See Boj et al. (2010) for details on the definition of the smoothing matrix.

Appendix B: Supplementary material

Code for fitting DB-GLM and local DB-GLM is available in the R package dbstats (http://CRAN.R-project.org/package=dbstats). Additional code for reproducing the computations and graphics in the paper are included in the R script ExamplesDB.R. The data files protein.asc and whtspec.asc are provided as supplementary material with the permission of John H. Kalivas, Idaho Satate University.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Boj, E., Caballé, A., Delicado, P. et al. Global and local distance-based generalized linear models. TEST 25, 170–195 (2016). https://doi.org/10.1007/s11749-015-0447-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11749-015-0447-1

Keywords

Mathematics Subject Classification

Navigation