Global and local distance-based generalized linear models

Boj, Eva; Caballé, Adrià; Delicado, Pedro; Esteve, Anna; Fortiana, Josep

doi:10.1007/s11749-015-0447-1

Global and local distance-based generalized linear models

Original Paper
Published: 21 May 2015

Volume 25, pages 170–195, (2016)
Cite this article

TEST Aims and scope Submit manuscript

Eva Boj¹,
Adrià Caballé²,
Pedro Delicado³,
Anna Esteve⁴ &
…
Josep Fortiana⁵

664 Accesses
13 Citations
Explore all metrics

Abstract

This paper introduces local distance-based generalized linear models. These models extend (weighted) distance-based linear models first to the generalized linear model framework. Then, a nonparametric version of these models is proposed by means of local fitting. Distances between individuals are the only predictor information needed to fit these models. Therefore, they are applicable, among others, to mixed (qualitative and quantitative) explanatory variables or when the regressor is of functional type. An implementation is provided by the R package dbstats, which also implements other distance-based prediction methods. Supplementary material for this article is available online, which reproduces all the results of this article.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

Article Open access 05 May 2021

Levi Kumle, Melissa L.-H. Võ & Dejan Draschkow

Univariate and multivariate skewness and kurtosis for measuring nonnormality: Prevalence, influence and estimation

Article 17 October 2016

Meghan K. Cain, Zhiyong Zhang & Ke-Hai Yuan

Evaluating significance in linear mixed-effects models in R

Article 12 September 2016

Steven G. Luke

References

Andrews DF, Herzberg AM (1985) Data. A collection of problems from many fields for the student and research worker. Springer, New York
MATH Google Scholar
Banks D, Carley K (1994) Metric inference for social networks. J Classif 11(1):121–149
Article MATH Google Scholar
Boj E, Caballé A, Delicado P, Fortiana J (2014)dbstats: distance-based statistics (dbstats). R package version 1:4
Boj E, Claramunt MM, Fortiana J (2007a) Selection of predictors in distance-based regression. Commun Stat A Theory 36:87–98
MATH Google Scholar
Boj E, Claramunt MM, Grané A, Fortiana J (2007b) Implementing PLS for distance-based regression: computational issues. Comput Stat 22:237–248
Article MATH Google Scholar
Boj E, Delicado P, Fortiana J (2010) Local linear functional regression based on weighted distance-based regression. Comput Stat Data Ann 54:429–437
Article MathSciNet MATH Google Scholar
Borg I, Groenen P (2005) Modern multidimensional scaling: theory and applications, 2nd edn. Springer, New York
Google Scholar
Bowman A, Azzalini A (1997) Applied smoothing techniques for data analysis. Oxford University Press, Oxford
MATH Google Scholar
Brenchley JM, Hörchner U, Kalivas JH (1997) Wavelength selection characterization for NIR spectra. Appl Spectrosc 51:689–699
Article Google Scholar
Brockman MJ, Wright TS (1992) Statistical motor rating: making effective use of your data. J Inst Actuar 119(3):457–543
Article Google Scholar
Buja A, Swayne DF, Littman ML, Dean N, Hofmann H, Chen L (2008) Data visualization with multidimensional scaling. J Comput Graph Stat 17(2):444–472
Article MathSciNet Google Scholar
Butts CT, Carley KM (2001) Multivariate methods for inter-structural analysis. Casos working paper, Carnegie Mellon University
Butts CT, Carley KM (2005) Some simple algorithms for structural comparison. Comput Math Organ Theory 11(4):291–305
Article MATH Google Scholar
Crainiceanu C, Reiss P, Goldsmith J, Huang L, Huo L, Scheipl F (2014) refund: regression with functional data computing (refund). R package version 0.1-11
Cuadras C, Arenas C (1990) A distance-based regression model for prediction with mixed data. Commun. Stat. A Theory 19:2261–2279
Article MathSciNet Google Scholar
Cuadras CM, Dodge Y (1989) Distance analysis in discrimination and classification using both continuous and categorical variables. Statistical data analysis and inference. Amsterdam, North-Holland, pp 459–473
Google Scholar
Cuadras CM, Arenas C, Fortiana J (1996) Some computational aspects of a distance-based model for prediction. Commun. Stat. B Simul. 25:593–609
Article MATH Google Scholar
Esteve A, Boj E, Fortiana J (2009) Interaction terms in distance-based regression. Commun. Stat. A Theor. 38:3498–3509
Article MathSciNet MATH Google Scholar
Everitt BS, Landau S, Leese M, Stahl D (2011) Cluster analysis, 5th edn. Wiley, New York
Book MATH Google Scholar
Faraway J (2014a)faraway: functions and datasets for books by Julian Faraway. R package version 1.0.6
Faraway J (2014b) Regression for non-Euclidean data using distance matrices. J Appl Stat 41(11):23422357
Article MathSciNet Google Scholar
Febrero-Bande M, González-Manteiga W (2013) Generalized additive models for functional data. Test 22(2):278–292
Article MathSciNet MATH Google Scholar
Febrero-Bande M, Oviedo M (2014) fda.usc: Functional data analysis and utilities for statistical computing (fda.usc). R package version 1.2.1
Ferraty F, Vieu P (2006a) Non parametric functional data analysis. Theory and practice. Springer, New York
Google Scholar
Ferraty F, Vieu P (2006b) Reference manual for implementing nonparametric functional data analysis (NPFDA). Companion manual of the book: NonParametric Functional Data Analysis: Theory and Practice. Springer, New York
Google Scholar
Gower JC (1968) Adding a point to vector diagrams in multivariate analysis. Biometrika 55:582–585
Article MATH Google Scholar
Gower JC (1971) A general coefficient of similarity and some of its properties. Biometrics 27:857–874
Article Google Scholar
Green PJ (1984) Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives. J R Stat Soc B Meth 46(2):149–192
MATH Google Scholar
Haberman S, Renshaw AE (1996) Generalized linear models and actuarial science. J Roy Stat Soc D Stat 45(4):407–436
Google Scholar
Hallin M, Ingenbleek JF (1983) The Swedish automobile portfolio in 1977. A statistical study. Scand Actuar J 83:49–64
Article Google Scholar
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning. Data Mining, inference, and prediction, 2nd edn. Springer, New York
MATH Google Scholar
Kalivas JH (1997) Two data sets of near infrared spectra. Chemometr Intell Lab 37:255–259
Article Google Scholar
Kass R, Goovaerts M, Dhaene J, Denuit M (2008) Modern actuarial risk theory using R, 2nd edn. Springer, Berlin
Book Google Scholar
Loader C (1999) Local regression and likelihood. Springer, New York
MATH Google Scholar
Maechler M (2015) cluster: cluster analysis extended Rousseeuw et al. R package version 2.0.1
Marx BD, Eilers PHC (1999) Generalized linear regression on sampled signals and curves: a P-spline approach. Technometrics 41(1):1–13
Article MathSciNet Google Scholar
McCullagh P, Nelder JA (1989) Generalized linear models, 2nd edn. Chapman and Hall, London
Book MATH Google Scholar
Mclean MW, Hooker G, Staicu AM, Scheipl F, Ruppert D (2014) Functional generalized additive models. J Comput Graph Stat 23(1):249–269
Article MathSciNet Google Scholar
Meyer D, Buchta C (2015) proxy: distance and similarity measures. R package version 0.4-14
R Development Core Team (2015) R: a language and environment for statistical computing. Austria, Vienna
Ramsay JO, Silverman BW (2005) Functional data analysis, 2nd edn. Springer, New York
Google Scholar
Rao CR (1973) Linear statistical inference and its applications. Wiley, New York
Book MATH Google Scholar
Stewart GW (1993) On the early history of the singular values decomposition. SIAM Rev 35:551–566
Article MathSciNet MATH Google Scholar
Street JO, Carroll RJ, Ruppert D (1988) A note on computing robust regression estimates via iteratively reweighted least squares. Am Stat 42(2):152–154
Google Scholar
Wasserman L (2004) All of statistics. A concise course in statistical inference. Springer, New York
MATH Google Scholar
Wasserman L (2006) All of nonparametric statistics. Springer, New York
MATH Google Scholar
Wood SN (2006) Generalized additive models: an introduction with R. Chapman & Hall, Boca Raton
Google Scholar

Download references

Acknowledgments

We appreciate very much the efforts that Philip K. Hopke, Clarkson University, is doing to make again publicly available the data sets described in Kalivas (1997). We are very grateful to John H. Kalivas for allowing us to add the protein.asc and whtspec.asc data files as supplementary material of this paper. Work supported in part by the Spanish Ministerio de Educación y Ciencia and FEDER, grants MTM2010-17323, MTM2010-14887, MTM2013-43992-R and MTM2014-56535-R.

Author information

Authors and Affiliations

Departament de Matemàtica Econòmica, Financera i Actuarial, Universitat de Barcelona, Barcelona, Spain
Eva Boj
University of Edimbourgh, Edimbourgh, UK
Adrià Caballé
Departament d’Estadística i I.O, Universitat Politècnica de Catalunya, Barcelona, Spain
Pedro Delicado
Centre d’Estudis Epidemiològics sobre les Infeccions de Transmissió Sexual i Sida de Catalunya, Agencia de Salut Pública de Catalunya, Badalona, Spain
Anna Esteve
Departament de Probabilitat, Lògica, i Estadística, Universitat de Barcelona, Barcelona, Spain
Josep Fortiana

Authors

Eva Boj
View author publications
You can also search for this author in PubMed Google Scholar
Adrià Caballé
View author publications
You can also search for this author in PubMed Google Scholar
Pedro Delicado
View author publications
You can also search for this author in PubMed Google Scholar
Anna Esteve
View author publications
You can also search for this author in PubMed Google Scholar
Josep Fortiana
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pedro Delicado.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (R 26 KB)

Supplementary material 2 (asc 2 KB)

Supplementary material 3 (asc 1096 KB)

Appendices

Appendix A: The dbstats package

The dbstats package (Boj et al. 2014) for R (R Development Core Team 2015) implements several distance-based prediction methods. The main functions of dbstats are: dblm for DB-LM, ldblm for local DB-LM, dbglm for DB-GLM, ldbglm for local DB-GLM and dbplsr for DB-PLSR.

In Sect. A.1, we describe the usage of function dbglm whereas ldbglm is described in Sect. A.2. Two examples illustrating the usage of these functions from a user perspective have been presented, respectively, in Sects. 3.1 and 4.1. For details of dblm, ldblm and plsr, we refer to Boj et al. (2014).

1.1 A.1 Function dbglm

Function dbglm fits DB-GLM. In this function, distances can be directly provided as: an interdistances matrix (class dist or dissimilarity as in stats package); a squared interdistances matrix (class D2); or an inner-products matrix (class Gram). Classes D2 and Gram have been implemented in the dbstats package. It is also possible to compute distances directly from observed explanatory variables (using a class formula object in the call to dbglm).

The dbstats package does not provide specific methods for computing distances, depending instead on other available functions and packages such as dist in the stats package, daisy in the cluster package (Maechler 2015) or dist in the proxy package (Meyer and Buchta 2015). Utility functions such as as.D2, as.Gram, D2toDist, D2toG, distoD2 and GtoD2 allow the user mutual interconversions (see Boj et al. 2014 for details).

Response and link function are as in the glm function of stats for ordinary GLM.

The usage of dbglm is:

where the argument distance is of class dist or dissimilarity. The same information can be provided replacing distance by an object of class either D2 or Gram.

When calling dbglm using an object of class formula, the first and second arguments in the previous call to dbglm are replaced by other three arguments: formula (of the form \({\varvec{y}}\sim {\varvec{Z}}\)), data (a data frame containing the variables in the model: both response \({\varvec{y}}\) and explanatory variables \({\varvec{Z}}\)) and metric (that indicates how to compute distances between the rows of \({\varvec{Z}}\); it must be one of the strings “euclidean” (the default), “manhattan” or “gower” to be passed to function daisy of cluster package).

In addition to the response y, the distance matrix distance (or equivalent information), the formula formula, data data and metric, it is worth mentioning the following arguments of dbglm:

family, weights, offset, mustart are arguments with the same role that they have in the glm function.
method sets the method to be used in deciding the effective rank. There are five different methods: “AIC”, “BIC”, “GCV” (default), “eff.rank” and “rel.gvar”. See Sect. 3 (before Sect. A.1) for details on these criteria.
range.eff.rank a vector defining the range of possible values for the effective rank in the dblm iterations to be evaluated when method is “AIC”, “BIC” or “GCV”. It should be restricted between \(c(1,n-1)\).
full.search sets the optimization procedure to be used to minimize the modeling criterion specified in method when “AIC”, “BIC” or “GCV” criteria are specified. See the help of dbstats package Boj et al. (2014) for details.
rel.gvar relative geometric variability (a real number between 0 and 1; default is 0.95). More details can be found at the end of Sect. 3 (before Sect. A.1).
eff.rank integer between 1 and \( n-1\). If specified its value overrides rel.gvar. When eff.rank = NULL (default), calls to dblm are made with method = “rel.gvar”. More details can be found in Sect. 3 (before Sect. 3.1).
maxiter, eps1, eps2 are stopping criteria for the iterative algorithm that fits the DB-GLM.

The function returns a list of class dbglm containing the following components:

Common elements with the output of glm function for R: residuals, fitted.values, family, deviance, aic.model, null.deviance, iter, prior.weights, weights, df.residual, df.null, y, call.
H hat matrix projector of the last dblm iteration.
convcrit convergence criterion. One of: “DevStat” (stopping criterion 1: when the relative decrement of deviance in one step is less than eps1), “muStat” (stopping criterion 2: when the relative change of the estimated expected values of the responses in one step is less than eps2), “maxiter” (maximum allowed number of iterations has been exceeded).
eff.rank, rel.gvar effective rank and relative geometric variability that have been finally used.
bic.model, gcv.model BIC and GCV criteria of the final DB-GLM.
dev.resids deviance residuals (the way they are computed depends on the specified family).
varmu vector of estimated variance of each observation (that depends on the estimated vector of expected values and on the specified family).

1.2 A.2 Function ldbglm

Function ldbglm is a localized version of a DB-GLM. As in the global model dbglm, explanatory information is coded as distances between individuals, that can either be computed from observed explanatory variables or directly provided to function ldbglm as a (possibly squared) interdistances matrix or as a inner-products matrix (a Gram matrix).

Remember that in local DB-GLM there appear two distance functions, \(\delta _1\) and \(\delta _2\), playing different roles. Accordingly, function ldbglm has two different arguments, dist1 and dist2 of class dist or dissimilarity, where distances \(\delta _1\) and \(\delta _2\) are specified: dist1 defines the neighborhood delimiting what observations (and with what weight) are used when locally fitting a DB-GLM, whereas dist2 (which may coincide with dist1) is used specifically for fitting this weighted DB-GLM.

The usage of ldbglm is:

In the same way that it was explained in the overview of function dbglm, the predictive information contained in distance matrices dist1 and dist2 can be provided to function ldbglm in three alternative ways: two squared distances matrices (D2.1 and D2.2), two inner-products matrices (G1 and G2), or a formula jointly with a data set and two metrics (formula, data, metric1 and metric2).

The following are other arguments of ldbglm that are specific of this function because they control its local character:

kind.of.kernel integer number between 1 and 6 which determines the user’s choice of smoothing kernel K (see Eq. 14): (1) Epanechnikov (Default), (2) Biweight, (3) Triweight, (4) Normal, (5) Triangular, (6) Uniform.
method.h sets the method to be used when choosing the bandwidth h to be used in Eq. (14). There are four different methods, AIC, BIC, GCV (default) and user.h. AIC, BIC and GCV take the bandwidth minimizing the Akaike or Bayesian Information Criterion or the generalized cross-validation, respectively. When method is user.h, the bandwidth is explicitly set by the user through the user.h optional parameter which, in this case, becomes mandatory.
user.h global bandwidth set by the user. The default value is the first quartile of all the distances d(i, j) in matrix dist1. It applies only if method = "user.h".
h.range a vector of length 2 giving the range for automatic bandwidth choice. (Default value: quantiles 0.05 and 0.5 of d(i, j) in matrix dist1). It applies when method != "user.h".
noh number of bandwidth h values within h.range for automatic bandwidth choice. It applies when method != "user.h".
k.knn minimum number of observations with positive weight in any local fit of a DB-GLM model. A too small value of bandwidth h could originate a neighborhood with only one observation producing a runtime error when trying to fit a local fit of a DB-GLM model. Choosing k.knn > 1 prevents from this problem. By default k.knn = 3.

The function returns a list of class ldbglm containing the following components:

Common elements with the output of dbglm: residuals, fitted.values, family, weights, y, call.
dist1, dist2 the distances matrices used to calculate the local weights of the observations and to locally fit the dbglm’s, respectively.
h.opt the optimal bandwidth h used in the fitting process (if method != user.h).
S the smoothing matrix in the last iteration of the IRWLS. See Boj et al. (2010) for details on the definition of the smoothing matrix.

Appendix B: Supplementary material

Code for fitting DB-GLM and local DB-GLM is available in the R package dbstats (http://CRAN.R-project.org/package=dbstats). Additional code for reproducing the computations and graphics in the paper are included in the R script ExamplesDB.R. The data files protein.asc and whtspec.asc are provided as supplementary material with the permission of John H. Kalivas, Idaho Satate University.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Boj, E., Caballé, A., Delicado, P. et al. Global and local distance-based generalized linear models. TEST 25, 170–195 (2016). https://doi.org/10.1007/s11749-015-0447-1

Download citation

Received: 21 January 2015
Accepted: 27 April 2015
Published: 21 May 2015
Issue Date: March 2016
DOI: https://doi.org/10.1007/s11749-015-0447-1

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Global and local distance-based generalized linear models

Abstract

Access this article

Similar content being viewed by others

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

Univariate and multivariate skewness and kurtosis for measuring nonnormality: Prevalence, influence and estimation

Evaluating significance in linear mixed-effects models in R

References

Acknowledgments