What are they? Why do we need them?
Generalized Linear Models (GLMs) were born out of a desire to bring under one umbrella, a wide variety of regression models that span the spectrum from Classical Linear Regression Models for real valued data, to models for counts based data such as Logit, Probit and Poisson, to models for Survival analysis.
Models under the GLM umbrella
GLMs give you a common way to specify and train the following classes of models using a common procedure:
- Classical Linear Regression (CLR) Models, colloquially referred to as Linear Regression models for real valued (and potentially negative valued) data sets.
- Analysis of Variance (ANOVA) models.
- Models for ratios of counts. For e.g. models which predict the odds of winning, probability of machine failure etc. Some examples of this class are the Logit model (used in Logistic regression), Probit and Ordered Probit models, and the very powerful Binomial Regression model.
- Models used for explaining (and predicting) event counts. For e.g. models that predict the number of footfalls at the supermarket, in a mall, in an emergency room. Examples of models of this class are the Poisson and Negative Binomial regression models, and the Hurdle model.
- Models for predicting time to next failure of parts, machines (and human beings). Models for estimating lifespans of living (and non-living) things.
When each one of the above seemingly diverse set of regression models is expressed in the format of a Generalized Linear Model (and we’ll get to explaining what that format is shortly), it gives the modeler the great benefit of applying a common training technique for all such models.
With Generalized Linear Models, one uses a common training technique for a diverse set of regression models.
Furthermore, GLMs allow the modeller to express the relationship between the regression variables (a.k.a. covariates, a.k.a. influencing variables, a.k.a. explanatory variables) X and the response variable (a.k.a. dependent variable) y, in a linear and additive way even though the underlying relationships may be neither linear nor additive.
Generalized Linear Models let you express the relation between covariates X and response y in a linear, additive manner.
Relationship with the Classical Linear Regression model
Speaking of linearity and additiveness, a Linear Regression model is a simple and powerful model that is successfully used for modeling linear, additive relationships such as the following:
A CLR model is often the ‘model of first choice’: something that a complex model should be carefully compared with, before choosing the complex model for one’s problem.
CLR models come with clear advantages:
Subject to certain conditions being met, they have a neat ‘closed-form’ solution, meaning, they can be fitted i.e. trained on the data by simply solving a linear algebraic equation.
It is also easy to interpret the trained model’s coefficients. For e.g. if a trained CLR model is expressed by the following equation:
It is clear from this equation what the model has been able to find: that for each unit increase in the number of campers the number of fish caught increases by around 75%, while for each unit increase in the number of children in the camping group, the number of fish that the group manages to catch reduces by the same amount! And also that it takes a camping group size of at least 3 (=roundup(2.49)) before any fish can be caught.
But Classical Linear Regression models also come with some strict requirements, namely:
- Additive relationships: Classical Linear models assume that the regression variables should have an additive relationship with each other.
- Homoscedastic data: Classical Linear models assume that the data should have constant variance i.e. the data should be homoscedastic. In real life, data is often not homoscedastic. The variance is not constant and sometimes it is a function of the mean. For e.g. the variance increases as the mean increases. This is common in monetary datasets.
- Normally distributed errors: Classical Linear models assume the errors of regression, also known as the residuals, are normally distributed with mean zero. This condition is also difficult to meet in real life.
- Non-correlated variables: Finally, the regression variables are assumed to be non-correlated with each other, and preferably independent of each other.
Therefore if your data set is non-linear, heteroscedastic and the residuals are not normally distributed, which is often the case in real world data sets, one needs to apply a suitable transformation to both y and X so as to make the relationship linear and at same time stabilize the variance and normalize the errors.
The square root and the logarithm transformations are commonly used for achieving these effects as follows:
Unfortunately, none of the available transforms are good at achieving all three effects at the same time, namely making the relation linear, minimizing heteroscedasticity and normalizing the error distribution.
There is another great problem with the transformation approach which is as follows:
Recollect that y is a random variable that follows some kind of a probability distribution. So for any given combination of x values in the data set, the real world is likely to present to you several random values of y and only some of these possible values will appear in your training samples. In the real world, these values of y will be randomly distributed around the conditional mean of y given the specific value of x. The conditional mean of y is denoted by E(y|x). This situation can be illustrated as follows:
In the variable transformation approach, we make the unrealistically strong assumption that every single value of y i.e. each one of the blue dots in the above plot, after transformation using log(), sqrt() etc., will end up having a linear relationship with X. This is obviously too much to expect.
What seems more realistic is that the conditional mean (a.k.a. expectation) of y, i.e. E(y|x) after a suitable transformation, ought to have a linear relationship with X.
In other words:
Generalized Linear Models make the above crucial assumption, namely that the transformed conditional expectation of y is a linear combination of regression variables X.
The transformation function is called a link function of the GLM and is denoted by g(.). In the above example, the log() is the link function, i.e. g(.) = log(.).
We illustrate the general action of g() as follows:
Thus, instead of transforming every single value of y for each x, GLMs transform only the conditional expectation of y for each x. So there is no need to assume that every single value of y is expressible as a linear combination of regression variables.
In Generalized Linear Models, one expresses the transformed conditional expectation of the dependent variable y as a linear combination of the regression variables X.
The link function g(.) can take many forms and we get a different regression model based on what form g(.) takes. Here are a few popular forms and the corresponding regression models that they lead to:
The Linear Regression Model
In Linear models, g(.) is the following identity function:
The Logistic (and in general, Binomial) Regression Models
In the Logistic regression model, g(.) is the following Logit function:
The Poisson Regression Model
The Poisson regression model uses the following log-link function:
There are many other variants of g(.) such as the Poisson-Gamma mixture leading to the Negative Binomial regression model and the inverse of the Cumulative Distribution Function of the Normal distribution, which leads to the probit model.
How to handle Heteroscedasticity in the data?
Finally, let’s look at how GLMs handle heteroscedastic data i.e. data in which the variance is not constant, and how GLMs handle potentially non-normal residual errors.
GLMs account for the possibility of a non-constant variance by assuming that the variance is some function V(µ) of the mean µ, or more accurately the conditional mean µ|X=x.
In each of the above mentioned models, we assume a suitable variance function V(µ|X=x).
In Generalized Linear Models, one expresses the variance in the data as a suitable function of the mean value.
In the Linear regression model, we assume V(µ) = some constant, i.e. variance is constant. Why? Because Linear models assume that y is Normally distributed and a Normal distribution has a constant variance.
In the Logistic and Binomial Regression models, we assume, V(µ) = µ — µ²/n for a data set size of n samples, as required by a Logit distributed y value.
In the Poisson Regression model, we assume V(µ) = µ. This is because, the Poisson regression model assumes that y has a Poisson distribution and in a Poisson distribution, variance = mean.
In the Negative Binomial regression model, we assume V(µ) = µ + α*µ², where α is a dispersion parameter which allows us to deal with over-dispersed or under-dispersed data.
…and so on for other models.
In GLMs, it is possible to show that the model is not sensitive to the distributional form of the residual errors. In simple terms, the model doesn’t care whether the model’s errors are normally distributed or distributed any other way, as long as the mean-variance relationship that you assume, is actually satisfied by your data.
Generalized Linear Models do not care if the residual errors are normally distributed as long as the specified mean-variance relationship is satisfied by the data.
This makes GLMs a practical choice for many real world data sets that are nonlinear and heteroscedastic and in which we cannot assume that the model’s errors will always be normally distributed.
Finally, a word of caution: Similar to Classical Linear Regression models, GLMs also assume that the regression variables are uncorrelated with each other. Therefore GLMs cannot be used to model time series data which typically contain a lot of auto-correlated observations.
Generalized Linear Models should not be used for modeling auto-correlated time series data.
Generalized Linear Models bring together under one estimation umbrella, a wide range of different regression models such as Classical Linear models, various models for data counts and survival models.
Here is a synopsis of things to remember about GLMs:
- GLMs deftly side-step several strong requirements of classical linear models such as additiveness of effects, homoscedasticity of data and normality of residual errors.
- GLMs impose a common functional form on all models in the GLM family which consists of a link function g(µ|X=x) that allows you to express the transformed conditional mean of the dependent variable y as a linear combination of the regression variables X.
- GLMs require the specification of a suitable variance function V(µ|X=x) for expressing the conditional variance in the data as function of the condition mean. What form V(.) takes depends on the probability distribution that you assume for the dependent variable y in your data set.
- GLMs do not care about the distributional form of the error term, thereby making them a practical choice for many real world data sets.
- GLMs do assume that regression variables X are uncorrelated, thereby making GLMs unsuitable for modeling auto-correlated time series data.
Citations and Copyrights
Cameron A. C. and Trivedi P. K., Regression Analysis of Count Data, Second Edition, Econometric Society Monograph No. 53, Cambridge University Press, Cambridge, May 2013.
McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models, Monographs on Statistics and Applied Probability, Vol. 37 of, 2 Edition, Chapman and Hall, London