Collinearity is the pebble that inevitably enters the shoe of any applied econometrician in the course of their work. The good news is that econometrics software take that pebble out of your shoe for you. The bad news is that sometimes this pebble is informative and tells you that something is wrong with the model you are estimating.
If you know what collinearity is, how software deal with it, and how it affects your coefficients, you’re fine! If you’re not described by the previous sentence, then two cases when you encounter collinearity. Either you’re lucky, and nothing happens or just a few unimportant coefficients become meaningless. Either you’re unlucky and… well, your main results are wrong and your paper should be retracted.
Without further ado, let’s try to remove chance from the econometrics process and try to understand what collinearity is and how to deal with it.
What collinearity is
This document’s definitions
Definitions. We are in the context of an OLS estimation. We have one dependent variable represented by the -vector , and the explanatory variables represented by the matrix . The term variable can be used interchangeably with the term column (referring to ).
Collinearity. There is presence of collinearity if at least one column of the matrix is a linear combination of the other columns. Alternatively, using a more sophisticated language, there is collinearity whenever the rank of is strictly lower than .
Collinear set. The collinear set is the set of all columns that are a linear combination of the other columns.
Regular set. The regular set is the set of all columns that are not a linear combination of the other columns.
Why collinearity can be a problem
Technical reason
The OLS estimator is:
In the presence of collinearity the square matrix is not of full rank, and hence is not invertible. This is like dividing by 0, you can’t because it does not make sense. Here you cannot invert, so is not defined.
Substantial reason
A good theorem to understand how OLS works is the Frisch-Waugh-Lovell (FWL) theorem.
Assume your set of explanatory variables is split in two: the matrix with columns and the matrix with columns (we have ). Note that you can split the variables in any way you want. The model you want to estimate is:
Define as the residual of on , and as the matrix where each column is the residual of each column of on . Consider the following estimation:
Then the theorem states that and that the residuals of the two regressions are the same. You can find an illustration below.
# Illustration of the FWL theorem's magic
# We use the `iris` data set
base = setNames(iris, c("y", "x", "z1", "z2", "species"))
library(fixest)
# The main estimation, we're only interested in `x`'s coefficient
est = feols(y ~ x + z1 + z2, base)
# We estimate both `y` and `x` on the other explanatory variables
# and get the matrix of residuals
resids = feols(c(y, x) ~ z1 + z2, base) |> resid()
# We estimate y's residuals on x's residuals
est_fwl = feols.fit(resids[, 1], resids[, 2])
# We compare the estimates: they are identical
# The standards errors are also the same, modulo a constant factor
etable(est, est_fwl, order = "x|resid")
#> est est_fwl
#> Dependent Var.: y resids[,1]
#>
#> x 0.6508*** (0.0667)
#> resids[,2] 0.6508*** (0.0660)
#> Constant 1.856*** (0.2508)
#> z1 0.7091*** (0.0567)
#> z2 -0.5565*** (0.1275)
#> _______________ ___________________ __________________
#> S.E. type IID IID
#> Observations 150 150
#> R2 0.85861 0.39104
#> Adj. R2 0.85571 0.39104
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now let us use the theorem to understand collinearity. Take one variable from the collinear set as the focal variable. By the definition of collinearity, the residual of this variable on all other variables is a -vector of 0s. Indeed, since it is a linear combination of the other variables in the collinear set, it can be perfectly explained and there is no residual.
Using the FWL theorem, this means that the coefficient of this variable in the main regression is the same as the coefficient of the regression of the residual of on a constant equal to 0! This makes no sense, right. And this is true for any variable in the collinear set!
But what does that mean for variables in the regular set? For any such variable, the residual on all other variables contains variation, and hence the coefficient is well defined.
When is collinearity a problem?
Assume you have two sets of variables: focal variables for which you want to interpret the coefficients, and control variables for which the coefficients are not interpreted. These sets are represented by the matrices and respectively with and columns.
Now assume there is presence of collinearity and the model cannot be directly estimated. There are two cases.
1) The collinear set contains only control variables. In this case collinearity is not an issue since it affects only variables that are not interpreted. The fix is easy: drop iteratively variables from the collinear set until there is no remaining collinearity (reconstruct the collinear set at each step).
Note, importantly, that it is not because it affects only control variables that collinearity is fine. You should always be wary: when there is collinearity in practice when it shouldn’t in theory, this is often the sign of coding mistakes (either the raw data is wrong or the processing is flawed). Allow collinearity only when it makes sense in theory (we’ll cover examples later when collinearity is fine in theory).
2) The collinear set contains at least one focal variable. Now we have a problem. Here collinearity can have two origins: i) the model is misspecified, or ii) the focal variables have been wrongly coded. If you are sure that the model is well specified, then check the data and fix how the variables are constructed. Now let’s cover model misspecification.
The model is misspecified
Err… the model is misspecified, what does that mean? This means that you are trying to estimate stuff that don’t make sense! It’s like accepting the result of the following syllogism:
Let’s add to : then let’s substract on both sides: and finally let’s divide by the same number on both sides:
The above equations look right but they’re wrong (notice the last one). Misspecification in an econometric model is the same: it (often) can be estimated, it looks right, but it’s wrong!
The problem with a misspecified model is that there is no technical solution, you have something wrong in the model and you need to fix it – theoretically. To illustrate this, let us give examples, some obvious, others less so.
Example 1: Gender gap in publications
Let’s take the following context: you study the publication process in science and want to quantify the gender gap in terms of publications. You are also interested in the effect of age on publications. You have a panel where scientists are indexed with and years are indexed with . The dependent variable is the number of publications, the focal explanatory variables are taking the value 1 if the person is a woman and 0 otherwise, and is equal to the age of scientist at time . Finally you want to control for time invariant characteristics specific to each scientist, and year specific factors common to all scientists. That means: you include scientist and year fixed-effects.
You end up with the following specification:
The above model is utterly misspecified. Given that the variable is time invariant (we’ll allow variation and introduce another problem later), it can be fully explained by the scientist fixed-effects. Hence if you want absolutely to control for scientist fixed-effects: you cannot estimate the gender gap, . You absolutely can’t, live with it. You really need to be aware of that, especially that econometrics software can give you estimates for !!!! Ouch! The behavior of econometrics software is detailed later.
Is that all? Are we done with misspecification? Well, although the variable varies across time and scientists, it is in the collinear set. We can decompose this variable as follows: . Now we can see that this variable is a linear combination of a scientist-invariant variable () and a time-invariant variable (). This means that the variable can be fully explained by the scientist and time fixed-effects. Hence you cannot estimate the effect of age in the presence of these fixed-effects.
Any solution? No. You need to change your model. In this case, the easy solution is to drop the scientist fixed-effects and accept the criticisms that may arise from not controlling for important time invariant confounders. To reformulate, if any of your focal variables is in a collinear set, you need to make non trivial changes to your model specification.
Example 2: Sneaky misspecification
We use the same context as above: the gender gap in publications. We introduce the following differences. First the variable varies: some scientist can change from man to woman and vice versa. Second, we use the variable which is equal to 1 if scientist reports being harrassed in year , and 0 otherwise. This variable contains missing values – this is important.
The model we estimate is:
This time the model is not misspecified since both and present variations that cannot be explained by scientist and time fixed-effects alone (although the identification of the gender gap will hinge only on the scientists reporting a variation in ).
Sneaky misspecification. One problem arises if the variable contains missing values and if, after removing observation with missing values, there is no more within-scientist variation in the variable . In that case, the variable ends up in the collinear set when the variable is included in the model. This is not a misspecified model per se, but purely a data problem: the data does not support the model specification.
In this specific case, one can apply variable imputation to end up with more non-missing values for the variable and be able to estimate the model. Or simply not use the variable at all.
The bottom line is that, even if the model is well specified, modifications of the sample, due to data availability for example, can lead in fine to model misspecification. As mentionned before, the problem is that econmetrics software can provide results even when the model is misspecified, so one really needs to be aware of the problem.
Now let us review how econometrics software deal with collinearity.
Econometrics software and collinearity
Most econometrics software, be it in R, Stata, Python, etc, automatically remove collinear variables to perform the estimation.
Importantly, all software routines assume that the user provides a well specified model. In other words, using the terminology defined above, they assume that the focal variables are not in the collinear set. It is the user’s responsibility to ensure that. If this condition is respected, automatic removal of collinear variables is a feature and not a bug.
In this section we first detail how most software remove collinear variables and how they differ in doing so. We detail why automatic removal is a feature. We end with a description of the pitfalls of automatic removal.
Detecting collinearity: Automatic variable removal algorithms
In general the process of collinearity fixing by variable removal is as follows:
- for each explanatory variable, from left to right, following the
order of the variables given by the user (this is important):
- check wether the current variable is positive to a collinearity test
with the variables previously done:
- if collinear: remove it
- else: continue
- check wether the current variable is positive to a collinearity test
with the variables previously done:
It is important to stress that we are not talking about mathematical collinearity but only about numerical collinearity which are two different beasts. Indeed, the results of collinearity detection and fixing depend on the rountines’ algorithms. This means that two different routines can lead to different results due to their diverging algorithms – although this should be the excteption rather than the rule.
Let us give two examples. In R
, the function
lm
uses a QR decomposition of the matrix
to solve the OLS problem. Linear dependencies (collinearity) are
resolved during the QR decomposition. Differently from lm
,
the function feols
from fixest
resolves
collinearity during a Cholesky decompostion of the matrix
.
While there are major advantages in terms of speed to use this method,
it is known to be inferior to the QR method for collinearity detection
because using
instead of
reduces numeric precision. In the vast majority of cases however, the
two methods are equivalent and provide similar results.
Collinearity is a numerical problem
To nail the fact that collinearity fixing in econometrics sowftare is a numerical problem, we give the following example. Consider the OLS estimation of this simple model:
where is a vector equal to 1 for half of the observations and 0 for the others. We generate the data with and . You know from your econometrics courses that the estimated coefficients are invariant to translations as long as there is a constant in the model. Let us check that out!
# We generate the data
n = 1e6
n_half = n / 2
df = data.frame(x = rep(0, n))
df$x[1:n_half] = 1
df$y = df$x + rnorm(n)
# we estimate y on x for various translations of x
all_trans = c(0, 10 ** (1:5))
all_results = list()
for(i in seq_along(all_trans)){
trans = all_trans[i]
all_results[[i]] = feols(y ~ I(x + trans), df)
}
# we display the results
etable(all_results)
#> model 1 model 2 model 3
#> Dependent Var.: y y y
#>
#> Constant 0.0013 (0.0014) -9.974*** (0.0210) -99.75*** (0.2009)
#> I(x+0) 0.9975*** (0.0020)
#> I(x+10) 0.9975*** (0.0020)
#> I(x+100) 0.9975*** (0.0020)
#> I(x+1000)
#> I(x+10000)
#> I(x+1e+05)
#> _______________ __________________ __________________ __________________
#> S.E. type IID IID IID
#> Observations 1,000,000 1,000,000 1,000,000
#> R2 0.19936 0.19936 0.19936
#> Adj. R2 0.19936 0.19936 0.19936
#>
#> model 4 model 5 model 6
#> Dependent Var.: y y y
#>
#> Constant -997.5*** (2.000) -9,974.9*** (19.99) -99,749.2*** (199.9)
#> I(x+0)
#> I(x+10)
#> I(x+100)
#> I(x+1000) 0.9975*** (0.0020)
#> I(x+10000) 0.9975*** (0.0020)
#> I(x+1e+05) 0.9975*** (0.0020)
#> _______________ __________________ ___________________ ____________________
#> S.E. type IID IID IID
#> Observations 1,000,000 1,000,000 1,000,000
#> R2 0.19936 0.19936 0.19936
#> Adj. R2 0.19936 0.19936 0.19936
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As we can see, the prediction is true: translations do not change the estimate of . But is that always true? What happens if we apply a bigger translation?
# we add 1,000,000 to x
feols(y ~ I(x + 1e6), df)
#> The variable 'I(x + 1e+06)' has been removed because of collinearity (see $collin.var).
#> OLS estimation, Dep. Var.: y
#> Observations: 1,000,000
#> Standard-errors: IID
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.500031 0.001117 447.653 < 2.2e-16 ***
#> ... 1 variable was removed because of collinearity (I(x + 1e+06))
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 1.11701
Ooopsie, our variable is now removed! This happens with the message:
“The variable […] removed because of collinearity”. What happened? The
algorithm could not invert
so removed the variable to be able to carry on. This is purely a
numerical problem: remember that we store numbers with 64 bits, leading
to at most 15 digits precision (so don’t be suprised of the result of
1 + 1e16 == 1e16
). If the numbers were stored with 128bits
precision, there would be no rounding error and no variable removal.
Now let’s see lm
’s algorithm.
lm(y ~ I(x + 1e6), df) |> coef()
#> (Intercept) I(x + 1e+06)
#> -9.974923e+05 9.974923e-01
lm(y ~ I(x + 1e7), df) |> coef()
#> (Intercept) I(x + 1e+07)
#> 0.500031 NA
Contrary to feols
, for a translation of 1,000,000
lm
does not remove the variable and gives the right result.
However, the behavior becomes similar to that of feols
when
we apply a translation of
.
A notable difference between the two routines is that there is no
explicit message from lm
(but this is displayed in
lm
’s summary
).
Bottom line. Collinearity detection and fixing is a numerical problem and as such is subject to the ususal pitfall of these, in particular precision problems. Always be wary when variables are removed: if there is no theoretical reason for removal then there might be a data problem. In general, ensuring that the variables have a reasonable numerical range is a good idea. Do not hesitate to re-scale the variables with a large range.
When automatic removal is good
Consider the context defined in the model misspecification section: where we follow the output of a panel of scientists. Here we want to see the effect of age on performance, when removing individual-specific confounders and confounders linked to the institution of the scientist. Formally, we estimate the following model:
with indexing the scientist, the institution and the year.
Our coefficient of interest is
and we include scientist and institution fixed-effects. For this example
we use a built-in data set of fixest
: base_pub
which corresponds to a small panel following the output of researchers
across time (this panel is constructed from Microsoft Academic Graph
data). Although feols
does have a specific way to handle
fixed-effects, for illustration we include the two fixed-effects as
dummies using the function i()
:
data(base_pub, package = "fixest")
## The model:
feols(nb_pub ~ age + i(author_id) + i(affil_id), base_pub)
#> The variables 'affil_id::6902469', 'affil_id::9217761', 'affil_id::27504731',
#> 'affil_id::39965400', 'affil_id::43522216', 'affil_id::47301684' and 45 others have been
#> removed because of collinearity (see $collin.var).
#> OLS estimation, Dep. Var.: nb_pub
#> Observations: 4,024
#> Standard-errors: IID
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -4.700489 2.396759 -1.961185 4.9934e-02 *
#> age 0.047252 0.006213 7.605218 3.6032e-14 ***
#> author_id::90561406 -1.458487 0.902767 -1.615574 1.0627e-01
#> author_id::94862465 -3.390346 1.862776 -1.820050 6.8834e-02 .
#> author_id::168896994 0.473991 2.447235 0.193684 8.4643e-01
#> author_id::217986139 -0.133319 1.734549 -0.076861 9.3874e-01
#> author_id::226108609 0.179560 2.021085 0.088843 9.2921e-01
#> author_id::231631639 2.799524 3.110143 0.900127 3.6811e-01
#> ... 397 coefficients remaining (display them with summary() or use argument n)
#> ... 51 variables were removed because of collinearity (affil_id::6902469,
#> affil_id::9217761 and 49 others [full set in $collin.var])
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 2.21108 Adj. R2: 0.685792
As the output shows clearly, 51 variables have been removed to fix collinearity. In this case, this is perfectly fine since our variable of interest , is not included in the collinear set. Many variables are removed because there is a lot of overlap between the affiliations and the researchers: many researchers have only a few different affiliations across their career, and many affiliations just have one researcher (remember that we deal with a small random sample here).
Note that the recommended way to include fixed-effects in
feols
is by adding the fixed-effects variables after a
pipe. That way the FWL theorem is leveraged and all variables are first
residualized on the fixed-effects (see the FWL
section). By handling the fixed-effects specifically the output
becomes much clearer and you are sure that the variable
is not in the collinear set (see next section for why this is
important):
feols(nb_pub ~ age | author_id + affil_id, base_pub, vcov = "iid")
#> OLS estimation, Dep. Var.: nb_pub
#> Observations: 4,024
#> Fixed-effects: author_id: 200, affil_id: 256
#> Standard-errors: IID
#> Estimate Std. Error t value Pr(>|t|)
#> age 0.047252 0.006257 7.55144 5.4359e-14 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 2.21108 Adj. R2: 0.681301
#> Within R2: 0.015731
Coming back to our first question: Why is automatic variable removal a feature? In this example the scientist and affiliation fixed-effects are just controls and the researcher has no interest in interpreting them. Therefore collinearity is no issue here and it does not matter which variable is removed to fix collinearity.
Automatic variable removal is a critical feature especially for
routines that do not handle fixed-effects specifically, like
lm
. If there was no automatic variable removal in
lm
, you would not be able to estimate your model even
though it is well specified!
When automatic removal bites
Automatic variable removal can be a problem when your model is misspecified and you haven’t noticed it.
We keep using the example on publications. Remember from the section discussing misspecification that the following model is misspecified:
as both and are in the collinear set.
We estimate this model anyway by including the scientist and year fixed-effects as dummy variables:
feols(nb_pub ~ is_woman + age + i(author_id) + i(year), base_pub)
#> The variables 'author_id::2747123765' and 'year::2000' have been removed because of
#> collinearity (see $collin.var).
#> OLS estimation, Dep. Var.: nb_pub
#> Observations: 4,024
#> Standard-errors: IID
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3.224328 2.203459 1.463303 0.14347
#> is_woman -0.673406 1.624295 -0.414583 0.67847
#> age 0.046843 0.045423 1.031271 0.30248
#> author_id::90561406 -1.028373 1.093804 -0.940180 0.34719
#> author_id::94862465 -1.953734 0.985021 -1.983444 0.04739 *
#> author_id::168896994 -1.449938 0.914733 -1.585094 0.11303
#> author_id::217986139 -1.576761 0.923925 -1.706591 0.08798 .
#> author_id::226108609 -0.568410 1.171480 -0.485207 0.62756
#> ... 242 coefficients remaining (display them with summary() or use argument n)
#> ... 2 variables were removed because of collinearity (author_id::2747123765 and
#> year::2000)
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 2.96683 Adj. R2: 0.457524
It just worked! We have coefficients for is_woman
and
age
! Good news right, we thought our model was misspecified
but in fact it’s not: we can estimate the gender gap with fixed-effects!
Or is it?
Well of course this is just a dream. If the model is misspecified in
theory, it remains misspecified in practice: the estimates for
is_woman
and age
do exist but cannot be
intrepreted. To better see why let’s replicate the previous estimation
where we cosmetically modify the fixed-effects:
# same estimation as above
est_num = feols(nb_pub ~ is_woman + age + i(author_id) + i(year), base_pub)
#> The variables 'author_id::2747123765' and 'year::2000' have been removed because of
#> collinearity (see $collin.var).
# we create `author_id_char`: same as `author_id` but in character form
base_pub$author_id_char = as.character(base_pub$author_id)
# replacing `author_id` with `author_id_char`: both variables contain the same information
est_char = feols(nb_pub ~ is_woman + age + i(author_id_char) + i(year), base_pub)
#> The variables 'author_id_char::731914895' and 'year::2000' have been removed because of
#> collinearity (see $collin.var).
etable(est_num, est_char, keep = "woman|age")
#> est_num est_char
#> Dependent Var.: nb_pub nb_pub
#>
#> is_woman -0.6734 (1.624) 1.729 (3.174)
#> age 0.0468 (0.0454) 0.0468 (0.0454)
#> _______________ _______________ _______________
#> S.E. type IID IID
#> Observations 4,024 4,024
#> R2 0.49110 0.49110
#> Adj. R2 0.45752 0.45752
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that although the two estimations are identical in
principle, the coefficient associated to the variable
is_woman
varies widely. But the results should not depend
on whether you store a categorical variable in character or numeric
form!
The difference is due to the fact that the algorithm removes the
variables using the order given by the user. Storing the variable
author_id
in numeric or charcter form leads to different
ordering of the categorical values, and hence a different variable is
removed.
Shouldn’t there be an error from the software? As seen in the previous section, automatic removal of collinear variables is useful. The fundemental problem is that the routines: i) can’t compute the collinear sets (this is very costly), and ii) don’t know which variables are the ones of interest. Hence it assumes that the user is knowledgeable and provides a well specified model, for which collinearity is not a problem.
What can the user do? Here’s a trick. To check whether your variables of interest are in a collinear set, simply place them last:
est_last = feols(nb_pub ~ i(author_id) + i(year) + is_woman + age, base_pub)
#> The variables 'is_woman' and 'age' have been removed because of collinearity (see
#> $collin.var).
Now the two variables of interest are out! The sign of model misspecification couldn’t be clearer!
If you use routines that handle fixed-effects specifically, like
feols
, try to treat all your categorical variables as
fixed-effects. This way you have the garantee that the variables of
interest are not in the same collinear set as the fixed-effects. Here is
the same example with explicit fixed-effects:
feols(nb_pub ~ is_woman + age | author_id + year, base_pub)
#> Error: in feols(nb_pub ~ is_woman + age | author_id + year,...:
#> All variables, 'is_woman' and 'age', are collinear with the fixed effects. Without
#> doubt, your model is misspecified.
The routine detects that the variables ‘is_woman’ and ‘age’ are collinear with the fixed-effects, and removes them from the estimation. Since there is no more variable, the estimation makes no sense and you have a suggestion that your model may be misspecified.