User-level access to internal demeaning algorithm of fixest
.
Usage
demean(
X,
f,
slope.vars,
slope.flag,
data,
weights,
nthreads = getFixest_nthreads(),
notes = getFixest_notes(),
iter = 2000,
tol = 1e-06,
fixef.reorder = TRUE,
fixef.algo = NULL,
na.rm = TRUE,
as.matrix = is.atomic(X),
im_confident = FALSE,
...
)
Arguments
- X
A matrix, vector, data.frame or a list OR a formula OR a
feols
estimation. If equal to a formula, then the argumentdata
is required, and it must be of the type:x1 + x2 ~ f1 + fe2
with on the LHS the variables to be centered, and on the RHS the factors used for centering. Note that you can use variables with varying slopes with the syntaxfe[v1, v2]
(see details infeols
). If afeols
estimation, all variables (LHS+RHS) are demeaned and then returned (only if it was estimated with fixed-effects). Otherwise, it must represent the data to be centered. Of course the number of observations of that data must be the same as the factors used for centering (argumentf
).- f
A matrix, vector, data.frame or list. The factors used to center the variables in argument
X
. Matrices will be coerced usingas.data.frame
.- slope.vars
A vector, matrix or list representing the variables with varying slopes. Matrices will be coerced using
as.data.frame
. Note that if this argument is used it MUST be in conjunction with the argumentslope.flag
that maps the factors to which the varying slopes are attached. See examples.- slope.flag
An integer vector of the same length as the number of variables in
f
(the factors used for centering). It indicates for each factor the number of variables with varying slopes to which it is associated. Positive values mean that the raw factor should also be included in the centering, negative values that it should be excluded. Sorry it's complicated... but see the examples it may get clearer.- data
A data.frame containing all variables in the argument
X
. Only used ifX
is a formula, in which casedata
is mandatory.- weights
Vector, can be missing or NULL. If present, it must contain the same number of observations as in
X
.- nthreads
Number of threads to be used. By default it is equal to
getFixest_nthreads()
.- notes
Logical, whether to display a message when NA values are removed. By default it is equal to
getFixest_notes()
.- iter
Number of iterations, default is 2000.
- tol
Stopping criterion of the algorithm. Default is
1e-6
. The algorithm stops when the maximum absolute increase in the coefficients values is lower thantol
.- fixef.reorder
Logical, default is
TRUE
. Whether to reorder the fixed-effects by frequencies before feeding them into the algorithm. IfFALSE
, the original fixed-effects order provided by the user is maintained. In general, reordering leads to faster and more precise performance.- fixef.algo
NULL
(default) or an object of classdemeaning_algo
obtained with the functiondemeaning_algo
. IfNULL
, it falls to the defaults ofdemeaning_algo
. This arguments controls the settings of the demeaning algorithm. Only play with it if the convergence is slow, i.e. look at the slot$iterations
, and if any is over 50, it may be worth playing around with it. Please read the documentation of the functiondemeaning_algo
. Be aware that there is no clear guidance on how to change the settings, it's more a matter of try-and-see.- na.rm
Logical, default is
TRUE
. IfTRUE
and the input data contains any NA value, then any observation with NA will be discarded leading to an output with less observations than the input. IfFALSE
, if NAs are present the output will also be filled with NAs for each NA observation in input.- as.matrix
Logical, if
TRUE
a matrix is returned, ifFALSE
it will be a data.frame. The default depends on the input, if atomic then a matrix will be returned.- im_confident
Logical, default is
FALSE
. FOR EXPERT USERS ONLY! This argument allows to skip some of the preprocessing of the arguments given in input. IfTRUE
, thenX
MUST be a numeric vector/matrix/list (not a formula!),f
MUST be a list,slope.vars
MUST be a list,slope.vars
MUST be consistent withslope.flag
, andweights
, if given, MUST be numeric (not integer!). Further there MUST be not any NA value, and the number of observations of each element MUST be consistent. Non compliance to these rules may simply lead your R session to break.- ...
Not currently used.
Value
It returns a data.frame of the same number of columns as the number of variables to be centered.
If na.rm = TRUE
, then the number of rows is equal to the number of rows in input minus the
number of NA values (contained in X
, f
, slope.vars
or weights
). The default is to have
an output of the same number of observations as the input (filled with NAs where appropriate).
A matrix can be returned if as.matrix = TRUE
.
Varying slopes
You can add variables with varying slopes in the fixed-effect part of the formula.
The syntax is as follows: fixef_var[var1, var2]
. Here the variables var1 and var2 will
be with varying slopes (one slope per value in fixef_var) and the fixed-effect
fixef_var will also be added.
To add only the variables with varying slopes and not the fixed-effect,
use double square brackets: fixef_var[[var1, var2]]
.
In other words:
fixef_var[var1, var2]
is equivalent tofixef_var + fixef_var[[var1]] + fixef_var[[var2]]
fixef_var[[var1, var2]]
is equivalent tofixef_var[[var1]] + fixef_var[[var2]]
In general, for convergence reasons, it is recommended to always add the fixed-effect and avoid using only the variable with varying slope (i.e. use single square brackets).
Examples
# Illustration of the FWL theorem
data(trade)
base = trade
base$ln_dist = log(base$dist_km)
base$ln_euros = log(base$Euros)
# We center the two variables ln_dist and ln_euros
# on the factors Origin and Destination
X_demean = demean(X = base[, c("ln_dist", "ln_euros")],
f = base[, c("Origin", "Destination")])
base[, c("ln_dist_dm", "ln_euros_dm")] = X_demean
est = feols(ln_euros_dm ~ ln_dist_dm, base)
est_fe = feols(ln_euros ~ ln_dist | Origin + Destination, base)
# The results are the same as if we used the two factors
# as fixed-effects
etable(est, est_fe, se = "st")
#> est est_fe
#> Dependent Var.: ln_euros_dm ln_euros
#>
#> Constant -6.89e-14 (0.0116)
#> ln_dist_dm -2.072*** (0.0271)
#> ln_dist -2.072*** (0.0271)
#> Fixed-Effects: ------------------ ------------------
#> Origin No Yes
#> Destination No Yes
#> _______________ __________________ __________________
#> S.E. type IID IID
#> Observations 38,325 38,325
#> R2 0.13218 0.50428
#> Within R2 -- 0.13218
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Variables with varying slopes
#
# You can center on factors but also on variables with varying slopes
# Let's have an illustration
base = iris
names(base) = c("y", "x1", "x2", "x3", "species")
#
# We center y and x1 on species and x2 * species
# using a formula
base_dm = demean(y + x1 ~ species[x2], data = base)
# using vectors
base_dm_bis = demean(X = base[, c("y", "x1")], f = base$species,
slope.vars = base$x2, slope.flag = 1)
# Let's look at the equivalences
res_vs_1 = feols(y ~ x1 + species + x2:species, base)
res_vs_2 = feols(y ~ x1, base_dm)
res_vs_3 = feols(y ~ x1, base_dm_bis)
# only the small sample adj. differ in the SEs
etable(res_vs_1, res_vs_2, res_vs_3, keep = "x1")
#> res_vs_1 res_vs_2 res_vs_3
#> Dependent Var.: y y y
#>
#> x1 0.4500*** (0.0806) 0.4500*** (0.0792) 0.4500*** (0.0792)
#> _______________ __________________ __________________ __________________
#> S.E. type IID IID IID
#> Observations 150 150 150
#> R2 0.86900 0.17894 0.17894
#> Adj. R2 0.86351 0.17340 0.17340
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# center on x2 * species and on another FE
base$fe = rep(1:5, 10)
# using a formula => double square brackets!
base_dm = demean(y + x1 ~ fe + species[[x2]], data = base)
# using vectors => note slope.flag!
base_dm_bis = demean(X = base[, c("y", "x1")], f = base[, c("fe", "species")],
slope.vars = base$x2, slope.flag = c(0, -1))
# Explanations slope.flag = c(0, -1):
# - the first 0: the first factor (fe) is associated to no variable
# - the "-1":
# * |-1| = 1: the second factor (species) is associated to ONE variable
# * -1 < 0: the second factor should not be included as such
# Let's look at the equivalences
res_vs_1 = feols(y ~ x1 + i(fe) + x2:species, base)
res_vs_2 = feols(y ~ x1, base_dm)
res_vs_3 = feols(y ~ x1, base_dm_bis)
# only the small sample adj. differ in the SEs
etable(res_vs_1, res_vs_2, res_vs_3, keep = "x1")
#> res_vs_1 res_vs_2 res_vs_3
#> Dependent Var.: y y y
#>
#> x1 0.4915*** (0.0839) 0.4915*** (0.0819) 0.4915*** (0.0819)
#> _______________ __________________ __________________ __________________
#> S.E. type IID IID IID
#> Observations 150 150 150
#> R2 0.85514 0.19580 0.19580
#> Adj. R2 0.84693 0.19037 0.19037
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1