Centers a set of variables around a set of factors

User-level access to internal demeaning algorithm of fixest.

Usage

demean(
  X,
  f,
  slope.vars,
  slope.flag,
  data,
  weights,
  nthreads = getFixest_nthreads(),
  notes = getFixest_notes(),
  iter = 2000,
  tol = 1e-06,
  fixef.reorder = TRUE,
  fixef.algo = NULL,
  na.rm = TRUE,
  as.matrix = is.atomic(X),
  im_confident = FALSE,
  ...
)

Arguments

X: A matrix, vector, data.frame or a list OR a formula OR a feols estimation. If equal to a formula, then the argument data is required, and it must be of the type: x1 + x2 ~ f1 + fe2 with on the LHS the variables to be centered, and on the RHS the factors used for centering. Note that you can use variables with varying slopes with the syntax fe[v1, v2] (see details in feols). If a feols estimation, all variables (LHS+RHS) are demeaned and then returned (only if it was estimated with fixed-effects). Otherwise, it must represent the data to be centered. Of course the number of observations of that data must be the same as the factors used for centering (argument f).
f: A matrix, vector, data.frame or list. The factors used to center the variables in argument X. Matrices will be coerced using as.data.frame.
slope.vars: A vector, matrix or list representing the variables with varying slopes. Matrices will be coerced using as.data.frame. Note that if this argument is used it MUST be in conjunction with the argument slope.flag that maps the factors to which the varying slopes are attached. See examples.
slope.flag: An integer vector of the same length as the number of variables in f (the factors used for centering). It indicates for each factor the number of variables with varying slopes to which it is associated. Positive values mean that the raw factor should also be included in the centering, negative values that it should be excluded. Sorry it's complicated... but see the examples it may get clearer.
data: A data.frame containing all variables in the argument X. Only used if X is a formula, in which case data is mandatory.
weights: Vector, can be missing or NULL. If present, it must contain the same number of observations as in X.
nthreads: Number of threads to be used. By default it is equal to getFixest_nthreads().
notes: Logical, whether to display a message when NA values are removed. By default it is equal to getFixest_notes().
iter: Number of iterations, default is 2000.
tol: Stopping criterion of the algorithm. Default is 1e-6. The algorithm stops when the maximum absolute increase in the coefficients values is lower than tol.
fixef.reorder: Logical, default is TRUE. Whether to reorder the fixed-effects by frequencies before feeding them into the algorithm. If FALSE, the original fixed-effects order provided by the user is maintained. In general, reordering leads to faster and more precise performance.
fixef.algo: NULL (default) or an object of class demeaning_algo obtained with the function demeaning_algo. If NULL, it falls to the defaults of demeaning_algo. This arguments controls the settings of the demeaning algorithm. Only play with it if the convergence is slow, i.e. look at the slot $iterations, and if any is over 50, it may be worth playing around with it. Please read the documentation of the function demeaning_algo. Be aware that there is no clear guidance on how to change the settings, it's more a matter of try-and-see.
na.rm: Logical, default is TRUE. If TRUE and the input data contains any NA value, then any observation with NA will be discarded leading to an output with less observations than the input. If FALSE, if NAs are present the output will also be filled with NAs for each NA observation in input.
as.matrix: Logical, if TRUE a matrix is returned, if FALSE it will be a data.frame. The default depends on the input, if atomic then a matrix will be returned.
im_confident: Logical, default is FALSE. FOR EXPERT USERS ONLY! This argument allows to skip some of the preprocessing of the arguments given in input. If TRUE, then X MUST be a numeric vector/matrix/list (not a formula!), f MUST be a list, slope.vars MUST be a list, slope.vars MUST be consistent with slope.flag, and weights, if given, MUST be numeric (not integer!). Further there MUST be not any NA value, and the number of observations of each element MUST be consistent. Non compliance to these rules may simply lead your R session to break.
...: Not currently used.

Value

It returns a data.frame of the same number of columns as the number of variables to be centered.

If na.rm = TRUE, then the number of rows is equal to the number of rows in input minus the number of NA values (contained in X, f, slope.vars or weights). The default is to have an output of the same number of observations as the input (filled with NAs where appropriate).

A matrix can be returned if as.matrix = TRUE.

Varying slopes

You can add variables with varying slopes in the fixed-effect part of the formula. The syntax is as follows: fixef_var[var1, var2]. Here the variables var1 and var2 will be with varying slopes (one slope per value in fixef_var) and the fixed-effect fixef_var will also be added.

To add only the variables with varying slopes and not the fixed-effect, use double square brackets: fixef_var[[var1, var2]].

In other words:

fixef_var[var1, var2] is equivalent to fixef_var + fixef_var[[var1]] + fixef_var[[var2]]
fixef_var[[var1, var2]] is equivalent to fixef_var[[var1]] + fixef_var[[var2]]

In general, for convergence reasons, it is recommended to always add the fixed-effect and avoid using only the variable with varying slope (i.e. use single square brackets).

Examples


# Illustration of the FWL theorem
data(trade)

base = trade
base$ln_dist = log(base$dist_km)
base$ln_euros = log(base$Euros)

# We center the two variables ln_dist and ln_euros
#  on the factors Origin and Destination
X_demean = demean(X = base[, c("ln_dist", "ln_euros")],
                  f = base[, c("Origin", "Destination")])
base[, c("ln_dist_dm", "ln_euros_dm")] = X_demean

est = feols(ln_euros_dm ~ ln_dist_dm, base)
est_fe = feols(ln_euros ~ ln_dist | Origin + Destination, base)

# The results are the same as if we used the two factors
# as fixed-effects
etable(est, est_fe, se = "st")
#>                                est             est_fe
#> Dependent Var.:        ln_euros_dm           ln_euros
#>                                                      
#> Constant        -6.89e-14 (0.0116)                   
#> ln_dist_dm      -2.072*** (0.0271)                   
#> ln_dist                            -2.072*** (0.0271)
#> Fixed-Effects:  ------------------ ------------------
#> Origin                          No                Yes
#> Destination                     No                Yes
#> _______________ __________________ __________________
#> S.E. type                      IID                IID
#> Observations                38,325             38,325
#> R2                         0.13218            0.50428
#> Within R2                       --            0.13218
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#
# Variables with varying slopes
#

# You can center on factors but also on variables with varying slopes

# Let's have an illustration
base = iris
names(base) = c("y", "x1", "x2", "x3", "species")

#
# We center y and x1 on species and x2 * species

# using a formula
base_dm = demean(y + x1 ~ species[x2], data = base)

# using vectors
base_dm_bis = demean(X = base[, c("y", "x1")], f = base$species,
                     slope.vars = base$x2, slope.flag = 1)

# Let's look at the equivalences
res_vs_1 = feols(y ~ x1 + species + x2:species, base)
res_vs_2 = feols(y ~ x1, base_dm)
res_vs_3 = feols(y ~ x1, base_dm_bis)

# only the small sample adj. differ in the SEs
etable(res_vs_1, res_vs_2, res_vs_3, keep = "x1")
#>                           res_vs_1           res_vs_2           res_vs_3
#> Dependent Var.:                  y                  y                  y
#>                                                                         
#> x1              0.4500*** (0.0806) 0.4500*** (0.0792) 0.4500*** (0.0792)
#> _______________ __________________ __________________ __________________
#> S.E. type                      IID                IID                IID
#> Observations                   150                150                150
#> R2                         0.86900            0.17894            0.17894
#> Adj. R2                    0.86351            0.17340            0.17340
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#
# center on x2 * species and on another FE

base$fe = rep(1:5, 10)

# using a formula => double square brackets!
base_dm = demean(y + x1 ~ fe + species[[x2]], data = base)

# using vectors => note slope.flag!
base_dm_bis = demean(X = base[, c("y", "x1")], f = base[, c("fe", "species")],
                     slope.vars = base$x2, slope.flag = c(0, -1))

# Explanations slope.flag = c(0, -1):
# - the first 0: the first factor (fe) is associated to no variable
# - the "-1":
#    * |-1| = 1: the second factor (species) is associated to ONE variable
#    *   -1 < 0: the second factor should not be included as such

# Let's look at the equivalences
res_vs_1 = feols(y ~ x1 + i(fe) + x2:species, base)
res_vs_2 = feols(y ~ x1, base_dm)
res_vs_3 = feols(y ~ x1, base_dm_bis)

# only the small sample adj. differ in the SEs
etable(res_vs_1, res_vs_2, res_vs_3, keep = "x1")
#>                           res_vs_1           res_vs_2           res_vs_3
#> Dependent Var.:                  y                  y                  y
#>                                                                         
#> x1              0.4915*** (0.0839) 0.4915*** (0.0819) 0.4915*** (0.0819)
#> _______________ __________________ __________________ __________________
#> S.E. type                      IID                IID                IID
#> Observations                   150                150                150
#> R2                         0.85514            0.19580            0.19580
#> Adj. R2                    0.84693            0.19037            0.19037
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1