Takes a variables of any types, transforms it into a factors, and modifies the values of the factors. Useful in estimations when you want to set some value of a vector as a reference.
Arguments
- x
A vector of any type (must be atomic though).
- ref
A vector or a list, or special binning values (explained later). If a vector, it must correspond to (partially matched) values of the vector
x
. The vectorx
which will be transformed into a factor and these values will be placed first in the levels. That's the main usage of this function. You can also bin on-the-fly the values ofx
, using the same syntax as the functionbin
. To create a new value from old values, useref = list("new_value"=old_values)
withold_values
a vector of existing values. You can use.()
forlist()
. It accepts regular expressions, but they must start with an"@"
, like inref="@Aug|Dec"
. It accepts one-sided formulas which must contain the variablex
, e.g.ref=list("<2" = ~x < 2)
. The names of the list are the new names. If the new name is missing, the first value matched becomes the new name. In the name, adding"@d"
, withd
a digit, will relocate the value in positiond
: useful to change the position of factors. If the vectorx
is numeric, you can use the special value"bin::digit"
to group everydigit
element. For example ifx
represents years, usingref="bin::2"
creates bins of two years. With any data, using"!bin::digit"
groups every digit consecutive values starting from the first value. Using"!!bin::digit"
is the same but starting from the last value. With numeric vectors you can: a) use"cut::n"
to cut the vector inton
equal parts, b) use"cut::a]b["
to create the following bins:[min, a]
,]a, b[
,[b, max]
. The latter syntax is a sequence of number/quartile (q0 to q4)/percentile (p0 to p100) followed by an open or closed square bracket. You can add custom bin names by adding them in the character vector after'cut::values'
. See details and examples. Dot square bracket expansion (seedsb
) is enabled.
Value
It returns a factor of the same length as x
, where levels have been modified according
to the argument ref
.
"Cutting" a numeric vector
Numeric vectors can be cut easily into: a) equal parts, b) user-specified bins.
Use "cut::n"
to cut the vector into n
(roughly) equal parts. Percentiles are
used to partition the data, hence some data distributions can lead to create less
than n
parts (for example if P0 is the same as P50).
The user can specify custom bins with the following syntax: "cut::a]b]c]"
. Here
the numbers a
, b
, c
, etc, are a sequence of increasing numbers, each followed
by an open or closed square bracket. The numbers can be specified as either
plain numbers (e.g. "cut::5]12[32["
), quartiles (e.g. "cut::q1]q3["
),
or percentiles (e.g. "cut::p10]p15]p90]"
). Values of different types can be mixed:
"cut::5]q2[p80["
is valid provided the median (q2
) is indeed greater
than 5
, otherwise an error is thrown.
The square bracket right of each number tells whether the numbers should be included
or excluded from the current bin. For example, say x
ranges from 0 to 100,
then "cut::5]"
will create two bins: one from 0 to 5 and a second from 6 to 100.
With "cut::5["
the bins would have been 0-4 and 5-100.
A factor is always returned. The labels always report the min and max values in each bin.
To have user-specified bin labels, just add them in the character vector
following 'cut::values'
. You don't need to provide all of them, and NA
values
fall back to the default label. For example, bin = c("cut::4", "Q1", NA, "Q3")
will modify only the first and third label that will be displayed as "Q1"
and "Q3"
.
bin
vs ref
The functions bin
and ref
are able to do the same thing, then why use one
instead of the other? Here are the differences:
ref
always returns a factor. This is in contrast withbin
which returns, when possible, a vector of the same type as the vector in input.ref
always places the values modified in the first place of the factor levels. On the other hand,bin
tries to not modify the ordering of the levels. It is possible to makebin
mimic the behavior ofref
by adding an"@"
as the first element of the list in the argumentbin
.when a vector (and not a list) is given in input,
ref
will place each element of the vector in the first place of the factor levels. The behavior ofbin
is totally different,bin
will transform all the values in the vector into a single value inx
(i.e. it's binning).
See also
To bin the values of a vector: bin
.
Examples
data(airquality)
# A vector of months
month_num = airquality$Month
month_lab = c("may", "june", "july", "august", "september")
month_fact = factor(month_num, labels = month_lab)
table(month_num)
#> month_num
#> 5 6 7 8 9
#> 31 30 31 31 30
table(month_fact)
#> month_fact
#> may june july august september
#> 31 30 31 31 30
#
# Main use
#
# Without argument: equivalent to as.factor
ref(month_num)
#> [1] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6
#> [38] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7
#> [75] 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
#> [112] 8 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
#> [149] 9 9 9 9 9
#> Levels: 5 6 7 8 9
# Main usage: to set a level first:
# (Note that partial matching is enabled.)
table(ref(month_fact, "aug"))
#>
#> august may june july september
#> 31 31 30 31 30
# You can rename the level on-the-fly
# (Northern hemisphere specific!)
table(ref(month_fact, .("Hot month"="aug",
"Late summer" = "sept")))
#>
#> Hot month Late summer may june july
#> 31 30 31 30 31
# Main use is in estimations:
a = feols(Petal.Width ~ Petal.Length + Species, iris)
# We change the reference
b = feols(Petal.Width ~ Petal.Length + ref(Species, "vers"), iris)
etable(a, b)
#> a b
#> Dependent Var.: Petal.Width Petal.Width
#>
#> Constant -0.0908 (0.0564) 0.3445* (0.1489)
#> Petal.Length 0.2304*** (0.0344) 0.2304*** (0.0344)
#> Speciesversicolor 0.4354*** (0.1028)
#> Speciesvirginica 0.8377*** (0.1453)
#> ref(Species,"vers")setosa -0.4354*** (0.1028)
#> ref(Species,"vers")virginica 0.4023*** (0.0572)
#> ____________________________ __________________ ___________________
#> S.E. type IID IID
#> Observations 150 150
#> R2 0.94557 0.94557
#> Adj. R2 0.94446 0.94446
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Binning
#
# You can also bin factor values on the fly
# Using @ first means a regular expression will be used to match the values.
# Note that the value created is placed first.
# To avoid that behavior => use the function "bin"
table(ref(month_fact, .(summer = "@jul|aug|sep")))
#>
#> summer may june
#> 92 31 30
# Please refer to the example in the bin help page for more example.
# The syntax is the same.
#
# Precise relocation
#
# You can place a factor at the location you want
# by adding "@digit" in the name first:
table(ref(month_num, .("@5"=5)))
#>
#> 6 7 8 9 5
#> 30 31 31 30 31
# Same with renaming
table(ref(month_num, .("@5 five"=5)))
#>
#> 6 7 8 9 five
#> 30 31 31 30 31