Tool to easily group the values of a given variable.

`bin(x, bin)`

- x
A vector whose values have to be grouped. Can be of any type but must be atomic.

- bin
A list of values to be grouped, a vector, a formula, or the special values

`"bin::digit"`

or`"cut::values"`

. To create a new value from old values, use`bin = list("new_value"=old_values)`

with`old_values`

a vector of existing values. You can use`.()`

for`list()`

. It accepts regular expressions, but they must start with an`"@"`

, like in`bin="@Aug|Dec"`

. It accepts one-sided formulas which must contain the variable`x`

, e.g.`bin=list("<2" = ~x < 2)`

. The names of the list are the new names. If the new name is missing, the first value matched becomes the new name. In the name, adding`"@d"`

, with`d`

a digit, will relocate the value in position`d`

: useful to change the position of factors. Use`"@"`

as first item to make subsequent items be located first in the factor. Feeding in a vector is like using a list without name and only a single element. If the vector is numeric, you can use the special value`"bin::digit"`

to group every`digit`

element. For example if`x`

represents years, using`bin="bin::2"`

creates bins of two years. With any data, using`"!bin::digit"`

groups every digit consecutive values starting from the first value. Using`"!!bin::digit"`

is the same but starting from the last value. With numeric vectors you can: a) use`"cut::n"`

to cut the vector into`n`

equal parts, b) use`"cut::a]b["`

to create the following bins:`[min, a]`

,`]a, b[`

,`[b, max]`

. The latter syntax is a sequence of number/quartile (q0 to q4)/percentile (p0 to p100) followed by an open or closed square bracket. You can add custom bin names by adding them in the character vector after`'cut::values'`

. See details and examples. Dot square bracket expansion (see`dsb`

) is enabled.

It returns a vector of the same length as `x`

.

Numeric vectors can be cut easily into: a) equal parts, b) user-specified bins.

Use `"cut::n"`

to cut the vector into `n`

(roughly) equal parts. Percentiles are used to partition the data, hence some data distributions can lead to create less than `n`

parts (for example if P0 is the same as P50).

The user can specify custom bins with the following syntax: `"cut::a]b]c]"etc`

. Here the numbers `a`

, `b`

, `c`

, etc, are a sequence of increasing numbers, each followed by an open or closed square bracket. The numbers can be specified as either plain numbers (e.g. `"cut::5]12[32["`

), quartiles (e.g. `"cut::q1]q3["`

), or percentiles (e.g. `"cut::p10]p15]p90]"`

). Values of different types can be mixed: `"cut::5]q2[p80["`

is valid provided the median (`q2`

) is indeed greater than `5`

, otherwise an error is thrown.

The square bracket right of each number tells whether the numbers should be included or excluded from the current bin. For example, say `x`

ranges from 0 to 100, then `"cut::5]"`

will create two bins: one from 0 to 5 and a second from 6 to 100. With `"cut::5["`

the bins would have been 0-4 and 5-100.

A factor is returned. The labels report the min and max values in each bin.

To have user-specified bin labels, just add them in the character vector following `'cut::values'`

. You don't need to provide all of them, and `NA`

values fall back to the default label. For example, `bin = c("cut::4", "Q1", NA, "Q3")`

will modify only the first and third label that will be displayed as `"Q1"`

and `"Q3"`

.

`bin`

vs `ref`

The functions `bin`

and `ref`

are able to do the same thing, then why use one instead of the other? Here are the differences:

`ref`

always returns a factor. This is in contrast with`bin`

which returns, when possible, a vector of the same type as the vector in input.`ref`

always places the values modified in the first place of the factor levels. On the other hand,`bin`

tries to not modify the ordering of the levels. It is possible to make`bin`

mimic the behavior of`ref`

by adding an`"@"`

as the first element of the list in the argument`bin`

.when a vector (and not a list) is given in input,

`ref`

will place each element of the vector in the first place of the factor levels. The behavior of`bin`

is totally different,`bin`

will transform all the values in the vector into a single value in`x`

(i.e. it's binning).

To re-factor variables: `ref`

.

```
data(airquality)
month_num = airquality$Month
table(month_num)
#> month_num
#> 5 6 7 8 9
#> 31 30 31 31 30
# Grouping the first two values
table(bin(month_num, 5:6))
#>
#> 5 7 8 9
#> 61 31 31 30
# ... plus changing the name to '10'
table(bin(month_num, list("10" = 5:6)))
#>
#> 7 8 9 10
#> 31 31 30 61
# ... and grouping 7 to 9
table(bin(month_num, list("g1" = 5:6, "g2" = 7:9)))
#>
#> g1 g2
#> 61 92
# Grouping every two months
table(bin(month_num, "bin::2"))
#>
#> 5 6 8
#> 31 61 61
# ... every 2 consecutive elements
table(bin(month_num, "!bin::2"))
#>
#> 5 7 9
#> 61 62 30
# ... idem starting from the last one
table(bin(month_num, "!!bin::2"))
#>
#> 5 6 8
#> 31 61 61
# Using .() for list():
table(bin(month_num, .("g1" = 5:6)))
#>
#> g1 7 8 9
#> 61 31 31 30
#
# with non numeric data
#
month_lab = c("may", "june", "july", "august", "september")
month_fact = factor(month_num, labels = month_lab)
# Grouping the first two elements
table(bin(month_fact, c("may", "jun")))
#>
#> may july august september
#> 61 31 31 30
# ... using regex
table(bin(month_fact, "@may|jun"))
#>
#> may july august september
#> 61 31 31 30
# ...changing the name
table(bin(month_fact, list("spring" = "@may|jun")))
#>
#> spring july august september
#> 61 31 31 30
# Grouping every 2 consecutive months
table(bin(month_fact, "!bin::2"))
#>
#> may july september
#> 61 62 30
# ...idem but starting from the last
table(bin(month_fact, "!!bin::2"))
#>
#> may june august
#> 31 61 61
# Relocating the months using "@d" in the name
table(bin(month_fact, .("@5" = "may", "@1 summer" = "@aug|jul")))
#>
#> summer june september may
#> 62 30 30 31
# Putting "@" as first item means subsequent items will be placed first
table(bin(month_fact, .("@", "aug", "july")))
#>
#> august july may june september
#> 31 31 31 30 30
#
# "Cutting" numeric data
#
data(iris)
plen = iris$Petal.Length
# 3 parts of (roughly) equal size
table(bin(plen, "cut::3"))
#>
#> [1.0; 1.9] [3.0; 4.9] [5.0; 6.9]
#> 50 54 46
# Three custom bins
table(bin(plen, "cut::2]5]"))
#>
#> [1.0; 1.9] [3.0; 5.0] [5.1; 6.9]
#> 50 58 42
# .. same, excluding 5 in the 2nd bin
table(bin(plen, "cut::2]5["))
#>
#> [1.0; 1.9] [3.0; 4.9] [5.0; 6.9]
#> 50 54 46
# Using quartiles
table(bin(plen, "cut::q1]q2]q3]"))
#>
#> [1.0; 1.6] [1.7; 4.3] [4.4; 5.1] [5.2; 6.9]
#> 44 31 41 34
# Using percentiles
table(bin(plen, "cut::p20]p50]p70]p90]"))
#>
#> [1.0; 1.5] [1.6; 4.3] [4.4; 5.0] [5.1; 5.8] [5.9; 6.9]
#> 37 38 33 29 13
# Mixing all
table(bin(plen, "cut::2[q2]p90]"))
#>
#> [1.0; 1.9] [3.0; 4.3] [4.4; 5.8] [5.9; 6.9]
#> 50 25 62 13
# NOTA:
# -> the labels always contain the min/max values in each bin
# Custom labels can be provided, just give them in the char. vector
# NA values lead to the default label
table(bin(plen, c("cut::2[q2]p90]", "<2", "]2; Q2]", NA, ">90%")))
#>
#> <2 ]2; Q2] [4.4; 5.8] >90%
#> 50 25 62 13
#
# With a formula
#
data(iris)
plen = iris$Petal.Length
# We need to use "x"
table(bin(plen, list("< 2" = ~x < 2, ">= 2" = ~x >= 2)))
#>
#> < 2 >= 2
#> 50 100
```