Bins the values of a variable (typically a factor)

Tool to easily group the values of a given variable.

Usage

bin(x, bin)

Arguments

x: A vector whose values have to be grouped. Can be of any type but must be atomic.
bin: A list of values to be grouped, a vector, a formula, or the special values "bin::digit" or "cut::values". To create a new value from old values, use bin = list("new_value"=old_values) with old_values a vector of existing values. You can use .() for list(). It accepts regular expressions, but they must start with an "@", like in bin="@Aug|Dec". It accepts one-sided formulas which must contain the variable x, e.g. bin=list("<2" = ~x < 2). The names of the list are the new names. If the new name is missing, the first value matched becomes the new name. In the name, adding "@d", with d a digit, will relocate the value in position d: useful to change the position of factors. Use "@" as first item to make subsequent items be located first in the factor. Feeding in a vector is like using a list without name and only a single element. If the vector is numeric, you can use the special value "bin::digit" to group every digit element. For example if x represents years, using bin="bin::2" creates bins of two years. With any data, using "!bin::digit" groups every digit consecutive values starting from the first value. Using "!!bin::digit" is the same but starting from the last value. With numeric vectors you can: a) use "cut::n" to cut the vector into n equal parts, b) use "cut::a]b[" to create the following bins: [min, a], ]a, b[, [b, max]. The latter syntax is a sequence of number/quartile (q0 to q4)/percentile (p0 to p100) followed by an open or closed square bracket. You can add custom bin names by adding them in the character vector after 'cut::values'. See details and examples. Dot square bracket expansion (see dsb) is enabled.

Value

It returns a vector of the same length as x.

"Cutting" a numeric vector

Numeric vectors can be cut easily into: a) equal parts, b) user-specified bins.

Use "cut::n" to cut the vector into n (roughly) equal parts. Percentiles are used to partition the data, hence some data distributions can lead to create less than n parts (for example if P0 is the same as P50).

The user can specify custom bins with the following syntax: "cut::a]b]c]". Here the numbers a, b, c, etc, are a sequence of increasing numbers, each followed by an open or closed square bracket. The numbers can be specified as either plain numbers (e.g. "cut::5]12[32["), quartiles (e.g. "cut::q1]q3["), or percentiles (e.g. "cut::p10]p15]p90]"). Values of different types can be mixed: "cut::5]q2[p80[" is valid provided the median (q2) is indeed greater than 5, otherwise an error is thrown.

The square bracket right of each number tells whether the numbers should be included or excluded from the current bin. For example, say x ranges from 0 to 100, then "cut::5]" will create two bins: one from 0 to 5 and a second from 6 to 100. With "cut::5[" the bins would have been 0-4 and 5-100.

A factor is always returned. The labels always report the min and max values in each bin.

To have user-specified bin labels, just add them in the character vector following 'cut::values'. You don't need to provide all of them, and NA values fall back to the default label. For example, bin = c("cut::4", "Q1", NA, "Q3") will modify only the first and third label that will be displayed as "Q1" and "Q3".

`bin` vs `ref`

The functions bin and ref are able to do the same thing, then why use one instead of the other? Here are the differences:

ref always returns a factor. This is in contrast with bin which returns, when possible, a vector of the same type as the vector in input.
ref always places the values modified in the first place of the factor levels. On the other hand, bin tries to not modify the ordering of the levels. It is possible to make bin mimic the behavior of ref by adding an "@" as the first element of the list in the argument bin.
when a vector (and not a list) is given in input, ref will place each element of the vector in the first place of the factor levels. The behavior of bin is totally different, bin will transform all the values in the vector into a single value in x (i.e. it's binning).

Author

Laurent Berge

Examples


data(airquality)
month_num = airquality$Month
table(month_num)
#> month_num
#>  5  6  7  8  9 
#> 31 30 31 31 30 

# Grouping the first two values
table(bin(month_num, 5:6))
#> 
#>  5  7  8  9 
#> 61 31 31 30 

# ... plus changing the name to '10'
table(bin(month_num, list("10" = 5:6)))
#> 
#>  7  8  9 10 
#> 31 31 30 61 

# ... and grouping 7 to 9
table(bin(month_num, list("g1" = 5:6, "g2" = 7:9)))
#> 
#> g1 g2 
#> 61 92 

# Grouping every two months
table(bin(month_num, "bin::2"))
#> 
#>  5  6  8 
#> 31 61 61 

# ... every 2 consecutive elements
table(bin(month_num, "!bin::2"))
#> 
#>  5  7  9 
#> 61 62 30 

# ... idem starting from the last one
table(bin(month_num, "!!bin::2"))
#> 
#>  5  6  8 
#> 31 61 61 

# Using .() for list():
table(bin(month_num, .("g1" = 5:6)))
#> 
#> g1  7  8  9 
#> 61 31 31 30 


#
# with non numeric data
#

month_lab = c("may", "june", "july", "august", "september")
month_fact = factor(month_num, labels = month_lab)

# Grouping the first two elements
table(bin(month_fact, c("may", "jun")))
#> 
#>       may      july    august september 
#>        61        31        31        30 

# ... using regex
table(bin(month_fact, "@may|jun"))
#> 
#>       may      july    august september 
#>        61        31        31        30 

# ...changing the name
table(bin(month_fact, list("spring" = "@may|jun")))
#> 
#>    spring      july    august september 
#>        61        31        31        30 

# Grouping every 2 consecutive months
table(bin(month_fact, "!bin::2"))
#> 
#>       may      july september 
#>        61        62        30 

# ...idem but starting from the last
table(bin(month_fact, "!!bin::2"))
#> 
#>    may   june august 
#>     31     61     61 

# Relocating the months using "@d" in the name
table(bin(month_fact, .("@5" = "may", "@1 summer" = "@aug|jul")))
#> 
#>    summer      june september       may 
#>        62        30        30        31 

# Putting "@" as first item means subsequent items will be placed first
table(bin(month_fact, .("@", "aug", "july")))
#> 
#>    august      july       may      june september 
#>        31        31        31        30        30 

#
# "Cutting" numeric data
#

data(iris)
plen = iris$Petal.Length

# 3 parts of (roughly) equal size
table(bin(plen, "cut::3"))
#> 
#> [1.0; 1.9] [3.0; 4.9] [5.0; 6.9] 
#>         50         54         46 

# Three custom bins
table(bin(plen, "cut::2]5]"))
#> 
#> [1.0; 1.9] [3.0; 5.0] [5.1; 6.9] 
#>         50         58         42 

# .. same, excluding 5 in the 2nd bin
table(bin(plen, "cut::2]5["))
#> 
#> [1.0; 1.9] [3.0; 4.9] [5.0; 6.9] 
#>         50         54         46 

# Using quartiles
table(bin(plen, "cut::q1]q2]q3]"))
#> 
#> [1.0; 1.6] [1.7; 4.3] [4.4; 5.1] [5.2; 6.9] 
#>         44         31         41         34 

# Using percentiles
table(bin(plen, "cut::p20]p50]p70]p90]"))
#> 
#> [1.0; 1.5] [1.6; 4.3] [4.4; 5.0] [5.1; 5.8] [5.9; 6.9] 
#>         37         38         33         29         13 

# Mixing all
table(bin(plen, "cut::2[q2]p90]"))
#> 
#> [1.0; 1.9] [3.0; 4.3] [4.4; 5.8] [5.9; 6.9] 
#>         50         25         62         13 

# NOTA:
# -> the labels always contain the min/max values in each bin

# Custom labels can be provided, just give them in the char. vector
# NA values lead to the default label
table(bin(plen, c("cut::2[q2]p90]", "<2", "]2; Q2]", NA, ">90%")))
#> 
#>         <2    ]2; Q2] [4.4; 5.8]       >90% 
#>         50         25         62         13 



#
# With a formula
#

data(iris)
plen = iris$Petal.Length

# We need to use "x"
table(bin(plen, list("< 2" = ~x < 2, ">= 2" = ~x >= 2)))
#> 
#>  < 2 >= 2 
#>   50  100