Prints the number of unique elements in a data set

This utility tool displays the number of unique elements in one or multiple data.frames as well as their number of NA values.

Usage

n_unik(x)

# S3 method for class 'vec_n_unik'
print(x, ...)

# S3 method for class 'list_n_unik'
print(x, ...)

Arguments

x: A formula, with data set names on the LHS and variables on the RHS, like data1 + data2 ~ var1 + var2. The following special variables are admitted: "." to get default values, ".N" for the number of observations, ".U" for the number of unique rows, ".NA" for the number of rows with at least one NA. Variables can be combined with "^", e.g. df~id^period; use id%^%period to also include the terms on both sides. Note that using : and * is equivalent to ^ and %^%. Sub select with id[cond], when doing so id is automatically included. Conditions can be chained, as in id[cond1, cond2]. Use NA(x, y) in conditions instead of is.na(x) | is.na(y). Use the !! operator to have both a condition and its opposite. To compare the keys in two data sets, use data1:data2. If not a formula, x can be: a vector (displays the # of unique values); a data.frame (default values are displayed), or a "sum" of data sets like in x = data1 + data2, in that case it is equivalent to data1 + data2 ~ ..
...: Not currently used.

Value

It returns a vector containing the number of unique values per element. If several data sets were provided, a list is returned, as long as the number of data sets, each element being a vector of unique values.

Special values and functions

In the formula, you can use the following special values: ".", ".N", ".U", and ".NA".

".": Accesses the default values. If there is only one data set and the data set is not a data.table, then the default is to display the number of observations and the number of unique rows. If the data is a data.table, the number of unique items in the key(s) is displayed instead of the number of unique rows (if the table has keys of course). If there are two or more data sets, then the default is to display the unique items for: a) the variables common across all data sets, if there's less than 4, and b) if no variable is shown in a), the number of variables common across at least two data sets, provided there are less than 5. If the data sets are data tables, the keys are also displayed on top of the common variables. In any case, the number of observations is always displayed.
".N": Displays the number of observations.
".U": Displays the number of unique rows.
".NA": Displays the number of rows with at least one NA.

The `NA` function

The special function NA is an equivalent to is.na but can handle several variables. For instance, NA(x, y) is equivalent to is.na(x) | is.na(y). You can add as many variables as you want as arguments. If no argument is provided, as in NA(), it is identical to having all the variables of the data set as argument.

Combining variables

Use the "hat", "^", operator to combine several variables. For example id^period will display the number of unique values of id x period combinations.

Use the "super hat", "%^%", operator to also include the terms on both sides. For example, instead of writing id + period + id^period, you can simply write id%^%period.

Alternatively, you can use : for ^ and * for %^%.

Sub-selections

To show the number of unique values for sub samples, simply use []. For example, id[x > 10] will display the number of unique id for which x > 10.

Simple square brackets lead to the inclusion of both the variable and its subset. For example id[x > 10] is equivalent to id + id[x > 10]. To include only the sub selection, use double square brackets, as in id[[x > 10]].

You can add multiple sub selections at once, only separate them with a comma. For example id[x > 10, NA(y)] is equivalent to id[x > 10] + id[NA(y)].

Use the double negative operator, i.e. !!, to include both a condition and its opposite at once. For example id[!!x > 10] is equivalent to id[x > 10, !x > 10]. Double negative operators can be chained, like in id[!!cond1 & !!cond2], then the cardinal product of all double negatived conditions is returned.

Author

Laurent Berge

Examples


data = base_did
data$x1.L1 = round(lag(x1~id+period, 1, data))

# By default, just the formatted number of observations
n_unik(data)
#> ## # Observations: 1,080 
#> ##  # Unique rows: 1,080 

# Or the nber of unique elements of a vector
n_unik(data$id)
#> ## id: 108 

# number of unique id values and id x period pairs
n_unik(data ~.N + id + id^period)
#> ## # Observations: 1,080 
#> ##             id:   108 
#> ##      id^period: 1,080 

# use the %^% operator to include the terms on the two sides at once
# => same as id*period
n_unik(data ~.N + id %^% period)
#> ## # Observations: 1,080 
#> ##             id:   108 
#> ##         period:    10 
#> ##      id^period: 1,080 

# using sub selection with []
n_unik(data ~.N + period[!NA(x1.L1)])
#> ##     # Observations: 1,080 
#> ##             period:    10 
#> ## period[!NA(x1.L1)]:     9 

# to show only the sub selection: [[]]
n_unik(data ~.N + period[[!NA(x1.L1)]])
#> ##     # Observations: 1,080 
#> ## period[!NA(x1.L1)]:     9 

# you can have multiple values in [],
# just separate them with a comma
n_unik(data ~.N + period[!NA(x1.L1), x1 > 7])
#> ##     # Observations: 1,080 
#> ##             period:    10 
#> ## period[!NA(x1.L1)]:     9 
#> ##     period[x1 > 7]:     7 

# to have both a condition and its opposite,
# use the !! operator
n_unik(data ~.N[!!NA(x1.L1)])
#> ##         # Observations: 1,080 
#> ##  # Obs. with NA(x1.L1):   108 
#> ## # Obs. with !NA(x1.L1):   972 

# the !! operator works within condition chains
n_unik(data ~.N[!!NA(x1.L1) & !!x1 > 7])
#> ##                   # Observations: 1,080 
#> ##   # Obs. with NA(x1.L1) & x1 > 7:     3 
#> ##  # Obs. with !NA(x1.L1) & x1 > 7:    10 
#> ##  # Obs. with NA(x1.L1) & !x1 > 7:   105 
#> ## # Obs. with !NA(x1.L1) & !x1 > 7:   962 

# Conditions can be distributed
n_unik(data ~ (id + period)[x1 > 7])
#> ##             id: 108 
#> ##     id[x1 > 7]:  13 
#> ##         period:  10 
#> ## period[x1 > 7]:   7 

#
# Several data sets
#

# Typical use case: merging
# Let's create two data sets and merge them

data(base_did)
base_main = base_did
base_extra = sample_df(base_main[, c("id", "period")], 100)
base_extra$id[1:10] = 111:120
base_extra$period[11:20] = 11:20
base_extra$z = rnorm(100)

# You can use db1:db2 to compare the common keys in two data sets
 n_unik(base_main:base_extra)
#> ##                 base_main base_extra 
#> ## # Observations:     1,080        100 
#> ##             id:       108         66 
#> ## # Exclusive   |        52         10 # Common: 56
#> ##         period:        10         20 
#> ## # Exclusive   |         0         10 # Common: 10
#> ##      id^period:     1,080        100 
#> ## # Exclusive   |     1,000         20 # Common: 80

tmp = merge(base_main, base_extra, all.x = TRUE, by = c("id", "period"))

# You can show unique values for any variable, as before
n_unik(tmp + base_main + base_extra ~ id[!!NA(z)] + id^period)
#> ##               tmp base_main base_extra
#> ##         id:   108       108         66
#> ##  id[NA(z)]:   108        --          0
#> ## id[!NA(z)]:    51        --         66
#> ##  id^period: 1,080     1,080        100