This utility tool displays the number of unique elements in one or multiple data.frames as well as their number of NA values.

## Arguments

- x
A formula, with data set names on the LHS and variables on the RHS, like

`data1 + data2 ~ var1 + var2`

. The following special variables are admitted:`"."`

to get default values,`".N"`

for the number of observations,`".U"`

for the number of unique rows,`".NA"`

for the number of rows with at least one NA. Variables can be combined with`"^"`

, e.g.`df~id^period`

; use`id%^%period`

to also include the terms on both sides. Note that using`:`

and`*`

is equivalent to`^`

and`%^%`

. Sub select with`id[cond]`

, when doing so`id`

is automatically included. Conditions can be chained, as in`id[cond1, cond2]`

. Use`NA(x, y)`

in conditions instead of`is.na(x) | is.na(y)`

. Use the`!!`

operator to have both a condition and its opposite. To compare the keys in two data sets, use`data1:data2`

. If not a formula,`x`

can be: a vector (displays the # of unique values); a`data.frame`

(default values are displayed), or a "sum" of data sets like in`x = data1 + data2`

, in that case it is equivalent to`data1 + data2 ~ .`

.- ...
Not currently used.

## Value

It returns a vector containing the number of unique values per element. If several data sets were provided, a list is returned, as long as the number of data sets, each element being a vector of unique values.

## Special values and functions

In the formula, you can use the following special values: `"."`

, `".N"`

, `".U"`

, and `".NA"`

.

`"."`

Accesses the default values. If there is only one data set and the data set is

*not*a`data.table`

, then the default is to display the number of observations and the number of unique rows. If the data is a`data.table`

, the number of unique items in the key(s) is displayed instead of the number of unique rows (if the table has keys of course). If there are two or more data sets, then the default is to display the unique items for: a) the variables common across all data sets, if there's less than 4, and b) if no variable is shown in a), the number of variables common across at least two data sets, provided there are less than 5. If the data sets are data tables, the keys are also displayed on top of the common variables. In any case, the number of observations is always displayed.`".N"`

Displays the number of observations.

`".U"`

Displays the number of unique rows.

`".NA"`

Displays the number of rows with at least one NA.

## The `NA`

function

The special function `NA`

is an equivalent to `is.na`

but can handle several variables.
For instance, `NA(x, y)`

is equivalent to `is.na(x) | is.na(y)`

. You can add as
many variables as you want as arguments. If no argument is provided, as in `NA()`

,
it is identical to having all the variables of the data set as argument.

## Combining variables

Use the "hat", `"^"`

, operator to combine several variables. For example `id^period`

will display the number of unique values of id x period combinations.

Use the "super hat", `"%^%"`

, operator to also include the terms on both sides.
For example, instead of writing `id + period + id^period`

, you can simply write `id%^%period`

.

Alternatively, you can use `:`

for `^`

and `*`

for `%^%`

.

## Sub-selections

To show the number of unique values for sub samples, simply use `[]`

.
For example, `id[x > 10]`

will display the number of unique `id`

for which `x > 10`

.

Simple square brackets lead to the inclusion of both the variable and its subset.
For example `id[x > 10]`

is equivalent to `id + id[x > 10]`

.
To include only the sub selection, use double square brackets, as in `id[[x > 10]]`

.

You can add multiple sub selections at once, only separate them with a comma.
For example `id[x > 10, NA(y)]`

is equivalent to `id[x > 10] + id[NA(y)]`

.

Use the double negative operator, i.e. `!!`

, to include both a condition and
its opposite at once. For example `id[!!x > 10]`

is equivalent to `id[x > 10, !x > 10]`

.
Double negative operators can be chained, like in `id[!!cond1 & !!cond2]`

, then the
cardinal product of all double negatived conditions is returned.

## Examples

```
data = base_did
data$x1.L1 = round(lag(x1~id+period, 1, data))
# By default, just the formatted number of observations
n_unik(data)
#> ## # Observations: 1,080
#> ## # Unique rows: 1,080
# Or the nber of unique elements of a vector
n_unik(data$id)
#> ## id: 108
# number of unique id values and id x period pairs
n_unik(data ~.N + id + id^period)
#> ## # Observations: 1,080
#> ## id: 108
#> ## id^period: 1,080
# use the %^% operator to include the terms on the two sides at once
# => same as id*period
n_unik(data ~.N + id %^% period)
#> ## # Observations: 1,080
#> ## id: 108
#> ## period: 10
#> ## id^period: 1,080
# using sub selection with []
n_unik(data ~.N + period[!NA(x1.L1)])
#> ## # Observations: 1,080
#> ## period: 10
#> ## period[!NA(x1.L1)]: 9
# to show only the sub selection: [[]]
n_unik(data ~.N + period[[!NA(x1.L1)]])
#> ## # Observations: 1,080
#> ## period[!NA(x1.L1)]: 9
# you can have multiple values in [],
# just separate them with a comma
n_unik(data ~.N + period[!NA(x1.L1), x1 > 7])
#> ## # Observations: 1,080
#> ## period: 10
#> ## period[!NA(x1.L1)]: 9
#> ## period[x1 > 7]: 7
# to have both a condition and its opposite,
# use the !! operator
n_unik(data ~.N[!!NA(x1.L1)])
#> ## # Observations: 1,080
#> ## # Obs. with NA(x1.L1): 108
#> ## # Obs. with !NA(x1.L1): 972
# the !! operator works within condition chains
n_unik(data ~.N[!!NA(x1.L1) & !!x1 > 7])
#> ## # Observations: 1,080
#> ## # Obs. with NA(x1.L1) & x1 > 7: 3
#> ## # Obs. with !NA(x1.L1) & x1 > 7: 10
#> ## # Obs. with NA(x1.L1) & !x1 > 7: 105
#> ## # Obs. with !NA(x1.L1) & !x1 > 7: 962
# Conditions can be distributed
n_unik(data ~ (id + period)[x1 > 7])
#> ## id: 108
#> ## id[x1 > 7]: 13
#> ## period: 10
#> ## period[x1 > 7]: 7
#
# Several data sets
#
# Typical use case: merging
# Let's create two data sets and merge them
data(base_did)
base_main = base_did
base_extra = sample_df(base_main[, c("id", "period")], 100)
base_extra$id[1:10] = 111:120
base_extra$period[11:20] = 11:20
base_extra$z = rnorm(100)
# You can use db1:db2 to compare the common keys in two data sets
n_unik(base_main:base_extra)
#> ## base_main base_extra
#> ## # Observations: 1,080 100
#> ## id: 108 73
#> ## # Exclusive | 45 10 # Common: 63
#> ## period: 10 20
#> ## # Exclusive | 0 10 # Common: 10
#> ## id^period: 1,080 100
#> ## # Exclusive | 1,000 20 # Common: 80
tmp = merge(base_main, base_extra, all.x = TRUE, by = c("id", "period"))
# You can show unique values for any variable, as before
n_unik(tmp + base_main + base_extra ~ id[!!NA(z)] + id^period)
#> ## tmp base_main base_extra
#> ## id: 108 108 73
#> ## id[NA(z)]: 108 -- 0
#> ## id[!NA(z)]: 59 -- 73
#> ## id^period: 1,080 1,080 100
```