Using CategoricalArrays

Basic usage

Suppose that you have data about four individuals, with three different age groups. Since this variable is clearly ordinal, we mark the array as such via the ordered argument.

julia> using CategoricalArrays

julia> x = CategoricalArray(["Old", "Young", "Middle", "Young"], ordered=true)
4-element CategoricalArray{String,1,UInt32}:
 "Old"
 "Young"
 "Middle"
 "Young"

By default, the levels are lexically sorted, which is clearly not correct in our case and would give incorrect results when testing for order. This is easily fixed using the levels! function to reorder levels:

julia> levels(x)
3-element Vector{String}:
 "Middle"
 "Old"
 "Young"

julia> levels!(x, ["Young", "Middle", "Old"])
4-element CategoricalArray{String,1,UInt32}:
 "Old"
 "Young"
 "Middle"
 "Young"

Thanks to this order, we can not only test for equality between two values, but also compare the ages of e.g. individuals 1 and 2:

julia> x[1]
CategoricalValue{String, UInt32} "Old" (3/3)

julia> x[2]
CategoricalValue{String, UInt32} "Young" (1/3)

julia> x[2] == x[4]
true

julia> x[1] > x[2]
true

Now let us imagine the first individual is actually in the "Young" group. Let's fix this (notice how the string "Young" is automatically converted to a CategoricalValue):

julia> x[1] = "Young"
"Young"

julia> x[1]
CategoricalValue{String, UInt32} "Young" (1/3)

The CategoricalArray still considers "Old" as a possible level even if it is unused now. This is necessary to allow efficiently accessing the levels and setting values of elements in the array: indeed, dropping unused levels requires iterating over every element in the array, which is expensive. This property can also be useful to keep track of possible levels, even if they do not occur in practice.

To get rid of the "Old" group, just call the droplevels! function:

julia> levels(x)
3-element Vector{String}:
 "Young"
 "Middle"
 "Old"

julia> droplevels!(x)
4-element CategoricalArray{String,1,UInt32}:
 "Young"
 "Young"
 "Middle"
 "Young"

julia> levels(x)
2-element Vector{String}:
 "Young"
 "Middle"

Another solution would have been to call levels!(x, ["Young", "Middle"]) manually. This command is safe too, since it will raise an error when trying to remove levels that are currently used:

julia> levels!(x, ["Young", "Midle"])
ERROR: ArgumentError: cannot remove level "Middle" as it is used at position 3. Change the array element type to Union{String, Missing} using convert if you want to transform some levels to missing values.
[...]

Note that entries in the x array cannot be treated as strings. Instead, they need to be converted to strings using String(x[i]):

julia> lowercase(String(x[3]))
"middle"

julia> replace(String(x[3]), 'M'=>'R')
"Riddle"

Note that the call to String does not reduce performance compared with working with a Vector{String} as it simply returns the string object which is stored by the pool.

Integer codes giving the index of each value in the levels can be obtained using the levelcode function:

julia> levelcode(x[1])
1

julia> levelcode.(x)
4-element Vector{Int64}:
 1
 1
 2
 1

Handling Missing Values

The examples above assumed that the data contained no missing values. This is generally not the case for real data. This is where CategoricalArray{Union{T, Missing}} comes into play. It is essentially the categorical-data equivalent of Array{Union{T, Missing}}. It behaves exactly as CategoricalArray{T}, except that when indexed it returns either a CategoricalValue{T} object or missing if the value is missing. See the Julia manual for more information on the Missing type.

Let's adapt the example developed above to support missing values. Since there are no missing values in the input vector, we need to specify that the array should be able to hold either a String or missing:

julia> y = CategoricalArray{Union{Missing, String}}(["Old", "Young", "Middle", "Young"], ordered=true)
4-element CategoricalArray{Union{Missing, String},1,UInt32}:
 "Old"
 "Young"
 "Middle"
 "Young"

Levels still need to be reordered manually:

julia> levels(y)
3-element Vector{String}:
 "Middle"
 "Old"
 "Young"

julia> levels!(y, ["Young", "Middle", "Old"])
4-element CategoricalArray{Union{Missing, String},1,UInt32}:
 "Old"
 "Young"
 "Middle"
 "Young"

At this point, indexing into the array gives exactly the same result

julia> y[1]
CategoricalValue{String, UInt32} "Old" (3/3)

Missing values can be introduced either manually, or by restricting the set of possible levels. Let us imagine this time that we actually do not know the age of the first individual. We can set it to a missing value this way:

julia> y[1] = missing
missing

julia> y
4-element CategoricalArray{Union{Missing, String},1,UInt32}:
 missing
 "Young"
 "Middle"
 "Young"

julia> y[1]
missing

It is also possible to transform all values belonging to some levels into missing values, which gives the same result as above in the present case since we have only one individual in the "Old" group. Let's first restore the original value for the first element, and then set it to missing again using the allowmissing argument to levels!:

julia> y[1] = "Old"
"Old"

julia> y
4-element CategoricalArray{Union{Missing, String},1,UInt32}:
 "Old"
 "Young"
 "Middle"
 "Young"

julia> levels!(y, ["Young", "Middle"]; allowmissing=true)
4-element CategoricalArray{Union{Missing, String},1,UInt32}:
 missing
 "Young"
 "Middle"
 "Young"

Conversely, all missing values can be turned into a "normal" value using replace (or recode, whose syntax is identical for this operation):

julia> replace(y, missing => "missing value")
4-element CategoricalArray{String,1,UInt32}:
 "missing value"
 "Young"
 "Middle"
 "Young"

Note that the returned array no longer allows for missing values (which is usually what is expected). This syntax works for array types other than CategoricalArray.

An in-place variant replace! (respectively recode!) is also provided. Note that y still allows for missing values (since the type of an object cannot be changed).

julia> replace!(y, missing => "missing value");

julia> y
4-element CategoricalArray{Union{Missing, String},1,UInt32}:
 "missing value"
 "Young"
 "Middle"
 "Young"

Combining levels

Some operations imply combining levels of two categorical arrays: this is the case when concatenating arrays (vcat, hcat and cat) and when assigning a CategoricalValue from another categorical array.

For example, imagine we have two sets of observations, one with only the younger part of the population and one with the older part:

julia> x = categorical(["Middle", "Old", "Middle"], ordered=true);

julia> y = categorical(["Young", "Middle", "Middle"], ordered=true);

julia> levels!(y, ["Young", "Middle"]);

If we concatenate the two sets, the levels of the resulting categorical vector are chosen so that the relative orders of levels in x and y are preserved, if possible. In that case, comparisons with < and > are still valid, and resulting vector is marked as ordered:

julia> xy = vcat(x, y)
6-element CategoricalArray{String,1,UInt32}:
 "Middle"
 "Old"
 "Middle"
 "Young"
 "Middle"
 "Middle"

julia> levels(xy)
3-element Vector{String}:
 "Young"
 "Middle"
 "Old"

julia> isordered(xy)
true

Likewise, assigning a CategoricalValue from y to an entry in x expands the levels of x with all levels from y, respecting the ordering of levels of both vectors if possible:

julia> levels(x)
2-element Vector{String}:
 "Middle"
 "Old"

julia> x[1] = y[1]
CategoricalValue{String, UInt32} "Young" (1/2)

julia> levels(x)
3-element Vector{String}:
 "Young"
 "Middle"
 "Old"

julia> x[1]
CategoricalValue{String, UInt32} "Young" (1/3)

In cases where levels with incompatible orderings are combined, the ordering of the destination array wins and the destination array is marked as unordered. The same happens when concatenating arrays, and the ordering of the first array wins in case of conflict:

julia> a = categorical(["a", "b", "c"], ordered=true);

julia> b = categorical(["a", "b", "c"], ordered=true);

julia> ab = vcat(a, b)
6-element CategoricalArray{String,1,UInt32}:
 "a"
 "b"
 "c"
 "a"
 "b"
 "c"

julia> levels(ab)
3-element Vector{String}:
 "a"
 "b"
 "c"

julia> isordered(ab)
true

julia> levels!(b, ["c", "b", "a"])
3-element CategoricalArray{String,1,UInt32}:
 "a"
 "b"
 "c"

julia> ab2 = vcat(a, b)
6-element CategoricalArray{String,1,UInt32}:
 "a"
 "b"
 "c"
 "a"
 "b"
 "c"

julia> levels(ab2)
3-element Vector{String}:
 "a"
 "b"
 "c"

julia> isordered(ab2)
false

The resulting array is marked as ordered only if all the source array(s) are ordered, with the exception that unordered arrays with no levels do not prompt the result to be marked as unordered. In particular, this allows assignment of a CategoricalValue to an empty CategoricalArray via setindex! to copy the levels of the source value and to mark the result as ordered.

Do note that in some cases the two sets of levels may have compatible orderings, but it is not possible to determine in what order should levels appear in the merged set. This is the case for example with ["a, "b", "d"] and ["c", "d", "e"]: there is no way to detect that "c" should be inserted exactly after "b" (lexicographic ordering is not relevant here). In such cases, the resulting array is marked as unordered. This situation can only happen when working with data subsets selected based on non-contiguous subsets of levels.

Exported functions

categorical(A) - Construct a categorical array with values from A

compress(A) - Return a copy of categorical array A using the smallest possible reference type

cut(x) - Cut a numeric array into intervals and return an ordered CategoricalArray

decompress(A) - Return a copy of categorical array A using the default reference type

isordered(A) - Test whether entries in A can be compared using <, > and similar operators

ordered!(A, ordered) - Set whether entries in A can be compared using <, > and similar operators

recode(a[, default], pairs...) - Return a copy of a after replacing one or more values

recode!(a[, default], pairs...) - Replace one or more values in a in-place

unwrap(x) - Return the value contained in categorical value x; if x is Missing return missing

levelcode(x) - Return the code of categorical value x, i.e. its index in the set of possible values returned by levels(x).

See API Index for more details.