Using CategoricalArrays

Using CategoricalArrays

Basic usage

Suppose that you have data about four individuals, with three different age groups. Since this variable is clearly ordinal, we mark the array as such via the ordered argument.

julia> using CategoricalArrays

julia> x = CategoricalArray(["Old", "Young", "Middle", "Young"], ordered=true)
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "Old"   
 "Young" 
 "Middle"
 "Young" 

By default, the levels are lexically sorted, which is clearly not correct in our case and would give incorrect results when testing for order. This is easily fixed using the levels! function to reorder levels:

julia> levels(x)
3-element Array{String,1}:
 "Middle"
 "Old"   
 "Young" 

julia> levels!(x, ["Young", "Middle", "Old"])
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "Old"   
 "Young" 
 "Middle"
 "Young" 

Thanks to this order, we can not only test for equality between two values, but also compare the ages of e.g. individuals 1 and 2:

julia> x[1]
CategoricalArrays.CategoricalValue{String,UInt32} "Old" (3/3)

julia> x[2]
CategoricalArrays.CategoricalValue{String,UInt32} "Young" (1/3)

julia> x[2] == x[4]
true

julia> x[1] > x[2]
true

Now let us imagine the first individual is actually in the "Young" group. Let's fix this (notice how the string "Young" is automatically converted to a CategoricalValue):

julia> x[1] = "Young"
"Young"

julia> x[1]
CategoricalArrays.CategoricalValue{String,UInt32} "Young" (1/3)

The CategoricalArray still considers "Old" as a possible level even if it is unused now. This is necessary to allow efficiently accessing the levels and setting values of elements in the array: indeed, dropping unused levels requires iterating over every element in the array, which is expensive. This property can also be useful to keep track of possible levels, even if they do not occur in practice.

To get rid of the "Old" group, just call the droplevels! function:

julia> levels(x)
3-element Array{String,1}:
 "Young" 
 "Middle"
 "Old"   

julia> droplevels!(x)
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "Young" 
 "Young" 
 "Middle"
 "Young" 

julia> levels(x)
2-element Array{String,1}:
 "Young" 
 "Middle"

Another solution would have been to call levels!(x, ["Young", "Middle"]) manually. This command is safe too, since it will raise an error when trying to remove levels that are currently used:

julia> levels!(x, ["Young", "Midle"]) 
ERROR: ArgumentError: cannot remove level "Middle" as it is used at position 3. Change the array element type to Union{String, Null} using convert if you want to transform some levels to missing values.
[...]

Note that entries in the x array can be treated as strings even though they are CategoricalValue objects:

julia> x[3] = lowercase(x[3])
"middle"

julia> x[3]
CategoricalArrays.CategoricalValue{String,UInt32} "middle" (3/3)

julia> droplevels!(x)
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "Young" 
 "Young" 
 "middle"
 "Young" 

julia> x[3]
CategoricalArrays.CategoricalValue{String,UInt32} "middle" (2/2)
droplevels!(A::CategoricalArray)

Drop levels which do not appear in categorical array A (so that they will no longer be returned by levels).

source
Nulls.levelsFunction.
levels(x)

Return a vector of unique values which occur or could occur in collection x, omitting null even if present. Values are returned in the preferred order for the collection, with the result of sort as a default.

Contrary to unique, this function may return values which do not actually occur in the data, and does not preserve their order of appearance in x.

source
levels(A::CategoricalArray)

Return the levels of categorical array A. This may include levels which do not actually appear in the data (see droplevels!).

source
levels!(A::CategoricalArray, newlevels::Vector; nullok::Bool=false)

Set the levels categorical array A. The order of appearance of levels will be respected by levels, which may affect display of results in some operations; if A is ordered (see isordered), it will also be used for order comparisons using <, > and similar operators. Reordering levels will never affect the values of entries in the array.

If A is nullable (i.e. eltype(A) >: Null) and nullok=true, entries corresponding to missing levels will be set to null. Else, newlevels must include all levels which appear in the data.

source

Handling Missing Values

The examples above assumed that the data contained no missing values. This is generally not the case in real data. This is where CategoricalArray{Union{T, Null}} comes into play. It is essentially the categorical-data equivalent of Array{Union{T, Null}}. It behaves exactly the same as CategoricalArray{T} , except that when indexed it returns either a CategoricalValue{T}, or null if a value is missing. See the Nulls package for more information on the Null type.

Let's adapt the example developed above to support missing values. Since there are no missing values in the input vector, we need to specify that the array should be able to hold either a String or null:

julia> y = CategoricalArray{Union{Null, String}}(["Old", "Young", "Middle", "Young"], ordered=true)
4-element CategoricalArrays.CategoricalArray{Union{Nulls.Null, String},1,UInt32}:
 "Old"   
 "Young" 
 "Middle"
 "Young" 

Levels still need to be reordered manually:

julia> levels(y)
3-element Array{String,1}:
 "Middle"
 "Old"   
 "Young" 

julia> levels!(y, ["Young", "Middle", "Old"])
4-element CategoricalArrays.CategoricalArray{Union{Nulls.Null, String},1,UInt32}:
 "Old"   
 "Young" 
 "Middle"
 "Young" 
 

At this point, indexing into the array gives exactly the same result

julia> y[1]
CategoricalArrays.CategoricalValue{String,UInt32} "Old" (3/3)

Missing values can be introduced either manually, or by restricting the set of possible levels. Let us imagine this time that we actually do not know the age of the first individual. We can set it to a missing value this way:

julia> y[1] = null
null

julia> y
4-element CategoricalArrays.CategoricalArray{Union{Nulls.Null, String},1,UInt32}:
 null    
 "Young" 
 "Middle"
 "Young" 

julia> y[1]
null

It is also possible to transform all values belonging to some levels into missing values, which gives the same result as above in the present case since we have only one individual in the "Old" group. Let's first restore the original value for the first element, and then set it to missing again using the nullok argument to levels!:

julia> y[1] = "Old"
"Old"

julia> y
4-element CategoricalArrays.CategoricalArray{Union{Nulls.Null, String},1,UInt32}:
 "Old"   
 "Young" 
 "Middle"
 "Young" 

julia> levels!(y, ["Young", "Middle"]; nullok=true)
4-element CategoricalArrays.CategoricalArray{Union{Nulls.Null, String},1,UInt32}:
 null    
 "Young" 
 "Middle"
 "Young" 

Working with categorical arrays

categorical(A) - Construct a categorical array with values from A

compress(A) - Return a copy of categorical array A using the smallest possible reference type

cut(x) - Cut a numeric array into intervals and return an ordered CategoricalArray

decompress(A) - Return a copy of categorical array A using the default reference type

isordered(A) - Test whether entries in A can be compared using <, > and similar operators

ordered!(A) - Set whether entries in A can be compared using <, > and similar operators

recode(a[, default], pairs...) - Return a copy of a after replacing one or more values

recode!(a[, default], pairs...) - Replace one or more values in a in-place

categorical{T}(A::AbstractArray{T}[, compress::Bool]; ordered::Bool=false)

Construct a categorical array with the values from A.

If the element type supports it, levels are sorted in ascending order; else, they are kept in their order of appearance in A. The ordered keyword argument determines whether the array values can be compared according to the ordering of levels or not (see isordered).

If compress is provided and set to true, the smallest reference type able to hold the number of unique values in A will be used. While this will reduce memory use, passing this parameter will also introduce a type instability which can affect performance inside the function where the call is made. Therefore, use this option with caution (the one-argument version does not suffer from this problem).

categorical{T}(A::CategoricalArray{T}[, compress::Bool]; ordered::Bool=isordered(A))

If A is already a CategoricalArray, its levels are preserved; the same applies to the ordered property, and to the reference type unless compress is passed.

source
compress(A::CategoricalArray)

Return a copy of categorical array A using the smallest reference type able to hold the number of levels of A.

While this will reduce memory use, this function is type-unstable, which can affect performance inside the function where the call is made. Therefore, use it with caution.

source
CategoricalArrays.cutFunction.
cut(x::AbstractArray, breaks::AbstractVector;
    extend::Bool=false, labels::AbstractVector=[], nullok::Bool=false)

Cut a numeric array into intervals and return an ordered CategoricalArray indicating the interval into which each entry falls. Intervals are of the form [lower, upper), i.e. the lower bound is included and the upper bound is excluded.

If x is nullable (i.e. eltype(x) >: Null), a nullable CategoricalArray is returned.

Arguments

  • extend::Bool=false: when false, an error is raised if some values in x fall outside of the breaks; when true, breaks are automatically added to include all values in x, and the upper bound is included in the last interval.

  • labels::AbstractVector=[]: a vector of strings giving the names to use for the intervals; if empty, default labels are used.

  • nullok::Bool=true: when true, values outside of breaks result in null values. only supported when x is nullable.

source
cut(x::AbstractArray, ngroups::Integer;
    labels::AbstractVector=String[])

Cut a numeric array into ngroups quantiles, determined using quantile.

source
decompress(A::CategoricalArray)

Return a copy of categorical array A using the default reference type (UInt32). If A is using a small reference type (such as UInt8 or UInt16) the decompressed array will have room for more levels.

To avoid the need to call decompress, ensure compress is not called when creating the categorical array.

source
isordered(A::CategoricalArray)

Test whether entries in A can be compared using <, > and similar operators, using the ordering of levels.

source
ordered!(A::CategoricalArray, ordered::Bool)

Set whether entries in A can be compared using <, > and similar operators, using the ordering of levels. Return the modified A.

source
recode(a::AbstractArray[, default::Any], pairs::Pair...)

Return a copy of a, replacing elements matching a key of pairs with the corresponding value. The type of the array is chosen so that it can hold all recoded elements (but not necessarily original elements from a).

For each Pair in pairs, if the element is equal to (according to isequal) or in the key (first item of the pair), then the corresponding value (second item) is used. If the element matches no key and default is not provided or nothing, it is copied as-is; if default is specified, it is used in place of the original element. If an element matches more than one key, the first match is used.

Examples

julia> using CategoricalArrays

julia> recode(1:10, 1=>100, 2:4=>0, [5; 9:10]=>-1)
10-element Array{Int64,1}:
 100
   0
   0
   0
  -1
   6
   7
   8
  -1
  -1
 recode(a::AbstractArray{>:Null}[, default::Any], pairs::Pair...)

For a nullable array a, null values are never replaced with default: use null in a pair to recode them. If that's not the case, the returned array will be nullable.

Examples

julia> using CategoricalArrays, Nulls

julia> recode(1:10, 1=>100, 2:4=>0, [5; 9:10]=>-1, 6=>null)
10-element Array{Union{Int64, Nulls.Null},1}:
 100    
   0    
   0    
   0    
  -1    
    null
   7    
   8    
  -1    
  -1    
source
recode!(dest::AbstractArray, src::AbstractArray[, default::Any], pairs::Pair...)

Fill dest with elements from src, replacing those matching a key of pairs with the corresponding value.

For each Pair in pairs, if the element is equal to (according to isequal) or in the key (first item of the pair), then the corresponding value (second item) is copied to dest. If the element matches no key and default is not provided or nothing, it is copied as-is; if default is specified, it is used in place of the original element. dest and src must be of the same length, but not necessarily of the same type. Elements of src as well as values from pairs will be converted when possible on assignment. If an element matches more than one key, the first match is used.

recode!(dest::AbstractArray, src::AbstractArray{>:Null}[, default::Any], pairs::Pair...)

For a nullable array a, null values are never replaced with default: use null in a pair to recode them. If that's not the case, the returned array will be nullable.

source