Using CategoricalArrays
Basic usage
Suppose that you have data about four individuals, with three different age groups. Since this variable is clearly ordinal, we mark the array as such via the ordered
argument.
julia> using CategoricalArrays
julia> x = CategoricalArray(["Old", "Young", "Middle", "Young"], ordered=true)
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"Old"
"Young"
"Middle"
"Young"
By default, the levels are lexically sorted, which is clearly not correct in our case and would give incorrect results when testing for order. This is easily fixed using the levels!
function to reorder levels:
julia> levels(x)
3-element Array{String,1}:
"Middle"
"Old"
"Young"
julia> levels!(x, ["Young", "Middle", "Old"])
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"Old"
"Young"
"Middle"
"Young"
Thanks to this order, we can not only test for equality between two values, but also compare the ages of e.g. individuals 1 and 2:
julia> x[1]
CategoricalArrays.CategoricalValue{String,UInt32} "Old" (3/3)
julia> x[2]
CategoricalArrays.CategoricalValue{String,UInt32} "Young" (1/3)
julia> x[2] == x[4]
true
julia> x[1] > x[2]
true
Now let us imagine the first individual is actually in the "Young" group. Let's fix this (notice how the string "Young"
is automatically converted to a CategoricalValue
):
julia> x[1] = "Young"
"Young"
julia> x[1]
CategoricalArrays.CategoricalValue{String,UInt32} "Young" (1/3)
The CategoricalArray
still considers "Old"
as a possible level even if it is unused now. This is necessary to allow efficiently accessing the levels and setting values of elements in the array: indeed, dropping unused levels requires iterating over every element in the array, which is expensive. This property can also be useful to keep track of possible levels, even if they do not occur in practice.
To get rid of the "Old"
group, just call the droplevels!
function:
julia> levels(x)
3-element Array{String,1}:
"Young"
"Middle"
"Old"
julia> droplevels!(x)
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"Young"
"Young"
"Middle"
"Young"
julia> levels(x)
2-element Array{String,1}:
"Young"
"Middle"
Another solution would have been to call levels!(x, ["Young", "Middle"])
manually. This command is safe too, since it will raise an error when trying to remove levels that are currently used:
julia> levels!(x, ["Young", "Midle"])
ERROR: ArgumentError: cannot remove level "Middle" as it is used at position 3. Change the array element type to Union{String, Null} using convert if you want to transform some levels to missing values.
[...]
Note that entries in the x
array can be treated as strings even though they are CategoricalValue
objects:
julia> x[3] = lowercase(x[3])
"middle"
julia> x[3]
CategoricalArrays.CategoricalValue{String,UInt32} "middle" (3/3)
julia> droplevels!(x)
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"Young"
"Young"
"middle"
"Young"
julia> x[3]
CategoricalArrays.CategoricalValue{String,UInt32} "middle" (2/2)
CategoricalArrays.droplevels!
— Function.droplevels!(A::CategoricalArray)
Drop levels which do not appear in categorical array A
(so that they will no longer be returned by levels
).
Nulls.levels
— Function.levels(x)
Return a vector of unique values which occur or could occur in collection x
, omitting null
even if present. Values are returned in the preferred order for the collection, with the result of sort
as a default.
Contrary to unique
, this function may return values which do not actually occur in the data, and does not preserve their order of appearance in x
.
levels(A::CategoricalArray)
Return the levels of categorical array A
. This may include levels which do not actually appear in the data (see droplevels!
).
CategoricalArrays.levels!
— Function.levels!(A::CategoricalArray, newlevels::Vector; nullok::Bool=false)
Set the levels categorical array A
. The order of appearance of levels will be respected by levels
, which may affect display of results in some operations; if A
is ordered (see isordered
), it will also be used for order comparisons using <
, >
and similar operators. Reordering levels will never affect the values of entries in the array.
If A
is nullable (i.e. eltype(A) >: Null
) and nullok=true
, entries corresponding to missing levels will be set to null
. Else, newlevels
must include all levels which appear in the data.
Handling Missing Values
The examples above assumed that the data contained no missing values. This is generally not the case in real data. This is where CategoricalArray{Union{T, Null}}
comes into play. It is essentially the categorical-data equivalent of Array{Union{T, Null}}
. It behaves exactly the same as CategoricalArray{T}
, except that when indexed it returns either a CategoricalValue{T}
, or null
if a value is missing. See the Nulls package for more information on the Null
type.
Let's adapt the example developed above to support missing values. Since there are no missing values in the input vector, we need to specify that the array should be able to hold either a String
or null
:
julia> y = CategoricalArray{Union{Null, String}}(["Old", "Young", "Middle", "Young"], ordered=true)
4-element CategoricalArrays.CategoricalArray{Union{Nulls.Null, String},1,UInt32}:
"Old"
"Young"
"Middle"
"Young"
Levels still need to be reordered manually:
julia> levels(y)
3-element Array{String,1}:
"Middle"
"Old"
"Young"
julia> levels!(y, ["Young", "Middle", "Old"])
4-element CategoricalArrays.CategoricalArray{Union{Nulls.Null, String},1,UInt32}:
"Old"
"Young"
"Middle"
"Young"
At this point, indexing into the array gives exactly the same result
julia> y[1]
CategoricalArrays.CategoricalValue{String,UInt32} "Old" (3/3)
Missing values can be introduced either manually, or by restricting the set of possible levels. Let us imagine this time that we actually do not know the age of the first individual. We can set it to a missing value this way:
julia> y[1] = null
null
julia> y
4-element CategoricalArrays.CategoricalArray{Union{Nulls.Null, String},1,UInt32}:
null
"Young"
"Middle"
"Young"
julia> y[1]
null
It is also possible to transform all values belonging to some levels into missing values, which gives the same result as above in the present case since we have only one individual in the "Old"
group. Let's first restore the original value for the first element, and then set it to missing again using the nullok
argument to levels!
:
julia> y[1] = "Old"
"Old"
julia> y
4-element CategoricalArrays.CategoricalArray{Union{Nulls.Null, String},1,UInt32}:
"Old"
"Young"
"Middle"
"Young"
julia> levels!(y, ["Young", "Middle"]; nullok=true)
4-element CategoricalArrays.CategoricalArray{Union{Nulls.Null, String},1,UInt32}:
null
"Young"
"Middle"
"Young"
Working with categorical arrays
categorical(A)
- Construct a categorical array with values from A
compress(A)
- Return a copy of categorical array A
using the smallest possible reference type
cut(x)
- Cut a numeric array into intervals and return an ordered CategoricalArray
decompress(A)
- Return a copy of categorical array A
using the default reference type
isordered(A)
- Test whether entries in A
can be compared using <
, >
and similar operators
ordered!(A)
- Set whether entries in A
can be compared using <
, >
and similar operators
recode(a[, default], pairs...)
- Return a copy of a
after replacing one or more values
recode!(a[, default], pairs...)
- Replace one or more values in a
in-place
CategoricalArrays.categorical
— Function.categorical{T}(A::AbstractArray{T}[, compress::Bool]; ordered::Bool=false)
Construct a categorical array with the values from A
.
If the element type supports it, levels are sorted in ascending order; else, they are kept in their order of appearance in A
. The ordered
keyword argument determines whether the array values can be compared according to the ordering of levels or not (see isordered
).
If compress
is provided and set to true
, the smallest reference type able to hold the number of unique values in A
will be used. While this will reduce memory use, passing this parameter will also introduce a type instability which can affect performance inside the function where the call is made. Therefore, use this option with caution (the one-argument version does not suffer from this problem).
categorical{T}(A::CategoricalArray{T}[, compress::Bool]; ordered::Bool=isordered(A))
If A
is already a CategoricalArray
, its levels are preserved; the same applies to the ordered property, and to the reference type unless compress
is passed.
CategoricalArrays.compress
— Function.compress(A::CategoricalArray)
Return a copy of categorical array A
using the smallest reference type able to hold the number of levels
of A
.
While this will reduce memory use, this function is type-unstable, which can affect performance inside the function where the call is made. Therefore, use it with caution.
CategoricalArrays.cut
— Function.cut(x::AbstractArray, breaks::AbstractVector;
extend::Bool=false, labels::AbstractVector=[], nullok::Bool=false)
Cut a numeric array into intervals and return an ordered CategoricalArray
indicating the interval into which each entry falls. Intervals are of the form [lower, upper)
, i.e. the lower bound is included and the upper bound is excluded.
If x
is nullable (i.e. eltype(x) >: Null
), a nullable CategoricalArray
is returned.
Arguments
extend::Bool=false
: whenfalse
, an error is raised if some values inx
fall outside of the breaks; whentrue
, breaks are automatically added to include all values inx
, and the upper bound is included in the last interval.labels::AbstractVector=[]
: a vector of strings giving the names to use for the intervals; if empty, default labels are used.nullok::Bool=true
: whentrue
, values outside of breaks result in null values. only supported whenx
is nullable.
cut(x::AbstractArray, ngroups::Integer;
labels::AbstractVector=String[])
Cut a numeric array into ngroups
quantiles, determined using quantile
.
CategoricalArrays.decompress
— Function.decompress(A::CategoricalArray)
Return a copy of categorical array A
using the default reference type (UInt32). If A
is using a small reference type (such as UInt8
or UInt16
) the decompressed array will have room for more levels.
To avoid the need to call decompress, ensure compress
is not called when creating the categorical array.
CategoricalArrays.isordered
— Function.isordered(A::CategoricalArray)
Test whether entries in A
can be compared using <
, >
and similar operators, using the ordering of levels.
CategoricalArrays.ordered!
— Function.ordered!(A::CategoricalArray, ordered::Bool)
Set whether entries in A
can be compared using <
, >
and similar operators, using the ordering of levels. Return the modified A
.
CategoricalArrays.recode
— Function.recode(a::AbstractArray[, default::Any], pairs::Pair...)
Return a copy of a
, replacing elements matching a key of pairs
with the corresponding value. The type of the array is chosen so that it can hold all recoded elements (but not necessarily original elements from a
).
For each Pair
in pairs
, if the element is equal to (according to isequal
) or in
the key (first item of the pair), then the corresponding value (second item) is used. If the element matches no key and default
is not provided or nothing
, it is copied as-is; if default
is specified, it is used in place of the original element. If an element matches more than one key, the first match is used.
Examples
julia> using CategoricalArrays
julia> recode(1:10, 1=>100, 2:4=>0, [5; 9:10]=>-1)
10-element Array{Int64,1}:
100
0
0
0
-1
6
7
8
-1
-1
recode(a::AbstractArray{>:Null}[, default::Any], pairs::Pair...)
For a nullable array a
, null values are never replaced with default
: use null
in a pair to recode them. If that's not the case, the returned array will be nullable.
Examples
julia> using CategoricalArrays, Nulls
julia> recode(1:10, 1=>100, 2:4=>0, [5; 9:10]=>-1, 6=>null)
10-element Array{Union{Int64, Nulls.Null},1}:
100
0
0
0
-1
null
7
8
-1
-1
CategoricalArrays.recode!
— Function.recode!(dest::AbstractArray, src::AbstractArray[, default::Any], pairs::Pair...)
Fill dest
with elements from src
, replacing those matching a key of pairs
with the corresponding value.
For each Pair
in pairs
, if the element is equal to (according to isequal
) or in
the key (first item of the pair), then the corresponding value (second item) is copied to dest
. If the element matches no key and default
is not provided or nothing
, it is copied as-is; if default
is specified, it is used in place of the original element. dest
and src
must be of the same length, but not necessarily of the same type. Elements of src
as well as values from pairs
will be convert
ed when possible on assignment. If an element matches more than one key, the first match is used.
recode!(dest::AbstractArray, src::AbstractArray{>:Null}[, default::Any], pairs::Pair...)
For a nullable array a
, null values are never replaced with default
: use null
in a pair to recode them. If that's not the case, the returned array will be nullable.