Overview
This package provides a replacement for DataArrays.jl's PooledDataArray
type. Contrary to that type, it supports both non-nullable and nullable arrays (i.e. arrays allowing for missing data), using the Null
type. It is also based on a simpler design by only supporting categorical data, which allows offering more specialized features (like ordering of categories). See the IndirectArrays.jl package for a simpler array type storing data with a small number of values.
The package provides the CategoricalArray
type designed to hold categorical data (either unordered/nominal or ordered/ordinal) efficiently and conveniently. CategoricalArray{T}
holds values of type T
. CategoricalArray{Union{Null, T}}
, either holds values of type T
, or missing values (represented as null
, of the Null
type). When indexed, CategoricalArray{T}
returns values of type CategoricalValue{T}
rather than of type T
(or null
for missing values).
CategoricalValue
objects are simple wrappers around the actual categorical levels which allow for very efficient extraction and equality tests. Indeed, the main feature of categorical arrays types is that they store a pool of the levels which can appear in the variable. These levels are stored in a specific order: for unordered arrays, this order is only used for pretty printing (e.g. in cross tables or plots); for ordered arrays, it also allows comparing values using the <
and >
operators: the comparison is then based on the ordering of levels stored in the array. Whether an array is ordered can be defined either on construction via the ordered
argument, or at any time via the ordered!
function.
Use the levels
function to access the levels of a categorical array, and the levels!
function to set and order them. Levels are automatically created when setting an element to a previously unused level. On the other hand, they are never removed without manual intervention: use the droplevels!
function for this.
CategoricalArray{T}
is designed to work with any underlying type T
, but the most common use case is to use it with T = String
. To make it more convenient to work with such arrays, CategoricalValue
elements support all operations which work on strings, and is actually a special type of string (i.e. CategoricalValue <: AbstractString
). The only difference from a standard String
value is that comparisons like <
and >
are based on the ordering of levels rather than on the lexicographic ordering.