Implementation details

Implementation details

CategoricalArray is made of the two fields:

The CategoricalPool{V,R,C} type keeps track of the levels of type V and associates them with an integer reference code of type R (for internal use). It offers methods to add new levels, and efficiently get the integer index corresponding to a level and vice-versa. Whether the values of CategoricalArray are ordered or not is defined by an ordered field of the pool. Finally, CategoricalPool{V,R,C} keeps a valindex vector of value objects of type C == CategoricalValue{V, R}, so that getindex can return the existing object instead of allocating a new one.

Do note that CategoricalPool levels are semi-mutable: it is only allowed to add new levels, but never to remove or reorder existing ones. This ensures existing CategoricalValue objects remain valid and always point to the same level as when they were created. Therefore, CategoricalArrays create a new pool each time some of their levels are removed or reordered. This happens when calling levels!, but also when assigning a CategoricalValue via setindex!, push!, append!, copy! or copyto! (as new levels may be added to the front to preserve relative order of both source and destination levels). Doing so requires updating all reference codes to point to the new pool, and makes it impossible to compare existing ordered CategoricalValue objects with values from the array using < and >.

The type parameters of CategoricalArray{T, N, R <: Integer, V, C, U} are a bit complex:

Only T, N and R could be specified upon construction. The last three parameters are chosen automatically, but are needed for the definition of the type. In particular, U allows expressing that CategoricalArray{T, N} inherits from AbstractArray{Union{C, U}, N} (which is equivalent to AbstractArray{C, N} for arrays which do not support missing values, and to AbstractArray{Union{C, Missing}, N} for those which support them).

The CategoricalPool type is designed to limit the need to go over all elements of the vector, either for reading or for writing. This is why unused levels are not dropped automatically (this would force checking all elements on every modification or keeping a counts table), but only when droplevels! is called. levels is a (very fast) O(1) operation since it merely returns the (ordered) vector of levels without accessing the data at all.