Categorical Variable Encoding
Feature-engine, a Python library for feature engineering
3 min read
Published Oct 3 2025
Guide Sections
Guide Comments
One Hot Encoder
This technique replaces the categorical variable with a combination of binary variables (which takes value 0 or 1) where each new binary variable is related to a label from the categorical variable.
If Colour has 3 possible values of red, blue and white, it will create three new binary variables: Colour_red, Colour_blue and Colour_white, which will be given the value 0
or 1
to define which colour category it is.
This is called a redundant feature as you only actually need 2 binary variables to define the colour category:
- Colour_red = 1 and Colour_blue = 0, meaning red
- Colour_red = 0 and Colour_blue = 1, meaning blue
- Colour_red = 0 and Colour_blue = 0, meaning white
You can add the parameter drop_last=True
to remove the redundant binary variable to still record the colour category:
This removes the 3rd redundant Colour_white variable
Ordinal Encoder
This replaces categories with ordinal numbers, like 0, 1, 2, 3 etc. The encoding method can be set to ordered
or arbitrary
. When set to ordered
, the categories are numbered in ascending order, based on the target mean value per category. When set to arbitrary
, the categories are numbered arbitrarily. Ordered categories only work when your machine learning task contains a target such as regression or classification, it will fail if used with something like cluster.
This will assign a number to each category. If you run output.encoder_dict_
then it will display what number has been assigned to what eg. {{'region': {'N': 0, 'S': 1, 'E': 2, 'W': 3},'sex': {'female': 0, 'male': 1}}
Rare Label Encoder
This encoder groups infrequent categories in a new category called 'Rare' (or other defined name).
There are some parameters to consider:
tol
: The tolerance, to indicate the minimum proportion a category should have to be counted on its own. eg. 0.1 would mean if it is <= 10% of the most frequent categories then it will be grouped in the Rare bucket.n_categories
: The minimum number of defined categories that we require, this needs to be met before we start grouping things in to Rare.This parameter is useful when we have big datasets and do not have time to examine all categorical variables individually. This way, we ensure that variables with low cardinality are not reduced any further.max_n_categories
: Optionally the maximum number of distinct categories to have, so the lower frequencies all get grouped as Rare.
This sets any that are Rare as Other.