Categorical Variable Encoding

Feature-engine, a Python library for feature engineering

3 min read

Published Oct 3 2025

Feature EngineeringFeature-engineMachine LearningPandasPythonscikit-learnTransformers

One Hot Encoder

This technique replaces the categorical variable with a combination of binary variables (which takes value 0 or 1) where each new binary variable is related to a label from the categorical variable.

output = OneHotEncoder(variables=['olour'])

If Colour has 3 possible values of red, blue and white, it will create three new binary variables: Colour_red, Colour_blue and Colour_white, which will be given the value 0 or 1 to define which colour category it is.

This is called a redundant feature as you only actually need 2 binary variables to define the colour category:

Colour_red = 1 and Colour_blue = 0, meaning red
Colour_red = 0 and Colour_blue = 1, meaning blue
Colour_red = 0 and Colour_blue = 0, meaning white

You can add the parameter drop_last=True to remove the redundant binary variable to still record the colour category:

output = OneHotEncoder(variables=['Colour'], drop_last=True)

This removes the 3rd redundant Colour_white variable

Ordinal Encoder

This replaces categories with ordinal numbers, like 0, 1, 2, 3 etc. The encoding method can be set to ordered or arbitrary. When set to ordered, the categories are numbered in ascending order, based on the target mean value per category. When set to arbitrary, the categories are numbered arbitrarily. Ordered categories only work when your machine learning task contains a target such as regression or classification, it will fail if used with something like cluster.

output = OrdinalEncoder(encoding_method='arbitrary', variables=['region', 'sex'])

This will assign a number to each category. If you run output.encoder_dict_ then it will display what number has been assigned to what eg. {{'region': {'N': 0, 'S': 1, 'E': 2, 'W': 3},'sex': {'female': 0, 'male': 1}}

Rare Label Encoder

This encoder groups infrequent categories in a new category called 'Rare' (or other defined name).

There are some parameters to consider:

tol : The tolerance, to indicate the minimum proportion a category should have to be counted on its own. eg. 0.1 would mean if it is <= 10% of the most frequent categories then it will be grouped in the Rare bucket.
n_categories: The minimum number of defined categories that we require, this needs to be met before we start grouping things in to Rare.This parameter is useful when we have big datasets and do not have time to examine all categorical variables individually. This way, we ensure that variables with low cardinality are not reduced any further.
max_n_categories: Optionally the maximum number of distinct categories to have, so the lower frequencies all get grouped as Rare.

encoder = RareLabelEncoder(

tol=0.1,

n_categories=2,

variables=['region', 'country'],

replace_with='Other',

)

This sets any that are Rare as Other.