Data Preprocessing and Feature Engineering
Scikit-learn Basics
4 min read
Published Nov 17 2025, updated Nov 19 2025
Guide Sections
Guide Comments
Raw data is rarely ready for machine learning. It often contains missing values, inconsistent scales, or categorical variables that need encoding.
Data preprocessing is the process of transforming raw inputs into a clean, numerical format suitable for training.
Feature engineering extends this idea, it means creating or modifying features to improve model performance.
Scikit-learn provides robust, consistent tools for both preprocessing and feature engineering — all following the same familiar API (fit, transform, fit_transform).
This section covers:
- Handling missing data
- Scaling and normalisation
- Encoding categorical variables
- Feature selection and dimensionality reduction
- Combining steps with
PipelineandColumnTransformer
Why Preprocessing Matters
Machine learning algorithms assume data is numeric, clean, and comparable across features.
Without preprocessing:
- Some models may fail outright, e.g. SVMs can’t handle NaNs.
- Scale-sensitive algorithms may perform poorly.
- Categorical data cannot be directly interpreted by numerical models.
Preprocessing ensures:
- Numerical features are standardised
- Categorical features are encoded
- Missing data is handled consistently
- The same transformation is applied to training and test sets
Handling Missing Data
Missing values are common in real datasets and must be handled before training. Scikit-learn provides the SimpleImputer class for this.
Strategies:
- mean - replace missing values with the mean of the column
- median - replace with the median (robust to outliers)
- most_frequent - replace with the mode (useful for categorical)
- constant - replace with a specific value (e.g. “unknown”)
Example: Using SimpleImputer
Output:
Always fit the imputer on your training data and then transform both train and test sets.
Feature Scaling and Normalisation
Different features may have different numerical ranges, e.g. “income” (in thousands) vs “age” (in years).
Some models (like SVMs, KNN, and gradient descent–based algorithms) are sensitive to these differences.
Standardisation
Centres data to zero mean and unit variance:
After scaling:
- Mean = 0
- Standard deviation = 1
Normalisation
Scales individual samples (rows) to unit length:
When to use:
- StandardScaler: For most models (e.g. logistic regression, SVM, PCA)
- MinMaxScaler: When values must remain within a specific range (e.g. [0,1])
- Normalizer: For sparse data or distance-based models (cosine similarity)
Encoding Categorical Features
Machine learning algorithms can’t directly interpret text labels. Encoding converts categories into numeric representations.
1. One-Hot Encoding
Creates binary (0/1) columns for each category.
Output:
2. Ordinal Encoding
Maps categories to integer values (use only for ordered categories):
Do not use ordinal encoding for unordered categories, it implies a false numerical relationship.
Feature Selection
Not all features are equally informative. Removing irrelevant or redundant features can improve model performance, interpretability, and training speed.
Scikit-learn offers several feature selection methods:
Method | Function | Description |
VarianceThreshold |
| Removes features with low variance |
SelectKBest |
| Keeps top K features by statistical test |
RFE |
| Recursively eliminates weak features |
Example: SelectKBest with ANOVA F-test
Output:
Dimensionality Reduction (PCA)
Principal Component Analysis (PCA) is a popular method for reducing dimensionality while retaining as much variance as possible.
It projects features into a smaller number of orthogonal components.
Example:
Visualising the first two principal components:

PCA helps simplify visualisation and modelling, especially with high-dimensional data.
Combining Steps with ColumnTransformer
Real datasets often contain both numeric and categorical columns. You can preprocess each type separately using ColumnTransformer.
Example:
This ensures all preprocessing steps are learned from the training data and applied consistently.
Avoiding Data Leakage
Data leakage happens when information from the test set is used, directly or indirectly, during training.
This leads to overly optimistic performance estimates.
Common causes:
- Scaling or imputing the entire dataset before splitting
- Performing feature selection before splitting
- Tuning hyperparameters using the test set
Prevention:
- Always split data before any transformations.
- Use
PipelineorColumnTransformerto encapsulate preprocessing. - Keep test data completely unseen until the final evaluation.
Best Practices
- Always standardise numeric features (especially for linear models or distance-based algorithms).
- Handle missing data systematically - avoid dropping rows unless absolutely necessary.
- Encode categorical variables carefully - use one-hot encoding for unordered categories.
- Combine steps into pipelines - it prevents leakage and simplifies your workflow.
- Document transformations - especially when sharing models for production or deployment.














