Pipelines and Workflow Automation

Scikit-learn Basics

3 min read

Published Nov 17 2025, updated Nov 19 2025


11
0
0
0

ClusteringFeature EngineeringK-MeansLinear RegressionLogistic RegressionMachine LearningNumPyPythonRandom Forestsscikit-learnSupervised LearningUnsupervised Learning

As machine learning projects grow, data preprocessing and modelling steps can quickly become complex and error-prone.
Each transformation (scaling, encoding, imputation) must be applied consistently, and in the correct order, across training, validation, and test sets.


Scikit-learn’s Pipeline and ColumnTransformer classes solve this problem elegantly.
They let you chain multiple steps (data cleaning, feature engineering, model fitting) into a single, reusable object with a uniform API.






Why Use Pipelines?

Without a pipeline, you often write code like this:

scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
model.fit(X_train_scaled, y_train)

This works but is easy to get wrong, especially when tuning or cross-validating models.


A Pipeline automates this process:

  • Keeps preprocessing and modelling steps together
  • Prevents data leakage (accidentally using test data during training)
  • Simplifies cross-validation and hyperparameter tuning
  • Makes deployment easier, one object does everything





The Pipeline Object

A pipeline is simply a sequence of steps defined as a list of (name, transformer/estimator) pairs.
The final step must be an estimator (a model with .fit()).


Example

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_wine

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression(max_iter=200))
])

pipe.fit(X_train, y_train)
print("Accuracy:", pipe.score(X_test, y_test))

Output:

Accuracy: 1.0

Now scaling and model fitting happen together — and the same pipeline can be used directly on new data.






Pipelines and Cross-Validation

Pipelines integrate seamlessly with cross-validation and hyperparameter tuning.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_wine
from sklearn.model_selection import GridSearchCV

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression(max_iter=200))
])

pipe.fit(X_train, y_train)

param_grid = {
     # Prefix step name + '__' + parameter
    'logreg__C': [0.1, 1, 10]
}

grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)

print("Best parameters:", grid.best_params_)
print("Best CV score:", grid.best_score_)

Note:
In parameter names, use the double-underscore (__) to refer to parameters inside a specific step.

Example:

  • Step name: 'logreg'
  • Parameter: 'C'
  • Full name for tuning: 'logreg__C'





Combining Different Data Types

Real-world data often contains a mix of numeric and categorical features.
You can preprocess each type separately using ColumnTransformer, then include it in a pipeline.


Example: Mixed-Type Preprocessing

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

# Sample dataset
data = pd.DataFrame({
    'age': [25, 30, None, 40],
    'income': [50000, 60000, 55000, None],
    'city': ['London', 'Paris', 'London', 'Berlin'],
    'purchased': [1, 0, 0, 1]
})

X = data[['age', 'income', 'city']]
y = data['purchased']

numeric_features = ['age', 'income']
categorical_features = ['city']

# Numeric preprocessing
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Categorical preprocessing
categorical_transformer = Pipeline([
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Combine them
preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

# Full pipeline
clf = Pipeline([
    ('preprocessor', preprocessor),
    ('model', RandomForestClassifier(random_state=42))
])

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

clf.fit(X_train, y_train)
print("Model accuracy:", clf.score(X_test, y_test))

Everything (imputation, scaling, encoding, and model fitting) now runs in one unified workflow.






Accessing Steps and Attributes

You can inspect or use intermediate steps in a pipeline:

# Access a step
scaler = pipe.named_steps['scaler']

# Retrieve learned attributes
print("Mean values:", scaler.mean_)

Or use the pipeline as a transformer:

# all steps except the final estimator
X_transformed = pipe[:-1].transform(X)





Feature Names After Transformation

After a ColumnTransformer, it’s often useful to see what the final feature names are (especially after one-hot encoding):

feature_names = clf.named_steps['preprocessor'].get_feature_names_out()
print("Transformed feature names:", feature_names)

This helps in model interpretation and debugging.






Tuning Pipelines

You can tune any parameter inside a pipeline or nested pipeline using the step__parameter syntax.

Example:

param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],
    'model__n_estimators': [50, 100, 200]
}

search = GridSearchCV(clf, param_grid, cv=5)
search.fit(X_train, y_train)

print("Best params:", search.best_params_)
print("Best CV score:", search.best_score_)

Grid search and randomised search automatically handle pipelines just like any other estimator.






Saving and Loading Pipelines

You can save a fully trained pipeline, including preprocessing and the model, using joblib.

import joblib

# Save pipeline
joblib.dump(clf, 'customer_model.pkl')

# Load pipeline
loaded_clf = joblib.load('customer_model.pkl')
print("Loaded model accuracy:", loaded_clf.score(X_test, y_test))

This ensures that the exact same transformations and model logic are reused later in production.






Pipelines with Feature Selection or Dimensionality Reduction

Pipelines can include any combination of preprocessing, feature selection, and modelling.


Example:

from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression

pca_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=2)),
    ('model', LogisticRegression(max_iter=200))
])

pca_pipe.fit(X_train, y_train)
print("Test accuracy:", pca_pipe.score(X_test, y_test))

By chaining these steps, you guarantee the same PCA transformation during both training and prediction.






Nested Pipelines and Advanced Workflows

You can even include pipelines within ColumnTransformers or other pipelines.
This is useful when you need separate processing pipelines for multiple feature groups.


Example:

preprocessor = ColumnTransformer([
    ('numeric', numeric_transformer, numeric_features),
    ('categorical', categorical_transformer, categorical_features)
])

full_pipe = Pipeline([
    ('features', preprocessor),
    ('classifier', Pipeline([
        ('pca', PCA(n_components=3)),
        ('logreg', LogisticRegression(max_iter=200))
    ]))
])

This structure lets you modularise your workflow, making it easier to debug and maintain.






Common Pitfalls

Pitfall

Description

Solution

Fitting transformers separately

Test data influences preprocessing

Always use Pipeline or fit_transform on training data only

Forgetting to apply the same transform to test data

Leads to inconsistent input

Let the pipeline handle transformations automatically

Using categorical strings directly

Models can’t interpret text

Encode with OneHotEncoder or OrdinalEncoder

Manually tuning inside pipelines

Inconsistent or slow

Use GridSearchCV or RandomizedSearchCV with step__param syntax






Best Practices

  1. Always encapsulate preprocessing and modelling together - Pipelines ensure consistency and prevent leakage.
  2. Use ColumnTransformer for mixed data types - Handle numeric, categorical, and text features separately and cleanly.
  3. Integrate tuning and validation directly - Combine pipelines with GridSearchCV for reproducible optimisation.
  4. Save pipelines after training - This guarantees identical preprocessing when deployed.
  5. Keep pipelines modular - Build smaller components (scalers, encoders, models) and combine them, this improves flexibility.

Products from our shop

Docker Cheat Sheet - Print at Home Designs

Docker Cheat Sheet - Print at Home Designs

Docker Cheat Sheet Mouse Mat

Docker Cheat Sheet Mouse Mat

Docker Cheat Sheet Travel Mug

Docker Cheat Sheet Travel Mug

Docker Cheat Sheet Mug

Docker Cheat Sheet Mug

Vim Cheat Sheet - Print at Home Designs

Vim Cheat Sheet - Print at Home Designs

Vim Cheat Sheet Mouse Mat

Vim Cheat Sheet Mouse Mat

Vim Cheat Sheet Travel Mug

Vim Cheat Sheet Travel Mug

Vim Cheat Sheet Mug

Vim Cheat Sheet Mug

SimpleSteps.guide branded Travel Mug

SimpleSteps.guide branded Travel Mug

Developer Excuse Javascript - Travel Mug

Developer Excuse Javascript - Travel Mug

Developer Excuse Javascript Embroidered T-Shirt - Dark

Developer Excuse Javascript Embroidered T-Shirt - Dark

Developer Excuse Javascript Embroidered T-Shirt - Light

Developer Excuse Javascript Embroidered T-Shirt - Light

Developer Excuse Javascript Mug - White

Developer Excuse Javascript Mug - White

Developer Excuse Javascript Mug - Black

Developer Excuse Javascript Mug - Black

SimpleSteps.guide branded stainless steel water bottle

SimpleSteps.guide branded stainless steel water bottle

Developer Excuse Javascript Hoodie - Light

Developer Excuse Javascript Hoodie - Light

Developer Excuse Javascript Hoodie - Dark

Developer Excuse Javascript Hoodie - Dark

© 2025 SimpleSteps.guide
AboutFAQPoliciesContact