Pipelines and Workflow Automation

Scikit-learn Basics

3 min read

Published Nov 17 2025, updated Nov 19 2025

ClusteringFeature EngineeringK-MeansLinear RegressionLogistic RegressionMachine LearningNumPyPythonRandom Forestsscikit-learnSupervised LearningUnsupervised Learning

As machine learning projects grow, data preprocessing and modelling steps can quickly become complex and error-prone.
Each transformation (scaling, encoding, imputation) must be applied consistently, and in the correct order, across training, validation, and test sets.

Scikit-learn’s Pipeline and ColumnTransformer classes solve this problem elegantly.
They let you chain multiple steps (data cleaning, feature engineering, model fitting) into a single, reusable object with a uniform API.

Why Use Pipelines?

Without a pipeline, you often write code like this:

scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)

X_test_scaled = scaler.transform(X_test)

model.fit(X_train_scaled, y_train)

This works but is easy to get wrong, especially when tuning or cross-validating models.

A Pipeline automates this process:

Keeps preprocessing and modelling steps together
Prevents data leakage (accidentally using test data during training)
Simplifies cross-validation and hyperparameter tuning
Makes deployment easier, one object does everything

The Pipeline Object

A pipeline is simply a sequence of steps defined as a list of (name, transformer/estimator) pairs.
The final step must be an estimator (a model with .fit()).

Example

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

from sklearn.datasets import load_wine

X, y = load_wine(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipe = Pipeline([

('scaler', StandardScaler()),

('logreg', LogisticRegression(max_iter=200))

])

pipe.fit(X_train, y_train)

print("Accuracy:", pipe.score(X_test, y_test))

Output:

Accuracy: 1.0

Now scaling and model fitting happen together — and the same pipeline can be used directly on new data.

Pipelines and Cross-Validation

Pipelines integrate seamlessly with cross-validation and hyperparameter tuning.

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

from sklearn.datasets import load_wine

from sklearn.model_selection import GridSearchCV

X, y = load_wine(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipe = Pipeline([

('scaler', StandardScaler()),

('logreg', LogisticRegression(max_iter=200))

])

pipe.fit(X_train, y_train)

param_grid = {

# Prefix step name + '__' + parameter

'logreg__C': [0.1, 1, 10]

}

grid = GridSearchCV(pipe, param_grid, cv=5)

grid.fit(X_train, y_train)

print("Best parameters:", grid.best_params_)

print("Best CV score:", grid.best_score_)

Note:
In parameter names, use the double-underscore (__) to refer to parameters inside a specific step.

Example:

Step name: 'logreg'
Parameter: 'C'
Full name for tuning: 'logreg__C'

Combining Different Data Types

Real-world data often contains a mix of numeric and categorical features.
You can preprocess each type separately using ColumnTransformer, then include it in a pipeline.

Example: Mixed-Type Preprocessing

import pandas as pd

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import StandardScaler, OneHotEncoder

from sklearn.impute import SimpleImputer

from sklearn.ensemble import RandomForestClassifier

from sklearn.pipeline import Pipeline

from sklearn.model_selection import train_test_split

# Sample dataset

data = pd.DataFrame({

'age': [25, 30, None, 40],

'income': [50000, 60000, 55000, None],

'city': ['London', 'Paris', 'London', 'Berlin'],

'purchased': [1, 0, 0, 1]

})

X = data[['age', 'income', 'city']]

y = data['purchased']

numeric_features = ['age', 'income']

categorical_features = ['city']

# Numeric preprocessing

numeric_transformer = Pipeline([

('imputer', SimpleImputer(strategy='mean')),

('scaler', StandardScaler())

])

# Categorical preprocessing

categorical_transformer = Pipeline([

('encoder', OneHotEncoder(handle_unknown='ignore'))

])

# Combine them

preprocessor = ColumnTransformer([

('num', numeric_transformer, numeric_features),

('cat', categorical_transformer, categorical_features)

])

# Full pipeline

clf = Pipeline([

('preprocessor', preprocessor),

('model', RandomForestClassifier(random_state=42))

])

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

clf.fit(X_train, y_train)

print("Model accuracy:", clf.score(X_test, y_test))

Everything (imputation, scaling, encoding, and model fitting) now runs in one unified workflow.

Accessing Steps and Attributes

You can inspect or use intermediate steps in a pipeline:

# Access a step

scaler = pipe.named_steps['scaler']

# Retrieve learned attributes

print("Mean values:", scaler.mean_)

Or use the pipeline as a transformer:

# all steps except the final estimator

X_transformed = pipe[:-1].transform(X)

Feature Names After Transformation

After a ColumnTransformer, it’s often useful to see what the final feature names are (especially after one-hot encoding):

feature_names = clf.named_steps['preprocessor'].get_feature_names_out()

print("Transformed feature names:", feature_names)

This helps in model interpretation and debugging.

Tuning Pipelines

You can tune any parameter inside a pipeline or nested pipeline using the step__parameter syntax.

Example:

param_grid = {

'preprocessor__num__imputer__strategy': ['mean', 'median'],

'model__n_estimators': [50, 100, 200]

}

search = GridSearchCV(clf, param_grid, cv=5)

search.fit(X_train, y_train)

print("Best params:", search.best_params_)

print("Best CV score:", search.best_score_)

Grid search and randomised search automatically handle pipelines just like any other estimator.

Saving and Loading Pipelines

You can save a fully trained pipeline, including preprocessing and the model, using joblib.

import joblib

# Save pipeline

joblib.dump(clf, 'customer_model.pkl')

# Load pipeline

loaded_clf = joblib.load('customer_model.pkl')

print("Loaded model accuracy:", loaded_clf.score(X_test, y_test))

This ensures that the exact same transformations and model logic are reused later in production.

Pipelines with Feature Selection or Dimensionality Reduction

Pipelines can include any combination of preprocessing, feature selection, and modelling.

Example:

from sklearn.decomposition import PCA

from sklearn.linear_model import LogisticRegression

pca_pipe = Pipeline([

('scaler', StandardScaler()),

('pca', PCA(n_components=2)),

('model', LogisticRegression(max_iter=200))

])

pca_pipe.fit(X_train, y_train)

print("Test accuracy:", pca_pipe.score(X_test, y_test))

By chaining these steps, you guarantee the same PCA transformation during both training and prediction.

Nested Pipelines and Advanced Workflows

You can even include pipelines within ColumnTransformers or other pipelines.
This is useful when you need separate processing pipelines for multiple feature groups.

Example:

preprocessor = ColumnTransformer([

('numeric', numeric_transformer, numeric_features),

('categorical', categorical_transformer, categorical_features)

])

full_pipe = Pipeline([

('features', preprocessor),

('classifier', Pipeline([

('pca', PCA(n_components=3)),

('logreg', LogisticRegression(max_iter=200))

]))

])

This structure lets you modularise your workflow, making it easier to debug and maintain.

Common Pitfalls

Pitfall	Description	Solution
Fitting transformers separately	Test data influences preprocessing	Always use `Pipeline` or `fit_transform` on training data only
Forgetting to apply the same transform to test data	Leads to inconsistent input	Let the pipeline handle transformations automatically
Using categorical strings directly	Models can’t interpret text	Encode with `OneHotEncoder` or `OrdinalEncoder`
Manually tuning inside pipelines	Inconsistent or slow	Use `GridSearchCV` or `RandomizedSearchCV` with step__param syntax

Best Practices

Always encapsulate preprocessing and modelling together - Pipelines ensure consistency and prevent leakage.
Use ColumnTransformer for mixed data types - Handle numeric, categorical, and text features separately and cleanly.
Integrate tuning and validation directly - Combine pipelines with GridSearchCV for reproducible optimisation.
Save pipelines after training - This guarantees identical preprocessing when deployed.
Keep pipelines modular - Build smaller components (scalers, encoders, models) and combine them, this improves flexibility.