Hyperparameter Tuning

Scikit-learn Basics

3 min read

Published Nov 17 2025, updated Nov 19 2025


11
0
0
0

ClusteringFeature EngineeringK-MeansLinear RegressionLogistic RegressionMachine LearningNumPyPythonRandom Forestsscikit-learnSupervised LearningUnsupervised Learning

Every machine learning model in Scikit-learn has hyperparameters, configuration settings chosen before training.
They control the model’s structure or learning behavior (e.g., tree depth, regularisation strength, number of neighbours).


Unlike learned parameters (like weights or coefficients), hyperparameters are not learned from data.
Choosing them well can dramatically improve performance — and choosing them poorly can cause overfitting or underfitting.


Scikit-learn provides systematic, automated ways to search for optimal hyperparameter values:

  • Grid Search (exhaustive search across predefined values)
  • Randomised Search (sampling from parameter distributions)
  • Bayesian Optimisation (via external libraries like Optuna — optional)





Parameters vs Hyperparameters

Type

Example

Set When

Learned By

Model Parameter

Regression coefficients (coef_)

During fit()

The model

Hyperparameter

max_depth, n_estimators, C, alpha

Before training

You (via tuning)


Example:

from sklearn.tree import DecisionTreeClassifier

# Hyperparameters
model = DecisionTreeClassifier(max_depth=5, min_samples_split=4)
model.fit(X_train, y_train)

Here, max_depth and min_samples_split are hyperparameters that control model complexity.






Why Tune Hyperparameters?

Proper tuning ensures that:

  • The model generalises better to unseen data
  • Performance is optimised without overfitting
  • You find the right bias-variance balance

Example:
A RandomForest with too few trees might underfit; too many could overfit or waste resources.
Hyperparameter tuning helps find the sweet spot.






Manual Search (Baseline Approach)

You can start by manually trying a few configurations and comparing validation scores:

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

for n in [50, 100, 200]:
    model = RandomForestClassifier(n_estimators=n, random_state=42)
    scores = cross_val_score(model, X_train, y_train, cv=5)
    print(f"{n} trees: mean accuracy = {scores.mean():.3f}")

Output:

50 trees: mean accuracy = 0.963
100 trees: mean accuracy = 0.978
200 trees: mean accuracy = 0.979

This simple process can guide your initial hyperparameter ranges for grid or randomised search.






Grid Search (Exhaustive Search)

Grid Search evaluates every combination of parameter values across a specified grid. It’s thorough but can be slow on large grids.


Example: GridSearchCV

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5]
}

grid_search = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)
print("Best parameters:", grid_search.best_params_)
print("Best CV score:", grid_search.best_score_)

Output:

Best parameters: {'max_depth': None, 'min_samples_split': 5, 'n_estimators': 50}
Best CV score: 0.9774928774928775

Notes:

  • cv=5 → 5-fold cross-validation.
  • scoring='accuracy' can be replaced with any metric (f1, roc_auc, r2, etc.).
  • n_jobs=-1 uses all CPU cores for parallel computation.

After finding the best model:

best_model = grid_search.best_estimator_
print("Test accuracy:", best_model.score(X_test, y_test))






Randomised Search (Efficient Sampling)

When the parameter space is large, RandomizedSearchCV samples random combinations instead of trying all possible ones.

This often finds good results faster and works well for large models (e.g., gradient boosting).


Example:

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from scipy.stats import randint
from sklearn.datasets import load_wine

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

param_dist = {
    'n_estimators': randint(50, 200),
    'max_depth': randint(2, 10),
    'learning_rate': [0.01, 0.05, 0.1, 0.2]
}

random_search = RandomizedSearchCV(
    GradientBoostingClassifier(random_state=42),
    param_distributions=param_dist,
    n_iter=20,
    cv=5,
    scoring='accuracy',
    random_state=42,
    n_jobs=-1
)

random_search.fit(X_train, y_train)
print("Best params:", random_search.best_params_)
print("Best CV score:", random_search.best_score_)

Output:

Best params: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 142}
Best CV score: 0.9401709401709402

Notes:

  • n_iter controls how many random samples to test.
  • You can pass distributions (scipy.stats.randint, uniform) for continuous ranges.
  • Randomised search is ideal when grid search is too expensive.





Nested Cross-Validation (Advanced)

When hyperparameter tuning is part of model selection, nested cross-validation provides an unbiased estimate of generalisation.


It runs an inner loop for tuning and an outer loop for evaluation:

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, KFold
from sklearn.svm import SVC
import numpy as np
from sklearn.datasets import load_wine

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

param_grid = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}
grid = GridSearchCV(SVC(), param_grid=param_grid, cv=3)

outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)
nested_scores = cross_val_score(grid, X, y, cv=outer_cv)

print("Nested CV mean accuracy:", np.mean(nested_scores))

Output:

Nested CV mean accuracy: 0.9552380952380952

Why nested CV?
It avoids data leakage between tuning and validation by ensuring the outer test folds never influence hyperparameter choices.






Custom Scoring Functions

You can customise evaluation metrics for tuning by using make_scorer.


Example (optimise F1-score for binary classification):

from sklearn.metrics import make_scorer, f1_score

f1_scorer = make_scorer(f1_score, average='binary')
grid = GridSearchCV(model, param_grid, scoring=f1_scorer, cv=5)

Scikit-learn also provides built-in scoring names:

from sklearn.metrics import get_scorer_names
print(get_scorer_names())


You’ll see metrics like 'accuracy', 'roc_auc', 'neg_mean_squared_error', etc.






Practical Tips for Efficient Tuning

  1. Start simple - Begin with a few important hyperparameters; expand only if needed.
  2. Use Randomised Search for large grids - Often finds comparable results much faster.
  3. Parallelise - Set n_jobs=-1 to use all available cores.
  4. Use cross-validation - Ensures your tuning isn’t biased by a lucky train/test split.
  5. Cache results (optional) - Use joblib to store models during search, especially for large datasets.
  6. Monitor runtime - Some models (e.g., SVMs, gradient boosting) scale poorly with very large parameter grids.





Example: Full Tuning Workflow

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from scipy.stats import randint

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

param_dist = {
    'n_estimators': randint(50, 300),
    'max_depth': randint(2, 15),
    'min_samples_split': randint(2, 10)
}

search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_distributions=param_dist,
    n_iter=20,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    random_state=42
)

search.fit(X_train, y_train)

print("Best parameters:", search.best_params_)
print("Best cross-val score:", search.best_score_)
print("Test set accuracy:", search.best_estimator_.score(X_test, y_test))

Output:

Best parameters: {'max_depth': 13, 'min_samples_split': 9, 'n_estimators': 64}
Best cross-val score: 0.9849002849002849
Test set accuracy: 1.0





When and How to Stop Tuning

  • If tuning multiple algorithms, use a coarse search first to identify strong candidates.
  • Once the best model type is known, refine with tighter parameter ranges.
  • Avoid re-tuning endlessly, use validation performance as your stopping point.
  • Keep test data fully unseen until final confirmation.

Products from our shop

Docker Cheat Sheet - Print at Home Designs

Docker Cheat Sheet - Print at Home Designs

Docker Cheat Sheet Mouse Mat

Docker Cheat Sheet Mouse Mat

Docker Cheat Sheet Travel Mug

Docker Cheat Sheet Travel Mug

Docker Cheat Sheet Mug

Docker Cheat Sheet Mug

Vim Cheat Sheet - Print at Home Designs

Vim Cheat Sheet - Print at Home Designs

Vim Cheat Sheet Mouse Mat

Vim Cheat Sheet Mouse Mat

Vim Cheat Sheet Travel Mug

Vim Cheat Sheet Travel Mug

Vim Cheat Sheet Mug

Vim Cheat Sheet Mug

SimpleSteps.guide branded Travel Mug

SimpleSteps.guide branded Travel Mug

Developer Excuse Javascript - Travel Mug

Developer Excuse Javascript - Travel Mug

Developer Excuse Javascript Embroidered T-Shirt - Dark

Developer Excuse Javascript Embroidered T-Shirt - Dark

Developer Excuse Javascript Embroidered T-Shirt - Light

Developer Excuse Javascript Embroidered T-Shirt - Light

Developer Excuse Javascript Mug - White

Developer Excuse Javascript Mug - White

Developer Excuse Javascript Mug - Black

Developer Excuse Javascript Mug - Black

SimpleSteps.guide branded stainless steel water bottle

SimpleSteps.guide branded stainless steel water bottle

Developer Excuse Javascript Hoodie - Light

Developer Excuse Javascript Hoodie - Light

Developer Excuse Javascript Hoodie - Dark

Developer Excuse Javascript Hoodie - Dark

© 2025 SimpleSteps.guide
AboutFAQPoliciesContact