Advanced Topics and Next Steps

Scikit-learn Basics

4 min read

Published Nov 17 2025, updated Nov 19 2025


11
0
0
0

ClusteringFeature EngineeringK-MeansLinear RegressionLogistic RegressionMachine LearningNumPyPythonRandom Forestsscikit-learnSupervised LearningUnsupervised Learning

By this point, you’ve mastered Scikit-learn’s core workflow, loading data, preprocessing, training, evaluating, tuning, and deploying models.


This section builds on that foundation, introducing advanced techniques and best practices used in professional machine learning pipelines:

  • Ensemble learning
  • Feature importance and interpretability
  • Model explainability
  • Handling imbalanced data
  • Scalability and performance
  • Extending beyond Scikit-learn

These topics will help you go from “model that works” to “model that’s trustworthy, efficient, and production-ready.”






Ensemble Learning

Ensemble learning combines the predictions of multiple models to achieve better performance than any single model alone.


Why Ensembles Work

Different models make different errors.
By averaging or voting across them, ensembles reduce variance (overfitting) and bias (underfitting).


Common ensemble strategies:

  • Bagging (Bootstrap Aggregating)
  • Boosting (Sequential correction of errors)
  • Stacking (Combining multiple model outputs as new features)


Bagging Example – Random Forest

Random Forest builds many decision trees on bootstrapped samples and averages their predictions.

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=8,
    random_state=42
)
rf.fit(X_train, y_train)

print("Test accuracy:", rf.score(X_test, y_test))

Bagging reduces overfitting compared to a single decision tree and provides stable results.



Boosting Example – Gradient Boosting

Boosting trains models sequentially, each new model focuses on correcting the errors of the previous one.

from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)
gb.fit(X_train, y_train)
print("Test accuracy:", gb.score(X_test, y_test))

Boosting often achieves higher accuracy than bagging, though it can overfit if not tuned carefully.



Stacking Example

Stacking combines multiple base models (level-0) and uses a meta-model (level-1) to learn the best combination of predictions.

from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

estimators = [
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('svm', SVC(probability=True, random_state=42))
]

stack = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression(max_iter=200)
)
stack.fit(X_train, y_train)
print("Stacked model accuracy:", stack.score(X_test, y_test))

Stacking learns how to weight or combine diverse models, it’s a powerful meta-ensemble technique.






Feature Importance and Interpretability

Understanding why your model made a decision is just as important as the accuracy itself, especially in regulated or safety-critical domains.


Feature Importance (Tree-Based Models)

Tree-based models like Random Forest and Gradient Boosting provide built-in measures of feature importance.

import pandas as pd
import matplotlib.pyplot as plt

feature_importances = pd.Series(
    rf.feature_importances_,
    index=load_wine().feature_names
).sort_values(ascending=True)

feature_importances.plot(kind='barh')
plt.title("Feature Importance (Random Forest)")
plt.show()

These values indicate which features contribute most to predictions, useful for feature selection or domain insights.



Permutation Importance

Permutation importance evaluates how much model performance drops when a feature’s values are randomly shuffled.

from sklearn.inspection import permutation_importance

result = permutation_importance(rf, X_test, y_test, n_repeats=10, random_state=42)
importances = pd.Series(result.importances_mean, index=load_wine().feature_names)
importances.sort_values().plot(kind='barh')
plt.title("Permutation Importance")
plt.show()

Advantage: Works for any model, not just trees.






Model Explainability (XAI)

Explainable AI (XAI) techniques help visualise and interpret model decisions.
Scikit-learn supports basic inspection tools, and advanced libraries like SHAP and LIME provide more granular explanations.


Partial Dependence Plots (PDP)

Shows how one or two features influence predictions while keeping others fixed.

from sklearn.inspection import PartialDependenceDisplay

PartialDependenceDisplay.from_estimator(rf, X_train, ['alcohol', 'malic_acid'])
plt.show()

Interpretation:

  • Flat line → little effect
  • Steep change → strong influence on prediction


SHAP (SHapley Additive exPlanations)

SHAP values quantify each feature’s contribution to a single prediction.

import shap

explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, feature_names=load_wine().feature_names)

Benefits:

  • Model-agnostic
  • Provides global and per-sample explanations
  • Helps debug biases or unexpected patterns





Handling Imbalanced Data

In many real-world problems (fraud detection, rare diseases), classes are imbalanced, one class vastly outnumbers the others.
A naive model might achieve high accuracy by predicting only the majority class.


Strategies to Handle Imbalance:

  1. Resampling:
    • Oversample minority (RandomOverSampler)
    • Undersample majority (RandomUnderSampler)
    • Use synthetic samples (SMOTE)
  2. Algorithmic Adjustments:
    • Use class_weight='balanced' in classifiers (e.g., Logistic Regression, SVM)
  3. Metric Choice:
    • Evaluate using F1, ROC-AUC, or Precision-Recall AUC instead of accuracy.

Example:

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(class_weight='balanced')
model.fit(X_train, y_train)





Scaling Up and Performance Optimization

Scikit-learn is optimised for moderate-sized datasets, but there are strategies for scaling up.


Efficiency Tips

  • Use sparse matrices for text or one-hot encoded data.
  • Limit model complexity (max_depth, n_estimators).
  • Use n_jobs=-1 for parallel computation.
  • Use incremental learning (partial_fit) for large data streams (e.g., SGDClassifier, MiniBatchKMeans).

Distributed and Parallel Alternatives

For massive data:

  • Dask-ML – parallel Scikit-learn-like API
  • Spark MLlib – distributed ML on large clusters
  • cuML (RAPIDS) – GPU-accelerated Scikit-learn-like library





Extending Scikit-learn

Scikit-learn plays nicely with other libraries:

  • XGBoost / LightGBM / CatBoost – advanced gradient boosting frameworks
  • Statsmodels – statistical models and inference
  • PyTorch / TensorFlow – deep learning
  • Evidently AI – monitoring and drift detection
  • Optuna / Hyperopt – advanced hyperparameter optimisation

Products from our shop

Docker Cheat Sheet - Print at Home Designs

Docker Cheat Sheet - Print at Home Designs

Docker Cheat Sheet Mouse Mat

Docker Cheat Sheet Mouse Mat

Docker Cheat Sheet Travel Mug

Docker Cheat Sheet Travel Mug

Docker Cheat Sheet Mug

Docker Cheat Sheet Mug

Vim Cheat Sheet - Print at Home Designs

Vim Cheat Sheet - Print at Home Designs

Vim Cheat Sheet Mouse Mat

Vim Cheat Sheet Mouse Mat

Vim Cheat Sheet Travel Mug

Vim Cheat Sheet Travel Mug

Vim Cheat Sheet Mug

Vim Cheat Sheet Mug

SimpleSteps.guide branded Travel Mug

SimpleSteps.guide branded Travel Mug

Developer Excuse Javascript - Travel Mug

Developer Excuse Javascript - Travel Mug

Developer Excuse Javascript Embroidered T-Shirt - Dark

Developer Excuse Javascript Embroidered T-Shirt - Dark

Developer Excuse Javascript Embroidered T-Shirt - Light

Developer Excuse Javascript Embroidered T-Shirt - Light

Developer Excuse Javascript Mug - White

Developer Excuse Javascript Mug - White

Developer Excuse Javascript Mug - Black

Developer Excuse Javascript Mug - Black

SimpleSteps.guide branded stainless steel water bottle

SimpleSteps.guide branded stainless steel water bottle

Developer Excuse Javascript Hoodie - Light

Developer Excuse Javascript Hoodie - Light

Developer Excuse Javascript Hoodie - Dark

Developer Excuse Javascript Hoodie - Dark

© 2025 SimpleSteps.guide
AboutFAQPoliciesContact