Unsupervised Learning

Scikit-learn Basics

6 min read

Published Nov 17 2025, updated Nov 19 2025


11
0
0
0

ClusteringFeature EngineeringK-MeansLinear RegressionLogistic RegressionMachine LearningNumPyPythonRandom Forestsscikit-learnSupervised LearningUnsupervised Learning

Unlike supervised learning, unsupervised learning deals with unlabeled data, data where we don’t have known outcomes or target labels.
The goal is to discover hidden patterns, structures, or relationships within the data itself.


Examples include:

  • Grouping customers by purchasing behaviour (clustering)
  • Reducing high-dimensional data for visualisation (PCA)
  • Detecting anomalies (outlier detection)

Scikit-learn provides a variety of unsupervised algorithms that follow the same familiar interface:

model = SomeUnsupervisedEstimator(...)
model.fit(X)
transformed = model.transform(X)

Even though these models don’t use labels (y), they still “learn” from the structure of X.






Key Types of Unsupervised Learning

  1. Clustering:
    Groups data points into clusters based on similarity.
    Examples: KMeans, DBSCAN, Agglomerative Clustering.
  2. Dimensionality Reduction:
    Compresses high-dimensional data into fewer features while preserving structure.
    Examples: PCA (Principal Component Analysis), t-SNE.
  3. Anomaly Detection:
    Identifies data points that differ significantly from the majority.
    Examples: Isolation Forest, One-Class SVM.





Clustering

Clustering attempts to find natural groupings in data.
Each algorithm has a different way of defining what a “cluster” means:

  • KMeans: Divides data into k spherical clusters based on distance to centroids.
  • DBSCAN: Groups dense regions together and marks sparse points as outliers.
  • Hierarchical / Agglomerative: Builds a tree (dendrogram) of clusters.


KMeans Clustering

KMeans is the most widely used clustering algorithm. It partitions the data into k clusters by minimising the within-cluster variance.


Algorithm summary:

  1. Choose k cluster centres (centroids).
  2. Assign each point to the nearest centroid.
  3. Recompute centroids.
  4. Repeat until assignments stabilise.

Example:

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Generate synthetic data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=42)

# Train KMeans
kmeans = KMeans(n_clusters=4, random_state=42)
y_kmeans = kmeans.fit_predict(X)

# Visualise clusters
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red', marker='x', s=100)
plt.title("KMeans Clustering Example")
plt.show()

scikit learn k means cluster

Notes:

  • You must specify the number of clusters (n_clusters) in advance.
  • Initialisation can affect results, use n_init to improve stability.
  • The Elbow Method helps estimate the optimal number of clusters.



The Elbow Method

The Elbow Method evaluates clustering quality across different k values using inertia (sum of squared distances to centroids).

inertias = []
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    inertias.append(kmeans.inertia_)

plt.plot(range(1, 10), inertias, marker='o')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.show()

scikit-learn elbow method

The “elbow” point — where inertia stops decreasing sharply — indicates a good value for k.




Silhouette Score

The silhouette score is a metric used to evaluate how well data has been clustered. It tells you how similar a point is to its own cluster compared to other clusters.


For each point i, the silhouette value is:

silohouette score formula

Where:

  • a(i) = average distance from point i to all other points in its cluster
  • b(i) = minimum average distance from point i to points in any other cluster

Thus:

  • s(i) ≈ +1 → good clustering (well separated)
  • s(i) ≈ 0 → borderline/overlapping clusters
  • s(i) < 0 → bad clustering (misclassified point)

The overall silhouette score is the mean of all s(i) values.


Why Use It With K-Means?

K-means requires choosing k, the number of clusters.


Silhouette score helps determine the best k:

  • Higher silhouette score = better clustering structure.

Common process: compute silhouette scores for k = 2…10 and pick the best.


How To Compute Silhouette Score in Python:

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs

# Example data
X, _ = make_blobs(n_samples=500, centers=4, random_state=42)

# Fit K-Means
kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)

# Compute silhouette score
score = silhouette_score(X, labels)

print("Silhouette Score:", score)

How To Select the Best 'k' Using Silhouette Analysis:

scores = []
ks = range(2, 10)

for k in ks:
    model = KMeans(n_clusters=k, random_state=42).fit(X)
    labels = model.labels_
    score = silhouette_score(X, labels)
    scores.append(score)
    print(f"k={k}, silhouette={score:.3f}")

You can plot the scores:

import matplotlib.pyplot as plt

plt.plot(ks, scores, marker="o")
plt.xlabel("k")
plt.ylabel("Silhouette Score")
plt.title("Silhouette Scores for Different k")
plt.show()

How To Interpret Silhouette Scores

  • 0.71 – 1.00 - Excellent clustering
  • 0.51 – 0.70 - Good clustering
  • 0.26 – 0.50 - Fair clustering
  • ≤ 0.25 - Poor clustering / possible overlap



DBSCAN (Density-Based Clustering)

DBSCAN groups together points that are close to each other in dense regions and labels sparse points as outliers.


Advantages:

  • Doesn’t require specifying k
  • Detects arbitrarily shaped clusters
  • Handles outliers naturally

Example:

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN

# Generate synthetic data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=42)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
clusters = dbscan.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='plasma')
plt.title("DBSCAN Clustering")
plt.show()

scikit-learn dbscan cluster

Notes:

  • eps: neighborhood radius, smaller values create more clusters.
  • min_samples: minimum points required to form a dense region.
  • Returns -1 for outlier points.

Can you use Silhouette Score?

Yes, as long as DBSCAN assigns cluster labels.


Important notes:

  • DBSCAN labels noise points as -1
  • You must remove noise points (-1) before computing silhouette score (because they don’t belong to any cluster)

Example:

from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score

db = DBSCAN(eps=0.5, min_samples=5).fit(X)
labels = db.labels_

# Remove noise points
mask = labels != -1
score = silhouette_score(X[mask], labels[mask])
print(score)






Dimensionality Reduction

Dimensionality reduction simplifies datasets with many features while preserving important structure or variance.


This is crucial for:

  • Visualisation
  • Noise reduction
  • Improving efficiency
  • Handling multicollinearity

Two common methods:

  • PCA (Principal Component Analysis) – linear projection maximising variance.
  • t-SNE (t-distributed Stochastic Neighbour Embedding) – nonlinear visualisation for complex manifolds.


Principal Component Analysis (PCA)

PCA transforms features into new orthogonal components (linear combinations) ordered by variance.
It’s unsupervised but often used as a preprocessing step before supervised tasks.


Why PCA Is Useful

PCA helps when you have:

  • Many features
  • Redundant or highly correlated variables
  • Noise obscuring patterns
  • Algorithms that struggle in high dimensions, e.g. clustering, regression, visualisation

By using PCA, you:

  • Simplify the dataset
  • Remove collinearity
  • Reduce noise
  • Improve model efficiency
  • Often improve clustering or classification accuracy
  • Make it easier to visualise structure, e.g. 2D or 3D plots

How PCA Works

PCA does three main things:

1. Identifies directions of maximum variance

  • It finds the axes along which the data spreads out the most.

2. Creates new features (principal components)

  • Each component is a weighted combination of the original variables.
  • Component 1 → captures the largest amount of variance
  • Component 2 → the second largest, orthogonal to Component 1
  • And so on…

3. Reduces dimensionality

  • Instead of keeping all components, you keep only the first few that explain most of the variation.

For example:

  • 50 features → reduce to 5 components but still capture 90–95% of the information.

You can then specify the number of components inside a pipeline, for example:

PCA(n_components=5)

Example:

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

# Load and scale data
X, y = load_iris(return_X_y=True)
X_scaled = StandardScaler().fit_transform(X)

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Visualise
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis')
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA on Iris Dataset")
plt.show()

scikit learn pca projection iris data

Output interpretation:

  • The axes (PC1, PC2) are directions of maximum variance.
  • Clusters often emerge naturally even without using labels.

Notes:

  • PCA requires scaled data (StandardScaler first).
  • You can check how much variance each component explains:
print(pca.explained_variance_ratio_)

PCA is linear, nonlinear structures may need t-SNE or UMAP.




t-SNE for Visualisation

t-SNE (t-distributed Stochastic Neighbour Embedding) is useful for visualising complex, high-dimensional data in 2D or 3D.


Example:

from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

# Load and scale data
X, y = load_iris(return_X_y=True)
X_scaled = StandardScaler().fit_transform(X)

# Apply t-SNE
X_embedded = TSNE(n_components=2, perplexity=30, random_state=42).fit_transform(X_scaled)

plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c=y, cmap='viridis')
plt.title("t-SNE visualisation of Iris data")
plt.show()

scikit-learn t-sne cluster

Notes:

  • t-SNE focuses on preserving local structure, similar points stay close.
  • It’s non-deterministic, small random changes can alter output.
  • Best used for visualisation, not as input to other models.





Evaluating Unsupervised Models

Evaluating unsupervised models can be tricky because there are no ground-truth labels.


Intrinsic Metrics

These measure internal cohesion and separation between clusters:

  • Inertia (for KMeans) - Total distance of samples to their assigned centroids.
  • Silhouette Score - Measures how similar a sample is to its own cluster vs others.
from sklearn.metrics import silhouette_score
score = silhouette_score(X, y_kmeans)
print("Silhouette Score:", score)

Closer to 1 = well-separated clusters.


Extrinsic Metrics

When true labels are available (e.g. for benchmarking):

  • Adjusted Rand Index (ARI) - Compares clustering results to true labels.
  • Normalised Mutual Information (NMI) - Measures shared information between predicted and actual clusters.
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score

ari = adjusted_rand_score(y, y_kmeans)
nmi = normalized_mutual_info_score(y, y_kmeans)
print("ARI:", ari, "NMI:", nmi)





Anomaly Detection (Overview)

Anomaly detection identifies data points that deviate from the overall pattern. It’s used in fraud detection, quality control, and cybersecurity.


Scikit-learn provides:

  • IsolationForest
  • OneClassSVM
  • LocalOutlierFactor

Example:

from sklearn.ensemble import IsolationForest

iso = IsolationForest(contamination=0.05, random_state=42)
y_pred = iso.fit_predict(X) # -1 = outlier, 1 = normal

Anomalies are automatically flagged based on sparse density or isolation.






Comparing and Choosing Unsupervised Methods

Task

Recommended Method

Notes

Find groups in numeric data

KMeans

Simple, fast, needs k

Find irregular shapes / outliers

DBSCAN

Density-based, no k required

Reduce high-dimensional data

PCA

Linear projection, scalable

Visualise complex data

t-SNE

Nonlinear, good for 2D visualisation

Detect anomalies

IsolationForest

Works well on high-dimensional data






Best Practices

  1. Always scale your features before clustering or PCA, unscaled features can dominate results.
  2. Try multiple clustering algorithms, different models capture different patterns.
  3. Use visualisation tools (PCA, t-SNE) to inspect structure and separability.
  4. Don’t overinterpret clusters, unsupervised models find structure, not meaning.
  5. Evaluate with multiple metrics if you have ground-truth labels.

Products from our shop

Docker Cheat Sheet - Print at Home Designs

Docker Cheat Sheet - Print at Home Designs

Docker Cheat Sheet Mouse Mat

Docker Cheat Sheet Mouse Mat

Docker Cheat Sheet Travel Mug

Docker Cheat Sheet Travel Mug

Docker Cheat Sheet Mug

Docker Cheat Sheet Mug

Vim Cheat Sheet - Print at Home Designs

Vim Cheat Sheet - Print at Home Designs

Vim Cheat Sheet Mouse Mat

Vim Cheat Sheet Mouse Mat

Vim Cheat Sheet Travel Mug

Vim Cheat Sheet Travel Mug

Vim Cheat Sheet Mug

Vim Cheat Sheet Mug

SimpleSteps.guide branded Travel Mug

SimpleSteps.guide branded Travel Mug

Developer Excuse Javascript - Travel Mug

Developer Excuse Javascript - Travel Mug

Developer Excuse Javascript Embroidered T-Shirt - Dark

Developer Excuse Javascript Embroidered T-Shirt - Dark

Developer Excuse Javascript Embroidered T-Shirt - Light

Developer Excuse Javascript Embroidered T-Shirt - Light

Developer Excuse Javascript Mug - White

Developer Excuse Javascript Mug - White

Developer Excuse Javascript Mug - Black

Developer Excuse Javascript Mug - Black

SimpleSteps.guide branded stainless steel water bottle

SimpleSteps.guide branded stainless steel water bottle

Developer Excuse Javascript Hoodie - Light

Developer Excuse Javascript Hoodie - Light

Developer Excuse Javascript Hoodie - Dark

Developer Excuse Javascript Hoodie - Dark

© 2025 SimpleSteps.guide
AboutFAQPoliciesContact