Unsupervised Learning

Scikit-learn Basics

6 min read

Published Nov 17 2025, updated Nov 19 2025

ClusteringFeature EngineeringK-MeansLinear RegressionLogistic RegressionMachine LearningNumPyPythonRandom Forestsscikit-learnSupervised LearningUnsupervised Learning

Unlike supervised learning, unsupervised learning deals with unlabeled data, data where we don’t have known outcomes or target labels.
The goal is to discover hidden patterns, structures, or relationships within the data itself.

Examples include:

Grouping customers by purchasing behaviour (clustering)
Reducing high-dimensional data for visualisation (PCA)
Detecting anomalies (outlier detection)

Scikit-learn provides a variety of unsupervised algorithms that follow the same familiar interface:

model = SomeUnsupervisedEstimator(...)

model.fit(X)

transformed = model.transform(X)

Even though these models don’t use labels (y), they still “learn” from the structure of X.

Key Types of Unsupervised Learning

Clustering:
Groups data points into clusters based on similarity.
Examples: KMeans, DBSCAN, Agglomerative Clustering.
Dimensionality Reduction:
Compresses high-dimensional data into fewer features while preserving structure.
Examples: PCA (Principal Component Analysis), t-SNE.
Anomaly Detection:
Identifies data points that differ significantly from the majority.
Examples: Isolation Forest, One-Class SVM.

Clustering

Clustering attempts to find natural groupings in data.
Each algorithm has a different way of defining what a “cluster” means:

KMeans: Divides data into k spherical clusters based on distance to centroids.
DBSCAN: Groups dense regions together and marks sparse points as outliers.
Hierarchical / Agglomerative: Builds a tree (dendrogram) of clusters.

KMeans Clustering

KMeans is the most widely used clustering algorithm. It partitions the data into k clusters by minimising the within-cluster variance.

Algorithm summary:

Choose k cluster centres (centroids).
Assign each point to the nearest centroid.
Recompute centroids.
Repeat until assignments stabilise.

Example:

from sklearn.datasets import make_blobs

from sklearn.cluster import KMeans

import matplotlib.pyplot as plt

# Generate synthetic data

X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=42)

# Train KMeans

kmeans = KMeans(n_clusters=4, random_state=42)

y_kmeans = kmeans.fit_predict(X)

# Visualise clusters

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red', marker='x', s=100)

plt.title("KMeans Clustering Example")

plt.show()

Notes:

You must specify the number of clusters (n_clusters) in advance.
Initialisation can affect results, use n_init to improve stability.
The Elbow Method helps estimate the optimal number of clusters.

The Elbow Method

The Elbow Method evaluates clustering quality across different k values using inertia (sum of squared distances to centroids).

inertias = []

for k in range(1, 10):

kmeans = KMeans(n_clusters=k, random_state=42)

kmeans.fit(X)

inertias.append(kmeans.inertia_)

plt.plot(range(1, 10), inertias, marker='o')

plt.xlabel('Number of clusters (k)')

plt.ylabel('Inertia')

plt.title('Elbow Method for Optimal k')

plt.show()

The “elbow” point — where inertia stops decreasing sharply — indicates a good value for k.

Silhouette Score

The silhouette score is a metric used to evaluate how well data has been clustered. It tells you how similar a point is to its own cluster compared to other clusters.

For each point i, the silhouette value is:

Where:

a(i) = average distance from point i to all other points in its cluster
b(i) = minimum average distance from point i to points in any other cluster

Thus:

s(i) ≈ +1 → good clustering (well separated)
s(i) ≈ 0 → borderline/overlapping clusters
s(i) < 0 → bad clustering (misclassified point)

The overall silhouette score is the mean of all s(i) values.

Why Use It With K-Means?

K-means requires choosing k, the number of clusters.

Silhouette score helps determine the best k:

Higher silhouette score = better clustering structure.

Common process: compute silhouette scores for k = 2…10 and pick the best.

How To Compute Silhouette Score in Python:

from sklearn.cluster import KMeans

from sklearn.metrics import silhouette_score

from sklearn.datasets import make_blobs

# Example data

X, _ = make_blobs(n_samples=500, centers=4, random_state=42)

# Fit K-Means

kmeans = KMeans(n_clusters=4, random_state=42)

labels = kmeans.fit_predict(X)

# Compute silhouette score

score = silhouette_score(X, labels)

print("Silhouette Score:", score)

How To Select the Best 'k' Using Silhouette Analysis:

scores = []

ks = range(2, 10)

for k in ks:

model = KMeans(n_clusters=k, random_state=42).fit(X)

labels = model.labels_

score = silhouette_score(X, labels)

scores.append(score)

print(f"k={k}, silhouette={score:.3f}")

You can plot the scores:

import matplotlib.pyplot as plt

plt.plot(ks, scores, marker="o")

plt.xlabel("k")

plt.ylabel("Silhouette Score")

plt.title("Silhouette Scores for Different k")

plt.show()

How To Interpret Silhouette Scores

0.71 – 1.00 - Excellent clustering
0.51 – 0.70 - Good clustering
0.26 – 0.50 - Fair clustering
≤ 0.25 - Poor clustering / possible overlap

DBSCAN (Density-Based Clustering)

DBSCAN groups together points that are close to each other in dense regions and labels sparse points as outliers.

Advantages:

Doesn’t require specifying k
Detects arbitrarily shaped clusters
Handles outliers naturally

Example:

from sklearn.datasets import make_blobs

import matplotlib.pyplot as plt

from sklearn.cluster import DBSCAN

# Generate synthetic data

X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=42)

# Apply DBSCAN

dbscan = DBSCAN(eps=0.5, min_samples=5)

clusters = dbscan.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='plasma')

plt.title("DBSCAN Clustering")

plt.show()

Notes:

eps: neighborhood radius, smaller values create more clusters.
min_samples: minimum points required to form a dense region.
Returns -1 for outlier points.

Can you use Silhouette Score?

Yes, as long as DBSCAN assigns cluster labels.

Important notes:

DBSCAN labels noise points as -1
You must remove noise points (-1) before computing silhouette score (because they don’t belong to any cluster)

Example:

from sklearn.cluster import DBSCAN

from sklearn.metrics import silhouette_score

db = DBSCAN(eps=0.5, min_samples=5).fit(X)

labels = db.labels_

# Remove noise points

mask = labels != -1

score = silhouette_score(X[mask], labels[mask])

print(score)

Dimensionality Reduction

Dimensionality reduction simplifies datasets with many features while preserving important structure or variance.

This is crucial for:

Visualisation
Noise reduction
Improving efficiency
Handling multicollinearity

Two common methods:

PCA (Principal Component Analysis) – linear projection maximising variance.
t-SNE (t-distributed Stochastic Neighbour Embedding) – nonlinear visualisation for complex manifolds.

Principal Component Analysis (PCA)

PCA transforms features into new orthogonal components (linear combinations) ordered by variance.
It’s unsupervised but often used as a preprocessing step before supervised tasks.

Why PCA Is Useful

PCA helps when you have:

Many features
Redundant or highly correlated variables
Noise obscuring patterns
Algorithms that struggle in high dimensions, e.g. clustering, regression, visualisation

By using PCA, you:

Simplify the dataset
Remove collinearity
Reduce noise
Improve model efficiency
Often improve clustering or classification accuracy
Make it easier to visualise structure, e.g. 2D or 3D plots

How PCA Works

PCA does three main things:

1. Identifies directions of maximum variance

It finds the axes along which the data spreads out the most.

2. Creates new features (principal components)

Each component is a weighted combination of the original variables.
Component 1 → captures the largest amount of variance
Component 2 → the second largest, orthogonal to Component 1
And so on…

3. Reduces dimensionality

Instead of keeping all components, you keep only the first few that explain most of the variation.

For example:

50 features → reduce to 5 components but still capture 90–95% of the information.

You can then specify the number of components inside a pipeline, for example:

PCA(n_components=5)

Example:

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

from sklearn.datasets import load_iris

import matplotlib.pyplot as plt

# Load and scale data

X, y = load_iris(return_X_y=True)

X_scaled = StandardScaler().fit_transform(X)

# Apply PCA

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X_scaled)

# Visualise

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis')

plt.xlabel("PC1")

plt.ylabel("PC2")

plt.title("PCA on Iris Dataset")

plt.show()

Output interpretation:

The axes (PC1, PC2) are directions of maximum variance.
Clusters often emerge naturally even without using labels.

Notes:

PCA requires scaled data (StandardScaler first).
You can check how much variance each component explains:

print(pca.explained_variance_ratio_)

PCA is linear, nonlinear structures may need t-SNE or UMAP.

t-SNE for Visualisation

t-SNE (t-distributed Stochastic Neighbour Embedding) is useful for visualising complex, high-dimensional data in 2D or 3D.

Example:

from sklearn.manifold import TSNE

from sklearn.preprocessing import StandardScaler

from sklearn.datasets import load_iris

import matplotlib.pyplot as plt

# Load and scale data

X, y = load_iris(return_X_y=True)

X_scaled = StandardScaler().fit_transform(X)

# Apply t-SNE

X_embedded = TSNE(n_components=2, perplexity=30, random_state=42).fit_transform(X_scaled)

plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c=y, cmap='viridis')

plt.title("t-SNE visualisation of Iris data")

plt.show()

Notes:

t-SNE focuses on preserving local structure, similar points stay close.
It’s non-deterministic, small random changes can alter output.
Best used for visualisation, not as input to other models.

Evaluating Unsupervised Models

Evaluating unsupervised models can be tricky because there are no ground-truth labels.

Intrinsic Metrics

These measure internal cohesion and separation between clusters:

Inertia (for KMeans) - Total distance of samples to their assigned centroids.
Silhouette Score - Measures how similar a sample is to its own cluster vs others.

from sklearn.metrics import silhouette_score

score = silhouette_score(X, y_kmeans)

print("Silhouette Score:", score)

Closer to 1 = well-separated clusters.

Extrinsic Metrics

When true labels are available (e.g. for benchmarking):

Adjusted Rand Index (ARI) - Compares clustering results to true labels.
Normalised Mutual Information (NMI) - Measures shared information between predicted and actual clusters.

from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score

ari = adjusted_rand_score(y, y_kmeans)

nmi = normalized_mutual_info_score(y, y_kmeans)

print("ARI:", ari, "NMI:", nmi)

Anomaly Detection (Overview)

Anomaly detection identifies data points that deviate from the overall pattern. It’s used in fraud detection, quality control, and cybersecurity.

Scikit-learn provides:

IsolationForest
OneClassSVM
LocalOutlierFactor

Example:

from sklearn.ensemble import IsolationForest

iso = IsolationForest(contamination=0.05, random_state=42)

y_pred = iso.fit_predict(X) # -1 = outlier, 1 = normal

Anomalies are automatically flagged based on sparse density or isolation.

Comparing and Choosing Unsupervised Methods

Task	Recommended Method	Notes
Find groups in numeric data	KMeans	Simple, fast, needs k
Find irregular shapes / outliers	DBSCAN	Density-based, no k required
Reduce high-dimensional data	PCA	Linear projection, scalable
Visualise complex data	t-SNE	Nonlinear, good for 2D visualisation
Detect anomalies	IsolationForest	Works well on high-dimensional data

Best Practices

Always scale your features before clustering or PCA, unscaled features can dominate results.
Try multiple clustering algorithms, different models capture different patterns.
Use visualisation tools (PCA, t-SNE) to inspect structure and separability.
Don’t overinterpret clusters, unsupervised models find structure, not meaning.
Evaluate with multiple metrics if you have ground-truth labels.