Measure of Dispersion

Maths: Statistics for machine learning

5 min read

Published Oct 22 2025, updated Oct 23 2025

Machine LearningMathsNumPyPandasPythonStatistics

Measures of Dispersion describe how spread out or varied the data is in a dataset.

While measures of central tendency (like the mean, median, and mode) tell us the centre of the data, measures of dispersion tell us how far the data values move away from that centre.

In other words —

Central tendency shows where the data is centred.
Dispersion shows how much the data is scattered.

Why It’s Important

Helps us understand data variability — whether data points are close together or widely spread.
Two datasets can have the same mean but very different spreads.
In machine learning, it helps detect outliers, measure data stability, and decide scaling or normalisation methods.

The Main Measures of Dispersion

1. Range

Definition: The difference between the maximum and minimum values in a dataset.
Formula:

Example:
- Data: 10, 12, 15, 18, 20
- Range = 20 − 10 = 10
Notes:
- Very simple to calculate.
- Highly affected by outliers (extremely high or low values).

2. Interquartile Range (IQR)

Definition: The range between the 25th percentile (Q1) and the 75th percentile (Q3).
It shows the spread of the middle 50% of the data.
Formula:

Example:
- Data (ordered): 5, 7, 8, 10, 12, 14, 16, 18
- Q1 = 7.5, Q3 = 15 → IQR = 15 − 7.5 = 7.5
Notes:
- Not affected by outliers.
- Often used in box plots to show data spread.
- Helps identify outliers (values below Q1 − 1.5 × IQR or above Q3 + 1.5 × IQR).

3. Variance

Definition: The average of the squared differences between each data point and the mean.
It tells us how far data points deviate from the mean on average.
Formula:

Example:
- Data: 2, 4, 6
- Mean = 4
- Variance = ((2−4)² + (4−4)² + (6−4)²) / 3 = (4 + 0 + 4)/3 = 2.67
Notes:
- The result is in squared units, so not directly comparable to the original data.
- Useful for understanding data spread mathematically.
- The formula above is for population variance with it being divided by n, sample variance is divided by n-1.

4. Standard Deviation

Definition: The square root of variance.
It shows the average amount that data values deviate from the mean, using the same units as the data.
Formula:

Example:
- Using the previous data → √2.67 = 1.63
Notes:
- Most commonly used measure of dispersion.
- Larger standard deviation → more spread out data.
- Smaller standard deviation → values closer to the mean.

5. Coefficient of Variation (CV)

Definition: The ratio of the standard deviation to the mean, expressed as a percentage.
It allows you to compare variability between datasets with different units or scales.
Formula:

Example:
- If mean = 50, SD = 5 → CV = (5 / 50) × 100% = 10%
Notes:
- Useful for comparing relative variability.
- Commonly used in finance and risk analysis.

Summary

Measure	Definition	Formula	Sensitive to Outliers?
Range	Difference between max and min	Max − Min	Yes
IQR	Middle 50% of data	Q3 − Q1	No
Variance	Average of squared deviations	Σ(x−x̄)² / n	Yes
Standard Deviation	Average deviation from mean	√Variance	Yes
Coefficient of Variation	SD relative to mean	(SD / Mean) × 100	Yes

In Machine Learning

High dispersion → data varies widely, may indicate outliers or noisy data.
Low dispersion → consistent data, easier for models to learn patterns.
Used in:
- Feature scaling and normalisation
- Outlier detection
- Feature selection (variance thresholding)

Bessel’s correction for sample variance

When calculating variance for a population, we divide by n, the total number of items in the population.

When you have data for the entire population, you already know every value.
So when calculating variance, you can use the true mean (μ) and just find how far each value is from that mean.

Because you’re using the actual mean of the population, you don’t need to “adjust” for anything.
Your variance represents the true spread of all values — no bias to correct.

When you only have a sample (a small subset of the population), you don’t know the true mean (μ).
You only have the sample mean (x̄) — an estimate of μ.

That small difference causes a subtle bias:

The sample mean tends to be closer to the sample data points than the true population mean would be.
This makes the average squared distances (the variance) a little too small.

So to correct that bias, we divide by (n − 1) instead of n.
This correction is called Bessel’s correction.

In summary:

Using n−1 gives a better estimate of the true population variance when you only have sample data.
If you divided by n, your sample variance would consistently underestimate how variable the population really is.

Example:

Say the true population values are:

[2, 4, 6, 8]

Population mean (μ) = 5
Population variance (divide by n=4):
- ((2−5)2+(4−5)2+(6−5)2+(8−5)2)/4=5

Now take a sample:

[2, 4, 6]

Sample mean (x̄) = 4
If we divide by n=3, variance = 2.67
If we divide by n−1=2, variance = 4.0

The n−1 version (4.0) is closer to the true population variance (5).
That’s why we use n−1 — it gives a better, unbiased estimate.

Calculating in Python

NumPy example:

import numpy as np

data = [10, 12, 15, 18, 20]

# Range

range_val = max(data) - min(data)

# IQR

q1 = np.percentile(data, 25)

q3 = np.percentile(data, 75)

iqr = q3 - q1

# Variance - by defaul NumPy uses population variance

variance = np.var(data)

# SD

std_dev = np.std(data)

# Coefficient of Variation

cv = (std_dev / np.mean(data)) * 100

print("Range:", range_val)

print("IQR:", iqr)

print("Variance:", variance)

print("Standard Deviation:", std_dev)

print("Coefficient of Variation:", cv)

Pandas example:

import pandas as pd

# Create a small dataset

data = {'Scores': [55, 60, 65, 70, 75, 80, 85, 90, 95]}

df = pd.DataFrame(data)

# Range

range_val = df['Scores'].max() - df['Scores'].min()

# Interquartile Range (IQR)

q1 = df['Scores'].quantile(0.25)

q3 = df['Scores'].quantile(0.75)

iqr = q3 - q1

# Variance - by defualt Pandas uses sample variance (n-1

variance = df['Scores'].var()

# Standard Deviation

std_dev = df['Scores'].std()

# Coefficient of Variation (CV)

cv = (df['Scores'].std() / df['Scores'].mean()) * 100

print("Range:", range_val)

print("IQR:", iqr)

print("Variance:", variance)

print("Standard Deviation:", std_dev)

print("Coefficient of Variation:", cv)

Population and Sample Variance:

# numpy population variance and standard deviation

# uses population (n) by default

var_pop_np = np.var(data)

std_dev_pop_np = np.std(data)

# numpy sample variance and standard deviation

# have to set ddof=1 for sample

var_sample_np = np.var(data, ddof=1)

std_dev_sample_np = np.std(data, ddof=1)

# pandas population variance and standard deviation

# has to set ddof=0 for population

var_pop_pd = df['column_name'].var(ddof=0)

std_dev_pop_pd = df['column_name'].std(ddof=0)

# pandas sample variance and standard deviation

# uses sample (n-1) by default

var_sample_pd = df['column_name'].var()

std_dev_sample_pd = df['column_name'].std()

# pandas df.describe() only works in sample (n-1) variance mode, you can't specify population (n) variance mode

df.describe()

NumPy defaults as population variance (n) and Pandas defaults as sample variance (n-1).

Measure of Dispersion

Maths: Statistics for machine learning

5 min read

Published Oct 22 2025, updated Oct 23 2025

Guide Sections

Guide Comments

Why It’s Important

The Main Measures of Dispersion

1. Range

2. Interquartile Range (IQR)

3. Variance

4. Standard Deviation

5. Coefficient of Variation (CV)

Summary

In Machine Learning

Bessel’s correction for sample variance

Calculating in Python

Products from our shop

Docker Cheat Sheet - Print at Home Designs

Docker Cheat Sheet Mouse Mat

Docker Cheat Sheet Travel Mug

Docker Cheat Sheet Mug

Vim Cheat Sheet - Print at Home Designs

Vim Cheat Sheet Mouse Mat

Vim Cheat Sheet Travel Mug

Vim Cheat Sheet Mug

SimpleSteps.guide branded Travel Mug

Developer Excuse Javascript - Travel Mug

Developer Excuse Javascript Embroidered T-Shirt - Dark

Developer Excuse Javascript Embroidered T-Shirt - Light

Developer Excuse Javascript Mug - White

Developer Excuse Javascript Mug - Black

SimpleSteps.guide branded stainless steel water bottle

Developer Excuse Javascript Hoodie - Light

Developer Excuse Javascript Hoodie - Dark