Measure of Dispersion

Maths: Statistics for machine learning

5 min read

Published Oct 22 2025, updated Oct 23 2025


40
0
0
0

Machine LearningMathsNumPyPandasPythonStatistics

Measures of Dispersion describe how spread out or varied the data is in a dataset.

While measures of central tendency (like the mean, median, and mode) tell us the centre of the data, measures of dispersion tell us how far the data values move away from that centre.


In other words —

Central tendency shows where the data is centred.
Dispersion shows how much the data is scattered.




Why It’s Important

  • Helps us understand data variability — whether data points are close together or widely spread.
  • Two datasets can have the same mean but very different spreads.
  • In machine learning, it helps detect outliers, measure data stability, and decide scaling or normalisation methods.



The Main Measures of Dispersion

1. Range

  • Definition: The difference between the maximum and minimum values in a dataset.
  • Formula:
Range formula
  • Example:
    • Data: 10, 12, 15, 18, 20
    • Range = 20 − 10 = 10
  • Notes:
    • Very simple to calculate.
    • Highly affected by outliers (extremely high or low values).

2. Interquartile Range (IQR)

  • Definition: The range between the 25th percentile (Q1) and the 75th percentile (Q3).
    It shows the spread of the middle 50% of the data.
  • Formula:
IQR formula
  • Example:
    • Data (ordered): 5, 7, 8, 10, 12, 14, 16, 18
    • Q1 = 7.5, Q3 = 15 → IQR = 15 − 7.5 = 7.5
  • Notes:
    • Not affected by outliers.
    • Often used in box plots to show data spread.
    • Helps identify outliers (values below Q1 − 1.5 × IQR or above Q3 + 1.5 × IQR).

3. Variance

  • Definition: The average of the squared differences between each data point and the mean.
    It tells us how far data points deviate from the mean on average.
  • Formula:
Variance Formula
  • Example:
    • Data: 2, 4, 6
    • Mean = 4
    • Variance = ((2−4)² + (4−4)² + (6−4)²) / 3 = (4 + 0 + 4)/3 = 2.67
  • Notes:
    • The result is in squared units, so not directly comparable to the original data.
    • Useful for understanding data spread mathematically.
    • The formula above is for population variance with it being divided by n, sample variance is divided by n-1.

4. Standard Deviation

  • Definition: The square root of variance.
    It shows the average amount that data values deviate from the mean, using the same units as the data.
  • Formula:
Standard Deviation Formula
  • Example:
    • Using the previous data → √2.67 = 1.63
  • Notes:
    • Most commonly used measure of dispersion.
    • Larger standard deviation → more spread out data.
    • Smaller standard deviation → values closer to the mean.

5. Coefficient of Variation (CV)

  • Definition: The ratio of the standard deviation to the mean, expressed as a percentage.
    It allows you to compare variability between datasets with different units or scales.
  • Formula:
CV Formula
  • Example:
    • If mean = 50, SD = 5 → CV = (5 / 50) × 100% = 10%
  • Notes:
    • Useful for comparing relative variability.
    • Commonly used in finance and risk analysis.




Summary

Measure

Definition

Formula

Sensitive to Outliers?

Range

Difference between max and min

Max − Min

Yes

IQR

Middle 50% of data

Q3 − Q1

No

Variance

Average of squared deviations

Σ(x−x̄)² / n

Yes

Standard Deviation

Average deviation from mean

√Variance

Yes

Coefficient of Variation

SD relative to mean

(SD / Mean) × 100

Yes





In Machine Learning

  • High dispersion → data varies widely, may indicate outliers or noisy data.
  • Low dispersion → consistent data, easier for models to learn patterns.
  • Used in:
    • Feature scaling and normalisation
    • Outlier detection
    • Feature selection (variance thresholding)




Bessel’s correction for sample variance

When calculating variance for a population, we divide by n, the total number of items in the population.


When you have data for the entire population, you already know every value.
So when calculating variance, you can use the true mean (μ) and just find how far each value is from that mean.

Because you’re using the actual mean of the population, you don’t need to “adjust” for anything.
Your variance represents the true spread of all values — no bias to correct.


When you only have a sample (a small subset of the population), you don’t know the true mean (μ).
You only have the sample mean (x̄) — an estimate of μ.

That small difference causes a subtle bias:

  • The sample mean tends to be closer to the sample data points than the true population mean would be.
  • This makes the average squared distances (the variance) a little too small.

So to correct that bias, we divide by (n − 1) instead of n.
This correction is called Bessel’s correction.


In summary:

  • Using n−1 gives a better estimate of the true population variance when you only have sample data.
  • If you divided by n, your sample variance would consistently underestimate how variable the population really is.

Example:

Say the true population values are:

[2, 4, 6, 8]

  • Population mean (μ) = 5
  • Population variance (divide by n=4):
    • ((2−5)2+(4−5)2+(6−5)2+(8−5)2)/4=5

Now take a sample:

[2, 4, 6]

  • Sample mean (x̄) = 4
  • If we divide by n=3, variance = 2.67
  • If we divide by n−1=2, variance = 4.0

The n−1 version (4.0) is closer to the true population variance (5).
That’s why we use n−1 — it gives a better, unbiased estimate.







Calculating in Python

NumPy example:

import numpy as np

data = [10, 12, 15, 18, 20]

# Range
range_val = max(data) - min(data)

# IQR
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1

# Variance - by defaul NumPy uses population variance
variance = np.var(data)


# SD
std_dev = np.std(data)

# Coefficient of Variation
cv = (std_dev / np.mean(data)) * 100

print("Range:", range_val)
print("IQR:", iqr)
print("Variance:", variance)
print("Standard Deviation:", std_dev)
print("Coefficient of Variation:", cv)

Pandas example:

import pandas as pd

# Create a small dataset
data = {'Scores': [55, 60, 65, 70, 75, 80, 85, 90, 95]}
df = pd.DataFrame(data)

# Range
range_val = df['Scores'].max() - df['Scores'].min()

# Interquartile Range (IQR)
q1 = df['Scores'].quantile(0.25)
q3 = df['Scores'].quantile(0.75)
iqr = q3 - q1

# Variance - by defualt Pandas uses sample variance (n-1
variance = df['Scores'].var()

# Standard Deviation
std_dev = df['Scores'].std()

# Coefficient of Variation (CV)
cv = (df['Scores'].std() / df['Scores'].mean()) * 100

print("Range:", range_val)
print("IQR:", iqr)
print("Variance:", variance)
print("Standard Deviation:", std_dev)
print("Coefficient of Variation:", cv)

Population and Sample Variance:

# numpy population variance and standard deviation
# uses population (n) by default
var_pop_np = np.var(data)
std_dev_pop_np = np.std(data)

# numpy sample variance and standard deviation
# have to set ddof=1 for sample
var_sample_np = np.var(data, ddof=1)
std_dev_sample_np = np.std(data, ddof=1)

# pandas population variance and standard deviation
# has to set ddof=0 for population
var_pop_pd = df['column_name'].var(ddof=0)
std_dev_pop_pd = df['column_name'].std(ddof=0)

# pandas sample variance and standard deviation
# uses sample (n-1) by default
var_sample_pd = df['column_name'].var()
std_dev_sample_pd = df['column_name'].std()

# pandas df.describe() only works in sample (n-1) variance mode, you can't specify population (n) variance mode
df.describe()

NumPy defaults as population variance (n) and Pandas defaults as sample variance (n-1).


Products from our shop

Docker Cheat Sheet - Print at Home Designs

Docker Cheat Sheet - Print at Home Designs

Docker Cheat Sheet Mouse Mat

Docker Cheat Sheet Mouse Mat

Docker Cheat Sheet Travel Mug

Docker Cheat Sheet Travel Mug

Docker Cheat Sheet Mug

Docker Cheat Sheet Mug

Vim Cheat Sheet - Print at Home Designs

Vim Cheat Sheet - Print at Home Designs

Vim Cheat Sheet Mouse Mat

Vim Cheat Sheet Mouse Mat

Vim Cheat Sheet Travel Mug

Vim Cheat Sheet Travel Mug

Vim Cheat Sheet Mug

Vim Cheat Sheet Mug

SimpleSteps.guide branded Travel Mug

SimpleSteps.guide branded Travel Mug

Developer Excuse Javascript - Travel Mug

Developer Excuse Javascript - Travel Mug

Developer Excuse Javascript Embroidered T-Shirt - Dark

Developer Excuse Javascript Embroidered T-Shirt - Dark

Developer Excuse Javascript Embroidered T-Shirt - Light

Developer Excuse Javascript Embroidered T-Shirt - Light

Developer Excuse Javascript Mug - White

Developer Excuse Javascript Mug - White

Developer Excuse Javascript Mug - Black

Developer Excuse Javascript Mug - Black

SimpleSteps.guide branded stainless steel water bottle

SimpleSteps.guide branded stainless steel water bottle

Developer Excuse Javascript Hoodie - Light

Developer Excuse Javascript Hoodie - Light

Developer Excuse Javascript Hoodie - Dark

Developer Excuse Javascript Hoodie - Dark

© 2025 SimpleSteps.guide
AboutFAQPoliciesContact