ANOVA (Analysis of Variance)

Maths: Statistics for machine learning

4 min read

Published Oct 22 2025, updated Oct 23 2025

Machine LearningMathsNumPyPandasPythonStatistics

ANOVA (short for Analysis of Variance) is a parametric statistical test used to determine whether there are significant differences between the means of three or more independent groups.

It compares the variance between groups to the variance within groups — if the between-group variance is much larger, it suggests at least one group’s mean is different.

In simple terms:

“ANOVA tests whether the average values across multiple groups are all the same, or if at least one is significantly different.”

When to Use It

Comparing 3 or more groups - E.g., three teaching methods, product versions, or model types
Continuous dependent variable - E.g., scores, revenue, accuracy
Categorical independent variable(s) - E.g., group, gender, treatment type
Normal distribution and equal variances - Assumptions of ANOVA

When Not to Use It

Non-normal data - Use Kruskal–Wallis test instead
Paired/repeated samples - Use Repeated-Measures ANOVA or Friedman test

Types of ANOVA

One-Way ANOVA - One independent variable (factor) eg. compare test scores across 3 teaching methods
Two-Way ANOVA - Two independent variables (factors) eg. compare test scores across methods and genders
Repeated Measures ANOVA - Same subjects tested under multiple conditions eg. compare model accuracy before/after feature tuning

Hypotheses

H₀ (Null Hypothesis) - All group means are equal (no difference)
H₁ (Alternative Hypothesis) - At least one group mean is different

How It Works

ANOVA partitions the total variation in the data into:

Between-group variation (SSB) → how much group means differ from the overall mean
Within-group variation (SSW) → how much individual observations differ within each group

It then compares the ratio of these two:

Where:

MSB=SSB / k−1
MSW=SSW / N−k
k = number of groups
N = total number of observations

If the F-ratio is large → between-group variance > within-group variance → group means differ significantly.

Example: Comparing Model Performance

You trained three machine learning models and recorded their accuracy scores.
You want to know if their mean accuracies differ significantly.

Python Example

import numpy as np

from scipy.stats import f_oneway

# Accuracy scores for 3 different models

model_A = np.array([0.82, 0.80, 0.79, 0.83, 0.81])

model_B = np.array([0.76, 0.77, 0.74, 0.78, 0.75])

model_C = np.array([0.85, 0.88, 0.84, 0.87, 0.86])

# Perform One-Way ANOVA

f_stat, p = f_oneway(model_A, model_B, model_C)

print(f"F-Statistic: {f_stat:.3f}")

print(f"P-value: {p:.4f}")

if p < 0.05:

print("Reject H₀ — at least one model mean differs significantly.")

else:

print("Fail to reject H₀ — model means are not significantly different.")

Example Output:

F-Statistic: 37.245

P-value: 0.0001

Reject H₀ — at least one model mean differs.

Interpretation:

There is a significant difference in average accuracy between the models.
However, ANOVA doesn’t tell which models differ — only that a difference exists.

Visualisation:

Each box shows the distribution of accuracy scores per model. If the boxes don’t overlap much, ANOVA will likely find a significant difference.

Assumptions of ANOVA

Assumption	Description	How to Check
1. Independence	Observations are independent	Experimental design
2. Normality	Data in each group are normally distributed	Shapiro–Wilk test
3. Homogeneity of variance	Variances are equal across groups	Levene’s test

If assumptions are violated:

Use Kruskal–Wallis (non-parametric ANOVA)
Or transform the data (log, sqrt)

Two-Way ANOVA (Factorial ANOVA)

The Two-Way ANOVA extends the One-Way ANOVA by including two independent variables (factors) instead of one.

It tests:

Whether each factor individually affects the mean of the dependent variable, and
Whether there’s an interaction effect between the two factors — i.e., if the effect of one depends on the level of the other.

In simple terms:

“Two-Way ANOVA tests if two categorical variables (and their combination) significantly influence a continuous outcome.”

When to Use It

Two categorical independent variables - e.g., Model Type and Dataset Size
One continuous dependent variable - e.g., Accuracy, Revenue, Test Score
Data are normally distributed - (within each group)
Variances are equal - Across all combinations of factors

When Not to Use It

Violated assumptions - Use Friedman or Scheirer–Ray–Hare test instead

Example Scenario

You test three ML models (A, B, C)
on two dataset sizes (Small, Large)
and record the accuracy scores.

You want to know:

Does the model type matter?
Does dataset size matter?
Is there an interaction effect (does model performance depend on dataset size)?

Example in Python

import pandas as pd

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

import statsmodels.api as sm

from statsmodels.formula.api import ols

# Example data

data = {

'Model': ['A','A','A','A','A','B','B','B','B','B','C','C','C','C','C']*2,

'DatasetSize': ['Small']*15 + ['Large']*15,

'Accuracy': [

0.80,0.82,0.83,0.81,0.79, 0.75,0.76,0.74,0.77,0.75, 0.86,0.88,0.85,0.87,0.84, # Small datasets

0.84,0.85,0.83,0.86,0.85, 0.80,0.82,0.78,0.81,0.80, 0.88,0.90,0.87,0.89,0.88 # Large datasets

]

}

df = pd.DataFrame(data)

# Run Two-Way ANOVA

model = ols('Accuracy ~ C(Model) + C(DatasetSize) + C(Model):C(DatasetSize)', data=df).fit()

anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)

Example Output:

sum_sq df F PR(>F)

C(Model) 0.0145 2.0 42.3786 0.0000

C(DatasetSize) 0.0048 1.0 28.0534 0.0000

C(Model):C(DatasetSize) 0.0010 2.0 2.8942 0.0724

Residual 0.0085 24.0 NaN NaN

Interpretation:

Model type (p < 0.001) → significant difference between model means
Dataset size (p < 0.001) → dataset size significantly affects accuracy
Interaction (p = 0.07) → no significant interaction (i.e., model ranking is similar across sizes)

Visualisation:

Interpretation of Plot:

If lines between “Small” and “Large” datasets are parallel, there’s no interaction.
If lines cross or diverge, that suggests an interaction — some models perform better on specific dataset sizes.