ANOVA (Analysis of Variance)
Maths: Statistics for machine learning
4 min read
Published Oct 22 2025, updated Oct 23 2025
Guide Sections
Guide Comments
ANOVA (short for Analysis of Variance) is a parametric statistical test used to determine whether there are significant differences between the means of three or more independent groups.
It compares the variance between groups to the variance within groups — if the between-group variance is much larger, it suggests at least one group’s mean is different.
In simple terms:
“ANOVA tests whether the average values across multiple groups are all the same, or if at least one is significantly different.”
When to Use It
- Comparing 3 or more groups - E.g., three teaching methods, product versions, or model types
- Continuous dependent variable - E.g., scores, revenue, accuracy
- Categorical independent variable(s) - E.g., group, gender, treatment type
- Normal distribution and equal variances - Assumptions of ANOVA
When Not to Use It
- Non-normal data - Use Kruskal–Wallis test instead
- Paired/repeated samples - Use Repeated-Measures ANOVA or Friedman test
Types of ANOVA
- One-Way ANOVA - One independent variable (factor) eg. compare test scores across 3 teaching methods
- Two-Way ANOVA - Two independent variables (factors) eg. compare test scores across methods and genders
- Repeated Measures ANOVA - Same subjects tested under multiple conditions eg. compare model accuracy before/after feature tuning
Hypotheses
- H₀ (Null Hypothesis) - All group means are equal (no difference)
- H₁ (Alternative Hypothesis) - At least one group mean is different
How It Works
ANOVA partitions the total variation in the data into:
- Between-group variation (SSB) → how much group means differ from the overall mean
- Within-group variation (SSW) → how much individual observations differ within each group
It then compares the ratio of these two:

Where:
- MSB=SSB / k−1
- MSW=SSW / N−k
- k = number of groups
- N = total number of observations
If the F-ratio is large → between-group variance > within-group variance → group means differ significantly.
Example: Comparing Model Performance
You trained three machine learning models and recorded their accuracy scores.
You want to know if their mean accuracies differ significantly.
Python Example
Example Output:
Interpretation:
- There is a significant difference in average accuracy between the models.
- However, ANOVA doesn’t tell which models differ — only that a difference exists.
Visualisation:

Each box shows the distribution of accuracy scores per model. If the boxes don’t overlap much, ANOVA will likely find a significant difference.
Assumptions of ANOVA
Assumption | Description | How to Check |
1. Independence | Observations are independent | Experimental design |
2. Normality | Data in each group are normally distributed | Shapiro–Wilk test |
3. Homogeneity of variance | Variances are equal across groups | Levene’s test |
If assumptions are violated:
- Use Kruskal–Wallis (non-parametric ANOVA)
- Or transform the data (log, sqrt)
Two-Way ANOVA (Factorial ANOVA)
The Two-Way ANOVA extends the One-Way ANOVA by including two independent variables (factors) instead of one.
It tests:
- Whether each factor individually affects the mean of the dependent variable, and
- Whether there’s an interaction effect between the two factors — i.e., if the effect of one depends on the level of the other.
In simple terms:
“Two-Way ANOVA tests if two categorical variables (and their combination) significantly influence a continuous outcome.”
When to Use It
- Two categorical independent variables - e.g., Model Type and Dataset Size
- One continuous dependent variable - e.g., Accuracy, Revenue, Test Score
- Data are normally distributed - (within each group)
- Variances are equal - Across all combinations of factors
When Not to Use It
- Violated assumptions - Use Friedman or Scheirer–Ray–Hare test instead
Example Scenario
You test three ML models (A, B, C)
on two dataset sizes (Small, Large)
and record the accuracy scores.
You want to know:
- Does the model type matter?
- Does dataset size matter?
- Is there an interaction effect (does model performance depend on dataset size)?
Example in Python
Example Output:
Interpretation:
- Model type (p < 0.001) → significant difference between model means
- Dataset size (p < 0.001) → dataset size significantly affects accuracy
- Interaction (p = 0.07) → no significant interaction (i.e., model ranking is similar across sizes)
Visualisation:

Interpretation of Plot:
- If lines between “Small” and “Large” datasets are parallel, there’s no interaction.
- If lines cross or diverge, that suggests an interaction — some models perform better on specific dataset sizes.
Python code
Output:














