Statistical Testing and Hypothesis Testing
Maths: Statistics for machine learning
4 min read
Published Oct 22 2025, updated Oct 23 2025
Guide Sections
Guide Comments
Statistical testing (or hypothesis testing) is a structured method used to make decisions or draw conclusions about a population based on data from a sample.
It helps answer questions like:
“Is there really a difference between two groups, or could the difference be due to random chance?”
In simple terms:
“We test whether the data provide enough evidence to support a claim.”
Key Idea
Statistical testing uses sample data to test a hypothesis about a population parameter (like the mean, proportion, or variance).
It’s a way to quantify uncertainty and make objective decisions rather than relying on guesswork.
Steps in Hypothesis Testing
- State the hypotheses - Formulate the null (H₀) and alternative (H₁) hypotheses.
- Choose significance level (α) - Decide how much risk of error you’ll accept (commonly 0.05).
- Collect and summarise data - Compute sample statistics (mean, variance, etc.).
- Calculate test statistic - Use an appropriate formula (z, t, χ², F, etc.) to measure difference.
- Compute p-value - Probability of observing the result if H₀ is true.
- Make a decision - If p-value < α → reject H₀; otherwise, fail to reject H₀.
1. Null and Alternative Hypotheses
Symbol | Meaning | Example |
H₀ (Null Hypothesis) | Assumes no effect or no difference. | “The mean weight = 70 kg.” |
H₁ (Alternative Hypothesis) | Suggests there is an effect or difference. | “The mean weight ≠ 70 kg.” |
We always start by assuming H₀ is true, then use data to decide whether there’s enough evidence to reject it.
2. Significance Level (α)
The significance level (α) represents the probability of rejecting H₀ when it’s actually true (Type I error).
Common choices:
- α = 0.05 (5%) → 95% confidence
- α = 0.01 (1%) → 99% confidence
Smaller α → stricter test (less chance of false positives).
3. Test Statistic
A test statistic measures how far your sample result is from what H₀ predicts — in standardised units (Z, T, F, or χ²).
Test | Use Case | Example |
Z-test | Known σ, large n | Comparing mean to population mean |
T-test | Unknown σ, small n | Comparing sample means |
Chi-square test (χ²) | Categorical data | Testing independence or goodness-of-fit |
ANOVA (F-test) | Comparing >2 group means | Checking if at least one group differs |
4. P-value and Decision Rule
- p-value = Probability of observing a test statistic as extreme as (or more extreme than) the one from your data, if H₀ is true.
- Decision:
- If p-value ≤ α → Reject H₀ (evidence supports H₁)
- If p-value > α → Fail to reject H₀ (no strong evidence)
Smaller p-value → stronger evidence against H₀.
5. Types of Errors
Type | Description | Example |
Type I Error (α) | Rejecting H₀ when it’s true | Concluding a drug works when it doesn’t |
Type II Error (β) | Failing to reject H₀ when it’s false | Missing that a drug actually works |
In testing, we balance these errors by adjusting α, sample size, and test power.
6. Confidence Level & Power
- Confidence Level - 1 − α (probability of not making Type I error)
- Power of a Test - 1 − β (probability of correctly rejecting a false H₀)
A powerful test (power close to 1) is more likely to detect real effects.
Example (Two-tailed t-test)
Suppose you’re testing whether the average height of a sample differs from 170 cm.
- H₀: μ = 170 , H₁: μ ≠ 170
- α = 0.05
- Compute t-statistic
- Compare p-value to 0.05
- If p < 0.05 → reject H₀ (significant difference)
The result tells you whether your sample mean differs significantly from 170 cm.
In Machine Learning
- Model validation - Testing if two models perform significantly differently
- Feature selection - Checking if a feature significantly impacts target
- A/B testing - Comparing conversion rates, engagement, etc.
- Error analysis - Checking if residuals are normally distributed
- Experiment design - Quantifying uncertainty and statistical significance














