Chi-Squared Test

Maths: Statistics for machine learning

3 min read

Published Oct 22 2025, updated Oct 23 2025


40
0
0
0

Machine LearningMathsNumPyPandasPythonStatistics

The Chi-Squared test (pronounced “kai-squared”) is a statistical hypothesis test used to determine whether there is a significant relationship between categorical variables or whether observed frequencies differ from expected frequencies.

It’s one of the most widely used non-parametric tests for discrete (count) data.


In simple terms:

“The Chi-Squared test checks whether what you observed is different from what you expected — for example, are two categorical variables related, or are differences just due to chance?”




When to Use It

  • Data are categorical - e.g., gender, colour, product type, etc
  • You have counts/frequencies - Observed vs expected
  • Expected frequency per cell ≥ 5 - (for accuracy)

When Not to Use It

  • Data are continuous - Use other tests (t, ANOVA, etc.)
  • Small sample (< 5 per cell) - Use Fisher’s exact test



Types of Chi-Squared Tests

  1. Goodness-of-Fit Test - Tests whether observed frequencies match expected proportions eg. “Are dice fair?”
  2. Test of Independence - Tests whether two categorical variables are related eg. "Is gender related to buying preference?”
  3. Test of Homogeneity - Tests whether distributions are the same across populations eg. “Do different stores sell similar proportions of products?”



1. Goodness-of-Fit Test

Hypotheses

  • H₀ (Null) - Observed frequencies match expected frequencies
  • H₁ (Alt) - Observed frequencies differ from expected frequencies

Formula:

Chi Square Goodness Fit Formula

Where:

  • Oi​ = observed frequency
  • Ei = expected frequency

Degrees of Freedom (df): df = k − 1

where k = number of categories


Example: Are dice fair?

You roll a die 60 times and get:

Face

1

2

3

4

5

6

Observed (O)

8

9

10

12

11

10


Expected (E) = 10 each (since fair die → equal chance)


Python Example

import numpy as np
from scipy.stats import chisquare

observed = np.array([8, 9, 10, 12, 11, 10])
expected = np.array([10, 10, 10, 10, 10, 10])

chi2, p = chisquare(observed, expected)
print(f"Chi-Squared Statistic: {chi2:.3f}")
print(f"P-value: {p:.4f}")

if p > 0.05:
    print("No significant difference — die appears fair.")
else:
    print("Significant difference — die may be biased.")

Interpretation:

  • p > 0.05 → data fit the expected distribution (no bias)
  • p ≤ 0.05 → observed frequencies differ significantly (possible bias)





2. Chi-Squared Test of Independence

Used with a contingency table — tests if two categorical variables are related or independent.

Hypotheses

  • H₀ (Null) - The variables are independent
  • H₁ (Alt) - The variables are dependent (associated)

Example: Gender vs Purchase Preference

Preference

Buy

Don’t Buy

Total

Male

40

20

60

Female

30

30

60

Total

70

50

120


Python Example

import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

# Contingency table
data = np.array([[40, 20],
                 [30, 30]])

chi2, p, dof, expected = chi2_contingency(data)

print(f"Chi-Squared Statistic: {chi2:.3f}")
print(f"Degrees of Freedom: {dof}")
print(f"P-value: {p:.4f}")
print("Expected Frequencies:\n", expected)

if p < 0.05:
    print("Reject H₀ — variables are dependent (related).")
else:
    print("Fail to reject H₀ — variables are independent.")

Interpretation:

  • If p < 0.05 → significant relationship (e.g., gender affects buying)
  • If p > 0.05 → no evidence of relationship (variables independent)





3. Chi-Squared Test of Homogeneity

Similar to the independence test but used to compare distributions across different populations (e.g., region A vs region B sales distribution).

Same formula, same interpretation — only the data come from different samples rather than one sample classified by two variables.


Degrees of Freedom (df):

For independence or homogeneity:

Chi-Square homogeneity degree of freedom Formula

where:

  • r = number of rows (categories of variable 1)
  • c = number of columns (categories of variable 2)

Visualisation Example

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.DataFrame({
    'Gender': ['Male'] * 60 + ['Female'] * 60,
    'Purchase': ['Buy'] * 40 + ['Don’t Buy'] * 20 + ['Buy'] * 30 + ['Don’t Buy'] * 30
})

sns.countplot(data=df, x='Gender', hue='Purchase', palette='Set2')
plt.title("Gender vs Purchase Preference")
plt.show()

The visual helps show whether proportions differ across categories.


Chi Square homogeneity Visualisation





Assumptions

  • Data type - Frequencies/counts (not percentages)
  • Independence - Observations must be independent
  • Expected frequencies - ≥ 5 per cell for valid χ² approximation
  • Sample size - Reasonably large





Python code

import pingouin as pg
import pandas as pd

# Example raw data
df = pd.DataFrame({
    'fbs': [0, 1, 1, 0, 0, 1, 0, 1, 1, 0], # Fasting blood sugar
    'target': [1, 1, 0, 0, 1, 0, 1, 1, 0, 0] # Heart disease
})

# Run chi-squared test on raw categorical columns
expected, observed, stats = pg.chi2_independence(data=df, x='fbs', y='target')

print("Observed Frequencies:\n", observed)
print("\nExpected Frequencies:\n", expected)
print("\nTest Results:\n", stats)

Outputs:

Observed Frequencies:
 target 0 1
fbs
0 2.5 2.5
1 2.5 2.5

Expected Frequencies:
 target 0 1
fbs
0 2.5 2.5
1 2.5 2.5

Test Results:
                  test lambda chi2 dof pval cramer power
0 pearson 1.000000 0.0 1.0 1.0 0.0 0.05
1 cressie-read 0.666667 0.0 1.0 1.0 0.0 0.05
2 log-likelihood 0.000000 0.0 1.0 1.0 0.0 0.05
3 freeman-tukey -0.500000 0.0 1.0 1.0 0.0 0.05
4 mod-log-likelihood -1.000000 0.0 1.0 1.0 0.0 0.05
5 neyman -2.000000 0.0 1.0 1.0 0.0 0.05

Products from our shop

Docker Cheat Sheet - Print at Home Designs

Docker Cheat Sheet - Print at Home Designs

Docker Cheat Sheet Mouse Mat

Docker Cheat Sheet Mouse Mat

Docker Cheat Sheet Travel Mug

Docker Cheat Sheet Travel Mug

Docker Cheat Sheet Mug

Docker Cheat Sheet Mug

Vim Cheat Sheet - Print at Home Designs

Vim Cheat Sheet - Print at Home Designs

Vim Cheat Sheet Mouse Mat

Vim Cheat Sheet Mouse Mat

Vim Cheat Sheet Travel Mug

Vim Cheat Sheet Travel Mug

Vim Cheat Sheet Mug

Vim Cheat Sheet Mug

SimpleSteps.guide branded Travel Mug

SimpleSteps.guide branded Travel Mug

Developer Excuse Javascript - Travel Mug

Developer Excuse Javascript - Travel Mug

Developer Excuse Javascript Embroidered T-Shirt - Dark

Developer Excuse Javascript Embroidered T-Shirt - Dark

Developer Excuse Javascript Embroidered T-Shirt - Light

Developer Excuse Javascript Embroidered T-Shirt - Light

Developer Excuse Javascript Mug - White

Developer Excuse Javascript Mug - White

Developer Excuse Javascript Mug - Black

Developer Excuse Javascript Mug - Black

SimpleSteps.guide branded stainless steel water bottle

SimpleSteps.guide branded stainless steel water bottle

Developer Excuse Javascript Hoodie - Light

Developer Excuse Javascript Hoodie - Light

Developer Excuse Javascript Hoodie - Dark

Developer Excuse Javascript Hoodie - Dark

© 2025 SimpleSteps.guide
AboutFAQPoliciesContact