Chi-Squared Test

Maths: Statistics for machine learning

3 min read

Published Oct 22 2025, updated Oct 23 2025

Machine LearningMathsNumPyPandasPythonStatistics

The Chi-Squared test (pronounced “kai-squared”) is a statistical hypothesis test used to determine whether there is a significant relationship between categorical variables or whether observed frequencies differ from expected frequencies.

It’s one of the most widely used non-parametric tests for discrete (count) data.

In simple terms:

“The Chi-Squared test checks whether what you observed is different from what you expected — for example, are two categorical variables related, or are differences just due to chance?”

When to Use It

Data are categorical - e.g., gender, colour, product type, etc
You have counts/frequencies - Observed vs expected
Expected frequency per cell ≥ 5 - (for accuracy)

When Not to Use It

Data are continuous - Use other tests (t, ANOVA, etc.)
Small sample (< 5 per cell) - Use Fisher’s exact test

Types of Chi-Squared Tests

Goodness-of-Fit Test - Tests whether observed frequencies match expected proportions eg. “Are dice fair?”
Test of Independence - Tests whether two categorical variables are related eg. "Is gender related to buying preference?”
Test of Homogeneity - Tests whether distributions are the same across populations eg. “Do different stores sell similar proportions of products?”

1. Goodness-of-Fit Test

Hypotheses

H₀ (Null) - Observed frequencies match expected frequencies
H₁ (Alt) - Observed frequencies differ from expected frequencies

Formula:

Where:

O_i = observed frequency
E_i = expected frequency

Degrees of Freedom (df): df = k − 1

where k = number of categories

Example: Are dice fair?

You roll a die 60 times and get:

Face	1	2	3	4	5	6
Observed (O)	8	9	10	12	11	10

Expected (E) = 10 each (since fair die → equal chance)

Python Example

import numpy as np

from scipy.stats import chisquare

observed = np.array([8, 9, 10, 12, 11, 10])

expected = np.array([10, 10, 10, 10, 10, 10])

chi2, p = chisquare(observed, expected)

print(f"Chi-Squared Statistic: {chi2:.3f}")

print(f"P-value: {p:.4f}")

if p > 0.05:

print("No significant difference — die appears fair.")

else:

print("Significant difference — die may be biased.")

Interpretation:

p > 0.05 → data fit the expected distribution (no bias)
p ≤ 0.05 → observed frequencies differ significantly (possible bias)

2. Chi-Squared Test of Independence

Used with a contingency table — tests if two categorical variables are related or independent.

Hypotheses

H₀ (Null) - The variables are independent
H₁ (Alt) - The variables are dependent (associated)

Example: Gender vs Purchase Preference

Preference	Buy	Don’t Buy	Total
Male	40	20	60
Female	30	30	60
Total	70	50	120

Python Example

import numpy as np

import pandas as pd

from scipy.stats import chi2_contingency

# Contingency table

data = np.array([[40, 20],

[30, 30]])

chi2, p, dof, expected = chi2_contingency(data)

print(f"Chi-Squared Statistic: {chi2:.3f}")

print(f"Degrees of Freedom: {dof}")

print(f"P-value: {p:.4f}")

print("Expected Frequencies:\n", expected)

if p < 0.05:

print("Reject H₀ — variables are dependent (related).")

else:

print("Fail to reject H₀ — variables are independent.")

Interpretation:

If p < 0.05 → significant relationship (e.g., gender affects buying)
If p > 0.05 → no evidence of relationship (variables independent)

3. Chi-Squared Test of Homogeneity

Similar to the independence test but used to compare distributions across different populations (e.g., region A vs region B sales distribution).

Same formula, same interpretation — only the data come from different samples rather than one sample classified by two variables.

Degrees of Freedom (df):

For independence or homogeneity:

where:

r = number of rows (categories of variable 1)
c = number of columns (categories of variable 2)

Visualisation Example

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

df = pd.DataFrame({

'Gender': ['Male'] * 60 + ['Female'] * 60,

'Purchase': ['Buy'] * 40 + ['Don’t Buy'] * 20 + ['Buy'] * 30 + ['Don’t Buy'] * 30

})

sns.countplot(data=df, x='Gender', hue='Purchase', palette='Set2')

plt.title("Gender vs Purchase Preference")

plt.show()

The visual helps show whether proportions differ across categories.

Assumptions

Data type - Frequencies/counts (not percentages)
Independence - Observations must be independent
Expected frequencies - ≥ 5 per cell for valid χ² approximation
Sample size - Reasonably large

Python code

import pingouin as pg

import pandas as pd

# Example raw data

df = pd.DataFrame({

'fbs': [0, 1, 1, 0, 0, 1, 0, 1, 1, 0], # Fasting blood sugar

'target': [1, 1, 0, 0, 1, 0, 1, 1, 0, 0] # Heart disease

})

# Run chi-squared test on raw categorical columns

expected, observed, stats = pg.chi2_independence(data=df, x='fbs', y='target')

print("Observed Frequencies:\n", observed)

print("\nExpected Frequencies:\n", expected)

print("\nTest Results:\n", stats)

Outputs:

Observed Frequencies:

target 0 1

fbs

0 2.5 2.5

1 2.5 2.5

Expected Frequencies:

target 0 1

fbs

0 2.5 2.5

1 2.5 2.5

Test Results:

test lambda chi2 dof pval cramer power

0 pearson 1.000000 0.0 1.0 1.0 0.0 0.05

1 cressie-read 0.666667 0.0 1.0 1.0 0.0 0.05

2 log-likelihood 0.000000 0.0 1.0 1.0 0.0 0.05

3 freeman-tukey -0.500000 0.0 1.0 1.0 0.0 0.05

4 mod-log-likelihood -1.000000 0.0 1.0 1.0 0.0 0.05

5 neyman -2.000000 0.0 1.0 1.0 0.0 0.05

Chi-Squared Test

Maths: Statistics for machine learning

3 min read

Published Oct 22 2025, updated Oct 23 2025

Guide Sections

Guide Comments

When to Use It

When Not to Use It

Types of Chi-Squared Tests

1. Goodness-of-Fit Test

2. Chi-Squared Test of Independence

3. Chi-Squared Test of Homogeneity

Assumptions

Python code

Products from our shop

Docker Cheat Sheet - Print at Home Designs

Docker Cheat Sheet Mouse Mat

Docker Cheat Sheet Travel Mug

Docker Cheat Sheet Mug

Vim Cheat Sheet - Print at Home Designs

Vim Cheat Sheet Mouse Mat

Vim Cheat Sheet Travel Mug

Vim Cheat Sheet Mug

SimpleSteps.guide branded Travel Mug

Developer Excuse Javascript - Travel Mug

Developer Excuse Javascript Embroidered T-Shirt - Dark

Developer Excuse Javascript Embroidered T-Shirt - Light

Developer Excuse Javascript Mug - White

Developer Excuse Javascript Mug - Black

SimpleSteps.guide branded stainless steel water bottle

Developer Excuse Javascript Hoodie - Light

Developer Excuse Javascript Hoodie - Dark