Histograms and KDE

Maths: Statistics for machine learning

4 min read

Published Oct 22 2025, updated Oct 23 2025

Machine LearningMathsNumPyPandasPythonStatistics

Understanding the shape and spread of your data is a key part of statistical analysis and machine learning. Two of the most common tools for visualising data distributions are the histogram and the kernel density estimate (KDE). A histogram provides a simple, discrete view of how data values are distributed by grouping them into bins and showing the frequency of observations in each range. In contrast, a KDE produces a smooth, continuous curve that estimates the probability density function (PDF) of the data. While histograms give a quick, intuitive snapshot of data frequency, KDEs offer a cleaner and often more informative representation of the underlying distribution, especially for identifying patterns, skewness, or multiple peaks. Used together, they provide a powerful way to explore data, detect outliers, and assess normality before applying machine learning models.

Histograms

A histogram is a graphical representation of the distribution of numerical data.
It shows how often (or how densely) values occur within specific intervals, called bins.

In simple terms:

“A histogram tells you how your data are spread out — where values are common or rare.”

How It Works

The data range is divided into equal-width intervals (bins).
The height of each bar shows the frequency (or relative frequency) of data values that fall within that interval.

X-axis: data values (grouped into bins)
Y-axis: frequency or density of observations

You’ll see a series of touching bars that form a bell-shaped curve, each bar shows how many values fall within that range (bin)

What are bins?

A bin in a histogram represents a range (interval) of values into which data points are grouped.
Each bar in the histogram corresponds to one bin, and its height shows how many values fall within that range.

In simple terms:

“Bins are buckets that group nearby values together so you can see patterns in your data.”

How Bins Work

The x-axis is divided into intervals (bins).
The y-axis shows the frequency (or density) of data points in each bin.
Each data point falls into exactly one bin.

Example:
If your bins are 0–10, 10–20, 20–30...
and your data value is 17 → it goes in the 10–20 bin.

Why Bin Choice Matters

Too Few Bins	Too Many Bins
Over-smoothing	Overfitting / noisy
Hides important patterns	Hard to interpret
Looks too “blocky”	Overly jagged
Misses multimodality	Emphasises random variation

The goal: find a balance — enough bins to show structure, but not so many that noise dominates.

Example:

5 bins - Oversimplified — hides variation
30 bins - Balanced — smooth but detailed
100 bins - Over-detailed — noisy, hard to read

How to Choose the Number of Bins

There’s no single “perfect” bin size, but several well-known statistical rules help guide your choice.

1. Sturges’ Rule

Good for small to moderately sized datasets.

k = number of bins
n = number of observations
Assumes approximately normal data

Example:
For n = 1000:
k = 1 + log⁡₂(1000) ≈ 11 bins

Simple and widely used, but may oversmooth large datasets.

2. Square Root Rule

Quick and simple rule of thumb.

Works reasonably well for many types of data.
Fast to compute — good for quick visualisation.

Example:
For n = 1000:
k= √1000 ≈ 32 bins

Useful for general-purpose EDA.

3. Rice Rule

A bit more refined and suitable for larger datasets.

Example:
For n = 1000:
k = 2×1000^1/3≈ 20 bins

Balanced for moderate-to-large data sizes.

4. Freedman–Diaconis Rule

Adapts to data spread (robust against outliers).

Where:

IQR = Q3 − Q1 (interquartile range)

Great for skewed or heavy-tailed data.

Python code example of Freedman–Diaconis Rule:

import numpy as np

data = np.random.normal(size=1000)

q25, q75 = np.percentile(data, [25, 75])

iqr = q75 - q25

bin_width = 2 * iqr * (len(data) ** (-1/3))

bins = int((data.max() - data.min()) / bin_width)

Returns an adaptive number of bins based on data variability.

KDE — Kernel Density Estimation

KDE is a non-parametric way to estimate a PDF from data samples, it is often also plotted on top of a histogram to help show distribution.

In Simple Terms:

A PDF is a theoretical function (e.g. Normal, Exponential).
A KDE is a data-driven estimate of that function.

KDE is basically: “What the PDF looks like, based on my actual data.”

It smooths the data points using “kernels” (small bumps) to create a continuous density curve that approximates the true underlying distribution.

How KDE Works (Conceptually)

Take each data point and place a small, smooth “bump” (kernel) around it — typically a Gaussian.
Add up all the bumps.
Normalise so the total area = 1.
The result = smooth curve that represents the estimated PDF.