Histograms and KDE
Maths: Statistics for machine learning
4 min read
Published Oct 22 2025, updated Oct 23 2025
Guide Sections
Guide Comments
Understanding the shape and spread of your data is a key part of statistical analysis and machine learning. Two of the most common tools for visualising data distributions are the histogram and the kernel density estimate (KDE). A histogram provides a simple, discrete view of how data values are distributed by grouping them into bins and showing the frequency of observations in each range. In contrast, a KDE produces a smooth, continuous curve that estimates the probability density function (PDF) of the data. While histograms give a quick, intuitive snapshot of data frequency, KDEs offer a cleaner and often more informative representation of the underlying distribution, especially for identifying patterns, skewness, or multiple peaks. Used together, they provide a powerful way to explore data, detect outliers, and assess normality before applying machine learning models.
Histograms
A histogram is a graphical representation of the distribution of numerical data.
It shows how often (or how densely) values occur within specific intervals, called bins.
In simple terms:
“A histogram tells you how your data are spread out — where values are common or rare.”
How It Works
- The data range is divided into equal-width intervals (bins).
- The height of each bar shows the frequency (or relative frequency) of data values that fall within that interval.
X-axis: data values (grouped into bins)
Y-axis: frequency or density of observations

You’ll see a series of touching bars that form a bell-shaped curve, each bar shows how many values fall within that range (bin)
What are bins?
A bin in a histogram represents a range (interval) of values into which data points are grouped.
Each bar in the histogram corresponds to one bin, and its height shows how many values fall within that range.
In simple terms:
“Bins are buckets that group nearby values together so you can see patterns in your data.”
How Bins Work
- The x-axis is divided into intervals (bins).
- The y-axis shows the frequency (or density) of data points in each bin.
- Each data point falls into exactly one bin.
Example:
If your bins are 0–10, 10–20, 20–30...
and your data value is 17 → it goes in the 10–20 bin.
Why Bin Choice Matters
Too Few Bins | Too Many Bins |
Over-smoothing | Overfitting / noisy |
Hides important patterns | Hard to interpret |
Looks too “blocky” | Overly jagged |
Misses multimodality | Emphasises random variation |
The goal: find a balance — enough bins to show structure, but not so many that noise dominates.
Example:
- 5 bins - Oversimplified — hides variation
- 30 bins - Balanced — smooth but detailed
- 100 bins - Over-detailed — noisy, hard to read

How to Choose the Number of Bins
There’s no single “perfect” bin size, but several well-known statistical rules help guide your choice.
1. Sturges’ Rule
Good for small to moderately sized datasets.

- k = number of bins
- n = number of observations
- Assumes approximately normal data
Example:
For n = 1000:
k = 1 + log2(1000) ≈ 11 bins
Simple and widely used, but may oversmooth large datasets.
2. Square Root Rule
Quick and simple rule of thumb.

- Works reasonably well for many types of data.
- Fast to compute — good for quick visualisation.
Example:
For n = 1000:
k= √1000 ≈ 32 bins
Useful for general-purpose EDA.
3. Rice Rule
A bit more refined and suitable for larger datasets.

Example:
For n = 1000:
k = 2×10001/3 ≈ 20 bins
Balanced for moderate-to-large data sizes.
4. Freedman–Diaconis Rule
Adapts to data spread (robust against outliers).

Where:
- IQR = Q3 − Q1 (interquartile range)
Great for skewed or heavy-tailed data.
Python code example of Freedman–Diaconis Rule:
Returns an adaptive number of bins based on data variability.
KDE — Kernel Density Estimation
KDE is a non-parametric way to estimate a PDF from data samples, it is often also plotted on top of a histogram to help show distribution.
In Simple Terms:
- A PDF is a theoretical function (e.g. Normal, Exponential).
- A KDE is a data-driven estimate of that function.
KDE is basically: “What the PDF looks like, based on my actual data.”
It smooths the data points using “kernels” (small bumps) to create a continuous density curve that approximates the true underlying distribution.
How KDE Works (Conceptually)
- Take each data point and place a small, smooth “bump” (kernel) around it — typically a Gaussian.
- Add up all the bumps.
- Normalise so the total area = 1.
- The result = smooth curve that represents the estimated PDF.

Where:
- n: number of data points
- h: bandwidth (controls smoothness)
- K(): kernel function (often Gaussian)

The KDE (red curve) is a smooth approximation of the underlying PDF
The histogram is a discrete estimate using bin counts














