Histograms and KDE

Maths: Statistics for machine learning

4 min read

Published Oct 22 2025, updated Oct 23 2025


40
0
0
0

Machine LearningMathsNumPyPandasPythonStatistics

Understanding the shape and spread of your data is a key part of statistical analysis and machine learning. Two of the most common tools for visualising data distributions are the histogram and the kernel density estimate (KDE). A histogram provides a simple, discrete view of how data values are distributed by grouping them into bins and showing the frequency of observations in each range. In contrast, a KDE produces a smooth, continuous curve that estimates the probability density function (PDF) of the data. While histograms give a quick, intuitive snapshot of data frequency, KDEs offer a cleaner and often more informative representation of the underlying distribution, especially for identifying patterns, skewness, or multiple peaks. Used together, they provide a powerful way to explore data, detect outliers, and assess normality before applying machine learning models.




Histograms

A histogram is a graphical representation of the distribution of numerical data.
It shows how often (or how densely) values occur within specific intervals, called bins.


In simple terms:

“A histogram tells you how your data are spread out — where values are common or rare.”



How It Works

  1. The data range is divided into equal-width intervals (bins).
  2. The height of each bar shows the frequency (or relative frequency) of data values that fall within that interval.

X-axis: data values (grouped into bins)
Y-axis: frequency or density of observations



Histogram

You’ll see a series of touching bars that form a bell-shaped curve, each bar shows how many values fall within that range (bin)




What are bins?

A bin in a histogram represents a range (interval) of values into which data points are grouped.
Each bar in the histogram corresponds to one bin, and its height shows how many values fall within that range.


In simple terms:

“Bins are buckets that group nearby values together so you can see patterns in your data.”



How Bins Work

  • The x-axis is divided into intervals (bins).
  • The y-axis shows the frequency (or density) of data points in each bin.
  • Each data point falls into exactly one bin.

Example:
If your bins are 0–10, 10–20, 20–30...
and your data value is 17 → it goes in the 10–20 bin.



Why Bin Choice Matters

Too Few Bins

Too Many Bins

Over-smoothing

Overfitting / noisy

Hides important patterns

Hard to interpret

Looks too “blocky”

Overly jagged

Misses multimodality

Emphasises random variation


The goal: find a balance — enough bins to show structure, but not so many that noise dominates.

Example:

  • 5 bins - Oversimplified — hides variation
  • 30 bins - Balanced — smooth but detailed
  • 100 bins - Over-detailed — noisy, hard to read

Histogram bins


How to Choose the Number of Bins

There’s no single “perfect” bin size, but several well-known statistical rules help guide your choice.


1. Sturges’ Rule

Good for small to moderately sized datasets.

stuges rule
  • k = number of bins
  • n = number of observations
  • Assumes approximately normal data

Example:
For n = 1000:
k = 1 + log⁡2(1000) ≈ 11 bins

Simple and widely used, but may oversmooth large datasets.



2. Square Root Rule

Quick and simple rule of thumb.

square root rule
  • Works reasonably well for many types of data.
  • Fast to compute — good for quick visualisation.

Example:
For n = 1000:
k= √1000 ≈ 32 bins

Useful for general-purpose EDA.



3. Rice Rule

A bit more refined and suitable for larger datasets.

Rice rule

Example:
For n = 1000:
k = 2×10001/3 ≈ 20 bins

Balanced for moderate-to-large data sizes.



4. Freedman–Diaconis Rule

Adapts to data spread (robust against outliers).

Freedman rule

Where:

  • IQR = Q3 − Q1 (interquartile range)

Great for skewed or heavy-tailed data.


Python code example of Freedman–Diaconis Rule:

import numpy as np

data = np.random.normal(size=1000)
q25, q75 = np.percentile(data, [25, 75])
iqr = q75 - q25
bin_width = 2 * iqr * (len(data) ** (-1/3))
bins = int((data.max() - data.min()) / bin_width)

Returns an adaptive number of bins based on data variability.







KDE — Kernel Density Estimation

KDE is a non-parametric way to estimate a PDF from data samples, it is often also plotted on top of a histogram to help show distribution.


In Simple Terms:

  • A PDF is a theoretical function (e.g. Normal, Exponential).
  • A KDE is a data-driven estimate of that function.

KDE is basically: “What the PDF looks like, based on my actual data.”


It smooths the data points using “kernels” (small bumps) to create a continuous density curve that approximates the true underlying distribution.




How KDE Works (Conceptually)

  1. Take each data point and place a small, smooth “bump” (kernel) around it — typically a Gaussian.
  2. Add up all the bumps.
  3. Normalise so the total area = 1.
  4. The result = smooth curve that represents the estimated PDF.
KDE formula

Where:

  • n: number of data points
  • h: bandwidth (controls smoothness)
  • K(): kernel function (often Gaussian)


KDE plot

The KDE (red curve) is a smooth approximation of the underlying PDF
The histogram is a discrete estimate using bin counts


Products from our shop

Docker Cheat Sheet - Print at Home Designs

Docker Cheat Sheet - Print at Home Designs

Docker Cheat Sheet Mouse Mat

Docker Cheat Sheet Mouse Mat

Docker Cheat Sheet Travel Mug

Docker Cheat Sheet Travel Mug

Docker Cheat Sheet Mug

Docker Cheat Sheet Mug

Vim Cheat Sheet - Print at Home Designs

Vim Cheat Sheet - Print at Home Designs

Vim Cheat Sheet Mouse Mat

Vim Cheat Sheet Mouse Mat

Vim Cheat Sheet Travel Mug

Vim Cheat Sheet Travel Mug

Vim Cheat Sheet Mug

Vim Cheat Sheet Mug

SimpleSteps.guide branded Travel Mug

SimpleSteps.guide branded Travel Mug

Developer Excuse Javascript - Travel Mug

Developer Excuse Javascript - Travel Mug

Developer Excuse Javascript Embroidered T-Shirt - Dark

Developer Excuse Javascript Embroidered T-Shirt - Dark

Developer Excuse Javascript Embroidered T-Shirt - Light

Developer Excuse Javascript Embroidered T-Shirt - Light

Developer Excuse Javascript Mug - White

Developer Excuse Javascript Mug - White

Developer Excuse Javascript Mug - Black

Developer Excuse Javascript Mug - Black

SimpleSteps.guide branded stainless steel water bottle

SimpleSteps.guide branded stainless steel water bottle

Developer Excuse Javascript Hoodie - Light

Developer Excuse Javascript Hoodie - Light

Developer Excuse Javascript Hoodie - Dark

Developer Excuse Javascript Hoodie - Dark

© 2025 SimpleSteps.guide
AboutFAQPoliciesContact