Population and Sample Sets

Maths: Statistics for machine learning

3 min read

Published Oct 22 2025, updated Oct 23 2025

Machine LearningMathsNumPyPandasPythonStatistics

Population

A population is the entire set of individuals, items, or data points that share a common characteristic of interest in a study.
It includes all members of a defined group about which we want to draw conclusions.

Characteristics:

Complete set: Includes all observations or elements of interest.
Parameter: A numerical value that describes a characteristic of the population (e.g., population mean μ, population variance σ²).

Example populations:

All students in a school — to calculate the average height of students.
All stores nationwide — to identify the most purchased product.
All consumers in a city — to understand purchasing behavior.
All patients in a hospital — to study the effectiveness of a new drug.

Sample Data

A sample is a subset of the population selected for analysis.
Sampling allows researchers to make inferences about the population without studying every individual, which is often impractical or expensive.

Characteristics:

Subset: Represents a portion of the population.
Statistic: A numerical value describing the sample (e.g., sample mean x̄, sample variance s²).
Random Sampling: Samples should be randomly selected to reduce bias and improve representativeness.

Example samples:

A group of 30 students from a school — to estimate the average student height.
Four stores across the country — to predict the most purchased product.
A group of 500 consumers from a city — to estimate city-wide purchasing trends.
A group of 150 patients — to test a drug’s effectiveness before wider rollout.

Types of Sampling

There are various techniques to select sample data from a population.
The choice depends on the research goal, data availability, and required accuracy.

1. Probability Sampling

Each member of the population has a known and non-zero chance of being selected.
This reduces selection bias and allows for statistical inference.

Common methods:

Simple Random Sampling: Every member has an equal chance of being selected.
Example: Drawing names out of a hat.
Systematic Sampling: Selecting every nth member after a random start.
Example: Surveying every 10th customer entering a store.
Stratified Sampling: Dividing the population into strata (groups) based on shared characteristics, then randomly sampling within each.
Example: Dividing employees by department and randomly selecting from each department.
Cluster Sampling: Dividing the population into clusters, randomly selecting a few clusters, and surveying all members within them.
Example: Selecting a few schools and surveying all teachers in those schools.
Multistage Sampling: Combining several sampling methods in stages.
Example: Selecting clusters (schools), then randomly sampling individuals (students) within them.

2. Non-Probability Sampling

Not all members of the population have a known chance of being selected.
These methods are easier and cheaper but may introduce bias, limiting generalisability.

Common methods:

Convenience Sampling: Selecting individuals that are easiest to reach.
Example: Surveying shoppers in a store.
Judgmental (Purposive) Sampling: Selecting participants based on the researcher’s judgment or expertise.
Example: Choosing experts in a field for a study.
Snowball Sampling: Existing participants recruit new ones from their networks.
Example: Asking participants to refer friends or colleagues.
Quota Sampling: Ensuring certain characteristics are represented by setting quotas (e.g., age, gender), but not selecting participants randomly.