Preparing the Data

Machine Learning Fundamentals with Python

2 min read

Published Nov 16 2025

ClusteringImagesK-MeansLinear RegressionLogistic RegressionMachine LearningNeural NetworksNLPNumPyPythonRandom Forestsscikit-learnSupervised LearningUnsupervised Learning

Machine learning starts and ends with data. Before training a model, we need to load, inspect, clean, and prepare the dataset so it’s suitable for algorithms to learn from.

Why Data Preparation Matters

A famous saying in ML:

“Garbage in → garbage out.”

Even the most advanced models will perform poorly if the data is messy, inconsistent, or incorrectly formatted.
Most of your time in real ML projects (often 60–80%) is spent on data preparation and exploration.

Loading Data

Data can come from CSV files, databases, or APIs.
Example Python code loading the Iris dataset from Seaborn:

import pandas as pd

import seaborn as sns

# Load Iris dataset

df = sns.load_dataset("iris")

# Display first few rows

print(df.head())

# Check dataset shape

print("\nShape of dataset:", df.shape)

# Check column names and data types

print("\nColumn info:")

print(df.info())

Explanation:

.head() shows the first few rows.
.info() reveals data types and missing values.

Exploring the Data (EDA)

Exploratory Data Analysis (EDA) helps you understand what’s in your dataset before you model it.

Look at summaries and distributions:

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

# Load Iris dataset

df = sns.load_dataset("iris")

# Basic statistics

print(df.describe())

# Pairplot: visualize relationships between features

sns.pairplot(df, hue='species')

plt.show()

# Correlation heatmap

plt.figure(figsize=(6,4))

sns.heatmap(df.select_dtypes(include="number").corr(), annot=True, cmap="coolwarm")

plt.title("Feature Correlation Heatmap")

plt.tight_layout()

plt.show()

Explanation:

.describe() gives mean, std, min, max, etc.
pairplot() helps visualise relationships between features.
heatmap() shows how features are correlated (linear relationships).

machine learning fundamentals iris dataset pairplot

Pairplot of Iris dataset

machine learning fundamentals iris dataset heatmap

Heatmap of Iris dataset

Handling Missing Data

Datasets often have missing or invalid values. You can either remove or fill them.

# Check for missing values

print("Missing values per column:")

print(df.isnull().sum())

# Drop rows with missing values

df_clean = df.dropna()

# Or fill missing values with the mean

df_filled = df.fillna(df.mean(numeric_only=True))

Explanation:

dropna() removes missing rows.
fillna() replaces missing values (e.g., with mean or median).
Always inspect the effect before choosing a method.

Encoding Categorical Data

Machine learning models work with numbers, not text. We must convert categorical (string) data into numeric form.

# Example: encoding 'species' column into numeric labels

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

df['species_encoded'] = encoder.fit_transform(df['species'])

print(df[['species', 'species_encoded']].head())

Explanation:

LabelEncoder converts text labels, e.g. "setosa", "versicolor", "virginica" into numeric codes 0, 1, 2.
For non-ordered categories, you can also use one-hot encoding (pd.get_dummies()):

# One-hot encoding example

df_onehot = pd.get_dummies(df, columns=['species'])

print(df_onehot.head())

Feature Scaling

Some algorithms (e.g., K-Means, SVM, Neural Networks) are sensitive to the scale of data. If one feature has much larger values than another, it can dominate the model’s learning.

Two common techniques: