Data Exploration
Pandas Basics
2 min read
Published Sep 29 2025, updated Sep 30 2025
Guide Sections
Guide Comments
When you load a dataset into a Pandas DataFrame, the first step is often exploring and understanding the data. This involves checking data types, missing values, summary statistics, unique values, correlations etc., some of the things you can do are:
- Combine methods: Use
.info()
+.describe()
+.value_counts()
to get a quick holistic view. - Visual inspection: Use
.head()
and.tail()
frequently to catch formatting or entry errors. - Investigate anomalies: Outliers or unexpected categories are easier to spot with
.value_counts()
and.unique()
. - Correlations early:
.corr()
helps identify potential predictive relationships or redundant features. - Missing data: Always check
.isna().sum()
.
.info()
Shows a concise summary of the DataFrame, including:
- Number of rows and columns
- Column names
- Non-null counts
- Data types of each column
Example Output:
Quickly see missing values and data types.
.head() and .tail()
.head(n)
shows the first n rows (default 5)..tail(n)
shows the last n rows.
.shape
Returns (number of rows, number of columns)
.
.dtypes
Shows data type of each column.
.columns
Returns a list of column names.
.unique()
Shows all unique values in a column.
.nunique()
Counts the number of unique values.
.value_counts()
Counts the frequency of each unique value.
.describe()
Provides summary statistics for numeric columns by default:
- Count, mean, standard deviation
- Minimum and maximum values
- Quartiles (25%, 50%, 75%)
Example output:
categorical columns:
Example output:
Summary of categorical (string) columns:
count
= number of entriesunique
= number of distinct valuestop
= most frequent valuefreq
= frequency of top value
.corr()
Provides a correlation matrix.
Example output:
Interpretation:
ID
andAge
are perfectly correlated here (since IDs increase with Age in this sample).Salary
also has a strong positive correlation with both (≈ 0.99
).
.isna() or .isnull()
.isna()
or.isnull()
identifies missing values..sum()
can count missing values per column.
Example output:
.sample(n)
randomly selects n
rows.
Useful for quick inspection without printing the entire dataset.