Data Exploration

Pandas Basics

2 min read

Published Sep 29 2025, updated Oct 24 2025

PandasPython

When you load a dataset into a Pandas DataFrame, the first step is often exploring and understanding the data. This involves checking data types, missing values, summary statistics, unique values, correlations etc., some of the things you can do are:

Combine methods: Use .info() + .describe() + .value_counts() to get a quick holistic view.
Visual inspection: Use .head() and .tail() frequently to catch formatting or entry errors.
Investigate anomalies: Outliers or unexpected categories are easier to spot with .value_counts() and .unique().
Correlations early: .corr() helps identify potential predictive relationships or redundant features.
Missing data: Always check .isna().sum() .

.info()

Shows a concise summary of the DataFrame, including:

Number of rows and columns
Column names
Non-null counts
Data types of each column

df.info()

Example Output:

RangeIndex: 100 entries, 0 to 99

Data columns (total 4 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 ID 100 non-null int64

1 Name 100 non-null object

2 Age 95 non-null float64

3 City 100 non-null object

Quickly see missing values and data types.

.head() and .tail()

.head(n) shows the first n rows (default 5).
.tail(n) shows the last n rows.

df.head(5)

df.tail(5)

.shape

Returns (number of rows, number of columns).

df.shape

# Example output: (100, 4)

.dtypes

Shows data type of each column.

df.dtypes

# Example output:

# ID int64

# Name object

# Age float64

# City object

.columns

Returns a list of column names.

df.columns

# Output: Index(['ID', 'Name', 'Age', 'City'], dtype='object')

.unique()

Shows all unique values in a column.

df['City'].unique()

# Output: array(['New York', 'Los Angeles', 'Chicago', 'Houston'], dtype=object)

.nunique()

Counts the number of unique values.

df['City'].nunique()

# Output: 4

.value_counts()

Counts the frequency of each unique value.

df['City'].value_counts()

# Output:

# New York 30

# Los Angeles 25

# Chicago 25

# Houston 20

.describe()

Provides summary statistics for numeric columns by default:

Count, mean, standard deviation
Minimum and maximum values
Quartiles (25%, 50%, 75%)

df.describe()

Example output:

ID Age Salary

count 5.00000 5.00000 5.000000

mean 3.00000 35.00000 71000.000000

std 1.58114 7.90569 15588.457268

min 1.00000 25.00000 50000.000000

25% 2.00000 30.00000 60000.000000

50% 3.00000 35.00000 75000.000000

75% 4.00000 40.00000 80000.000000

max 5.00000 45.00000 90000.000000

categorical columns:

df.describe(include="object")

Example output:

Name City

count 5 5

unique 5 4

top Alice Chicago

freq 1 2

Summary of categorical (string) columns:

count = number of entries
unique = number of distinct values
top = most frequent value
freq = frequency of top value

.corr()

Provides a correlation matrix.

df.corr()

Example output:

ID Age Salary

ID 1.000000 1.000000 0.986241

Age 1.000000 1.000000 0.986241

Salary 0.986241 0.986241 1.000000

Interpretation:

ID and Age are perfectly correlated here (since IDs increase with Age in this sample).
Salary also has a strong positive correlation with both (≈ 0.99).