Tutorial

Python EDA Guide: Key Functions for Exploratory Data Analysis

A hands-on tutorial on Exploratory Data Analysis in Python using built-in functions and simple plots to quickly understand any dataset.

Drake Nguyen

Founder · System Architect

3 min read

In previous tutorials, we explored how to perform Exploratory Data Analysis (EDA) using visual techniques. In this article, we focus on Python functions that help you understand data quickly—without relying only on charts. EDA is a key step because it reveals the structure of your dataset, distributions, missing values, and relationships between variables. Let’s get started.


Exploratory Data Analysis (EDA)

  • EDA is used to investigate a dataset and summarize key insights.
  • It helps you understand data quality, distributions, and potential issues (e.g., missing values).
  • You can explore data using graphs or Python functions (non-graphical methods).
  • Common analysis types: Univariate (single variable) and Bivariate (two variables, often including the target).
  • Examples of non-graphical EDA: shape, info, describe, isnull, dtypes, and more.
  • Examples of graphical EDA: scatter plots, box plots, bar charts, density plots, and correlation heatmaps.

Load the Data

First, load the dataset into Python. In this tutorial, we’ll use the Titanic dataset as a simple example for EDA.

# Import required libraries
import pandas as pd
import numpy as np
import seaborn as sns

# Load the dataset
df = pd.read_csv('titanic.csv')

# Preview the dataset
df.head()

Once the data is loaded, we can start exploring it step by step.


1. Basic Information About the Dataset

A good starting point is to understand the dataset structure: columns, types, non-null counts, and memory usage. You can do this with df.info(). For quick summary statistics of numerical columns, use df.describe().

# Basic structure and column details
df.info()

# Descriptive statistics for numerical columns
df.describe()

These two functions provide a fast overview of data health and basic distributions.


2. Duplicate Values

Duplicate rows can distort analysis results. Check how many duplicates exist using:

# Count duplicate rows
df.duplicated().sum()

If the output is 0, your dataset has no duplicate records.


3. Unique Values in Columns

For categorical features, it’s useful to see all distinct values using unique(). This helps you understand possible categories and validate data consistency.

# Unique values in selected columns
df['Pclass'].unique()
df['Survived'].unique()
df['Sex'].unique()

4. Visualize Category Counts

After checking unique values, you often want to see how frequently each category appears. A quick way is to use a count plot.

# Count plot for a categorical column
sns.countplot(x=df['Pclass'])

Combining function-based exploration with small visuals gives a clearer picture of the dataset.


5. Find Missing (Null) Values

Missing values are common and must be handled carefully. Start by counting nulls per column:

# Count missing values per column
df.isnull().sum()

Columns like Age or Cabin often contain missing values in the Titanic dataset.


6. Replace Missing Values

You can replace missing values using replace(). Below is a simple example that replaces nulls with 0. (In practice, you may prefer mean/median/mode depending on the feature.)

# Replace NaN with 0 (example approach)
df.replace(np.nan, '0', inplace=True)

# Verify missing values again
df.isnull().sum()

Always choose a replacement strategy that makes sense for your data and downstream tasks.


7. Check Data Types

Knowing column data types helps avoid analysis errors and ensures correct preprocessing. Use:

# Show data types for each column
df.dtypes

8. Filter the Data

Filtering lets you focus on a subset of the dataset based on a condition. For example, here we select passengers in first class:

# Filter: passengers with Pclass == 1
df[df['Pclass'] == 1].head()

9. A Quick Box Plot

Box plots help you understand distributions and detect outliers quickly. Here’s a simple example using the Fare column:

# Box plot for a numerical column
df[['Fare']].boxplot()

10. Correlation Analysis

Correlation measures the strength of linear relationships between numerical variables. You can generate a correlation matrix with:

# Correlation matrix
df.corr()

Values range from -1 to +1: +1 indicates a strong positive relationship, while -1 indicates a strong negative relationship.

For a clearer view, visualize it with a heatmap:

# Correlation heatmap
sns.heatmap(df.corr())

Ending Note

EDA is one of the most important steps in any data project. It helps you validate assumptions, detect data quality issues, and understand relationships before moving to modeling or reporting.

In this tutorial, we covered practical Python functions and a few quick visual checks to help you explore a dataset efficiently. Keep practicing with different datasets to build stronger intuition.

Happy Python! 🐍

Stay updated with Netalith

Get coding resources, product updates, and special offers directly in your inbox.