How To Graph Word Frequency Using matplotlib with Python 3
Practical guide to plot word frequency in Python with matplotlib: tokenization, counting with Counter, preparing rank-frequency lists, and creating log-log plots (Zipf). Includes complete word_freq.py script.
Drake Nguyen
Founder · System Architect
Introduction
This article shows how to plot word frequency in Python using matplotlib. You will learn a practical, reproducible workflow for reading a text file, counting words, preparing a frequency distribution, and visualizing the results with a log-log plot. The techniques here are useful for general text analysis Python tasks, from simple word count visualizations to exploring Zipf's law.
Prerequisites
-
Python 3 installed and a working development environment.
-
matplotlib available (pip install matplotlib).
-
Basic familiarity with Python data structures (dictionaries, lists) and the command line.
Step 1 — Create the program and imports
Create a file named word_freq.py. The example below uses Python's standard library plus matplotlib and collections.Counter for a compact, efficient word frequency Python implementation.
import re
import sys
import argparse
from collections import Counter
import matplotlib.pyplot as plt
Step 2 — Command-line arguments
Use argparse to accept a filename and a target word to highlight on the plot. You can add an optional --top argument to plot only the top N words.
parser = argparse.ArgumentParser(description='Plot word frequency in Python (log-log)')
parser.add_argument('filename', help='path to a plain text file')
parser.add_argument('word', help='word to highlight on the plot')
parser.add_argument('--top', type=int, default=200, help='number of top words to plot')
args = parser.parse_args()
Step 3 — Read, tokenize, and count words
A robust approach is to normalize case and use a regular expression to extract word tokens. This avoids simple split() pitfalls and improves your matplotlib word frequency results.
def count_words(path):
"""Return a collections.Counter of word tokens from the file."""
token_re = re.compile(r"\b\w+\b", flags=re.UNICODE)
ctr = Counter()
try:
with open(path, 'r', encoding='utf-8') as fh:
for line in fh:
tokens = token_re.findall(line.lower())
ctr.update(tokens)
except FileNotFoundError:
sys.stderr.write(f"Error: {path} not found\n")
sys.exit(1)
return ctr
Step 4 — Prepare ranked data for plotting
Convert the counter into a sorted list of frequencies and build parallel lists of rank and frequency. This gives the frequency distribution Python code needed to produce a word frequency histogram or a rank-frequency plot.
def prepare_ranked_lists(counter, top_n=None):
"""Return (ranks, freqs, rank_of_word, freq_of_word)."""
most_common = counter.most_common(top_n)
freqs = [freq for _, freq in most_common]
ranks = list(range(1, len(freqs) + 1))
return ranks, freqs
Step 5 — Plotting (log-log rank vs frequency)
Use matplotlib's loglog to visualize the rank-frequency relationship (useful when exploring Zipf's law). The example also highlights a chosen word with a distinct marker.
def plot_rank_freq(ranks, freqs, highlight_rank=None, highlight_freq=None, word=None, filename=None):
plt.title(f"Word frequency plot: {filename}")
plt.xlabel('Rank (log scale)')
plt.ylabel('Frequency (log scale)')
# Log-log line for the overall distribution
plt.loglog(ranks, freqs, basex=10, basey=10, marker=',')
# Highlight a particular word as a star
if highlight_rank and highlight_freq:
plt.scatter([highlight_rank], [highlight_freq], color='orange', marker='*', s=100, label=word)
plt.legend()
plt.tight_layout()
plt.show()
Complete example: word_freq.py
#!/usr/bin/env python3
import re
import sys
import argparse
from collections import Counter
import matplotlib.pyplot as plt
def count_words(path):
token_re = re.compile(r"\b\w+\b", flags=re.UNICODE)
ctr = Counter()
try:
with open(path, 'r', encoding='utf-8') as fh:
for line in fh:
tokens = token_re.findall(line.lower())
ctr.update(tokens)
except FileNotFoundError:
sys.stderr.write(f"Error: {path} not found\n")
sys.exit(1)
return ctr
def plot_rank_freq(ranks, freqs, highlight_rank=None, highlight_freq=None, word=None, filename=None):
plt.title(f"Word frequency plot: {filename}")
plt.xlabel('Rank (log scale)')
plt.ylabel('Frequency (log scale)')
plt.loglog(ranks, freqs, basex=10, basey=10, marker=',')
if highlight_rank and highlight_freq:
plt.scatter([highlight_rank], [highlight_freq], color='orange', marker='*', s=100, label=word)
plt.legend()
plt.tight_layout()
plt.show()
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Plot word frequency in python using matplotlib')
parser.add_argument('filename', help='path to a plain text file')
parser.add_argument('word', help='word to highlight on the plot')
parser.add_argument('--top', type=int, default=500, help='plot only the top N words')
args = parser.parse_args()
counter = count_words(args.filename)
if args.word.lower() not in counter:
sys.stderr.write(f"Error: '{args.word}' not found in {args.filename}\n")
sys.exit(1)
most = counter.most_common(args.top)
ranks = list(range(1, len(most) + 1))
freqs = [f for _, f in most]
# find rank and frequency for highlighted word (if within top N)
highlight_rank = None
highlight_freq = None
for i, (w, f) in enumerate(most, start=1):
if w == args.word.lower():
highlight_rank = i
highlight_freq = f
break
plot_rank_freq(ranks, freqs, highlight_rank, highlight_freq, args.word, args.filename)
Step 6 — Run the script
Download a text file (for example from Project Gutenberg) and run:
python word_freq.py cities.txt fish --top 1000
This will open a matplotlib window with a log-log word frequency plot and an orange star marking the chosen word.
Code improvements and alternatives
-
To perform more advanced tokenization or stopword filtering use NLTK or spaCy — useful for serious text analysis Python projects.
-
Plot a bar chart of the top N words with
plt.barfor a standard word frequency plot or a histogram for frequency distribution Python views. -
To inspect Zipf's law, use the log-log plot above; the rank vs frequency relationship often approximates a straight line on log-log axes (see "zipf's law plot word frequency python").
-
Export figures to files with
plt.savefig('word_freq.png')when running headless or saving results for reports.
Conclusion
This tutorial covered how to plot word frequency in Python using matplotlib, from reading and tokenizing text to producing a log-log rank-frequency plot and highlighting a specific word. The same pattern scales to larger corpora and is a solid starting point for visual text analysis and exploring frequency distributions.
Keywords used naturally in this guide: word frequency python, matplotlib word frequency, python word count plot, word frequency plot, frequency distribution python, log-log plot of word frequency python.