Tutorial

How To Graph Word Frequency Using matplotlib with Python 3

Practical guide to plot word frequency in Python with matplotlib: tokenization, counting with Counter, preparing rank-frequency lists, and creating log-log plots (Zipf). Includes complete word_freq.py script.

Drake Nguyen

Founder · System Architect

3 min read
How To Graph Word Frequency Using matplotlib with Python 3
How To Graph Word Frequency Using matplotlib with Python 3

Introduction

This article shows how to plot word frequency in Python using matplotlib. You will learn a practical, reproducible workflow for reading a text file, counting words, preparing a frequency distribution, and visualizing the results with a log-log plot. The techniques here are useful for general text analysis Python tasks, from simple word count visualizations to exploring Zipf's law.

Prerequisites

  • Python 3 installed and a working development environment.

  • matplotlib available (pip install matplotlib).

  • Basic familiarity with Python data structures (dictionaries, lists) and the command line.

Step 1 — Create the program and imports

Create a file named word_freq.py. The example below uses Python's standard library plus matplotlib and collections.Counter for a compact, efficient word frequency Python implementation.

import re
import sys
import argparse
from collections import Counter
import matplotlib.pyplot as plt

Step 2 — Command-line arguments

Use argparse to accept a filename and a target word to highlight on the plot. You can add an optional --top argument to plot only the top N words.

parser = argparse.ArgumentParser(description='Plot word frequency in Python (log-log)')
parser.add_argument('filename', help='path to a plain text file')
parser.add_argument('word', help='word to highlight on the plot')
parser.add_argument('--top', type=int, default=200, help='number of top words to plot')
args = parser.parse_args()

Step 3 — Read, tokenize, and count words

A robust approach is to normalize case and use a regular expression to extract word tokens. This avoids simple split() pitfalls and improves your matplotlib word frequency results.

def count_words(path):
    """Return a collections.Counter of word tokens from the file."""
    token_re = re.compile(r"\b\w+\b", flags=re.UNICODE)
    ctr = Counter()
    try:
        with open(path, 'r', encoding='utf-8') as fh:
            for line in fh:
                tokens = token_re.findall(line.lower())
                ctr.update(tokens)
    except FileNotFoundError:
        sys.stderr.write(f"Error: {path} not found\n")
        sys.exit(1)
    return ctr

Step 4 — Prepare ranked data for plotting

Convert the counter into a sorted list of frequencies and build parallel lists of rank and frequency. This gives the frequency distribution Python code needed to produce a word frequency histogram or a rank-frequency plot.

def prepare_ranked_lists(counter, top_n=None):
    """Return (ranks, freqs, rank_of_word, freq_of_word)."""
    most_common = counter.most_common(top_n)
    freqs = [freq for _, freq in most_common]
    ranks = list(range(1, len(freqs) + 1))
    return ranks, freqs

Step 5 — Plotting (log-log rank vs frequency)

Use matplotlib's loglog to visualize the rank-frequency relationship (useful when exploring Zipf's law). The example also highlights a chosen word with a distinct marker.

def plot_rank_freq(ranks, freqs, highlight_rank=None, highlight_freq=None, word=None, filename=None):
    plt.title(f"Word frequency plot: {filename}")
    plt.xlabel('Rank (log scale)')
    plt.ylabel('Frequency (log scale)')

    # Log-log line for the overall distribution
    plt.loglog(ranks, freqs, basex=10, basey=10, marker=',')

    # Highlight a particular word as a star
    if highlight_rank and highlight_freq:
        plt.scatter([highlight_rank], [highlight_freq], color='orange', marker='*', s=100, label=word)
        plt.legend()

    plt.tight_layout()
    plt.show()

Complete example: word_freq.py

#!/usr/bin/env python3

import re
import sys
import argparse
from collections import Counter
import matplotlib.pyplot as plt


def count_words(path):
    token_re = re.compile(r"\b\w+\b", flags=re.UNICODE)
    ctr = Counter()
    try:
        with open(path, 'r', encoding='utf-8') as fh:
            for line in fh:
                tokens = token_re.findall(line.lower())
                ctr.update(tokens)
    except FileNotFoundError:
        sys.stderr.write(f"Error: {path} not found\n")
        sys.exit(1)
    return ctr


def plot_rank_freq(ranks, freqs, highlight_rank=None, highlight_freq=None, word=None, filename=None):
    plt.title(f"Word frequency plot: {filename}")
    plt.xlabel('Rank (log scale)')
    plt.ylabel('Frequency (log scale)')
    plt.loglog(ranks, freqs, basex=10, basey=10, marker=',')
    if highlight_rank and highlight_freq:
        plt.scatter([highlight_rank], [highlight_freq], color='orange', marker='*', s=100, label=word)
        plt.legend()
    plt.tight_layout()
    plt.show()


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Plot word frequency in python using matplotlib')
    parser.add_argument('filename', help='path to a plain text file')
    parser.add_argument('word', help='word to highlight on the plot')
    parser.add_argument('--top', type=int, default=500, help='plot only the top N words')
    args = parser.parse_args()

    counter = count_words(args.filename)
    if args.word.lower() not in counter:
        sys.stderr.write(f"Error: '{args.word}' not found in {args.filename}\n")
        sys.exit(1)

    most = counter.most_common(args.top)
    ranks = list(range(1, len(most) + 1))
    freqs = [f for _, f in most]

    # find rank and frequency for highlighted word (if within top N)
    highlight_rank = None
    highlight_freq = None
    for i, (w, f) in enumerate(most, start=1):
        if w == args.word.lower():
            highlight_rank = i
            highlight_freq = f
            break

    plot_rank_freq(ranks, freqs, highlight_rank, highlight_freq, args.word, args.filename)

Step 6 — Run the script

Download a text file (for example from Project Gutenberg) and run:

python word_freq.py cities.txt fish --top 1000

This will open a matplotlib window with a log-log word frequency plot and an orange star marking the chosen word.

Code improvements and alternatives

  • To perform more advanced tokenization or stopword filtering use NLTK or spaCy — useful for serious text analysis Python projects.

  • Plot a bar chart of the top N words with plt.bar for a standard word frequency plot or a histogram for frequency distribution Python views.

  • To inspect Zipf's law, use the log-log plot above; the rank vs frequency relationship often approximates a straight line on log-log axes (see "zipf's law plot word frequency python").

  • Export figures to files with plt.savefig('word_freq.png') when running headless or saving results for reports.

Conclusion

This tutorial covered how to plot word frequency in Python using matplotlib, from reading and tokenizing text to producing a log-log rank-frequency plot and highlighting a specific word. The same pattern scales to larger corpora and is a solid starting point for visual text analysis and exploring frequency distributions.

Keywords used naturally in this guide: word frequency python, matplotlib word frequency, python word count plot, word frequency plot, frequency distribution python, log-log plot of word frequency python.

Stay updated with Netalith

Get coding resources, product updates, and special offers directly in your inbox.