Tutorial

K-Nearest Neighbors (KNN) in Python

Clear, original guide to KNN in Python: intuition, distance metrics, scikit-learn code (KNeighborsClassifier), choosing k, scaling, limitations, and practical tips.

Drake Nguyen

Founder · System Architect

• Feb. 22, 2026, 2:04 p.m. • 3 min read

KNN in Python: an approachable guide to the k-nearest neighbors algorithm

K-nearest neighbors (kNN) is a simple, instance-based supervised learning algorithm used for both classification and regression tasks. When working with machine learning in Python, kNN is easy to prototype with scikit-learn and is valuable for understanding concepts like distance metrics, feature scaling, and lazy learning.

The intuition behind the k-nearest neighbors algorithm

The core idea of the k nearest neighbors algorithm is intuitive: a data point is likely to share the label or value of nearby points. In classification, a KNN classifier assigns the most common class among the k nearest neighbors. In regression, the algorithm predicts the average (or a distance-weighted average) of neighbors' target values.

Analogy

Think of social influence: if you spend most time with one person, you may adopt that person’s preferences (k=1). If you regularly interact with five friends, your preferences more closely resemble the group average (k=5). This mirrors how the k value in KNN controls local vs. broader influence.

Distance metrics in KNN

How “near” is measured matters. Common distance metrics in KNN include:

Euclidean distance (Minkowski with p=2) — common for continuous features.
Manhattan distance (Minkowski with p=1) — robust when axes are independent.
Minkowski distance — general form that covers Euclidean and Manhattan.

Scaling features (for example with StandardScaler) is essential because Euclidean distance is sensitive to feature magnitudes. Without scaling, large-valued features can dominate neighbor selection.

Implementing KNN in Python (scikit-learn example)

This example shows a typical workflow: create data, split it, scale features, fit a KNeighborsClassifier, and evaluate. The example uses scikit learn KNN utilities and demonstrates a knn classifier in python scikit learn.

1) Imports

import numpy as np
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

2) Create a synthetic dataset

# k nearest neighbors algorithm in python example
X, y = make_blobs(n_samples=500, n_features=2, centers=4, cluster_std=1.5, random_state=4)

3) Train / test split and scaling

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

4) Fit KNN classifiers and predict

# KNeighborsClassifier example python
knn5 = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
knn1 = KNeighborsClassifier(n_neighbors=1)
knn5.fit(X_train_scaled, y_train)
knn1.fit(X_train_scaled, y_train)
y_pred_5 = knn5.predict(X_test_scaled)
y_pred_1 = knn1.predict(X_test_scaled)
print('Accuracy with k=5:', accuracy_score(y_test, y_pred_5) * 100)
print('Accuracy with k=1:', accuracy_score(y_test, y_pred_1) * 100)

How to choose k (best k value for KNN)

Choosing the k value in KNN balances bias and variance. Practical tips:

Small k (e.g., k=1) can overfit and be sensitive to noise.
Large k can underfit by oversmoothing class boundaries.
Use odd k for binary classification to reduce ties.
Try sqrt(n_samples) as a heuristic starting point.
Use cross-validation or an elbow plot of validation accuracy vs. k to find the best k.

KNN regression and variants

For continuous targets, KNN regression predicts the mean or a weighted mean of neighbors. scikit-learn provides KNeighborsRegressor for these tasks. You can also weight neighbors by distance to give closer points more influence.

Limitations of KNN

Storage and prediction cost: KNN stores the full training set and can be slow for large datasets.
Curse of dimensionality: performance degrades with many irrelevant features—perform feature selection or dimensionality reduction first.
Feature scaling required: distance metrics can be dominated by unscaled features.
Cannot extrapolate beyond observed data—limited for novel rare events.

Practical recommendations

Always scale numeric features (StandardScaler or MinMaxScaler).
Experiment with distance metrics (Euclidean, Manhattan, Minkowski) for your data.
Use cross-validation for hyperparameter tuning: k value, weights, and metric.
For large datasets, consider approximate nearest neighbors libraries or different supervised learning algorithms.

Conclusion

KNN in Python is an accessible supervised learning algorithm that helps illustrate instance-based and lazy learning concepts. With scikit-learn's KNeighborsClassifier and KNeighborsRegressor you can quickly prototype classification and regression solutions, but remember to scale features and tune the k value and distance metrics for best results.

For hands-on practice try: knn algorithm python from scratch to learn the mechanics, then move to scikit-learn for production-ready workflows.