Tutorial

K-Nearest Neighbors (KNN) in Python

Clear, original guide to KNN in Python: intuition, distance metrics, scikit-learn code (KNeighborsClassifier), choosing k, scaling, limitations, and practical tips.

Drake Nguyen

Founder · System Architect

3 min read
K-Nearest Neighbors (KNN) in Python
K-Nearest Neighbors (KNN) in Python

KNN in Python: an approachable guide to the k-nearest neighbors algorithm

K-nearest neighbors (kNN) is a simple, instance-based supervised learning algorithm used for both classification and regression tasks. When working with machine learning in Python, kNN is easy to prototype with scikit-learn and is valuable for understanding concepts like distance metrics, feature scaling, and lazy learning.

The intuition behind the k-nearest neighbors algorithm

The core idea of the k nearest neighbors algorithm is intuitive: a data point is likely to share the label or value of nearby points. In classification, a KNN classifier assigns the most common class among the k nearest neighbors. In regression, the algorithm predicts the average (or a distance-weighted average) of neighbors' target values.

Analogy

Think of social influence: if you spend most time with one person, you may adopt that person’s preferences (k=1). If you regularly interact with five friends, your preferences more closely resemble the group average (k=5). This mirrors how the k value in KNN controls local vs. broader influence.

Distance metrics in KNN

How “near” is measured matters. Common distance metrics in KNN include:

  • Euclidean distance (Minkowski with p=2) — common for continuous features.
  • Manhattan distance (Minkowski with p=1) — robust when axes are independent.
  • Minkowski distance — general form that covers Euclidean and Manhattan.

Scaling features (for example with StandardScaler) is essential because Euclidean distance is sensitive to feature magnitudes. Without scaling, large-valued features can dominate neighbor selection.

Implementing KNN in Python (scikit-learn example)

This example shows a typical workflow: create data, split it, scale features, fit a KNeighborsClassifier, and evaluate. The example uses scikit learn KNN utilities and demonstrates a knn classifier in python scikit learn.

1) Imports

import numpy as np
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

2) Create a synthetic dataset

# k nearest neighbors algorithm in python example
X, y = make_blobs(n_samples=500, n_features=2, centers=4, cluster_std=1.5, random_state=4)

3) Train / test split and scaling

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

4) Fit KNN classifiers and predict

# KNeighborsClassifier example python
knn5 = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
knn1 = KNeighborsClassifier(n_neighbors=1)
knn5.fit(X_train_scaled, y_train)
knn1.fit(X_train_scaled, y_train)
y_pred_5 = knn5.predict(X_test_scaled)
y_pred_1 = knn1.predict(X_test_scaled)
print('Accuracy with k=5:', accuracy_score(y_test, y_pred_5) * 100)
print('Accuracy with k=1:', accuracy_score(y_test, y_pred_1) * 100)

How to choose k (best k value for KNN)

Choosing the k value in KNN balances bias and variance. Practical tips:

  • Small k (e.g., k=1) can overfit and be sensitive to noise.
  • Large k can underfit by oversmoothing class boundaries.
  • Use odd k for binary classification to reduce ties.
  • Try sqrt(n_samples) as a heuristic starting point.
  • Use cross-validation or an elbow plot of validation accuracy vs. k to find the best k.

KNN regression and variants

For continuous targets, KNN regression predicts the mean or a weighted mean of neighbors. scikit-learn provides KNeighborsRegressor for these tasks. You can also weight neighbors by distance to give closer points more influence.

Limitations of KNN

  • Storage and prediction cost: KNN stores the full training set and can be slow for large datasets.
  • Curse of dimensionality: performance degrades with many irrelevant features—perform feature selection or dimensionality reduction first.
  • Feature scaling required: distance metrics can be dominated by unscaled features.
  • Cannot extrapolate beyond observed data—limited for novel rare events.

Practical recommendations

  • Always scale numeric features (StandardScaler or MinMaxScaler).
  • Experiment with distance metrics (Euclidean, Manhattan, Minkowski) for your data.
  • Use cross-validation for hyperparameter tuning: k value, weights, and metric.
  • For large datasets, consider approximate nearest neighbors libraries or different supervised learning algorithms.

Conclusion

KNN in Python is an accessible supervised learning algorithm that helps illustrate instance-based and lazy learning concepts. With scikit-learn's KNeighborsClassifier and KNeighborsRegressor you can quickly prototype classification and regression solutions, but remember to scale features and tune the k value and distance metrics for best results.

For hands-on practice try: knn algorithm python from scratch to learn the mechanics, then move to scikit-learn for production-ready workflows.

Stay updated with Netalith

Get coding resources, product updates, and special offers directly in your inbox.