Neighbor-based learning methods include supervised and unsupervised types. Supervised neighbor-based methods are primarily used for classification, though applicable to regression. These methods do not require a specialized training phase, utilize all available data for training, and do not make assumptions about underlying data, making them lazy and non-parametric. Nearest neighbor methods identify the nearest training samples to a new data point and predict the label based on them.
1. Types of Algorithms
Different types of algorithms can be utilized in neighbor-based methods, as follows:
- Brute Force: Brute-force distance computation for neighbor search is effective for small datasets but becomes impractical as the number of samples increases significantly.
- K-D Tree: K-D tree is a binary tree used to improve computational efficiency by recursively partitioning parameter space along data axes, creating nested orthographic regions filled with data points for better organization and retrieval.
- Ball Tree: Ball tree data structure was developed to overcome K-D Tree’s inefficiency in higher dimensions. It recursively divides data into nodes defined by a centroid and radius, ensuring all points within a node lie inside the respective hyper-sphere.
2. Selecting Nearest Neighbors Algorithm
Selecting the ideal algorithm for a given dataset depends on the following factors:
- Number of samples (N) and dimensionality (D): When selecting this algorithm, key factors include query time complexities: Brute Force has O[DN], K-D Tree varies with D, and Ball Tree has O[D log(N)], showing efficiency at D < 20 (O[D log(N)]) but inefficiency at D > 20, as the cost approaches O[DN]. This variability makes selecting the appropriate algorithm crucial based on dimensionality.
- Data structure: The intrinsic dimensionality and sparsity of data significantly influence the performance of Ball Tree and K-D Tree algorithms, affecting their query times. In contrast, the Brute Force algorithm remains unaffected by data structure. Both Ball Tree and K-D Tree algorithms perform better with sparser data and lower dimensionality.
- Number of neighbors (K): Increasing the number of neighbors for query points slows query times for Ball Tree and K-D Tree algorithms, while Brute Force’s query time remains unchanged regardless of this number.
- Number of query points: K-D Tree and Ball Tree algorithms are effective for many query points, while Brute Force algorithm outperforms them when query points are fewer during the construction phase.
3. Types of KNN Learning Techniques
KNN is a non-parametric, lazy machine learning algorithm that does not assume a data distribution and uses the entire training dataset for testing without needing a separate model generation phase, through two following phases: (i) computing and storing K nearest neighbors for training samples, and (ii) retrieving K nearest neighbors from the dataset and predicting class by majority voting among those neighbors.
This algorithm implements K-nearest neighbors for both supervised and unsupervised learning methods. Unsupervised nearest neighbors use algorithms like Brute Force, K-D Tree, or Ball Tree to identify nearest neighbors for samples, forming the basis for algorithms like KNN and K-Means. In contrast, supervised neighbors-based learning is applied in classification and regression tasks.
3.1. Unsupervised KNN Learning
Scikit-Learn implements neighbor search as a separate learner due to inefficiencies in computing all pairwise distances for nearest neighbor searches in algorithms like KNN and K-Means. This allows for improved performance in unsupervised nearest neighbor learning. The following example demonstrates its implementation.
import numpy as np
from sklearn.neighbors import NearestNeighbors
data = np.array([[2, 3], [-1, 2], [1, -3], [-1, -2], [1, 5]])
nearest_neighbors = NearestNeighbors(n_neighbors = 3, algorithm = 'ball_tree')
nearest_neighbors.fit(data)
distances, indices = nearest_neighbors.kneighbors(data)
print(indices, "\n")
print(distances, "\n")
print(nearest_neighbors.kneighbors_graph(data).toarray())
Output
[[0 4 1]
[1 0 4]
[2 3 1]
[3 2 1]
[4 0 1]]
[[0. 2.23606798 3.16227766]
[0. 3.16227766 3.60555128]
[0. 2.23606798 5.38516481]
[0. 2.23606798 4. ]
[0. 2.23606798 3.60555128]]
[[1. 1. 0. 0. 1.]
[1. 1. 0. 0. 1.]
[0. 1. 1. 1. 0.]
[0. 1. 1. 1. 0.]
[1. 1. 0. 0. 1.]]
3.2. Supervised KNN Learning
Supervised neighbors-based learning is applied for classification and regression tasks with respective labels. In the nearest neighbor classifier, neighbors-based classification relies on a majority vote from the nearest neighbors and stores training data instances, making it a type of non-generalizing learning. KNeighborsClassifier and RadiusNeighborsClassifier are two different types of the nearest neighbor classifiers applied in the Scikit-Learn library. In contrast, the nearest neighbor regressor is used for continuous data labels, computed based on the mean of labels from nearest neighbors. KNeighborsRegressor and RadiusNeighborsRegressor are two different types of the nearest neighbor regressors utilized in this library.
3.2.1. KNeighborsRegressor
The regressor uses the K nearest neighbors approach, where K is a user-specified integer. The chosen value of K varies with data, influencing the regressor’s learning process. The following implementation example clarifies this method based on the iris dataset.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
import numpy as np
iris = load_iris()
data = iris.data[:, :4]
target = iris.target
data_train, data_test, target_train, target_test = train_test_split(data, target, test_size = 0.30)
scaler = StandardScaler()
scaler.fit(data_train)
data_train = scaler.transform(data_train)
data_test = scaler.transform(data_test)
regressor = KNeighborsRegressor(n_neighbors = 5)
regressor.fit(data_train, target_train)
print ("The mean squared error is:", format(np.power(target - regressor.predict(data), 4).mean()))
data = [[0], [1], [2], [3]]
target = [0, 1, 0, 1]
regressor = KNeighborsRegressor(n_neighbors = 3)
regressor.fit(data, target)
print(regressor.predict([[2.5]]))
Output
The mean squared error is: 5.593301333333334
[0.66666667]
3.2.2. RadiusNeighborsRegressor
The radius regressor finds the nearest neighbors within a user-defined radius R, implementing learning based on the number of neighbors in that fixed range around each training point. The following illustrative implementation example describes this method based on the iris dataset.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import RadiusNeighborsRegressor
import numpy as np
iris = load_iris()
data = iris.data[:, :4]
target = iris.target
data_train, data_test, target_train, target_test = train_test_split(data, target, test_size = 0.30)
scaler = StandardScaler()
scaler.fit(data_train)
data_train = scaler.transform(data_train)
data_test = scaler.transform(data_test)
regressor = RadiusNeighborsRegressor(radius = 1)
regressor.fit(data_train, target_train)
print ("The mean squared error is:", format(np.power(target - regressor.predict(data), 4).mean()))
data = [[0], [1], [2], [3]]
target = [0, 1, 0, 1]
regressor = RadiusNeighborsRegressor(radius = 1)
regressor.fit(data, target)
print(regressor.predict([[1.5]]))
Output
The mean squared error is: 5.666666666666667
[0.5]
References
- Hackeling, G. (2017). Mastering Machine Learning with scikit-learn, 2nd Edition. Packt Publishing Ltd.
- Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition. O’Reilly Media, Inc.
- Tutorials Point. Scikit Learn Tutorial. Retrieved November 20, 2025, from https://www.tutorialspoint.com/.

Leave a Reply