Anomaly detection identifies data points that deviate from the norm, classifying anomalies as outliers in three categories: point anomalies are individual data instances deemed anomalous compared to other data, contextual anomalies occur when a data instance is anomalous within a specific context, and collective anomalies arise when related data instances deviate anomalously from the entire dataset.
Outlier detection and novelty detection are distinct methods used for anomaly detection. In the outlier detection, outliers in training data, defined as deviant observations, are ignored by outlier detection estimators that focus on concentrated regions of data, a process known as unsupervised anomaly detection. In contrast, the novelty detection detects unobserved patterns in new data not present in training data, focusing on clean datasets without outliers, termed semi-supervised anomaly detection.
The Scikit-Learn algorithms for outlier detection are described below.
Elliptic Envelope
This algorithm assumes regular data follows known Gaussian distribution. The object estimates robust covariance, fitting an ellipse to central data points while ignoring outliers. The algorithm is used in the following example.
import numpy as np
from sklearn.covariance import EllipticEnvelope
true_covariance = np.array([[.7, .1], [.2, .8]])
data = np.random.RandomState(0).multivariate_normal(mean = [0, 1], cov = true_covariance, size = 1000)
covariance = EllipticEnvelope(random_state = 0).fit(data)
# Returning 1 for an inlier and -1 for an outlier using the predict method
print(covariance.predict([[1, 1], [2, 2]]))
Output
[ 1 -1]
Isolation Forest
Random forests efficiently detect outliers in high-dimensional datasets. Scikit-Learn’s isolation forest isolates observations by randomly selecting a feature and a value between its maximum and minimum values. The number of splits to isolate a sample equals the path length from the root node to the terminal node. The example below shows how to use this method for fitting 50 trees to the given data.
import numpy as np
from sklearn.ensemble import IsolationForest
data = np.array([[0, 0], [-5, -6], [1, 5], [1, 2], [30, -40]])
classifier = IsolationForest(n_estimators = 50)
classifier.fit(data)
Local Outlier Factor
Local Outlier Factor (LOF) is an effective outlier detection algorithm for high-dimensional data. The method calculates a local outlier factor score to identify samples with significantly lower density compared to their neighbors, indicating anomalies. The following example uses this algorithm to construct the NeighborsClassifier class from an array related to the dataset.
from sklearn.neighbors import NearestNeighbors
samples = [[2., 2., 4.], [1., 0., .5], [0., 0., 0.]]
neighbors = NearestNeighbors(n_neighbors = 1, algorithm = "ball_tree", p = 1)
neighbors.fit(samples)
print(neighbors.kneighbors([[1., .4, .5]]))
Output
(array([[0.4]]), array([[1]]))
One-Class SVM
It is an unsupervised outlier detection algorithm, which estimates high-dimensional distribution support, requiring a kernel (commonly RBF) and a scalar parameter to define a frontier efficiently. The example below fits the dataset with the OneClassSVM object.
from sklearn.svm import OneClassSVM
data = [[0.65], [0.70], [0], [1], [0.45]]
classifier = OneClassSVM(gamma = 'scale').fit(data)
print(classifier.score_samples(data))
Output
[0.94045373 0.94436003 0.94127002 0.94127001 0.94066049]
References
- Hackeling, G. (2017). Mastering Machine Learning with scikit-learn, 2nd Edition. Packt Publishing Ltd.
- Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition. O’Reilly Media, Inc.
- Tutorials Point. Scikit Learn Tutorial. Retrieved November 20, 2025, from https://www.tutorialspoint.com/.
- DataCamp. Support Vector Machines with Scikit-learn Tutorial. Retrieved December 25, 2025, from https://www.datacamp.com/.

Leave a Reply