Outlier
Author: Ernad Mujakic
Date: 2025-07-08
Outliers are data points that differ significantly from the other observations in a dataset. Outliers may occur due to measurement or recording error, or could possibly represent an important anomaly warranting further analysis. There is no fixed definition of what constitutes as an outlier, typically, specific domain knowledge is usually necessary to understand whether a specific observation is an outlier, or is a natural phenomenon of the dataset.
Causes of Outliers
- Measurement Error: Outliers may occur due to user error in the data collection process, or could occur due to errors in autonomous systems, such as sensor failure.
- Natural Variation: Outliers may represent perfectly legitimate values that are an inherent part of the naturally occurring variations in the underlying domain of the dataset.
- Anomaly: Outliers may represent unusual behavior, such as fraudulent transactions, which warrant further investigation and analysis.
Impact
- Outliers can have a significant impact on various statistical measures, such as Mean, Standard Deviation, or Range.
- Outliers could also hinder the performance of some machine learning models such as Logistic Regression or K-Nearest-Neighbors.
Types of Outliers
- Global Outliers: A global outlier deviates significantly from the entire population globally.
- Collective Outliers: A group or subset of data points that collectively deviate considerably from the overall distribution. Typically require special techniques to detect.
- Contextual/Local Outliers: Data points whose value deviate significantly relative to other data points within the same "context." Contextual outliers may not be considered outliers when considered globally, meaning they need special attention to be properly detected and analyzed.
Outliers vs. Extreme Values
While the terms outliers and extreme values may be used interchangeably, they have distinct definitions in statistics:
- Outlier: is a data point that varies significantly from the rest of the dataset.
- Extreme values: values that reside at the outer edges of the dataset, representing the highest and lowest points in a dataset. Extreme values may be outliers, or they be a natural part of a distribution.
Detection
Z-score, sometimes called the standard score, measures how many Standard Deviations a data object is from the mean of the distribution. A positive z-score indicates the values is greater than the mean, while a negative score indicates it is less than the mean.
Formula
Mathematically, the z-score is defined as:
Where:
is the z-score. is the given data value. is the population mean. is the population standard deviation.
Application
The z-score is commonly used to detect outliers by flagging any values that are outside of a specified threshold (commonly -3 and 3).
Interquartile Range (IQR)
The Interquartile Range (IQR) is a measure of statistical dispersion that represents the range within which the central 50% of the data points lie.
Formula
Mathematically, the IQR is defined as:
Where:
Application
The IQR is commonly used to detect outliers by flagging any values that are outside of a specified boundary, typically:
- Lower Bound:
- Upper Bound:
Any data point below the lower bound or above the upper bound is considered an outlier.
K-Nearest Neighbors is a supervised machine learning algorithm which can be used for both classification and regression tasks. The algorithm relies on Distance Metrics such as Euclidean Distance, Manhattan Distance, or Minkowski Distance to find the "
Intuition
- Regression: For regression tasks, the algorithm takes the average values of the k-nearest-neighbors of the data object, where the neighbors come from the training set. Then the mean of the neighbors' values is the predicted value for the object.
- Classification: For classification tasks, a majority vote among the object's k-nearest-neighbors is taken to determine the category for the given data object.
Application
K-nearest-neighbors is used to detect outliers by assigning an outlier score to a data object, which is done by measuring the distance of an object from its nearest neighbors. Though, this approach is not effective for collective outliers.
DBSCAN is a density-based clustering algorithm that clusters data based on the density of data points. DBSCAN excels at identifying an arbitrary number of clusters, and can handle nested clusters of arbitrary shapes.
Intuition
- Two parameters are chosen,
represents the radius within which to classify neighbors; andminPts
, which represents the minimum number of neighbors within to classify the point as a "core point." - A data point is considered a core point if the amount of other data points that falls within its
radius is at leastminPts.
- After all core points are identified, for each core point, create a cluster of the core point and all the points within its
radius. - Expand the cluster by iterating through all reachable points and adding them to the cluster if they're core points.
- Points that are not reachable from any core points are labelled as noise or outliers.
Application
DBSCAN is popular algorithm for anomaly detection, since outliers are detected based on relative density of data.
Handling Outliers
Removal
One straightforward method for dealing with outliers is simply to remove them from the dataset. This approach if effective as long as outliers are not relevant to the analysis, such as the case where the focus is not on Anomaly Detection.
Transformation
Transformations can be applied to the data to minimize the effect of outliers. Some popular techniques include:
- Scaling: Adjusting the range of the data to reduce the influence of extreme values. This includes methods such as Min-Max Scaling or Z-Score Normalization.
- Winsorization: Replacing outlier values with the nearest value within a specific percentile range. For example, any values outside of the middle 95% percentile are replaced with the nearest values within that range.
- Log Transformation: Applying logarithmic transformations to reduce Variance and make the data more normally distributed.
Explicit Modeling
Another approach to handling outliers is explicitly modeling them. This can be done by adding a new Binary Data attribute that specifies whether a given data object is an outlier or not.
References
- J. Han, M. Kamber, and J. Pei, Data Mining : Concepts and Techniques. Burlington, Ma: Elsevier, 2012.
- I. Cohen, “Outlier Detection & Analysis: The Different Types of Outliers,” Anodot, Feb. 25, 2022. https://www.anodot.com/blog/quick-guide-different-types-outliers/
- GeeksforGeeks, “Types of Outliers in Data Mining,” GeeksforGeeks, Jul. 2021. https://www.geeksforgeeks.org/data-analysis/types-of-outliers-in-data-mining/ (accessed Jul. 08, 2025).
- S. Glen, “Outliers: Finding Them in Data, Formula, Examples. Easy Steps and Video,” Statistics How To. https://www.statisticshowto.com/statistics-basics/find-outliers/
- Wikipedia Contributors, “Outlier,” Wikipedia, Apr. 07, 2019. https://en.wikipedia.org/wiki/Outlier
- GeeksforGeeks, “How to Detect Outliers in Machine Learning,” GeeksforGeeks, Jan. 12, 2019. https://www.geeksforgeeks.org/machine-learning/machine-learning-outlier/ (accessed Jul. 10, 2025).
- Wikipedia Contributors, “Standard score,” Wikipedia, Sep. 12, 2019. https://en.wikipedia.org/wiki/Standard_score