Image generated from DALLE-3
In today’s era of massive data sets and intricate data patterns, the art and science of detecting anomalies or outliers has become more nuanced. While traditional outlier detection techniques are well-equipped to handle scalar or multivariate data, functional data (consisting of curves, surfaces, or anything on a continuum) pose unique challenges. One of the innovative techniques that has been developed to address this problem is the “density core depth” (DKD) method.
In this article, we will delve into the concept of DKD and its implications in outlier detection for functional data from a data scientist’s point of view.
Before delving into the complexities of DKD, it is vital to understand what functional data entails. Unlike traditional data points which are scalar values, functional data consists of curves or functions. Think of it as having an entire curve as a single data observation. This type of data typically arises in situations where measurements are taken continuously over time, such as temperature curves over the course of a day or stock market trajectories.
Given a set of data north curves observed in a domain dEach curve can be represented as:
<img decoding="async" alt=""Ecuación"" src="https://latex.codecogs.com/png.image?\huge&space;\dpi{70}\(x_i
For scalar data, we could calculate the mean and standard deviation and then determine outliers based on data points that lie a certain number of standard deviations from the mean.
For functional data, this approach is more complicated because each observation is a curve.
One method of measuring the centrality of a curve is to calculate its “depth” relative to other curves. For example, using a simple depth measurement:
<img decoding="async" alt="Equation" src="https://latex.codecogs.com/png.latex?\huge&space;\dpi{70}\( \text{Depth}(x_i
Where n is the total number of curves.
While the above is a simplified representation, in reality, functional data sets can consist of thousands of curves, making it difficult to visually detect outliers. Mathematical formulations such as the depth measure provide a more structured approach to measuring the centrality of each curve and potentially detecting outliers.
In a practical scenario, more advanced methods, such as Density Kernel Depth, would be needed to effectively determine outliers in functional data.
DKD works by comparing the density of each curve at each point with the overall density of the entire data set at that point. Density is estimated using kernel methods, which are non-parametric techniques that allow estimating densities in complex data structures.
For each curve, DKD evaluates its “peripheral” character at each point and integrates these values across the entire domain. The result is a single number that represents the depth of the curve. Lower values indicate possible outliers.
The estimation of the density of the core at the point t for a given curve Xi?(t) Is defined as:
<img decoding="async" alt="Equation" src="https://latex.codecogs.com/gif.latex?\huge&space;\dpi{70}\( \hat{f_i}
Where:
- K (.) is the kernel function, often a Gaussian kernel.
- h is the bandwidth parameter.
The choice of kernel function. k (.) and bandwidth h can significantly influence DKD values:
- Kernel function: Gaussian kernels are commonly used due to their smooth properties.
- Broadband ?: Determines the smoothness of the density estimate. Cross-validation methods are often used to select an optimal method. h.
The depth of the curve. Xi?(t) at point t relative to the entire data set is calculated as:
<img decoding="async" alt="Equation" src="https://latex.codecogs.com/png.latex?\huge&space;\dpi{70}\( \text{DKD}(x_i
where:
<img decoding="async" alt="Equation" src="https://latex.codecogs.com/gif.latex?\huge&space;\dpi{70}\( \hat{f}
<img decoding="async" alt="Equation" src="https://latex.codecogs.com/gif.latex?\huge&space;\dpi{70}\hat{f}
The resulting DKD value for each curve gives a measure of its centrality:
- Curves with higher DKD values are more central to the data set.
- Curves with lower DKD values are possible outliers.
Flexibility: DKD makes no strong assumptions about the underlying distribution of data, making it versatile for various functional data structures.
Interpretability: By providing a depth value for each curve, DKD makes it intuitive to understand which curves are core and which are possible outliers.
Efficiency: Despite its complexity, DKD is computationally efficient, making it feasible for large functional data sets.
Imagine a scenario where a data scientist analyzes the heart rate curves of patients over 24 hours. Traditional outlier detection could flag occasional elevated heart rate readings as outliers. However, with analysis of functional data using DKD, entire curves of abnormal heart rate (perhaps indicating arrhythmias) can be detected, providing a more holistic view of the patient’s health.
As data continues to grow in complexity, the tools and techniques to analyze it must evolve together. Density Kernel Depth offers a promising approach to navigating the intricate landscape of functional data, ensuring that data scientists can confidently detect outliers and derive meaningful insights from them. While DKD is just one of many tools in a data scientist’s arsenal, its potential in functional data analysis is undeniable and will pave the way for more sophisticated analysis techniques in the future.
Kulbir Singh is a distinguished leader in the field of data science and analytics, with more than two decades of experience in information technology. His experience is multifaceted, encompassing leadership, data analytics, machine learning, artificial intelligence (ai), innovative solution design, and problem solving. Currently, Kulbir holds the position of Health Information Manager at Elevance Health. Passionate about the advancement of artificial intelligence (ai), Kulbir founded AIboard.io, an innovative platform dedicated to creating educational content and courses focused on ai and healthcare.