Outlier analysis
- Outliers
- Global: deviates from the rest of the data set
- Contextual: deviates significantly with respect to a specific context of the object
- Collective: objects as a whole deviate significantly from the entire data set
- Categorization based on supervision
- Supervised
- Unsupervised
- Semi-supervised
- Mining contextual outliers
- transforming contextual to conventional
- Modeling normal behavior
- Mining collective outliers
- Exploring the structure of the data
- high dimensional data
- dimensionality reduction
- partiion the original feature space into small region
Statistical
- Parametric approaches
- Univariate
- mean + standard deviation:
- median 1.5 * IQR (inter-quatile range)
- Brubb's test: Z score + t-distribution
- Multivariate
- Trainsform to univariate: univarate set
- 卡方分析
- modeling the data with multiple parametric distribution
- Univariate
- Non-parametric
- histogram
- kernel density estimation
Proximity-based approaches
the proximity of an outlier object to its nearest neighbors significantly deviates from the proximity of the object to most of the other objects in the data set
- Distance-based: global view
- -outlier:
- CELL (Grid-base)
- Density-based
- LOF
- K-distance neighborhood:
- Reachability distance:
- Local reachability density:
- Local outlier factor(the larger, the more abnormal):
- LOF
Clustering-based
- basic ideas
- outlier does not belong to any clusters
- the distance between an outlier and the cluster to which it is closest to is large
- all objects in a small and sparse cluster can be considered as outliers
- CBLOF
- Find clusters and sort according to decreasing size
- Identify “large” cluster using a preset percentage of the entire data
- for points in large cluster: CBLOF= size of the cluster similarity between the point and the cluster
- for points in small cluster: CBLOF= size of the cluster similarity between the point and the closest large cluster
Classification-based
- outlier v.s. normal: severely imbalanced distribution
- One-class SVM
- v-SVM: Separating the “normal” data from the origin with a margin in a feature space
- SVDD: Constraining ”normal” data in a ball with relative small radius
Isoation-based
- iForest: Outliers are few and different. Thus, when randomly split the space into small region, an outlier is more likely to be ISOLATED