Skip to Content
Data Mining

5. Outlier Analysis

2020-02-06Original-language archivelegacy assets may be incomplete

Outlier analysis

  • Outliers
    • Global: deviates from the rest of the data set
    • Contextual: deviates significantly with respect to a specific context of the object
    • Collective: objects as a whole deviate significantly from the entire data set
  • Categorization based on supervision
    • Supervised
    • Unsupervised
    • Semi-supervised
  • Mining contextual outliers
    • transforming contextual to conventional
    • Modeling normal behavior
  • Mining collective outliers
    • Exploring the structure of the data
  • high dimensional data
    • dimensionality reduction
    • partiion the original feature space into small region

Statistical

  • Parametric approaches
    • Univariate
      • mean + standard deviation: μ±3σ\mu\pm3\sigma
      • median ±\pm 1.5 * IQR (inter-quatile range)
      • Brubb's test: Z score + t-distribution
    • Multivariate
      • Trainsform to univariate: univarate set {d(o,o)oD}\{d(o,\overline{o})|o\in D\}
      • 卡方分析
      • modeling the data with multiple parametric distribution
  • Non-parametric
    • histogram
    • kernel density estimation

Proximity-based approaches

the proximity of an outlier object to its nearest neighbors significantly deviates from the proximity of the object to most of the other objects in the data set

  • Distance-based: global view
    • DB(r,π)\text{DB}(r,\pi)-outlier: {odist(o,o)r}Dπ\frac{\|\{o'|\text{dist}(o,o')\leq r\}\|}{\|D\|}\leq\pi
    • CELL (Grid-base)
  • Density-based
    • LOF
      • K-distance neighborhood: Nk(o)={ooD,d(o,o)dk(o)}N_k(o)=\{o'|o'\in D,d(o,o')\leq d_k(o)\}
    • Reachability distance: reachdistk(oo)=max{dk(o),d(o,o)}\text{reachdist}_k(o\leftarrow o')=\max\{d_k(o),d(o,o')\}
    • Local reachability density: lrdk(o)=Nk(o)oNk(o)reachdistk(oo)\text{lrd}_k(o)=\frac{\|N_k(o)\|}{\sum_{o'\in N_k(o)}\text{reachdist}_k(o'\leftarrow o)}
    • Local outlier factor(the larger, the more abnormal): LOFk(o)=oNk(o)oNk(o)reachdistk(oo)\text{LOF}_k(o)=\sum_{o'\in N_k(o)}\sum_{o'\in N_k(o)}\text{reachdist}_k(o'\leftarrow o)

Clustering-based

  • basic ideas
    • outlier does not belong to any clusters
    • the distance between an outlier and the cluster to which it is closest to is large
    • all objects in a small and sparse cluster can be considered as outliers
  • CBLOF
    • Find clusters and sort according to decreasing size
    • Identify “large” cluster using a preset percentage of the entire data
    • for points in large cluster: CBLOF= size of the cluster ×\times similarity between the point and the cluster
    • for points in small cluster: CBLOF= size of the cluster ×\times similarity between the point and the closest large cluster

Classification-based

  • outlier v.s. normal: severely imbalanced distribution
  • One-class SVM
    • v-SVM: Separating the “normal” data from the origin with a margin in a feature space
    • SVDD: Constraining ”normal” data in a ball with relative small radius

Isoation-based

  • iForest: Outliers are few and different. Thus, when randomly split the space into small region, an outlier is more likely to be ISOLATED