Skip to Content
Advanced Algorithm

3. Dimension Reduction

2019-09-04Original-language archivelegacy assets may be incomplete

Metric Embedding

  • Metric Space: (X,d),X(X,d),X is a set and dd is a distance on XX
  • Embedding: ϕ:XY\phi:X\rightarrow Y on two metric space (X,dX),(Y,dY)(X,d_X),(Y,d_Y)
    • Distortion α1\alpha\geq 1: x,yX,1αd(x,y)d(ϕ(x),ϕ(y))αd(x,y)\forall x,y\in X,\frac{1}{\alpha}d(x,y)\leq d(\phi(x),\phi(y))\leq\alpha d(x,y)
    • Dimension Reduction: x1,,xnRd,y1,,ynRkx_1,\cdots,x_n\in\mathbb{R}^d,y_1,\cdots,y_n\in\mathbb{R}^k

JLT Embedding

  • Johnson-Lindenstrauss Theorem(JLT, 1984): it is always possble to embed nn points in arbitrary dimension to O(logn)O(\log n) dimension with constant distortion in Euclidian Space
    • 0<ϵ<1,SRd,S=n,k=O(ϵ2logn),ϕ:RdRk\forall0<\epsilon <1,\forall S\subset \mathbb{R}^{d},|S|=n,\exists k=O(\epsilon ^{-2}\log n),\phi :\mathbb{R} ^{d}\rightarrow \mathbb{R}^{k} such that x,yS,(1ϵ)xy2ϕ(x)ϕ(y)2(1+ϵ)xy2\forall x,y\in S,(1-\epsilon )\|x-y\|^{2}\leq \|\phi (x)-\phi (y)\|^{2}\leq (1+\epsilon )\|x-y\|^{2}
    • (linear embedding): ϕ(x)=Ax\phi(x)=Ax
    • (2016) Lower Bound: Ω(ϵ2logn)\Omega(\epsilon^{-2}\log n)
  • Contruction
    • projection onto uniform random kk-dimensional subspace of Rd\mathbb{R}^d (1999)
    • random matrix with i.i.d. ±1\pm 1 (2003)
    • random matrix with i.i.d. Gaussian entries (1998): ARk×dA\in\mathbb{R}^{k\times d} is drawn from the Gaussian distribution N(0,1k)\mathcal{N}(0,\frac{1}{k})
      • To prove: P(Au221>ϵ)<1n3P(|\| Au\|^2_2-1|>\epsilon)<\frac{1}{n^3}
      • Au2=i=1kYi2,YiN(0,1k)\lVert Au\rVert^2=\sum_{i=1}^kY_i^2,Y_i\sim\mathcal{N}(0,\frac{1}{k})
      • Chernoff bound for χ2\chi^2-distribution

Nearest Neighbor Search(NNS)

Problem

  • NNS
    • Data: y1,,ynXy_1,\cdots,y_n\in X
    • Query: xXx\in X
    • Answer: yiy_i closest to xx
  • cc-ANN: Approximate Nearest Neighbor
    • Answer: Find a yiy_i such that dist(x,yi)cmin1jndist(x,yj)\text{dist}(x,y_i)\leq c\min_{1\leq j\leq n}\text{dist}(x,y_j)
  • (c,r)(c,r)-ANN: Approximate Near Neighbor:

{yiS,dist(x,yi)cryiS,dist(x,yi)ryiS,dist(x,yi)>rarbitraryo.w.\begin{cases}y_{i^*}\in S,\text{dist}(x,y_{i^*})\leq cr & \exists y_i\in S,\text{dist}(x,y_i)\leq r\newline \perp & \forall y_i\in S,\text{dist}(x,y_i)> r\newline \text{arbitrary} & o.w.\end{cases}

  • From (c,r)(c,r)-ANN to cc-ANN
    • Definition
      • Dmin=mindist(yi,yj)D_{\min}=\min\text{dist}(y_i,y_j)
      • Dmax=maxdist(yi,yj)D_{\max}=\max\text{dist}(y_i,y_j)
      • R=DmaxDminR=\frac{D_{\max}}{D_{\min}}
    • r,\forall r,\exists data structure for (c,r)(c,r)-ANN
      • space ss
      • answer time tt
      • probability 1δ1-\delta
    • \exists data structure for rr-ANN
      • space O(slogcR)O(s\log_c R)
      • answer time O(tloglogcR)O(t\log\log_c R)
      • probability 1O(δloglogcR)1-O(\delta\log\log_c R)

Deterministic

  • Dictionary data structure
    • k-d tree
    • Voronoi diagram
  • Curse of dimensionality
    • conjecture: NNS in high dimension requires either super-polynomial space or super-polynomial time

Dimension Reduction

  • Solve (c,r)(c,r)-ANN in Hamming Space x{0,1}d,d>>lognx\in\{0,1\}^d,d>>\log n w.h.p
  • Data Structure
    • random Boolean matrix Ak×dA_{k\times d} with i.i.d. entries \in Bernoulli(pp)
      • zi=Ayi{0,1}kz_i=Ay_i\in\{0,1\}^k on finite field GF(2)
    • ss-balls: Bs(u)={yidist(u,zi)s}B_s(u)=\{y_i|\text{dist}(u,z_i)\leq s\} for all u{0,1}ku\in\{0,1\}^k (打表)
    • space: O(n2k)O(n2^k)
  • Answer: any yiBs(Ax)y_i\in B_s(Ax) (no if none)
    • query time: O(kd)+O(1)O(kd)+O(1)
  • Decide k,p,sk,p,s
    • k=lnn(182(c+2))2=O(logn)k=\frac{\ln n}{(\frac{1}{8}-2^{-(c+2)})^2}=O(\log n)
    • p=12211/rp=\frac{1}{2}-2^{-1-1/r}
    • s=(382(c+2))ks=(\frac{3}{8}-2^{-(c+2)})k
    • space: nO(1)n^{O(1)}
    • time: O(dlogn)O(d\log n)

Locality Sensitive Hashing

  • (r,cr,p,q)(r,cr,p,q)-LSH: h:XUh:X\rightarrow U satisfying x,yX\forall x,y\in X
    • dist(x,y)rP(h(x)=h(y))p\text{dist}(x,y)\leq r\Rightarrow P(h(x)=h(y))\geq p
    • dist(x,y)>crP(h(x)=h(y))q\text{dist}(x,y)>cr\Rightarrow P(h(x)=h(y))\leq q
  • (r,cr,p,q)\exists (r,cr,p,q)-LSH (r,cr,pk,qk)\Rightarrow\exists (r,cr,p^k,q^k)-LSH
    • g(x)=(h1(x),,hk(x))Ukg(x)=(h_1(x),\cdots,h_k(x))\in U^k
    • k=log1qn,pk=nlogplogq,q=1nk=\log_{\frac{1}{q}}n,p^k={n^{-\frac{\log p}{\log q}}},q=\frac{1}{n}
  • with (r,cr,p,1n)(r,cr,p^*,\frac{1}{n})-LSH gg
    • Algorithm 1:
      • space: O(n)O(n)
        • store y1,,yny_1,\cdots,y_n in nondecreasing order g(yi)g(y_i)
      • time: O(logn)+O(1)O(\log n)+O(1) in expectation
        • find all yiy_i that g(x)=g(yi)g(x)=g(y_i) and check dist(x,yi)\text{dist}(x,y_i)
      • real answer "no": always correct
      • real answer not "no": p\geq p^*
    • Algorithm 2:
      • space: O(np)O(\frac{n}{p^*})
        • draw independent g1,,g1pg_1,\cdots,g_{\frac{1}{p^*}}
        • store y1,,yny_1,\cdots,y_n in table-jj in nondecreasing order of gj(yi)g_j(y_i)
      • time: O(lognp)O(\frac{\log n}{p^*})
        • find 10p\leq\frac{10}{p^*} number of yi,j,gj(x)=gj(yi)y_i,\exists j,g_j(x)=g_j(y_i)
      • real answer "no": always correct
      • real answer not "no": >12>\frac{1}{2}
  • Overall: solve (c,r)(c,r)-ANN with space O(n1+logplogq)O(n^{1+\frac{\log p}{\log q}}), query time O(nlogplogqlogn)O(n^{\frac{\log p}{\log q}}\log n) and one-sided error <0.5<0.5
  • Hamming space: h(x)=xih(x)=x_i for uniform i[d]i\in[d] is a (r,cr,1rd,1crd)(r,cr,1-\frac{r}{d},1-\frac{cr}{d})-LSH