Skip to Content
Advanced Algorithm

2. Space-saving

2019-01-16Original-language archivelegacy assets may be incomplete

Problems

  • Checking Identity
    • EQ(a,b)=[a=b]\text{EQ}(a,b)=[a=b] (bit-string x{0,1}nx\in\{0,1\}^n)
    • Communication cost
  • checking distinctness
    • Input: A,B{1,2,,n}A,B\in\{1,2,\cdots,n\}
    • Determine whether A=BA=B(multiset equivalence)
  • counting distinct elements
    • Data: Stream Model, nn is unknown and elements is presented one at a time
    • Input: a sequence x1,x2,,xnΩx_1,x_2,\cdots,x_n\in\Omega
    • Output: an estimation of z={x1,x2,,xn}z=|\{x_1,x_2,\cdots,x_n\}|
  • Set Membership
    • Data: a set of elements x1,,xnΩx_1,\cdots,x_n\in\Omega
    • Input: xΩx\in\Omega
    • Output: [xΩ][x\in\Omega]
  • Frequency Estimation
    • Data: a sequence of elements x1,,xnΩx_1,\cdots,x_n\in\Omega
    • Input: xΩx\in\Omega
    • Output: fx={ixi=x}f_x=|\{i|x_i=x\}|

Fingerprint

  • use Fingerprint to design One-sided-error Monte Carlo algorithm
    • FING()\text{FING}(\cdot): function easy to compute and compare
    • X=Y,FING(X)=FING(Y)X=Y,\text{FING}(X)=\text{FING}(Y)
    • XY,P(FING(X)=FING(Y))X\not=Y,P(\text{FING}(X)=\text{FING}(Y)) is small

Communication Protocol

  • Theorem (Yao 1979): Any deterministic communication protocol computing EQ on two nn-bit strings costs nn bits of communication in the worst-case.
  • Fingerprinting by PIT
    • FING(x)=i=0n1xiri\text{FING}(x)=\sum_{i=0}^{n-1}x^ir^i under Zp\mathbb{Z}_p, rr chosen uniformly at random
    • xy,P(FING(x)=FING(y))n1px\neq y,P(\text{FING}(x)=\text{FING}(y))\leq\frac{n-1}{p}
      • p[n2,2n2],P1np\in[n^2,2n^2],P\leq\frac{1}{n}
    • communication cost: O(logp)O(\log p)
  • Fingerprinting by check sum
    • FING(x)=xmodp,p[k],k=2n2lnn\text{FING}(x)=x\bmod p,p\in[k],k=2n^2\ln n
    • xy,P(FING(x)=FING(y))=P(xymodp=0)nπ(k)1nx\neq y,P(\text{FING}(x)=\text{FING}(y))=P(|x-y|\bmod p=0)\leq\frac{n}{\pi(k)}\leq\frac{1}{n}
    • Proof technique
      • number of prime divisors of nlog2nn\leq \log_2n
      • number of primes in [k][k]: π(k)klnk\pi(k)\sim\frac{k}{\ln k}
    • communication cost: O(logk)=O(logn)O(\log k)=O(\log n)

Checking distinctness

  • FING(A)=fA(r)=i=1n(rai),rZp\text{FING}(A)=f_A(r)=\prod_{i=1}^n(r-a_i),r\in \mathbb{Z}_p
    • p[(nlogn)2,2(nlogn)2]p\in [(n\log n)^2,2(n\log n)^2]
    • Theorem (Lipton 1989): AB,P(FING(A)=FING(b))=O(1n)A\not=B,P(\text{FING}(A)=\text{FING}(b))=O(\frac{1}{n})
    • Proof technique
      • AB,P(FING(A)=FING(B))=P(fA(r)=fB(r)fA≢fB)P(fA≢fB)+P(fA(r)=fB(r)fAfB)P(fAfB)A\not=B,P(\text{FING}(A)=\text{FING}(B))=P(f_A(r)=f_B(r)|f_A\not\equiv f_B)P(f_A\not\equiv f_B)+P(f_A(r)=f_B(r)|f_A\equiv f_B)P(f_A\equiv f_B)
    • space cost: O(logp)=O(logn)O(\log p)=O(\log n)

Hashing

Distinct Elements

  • deterministic: O(n)O(n) space
  • (ϵ,δ)(\epsilon,\delta)-estimator Z^\hat Z: P((1ϵ)zZ^(1+ϵ)z))1δP((1-\epsilon)z\leq \hat Z\leq(1+\epsilon)z))\geq1-\delta
    • ϵ\epsilon: approximation error
    • δ\delta: confidence error
  • UHA(uniform hash assumption)
    • h:[N][M]h:[N]\rightarrow[M] takes Ω(NlogM)\Omega(N\log M) bits according to information theory
    • SUHA(Simple UHA): A uniform random function h:[N][M]h:[N]\rightarrow[M] is available and the computation of hh is efficient.
  • Hashing Estimator
    • h:Ω[0,1]h:\Omega\rightarrow[0,1]
    • Compute h(xi)h(x_i): independent values in [0,1][0,1]
    • Z=1minih(xi)1Z=\frac{1}{\min_ih(x_i)}-1
      • E(min1inh(xi))=E(E(\min_{1\leq i\leq n}h(x_i))=E(length of a subinterval)=1z+1)=\frac{1}{z+1}
      • drawback: V(minih(xi))V(\min_ih(x_i)) is too large
  • Flajolet-Martin algorithm (1985)
    • Y=1kj=1kYj\overline{Y}=\frac{1}{k}\sum_{j=1}^k Y_j
    • E(Y)=1z+1E(\overline{Y})=\frac{1}{z+1}
    • Z=1Y1Z=\frac{1}{\overline{Y}}-1
    • k4ϵ2δ,ϵ,δ<12k\geq\lceil\frac{4}{\epsilon^2\delta}\rceil,\forall \epsilon,\delta<\frac{1}{2}
    • Space cost: O(ϵ2logn)O(\epsilon^{-2}\log n) bits
  • 2010: Θ(ϵ2+logn)\Theta(\epsilon^{-2}+\log n)

Sketch

  • sketch: lossy representation of data and tolerate a bounded error in answering queries

Set Membership

  • Dictionary data structure
    • balanced tree: O(nlogΩ)O(n\log|\Omega|) space, O(logn)O(\log n) time
    • perfect hashing: O(nlogΩ)O(n\log|\Omega|) space, O(1)O(1) time
  • Bloom filter (1970)
    • h1,h2,,hk:Ω[cn]h_1,h_2,\cdots,h_k:\Omega\rightarrow [cn] are uniform and independent random hash function
    • Data Structure AA of cncn bits O(n)O(n)
      • initially: all 00
      • for each xS,A[hi(x)]=1x\in S,A[h_i(x)]=1 for 1ik1\leq i\leq k
    • Query: yes if A[hi(x)]=1,1ikA[h_i(x)]=1,1\leq i\leq k
      • time cost: O(k)O(k)
    • RP\text{RP}
      • xSx\in S: always correct
      • x∉Sx\not\in S:
        • P(A[h1(x)]=0)=(11cn)knekcP(A[h_1(x)]=0)=(1-\frac{1}{cn})^{kn}\approx e^{\frac{-k}{c}}
        • P=(1ekc)kP=(1-e^{\frac{-k}{c}})^k
        • k=cln2,P=(0.6185)ck=c\ln 2,P=(0.6185)^c

Frequency Estimation

  • Additive error: P(f^xfxϵn)1δP(|\hat f_x-f_x|\leq\epsilon n)\geq 1-\delta
    • heavy hitters: elements xx with fx>ϵnf_x>\epsilon n
  • Count-min sketch (Cormode and Muthukrishnan, 2003)
    • Data Structure: k×mk\times m integer array CMS[k][m]\text{CMS}[k][m]
      • initially: CMS[k][m]=0\text{CMS}[k][m] = 0
      • for each xix_i, CMS[j][hj(xi)]\text{CMS}[j][h_j(x_i)]++
    • Query: f^x=min1jkCMS[j][hj(x)]\hat f_x=\min_{1\leq j\leq k}\text{CMS}[j][h_j(x)]
      • Time cost: O(1ϵ)O(\frac{1}{\epsilon})
    • Space cost: O(kmlogn)=O(1ϵlog1δlogn)O(km\log n)=O(\frac{1}{\epsilon}\log\frac{1}{\delta}\log n)
    • Proof
      • CMS[j][hj(x)]fx\text{CMS}[j][h_j(x)]\geq f_x
      • E(CMS[j][hj(x)])fx+nm\mathbb{E}(\text{CMS}[j][h_j(x)])\leq f_x+\frac{n}{m}
      • P(f^xfxϵn)1ϵmkP(|\hat f_x-f_x|\leq\epsilon n)\leq\frac{1}{\epsilon m}^k
      • m=eϵ,k=ln1δm=\lceil\frac{e}{\epsilon}\rceil,k=\lceil\ln\frac{1}{\delta}\rceil