8-Methodology

半监督学习

自训练（Self-Training, Self-Teaching, Bootstrapping）：先用标注数据训练一个模型，将预测置信度较高的样本的位标签加入训练集重新训练
协同训练（Co-Training）：基于不同视角的分类器促进训练
- 在训练集上根据不同视角分别训练两个模型 $f_1$ 和 $f_2$
- 在无标注训练集上预测，各选取预测置信度比较高的样本加入训练集，重新训练两个不同视角的模型

多任务学习：归纳迁移学习的一种，利用相关任务中的信息作为归纳偏置提高泛化能力
共享模式
- 硬共享模式：让不同任务的神经网络共同使用一些共享模块提取通用特征
- 软共享模式：每个任务从其它任务获得一些信息（如隐状态、注意力机制）
- 层次共享模式：一般神经网络中不同层抽取的特征类型不同，低层一般抽取一些低级的局部特征，高层抽取一些高级的抽象语义特征
- 共享-私有模式：将共享模块和任务特定(私有)模块的责任分开

Different Tasks: $p_S(y|x)\neq p_T(y|x), p_S(x)=p_T(x)$

Multi-task Learning: Source Domain Labels are available
- learn source and target
Self-taught Learning: Source Domain Labels are unavailable
- feature based: learn good feature on source
- fine-tuning: pretrain model

$p_S(x,y)\neq p_T(x,y)$ ，假设源领域有大量标记数据，目标领域有无标记数据

Domain Adaptation: 协变量偏移 $p_S(x)\neq p_T(x),p_S(y|x)=p_T(y|x)$ $p_{S} (x) \neq = p_{T} (x), p_{S} (y ∣ x) = p_{T} (y ∣ x)$
- 学习 domain-invariant feature 使得学习到的特征不受限于 Source Domain 而导致 over-fitting，缩小 co-variant shift
- 协变量 Covariate：可能影响预测结果的统计变量，机器学习中可以看作输入
概念偏移：different tasks $p_S(y|x)=p_T(y|x)$ with $p_S(x)=p_T(x)$
先验偏移： $p_S(y)\neq p_T(y),p_S(x|y)=p_T(x|y)$

No labeled data in both source and target domain

学习模型 $f:\mathcal{X}\rightarrow\mathcal{Y}$ $f : X \to Y$
- $\mathcal{R}_T(\theta_f)=E_{(x,y)\sim p_S(x,y)}\frac{p_T(x)}{p_S(x)}(L(f(x;\theta_f),y))$
领域无关表示 Domain-Invariant： $g:\mathcal{X}\rightarrow\mathbb{R}^d$ $g : X \to R^{d}$
- $p_S(g(x;\theta_g))=p_T(g(x;\theta_g)),\forall x\in\mathcal{X}$
- $R_T(\theta_f,\theta_g)=E_{(x,y)\sim p_S(x,y)}([L(f(g(x;\theta_g);\theta_g),y)])+\gamma d_g(S,T)$
分布差异
- MMD(Maximum Mean Discrepancy)
- CMD(Central Moment Discrepancy)
对抗学习（Adverserial）
- 判别器 $c(h,\theta_c)$ ： $L_c(\theta_g,\theta_c)=\frac{1}{N}\sum_{n=1}^N\log c(h_S^{(n)},\theta_c)+\frac{1}{M}\sum_{m=1}^M\log(1-c(x_D^{(m)},\theta_c))$
- 特征提取： $d_g(S,T)=\mathcal{L_c}(\theta_f,\theta_c)$

通过历史任务 $\mathcal{T}_1,\mathcal{T}_2,\cdots,\mathcal{T}_m$ 学习 $\mathcal{T}_{m+1}$
避免灾难性遗忘：按照一定顺序学习多个任务时，在学习新任务的同时不忘记先前学会的历史任务
弹性权重巩固（2017）
- $\log p(\theta|D)=\log p(D_B|\theta)+\log p(\theta|D_A)-\log p(D_B)$
- 假设 $p(\theta|D_A)$ 为高斯分布，期望为任务 $\mathcal{T}_A$ 上学习到的参数 $\theta_A$ ，精度矩阵（协方差矩阵的逆）为 $\theta$ 在 $\mathcal{D}_A$ 上的 Fisher 信息矩阵近似， $p(\theta|D_A)=\mathcal{N}(\theta_A,F^{-1})$
- Fisher 信息矩阵：测量似然函数 $p(x,\theta)$ 携带的关于参数 $\theta$ 信息量的方法，对角线反应了最大似然估计时的不确定性，值越大，参数估计值方差越小，越有可靠性
打分函数： $s(\theta)=\nabla_\theta\log p(x;\theta)$
- $E(s(\theta))=0$
- Fisher 信息矩阵： $s(\theta)$ 的协方差矩阵， $F(\theta)=E(s(\theta)s(\theta)^\top)$
- $L(\theta)=L_B(\theta)+\sum_{i=1}^N\frac{\lambda}{2}F_i^A(\theta_i-\theta_{A,i}^*)^2$