DDPM论文浅析

论文链接：Denoising Diffusion Probabilistic Models

这篇论文的贡献主要有两个

证明扩散模型确实有能力生成高质量样本
证明扩散模型的一种特定参数化方法与降噪分数匹配(denoising score matching)等价

一些主要前置知识：

变分推理（Variational Inference）

扩散模型：Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Score Matching：Estimation of Non-Normalized Statistical Models by Score Matching

Score-based Generative Model：Generative Modeling by Estimating Gradients of the Data Distribution

另一篇非常好的参考博客：https://aman.ai/primers/ai/diffusion-models/

Diffusion Model

设$\mathbf{x}{0} $为已知数据集中的样本，它来自分布$ q(\mathbf{x}{0})$

设 $x_{1}, \dots, x_{T}$ 是与 $x_{0}$ 维度相同的隐变量，它们通过一个固定的马尔可夫链采样得到

该马尔可夫链定义为根据方差表 $β_{1}, \dots, β_{T}$ 逐步向 $x_{0}$ 中添加随机高斯噪声，称为前向过程(forward process)或扩散过程(diffusion process)

q (x_{t} | x_{t - 1}) := N (x_{t}; \sqrt{1 - β_{t}} x_{t - 1}, β_{t} I)

前向过程的一个重要性质是可以解析地在任意timestep进行对 $x$ 采样，令$\alphat=1-\beta_t $，$ \overline{\alpha}_t=\prod{s=1}^t\alpha_s$，则有

q (x_{t} | x_{0}) = N (x_{t}; \sqrt{{\bar{α}}_{t}} x_{0}, (1 - {\bar{α}}_{t}) I)

证明：

使用重参数化(reparameterize)技巧，设 $ϵ \sim N (0, I)$ ，则

\begin{aligned} x_{t} & = \sqrt{α_{t}} x_{t - 1} + \sqrt{1 - α_{t}} ϵ_{t - 1} \\ = \sqrt{α_{t}} (\sqrt{α_{t - 1}} x_{t - 2} + \sqrt{1 - α_{t - 1}} ϵ_{t - 2}) + \sqrt{1 - α_{t}} ϵ_{t - 1} \\ = \sqrt{α_{t} α_{t - 1}} x_{t - 2} + \sqrt{α_{t} - α_{t} α_{t - 1}} ϵ_{t - 2} + \sqrt{1 - α_{t}} ϵ_{t - 1} \\ = \sqrt{α_{t} α_{t - 1}} x_{t - 2} + \sqrt{(α_{t} - α_{t} α_{t - 1}) + (1 - α_{t})} ϵ_{t - 2}^{*} \\ = \sqrt{α_{t} α_{t - 1}} x_{t - 2} + \sqrt{1 - α_{t} α_{t - 1}} ϵ_{t - 2}^{*} \\ = . . . \\ = \sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ_{0} \end{aligned}

采样得到隐变量序列后，可以得到后验 $q (x_{1, T} | x_{0})$

q (x_{1 : T} | x_{0}) := \prod_{t = 1}^{T} q (x_{t} | x_{t - 1})

现在再定义另一个逆向马尔可夫链，其初始分布为 $p_{θ} (x_{T}) = N (x_{T}; 0, I)$ ，转移概率是通过学习得到的高斯分布，称为逆过程(reverse process)

p_{θ} (x_{t - 1} | x_{t}) := N (x_{t - 1}; μ_{θ} (x_{t}, t), Σ_{θ} (x_{t}, t))

进一步可得联合分布$p{\theta}\left(\mathbf{x}{0: T}\right)$

p_{θ} (x_{0 : T}) := p (x_{T}) \prod_{t = 1}^{T} p_{θ} (x_{t - 1} ∣ x_{t})

扩散模型(Diffusion Models, DM)的思路是用$p{\theta}(\mathbf x{t-1}|\mathbf xt) $来近似$ q(\mathbf x{t-1}|\mathbf x{t}) $，这样$ p{\theta}(\mathbf{x}{0})=\int p{\theta}\left(\mathbf{x}{0: T}\right)d \mathbf{x}{1: T} $就成为了真实分布$ q(\mathbf{x}_{0})$的近似

通俗的说，DM对样本不断加噪直到几乎只有高斯噪声，然后再学习加噪的逆过程，即从高斯噪声不断降噪直到获得一个样本

$p{\theta}(\mathbf{x}{0})$的训练是通过优化变分下界进行的，即

E_{q} [- \log p_{θ} (x_{0})] \leq E_{q} [- \log \frac{p_{θ} (x_{0 : T})}{q (x_{1 : T} | x_{0})}] = E_{q} [- \log p (x_{T}) - \sum_{t \geq 1} \log \frac{p_{θ} (x_{t - 1} | x_{t})}{q (x_{t} | x_{t - 1})}] =: L

直接计算这个式子需要MCMC采样，方差会比较大，因此将 $L$ 进一步改写为

L = E_{q} [\underset{L_{T}}{\underset{⏟}{D_{KL} (q (x_{T} ∣ x_{0}) ‖ p (x_{T}))}} + \sum_{t > 1} \underset{L_{t - 1}}{\underset{⏟}{D_{KL} (q (x_{t - 1} ∣ x_{t}, x_{0}) ‖ p_{θ} (x_{t - 1} ∣ x_{t}))}} \underset{L_{0}}{\underset{⏟}{- \log p_{θ} (x_{0} ∣ x_{1})}}]

证明：

\begin{aligned} L = & E_{q} [- \log p (x_{T}) - \sum_{t \geq 1} \log \frac{p_{θ} (x_{t - 1} | x_{t})}{q (x_{t} | x_{t - 1})}] \\ = & E_{q} [- \log p (x_{T}) - \log \frac{p_{θ} (x_{0} | x_{1})}{q (x_{1} | x_{0})} - \sum_{t > 1} \log \frac{p_{θ} (x_{t - 1} | x_{t})}{q (x_{t} | x_{t - 1})}] \\ = & E_{q} [- \log p (x_{T}) - \log \frac{p_{θ} (x_{0} | x_{1})}{q (x_{1} | x_{0})} - \sum_{t > 1} \log \frac{p_{θ} (x_{t - 1} | x_{t})}{q (x_{t - 1} | x_{t}, x_{0})} \cdot \frac{q (x_{t - 1} | x_{0})}{q (x_{t} | x_{0})}] \\ = & E_{q} [- \log p (x_{T}) - \log \frac{p_{θ} (x_{0} | x_{1})}{q (x_{1} | x_{0})} - \log \frac{q (x_{1} | x_{0})}{q (x_{T} | x_{0})} - \sum_{t > 1} \log \frac{p_{θ} (x_{t - 1} | x_{t})}{q (x_{t - 1} | x_{t}, x_{0})}] \\ = & E_{q} [- \log \frac{p (x_{T})}{q (x_{T} | x_{0})} - \log p_{θ} (x_{0} | x_{1}) - \sum_{t > 1} \log \frac{p_{θ} (x_{t - 1} | x_{t})}{q (x_{t - 1} | x_{t}, x_{0})}] \\ = & E_{q} [D_{K L} (q (x_{T} | x_{0}) ‖ p (x_{T})) + \sum_{t > 1} D_{K L} (q (x_{t - 1} | x_{t}, x_{0}) ‖ p_{θ} (x_{t - 1} | x_{t})) - \log p_{θ} (x_{0} | x_{1})] \end{aligned}

改写后的 $L$ 含义更加明显了，最小化 $L$ 其实就是最小化$D{KL}(q(\mathbf x{t-1}|\mathbf x{t})|p{\theta}(\mathbf x{t-1}|\mathbf x_t)) $，也即令$ p{\theta}(\mathbf x{t-1}|\mathbf x_t) $和$ q(\mathbf x{t-1}|\mathbf x_{t})$尽可能相似

$L{t-1} $中$ q(\mathbf{x}{t-1}|\mathbf{x}{t}, \mathbf{x}{0})$根据贝叶斯公式可得

q (x_{t - 1} | x_{t}, x_{0}) = N (x_{t - 1}; {\tilde{μ}}_{t} (x_{t}, x_{0}), {\tilde{β}}_{t} I) w h e r e {\tilde{μ}}_{t} (x_{t}, x_{0}) := \frac{\sqrt{{\bar{α}}_{t - 1}} β_{t}}{1 - {\bar{α}}_{t}} x_{0} + \frac{\sqrt{α_{t}} (1 - {\bar{α}}_{t - 1})}{1 - {\bar{α}}_{t}} x_{t} a n d {\tilde{β}}_{t} := \frac{1 - {\bar{α}}_{t - 1}}{1 - {\bar{α}}_{t}} β_{t}

正态分布间的KL散度是可以直接计算的，这样就避免了MCMC

Score-based Model

Score Matching最初的目的是计算非归一化概率模型的归一化常量

其论文中将Score Function定义为对数概率密度的梯度（the gradient of the log-density）

ψ (ξ; θ) = (\begin{matrix} \frac{\partial \log p (ξ; θ)}{\partial ξ_{1}} \\ ⋮ \\ \frac{\partial \log p (ξ; θ)}{\partial ξ_{n}} \end{matrix}) = \nabla_{ξ} \log p (ξ; θ)

Score-based Model的目标是训练是一个分数网络 $s_{θ} (x)$ 来估计数据的Score Function，即

\frac{1}{2} E_{p_{data}} [‖ s_{θ} (x) - \nabla_{x} \log p_{data} (x) ‖_{2}^{2}]

由于数据的Score Function是未知的，该式需要进一步推导，可以证明，该式在常数差距内等价于

E_{p_{data} (x)} [tr (\nabla_{x} s_{θ} (x)) + \frac{1}{2} {‖ s_{θ} (x) ‖}_{2}^{2}]

此时$\operatorname{tr}(\nabla{\mathbf{x}}\mathbf{s}{\theta}(\mathbf{x}))$仍是难以计算的，不同的Score-based Model目标就是解决这个问题

一种比较常用的方法是降噪分数匹配（denoising score matching）

其首先使用一个噪声分布 $q_{σ} (\tilde{x} | x)$ 对数据点 $x$ 进行搅动，然后使用分数网络估计搅动后的数据分布

\frac{1}{2} E_{q_{σ} (\tilde{x} ∣ x) p_{data} (x)} [{‖ s_{θ} (\tilde{x}) - \nabla_{\tilde{x}} \log q_{σ} (\tilde{x} ∣ x) ‖}_{2}^{2}]

可以证明该式最小化时有$\mathbf{s}^*{\boldsymbol{\theta}}(\tilde{\mathbf{x}})=\nabla{\mathbf{x}} \log q{\sigma}(\mathbf{x}) $，当噪声足够小时$ q{\sigma}(\mathbf{x})\approx p_{data}(\mathbf x)$

获得了分数网络$\mathbf s{\theta}(\mathbf x)$后，我们可以使用Lagevin Dynamics（或称Lagevin Sampling）从Score Function $\nabla{\mathbf x} p_{\mathrm{data}}(\mathbf{x})$中进行采样，它是一种MCMC采样方法

给定固定步长 $ϵ > 0$ 、先验分布 $π$ 和初始值 ${\tilde{x}}_{0} \sim π (x)$ ，Lagevin Sampling循环计算

{\tilde{x}}_{t} = {\tilde{x}}_{t - 1} + \frac{ϵ}{2} \nabla_{x} \log p ({\tilde{x}}_{t - 1}) + \sqrt{ϵ} z_{t}

其中$\mathbf{z}t\sim \mathcal N(0, I) $，当$ \epsilon \to 0, T\to \infty $时，$ \tilde{\mathbf{x}}{t} $（在一些正则条件下）就是$ p(\mathbf x)$的准确采样

Diffusion Models and Denoising Autoencoders

通过指定不同的$\betat $和不同的逆向过程$ p{\theta}\left(\mathbf{x}{t-1}|\mathbf{x}{t}\right)$的参数化形式，可以得到扩散模型的很多不同实现

下面要讨论的一种实现将使扩散模型产生与降噪分数匹配等价的效果

对于 $β_{t}$ ，我们将其固定为常数，此时后验 $q$ 无可学习参数，进而 $L_{T}$ 也是常数，因此可以忽略

对于逆向过程 $p{\theta}\left(\mathbf{x}{t-1}|\mathbf{x}{t}\right):=\mathcal{N}\left(\mathbf{x}{t-1} ; \boldsymbol{\mu}{\theta}\left(\mathbf{x}{t}, t\right), \boldsymbol{\Sigma}{\theta}\left(\mathbf{x}{t}, t\right)\right)$

首先，令$\boldsymbol{\Sigma}{\theta}\left(\mathbf{x}{t}, t\right)=\sigma_t^2 \mathbf I $，其中$ \sigma^2_t$是常量且相互独立，不参与训练

从实验结果来看，$\sigma^2t=\beta_t $和$ \sigma^2_t=\tilde{\beta}{t}$具有相似效果（但是论文中的解释暂时没看懂）

The first choice is optimal for $x_{0} \sim N (0, I)$ , and the second is optimal for $x_{0}$ deterministically set to one point. These are the two extreme choices corresponding to upper and lower bounds on reverse process entropy for data with coordinatewise unit variance.

此时我们可以根据正态分布间的KL散度公式进一步推导得

L_{t - 1} = E_{q} [\frac{1}{2 σ_{t}^{2}} {‖ {\tilde{μ}}_{t} (x_{t}, x_{0}) - μ_{θ} (x_{t}, t) ‖}^{2}] + C

其中 $C$ 是与 $θ$ 无关的常量

前面提到前向过程可以在任意时间步采样，即 $q (x_{t} | x_{0}) = N (x_{t}; \sqrt{{\bar{α}}_{t}} x_{0}, (1 - {\bar{α}}_{t}) I)$

我们对其使用重参数化技巧，令 $ϵ \sim N (0, I)$ ，则 $\mathbf xt(\mathbf x_0, \epsilon)=\sqrt{\bar\alpha_t}\mathbf x_0+\sqrt{1-\bar\alpha_t}\epsilon $，将其带入$ L{t}$得

\begin{aligned} L_{t - 1} - C & = E_{x_{0}, ϵ} [\frac{1}{2 σ_{t}^{2}} {‖ {\tilde{μ}}_{t} (x_{t} (x_{0}, ϵ), \frac{1}{\sqrt{{\bar{α}}_{t}}} (x_{t} (x_{0}, ϵ) - \sqrt{1 - {\bar{α}}_{t}} ϵ)) - μ_{θ} (x_{t} (x_{0}, ϵ), t) ‖}^{2}] \\ = E_{x_{0}, ϵ} [\frac{1}{2 σ_{t}^{2}} {‖ \frac{1}{\sqrt{α_{t}}} (x_{t} (x_{0}, ϵ) - \frac{β_{t}}{\sqrt{1 - {\bar{α}}_{t}}} ϵ) - μ_{θ} (x_{t} (x_{0}, ϵ), t) ‖}^{2}] \end{aligned}

从这个推导结果容易得到，对于均值 $μ_{θ} (x_{t}, t)$ ，我们应将其参数化为

\begin{array}{r} μ_{θ} (x_{t}, t) = {\tilde{μ}}_{t} (x_{t}, \frac{1}{\sqrt{{\bar{α}}_{t}}} (x_{t} - \sqrt{1 - {\bar{α}}_{t}} ϵ_{θ} (x_{t}))) = \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{β_{t}}{\sqrt{1 - {\bar{α}}_{t}}} ϵ_{θ} (x_{t}, t)) \end{array}

其中$\mathbf xt $是给定的模型输入，$ \epsilon{\theta} $是根据$ \mathbf x_t $预测$ \epsilon$的近似函数

确定均值的参数化方法后， $L_{t}$ 可化简为

E_{x_{0}, ϵ} [\frac{β_{t}^{2}}{2 σ_{t}^{2} α_{t} (1 - {\bar{α}}_{t})} {‖ ϵ - ϵ_{θ} (\sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ, t) ‖}^{2}]

此时可以明显看出该式与降噪分数匹配非常类似，实际上是在多个尺度的噪声下进行降噪分数匹配，其中的 $ϵ_{θ}$ 就相当于学习到的数据梯度

下面给出该参数化方法下扩散模型的算法伪码，如Algorithm2所示，采样$\mathbf x{t-1}\sim p{\theta}(\mathbf x{t-1}|\mathbf x_t) $即计算$ \mathbf{x}{t-1}=\frac{1}{\sqrt{\alphat}}\left(\mathbf{x}_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon\theta(\mathbf{x}_t,t)\right)+\sigma_t\mathbf{z} $，其中$ z \sim \mathcal N(\mathbf 0,\mathbf I)$，这个过程与Langevin dynamics类似

Discrete Decoder of Reverse Process

由RGB表示的图像中每个像素点离散地在 $0, 1, \dots, 255$ 中取值，但现在逆过程中每一步都是连续的高斯分布

为了获得离散似然，我们需要将逆过程的最后一步改写为一个独立的离散解码器(independent discrete decoder)

以下我们假设所有图像的像素值都从 $[0, 255]$ 均匀的缩放到了 $[- 1, 1]$

由于上述推导中假定逆过程的协方差矩阵是对角阵，所以显然有

p_{θ} (x_{0} | x_{1}) = \prod_{i = 1}^{D} p_{θ} (x_{0}^{i} | x_{1}^{i}) = \prod_{i = 1}^{D} N (x; μ_{θ}^{i} (x_{1}, 1), σ_{1}^{2})

其中 $D$ 是数据的维度

接下来使用分箱进行离散化，即像素点取值在区间 $[x - \frac{1}{255}, x + \frac{1}{255}]$ 时视为取离散值 $x$ ， $x = 1, 2, \dots, 255$ ，于是有

\begin{aligned} p_{θ} (x_{0} ∣ x_{1}) & = \prod_{i = 1}^{D} \int_{δ_{-} (x_{0}^{i})}^{δ_{+} (x_{0}^{i})} N (x; μ_{θ}^{i} (x_{1}, 1), σ_{1}^{2}) d x \\ δ_{+} (x) & = {\begin{cases} \infty & if x = 1 \\ x + \frac{1}{255} & if x < 1 \end{cases} δ_{-} (x) = {\begin{cases} - \infty & if x = - 1 \\ x - \frac{1}{255} & if x > - 1 \end{cases} \end{aligned}

Simplified training object

作者在实验中发现，将前面推导的得到的训练目标进一步化简可以得到更好的生成效果

L_{simple} (θ) := E_{t, x_{0}, ϵ} [{‖ ϵ - ϵ_{θ} (\sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ, t) ‖}^{2}]

其中 $t \sim U (1, T)$ ，即均匀分布

Experiment

IS和FID是评价（图像）生成模型的常用量化方法

IS（Inception Score）使用Inception Net输出图像 $x$ 的1000维分类向量 $y$ ，并从两方面量化生成图像的质量

清晰度：对单一的生成图像，其类别分布的熵应尽可能小，即对于清晰的图片，其属于某一个类别的概率应趋近于1，其余则趋近于0。因此最小化
$E_{x \sim p_{g}} [H (p (y ‖ x))]$
多样性：对成批的生成图像，其类别分布的熵应尽量大，即生成模型生成的图像类别因尽可能丰富，因此最大化
$H (E_{x \sim p_{g}} [p (y ‖ x))])$

将前者取负，两者相加后得IS表达式，IS越大，生成图像质量越高

I S = \exp (E_{x \sim p_{g}} [D_{K L} (p (y | x) ‖ p (y))])

FID（Frechet Inception Distance）使用Inception Net-V3并删除最后的分类层，得到图像的2048维特征向量

FID的思想是直接计算真实图像分布和生成图像分布的距离，但图像分布维度过大不易计算，因此使用Inception输出的2048维特征向量计算Frechet Distance，显然FID越小越好

F I D = ‖ μ_{x} - μ_{g} ‖^{2} + t r (Σ_{x} + Σ_{g} - 2 (Σ_{x} Σ_{g})^{\frac{1}{2}})

下图Table 1、Table2展示了DDPM的IS、FID和NLL比较

由Tabel 1可知DPPM的IS仅次于StyleGAN2+ADA，而FID则最优，对于负对数似然NLL，论文称使用未化简的变分目标得到的NLL更好，但图片质量不如化简的目标

Table 2对比的是不同的 $L$ 和不同的逆过程参数化方法