An introduction to Flow Matching and Diffusion Models
相关内容
Linqi Zhou --- Terminal Velocity Matching
Terminal Velocity Matching
终端速度匹配
Linqi Zhou, Mathias Parger, Ayaan Haque, Jiaming Song
https://arxiv.org/abs/2511.19797
We propose Terminal Velocity Matching (TVM), a generalization of flow matching that enables high-fidelity one- and few-step generative modeling. TVM models the transition between any two diffusion timesteps and regularizes its behavior at its terminal time rather than at the initial time. We prove that TVM provides an upper bound on the \(2\)-Wasserstein distance between data and model distributions when the model is Lipschitz continuous. However, since Diffusion Transformers lack this property, we introduce minimal architectural changes that achieve stable, single-stage training. To make TVM efficient in practice, we develop a fused attention kernel that supports backward passes on Jacobian-Vector Products, which scale well with transformer architectures. On ImageNet-256x256, TVM achieves 3.29 FID with a single function evaluation (NFE) and 1.99 FID with 4 NFEs. It similarly achieves 4.32 1-NFE FID and 2.94 4-NFE FID on ImageNet-512x512, representing state-of-the-art performance for one/few-step models from scratch.
我们提出了终端速度匹配 (TVM),这是一种流量匹配的概括,可以实现高保真的一步和几步生成建模。TVM对任何两个扩散时间步长之间的过渡进行建模,并在其终止时间而不是初始时间对其行为进行正则化。我们证明,当模型为Lipschitzyc连续时,TVM为数据和模型分布之间的 \(2 \)-Wasserstein距离提供了上限。但是,由于扩散变压器缺乏此属性,因此我们引入了最小的体系结构更改,以实现稳定的单级培训。为了使TVM在实践中高效,我们开发了一个融合的注意力内核,该内核支持Jacobian矢量乘积的后向传递,该乘积可与transformer架构很好地扩展。在ImageNet-256x256上,TVM通过单功能评估 (NFE) 实现了3.29 FID,通过4个NFE实现了1.99 FID。它同样在ImageNet-512x512上实现了4.32个1-NFE FID和2.94个4-NFE FID,代表了从头开始的一步/几步模型的最新性能。
面临问题
扩散模型(Diffusion Models)限制了采样概率路径的选择,导致训练时间过长,且采样效率低。
扩散模型中的概率分布是易于求解的正态分布。但是其积分问题难以处理。在DDPM中通过求解 ELBO 的方式来规避积分问题。来近似真实分布。但 ELBO 只是一个边界,与真实分布远不够接近。
前置知识
归一化流
归一化流(Normalizing Flows)是一种通过一系列可逆变换将简单分布(如高斯分布)转换为复杂分布的方法。其核心思想是利用变量变换公式(Change of Variables Formula)来精确计算变换后的概率密度。
变量变换公式
假设有一个随机变量 \( z \sim p_z(z) \),通过一个可逆映射 \( f: \mathbb{R}^d \to \mathbb{R}^d \) 得到新变量 \( x = f(z) \)。那么 \( x \) 的概率密度为:
其中:
- \( f^{-1} \) 是 \( f \) 的逆映射,
- \( \det \left( \frac{\partial f^{-1}(x)}{\partial x} \right) \) 是逆映射的雅可比矩阵的行列式。
为了简化计算,通常要求 \( f \) 具有易于计算的雅可比行列式,例如使用三角矩阵或行列式直接为1的变换。
流模型的基本结构
归一化流通过组合多个简单变换来构建复杂的分布:
对应的概率密度为:
其中 \( z_0 = z \),\( z_k = f_k(z_{k-1}) \)。
优点与局限性
- 优点:
- 能够精确计算似然,适用于概率密度估计。
- 采样过程简单,只需从基础分布采样并应用变换。
- 局限性:
- 要求变换可逆且雅可比行列式易于计算,限制了模型表达能力。
- 当数据维度高时,变换的设计和计算成本较高。
归一化流为理解流匹配(Flow Matching)提供了理论基础,后者通过更灵活的方式学习概率路径,
Normalizing Flows
简单分布容易处理,但不能表示复杂的形状;复杂分布可以表示数据,但是计算概率密度很困难。
Normalizing Flow 通过一系列可逆变换,把一个简单分布给逐步变换成一个复杂分布。
变量替换公式
变量替换公式(change of variables formula)是流模型的核心。若 \( z \sim p_z(z) \),通过可逆映射 \( x = f(z) \),则 \( x \) 的分布为:
或等价地:
其中 \(\det \frac{\partial f}{\partial z}\) 是变换 \( f \) 的雅可比行列式,保证概率密度在变换下保持归一化。
Flow
由于单个变换的灵活性有限,我们把多个可逆变换串联起来:
串联多个可逆变换 \( f = f_1 \circ f_2 \circ \cdots \circ f_K \),则:
概率密度变为:
其中 \( z_0 = z \),\( z_k = f_k(z_{k-1}) \),\( z_K = x \)。
关键要求:
- 每个 \( f_k \) 可逆
- 雅可比行列式易计算
- 变换足够灵活以拟合复杂分布
连续归一化流
在连续时间内对概率密度进行变换
流匹配
一种无仿真的 CNF 训练方法,通过匹配模型矢量场与目标矢量场来生成概率路径,绕过了对 ODE 的数值仿真,实现了高效的训练和采样。
条件流匹配
流形假设
流形假设 Manifold Hypothesis - AIDIY Wiki
流形假设(Manifold Hypothesis)是机器学习和表示学习中的核心假设,认为现实世界中 的高维数据通常分布在一个低维流形上。这一假设为降维和特征学习提供了理论基础。
数学定义
设观测数据 \( \mathbf{x} \in \mathbb{R}^D \) 来自高维空间,流形假设认为存在一个低维流形 \( \mathcal{M} \subset \mathbb{R}^D \)(维度 \( d \ll D \))和一个生成过程:
这表示,观测数据 \(x\) 要么位于 \(\mathcal{M}\) 上,要么就距离 \(\mathcal{M}\) 距离很小。
其中:
- \( \mathbf{z} \in \mathbb{R}^d \) 是低维潜在变量
- \( f: \mathbb{R}^d \to \mathbb{R}^D \) 是光滑映射
- \( \mathbf{\epsilon} \) 是观测噪声
- \( \mathcal{M} = \{f(\mathbf{z}) : \mathbf{z} \in \mathcal{Z}\} \) 是数据流形
直观理解
- /局部欧几里得性/:流形上每个点的邻域与欧几里得空间同胚
- /内在维度/:数据的真实自由度远小于观测维度
- /连续性/:流形上的邻近点对应语义相似的样本
以手写数字为例
数字图像虽然存在于 784 维空间(28×28 像素),但所有可能的数字图像构成一个低维流形,其内在维度远小于 784。
现在我们要写一个数字“2”,
我们只需要控制几个关键参数:起笔位置、笔画弯曲程度、收笔位置等。这些参数构成了一个低维空间,而所有可能的“2”的数字图像就分布在这个低维空间通过某种映射得到的高维流形上。
这表明,对于两种合理的 2 的写法 \(x_1\) 与 \(x_2\),存在一条连续路径 \(\gamma(t)\) 在流形上连接它们,路径上的每个点都对应一个合理的“2”的写法。
其中对于 \(\gamma(t)\),
代码实现:流形学习算法
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.manifold import Isomap, LocallyLinearEmbedding, TSNE
from sklearn.decomposition import PCA
from umap import UMAP
def demonstrate_manifold_hypothesis():
"""演示流形假设"""
# 1. 生成瑞士卷数据集(经典流形示例)
print("1. 瑞士卷数据集")
n_samples = 1000
noise = 0.1
# 生成瑞士卷
t = 1.5 * np.pi * (1 + 2 * np.random.rand(1, n_samples))
x = t * np.cos(t)
y = 30 * np.random.rand(1, n_samples)
z = t * np.sin(t)
X_swiss = np.concatenate((x, y, z), axis=0).T
X_swiss += noise * np.random.randn(n_samples, 3)
t = t.squeeze()
print(f"数据形状: {X_swiss.shape}")
print(f"观测维度: 3")
print(f"内在维度: 2 (理论上)")
# 2. 应用不同流形学习算法
methods = {
'PCA': PCA(n_components=2),
'Isomap': Isomap(n_components=2, n_neighbors=10),
'LLE': LocallyLinearEmbedding(n_components=2, n_neighbors=10),
't-SNE': TSNE(n_components=2, random_state=42),
'UMAP': UMAP(n_components=2, random_state=42)
}
# 可视化结果
fig = plt.figure(figsize=(20, 12))
# 原始3D数据
ax1 = fig.add_subplot(2, 3, 1, projection='3d')
ax1.scatter(X_swiss[:, 0], X_swiss[:, 1], X_swiss[:, 2], c=t, cmap='viridis')
ax1.set_title('原始瑞士卷 (3D)')
ax1.set_xlabel('X')
ax1.set_ylabel('Y')
ax1.set_zlabel('Z')
# 不同方法的结果
for i, (name, method) in enumerate(methods.items()):
X_embedded = method.fit_transform(X_swiss)
ax = fig.add_subplot(2, 3, i+2)
scatter = ax.scatter(X_embedded[:, 0], X_embedded[:, 1], c=t, cmap='viridis')
ax.set_title(f'{name}')
ax.set_xlabel('Component 1')
ax.set_ylabel('Component 2')
plt.colorbar(scatter, ax=ax)
plt.tight_layout()
plt.show()
# 3. 内在维度估计
print("\n2. 内在维度估计")
def intrinsic_dimension_estimation(X, k=10):
"""使用最近邻距离估计内在维度"""
from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors(n_neighbors=k+1).fit(X)
distances, indices = nbrs.kneighbors(X)
# 使用最大似然估计
r_k = distances[:, k] # 到第k个最近邻的距离
d_hat = 1 / (np.mean(np.log(r_k[:, None] / distances[:, 1:k+1])) + np.log(k))
return d_hat
dim_estimate = intrinsic_dimension_estimation(X_swiss)
print(f"估计的内在维度: {dim_estimate:.2f}")
return X_swiss, t, methods
X_swiss, t, methods = demonstrate_manifold_hypothesis()
流形学习算法分类
线性方法
- /主成分分析/(PCA):寻找最大方差方向
- /多维缩放/(MDS):保持样本间距离
非线性方法
- /等距映射/(Isomap):保持测地距离
- /局部线性嵌入/(LLE):保持局部线性关系
- /拉普拉斯特征映射/:基于图拉普拉斯算子
- /t-SNE/:保持局部相似性概率
- /UMAP/:基于黎曼几何和代数拓扑
流形假设的理论基础
import torch
import torch.nn as nn
class ManifoldLearningTheory:
"""流形学习理论演示"""
@staticmethod
def tangent_space_estimation(X, point_idx, k=20):
"""估计流形在给定点的切空间"""
from sklearn.neighbors import NearestNeighbors
from scipy.linalg import svd
# 找到最近邻
nbrs = NearestNeighbors(n_neighbors=k).fit(X)
distances, indices = nbrs.kneighbors(X[point_idx:point_idx+1])
# 局部邻域
neighborhood = X[indices[0]]
# 中心化
center = neighborhood.mean(axis=0)
centered = neighborhood - center
# SVD 分解找到主方向
U, s, Vt = svd(centered, full_matrices=False)
# 切空间由前d个奇异向量张成
tangent_basis = Vt[:2].T # 假设内在维度为2
return center, tangent_basis, neighborhood
@staticmethod
def manifold_regularization_loss(encoder, X, lambda_reg=0.1):
"""流形正则化损失 - 鼓励编码器保持局部结构"""
from sklearn.neighbors import NearestNeighbors
# 找到每个点的最近邻
nbrs = NearestNeighbors(n_neighbors=5).fit(X)
distances, indices = nbrs.kneighbors(X)
batch_size = X.shape[0]
reg_loss = 0
for i in range(batch_size):
# 原始空间中的邻居
neighbors = X[indices[i]]
# 编码后的表示
z_i = encoder(X[i:i+1])
z_neighbors = encoder(neighbors)
# 保持局部距离
original_dists = torch.cdist(X[i:i+1], neighbors)
encoded_dists = torch.cdist(z_i, z_neighbors)
# 距离保持损失
reg_loss += torch.mean((original_dists - encoded_dists)**2)
return lambda_reg * reg_loss / batch_size
def demonstrate_tangent_spaces():
"""演示切空间估计"""
# 生成球面数据
n_samples = 500
theta = np.random.uniform(0, 2*np.pi, n_samples)
phi = np.random.uniform(0, np.pi, n_samples)
# 球坐标转笛卡尔坐标
x = np.sin(phi) * np.cos(theta)
y = np.sin(phi) * np.sin(theta)
z = np.cos(phi)
X_sphere = np.column_stack([x, y, z])
# 估计几个点的切空间
test_points = [0, 100, 200]
fig = plt.figure(figsize=(15, 5))
for i, point_idx in enumerate(test_points):
center, tangent_basis, neighborhood = \
ManifoldLearningTheory.tangent_space_estimation(X_sphere, point_idx)
# 3D 可视化
ax = fig.add_subplot(1, 3, i+1, projection='3d')
# 绘制球面
ax.scatter(X_sphere[:, 0], X_sphere[:, 1], X_sphere[:, 2],
alpha=0.3, s=10, color='blue')
# 绘制局部邻域
ax.scatter(neighborhood[:, 0], neighborhood[:, 1], neighborhood[:, 2],
color='red', s=50)
# 绘制切平面
u, v = tangent_basis[:, 0], tangent_basis[:, 1]
u, v = u * 0.3, v * 0.3 # 缩放切向量
# 生成切平面网格
s, t = np.meshgrid(np.linspace(-1, 1, 2), np.linspace(-1, 1, 2))
plane_x = center[0] + s * u[0] + t * v[0]
plane_y = center[1] + s * u[1] + t * v[1]
plane_z = center[2] + s * u[2] + t * v[2]
ax.plot_surface(plane_x, plane_y, plane_z, alpha=0.5, color='orange')
ax.set_title(f'点 {point_idx} 的切空间')
ax.set_xlim(-1, 1)
ax.set_ylim(-1, 1)
ax.set_zlim(-1, 1)
plt.tight_layout()
plt.show()
print("切空间估计完成")
print("每个点的切空间由两个正交向量张成")
demonstrate_tangent_spaces()
在深度学习中的应用
class ManifoldAwareAutoencoder(nn.Module):
"""考虑流形结构的自编码器"""
def __init__(self, input_dim, hidden_dims, latent_dim):
super(ManifoldAwareAutoencoder, self).__init__()
# 编码器
encoder_layers = []
prev_dim = input_dim
for hidden_dim in hidden_dims:
encoder_layers.extend([
nn.Linear(prev_dim, hidden_dim),
nn.BatchNorm1d(hidden_dim),
nn.LeakyReLU(0.2)
])
prev_dim = hidden_dim
self.encoder = nn.Sequential(*encoder_layers)
self.fc_mu = nn.Linear(prev_dim, latent_dim)
self.fc_logvar = nn.Linear(prev_dim, latent_dim)
# 解码器
decoder_layers = []
prev_dim = latent_dim
for hidden_dim in reversed(hidden_dims):
decoder_layers.extend([
nn.Linear(prev_dim, hidden_dim),
nn.BatchNorm1d(hidden_dim),
nn.LeakyReLU(0.2)
])
prev_dim = hidden_dim
decoder_layers.append(nn.Linear(prev_dim, input_dim))
self.decoder = nn.Sequential(*decoder_layers)
def reparameterize(self, mu, logvar):
std = torch.exp(0.5 * logvar)
eps = torch.randn_like(std)
return mu + eps * std
def forward(self, x):
# 编码
h = self.encoder(x)
mu, logvar = self.fc_mu(h), self.fc_logvar(h)
z = self.reparameterize(mu, logvar)
# 解码
x_recon = self.decoder(z)
return x_recon, z, mu, logvar
def manifold_aware_loss(self, x, x_recon, z, mu, logvar, lambda_manifold=0.1):
"""包含流形感知的损失函数"""
# 重构损失
recon_loss = F.mse_loss(x_recon, x, reduction='mean')
# KL 散度
kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp()) / x.size(0)
# 流形正则化(简化版)
manifold_loss = self.local_structure_preservation_loss(x, z)
total_loss = recon_loss + 0.01 * kl_loss + lambda_manifold * manifold_loss
return total_loss, recon_loss, kl_loss, manifold_loss
def local_structure_preservation_loss(self, x, z, k=5):
"""局部结构保持损失"""
from sklearn.neighbors import NearestNeighbors
import torch.nn.functional as F
# 在原始空间找到最近邻
x_np = x.detach().cpu().numpy()
nbrs = NearestNeighbors(n_neighbors=k+1).fit(x_np)
distances, indices = nbrs.kneighbors(x_np)
batch_size = x.size(0)
loss = 0
for i in range(batch_size):
# 原始空间中的邻居(排除自身)
orig_neighbors = x[indices[i, 1:]] # 跳过第一个(自身)
# 潜在空间中的对应点
z_i = z[i:i+1]
z_neighbors = z[indices[i, 1:]]
# 计算距离比率的差异
orig_dists = torch.cdist(x[i:i+1], orig_neighbors).squeeze()
latent_dists = torch.cdist(z_i, z_neighbors).squeeze()
# 保持距离顺序
loss += F.mse_loss(orig_dists / orig_dists.max(),
latent_dists / latent_dists.max())
return loss / batch_size
def demonstrate_manifold_learning():
"""演示流形学习在深度学习中的应用"""
# 生成同心圆数据集
n_samples = 1000
t = np.linspace(0, 2*np.pi, n_samples)
# 两个同心圆
r1, r2 = 1.0, 2.0
x1 = r1 * np.cos(t) + 0.1 * np.random.randn(n_samples)
y1 = r1 * np.sin(t) + 0.1 * np.random.randn(n_samples)
x2 = r2 * np.cos(t) + 0.1 * np.random.randn(n_samples)
y2 = r2 * np.sin(t) + 0.1 * np.random.randn(n_samples)
X_circle = np.vstack([np.column_stack([x1, y1]), np.column_stack([x2, y2])])
labels = np.hstack([np.zeros(n_samples), np.ones(n_samples)])
# 转换为PyTorch张量
X_tensor = torch.FloatTensor(X_circle)
# 创建流形感知自编码器
model = ManifoldAwareAutoencoder(
input_dim=2,
hidden_dims=[64, 32],
latent_dim=2
)
# 训练(简化版)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
print("训练流形感知自编码器...")
for epoch in range(1000):
optimizer.zero_grad()
x_recon, z, mu, logvar = model(X_tensor)
total_loss, recon_loss, kl_loss, manifold_loss = \
model.manifold_aware_loss(X_tensor, x_recon, z, mu, logvar)
total_loss.backward()
optimizer.step()
if epoch % 200 == 0:
print(f'Epoch {epoch}: Total Loss: {total_loss.item():.4f}, '
f'Recon: {recon_loss.item():.4f}, Manifold: {manifold_loss.item():.4f}')
# 可视化结果
with torch.no_grad():
x_recon, z, _, _ = model(X_tensor)
plt.figure(figsize=(15, 5))
# 原始数据
plt.subplot(1, 3, 1)
plt.scatter(X_circle[:, 0], X_circle[:, 1], c=labels, cmap='viridis', alpha=0.6)
plt.title('原始数据 (2D)')
plt.xlabel('x1')
plt.ylabel('x2')
# 潜在空间
plt.subplot(1, 3, 2)
z_np = z.numpy()
plt.scatter(z_np[:, 0], z_np[:, 1], c=labels, cmap='viridis', alpha=0.6)
plt.title('潜在空间 (2D)')
plt.xlabel('z1')
plt.ylabel('z2')
# 重构数据
plt.subplot(1, 3, 3)
x_recon_np = x_recon.numpy()
plt.scatter(x_recon_np[:, 0], x_recon_np[:, 1], c=labels, cmap='viridis', alpha=0.6)
plt.title('重构数据 (2D)')
plt.xlabel('x1_recon')
plt.ylabel('x2_recon')
plt.tight_layout()
plt.show()
return model, X_circle, labels
model, X_circle, labels = demonstrate_manifold_learning()
理论意义与应用
- /降维的理论基础/:解释为什么高维数据可以有效降维
- /泛化能力解释/:数据在流形上的结构性有助于泛化
- /对抗攻击分析/:对抗样本可能位于流形之外
- /生成模型/:VAE、GAN 等本质是在学习数据流形
挑战与局限
- /维度诅咒/:在高维空间中流形学习变得困难
- /噪声敏感性/:真实数据中的噪声可能破坏流形结构
- /计算复杂度/:许多流形学习算法计算成本高
- /理论保证/:对复杂流形的理论理解仍不完善
流形假设为理解高维数据的结构和设计有效的机器学习算法提供了重要视角,是现代表示学习的理论基础之一。
流匹配
流匹配(Flow Matching)是一种基于连续归一化流(Continuous Normalizing Flows, CNF) 的生成模型训练方法,通过模拟概率路径来学习从简单分布到数据分布的变换。
基本思想
传统归一化流通过离散的可逆变换序列建模分布变换,而流匹配考虑连续时间动态系统:
其中 \( z_t \) 是时间 \( t \in [0,1] \) 的状态,\( v_t \) 是速度场。初始分布 \( p_0 \) 是简单分布(如高斯分布),目标是在 \( t=1 \) 时得到数据分布 \( p_1 \)。
概率路径
定义概率路径 \( p_t(z_t) \) 满足:
- \( p_0 \) 是简单先验分布
- \(
p_1 \) 是数据分布
- 存在对应的向量场 \( u_t(z_t) \) 使得 \( p_t \) 满足连续性方程:
训练目标
流匹配的目标是学习一个参数化的向量场 \( v_\theta(z_t, t) \) 来匹配真实向量场 \( u_t(z_t) \):
条件流匹配
实际中直接计算 \( u_t \) 困难,因此使用条件流匹配(Conditional Flow Matching):
其中 \( z_t \) 是从 \( z \) 到 \( x \) 的直线路径:
对应的条件向量场为:
优势
- 避免计算雅可比行列式,训练更高效
- 支持灵活的架构设计
- 在图像生成、分子设计等领域表现出色
- 与扩散模型有深刻联系
实现示例
import torch
import torch.nn as nn
class SimpleFlowMatching(nn.Module):
def __init__(self, dim, hidden_dim=128):
super().__init__()
self.net = nn.Sequential(
nn.Linear(dim + 1, hidden_dim), # +1 for time
nn.SiLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.SiLU(),
nn.Linear(hidden_dim, dim)
)
def forward(self, z, t):
# z: [batch, dim], t: [batch, 1]
inputs = torch.cat([z, t], dim=-1)
return self.net(inputs)
# 训练伪代码
def train_step(model, data_loader, optimizer):
model.train()
for batch in data_loader:
# 采样时间
t = torch.rand(batch.size(0), 1)
# 采样噪声
z0 = torch.randn_like(batch)
# 构造插值路径
zt = (1 - t) * z0 + t * batch
# 目标向量场
ut = batch - z0
# 预测向量场
vt = model(zt, t)
# 计算损失
loss = torch.mean((vt - ut) ** 2)
optimizer.zero_grad()
loss.backward()
optimizer.step()
流匹配为生成模型提供了新的视角,将离散的变换序列推广到连续的动态系统,在许多任务中展现出优越的性能。
Flow and Diffusion Models
Generation as Sampling : - 生成一个 object 就是从分布中进行一次采样 \(z\sim p_\text{data}\)
数据集 : - 数据集就是有限数据的样本 \(z_1,\cdots,z_N\sim p_\text{data}\)
Trajectory : - \([0, 1]\rightarrow \mathrm R^d\)
Vector Field : - \(u:\mathrm {R}^d\times [0, 1]\rightarrow \mathrm R^d\)
Ordinary Differential Equation(ODE)
*常微分方程 (ODE) 描述了轨迹如何由向量场驱动:
给定向量场 \(u(x, t)\),轨迹 \(\phi: [0,1] \to \mathbb{R}^d\) 满足:
即:轨迹在每一点的瞬时速度等于向量场在该点该时刻的值。
Flow
*Flow 是向量场 \(u\) 生成的映射族 \(\psi_t: \mathbb{R}^d \to \mathbb{R}^d\),满足:
即 \(\psi_t\) 将初始点 \(x\) 沿 ODE 推到时刻 \(t\) 的位置。Flow 是时间相关的微分同胚,用于将简单分布(如高斯)变形为目标分布。
它是一个随时间变化的矢量场。
ODE/flows 有唯一解。
使用欧拉法模拟 ODE
Flow Model
采样一个初始点,通过 ODE 过程,来到 \(X_1\), 而 \(X_1\) 在 \(p_\text{data}\) 上采样得到。
Diffusion Models
\(X_t\in \mathrm R^d\) random variable
Vector Field \(u_t(x)\) : A function that describes the deterministic drift of the process at time \(t\) and position \(x\).
Diffusion Coefficient \(\sigma_t(x)\) : A function (often a scalar or matrix) that controls the magnitude and direction of random noise added at each step.
Brownian Motion
Brownian motion \(W_t\) is a continuous-time stochastic process with:
Properties:
- \(W_0 = 0\) almost surely
- Independent increments: \(W_t - W_s\) independent of past for \(t > s\)
- Gaussian increments: \(W_t - W_s \sim \mathcal{N}(0, t-s)\)
- Continuous paths (almost surely)
Key characteristics:
- Non-differentiable paths (nowhere differentiable)
- Quadratic variation: \([W]_t = t\)
- Martingale property: \(\mathbb{E}[W_t | \mathcal{F}_s] = W_s\) for \(t > s\)
In the SDE, \(\sigma_t dW_t\) injects Gaussian noise scaled by \(\sigma_t\), making the trajectory random rather than purely deterministic.
SDE 与 ODE
SDE 有唯一解,指的是轨迹的分布 (distribution) 一致。
Euler
前向 SDE 欧拉法
从带噪观测中去除随机项(简单去噪)
逆向时间的去噪欧拉法(扩散模型中的概率流 ODE)
实际可用的去噪迭代(用神经网络 s_θ 近似得分)
Flow Matching
Probability Path
Gaussian Probability Path
零时刻是一个 Gaussian Distribution, 随着 t 上升到了,方差下降到0. 此时这是一个 Dirac Distribution, 只能取到固定的值。
其中,我们可以得到边界点的条件:
随着时间上升,\(\alpha_t\)会从0上升到1,而 \(\beta_t\) 会从 1 下降到 0。这是一个噪声逐步消退的过程。
Marginal Prob Path
Vector Field
Conditional Vector Field
A conditional vector field \( u_t^{\text{target}}(x \mid z) \) is a time-dependent vector field that depends on a conditioning variable \( z \). It appears in flow-based generative modeling, particularly in conditional flow matching.
Key properties:
- Domain: \( t \in [0,1] \), \( x, z \in \mathbb{R}^d \)
- Purpose: Defines a probability path \( p_t(x \mid z) \) that transports a simple base distribution (e.g., Gaussian at \( t=0 \)) to a target conditional distribution (e.g., data distribution given \( z \) at \( t=1 \)).
- Relation to ODE: The vector field generates a flow \( \phi_t \) via
$$\frac{d}{dt} \phi_t(x) = u_t^{\text{target}}(\phi_t(x) \mid z), \quad \phi_0(x) = x$$such that \( x_1 = \phi_1(x_0) \sim p_1(x \mid z) \).
In conditional flow matching, the model learns to approximate this field by regressing onto a tractable conditional target field (e.g., from a Gaussian conditional probability path).
Cond. Gaussian Vector Field
This is the conditional Gaussian vector field used in diffusion models and score-based generative modeling.
The following expression gives the target velocity field \( u_t^{\text{target}}(x \mid z) \) that transports a sample \( x \) at time \( t \) toward the endpoint \( z \), given the noise schedule parameters \( \alpha_t, \beta_t \).
Key components:
- \( \alpha_t, \beta_t \): scalar functions defining the forward noising process (typically \( \alpha_t^2 + \beta_t^2 = 1 \) for variance-preserving SDEs)
- \( \dot\alpha_t, \dot\beta_t \): time derivatives
- \( z \): the clean target (data point)
- \( x \): current noisy sample
Marginal Vector Field
The marginal vector field is obtained by taking expectation over \( z \sim p(z \mid x) \):
This marginal field defines the probability flow ODE that generates samples from the data distribution.
边际矢量场 \( u_t(x) \) 满足:
初始条件:当 \( t=0 \) 时,\( p_0(x) \) 是简单先验分布(如标准高斯),矢量场引导样本从该分布出发。
终值条件:当 \( t=1 \) 时,\( p_1(x) \) 近似数据分布 \( p_{\text{data}}(x) \),矢量场将初始噪声平滑变换为目标数据。
因此,通过求解 ODE \( dx/dt = u_t(x) \),即可从先验采样生成数据分布中的样本。
Continuity Equation
Definition: Divergence
*Divergence (∇·F) measures the net "outflow" of a vector field from a point.
- Positive divergence: net source (fluid expanding outward).
- Negative divergence: net sink (fluid converging inward).
- Zero divergence: incompressible flow (solenoidal).
Mathematically: ∇·F = ∂F_x/∂x + ∂F_y/∂y + ∂F_z/∂z (in 3D Cartesian).
通过 Continuity Equation, 我们证明了,在给定条件分布的情况下,我们的边缘分布也可以成立。
训练 Flow Matching
Score Functions and Score Matching
Conditional Score
*Conditional Score extends score matching to model conditional distributions \( p(\mathbf{x} \mid \mathbf{y}) \).
Definition: The conditional score is the gradient of the log conditional density:
Key uses:
- Conditional generation (e.g., image inpainting, super-resolution, text-to-image)
- Inverse problems (denoising, deblurring)
- Classifier-free guidance in diffusion models
Training: Minimize the conditional score matching objective:
Connection to denoising: The conditional score can be approximated via a conditional denoising autoencoder:
Gaussian Score
Marginal Score
Marginal Score refers to the score of the marginal distribution \( p(\mathbf{x}) \), obtained by integrating out latent variables or conditions.
Definition:
Key properties:
- No conditioning — models the data distribution alone
- Intractable directly due to the integral over \(\mathbf{z}\)
- Denoising Score Matching provides a tractable surrogate
Connection to Conditional Score: By the law of total probability:
Use in diffusion models:
- Unconditional diffusion models learn the marginal score
- Classifier-free guidance interpolates between conditional and marginal scores:
$$\tilde{s}_\theta(\mathbf{x}, \mathbf{y}) = s_\theta(\mathbf{x}, \mathbf{y}) + w \big(s_\theta(\mathbf{x}, \mathbf{y}) - s_\theta(\mathbf{x})\big)$$where \( s_\theta(\mathbf{x}) \) is the marginal score and \( w \) controls guidance strength.
Sampling with SDE
SDE Extension Trick
SDE extension trick(定理17)指出:
对于任意扩散系数 \(\sigma_t \ge 0\),可以通过在原始ODE的动力学基础上添加随机动力学来构造一个SDE,使得该SDE的轨迹仍然遵循相同的概率路径 \(p_t\)。具体地,该SDE的形式为:
其中 \(u_t^{\text{target}}(x)\) 是边际向量场,\(\nabla\log p_t(x)\) 是边际得分函数。该SDE的轨迹满足 \(X_t \sim p_t\)(\(0 \le t \le 1\)),特别地,\(X_1 \sim p_{\text{data}}\)。该技巧的关键在于,即使添加了随机性,边际分布 \(p_t\) 保持不变。