ESC
输入关键词搜索文章
目录

An introduction to Flow Matching and Diffusion Models

相关内容

Linqi Zhou --- Terminal Velocity Matching

Terminal Velocity Matching

终端速度匹配

Linqi Zhou, Mathias Parger, Ayaan Haque, Jiaming Song

https://arxiv.org/abs/2511.19797

We propose Terminal Velocity Matching (TVM), a generalization of flow matching that enables high-fidelity one- and few-step generative modeling. TVM models the transition between any two diffusion timesteps and regularizes its behavior at its terminal time rather than at the initial time. We prove that TVM provides an upper bound on the \(2\)-Wasserstein distance between data and model distributions when the model is Lipschitz continuous. However, since Diffusion Transformers lack this property, we introduce minimal architectural changes that achieve stable, single-stage training. To make TVM efficient in practice, we develop a fused attention kernel that supports backward passes on Jacobian-Vector Products, which scale well with transformer architectures. On ImageNet-256x256, TVM achieves 3.29 FID with a single function evaluation (NFE) and 1.99 FID with 4 NFEs. It similarly achieves 4.32 1-NFE FID and 2.94 4-NFE FID on ImageNet-512x512, representing state-of-the-art performance for one/few-step models from scratch.

我们提出了终端速度匹配 (TVM),这是一种流量匹配的概括,可以实现高保真的一步和几步生成建模。TVM对任何两个扩散时间步长之间的过渡进行建模,并在其终止时间而不是初始时间对其行为进行正则化。我们证明,当模型为Lipschitzyc连续时,TVM为数据和模型分布之间的 \(2 \)-Wasserstein距离提供了上限。但是,由于扩散变压器缺乏此属性,因此我们引入了最小的体系结构更改,以实现稳定的单级培训。为了使TVM在实践中高效,我们开发了一个融合的注意力内核,该内核支持Jacobian矢量乘积的后向传递,该乘积可与transformer架构很好地扩展。在ImageNet-256x256上,TVM通过单功能评估 (NFE) 实现了3.29 FID,通过4个NFE实现了1.99 FID。它同样在ImageNet-512x512上实现了4.32个1-NFE FID和2.94个4-NFE FID,代表了从头开始的一步/几步模型的最新性能。

面临问题

扩散模型(Diffusion Models)限制了采样概率路径的选择,导致训练时间过长,且采样效率低。

扩散模型中的概率分布是易于求解的正态分布。但是其积分问题难以处理。在DDPM中通过求解 ELBO 的方式来规避积分问题。来近似真实分布。但 ELBO 只是一个边界,与真实分布远不够接近。

前置知识

归一化流

归一化流(Normalizing Flows)是一种通过一系列可逆变换将简单分布(如高斯分布)转换为复杂分布的方法。其核心思想是利用变量变换公式(Change of Variables Formula)来精确计算变换后的概率密度。

变量变换公式

假设有一个随机变量 \( z \sim p_z(z) \),通过一个可逆映射 \( f: \mathbb{R}^d \to \mathbb{R}^d \) 得到新变量 \( x = f(z) \)。那么 \( x \) 的概率密度为:

$$p_x(x) = p_z(f^{-1}(x)) \left| \det \left( \frac{\partial f^{-1}(x)}{\partial x} \right) \right|$$

其中:

为了简化计算,通常要求 \( f \) 具有易于计算的雅可比行列式,例如使用三角矩阵或行列式直接为1的变换。

流模型的基本结构

归一化流通过组合多个简单变换来构建复杂的分布:

$$x = f_K \circ f_{K-1} \circ \cdots \circ f_1(z)$$

对应的概率密度为:

$$\log p(x) = \log p_z(z) - \sum_{k=1}^K \log \left| \det \left( \frac{\partial f_k}{\partial z_{k-1}} \right) \right|$$

其中 \( z_0 = z \)\( z_k = f_k(z_{k-1}) \)

优点与局限性

归一化流为理解流匹配(Flow Matching)提供了理论基础,后者通过更灵活的方式学习概率路径,

Normalizing Flows

简单分布容易处理,但不能表示复杂的形状;复杂分布可以表示数据,但是计算概率密度很困难。

Normalizing Flow 通过一系列可逆变换,把一个简单分布给逐步变换成一个复杂分布。

变量替换公式

参考多元微积分 中的变量替换公式

变量替换公式(change of variables formula)是流模型的核心。若 \( z \sim p_z(z) \),通过可逆映射 \( x = f(z) \),则 \( x \) 的分布为:

$$p_x(x) = p_z(f^{-1}(x)) \cdot \left| \det \frac{\partial f^{-1}(x)}{\partial x} \right|$$

或等价地:

$$p_x(x) = p_z(z) \cdot \left| \det \frac{\partial f(z)}{\partial z} \right|^{-1}$$

其中 \(\det \frac{\partial f}{\partial z}\) 是变换 \( f \) 的雅可比行列式,保证概率密度在变换下保持归一化。

Flow

由于单个变换的灵活性有限,我们把多个可逆变换串联起来:

串联多个可逆变换 \( f = f_1 \circ f_2 \circ \cdots \circ f_K \),则:

$$x = f(z) = f_K \circ \cdots \circ f_2 \circ f_1(z)$$

概率密度变为:

$$p_x(x) = p_z(z) \prod_{k=1}^K \left| \det \frac{\partial f_k(z_{k-1})}{\partial z_{k-1}} \right|^{-1}$$

其中 \( z_0 = z \)\( z_k = f_k(z_{k-1}) \)\( z_K = x \)

关键要求:

连续归一化流

在连续时间内对概率密度进行变换

流匹配

一种无仿真的 CNF 训练方法,通过匹配模型矢量场与目标矢量场来生成概率路径,绕过了对 ODE 的数值仿真,实现了高效的训练和采样。

条件流匹配

流形假设

流形假设 Manifold Hypothesis - AIDIY Wiki

流形假设(Manifold Hypothesis)是机器学习和表示学习中的核心假设,认为现实世界中 的高维数据通常分布在一个低维流形上。这一假设为降维和特征学习提供了理论基础。

数学定义

设观测数据 \( \mathbf{x} \in \mathbb{R}^D \) 来自高维空间,流形假设认为存在一个低维流形 \( \mathcal{M} \subset \mathbb{R}^D \)(维度 \( d \ll D \))和一个生成过程:

$$\mathbf{x} = f(\mathbf{z}) + \mathbf{\epsilon}$$

这表示,观测数据 \(x\) 要么位于 \(\mathcal{M}\) 上,要么就距离 \(\mathcal{M}\) 距离很小。

其中:

直观理解

  1. /局部欧几里得性/:流形上每个点的邻域与欧几里得空间同胚
  2. /内在维度/:数据的真实自由度远小于观测维度
  3. /连续性/:流形上的邻近点对应语义相似的样本

以手写数字为例

数字图像虽然存在于 784 维空间(28×28 像素),但所有可能的数字图像构成一个低维流形,其内在维度远小于 784。

现在我们要写一个数字“2”,

我们只需要控制几个关键参数:起笔位置、笔画弯曲程度、收笔位置等。这些参数构成了一个低维空间,而所有可能的“2”的数字图像就分布在这个低维空间通过某种映射得到的高维流形上。

这表明,对于两种合理的 2 的写法 \(x_1\)\(x_2\),存在一条连续路径 \(\gamma(t)\) 在流形上连接它们,路径上的每个点都对应一个合理的“2”的写法。

其中对于 \(\gamma(t)\),

$$t \in \left [0, 1\right]$$
使用
$$\gamma(0)=x_1, \gamma(1)=x_2$$

代码实现:流形学习算法

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.manifold import Isomap, LocallyLinearEmbedding, TSNE
from sklearn.decomposition import PCA
from umap import UMAP

def demonstrate_manifold_hypothesis():
    """演示流形假设"""

    # 1. 生成瑞士卷数据集(经典流形示例)
    print("1. 瑞士卷数据集")
    n_samples = 1000
    noise = 0.1

    # 生成瑞士卷
    t = 1.5 * np.pi * (1 + 2 * np.random.rand(1, n_samples))
    x = t * np.cos(t)
    y = 30 * np.random.rand(1, n_samples)
    z = t * np.sin(t)

    X_swiss = np.concatenate((x, y, z), axis=0).T
    X_swiss += noise * np.random.randn(n_samples, 3)
    t = t.squeeze()

    print(f"数据形状: {X_swiss.shape}")
    print(f"观测维度: 3")
    print(f"内在维度: 2 (理论上)")

    # 2. 应用不同流形学习算法
    methods = {
        'PCA': PCA(n_components=2),
        'Isomap': Isomap(n_components=2, n_neighbors=10),
        'LLE': LocallyLinearEmbedding(n_components=2, n_neighbors=10),
        't-SNE': TSNE(n_components=2, random_state=42),
        'UMAP': UMAP(n_components=2, random_state=42)
    }

    # 可视化结果
    fig = plt.figure(figsize=(20, 12))

    # 原始3D数据
    ax1 = fig.add_subplot(2, 3, 1, projection='3d')
    ax1.scatter(X_swiss[:, 0], X_swiss[:, 1], X_swiss[:, 2], c=t, cmap='viridis')
    ax1.set_title('原始瑞士卷 (3D)')
    ax1.set_xlabel('X')
    ax1.set_ylabel('Y')
    ax1.set_zlabel('Z')

    # 不同方法的结果
    for i, (name, method) in enumerate(methods.items()):
        X_embedded = method.fit_transform(X_swiss)

        ax = fig.add_subplot(2, 3, i+2)
        scatter = ax.scatter(X_embedded[:, 0], X_embedded[:, 1], c=t, cmap='viridis')
        ax.set_title(f'{name}')
        ax.set_xlabel('Component 1')
        ax.set_ylabel('Component 2')
        plt.colorbar(scatter, ax=ax)

    plt.tight_layout()
    plt.show()

    # 3. 内在维度估计
    print("\n2. 内在维度估计")

    def intrinsic_dimension_estimation(X, k=10):
        """使用最近邻距离估计内在维度"""
        from sklearn.neighbors import NearestNeighbors

        nbrs = NearestNeighbors(n_neighbors=k+1).fit(X)
        distances, indices = nbrs.kneighbors(X)

        # 使用最大似然估计
        r_k = distances[:, k]  # 到第k个最近邻的距离
        d_hat = 1 / (np.mean(np.log(r_k[:, None] / distances[:, 1:k+1])) + np.log(k))

        return d_hat

    dim_estimate = intrinsic_dimension_estimation(X_swiss)
    print(f"估计的内在维度: {dim_estimate:.2f}")

    return X_swiss, t, methods

X_swiss, t, methods = demonstrate_manifold_hypothesis()

流形学习算法分类

流形假设的理论基础

import torch
import torch.nn as nn

class ManifoldLearningTheory:
    """流形学习理论演示"""

    @staticmethod
    def tangent_space_estimation(X, point_idx, k=20):
        """估计流形在给定点的切空间"""
        from sklearn.neighbors import NearestNeighbors
        from scipy.linalg import svd

        # 找到最近邻
        nbrs = NearestNeighbors(n_neighbors=k).fit(X)
        distances, indices = nbrs.kneighbors(X[point_idx:point_idx+1])

        # 局部邻域
        neighborhood = X[indices[0]]

        # 中心化
        center = neighborhood.mean(axis=0)
        centered = neighborhood - center

        # SVD 分解找到主方向
        U, s, Vt = svd(centered, full_matrices=False)

        # 切空间由前d个奇异向量张成
        tangent_basis = Vt[:2].T  # 假设内在维度为2

        return center, tangent_basis, neighborhood

    @staticmethod
    def manifold_regularization_loss(encoder, X, lambda_reg=0.1):
        """流形正则化损失 - 鼓励编码器保持局部结构"""
        from sklearn.neighbors import NearestNeighbors

        # 找到每个点的最近邻
        nbrs = NearestNeighbors(n_neighbors=5).fit(X)
        distances, indices = nbrs.kneighbors(X)

        batch_size = X.shape[0]
        reg_loss = 0

        for i in range(batch_size):
            # 原始空间中的邻居
            neighbors = X[indices[i]]

            # 编码后的表示
            z_i = encoder(X[i:i+1])
            z_neighbors = encoder(neighbors)

            # 保持局部距离
            original_dists = torch.cdist(X[i:i+1], neighbors)
            encoded_dists = torch.cdist(z_i, z_neighbors)

            # 距离保持损失
            reg_loss += torch.mean((original_dists - encoded_dists)**2)

        return lambda_reg * reg_loss / batch_size

def demonstrate_tangent_spaces():
    """演示切空间估计"""

    # 生成球面数据
    n_samples = 500
    theta = np.random.uniform(0, 2*np.pi, n_samples)
    phi = np.random.uniform(0, np.pi, n_samples)

    # 球坐标转笛卡尔坐标
    x = np.sin(phi) * np.cos(theta)
    y = np.sin(phi) * np.sin(theta)
    z = np.cos(phi)

    X_sphere = np.column_stack([x, y, z])

    # 估计几个点的切空间
    test_points = [0, 100, 200]

    fig = plt.figure(figsize=(15, 5))

    for i, point_idx in enumerate(test_points):
        center, tangent_basis, neighborhood = \
            ManifoldLearningTheory.tangent_space_estimation(X_sphere, point_idx)

        # 3D 可视化
        ax = fig.add_subplot(1, 3, i+1, projection='3d')

        # 绘制球面
        ax.scatter(X_sphere[:, 0], X_sphere[:, 1], X_sphere[:, 2],
                  alpha=0.3, s=10, color='blue')

        # 绘制局部邻域
        ax.scatter(neighborhood[:, 0], neighborhood[:, 1], neighborhood[:, 2],
                  color='red', s=50)

        # 绘制切平面
        u, v = tangent_basis[:, 0], tangent_basis[:, 1]
        u, v = u * 0.3, v * 0.3  # 缩放切向量

        # 生成切平面网格
        s, t = np.meshgrid(np.linspace(-1, 1, 2), np.linspace(-1, 1, 2))
        plane_x = center[0] + s * u[0] + t * v[0]
        plane_y = center[1] + s * u[1] + t * v[1]
        plane_z = center[2] + s * u[2] + t * v[2]

        ax.plot_surface(plane_x, plane_y, plane_z, alpha=0.5, color='orange')

        ax.set_title(f'点 {point_idx} 的切空间')
        ax.set_xlim(-1, 1)
        ax.set_ylim(-1, 1)
        ax.set_zlim(-1, 1)

    plt.tight_layout()
    plt.show()

    print("切空间估计完成")
    print("每个点的切空间由两个正交向量张成")

demonstrate_tangent_spaces()

在深度学习中的应用

class ManifoldAwareAutoencoder(nn.Module):
    """考虑流形结构的自编码器"""

    def __init__(self, input_dim, hidden_dims, latent_dim):
        super(ManifoldAwareAutoencoder, self).__init__()

        # 编码器
        encoder_layers = []
        prev_dim = input_dim
        for hidden_dim in hidden_dims:
            encoder_layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.BatchNorm1d(hidden_dim),
                nn.LeakyReLU(0.2)
            ])
            prev_dim = hidden_dim
        self.encoder = nn.Sequential(*encoder_layers)
        self.fc_mu = nn.Linear(prev_dim, latent_dim)
        self.fc_logvar = nn.Linear(prev_dim, latent_dim)

        # 解码器
        decoder_layers = []
        prev_dim = latent_dim
        for hidden_dim in reversed(hidden_dims):
            decoder_layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.BatchNorm1d(hidden_dim),
                nn.LeakyReLU(0.2)
            ])
            prev_dim = hidden_dim
        decoder_layers.append(nn.Linear(prev_dim, input_dim))
        self.decoder = nn.Sequential(*decoder_layers)

    def reparameterize(self, mu, logvar):
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mu + eps * std

    def forward(self, x):
        # 编码
        h = self.encoder(x)
        mu, logvar = self.fc_mu(h), self.fc_logvar(h)
        z = self.reparameterize(mu, logvar)

        # 解码
        x_recon = self.decoder(z)

        return x_recon, z, mu, logvar

    def manifold_aware_loss(self, x, x_recon, z, mu, logvar, lambda_manifold=0.1):
        """包含流形感知的损失函数"""

        # 重构损失
        recon_loss = F.mse_loss(x_recon, x, reduction='mean')

        # KL 散度
        kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp()) / x.size(0)

        # 流形正则化(简化版)
        manifold_loss = self.local_structure_preservation_loss(x, z)

        total_loss = recon_loss + 0.01 * kl_loss + lambda_manifold * manifold_loss

        return total_loss, recon_loss, kl_loss, manifold_loss

    def local_structure_preservation_loss(self, x, z, k=5):
        """局部结构保持损失"""
        from sklearn.neighbors import NearestNeighbors
        import torch.nn.functional as F

        # 在原始空间找到最近邻
        x_np = x.detach().cpu().numpy()
        nbrs = NearestNeighbors(n_neighbors=k+1).fit(x_np)
        distances, indices = nbrs.kneighbors(x_np)

        batch_size = x.size(0)
        loss = 0

        for i in range(batch_size):
            # 原始空间中的邻居(排除自身)
            orig_neighbors = x[indices[i, 1:]]  # 跳过第一个(自身)

            # 潜在空间中的对应点
            z_i = z[i:i+1]
            z_neighbors = z[indices[i, 1:]]

            # 计算距离比率的差异
            orig_dists = torch.cdist(x[i:i+1], orig_neighbors).squeeze()
            latent_dists = torch.cdist(z_i, z_neighbors).squeeze()

            # 保持距离顺序
            loss += F.mse_loss(orig_dists / orig_dists.max(),
                              latent_dists / latent_dists.max())

        return loss / batch_size

def demonstrate_manifold_learning():
    """演示流形学习在深度学习中的应用"""

    # 生成同心圆数据集
    n_samples = 1000
    t = np.linspace(0, 2*np.pi, n_samples)

    # 两个同心圆
    r1, r2 = 1.0, 2.0
    x1 = r1 * np.cos(t) + 0.1 * np.random.randn(n_samples)
    y1 = r1 * np.sin(t) + 0.1 * np.random.randn(n_samples)
    x2 = r2 * np.cos(t) + 0.1 * np.random.randn(n_samples)
    y2 = r2 * np.sin(t) + 0.1 * np.random.randn(n_samples)

    X_circle = np.vstack([np.column_stack([x1, y1]), np.column_stack([x2, y2])])
    labels = np.hstack([np.zeros(n_samples), np.ones(n_samples)])

    # 转换为PyTorch张量
    X_tensor = torch.FloatTensor(X_circle)

    # 创建流形感知自编码器
    model = ManifoldAwareAutoencoder(
        input_dim=2,
        hidden_dims=[64, 32],
        latent_dim=2
    )

    # 训练(简化版)
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

    print("训练流形感知自编码器...")
    for epoch in range(1000):
        optimizer.zero_grad()

        x_recon, z, mu, logvar = model(X_tensor)
        total_loss, recon_loss, kl_loss, manifold_loss = \
            model.manifold_aware_loss(X_tensor, x_recon, z, mu, logvar)

        total_loss.backward()
        optimizer.step()

        if epoch % 200 == 0:
            print(f'Epoch {epoch}: Total Loss: {total_loss.item():.4f}, '
                  f'Recon: {recon_loss.item():.4f}, Manifold: {manifold_loss.item():.4f}')

    # 可视化结果
    with torch.no_grad():
        x_recon, z, _, _ = model(X_tensor)

        plt.figure(figsize=(15, 5))

        # 原始数据
        plt.subplot(1, 3, 1)
        plt.scatter(X_circle[:, 0], X_circle[:, 1], c=labels, cmap='viridis', alpha=0.6)
        plt.title('原始数据 (2D)')
        plt.xlabel('x1')
        plt.ylabel('x2')

        # 潜在空间
        plt.subplot(1, 3, 2)
        z_np = z.numpy()
        plt.scatter(z_np[:, 0], z_np[:, 1], c=labels, cmap='viridis', alpha=0.6)
        plt.title('潜在空间 (2D)')
        plt.xlabel('z1')
        plt.ylabel('z2')

        # 重构数据
        plt.subplot(1, 3, 3)
        x_recon_np = x_recon.numpy()
        plt.scatter(x_recon_np[:, 0], x_recon_np[:, 1], c=labels, cmap='viridis', alpha=0.6)
        plt.title('重构数据 (2D)')
        plt.xlabel('x1_recon')
        plt.ylabel('x2_recon')

        plt.tight_layout()
        plt.show()

    return model, X_circle, labels

model, X_circle, labels = demonstrate_manifold_learning()

理论意义与应用

  1. /降维的理论基础/:解释为什么高维数据可以有效降维
  2. /泛化能力解释/:数据在流形上的结构性有助于泛化
  3. /对抗攻击分析/:对抗样本可能位于流形之外
  4. /生成模型/:VAE、GAN 等本质是在学习数据流形

挑战与局限

  1. /维度诅咒/:在高维空间中流形学习变得困难
  2. /噪声敏感性/:真实数据中的噪声可能破坏流形结构
  3. /计算复杂度/:许多流形学习算法计算成本高
  4. /理论保证/:对复杂流形的理论理解仍不完善

流形假设为理解高维数据的结构和设计有效的机器学习算法提供了重要视角,是现代表示学习的理论基础之一。

流匹配

流匹配(Flow Matching)是一种基于连续归一化流(Continuous Normalizing Flows, CNF) 的生成模型训练方法,通过模拟概率路径来学习从简单分布到数据分布的变换。

基本思想

传统归一化流通过离散的可逆变换序列建模分布变换,而流匹配考虑连续时间动态系统:

$$\frac{dz_t}{dt} = v_t (z_t)$$

其中 \( z_t \) 是时间 \( t \in [0,1] \) 的状态,\( v_t \) 是速度场。初始分布 \( p_0 \) 是简单分布(如高斯分布),目标是在 \( t=1 \) 时得到数据分布 \( p_1 \)

概率路径

定义概率路径 \( p_t(z_t) \) 满足:

$$\frac{\partial p_t}{\partial t} + \nabla \cdot (p_t u_t) = 0$$

训练目标

流匹配的目标是学习一个参数化的向量场 \( v_\theta(z_t, t) \) 来匹配真实向量场 \( u_t(z_t) \)

$$\mathcal{L}_{FM}(\theta) = \mathbb{E}_{t \sim U(0,1), z_t \sim p_t} \left[ \| v_\theta(z_t, t) - u_t(z_t) \|^2 \right]$$

条件流匹配

实际中直接计算 \( u_t \) 困难,因此使用条件流匹配(Conditional Flow Matching):

$$\mathcal{L}_{CFM}(\theta) = \mathbb{E}_{t \sim U(0,1), x \sim p_1, z \sim p_0} \left[ \| v_\theta(z_t, t) - u_t(z_t | x) \|^2 \right]$$

其中 \( z_t \) 是从 \( z \)\( x \) 的直线路径:

$$z_t = (1-t) z + tx$$

对应的条件向量场为:

$$u_t(z_t | x) = x - z$$

优势

实现示例

import torch
import torch.nn as nn

class SimpleFlowMatching(nn.Module):
    def __init__(self, dim, hidden_dim=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim + 1, hidden_dim),  # +1 for time
            nn.SiLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.SiLU(),
            nn.Linear(hidden_dim, dim)
        )

    def forward(self, z, t):
        # z: [batch, dim], t: [batch, 1]
        inputs = torch.cat([z, t], dim=-1)
        return self.net(inputs)

# 训练伪代码
def train_step(model, data_loader, optimizer):
    model.train()
    for batch in data_loader:
        # 采样时间
        t = torch.rand(batch.size(0), 1)

        # 采样噪声
        z0 = torch.randn_like(batch)

        # 构造插值路径
        zt = (1 - t) * z0 + t * batch

        # 目标向量场
        ut = batch - z0

        # 预测向量场
        vt = model(zt, t)

        # 计算损失
        loss = torch.mean((vt - ut) ** 2)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

流匹配为生成模型提供了新的视角,将离散的变换序列推广到连续的动态系统,在许多任务中展现出优越的性能。

Flow and Diffusion Models

Generation as Sampling : - 生成一个 object 就是从分布中进行一次采样 \(z\sim p_\text{data}\)

数据集 : - 数据集就是有限数据的样本 \(z_1,\cdots,z_N\sim p_\text{data}\)

Trajectory : - \([0, 1]\rightarrow \mathrm R^d\)

Vector Field : - \(u:\mathrm {R}^d\times [0, 1]\rightarrow \mathrm R^d\)

Ordinary Differential Equation(ODE)

*常微分方程 (ODE) 描述了轨迹如何由向量场驱动:

给定向量场 \(u(x, t)\),轨迹 \(\phi: [0,1] \to \mathbb{R}^d\) 满足:

$$\frac{d}{dt} \phi(t) = u(\phi(t), t), \quad \phi(0) = x_0$$

即:轨迹在每一点的瞬时速度等于向量场在该点该时刻的值

Flow

*Flow 是向量场 \(u\) 生成的映射族 \(\psi_t: \mathbb{R}^d \to \mathbb{R}^d\),满足:

$$\frac{d}{dt} \psi_t(x) = u(\psi_t(x), t), \quad \psi_0(x) = x$$

\(\psi_t\) 将初始点 \(x\) 沿 ODE 推到时刻 \(t\) 的位置。Flow 是时间相关的微分同胚,用于将简单分布(如高斯)变形为目标分布。

它是一个随时间变化的矢量场。

ODE/flows 有唯一解。

使用欧拉法模拟 ODE

Flow Model

$$p_\text{init}\xrightarrow{\text{ODE}} p_\text{data}$$

采样一个初始点,通过 ODE 过程,来到 \(X_1\), 而 \(X_1\)\(p_\text{data}\) 上采样得到。

Diffusion Models

\(X_t\in \mathrm R^d\) random variable

Vector Field \(u_t(x)\) : A function that describes the deterministic drift of the process at time \(t\) and position \(x\).

Diffusion Coefficient \(\sigma_t(x)\) : A function (often a scalar or matrix) that controls the magnitude and direction of random noise added at each step.

$$d X_t = \underbrace{u_t(X_t)dt}_{\text{ODE term}} + \underbrace{\sigma_t dW_t}_{\text{stochastic term}}$$

Brownian Motion

Brownian motion \(W_t\) is a continuous-time stochastic process with:

In the SDE, \(\sigma_t dW_t\) injects Gaussian noise scaled by \(\sigma_t\), making the trajectory random rather than purely deterministic.

SDE 与 ODE

$$% ODE with error term \frac{dX}{dt} = f(X,t) + \varepsilon(t)$$
$$% SDE with error term dX_t = \bigl(f(X_t, t) + \varepsilon(t)\bigr) dt + \sigma(X_t, t) dW_t$$
$$% Annotated version dX_t = \underbrace{f(X_t, t) dt}_{\text{deterministic drift}} + \underbrace{\varepsilon(t) dt}_{\text{error term}} + \underbrace{\sigma(X_t, t) dW_t}_{\text{stochastic diffusion}}$$

SDE 有唯一解,指的是轨迹的分布 (distribution) 一致。

Euler

前向 SDE 欧拉法

$$X_{n+1} = X_n + f(X_n, t_n) \Delta t + \sigma(t_n) \sqrt{\Delta t} \, \xi_n$$

从带噪观测中去除随机项(简单去噪)

$$\hat{X}_{n+1}^{\text{denoised}} = X_{n+1} - \sigma(t_n) \sqrt{\Delta t} \, \mathbb{E}[\xi_n | X_{n+1}]$$

逆向时间的去噪欧拉法(扩散模型中的概率流 ODE)

$$% X_n = X_{n+1} - \left[ f(X_{n+1}, t_{n+1}) - \frac{1}{2} \sigma(t_{n+1})^2 \nabla \log p_{t_{n+1}}(X_{n+1}) \right] \Delta t$$

实际可用的去噪迭代(用神经网络 s_θ 近似得分)

$$% X_{t-1} = X_t - \left[ f(X_t, t) - \frac{1}{2} \sigma(t)^2 s_\theta(X_t, t) \right] \Delta t$$

Flow Matching

Probability Path

Gaussian Probability Path

零时刻是一个 Gaussian Distribution, 随着 t 上升到了,方差下降到0. 此时这是一个 Dirac Distribution, 只能取到固定的值。

$$z\sim p_\text{data}, x\sim p_t(\cdot\mid z)$$
$$p_t(\cdot\mid z)=\mathrm N(\alpha_t z, \beta_t^2 I_d)$$

其中,我们可以得到边界点的条件:

$$p_0(\cdot\mid z)=p_\text{init}$$

$$p_1(\cdot\mid z)=\delta_z\quad \text{Dirac 函数}$$

随着时间上升,\(\alpha_t\)会从0上升到1,而 \(\beta_t\) 会从 1 下降到 0。这是一个噪声逐步消退的过程。

Marginal Prob Path

$$p_t(x)=\int p_t(x\mid z)p_\text{data}(z)\text{d} z$$
$$p_0=p_\text{init},\quad p_1=p_\text{data}$$

Vector Field

Conditional Vector Field

$$u_t^\text{target}(x\mid z)\quad 0\le t\le 1, x, z\in \mathbb{R}^d$$

A conditional vector field \( u_t^{\text{target}}(x \mid z) \) is a time-dependent vector field that depends on a conditioning variable \( z \). It appears in flow-based generative modeling, particularly in conditional flow matching.

Key properties:

In conditional flow matching, the model learns to approximate this field by regressing onto a tractable conditional target field (e.g., from a Gaussian conditional probability path).

Cond. Gaussian Vector Field

This is the conditional Gaussian vector field used in diffusion models and score-based generative modeling.

The following expression gives the target velocity field \( u_t^{\text{target}}(x \mid z) \) that transports a sample \( x \) at time \( t \) toward the endpoint \( z \), given the noise schedule parameters \( \alpha_t, \beta_t \).

$$u_t^\text{target}(x\mid z)= \left(\dot \alpha_t-\frac {\dot \beta_t}{\beta_t}\alpha_t\right)z+\frac{\dot \beta_t}{\beta_t}x\quad 0\le t\le 1, x, z\in \mathbb{R}^d$$

Key components:

Marginal Vector Field

The marginal vector field is obtained by taking expectation over \( z \sim p(z \mid x) \):

$$u_t(x) = \mathbb{E}_{z \sim p(z \mid x)}[u_t^{\text{target}}(x \mid z)]=\int u_t^\text{target}(x\mid z)\frac {p_t(x\mid z)p_\text{data}(z)}{p_t(x)}\text{d} z$$

This marginal field defines the probability flow ODE that generates samples from the data distribution.

边际矢量场 \( u_t(x) \) 满足:

  1. 初始条件:当 \( t=0 \) 时,\( p_0(x) \) 是简单先验分布(如标准高斯),矢量场引导样本从该分布出发。

  2. 终值条件:当 \( t=1 \) 时,\( p_1(x) \) 近似数据分布 \( p_{\text{data}}(x) \),矢量场将初始噪声平滑变换为目标数据。

因此,通过求解 ODE \( dx/dt = u_t(x) \),即可从先验采样生成数据分布中的样本。

Continuity Equation

Definition: Divergence

*Divergence (∇·F) measures the net "outflow" of a vector field from a point.

Mathematically: ∇·F = ∂F_x/∂x + ∂F_y/∂y + ∂F_z/∂z (in 3D Cartesian).

$$div(v_t)(x)=\sum^d_{i=1}\frac \partial {\partial x_i} v_t(x) \quad v_t:\mathbb{R}^d\rightarrow \mathbb R^d$$
表示一点的流入与流出的量之差。

通过 Continuity Equation, 我们证明了,在给定条件分布的情况下,我们的边缘分布也可以成立。

训练 Flow Matching

Score Functions and Score Matching

Conditional Score

*Conditional Score extends score matching to model conditional distributions \( p(\mathbf{x} \mid \mathbf{y}) \).

Definition: The conditional score is the gradient of the log conditional density:

$$\nabla_{\mathbf{x}} \log p(\mathbf{x} \mid \mathbf{y})$$

Key uses:

Training: Minimize the conditional score matching objective:

$$\mathbb{E}_{p(\mathbf{x},\mathbf{y})} \left[ \| s_\theta(\mathbf{x}, \mathbf{y}) - \nabla_{\mathbf{x}} \log p(\mathbf{x} \mid \mathbf{y}) \|^2 \right]$$

Connection to denoising: The conditional score can be approximated via a conditional denoising autoencoder:

$$\nabla_{\mathbf{x}} \log p(\mathbf{x} \mid \mathbf{y}) \approx \frac{\mathbf{x} - \hat{\mathbf{x}}_\theta(\mathbf{x}, \mathbf{y})}{\sigma^2}$$
where \(\hat{\mathbf{x}}_\theta\) predicts the clean signal given noisy \(\mathbf{x}\) and condition \(\mathbf{y}\).

Gaussian Score

Marginal Score

Marginal Score refers to the score of the marginal distribution \( p(\mathbf{x}) \), obtained by integrating out latent variables or conditions.

Definition:

$$\nabla_{\mathbf{x}} \log p_t(\mathbf{x}) = \frac {\nabla p_t(x)}{p_t(x)} = \nabla_{\mathbf{x}} \log \int p_t(\mathbf{x} \mid \mathbf{z}) p_\text{data}(\mathbf{z}) \, d\mathbf{z}=\int \nabla\log p_t(x\mid z)\frac {p_t(x\mid z)p_\text{data}(z)}{p_t(x)}\text{d} z$$

Key properties:

Connection to Conditional Score: By the law of total probability:

$$p(\mathbf{x}) = \mathbb{E}_{p(\mathbf{y})}[p(\mathbf{x} \mid \mathbf{y})]$$
But the marginal score is not simply the expectation of conditional scores:
$$\nabla_{\mathbf{x}} \log p(\mathbf{x}) \neq \mathbb{E}_{p(\mathbf{y} \mid \mathbf{x})}[\nabla_{\mathbf{x}} \log p(\mathbf{x} \mid \mathbf{y})]$$

Use in diffusion models:

Sampling with SDE

SDE Extension Trick

SDE extension trick(定理17)指出:

对于任意扩散系数 \(\sigma_t \ge 0\),可以通过在原始ODE的动力学基础上添加随机动力学来构造一个SDE,使得该SDE的轨迹仍然遵循相同的概率路径 \(p_t\)。具体地,该SDE的形式为:

$$X_0\sim p_{\text{init}},\qquad dX_t = u_t^{\text{target}}(X_t)dt + \frac{\sigma_t^2}{2}\nabla\log p_t(X_t)dt + \sigma_t dW_t$$

其中 \(u_t^{\text{target}}(x)\) 是边际向量场,\(\nabla\log p_t(x)\) 是边际得分函数。该SDE的轨迹满足 \(X_t \sim p_t\)\(0 \le t \le 1\)),特别地,\(X_1 \sim p_{\text{data}}\)。该技巧的关键在于,即使添加了随机性,边际分布 \(p_t\) 保持不变。

Guidance

Building Large-Scale Image or Video Generators

Discrete Diffusion Models