每日 arXiv 论文简报

2026/05/21 10:00:00

AI论文每日摘要·254 min read

arXiv 论文 AI 视觉编码器

Daily Radar

每日总览

今日共追踪到 51 篇相关论文。内容按研究方向拆分，避免不同问题域混在同一个长列表里。

autoregressive

Autoregressive

12 篇论文

今日自回归方向集中在视觉生成模型的长序列能力、空间一致性和个性化组合生成。CPC-VAR 将 VAR 模型推进到持续个性化场景，核心价值在于处理连续学习中的遗忘与多概念组合；PanoWorld 把全屋 VR 漫游建模为节点式 360° 全景的自回归生成，显示 AR 架构正在进入空间世界模型；LongLive-2.0 则从 NVFP4、序列并行和 KV cache 量化切入，为长视频生成提供底层基础设施。整体看，自回归视觉生成正在从单图生成转向更长、更复杂、更具空间记忆的模拟系统。

STELLAR: Scaling 3D Perception Large Models for Autonomous Driving

2026-05-21T04:00:00autoregressive, cs.AI, cs.CV, cs.LG, cs.ROoai:arXiv.org:2605.20390v1

中文标题：STELLAR: 为自动驾驶缩放3D感知大型模型

作者：Yingwei Li, Xin Huang, Yang Liu, Yang Fu, Alex Zihao Zhu, Chen Song, Junwen Yao, Anant Subramanian, Hao Xiang, Weijing Shi, Yuliang Zou, Tom Hoddes, Zhaoqi Leng, Govind Thattai, Dragomir Anguelov, Mingxing Tan

摘要：

Model scaling has demonstrated remarkable success through large-scale training on diverse datasets. It remains an open question whether the same paradigm would apply to autonomous driving perception systems due to unique challenges, such as fusing heterogeneous sensor data and the need for sophisticated 3D spatial understanding. To bridge this gap, we present a comprehensive study on systematically analyzing the impact of scale on these systems. We develop our STELLAR model based on Sparse Window Transformer, by extending the input modalities to include LiDAR, radar, camera, and map prior. We train the model on a large-scale dataset of 50 million driving examples with up to 500 million parameters. Our large-scale experiments reveal empirical scaling trends that connect model performance to model size, data, and compute. The resulting model establishes a new state-of-the-art on the Waymo Open Dataset challenge, outperforming prior arts by a large margin. Our work demonstrates that large-scale training is a highly promising path for advancing the capabilities of perception models for autonomous driving.

摘要中文：

arXiv:2605.20390v1宣布类型: 新摘要: 通过在不同数据集上进行大规模训练，模型缩放已经取得了显著的成功。由于独特的挑战，例如融合异构传感器数据和对复杂的3D空间理解的需求，相同的范例是否适用于自动驾驶感知系统仍然是一个悬而未决的问题。为了弥合这一差距，我们提出了一项全面的研究，系统地分析了规模对这些系统的影响。我们通过扩展输入模式以包括激光雷达，雷达，相机和地图先验来开发基于稀疏窗口变压器的恒星模型。我们在具有多达5亿个参数的5000万驱动示例的大规模数据集上训练模型。我们的大规模实验揭示了将模型性能与模型大小、数据和计算联系起来的经验缩放趋势。由此产生的模型在Waymo开放数据集挑战中建立了一个新的最先进的技术，大大超过了现有技术。我们的工作表明，大规模培训是提高自动驾驶感知模型能力的非常有希望的途径。

Goodbye Drift: Anchored Tree Sampling for Long-Horizon Video-to-Video Generation

2026-05-21T04:00:00autoregressive, cs.CVoai:arXiv.org:2605.20476v1

中文标题：告别漂移：用于长时序视频到视频生成的锚定树木采样方法

作者：Matthew Bendel, Stephen W. Bailey, Mithilesh Vaidya, Sumukh Badam, Xingzhe He

摘要：

Long-horizon video generation suffers from two intertwined issues. First, there is drift, where video quality degrades over time. Second, there are continuity issues which manifest as object permanence issues, or improperly rendering transient content (e.g., an object that appears in non-consecutive frames changing color/style). Recent work has focused on autoregressive distillation techniques that attack both problems simultaneously. We instead choose to focus on drift directly and introduce \textbf{Anchored Tree Sampling (ATS)}: a training-free inference-time scheduler that replaces left-to-right rollout with sparse-to-dense, anchor-bounded imputation organized as a tree. A root call produces sparse anchors over the full horizon, recursive refinement generates intermediate anchors, and final leaf spans are synthesized between neighboring anchors. This reduces the critical path from $$K$$ sequential rollout steps to $$L+1$$ tree-hierarchical steps and converts horizon-compounding drift into anchor-bounded drift. We focus on V2V generation in the \emph{static-camera} regime, where sparse anchors over the horizon are well approximated by the dense conditioning signal, and the base model can produce them without retraining. We evaluate ATS against two contemporary autoregressive baselines on Wan $$2.1$$ $$+$$ VACE, across five conditioning modalities (inpainting, outpainting, edge, pose, depth). We show that ATS outperforms both competitors in overall quality, as well as in drift prevention. We additionally demonstrate stable $\geq 40$ -minute generation on LTX- $$2.3$$ across the same five modalities. We conclude by proposing a path forward to extend ATS to arbitrarily long T2V generation, as well as the dynamic-camera and multi-shot regimes.

摘要中文：

arXiv:2605.20476v1公告类型：新摘要：长时序视频生成面临两个相互交织的问题。首先，存在视频质量随时间逐渐下降的漂移现象。其次，还存在一些连续性问题，表现为物体恒存性方面的缺陷，或者未能正确呈现短暂性内容（例如，在非连续帧中出现的某个对象会改变颜色或样式）。近期的研究主要聚焦于能够同时解决这两个问题的自回归蒸馏技术。我们转而直接关注偏差问题，并提出“锚定树采样（ATS）”：一种无需训练的推理时调度算法，它用一种以树结构组织的、由锚点约束的稀疏–稠密插补机制取代了传统的从左到右的展开策略。根节点调用在全时间范围内生成稀疏锚点，递归细化过程生成中间锚点，最终的叶级跨度则在相邻锚点之间进行合成。这将关键路径从 $$K$$ 个顺序展开步骤缩短至 $$L-1$$ 个树形层次化步骤，并将视界复合偏差转化为锚定约束下的偏差。我们专注于在“固定相机”场景下的V2V生成任务，在该场景下，地平线上的稀疏锚点可由密集的条件信号较好地近似，且基础模型无需重新训练即可生成这些锚点。我们在WADT‑2.1数据集上，针对五种条件模态（图像修复、图像外扩、边缘图、人体姿态图、深度图），将ATS与两种主流的自回归基线方法进行了对比评估。我们证明，ATS在整体质量以及防止模型偏移方面均优于两种对比方法。我们进一步证明，在LTX-2.3平台上，针对上述五种模态均可实现持续≥40分钟的生成。最后，我们提出了一条可行的路径，以将ATS扩展至任意长度的文本到视频生成任务，并支持动态摄像机与多镜头拍摄模式。

Head-Aware Key-Value Compression for Efficient Autoregressive Image Generation

2026-05-21T04:00:00autoregressive, cs.CVoai:arXiv.org:2605.20600v1

中文标题：面向头部感知的键值压缩用于高效自回归图像生成

作者：Guotao Liang, Baoquan Zhang, Zhiyuan Wen, Yunming Ye

摘要：

Autoregressive (AR) visual generation has achieved remarkable performance but suffers from high memory usage and low throughput, as it requires caching previously generated visual tokens. Recent research has shown that retaining only a few lines of cache tokens can maintain high-quality images while significantly reducing memory usage and improving throughput. However, these methods allocate a fixed budget to each attention head, overlooking the heterogeneity among attention heads, leading to suboptimal memory allocation. In this paper, we observe that attention heads across different layers exhibit diverse attention patterns, where some heads focus on local neighborhoods while others capture broader contextual dependencies. Based on this insight, we propose a novel head-aware key-value (KV) cache compression framework for autoregressive image generation, called HeadKV, which assigns smaller budgets to locality-biased heads and larger budgets to heads with broader attention. A key challenge lies in identifying the type of each attention head to guide cache compression. We further observe that, within the same layer, each head exhibits consistent attention patterns across token positions, \emph{i.e.}, a head's behavior for early tokens remains consistent with that for later tokens. This insight suggests that head types can be identified during the early stage and reused for KV compression throughout generation. Its advantage is that it requires no additional training or dataset-level statistics and generalizes seamlessly across different inputs. Moreover, we design a Stratified Token Eviction strategy to effectively preserve long-range information. Extensive experiments demonstrate its effectiveness across multiple autoregressive image generation models.

摘要中文：

arXiv:2605.20600v1公告类型：新摘要：自回归（AR）视觉生成方法已取得显著性能，但由于需要缓存先前生成的视觉标记，其内存占用高、吞吐量低。最新研究表明，仅保留少量缓存标记即可在显著降低内存占用、提升吞吐量的同时，保持高质量的图像输出。然而，这些方法为每个注意力头分配固定的预算，忽视了注意力头之间的异质性，从而导致内存分配次优。在本文中，我们发现不同层间的注意力头呈现出多样化的注意力模式：部分注意力头聚焦于局部邻域，而另一些则捕捉更为广泛的上下文依赖关系。基于这一洞察，我们提出了一种用于自回归图像生成的新型头感知键值（KV）缓存压缩框架——HeadKV，该框架为具有局部性偏好的注意力头分配较小的缓存预算，而为具有更广泛注意力范围的头分配较大的缓存预算。一个关键挑战在于识别每个注意力头的类型，以指导缓存压缩。我们进一步观察到，在同一层内，各注意力头在不同位置的标记上表现出一致的注意力模式，即：某一注意力头在早期标记上的行为与其在后续标记上的行为保持一致。这一发现表明，可以在早期阶段识别出不同类型的头部，并在整个生成过程中将其复用于键值对的压缩。其优势在于无需额外的训练或数据集级别的统计信息，并且能够无缝地推广到不同的输入上。此外，我们设计了一种分层标记逐出策略，以有效保留长距离信息。大量实验表明，该方法在多种自回归图像生成模型上均具有良好的效果。

Accelerating Video Inverse Problem Solvers with Autoregressive Diffusion Models

2026-05-21T04:00:00autoregressive, cs.AI, cs.CV, cs.LG, diffusionoai:arXiv.org:2605.20624v1

中文标题：利用自回归扩散模型加速视频逆问题求解器

作者：Taesung Kwon, Jonghyun Park, Hyungjin Chung, Jong Chul Ye

摘要：

Diffusion models provide powerful priors for zero-shot video inverse problems, but their real-time deployment is hindered by two inefficiencies: high initial latency caused by holistic video restoration, and low throughput resulting from multiple VAE passes to enforce measurement consistency in pixel space. To overcome these limitations, we propose Autoregressive Video Inverse problem Solver (AVIS). The AVIS framework leverages autoregressive video diffusion models to restore videos in a streaming manner, naturally eliminating latency bottlenecks. Specifically, AVIS initializes reverse diffusion with a measurement-consistent estimate, reducing the required sampling steps. Compared to leading non-autoregressive solvers, AVIS drastically reduces initial latency from 114s to 4s and increases throughput from 0.71 to 1.18 FPS while achieving superior restoration quality. We further introduce a highly accelerated variant, dubbed AVIS Flash, that enforces measurement consistency solely on the first chunk. AVIS Flash substantially boosts throughput to 5.91 FPS on a single RTX 4090 GPU while maintaining competitive performance and achieving a favorable efficiency-performance trade-off, paving the way toward real-time deployment.

摘要中文：

arXiv:2605.20624v1公告类型：新摘要：扩散模型为零样本视频逆问题提供了强大的先验，但其实时部署受到两大效率瓶颈的制约：整体式视频修复带来的高初始延迟，以及为在像素空间中强制保持测量一致性而需进行多次VAE前向传播所导致的低吞吐量。为克服这些局限性，我们提出了自回归视频逆问题求解器（AVIS）。AVIS框架利用自回归视频扩散模型以流式处理的方式对视频进行修复，从而自然地消除了延迟瓶颈。具体而言，AVIS以与测量数据一致的估计值作为逆向扩散过程的初始条件，从而减少了所需的采样步数。与领先的非自回归求解器相比，AVIS将初始延迟从114秒大幅降至4秒，并将吞吐量由0.71 FPS提升至1.18 FPS，同时实现了更优的重建质量。我们进一步提出了一种高度加速的改进方案，称为AVIS Flash，该方案仅在第一个数据块上强制执行测量一致性。AVIS Flash在单张RTX 4090 GPU上将吞吐量大幅提升至5.91 FPS，同时保持了具有竞争力的性能，并实现了良好的能效与性能平衡，为其实时部署奠定了基础。

FlowLong: Inference-time Long Video Generation via Manifold-constrained Tweedie Matching

2026-05-21T04:00:00autoregressive, cs.CV, diffusionoai:arXiv.org:2605.20910v1

中文标题：FlowLong：基于流形约束的特威迪匹配实现推理时长视频生成

作者：Jangho Park, Geon Yeong Park, Gihyun Kwon, Jong Chul Ye

摘要：

Extending the generation horizon of video diffusion models to long sequences remains a long-standing and important challenge. Existing training-free approaches fall into two categories: extensions of bidirectional models, which are tightly coupled to specific architectures and suffer from quality degradation over long horizons, and autoregressive models, which accumulate drift errors due to exposure bias and tend to produce repetitive motion patterns. To address these issues, we propose a novel but simple inference-time approach for long video generation that is architecture-agnostic and requires no additional training. Our method generates long videos via overlapping sliding windows, where predicted clean samples from adjacent windows are blended via \emph{Tweedie matching} to enforce both \textbf{manifold constraint and temporal consistency} across overlap regions. \emph{Stochastic early-phase sampling} then synchronizes per-window trajectories by injecting fresh noise after each Tweedie matching correction in the high-noise phase, before transitioning to deterministic ODE sampling to preserve fine-grained visual fidelity. Applied to various video generation models, our method generates videos several times longer than the native window length while outperforming both training-free and autoregressive baselines in temporal consistency and visual quality, and further extends to audio-video joint generation and text-to-3DGS without any fine-tuning.

摘要中文：

arXiv:2605.20910v1公告类型：新摘要：将视频扩散模型的生成时长扩展至长序列，仍然是一个长期且重要的挑战。现有的无需训练的方法可分为两类：一类是双向模型的扩展，这类方法与特定架构紧密耦合，在长horizon下会面临质量退化问题；另一类是自回归模型，由于暴露偏差而累积漂移误差，并且往往会产生重复的运动模式。为解决这些问题，我们提出了一种新颖而简洁的长视频生成推理方法，该方法与模型架构无关，且无需额外训练。我们的方法通过重叠滑动窗口生成长视频，其中来自相邻窗口的预测干净帧样本采用“Tweedie匹配”进行融合，以在重叠区域同时满足流形约束和时间一致性。随后，“随机早期阶段采样”通过在高噪声阶段每次完成Tweedie匹配校正后注入新的噪声，对各窗口的轨迹进行同步；之后再切换至确定性常微分方程采样，以保持精细的视觉保真度。我们的方法可应用于多种视频生成模型，在保持时间一致性与视觉质量方面均优于无训练基线和自回归基线，且能将生成时长延长至原生窗口长度的数倍；此外，该方法还无需任何微调即可扩展至音视频联合生成以及文本到3D高斯流体的生成任务。

DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation

2026-05-21T04:00:00autoregressive, cs.AI, cs.CVoai:arXiv.org:2605.21028v1

中文标题：DySink：用于自回归长视频生成的动态帧汇

作者：Bo Ye, Xinyu Cui, Jian Zhao, Tong Wei, Min-Ling Zhang

摘要：

Autoregressive long video generation often adopts bounded-memory streaming for efficiency, typically combining local windows for short-term continuity with static early-frame sinks as long-range anchors. However, this fixed allocation keeps early frames cached even when the current visual state has substantially diverged from them, while discarding potentially more relevant intermediate history. As a result, the retained long-range context may become less adaptive and bias generation toward outdated cues; in severe cases, RoPE-induced phase re-alignment can homogenize inter-head attention and cause sink collapse, where content regresses toward sink frames. We propose DySink, a retrieval-based framework that maintains a compact memory bank and selects visually relevant historical frames as dynamic frame sinks. DySink couples adaptive retrieval with a sink anomaly gate, which detects excessive inter-head consensus over retrieved context and suppresses collapse-prone context. Experiments on minute-long videos show that DySink consistently improves dynamic degree over strong baselines while also achieving higher temporal quality. The code and model weights will be released at https://github.com/yebo0216best/DySink.

摘要中文：

arXiv:2605.21028v1公告类型：新摘要：自回归长视频生成为了提升效率，通常采用受限内存的流式处理方式，即在局部窗口之间实现短时连续性，并以早期帧的静态特征作为长程锚点。然而，这种固定分配策略会导致早期帧在当前视觉状态已与其显著偏离的情况下仍被保留在缓存中，同时却丢弃了可能更为相关的中间历史信息。因此，所保留的长距离上下文可能会变得适应性降低，并使生成过程偏向于过时的线索；在严重情况下，RoPE引起的相位重对齐甚至会使各注意力头之间的注意力趋于一致，从而导致“sink collapse”现象，即内容逐渐退化至“sink”帧。我们提出了DySink，这是一种基于检索的框架，它维护一个紧凑的记忆库，并将视觉上相关的历史帧作为动态帧汇聚点进行选择。DySink将自适应检索与sink异常门控机制相结合，该机制能够检测到检索上下文中节点间共识过强的现象，并抑制易导致坍缩的上下文。针对时长为数分钟的视频所开展的实验表明，DySink在显著提升动态程度的同时，还能获得更高的时序质量，并且优于各类强基准方法。代码和模型权重将在 https://github.com/yebo0216best/DySink 上发布。

Q-ARVD: Quantizing Autoregressive Video Diffusion Models

2026-05-21T04:00:00autoregressive, cs.CV, diffusionoai:arXiv.org:2605.21072v1

中文标题：Q-ARVD：量化自回归视频扩散模型

作者：Siao Tang, Xinyin Ma, Gongfan Fang, Xingyi Yang, Xinchao Wang

摘要：

Autoregressive video diffusion models (ARVDs) have emerged as a promising architecture for streaming video generation, paving the way for real-time interactive video generation and world modeling. Despite their potential, the substantial inference cost of ARVDs remains a major obstacle to practical deployment, making model quantization a natural direction for improving efficiency. However, quantization for ARVDs remains largely unexplored. Our empirical analysis shows that directly applying existing quantization schemes developed for standard diffusion transformers to ARVDs leads to suboptimal performance, revealing quantization behaviors that differ from those observed in bidirectional diffusion models. In this paper, we identify two critical challenges in quantizing ARVDs: (C1) Highly unbalanced frame-wise quantization sensitivity. Error accumulation during autoregressive generation can induce severely skewed quantization sensitivity across frames, following an exponential-like decay pattern. (C2) Prominent and heterogeneous outlier patterns in weights. Weight distributions exhibit pronounced outlier channels, whose patterns vary substantially across layer types and block depths. To address these issues, we propose Q-ARVD, a novel framework for accurate ARVD quantization. (S1) To tackle the highly unbalanced frame-wise sensitivity, Q-ARVD incorporates a final-quality aware frame-weighting mechanism into the quantization objective. (S2) To prevent heterogeneous outliers from degrading performance, Q-ARVD introduces an outlier-aware adaptive dual-scale quantization, which automatically detects the presence and quantity of outlier channels for an arbitrary layer, and isolates them to protect normal channels. Extensive experiments demonstrate the superiority of Q-ARVD.

摘要中文：

arXiv:2605.21072v1公告类型：新摘要：自回归视频扩散模型（ARVD）作为一种极具前景的架构，已在流式视频生成领域崭露头角，为实时交互式视频生成与世界建模开辟了新路径。尽管具有潜在优势，但ARVDs的高昂推理开销仍是其实用化部署的主要障碍，因此模型量化成为提升效率的自然途径。然而，针对ARVD的量化研究仍largely未被探索。我们的实证分析表明，将为标准扩散Transformer设计的现有量化方案直接应用于ARVDs会导致性能欠佳，这揭示了其量化行为与在双向扩散模型中观察到的行为存在显著差异。本文指出，ARVD量化面临两个关键挑战：（C1）帧级量化灵敏度严重失衡。在自回归生成过程中，误差累积会导致各帧的量化敏感性严重偏斜，并呈现出类似指数衰减的规律。（C2）权重中存在显著且异质的异常模式。权重分布中存在显著的异常通道，其模式在不同层类型和块深度之间差异较大。为解决这些问题，我们提出了一种名为Q-ARVD的新型框架，用于实现精确的ARVD量化。（S1）为应对帧级敏感度的严重失衡，Q-ARVD在量化目标中引入了面向最终质量的帧权重机制。(S2) 为防止异质异常值降低模型性能，Q-ARVD提出了一种基于异常值感知的自适应双尺度量化方法，该方法能够自动检测任意层中异常通道的存在及其数量，并将其隔离以保护正常通道。大量实验结果表明，Q-ARVD具有显著的优越性。

UniT: Unified Geometry Learning with Group Autoregressive Transformer

2026-05-21T04:00:00autoregressive, cs.CVoai:arXiv.org:2605.21131v1

中文标题：UniT：基于分组自回归Transformer的统一几何学习

作者：Haotian Wang, Yusong Huang, Zhaonian Kuang, Hongliang Lu, Xinhu Zheng, Meng Yang, Gang Hua

摘要：

Recent feed-forward models have significantly advanced geometry perception for inferring dense 3D structure from sensor observations. However, its essential capabilities remain fragmented across multiple incompatible paradigms, including online perception, offline reconstruction, multi-modal integration, long-horizon scalability, and metric-scale estimation. We present UniT, a unified model built upon a novel Group Autoregressive Transformer, which reformulates these seemingly disparate capabilities within a single framework. The key idea is to treat groups of sensor observations as the basic autoregressive units and predict the corresponding point maps in an anchor-free and scale-adaptive manner. More specifically, diverse view configurations in both online and offline settings are naturally unified within a single group autoregression process. By varying the group size, online mode operates over multiple autoregressive steps with single-frame groups, whereas offline mode aggregates a multi-frame group in a single forward pass. Meanwhile, a queue-style KV caching mechanism ensures bounded autoregressive memory over long horizons. This is enabled by reducing long-range dependencies on early frames through anchor-free relational modeling, thereby allowing outdated memory to be discarded on the fly. To improve metric-scale generalization across scenes, a scale-adaptive geometry loss is further introduced within this framework. It couples relative geometric constraints with a partial absolute scale term, implicitly regularizing global scale and inducing a progressive transition from scale-invariant geometry to metric-scale solutions. Together with a dedicated modal attention module for integrating auxiliary modalities, UniT achieves state-of-the-art performance in unified geometry perception, as validated on ten benchmarks spanning seven representative tasks.

摘要中文：

arXiv:2605.21131v1公告类型：新摘要：近年来，前馈神经网络模型在基于传感器观测推断稠密三维结构的几何感知任务上取得了显著进展。然而，其核心能力仍分散于多个互不兼容的范式之中，涵盖在线感知、离线重建、多模态融合、长时序可扩展性以及米级尺度估计等。我们提出了UniT，这是一种基于新型分组自回归Transformer的统一模型，它将这些看似迥异的能力重新整合到一个统一的框架之中。核心思想是将传感器观测的分组视为基本的自回归单元，并以无锚框、尺度自适应的方式预测相应的点云地图。更具体地说，在线上与线下两种场景下，多样化的视图配置被自然地统一到单一的组自回归过程中。通过调整组的大小，在线模式以单帧为一组，分多个自回归步骤运行；而离线模式则在一次前向传播中对多帧组成的组进行聚合。与此同时，一种队列式的键值对缓存机制确保了在长时域下具有有界的自回归内存开销。这是通过无锚点关系建模来削弱早期帧之间的长距离依赖实现的，从而能够在运行时动态丢弃过时的记忆。为提升场景间的度量尺度泛化能力，本框架中进一步引入了一种尺度自适应的几何损失。它将相对几何约束与部分绝对尺度项相结合，从而隐式地对全局尺度进行正则化，并促使解从尺度不变的几何形态逐步过渡到具有度量尺度的解。结合用于融合辅助模态的专用模态注意力模块，UniT在统一几何感知任务上取得了当前最佳性能，并在涵盖七项代表性任务的十个基准测试中得到了验证。

Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation

2026-05-21T04:00:00autoregressive, cs.CV, cs.GR, cs.MMoai:arXiv.org:2605.14382v3

中文标题：Delta作用力：用于交互式自回归视频生成的信任域引导方法

作者：Yuheng Wu, Xiangbo Gao, Tianhao Chen, Xinghao Chen, Qing Yin, Zhengzhong Tu, Dongman Lee

摘要：

Interactive real-time autoregressive video generation is essential for applications such as content creation and world modeling, where visual content must adapt to dynamically evolving event conditions. A fundamental challenge lies in balancing reactivity and stability: models must respond promptly to new events while maintaining temporal coherence over long horizons. Existing approaches distill bidirectional models into autoregressive generators and further adapt them via streaming long tuning, yet often exhibit persistent drift after condition changes. We identify the cause as conditional bias, where the teacher may provide condition-aligned but trajectory-agnostic guidance, biasing generation toward locally valid yet globally inconsistent modes. Inspired by Trust Region Policy Optimization, we propose Delta Forcing, a simple yet effective framework that constrains unreliable teacher supervision within an adaptive trust region. Specifically, Delta Forcing estimates transition consistency from the latent delta between teacher and generator trajectories, and uses it to balance teacher supervision with a monotonic continuity objective. This suppress unreliable teacher-induced shifts while preserving responsiveness to new events. Extensive experiments demonstrate that Delta Forcing significantly improves consistency while maintaining event reactivity.

摘要中文：

arXiv:2605.14382v3公告类型：替换摘要：交互式实时自回归视频生成对于内容创作和世界建模等应用至关重要，因为在这些场景中，视觉内容必须能够适应动态变化的事件状态。一个根本性挑战在于如何在响应性和稳定性之间取得平衡：模型既要对新事件作出及时响应，又要在长时域内保持时间一致性。现有方法通常将双向模型蒸馏为自回归生成器，并通过流式长调优进一步适配，但在条件发生变化后仍会持续出现漂移现象。我们将这一现象归因于条件偏差：教师可能提供与给定条件相一致但对轨迹信息不敏感的指导，从而导致模型生成倾向于局部有效而全局不一致的模式。受信任区域策略优化的启发，我们提出了Delta Forcing这一简单而有效的框架，该框架在自适应的信任区域内对不可靠的教师监督进行约束。具体而言，Delta Forcing通过计算教师模型与生成器轨迹之间的潜在差异来估计其转移一致性，并以此在教师监督与单调连续性目标之间进行权衡。这在抑制由教师引发的不可靠转换的同时，仍能保持对新事件的响应能力。大量实验表明，Delta Forcing在保持事件响应性的同时，显著提升了结果的一致性。

Spectral Progressive Diffusion for Efficient Image and Video Generation

2026-05-21T04:00:00autoregressive, cs.CV, diffusionoai:arXiv.org:2605.18736v2

中文标题：用于高效图像与视频生成的谱渐进扩散模型

作者：Howard Xiao, Brian Chao, Lior Yariv, Gordon Wetzstein

摘要：

Diffusion models have been shown to implicitly generate visual content autoregressively in the frequency domain, where low-frequency components are generated earlier in the denoising process while high-frequency details emerge only in later timesteps. This structure offers a natural opportunity for efficient generation, as high-resolution computation on noise-dominated frequencies is largely redundant. We propose Spectral Progressive Diffusion, a general framework that progressively grows resolution along the denoising trajectory of pretrained diffusion models. To this end, we develop a spectral noise expansion mechanism and derive an optimal resolution schedule from the model's power spectrum. Our framework supports training-free acceleration and a novel fine-tuning recipe that further improves efficiency and quality. We demonstrate significant speedups on state-of-the-art pretrained image and video generation models while preserving visual quality.

摘要中文：

arXiv:2605.18736v2公告类型：替换摘要：研究表明，扩散模型在频域中以自回归方式隐式地生成视觉内容：在去噪过程中，低频分量较早生成，而高频细节则仅在后续的采样步中逐渐显现。这种结构为高效生成提供了天然的契机，因为在噪声主导的频段上进行高分辨率计算在很大程度上是冗余的。我们提出了谱渐进扩散模型，这是一种通用框架，能够在预训练扩散模型的去噪过程中逐步提升图像分辨率。为此，我们提出了一种谱噪声扩展机制，并基于模型的功率谱推导出最优的分辨率调度策略。我们的框架支持无需训练的加速，以及一种新颖的微调方案，可进一步提升效率与质量。我们在最先进的预训练图像与视频生成模型上实现了显著的加速，同时保持了良好的视觉质量。

MeshTailor: Cutting Seams via Generative Mesh Traversal

2026-05-21T04:00:00autoregressive, cs.CV, cs.GRoai:arXiv.org:2603.27309v2

中文标题：MeshTailor：基于生成式网格遍历的裁剪接缝技术

作者：Xueqi Ma, Xingguang Yan, Congyue Zhang, Hui Huang

摘要：

arXiv:2603.27309v2 Announce Type: replace-cross Abstract: We present MeshTailor, the first mesh-native generative framework for synthesizing edge-aligned seams on 3D surfaces. Unlike prior optimization-based or extrinsic learning-based methods, MeshTailor operates directly on the mesh graph, eliminating projection artifacts and fragile snapping heuristics. We introduce ChainingSeams, a hierarchical serialization of the seam graph that orders chains from global structural cuts down to local details in a coarse-to-fine manner, and a dual-stream encoder that fuses topological and geometric context. Leveraging this hierarchical representation and dual-stream vertex embeddings, our MeshTailor Transformer utilizes an autoregressive pointer layer to trace seams vertex-by-vertex within local neighborhoods. Extensive evaluations show that MeshTailor produces more coherent and structurally regular seam layouts compared to recent optimization-based and learning-based baselines.

摘要中文：

arXiv:2603.27309v2公告类型：替换-交叉摘要：我们提出了MeshTailor，这是首个原生面向网格的生成式框架，用于在三维曲面上合成与边缘对齐的接缝。与以往基于优化或外在学习的方法不同，MeshTailor直接作用于网格图，从而消除了投影伪影和易失效的对齐启发式策略。我们提出了ChainingSeams，这是一种对接缝图的层次化序列化方法，它以由粗到细的顺序，将接缝链从全局结构切分逐步细化至局部细节；同时，我们还设计了一种双流编码器，用于融合拓扑与几何上下文信息。借助这种层次化表示与双流式顶点嵌入，我们的MeshTailor Transformer采用自回归指针层，在局部邻域内逐顶点地追踪接缝。大量实验评估表明，与近期的基于优化和基于学习的基准方法相比，MeshTailor能够生成更加连贯且结构更为规整的接缝布局。

Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control

2026-05-21T04:00:00autoregressive, cs.CV, cs.RO, diffusionoai:arXiv.org:2605.14417v2

中文标题：在身体运动之前：学习用于语言条件化人形机器人控制的预判性关节意图

作者：Haozhe Jia, Honglei Jin, Yuan Zhang, Youcheng Fan, Shaofeng Liang, Lei Wang, Shuxu Jin, Kuimou Yu, Zinuo Zhang, Jianfei Song, Wenshuo Chen, Yutao Yue

摘要：

arXiv:2605.14417v2 Announce Type: replace-cross Abstract: Natural language is an intuitive interface for humanoid robots, yet streaming whole-body control requires control representations that are executable now and anticipatory of future physical transitions. Existing language-conditioned humanoid systems typically generate kinematic references that a low-level tracker must repair reactively, or use latent/action policies whose outputs do not explicitly encode upcoming contact changes, support transfers, and balance preparation. We propose \textbf{DAJI} (\emph{Dynamics-Aligned Joint Intent}), a hierarchical framework that learns an anticipatory joint-intent interface between language generation and closed-loop control. DAJI-Act distills a future-aware teacher into a deployable diffusion action policy through student-driven rollouts, while DAJI-Flow autoregressively generates future intent chunks from language and intent history. Experiments show that DAJI achieves strong results in anticipatory latent learning, single-instruction generation, and streaming instruction following, reaching 94.42\% rollout success on HumanML3D-style generation and 0.152 subsequence FID on BABEL.

摘要中文：

arXiv:2605.14417v2宣布类型: 替换-交叉摘要: 自然语言是类人机器人的直观界面，但流式全身控制需要现在可执行的控制表示，并预测未来的物理转换。现有的语言条件型人形机器人系统通常会生成由底层跟踪器不得不以反应式方式加以修正的运动学参考，或者采用其输出并未显式编码未来接触状态变化、支撑面转移及平衡准备动作的潜在空间或动作策略。我们提出了一种名为DAJI（动态对齐联合意图）的分层框架，该框架学习语言生成与闭环控制之间的前瞻性联合意图接口。DAJI-Act通过学生主导的部署，将具备未来感知的教师提炼为可部署的扩散行动策略；与此同时，DAJI-Flow则基于语言与意图历史，自回归地生成未来的意图片段。实验结果表明，DAJI在前瞻式潜在学习、单指令生成以及流式指令跟随任务上均取得了优异性能，在HumanML3D风格的生成任务上达到了94.42% 的回放成功率，在BABEL数据集上实现了0.152的子序列FID。

diffusion

Diffusion

38 篇论文

今日扩散方向的论文数量明显更多，主题覆盖生成建模、长视频系统、模型对齐、版权水印、3D 几何生成与视觉理解数据合成。值得优先看的有四类：Heat Dissipation Flow Matching 将热扩散过程引入 Flow Matching，为多尺度生成提供新的物理先验；LongLive-2.0 从训练和推理基础设施层面优化长视频扩散模型；Sparse MoE Routing in Visual Diffusion Transformers 系统分析视觉 DiT 中 MoE 路由崩溃问题；When Preference Labels Fall Short 讨论用真实数据对齐扩散模型。整体趋势是，扩散模型研究正在从单纯提升生成质量，转向效率、安全、系统化训练和跨任务可控性。

Generation of Heterogeneous PET Images from Uniform Organ Activity Maps Using a Pretrained Domain-Adapted Diffusion Model

2026-05-21T04:00:00cs.AI, cs.CV, diffusionoai:arXiv.org:2605.20267v1

中文标题：基于预训练的域适配扩散模型，从均匀的器官活性图生成异质性PET图像

作者：Suya Li, Kaushik Dutta, Debojyoti Pal, Jingqin Luo, Kooresh I. Shoghi

摘要：

Synthetic PET images are valuable for quantitative imaging workflow development, scalable virtual imaging trials, and deep learning model training, but conventional physics-based simulation approaches are computationally intensive, limited in anatomical variability, and often fail to capture heterogeneous PET uptake. This study developed a pretrained domain-adapted diffusion (PAD) model for anatomy-conditioned PET synthesis from uniform organ activity maps. PAD adopts a natural-image pretrained text-to-image decoder with an upstream conditioning encoder and a downstream PET-domain adapter. A two-phase training strategy was used, with the first phase learning coarse uptake distributions and the second refining local image details. Uniform organ activity maps were generated from CT-based segmentations by assigning each organ its mean uptake from the paired PET image. Evaluation included quantitative accuracy, noise assessment, radiomic analysis, tumor segmentation performance, and a human observer study. PAD-generated images achieved high quantitative accuracy, with concordance correlation coefficients above 0.92 between organ mean SUVs and assigned activity values. The synthesized images showed noise levels and texture characteristics similar to target PET images and produced comparable tumor segmentation performance. In a two-alternative forced-choice observer study, four readers achieved approximately 50% accuracy, indicating visual indistinguishability between synthesized and target images. PAD also generated realistic PET images from XCAT-derived activity maps, demonstrating compatibility with phantom-based anatomical priors. Overall, PAD provides a diffusion-based framework for generating clinically relevant heterogeneous PET images from uniform organ activity maps derived from clinical segmentations or digital phantoms, supporting data augmentation and downstream imaging studies.

摘要中文：

arXiv:2605.20267v1公告类型：新摘要：合成PET图像在定量成像工作流开发、可扩展的虚拟影像试验以及深度学习模型训练等方面具有重要价值，但传统的基于物理的仿真方法计算成本高昂、解剖学变异的模拟能力有限，且往往难以准确刻画PET摄取的异质性。本研究构建了一种预训练的领域自适应扩散模型（PAD），用于从均匀分布的器官活性图生成解剖结构条件化的PET图像。PAD采用了一个基于自然图像预训练的文本到图像解码器，并配备了一个上游条件编码器和一个下游PET领域适配器。采用了一种两阶段的训练策略，第一阶段学习粗粒度的摄取分布，第二阶段则用于细化局部图像细节。基于CT图像的分割结果，通过将每种器官的平均摄取值赋为其对应的PET图像值，生成了标准化的器官活性分布图。评估内容包括定量准确性、噪声评价、影像组学分析、肿瘤分割性能以及人体观察者研究。PAD生成的图像在定量评估方面表现出较高的准确性，器官平均SUV与赋值活性之间的一致性相关系数均高于0.92。合成图像的噪声水平和纹理特征与目标PET图像相似，并实现了相当的肿瘤分割性能。在一项两选项强制选择的观察者研究中，四位阅片者的表现准确率约为50%，表明合成图像与目标图像在视觉上难以区分。PAD还基于XCAT生成的活性分布图合成了逼真的PET图像，证实了其与基于模体的解剖学先验信息的兼容性。总体而言，PAD提供了一个基于扩散的框架，可从源自临床分割或数字体模的均匀器官活性图生成具有临床意义的异质性PET图像，从而支持数据增强及下游影像学研究。

Tiny-Engram: Trigger-Indexed Concept Tables for Generative Vision

2026-05-21T04:00:00cs.AI, cs.CV, diffusionoai:arXiv.org:2605.20309v1

中文标题：微小记忆痕迹：用于生成式视觉的触发索引概念表

作者：Runyuan Cai, Yiming Wang, Yu Lin, Xiaodong Zeng

摘要：

Current personalization methods for generative vision models typically encode new concepts through continuous adapters or weight updates, yet provide limited control over whether and when a concept should be retrieved. In this work, we introduce Tiny-Engram, a compact trigger-indexed concept table that gives visual memories an explicit lexical address and activation boundary inside frozen image and video generators. Tiny-Engram parameterizes each concept as a small set of memory entries indexed by registered n-gram matches, which modulate text-encoder hidden states only within the matched trigger region. Outside this lexical support, the conditioning pathway is identical to that of the frozen base model. Across both single-encoder latent diffusion and multi-encoder diffusion-transformer backbones, this formulation binds a rare trigger phrase to a target identity while preserving compositional control from the surrounding prompt. We further evaluate the same table-based memory in a text-conditioned video generation setting, where the trigger path reliably alters the generated subject but fine-grained identity persistence across held-out video prompts remains limited. Taken together, these results suggest that small, explicitly addressed concept tables are a practical route to modular visual personalization, with strongest evidence in image generation. For video diffusion, the remaining gap points to a broader requirement: temporally stable identity likely depends on tighter coupling between text-side memory and the evolving visual state, motivating future work on memory injection beyond the text-conditioning interface.

摘要中文：

arXiv:2605.20309v1公告类型：新摘要：当前针对生成式视觉模型的个性化方法通常通过连续适配器或权重更新来编码新概念，但在控制何时以及是否调用特定概念方面仍存在局限性。在本工作中，我们提出了Tiny-Engram，这是一种紧凑的、由触发词索引的概念表，它为视觉记忆在冻结的图像与视频生成模型中赋予了明确的词汇化地址与激活边界。Tiny-Engram将每个概念参数化为一组由已注册n-gram匹配所索引的少量记忆条目，这些条目仅在匹配的触发区域内调制文本编码器的隐藏状态。在这一词法支持之外，条件化路径与冻结的基模型完全相同。在单编码器潜扩散模型和多编码器扩散—Transformer主干网络这两种架构中，该方法将一个罕见的触发短语与目标身份绑定，同时保留来自周边提示的组合性控制能力。我们进一步在文本条件的视频生成场景中评估了同一基于表格的记忆机制，结果表明，触发路径能够可靠地改变生成主体，但在未见过的视频提示条件下，细粒度的身份一致性仍较为有限。综上所述，这些研究结果表明，小型且经过明确设计的概念表是实现模块化视觉个性化的一种可行途径，其中在图像生成领域证据最为充分。对于视频扩散模型而言，现存的差距指向了一个更为广泛的诉求：时序上稳定的表征很可能依赖于文本侧记忆与动态演化的视觉状态之间更紧密的耦合，这为未来在文本条件化接口之外开展记忆注入的研究提供了动力。

EPC-3D-Diff: Equivariant Physics Consistent Conditional 3D Latent Diffusion for CBCT to CT Synthesis

2026-05-21T04:00:00cs.AI, cs.CV, diffusion, physics.med-phoai:arXiv.org:2605.20470v1

中文标题：EPC-3D-Diff：用于CBCT至CT合成的等变物理一致性条件三维潜在扩散模型

作者：Alzahra Altalib, Chunhui Li, Haytham Al Ewaidat, Khaled Alawneh, Ahmad Qendel, Alessandro Perelli

摘要：

Cone-beam CT (CBCT) is routinely acquired during radiotherapy for patient setup, but its quantitative reliability is degraded by scatter, noise, and reconstruction artifacts, limiting Hounsfield Unit (HU) accuracy. We propose EPC-3D-Diff, a novel conditional 3D latent diffusion framework for volumetric CBCT to CT synthesis that introduces a projection domain equivariance loss derived from acquisition physics. Unlike common image domain equivariance, we exploit the fact that an in plane rotation of the volume corresponds to an angular shift in its projections. During training, we enforce this relationship by forward projecting rotated synthesized CT volumes and matching them to appropriately angle shifted projections of the paired target CT, yielding a physics consistent equivariance constraint integrated into the diffusion objective. To capture full 3D context efficiently, conditional diffusion is performed in a compact latent space learnt by a lightweight 3D autoencoder, preserving axial depth while downsampling in plane resolution for stable training. We validate on a paired head CBCT/CT phantom dataset, including repeat scans, and paired clinical data using patient wise splits, and perform single and mixed domain training, ablations, and comparisons with diffusion and CycleGAN. EPC-3D-Diff generalizes well and achieved substantial improvements, +7.4 dB (phantom) and +1.8 dB (clinical data) in PSNR compared to state of the art methods, alongside improved SSIM and HU accuracy, within tissue boundaries. Overall, EPC-3D-Diff improves robustness and physics consistency, supporting HU aware synthesis for downstream radiotherapy workflows.

摘要中文：

arXiv:2605.20470v1公告类型：新摘要：在放射治疗过程中，锥束CT（CBCT）常用于患者摆位验证，但其定量可靠性因散射、噪声及重建伪影而降低，从而限制了亨氏单位（HU）的准确性。我们提出了一种名为EPC-3D-Diff的新型条件化三维潜在扩散框架，用于从体积CBCT图像合成CT图像，并引入了一种基于成像物理原理的投影域等变性损失。与常见的图像域等变性不同，我们利用了这样一个事实：体数据在平面内的旋转对应于其投影在角度域中的平移。在训练过程中，我们通过将旋转后的合成CT图像进行前向投影，并将其与配对目标CT图像经角度偏移后的投影相匹配，从而强制实现这一关系，由此得到一个与物理规律一致的等变性约束，并将其融入扩散模型的目标函数中。为高效地捕捉完整的三维上下文，条件扩散过程在由轻量级三维自编码器学习得到的紧凑潜在空间中进行，在平面分辨率下采样的同时保留轴向深度，从而确保训练的稳定性。我们在一套配对的头部CBCT/CT体模数据集上进行验证，该数据集包含重复扫描，并采用患者级别的划分方式处理配对的临床数据；同时开展了单一域与混合域训练、消融实验，并与基于扩散模型的方法及CycleGAN进行了对比。EPC-3D-Diff具有良好的泛化能力，与现有最先进方法相比，在PSNR方面分别取得了7.4 dB（体模）和1.8 dB（临床数据）的显著提升，同时在组织边界内还实现了SSIM和HU精度的进一步提高。总体而言，EPC-3D-Diff提升了模型的鲁棒性和物理一致性，支持面向下游放疗工作流的器官分割感知合成。

Oracle Supervision Transfers for Hyperparameter Prediction in Model-Based Image Denoising

2026-05-21T04:00:00cs.CV, cs.LG, diffusionoai:arXiv.org:2605.20479v1

中文标题：面向基于模型的图像去噪中超参数预测的贝叶斯优化监督迁移

作者：Jianmin Liao, Lixin Shen, Yuesheng Xu

摘要：

Hyperparameter prediction is a critical practical bottleneck for model-based image denoisers, ranging from classical TV/TGV variational solvers to modern diffusion-based models such as DiffPIR. While existing learned predictors can achieve near-oracle performance, this approach scales poorly: each new configuration conventionally requires its own oracle-labeled training set, and each label requires a hierarchical grid search evaluated against clean ground truth. We therefore ask whether oracle supervision collected on source configurations can transfer to target configurations with few or no target oracle labels. We propose HyperDn, a single configuration-conditioned predictor that pools oracle supervision across source configurations and predicts heterogeneous hyperparameters for new denoiser--noise configurations. In a cross-paradigm experiment, HyperDn transfers from relatively cheap TV/TGV variational sources to more expensive diffusion-based DiffPIR. With only $$2$$ target oracle labels, it reaches $$30.23$$ \,dB, within $$0.90$$ \,dB of the oracle, and outperforms the $$64$$ -label per-configuration predictor trained from scratch, using $$1/32$$ as many target labels as that baseline point. Without any target oracle labels, HyperDn also reaches near-oracle PSNR on two unseen mixtures of seen noise types and on transfer from relatively cheap $96\times 96$ source images to $512\times 768$ targets. Together, these results show that expensive oracle supervision for hyperparameter prediction can be transferred from source to new target configurations, reducing the need to rebuild oracle labels for each new denoising configuration.

摘要中文：

arXiv:2605.20479v1公告类型：新摘要：超参数预测是基于模型的图像去噪方法面临的一个关键性实践瓶颈，这一问题贯穿于从经典的TV/TGV变分求解器到现代基于扩散模型（如DiffPIR）的各类方法。尽管现有的学习型预测器已能逼近“圣杯”模型的性能，但该方法的可扩展性较差：每一种新的配置通常都需要构建一套由“圣杯”模型标注的训练数据集，而每一个标签的获取都需在干净的真实标签上进行多层次的网格搜索。因此，我们探讨在源域配置上收集的监督信息是否能够迁移到目标域配置，且在目标域几乎或完全无需标注的情况下仍能发挥作用。我们提出了HyperDn，这是一种单一的配置条件型预测器，它跨源配置聚合先验监督信息，并为新的去噪器–噪声组合预测异构超参数。在一项跨范式实验中，HyperDn从相对廉价的基于变分法的TV/TGV数据源迁移到更为昂贵的基于扩散模型的DiffPIR。仅使用2个目标真值标签时，其性能达到30.23 dB，与理想情况的差距仅为0.90 dB，并且优于从零开始训练的每配置64个标签的预测器，而后者所需的目标真值标签数量是前者的32倍。在没有任何目标真值标签的情况下，HyperDn在两种未见过的已见噪声类型的混合数据上，以及从分辨率相对较低的96×96源图像迁移到512×768目标图像时，均能获得接近真值的PSNR。综上所述，这些结果表明，用于超参数预测的昂贵的“黄金标准”监督信号可以从源配置迁移到新的目标配置，从而减少了为每一种新的去噪配置重新构建“黄金标准”标签的必要性。

Accelerating Video Inverse Problem Solvers with Autoregressive Diffusion Models

2026-05-21T04:00:00autoregressive, cs.AI, cs.CV, cs.LG, diffusionoai:arXiv.org:2605.20624v1

中文标题：利用自回归扩散模型加速视频逆问题求解器

作者：Taesung Kwon, Jonghyun Park, Hyungjin Chung, Jong Chul Ye

摘要：

摘要中文：

Pareto-Enhanced Portrait Generation: Vision-Aligned Text Supervision for Alignment, Realism, and Aesthetics

2026-05-21T04:00:00cs.AI, cs.CV, diffusionoai:arXiv.org:2605.20640v1

中文标题：帕累托增强的人像生成：面向对齐、真实感与美学的视觉一致性文本监督

作者：Yunlong Wang, Jinjin Shi, Wenbin Gao, Xuran Xu, Runyu Shi, Ying Huang

摘要：

Text-to-image diffusion models often face a severe trilemma in human portrait generation: text-image alignment, photorealism, and human-perceived aesthetics inherently inhibit one another. Supervised Fine-Tuning (SFT) is an effective method for enhancing the photorealism of image generation. However, it often leads to overfitting to the training dataset, corrupts pre-trained image priors, and degrades alignment or aesthetics. To break this bottleneck, we propose a feature supervision paradigm for Multimodal Diffusion Transformers (MM-DiT). Specifically, we introduce a lightweight cross-modal alignment mechanism that implicitly extracts multi-granularity vision-aligned text representations from SigLIP 2 and applies supervision to the image branch of MM-DiT during the training stage, with zero extra inference overhead. Our method injects vision-aligned text guidance while preserving the base model's original generalization, avoiding degradation caused by SFT. Furthermore, our method directly mines implicit multi-granularity aesthetic signals from pre-trained vision foundation models to optimize human-perceived aesthetics. Extensive experiments on MM-DiTs show that our method pushes the Pareto frontier and achieves synergistic improvements across text-image alignment, photorealism, and human-perceived aesthetics.

摘要中文：

arXiv:2605.20640v1公告类型：新摘要：文本到图像的扩散模型在人像生成任务中往往面临严峻的三难困境：文本与图像的对齐、照片级真实感以及人类感知的美学品质彼此之间存在内在的相互制约。监督微调（SFT）是一种提升图像生成逼真度的有效方法。然而，这往往会导致对训练数据集的过拟合，破坏预训练图像先验，并降低对齐质量或审美效果。为突破这一瓶颈，我们提出了一种面向多模态扩散Transformer（MM-DiT）的特征监督范式。具体而言，我们提出了一种轻量级的跨模态对齐机制，该机制在训练阶段从SigLIP 2中隐式地提取多粒度的视觉对齐文本表示，并对MM-DiT的图像分支施加监督，且在推理时不引入任何额外开销。我们的方法在保持基础模型原有泛化能力的同时，注入与视觉对齐的文本指导，从而避免由监督微调（SFT）带来的性能退化。此外，我们的方法直接从预训练的视觉基础模型中挖掘隐式的多粒度美学信号，以优化人类感知到的审美效果。在MM-DiTs上的大量实验表明，我们的方法能够推动帕累托前沿，并在文本‑图像对齐、真实感以及人类感知的美学质量等方面实现协同提升。

RoPeSLR: 3D RoPE-driven Sparse-LowRank Attention for Efficient Diffusion Transformers

2026-05-21T04:00:00cs.CV, cs.LG, diffusionoai:arXiv.org:2605.20659v1

中文标题：RoPeSLR：基于三维RoPE的稀疏低秩注意力机制，用于高效扩散Transformer模型

作者：Yuxi Liu, Zekun Zhang, Yixiang Cai, Renjia Deng, Yutong He, Kun Yuan

摘要：

Diffusion Transformers (DiTs) have revolutionized high-fidelity video generation, yet their $\mathcal{O}(L^2)$ attention complexity poses a formidable bottleneck for long-sequence synthesis. While recent sparse-linear attention hybrids aim to mitigate this, their performance severely degrades at extreme sparsity due to the "RoPE Dilemma": standard linear attention fails to preserve the orthogonal relative-position structure of 3D Rotary Position Embeddings (RoPE), neutralizing vital distance awareness. To address this, we propose \textbf{RoPeSLR}, a 3D RoPE-guided Sparse-LowRank attention framework. We establish that under empirically validated assumptions, the DiT attention manifold admits a decoupling into a high-frequency semantic spike set (bounded by $\mathcal{O}(L^{3/2})$ sparsity) and an extreme low-rank ( $\mathcal{O}(d_h \log L)$ ) background continuum. Guided by this structural prior, RoPeSLR eschews standard linear attention for a head-wise low-rank parameterization equipped with a learnable 3D Absolute Positional Embedding (PE) injection, seamlessly synthesizing long-range relative distance decay. By guaranteeing sub-quadratic sparsity and sub-linear rank growth, RoPeSLR is exceptionally suited for scaling to ultra-long video inference. Extensive evaluations validate this scalable superiority: at 90\% sparsity, RoPeSLR achieves up to $10\times$ fewer FLOPs on Wan2.1-1.3B and delivers a $2.26\times$ end-to-end inference speedup on the ultra-long 100K+ token sequences of HunyuanVideo-13B, all while maintaining near-lossless generation fidelity (less than 1.3\% average VBench degradation).

摘要中文：

arXiv:2605.20659v1公告类型：新摘要：扩散变换器（DiT）已彻底革新了高保真视频生成技术，然而其 $\mathcal{O}(L^2)$ 的注意力复杂度为长序列合成带来了严峻的瓶颈。尽管近期的稀疏线性注意力混合模型旨在缓解这一问题，但在极端稀疏条件下，其性能仍会显著下降，这是由于“RoPE困境”所致：标准的线性注意力无法保留三维旋转位置编码（RoPE）的正交相对位置结构，从而削弱了至关重要的距离感知能力。为此，我们提出了\textbf{RoPeSLR}——一种由3D RoPE引导的稀疏-低秩注意力框架。我们证明，在经实证验证的假设下，DiT的注意力流形可分解为一个高频语义尖峰集（其稀疏度受 $\mathcal{O}(L^{3/2})$ 的上界约束）与一个极低秩（ $\mathcal{O}(d_h \log L)$ ）的背景连续体。在这一结构先验的指导下，RoPeSLR摒弃了标准的线性注意力机制，转而采用一种按头划分的低秩参数化方案，并辅以可学习的三维绝对位置编码（PE）注入，从而无缝地实现了长距离相对衰减效应。通过确保次二次稀疏性和亚线性秩增长，RoPeSLR极其适于扩展至超长视频推理场景。大量实验验证了其出色的可扩展性：在90%的稀疏度下，RoPeSLR在Wan2.1-1.3B上将浮点运算次数减少至原来的十分之一，并在HunyuanVideo-13B的超长10万+ 令牌序列上实现了2.26倍的端到端推理加速，同时保持近乎无损的生成质量（VBench评分平均下降不到1.3%）。

Rethinking Cross-Layer Information Routing in Diffusion Transformers

2026-05-21T04:00:00cs.AI, cs.CV, diffusionoai:arXiv.org:2605.20708v1

中文标题：重新思考扩散Transformer中的跨层信息路由

作者：Chao Xu, Maohua Li, Qirui Li, Yixuan Xu, Yanke Zhou, Yunhe Li, Cuifeng Shen, Hanlin Tang, Kan Liu, Tao Lan, Lin Qu, Shao-Qun Zhang

摘要：

Diffusion Transformers (DiTs) have become a de facto backbone of modern visual generation, and nearly every major axis of their design -- tokenization, attention, conditioning, objectives, and latent autoencoders -- has been extensively revisited. The residual stream that governs how information accumulates across layers, however, has been directly inherited from the original Transformer. In this paper, we present a systematic empirical analysis of cross-layer information flow in DiTs, jointly along depth and denoising timestep, and identify three concrete symptoms of traditional residual addition, namely monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy. Motivated by this diagnosis, we propose Diffusion-Adaptive Routing (\textsc{DAR}), a drop-in residual replacement that performs \emph{learnable, timestep-adaptive, and non-incremental} aggregation over the history of sublayer outputs. Moreover, the proposed \textsc{DAR} is compatible with many modern Transformer enhancement methods, such as REPA. On ImageNet $256\times256$ , \textsc{DAR} improves SiT-XL/2 by $$2.11$$ FID ( $$7.56$$ vs.\ $$9.67$$ ) and matches the baseline's converged quality with $8.75\times$ fewer training iterations. Stacked on top of REPA, it yields a $2\times$ training acceleration in the early stage, suggesting cross-layer information routing as an underexplored design axis in diffusion modeling, one that operates orthogonally to existing representation-alignment objectives. Beyond pretraining, \textsc{DAR} can also be applied during the fine-tuning stage of large-scale T2I models and preserves high-frequency details during Distribution Matching Distillation.

摘要中文：

arXiv:2605.20708v1公告类型：新摘要：扩散变换器（DiT）已成为现代视觉生成的主流骨干架构，其设计的各个主要方面——包括标记化、注意力机制、条件建模、目标函数以及潜在自编码器——均得到了深入的重新审视。然而，用于调控信息在各层间累积的残差流却直接沿袭自原始的Transformer模型。本文对DiT中跨层信息流在深度与去噪时间步两个维度上的联合演化进行了系统的实证分析，并揭示了传统残差连接的三个具体问题：前向幅度单调膨胀、后向梯度急剧衰减以及显著的块间冗余。基于这一诊断，我们提出了扩散自适应路由（\textsc{DAR}），这是一种即插即用的残差替换方案，能够在子层输出的历史序列上执行 \emph{可学习的、随时间步自适应的且非增量式} 的聚合操作。此外，所提出的DAR框架与多种现代Transformer增强方法兼容，例如REPA。在ImageNet 256×256数据集上，\textsc{DAR} 将SiT-XL/2的FID降低了2.11（从9.67降至7.56），并且仅需原基准模型收敛时所用训练迭代次数的1/8.75即可达到相同的生成质量。将其叠加于REPA之上，可在训练初期实现两倍的加速效果，这表明跨层信息路由是扩散模型中一个尚未充分探索的设计方向，且其与现有的表征对齐目标呈正交关系。除了预训练阶段之外，\textsc{DAR} 还可在大规模文本到图像模型的微调阶段加以应用，并在分布匹配蒸馏过程中有效保留高频细节。

Sketch2MinSurf: Vision-Language Guided Generation of Editable Minimal Surfaces from Hand-Drawn Sketches

2026-05-21T04:00:00cs.CV, diffusionoai:arXiv.org:2605.20733v1

中文标题：Sketch2MinSurf：基于视觉-语言引导的可编辑极小曲面手绘草图生成方法

作者：Wenda Wang, Anqi Liu, Junqi Yang, Lei He, Luying Wang, Jiachen Lu, Weixin Huang

摘要：

Converting hand-drawn sketches into structured 3D geometries remains challenging due to the difficulty of representing non-Euclidean surfaces and maintaining topological consistency. Existing generative models such as GANs, NeRFs, and diffusion architectures often fail to produce editable manifolds directly usable in downstream design workflows. We present Sketch2MinSurf, a hybrid vision-language and geometric optimization framework that integrates vision-language guidance with minimal-surface theory to generate smooth and editable 3D surfaces from hand-drawn sketches. The core of our approach is a spatial-topological encoding that represents geometry as tuples of node coordinates and real/virtual edge skeletons, enabling stable topological control during generation. We further introduce the Sketch2MinSurf Structural Loss (S2MS-Loss), a reward-modulated objective that jointly constrains geometric reconstruction and topological coherence. On a test set of 100 sketches, Sketch2MinSurf achieves a topological similarity score of 0.844, outperforming existing sketch-to-shape baselines. The generated manifolds are directly editable and free from non-manifold artifacts. A public art installation at a university showcases the method's potential for human-intent-driven 3D form generation. The dataset and code are available at https://anonymous.4open.science/r/Sketch2MinSurf/.

摘要中文：

arXiv:2605.20733v1公告类型：新摘要：由于难以表示非欧几里得曲面并保持拓扑一致性，将手绘草图转换为结构化的三维几何模型仍是一项挑战。现有的生成模型，如GAN、NeRF以及扩散模型架构，往往难以生成可直接用于下游设计工作流的可编辑流形。我们提出了Sketch2MinSurf，这是一种融合视觉‑语言与几何优化的混合框架，通过将视觉‑语言引导与极小曲面理论相结合，能够从手绘草图生成光滑且可编辑的三维曲面。我们方法的核心是一种空间拓扑编码，它将几何形状表示为节点坐标与实/虚边骨架的元组，从而在生成过程中实现稳定的拓扑控制。我们进一步提出了草图到最小曲面的结构损失（S2MS-Loss），这是一种由奖励机制调控的目标函数，能够同时约束几何重建与拓扑一致性。在包含100个草图的测试集上，Sketch2MinSurf的拓扑相似度评分为0.844，优于现有的草图到形状生成基线方法。生成的流形是可直接编辑的，并且不存在非流形瑕疵。一所大学的公共艺术装置展示了该方法在人机协同驱动的三维形态生成方面的潜力。数据集和代码已在 https://anonymous.4open.science/r/Sketch2MinSurf/ 上公开。

Diffuse to Detect: Bi-Level Sample Rebalancing with Pseudo-Label Diffusion for Point-Supervised Infrared Small-Target Detection

2026-05-21T04:00:00cs.CV, diffusionoai:arXiv.org:2605.20766v1

中文标题：扩散检测：基于伪标签扩散的双层样本重平衡方法用于点监督红外小目标检测

作者：Zhu Liu, Yuanhang Yao, Ping Qian, Zihang Chen, Risheng Liu

摘要：

Point supervision has become a scalable solution to address dense annotation for infrared small target detection, but its performance is limited by two coupled bottlenecks: unstable pseudo-label evolution in cluttered, low-contrast infrared imagery and severe sample-distribution imbalance. In this paper, we present a more adaptive and stable framework to address these issues. Leveraging the intrinsic consistency between thermal radiation patterns and heat diffusion, we propose a physics-induced annotation strategy that expands single-point labels into reliable pseudo-masks. To further enhance supervision and alleviate sample imbalance, we develop a bi-level dual-update framework that jointly optimizes detector weights, sample weights, and diffusion parameters. A meta-classifier dynamically predicts sample-wise loss weights, while a differentiable diffusion module refines pseudo-labels with detection feedback, enabling adaptive interaction between training and hyperparameter optimization. Extensive experiments across multiple datasets demonstrate five-fold annotation acceleration, superior detection accuracy, and comparable performance with 30% of the training data, validating the efficiency and practicality of our approach. Our code is available at https://github.com/yuanhang-yao/diffuse-to-detect.

摘要中文：

arXiv:2605.20766v1公告类型：新摘要：点监督已成为应对红外小目标检测中密集标注问题的一种可扩展解决方案，但其性能受限于两个相互耦合的瓶颈：在杂乱、低对比度的红外图像中伪标签演化不稳定，以及样本分布严重失衡。本文提出了一种更具自适应性和稳定性的框架，以解决上述问题。利用热辐射模式与热扩散之间的内在一致性，我们提出了一种基于物理机制的标注策略，可将单点标签扩展为可靠的伪掩码。为进一步强化监督并缓解样本不平衡问题，我们构建了一个双层双重更新框架，该框架可联合优化检测器权重、样本权重以及扩散参数。一个元分类器动态地为每个样本预测损失权重，而一个可微分的扩散模块则借助检测反馈对伪标签进行精化，从而实现训练过程与超参数优化之间的自适应交互。在多个数据集上的大量实验表明，该方法可实现五倍的标注效率提升、更优的检测精度，并且仅使用30%的训练数据即可达到与全量数据相当的性能，从而验证了所提方法的高效性和实用性。我们的代码已在 https://github.com/yuanhang-yao/diffuse-to-detect 上公开。

Findings of the Counter Turing Test: AI-Generated Image Detection

2026-05-21T04:00:00cs.CV, diffusionoai:arXiv.org:2605.20787v1

中文标题：反图灵测试的研究结果：人工智能生成图像检测

作者：Rajarshi Roy, Nasrin Imanpour, Ashhar Aziz, Shashwat Bajpai, Gurpreet Singh, Shwetangshu Biswas, Kapil Wanaskar, Parth Patwa, Subhankar Ghosh, Shreyas Dixit, Nilesh Ranjan Pal, Vipula Rawte, Ritvik Garimella, Amitava Das, Amit Sheth, Vasu Sharma, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha

摘要：

The rapid advancements in generative AI technologies, such as Stable Diffusion, DALL-E, and Midjourney, have significantly transformed the creation of synthetic visual content. While these models enable innovation across industries, they also pose serious challenges, including misinformation, disinformation, and biased content generation. The increasing realism of AI-generated images makes their detection a pressing concern for researchers, policymakers, and industry stakeholders. In this paper, we present the findings of the Defactify 4.0 workshop, which introduced the Counter Turing Test (CT2) for AI-Generated Image Detection. The competition consisted of two key tasks: (1) binary classification of images as either AI-generated or real and (2) identification of the specific generative model responsible for an AI-generated image. To facilitate this, we developed the MS COCOAI dataset, consisting of 50,000 synthetic images from multiple generative models alongside real-world images from the MS COCO dataset. Participants employed diverse detection strategies, including convolutional neural networks (CNNs), Vision Transformers (ViTs), frequency-based analysis, contrastive learning, and multimodal techniques. The results demonstrated that while AI-generated images can be detected with high accuracy (F1-score > 0.83), identifying the exact model used remains significantly more challenging (highest F1-score: 0.4986). These findings highlight the need for improved model fingerprinting, adversarial robustness, and real-time detection mechanisms.

摘要中文：

arXiv:2605.20787v1公告类型：新摘要：生成式人工智能技术的迅猛发展，如Stable Diffusion、DALL-E和Midjourney等，已深刻改变了合成视觉内容的创作方式。尽管这些模型推动了各行业的创新，但也带来了严峻挑战，包括虚假信息、错误信息以及有偏见的内容生成等问题。人工智能生成图像的逼真度不断提升，使其检测成为研究人员、政策制定者及行业相关方亟待解决的重要问题。本文报告了Defactify 4.0研讨会的研究成果，该研讨会提出了用于人工智能生成图像检测的反图灵测试（CT2）。该竞赛包含两项核心任务：（1）对图像进行二分类，判定其为人工智能生成或真实；（2）识别生成某张人工智能生成图像的具体生成模型。为此，我们构建了MS COCOAI数据集，该数据集由来自多个生成模型的5万张合成图像以及MS COCO数据集中的真实图像组成。参与者采用了多种检测策略，包括卷积神经网络（CNN）、视觉Transformer（ViT）、基于频域的分析、对比学习以及多模态方法。结果表明，尽管人工智能生成图像可被以较高的准确率检测出来（F1分数>0.83），但识别其所使用的具体模型仍面临较大挑战（最高F1分数为0.4986）。这些研究结果凸显了改进模型指纹识别、对抗鲁棒性以及实时检测机制的必要性。

Spatial Gram Alignment for Ultra-High-Resolution Image Synthesis

2026-05-21T04:00:00cs.CV, diffusionoai:arXiv.org:2605.20808v1

中文标题：用于超高清图像合成的空间词法对齐

作者：Jinjin Zhang, Xiefan Guo, Di Huang

摘要：

Modern ultra-high-resolution image synthesis relies heavily on the robust generative capacity of large-scale pre-trained Latent Diffusion Models (LDMs). While recent representation alignment methods have proven effective by distilling visual priors from foundation models (e.g., SAM or DINO) into generative latent features, scaling these approaches to pre-trained LDMs at extreme resolutions exposes a critical learnability-fidelity conflict. Specifically, forcing direct patch-wise feature distillation inherently perturbs the pre-trained latent manifold, ultimately leading to generation degradation. To address this bottleneck, we propose Spatial Gram Alignment (SGA), a novel framework that explicitly leverages the representation priors of vision foundation models while preserving the native generative capacity of LDMs. Moving beyond restrictive direct alignment, SGA imposes a non-invasive spatial constraint by aligning the internal self-similarities of the generative features with those of the foundation priors. This spatial constraint effectively establishes macroscopic structural coherence, while the native generative objectives retain the microscopic pixel-level fidelity inherent to the original LDMs. Notably, this versatile strategy integrates seamlessly across both intermediate diffusion features and VAE latents within pre-trained LDMs. Extensive experiments demonstrate that SGA achieves state-of-the-art performance for ultra-high-resolution text-to-image synthesis, yielding an effective reconciliation between global structural integrity and fine-grained visual details. Code is available at https://github.com/zhang0jhon/SGA.

摘要中文：

arXiv:2605.20808v1公告类型：新摘要：现代超高清图像合成在很大程度上依赖于大规模预训练潜扩散模型（LDM）的强大生成能力。尽管近年来的表征对齐方法通过将视觉先验从基础模型（如SAM或DINO）提炼至生成式潜在特征而展现出良好的效果，但将这类方法扩展至在极高分辨率下运行的预训练LDM时，却暴露出一个关键的可学习性与保真度之间的矛盾。具体而言，强制进行逐块的特征蒸馏会不可避免地扰动预训练的潜在流形，最终导致生成质量下降。为解决这一瓶颈，我们提出了空间格拉姆对齐（SGA）框架，该框架在保留LDMs原生生成能力的同时，显式地利用视觉基础模型的表征先验。超越受限的直接对齐，SGA通过将生成特征的内部自相似性与基础先验的自相似性进行对齐，施加了一种非侵入性的空间约束。这种空间约束有效地实现了宏观尺度上的结构一致性，而原有的生成目标则保持了原始LDMs所固有的微观像素级保真度。值得注意的是，这一通用策略能够在预训练的LDMs中实现中间扩散特征与VAE潜在空间之间的无缝融合。大量实验表明，SGA在超高清文本到图像生成任务中取得了当前最佳性能，在全局结构完整性和细粒度视觉细节之间实现了有效的平衡。代码可在 https://github.com/zhang0jhon/SGA 获取。

FlowLong: Inference-time Long Video Generation via Manifold-constrained Tweedie Matching

2026-05-21T04:00:00autoregressive, cs.CV, diffusionoai:arXiv.org:2605.20910v1

中文标题：FlowLong：基于流形约束的特威迪匹配实现推理时长视频生成

作者：Jangho Park, Geon Yeong Park, Gihyun Kwon, Jong Chul Ye

摘要：

摘要中文：

DrawMotion: Generating 3D Human Motions by Freehand Drawing

2026-05-21T04:00:00cs.CV, diffusionoai:arXiv.org:2605.20955v1

中文标题：手绘驱动：通过自由绘制生成三维人体运动

作者：Tao Wang, Lei Jin, Zhihua Wu, Qiaozhi He, Jiaming Chu, Yu Cheng, Junliang Xing, Jian Zhao, Shuicheng Yan, Li Wang

摘要：

Text-to-motion generation, which translates textual descriptions into human motions, faces the challenge that users often struggle to precisely convey their intended motions through text alone. To address this issue, this paper introduces DrawMotion, an efficient diffusion-based framework designed for multi-condition scenarios. DrawMotion generates motions based on both a conventional text condition and a novel hand-drawing condition, which provide semantic and spatial control over the generated motions, respectively. Specifically, we tackle the fine-grained motion generation task from three perspectives: 1) freehand drawing condition. To accurately capture users' intended motions without requiring tedious textual input, we develop an algorithm to automatically generate hand-drawn stickman sketches across different dataset formats; 2) multi-condition fusion. We propose a Multi-Condition Module (MCM) that is integrated into the diffusion process, enabling the model to exploit all possible condition combinations while reducing computational complexity compared to conventional approaches; and 3) training-free guidance. Notably, the MCM in DrawMotion ensures that its intermediate features lie in a continuous space, allowing classifier-guidance gradients to update the features and thereby aligning the generated motions with user intentions while preserving fidelity. Quantitative experiments and user studies demonstrate that the freehand drawing approach reduces user time by approximately 46.7% when generating motions aligned with their imagination. The code, demos, and relevant data are publicly available at https://github.com/InvertedForest/DrawMotion.

摘要中文：

arXiv:2605.20955v1公告类型：新摘要：文本到动作生成技术旨在将文本描述转化为人体运动，然而其面临的挑战在于，用户往往难以仅凭文字准确表达其期望的动作意图。为解决这一问题，本文提出了DrawMotion，一个专为多条件场景设计的高效扩散模型框架。DrawMotion既可根据传统的文本条件，也可根据一种新颖的手绘条件来生成运动，从而分别对生成的运动实现语义控制和空间控制。具体而言，我们从三个方面着手解决细粒度运动生成任务：1）自由绘制条件。为在无需繁琐文本输入的情况下准确捕捉用户的目标动作，我们开发了一种能够跨不同数据集格式自动生成手绘火柴人草图的算法；2）多条件融合。我们提出了一种多条件模块（MCM），将其集成到扩散过程中，使模型能够在充分利用所有可能的条件组合的同时，相较于传统方法降低计算复杂度；此外，还实现了无需训练的引导机制。值得注意的是，DrawMotion中的MCM保证其中间特征位于一个连续的空间中，从而使分类器引导的梯度能够更新这些特征，进而使生成的动作与用户意图保持一致，同时维持其保真度。定量实验与用户研究均表明，该手绘式方法在生成符合用户想象的运动序列时，可将用户所需时间缩短约46.7%。代码、示例以及相关数据已在 https://github.com/InvertedForest/DrawMotion 上公开发布。

Preserve, Reveal, Expand: Faithful 4D Video Editing with Region-Aware Conditioning

2026-05-21T04:00:00cs.CV, diffusionoai:arXiv.org:2605.20961v1

中文标题：保持、揭示、扩展：基于区域感知条件的忠实4D视频编辑

作者：Zhangchi Hu, Wenzhang Sun, Xiangchen Yin, Jiahui Yuan, Chunfeng Wang, Hao Li, Kun Zhan, Xiaoyan Sun

摘要：

Existing 4D-driven video diffusion models primarily target plausible generation, but faithful 4D editing requires preserving source-observed regions while synthesizing disoccluded or out-of-view content. We identify Evidence-Role Mismatch: reliable source-backed evidence, unreliable rendered cues, and unsupported regions are entangled in a single conditioning signal, causing preservation drift, ghosting, and unstable extrapolation. We propose PREX (Preserve, Reveal, Expand), a region-aware framework that decomposes the target spatiotemporal volume into Preserve, Reveal, and Expand roles according to observation support and scene extent. PREX builds observation-backed appearance cues with calibrated confidence and injects them into a frozen video diffusion backbone through a region-aware adapter, trained with proxy tasks without requiring paired edited videos. We further introduce PREBench, a diagnostic benchmark with curated edits, region-role masks, and human-aligned metrics that complement global video-quality and 4D-control evaluations. Experiments show that PREX reduces region-structured failures while maintaining strong visual quality and 4D edit control capability. Project Page: https://ricepastem.github.io/PREX-Open

摘要中文：

arXiv:2605.20961v1公告类型：新摘要：现有的4D驱动视频扩散模型主要致力于生成合理可信的视频，而忠实的4D编辑则要求在合成被遮挡或视野外的内容时，仍能保持源图像中已观测区域的一致性。我们发现了证据与角色的不匹配：由可靠来源支撑的证据、不可靠的渲染线索以及缺乏依据的区域被混杂在单一的条件信号中，从而导致保真度退化、伪影以及不稳定外推。我们提出了PREX（Preserve, Reveal, Expand）框架，该框架能够根据观测支持程度和场景范围，将目标的时空体分解为“保留”、“揭示”和“扩展”三个角色。PREX基于观测数据构建具有校准置信度的外观线索，并通过一个区域感知适配器将其注入到预训练的视频扩散模型主干中，该适配器采用代理任务进行训练，且无需成对的编辑视频。我们进一步提出了PREBench，这是一个包含精心设计的编辑、区域—角色掩码以及与人类偏好对齐的度量指标的诊断基准，用以补充全局视频质量和四维控制评估。实验表明，PREX在保持优异的视觉质量和四维编辑控制能力的同时，能够有效减少区域结构化伪影。项目页面：https://ricepastem.github.io/PREX-Open

Dynamic Video Generation: Shaping Video Generation Across Time and Space

2026-05-21T04:00:00cs.CV, diffusionoai:arXiv.org:2605.21042v1

中文标题：动态视频生成：跨时空塑造视频生成

作者：Shikang Zheng, Jingkai Huang, Jiacheng Liu, Guantao Chen, Lixuan, Yuqi Lin, Peiliang Cai, Linfeng Zhang

摘要：

Diffusion models have achieved impressive performance in video generation, but their iterative denoising process remains computationally expensive due to the large number of tokens processed at each timestep. Recently, progressive resolution sampling has emerged as a promising acceleration approach by reducing latent resolution in early stages. However, scaling this idea to video generation remains challenging, as the additional temporal dimension introduces diverse spatio-temporal demands across different videos, and compressing only a single dimension often leads to limited acceleration or degraded quality. Therefore, we propose DVG, a Dynamic Video Generation framework that jointly allocates computation across time and space, automatically selecting content-aware acceleration strategies without manual tuning or retraining. DVG achieves near-lossless acceleration across models and tasks, reaching up to 7 times speedup on HunyuanVideo and HunyuanVideo-1.5, and 18 times when combined with distillation, demonstrating its potential as a key component in today's large-scale efficient video generation systems. Our code is in supplementary material and will be released on Github.

摘要中文：

arXiv:2605.21042v1公告类型：新摘要：扩散模型在视频生成任务中取得了令人瞩目的性能，但由于在每个时间步都要处理大量标记，其迭代去噪过程仍然计算开销巨大。近来，渐进式分辨率采样通过在早期阶段降低潜在分辨率，已成为一种颇具前景的加速方法。然而，将这一思路扩展至视频生成仍面临诸多挑战：额外的时间维度使得不同视频呈现出多样的时空需求，而仅对单一维度进行压缩往往难以带来显著的加速效果，甚至会导致质量下降。为此，我们提出了DVG——一种动态视频生成框架，它能够跨时间和空间进行联合计算资源分配，在无需人工调优或重新训练的情况下自动选择内容感知型加速策略。DVG在各类模型和任务上均实现了近乎无损的加速，在HunyuanVideo和HunyuanVideo‑1.5上最高可获得7倍的加速，配合知识蒸馏更是可达18倍，充分展现了其作为当前大规模高效视频生成系统核心组件的潜力。我们的代码已收录于补充材料，并将于GitHub上公开发布。

Q-ARVD: Quantizing Autoregressive Video Diffusion Models

2026-05-21T04:00:00autoregressive, cs.CV, diffusionoai:arXiv.org:2605.21072v1

中文标题：Q-ARVD：量化自回归视频扩散模型

作者：Siao Tang, Xinyin Ma, Gongfan Fang, Xingyi Yang, Xinchao Wang

摘要：

摘要中文：

Semantic Granularity Navigation in Image Editing

2026-05-21T04:00:00cs.CV, diffusionoai:arXiv.org:2605.21190v1

中文标题：图像编辑中的语义粒度导航

作者：Liangsi Lu, Minzhe Guo, Xuhang Chen, Yang Shi

摘要：

Despite the generative capabilities of diffusion and flow models, real-image editing remains constrained by a persistent trade-off between semantic editability and structural fidelity. We trace a primary cause of this limitation to the implicit coupling of edit progress with model scale in existing paradigms. Under this coupling, stronger edits typically require visiting noisier states, which spends computation on destabilizing layout before the semantic change is well localized. We introduce NaviEdit, a training-free inference-time controller that decouples edit progress from model scale traversal through a strict self-consistency contract. NaviEdit operates at the rollout level and leaves the underlying pretrained model unchanged. It treats scale as a control input and reallocates a fixed step budget toward semantically responsive intermediate scales instead of destructive high-noise regimes. Experiments show positive average gains across compatible editors and flow backbones, supporting decoupling as a portable inference-time control principle.

摘要中文：

arXiv:2605.21190v1公告类型：新摘要：尽管扩散模型和流模型具备强大的生成能力，但真实图像的编辑仍然受到语义可编辑性与结构保真度之间持续存在的权衡制约。我们将这一局限性的主要根源归因于现有范式中编辑进度与模型规模之间的隐性耦合。在这种耦合机制下，更强烈的编辑往往需要遍历更多噪声较大的状态，从而在语义变更被充分局部化之前，就耗费大量计算来破坏布局的稳定性。我们提出了NaviEdit，这是一种无需训练的推理时控制器，它通过严格的自洽约束，将编辑进度与模型规模的遍历过程解耦。NaviEdit在卷展栏级别运行，并且不会改变底层的预训练模型。它将尺度视为控制输入，并将固定的步长预算重新分配至语义响应的中间尺度，而非分配给破坏性的高噪声区域。实验表明，在各类兼容的编辑器和特征提取主干网络上均能获得正向的平均收益，从而验证了解耦作为一种可移植的推理时控制机制的有效性。

PGC: Peak-Guided Calibration for Generalizable AI-Generated Image Detection

2026-05-21T04:00:00cs.CV, diffusionoai:arXiv.org:2605.21207v1

中文标题：PGC：面向可泛化人工智能生成图像检测的峰值引导校准

作者：Xiaoyu Zhou, Jianwei Fei, Peipeng Yu, Jingchang Xie, Chong Cheng, Zhihua Xia

摘要：

The rapid evolution of generative AI, from GANs to modern diffusion models, has resulted in increasingly subtle discriminative clues. These fine-grained signals are often overshadowed by dominant, high-fidelity image content (e.g., the main subject), limiting the reliability of existing detectors that predominantly rely on global representations. To address this challenge, we propose the Peak-Guided Calibration (PGC) framework. PGC introduces a novel strategy that aggregates salient features via a peak-focusing mechanism. Specifically, by employing a peak-sensitive aggregation that accentuates the most discriminative local clues, PGC leverages these critical signals to calibrate the global decision. This approach recovers subtle patterns that would otherwise be submerged in the global context. Furthermore, to better simulate real-world threats, we introduce the CommGen15 dataset, a challenging benchmark comprising samples from 15 commercial models. Extensive experiments demonstrate that PGC achieves state-of-the-art performance. Specifically, it improves mean accuracy by +12.3% on our CommGen15 dataset, and sets new records on standard benchmarks, including GenImage (+2.1%), AIGI (+3.5%), and UniversalFakeDetect (+1.7%). Code is available at https://github.com/xiaoyu6868/PGC.

摘要中文：

arXiv:2605.21207v1公告类型：新摘要：生成式人工智能的快速发展，从GANs到现代扩散模型，使得判别线索愈发微妙。这些细粒度的特征往往被占据主导地位、保真度较高的图像内容（如主体）所掩盖，从而降低了主要依赖全局表征的现有检测器的可靠性。为应对这一挑战，我们提出了峰值引导校准（PGC）框架。PGC提出了一种新颖的策略，通过峰值聚焦机制来聚合显著特征。具体而言，通过采用一种对峰值敏感的聚合机制来凸显最具判别性的局部线索，PGC得以利用这些关键信号对全局决策进行校准。该方法能够恢复那些在全局背景下原本会被淹没的细微模式。此外，为了更有效地模拟现实世界的威胁，我们提出了CommGen15数据集，这是一个由来自15个商用模型的样本构成的具有挑战性的基准数据集。大量实验表明，PGC能够取得当前最优的性能。具体而言，它在我们的CommGen15数据集上将平均准确率提升了12.3%，并在多项标准基准测试中刷新了纪录，包括GenImage（+2.1%）、AIGI（+3.5%）和UniversalFakeDetect（+1.7%）。代码可在 https://github.com/xiaoyu6868/PGC 获取。

SR-Ground: Image Quality Grounding for Super-Resolved Content

2026-05-21T04:00:00cs.CV, diffusionoai:arXiv.org:2605.21244v1

中文标题：SR-Ground：面向超分辨率内容的图像质量归因分析

作者：Artem Borisov, Evgeney Bogatyrev, Khaled Abud, Dmitriy Vatolin

摘要：

Super-Resolution (SR) has advanced rapidly in recent years, with diffusion-based models achieving unprecedented fidelity at the cost of introducing new types of visual artifacts. While existing Image Quality Assessment (IQA) methods provide holistic quality scores, they lack interpretability and fail to distinguish between different artifact types arising from modern SR approaches. To address this gap, we introduce SR-Ground, a large-scale dataset specifically designed for fine-grained artifact segmentation in super-resolved images. The dataset comprises images processed by a diverse set of state-of-the-art SR models, with pixel-level annotations for multiple artifact categories. We conduct a large-scale crowdsourcing study involving 1,062 participants to validate and refine automatically generated segmentations, resulting in a high-quality dataset of 63,000 images spanning 6 distinct artifact types. We demonstrate that training IQA models with grounding capabilities on SR-Ground significantly improves performance on downstream tasks. Furthermore, we introduce a fine-tuning pipeline that leverages our grounding model to reduce perceptible artifacts in SR outputs, showcasing the practical utility of our dataset.

摘要中文：

arXiv:2605.21244v1公告类型：新摘要：近年来，超分辨率技术发展迅速，基于扩散模型的方法在实现前所未有的重建保真度的同时，也引入了新型的视觉伪影。尽管现有的图像质量评价（IQA）方法能够给出整体的质量评分，但它们缺乏可解释性，且无法区分由现代超分辨率技术所引入的不同类型失真。为弥补这一不足，我们提出了SR-Ground，这是一个专为超分辨率图像中的细粒度伪影分割而构建的大规模数据集。该数据集由一系列最先进的超分辨率模型处理得到的图像构成，并针对多种伪影类别提供了像素级标注。我们开展了一项大规模众包研究，共招募1,062名参与者，用于验证并优化自动生成的分割结果，最终构建了一个高质量的数据集，包含63,000张图像，涵盖6种不同的伪影类型。我们证明，在SR-Ground数据集上训练具备场景理解能力的图像质量评估模型，能够显著提升其在下游任务中的性能。此外，我们还提出了一条微调流水线，利用我们的视觉定位模型来减少超分辨率重建结果中的可见伪影，从而彰显了我们数据集的实用价值。

OcclusionFormer: Arranging Z-Order for Layout-Grounded Image Generation

2026-05-21T04:00:00cs.CV, diffusionoai:arXiv.org:2605.21343v1

中文标题：OcclusionFormer：面向版面约束的图像生成中的深度排序

作者：Ziye Li, Henghui Ding

摘要：

Recent layout-to-image models have achieved remarkable progress in spatial controllability. However, they still struggle with inter-object occlusion. When bounding boxes overlap, most existing methods lack explicit occlusion information, which makes the generation in intersection regions inherently ambiguous and hinders the determination of complex occlusion relationships. As a result, they often produce entangled textures or physically inconsistent layering in the overlapped areas. To address this issue, we first construct SA-Z, a large-scale dataset enriched with explicit occlusion ordering and pixel-level annotations. Building upon our proposed dataset, we introduce OcclusionFormer, a novel occlusion-aware Diffusion Transformer framework that explicitly models Z-order priority by decoupling instances and compositing them via volume rendering. Furthermore, to ensure fine-grained spatial precision, we introduce a queried alignment loss that explicitly supervises individual instances and enhances semantic consistency. The proposed method effectively reduces ambiguity in overlapping regions, enforces correct occlusion dependencies, and preserves structural integrity, leading to substantial accuracy gains across diverse scenes.

摘要中文：

arXiv:2605.21343v1公告类型：新摘要：近年来，布局到图像生成模型在空间可控性方面取得了显著进展。然而，它们在处理物体间的遮挡问题时仍存在困难。当边界框发生重叠时，大多数现有方法缺乏明确的遮挡信息，这使得交集区域的生成inherently具有歧义性，并阻碍了复杂遮挡关系的准确判定。因此，它们在重叠区域往往会产生纠缠的纹理或在物理上不一致的层叠效果。为解决这一问题，我们首先构建了SA-Z数据集，该数据集规模宏大，并配备了明确的遮挡顺序及像素级标注。在我们所提出的数据集基础上，我们提出了OcclusionFormer，这是一种新颖的遮挡感知扩散Transformer框架，通过将实例解耦并借助体渲染进行合成，从而显式地建模Z轴顺序优先级。此外，为确保细粒度的空间精度，我们引入了一种查询对齐损失，该损失对单个实例进行显式监督，并增强语义一致性。所提出的方法能够有效降低重叠区域的歧义性，正确地约束遮挡依赖关系，并保持结构完整性，从而在各类场景中显著提升精度。

A Non-Reference Diffusion-Based Restoration Framework for Landsat 7 ETM+ SLC-off Imagery in Antarctica

2026-05-21T04:00:00cs.CV, diffusionoai:arXiv.org:2605.21371v1

中文标题：一种基于扩散的无参考图像复原框架，用于南极地区的Landsat 7 ETM SLC关闭模式影像

作者：Leyue Tang, Jonathan Louis Bamber, Gang Qiao, Yuanhang Kong

摘要：

Acquiring usable optical imagery in Antarctica is inherently challenging due to prolonged polar nights and frequent cloud cover. Landsat provides the longest and most continuous optical observations and constitutes one of the most important remote sensing data sources for Antarctic studies. However, the scan-line corrector (SLC) failure in 2003 resulted in approximately 22% missing pixels in Landsat 7 ETM+ SLC-off imagery, severely limiting its usability. Unlike many non-polar environments, Antarctic surfaces undergo rapid and substantial changes, which makes it difficult to obtain reliable reference imagery and reduces the applicability of conventional reference-based gap-filling methods. To address this challenge, we propose DiffGF, a non-reference diffusion-based framework for restoring Landsat 7 SLC-off imagery without requiring any external reference data. DiffGF adopts a two-stage design consisting of a latent-space diffusion process and a pixel-space refinement. A dedicated Antarctic dataset, SLCANT, is constructed for training and evaluation. Quantitative and qualitative results demonstrate that DiffGF restores Antarctic SLC-off imagery with high fidelity. Its practical value is further examined through a downstream crevasse segmentation application. The results suggest that DiffGF provides a useful approach for exploiting Landsat 7 SLC-off archives in Antarctica, enabling the extraction of valuable information from historical records and supporting related Antarctic studies.

摘要中文：

arXiv:2605.21371v1公告类型：新摘要：由于极夜期长且云层覆盖频繁，在南极获取可用的光学影像具有inherent的难度。Landsat提供了持续时间最长、最为连续的光学观测数据，是南极研究领域最重要的遥感数据源之一。然而，2003年扫描线校正器（SLC）发生故障，导致Landsat 7 ETM+ SLC关闭模式下的影像约有22%的像素缺失，严重降低了其应用价值。与许多非极地环境不同，南极地表会发生快速且显著的变化，这使得获取可靠的参考影像变得困难，并降低了传统基于参考数据的空洞填补方法的适用性。为应对这一挑战，我们提出了DiffGF，这是一种基于非参考扩散的框架，用于恢复Landsat 7 SLC失效影像，且无需任何外部参考数据。DiffGF采用两阶段设计，包括潜在空间扩散过程和像素空间细化。为训练与评估构建了一个专门的南极数据集——SLCANT。定量与定性分析结果均表明，DiffGF能够以高保真度重建南极地区的SLC关闭影像。其应用价值通过一项下游的决口分割任务得到了进一步验证。研究结果表明，DiffGF为有效利用南极地区的Landsat 7 SLC失效影像档案提供了一种可行的方法，有助于从历史数据中提取有价值的信息，并为相关的南极科学研究提供支撑。

Disentangling Generation and Regression in Stochastic Interpolants for Controllable Image Restoration

2026-05-21T04:00:00cs.CV, cs.LG, diffusionoai:arXiv.org:2605.21381v1

中文标题：在用于可控图像修复的随机插值中分离生成与回归

作者：Yi Liu, Jia Ma, Wengen Li, Jihong Guan, Shuigeng Zhou, Yichao Zhang

摘要：

Recent advances in Image Restoration (IR) have been largely driven by generative methods such as Diffusion Models and Flow Matching, which excel in synthesizing realistic textures while suffering from slow multi-step inference and compromised pixel fidelity. In contrast, classical regression-based IR methods excel precisely in these aspects, offering single-step efficiency and high pixel-level reconstruction fidelity. To bridge this gap, we propose DiSI, a unified framework that Disentangles the underlying Stochastic Interpolant process into independent generation and regression components. This decoupling endows DiSI with remarkable versatility, enabling a continuous and controllable transition from a pure regression process to a fully generative one. Technically, we instantiate this framework with two specific sampling trajectories, accompanied by a unified sampler for high-quality, few-step inference on arbitrary trajectories. Furthermore, we design a dual-branch U-Net style transformer network in pixel space, using a dedicated branch to enhance conditional guidance while ensuring high throughput. Extensive experiments demonstrate that DiSI efficiently achieves competitive results on various IR tasks, while uniquely offering the inference-time flexibility to control the distortion-perception trade-off within a single model.

摘要中文：

arXiv:2605.21381v1公告类型：新摘要：近年来，图像修复领域的进展在很大程度上得益于扩散模型和流匹配等生成方法；这些方法在生成逼真纹理方面表现卓越，但同时也存在多步推理速度慢、像素级保真度受损等问题。相比之下，经典的基于回归的IR方法在这些方面表现尤为出色，兼具单步求解的高效性与较高的像素级重建保真度。为弥合这一差距，我们提出了DiSI，这是一种将底层随机插值过程解耦为独立的生成与回归两个子模块的统一框架。这种解耦使DiSI具备卓越的通用性，能够实现从纯回归过程到完全生成式过程的连续且可控的过渡。从技术上讲，我们通过两条特定的采样轨迹实例化该框架，并配备一个统一的采样器，以在任意轨迹上实现高质量的少步推理。此外，我们在像素空间中设计了一种双分支的U-Net风格Transformer网络，通过设置专门的分支来强化条件引导，同时确保较高的吞吐量。大量实验表明，DiSI在各类信息检索任务上均能高效取得具有竞争力的性能，并且独特地在推理阶段赋予用户在同一模型内灵活调控失真与感知质量之间权衡的能力。

iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance

2026-05-21T04:00:00cs.CV, diffusionoai:arXiv.org:2605.21431v1

中文标题：iTryOn：基于空间语义引导的交互式视频虚拟试穿技术研究

作者：Jun Zheng, Zhengze Xu, Mengting Chen, Jing Wang, Jinsong Lan, Xiaoyong Zhu, Kaifu Zhang, Bo Zheng, Xiaodan Liang

摘要：

Video Virtual Try-On (VVT) aims to seamlessly replace a garment on a person in a video with a new one. While existing methods have made significant strides in maintaining temporal consistency, they are predominantly confined to non-interactive scenarios where models merely showcase garments. This limitation overlooks a crucial aspect of real-world apparel presentation: active human-garment interaction. To bridge this gap, we introduce and formalize a new challenging task: Interactive Video Virtual Try-On (Interactive VVT), where subjects in the video actively engage with their clothing. This task introduces unique challenges beyond simple texture preservation, including: (1) resolving the semantic ambiguity of interactions from standard pose information, and (2) learning complex garment deformations from video where interactive moments are sparse and brief. To address these challenges, we propose iTryOn, a novel framework built upon a large-scale video diffusion Transformer. iTryOn pioneers a multi-level interaction injection mechanism to guide the generation of complex dynamics. At the spatial level, we introduce a garment-agnostic 3D hand prior to provide fine-grained guidance for precise hand-garment contact, effectively resolving spatial ambiguity. At the semantic level, iTryOn leverages global captions for overall context and time-stamped action captions for localized interactions, synchronized via our novel Action-aware Rotational Position Embedding (A-RoPE). Extensive experiments demonstrate that iTryOn not only achieves state-of-the-art performance on traditional VVT benchmarks but also establishes a commanding lead in the new interactive setting, marking a significant step towards more dynamic and controllable virtual try-on experiences.

摘要中文：

arXiv:2605.21431v1公告类型：新摘要：视频虚拟试穿（VVT）旨在将视频中人物身上的服装无缝替换为另一件新服装。尽管现有方法在保持时序一致性方面取得了显著进展，但它们主要局限于非交互式场景，即模型仅用于展示服装。这一局限性忽视了现实服装展示中一个至关重要的方面：人与服装之间的主动交互。为弥合这一鸿沟，我们提出并形式化了一项全新的、具有挑战性的任务：交互式视频虚拟试穿（Interactive VVT），其中视频中的主体与其所穿服饰进行主动交互。这项任务在单纯纹理保真的基础上提出了独特的挑战，包括：（1）仅凭标准的姿态信息难以准确解析交互行为的语义歧义；（2）从交互片段稀少且短暂的视频中学习复杂的服装形变。为应对这些挑战，我们提出了iTryOn，这是一种基于大规模视频扩散Transformer构建的新型框架。iTryOn首创了一种多层级交互注入机制，用于引导复杂动态的生成。在空间层面，我们引入了一种与服装无关的三维手部先验模型，为精确的手—服装接触提供细粒度指导，从而有效解决空间上的歧义问题。在语义层面，iTryOn同时利用全局字幕获取整体上下文，并借助带时间戳的动作字幕捕捉局部交互，二者通过我们提出的新型动作感知旋转位置编码（A-RoPE）实现同步。大量实验表明，iTryOn不仅在传统的虚拟试穿基准测试上取得了最先进的性能，在全新的交互式场景中也遥遥领先，标志着向更加动态、可控的虚拟试穿体验迈出了重要一步。

One-Step Distillation of Discrete Diffusion Image Generators via Fixed-Point Iteration

2026-05-21T04:00:00cs.CV, diffusionoai:arXiv.org:2605.21484v1

中文标题：基于不动点迭代的离散扩散图像生成器一步式蒸馏

作者：Chaoyang Wang, Yunhai Tong

摘要：

Discrete diffusion models excel at visual synthesis but rely on slow, iterative decoding. Existing single-step distillation methods attempt to bypass this bottleneck, either by training auxiliary score networks that effectively double compute, or by introducing specialized parameterizations and multi-stage pipelines that fragment optimization. In this paper, we introduce Fixed-Point Distillation (FPD), an end-to-end framework that constructs local correction targets by partially corrupting the student's one-step draft and refining it with a single teacher step. To compute the training objective in a semantically meaningful space, we lift discrete tokens into continuous features and apply a multi-bandwidth drift loss that iteratively accumulates these corrections. To backpropagate through the discrete bottleneck, we employ a straight-through estimator that feeds exact hard-sampled tokens to the teacher and decoder during the forward pass, ensuring that training and inference operate on the same codebook manifold, while routing continuous gradients back to the student logits. This fully differentiable pathway additionally accommodates an optional unconditional adversarial objective to enhance perceptual realism. Evaluations on both class- and text-conditional generation validate the effectiveness of our framework. FPD achieves competitive visual fidelity and structural alignment within a single inference step, narrowing the gap to multi-step teachers while outperforming existing discrete distillation baselines.

摘要中文：

arXiv:2605.21484v1公告类型：新摘要：离散扩散模型在视觉生成任务中表现优异，但其解码过程依赖于缓慢的迭代运算。现有的单步蒸馏方法试图绕过这一瓶颈，要么通过训练辅助得分网络而使计算量几乎翻倍，要么引入专用的参数化方案和多阶段流水线，从而导致优化过程被割裂。在本文中，我们提出了定点蒸馏（FPD）这一端到端框架，该框架通过部分破坏学生的单步草稿，并借助教师模型的一次前向传播对其进行精炼，从而构建局部校正目标。为了在语义上有意义的空间中计算训练目标，我们首先将离散的词元映射为连续的特征表示，并施加一种多带宽漂移损失，该损失会迭代地累积这些修正项。为了对离散瓶颈进行反向传播，我们采用了一种直通估计器：在前向传播过程中，将由硬采样得到的精确token直接馈送给教师模型和解码器，从而确保训练与推理均在同一个码本流形上进行，同时将连续的梯度回传至学生模型的logits。这条完全可微的路径还额外引入了一个可选的无条件对抗目标，以提升感知真实性。针对类别条件生成与文本条件生成的评估均证实了所提框架的有效性。FPD在单次推理过程中即可实现与教师模型相当的视觉保真度和结构对齐，不仅缩小了与多步蒸馏教师模型之间的差距，还优于现有的离散蒸馏基线方法。

Tippett-minimum Fusion of Representation-space Diffusion Models for Multi-Encoder Out-of-Distribution Detection

2026-05-21T04:00:00cs.AI, cs.CV, cs.LG, diffusion, stat.AP, stat.MLoai:arXiv.org:2605.20502v1

中文标题：用于多编码器分布外检测的表征空间扩散模型蒂佩特-最小值融合

作者：Neelkamal Bhuyan

摘要：

We address out-of-distribution (OOD) detection across the full spectrum of distribution shifts -- global domain changes, semantic divergence, texture differences, and covariate corruptions -- through a multi-encoder fusion of per-encoder representation-space diffusion models (RDMs). We statistically identify each encoder's sensitivity to specific shift types from ID data alone and introduce EncMin2L -- an encoder-agnostic two-level $\min(\cdot)$ -gate that combines and calibrates per-encoder diffusion-based likelihood detectors without OOD labels, outperforming monolithic multi-encoder baselines at $2.3\times$ lower parameter cost. Two ID-data diagnostics: $\eta^2$ (class-conditional F-test) and $\Delta\mu$ (log-likelihood shift under synthetic corruptions) -- quantify encoder specialization, while a Tippett minimum $$p$$ -value combination aggregates per-encoder scores into a single, calibration-stable OOD signal. EncMin2L achieves $\geq 0.94$ AUROC across all four shift types simultaneously, outperforming the state-of-the-art representation-space diffusion OOD detectors across overlapping benchmarks.

摘要中文：

arXiv:2605.20502v1公告类型：交叉领域摘要：我们通过将各编码器在表征空间中构建的扩散模型（RDM）进行多编码器融合，实现了对全谱分布偏移——包括全局域变化、语义差异、纹理差异以及协变量扰动——的分布外检测。我们仅基于ID数据，便能从统计上识别出各编码器对特定类型移位的敏感性，并提出了EncMin2L——一种与编码器无关的两级 $\min(\cdot)$ 门机制，它无需OOD标签即可融合并校准基于扩散模型的逐编码器似然检测器，在参数量仅为单体多编码器基线2.3分之一的情况下仍取得更优性能。两种基于ID数据的诊断指标——η²（类条件F检验）和Δμ（在合成扰动下的对数似然偏移）——用于量化编码器的专属性；同时，采用Tippett最小P值组合方法将各编码器的得分聚合为单一且校准稳定的OOD检测信号。EncMin2L在四种类型的分布偏移上均实现了不低于0.94的AUROC，其性能优于现有最先进的表示空间扩散型异常检测方法，并且在多个重叠基准测试中均表现更优。

Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards

2026-05-21T04:00:00cs.AI, cs.CV, cs.LG, cs.RO, diffusionoai:arXiv.org:2605.20758v1

中文标题：面向组合奖励的流模型冲突感知加性引导

作者：Xuehui Yu, Fucheng Cai, Meiyi Wang, Xiaopeng Fan, Harold Soh

摘要：

Inference-time guided sampling steers state-of-the-art diffusion and flow models without fine-tuning by interpreting the generation process as a controllable trajectory. This provides a simple and flexible way to inject external constraints (e.g., cost functions or pre-trained verifiers) for controlled generation. However, existing methods often fail when composing multiple constraints simultaneously, which leads to deviations from the true data manifold. In this work, we identify root causes of this off-manifold drift and find that the approximation error scales severely with gradient misalignment. Building on these findings, we propose Conflict-Aware Additive Guidance ( $g^\text{car}$ ), a lightweight and learnable method, which actively rectifies off-manifold drift by dynamically detecting and resolving gradient conflicts. We validate $g^\text{car}$ across diverse domains, ranging from synthetic datasets and image editing to generative decision-making for planning and control. Our results demonstrate that $g^\text{car}$ effectively rectifies off-manifold drift, surpassing baselines in generation fidelity while using light compute. Code is available at https://github.com/yuxuehui/CAR-guidance.

摘要中文：

arXiv:2605.20758v1公告类型：跨领域摘要：推理时的引导采样无需微调即可对当前最先进的扩散模型和流模型进行引导，其核心思想是将生成过程视为一条可调控的轨迹。这为受控生成提供了一种简单而灵活的途径，用于注入外部约束（例如成本函数或预训练验证器）。然而，现有方法在同时组合多个约束时往往失效，从而导致结果偏离真实数据流形。在本研究中，我们揭示了这种离流形漂移的根源，并发现其近似误差会随着梯度的不对齐程度而显著增大。基于上述发现，我们提出了冲突感知加性引导（ $g^\text{car}$ ），这是一种轻量且可学习的方法，通过动态检测并化解梯度冲突，主动纠正流形外漂移。我们在多个不同领域对 $g^\text{car}$ 进行了验证，涵盖从合成数据集、图像编辑到用于规划与控制的生成式决策等任务。我们的实验结果表明， $g^\text{car}$ 能够有效抑制流形外漂移，在保持较低计算开销的同时，显著提升生成质量。代码可在 https://github.com/yuxuehui/CAR-guidance 获取。

Learning to Think in Physics: Breaking Shortcut Learning in Scientific Diffusion via Representation Alignment

2026-05-21T04:00:00cs.CV, cs.LG, diffusionoai:arXiv.org:2605.20780v1

中文标题：在物理学中学会思考：通过表征对齐破解科学传播中的捷径式学习

作者：Haozhe Jia, Pengyu Yin, Wenshuo Chen, Shaofeng Liang, Lei Wang, Bowen Tian, Xiucheng Wang, Nanqian Jia, Yutao Yue

摘要：

Physics-informed diffusion models typically enforce PDE constraints only on final outputs, leaving intermediate representations unconstrained and prone to shortcut learning under shifted boundary conditions. We introduce REPA-P, a teacher-free, architecture-agnostic framework that aligns intermediate features with physical states using first-principles residuals. REPA-P attaches lightweight $1{\times}1$ projection heads to selected layers, decodes hidden activations into physical quantities, and applies PDE residual losses during training. These heads are discarded at inference, introducing zero overhead. Across four PDE tasks, including Darcy flow, topology optimization, electrostatic potential, and turbulent channel flow, REPA-P accelerates convergence by up to $2{\times}$ , reduces physics residuals by up to $66.4\%$ , and improves out-of-distribution robustness by up to $49.3\%$ , with consistent gains on both U-Net and Diffusion Transformer backbones. Ablations show that supervising a small set of intermediate layers captures most benefits and complements output-level physics losses. Code is available at https://github.com/Hxxxz0/REPA-P.

摘要中文：

arXiv:2605.20780v1公告类型：跨领域摘要：物理信息驱动的扩散模型通常仅在最终输出上施加偏微分方程约束，而中间表示则不受约束，在边界条件发生偏移时极易产生捷径学习现象。我们提出了一种无需教师、与网络架构无关的框架——REPA-P，该框架基于第一性原理残差将中间特征与物理状态进行对齐。REPA-P在选定的层上附加轻量级的 $1{\times}1$ 投影头，将隐藏激活解码为物理量，并在训练过程中施加偏微分方程残差损失。这些头部在推理阶段被丢弃，从而带来零开销。在包括达西流、拓扑优化、静电势和湍流槽道流动在内的四项偏微分方程任务中，REPA-P最高可将收敛速度提升2倍，将物理残差降低多达66.4%，并将分布外鲁棒性提高最多49.3%，且在U-Net和扩散Transformer两种骨干网络上均取得了稳定的效果。消融实验表明，对少量中间层进行监督即可获得大部分收益，并且能够与输出层的物理约束损失形成互补。代码可在 https://github.com/Hxxxz0/REPA-P 获取。

Variance Reduction for Expectations with Diffusion Teachers

2026-05-21T04:00:00cs.AI, cs.CV, cs.LG, diffusion, stat.CO, stat.MLoai:arXiv.org:2605.21489v1

中文标题：带有扩散教师的期望值方差缩减

作者：Jesse Bettencourt, Xindi Wu, Matan Atzmon, James Lucas, Jonathan Lorraine

摘要：

Pretrained diffusion models serve as frozen teachers feeding downstream pipelines such as text-to-3D, single-step distillation, and data attribution. The teacher gradients these pipelines consume are Monte Carlo (MC) expectations over noise levels and Gaussian noise samples; their estimator variance dominates compute cost because each draw requires expensive upstream work (rendering, simulation, encoding). We introduce CARV, a compute-aware variance-accounting framework that motivates a hierarchical MC estimator: amortize the expensive upstream computation over cheap diffusion-noise resamples, sharpened by timestep importance sampling and a stratified-inverse-CDF construction. In our text-to-3D distillation and attribution experiments, CARV delivers 2-3x effective compute multipliers (most from amortized reuse; ~25% additional from IS+stratification) without changing the objective; in single-step distillation, the same techniques cut gradient variance by an order of magnitude but do not improve downstream FID, marking the regime where MC variance is no longer the bottleneck.

摘要中文：

arXiv:2605.21489v1公告类型：跨领域摘要：预训练的扩散模型作为“冻结”的教师模型，为文本到三维、单步蒸馏以及数据归因等下游流水线提供指导。这些管道所计算的梯度是由噪声水平和高斯噪声样本上的蒙特卡洛（MC）期望构成的；其估计量的方差主导了计算开销，因为每次采样都需要代价高昂的前置处理工作（渲染、仿真、编码）。我们提出了CARV，这是一种面向计算的方差分析框架，它引导构建了一种分层的蒙特卡洛估计器：通过时间步重要性采样以及基于分层逆累积分布函数的构造，将代价高昂的上游计算成本摊薄到廉价的扩散噪声重采样上。在我们的文本到3D蒸馏与归因实验中，CARV在不改变目标函数的前提下，实现了2至3倍的有效计算加速（其中大部分来自摊销后的重复利用，约25%的额外增益来自IS与分层策略）；而在单步蒸馏过程中，同样的技术将梯度方差降低了整整一个数量级，但并未提升下游的FID指标，这标志着蒙特卡洛采样带来的方差已不再是瓶颈。

UniEdit-Flow: Unleashing Inversion and Editing in the Era of Flow Models

2026-05-21T04:00:00cs.CV, diffusionoai:arXiv.org:2504.13109v2

中文标题：UniEdit-Flow：在流模型时代释放逆向与编辑能力

作者：Guanlong Jiao, Biqing Huang, Kuan-Chieh Wang, Renjie Liao

摘要：

Flow matching models have emerged as a strong alternative to diffusion models, but existing inversion and editing methods designed for diffusion are often ineffective or inapplicable to them. The straight-line, non-crossing trajectories of flow models pose challenges for diffusion-based approaches but also open avenues for novel solutions. In this paper, we introduce a predictor-corrector-based framework for inversion and editing in flow models. First, we propose Uni-Inv, an effective inversion method designed for accurate reconstruction. Building on this, we extend the concept of delayed injection to flow models and introduce Uni-Edit, a region-aware, robust image editing approach. Our methodology is tuning-free, model-agnostic, efficient, and effective, enabling diverse edits while ensuring strong preservation of edit-irrelevant regions. Extensive experiments across various generative models demonstrate the superiority and generalizability of Uni-Inv and Uni-Edit, even under low-cost settings. Project page: https://uniedit-flow.github.io/

摘要中文：

arXiv:2504.13109v2公告类型：替换摘要：流匹配模型已作为一种强有力的替代方案脱颖而出，但针对扩散模型设计的现有反演与编辑方法在这些模型上往往效果不佳甚至无法适用。流场模型的直线型、无交叉轨迹既给基于扩散的方法带来了挑战，也为新型解决方案开辟了新的途径。本文提出了一种基于预测-校正机制的流模型反演与编辑框架。首先，我们提出了Uni-Inv，这是一种旨在实现高精度重建的有效逆向方法。在此基础上，我们进一步将延迟注入的概念拓展至流模型，并提出了Uni-Edit——一种具有区域感知能力且鲁棒性强的图像编辑方法。我们的方法无需调参、与模型无关，兼具高效性和有效性，在实现多样化编辑的同时，还能确保对与编辑无关区域的强保护。在多种生成模型上的大量实验表明，Uni-Inv和Uni-Edit具有显著的优越性和良好的泛化能力，即便在低成本条件下亦是如此。项目页面：https://uniedit-flow.github.io/

Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models

2026-05-21T04:00:00cs.AI, cs.CV, diffusionoai:arXiv.org:2601.04068v4

中文标题：关注生成细节：面向视频扩散模型的直接局部细节偏好优化

作者：Zitong Huang, Kaidong Zhang, Yukang Ding, Chao Gao, Rui Ding, Ying Chen, Wangmeng Zuo

摘要：

Aligning text-to-video diffusion models with human preferences is crucial for generating high-quality videos. Existing Direct Preference Otimization (DPO) methods rely on multi-sample ranking and task-specific critic models, which is inefficient and often yields ambiguous global supervision. To address these limitations, we propose LocalDPO, a novel post-training framework that constructs localized preference pairs from real videos and optimizes alignment at the spatio-temporal region level. We design an automated pipeline to efficiently collect preference pair data that generates preference pairs with a single inference per prompt, eliminating the need for external critic models or manual annotation. Specifically, we treat high-quality real videos as positive samples and generate corresponding negatives by locally corrupting them with random spatio-temporal masks and restoring only the masked regions using the frozen base model. During training, we introduce a region-aware DPO loss that restricts preference learning to corrupted areas for rapid convergence. Experiments on Wan2.1 and CogVideoX demonstrate that LocalDPO consistently improves video fidelity, temporal coherence and human preference scores over other post-training approaches, establishing a more efficient and fine-grained paradigm for video generator alignment.The code is available at https://github.com/1170300714/Local-DPO.

摘要中文：

arXiv:2601.04068v4公告类型：替换摘要：使文本到视频的扩散模型与人类偏好相一致，对于生成高质量视频至关重要。现有的直接偏好优化（DPO）方法依赖于多样本排序和任务特定的判别模型，这不仅效率低下，而且往往导致全局监督信号模糊不清。为应对这些局限性，我们提出了LocalDPO，这是一种新颖的后训练框架，它从真实视频中构建局部化的偏好对，并在时空区域层面优化对齐效果。我们设计了一条自动化流水线，用于高效地收集偏好对数据，该流水线在每个提示下仅需一次推理即可生成偏好对，从而无需借助外部评判模型或人工标注。具体而言，我们将高质量的真实视频视为正样本，并通过在时空域上施加随机掩码对其进行局部篡改，同时仅利用冻结的预训练基础模型对被掩码的区域进行修复，从而生成相应的负样本。在训练过程中，我们引入了一种区域感知的DPO损失函数，该损失函数将偏好学习限定在受损区域，以实现快速收敛。在Wan2.1和CogVideoX上的实验表明，LocalDPO相较于其他后训练方法，在视频保真度、时序连贯性以及人类偏好评分等方面均实现了持续提升，为视频生成器的对齐任务构建了一种更高效、更精细的范式。代码已开源，地址为：https://github.com/1170300714/Local-DPO。

Q-DiT4SR: Exploration of Detail-Preserving Diffusion Transformer Quantization for Real-World Image Super-Resolution

2026-05-21T04:00:00cs.CV, diffusionoai:arXiv.org:2602.01273v4

中文标题：Q-DiT4SR：面向真实场景图像超分辨率的细节保持型扩散Transformer量化方法研究

作者：Xun Zhang, Kaicheng Yang, Hongliang Lu, Haotong Qin, Yong Guo, Yulun Zhang

摘要：

Recently, Diffusion Transformers (DiTs) have emerged in Real-World Image Super-Resolution (Real-ISR) to generate high-quality textures, yet their heavy inference burden hinders real-world deployment. While Post-Training Quantization (PTQ) is a promising solution for acceleration, existing methods in super-resolution mostly focus on U-Net architectures, whereas generic DiT quantization is typically designed for text-to-image tasks. Directly applying these methods to DiT-based super-resolution models leads to severe degradation of local textures. Therefore, we propose Q-DiT4SR, the first PTQ framework specifically tailored for DiT-based Real-ISR. We propose H-SVD, a hierarchical SVD that integrates a global low-rank branch with a local block-wise rank-1 branch under a matched parameter budget. We further propose Variance-aware Spatio-Temporal Mixed Precision: VaSMP allocates cross-layer weight bit-widths in a data-free manner based on rate-distortion theory, while VaTMP schedules intra-layer activation precision across diffusion timesteps via dynamic programming (DP) with minimal calibration. Experiments on multiple real-world datasets demonstrate that our Q-DiT4SR achieves SOTA performance under both W4A6 and W4A4 settings. Notably, the W4A4 quantization configuration reduces model size by 5.8 $\times$ and computational operations by 6.14 $\times$ . Our code and models will be available at https://github.com/xunzhang1128/Q-DiT4SR.

摘要中文：

arXiv:2602.01273v4公告类型：替换摘要：近年来，扩散变换器（DiT）已在真实场景图像超分辨率（Real-ISR）任务中得到应用，能够生成高质量的纹理细节，但其庞大的推理开销却严重制约了在实际场景中的部署。尽管训练后量化（PTQ）是一种颇具前景的加速方案，但现有的超分辨率方法大多聚焦于U-Net架构，而通用的DiT量化则通常面向文本到图像任务。将这些方法直接应用于基于DiT的超分辨率模型会导致局部纹理严重退化。为此，我们提出了Q-DiT4SR，这是首个专为基于DiT的图像超分辨率任务量身定制的后训练量化框架。我们提出了H-SVD，这是一种分层奇异值分解方法，在给定的参数预算下，将全局低秩分支与局部逐块秩1分支相结合。我们进一步提出方差感知的时空混合精度：VaSMP基于率失真理论，在无需数据的情况下为跨层权重分配位宽；同时，VaTMP通过动态规划（DP）并在极少校准的前提下，对扩散模型各时间步的层内激活精度进行调度。在多个真实数据集上的实验表明，我们的Q-DiT4SR在W4A6和W4A4两种设置下均取得了当前最优的性能。值得注意的是，W4A4量化配置可使模型尺寸缩小5.8倍，计算量减少6.14倍。我们的代码和模型将发布在 https://github.com/xunzhang1128/Q-DiT4SR。

Prompt Reinjection: Alleviating Prompt Forgetting in Multimodal Diffusion Transformers

2026-05-21T04:00:00cs.CV, diffusionoai:arXiv.org:2602.06886v3

中文标题：提示重注入：缓解多模态扩散Transformer中的提示遗忘问题

作者：Yuxuan Yao, Yuxuan Chen, Hui Li, Kaihui Cheng, Qipeng Guo, Yuwei Sun, Zilong Dong, Jingdong Wang, Siyu Zhu

摘要：

Multimodal Diffusion Transformers (MMDiTs) for text-to-image generation maintain separate text and image branches, with bidirectional information flow between text tokens and visual latents throughout denoising. In this setting, we observe a prompt forgetting phenomenon: the semantics of the prompt representation in the text branch is progressively forgotten as depth increases. We further verify this effect on three representative MMDiTs--SD3, SD3.5, and FLUX.1 by probing linguistic attributes of the representations over the layers in the text branch. Motivated by these findings, we introduce a training-free approach, prompt reinjection, which reinjects prompt representations from early layers into later layers to alleviate this forgetting. Experiments on GenEval, DPG, and T2I-CompBench++ show consistent gains in instruction-following capability, along with improvements on metrics capturing preference, aesthetics, and overall text--image generation quality.

摘要中文：

arXiv:2602.06886v3公告类型：替换摘要：用于文本到图像生成的多模态扩散变换器（MMDiT）采用独立的文本分支和图像分支，并在去噪过程中实现文本标记与视觉潜在变量之间的双向信息交互。在该设置下，我们观察到一种提示遗忘现象：随着网络深度的增加，文本分支中提示表示的语义逐渐被遗忘。我们进一步通过在文本分支的各层对表征的语义属性进行探查，验证了这一效应在三种具有代表性的多模态扩散模型——SD3、SD3.5和FLUX.1上的一致性。受这些发现的启发，我们提出了一种无需训练的方法——提示重注入，通过将早期层中的提示表示重新注入到后续层中，以缓解这种遗忘现象。在GenEval、DPG和T2I-CompBench上的实验表明，模型在指令遵循能力方面取得了持续提升，同时在衡量偏好、美学以及文本‑图像生成整体质量的各项指标上也有所改善。

Predicting 3D structure by latent posterior sampling

2026-05-21T04:00:00cs.CV, cs.LG, diffusionoai:arXiv.org:2605.10830v3

中文标题：通过潜在后验采样预测三维结构

作者：Azmi Haider, Dan Rosenbaum

摘要：

The remarkable achievements of both generative models of 2D images and neural field representations for 3D scenes present a compelling opportunity to integrate the strengths of both approaches. In this work, we propose a methodology that combines a NeRF-based representation of 3D scenes with probabilistic modeling and reasoning using diffusion models. We view 3D reconstruction as a perception problem with inherent uncertainty that can thereby benefit from probabilistic inference methods. The core idea is to represent the 3D scene as a stochastic latent variable for which we can learn a prior and use it to perform posterior inference given a set of observations. We formulate posterior sampling using the score-based inference method of diffusion models in conjunction with a likelihood term computed from a reconstruction model that includes volumetric rendering. We train the model using a two-stage process: first we train the reconstruction model while auto-decoding the latent representations for a dataset of 3D scenes, and then we train the prior over the latents using a diffusion model. By using the model to generate samples from the posterior we demonstrate that various 3D reconstruction tasks can be performed, differing by the type of observation used as inputs. We showcase reconstruction from single-view, multi-view, noisy images, sparse pixels, and sparse depth data. These observations vary in the amount of information they provide for the scene and we show that our method can model the varying levels of inherent uncertainty associated with each task. Our experiments illustrate that this approach yields a comprehensive method capable of accurately predicting 3D structure from diverse types of observations.

摘要中文：

arXiv:2605.10830v3公告类型：替换摘要：二维图像生成模型与用于三维场景的神经场表示均取得了令人瞩目的成果，这为整合两者的优势提供了极具吸引力的契机。在本工作中，我们提出了一种将基于NeRF的三维场景表示与基于扩散模型的概率建模及推理相结合的方法。我们将三维重建视为一个具有固有不确定性的感知问题，因此能够从概率推理方法中获益。核心思想是将三维场景表示为一个随机潜变量，我们可以在该潜变量上学习先验分布，并在给定一组观测数据的情况下利用其进行后验推断。我们结合基于分数的扩散模型推理方法与由包含体渲染的重建模型所计算的似然项，来构建后验采样。我们采用两阶段训练流程：首先，针对一组三维场景数据，通过自编码器对潜在表征进行重建并对其进行训练；随后，利用扩散模型对潜在变量的先验分布进行训练。通过利用该模型从后验分布中生成样本，我们证明了可以完成多种三维重建任务，这些任务的区别在于所采用的观测类型作为输入。我们展示了基于单视图、多视图、含噪图像、稀疏像素以及稀疏深度数据的重建结果。这些观测在为场景提供的信息量上各不相同，我们证明了所提出的方法能够建模与每项任务相关的不同层次的固有不确定性。我们的实验表明，该方法能够提供一种全面的途径，从多种类型的观测数据中准确预测蛋白质的三维结构。

SWoMo: Neuro-Symbolic World Model for Cataract Surgery Simulation

2026-05-21T04:00:00cs.CV, diffusionoai:arXiv.org:2605.16530v2

中文标题：SWoMo：用于白内障手术仿真的神经符号世界模型

作者：Ssharvien Kumar Sivakumar, Akwele Johnson, Anirudh Dhingra, Yannik Frisch, Ghazal Ghazaei, Anirban Mukhopadhyay

摘要：

Realistic surgical simulation plays a crucial role in training novice surgeons and in the development of autonomous agents. World models can scale such simulation environments to realistic and diverse procedures by predicting future patient states conditioned on current observations and surgical actions. However, current state-of-the-art approaches often fail to satisfy key criteria required for clinical applicability, including visual realism, physically grounded interactions, and the ability to simulate scenarios beyond the training distribution. Hence, we introduce SWoMo, a neuro-symbolic world model for cataract surgery simulation that decouples motion generation from visual realism. The symbolic component, consisting of a rule-based simulator and scene graph representations, models motion dynamics and tool-tissue interactions, while a diffusion model produces realistic visual appearance, including textures and tissue deformations. We propose an inverse pairing strategy that reconstructs real surgical videos in the simulator to obtain paired simulated and real videos, which are then used to train our video diffusion model for the reverse objective of sim-to-real translation. Our experiments show both qualitative and quantitative improvements over prior work. We demonstrate that our simulator further satisfies the key criteria, including generalisation to unseen interaction geometries, improvements in downstream phase detection, and unsupervised video style transfer. The code, data, and model weights are available at: https://ssharvienkumar.github.io/SWoMo/

摘要中文：

arXiv:2605.16530v2公告类型：替换摘要：逼真的手术模拟在新手外科医生的培训以及自主智能体的研发中发挥着至关重要的作用。世界模型可通过基于当前观测与手术操作预测患者未来的状态，从而将此类模拟环境扩展至真实且多样化的诊疗流程。然而，当前的最先进方法往往难以满足临床应用所必需的关键要求，包括视觉真实感、基于物理的交互，以及对超出训练分布的场景进行仿真的能力。为此，我们提出了SWoMo，这是一种用于白内障手术仿真的神经符号世界模型，它将运动生成与视觉真实感解耦。由基于规则的仿真器与场景图表示组成的符号化模块，用于建模运动动力学和工具—组织交互；同时，扩散模型则生成逼真的视觉外观，包括纹理与组织变形。我们提出了一种逆向配对策略，在模拟环境中重建真实的手术视频，从而获得成对的仿真与真实视频，并以此训练我们的视频扩散模型，以实现从仿真到真实的逆向迁移任务。我们的实验结果表明，与现有方法相比，本文在定性和定量两方面均取得了显著提升。我们证明，我们的仿真器进一步满足各项关键指标，包括对未见交互几何的泛化能力、下游相位检测性能的提升，以及无监督视频风格迁移。代码、数据及模型权重已公开，访问地址为：https://ssharvienkumar.github.io/SWoMo/

Spectral Progressive Diffusion for Efficient Image and Video Generation

2026-05-21T04:00:00autoregressive, cs.CV, diffusionoai:arXiv.org:2605.18736v2

中文标题：用于高效图像与视频生成的谱渐进扩散模型

作者：Howard Xiao, Brian Chao, Lior Yariv, Gordon Wetzstein

摘要：

摘要中文：

LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models

2026-05-21T04:00:00cs.AI, cs.CV, diffusionoai:arXiv.org:2605.19729v2

中文标题：LIFT与PLACE：一种简单、稳定且有效的轻量化扩散模型知识蒸馏框架

作者：Hyunsoo Han, Sangyeop Yeo, Jaejun Yoo

摘要：

We demonstrate that in knowledge distillation for diffusion models, the teacher network's highly complex denoising process - stemming from its substantially larger capacity - poses a significant challenge for the student model to faithfully mimic. To address this problem, we propose a coarse-to-fine distillation framework with LInear FiTtingbased distillation (LIFT) and Piecewise Local Adaptive Coefficient Estimation (PLACE). First, LIFT decomposes the objective into a "coarse" alignment and a "fine" refinement. The student is then trained on coarse alignment before proceeding to hard refinement. Second, PLACE extends LIFT to address spatially non-uniform errors by partitioning outputs into error-based groups, providing locally adaptive guidance. Our experiments show that LIFT and PLACE is effective across diffusion spaces (image/latent), backbones (U-Net/DiT), tasks (unconditional/conditional), datasets, and even extends to flow-based models such as MMDiT (SD3). Furthermore, under extreme compression with a 1.3M-parameter student (only 1.6% of the teacher), conventional KD fails to provide sufficient guidance for stable training, with FID scores often degrading to 50-200+, but our method remains stably convergent and achieves an FID of 15.73.

摘要中文：

arXiv:2605.19729v2公告类型：替换摘要：我们证明，在扩散模型的知识蒸馏中，教师模型凭借其显著更大的容量所实现的极为复杂的去噪过程，对学生的模仿能力构成了严峻挑战。为解决这一问题，我们提出了一种基于线性拟合蒸馏（LIFT）与分段局部自适应系数估计（PLACE）的由粗到精的蒸馏框架。首先，LIFT将目标任务分解为“粗略”对齐和“精细”优化两个阶段。随后，该学生将在进行精细化调整之前接受粗略配准的训练。其次，PLACE将LIFT扩展至能够处理空间非均匀误差的场景，通过将输出划分为基于误差的若干组，从而提供局部自适应的指导。我们的实验表明，LIFT和PLACE在不同的扩散空间（图像空间与潜在空间）、骨干网络（U-Net与DiT）、任务类型（无条件生成与条件生成）、数据集上均表现出良好的效果，并且其适用性还可扩展至基于流的模型，如MMDiT（SD3）。此外，在参数量仅为教师模型1.6%的130万参数学生模型的极端压缩条件下，传统知识蒸馏难以提供足以支撑稳定训练的有效指导，FID指标往往恶化至50—200+；而我们的方法则保持稳定的收敛性，并取得了15.73的FID值。

Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control

2026-05-21T04:00:00autoregressive, cs.CV, cs.RO, diffusionoai:arXiv.org:2605.14417v2

中文标题：在身体运动之前：学习用于语言条件化人形机器人控制的预判性关节意图

作者：Haozhe Jia, Honglei Jin, Yuan Zhang, Youcheng Fan, Shaofeng Liang, Lei Wang, Shuxu Jin, Kuimou Yu, Zinuo Zhang, Jianfei Song, Wenshuo Chen, Yutao Yue

摘要：

摘要中文：

image_compression

Image Compression

1 篇论文

今日 Image Compression 方向共追踪到 1 篇论文。简报保留原始摘要、中文摘要、作者和链接，适合先快速筛选，再挑出值得深读的论文进入 org-roam。

Efficient training for compact compression models via sequential distillation

2026-05-21T04:00:00cs.CV, cs.LG, image_compressionoai:arXiv.org:2601.05639v2

中文标题：通过顺序蒸馏实现紧凑压缩模型的高效训练

作者：Caroline Mazini Rodrigues (COMPACT), Nicolas Keriven (COMPACT), Thomas Maugey (COMPACT)

摘要：

Deep learning models for image compression often face practical limitations in hardware-constrained applications. Although these models achieve high-quality reconstructions, they are typically complex, heavyweight, and require substantial training data and computational resources. We propose a methodology to significantly reduce autoencoder-based compression networks in a more stable Knowledge Distillation process. The intuition is that highly reduced architectures benefit from simplified optimization objectives in early training, with complexity gradually introduced later. Therefore, our approach begins with a sequential encoder--decoder distillation stage that provides a robust initialization for the lightweight model. This is followed by standard training that can be regularized with latent distillation. We evaluate the resulting lightweight autoencoders across two different architectures on the image compression task. Experiments show that our method preserves reconstruction quality and statistical fidelity in early epochs better than training lightweight autoencoders with the original loss, making it practical for resource-limited environments.

摘要中文：

arXiv:2601.05639v2公告类型：替换摘要：用于图像压缩的深度学习模型在硬件受限的应用场景中往往面临诸多实际限制。尽管这些模型能够实现高质量的重建，但它们通常结构复杂、计算开销大，并且需要大量的训练数据和算力。我们提出了一种方法，能够在更为稳定的知识蒸馏过程中显著压缩基于自编码器的压缩网络。其基本直觉在于，高度简化的网络架构在训练初期能够受益于更为简化的优化目标，随后再逐步引入模型复杂度。因此，我们的方法首先通过一个序列式的编码器-解码器蒸馏阶段，为轻量化模型提供稳健的初始化。随后是标准训练，该过程可通过潜在蒸馏进行正则化。我们在图像压缩任务上，针对两种不同架构对所得到的轻量化自编码器进行了评估。实验结果表明，与采用原始损失函数训练轻量化自编码器相比，我们的方法在训练初期能够更好地保持重建质量和统计保真度，从而使其在资源受限的环境中更具实用性。