ESC
输入关键词搜索文章
目录

每日 arXiv 论文简报

2026-06-17 · 92 篇论文 · 按研究方向分组
自动追踪 · LLM 总览 · 研究雷达
92Total Papers
19Autoregressive
68Diffusion
5Image Compression
01D Visual Tokenizer
0Diffusion Visual Encoder
Daily Radar
每日总览

今日arXiv论文呈现出多模态融合与生成式AI深化落地的主旋律。Diffusion类论文数量最多(68篇),仍是生成模型的主力技术路径;Autoregressive方向(19篇)则聚焦于视觉自回归模型的架构优化与概念控制,两者呈现交叉融合趋势——如SACE同时出现在两个分类中,体现了生成与控制技术的统一化趋势。

从应用层面看,视频生成与自动驾驶是最热门的交叉领域。视频生成关注长时序一致性(Steady-Forcing、DySink),自动驾驶则强调实时世界模型(CausalDrive)和可控场景生成(ControlMap),两者都指向物理世界的可预测模拟。此外,机器人控制(MotionVLA、MimicIK)与强化学习(Efficient Reinforcement、Trust-Region Diffusion Policies)结合紧密,视觉-语言-动作(VLA)模型正在从研究走向实际部署。

技术亮点方面,概念擦除(SACE)、量化压缩(Shift-and-Sum Quantization)、对抗鲁棒性(BadWorld)等基础能力研究持续推进,表明生成AI正在从“能生成”向“能控制”进化。图像压缩(5篇)虽数量少,但针对低功耗设备和VLA模型的优化值得关注。

重点论文推荐:

  • SACE: Concept Erasure at the Semantic Singularity in Visual Autoregressive Models — 提出在语义奇点处擦除概念,为可控生成提供新范式,方法同时适用于Diffusion和Autoregressive模型
  • ControlMap: Controllable High-Definition Map Generation for Traffic Scenario Simulation — 可控高清地图生成,直接服务于自动驾驶仿真测试,具有强实用价值
  • Trust-Region Diffusion Policies for Massively Parallel On-Policy RL — 将信任域方法引入扩散策略,实现大规模并行强化学习,推动机器人技能学习效率
  • Learned Image Compression for Vision-Language-Action Models — 针对VLA模型的图像压缩优化,连接了生成式AI与端侧部署两个热点
  • BadWorld: Adversarial Attacks on World Models — 首次系统研究世界模型的对抗攻击安全性,为具身智能安全部署敲响警钟
autoregressive
Autoregressive
19 篇论文

今日 Autoregressive(自回归)类别论文整体呈现出视频生成效率与质量双提升的趋势,同时视觉自回归模型的安全性与压缩也受到关注。核心亮点包括:长视频生成的动态帧 sink 机制、视觉自回归模型的量化方法、以及流式视频生成向高分辨率的扩展。

  • DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation - 提出了动态帧汇机制,有效解决自回归长视频生成中的误差累积问题,是长视频生成的重要突破。
  • Shift-and-Sum Quantization for Visual Autoregressive Models - 针对视觉自回归模型的量化方法,在保持生成质量的同时大幅降低计算开销,具有实用价值。
  • SACE: Concept Erasure at the Semantic Singularity in Visual Autoregressive Models - 关注视觉自回归模型的安全性问题,提出了概念擦除方法,对模型可控性有重要意义。
  • Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention - 通过稀疏注意力机制加速自回归视频扩散,为高效视频生成提供了新思路。
  • Ultra Flash: Scaling Real-Time Streaming Video Generation to High Resolutions - 将流式视频生成扩展到高分辨率,推动实时生成的实用化进程。

Steady-Forcing: Balancing Spatial Persistence and Motion Continuity in Long-Horizon Nature Video Diffusion

2026-06-16T04:00:00autoregressive, cs.AI, cs.CV, cs.LG, cs.MM, diffusion2606.14732

中文标题:Steady-Forcing:平衡长时序自然视频扩散中的空间持久性与运动连续性

作者:Matiur Rahman Minar, Seunghun Oh, GangHyeon Jeong, Unsang Park

摘要:

Autoregressive video diffusion models enable streaming generation but often degrade over long rollouts: static scene layouts drift, while mechanisms that improve spatial stability tend to suppress motion, causing natural flows such as water, fire, or smoke to stagnate. We study this stability-motion trade-off in fixed-camera long-horizon nature video generation, where the two failure modes can be more clearly separated than in moving-camera settings. We propose Steady-Forcing, a memory and training framework combining a persistent visual anchor (V-Sink), an exponential moving-average motion memory (EMA-Sink), block-relative temporal encoding, periodic cache purification, and distillation from a Wan2.1-14B teacher with motion-rewarded priors under task-focused configurations. Together, these components are designed to preserve background identity while sustaining visually plausible fluid dynamics over multi-minute autoregressive rollouts. Evaluations across seven baselines show that Steady-Forcing improves long horizon background consistency and imaging quality, while a blind user study indicates stronger perceived stability and motion continuity. The benchmark evaluation further suggest that generic VBench aggregate scores under-penalize fixed-camera artifacts as well as rewarding drift-induced optical flow as Dynamic Degree while not directly penalizing texture hardening or flow stagnation - motivating future task-specific benchmarks for static-camera nature-flow evaluation. Project page: https://minar09.github.io/steadyforcing/

摘要中文:

自回归视频扩散模型支持流式生成,但在长时序展开过程中往往会出现质量退化:静态场景布局发生漂移,而提高空间稳定性的机制 tend to抑制运动,导致水、火、烟雾等自然流动停滞。本文在固定摄像头的长时序自然视频生成任务中研究这一稳定性-运动权衡问题,该场景下两种失败模式比移动摄像机设置更容易区分。我们提出了Steady-Forcing,一个结合了持久视觉锚点(V-Sink)、指数移动平均运动记忆(EMA-Sink)、块相对时间编码、周期性缓存净化以及基于运动奖励先验的任务聚焦配置下从Wan2.1-14B教师模型蒸馏的训练框架。这些组件共同设计用于在数分钟的自回归生成过程中保持背景一致性同时维持视觉上可信的流体动力学。七个基线模型的评估表明,Steady-Forcing提高了长时序背景一致性和成像质量,而盲测用户研究显示其在感知稳定性和运动连续性方面表现更强。基准评估进一步表明,通用VBench综合分数对固定摄像头伪影的惩罚不足,同时将漂移导致的光流奖励为动态程度,却未直接惩罚纹理硬化或流动停滞——这促使未来需要针对静态摄像头自然流评估的专用基准测试。项目主页:https://minar09.github.io/steadyforcing/

Efficient Reinforcement for Visual-Textual Thinking with Discrete Diffusion Model

2026-06-16T04:00:00autoregressive, cs.AI, cs.CV, diffusion2606.14792

中文标题:基于离散扩散模型的高效视觉-文本思维强化学习

作者:Yoonjeon Kim, Yuhta Takida, Chieh-Hsin Lai, Eunho Yang, Yuki Mitsufuji

摘要:

RL-based post-training has been widely adopted to enable interleaved visual and textual reasoning in unified multimodal models capable of both text and image generation. However, most existing approaches are built upon autoregressive (AR) unified models, which require full image regeneration during visual reasoning. In this work, we demonstrate that multimodal discrete diffusion models are effective alternatives to AR models for reinforcement learning in interleaved reasoning, owing to their ability to perform efficient visual rollouts via localized visual editing rather than full image-token regeneration. This reduces rollout computation during GRPO by 26.9\% compared to AR baselines, with minimal performance drop. Despite the improved efficiency, we find that joint reward assignment, which employs a shared reward signal across modalities, introduces cross-modal interference between unrelated image and text token sequences during RL updates. To address this issue, we propose factorized reward assignment, a strategy that assigns rewards independently to text and vision segments. With factorized reward assignment, our RL approach achieves an 11.2% improvement over joint reward assignment and a 38.04% improvement over the base model.

摘要中文:

基于强化学习的后训练已被广泛采用,以使统一的文本和图像生成多模态模型具备交织视觉和文本推理能力。然而,现有方法大多基于自回归统一模型,这类模型在视觉推理过程中需要完整重建图像。本工作证明了多模态离散扩散模型是交织推理中强化学习的有效替代方案,其优势在于能够通过局部视觉编辑而非完整图像token重建来执行高效的视觉rollout。与自回归基线相比,这使GRPO的推演计算量减少了26.9%,同时性能损失极小。尽管效率有所提升,我们发现采用跨模态共享奖励信号的联合奖励分配会在强化学习更新中引入不相关图像和文本token序列之间的跨模态干扰。为解决这一问题,我们提出了分解奖励分配策略,即对文本和视觉段落独立分配奖励。使用分解奖励分配后,我们的强化学习方法相比联合奖励分配提升了11.2%,相比基础模型提升了38.04%。

SACE: Concept Erasure at the Semantic Singularity in Visual Autoregressive Models

2026-06-16T04:00:00autoregressive, cs.AI, cs.CV, diffusion2606.15819

中文标题:SACE:视觉自回归模型语义奇点处的概念抹除

作者:Siya Yang, Nanxiang Jiang, Zhaoxin Fan, Yunfeng Diao

摘要:

The rapid progress of visual autoregressive (VAR) models has unlocked a transformative frontier for high-fidelity text-to-image synthesis, while heightening concerns over the safety alignment of generated content. Naive application of existing erasure techniques to VAR models causes catastrophic semantic collapse and visual artifacts, since they are predominantly designed for the homogeneous denoising steps of diffusion models. To address this foundational challenge, we first propose the Semantic Singularity Axiom, which posits that any target semantic concept embedded within a prompt is definitively locked at Scale-0. Then rigorously validate this axiom through our proposed Incremental Semantic Saliency Analysis (ISSA),which also enable the community to transparently inspect the coarse-to-fine semantic injection process. Guided by this insight, we introduce the first scale-aware concept erasure framework (SACE) for VAR models. By strictly confining interventions to the first scale, our approach couples an Entropy-Regularized Erasure Objective to prevent high-entropy sampling degeneration, alongside a restorative preservation loss to safely anchor the integrity of entangled benign priors. Extensive experiments demonstrate that our method achieves surgical concept erasure performance across various domains with minimal training overhead, timely and elegently resolute the critical safety vulnerabilities inherent in emerging VAR architectures. Code is available at: https://github.com/limerenceysy/SACE}{https://github.com/limerenceysy/SACE.

摘要中文:

视觉自回归(VAR)模型的快速发展为高保真文本到图像合成开辟了变革性前沿,同时也加剧了生成内容安全对齐方面的担忧。将现有抹除技术简单应用于VAR模型会导致灾难性的语义崩溃和视觉伪影,因为这些技术主要针对扩散模型的同质化去噪步骤。为解决这一基础性挑战,我们首先提出语义奇点公理,该公理认为嵌入在提示词中的任何目标语义概念都明确锁定在Scale-0。随后,我们通过提出的增量语义显著性分析(ISSA)严格验证了这一公理,从而使社区能够透明地检查粗到细的语义注入过程。在此洞察的指引下,我们为VAR模型引入了首个尺度感知概念抹除框架(SACE)。通过严格将干预限制在第一个尺度,我们的方法结合了熵正则化抹除目标以防止高熵采样退化,同时配合恢复性保留损失来安全地锚定纠缠良性先验的完整性。大量实验表明,我们的方法在各种领域实现了精确的概念抹除性能,且训练开销极小,及时而优雅地解决了新兴VAR架构固有的关键安全漏洞。代码见:https://github.com/limerenceysy/SACE

RealityBridge: Bridging Editable 3D Gaussian Splatting Driving Simulations and Real-World Videos

2026-06-16T04:00:00autoregressive, cs.AI, cs.CV2606.16278

中文标题:RealityBridge: 弥合可编辑3D高斯溅射驾驶仿真与真实世界视频之间的域差距

作者:Zhenhua Wu, Yun Pang, Mingkun Chang, Yuwei Ning, Liangzhi Wang, Yi Xiao, Guanbin Li

摘要:

Long-tail hazardous scenarios are essential for safety-oriented autonomous driving, yet they are difficult to collect and reproduce at scale. Editable 3D Gaussian Splatting (3DGS) simulation offers a promising alternative by reconstructing real driving scenes and supporting controllable scene editing. However, edited 3DGS-rendered videos still suffer from a significant Sim-to-Real gap, including rendering artifacts, degraded foreground assets, inconsistent illumination, and temporal flickering. Existing restoration and video generation methods are insufficient for this task, as they often fail to jointly repair 3DGS-specific artifacts, improve visual realism, and ensure temporal consistency. To fill this gap, we propose RealityBridge, a structure-preserving and asset-aware Sim-to-Real framework for edited 3DGS driving videos. RealityBridge uses multimodal controls, including rendered videos, foreground masks, edge maps, and semantic masks, together with a lightweight GateNet for adaptive condition allocation across backbone layers. We further construct targeted training data and introduce autoregressive long-video training with reward-guided post-training to improve restoration quality, temporal stability, and hallucination suppression. Extensive experiments on internal and public driving datasets show that RealityBridge outperforms existing methods in artifact removal, illumination harmonization, and long-sequence temporal consistency.

摘要中文:

长尾危险场景对于面向安全的自动驾驶至关重要,但难以大规模收集和复现。可编辑3D高斯溅射(3DGS)仿真通过重建真实驾驶场景并支持可控的场景编辑,提供了一种有前景的替代方案。然而,经编辑的3DGS渲染视频仍存在显著的仿真到真实域差距,包括渲染伪影、前景目标质量下降、光照不一致以及时间闪烁。现有的修复和视频生成方法不足以完成此任务,因为它们通常无法同时修复3DGS特有的伪影、提升视觉真实感并确保时间一致性。为填补这一空白,我们提出了RealityBridge,一个用于经编辑3DGS驾驶视频的结构保持且目标感知的仿真到真实域框架。RealityBridge采用多模态控制,包括渲染视频、前景掩码、边缘图和语义掩码,并结合轻量级GateNet实现跨主干网络层的自适应条件分配。我们进一步构建了针对性训练数据,并引入自回归长视频训练和奖励引导的后训练来提升修复质量、时间稳定性和幻觉抑制。在内部和公开驾驶数据集上的大量实验表明,RealityBridge在伪影去除、光照协调和长序列时间一致性方面优于现有方法。

NeuronFabric: A Software Reference Architecture for On-Chip Transformer Training with Local Adam

2026-06-16T04:00:00autoregressive, cs.AI, cs.AR, cs.LG2606.16440

中文标题:NeuronFabric: 一种面向片上Transformer训练(采用本地Adam优化器)的软件参考架构

作者:Evgeny Ukladchikov

摘要:

Publicly documented accelerator architectures generally separate training computation from optimizer-state updates or rely on external memory and host orchestration. This paper presents NeuronFabric, a software reference architecture intended for future FPGA and ASIC implementations of transformer training with local Adam updates. A complete C# prototype implements forward pass, backpropagation, and Adam optimization without external machine-learning frameworks. The goal is to validate numerical correctness and memory requirements before hardware implementation. The evaluated model is a 334K-parameter autoregressive transformer (d=88, H=4, f=264, L=4, vocab=256) trained on the Shakespeare corpus. The BF16W configuration achieves evaluation loss 1.5426 after 80K samples, compared with 1.5224 for an FP32 GPU reference, while producing coherent character-level text. The paper introduces BF16W, which stores weights in BF16 while retaining Adam optimizer moments in FP32. This reduces memory requirements for on-chip training. A 334K-parameter FP32 model with Adam moments requires approximately 4.0 MB, matching the BRAM capacity of a Xilinx ZCU102 device. The BF16W variant requires approximately 3.34 MB, leaving memory available for activation storage. We describe the vocabulary-budget constraint observed during earlier experiments, quantify BF16W memory savings, and outline FPGA training as the next stage of development. No FPGA measurements are included in this paper. This publication serves as a public architectural disclosure and software reference implementation for future FPGA and ASIC exploration of the NeuronFabric architecture.

摘要中文:

公开记录的加速器架构通常将训练计算与优化器状态更新分离,或依赖外部内存和主机编排。本文提出NeuronFabric,一个面向未来FPGA和ASIC实现的本地Adam更新Transformer训练的软件参考架构。完整的C#原型实现了前向传播、反向传播和Adam优化,不依赖外部机器学习框架。目标是在硬件实现前验证数值正确性和内存需求。评估模型是一个334K参数的自动回归Transformer(d=88, H=4, f=264, L=4, vocab=256),在莎士比亚语料库上进行训练。BF16W配置在80K样本后达到评估损失1.5224,而FP32 GPU参考模型为1.5224,同时生成了连贯的字符级文本。本文引入的BF16W将权重存储为BF16格式,同时将Adam优化器动量保留为FP32格式,从而降低了片上训练的内存需求。包含Adam动量的334K参数FP32模型需要约4.0 MB内存,与Xilinx ZCU102设备的BRAM容量相当。BF16W变体需要约3.34 MB,为激活存储留出了内存空间。本文描述了早期实验中观察到的词汇预算约束,量化了BF16W的内存节省,并概述了FPGA训练作为下一阶段开发计划。本文未包含FPGA测量结果。本出版物作为NeuronFabric架构的公开架构披露和软件参考实现,供未来FPGA和ASIC探索使用。

Probing Low Frame Rate Degradation in Neural Audio Codecs

2026-06-16T04:00:00autoregressive, cs.AI, cs.SD, eess.AS2606.16969

中文标题:神经音频编解码器低帧率降质问题研究

作者:Alex Gichamba, Moise Busogi

摘要:

Low frame rates in neural audio codecs are attractive for autoregressive speech synthesis, where the generation cost scales linearly with the sequence length. Recent work has demonstrated that codecs can operate at 12.5 Hz and below, but the mechanisms underlying low frame rate degradation remain insufficiently understood. We investigate these mechanisms through a controlled frame rate ablation. We reproduce a quality cliff at 6.25 Hz reported in previous works and evaluate candidate explanations: phonemic collisions and codebook saturation, neither of which shows evidence of a fundamental barrier. The cliff is instead caused by suboptimal training configuration: fixed clip duration during training yields too few tokens at low frame rates, starving the decoder of inter-token context. Once corrected, WER degrades smoothly with phonemic load down to 3.1 Hz and 1.6 Hz, suggesting the inference-time efficiency gains of low frame rate codecs are more accessible than previously assumed.

摘要中文:

低帧率神经音频编解码器对于自回归语音合成具有吸引力,因为其生成成本随序列长度线性增长。近期研究表明编解码器可在12.5 Hz甚至更低的帧率下运行,但低帧率降质的潜在机制仍缺乏充分理解。本文通过受控的帧率消融实验探究这些机制。我们复现了先前工作中报道的6.25 Hz质量悬崖现象,并评估了两种候选解释:音素冲突和码本饱和,均未发现根本性障碍的证据。该质量悬崖实际上是由次优训练配置所致:训练时固定片段时长导致低帧率下令牌数量过少,使解码器缺乏足够的令牌间上下文。修正该问题后,词错误率随音素负载增加而平滑下降,最低可至3.1 Hz和1.6 Hz,表明低帧率编解码器的推理时效率增益比此前假设的更容易实现。

MapDream: Task-Driven Map Learning for Vision-Language Navigation

2026-06-16T04:00:00autoregressive, cs.AI, cs.CV, cs.RO2602.00222

中文标题:MapDream:视觉语言导航的任务驱动地图学习

作者:Guoxin Lian, Shuo Wang, Yucheng Wang, Yongcai Wang, Maiyue Chen, Kaihui Wang, Bo Zhang, Zhizhong Su, Deying Li, Zhaoxin Fan

摘要:

Vision-Language Navigation (VLN) requires agents to follow natural language instructions in partially observed 3D environments, motivating map representations that aggregate spatial context beyond local perception. However, most existing approaches rely on hand-crafted maps constructed independently of the navigation policy. We argue that maps should instead be learned representations shaped directly by navigation objectives rather than exhaustive reconstructions. Based on this insight, we propose MapDream, a map-in-the-loop framework that formulates map construction as autoregressive bird's-eye-view (BEV) image synthesis. The framework jointly learns map generation and action prediction, distilling environmental context into a compact three-channel BEV map that preserves only navigation-critical affordances. Supervised pre-training bootstraps a reliable mapping-to-control interface, while the autoregressive design enables end-to-end joint optimization through reinforcement fine-tuning. Experiments on R2R-CE and RxR-CE achieve state-of-the-art monocular performance, validating task-driven generative map learning.

摘要中文:

视觉语言导航(Vision-Language Navigation, VLN)要求智能体在部分观测的3D环境中遵循自然语言指令,这促使人们研究能够聚合局部感知之外的空间上下文的地图表征。然而,大多数现有方法依赖于独立于导航策略构建的人工设计地图。我们认为地图应该直接由导航目标塑造的习得表征,而非穷举式重建。基于这一洞察,我们提出了MapDream,一个地图嵌入循环框架,将地图构建形式化为自回归鸟瞰图(BEV)图像合成。该框架联合学习地图生成和动作预测,将环境上下文蒸馏为紧凑的三通道BEV地图,仅保留导航关键的可供性。监督预训练引导建立可靠的地图到控制接口,而自回归设计使得通过强化微调进行端到端联合优化成为可能。在R2R-CE和RxR-CE数据集上的实验取得了最先进的单目性能,验证了任务驱动的生成式地图学习。

DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation

2026-06-16T04:00:00autoregressive, cs.AI, cs.CV2605.21028

中文标题:DySink:用于自回归长视频生成的动态帧Sink

作者:Bo Ye, Xinyu Cui, Jian Zhao, Tong Wei, Min-Ling Zhang

摘要:

Autoregressive long video generation often adopts bounded-memory streaming for efficiency, typically combining local windows for short-term continuity with static early-frame sinks as long-range anchors. However, this fixed allocation keeps early frames cached even when the current visual state has substantially diverged from them, while discarding potentially more relevant intermediate history. As a result, the retained long-range context may become less adaptive and bias generation toward outdated cues; in severe cases, RoPE-induced phase re-alignment can homogenize inter-head attention and cause sink collapse, where content regresses toward sink frames. We propose DySink, a retrieval-based framework that maintains a compact memory bank and selects visually relevant historical frames as dynamic frame sinks. DySink couples adaptive retrieval with a sink anomaly gate, which detects excessive inter-head consensus over retrieved context and suppresses collapse-prone context. Experiments on minute-long videos show that DySink consistently improves dynamic degree over strong baselines while also achieving higher temporal quality. The code and model weights will be released at https://github.com/yebo0216best/DySink.

摘要中文:

自回归长视频生成通常采用有界内存流式处理以提高效率,通常结合局部窗口实现短期连续性,并使用静态早期帧sink作为长程锚点。然而,这种固定分配方式会将早期帧一直保留在缓存中,即使当前视觉状态已与其显著偏离,同时丢弃可能更相关的中间历史。因此,保留的长程上下文可能变得不那么适应,并使生成偏向过时的线索;在严重情况下,RoPE引起的相位重新对齐会导致头间注意力同质化并引发sink崩溃,即内容向sink帧回归。我们提出了DySink,一个基于检索的框架,它维护一个紧凑的记忆库,并选择视觉相关的历史帧作为动态帧sink。DySink将自适应检索与sink异常门相结合,该门检测所检索上下文上的过度头间共识并抑制易崩溃的上下文。在分钟级长视频上的实验表明,DySink在动态程度方面持续优于强基线,同时实现了更高的时间质量。代码和模型权重将发布于 https://github.com/yebo0216best/DySink。

AnchorEdit: Maintaining Temporal Consistency in Multi-turn Image Editing via Causal Memory

2026-06-16T04:00:00autoregressive, cs.AI, cs.CV, diffusion2606.11751

中文标题:AnchorEdit:通过因果记忆在多轮图像编辑中保持时间一致性

作者:Hang Xu, Xiaoxiao Ma, Guohui Zhang, Yu Hu, Siming Fu, Jie Huang, Lin Song, Haoyang Huang, Nan Duan, Feng Zhao

摘要:

Multi-turn image editing is essential for iterative design, yet current models often struggle with identity drift and error accumulation over successive steps. While existing research leverages video priors for consistency, their reliance on bidirectional attention is fundamentally misaligned with the causal, sequential nature of interactive editing. In this paper, we propose AnchorEdit, the first autoregressive (AR) diffusion-based framework designed specifically for high-resolution, long-term multi-turn editing. AnchorEdit bridges the gap between video priors and causal inference through a three-stage training curriculum: identity-preserving sing-turn pretraining, causal AR forcing fine-tuning with a novel self-rollout strategy to mitigate exposure bias, and consistency distillation for efficient 4-step generation. During inference, we introduce a memory mechanism to anchor the initial subject identity and ensure stable extrapolation across extended editing trajectories. To evaluate performance, we provide a new high-resolution multi-turn editing benchmark designed to stress-test long-horizon stability. Extensive experiments demonstrate that AnchorEdit achieves state-of-the-art results, maintaining exceptional subject fidelity and instruction following even over 10+ interaction rounds.

摘要中文:

多轮图像编辑对于迭代设计至关重要,但当前模型在连续步骤中常常面临身份漂移和错误累积的问题。尽管现有研究利用视频先验来保持一致性,但其对双向注意力的依赖从根本上与交互编辑的因果时序性质相矛盾。本文提出了 AnchorEdit,这是首个专门为高分辨率、长期多轮编辑设计的自回归(AR)扩散框架。AnchorEdit 通过三阶段训练课程弥合了视频先验与因果推理之间的差距:身份保留的单轮预训练、采用新型自 rollout 策略的因果 AR 强制微调以缓解曝光偏差,以及用于高效 4 步生成的一致性蒸馏。在推理过程中,我们引入了一种记忆机制来锚定初始主体身份,确保在延长编辑轨迹上的稳定外推。为了评估性能,我们构建了一个新的高分辨率多轮编辑基准测试,旨在压力测试长程稳定性。大量实验表明,AnchorEdit 实现了最先进的结果,即使在超过 10 轮交互后仍能保持优异的主体保真度和指令遵循能力。

MotionVLA: Vision-Language-Action Model for Humanoid Motion

2026-06-16T04:00:00autoregressive, cs.CV, cs.RO2606.15142

中文标题:MotionVLA:用于人形运动的视觉-语言-动作模型

作者:Nonghai Zhang, Siyu Zhai, Yanjun Li, Zeyu Zhang, Zhihan Yin, Yandong Guo, Boxin Shi, Hao Tang

摘要:

Generating realistic humanoid motion from scene images and text involves both low-frequency pose semantics and high-frequency physical dynamics. However, many existing methods tokenize motion with a single shared codebook, forcing heterogeneous motion signals into the same quantization space. Our frequency-domain analysis of human motion data reveals a clear mismatch between single-codebook quantization and motion statistics: five DCT coefficients capture 93% of joint-position energy but only 37% of joint-velocity energy, which can bias quantization toward pose statistics and under-represent high-frequency velocity components. A second challenge lies in adapting a standard autoregressive model to effectively model high-frequency physical signals in motion sequences. Therefore, we propose DSFT, a dual-stream frequency tokenizer that separates motion into Base and physical streams and compresses them independently with DCT truncation and BPE. Furthermore, we present MotionVLA, a Qwen3.5-based model that arranges Base and physical tokens in a unified sequence, where Phys tokens are predicted after Base tokens. Experiments on HumanML3D and MBench show that, despite using a lightweight 2B backbone, MotionVLA reduces the Diversity gap to real data by over 50% on HumanML3D and improves Motion-Condition Consistency by 3.8% on MBench, supporting frequency-aware dual-stream decoupling as an effective formulation for autoregressive motion generation. Code: https://github.com/AIGeeksGroup/MotionVLA. Website: https://aigeeksgroup.github.io/MotionVLA.

摘要中文:

从场景图像和文本生成逼真的人形运动需要同时处理低频姿态语义和高频物理动力学。然而,许多现有方法使用单一共享码书对运动进行分词,强制将异构运动信号置于同一量化空间内。我们对人类运动数据的频域分析揭示了单码书量化与运动统计之间的明显不匹配:五个DCT系数可捕获93%的关节位置能量,但仅能捕获37%的关节速度能量,这可能导致量化偏向姿态统计而低估高频速度分量。第二个挑战在于将标准自回归模型适配以有效建模运动序列中的高频物理信号。因此,我们提出了DSFT,这是一种双流频率分词器,将运动分离为基底流和物理流,并分别通过DCT截断和BPE进行独立压缩。此外,我们提出了MotionVLA,这是一种基于Qwen3.5的模型,将基底和物理标记统一排列在同一序列中,其中物理标记在基底标记之后进行预测。在HumanML3D和MBench上的实验表明,尽管使用轻量级2B骨干网络,MotionVLA在HumanML3D上将多样性差距缩小了50%以上,并在MBench上提升了3.8%的运动-条件一致性,验证了频率感知双流解耦是自回归运动生成的有效方案。代码:https://github.com/AIGeeksGroup/MotionVLA。网站:https://aigeeksgroup.github.io/MotionVLA。

GeoStream: Toward Precise Camera Controlled Streaming Video Generation

2026-06-16T04:00:00autoregressive, cs.CV2606.15162

中文标题:GeoStream:面向精确相机控制的流式视频生成

作者:Yizhou Zhao, Yifan Wang, Xiaoyuan Wang, Yushu Wu, Hao Zhang, Moayed Haji-Ali, Rameen Abdal, Ashkan Mirzaei, Yanyu Li, Willi Menapace, Laszlo Jeni, Sergey Tulyakov, Peter Wonka, Chaoyang Wang

摘要:

Accurate interactive camera control is essential for video-based world models, but most existing approaches learn camera motion implicitly, leading to inaccurate control under out-of-distribution trajectories. Explicit geometric conditioning improves controllability, but existing methods are non-autoregressive and rely on a static 3D cache built from an initial frame, which becomes ineffective once the viewpoint moves beyond the original frustum. We propose GeoStream, a framework that enables precise metric-scale camera control in autoregressive streaming video generation. Our method maintains a self-refreshing 3D cache that is periodically updated online from the model's own outputs: we estimate depth from the most recently generated frame, unproject to 3D, and reproject into the target view to produce point reprojections as geometric conditioning for subsequent synthesis. By the same principle, the conditioning seen during training is also rendered from the student&x27;s own generated frames, yielding a fully on-policy distillation that naturally aligns the train and inference conditioning distributions. Unlike prior work that uses off-policy condition noising, our approach trains the model against the exact error distribution it encounters at inference, mitigating both standard autoregressive drift and the second-order geometric feedback loop that arises when the cache itself is derived from generated outputs. Quantitative and qualitative results show that our approach substantially improves camera controllability.

摘要中文:

准确的交互式相机控制对于基于视频的世界模型至关重要,但现有方法大多隐式学习相机运动,导致在分布外轨迹下控制不准确。显式几何条件可改善可控性,但现有方法采用非自回归方式且依赖由初始帧构建的静态3D缓存,一旦视角超出原始视锥体便失效。我们提出GeoStream,一个能够在自回归流式视频生成中实现精确度量尺度相机控制的框架。我们的方法维护一个自刷新的3D缓存,该缓存通过模型的自身输出在线定期更新:我们从最近生成的帧估计深度,反投影至3D空间,再重投影到目标视角,产生点重投影作为后续合成的几何条件。遵循相同原则,训练期间的条件渲染也来自学生模型自身生成的帧,实现了完全策略内蒸馏,自然地对齐了训练和推理的条件分布。与使用策略外条件噪声的先前工作不同,我们的方法针对推理时遇到的精确误差分布进行训练,缓解了标准自回归漂移以及当缓存本身由生成输出派生时产生的二阶几何反馈循环。定量和定性结果表明,我们的方法显著提升了相机可控性。

CausalDrive: Real-time Causal World Models for Autonomous Driving

2026-06-16T04:00:00autoregressive, cs.CV, diffusion2606.15341

中文标题:CausalDrive:自动驾驶实时因果世界模型

作者:Tianyi Yan, Huan Zheng, Dubing Chen, Meizhi Qu, Yingying Shen, Lijun Zhou, Mingfei Tu, Bing Wang, Guang Chen, Hangjun Ye, Haiyang Sun, Cheng-zhong Xu, Jianbing Shen

摘要:

World models have emerged as a promising paradigm for scaling autonomous driving (AD) data, yet existing video generative models fall short as interactive simulators. Layout-conditioned renderers rely on "oracle" future trajectories of all background agents, rendering them strictly non-reactive. Conversely, pure action-conditioned predictors lack semantic control over complex interactions and suffer from prohibitive diffusion latencies, hindering closed-loop policy learning. To bridge this gap, we present CausalDrive, a controllable, real-time foundation driving world renderer. CausalDrive operates solely on the initial front-view frame, the ego-vehicle's trajectory, and a macroscopic text prompt. By excluding future NPC layouts, we compel the model to intrinsically predict causal interactions, enabling text-driven control over Driving Sociology, allowing users to dynamically orchestrate diverse counterfactual reactions to identical ego-actions. To overcome the efficiency bottleneck and address the covariate shift in autoregressive generation, we propose a novel Context-Forced DMD architecture. This combines continuous flow-matching with a self-correcting distillation objective, achieving interactive speeds of 12 FPS. This breakthrough transforms the passive video generator into a playable neural simulator. We demonstrate its versatility across three downstream applications: (1) generative closed-loop evaluation with significantly mitigated collision artifacts, (2) large-scale Reinforcement Learning (RL) post-training driven by a Video2Reward module, and (3) real-time human-in-the-loop simulation. Extensive experiments validate that policies trained within CausalDrive&x27;s reactive scenarios exhibit superior interaction capabilities in the real world.

摘要中文:

世界模型作为扩展自动驾驶数据的有前景范式应运而生,然而现有视频生成模型作为交互式模拟器仍存在不足。布局条件渲染器依赖所有背景车辆的“神谕”未来轨迹,导致其严格意义上不具有响应性。相反,纯动作条件预测器缺乏对复杂交互的语义控制能力,且受制于极高的扩散延迟,阻碍了闭环策略学习。为弥合这一差距,我们提出CausalDrive,一个可控的实时基础驾驶世界渲染器。CausalDrive仅基于初始前视帧、自车轨迹和宏观文本提示进行操作。通过排除未来NPC布局,我们强制模型内在地预测因果交互,实现对驾驶社会学的文本驱动控制,使用户能够动态编排相同自车动作下的多样化反事实反应。为克服效率瓶颈并解决自回归生成中的协变量偏移问题,我们提出了一种新颖的Context-Forced DMD架构。该架构将连续流匹配与自纠正蒸馏目标相结合,达到12 FPS的交互速度。这一突破将被动视频生成器转变为可玩神经模拟器。我们在三个下游应用中展示其通用性:(1)显著降低碰撞伪影的生成式闭环评估;(2)由Video2Reward模块驱动的大规模强化学习后训练;(3)实时人在回路仿真。广泛实验表明,在CausalDrive的反应性场景中训练的策略在现实世界中展现出更优越的交互能力。

Shift-and-Sum Quantization for Visual Autoregressive Models

2026-06-16T04:00:00autoregressive, cs.CV, cs.LG2606.16131

中文标题:视觉自回归模型的平移求和量化

作者:Jaehyeon Moon, Bumsub Ham

摘要:

Post-training quantization (PTQ) enables efficient deployment of deep networks using a small set of data. Its application to visual autoregressive models (VAR), however, remains relatively unexplored. We identify two key challenges for applying PTQ to VAR: (i) large reconstruction errors in attention-value products, especially at coarse scales where high attention scores occur more frequently; and (ii) a discrepancy between the sampling frequencies of codebook entries and their predicted probabilities due to limited calibration data. To address these challenges, we propose a PTQ framework tailored for VAR. First, we introduce a shift-and-sum quantization method that reduces reconstruction errors by aggregating quantized results from symmetrically shifted duplicates of value tokens. Second, we present a resampling strategy for calibration data that aligns sampling frequencies of codebook entries with their predicted probabilities. Experiments on class-conditional image generation, inpainting, outpainting, and class-conditional editing show consistent improvements across VAR architectures, establishing a new state of the art in PTQ for VAR.

摘要中文:

训练后量化(PTQ)能够使用少量数据实现深度网络的高效部署。然而,将其应用于视觉自回归模型(VAR)仍相对未被探索。我们发现将PTQ应用于VAR存在两个关键挑战:(i)注意力值乘积的重建误差较大,特别是在粗尺度上,高注意力分数出现更为频繁;(ii)由于校准数据有限,码本条目的采样频率与其预测概率之间存在差异。针对这些挑战,我们提出了一个专为VAR设计的PTQ框架。首先,我们引入了一种平移求和量化方法,通过聚合值令牌的、对称平移副本的量化结果来降低重建误差。其次,我们提出了一种校准数据的重采样策略,使码本条目的采样频率与其预测概率相一致。在类别条件图像生成、图像修复、图像扩展和类别条件编辑任务上的实验表明,我们的方法在各类VAR架构上均取得了一致的性能提升,树立了VAR后训练量化领域的新标杆。

BadWorld: Adversarial Attacks on World Models

2026-06-16T04:00:00autoregressive, cs.CV2606.16519

中文标题:BadWorld:面向世界模型的对抗性攻击

作者:Linghui Shen, Mingyue Cui, Xingyi Yang

摘要:

Visual world models (VWMs) synthesize interactive, action-conditioned rollouts from a single context image. However, it remains an open question how robust these models are to adversarial perturbations. Standard adversarial attacks fail to assess this vulnerability because attackers lack ground-truth future videos and cannot predict subsequent user controls. We introduce BadWorld, a label-free adversarial framework tailored for autoregressive VWMs that systematically overcomes both constraints. First, to bypass the need for future supervision, we propose a self-supervised velocity attack that directly disrupts the early denoising dynamics of the model. Second, to ensure the attack generalizes across unpredictable user actions, we formulate a trajectory-adaptive bi-level optimization that actively mines hard control sequences to forge control-agnostic perturbations. Evaluated on representative VWMs with continuous and discrete controls, BadWorld exposes severe structural fragility. Visually indistinguishable adversarial images reliably trigger catastrophic degradation in future rollouts, leading to incomplete denoising, structural collapse, and control inconsistency. These findings reveal critical risks for deploying VWMs in safety-critical systems while highlighting a practical mechanism for privacy protection.

摘要中文:

视觉世界模型(VWMs)能够从单一上下文图像中合成交互式、动作条件的展开序列。然而,这些模型在面对对抗性扰动时的鲁棒性仍是一个开放性问题。标准对抗性攻击无法有效评估这一漏洞,因为攻击者缺乏真实未来视频且无法预测后续用户控制。我们提出了BadWorld,一个专为自回归VWMs设计的无标签对抗框架,系统性地克服了这两个限制。首先,为了规避对未来监督的需求,我们提出了自监督速度攻击,直接干扰模型的早期去噪动态。其次,为确保攻击能够泛化到不可预测的用户动作,我们构建了轨迹自适应双层优化,主动挖掘困难控制序列以生成控制无关扰动。在具有连续和离散控制的代表性VWMs上评估,BadWorld揭示了严重的结构脆弱性。视觉上难以区分的对抗图像可靠地触发未来展开序列的灾难性退化,导致去噪不完整、结构崩溃和控制不一致。这些发现揭示了VWMs部署在安全关键系统中的关键风险,同时为隐私保护提供了一种实用机制。

DreamX-World 1.0: A General-Purpose Interactive World Model

2026-06-16T04:00:00autoregressive, cs.CV2606.16993

中文标题:DreamX-World 1.0:一个通用的交互式世界模型

作者:DreamX Team, Yancheng Bai, Rui Chen, Xiangxiang Chu, Rujing Dang, Hao Dou, Bingjie Gao, Qiwen Gu, Siyu Hong, Jiachen Lei, Geng Li, Jifan Li, Ruimin Lin, Qingfeng Shi, Bingze Song, Lei Sun, Jing Tang, Ruitian Tian, Jun Wang, Jiahong Wu, Pengfei Zhang, Shen Zhang, Jiashu Zhu

摘要:

DreamX-World 1.0 is a general-purpose interactive text/image-to-video world model for controllable long-horizon generation. It supports camera navigation, revisits to previously observed regions, and promptable events across photorealistic, game-style, and stylized domains. Our data engine combines camera-accurate Unreal Engine rendering, action-rich gameplay recordings, and real-world videos with recovered camera geometry. For camera control, we introduce E-PRoPE, a lightweight variant of projective positional encoding that retains PRoPE's projective camera geometry while applying camera-aware attention to spatially reduced tokens. We convert a bidirectional video generator into a few-step autoregressive world model using causal forcing, DMD-style distillation, and long-rollout training. Training on self-generated long-horizon contexts exposes the model to its own generated history and reduces the style and color drift that accumulates across autoregressive chunks. Memory-Conditioned Scene Persistence retrieves earlier views through camera-geometry-based retrieval, while residual recycling makes the conditioning path less sensitive to imperfect memory latents. Event Instruction Tuning adds composable event control, and reinforcement learning alignment recovers camera control and visual quality after distillation. With mixed-precision DiT execution, residual reuse, 75\%-pruned VAE decoding, and asynchronous pipeline parallelism, DreamX-World 1.0 reaches up to 16\,FPS on eight RTX\,5090 GPUs. On our 5-second basic evaluation, DreamX-World 1.0 achieves a camera-control score of 73.75 and an overall score of 84.76, outperforming HY-WorldPlay 1.5 and LingBot-World in overall score, which achieve 80.79 and 80.45, respectively.

摘要中文:

DreamX-World 1.0 是一个通用的交互式文本/图像到视频世界模型,用于可控的长时序生成。它支持相机导航、重访先前观察到的区域,以及跨真实感、游戏风格和风格化领域的可提示事件。我们的数据引擎结合了精确相机视角的 Unreal Engine 渲染、富含动作的游戏录制以及带有恢复相机几何结构的真实世界视频。对于相机控制,我们引入了 E-PRoPE,这是投影位置编码的一种轻量级变体,它保留了 PRoPE 的投影相机几何结构,同时对空间缩减的 token 应用相机感知注意力。我们使用因果强制、DMD 风格蒸馏和长时序 rollout 训练,将双向视频生成器转换为几步自回归世界模型。在自生成的长时序上下文上训练使模型接触到自身生成的历史,并减少了跨自回归块累积的风格和颜色漂移。记忆条件场景持久化通过基于相机几何结构的检索来检索更早的视图,而残差回收使条件路径对不完善的记忆潜在变量不那么敏感。事件指令调优增加了可组合的事件控制,强化学习对齐在蒸馏后恢复了相机控制和视觉质量。通过混合精度 DiT 执行、残差回收、75% 剪枝的 VAE 解码和异步流水线并行,DreamX-World 1.0 在八个 RTX 5090 GPU 上达到了最高 16 FPS。在我们的 5 秒基础评估中,DreamX-World 1.0 达到了 73.75 的相机控制分数和 84.76 的总体分数,在总体分数上超过了 HY-WorldPlay 1.5 和 LingBot-World,后者的分数分别为 80.79 和 80.45。

DC-Motion: Decoupling Semantics and Details via Discrete-Continuous Tokens for Human Motion Generation

2026-06-16T04:00:00autoregressive, cs.CV, cs.GR, cs.RO, diffusion2606.14721

中文标题:DC-Motion:通过离散-连续令牌解耦语义与细节的人体动作生成

作者:Hequan Wang, Jiaxu Zhang, Zhengbo Zhang, Zhigang Tu

摘要:

Text-to-motion generation requires synthesizing physically realistic dynamics that strictly follow complex and long-horizon textual instructions. Existing approaches rely on homogeneous representation spaces that may fail to capture the hierarchical nature of human motion, with diffusion models struggling at compositional semantic reasoning and AR models sacrificing fine-grained physical details due to quantization. To solve it, we introduce DC-Motion, a factorized generative framework designed to explicitly decouple semantics and details via discrete-continuous tokens. A Discrete-Continuous VAE (DC-VAE) first decomposes motion into discrete tokens for semantics and continuous residuals for fine-grained dynamics. Then, a masked AR model predicts the discrete structure from text, and a lightweight residual diffusion model recovers the continuous physical details. Extensive experiments demonstrate that DC-Motion effectively improves the capability to follow complex instructions. By effectively balancing semantic controllability and physical realism, our approach offers a highly adaptable modeling paradigm for human motion generation. On both HumanML3D and KIT-ML datasets, DC-Motion achieves state-of-the-art performance, delivering the best FID for motion realism and R-precision for text alignment.

摘要中文:

文本到动作生成需要合成严格遵循复杂且长时序文本指令的物理真实动力学。现有的方法依赖于同质化表示空间,可能无法捕捉人体动作的层级特性,其中扩散模型在组合语义推理方面存在困难,而自回归模型则因量化而牺牲了细粒度的物理细节。为解决这一问题,我们提出了DC-Motion,一个通过离散-连续令牌显式解耦语义与细节的因子化生成框架。离散-连续VAE(DC-VAE)首先将动作分解为用于语义的离散令牌和用于细粒度动力学的连续残差。然后,掩码自回归模型从文本预测离散结构,轻量级残差扩散模型恢复连续的物理细节。大量实验表明,DC-Motion有效提升了遵循复杂指令的能力。通过有效平衡语义可控性和物理真实感,我们的方法为人体动作生成提供了一个高度适应性强的建模范式。在HumanML3D和KIT-ML数据集上,DC-Motion实现了最先进的性能,在动作真实性的FID和文本对齐的R-precision方面均取得了最佳结果。

Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention

2026-06-16T04:00:00autoregressive, cs.CV, diffusion2602.04789

中文标题:轻量强制:通过稀疏注意力加速自回归视频扩散

作者:Chengtao Lv, Yumeng Shi, Yushi Huang, Ruihao Gong, Shen Ren, Wenya Wang

摘要:

Advanced autoregressive (AR) video generation models have improved visual fidelity and interactivity, but the quadratic complexity of attention remains a primary bottleneck for efficient deployment. While existing sparse attention solutions have shown promise on bidirectional models, we identify that applying these solutions to AR models leads to considerable performance degradation for two reasons: isolated consideration of chunk generation and insufficient utilization of past informative context. Motivated by these observations, we propose \textsc{Light Forcing}, the \textit{first} sparse attention solution tailored for AR video generation models. It incorporates a \textit{Chunk-Aware Growth} mechanism to quantitatively estimate the contribution of each chunk, which determines their sparsity allocation. This progressive sparsity increase strategy enables the current chunk to inherit prior knowledge in earlier chunks during generation. Additionally, we introduce a \textit{Hierarchical Sparse Attention} to capture informative historical and local context in a coarse-to-fine manner. Such two-level mask selection strategy (i.e., frame and block level) can adaptively handle diverse attention patterns. Extensive experiments demonstrate that our method outperforms existing sparse attention in quality (e.g., 84.5 on VBench) and efficiency (e.g., $1.2{\sim}1.3\times$ end-to-end speedup). Combined with other efficient solutions, \textsc{Light Forcing} further achieves a $2.0{\sim}3.0\times$ end-to-end speedup across diverse GPUs (e.g., 27.4\,FPS on RTX 5090 and 33.9\,FPS on H100). Code is released via this \href{https://github.com/chengtao-lv/LightForcing}{link}.

摘要中文:

高级自回归视频生成模型显著提升了视觉保真度和交互性,但注意力机制的二次复杂度仍是高效部署的主要瓶颈。现有的稀疏注意力方案在双向模型上展现出良好的应用前景,但我们发现将其直接应用于自回归模型会导致显著的性能下降,原因包括:对分块生成的孤立考虑以及对历史信息上下文的利用不足。针对这些问题,我们提出了Light Forcing,这是首个专为自回归视频生成模型设计的稀疏注意力方案。该方法引入了一种分块感知增长机制,用于定量评估每个分块的贡献程度,进而确定其稀疏度分配。这种渐进式稀疏度增长策略使得当前分块在生成过程中能够继承先前分块的先验知识。此外,我们还提出了分层稀疏注意力机制,以粗到细的方式捕获信息丰富的历史和局部上下文。这种两级掩码选择策略(即帧级和块级)能够自适应地处理多样化的注意力模式。大量实验表明,我们的方法在质量(例如VBench上84.5分)和效率(例如端到端加速1.2至1.3倍)方面均优于现有稀疏注意力方法。结合其他高效方案,Light Forcing在多种GPU上进一步实现了2.0至3.0倍的端到端加速(例如RTX 5090上27.4 FPS,H100上33.9 FPS)。代码已通过链接发布。

Ultra Flash: Scaling Real-Time Streaming Video Generation to High Resolutions

2026-06-16T04:00:00autoregressive, cs.CV, diffusion2606.09150

中文标题:Ultra Flash:面向高分辨率的实时流式视频生成扩展方法

作者:Luxury, Jie Huang, Zihao Fan, Xiaoxiao Ma, Jun-hao Zhuang, Yuming Li, Zeyue Xue, Siming Fu, Haoran Li, Mingchen Zhong, Guohui Zhang, Shichen Ma, Yijun Liu, Jiaqi Shi, Yanwen Ma, Yaofeng Su, Haoyu Wang, Yaowei Li, Songchun Zhang, Weiyang Jin, Yuxuan Bian, Shiyi Zhang, Haojun Xu, Shuai Lu, Xin Han, Wei Tang, Haoyang Huang, Nan Duan

摘要:

While recent autoregressive video diffusion models achieve remarkable streaming quality, they remain confined to low resolutions (e.g., 480P), leaving efficient, scalable, real-time high-resolution video generation a fundamental open challenge. To bridge this gap, we present Ultra Flash, a cascaded streaming framework capable of real-time high-resolution video generation. Ultra Flash achieves ~30 FPS at 1K resolution and ~18 FPS at 2K resolution on a single GPU through three key contributions: (1) an architecture-preserving T2V-to-TV2V super-resolution training paradigm coupled with an AIGC-oriented data degradation pipeline that effectively preserves the generative capability of the base model, enabling enhanced high-resolution detail when cascaded after mainstream low-resolution generative models; (2) a causal streaming latent upsampler paired with a high-resolution decoder, which enhances spatiotemporal coherence while enabling efficient latent spatial scaling and precise high-resolution decoding with negligible computational overhead; and (3) a cascade high-resolution streaming video generation optimization scheme that first performs hybrid-reward-enhanced sparse causalization and single-step distillation of the super-resolution model, then introduces cascaded streaming self-forcing preference optimization with dynamic cache management, jointly enhancing overall coherence, improving quality, and enabling real-time high-resolution streaming video generation. Extensive experiments demonstrate that Ultra Flash reliably produces ultra-high-resolution streaming video while maintaining state-of-the-art visual quality and superior efficiency. Project Page: https://xin1u.github.io/UltraFlash/

摘要中文:

尽管当前的自回归视频扩散模型已能实现优异的流式质量,但其仍受限于低分辨率(如480P),使得高效、可扩展的实时高分辨率视频生成成为一个亟待解决的核心挑战。为弥补这一差距,我们提出了Ultra Flash,一个能够实现实时高分辨率视频生成的级联流式框架。Ultra Flash在单GPU上于1K分辨率达到约30 FPS、于2K分辨率达到约18 FPS,主要通过以下三项技术贡献实现:(1)一种架构保持的T2V到TV2V超分辨率训练范式,结合面向AIGC的数据降质管道,能够有效保留基模型的生成能力,使其在与主流低分辨率生成模型级联时能够增强高分辨率细节;(2)一个因果流式潜在上采样器配合高分辨率解码器,在增强时空一致性的同时实现高效的潜在空间缩放和精确的高分辨率解码,且计算开销可忽略不计;(3)一种级联高分辨率流式视频生成优化方案,首先对超分辨率模型执行混合奖励增强的稀疏因果化和单步蒸馏,随后引入带动态缓存管理的级联流式自强制偏好优化,共同增强整体一致性、提升质量并实现实时高分辨率流式视频生成。大量实验表明,Ultra Flash能够可靠地生成超高分辨率流式视频,同时保持最先进的视觉质量和优异的效率。项目主页:https://xin1u.github.io/UltraFlash/

CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos

2026-06-16T04:00:00autoregressive, cs.CV, cs.RO2601.04061

中文标题:CLAP:用于从人类视频学习视觉-语言-动作模型的对比性潜动作预训练

作者:Chubin Zhang, Jianan Wang, Zifeng Gao, Yue Su, Tianru Dai, Cai Zhou, Jiwen Lu, Yansong Tang

摘要:

Generalist Vision-Language-Action models remain constrained by the scarcity of robotic data relative to the abundance of human video demonstrations. Existing Latent Action Models attempt to use video data but often suffer from visual entanglement, encoding noise rather than manipulation skills. To address this limitation, we propose Contrastive Latent Action Pretraining (CLAP), a framework that first uses Act-VAE to learn an executable action-token vocabulary from robot trajectories and then aligns human visual transitions with this vocabulary through contrastive learning. This alignment maps unlabeled human videos into a physically grounded latent action space rather than reconstructing appearance. Building on the aligned tokens, we train CLAP-NTP as an autoregressive VLA using robot demonstrations and pseudo-labeled human videos, preserving instruction following and object generalization. For deployment and target-domain adaptation, we further introduce a post-training strategy that combines CLAP-RF, a Rectified Flow action head for low-latency continuous action chunk prediction, with Knowledge Matching regularization to preserve pretrained semantic knowledge during fine-tuning. Extensive experiments show that CLAP achieves strong performance against competitive baselines while enabling effective skill transfer from human videos to robotic execution.

摘要中文:

通用视觉-语言-动作模型受限于机器人数据的稀缺,而人类视频演示数据却非常丰富。现有的潜动作模型尝试使用视频数据,但经常遭受视觉纠缠问题,编码的是噪声而非操作技能。为解决这一局限性,我们提出了对比性潜动作预训练(CLAP),该框架首先使用Act-VAE从机器人轨迹学习可执行的动作token词汇表,然后通过对比学习将人类视觉转换与该词汇表对齐。这种对齐将未标记的人类视频映射到物理基础的潜动作空间,而非重建外观。基于对齐的token,我们使用机器人演示和伪标记人类视频训练CLAP-NTP作为自回归视觉-语言-动作模型,保留了指令跟随和对象泛化能力。对于部署和目标域适应,我们进一步提出了一种后训练策略,结合CLAP-RF(一种用于低延迟连续动作块预测的修正流动作头)与知识匹配正则化,以在微调期间保留预训练的语义知识。大量实验表明,CLAP在与竞争性基线的对比中取得了强劲的性能,同时实现了从人类视频到机器人执行的有效技能迁移。

diffusion
Diffusion
68 篇论文

今日 Diffusion 论文总览

今日 Diffusion 相关论文呈现多元化发展态势,主要集中在三个方向:一是长视频生成与控制,如 Steady-Forcing 通过空间持久性与运动连续性的平衡机制提升自然视频生成质量;二是强化学习策略,Trust-Region Diffusion Policies 和 VANDERER 等工作将扩散模型应用于机器人视觉导航与策略学习;三是模型效率与公平性,Divide-and-Denoise 引入博弈论方法实现公平模型组合,Null-Space Diffusion Distillation 提升推理速度与成像质量。理论层面,Wasserstein 收敛分析为 ODE 类采样器提供了数学基础。此外,图像编辑、3D 生成、医学影像等应用场景也有新进展。

  • Steady-Forcing: Balancing Spatial Persistence and Motion Continuity in Long-Horizon Nature Video Diffusion - 提出空间持久性与运动连续性平衡机制,解决长时序自然视频生成中的时间一致性问题,对视频扩散模型有重要参考价值。
  • Divide-and-Denoise: A Game-Theoretic Method for Fairly Composing Diffusion Models - 引入博弈论框架实现公平模型组合,为扩散模型的去偏与可控生成提供了新思路。
  • Trust-Region Diffusion Policies for Massively Parallel On-Policy RL - 将信任域方法与扩散策略结合,实现大规模并行强化学习,显著提升训练效率。
  • Wasserstein Convergence of ODE-Based Samplers in Decentralized Diffusion Model via Velocity Field Decomposition - 通过速度场分解分析 ODE 采样器的 Wasserstein 收敛性,为扩散模型理论提供重要贡献。
  • MoECa: Aligning Feature Reuse with Expert Decomposition in Diffusion Transformers - 通过专家分解与特征复用对齐,提升扩散 Transformer 的效率与性能。

Steady-Forcing: Balancing Spatial Persistence and Motion Continuity in Long-Horizon Nature Video Diffusion

2026-06-16T04:00:00autoregressive, cs.AI, cs.CV, cs.LG, cs.MM, diffusion2606.14732

中文标题:Steady-Forcing:平衡长时序自然视频扩散中的空间持久性与运动连续性

作者:Matiur Rahman Minar, Seunghun Oh, GangHyeon Jeong, Unsang Park

摘要:

Autoregressive video diffusion models enable streaming generation but often degrade over long rollouts: static scene layouts drift, while mechanisms that improve spatial stability tend to suppress motion, causing natural flows such as water, fire, or smoke to stagnate. We study this stability-motion trade-off in fixed-camera long-horizon nature video generation, where the two failure modes can be more clearly separated than in moving-camera settings. We propose Steady-Forcing, a memory and training framework combining a persistent visual anchor (V-Sink), an exponential moving-average motion memory (EMA-Sink), block-relative temporal encoding, periodic cache purification, and distillation from a Wan2.1-14B teacher with motion-rewarded priors under task-focused configurations. Together, these components are designed to preserve background identity while sustaining visually plausible fluid dynamics over multi-minute autoregressive rollouts. Evaluations across seven baselines show that Steady-Forcing improves long horizon background consistency and imaging quality, while a blind user study indicates stronger perceived stability and motion continuity. The benchmark evaluation further suggest that generic VBench aggregate scores under-penalize fixed-camera artifacts as well as rewarding drift-induced optical flow as Dynamic Degree while not directly penalizing texture hardening or flow stagnation - motivating future task-specific benchmarks for static-camera nature-flow evaluation. Project page: https://minar09.github.io/steadyforcing/

摘要中文:

自回归视频扩散模型支持流式生成,但在长时序展开过程中往往会出现质量退化:静态场景布局发生漂移,而提高空间稳定性的机制 tend to抑制运动,导致水、火、烟雾等自然流动停滞。本文在固定摄像头的长时序自然视频生成任务中研究这一稳定性-运动权衡问题,该场景下两种失败模式比移动摄像机设置更容易区分。我们提出了Steady-Forcing,一个结合了持久视觉锚点(V-Sink)、指数移动平均运动记忆(EMA-Sink)、块相对时间编码、周期性缓存净化以及基于运动奖励先验的任务聚焦配置下从Wan2.1-14B教师模型蒸馏的训练框架。这些组件共同设计用于在数分钟的自回归生成过程中保持背景一致性同时维持视觉上可信的流体动力学。七个基线模型的评估表明,Steady-Forcing提高了长时序背景一致性和成像质量,而盲测用户研究显示其在感知稳定性和运动连续性方面表现更强。基准评估进一步表明,通用VBench综合分数对固定摄像头伪影的惩罚不足,同时将漂移导致的光流奖励为动态程度,却未直接惩罚纹理硬化或流动停滞——这促使未来需要针对静态摄像头自然流评估的专用基准测试。项目主页:https://minar09.github.io/steadyforcing/

Divide-and-Denoise: A Game-Theoretic Method for Fairly Composing Diffusion Models

2026-06-16T04:00:00cs.AI, cs.CV, cs.LG, diffusion2606.14756

中文标题:分割去噪:一种公平组合扩散模型的博弈论方法

作者:Abhi Gupta, Polina Barabanshchikova, Vikas Garg, Samuel Kaski, Tommi Jaakkola

摘要:

The abundance of pre-trained diffusion models provides an opportunity for composition. Combining several models, however, runs the risk of one model dominating or models disagreeing with each other. Here, we propose Divide-and-Denoise, a method for coordinating multiple pre-trained diffusion models during sampling. Much like managing a specialized workforce, our method creates a fair but efficient division of labor across models. Central to our method is the notion of an allocation which defines the responsibility of each model to every region of the noisy sample. At every timestep, we then denoise by (i) updating the allocation by solving a fair division game, where we divide the sample into regions that maximize total utility under fairness constraints, and (ii) aligning the models with this allocation, where we guide each model to denoise within its assigned region. This leads to a new composite denoising process that evolves in tandem with a division process. We evaluate Divide-and-Denoise on conditional image generation. Across several quality metrics, including the GenEval benchmark, our method outperforms baselines and resolves common failures including missing objects and mismatched attributes. Experiments show that Divide-and-Denoise utilizes each model's expertise without neglecting any other model.

摘要中文:

大量预训练扩散模型的出现为模型组合提供了机会。然而,组合多个模型存在模型主导或模型间不一致的风险。为此,我们提出了分割去噪(Divide-and-Denoise)方法,用于在采样过程中协调多个预训练扩散模型。我们的方法类似于管理一支专业化团队,在各模型间创建公平且高效的劳动分工。我们的方法核心是分配(allocation)概念,它定义了每个模型对噪声样本各区域的责任。在每个时间步,我们通过以下两步进行去噪:(i)通过求解公平分割博弈来更新分配,将样本分割为在公平约束下最大化总效用的区域;(ii)使各模型与此分配对齐,引导每个模型在其指定区域内进行去噪。这产生了一个与分割过程协同演化的新型组合去噪过程。我们在条件图像生成任务上评估了分割去噪方法。在包括GenEval基准在内的多项质量指标上,我们的方法优于基线方法,并解决了对象缺失和属性不匹配等常见失败案例。实验表明,分割去噪能够利用每个模型的专业能力,同时不忽视其他模型。

Efficient Reinforcement for Visual-Textual Thinking with Discrete Diffusion Model

2026-06-16T04:00:00autoregressive, cs.AI, cs.CV, diffusion2606.14792

中文标题:基于离散扩散模型的高效视觉-文本思维强化学习

作者:Yoonjeon Kim, Yuhta Takida, Chieh-Hsin Lai, Eunho Yang, Yuki Mitsufuji

摘要:

RL-based post-training has been widely adopted to enable interleaved visual and textual reasoning in unified multimodal models capable of both text and image generation. However, most existing approaches are built upon autoregressive (AR) unified models, which require full image regeneration during visual reasoning. In this work, we demonstrate that multimodal discrete diffusion models are effective alternatives to AR models for reinforcement learning in interleaved reasoning, owing to their ability to perform efficient visual rollouts via localized visual editing rather than full image-token regeneration. This reduces rollout computation during GRPO by 26.9\% compared to AR baselines, with minimal performance drop. Despite the improved efficiency, we find that joint reward assignment, which employs a shared reward signal across modalities, introduces cross-modal interference between unrelated image and text token sequences during RL updates. To address this issue, we propose factorized reward assignment, a strategy that assigns rewards independently to text and vision segments. With factorized reward assignment, our RL approach achieves an 11.2% improvement over joint reward assignment and a 38.04% improvement over the base model.

摘要中文:

基于强化学习的后训练已被广泛采用,以使统一的文本和图像生成多模态模型具备交织视觉和文本推理能力。然而,现有方法大多基于自回归统一模型,这类模型在视觉推理过程中需要完整重建图像。本工作证明了多模态离散扩散模型是交织推理中强化学习的有效替代方案,其优势在于能够通过局部视觉编辑而非完整图像token重建来执行高效的视觉rollout。与自回归基线相比,这使GRPO的推演计算量减少了26.9%,同时性能损失极小。尽管效率有所提升,我们发现采用跨模态共享奖励信号的联合奖励分配会在强化学习更新中引入不相关图像和文本token序列之间的跨模态干扰。为解决这一问题,我们提出了分解奖励分配策略,即对文本和视觉段落独立分配奖励。使用分解奖励分配后,我们的强化学习方法相比联合奖励分配提升了11.2%,相比基础模型提升了38.04%。

QPILOTS: Efficient Test-Time Q-Steering for Flow Policies

2026-06-16T04:00:00cs.AI, cs.LG, cs.RO, diffusion2606.14801

中文标题:QPILOTS:流策略的高效推理时Q导向方法

作者:Yifan Ruan, Chenyang Cao, Andreas Burger, Ali Pesaranghader, Kaveh Kamali, Jaehong Kim, Nandita Vijaykumar, Alan Aspuru-Guzik, Igor Gilitschenski, Nicholas Rhinehart

摘要:

Flow-matching and diffusion policies are expressive action generators, but optimizing them with temporal-difference reinforcement learning (RL) remains difficult. Effective policy extraction requires exploiting the critic's action gradient, yet directly backpropagating this signal through a multi-step denoising process can be numerically unstable. Existing methods work around this either by discarding gradient information, distilling the policy into a simpler one-step actor, or repeatedly fine-tuning the denoising policy as the critic improves. We propose QPILOTS, a method that leaves the original policy unmodified and steers the denoising process at inference time. At each denoising step, instead of evaluating the critic on the noisy intermediate action where critic predictions are unreliable, we first project that intermediate state to an estimate of the final clean action and compute the critic gradient there. We introduce two variants: QPILOTS-U uses a fast single-point approximation, while QPILOTS-M draws differentiable posterior samples via a learned auxiliary network. On a standard offline-to-online RL benchmark, QPILOTS achieves the best aggregate performance, reaching an average success rate of 90% across 50 tasks. We also apply QPILOTS to steer a large, frozen, pretrained Vision-Language Action (VLA) foundation model, outperforming or matching prior inference-time approaches across six manipulation tasks in simulation.

摘要中文:

流匹配和扩散策略是表达性强的动作生成器,但利用时序差分强化学习(RL)对其进行优化仍然困难。有效的策略提取需要利用评论家的动作梯度,但直接通过多步去噪过程反向传播该信号可能导致数值不稳定。现有方法通过丢弃梯度信息、将策略蒸馏为更简单的一步执行器,或在评论家改进时反复微调去噪策略来规避这一问题。我们提出QPILOTS方法,该方法保持原始策略不变,而在推理时引导去噪过程。在每个去噪步骤中,我们不在噪声中间动作上评估评论家(因为评论家预测不可靠),而是先将中间状态投影到最终干净动作的估计值,并在该位置计算评论家梯度。我们引入两种变体:QPILOTS-U使用快速单点近似,而QPILOTS-M通过学习的辅助网络绘制可微后验样本。在标准的离线到在线强化学习基准测试中,QPILOTS实现了最佳综合性能,在50个任务中达到90%的平均成功率。我们还将QPILOTS应用于引导大型冻结的预训练视觉语言动作(VLA)基础模型,在模拟中的六个操作任务上优于或匹配先前的推理时方法。

Inference-time Policy Steering via Vision and Touch

2026-06-16T04:00:00cs.AI, cs.LG, cs.RO, diffusion2606.14981

中文标题:基于视觉和触觉的推理时策略引导

作者:Yilin Wu, Zilin Si, Zeynep Temel, Oliver Kroemer, Andrea Bajcsy

摘要:

Inference-time steering adapts pre-trained generative robot policies during deployment by verifying candidate actions before execution. While prior methods typically perform this verification only with visual observations, vision alone is often insufficient for contact-rich manipulation, where success depends on both global task progress and subtle local interactions such as contact force. We introduce ViTaL, a visuo-tactile inference-time steering framework that formulates multimodal guidance as a bi-level optimization problem. At the high level, visual sampling-and-verification performs long-horizon mode selection, deciding what behavior the robot should execute. At the low level, tactile-guided diffusion editing refines the selected action sequence over a shorter horizon to satisfy local contact requirements. To support outcome-based steering, ViTaL learns a visuo-tactile latent world model and employs semantically aligned visual and tactile verifiers, including a novel text-conditioned tactile reward that scores predicted tactile futures directly in latent space. Across three real-world contact-rich manipulation tasks, ViTaL improves overall success by 51% over the base policy, outperforms unimodal steering by at least 33%, and exceeds naive multimodal fusion by at least 20%. Website: https://yilin-wu98.github.io/vital_website.

摘要中文:

推理时引导通过在执行前验证候选动作,对部署时的预训练生成式机器人策略进行适配。虽然现有方法通常仅使用视觉观测进行验证,但仅凭视觉往往不足以处理接触丰富的操作任务,因为成功既取决于全局任务进度,也取决于细微的局部交互(如接触力)。我们提出了ViTaL,一个视觉-触觉推理时引导框架,将多模态指导形式化为双层优化问题。在高层,视觉采样与验证执行长视野模式选择,决定机器人应执行何种行为。在低层,触觉引导的扩散编辑在较短视野内细化所选动作序列,以满足局部接触要求。为支持基于结果的引导,ViTaL学习了一个视觉-触觉潜在世界模型,并采用语义对齐的视觉和触觉验证器,包括一种新颖的文本条件触觉奖励,可在潜在空间直接对预测的触觉未来进行评分。在三个真实世界的接触丰富操作任务中,ViTaL相比基础策略将整体成功率提升了51%,优于单模态引导至少33%,并超越朴素多模态融合至少20%。网站:https://yilin-wu98.github.io/vital_website

MimicIK: Real-Time Generative Inverse Kinematics from Teleoperation with FK Consistency

2026-06-16T04:00:00cs.AI, cs.RO, diffusion2606.15148

中文标题:MimicIK: 基于遥操作的正向运动学一致性实时生成式逆运动学

作者:Jiahao Yang, Shenhao Yan, Fan Feng, Chengsi Yao, Ge Wang, Zhixin Mai, Yiming Zhao, Yatong Han

摘要:

Inverse kinematics (IK) remains a critical bottleneck for real-time robot manipulation. Classical numerical solvers achieve high geometric precision but often suffer from discontinuous branch switching and unstable behavior near kinematic singularities during closed-loop deployment. Meanwhile, learned IK approaches frequently struggle to balance spatial accuracy, motion smoothness, and real-time efficiency, particularly when trained on noisy human teleoperation data. We present \textbf{MimicIK}, a real-time generative inverse kinematics framework that learns smooth and robust joint-space motion priors from teleoperation demonstrations through conditional flow matching. Given the current joint configuration and a target end-effector pose, MimicIK predicts continuous delta-joint commands using an efficient two-step iterative refinement process based on a Minimal Iterative Policy (MIP) backbone. To enforce physical consistency, we further introduce an FK consistency loss, a differentiable forward-kinematics regularization that penalizes task-space deviations from the target pose during training. We evaluate MimicIK on a real-world 6-DOF robot dataset containing 8,848 teleoperation demonstrations. MimicIK achieves a mean position error of 4.65 mm, a 10 mm success rate of 92.01\%, and a trajectory spike rate of only 7.99\%. Compared with a UNet diffusion baseline, our method improves both spatial accuracy and motion smoothness while reducing inference latency from 21.66 ms to 6.74 ms. Furthermore, unlike deterministic MLP baselines that catastrophically diverge under out-of-distribution deployment, MimicIK remains stable near singular configurations and enables robust 20 Hz real-time control on deployment hardware.

摘要中文:

逆运动学(IK)仍是实时机器人操作的关键瓶颈。经典数值求解器虽能达到较高的几何精度,但在闭环部署中常因不连续分支切换和运动学奇异性附近的非稳态行为而表现不佳。与此同时,学习式逆运动学方法在平衡空间精度、运动平滑性和实时效率方面面临挑战,尤其是在基于噪声人类遥操作数据训练时更是如此。我们提出MimicIK,这是一个实时生成式逆运动学框架,通过条件流匹配从遥操作演示中学习平滑且稳健的关节空间运动先验。给定当前关节配置和目标末端执行器姿态,MimicIK使用基于最小迭代策略(MIP)主干的高效两步迭代细化过程来预测连续的关节增量命令。为强化物理一致性,我们进一步引入FK一致性损失,这是一种可微的正向运动学正则化方法,在训练时惩罚任务空间与目标姿态的偏差。我们在包含8,848条遥操作演示的真实世界6自由度机器人数据集上评估MimicIK。MimicIK实现了4.65 mm的平均位置误差、92.01%的10 mm成功率以及仅7.99%的轨迹尖峰率。与UNet扩散基线相比,我们的方法在空间精度和运动平滑性上均有提升,同时将推理延迟从21.66 ms降低至6.74 ms。此外,与确定性MLP基线在分布外部署时灾难性发散不同,MimicIK在奇异构型附近保持稳定,并能在部署硬件上实现20 Hz的稳健实时控制。

Trust-Region Diffusion Policies for Massively Parallel On-Policy RL

2026-06-16T04:00:00cs.AI, cs.LG, diffusion2606.15260

中文标题:面向大规模并行同策略强化学习的信任域扩散策略

作者:Huy Le, Onur Celik, Denis Blessing, Tai Hoang, Claas A Voelcker, Axel Brunnbauer, Felix Richter, Michael Volpp, Gerhard Neumann

摘要:

Reinforcement learning with massively parallel simulations has become a standard framework for developing robust, deployable policies; however, most existing approaches still rely on simple Gaussian policy parameterizations. Diffusion models provide a more expressive policy class and have shown strong performance on challenging control problems, yet most diffusion-based RL methods are designed for offline or off-policy training. In this work, we ask whether diffusion policies can be trained effectively in the massively parallel, on-policy regime. To this end, we introduce Trust-region Diffusion Policies (TruDi), which enables diffusion policies for on-policy RL with massively parallel simulations. This setting is particularly challenging because the data distribution changes quickly across updates, making stable training with complex policies difficult. TruDi addresses this by integrating a trust-region optimization rule to enforce a KL-divergence constraint over the entire diffusion trajectory. Empirically, we evaluate TruDi on a diverse set of 4 massively parallel RL benchmarks comprising a total of 73 tasks. Across these tasks, TruDi consistently outperforms or is on-par with strong baselines on standard tasks and achieves clear gains on more challenging humanoid control tasks, establishing a strong new baseline for massively parallel on-policy RL.

摘要中文:

大规模并行模拟的强化学习已成为开发健壮可部署策略的标准框架;然而,现有大多数方法仍依赖于简单的高斯策略参数化。扩散模型提供了更具表达力的策略类别,并在具有挑战性的控制问题上展现出优异性能,但大多数基于扩散的强化学习方法都是针对离线或异策略训练设计的。在本工作中,我们探讨扩散策略是否能够在大规模并行、同策略训练环境下得到有效训练。为此,我们提出了信任域扩散策略(TruDi),使扩散策略能够在大规模并行模拟中进行同策略强化学习训练。这一设置尤为挑战性,因为数据分布在更新过程中变化迅速,使得用复杂策略实现稳定训练变得困难。TruDi通过整合信任域优化规则来解决这一问题,对整个扩散轨迹执行KL散度约束。实证上,我们在一组包含总计73个任务的大规模并行强化学习基准上评估了TruDi。在这些任务中,TruDi在标准任务上始终优于或与强基线方法持平,并在更具挑战性的人形控制任务上取得了显著提升,为大规模并行同策略强化学习确立了新的强基线。

Distilling Drifting Transformers with Representation Autoencoders

2026-06-16T04:00:00cs.AI, cs.LG, diffusion2606.15553

中文标题:使用表示自编码器蒸馏漂移Transformer

作者:Jiawei Zhang, Mengfei Xia, Gen Li, Yuantao Gu

摘要:

Representation Autoencoders (RAEs) have improved diffusion and flow models by semantically richer latent space owing to the strongly label-wise clustered DINO features in the pretrained encoders. Yet in the distillation stage, the severe anisotropy and large curvatures caused by the rich semantic representations would hinder the convergence and performance, making the trajectory-based distillation unstable. In this work, we argue that the RAE latent space is compatible with distillation via the newly proposed Drifting Models. We first quantitatively study the curvatures and isotropy statistics across different autoencoders, and theoretically reveal that Drifting Model itself is highly likely to fail on extremely scattered spaces like reconstruction-based VAEs. These motivate us to apply the drifting paradigm directly to representation autoencoders. Our proposed method, Drift-RAE, distills pretrained flow models in RAE latent spaces using Drifting, together with insightful modifications that improve training stability by thereotically aligning drifting fields with other frameworks. Regarding the experimental evidences, we achieve 1.77 FID on ImageNet 256 dataset using only 10k distillation steps, surpassing state-of-the-art RAE distillation methods and appearing comparative with the original Drifting Model without requiring an auxiliary MAE feature extractor. The code will be made publicly available.

摘要中文:

表示自编码器(RAEs)通过预训练编码器中具有强标签级聚类特性的DINO特征,实现了语义更丰富的潜空间,从而提升了扩散模型和流模型的性能。然而,在蒸馏阶段,丰富的语义表示所导致的严重各向异性和大曲率会阻碍收敛和性能,使得基于轨迹的蒸馏不稳定。在本文中,我们认为RAE潜空间与新提出的漂移模型是兼容的。我们首先定量研究了不同自编码器之间的曲率和各向同性统计,并从理论上揭示了漂移模型本身在类似重建型VAE这样的极度分散空间中很可能失败。这些发现促使我们将漂移范式直接应用于表示自编码器。我们提出的方法Drift-RAE,在RAE潜空间中使用漂移技术蒸馏预训练流模型,并进行了理论性的改进,通过将漂移场与其他框架进行理论对齐来提高训练稳定性。在实验验证方面,我们仅使用10k蒸馏步骤就在ImageNet 256数据集上实现了1.77的FID,超越了目前最先进的RAE蒸馏方法,并与原始漂移模型表现相当,同时无需使用辅助的MAE特征提取器。代码将公开发布。

SACE: Concept Erasure at the Semantic Singularity in Visual Autoregressive Models

2026-06-16T04:00:00autoregressive, cs.AI, cs.CV, diffusion2606.15819

中文标题:SACE:视觉自回归模型语义奇点处的概念抹除

作者:Siya Yang, Nanxiang Jiang, Zhaoxin Fan, Yunfeng Diao

摘要:

The rapid progress of visual autoregressive (VAR) models has unlocked a transformative frontier for high-fidelity text-to-image synthesis, while heightening concerns over the safety alignment of generated content. Naive application of existing erasure techniques to VAR models causes catastrophic semantic collapse and visual artifacts, since they are predominantly designed for the homogeneous denoising steps of diffusion models. To address this foundational challenge, we first propose the Semantic Singularity Axiom, which posits that any target semantic concept embedded within a prompt is definitively locked at Scale-0. Then rigorously validate this axiom through our proposed Incremental Semantic Saliency Analysis (ISSA),which also enable the community to transparently inspect the coarse-to-fine semantic injection process. Guided by this insight, we introduce the first scale-aware concept erasure framework (SACE) for VAR models. By strictly confining interventions to the first scale, our approach couples an Entropy-Regularized Erasure Objective to prevent high-entropy sampling degeneration, alongside a restorative preservation loss to safely anchor the integrity of entangled benign priors. Extensive experiments demonstrate that our method achieves surgical concept erasure performance across various domains with minimal training overhead, timely and elegently resolute the critical safety vulnerabilities inherent in emerging VAR architectures. Code is available at: https://github.com/limerenceysy/SACE}{https://github.com/limerenceysy/SACE.

摘要中文:

视觉自回归(VAR)模型的快速发展为高保真文本到图像合成开辟了变革性前沿,同时也加剧了生成内容安全对齐方面的担忧。将现有抹除技术简单应用于VAR模型会导致灾难性的语义崩溃和视觉伪影,因为这些技术主要针对扩散模型的同质化去噪步骤。为解决这一基础性挑战,我们首先提出语义奇点公理,该公理认为嵌入在提示词中的任何目标语义概念都明确锁定在Scale-0。随后,我们通过提出的增量语义显著性分析(ISSA)严格验证了这一公理,从而使社区能够透明地检查粗到细的语义注入过程。在此洞察的指引下,我们为VAR模型引入了首个尺度感知概念抹除框架(SACE)。通过严格将干预限制在第一个尺度,我们的方法结合了熵正则化抹除目标以防止高熵采样退化,同时配合恢复性保留损失来安全地锚定纠缠良性先验的完整性。大量实验表明,我们的方法在各种领域实现了精确的概念抹除性能,且训练开销极小,及时而优雅地解决了新兴VAR架构固有的关键安全漏洞。代码见:https://github.com/limerenceysy/SACE

Wasserstein Convergence of ODE-Based Samplers in Decentralized Diffusion Model via Velocity Field Decomposition

2026-06-16T04:00:00cs.AI, cs.LG, diffusion2606.15835

中文标题:基于速度场分解的去中心化扩散模型中基于常微分方程采样器的Wasserstein收敛性

作者:Chencheng Tang, Xuanyu Xue, Fangyikang Wang, Chao Zhang, Hubery Yin

摘要:

Diffusion models have achieved impressive empirical success in generative tasks, and their convergence theory is now relatively well understood. Motivated by privacy and scalability, recent decentralized diffusion architectures replace a single global velocity field with multiple local experts and a routing mechanism, yielding a sampling dynamics with stochastic expert switching that falls outside standard diffusion convergence analyses. In this work, We study a decentralized diffusion framework with stochastic velocity fields and ODE-based sampling. We establish a convergence guarantee in Wasserstein-2 distance, showing that the distribution of the $N$-step discretization converges to the analytical solution at rate $\mathcal{O}(N^{-1/2}+\varepsilon)$ in $W_2$, where $\varepsilon$ captures the neural approximation errors. To our knowledge, this is the first $W_2$ convergence result for decentralized diffusion models with an ODE-based sampling scheme.

摘要中文:

扩散模型在生成任务中取得了令人瞩目的实证成功,其收敛理论目前已得到相对充分的理解。出于隐私和可扩展性的考虑,最近的去中心化扩散架构用多个本地专家和一个路由机制取代了单一的全局速度场,产生了一种具有随机专家切换的采样动力学,这超出了标准扩散收敛分析的范围。在本文中,我们研究了一个具有随机速度场和基于常微分方程采样的去中心化扩散框架。我们在Wasserstein-2距离上建立了收敛保证,表明N步离散化的分布以O(N^{-1/2}+ε)的速率收敛到解析解,其中ε表示神经近似误差。据我们所知,这是关于具有基于常微分方程采样方案的去中心化扩散模型的第一个W2收敛结果。

ControlMap: Controllable High-Definition Map Generation for Traffic Scenario Simulation

2026-06-16T04:00:00cs.AI, cs.RO, diffusion2606.15930

中文标题:ControlMap:面向交通场景仿真的可控高清地图生成

作者:Marwan Farag, Steffen W\"aldele, Yu Yao

摘要:

Simulation is central to validating autonomous driving systems, yet current pipelines are limited by insufficient scenario diversity due to costly High Definition (HD) map creation. Scaling HD maps requires expensive data collection and manual processing. Moreover, existing generative models lack the fine-grained control necessary to target specific road topologies during generation. This paper presents a data-driven pipeline for controllable HD map generation using latent diffusion and ControlNet for spatial conditioning. To our knowledge, we are the first to inject spatial guidance signals into a diffusion model for HD map synthesis. Furthermore, our model supports adjustable conditioning strength through classifier-free guidance and city-level style transfer via city label conditioning. To complement existing metrics, we introduce two novel metrics to evaluate adherence to the control signal and similarity to ground-truth maps. Experiments demonstrate that our model generates realistic HD maps that faithfully follow input road topologies while accurately preserving city-specific details.

摘要中文:

仿真是验证自动驾驶系统的核心环节,但当前流程因高精地图(HD Map)创建成本高昂而导致场景多样性不足。高精地图的规模化扩展需要昂贵的数据收集和人工处理。此外,现有生成模型缺乏在生成过程中针对特定道路拓扑进行细粒度控制的能力。 本文提出了一种基于潜在扩散和ControlNet空间条件控制的可控高精地图生成数据驱动流程。据我们所知,我们是首个将空间引导信号注入扩散模型用于高精地图合成的研究。此外,我们的模型支持通过无分类器引导实现可调的条件强度,并通过城市标签条件实现城市级风格迁移。为补充现有指标,我们引入了两个新指标来评估对控制信号的遵循程度以及与真实地图的相似度。实验表明,我们的模型生成的逼真高精地图能够忠实遵循输入的道路拓扑结构,同时准确保留城市特定细节。

Training and Evaluating Diffusion Policies with Long Context Lengths

2026-06-16T04:00:00cs.AI, cs.RO, diffusion2606.16447

中文标题:长上下文长度扩散策略的训练与评估

作者:Abhinav Agarwal, Adam Wei, Taylan Kargin, Michael Zeng, Cole Becker, Arif Kerem Dayi, Pablo Parrilo, Asuman Ozdaglar, Russ Tedrake

摘要:

Imitation learning has enabled highly-dexterous robotic manipulation from RGB observations. Policies trained with these methods, however, typically condition robot actions on only a short history of observations. These policies cannot solve tasks that require memory and can get stuck repeatedly executing the same failing motions. In this work, we first benchmark policy performance as context length is incrementally increased from short to long, across a spectrum of tasks with varying local stability and memory requirements, and in multiple data regimes. To our knowledge, this is the first study to investigate context length in imitation learning at this level of detail. Our results challenge prior claims: naively scaling context length is not as brittle as advertised in literature. With an appropriate conditioning method and denoising backbone (UNet+Cross-Attention), single-task policies achieve high success rates on many tasks in the usual data regime even with naive scaling. Next, we propose a training algorithm to jointly train policies at multiple context lengths, further reducing the sample complexity of long-context learning. Finally, we apply our findings to re-evaluate some previously proposed solutions to long-context imitation learning.

摘要中文:

模仿学习已能够从RGB观测中实现高度灵巧的机器人操作。然而,使用这些方法训练的策略通常仅以较短的观测历史作为条件来执行机器人动作。这类策略无法解决需要记忆的任务,可能会陷入重复执行相同失败动作的困境。在本研究中,我们首先对策略性能进行基准测试,在局部稳定性和记忆需求各异的任务谱系中,以及多种数据条件下,将上下文长度从短到长逐步增加。据我们所知,这是首次在如此详细的层面研究模仿学习中的上下文长度。我们的结果对先前的主张提出了挑战:按文献中所宣传的,上下文长度的简单扩展并不那么脆弱。通过适当的条件化方法和去噪骨干网络(UNet+Cross-Attention),即使采用简单扩展,单任务策略在常规数据条件下也能在许多任务中取得高成功率。随后,我们提出了一种在多个上下文长度下联合训练策略的算法,进一步降低了长上下文学习的样本复杂度。最后,我们将研究发现应用于重新评估先前提出的长上下文模仿学习解决方案。

Fast When, Careful Who: Dual-Process Multiparty Turn-Taking with Diffusion Augmentation

2026-06-16T04:00:00cs.AI, cs.CL, diffusion2606.16568

中文标题:快速判断何时,仔细判断谁:基于扩散增强的双过程多方轮换机制

作者:Rutherford A. Patamia, Ming Liu, Wei Luo, Favour Ekong, Akan Cosgun

摘要:

Reliable turn-taking is essential for spoken dialogue systems. However, most existing methods are designed for two-speaker interaction and struggle with realistic multiparty audio containing overlap and rapid speaker changes. We study multiparty turn-taking on the VoxConverse dataset and propose an audio-only two-stage pipeline that separates when to trigger a turn boundary from whether the floor is actually transferring. A fast trigger scans the audio and proposes candidate end-of-turn times, while a lightweight verifier runs only at those times to decide \textsc{Hold} or \textsc{Shift} and support next-speaker prediction. We report results in the full multiparty setting and a controlled dyadic top-2 projection for comparability. We also investigate diffusion-based, label-preserving background-audio mixing as a data augmentation strategy. Results show improved shift detection over a baseline, with further improvements from diffusion augmentation.

摘要中文:

可靠的轮换机制对于口语对话系统至关重要。然而,大多数现有方法都是为双人交互设计的,难以处理包含话语重叠和快速说话人切换的真实多方音频。我们在VoxConverse数据集上研究多方轮换问题,并提出了一个纯音频的两阶段管道,将何时触发轮换边界与话语权是否实际转移这两个问题解耦处理。快速触发器扫描音频并提出候选的轮换结束时间点,而轻量级验证器仅在这些时间点运行,以决定保持话语权或转移话语权,并支持下一说话人预测。我们报告了完整多方设置下的实验结果,以及用于可比性评估的受控双人顶级预测结果。我们还研究了基于扩散模型的标签保持背景音频混合作为数据增强策略。实验结果显示,相较于基线模型,话语权转移检测性能有所提升,而扩散增强策略带来了进一步的性能改进。

Sample from What You See: Visuomotor Policy Learning via Diffusion Bridge with Observation-Embedded Stochastic Differential Equation

2026-06-16T04:00:00cs.AI, cs.LG, diffusion2512.07212

中文标题:从所见中采样:通过嵌入观察的随机微分方程实现扩散桥的视觉运动策略学习

作者:Zhaoyang Liu, Mokai Pan, Zhongyi Wang, Kaizhen Zhu, Haotao Lu, Haipeng Zhang, Jingya Wang, Ye Shi

摘要:

Imitation learning with diffusion models has advanced robotic control by capturing the multi-modal action distributions. However, existing methods typically treat observations only as high-level conditions to the denoising network, rather than integrating them into the stochastic dynamics of the diffusion process itself. As a result, the sampling is forced to begin from random noise, weakening the coupling between perception and control and often yielding suboptimal performance. We propose BridgePolicy, a generative visuomotor policy that directly integrates observations into the stochastic dynamics via a diffusion-bridge formulation. By constructing an observation-informed trajectory, BridgePolicy enables sampling to start from a rich and informative prior rather than random noise, substantially improving precision and reliability in control. A key difficulty is that diffusion bridge normally connects distributions of matched dimensionality, while robotic observations are heterogeneous and not naturally aligned with actions. To overcome this, we introduce a semantic aligner to unify the visual and state inputs and align the observations with action representations, making diffusion bridge applicable to heterogeneous robot data. Extensive experiments across 52 simulation tasks on three benchmarks and 5 real-world tasks demonstrate that BridgePolicy consistently outperforms state-of-the-art generative policies. Our code is available at https://jianghcsr.github.io/BridgePolicy_page/.

摘要中文:

基于扩散模型的模仿学习通过捕捉多模态动作分布提升了机器人控制能力。然而,现有方法通常仅将观察作为去噪网络的高层条件,而未将其整合到扩散过程本身的随机动力学中。因此,采样被迫从随机噪声开始,削弱了感知与控制之间的耦合,往往导致性能不佳。我们提出BridgePolicy,这是一种生成式视觉运动策略,通过扩散桥公式直接将观察整合到随机动力学中。通过构建信息感知的轨迹,BridgePolicy使采样能够从丰富且具有信息量的先验开始,而非随机噪声,从而显著提高了控制的精度和可靠性。一个关键难点在于扩散桥通常连接维度匹配的分布,而机器人观察是异构的,自然无法与动作对齐。为此,我们引入语义对齐器来统一视觉和状态输入,并将观察与动作表示对齐,使扩散桥能够应用于异构机器人数据。在三个基准的52个模拟任务和5个真实世界任务上的广泛实验表明,BridgePolicy始终优于最先进的生成式策略。代码可访问https://jianghcsr.github.io/BridgePolicy_page/。

Region-Adaptive Sampling for Diffusion Transformers

2026-06-16T04:00:00cs.AI, cs.CV, diffusion2502.10389

中文标题:扩散变换器的区域自适应采样

作者:Ziming Liu, Yifan Yang, Chengruidong Zhang, Yiqi Zhang, Lili Qiu, Yang You, Yuqing Yang

摘要:

Diffusion models (DMs) have become the leading choice for generative tasks across diverse domains. However, their reliance on multiple sequential forward passes significantly limits real-time performance. Previous acceleration methods have primarily focused on reducing the number of sampling steps or reusing intermediate results, failing to leverage variations across spatial regions within the image due to the constraints of convolutional U-Net structures. By harnessing the flexibility of Diffusion Transformers (DiTs) in handling variable number of tokens, we introduce RAS, a novel, training-free sampling strategy that dynamically assigns different sampling ratios to regions within an image based on the focus of the DiT model. Our key observation is that during each sampling step, the model concentrates on semantically meaningful regions, and these areas of focus exhibit strong continuity across consecutive steps. Leveraging this insight, RAS updates only the regions currently in focus, while other regions are updated using cached noise from the previous step. The model's focus is determined based on the output from the preceding step, capitalizing on the temporal consistency we observed. We evaluate RAS on Stable Diffusion 3 and Lumina-Next-T2I, achieving speedups up to 2.36x and 2.51x, respectively, with minimal degradation in generation quality. Additionally, a user study reveals that RAS delivers comparable qualities under human evaluation while achieving a 1.6x speedup. Our approach makes a significant step towards more efficient diffusion transformers, enhancing their potential for real-time applications.

摘要中文:

扩散模型已成为跨领域生成任务的首选方案。然而,其对多次顺序前向传递的依赖严重限制了实时性能。先前的加速方法主要聚焦于减少采样步骤数量或复用中间结果,未能利用图像内空间区域的差异性,这是因为受到了卷积U-Net结构的约束。通过利用扩散变换器(DiTs)在处理可变数量标记时的灵活性,我们提出了一种名为RAS的全新无需训练的采样策略,该策略根据DiT模型的关注点动态为图像内的不同区域分配不同的采样比率。我们的关键观察是,在每个采样步骤中,模型会聚焦于语义有意义的区域,而这些关注区域在连续步骤之间表现出很强的连续性。利用这一洞察,RAS仅更新当前处于聚焦状态的区域,其他区域则使用前一步骤缓存的噪声进行更新。模型的关注点基于前一时刻的输出确定,利用了我们观察到的时间一致性。我们在Stable Diffusion 3和Lumina-Next-T2I上评估了RAS,分别实现了高达2.36倍和2.51倍的加速,而生成质量的下降微乎其微。此外,用户研究表明,RAS在实现1.6倍加速的同时提供了相当的质量。我们的方法朝着更高效的扩散变换器迈出了重要一步,增强了其在实时应用中的潜力。

Akasha 2: Hamiltonian State Space Duality and Visual-Language Joint Embedding Predictive Architectur

2026-06-16T04:00:00cs.AI, cs.CV, diffusion2601.06212

中文标题:Akasha 2:哈密顿状态空间对偶性与视觉-语言联合嵌入预测架构

作者:Yani Meziani

摘要:

We present Akasha 2, a state-of-the-art multimodal architecture that integrates Hamiltonian State Space Duality (H-SSD) with Visual-Language Joint Embedding Predictive Architecture (VL-JEPA). The system leverages the Mamba-3 Selective State Space Model (SSM) augmented by a Sparse Mixture of Hamiltonian Experts (SMoE-HE) that enforces latent physical conservation laws through symplectic integration. For visual synthesis, we introduce Hamiltonian Flow Matching (HFM) and persistent 3D Gaussian Splatting (3DGS), enabling ultra-low latency (<50ms) on mobile hardware. This work establishes a new paradigm in latent world models, achieving unprecedented spatiotemporal coherence through a holographic memory architecture. Our approach demonstrates that incorporating physics-inspired inductive biases into neural architectures yields significant improvements: state-of-the-art video prediction (FVD: 287), 4x faster visual synthesis than diffusion models, and 3-18x inference speedup over transformer baselines while maintaining energy conservation over extended horizons.

摘要中文:

本文提出Akasha 2,一个将哈密顿状态空间对偶性(H-SSD)与视觉-语言联合嵌入预测架构(VL-JEPA)相融合的先进多模态架构。该系统采用Mamba-3选择性状态空间模型(SSM),并通过稀疏哈密顿专家混合(SMoE-HE)进行增强,利用辛积分强制执行潜在物理守恒定律。在视觉合成方面,本文引入哈密顿流匹配(HFM)和持久化3D高斯溅射(3DGS)技术,能够在移动硬件上实现超低延迟(<50ms)。本研究在潜在世界模型领域建立了新的范式,通过全息记忆架构实现了前所未有的时空一致性。我们的方法表明,将物理启发的归纳偏置融入神经架构可带来显著提升:实现了最先进的视频预测性能(FVD:287),视觉合成速度比扩散模型快4倍,推理速度比Transformer基线快3-18倍,同时在长时间范围内保持能量守恒。

Learning Permutation Distributions via Reflected Diffusion on Ranks

2026-06-16T04:00:00cs.AI, cs.LG, diffusion2603.17353

中文标题:通过排名上的反射扩散学习置换分布

作者:Sizhuang He, Yangtian Zhang, Shiyang Zhang, David van Dijk

摘要:

The finite symmetric group S_n provides a natural domain for permutations, yet learning probability distributions on S_n is challenging due to its factorially growing size and discrete, non-Euclidean structure. Recent permutation diffusion methods define forward noising via shuffle-based random walks (e.g., riffle shuffles) and learn reverse transitions with Plackett-Luce (PL) variants, but the resulting trajectories can be abrupt and increasingly hard to denoise as n grows. We propose Soft-Rank Diffusion, a discrete diffusion framework that replaces shuffle-based corruption with a structured soft-rank forward process: we lift permutations to a continuous latent representation of order by relaxing discrete ranks into soft ranks, yielding smoother and more tractable trajectories. For the reverse process, we introduce contextualized generalized Plackett-Luce (cGPL) denoisers that generalize prior PL-style parameterizations and improve expressivity for sequential decision structures. Experiments on sorting and combinatorial optimization benchmarks show that Soft-Rank Diffusion consistently outperforms prior diffusion baselines, with particularly strong gains in long-sequence and intrinsically sequential settings.

摘要中文:

有限对称群S_n为置换提供了自然域,但由于其阶乘级增长的规模和离散、非欧几里得结构,学习S_n上的概率分布具有挑战性。近期的置换扩散方法通过基于洗牌的随机游走(如交错洗牌)定义前向噪声,并使用Plackett-Luce(PL)变体学习反向转换,但产生的轨迹可能较为突兀,且随着n增大而难以去噪。我们提出Soft-Rank Diffusion,这是一种离散扩散框架,用结构化的软排名前向过程替代基于洗牌的损坏:我们将置换提升为顺序的连续潜在表示,将离散排名松弛为软排名,从而获得更平滑、更易处理的轨迹。对于反向过程,我们引入了情境化广义Plackett-Luce(cGPL)去噪器,它推广了先前的PL风格参数化,并提高了序贯决策结构的表达能力。在排序和组合优化基准测试上的实验表明,Soft-Rank Diffusion始终优于先前的扩散基线,在长序列和内在序贯设置中表现尤为突出。

Rel-Zero: Harnessing Patch-Pair Invariance for Robust Zero-Watermarking Against AI Editing

2026-06-16T04:00:00cs.AI, cs.CR, cs.CV, diffusion2603.17531

中文标题:Rel-Zero:利用块对不变性实现抗AI编辑的鲁棒零水印

作者:Pengzhen Chen, Yanwei Liu, Xiaoyan Gu, Xiaojun Chen, Wu Liu, Weiping Wang

摘要:

Recent advancements in diffusion-based image editing pose a significant threat to the authenticity of digital visual content. Traditional embedding-based watermarking methods often introduce perceptible perturbations to maintain robustness, inevitably compromising visual fidelity. Meanwhile, existing zero-watermarking approaches, typically relying on global image features, struggle to withstand sophisticated manipulations. In this work, we uncover a key observation: while individual image patches undergo substantial alterations during AI-based editing, the relational distance between patch pairs remains relatively invariant. Leveraging this property, we propose Relational Zero-Watermarking (Rel-Zero), a novel framework that requires no modification to the original image but derives a unique zero-watermark from these editing-invariant patch relations. By grounding the watermark in intrinsic structural consistency rather than absolute appearance, Rel-Zero provides a non-invasive yet resilient mechanism for content authentication. Extensive experiments demonstrate that Rel-Zero achieves substantially improved robustness across diverse editing models and manipulations compared to prior zero-watermarking approaches.

摘要中文:

近年来,基于扩散模型的图像编辑技术对数字视觉内容的真实性构成了重大威胁。传统基于嵌入的水印方法通常需要引入可感知的扰动来维持鲁棒性,这不可避免地会损害视觉保真度。与此同时,现有的零水印方法通常依赖于全局图像特征,难以抵御复杂的编辑操作。本研究揭示了一个关键观察:尽管单个图像块在AI编辑过程中会发生实质性变化,但块对之间的相对距离保持相对不变。利用这一特性,我们提出了关系零水印(Rel-Zero),这是一个无需对原始图像进行任何修改,而是从这些抗编辑的块关系中提取独特零水印的新框架。由于水印基于内在结构一致性而非绝对外观,Rel-Zero提供了一种非侵入性且具有鲁棒性的内容认证机制。大量实验表明,与现有的零水印方法相比,Rel-Zero在各种编辑模型和操作下实现了显著提升的鲁棒性。

From Noise to Intent: Anchoring Generative VLA Policies with Residual Bridges

2026-06-16T04:00:00cs.AI, cs.RO, diffusion2604.21391

中文标题:从噪声到意图:基于残差桥的生成式VLA策略锚定

作者:Yiming Zhong, Yaoyu He, Zemin Yang, Pengfei Tian, Yifan Huang, Qingqiu Huang, Xinge Zhu, Yuexin Ma

摘要:

Bridging high-level semantic understanding with low-level physical control remains a persistent challenge in embodied intelligence, stemming from the fundamental spatiotemporal scale mismatch between cognition and action. Existing generative VLA policies typically adopt a "Generation-from-Noise" paradigm, which disregards this disparity, leading to representation inefficiency and weak condition alignment during optimization. In this work, we propose ResVLA, an architecture that shifts the paradigm to "Refinement-from-Intent." Recognizing that robotic motion naturally decomposes into global intent and local dynamics, ResVLA utilizes spectral analysis to decouple control into a deterministic low-frequency anchor and a stochastic high-frequency residual. By anchoring the generative process on the predicted intent, our model focuses strictly on refining local dynamics via a residual diffusion bridge. Extensive simulation experiments show that ResVLA achieves competitive performance, strong robustness to language and robot embodiment perturbations, and faster convergence than standard generative baselines. ResVLA also demonstrates strong performance in real-world robot experiments.

摘要中文:

弥合高层语义理解与低层物理控制之间的差距仍是具身智能领域的一项持久挑战,这源于认知与动作之间根本的时空尺度不匹配问题。现有的生成式VLA策略通常采用"从噪声生成"范式,这种范式忽视了上述差异,导致表征效率低下且优化过程中条件对齐能力较弱。本研究提出ResVLA架构,将范式转变为"从意图精炼"。鉴于机器人运动自然可分解为全局意图和局部动力学,ResVLA利用频谱分析将控制解耦为确定性低频锚点和随机高频残差。通过将生成过程锚定在预测的意图上,我们的模型专注于通过残差扩散桥来精炼局部动力学。大量模拟实验表明,ResVLA取得了具有竞争力的性能,对语言和机器人具身扰动具有较强的鲁棒性,且收敛速度优于标准生成式基线。ResVLA在真实机器人实验中也表现出色。

Improved Baselines with Representation Autoencoders

2026-06-16T04:00:00cs.AI, cs.CV, cs.GR, cs.LG, diffusion, stat.ML2605.18324

中文标题:基于表示自编码器的改进基线

作者:Jaskirat Singh, Boyang Zheng, Zongze Wu, Richard Zhang, Eli Shechtman, Saining Xie

摘要:

Representation Autoencoders (RAE) replace traditional VAE with pretrained vision encoders. In this paper, we systematically investigate several design choices and find three insights which simplify and improve RAE. First, we study a generalized formulation where the representation is defined as sum of the last k encoder layers rather than solely the final layer. This simple change greatly improves reconstruction without encoder finetuning or specialized data (e.g., text, faces). Second, we study the prevalent assumption that RAE (using pretrained representation as encoder) replaces representation alignment (REPA), which distills the same representation to intermediate layers instead. Through large-scale empirical analysis, we uncover a surprising finding: RAE and REPA exhibit complementary working mechanisms, allowing the same representation to be used as both encoder and target for intermediate diffusion layers. Finally, the original RAE struggles with classifier-free guidance (CFG) and requires training a second, weaker diffusion model for AutoGuidance (AG). We show that REPA itself can be viewed as x-prediction in RAE latent space. By simply re-parameterizing the output of the DiT model, it can provide guidance for "free". Overall, RAEv2 leads to more than 10x faster convergence over the original RAE, achieving a state-of-the-art gFID of 1.06 in just 80 epochs on ImageNet-256. On FDr6, RAEv2 achieves a state-of-the-art 2.17 at just 80 epochs compared to the previous best 3.26 (800 epochs) without any post-training. This motivates EPFID@k (epochs to reach unguided gFID < k) as a measure of training efficiency. RAEv2 attains an EPFID@2 of 35 epochs, versus 177 for the original RAE. We also validate our approach across diverse settings for text-to-image generation and navigation world models, showing consistent improvements. The code is available at https://raev2.github.io.

摘要中文:

表示自编码器(Representation Autoencoders, RAE)用预训练视觉编码器替代传统变分自编码器。本论文系统研究了几种设计选择,并发现了三个可简化并改进RAE的见解。首先,我们研究了一种通用公式化方法,其中表示被定义为最后k个编码器层之和,而非仅使用最后一层。这种简单变化显著提升了重建效果,且无需对编码器进行微调或使用专用数据(如文本、人脸)。其次,我们研究了一个普遍假设:RAE(使用预训练表示作为编码器)是否替代了表示对齐(Representation Alignment, REPA),后者将相同表示蒸馏到中间层。通过大规模实证分析,我们发现了一个令人惊讶的结论:RAE和REPA表现出互补的工作机制,使得相同表示可以作为中间扩散层的编码器和目标。第三,原版RAE难以处理无分类器引导(Classifier-Free Guidance, CFG),需要训练第二个较弱的扩散模型用于自动引导(AutoGuidance, AG)。我们证明REPA本身可以视为RAE潜在空间中的x预测。通过简单地重参数化DiT模型的输出,它可以实现“免费“引导。总体而言,RAEv2相比原版RAE实现了超过10倍的收敛速度提升,在ImageNet-256上仅用80个epoch就达到了1.06的最优gFID。在FDr6上,RAEv2仅用80个epoch就达到了2.17的最优成绩,而此前最佳成绩为3.26(需800个epoch),且无需任何后训练。这促使我们提出EPFID@k(达到无引导gFID < k所需的epoch数)作为训练效率的衡量指标。RAEv2达到EPFID@2仅需35个epoch,而原版RAE需要177个epoch。我们还在文本到图像生成和导航世界模型等多种不同设置中验证了我们的方法,表现出持续的改进效果。

Towards Unified Song Generation and Singing Voice Conversion with Accompaniment Co-Generation

2026-06-16T04:00:00cs.AI, cs.SD, diffusion2606.07015

中文标题:面向伴奏共同生成的统一歌曲生成与歌声转换研究

作者:Ziyu Zhang, Chunyu Qiang, Xiaopeng Wang, Yuxin Guo, Kang Yin, Wenjie Tian, Jingbin Hu, Tianlun Zuo, Zhao Guo, Teng Ma, Yuzhe Liang, Chen Zhang, Lei Xie

摘要:

While song generation and singing voice conversion (SVC) have evolved significantly, they have long been developed isolated: the former lacks zero-shot speaker cloning, while the latter overlooks vocal-accompaniment synergy. To bridge this gap, we propose UniSinger, the first end-to-end framework unifying speaker cloning song generation and accompaniment co-generation SVC. Building on the multimodal diffusion transformer, we construct a unified speaker embedding space transferring speaker representation from SVC to song generation, endowing fine-grained cross-task timbre control. To mitigate multi-task optimization conflicts, we design a curriculum learning strategy using task-specific modality masking to guide the model to gradually master the generative mechanisms among semantic content, vocal timbre, and accompaniment. Experiments show state-of-the-art performance on both tasks and realizes complementary benefits, offering new possibilities for intelligent music production.

摘要中文:

尽管歌曲生成和歌声转换(SVC)已取得显著发展,但两者长期独立演进:前者缺乏零样本说话人克隆功能,后者则忽视了人声与伴奏的协同效应。为弥补这一空白,我们提出UniSinger,这是首个端到端统一框架,整合了说话人克隆歌曲生成与伴奏共同生成功能。基于多模态扩散变换器,我们构建了统一的说话人嵌入空间,实现从SVC到歌曲生成的说话人表示迁移,从而赋予细粒度的跨任务音色控制能力。为缓解多任务优化冲突,我们设计了课程学习策略,采用任务特定的模态掩码,引导模型逐步掌握语义内容、人声音色和伴奏之间的生成机制。实验表明,本方法在两项任务上均达到最先进的性能,并实现了互补优势,为智能音乐制作提供了新的可能性。

AnchorEdit: Maintaining Temporal Consistency in Multi-turn Image Editing via Causal Memory

2026-06-16T04:00:00autoregressive, cs.AI, cs.CV, diffusion2606.11751

中文标题:AnchorEdit:通过因果记忆在多轮图像编辑中保持时间一致性

作者:Hang Xu, Xiaoxiao Ma, Guohui Zhang, Yu Hu, Siming Fu, Jie Huang, Lin Song, Haoyang Huang, Nan Duan, Feng Zhao

摘要:

Multi-turn image editing is essential for iterative design, yet current models often struggle with identity drift and error accumulation over successive steps. While existing research leverages video priors for consistency, their reliance on bidirectional attention is fundamentally misaligned with the causal, sequential nature of interactive editing. In this paper, we propose AnchorEdit, the first autoregressive (AR) diffusion-based framework designed specifically for high-resolution, long-term multi-turn editing. AnchorEdit bridges the gap between video priors and causal inference through a three-stage training curriculum: identity-preserving sing-turn pretraining, causal AR forcing fine-tuning with a novel self-rollout strategy to mitigate exposure bias, and consistency distillation for efficient 4-step generation. During inference, we introduce a memory mechanism to anchor the initial subject identity and ensure stable extrapolation across extended editing trajectories. To evaluate performance, we provide a new high-resolution multi-turn editing benchmark designed to stress-test long-horizon stability. Extensive experiments demonstrate that AnchorEdit achieves state-of-the-art results, maintaining exceptional subject fidelity and instruction following even over 10+ interaction rounds.

摘要中文:

多轮图像编辑对于迭代设计至关重要,但当前模型在连续步骤中常常面临身份漂移和错误累积的问题。尽管现有研究利用视频先验来保持一致性,但其对双向注意力的依赖从根本上与交互编辑的因果时序性质相矛盾。本文提出了 AnchorEdit,这是首个专门为高分辨率、长期多轮编辑设计的自回归(AR)扩散框架。AnchorEdit 通过三阶段训练课程弥合了视频先验与因果推理之间的差距:身份保留的单轮预训练、采用新型自 rollout 策略的因果 AR 强制微调以缓解曝光偏差,以及用于高效 4 步生成的一致性蒸馏。在推理过程中,我们引入了一种记忆机制来锚定初始主体身份,确保在延长编辑轨迹上的稳定外推。为了评估性能,我们构建了一个新的高分辨率多轮编辑基准测试,旨在压力测试长程稳定性。大量实验表明,AnchorEdit 实现了最先进的结果,即使在超过 10 轮交互后仍能保持优异的主体保真度和指令遵循能力。

Style-CCL: Content-Preserving Style Transfer via Curriculum Continual Learning

2026-06-16T04:00:00cs.CV, diffusion2606.14746

中文标题:Style-CCL: 基于课程持续学习的内容保持风格迁移

作者:Shiwen Zhang, Haoyuan Wang, Xianghao Zang, Haibin Huang, Chi Zhang, Xuelong Li

摘要:

Content-Preserving Style transfer, given content and style references, remains challenging for Diffusion Transformers (DiTs) due to entangled content and style features. With a reverse triplet synthesis pipeline to build a million-scale training set and a dual-branch Style-Content DiT (SC-DiT) that decouples style and content via separate ROPE embeddings and causal masking, we observe that such a one-stage training paradigm on mixed style categories causes semantic styles to dominate, hindering texture style learning, and harming content preservation. To address these issues, we propose Style-CCL, a Multi-Stage Curriculum Continual Learning framework that trains SC-DiT from semantic (easy) to texture (hard) styles, and from clean to synthetic data, with Random Memory Rehearsal across stages to avoid catastrophic forgetting. Extensive experiments demonstrate that our Style-CCL achieves state-of-the-art performance in three core metrics: style similarity, content consistency, and aesthetic quality.

摘要中文:

内容保持的风格迁移给定内容和风格参考,由于特征纠缠问题,对扩散Transformer(DiT)而言仍具挑战性。我们通过反向三元组合成流水线构建了百万规模的训练集,并设计了双分支风格-内容DiT(SC-DiT)模型,利用独立的ROPE嵌入和因果掩码实现风格与内容的解耦。我们发现,在混合风格类别上进行单阶段训练会导致语义风格占主导,阻碍纹理风格学习,并损害内容保持能力。为解决这些问题,我们提出了Style-CCL,这是一种多阶段课程持续学习框架,能够从语义(简单)风格到纹理(困难)风格、从干净数据到合成数据对SC-DiT进行训练,并采用跨阶段随机记忆回放来避免灾难性遗忘。大量实验表明,我们的Style-CCL在风格相似度、内容一致性和美学质量三个核心指标上均达到了最先进的性能。

ReGenHuman: Re-Generating Human Appearances for Realistic Full-Body Video Anonymization

2026-06-16T04:00:00cs.CV, diffusion2606.14972

中文标题:ReGenHuman:用于逼真全身视频匿名化的人类外观重生成

作者:Adam Sun, Eshaan Barkataki, Arnold Milstein, Gordon Wetzstein, Ehsan Adeli

摘要:

Anonymizing human-centric video data is an understudied problem. Prior anonymization techniques either blur or redact pixels at the cost of realism and downstream utility, or generate frame-by-frame at the cost of temporal coherence. We introduce ReGenHuman, the first full-body video anonymization pipeline that is simultaneously realistic, temporally consistent, and anonymous by construction. Contrary to past approaches which redact or edit the inputs directly, we propose a regenerate, don't edit paradigm. Our approach composites 2D pose, segmentation, and monocular depth into two complementary conditioning streams - StructAll and StructHuman, which are used to fine-tune a video-to-video diffusion backbone on in-the-wild human videos, synthesizing the human regions entirely from identity-free structural cues. We evaluate our model on privacy, quality, and utility, and show that our ReGenHuman achieves the best tradeoff across all three axes against current baselines. We further show that our anonymized videos remain effective for downstream tasks, including video question answering.

摘要中文:

以人为中心的视频数据匿名化是一个尚未得到充分研究的问题。现有的匿名化技术要么通过模糊或涂抹像素的方式实现匿名化,但以牺牲真实性和下游实用性为代价;要么采用逐帧生成的方式,但缺乏时间一致性。本研究提出ReGenHuman,这是首个同时具备逼真性、时间一致性和构建式匿名化的全身视频匿名化管道。与过去直接涂抹或编辑输入的方法不同,本研究提出了一种“重新生成,而非编辑”的范式。本方法将2D姿态、分割和单目深度合成为两个互补的条件流——StructAll和StructHuman,并利用这些条件流对真实世界人类视频上的视频到视频扩散骨干网络进行微调,从而完全通过无身份信息的结构线索来合成人体区域。本研究在隐私性、质量和实用性方面对模型进行了评估,结果表明ReGenHuman在所有三个维度上都实现了比现有基线方法更好的权衡。此外,本研究还证明经过匿名化处理的视频在下游任务中仍然有效,包括视频问答任务。

Texture-Shape Bias Balancing for Robust Synthetic-to-Real Semantic Segmentation in Automotive NIR Imagery

2026-06-16T04:00:00cs.CV, diffusion2606.15072

中文标题:用于汽车近红外图像合成到真实域鲁棒语义分割的纹理-形状偏差平衡方法

作者:Felix Stillger, Ben Hamscher, Lukas Hahn, Annika M\"utze, Tobias Meisen, Kira Maag

摘要:

Semantic segmentation is a fundamental component of visual perception in modern automotive systems, enabling pixel-level scene understanding. Near-Infrared imaging (NIR) offers stable detection under difficult illumination conditions, but the development of domain-specific semantic segmentation models remains challenging due to the lack of high-quality annotated data from real-world scenarios. Synthetic datasets offer a scalable alternative, but models trained on synthetic images often suffer performance degradation when transferred to real domains. We present the first systematic study on synthetic to real domain adaptation for semantic segmentation in NIR images in the automotive domain. We propose a generative augmentation framework that transforms synthetic images into realistic NIR-style variants via our introduced target style adaptation (TSA). TSA fine-tunes a latent diffusion model via low-rank adaptation on a small curated set of real NIR images and applies it to synthetic training data using structure-preserving multi-signal conditioning. To reduce texture bias and improve segmentation robustness, we further apply a Voronoi-based style diversification strategy (VSD) that modifies the original textures while preserving scene geometry. Experiments with multiple model architectures on NIR data from vehicle interiors and street scenes show that balancing inductive bias during training leads to noticeably more robust semantic segmentation and effectively reduces the domain gap in our real-world scenarios by up to 63.6% on exterior and 28.4% on interior data. The code is available at GitHub.

摘要中文:

语义分割是现代汽车系统视觉感知的基础组件,能够实现像素级的场景理解。近红外成像(NIR)在困难光照条件下提供稳定检测,但由于缺乏来自真实场景的高质量标注数据,开发领域特定的语义分割模型仍然具有挑战性。合成数据集提供了一种可扩展的替代方案,但基于合成图像训练的模型在迁移到真实领域时经常出现性能下降。本文首次针对汽车领域NIR图像的合成到真实域适应问题开展系统研究。我们提出了一个生成式增强框架,通过引入的目标风格适配(TSA)将合成图像转换为逼真的NIR风格变体。TSA利用少量精选的真实NIR图像通过低秩适配微调潜在扩散模型,并采用保结构多信号条件应用于合成训练数据。为减少纹理偏差并提高分割鲁棒性,我们进一步应用了基于Voronoi的风格多样化策略(VSD),该策略在保持场景几何结构的同时修改原始纹理。使用多种模型架构在车辆内部和街道场景的NIR数据上进行的实验表明,训练过程中平衡归纳偏差能够显著提高语义分割的鲁棒性,并有效缩小真实场景中的域间差距,在外部场景数据上提升63.6%,内部场景数据上提升28.4%。代码已开源于GitHub。

RefGC-SR$^2$: Reference-guided Generated Content Super-Resolution and Refinement

2026-06-16T04:00:00cs.CV, diffusion2606.15158

中文标题:RefGC-SR$^2$: 参考图引导的生成内容超分辨率与细化

作者:Jeahun Sung, Dahyeon Kye, Soo Ye Kim, Jihyong Oh

摘要:

Reference-guided generation (e.g., object compositing, customization) has progressed rapidly, yet current pipelines share a fundamental limitation: the object-centric high-resolution reference image (HRRI) provided by users is downsampled to a fixed low-resolution (LR) before being fed into the model, so the fine-grained details are discarded before the output is even produced. In addition, the generation step then introduces its own artifacts (e.g., identity distortion) on top of this loss. Existing reference-guided generated content refinement (RefGCR) methods can correct some of these artifacts but still operate in the LR domain; reference-guided super-resolution (RefSR) methods recover resolution but assume natural-image degradations and ignore the artifact distribution of generative pipelines. To address both gaps in a single formulation, we introduce a new task: reference-guided generated content super-resolution-refinement (RefGC-SR$^2$), where the original HRRI is reused at the post-processing stage to recover lost details, refine generative artifacts, and upscale the output simultaneously. We construct the first real-world triplet data generation pipeline for this RefGC-SR$^2$ task, training a diptych-conditioned generator to synthesize paired low-quality anchors that public pretrained models cannot provide. We further present a frequency-aware diffusion transformer model for RefGC-SR$^2$ that selectively injects fine details from the HRRI while removing generative artifacts. Extensive experiments demonstrate that our RefGC-SR$^2$ model successfully (i) refines the object identity faithfully with respect to the reference, and (ii) recovers high-resolution details, so that the final result is significantly higher quality and practically more usable compared to existing RefGCR and RefSR baselines.

摘要中文:

参考图引导生成(如目标合成、个性化定制)发展迅速,但当前流程存在一个根本性限制:用户提供的高分辨率参考图像在输入模型前被下采样至固定低分辨率,导致细粒度细节在输出产生前就已丢失。此外,生成步骤还会引入额外的伪影(如身份失真)。现有的参考图引导生成内容细化方法虽能纠正部分伪影,但仍只在低分辨率域操作;参考图引导超分辨率方法虽能恢复分辨率,但假设的是自然图像退化,忽略了生成管线的伪影分布。为在一个统一的框架中解决这两个问题,我们提出了一个新任务:参考图引导生成内容超分辨率与细化(RefGC-SR$^2$),在后处理阶段重用原始高分辨率参考图像以恢复丢失的细节、细化生成伪影并同时提升输出分辨率。我们为该任务构建了首个真实世界三元组数据生成流程,训练了一个双联图条件生成器来合成公开预训练模型无法提供的成对低质量锚点。我们进一步提出了一种频率感知的扩散Transformer模型用于RefGC-SR$^2$,该模型可从高分辨率参考图像中选择性注入细节同时去除生成伪影。大量实验表明,我们的RefGC-SR$^2$模型成功实现了(i)相对于参考图准确细化目标身份,(ii)恢复高分辨率细节,使得最终结果的质量显著更高,且与现有RefGCR和RefSR基线方法相比更具实用性。

Show the Signal, Hide the Noise: Spectral Forcing for Pixel-Space Diffusion

2026-06-16T04:00:00cs.CV, diffusion2606.15236

中文标题:展示信号,隐藏噪声:像素空间扩散的频谱强制方法

作者:Weichen Fan, Haiwen Diao, Penghao Wu, Ziwei Liu

摘要:

Pixel-space diffusion models are trained on full-bandwidth noisy images, yet the useful signal available to the denoiser is strongly frequency dependent. Under rectified-flow diffusion and natural-image power-law spectra, the per-band data-to-noise contour $k^{*}(t) = (1-t)^{-2/\alpha}$ separates a signal-bearing low-frequency region from a noise-dominated high-frequency region at each time $t$. We show that this implicit coarse-to-fine structure is not merely descriptive: it induces a capacity-allocation problem. A standard pixel-space denoiser must discover the moving bandwidth boundary internally and can spend computation on frequency-time regions where the optimal prediction collapses to deterministic baselines rather than data-distribution modeling. To make this boundary explicit, we introduce Spectral Forcing, a parameter-free, time-conditional 2D-DCT low-pass operator applied to the noisy input before the patch embedder. Its cutoff expands monotonically with the diffusion time and becomes the identity at the data endpoint. Through controlled synthetic experiments, we identify the regime in which the operator is beneficial: coarse patch tokenization and data whose high-frequency content is predominantly noise rather than essential signal. On ImageNet-256 with JiT-700M/32, Spectral Forcing consistently improves both FID and Inception Score across different training epochs, demonstrating robust gains throughout training; at finer tokenization, the spectral forcing is still competitive. We further insert the unchanged operator into SenseNova-U1, a unified text-to-image model, where it improves DPG-Bench and GenEval, showing that the input-side spectral prior transfers beyond class-conditional generation. These results suggest a route to capacity-efficient pixel-space diffusion by showing the signal and hiding the noise.

摘要中文:

像素空间扩散模型在全带宽噪声图像上进行训练,然而降噪器可用的有用信号与频率密切相关。在整流流扩散和自然图像幂律谱条件下,每频带的数据-噪声轮廓k*(t) = (1-t)^(-2/α)在每个时间步t将包含信号的低频区域与噪声主导的高频区域分开。我们表明这种隐式的粗到细结构不仅仅是描述性的:它引入了容量分配问题。标准像素空间降噪器必须在内部发现移动的带宽边界,并可能将计算资源浪费在最优预测退化为确定性基线而非数据分布建模的频率-时间区域。为了使这一边界显式化,我们引入频谱强制(Spectral Forcing),这是一个无参数、时间条件的二维离散余弦变换低通算子,在补丁嵌入器之前应用于噪声输入。其截止频率随扩散时间单调扩展,并在数据端点处成为恒等算子。通过受控的合成实验,我们确定了该算子有益的 regime:粗糙的补丁标记化以及高频内容主要为噪声而非本质信号的数据。在 ImageNet-256 上使用 JiT-700M/32,频谱强制在不同训练轮次持续改善 FID 和 Inception Score,展示了训练过程中的稳健增益;在更精细的标记化下,频谱强制仍具竞争力。我们进一步将不变算子插入 SenseNova-U1(一种统一文生图模型),其中它改善了 DPG-Bench 和 GenEval,表明输入侧频谱先验可超越类别条件生成进行迁移。这些结果通过展示信号和隐藏噪声,为容量高效的像素空间扩散提供了途径。

CausalDrive: Real-time Causal World Models for Autonomous Driving

2026-06-16T04:00:00autoregressive, cs.CV, diffusion2606.15341

中文标题:CausalDrive:自动驾驶实时因果世界模型

作者:Tianyi Yan, Huan Zheng, Dubing Chen, Meizhi Qu, Yingying Shen, Lijun Zhou, Mingfei Tu, Bing Wang, Guang Chen, Hangjun Ye, Haiyang Sun, Cheng-zhong Xu, Jianbing Shen

摘要:

World models have emerged as a promising paradigm for scaling autonomous driving (AD) data, yet existing video generative models fall short as interactive simulators. Layout-conditioned renderers rely on "oracle" future trajectories of all background agents, rendering them strictly non-reactive. Conversely, pure action-conditioned predictors lack semantic control over complex interactions and suffer from prohibitive diffusion latencies, hindering closed-loop policy learning. To bridge this gap, we present CausalDrive, a controllable, real-time foundation driving world renderer. CausalDrive operates solely on the initial front-view frame, the ego-vehicle&#x27;s trajectory, and a macroscopic text prompt. By excluding future NPC layouts, we compel the model to intrinsically predict causal interactions, enabling text-driven control over Driving Sociology, allowing users to dynamically orchestrate diverse counterfactual reactions to identical ego-actions. To overcome the efficiency bottleneck and address the covariate shift in autoregressive generation, we propose a novel Context-Forced DMD architecture. This combines continuous flow-matching with a self-correcting distillation objective, achieving interactive speeds of 12 FPS. This breakthrough transforms the passive video generator into a playable neural simulator. We demonstrate its versatility across three downstream applications: (1) generative closed-loop evaluation with significantly mitigated collision artifacts, (2) large-scale Reinforcement Learning (RL) post-training driven by a Video2Reward module, and (3) real-time human-in-the-loop simulation. Extensive experiments validate that policies trained within CausalDrive&x27;s reactive scenarios exhibit superior interaction capabilities in the real world.

摘要中文:

世界模型作为扩展自动驾驶数据的有前景范式应运而生,然而现有视频生成模型作为交互式模拟器仍存在不足。布局条件渲染器依赖所有背景车辆的“神谕”未来轨迹,导致其严格意义上不具有响应性。相反,纯动作条件预测器缺乏对复杂交互的语义控制能力,且受制于极高的扩散延迟,阻碍了闭环策略学习。为弥合这一差距,我们提出CausalDrive,一个可控的实时基础驾驶世界渲染器。CausalDrive仅基于初始前视帧、自车轨迹和宏观文本提示进行操作。通过排除未来NPC布局,我们强制模型内在地预测因果交互,实现对驾驶社会学的文本驱动控制,使用户能够动态编排相同自车动作下的多样化反事实反应。为克服效率瓶颈并解决自回归生成中的协变量偏移问题,我们提出了一种新颖的Context-Forced DMD架构。该架构将连续流匹配与自纠正蒸馏目标相结合,达到12 FPS的交互速度。这一突破将被动视频生成器转变为可玩神经模拟器。我们在三个下游应用中展示其通用性:(1)显著降低碰撞伪影的生成式闭环评估;(2)由Video2Reward模块驱动的大规模强化学习后训练;(3)实时人在回路仿真。广泛实验表明,在CausalDrive的反应性场景中训练的策略在现实世界中展现出更优越的交互能力。

Timestep Rescheduling in Diffusion Inversion

2026-06-16T04:00:00cs.CV, diffusion2606.15389

中文标题:扩散反演中的时间步重新调度

作者:Shangquan Sun, Ting Gong, Zhirui Liu, Jiamin Wu, Runkai Zhao, Mianxin Liu, Wenqi Ren, Xiaochun Cao

摘要:

Diffusion inversion, which maps images back to the Gaussian latent space of a diffusion model, is a critical task for image reconstruction and editing. While DDIM enables fast deterministic inversion, it inherently introduces deviations that accumulate into noticeable inversion errors. Existing methods often address this by solving a fixed-point problem but largely overlook how the selection of the diffusion timestep in the noise scheduler influences inversion fidelity. In this work, we reveal that the deviation scale in diffusion inversion is strongly dependent on the timestep size, and exhibits a parabolic trend, with larger errors concentrated at both small and large timesteps. Based on this finding, we propose a simple yet effective nonuniform timestep scheduler that integrates a global rescaling with a local dynamic programming based rescheduling, enabling a strategic allocation of computational effort that minimizes the overall inversion error and preserves higher inversion accuracy. Our method serves as an off-the-shelf enhancement for existing inversion techniques and requires no extra parameters or computational overhead. Through extensive experiments, we verify that integrating our scheduler consistently boosts the performance of existing inversion methods, achieving superior results in image reconstruction and editing.

摘要中文:

扩散反演是将图像映射回扩散模型高斯潜在空间的关键任务,对于图像重建和编辑至关重要。虽然DDIM能够实现快速确定性反演,但它固有地引入偏差,这些偏差会累积成明显的反演误差。现有方法通常通过求解不动点问题来解决这一问题,但很大程度上忽略了噪声调度器中扩散时间步的选择对反演保真度的影响。在本工作中,我们发现扩散反演中的偏差规模与时间步大小密切相关,并呈现抛物线趋势,较大的误差集中于小时间步和大时间步。基于这一发现,我们提出了一种简单而有效的非均匀时间步调度器,该调度器结合了全局重缩放与基于局部动态规划的重新调度策略,能够战略性分配计算资源,从而最大程度降低整体反演误差并保持较高的反演精度。我们的方法可作为现有反演技术的即插即用增强方案,无需额外参数或计算开销。通过大量实验验证,我们证明将我们的调度器集成到现有反演方法中能够持续提升其性能,在图像重建和编辑任务中取得优异效果。

Lesion-DDPM: Lesion-Enhanced 3D Diffusion for MS MRI Synthesis

2026-06-16T04:00:00cs.CV, cs.LG, diffusion2606.15457

中文标题:Lesion-DDPM:用于MS MRI合成的病灶增强型3D扩散模型

作者:Weidong Zhang, Yongchan Jung, Shafayat Mowla Anik, Furen Xiao, Vasudevan Janarthanan, Enkhzaya Chuluunbaatar, Byeong Kil Lee, Jeeho Ryoo

摘要:

3D FLAIR MRI is widely recommended as one of the standard MRI sequences for brain imaging in multiple sclerosis (MS), but publicly available MS datasets remain relatively small and vary across scanners, acquisition protocols, and lesion patterns. This scarcity and variability hinder the development of robust neuroimaging machine learning models and are particularly challenging for generative models that aim to synthesize images while preserving small, sparse lesions. We propose Lesion-DDPM, a 3D conditional diffusion framework for lesion-aware FLAIR synthesis that incorporates multi-level anatomical mask injection together with a lesion-weighted reconstruction loss to emphasize lesion voxels while maintaining global brain structure. Using a curated subset of the MSLesSeg dataset, we compare Lesion-DDPM with representative state-of-the-art GAN- and diffusion-based models, assessing both image-generation metrics and downstream 3D U-Net segmentation. In our experiments, Lesion-DDPM achieved the lowest lesion-region reconstruction error among all methods. In a downstream 3D U-Net lesion segmentation task, a model trained only on Lesion-DDPM-generated scans and evaluated on real MRIs reached a Dice score of 0.616 compared with 0.569 for the best competing synthetic dataset. When Lesion-DDPM images were added to the real training set, the Dice score further increased to 0.685.

摘要中文:

3D FLAIR MRI 被广泛推荐为多发性硬化症(MS)脑影像的标准MRI序列之一,但公开可用的MS数据集仍然相对较小,且在不同扫描仪、采集协议和病灶模式间存在差异。这种稀缺性和变异性阻碍了鲁棒神经影像机器学习模型的发展,对于旨在合成图像同时保留小型稀疏病灶的生成模型尤其具有挑战性。我们提出了 Lesion-DDPM,这是一种用于病灶感知FLAIR合成的3D条件扩散框架,结合了多层解剖掩膜注入和病灶加权重建损失,以强调病灶体素同时保持全局脑结构。使用 MSLesSeg 数据集的精选子集,我们将 Lesion-DDPM 与代表性的先进GAN和扩散模型进行比较,评估图像生成指标和下游3D U-Net分割任务。在我们的实验中,Lesion-DDPM 在所有方法中实现了最低的病灶区域重建误差。在下游3D U-Net病灶分割任务中,仅使用 Lesion-DDPM 生成的扫描进行训练并在真实MRI上评估的模型达到了0.616的Dice系数,而最佳竞争合成数据集为0.569。当 Lesion-DDPM 生成的图像被添加到真实训练集后,Dice系数进一步提升至0.685。

ST-DiffEye: Diffusion-based Continuous Gaze Generation via Joint Scanpath-Trajectory Modeling

2026-06-16T04:00:00cs.CV, diffusion2606.15486

中文标题:ST-DiffEye:基于扩散模型的联合扫描路径-轨迹建模的连续视线生成

作者:Brian Nlong Zhao, Ozgur Kara, Junho Kim, James M. Rehg

摘要:

We study the problem of human gaze modeling, which aims to generate the gaze patterns a viewer produces while observing a visual stimulus. Gaze is primarily captured through two modalities: continuous eye-tracking trajectories, which describe fine-grained motion dynamics, and discrete scanpaths, which describe high-level fixation structure. Because gaze varies substantially across viewers and trials, we treat this variability as a defining property rather than noise and model gaze as a stochastic generative process. Existing generative gaze models supervise on only one of these two representations in isolation. We hypothesize that trajectories and scanpaths describe gaze at complementary scales and are jointly informative during training, and test this hypothesis through ST-DiffEye, a joint trajectory-scanpath diffusion framework that couples both modalities by concatenating them as an additional raw input channel, requiring no architectural overhead beyond an input and output channel expansion. We further introduce a principled evaluation framework based on the Continuous Ranked Probability Score (CRPS), which generalizes any existing sequence similarity metric into a proper scoring rule that jointly assesses the accuracy and diversity of generated gaze. Experiments on task-driven visual search, covering both target-present and target-absent scenarios, and on free-viewing benchmarks demonstrate state-of-the-art performance. These results, along with detailed ablations, confirm the benefit of joint modeling and the value of distribution-aware evaluation in capturing the intrinsic variability of human gaze. Project webpage: https://st-diffeye.github.io/

摘要中文:

本文研究人类视线建模问题,旨在生成观察者在观看视觉刺激时产生的注视模式。视线主要通过两种模态进行捕捉:描述细粒度运动动态的连续眼动轨迹,以及描述高层注视结构的离散扫描路径。由于视线在不同的观察者和试次之间存在显著差异,我们将这种变异性视为定义性特征而非噪声,并将视线建模为随机生成过程。现有的生成式视线模型仅单独监督这两种表示中的一种。我们假设轨迹和扫描路径在不同尺度上描述视线,并在训练过程中具有联合信息价值。为验证这一假设,我们提出了ST-DiffEye——一种联合轨迹-扫描路径的扩散框架,该框架通过将两种模态拼接作为额外的原始输入通道来耦合它们,除去输入和输出通道扩展外无需额外的架构开销。我们进一步引入了一种基于连续排序概率分数(CRPS)的原则性评估框架,该框架将任何现有的序列相似性度量泛化为一种恰当的评分规则,可联合评估生成视线的准确性和多样性。在任务驱动的视觉搜索(包括目标存在和目标不存在场景)以及自由观看基准上的实验表明,本方法达到了最先进的性能。这些结果连同详细的消融实验,证实了联合建模的价值以及分布感知评估在捕捉人类视线内在变异性方面的价值。项目主页:https://st-diffeye.github.io

Track2View: 4D-Consistent Camera-Controlled Video Generation via Paired 3D Point Tracks

2026-06-16T04:00:00cs.CV, diffusion2606.15534

中文标题:Track2View:基于配对3D点轨迹的4D一致性相机控制视频生成

作者:Feng Qiao, Zhaochong An, Zhexiao Xiong, Serge Belongie, Nathan Jacobs

摘要:

Re-rendering an existing video from a novel camera viewpoint requires the output to follow the prescribed camera trajectory while preserving the appearance and dynamics of the original scene across every frame. Existing methods rely on per-frame pose embeddings, noisy point-cloud renderings, or implicit learned correspondences, none of which provides an explicit, temporally continuous link between source and target pixels. We propose Track2View, which conditions a video diffusion transformer on paired 3D point tracks: sparse trajectories of scene points projected into both the source and target camera views. These tracks provide explicit spatiotemporal correspondences that are temporally continuous by construction, encoding what content should appear where and when. At the core of Track2View is a dual-view track conditioner that transfers visual context from source to target view through parameter-free geometric operations and learned temporal aggregation, ensuring generalization to arbitrary camera trajectories without memorizing specific motions. We further introduce a data curation pipeline that extracts one-to-one track correspondences by running a 3D point tracker on temporally concatenated multi-camera view pairs. On a 400-video benchmark spanning static and dynamic scenes, Track2View achieves state-of-the-art results across visual quality, view synchronization, and camera accuracy, reducing rotation error by 30-65% and translation error by 61-72% relative to leading baselines. Project page is available at this https URL: https://qjizhi.github.io/track2view

摘要中文:

从新相机视角对现有视频进行重渲染,要求输出遵循指定的相机轨迹,同时在每一帧中保持原始场景的外观和动态特征。现有方法依赖于逐帧姿态嵌入、噪声点云渲染或隐式学习的对应关系,均无法提供源像素与目标像素之间显式的时间连续链接。我们提出Track2View,该方法将视频扩散变换器置于配对3D点轨迹的条件之下:这些稀疏轨迹将场景点投影到源相机视角和目标相机视角中。这些轨迹提供了由构建方式决定的时间连续的显式时空对应关系,编码了内容在何时何地应该出现。Track2View的核心是一个双视图轨迹调节器,通过无参数的几何操作和学习的时序聚合将视觉上下文从源视图转移到目标视图,确保了对任意相机轨迹的泛化能力,而无需记忆特定运动模式。我们进一步引入了一个数据整理流程,通过在时间连接的多相机视图对上运行3D点跟踪器来提取一对一轨迹对应关系。在涵盖静态和动态场景的400个视频基准测试中,Track2View在视觉质量、视图同步和相机精度方面均达到了最先进水平,与领先基线相比,旋转误差降低30-65%,平移误差降低61-72%。项目页面见此链接:https://qjizhi.github.io/track2view

Toward the Whole Picture: Accumulative Fingerprint Mapping and Reconstruction for Small-Area Mobile Sensors

2026-06-16T04:00:00cs.CV, diffusion2606.15574

中文标题:迈向完整图像:小区域移动传感器的累积指纹映射与重建

作者:Xiongjun Guan, Jianjiang Feng, Jie Zhou

摘要:

Small-area fingerprint sensing on mobile devices creates a fundamental mismatch between acquisition and recognition: each touch captures only a tiny, pose-varying local patch, while reliable biometric matching ultimately requires a stable and sufficiently complete fingerprint representation. Existing pipelines largely cope with this mismatch by treating repeated touches as independent partial templates, which leads to repeated registration, repeated matching, and no guarantee of adequate global coverage. In this paper, we advocate a different formulation, namely \emph{accumulative fingerprint mapping and reconstruction} for small-area mobile sensing. Rather than matching every partial patch separately, the proposed perspective converts a sequence of local observations into a unified fingerprint state that is progressively refined as new touches arrive and can be matched only once after consolidation. As a concrete baseline, we present a classical pipeline that performs patch-wise structural feature extraction, feature-level registration and fusion, fingerprint map construction, and phase-based ridge reconstruction. More importantly, we position this baseline within a broader mobile fingerprint framework that integrates structured token learning, two-stage pose reasoning, and diffusion-based generative reconstruction. This viewpoint reframes mobile fingerprint recognition from multi-capture multi-match processing to accumulative map building, state refinement, and one-shot matching, offering a principled route toward efficient, pose-robust, and deployment-friendly biometrics for small-area mobile platforms. The baseline implementation has been publicly released at https://github.com/XiongjunGuan/FpReconstruction.

摘要中文:

移动设备上的小区域指纹感知在采集与识别之间产生了根本性的不匹配:每次触摸仅能捕获一个姿态各异的小范围局部区域,而可靠的生物特征匹配最终需要稳定且足够完整的指纹表示。现有流程在很大程度上通过将重复触点视为独立的部分模板来应对这种不匹配,这导致重复配准、重复匹配,且无法保证足够的全局覆盖。本文倡导一种不同的方案,即小区域移动感知的累积指纹映射与重建。所提出的视角并非分别匹配每个局部_patch_,而是将一系列局部观测转换为统一的指纹状态,随着新触摸的到来逐步精炼,并在整合后仅需匹配一次。作为具体基线,我们提出了一种经典流程,包括局部_patch_结构特征提取、特征级配准与融合、指纹图构建以及基于相位的脊线重建。更重要的是,我们将该基线置于更广泛的移动指纹框架中,该框架集成了结构化令牌学习、两阶段姿态推理和基于扩散的生成式重建。这一视角将移动指纹识别从多采集多匹配处理重新定义为累积图构建、状态精炼和一次性匹配,为小区域移动平台提供了一种高效、姿态鲁棒且易于部署的生物特征识别原则性路径。基线实现已公开发布于 https://github.com/XiongjunGuan/FpReconstruction。

Unlocking Diffusion Hierarchies: Adaptive Timestep Selection for Zero-Shot Segmentation

2026-06-16T04:00:00cs.CV, diffusion2606.15590

中文标题:解锁扩散层级:面向零样本分割的自适应时间步选择

作者:Ramin Nakhli, Mahesh Ramachandran, Luca Ballan

摘要:

Zero-shot segmentation has recently shown notable improvement by leveraging the rich visual priors in large-scale text-to-image diffusion models, such as Stable Diffusion. However, current diffusion-based methods often face limitations due to the trade-off between spatial resolution and contextual information, as well as their reliance on a single static timestep for feature extraction. To overcome these challenges, our work introduces two key advancements. First, our Contextual Similarity Maps fuse high-resolution attention maps with rich U-Net encoder features, providing both fine-grained and robust per-pixel representations. Second, we identify an emergent hierarchical semantic progression within the denoising process of various diffusion models: representations transition from part-level abstractions at earlier timesteps to object-level abstractions at later stages. Leveraging this insight, we introduce a mechanism to adaptively select the optimal timestep for each pixel. Extensive experiments demonstrate that our method consistently outperforms existing zero-shot segmentation baselines, validating the efficacy of combining contextual features with dynamic, hierarchical timestep selection.

摘要中文:

零样本分割最近通过利用大规模文本到图像扩散模型(如Stable Diffusion)丰富的视觉先验,表现出显著改进。然而,当前基于扩散的方法经常面临空间分辨率与上下文信息之间的权衡,以及对单一静态时间步特征提取的依赖所带来的限制。为克服这些挑战,我们的工作引入了两个关键进展。首先,上下文相似性图将高分辨率注意力图与丰富的U-Net编码器特征融合,提供细粒度且鲁棒的逐像素表示。其次,我们发现了各种扩散模型去噪过程中出现的层级语义演进:表示从早期时间步的部分级抽象过渡到后期的对象级抽象。利用这一洞察,我们引入了一种机制来自适应地为每个像素选择最优时间步。大量实验表明,我们的方法始终优于现有的零样本分割基线,验证了结合上下文特征与动态层级时间步选择的有效性。

Variational Test-time Optimization for Diffusion Synchronization

2026-06-16T04:00:00cs.CV, diffusion2606.15614

中文标题:用于扩散同步的变分测试时优化

作者:Hyunsoo Lee, Farrin Marouf Sofian, Kushagra Pandey, Stephan Mandt

摘要:

Collaborative generation, which coordinates multiple diffusion trajectories to extend the capabilities of pretrained priors, has emerged as a powerful paradigm for extending the applicability of diffusion models. Among existing approaches, diffusion synchronization provides a scenario-agnostic solution by introducing general guidance mechanisms. However, current synchronization approaches rely heavily on heuristics and still require task-specific tailoring, which limits their generalizability and performance. In this work, we mathematically derive a synchronization framework based on optimal control, providing a principled explanation of diffusion synchronization. During sampling, we optimize control variables to guide multiple trajectories toward coherent solutions while remaining close to the underlying diffusion prior. Our method operates entirely at test-time without additional training, thereby enabling broad applicability across diverse generation scenarios when combined with strong pretrained priors. We demonstrate consistent improvements over baselines on three representative collaborative generation tasks, covering a wide range of modalities and applications. Beyond performance gains, our work establishes a novel foundation for collaborative generation, opening a principled path toward extending pretrained generative models to new collaborative generation settings.

摘要中文:

协作生成通过协调多个扩散轨迹来扩展预训练先验的能力,已成为扩展扩散模型适用性的强大范式。在现有方法中,扩散同步通过引入通用引导机制提供了场景无关的解决方案。然而,当前的同步方法严重依赖启发式方法,仍然需要针对特定任务进行调整,这限制了其泛化性和性能。在本工作中,我们从数学上推导出一个基于最优控制的同步框架,为扩散同步提供了原则性解释。在采样过程中,我们优化控制变量来引导多个轨迹趋向一致解,同时保持与底层扩散先验的接近。我们的方法完全在测试时运行,无需额外训练,因此与强大的预训练先验结合时能够广泛应用于不同的生成场景。我们在三个代表性协作生成任务上展示了相对于基线方法的一致性改进,涵盖了广泛的模态和应用。除了性能提升之外,我们的工作为协作生成奠定了新的基础,为将预训练生成模型扩展到新的协作生成设置开辟了原则性路径。

SiGnature: Explicit Motion Diffusion for Stylized Semantic Gesture

2026-06-16T04:00:00cs.CV, diffusion2606.15889

中文标题:SiGnature:面向风格化语义手势的显式运动扩散

作者:Adi Rosenthal, Tomer Koren, Nadav Shaked, Doron Friedman, Ariel Shamir

摘要:

While recent advances in co-speech gesture generation have achieved impressive rhythmic synchronization, synthesizing gestures that are both semantically meaningful and faithful to a speaker&#x27;s unique non-verbal style remains an open challenge. Semantic gestures, such as iconic shapes or deictic pointing, are statistically sparse, making them difficult to learn effectively within standard generative models. We present SiGnature, a framework for Stylized and Semantic Gesture generation that reconciles precise semantic control with high-fidelity style preservation. Unlike prevalent methods that rely on entangled latent representations, SiGnature operates in an explicit joint-rotation space. This design enables our core contribution, Joint Motion Integration (JMI), a training-free inference mechanism capable of injecting any external motion sequence, particularly in-the-wild semantic gestures, directly into the diffusion process. JMI automatically identifies the specific ``active joints&x27;&#x27; conveying a semantic action and injects them into the generation, while relying on the diffusion backbone to synthesize the remaining body dynamics, including posture and flow, in accordance with the pre-learned style of the target speaker. This allows for the plug-and-play integration of arbitrary motions, including complex semantic gestures, without retraining or introducing the ``Frankenstein&x27;' artifacts typical of cut-and-paste methods. Extensive experiments and perceptual studies demonstrate that SiGnature offers superior semantic motion control while maintaining smooth and natural co-speech gesture generation and preserving the distinct characteristics of the speaker, thereby outperforming state-of-the-art baselines.

摘要中文:

尽管近年来伴随言语手势生成在节奏同步方面取得了显著进展,但如何合成兼具语义意义并忠实于说话者独特非言语风格的手势仍是一项开放性挑战。语义手势(如象形手势或指示手势)在统计上较为稀疏,使得标准生成模型难以有效学习。本研究提出SiGnature,一个风格化与语义手势生成框架,能够在精确语义控制与高保真风格保持之间实现平衡。与依赖纠缠潜在表示的流行方法不同,SiGnature在显式关节旋转空间中运行。这一设计使得本研究的核心贡献——联合运动整合(Joint Motion Integration, JMI)成为可能。JMI是一种无需训练的推理机制,能够将任意外部运动序列(尤其是真实场景中的语义手势)直接注入扩散过程。JMI自动识别传达语义动作的特定“活跃关节”并将其注入生成过程,同时依赖扩散主干网络根据目标说话者的预学习风格合成其余身体动态,包括姿态和流畅度。这使得任意运动的即插即用集成成为可能,包括复杂语义手势,无需重新训练,也不会引入剪切粘贴方法中常见的“拼贴”伪影。大量实验和感知研究表明,SiGnature在保持流畅自然的伴随言语手势生成并保留说话者独特特征的同时,提供了卓越的语义运动控制能力,从而优于现有最先进的方法。

PointDiffusion: Diffusion-Based Scene Completion in the Point Cloud Domain

2026-06-16T04:00:00cs.CV, diffusion2606.16048

中文标题:PointDiffusion:基于扩散的点云域场景补全

作者:Chidera Agbasiere, Mikhail Sannikov, Faith Ogunwoye, Erik Shaikhiev, Alex Kozinov, Ilya Mikhalchuk, Iana Zhura, Dzmitry Tsetserukou

摘要:

Reconstructing dense 3D scenes from sparse LiDAR point clouds is a fundamental challenge in autonomous driving, where latent diffusion models offer a promising solution. However, existing approaches rely on object-level autoencoders that collapse into unstable global representations at outdoor scale and suffer from ground truth data corrupted by odometry drift that systematically degrades supervision quality. Furthermore, multi-step diffusion inference incurs prohibitive latency for real-time deployment. We propose a novel multi-token Gaussian VAE with cross-attention pooling for stable scene-scale LiDAR compression, combined with an anchor-based ICP ground truth refinement pipeline that eliminates drift-induced noise from training supervision. Together, these components enable a scaffold-free single-step diffusion completion model that achieves an approximately 16x reduction in squared Chamfer distance on SemanticKITTI seq. 08 (0.396 m^2 to 0.024 m^2), surpasses LiDiff and ScoreLiDAR by 17-19% and 10-11%, respectively, and operates at 25-143x lower inference latency. Our results demonstrate that data quality dominates model design in this regime and that multi-token latent spaces provide a stable first stage for latent diffusion-based scene completion.

摘要中文:

从稀疏LiDAR点云重建密集3D场景是自动驾驶领域的一项基础性挑战,潜扩散模型为此提供了一种有前景的解决方案。然而,现有方法依赖于目标级自编码器,这些自编码器在室外规模下会坍缩为不稳定的全局表示,并且受到里程计漂移污染的真值数据的困扰,这会系统性降低监督质量。此外,多步扩散推理会产生难以满足实时部署要求的过高延迟。我们提出了一种具有交叉注意力池化的新型多标记高斯VAE,用于稳定的场景级LiDAR压缩,并结合基于锚点的ICP真值精炼管道,以消除训练监督中由漂移引起的噪声。这些组件共同实现了一个无脚手架的单步扩散补全模型,在SemanticKITTI seq. 08上实现了平方Chamfer距离约16倍的降低(从0.396 m²降至0.024 m²),分别超越LiDiff和ScoreLiDAR 17-19%和10-11%,且推理延迟降低25-143倍。我们的结果表明,在该任务中数据质量主导模型设计,多标记潜空间为基于潜扩散的场景补全提供了稳定的第一阶段。

teasr: training-efficient any-step diffusion transformer for real-world image super-resolution

2026-06-16T04:00:00cs.CV, diffusion2606.16188

中文标题:TEASR:用于真实世界图像超分辨率的训练高效任意步扩散Transformer

作者:Xiang Gao, Chenxin Zhu, Yushun Fang, Qiang Hu, Xiaoyun Zhang

摘要:

Diffusion models excel in Real-World Image Super-Resolution (Real-ISR) due to their powerful generative priors but suffer from slow iterative sampling. Although existing one-step distillation methods accelerate inference, they typically require auxiliary teacher models that inflate training memory and restrict scalability to large-scale architectures. Furthermore, these fixed-step models lack the flexibility to trade off speed for quality. In this paper, we propose TEASR, a training-efficient any-step diffusion framework for Real-ISR that enables both one-step and multi-step restoration within a unified model. Our key idea is to perform self-adversarial distillation within a single diffusion model, eliminating the need for auxiliary teachers or discriminators. Specifically, we propose a timestep-aware rectification strategy that stabilizes one-step generation across noise levels. These two designs further enables the distillation of 20B-parameter diffusion models on a single GPU, significantly improving training efficiency. Moreover, we introduce a dual-branch diffusion transformer with decoupled timestep condition to separate the current noise state and the denoising target to enhance sampling quality. Extensive experiments demonstrate that TEASR supports seamless any-step sampling and consistently outperforms state-of-the-art methods across multiple datasets.

摘要中文:

扩散模型凭借其强大的生成先验在真实世界图像超分辨率任务中表现出色,但存在采样迭代速度慢的问题。尽管现有的一步蒸馏方法能够加速推理,但通常需要辅助教师模型,这增加了训练内存消耗并限制了其向大规模架构的可扩展性。此外,这些固定步数的模型缺乏灵活调整速度与质量之间权衡的能力。本文提出TEASR,一个用于真实世界图像超分辨率的训练高效任意步扩散框架,能够在同一统一模型中实现一步和多步恢复。我们的核心思想是在单一扩散模型内进行自对抗蒸馏,从而无需辅助教师模型或判别器。具体而言,我们提出了一种时间步感知校正策略,以稳定跨噪声水平的一步生成。这两种设计使得能够在单GPU上蒸馏200亿参数的扩散模型,显著提升了训练效率。此外,我们引入了一种带有解耦时间步条件的双分支扩散Transformer,以分离当前噪声状态与去噪目标,从而增强采样质量。大量实验表明,TEASR支持无缝任意步采样,并在多个数据集上持续优于最先进的方法。

Structure-Semantic Co-optimized Latent Diffusion Model for Fast Visual Anagram Synthesis

2026-06-16T04:00:00cs.CV, diffusion2606.16241

中文标题:用于快速视觉字谜合成的结构-语义协同优化潜在扩散模型

作者:Xiang Gao, Yunpeng Jia

摘要:

Visual anagram is an intriguing form of art creation wherein a single image presents different conceptual interpretations under transformations such as flipping or rotation. Recent work has achieved visual anagram synthesis by leveraging pretrained text-to-image (T2I) diffusion models, yet still suffers from several key limitations including computational inefficiency, suboptimal aesthetic quality, and weak semantic fidelity and expressiveness. This work focuses on generating visual anagrams with substantially improved visual quality at minimal computational cost, thereby advancing intelligent creation of illusionary digital art. To increase image resolution while reducing time overhead, we adapt the cutting-edge parallel denoising algorithm from pixel-based T2I model to the adversarially distilled latent-based one, and accordingly propose a structure-semantic co-optimization (S2CO) framework to counteract the consequent visual degradation. As the core of our approach, S2CO framework comprises three key innovations: (\romannumeral1) null-text structure alignment optimization; (\romannumeral2) semantic enhancement optimization; (\romannumeral3) attention-guided noise fusion. Building upon these components, our method dubbed \textbf{S2CO-Anagram} is able to generate higher-resolution anagram images with noticeably superior visual harmony and semantic faithfulness than related SOTA approaches, all while achieving substantially faster inference speed. Code will be publicly available.

摘要中文:

视觉字谜是一种独特的艺术创作形式,其特点在于单张图像在翻转或旋转等变换下呈现不同的概念解读。当前研究虽已利用预训练文生图扩散模型实现视觉字谜合成,但仍存在计算效率低、审美质量欠佳、语义保真度和表达力不足等关键问题。本研究聚焦于以极低计算成本生成高质量视觉字谜,从而推动幻觉数字艺术的智能创作。为在提升图像分辨率的同时降低时间开销,我们将基于像素的文生图模型中的前沿并行去噪算法适配至对抗蒸馏的潜在扩散模型,并相应提出结构-语义协同优化框架以应对随之而来的视觉退化问题。作为本方法的核心,结构-语义协同优化框架包含三项关键创新:(一)空文本结构对齐优化;(二)语义增强优化;(三)注意力引导噪声融合。基于这些组件,本方法名为S2CO-Anagram,能够生成更高分辨率的字谜图像,在视觉和谐度和语义保真度方面显著优于相关前沿方法,同时实现更快的推理速度。代码将公开发布。

Training-free sparse attention based on cumulative energy filtering

2026-06-16T04:00:00cs.CV, diffusion2606.16317

中文标题:基于累积能量过滤的无训练稀疏注意力

作者:Chunlu Li, Yixuan Pan, Bai Du, Zhenyuan Chen, Yanzhao Li, Hui Dong, Hui Wang, Zhiqiang Zou

摘要:

Sparse attention accelerates Diffusion Transformers (DiTs) for video generation by computing only the important tokens while skipping the rest. The token selection strategy is key to balancing sparsity and accuracy. We formulate the token filtering process as a dual-goal optimization problem: maximizing sparsity and minimizing accuracy degradation. Existing algorithms cannot fulfill both objectives simultaneously. For example, Top-p only considers the accuracy constraint, while Top-k maintains a fixed computational budget but loosens the accuracy constraint. This paper demonstrates that maintaining a fixed recall rate is sufficient for ensuring accuracy, whereas a fixed threshold is suboptimal for reducing computational cost. Therefore, we propose a dynamic thresholding scheme to improve sparsity while maintaining the same level of accuracy. Furthermore, our algorithm is deeply integrated with Flash Attention (FA), eliminating the need for any additional masking computation overhead. Experimental results on Wan 2.2 validate that, compared to the BLASST algorithm which is also integrated with FA, our dynamic thresholding strategy enhances sparsity from 61.42\% to 82\% with a VBench metric drop of less than 5\%. This results in an approximate 15\% in attention computation and a $1.61\times$ increase in computational efficiency, which is 1.18x higher than that of BLASST.

摘要中文:

稀疏注意力通过仅计算重要令牌而跳过其余令牌来加速用于视频生成的扩散变压器(DiTs)。令牌选择策略是平衡稀疏性和准确性的关键。我们将令牌过滤过程形式化为一个双目标优化问题:最大化稀疏性并最小化精度下降。现有的算法无法同时实现这两个目标。例如,Top-p仅考虑精度约束,而Top-k保持固定的计算预算但放松了精度约束。本文证明,保持固定的召回率足以确保精度,而固定阈值对于降低计算成本是次优的。因此,我们提出了一种动态阈值方案,在保持相同精度水平的同时提高稀疏性。此外,我们的算法与Flash Attention(FA)深度集成,无需任何额外的掩码计算开销。在Wan 2.2上的实验验证表明,与同样集成FA的BLASST算法相比,我们的动态阈值策略将稀疏性从61.42%提高到82%,同时VBench指标下降小于5%。这使得注意力计算量减少约15%,计算效率提升1.61倍,比BLASST高出1.18倍。

GraphBEV++: Multi-Modal Feature Alignment for Autonomous Driving

2026-06-16T04:00:00cs.CV, diffusion2606.16354

中文标题:GraphBEV++:自动驾驶中的多模态特征对齐

作者:Ziying Song, Caiyan Jia, Lin Liu, Shaoqing Xu, Lei Yang, Yadan Luo

摘要:

Feature misalignment in BEV perception is a critical yet often overlooked challenge in autonomous driving, especially under calibration uncertainties between LiDAR and camera sensors. To address this issue, we propose a robust multi-modal fusion framework, GraphBEV++, which systematically mitigates projection-induced misalignment. The framework consists of two key modules: LocalAlign-v2 and GlobalAlign-v2. LocalAlign-v2 introduces neighborhood-aware depth features via graph matching to correct local misalignment. It supports both LSS-based and query-based BEV representations, making it compatible with BEVFusion and BEVFormer architectures for consistent cross-paradigm alignment. GlobalAlign-v2 encompasses two variants: Deformable and Diffusion. The Deformable variant addresses global misalignment in LSS-based multi-modal BEV by explicitly learning cross-modal feature offsets. In contrast, the Diffusion variant targets implicit misalignment in query-based BEV by injecting noise to simulate misalignment and employing a denoising process to recover aligned features. Experimental results show that GraphBEV++ achieves state-of-the-art performance under misalignment noise on nuScenes and Waymo subset, improves long-range detection on Argoverse2, and generalizes effectively to the 3D occupancy prediction task, consistently improving occupancy estimation accuracy and robustness under both clean and noisy settings. Furthermore, GraphBEV++ effectively alleviates misalignment issues in end-to-end autonomous driving. Compared with five baselines (UniAD, VAD, FusionAD, MomAD, and WoTE), it demonstrates superior performance in both open-loop (nuScenes) and closed-loop (Bench2Drive and NAVSIM) evaluations across perception, prediction, and planning tasks.

摘要中文:

BEV感知中的特征错位是自动驾驶中一个关键但经常被忽视的挑战,尤其是在激光雷达和相机传感器之间存在校准不确定性的情况下。为解决这一问题,我们提出了一种鲁棒的多模态融合框架GraphBEV++,该框架能够系统性地减轻投影引起的特征错位。该框架包含两个关键模块:LocalAlign-v2和GlobalAlign-v2。LocalAlign-v2通过图匹配引入邻域感知深度特征来纠正局部错位,同时支持基于LSS和基于查询的BEV表示,使其与BEVFusion和BEVFormer架构兼容,实现跨范式的一致性对齐。GlobalAlign-v2包含两个变体:可变形变体和扩散变体。可变形变体通过显式学习跨模态特征偏移来解决基于LSS的多模态BEV中的全局错位问题;而扩散变体则针对基于查询的BEV中的隐式错位问题,通过注入噪声来模拟错位,并采用去噪过程来恢复对齐特征。实验结果表明,GraphBEV++在nuScenes和Waymo子集上的错位噪声条件下达到了最先进的性能,在Argoverse2上提升了远程检测能力,并能够有效泛化到3D占用预测任务,在干净和噪声环境下均能持续提升占用估计的准确性和鲁棒性。此外,GraphBEV++还能有效缓解端到端自动驾驶中的错位问题。与五个基线模型(UniAD、VAD、FusionAD、MomAD和WoTE)相比,它在nuScenes的开环评估以及Bench2Drive和NAVSIM的闭环评估中,在感知、预测和规划任务上均展现出更优异的性能。

SP$^3$: Spherical Priors for Plug-and-Play Restoration

2026-06-16T04:00:00cs.CV, diffusion, eess.IV2606.16396

中文标题:SP³: 球面先验用于即插即用图像恢复

作者:Sean Man, Ron Raphaeli, Matan Kleiner, Or Ronai

摘要:

In this paper, we introduce SP$^3$, a novel Plug-and-Play algorithm that accelerates maximum a posteriori image restoration by replacing denoisers with Spherical Encoders (SE) as generative priors. SP$^3$ approximates the intractable proximal prior step by utilizing the SE tightly structured latent space as a robust projection onto the natural image manifold. Alternating this projection with a closed-form data-consistency step, via Half-Quadratic Splitting, achieves stable convergence without requiring gradient computation during inference. This unique formulation unlocks "anytime" restoration capabilities, producing sharp, plausible images from the first iteration. Evaluations across a variety of image restoration tasks demonstrate that SP$^3$ achieves perceptual quality comparable to state-of-the-art zero-shot diffusion and flow methods while being $3$-$630\times$ faster.

摘要中文:

本文提出SP³,一种新型即插即用算法,通过用球形编码器(SE)作为生成先验替代去噪器来加速最大后验概率图像恢复。SP³利用SE紧密结构的潜在空间作为自然图像流形的稳健投影来近似难以处理的近端先验步骤。通过半二次分裂法将该投影与闭合形式的数据一致性步骤交替进行,实现了稳定的收敛,且在推理过程中无需计算梯度。这种独特的公式解锁了“随时”恢复能力,从第一次迭代起就能生成清晰、逼真的图像。在各种图像恢复任务上的评估表明,SP³达到了与最先进的零样本扩散和流方法相当的感知质量,同时速度提升了3至630倍。

ResEdit: Residual embeddings for precise generative image editing

2026-06-16T04:00:00cs.CV, cs.GR, diffusion2606.16457

中文标题:ResEdit:用于精确生成式图像编辑的残差嵌入

作者:Ahmet Canberk Baykal, Valentin Deschaintre, Yannick Hold-Geoffroy, Michael Fischer, Anna Fr\"uhst\"uck, Cengiz \"Oztireli, Iliyan Georgiev

摘要:

Conditional diffusion image generators can be repurposed for editing through inversion, without the need for large-scale paired fine-tuning data. However, producing high-quality, targeted edits while maintaining image identity and global consistency remains challenging, as weakly conditioned inversion often embeds conflicting image features into the noise. We demonstrate that incorporating a residual image encoding as additional conditioning enables both improved identity preservation and better editability. We optimize this residual encoding to provide a strong conditioning signal for reconstruction, thereby reducing the reliance on inversion and susceptibility to its aforementioned pitfalls. To ensure this residual does not interfere with desired edits, we incorporate a gradient reversal-based optimization strategy that disentangles the residual from the edited condition. We illustrate our method's ability to produce high-fidelity results across precise intrinsic-based editing and relighting, and show proof-of-concept text-guided manipulation.

摘要中文:

条件扩散图像生成器可通过反演技术进行图像编辑,无需大规模成对微调数据。然而,在保持图像身份一致性的同时实现高质量、针对性的编辑仍具挑战性,因为弱条件反演常将相互冲突的图像特征嵌入到噪声中。我们证明,引入残差图像编码作为额外条件能够同时改善身份保持和可编辑性。我们对该残差编码进行优化,使其为重建提供强条件信号,从而减少对反演的依赖及其上述缺陷的影响。为确保该残差不会干扰预期编辑,我们采用基于梯度反转的优化策略将其与编辑条件解耦。我们验证了本方法在精确基于内在属性的编辑和重光照任务中生成高保真结果的能力,并展示了文本引导操控的概念验证。

MMDiff: Extending Diffusion Transformers for Multi-Modal Generation

2026-06-16T04:00:00cs.CV, diffusion2606.16673

中文标题:MMDiff:扩展扩散变换器用于多模态生成

作者:Yagmur Akarken, Orest Kupyn, Christian Rupprecht

摘要:

Diffusion transformers have demonstrated remarkable generative capabilities, yet the rich perceptual representations computed across their denoising trajectory are discarded once the content is rendered. We present MMDiff, a framework that transforms a frozen diffusion transformer into a multi-modal generative system that jointly produces images alongside any combination of dense perceptual modalities using lightweight decoder heads. Our central finding is that perceptual information is temporally distributed along the denoising trajectory, and that multi-timestep feature fusion with spatially varying aggregation weights is essential, improving semantic segmentation results by up to 28.7% mIoU over single-timestep extraction. We further adopt concept-driven attention extraction for interpretable spatial guidance, and show that frozen diffusion features are competitive with and complementary to state-of-the-art encoders such as DINOv3. By training only lightweight decoder heads on a frozen backbone, we achieve strong performance in semantic segmentation, salient object detection, and depth estimation, and demonstrate that this framework enables effective synthetic data generation at scale.

摘要中文:

扩散变换器已展现出卓越的生成能力,然而其去噪轨迹中计算的丰富感知表征在内容渲染完成后即被丢弃。我们提出 MMDiff 框架,该框架将冻结的扩散变换器转换为多模态生成系统,能够通过轻量级解码器头协同生成图像及任意组合的密集感知模态。我们的核心发现是:感知信息沿去噪轨迹时间分布,且使用空间变化聚合权重的多时间步特征融合至关重要,相比单时间步提取可将语义分割结果提升高达 28.7% mIoU。我们进一步采用概念驱动注意力提取实现可解释的空间引导,并表明冻结的扩散特征与 DINOv3 等最先进编码器相比具有竞争力且互为补充。通过仅在冻结的主干网络上训练轻量级解码器头,我们在语义分割、显著性目标检测和深度估计任务中取得了优异性能,并证明该框架能够实现大规模有效合成数据生成。

Redirecting the Flow: Image Customization through Attention Distribution Shift

2026-06-16T04:00:00cs.CV, diffusion2606.16866

中文标题:重定向流向:通过注意力分布偏移实现图像定制

作者:Jie Li, Suorong Yang, Jian Zhao, Furao Shen

摘要:

Subject-driven image customization aims to generate images that not only follow textual instructions but also preserve the identity of a given reference subject. Existing approaches, including test-time fine-tuning, encoder-based methods, and token competition in shared attention spaces, suffer from limited efficiency, misalignment between extracted reference features and the generative process, and interference from irrelevant information. To address these limitations, we formulate the customization task as a distribution shift induced by incorporating reference images into text-to-image generation, and derive a Conditional Attention Distribution Shift formulation grounded in maximum entropy theory. Building on this formulation, we propose CustomShift, a dual-branch architecture based on Stable Diffusion 3. The Reference-Alignment Branch leverages self-attention between reference images and subject names to achieve layer-wise alignment with latent representations, while the Cross-Guidance Branch integrates textual and reference cues to guide generation. Experiments on the DreamBooth and Custom101 benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches, achieving a better balance between semantic fidelity and subject consistency.

摘要中文:

主体驱动的图像定制旨在生成既遵循文本指令又保留给定参考主体身份的图像。现有的方法,包括测试时微调、基于编码器的方法以及共享注意力空间中的token竞争,在效率、提取的参考特征与生成过程之间的对齐以及无关信息的干扰方面存在局限。为解决这些局限性,我们将定制任务形式化为将参考图像纳入文本到图像生成所诱导的分布偏移,并推导出基于最大熵理论的条件注意力分布偏移公式。基于该公式,我们提出了CustomShift,一个基于Stable Diffusion 3的双分支架构。参考对齐分支利用参考图像与主体名称之间的自注意力实现与潜在表示的逐层对齐,而跨引导分支则整合文本和参考线索来引导生成。在DreamBooth和Custom101基准数据集上的实验表明,我们的方法始终优于最先进的方法,在语义保真度和主体一致性之间实现了更好的平衡。

DC-Motion: Decoupling Semantics and Details via Discrete-Continuous Tokens for Human Motion Generation

2026-06-16T04:00:00autoregressive, cs.CV, cs.GR, cs.RO, diffusion2606.14721

中文标题:DC-Motion:通过离散-连续令牌解耦语义与细节的人体动作生成

作者:Hequan Wang, Jiaxu Zhang, Zhengbo Zhang, Zhigang Tu

摘要:

Text-to-motion generation requires synthesizing physically realistic dynamics that strictly follow complex and long-horizon textual instructions. Existing approaches rely on homogeneous representation spaces that may fail to capture the hierarchical nature of human motion, with diffusion models struggling at compositional semantic reasoning and AR models sacrificing fine-grained physical details due to quantization. To solve it, we introduce DC-Motion, a factorized generative framework designed to explicitly decouple semantics and details via discrete-continuous tokens. A Discrete-Continuous VAE (DC-VAE) first decomposes motion into discrete tokens for semantics and continuous residuals for fine-grained dynamics. Then, a masked AR model predicts the discrete structure from text, and a lightweight residual diffusion model recovers the continuous physical details. Extensive experiments demonstrate that DC-Motion effectively improves the capability to follow complex instructions. By effectively balancing semantic controllability and physical realism, our approach offers a highly adaptable modeling paradigm for human motion generation. On both HumanML3D and KIT-ML datasets, DC-Motion achieves state-of-the-art performance, delivering the best FID for motion realism and R-precision for text alignment.

摘要中文:

文本到动作生成需要合成严格遵循复杂且长时序文本指令的物理真实动力学。现有的方法依赖于同质化表示空间,可能无法捕捉人体动作的层级特性,其中扩散模型在组合语义推理方面存在困难,而自回归模型则因量化而牺牲了细粒度的物理细节。为解决这一问题,我们提出了DC-Motion,一个通过离散-连续令牌显式解耦语义与细节的因子化生成框架。离散-连续VAE(DC-VAE)首先将动作分解为用于语义的离散令牌和用于细粒度动力学的连续残差。然后,掩码自回归模型从文本预测离散结构,轻量级残差扩散模型恢复连续的物理细节。大量实验表明,DC-Motion有效提升了遵循复杂指令的能力。通过有效平衡语义可控性和物理真实感,我们的方法为人体动作生成提供了一个高度适应性强的建模范式。在HumanML3D和KIT-ML数据集上,DC-Motion实现了最先进的性能,在动作真实性的FID和文本对齐的R-precision方面均取得了最佳结果。

VANDERER: Map-Free Exploration using Future-Aware and Visual-Curiosity-Guided Diffusion Policy

2026-06-16T04:00:00cs.CV, cs.LG, cs.RO, diffusion2606.14879

中文标题:VANDERER:基于未来感知与视觉好奇心引导的免地图探索扩散策略

作者:Venkata Naren Devarakonda, Raktim Gautam Goswami, Prashanth Krishnamurthy, Farshad Khorrami

摘要:

Mobile agents require efficient exploration strategies to map unseen environments and autonomously plan tasks. Traditional methods rely on generating occupancy maps and optimizing the sequence in which unexplored regions are visited. However, in sensor-constrained settings, such as those limited to monocular cameras, generating accurate occupancy maps is challenging. To address this, we propose VANDERER, an exploration framework that leverages a Visual Curiosity Module (VCM) to guide pre-trained diffusion policies using only monocular image data. This curiosity module predicts the outcomes of proposed actions via a navigation world model and evaluates them through a curiosity cost. The cost then guides the diffusion process toward generating actions that maximize exploration. Evaluated across diverse simulated environments, VANDERER consistently outperforms established baselines, exploring an average of 13.4% more area than NoMaD. Our results reveal a direct correlation between visual and geometric curiosity in outdoor environments, demonstrating that VANDERER can effectively leverage this relationship for efficient exploration using sensor-constrained agents.

摘要中文:

移动智能体需要高效的探索策略来绘制未见环境的地图并自主规划任务。传统方法依赖于生成占用地图并优化访问未探索区域的顺序。然而,在传感器受限的环境中,例如仅配备单目相机的场景,生成精确的占用地图具有挑战性。为解决这一问题,我们提出了VANDERER,一种利用视觉好奇心模块(VCM)仅使用单目图像数据引导预训练扩散策略的探索框架。该好奇心模块通过导航世界模型预测所提动作的结果,并基于好奇心代价对其进行评估。随后,该代价引导扩散过程生成能够最大化探索面积的动作。在多种模拟环境中进行评估,VANDERER始终优于现有基线方法,其探索面积比NoMaD平均多13.4%。我们的结果揭示了户外环境中视觉好奇心与几何好奇心之间的直接相关性,表明VANDERER能够有效利用这一关系实现传感器受限智能体的高效探索。

Temporal Difference Learning for Diffusion Models

2026-06-16T04:00:00cs.CV, cs.LG, diffusion2606.15048

中文标题:扩散模型的时序差分学习

作者:Qizhen Ying, Yangchen Pan, Victor Adrian Prisacariu, Junfeng Wen

摘要:

Diffusion models are typically trained with objectives that focus on local denoising targets at individual time steps (or adjacent pairs), which do not enforce consistency between predictions along the denoising trajectory. This lack of cross-time consistency can degrade performance, especially for few-step samplers. We introduce a temporal difference (TD) objective that penalizes inconsistency of the model's multi-step progress along the denoising path. By reformulating the diffusion process as a Markov reward process and casting denoising as a policy evaluation problem in reinforcement learning, we derive a unified TD approach that applies to both discrete- and continuous-time diffusion formulations. We further propose a principled sample-based reweighting method that stabilizes training. Empirically, we show that using our TD training can significantly improve sample quality measured by FID, with stronger advantages when the number of sampling steps is small, highlighting its practical utility under low-computation-budget scenarios. We provide ablation studies to justify our design choices, including pairwise loss reweighting, regularization weight, and one-step stride. Overall, our TD approach can be a general drop-in that enforces cross-time consistency and improves generation quality across different diffusion generative models.

摘要中文:

扩散模型的训练目标通常聚焦于单个时间步(或相邻时间步对)的局部去噪目标,并未强制执行去噪轨迹上预测之间的跨时间一致性。这种跨时间一致性的缺失可能导致性能下降,尤其是在少步采样器中。我们引入了一种时序差分(TD)目标,用于惩罚模型在去噪路径上多步进展的不一致性。通过将扩散过程重新表述为马尔可夫奖励过程,并将去噪问题转化为强化学习中的策略评估问题,我们推导出一种统一的TD方法,适用于离散时间和连续时间的扩散公式。我们进一步提出了一种基于样本的原则性重加权方法,以稳定训练过程。实验表明,使用我们的TD训练可以显著提升以FID衡量的样本质量,且在采样步数较少时优势更为明显,突出了其在低计算预算场景下的实用价值。我们提供了消融实验以验证设计选择,包括成对损失重加权、正则化权重和单步跨度。总体而言,我们的TD方法可以作为一种通用的即插即用模块,强制执行跨时间一致性,并提升不同扩散生成模型的生成质量。

MoECa: Aligning Feature Reuse with Expert Decomposition in Diffusion Transformers

2026-06-16T04:00:00cs.CV, cs.LG, diffusion2606.15615

中文标题:MoECa: 在扩散Transformer中以专家分解对齐特征重用

作者:Maoliang Li, Haojing Chen, Jiayu Chen, Zihao Zheng, Xinhao Sun, Hailong Zou, Xiang Chen

摘要:

Diffusion Transformers with Mixture-of-Experts (DiT-MoE) improve model capacity under sparse activation, but diffusion inference is still bottlenecked by redundant computation across timesteps. Existing caching methods mainly operate at the token level, which becomes suboptimal in DiT-MoE because each token update is internally decomposed into multiple routed expert branches. Our analysis shows that cross-timestep redundancy in DiT-MoE is better characterized at the expert-branch level than at the whole-token level. Based on this observation, we propose MoECa, a fine-grained caching framework that performs branch-level feature reuse across timesteps. MoECa further introduces expert-aware adaptive control and synchronized cache updates across MoE and attention paths to maintain stable intermediate states. Experiments on multiple DiT-MoE models show that MoECa consistently achieves a better speed-quality trade-off than prior caching methods, with up to 2.83$\times$ inference speedup and minimal quality degradation.

摘要中文:

扩散专家混合Transformer(DiT-MoE)在稀疏激活下提升了模型容量,但扩散推理仍受时间步间冗余计算的瓶颈限制。现有缓存方法主要在token级别操作,这在DiT-MoE中并非最优,因为每个token更新在内部被分解为多个路由专家分支。我们的分析表明,DiT-MoE中跨时间步的冗余在专家分支级别比在整个token级别更能准确刻画。基于这一观察,我们提出了MoECa,一个细粒度缓存框架,在时间步间执行分支级特征重用。MoECa进一步引入专家感知自适应控制以及跨MoE和注意力路径的同步缓存更新,以维持稳定的中介状态。在多个DiT-MoE模型上的实验表明,MoECa始终获得比现有缓存方法更好的速度-质量权衡,最高可达2.83倍推理加速,且质量损失极小。

R2RDreamer: 3D-aware Data Augmentation for Spatially-generalized 2D Manipulation Policies

2026-06-16T04:00:00cs.CV, cs.RO, diffusion2606.17040

中文标题:R2RDreamer: 面向空间泛化2D操作策略的3D感知数据增强

作者:Xiuwei Xu, Haowen Sun, Angyuan Ma, Yiwei Zhang, Zhenyu Wu, Xiaofeng Wang, Bingyao Yu, Zheng Zhu, Jie Zhou, Jiwen Lu

摘要:

Spatial generalization is critical for imitation-learned manipulation policies, but achieving it typically requires scaling demonstrations across diverse object poses, robot configurations, and camera viewpoints. Data augmentation from a few source demonstrations offers a practical alternative to costly real-world collection. Simulation-based augmentation can create controllable variation, but requires complex environment and object setup and may introduce a sim-to-real gap. Recent real-to-real methods avoid these issues by jointly editing 3D observations and action trajectories from real demonstrations, yet they still rely on strong 3D scene parsing and geometry completion, and often produce observations tailored to 3D pointcloud policies rather than RGB-based 2D policies. We propose R2RDreamer, a real-to-real demonstration augmentation framework that preserves the geometric consistency of 3D action-observation editing while moving visual completion to 2D video space. Specifically, R2RDreamer first performs lightweight 3D augmentation by editing incomplete object pointclouds and end-effector trajectories in a shared 3D frame; it then projects the edited scene into masked image-space control videos with occlusion-aware reasoning and uses a dense-control image-to-video model to complete temporally coherent RGB observations. Experiments on spatially shifted manipulation tasks with both 2D diffusion-style policies and vision-language-action policies show that R2RDreamer improves spatial generalization from limited source demonstrations, with analyses validating the contributions of 3D editing, occlusion-aware projection, and video completion.

摘要中文:

空间泛化对于模仿学习操作策略至关重要,但实现空间泛化通常需要在多样化的物体姿态、机器人配置和相机视角下扩展演示数据。从少量源演示进行数据增强为昂贵的真实世界数据收集提供了一种实用的替代方案。基于模拟的数据增强可以创建可控的变化,但需要复杂的环境和物体设置,并可能引入模拟到真实的差距。近期提出的真实到真实方法通过联合编辑真实演示中的3D观测和动作轨迹来避免这些问题,但它们仍然依赖于强3D场景解析和几何补全,且通常生成的观测适用于3D点云策略而非基于RGB的2D策略。我们提出了R2RDreamer,一个真实到真实演示增强框架,它在将视觉补全迁移到2D视频空间的同时保留了3D动作-观测编辑的几何一致性。具体而言,R2RDreamer首先通过在共享3D帧中编辑不完整的物体点云和末端执行器轨迹来进行轻量级3D增强;然后将编辑后的场景投影到带有遮挡感知推理的掩码图像空间控制视频中,并使用密集控制图像到视频模型生成时间一致的RGB观测。在空间偏移操作任务上的实验表明,无论是对2D扩散风格策略还是视觉语言动作策略,R2RDreamer都能从有限的源演示中提升空间泛化能力,分析验证了3D编辑、遮挡感知投影和视频补全的贡献。

Exact Posterior Score Estimation for Solving Linear Inverse Problems

2026-06-16T04:00:00cs.CV, cs.LG, diffusion, stat.ML2606.17048

中文标题:精确后验分数估计用于求解线性逆问题

作者:Abbas Mammadov, Ozgur Kara, Kaan Oktay, Iskander Azangulov, Adil Kaan Akan, Hyungjin Chung, James Matthew Rehg, Yee Whye Teh

摘要:

Diffusion and flow-based models learn powerful data priors by training a denoiser to reverse Gaussian corruption. To use this prior to solve a linear inverse problem, one needs to sample from the posterior, but the score that the prior provides is the unconditional score, not the posterior score. Existing methods either steer a fixed pretrained denoiser with approximate measurement-matching corrections, or train a conditional restoration model that abandons the denoising structure of the prior. We derive the exact posterior score in closed form for linear Gaussian inverse problems under general Gaussian interpolants, and show that posterior sampling reduces to a denoising problem at an operator-dependent shifted pivot under an anisotropic noise covariance. We turn this identity into Exact Posterior Score (EPS), a denoising training objective that preserves the input/output structure of standard pretraining and can therefore be trained from scratch or fine-tuned from a pretrained denoiser. At inference, EPS uses the same sampler as the underlying backbone, with no likelihood gradients or projections. We evaluate EPS on five linear inverse problems across FFHQ and ImageNet, where it outperforms training-free and training-based baselines on fidelity, perceptual, and distributional metrics, while using roughly an order of magnitude fewer denoiser evaluations than gradient-based posterior samplers.

摘要中文:

扩散模型和基于流的模型通过训练去噪器逆向高斯噪声来学习强大的数据先验。要将此先验用于求解线性逆问题,需要从后验分布中采样,但先验提供的分数是无条件分数,而非后验分数。现有方法要么使用近似测量匹配校正来引导固定预训练去噪器,要么训练条件恢复模型而放弃先验的去噪结构。我们在一般高斯插值下推导出线性高斯逆问题的精确后验分数闭式解,并证明在各向异性噪声协方差下,后验采样等价于在算子依赖的偏移轴点处的去噪问题。我们将这一恒等式转化为精确后验分数(EPS)去噪训练目标,它保留了标准预训练的输入/输出结构,因此可以从头训练或从预训练去噪器微调。在推理阶段,EPS使用与骨干模型相同的采样器,无需似然梯度或投影。我们在FFHQ和ImageNet上的五个线性逆问题中评估EPS,在保真度、感知质量和分布质量指标上优于无训练和基于训练的基线方法,同时去噪器评估次数比基于梯度的后验采样器少约一个数量级。

CLAD: Constrained Latent Action Diffusion for Vision-Language Procedure Planning

2026-06-16T04:00:00cs.CV, diffusion2503.06637

中文标题:CLAD:用于视觉-语言过程规划的约束潜在动作扩散模型

作者:Lei Shi, Andreas Bulling

摘要:

We propose CLAD, a Constrained Latent Action Diffusion model for vision-language procedure planning in instructional videos. Procedure planning is the challenging task of predicting intermediate actions given a visual observation of a start and a goal state. However, future interactive AI systems must also be able to plan procedures using multi-modal input, e.g., where visual observations are augmented with language descriptions. To tackle this vision-language procedure planning task, our method uses a Variational Autoencoder (VAE) to learn the latent representation of actions and observations as constraints and integrate them into the diffusion process. This approach exploits that the latent space of diffusion models already has semantics that can be used. We use the latent constraints to steer the diffusion model to better generate actions. We report extensive experiments on the popular CrossTask, Coin, and NIV datasets and show that our method outperforms state-of-the-art methods by a large margin. By evaluating ablated versions of our method, we further show that the proposed integration of the action and observation representations learnt in the VAE latent space is key to these performance improvements.

摘要中文:

我们提出了CLAD,这是一个用于教学视频中视觉-语言过程规划的约束潜在动作扩散模型。过程规划是一项具有挑战性的任务,需要在给定起始状态和目标状态的视觉观察下预测中间动作。然而,未来的交互式人工智能系统还必须能够使用多模态输入进行过程规划,例如在视觉观察中加入语言描述。为了解决这一视觉-语言过程规划任务,我们的方法使用变分自编码器(VAE)学习动作和观察的潜在表示作为约束,并将其整合到扩散过程中。该方法利用了扩散模型的潜在空间已具有可用的语义信息。我们使用潜在约束来引导扩散模型更好地生成动作。我们在流行的CrossTask、Coin和NIV数据集上进行了广泛的实验,结果表明我们的方法优于最先进的方法一大截。通过对我们方法的消融版本进行评估,我们进一步证明了VAE潜在空间中学习的动作和观察表示的整合是这些性能提升的关键。

Dual-branch Prompting for Multimodal Machine Translation

2026-06-16T04:00:00cs.CL, cs.CV, diffusion2507.17588

中文标题:用于多模态机器翻译的双分支提示方法

作者:Jie Wang, Zhendong Yang, Liansong Zong, Xiaobo Zhang, Dexian Wang, Ji Zhang

摘要:

Multimodal Machine Translation (MMT) typically enhances text-only translation by incorporating aligned visual features. Despite the remarkable progress, state-of-the-art MMT approaches often rely on paired image-text inputs at inference and are sensitive to irrelevant visual noise, which limits their robustness and practical applicability. To address these issues, we propose D2P-MMT, a diffusion-based dual-branch prompting framework for robust vision-guided translation. Specifically, D2P-MMT requires only the source text and a reconstructed image generated by a pre-trained diffusion model, which naturally filters out distracting visual details while preserving semantic cues. During training, the model jointly learns from both authentic and reconstructed images using a dual-branch prompting strategy, encouraging rich cross-modal interactions. To bridge the modality gap and mitigate training-inference discrepancies, we introduce a distributional alignment loss that enforces consistency between the output distributions of the two branches. Extensive experiments on the Multi30K dataset demonstrate that D2P-MMT achieves superior translation performance compared to existing state-of-the-art approaches. Our code is publicly available at https://github.com/MentaY/DDP.

摘要中文:

多模态机器翻译(MMT)通常通过引入对齐的视觉特征来增强仅文本翻译。尽管取得了显著进展,但最先进的MMT方法在推理时往往依赖于成对的图像-文本输入,且对不相关的视觉噪声敏感,这限制了它们的鲁棒性和实际应用范围。为解决这些问题,我们提出了D2P-MMT,一个基于扩散的视觉引导翻译双分支提示框架。具体而言,D2P-MMT仅需要源文本和由预训练扩散模型生成的重建图像,该重建图像能够在保留语义线索的同时自然地过滤掉干扰性的视觉细节。在训练过程中,模型使用双分支提示策略从真实图像和重建图像中联合学习,以促进丰富的跨模态交互。为了弥合模态差距并减少训练-推理差异,我们引入了一种分布对齐损失,用于强制两个分支的输出分布保持一致。在Multi30K数据集上的广泛实验表明,D2P-MMT与现有最先进的方法相比取得了优异的翻译性能。我们的代码已在https://github.com/MentaY/DDP上公开。

Null-Space Diffusion Distillation Unlocks Speed, Fidelity and Realism in Lensless Imaging

2026-06-16T04:00:00cs.CV, diffusion2511.12024

中文标题:零空间扩散蒸馏解锁无透镜成像的速度、保真度和真实感

作者:Jose Reinaldo Cunha Santos A V Silva Neto, Hodaka Kawachi, Yasushi Yagi, Tomoya Nakamura

摘要:

Lensless imaging reconstructs scenes from highly multiplexed measurements, resulting in a severely ill-posed inverse problem. In this work, we identify a fundamental trade-off between measurement consistency, perceptual quality, and inference speed across lensless reconstruction paradigms. Traditional methods favor consistency but produce perceptually degraded results, supervised approaches achieve high-quality reconstructions with fast inference but may violate physical constraints, and diffusion-prior methods achieve high perceptual quality and consistency--particularly when structured constraints such as range-null decomposition are used--but remain slow due to iterative sampling. Motivated by this observation, we propose Null-Space Diffusion Distillation (NSDD), a single-pass reconstruction model that distills structured diffusion-prior inference into an efficient feed-forward network. NSDD learns to produce high-quality reconstructions that preserve measurement consistency while avoiding costly iterative sampling. Experimental results demonstrate that NSDD achieves perceptual quality and consistency competitive with diffusion-prior methods, while providing significantly faster inference and offering a favorable balance across all three objectives. Furthermore, ablation experiments show that distilling the range--null decomposition improves reconstruction quality and robustness over unstructured full-reconstruction distillation, including on unseen real scenes. These results highlight the potential of structure-aware distillation for efficient lensless imaging. Code is available at github.com/JRCSAVSN/NullSpaceDiffusionDistillation.

摘要中文:

无透镜成像从高度复用测量中重建场景,导致了一个严重的欠定逆问题。在本工作中,我们发现了无透镜重建范式中测量一致性、感知质量和推理速度之间的根本权衡。传统方法优先考虑一致性,但产生感知质量退化的结果;有监督方法实现高质量重建且推理速度快,但可能违反物理约束;扩散先验方法实现高感知质量和高一致性——尤其是当使用range-null分解等结构化约束时——但由于迭代采样而推理速度较慢。基于这一观察,我们提出了零空间扩散蒸馏(Null-Space Diffusion Distillation, NSDD),这是一种单次重建模型,将结构化扩散先验推理蒸馏为高效的前馈网络。NSDD学习生成高质量重建结果,在避免代价高昂的迭代采样的同时保持测量一致性。实验结果表明,NSDD在感知质量和一致性方面与扩散先验方法相当,同时提供了显著更快的推理速度,并在所有三个目标之间实现了有利的平衡。此外,消融实验表明,蒸馏range-null分解提高了重建质量和鲁棒性,优于非结构化的完整重建蒸馏,包括在未见过的真实场景中。这些结果突出了结构感知蒸馏在高效无透镜成像中的潜力。代码可访问github.com/JRCSAVSN/NullSpaceDiffusionDistillation。

Through-Foliage Surface-Temperature Reconstruction for Early Wildfire Detection

2026-06-16T04:00:00cs.CV, diffusion2511.12572

中文标题:用于早期野火检测的穿透植被地表温度重建方法

作者:Mohamed Youssef, Lukas Brunner, Klaus Rundhammer, Gerald Czech, Oliver Bimber

摘要:

We present a method to reconstruct surface temperatures through forest vegetation by combining signal processing and machine learning, enabling fully automated aerial wildfire monitoring with drones for early fire detection. Synthetic aperture (SA) sensing reduces canopy occlusion but introduces thermal blur. To overcome this, we train a visual state space model to recover subtle thermal signals of partially occluded soil and fire hotspots from blurred data. To address limited real-world training data, we generate realistic surface temperature simulations using a latent diffusion model, temperature augmentation, and procedural thermal forest modeling. On simulated datasets, our method reduces RMSE by 2-2.5 versus conventional thermal and uncorrected SA imaging; in field experiments on hotspots, RMSE improved by 12.8-fold and 2.6-fold, respectively. Our approach also generalizes to other thermal signals, including human signatures, capturing morphology and extent -- critical where simple thresholding fails -- while conventional imaging struggles with partial occlusion.

摘要中文:

本研究提出了一种结合信号处理与机器学习的穿透森林植被地表温度重建方法,实现基于无人机的全自动空中野火监测以进行早期火情检测。合成孔径(SA)传感技术可减少树冠遮挡,但会引入热模糊问题。为解决这一问题,我们训练了一个视觉状态空间模型,从模糊数据中恢复部分遮挡土壤和火灾热点的微弱热信号。针对真实训练数据有限的问题,我们利用潜空间扩散模型、温度增强和程序化热森林建模生成逼真的地表温度模拟数据。在模拟数据集上,我们的方法相较于传统热成像和未校正的SA成像将均方根误差降低了2至2.5倍;在野外热点实验中,均方根误差分别改善了12.8倍和2.6倍。我们的方法还可泛化至其他热信号,包括人体特征,能够捕捉形态和范围——在简单阈值处理失效的场景中至关重要——而传统成像在部分遮挡情况下表现困难。

Bridging Information Asymmetry: A Hierarchical Framework for Blind Face Restoration with Reduced Uncertainty

2026-06-16T04:00:00cs.CV, diffusion2601.19506

中文标题:弥合信息不对称:一种降低不确定性的盲人脸恢复分层框架

作者:Zhengjian Yao, Jiakui Hu, Kaiwen Li, Hangzhou He, Xinliang Zhang, Shuang Zeng, Lei Zhu, Yanye Lu

摘要:

Blind face restoration remains a persistent challenge due to the inherent ill-posedness of reconstructing holistic structures from severely constrained observations. Current generative paradigms, while capable of synthesizing realistic facial details, remain limited by the under-constrained nature of blind restoration, where severely degraded inputs can be mapped to plausible yet identity-inconsistent outputs. To address this issue, we present \textbf{Pref-Restore}, a hierarchical framework for BFR with reduced restoration uncertainty. Our design is organized around three complementary principles: (1) Semantic Information Augmentation, where an auto-regressive semantic branch converts image and text cues into structured tokens that provide a stable high-level anchor; (2) Texture-level Fidelity Alignment, where the diffusion generator is trained under this anchor to recover identity-relevant details; and (3) Fidelity-constrained Preference Optimization, where a face-aware reward refines the diffusion trajectory while controlling the quality--fidelity trade-off. Extensive experiments on synthetic and real-world benchmarks show that Pref-Restore achieves state-of-the-art performance, with stronger identity-sensitive fidelity and lower restoration uncertainty across repeated sampling. Systematic ablations further attribute these gains to the proposed hierarchical design, showing the necessity of staged training, the robustness and quality dependence of the text pathway, and the benefit of fidelity-constrained preference optimization.

摘要中文:

盲人脸恢复由于从严重受限的观测中重建整体结构所固有的不适定性而一直是一项挑战。现有的生成范式虽然能够合成逼真的人脸细节,但仍受限于盲恢复的欠约束性质,即严重退化的输入可能被映射到看似合理但身份不一致的输出。为解决这一问题,我们提出了Pref-Restore,一种降低恢复不确定性的BFR分层框架。我们的设计围绕三个互补原则组织:(1)语义信息增强,其中自回归语义分支将图像和文本线索转换为提供稳定高层锚点的结构化令牌;(2)纹理级保真度对齐,其中扩散生成器在此锚点下训练以恢复身份相关细节;(3)保真度约束的偏好优化,其中人脸感知奖励在控制质量-保真度权衡的同时优化扩散轨迹。在合成和真实世界基准数据集上的广泛实验表明,Pref-Restore实现了最先进的性能,在重复采样中具有更强的身份敏感保真度和更低的恢复不确定性。系统的消融实验进一步将这些增益归因于所提出的分层设计,展示了分阶段训练的必要性、文本路径的鲁棒性和质量依赖性,以及保真度约束偏好优化的优势。

Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention

2026-06-16T04:00:00autoregressive, cs.CV, diffusion2602.04789

中文标题:轻量强制:通过稀疏注意力加速自回归视频扩散

作者:Chengtao Lv, Yumeng Shi, Yushi Huang, Ruihao Gong, Shen Ren, Wenya Wang

摘要:

Advanced autoregressive (AR) video generation models have improved visual fidelity and interactivity, but the quadratic complexity of attention remains a primary bottleneck for efficient deployment. While existing sparse attention solutions have shown promise on bidirectional models, we identify that applying these solutions to AR models leads to considerable performance degradation for two reasons: isolated consideration of chunk generation and insufficient utilization of past informative context. Motivated by these observations, we propose \textsc{Light Forcing}, the \textit{first} sparse attention solution tailored for AR video generation models. It incorporates a \textit{Chunk-Aware Growth} mechanism to quantitatively estimate the contribution of each chunk, which determines their sparsity allocation. This progressive sparsity increase strategy enables the current chunk to inherit prior knowledge in earlier chunks during generation. Additionally, we introduce a \textit{Hierarchical Sparse Attention} to capture informative historical and local context in a coarse-to-fine manner. Such two-level mask selection strategy (i.e., frame and block level) can adaptively handle diverse attention patterns. Extensive experiments demonstrate that our method outperforms existing sparse attention in quality (e.g., 84.5 on VBench) and efficiency (e.g., $1.2{\sim}1.3\times$ end-to-end speedup). Combined with other efficient solutions, \textsc{Light Forcing} further achieves a $2.0{\sim}3.0\times$ end-to-end speedup across diverse GPUs (e.g., 27.4\,FPS on RTX 5090 and 33.9\,FPS on H100). Code is released via this \href{https://github.com/chengtao-lv/LightForcing}{link}.

摘要中文:

高级自回归视频生成模型显著提升了视觉保真度和交互性,但注意力机制的二次复杂度仍是高效部署的主要瓶颈。现有的稀疏注意力方案在双向模型上展现出良好的应用前景,但我们发现将其直接应用于自回归模型会导致显著的性能下降,原因包括:对分块生成的孤立考虑以及对历史信息上下文的利用不足。针对这些问题,我们提出了Light Forcing,这是首个专为自回归视频生成模型设计的稀疏注意力方案。该方法引入了一种分块感知增长机制,用于定量评估每个分块的贡献程度,进而确定其稀疏度分配。这种渐进式稀疏度增长策略使得当前分块在生成过程中能够继承先前分块的先验知识。此外,我们还提出了分层稀疏注意力机制,以粗到细的方式捕获信息丰富的历史和局部上下文。这种两级掩码选择策略(即帧级和块级)能够自适应地处理多样化的注意力模式。大量实验表明,我们的方法在质量(例如VBench上84.5分)和效率(例如端到端加速1.2至1.3倍)方面均优于现有稀疏注意力方法。结合其他高效方案,Light Forcing在多种GPU上进一步实现了2.0至3.0倍的端到端加速(例如RTX 5090上27.4 FPS,H100上33.9 FPS)。代码已通过链接发布。

FireRed-Image-Edit-1.0 Technical Report

2026-06-16T04:00:00cs.CV, diffusion, eess.IV2602.13344

中文标题:FireRed-Image-Edit-1.0 技术报告

作者:Super Intelligence Team, Changhao Qiao, Chao Hui, Chen Li, Cunzheng Wang, Dejia Song, Jiale Zhang, Jing Li, Qiang Xiang, Runqi Wang, Shuang Sun, Wei Zhu, Xu Tang, Yao Hu, Yibo Chen, Yuhao Huang, Yuxuan Duan, Zhiyi Chen, Ziyuan Guo

摘要:

We present FireRed-Image-Edit, a diffusion transformer for instruction-based image editing that achieves state-of-the-art performance through systematic optimization of data curation, training methodology, and evaluation design. We construct a 1.6B-sample training corpus, comprising 900M text-to-image and 700M image editing pairs from diverse sources. After rigorous cleaning, stratification, auto-labeling, and two-stage filtering, we retain over 100M high-quality samples balanced between generation and editing, ensuring strong semantic coverage and instruction alignment. Our multi-stage training pipeline progressively builds editing capability via pre-training, supervised fine-tuning, and reinforcement learning. To improve data efficiency, we introduce a Multi-Condition Aware Bucket Sampler for variable-resolution batching and Stochastic Instruction Alignment with dynamic prompt re-indexing. To stabilize optimization and enhance controllability, we propose Asymmetric Gradient Optimization for DPO, DiffusionNFT with layout-aware OCR rewards for text editing, and a differentiable Consistency Loss for identity preservation. We further establish REDEdit-Bench, a comprehensive benchmark spanning 15 editing categories, including newly introduced beautification and low-level enhancement tasks. Extensive experiments on REDEdit-Bench and public benchmarks (ImgEdit and GEdit) demonstrate competitive or superior performance against both open-source and proprietary systems. To support future research, our code, models, and benchmark suite are publicly available at https://github.com/FireRedTeam/FireRed-Image-Edit/ .

摘要中文:

我们提出 FireRed-Image-Edit,一个基于指令的图像编辑扩散变换器,通过系统性地优化数据整理、训练方法和评估设计,实现了领先水平的性能。我们构建了一个包含16亿样本的训练语料库,其中包括9亿个文本到图像对和7亿个图像编辑对,来源于多种渠道。经过严格的清洗、分层、自动标注和两阶段过滤,我们保留了超过1亿个高质量样本,在生成和编辑任务之间保持平衡,确保了强大的语义覆盖和指令对齐。我们的多阶段训练流程通过预训练、监督微调和强化学习逐步构建编辑能力。为了提高数据效率,我们引入了多条件感知桶采样器用于可变分辨率批处理,以及具有动态提示词重索引的随机指令对齐。为了稳定优化并增强可控性,我们提出了用于DPO的非对称梯度优化、用于文本编辑的具有布局感知OCR奖励的DiffusionNFT,以及用于身份保持的可微一致性损失。我们进一步建立了REDEdit-Bench,这是一个涵盖15个编辑类别的综合基准测试,包括新引入的美化和低级增强任务。在REDEdit-Bench以及公开基准测试(ImgEdit和GEdit)上的大量实验表明,与开源和专有系统相比,我们的模型具有竞争性或更优越的性能。为支持未来研究,我们的代码、模型和基准测试套件已在 https://github.com/FireRedTeam/FireRed-Image-Edit/ 上公开可用。

DiverseDiT: Towards Diverse Representation Learning in Diffusion Transformers

2026-06-16T04:00:00cs.CV, diffusion2603.04239

中文标题:DiverseDiT:面向扩散变换器的多样化表示学习

作者:Mengping Yang, Zhiyu Tan, Binglei Li, Xiaomeng Yang, Hesen Chen, Hao Li

摘要:

Recent breakthroughs in Diffusion Transformers (DiTs) have revolutionized the field of visual synthesis due to their superior scalability. To facilitate DiTs' capability of capturing meaningful internal representations, recent works such as REPA incorporate external pretrained encoders for representation alignment. However, the underlying mechanisms governing representation learning within DiTs are not well understood. To this end, we first systematically investigate the representation dynamics of DiTs. Through analyzing the evolution and influence of internal representations under various settings, we reveal that representation diversity across blocks is a crucial factor for effective learning. Based on this key insight, we propose DiverseDiT, a novel framework that explicitly promotes representation diversity. DiverseDiT incorporates long residual connections to diversify input representations across blocks and a representation diversity loss to encourage blocks to learn distinct features. Extensive experiments on ImageNet 256x256 and 512x512 demonstrate that our DiverseDiT yields consistent performance gains and convergence acceleration when applied to different backbones with various sizes, even when tested on the challenging one-step generation setting. Furthermore, we show that DiverseDiT is complementary to existing representation learning techniques, leading to further performance gains. Our work provides valuable insights into the representation learning dynamics of DiTs and offers a practical approach for enhancing their performance.

摘要中文:

扩散变换器(Diffusion Transformers,DiTs)因其卓越的可扩展性而在视觉合成领域取得了突破性进展。为了提升DiTs捕捉有意义内部表示的能力,近期研究如REPA引入外部预训练编码器进行表示对齐。然而,DiTs内部表示学习的底层机制尚未被充分理解。为此,我们首先系统性地研究了DiTs的表示动态特征。通过分析不同设置下内部表示的演变与影响,我们发现跨块的表示多样性是有效学习的关键因素。基于这一重要洞见,我们提出了DiverseDiT,一个明确促进表示多样性的新型框架。DiverseDiT引入长残差连接来多样化跨块的输入表示,并采用表示多样性损失来鼓励不同块学习独特特征。在ImageNet 256×256和512×512上的大量实验表明,我们的方法在不同规模和骨干网络的应用中均能获得一致的性能提升和收敛加速,即使在极具挑战性的单步生成设置下测试也不例外。此外,我们证明DiverseDiT与现有表示学习技术具有互补性,能够带来进一步的性能提升。我们的工作为DiTs的表示学习动态提供了宝贵洞见,并为提升其性能提供了一种实用方法。

Structural Energy Guidance for View-Consistent Text-to-3D Generation

2026-06-16T04:00:00cs.CV, diffusion2605.19876

中文标题:用于视图一致的文本到3D生成的结构能量引导

作者:Qing Zhang, Jinguang Tong, Jing Zhang, Jie Hong, Xuesong Li

摘要:

Text-to-3D generation based on diffusion models often suffers from the Janus problem, leading to inconsistent geometry across viewpoints. This work identifies viewpoint bias in 2D diffusion priors as the main cause and proposes Structural Energy-Guided Sampling (SEGS), a training-free and plug-and-play framework to improve multi-view consistency. SEGS constructs a structural energy in the PCA subspace of U-Net features and injects its gradient into the denoising process. It can be easily integrated into SDS/VSD pipelines without retraining. Experiments show that SEGS reduces the Janus Rate by about 10% on average and improves View-CS scores across multiple baselines, including DreamFusion, Magic3D, and LucidDreamer. This method effectively alleviates viewpoint artifacts while preserving appearance fidelity, providing a flexible solution for high-quality text-to-3D content generation.

摘要中文:

基于扩散模型的文本到3D生成经常遭受Janus问题的影响,导致不同视角下的几何形状不一致。本工作认为2D扩散先验中的视角偏差是主要原因,并提出了结构能量引导采样(SEGS),这是一个无需训练且即插即用的框架,用于改善多视角一致性。SEGS在U-Net特征的PCA子空间中构建结构能量,并将其梯度注入到去噪过程中。它可以轻松集成到SDS/VSD pipeline中而无需重新训练。实验表明,SEGS平均将Janus率降低约10%,并在多个基线模型(包括DreamFusion、Magic3D和LucidDreamer)上提升了View-CS分数。该方法有效缓解了视角伪影,同时保持了外观保真度,为高质量文本到3D内容生成提供了灵活的解决方案。

Pantheon360: Taming Digital Twin Generation via 3D-Aware 360{\deg} Video Diffusion

2026-06-16T04:00:00cs.CV, diffusion2605.25449

中文标题:Pantheon360:通过3D感知360°视频扩散控制数字孪生生成

作者:Ting-Hsuan Chen, Ying-Huan Chen, Tao Tu, Jie-Ying Lee, Cho-Ying Wu, Fangzhou Lin, Hengyuan Zhang, David Paz, Xinyu Huang, Yuliang Guo, Yu-Lun Liu, Yue Wang, Liu Ren

摘要:

Generating complete digital twins from videos requires precise camera control, global scene coverage, and strict spatial-temporal consistency constraints that remain challenging for perspective video generators due to their limited field of view (FoV). Their narrow FoV forces long or multi-view trajectories, amplifying cross-view inconsistency and temporal drift. We argue that 360{\deg} video generation offers a natural solution: panoramic coverage simplifies trajectory design and provides a strong global context for maintaining coherence. We introduce Pantheon360: Taming Digital Twin Generation via 3D-Aware 360{\deg} Video Diffusion, a controllable 360{\deg} video generation framework that synthesizes high-fidelity videos from sparse 360{\deg} inputs. The key idea is an explicit 3D Cache, reconstructed from the input, which serves as a geometric scaffold for any user-defined camera path. This allows the diffusion model to focus on photorealistic texture refinement while the 3D Cache enforces global geometric consistency. Experiments show that Pantheon360 achieves superior visual quality and unmatched geometric coherence, enabling reliable and flexible 360{\deg} scene generation for downstream simulation and digital-twin applications.

摘要中文:

从视频生成完整的数字孪生需要精确的相机控制、全局场景覆盖以及严格的时空一致性约束,这些要求对于透视视频生成器而言仍具挑战性,原因在于其有限的视场角(FoV)。狭窄的视场角迫使采用长轨迹或多视角轨迹,从而加剧了跨视角不一致性和时间漂移问题。我们认为,360°视频生成提供了自然解决方案:全景覆盖简化了轨迹设计,并为保持一致性提供了强大的全局上下文。我们提出了Pantheon360:通过3D感知360°视频扩散控制数字孪生生成,这是一个可控的360°视频生成框架,能够从稀疏的360°输入合成高保真视频。其核心思想是利用从输入重建的显式3D Cache作为几何支架,支持任意用户定义的相机路径。这使得扩散模型能够专注于逼真的纹理细化,而3D Cache则强制执行全局几何一致性。实验表明,Pantheon360实现了卓越的视觉质量和无与伦比的几何一致性,能够为下游仿真和数字孪生应用提供可靠且灵活的360°场景生成。

KGEdit: Ambiguity-Aware Knowledge Graphs for Training-Free Precise Video Generation and Editing

2026-06-16T04:00:00cs.CV, diffusion2605.29509

中文标题:KGEdit:用于免训练精确视频生成与编辑的歧义感知知识图谱

作者:Mingshu Cai, Miao Zhang, Chenghe Yang, Yixuan Li, Osamu Yoshie, Yuya Ieiri

摘要:

In recent years, training-free video generation has progressed remarkably. However, when handling complex textual instructions, existing methods still suffer from semantic ambiguity, incorrect concept binding, and cross-frame inconsistency. To address these issues, we propose KGEdit, a structured semantic control framework for text-to-video (T2V) diffusion models. Specifically, we first construct an ambiguity-aware knowledge graph (AAKG) to disentangle and disambiguate the input prompt, converting it into four types of structured semantics: identity, relation, attribute, and negative constraints. We then design a structured semantic injection module (SSIM) to inject these semantic signals into key layers of the diffusion Transformer, enabling fine-grained semantic control. In addition, we introduce a temporal-aware semantic control (TASC) module that dynamically schedules semantic objectives according to the stage-wise characteristics of the denoising process, further improving semantic alignment and temporal consistency. Experiments show that KGEdit outperforms existing methods in editing precision and temporal stability, while offering higher efficiency and controllability in text-driven interaction scenarios.

摘要中文:

近年来,免训练视频生成取得了显著进展。然而,在处理复杂文本指令时,现有方法仍面临语义歧义、概念绑定错误和跨帧不一致等问题。为解决这些问题,我们提出了KGEdit,一个用于文本到视频(T2V)扩散模型的结构化语义控制框架。具体而言,我们首先构建了歧义感知知识图谱(AAKG)来解耦和消歧输入提示词,将其转换为四种类型的结构化语义:身份、关系、属性和负面约束。随后,我们设计了结构化语义注入模块(SSIM)将这些语义信号注入扩散Transformer的关键层,实现细粒度语义控制。此外,我们引入了时序感知语义控制(TASC)模块,根据去噪过程的阶段特征动态调度语义目标,进一步提升语义对齐和时序一致性。实验表明,KGEdit在编辑精度和时序稳定性方面优于现有方法,同时在文本驱动交互场景中具有更高的效率和可控性。

MeshFlow: Efficient Artistic Mesh Generation via MeshVAE and Flow-based Diffusion Transformer

2026-06-16T04:00:00cs.CV, cs.GR, diffusion2606.04621

中文标题:MeshFlow:基于MeshVAE和流式扩散变压器的高效艺术网格生成

作者:Weiyu Li, Antoine Toisoul, Tom Monnier, Roman Shapovalov, Rakesh Ranjan, Ping Tan, Andrea Vedaldi

摘要:

We present MeshFlow, a new method for generating artist-like 3D meshes. Current mesh generators often adopt Auto-Regressive (AR) next-token prediction, a natural choice given the discrete nature of mesh topology. However, AR methods scale poorly because the inference cost is quadratic in mesh size. They also require discretizing the vertex coordinates, which introduces quantization errors. To address these challenges, we introduce a Variational Autoencoder (VAE) that, supervised with a contrastive loss, represents both continuous vertex positions and discrete connectivity in a continuous latent space. This latent space is significantly more compact than prior token-based mesh representations. We then build a 3D generator based on a Rectified Flow transformer, generating all mesh vertices and edges in parallel. Our model generates meshes 18x faster than the fastest AR generator while also achieving excellent accuracy across standard mesh-generation metrics. Homepage: https://mesh-flow.github.io/, Code: https://github.com/facebookresearch/meshflow

摘要中文:

我们提出了MeshFlow,一种生成类人艺术家风格三维网格的新方法。当前的网格生成器通常采用自回归(AR)下一个token预测范式,这对于网格拓扑的离散性质而言是自然选择。然而,自回归方法的可扩展性较差,因为推理成本随网格大小呈二次方增长。此外,这些方法还需要对顶点坐标进行离散化,从而引入量化误差。为解决这些挑战,我们引入了一种变分自编码器(VAE),在对比损失的监督下,将连续顶点位置和离散连接性统一表示在一个连续潜在空间中。该潜在空间比以往基于token的网格表示更加紧凑。在此基础上,我们构建了一个基于整流流变压器(Rectified Flow transformer)的三维生成器,可并行生成所有网格顶点和边。我们的模型在生成速度上比最快的自回归生成器快18倍,同时在标准网格生成指标上也达到了优异的精度。

RQUL-UIE: Revitalizing Quality-Unstable Labels for Underwater Image Enhancement via In-Dataset Self-Supervision

2026-06-16T04:00:00cs.CV, diffusion2606.06176

中文标题:RQUL-UIE:基于数据集内自监督的水下图像增强质量不稳定标签重构方法

作者:Haochen Hu, Yanrui Bin, Chih-yung Wen, Bing Wang

摘要:

Underwater Image Enhancement (UIE) is essential for mitigating degradations caused by water medium. Although learning-based methods have advanced significantly, most rely on paired datasets with unstable label quality, which bottlenecks model performance. This paper proposes a diffusion-based, in-dataset self-supervised learning strategy designed to exploit the quality distribution of training labels. Specifically, we evaluate label quality via semantic perception embeddings from a pre-trained diffusion model in a training-free manner. These quality scores are subsequently quantized into noise-level indices, guiding a multi-step denoising process for level-wise supervision. This mechanism prevents low-quality labels from degrading the model while maximizing their utility during training. Furthermore, a Fourier-based refinement network is incorporated to explicitly reconstruct high-frequency components. Extensive evaluations demonstrate that our method consistently outperforms SOTA approaches in restoration quality. The code and pre-trained model will be available once accepted in link.

摘要中文:

水下图像增强对于减轻水介质引起的图像退化至关重要。尽管基于学习的方法已取得显著进展,但大多数方法依赖于标签质量不稳定的配对数据集,这制约了模型性能的提升。本文提出了一种基于扩散模型的数据集内自监督学习策略,旨在挖掘训练标签的质量分布。具体而言,我们通过预训练扩散模型的语义感知嵌入来评估标签质量,无需额外训练即可完成。这些质量分数随后被量化为噪声水平索引,用于指导多步去噪过程中的分层监督。该机制可防止低质量标签对模型造成 degradation,同时在训练中最大化利用这些标签。此外,我们还引入了一个基于傅里叶的细化网络来显式重建高频分量。大量实验表明,我们的方法在恢复质量方面始终优于现有最先进的方法。代码和预训练模型将在论文录用后提供。

Ultra Flash: Scaling Real-Time Streaming Video Generation to High Resolutions

2026-06-16T04:00:00autoregressive, cs.CV, diffusion2606.09150

中文标题:Ultra Flash:面向高分辨率的实时流式视频生成扩展方法

作者:Luxury, Jie Huang, Zihao Fan, Xiaoxiao Ma, Jun-hao Zhuang, Yuming Li, Zeyue Xue, Siming Fu, Haoran Li, Mingchen Zhong, Guohui Zhang, Shichen Ma, Yijun Liu, Jiaqi Shi, Yanwen Ma, Yaofeng Su, Haoyu Wang, Yaowei Li, Songchun Zhang, Weiyang Jin, Yuxuan Bian, Shiyi Zhang, Haojun Xu, Shuai Lu, Xin Han, Wei Tang, Haoyang Huang, Nan Duan

摘要:

While recent autoregressive video diffusion models achieve remarkable streaming quality, they remain confined to low resolutions (e.g., 480P), leaving efficient, scalable, real-time high-resolution video generation a fundamental open challenge. To bridge this gap, we present Ultra Flash, a cascaded streaming framework capable of real-time high-resolution video generation. Ultra Flash achieves ~30 FPS at 1K resolution and ~18 FPS at 2K resolution on a single GPU through three key contributions: (1) an architecture-preserving T2V-to-TV2V super-resolution training paradigm coupled with an AIGC-oriented data degradation pipeline that effectively preserves the generative capability of the base model, enabling enhanced high-resolution detail when cascaded after mainstream low-resolution generative models; (2) a causal streaming latent upsampler paired with a high-resolution decoder, which enhances spatiotemporal coherence while enabling efficient latent spatial scaling and precise high-resolution decoding with negligible computational overhead; and (3) a cascade high-resolution streaming video generation optimization scheme that first performs hybrid-reward-enhanced sparse causalization and single-step distillation of the super-resolution model, then introduces cascaded streaming self-forcing preference optimization with dynamic cache management, jointly enhancing overall coherence, improving quality, and enabling real-time high-resolution streaming video generation. Extensive experiments demonstrate that Ultra Flash reliably produces ultra-high-resolution streaming video while maintaining state-of-the-art visual quality and superior efficiency. Project Page: https://xin1u.github.io/UltraFlash/

摘要中文:

尽管当前的自回归视频扩散模型已能实现优异的流式质量,但其仍受限于低分辨率(如480P),使得高效、可扩展的实时高分辨率视频生成成为一个亟待解决的核心挑战。为弥补这一差距,我们提出了Ultra Flash,一个能够实现实时高分辨率视频生成的级联流式框架。Ultra Flash在单GPU上于1K分辨率达到约30 FPS、于2K分辨率达到约18 FPS,主要通过以下三项技术贡献实现:(1)一种架构保持的T2V到TV2V超分辨率训练范式,结合面向AIGC的数据降质管道,能够有效保留基模型的生成能力,使其在与主流低分辨率生成模型级联时能够增强高分辨率细节;(2)一个因果流式潜在上采样器配合高分辨率解码器,在增强时空一致性的同时实现高效的潜在空间缩放和精确的高分辨率解码,且计算开销可忽略不计;(3)一种级联高分辨率流式视频生成优化方案,首先对超分辨率模型执行混合奖励增强的稀疏因果化和单步蒸馏,随后引入带动态缓存管理的级联流式自强制偏好优化,共同增强整体一致性、提升质量并实现实时高分辨率流式视频生成。大量实验表明,Ultra Flash能够可靠地生成超高分辨率流式视频,同时保持最先进的视觉质量和优异的效率。项目主页:https://xin1u.github.io/UltraFlash/

Flex4DHuman: Flexible Multi-view Video Diffusion for 4D Human Reconstruction

2026-06-16T04:00:00cs.CV, cs.GR, diffusion2606.13655

中文标题:Flex4DHuman:用于4D人体重建的灵活多视角视频扩散模型

作者:Jen-Hao Cheng, Yipeng Wang, Hao Zhang, Gengshan Yang, Jenq-Neng Hwang

摘要:

We present Flex4DHuman, a multi-view video diffusion model that transforms a monocular or sparse multi-view video of a dynamic subject into synchronized dense multi-view videos using only relative camera-pose conditioning. Unlike prior human-centric methods that rely on skeletons, depth maps, normals, or rendered target-view geometry, Flex4DHuman requires no explicit geometry priors and instead conditions generation through relative camera-pose positional encoding. The generated videos can be directly ingested by downstream reconstruction pipelines to create dynamic 4D Gaussian splats. Built on the Wan 2.1 1.3B text-to-video model, Flex4DHuman preserves the backbone architecture and encodes camera and view information through a five-axis positional encoding that extends spatio-temporal RoPE with view indices and continuous SE(3) relative camera geometry. A three-stage curriculum progressively trains the model for pose following, flexible reference-to-target view generation, and temporal rollout. To support temporal rollout, we train with clean historical target-view tokens. We also add multi-view captions to enable test-time text control. Combined with an off-the-shelf 4D Gaussian Splatting stage, our framework lifts monocular static-camera videos into dynamic 4D Gaussian splats. Experiments on DNA-Rendering and ActorsHQ show that Flex4DHuman surpasses prior state-of-the-art methods, while the same formulation generalizes to animal categories after mixed human-animal training. These capabilities make Flex4DHuman a practical step toward scalable 4D content creation from casual monocular videos for simulation, gaming, AR/VR, and video re-shooting.

摘要中文:

我们提出了Flex4DHuman,一个多视角视频扩散模型,能够将动态主体的单目或稀疏多视角视频转换为同步的密集多视角视频,仅使用相对相机位姿作为条件。与先前依赖骨架、深度图、法线或渲染目标视角几何的人体中心方法不同,Flex4DHuman无需显式的几何先验,而是通过相对相机位姿位置编码来条件化生成。生成的视频可直接输入下游重建流程,以创建动态4D高斯溅射。Flex4DHuman基于Wan 2.1 1.3B文本到视频模型构建,保留了主干网络架构,并通过五轴位置编码对相机和视角信息进行编码,该编码在时空RoPE基础上扩展了视角索引和连续SE(3)相对相机几何。三阶段课程逐步训练模型实现姿态跟随、灵活的参考到目标视角生成以及时间展开。为支持时间展开,我们使用干净的历史目标视角token进行训练。我们还添加了多视角caption以实现测试时的文本控制。结合现成的4D高斯溅射阶段,我们的框架将单目静态相机视频提升为动态4D高斯溅射。在DNA-Rendering和ActorsHQ上的实验表明,Flex4DHuman超越了先前最先进的方法,且相同 formulation 在混合人-动物训练后可以泛化到动物类别。这些能力使Flex4DHuman成为从随意单目视频进行规模化4D内容创建的实际一步,适用于模拟、游戏、AR/VR和视频重拍。

Semantic Editing with Coupled Stochastic Differential Equations

2026-06-16T04:00:00cs.CV, cs.LG, diffusion, stat.ML2509.24223

中文标题:基于耦合随机微分方程的语义编辑

作者:Jianxin Zhang, Clayton Scott

摘要:

Editing the content of an image with a pretrained text-to-image model remains challenging. Existing methods often distort fine details or introduce unintended artifacts. We propose using \emph{coupled stochastic differential equations} (coupled SDEs) to guide the sampling process of any pre-trained generative model that can be sampled by solving an SDE, including diffusion and rectified flow models. By driving both the source image and the edited image with the same correlated noise, our approach steers new samples toward the desired semantics while preserving visual similarity to the source. The method works out-of-the-box, without retraining or auxiliary networks, and achieves high prompt fidelity along with near-pixel-level consistency. These results position coupled SDEs as a simple yet powerful tool for controlled generative AI. Project page: https://z-jianxin.github.io/syncSDE-release/. Code: https://github.com/Z-Jianxin/syncSDE-release.

摘要中文:

使用预训练文生图模型编辑图像内容仍具挑战性。现有方法往往会导致细节扭曲或引入非预期伪影。本文提出使用耦合随机微分方程(coupled SDEs)来引导任意可通过求解SDE进行采样的预训练生成模型(包括扩散模型和整流流模型)的采样过程。通过使用相同的相关噪声驱动源图像和编辑后的图像,本方法能够引导新样本朝向期望的语义,同时保持与源图像的视觉相似性。该方法即插即用,无需重训练或辅助网络即可实现高提示词保真度和近乎像素级的一致性。这些结果表明,耦合随机微分方程作为简单而强大的工具,在可控生成式AI领域具有重要应用价值。项目主页:https://z-jianxin.github.io/syncSDE-release/。代码:https://github.com/Z-Jianxin/syncSDE-release。

PhyloSDF: Phylogenetically-Conditioned Neural Generation of 3D Skull Morphology via Residual Flow Matching

2026-06-16T04:00:00cs.CV, diffusion, q-bio.QM2604.25371

中文标题:PhyloSDF:基于残差流匹配的系统发育条件化3D头骨形态神经生成方法

作者:Kaikwan Lau, Gary P. T. Choi

摘要:

Generating novel, biologically plausible three-dimensional morphological structures is a fundamental challenge in computational evolutionary biology, hampered by extreme data scarcity and the requirement that generated shapes respect phylogenetic relationships among species. In this work, we present PhyloSDF, a phylogenetically-conditioned neural generative model for 3D biological morphology that integrates two innovations: (1) a DeepSDF auto-decoder regularized by a novel Phylogenetic Consistency Loss that structures the latent space to correlate with evolutionary distances (Pearson r=0.993); (2) a Residual Conditional Flow Matching (Residual CFM) architecture that factorizes generation into analytic species-centroid lookup and learned residual prediction, enabling generation from as few as ~4 specimens per species. We evaluate PhyloSDF on 100 micro-CT-scanned skulls of Darwin&#x27;s Finches and their relatives across 24 species. The model generates novel meshes achieving 88-129% of real intra-species variation at the code level, with all 180 generated meshes verified as non-memorized. Residual CFM surpasses denoising diffusion (which fails entirely at this scale), standard flow matching (which mode-collapses to 3-6% variation), and a Gaussian mixture baseline in both fidelity (Chamfer Distance 0.00181 vs. 0.00190) and morphometric Fr\&x27;{e}chet distance (10,641 vs. 13,322). Leave-one-species-out experiments across 18 species demonstrate phylogenetic extrapolation capability, and smooth latent interpolations produce biologically plausible ancestral skull reconstructions.

摘要中文:

生成新颖且生物学上合理的三维形态结构是计算进化生物学中的一个 fundamental challenge,严重受限于极端的数据稀缺性以及生成形态需符合物种间系统发育关系的要求。本研究提出了 PhyloSDF,一个用于三维生物形态的系统发育条件化神经生成模型,该模型整合了两项创新:(1)一个由新型系统发育一致性损失正则化的 DeepSDF 自解码器,该损失函数将潜在空间结构化以与进化距离相关(Pearson r=0.993);(2)一个残差条件流匹配(Residual CFM)架构,该架构将生成过程分解为解析物种质心查找和学习残差预测,使得每个物种仅需约 4 个标本即可进行生成。我们在 100 个来自 Darwin's Finches 及其亲缘物种的微米 CT 扫描头骨(涵盖 24 个物种)上评估了 PhyloSDF。该模型在代码层面生成的新颖网格达到了真实物种内变异的 88-129%,且所有 180 个生成的网格均已验证为非记忆化生成。残差 CFM 在保真度(Chamfer 距离 0.00181 对比 0.00190)和形态测量 Fréchet 距离(10,641 对比 13,322)两方面均超越了去噪扩散模型(在该数据规模下完全失效)、标准流匹配(模式坍塌至 3-6% 变异)以及高斯混合基线方法。覆盖 18 个物种的留一物种出实验证明了系统发育外推能力,且平滑的潜在插值产生了生物学上合理的祖先头骨重建。

image_compression
Image Compression
5 篇论文

今日 Image Compression 分类论文整体聚焦于深度学习驱动的端到端压缩方案,尤其关注压缩与下游视觉任务的协同优化。可以看出两个主要趋势:一是从传统 PSNR/SSIM 指标转向针对 DNN 感知性能的定制化压缩;二是面向带宽受限场景(如机器人、自动驾驶)的实时压缩需求。亮点包括 VLA 模型的压缩适配、几何采样器在感知系统中的应用,以及 VQ-VAE 在低功耗设备上的成功实践。

值得关注的论文:

  • Learned JPEG Compression for DNN Vision (2606.16185) — 提出专为深度神经网络视觉任务优化的学习型 JPEG 压缩方案,直接针对 AI 感知指标而非传统图像质量指标,代表了压缩编码与下游任务协同设计的新范式。
  • Learned Image Compression for Vision-Language-Action Models (2606.16253) — 首次针对 VLA 模型设计端到端压缩框架,解决视觉编码在机器人系统中的带宽与延迟瓶颈,具有重要的实际应用价值。
  • Double-Helix Vision (DH-V2): A Geometry-Based Visual Sampler (2606.14773) — 引入几何感知的视觉采样器,在带宽受限感知场景下实现高效信息筛选,为压缩提供了一种全新的几何先验思路。
  • Sustainable Face Recognition on Low-Power Devices with VQ-VAE Embeddings (2606.15355) — 将 VQ-VAE 嵌入用于低功耗设备的人脸识别,展示了压缩表示在边缘 AI 中的可行性。

备注:《Improved Knowledge Distillation for Land-Use Image Classification》主要属于知识蒸馏和遥感分类领域,非本分类核心论文。

Double-Helix Vision (DH-V2): A Geometry-Based Visual Sampler for Bandwidth-Constrained Perception

2026-06-16T04:00:00cs.AI, cs.CV, image_compression2606.14773

中文标题:双螺旋视觉(DH-V2):一种面向带宽受限感知的基于几何的视觉采样器

作者:Jinwen Wen

摘要:

We present Double-Helix Vision (DH), a geometry-based visual sampler that compresses 2D images into compact 1D signals using paired golden-ratio-inspired spiral trajectories. Rather than processing every pixel uniformly, DH employs two phase-shifted helices (Alpha and Beta, offset by 180 degrees) to sample the image with biologically-inspired foveation: high density at the center, sparse coverage at the periphery. At 4K resolution, DH achieves a 1,433x compression ratio (99.93% reduction) while preserving the geometric structure of the scene. The full perception pipeline -- including spatial mapping, temporal collision detection, and intra-frame structural disparity estimation -- runs in 0.52 ms at 1080p on CPU-only hardware, with no neural network dependencies. On CIFAR-10 at extreme sampling budgets (K=128 points per helix), DH achieves a +6.03% accuracy gain over uniform random sampling. A JSON-serializable Robotics API is provided, delivering sub-millisecond spatial perception reports in 2.7 KB packets. Code and benchmarks are available under the MIT License.

摘要中文:

我们提出双螺旋视觉(Double-Helix Vision, DH),一种基于几何的视觉采样器,采用成对的金比例启发式螺旋轨迹将2D图像压缩为紧凑的1D信号。DH并非均匀处理每个像素,而是使用两个相位偏移的螺旋线(Alpha和Beta,相位差180度)进行生物启发式的中央凹采样:中心区域高密度采样,周边区域稀疏覆盖。在4K分辨率下,DH实现了1433倍的压缩率(99.93%的数据 reduction),同时保留了场景的几何结构。完整感知管道——包括空间映射、时序碰撞检测和帧内结构视差估计——在仅使用CPU的1080p硬件上仅需0.52ms运行,且不依赖神经网络。在CIFAR-10数据集上极端采样预算条件下(每螺旋K=128个点),DH相比均匀随机采样实现了+6.03%的精度提升。我们提供了一个JSON可序列化的机器人API,以2.7KB的数据包提供亚毫秒级空间感知报告。代码和基准测试已在MIT许可证下开源。

Improved Knowledge Distillation for Land-Use Image Classification

2026-06-16T04:00:00cs.AI, cs.CV, image_compression2606.14886

中文标题:用于土地利用图像分类的改进知识蒸馏方法

作者:Arundhuti Sur, Abhiroop Chatterjee, Susmita Ghosh, Emmett Ientilucci

摘要:

In the present article, an improved Knowledge Distillation (KD) framework has been proposed for efficient compression of deep convolutional neural networks for land-use image classification task. Motivated by the need to achieve competitive classification accuracy while reducing computational complexity, a teacher-student learning paradigm is adopted in which a VGG16 network transfers knowledge to a lightweight MobileNetV2 model. The proposed framework integrates hard supervision from ground truth labels with a soft supervision strategy that combines Kullback-Leibler divergence and Cosine Similarity losses. Experiments conducted on three land-use datasets show that the proposed KD-based method yields improved performance, and achieves an accuracy of 99.04%, outperforming both baseline student training and single-loss distillation approaches, while retaining substantial model compression.

摘要中文:

本文提出了一种改进的知识蒸馏框架,用于对用于土地利用图像分类任务的深度卷积神经网络进行高效压缩。为在降低计算复杂度的同时实现具有竞争力的分类精度,本研究采用师生学习范式,由VGG16网络将知识迁移至轻量级的MobileNetV2模型。该框架整合了来自真实标签的硬监督与结合Kullback-Leibler(KL)散度和余弦相似度损失的软监督策略。在三个土地利用数据集上进行的实验表明,所提出的基于知识蒸馏的方法取得了性能提升,达到了99.04%的准确率,优于基线学生训练方法和单损失蒸馏方法,同时保持了显著的模型压缩效果。

Learned Image Compression for Vision-Language-Action Models

2026-06-16T04:00:00cs.AI, cs.CV, image_compression2606.16253

中文标题:视觉-语言-动作模型的学习型图像压缩

作者:Hyeonjun Kim, Jegwang Ryu, Sangbeom Ha, Junhyeok Lee, Jun-Hyuk Kim, Hyemin Ahn, Jaeho Lee

摘要:

Vision-language-action (VLA) models increasingly rely on high-frequency multi-camera observations, making visual communication a major bottleneck for real-time robotic control in bandwidth-constrained or distributed deployment settings. Existing image and video codecs, however, are designed to preserve generic visual fidelity rather than the control performance of downstream VLA policies. In this work, we introduce SPARC (SPatially Adaptive Rate Control), a learned image compression framework tailored for VLA-driven robots. Our key observation is that the importance of visual information varies substantially across both camera views and spatial regions within an image. Based on this observation, SPARC employs a lightweight temporal mask selector that adaptively allocates bitrate over latent representations according to task relevance while leveraging temporal context. We further introduce a tilted rate loss that stabilizes training by reducing the tendency of entropy-based objectives to over-suppress rare yet task-critical visual patterns. Experiments on diverse robotic benchmarks, including RoboCasa365, VLABench, and LIBERO, show that SPARC consistently achieves stronger control performance than conventional image/video codecs and recent learned compression methods under the same bitrate budget. We additionally demonstrate real-world deployment benefits in remote-control settings, where our method substantially improves the bitrate-success tradeoff.

摘要中文:

视觉-语言-动作(VLA)模型日益依赖高频率的多摄像头观测,使得视觉通信成为带宽受限或分布式部署场景下实时机器人控制的主要瓶颈。然而,现有的图像和视频编解码器旨在保持通用视觉保真度,而非提升下游VLA策略的控制性能。本工作提出了SPARC(空间自适应率控制),一种专为VLA驱动机器人设计的学习型图像压缩框架。我们的核心发现是,视觉信息的重要性在摄像头视角和图像空间区域之间存在显著差异。基于这一发现,SPARC采用轻量级的时序掩码选择器,根据任务相关性在潜在表示上自适应分配比特率,同时利用时序上下文。我们进一步提出了倾斜率损失函数,通过减少基于熵的目标函数对稀有但关键任务的视觉模式的过度抑制来稳定训练。在RoboCasa365、VLABench和LIBERO等多种机器人基准上的实验表明,SPARC在相同比特率预算下始终比传统图像/视频编解码器和近期学习型压缩方法获得更强的控制性能。我们还展示了在远程控制场景中的实际部署优势,该方法显著改善了比特率与成功率之间的权衡。

Sustainable Face Recognition on Low-Power Devices with VQ-VAE Embeddings

2026-06-16T04:00:00cs.CV, image_compression2606.15355

中文标题:基于VQ-VAE嵌入的低功耗设备可持续人脸识别

作者:Christos Chronis, Georgios Th. Papadopoulos, Iraklis Varlamis

摘要:

Face recognition has become a cornerstone of modern AI applications, yet conventional approaches often rely on computationally intensive models deployed in cloud environments, leading to increased network traffic, high energy consumption, and a heavy carbon footprint. This work introduces a sustainable, edge-deployable face recognition framework based on Vector-Quantized Variational Autoencoders (VQ-VAE), which generates compact and semantically rich latent representations of facial images. By leveraging the compression capacity and reconstruction quality of VQ-VAE embeddings on the edge and combining them with the power of pre-trained face embeddings in a knowledge distillation setup, our system achieves comparable accuracy to state-of-the-art face embedding models while significantly reducing memory and computation requirements on the edge, making it suitable for low-power edge devices. The integration of VQ-VAE compression minimizes network overhead while keeping the matching accuracy high by retaining only the most informative facial features in the latent space. As a result, the reconstructed images preserve the key identity characteristics, improving the robustness and overall performance of the face embeddings.

摘要中文:

人脸识别已成为现代人工智能应用的基石,然而传统方法通常依赖部署在云环境中的计算密集型模型,导致网络流量增加、能源消耗高及碳足迹沉重。本研究提出了一种基于矢量量化变分自编码器(VQ-VAE)的可持续边缘部署人脸识别框架,该框架能够生成紧凑且语义丰富的人脸图像潜在表征。通过充分利用VQ-VAE嵌入在边缘设备上的压缩能力和重构质量,并结合知识蒸馏框架中预训练人脸嵌入的能力,我们的系统在达到与最先进的人脸嵌入模型相当精度的同时,显著降低了边缘设备的内存和计算需求,使其适用于低功耗边缘设备。VQ-VAE压缩技术的集成最大程度减少了网络开销,同时通过仅保留潜在空间中最具信息量的人脸特征来保持高匹配精度。因此,重构图像保留了关键的身份特征,提高了人脸嵌入的鲁棒性和整体性能。

Learned JPEG Compression for DNN Vision

2026-06-16T04:00:00cs.CV, image_compression2606.16185

中文标题:面向DNN视觉任务的学习型JPEG压缩

作者:Kaixiang Zheng, Ahmed H. Salamah, Siyu Chen, En-Hui Yang

摘要:

JPEG, a lossy image compression technique designed for human viewers, has maintained its dominance for decades. However, in the era of artificial intelligence (AI), a substantial portion of image data, often compressed by JPEG, is and will continue to be consumed by deep neural networks (DNNs) instead of humans, thus creating a need to optimize JPEG for DNN inference performance. To this end, we propose learned JPEG compression for DNN vision (J4D), a novel training framework for determining JPEG encoding parameters to minimize compression rate while maximizing DNN inference performance. The major challenge of solving this optimization problem lies in representing the JPEG codec and compression rate in closed form. By incorporating a differentiable soft quantizer based on a probabilistic quantization scheme, we not only obtain a differentiable proxy for the JPEG codec, but are also able to compute the entropy of the coded source analytically, which is a close estimate of the actual compression rate. Equipped with both the differentiable JPEG codec and the information-theoretic rate estimator, we are then able to solve the aforementioned optimization problem with backpropagation. After training, the learned encoding parameters will be subsequently used in actual JPEG encoding based on probabilistic quantization. Extensive experimental results across multiple datasets and DNN architectures demonstrate that J4D consistently and significantly outperforms the default JPEG and other competitive JPEG codecs optimized for DNNs. Notably, compared to the default JPEG, J4D achieves an increase in accuracy by as much as 11.60% at the same rate, or a reduction of compression rate up to 80.05% at the same accuracy. Additionally, with the help of J4D, we show the potential to design universal JPEG encoding parameters for various DNN architectures for the first time.

摘要中文:

JPEG是一种为人类视觉设计的有损图像压缩技术,数十年来一直保持其主导地位。然而,在人工智能时代,经JPEG压缩的大量图像数据正被深度神经网络(DNN)而非人类所消费,且这一趋势将持续下去,因此产生了为DNN推理性能优化JPEG的需求。为此,我们提出了用于DNN视觉的JPEG学习压缩(J4D),这是一种新颖的训练框架,用于确定JPEG编码参数以在最大化DNN推理性能的同时最小化压缩率。解决这一优化问题的主要挑战在于以闭式形式表示JPEG编解码器和压缩率。通过引入基于概率量化方案的可微软量化器,我们不仅获得了JPEG编解码器的可微代理,还能够解析计算编码源的熵,这是实际压缩率的精确估计。凭借可微JPEG编解码器和信息论率估计器,我们能够通过反向传播解决上述优化问题。训练后,学习到的编码参数将基于概率量化用于实际JPEG编码。多个数据集和DNN架构上的广泛实验结果表明,J4D始终且显著优于默认JPEG和其他针对DNN优化的具有竞争力的JPEG编解码器。值得注意的是,与默认JPEG相比,J4D在同一压缩率下精度提升高达11.60%,或在相同精度下压缩率降低高达80.05%。此外,借助J4D,我们首次展示了为各种DNN架构设计通用JPEG编码参数的潜力。

visual_tokenizer_1d
1D Visual Tokenizer
0 篇论文

今日未找到该分类的匹配论文。

今日未找到该分类的匹配论文。

diffusion_visual_encoder
Diffusion Visual Encoder
0 篇论文

今日未找到该分类的匹配论文。

今日未找到该分类的匹配论文。