每日 arXiv 论文简报

2026/06/11 10:00:00

Daily Radar

每日总览

今日 arXiv 论文呈现出生成式 AI 向实时化、交互化和安全化演进的显著趋势。自回归（AR）与扩散（Diffusion）模型的边界正在模糊化——BiWM 同时出现在两个分类中，展示双向自回归如何赋能交互式视频世界模型；Lip Forcing 和 PathRelax 则聚焦少步数/并行解码以实现实时生成。工业应用方面，HiGR 展示层级生成式 Slate 推荐的大规模部署能力，而机器人扩散策略的对抗性劫持研究（Test-time Adversarial Takeover）则揭示了安全关键场景下的脆弱性。科学计算与 AI 的交叉持续深化，PDE 代理训练、潜势扩散用于地下流数据同化、神经活动动态分析等论文表明生成模型正成为科学发现的新工具。视频生成领域尤为活跃，FadeMem 与 Making Time Editable 分别从记忆机制和时间可编辑性切入，推动视频生成的可控性边界。总体而言，模型架构创新（双向自回归、时间编辑）、安全鲁棒性（对抗样本、版权规避）、以及垂直领域落地（机器人、推荐系统、科学模拟）构成今日的核心议程。

BiWM: Advancing Open-Source Interactive Video World Models with Bidirectional Autoregression — 双向自回归框架首次开源应用于交互式视频世界模型，弥合了自回归与扩散范式的鸿沟，为实时可控视频生成提供新基线。
Test-time Adversarial Takeover: A Real-time Hijacking Interface against Robotic Diffusion Policies — 首次揭示机器人扩散策略在测试时的对抗劫持漏洞，对安全部署扩散模型于物理系统具有关键警示意义。
HiGR: Industrial-Scale Hierarchical Generative Slate Recommendation Framework in Tencent — 腾讯工业级层级生成式 Slate 推荐系统，验证生成式方法在万亿级推荐场景的可行性，兼具学术创新与产业价值。
Making Time Editable in Video Diffusion Transformers — 提出视频扩散变换器的时间编辑能力，解决视频生成中时间维度可控性的核心难题，推动视频编辑工作流革新。
Diffusion Forcing Planner: History-Annealed Planning with Time-Dependent Guidance for Autonomous Driving — 将扩散模型引入自动驾驶规划，结合历史 annealing 与时变引导，为复杂城市场景的决策提供全新概率框架。

autoregressive

Autoregressive

10 篇论文

今日 Autoregressive 分类概述：

今日自回归模型领域呈现出多模态融合与推理加速并行的特点。视频生成方面，双向自回归与扩散模型结合成为新趋势（BiWM、FadeMem），显著提升了交互式视频建模与时序一致性。在生成速度上，少步推理和投机解码技术取得突破，PathRelax 和 Lip Forcing 分别在图像和唇同步任务上实现了实时或近实时生成。多模态理解方面，ARM 引入统一离散表示，为自回归模型的多模态统一建模提供了新思路。此外，工业级应用持续推进，腾讯的 HiGR 展示了分层生成在推荐系统中的规模化落地能力。整体来看，自回归模型正在从单纯的语言生成向视频、音频、机器人操作等多模态任务快速渗透，同时推理效率优化成为关键研究方向。

重点论文推荐：

BiWM — 双向自回归框架为交互式视频世界模型提供更精准的时序建模，值得关注其开源实现与游戏模拟场景潜力。
Lip Forcing — 少步自回归扩散实现实时唇同步，推理速度提升显著，对实时视频应用有重要参考价值。
PathRelax — 投机Jacobi解码加速文生图生成，探索了自回归与并行解码的混合路径，对效率优化有启发意义。
HiGR — 腾讯工业级分层生成推荐系统，展示自回归在真实业务场景中的规模化部署方案。
ARM — 统一离散表示的自回归大模态模型，为多模态理解和生成提供了新的统一框架思路。

BiWM: Advancing Open-Source Interactive Video World Models with Bidirectional Autoregression

2026-06-10T04:00:00autoregressive, cs.AI, cs.CV, diffusion2606.10135

中文标题：BiWM：利用双向自回归推进开源交互式视频世界模型

作者：Shaohao Rui, Xiaofeng Mao, Zhanyu Zhang, Peijia Lin, Yansong Zhu, Yibo Zhang, Haibin Wan, Weijie Ma

摘要：

Transitioning bidirectional video diffusion models into an autoregressive paradigm improves the interactivity of video world models, but existing causal pipelines need many stages (control fine-tuning, autoregressive training, causal initialization, few-step distillation) and still trail bidirectional models in quality due to error accumulation. Recent world models such as Yume-1.5 and Matrix-Game-3.0 instead adopt a bidirectional autoregressive approach, gaining fidelity and stable long-horizon rollout from self-correcting error propagation, yet open-source frameworks (e.g., minWM) support only causal models. We present BiWM, the first full-stack framework for interactive video world models under the bidirectional autoregressive paradigm, jointly optimizing generation quality and inference speed. From a pretrained video backbone, BiWM injects camera control by fine-tuning, then runs a few-step Distribution Matching Distillation (DMD) stage that turns the backbone into an action/camera-controllable world model: just two training stages instead of four in minWM, converging in a few hundred steps on 8xH200 GPUs. A single recipe spans Wan2.1-1.3B, Wan2.2-5B, HunyuanVideo-1.5-8B, and LTX-2.3-22B, and also supports secondary fine-tuning of existing bidirectional models. BiWM enables real-world camera control where minWM loses controllability, integrates pluggable history compression (FramePack-style and PackForcing-style) for long rollouts, and offers an optional NVFP4 4-bit training/inference pipeline. To counter DMD's mode-seeking degradation, we add GAN and mass-covering forward-KL objectives that preserve scene dynamics. We open-source BiWM for resource-constrained research and high-fidelity environment simulation.

摘要中文：

将双向视频扩散模型转换为自回归范式能够提升视频世界模型的交互性，但现有因果管道需要多个训练阶段（控制微调、自回归训练、因果初始化、少步蒸馏），且由于误差累积仍无法达到双向模型的质量。最近的世界模型如Yume-1.5和Matrix-Game-3.0转而采用双向自回归方法，从自我纠正的误差传播中获得高保真度和稳定的长期展开能力，但开源框架（如minWM）仅支持因果模型。我们提出BiWM，这是首个基于双向自回归范式的交互式视频世界模型全栈框架，同时优化生成质量和推理速度。从预训练的视频主干网络开始，BiWM通过微调注入相机控制，然后运行少量步骤的Distribution Matching Distillation（DMD）阶段，将主干网络转换为可控制动作/相机的世界模型：仅需两个训练阶段而非minWM的四个阶段，在8xH200 GPU上几百步即可收敛。一个统一的方案涵盖Wan2.1-1.3B、Wan2.2-5B、HunyuanVideo-1.5-8B和LTX-2.3-22B，并支持对现有双向模型的二次微调。BiWM实现了minWM失去可控性的真实世界相机控制，集成了可插拔的历史压缩（FramePack风格和PackForcing风格）用于长期展开，并提供可选的NVFP4 4位训练/推理管道。为解决DMD的模式寻求退化问题，我们添加了GAN和mass-covering forward-KL目标以保留场景动态。我们开源BiWM，以支持资源受限的研究和高保真环境仿真。

HiGR: Industrial-Scale Hierarchical Generative Slate Recommendation Framework in Tencent

2026-06-10T04:00:00autoregressive, cs.AI, cs.IR2512.24787

中文标题：HiGR: 腾讯工业级分层生成式榜单推荐框架

作者：Yunsheng Pang, Zijian Liu, Yudong Li, Shaojie Zhu, Zijian Luo, Chenyun Yu, Sikai Wu, Shichen Shen, Cong Xu, Bin Wang, Kai Jiang, Chengxiang Zhuo, Zang Li

摘要：

Slate recommendation, which presents users with a ranked item list in a single display, is ubiquitous across mainstream online platforms. While recent generative recommendation methods have shown strong potential in modeling item sequences with semantic IDs, directly applying them to industrial-scale slate recommendation faces a fundamental disconnect: entangled SID spaces confound high-level list planning, fine-grained autoregressive decoding over long sequences limits semantic planning efficiency, and token-level objectives misalign with holistic slate quality. In this paper, we propose HiGR, an industrial-scale hierarchical generative framework for slate recommendation that bridges this disconnect through a co-designed pipeline. First, HiGR learns structured SIDs via a Prefix-Contrastive Residual Quantized VAE (PCRQ-VAE). By enforcing high-level prefixes to capture shared semantics, PCRQ-VAE creates a controllable discrete space that acts as a prerequisite for efficient planning. Leveraging this structured space, our Hierarchical Slate Decoder (HSD) shifts autoregressive modeling from entangled token-level decoding to coarse-grained preference embeddings. This design significantly reduces inference latency while allowing explicit global slate structure planning. Finally, this stable planning space enables an ORPO-based listwise alignment mechanism to optimize triple-objective implicit feedback-ranking fidelity, genuine user interest, and diversity. Extensive offline experiments show that HiGR outperforms state-of-the-art baselines by over 10% in offline recommendation quality while achieving a $5\times$ inference speedup. Online A/B tests on Tencent platforms further improve watch time by 1.22% and video plays by 1.73%. HiGR has been deployed on multiple Tencent platform surfaces, serving hundreds of millions of users and proving its industrial-scale applicability.

摘要中文：

榜单推荐在主流在线平台上非常普遍，它在单个展示中向用户呈现排序后的商品列表。虽然近期生成式推荐方法在利用语义ID建模项目序列方面展现出强大潜力，但直接将其应用于工业级榜单推荐面临根本性的脱节问题：纠缠的SID空间干扰高层列表规划，长序列上的细粒度自回归解码限制语义规划效率，以及token级目标与整体榜单质量不一致。本文提出HiGR，一种工业级分层生成式榜单推荐框架，通过协同设计 pipeline 来弥合这一脱节。首先，HiGR通过前缀对比残差量化变分自编码器（PCRQ-VAE）学习结构化SID。通过强制高层前缀捕获共享语义，PCRQ-VAE创建了一个可控的离散空间，作为高效规划的前提条件。利用这一结构化空间，我们的分层榜单解码器（HSD）将自回归建模从纠缠的token级解码转向粗粒度偏好嵌入。该设计显著降低推理延迟，同时支持显式的全局榜单结构规划。最后，这一稳定的规划空间使得基于ORPO的列表级对齐机制能够优化三重目标：隐式反馈-排序保真度、真实用户兴趣和多样性。大量离线实验表明，HiGR在离线推荐质量上超越最先进基线超过10%，同时实现5倍推理加速。腾讯平台的在线A/B测试进一步将观看时间提升1.22%，视频播放量提升1.73%。HiGR已部署于多个腾讯平台场景，服务数亿用户，证明了其工业级适用性。

Variational Learning for Insertion-based Generation

2026-06-10T04:00:00autoregressive, cs.AI, cs.LG, diffusion2606.02133

中文标题：基于插入的生成的变分学习

作者：Yangtian Zhang, Zhe Wang, Arthur Gretton, Rex Ying, David van Dijk, Michalis K. Titsias, Jiaxin Shi

摘要：

Non-monotonic sequence generation methods, such as masked diffusion models, provide a flexible alternative to left-to-right autoregressive modeling by allowing tokens to be generated in non-fixed and prescribed orders. Despite their practical advantages, most existing non-monotonic models are order-agnostic and rely on a fixed-length grid, limiting their ability to support variable-length generation and adaptive insertion order. In this work, we introduce a probabilistic framework for learning insertion order in variable-length insertion models. We formalize a bijective correspondence between insertion trajectories and permutations, which enables an exact reparameterization of the data likelihood as a sum over permutations. Building on this result, we propose the Insertion Process (IP), a stochastic generative model that jointly learns where to insert, what to insert, and when to terminate, trained via permutation-based variational inference. Unlike prior fixed-canvas approaches, IP natively supports variable-length generation and learns data-driven preferences over insertion orders. Experiments on goal-conditioned planning and molecular string generation demonstrate that learning insertion order improves both modeling quality and generalization in domains without a canonical left-to-right structure.

摘要中文：

非单调序列生成方法（如掩码扩散模型）通过允许词元以非固定和预定义的顺序生成，为从左到右的自回归建模提供了一种灵活的替代方案。尽管具有实际优势，但现有大多数非单调模型都是顺序无关的，并且依赖于固定长度的网格，限制了其支持变长生成和自适应插入顺序的能力。在本工作中，我们引入了一个用于学习变长插入模型中插入顺序的概率框架。我们形式化建立了插入轨迹与排列之间的双射对应关系，这使得数据似然可以精确地重参数化为排列求和的形式。基于这一结果，我们提出了插入过程（Insertion Process, IP），这是一种随机生成模型，能够联合学习插入位置、插入内容以及终止时机，并通过基于排列的变分推断进行训练。与先前固定画布的方法不同，IP原生支持变长生成，并学习数据驱动的插入顺序偏好。在目标条件规划和分子字符串生成上的实验表明，在没有规范从左到右结构的领域中，学习插入顺序能够提升建模质量和泛化能力。

LiveBand: Live Accompaniment Generation in the Audio Domain

2026-06-10T04:00:00autoregressive, cs.AI, cs.SD, eess.AS2606.03803

中文标题：LiveBand：音频域中的实时伴奏生成

作者：Marco Pasini, Javier Nistal, Ben Hayes, Mathias Rose Bjare, Stefan Lattner, George Fazekas

摘要：

We present LiveBand, a real-time system that generates high-fidelity music accompaniments to live audio input, respecting strict causal constraints. Our method trains a causal transformer generator in the continuous latent space of a pre-trained causal audio autoencoder, using adversarial sequence-level supervision from a discriminator. At each timestep, the generator receives only the causally available mix context and Gaussian noise, and predicts accompaniment latents without access to future mix frames or ground-truth target latents. Training is performed in a single parallel forward pass under causal masking, while streaming inference proceeds autoregressively with a rolling attention state. The model's training and inference computations are matched by design, eliminating teacher forcing and the associated exposure bias. On a multi-instrument music accompaniment benchmark, LiveBand improves over prior work on objective measures of audio quality, beat alignment, and mix adherence, while enabling real-time streaming generation without lookahead into the future on consumer hardware.

摘要中文：

我们提出LiveBand，一个实时系统，可对实时音频输入生成高保真音乐伴奏，同时满足严格的因果约束。我们的方法在预训练因果音频自编码器的连续潜空间中训练因果Transformer生成器，并使用判别器进行对抗性序列级监督。在每个时间步，生成器仅接收因果可用的混音上下文和高斯噪声，在无法访问未来混音帧或真值目标潜向量的条件下预测伴奏潜向量。训练在因果掩码下以单次并行前向传播完成，而流式推理则采用滚动注意力状态进行自回归。模型的训练和推理计算在设计上保持一致，消除了教师强制及其带来的曝光偏差。在多乐器音乐伴奏基准测试中，LiveBand在音频质量、节拍对齐和混音契合度的客观指标上优于先前工作，同时能够在消费者硬件上实现无需前瞻未来信息的实时流式生成。

PathRelax: Parallel-Path Relaxed Speculative Jacobi Decoding for Accelerating Auto-Regressive Text-to-Image Generation

2026-06-10T04:00:00autoregressive, cs.CV2606.10492

中文标题：PathRelax: 用于加速自回归文生图生成的并行路径放松式推测Jacobi解码

作者：Haodong Lei, Hongsong Wang, Bingxuan Dai, Pan Zhou

摘要：

The growing need for high-resolution image generation in autoregressive text-to-image models has resulted in extended token sequences, significantly increasing computational costs and inference times. However, existing state-of-the-art methods for accelerating autoregressive text-to-image models rely on chain-structured draft token sequences, leading to inefficient draft token search and limited acceptance lengths. To address this, we propose parallel-path cross-relaxed speculative Jacobi decoding (\textbf{PathSpec}), a novel framework that enhances efficiency through a multi-sequence draft tree structure. Our parallel-path speculative Jacobi decoding (\textbf{PathExplore}) expands the token search space, achieving a higher speedup ratio without sacrificing image quality. Additionally, we introduce cross-path relaxed verification (\textbf{PathRelax}) that exploits semantic similarities across sequences to further boost token acceptance rates. Evaluated on the Parti-Prompts, MSCOCO2017, and T2ICompBench datasets, our method achieves a speedup ratio of 4.14 $\times$ , 3.95 $\times$ , and 4.18 $\times$ , respectively. Remarkably, PathExplore, without any relaxed sampling, outperforms relaxed sampling methods in the speedup ratio, such as GSD and LANTERN. Moreover, PathRelax's relaxation mechanism can be seamlessly integrated with other relaxation techniques, enabling further acceleration and providing an efficient solution for real-time text-to-image generation. Our code is available at https://github.com/Haodong-Lei-Ray/PathSpec.

摘要中文：

自回归文生图模型对高分辨率图像生成的需求日益增长，导致token序列长度显著增加，进而大幅提升了计算成本和推理时间。然而，现有的自回归文生图模型加速方法依赖于链式结构的草稿token序列，导致草稿token搜索效率低下且接受长度受限。为此，我们提出了跨路径放松式推测Jacobi解码（\textbf{PathSpec}），这是一个通过多序列草稿树结构提升效率的全新框架。我们的并行路径推测Jacobi解码（\textbf{PathExplore}）扩展了token搜索空间，在不牺牲图像质量的前提下实现了更高的加速比。此外，我们还引入了跨路径放松验证（\textbf{PathRelax}），利用序列间的语义相似性进一步提升token接受率。在Parti-Prompts、MSCOCO2017和T2ICompBench数据集上的评估表明，我们的方法分别实现了4.14倍、3.95倍和4.18倍的加速比。值得注意的是，无需任何放松采样的PathExplore在加速比上优于GSD和LANTERN等放松采样方法。此外，PathRelax的放松机制可与其它放松技术无缝集成，进一步提升加速效果，为实时文生图生成提供了高效的解决方案。我们的代码已开源于 https://github.com/Haodong-Lei-Ray/PathSpec。

FadeMem: Distance-Aware Memory Consolidation for Autoregressive Video Diffusion

2026-06-10T04:00:00autoregressive, cs.CV, diffusion2606.10671

中文标题：FadeMem: 面向自回归视频扩散的距离感知记忆巩固

作者：Yu Lu, Junjie Yang, Piotr Koniusz, YuXin Song, Yi Yang

摘要：

Autoregressive video generators synthesize long videos by generating successive temporal segments, but their historical KV cache grows with video length. Existing bounded-cache methods reduce this cost with local windows, sink tokens, or compressed memory states, yet they usually assign fixed roles to different parts of the history. We propose FadeMem, a distance-aware KV memory consolidation mechanism that organizes historical KV blocks into a temporal hierarchy under a fixed cache budget. This design is motivated by frequency-dependent temporal decay: fine details decorrelate quickly, while coarse scene structure and identity remain useful over longer horizons. During generation, new history is inserted as fine-grained entries, while older adjacent entries are progressively merged under a power-law temporal allocation schedule, yielding a dense-near, sparse-far memory within one cache. Without architectural changes, FadeMem preserves recent context for short-term dynamics and compact long-range anchors for identity and scene coherence. Experiments show improved subject consistency, background stability, and temporal coherence over existing bounded-cache strategies.

摘要中文：

自回归视频生成器通过生成连续的时间片段来合成较长的视频，但其历史KV缓存随视频长度增长。现有的有限缓存方法通过局部窗口、汇聚标记或压缩记忆状态来降低这一成本，但它们通常为历史的不同部分分配固定角色。我们提出FadeMem，一种距离感知的KV记忆巩固机制，在固定缓存预算下将历史KV块组织成时间层次结构。这一设计源于频率依赖的时间衰减：细节很快去相关，而粗糙的场景结构和身份特征在较长时间范围内仍然有用。在生成过程中，新的历史作为细粒度条目被插入，而较旧的相邻条目在幂律时间分配调度下逐步合并，在单一缓存中形成近密远疏的记忆结构。无需架构修改，FadeMem保留了用于短期动态的近期上下文，以及用于身份和场景一致性的紧凑长期锚点。实验表明，在主体一致性、背景稳定性和时间一致性方面优于现有的有限缓存策略。

IDEAL: In-DEpth ALignment Makes A Discrete Representation AutoEncoder

2026-06-10T04:00:00autoregressive, cs.CV2606.11096

中文标题：IDEAL：深度对齐构建离散表示自编码器

作者：Yitong Chen, Zijie Diao, Junke Wang, Lingyu Kong, Yixuan Ren, Bo He, Yu-Gang Jiang, Zuxuan Wu

摘要：

Built on pretrained vision foundation models (VFMs), representation autoencoders (RAEs) have recently emerged as a promising approach for constructing semantically rich latent spaces for image generation. However, their reconstruction quality often remains suboptimal, largely because deep VFM representations do not preserve sufficient fine-grained visual detail. This limitation becomes even more severe after discretization, where missing low-level information is difficult to recover. In fact, we observe that shallow VFM features retain considerably richer local appearance and structural detail, which complements the high-level semantics carried by deep features used in existing RAEs. Motivated by this complementary property, we propose Ideal, an In-depth Alignment framework for discrete representation autoencoding. By jointly aligning quantized tokens with both shallow and deep VFM features, Ideal enables the resulting discrete visual tokens to preserve both visual fidelity and rich semantics. Extensive experiments demonstrate that Ideal yields superior reconstruction performance, achieving 0.61 rFID on ImageNet and outperforming the previous best method by 0.28. When used for autoregressive image generation, Ideal further produces a gFID of 1.89, establishing a new state of the art for autoregressive image generation.

摘要中文：

基于预训练视觉基础模型（Vision Foundation Models，VFMs），表示自编码器（Representation Autoencoders，RAEs）已成为构建语义丰富图像生成潜在空间的一种很有前景的方法。然而，其重建质量往往仍不理想，主要原因在于深度VFM表示未能保留足够的细粒度视觉细节。这一局限在离散化后更加严重，因为低层信息缺失难以恢复。事实上，我们观察到浅层VFM特征保留了更丰富的局部外观和结构细节，这与现有RAEs所使用的深度特征所携带的高层语义形成互补。基于这一互补特性，我们提出了IDEAL，一个用于离散表示自编码器的深度对齐（In-DEpth ALignment）框架。通过将量化token同时与浅层和深层VFM特征进行对齐，IDEAL使生成的离散视觉token同时保留视觉保真度和丰富语义。大量实验表明，IDEAL实现了优越的重建性能，在ImageNet上达到0.61 rFID，超越此前最佳方法0.28。当用于自回归图像生成时，IDEAL进一步实现了1.89 gFID，确立了自回归图像生成的新技术前沿。

Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization

2026-06-10T04:00:00autoregressive, cs.CV, diffusion2606.11180

中文标题：Lip Forcing：面向实时唇同步的少步自回归扩散方法

作者：Paul Hyunbin Cho (KAIST AI), Jinhyuk Jang (KAIST AI), SeokYoung Lee (KAIST AI), Joungbin Lee (KAIST AI), Siyoon Jin (KAIST AI), Heeseong Shin (KAIST AI), Jung Yi (KAIST AI), Yunjin Park (AIPARK), Chulmin Park (AIPARK), Seungryong Kim (KAIST AI)

摘要：

Diffusion-based lip synchronization models achieve strong visual quality and audio-visual alignment, but full-sequence bidirectional attention and many denoising steps make them impractical for real-time inference. We present Lip Forcing, to our knowledge the first autoregressive diffusion method for video-to-video (V2V) lip synchronization, which distills a 14B audio-conditioned bidirectional video diffusion teacher into causal students. At inference, the students generate each chunk in only two denoising steps without inference-time CFG, enabling real-time lip synchronization. A lip-sync-specific teacher-trajectory analysis reveals a CFG fidelity-sync tradeoff: no-CFG predictions favor reference fidelity, whereas CFG-guided predictions favor synchronization within a mid-trajectory band. Lip Forcing translates this finding into three analysis-derived components: Sync-Window DMD, a two-step inference schedule, and a SyncNet-based reward. We validate Lip Forcing at two student scales, both distilled from the 14B teacher. The 1.3B student crosses into real-time streaming at 31 FPS, $17.6\times$ faster than its same-scale bidirectional model. The 14B student, the largest diffusion model reported for V2V lip synchronization, runs $39.8\times$ faster than its teacher at comparable reference fidelity. Time-to-first-frame is sub-millisecond at both scales, far below every diffusion baseline.

摘要中文：

基于扩散的唇同步模型虽然能够实现较强的视觉质量和音画对齐效果，但全序列双向注意力和大量去噪步骤使其难以满足实时推理需求。我们提出了Lip Forcing，据我们所知，这是首个用于视频到视频（V2V）唇同步的自回归扩散方法，该方法将一个140亿参数的音频条件双向视频扩散教师模型提炼为因果学生模型。在推理时，学生模型仅需两个去噪步骤即可生成每个片段，且无需推理时的无分类器引导（CFG），从而实现实时唇同步。一种唇同步专用的教师轨迹分析揭示了CFG保真度-同步权衡现象：无CFG预测更偏向参考保真度，而CFG引导预测更倾向于中间轨迹带内的同步效果。Lip Forcing将这一发现转化为三个分析驱动的组件：Sync-Window DMD（同步窗口差分模型）、两步推理调度以及基于SyncNet的奖励机制。我们在两个学生模型规模上验证了Lip Forcing的效果，两者均从140亿教师模型提炼而来。13亿参数的学生模型达到31 FPS的实时流式处理速度，比同等规模的双向模型快17.6倍。140亿参数的学生模型（据报告为V2V唇同步领域最大规模的扩散模型）以 comparable reference fidelity（相当的参考保真度）运行，比其教师模型快39.8倍。两种规模的首帧生成时间均低于1毫秒，远优于所有扩散基线方法。

ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations

2026-06-10T04:00:00autoregressive, cs.CV2606.11188

中文标题：ARM：一种具有统一离散表示的自回归大型多模态模型

作者：Junke Wang, Xiao Wang, Jiacheng Pan, Xuefeng Hu, Feng Li, Jingxiang Sun, Chaorui Deng, Zilong Chen, Yunpeng Chen, Kaibin Tian, Matthew Gwilliam, Hao Chen, Danhui Guan, Kun Xu, Weilin Huang, Zuxuan Wu, Haoqi Fan, Yu-Gang Jiang, Zhenheng Yang

摘要：

This paper introduces ARM, a discrete representation-based AutoRegressive Model that unifies image understanding, generation, and editing within a next-token prediction framework. ARM is built on three efforts: first, we train a discrete semantic visual tokenizer that maps images into compact token sequences. Our tokenizer is supervised with multiple objectives that jointly promote semantic discriminability, language alignment and faithful reconstruction, thereby supporting diverse tasks in a shared latent space. With this, we train a 7B autoregressive model over large-scale text and image token sequences, seamlessly developing vision-language perception and generation capabilities. Finally, to further improve preference-aligned behavior for text-to-image generation and instruction-guided editing, ARM applies reinforcement learning (RL) to optimize task-level objectives such as visual quality, instruction adherence, and edit consistency. Surprisingly, the results show that RL not only substantially improves performance on the target tasks (e.g., raising WISE overall from 0.50 to 0.56, GEdit-Bench-EN G_O from 5.75 to 6.68), but also induces cross-task synergy between text-to-image generation and editing. Collectively, these findings highlight autoregressive modeling, when paired with strong representations and preference optimization, as a scalable foundation for multimodal intelligence. Code: https://github.com/wdrink/ARM.

摘要中文：

本文介绍了ARM，一个基于离散表示的自回归模型，将图像理解、生成和编辑统一在下一个标记预测框架中。ARM建立在三个方面的努力之上：首先，我们训练了一个离散语义视觉分词器，将图像映射为紧凑的标记序列。我们的分词器采用多个目标进行联合监督，共同促进语义可区分性、语言对齐和忠实重建，从而在共享潜在空间中支持多样化的任务。在此基础上，我们在大规模的文本和图像标记序列上训练了一个70亿参数的自回归模型，无缝地发展出视觉-语言感知和生成能力。最后，为了进一步提升文本到图像生成和指令引导编辑的偏好对齐行为，ARM采用强化学习（RL）来优化视觉质量、指令遵循和编辑一致性等任务级目标。令人惊讶的是，结果表明RL不仅显著提升了目标任务的性能（例如，将WISE总体得分从0.50提升至0.56，GEdit-Bench-EN的G_O从5.75提升至6.68），还诱导了文本到图像生成与编辑之间的跨任务协同。这些发现共同表明，当自回归模型与强大的表示和偏好优化相结合时，可以作为多模态智能的可扩展基础。代码：https://github.com/wdrink/ARM

Dexterous Point Policy: Learning Point-based Dexterous Hand Policies from Human Demonstrations

2026-06-10T04:00:00autoregressive, cs.CV, cs.LG, cs.RO2606.10614

中文标题：Dexterous Point Policy：从人类演示学习基于关键点的灵巧手策略

作者：Beomjun Kim, Seong Hyeon Park, Seunghoon Sim, Seungjun Moon, Sanghyeok Lee, Jinwoo Shin

摘要：

Robotic foundation models pre-trained on human demonstration videos have shown promise, but a significant embodiment gap remains when the resulting policies are deployed on real robots. A common remedy is to fine-tune these models on robot-specific demonstrations. However, robot data collection can be prohibitively expensive and time-consuming, which is particularly acute in dexterous manipulation, e.g., teleoperating a multi-fingered hand for even a single atomic task can take days. To address this, we introduce Dexterous Point Policy, a framework that learns dexterous manipulation policies directly from human videos and requires no robot demonstrations. Our core insight is that a unified 3D keypoint representation can bridge human and robot embodiments when used for both observations and actions. Specifically, we extract 3D keypoints of task-relevant objects and human hands from raw videos, and train an autoregressive transformer over these keypoints. We observe that at the keypoint level, specifically the wrist and fingertips, human and robot behaviors closely align, enabling direct policy transfer. On a suite of real-robot tasks spanning pick-and-place and tool use, Dexterous Point Policy attains 75.0% success, whereas a state-of-the-art VLA baseline reaches only 1.0%. Furthermore, our method generalizes strongly to unseen scenarios, including multi-object environments and novel object categories.

摘要中文：

在人类演示视频上预训练的机器人基础模型已显示出前景，但当所得到的策略部署到真实机器人时，仍存在显著的化身差距。一种常见的补救方法是在机器人特定演示上对这些模型进行微调。然而，机器人数据收集的成本可能极高且耗时，这在灵巧操作中尤为突出，例如，即使仅对一个原子任务进行多指手的遥操作也可能需要数天时间。为解决这一问题，我们提出了Dexterous Point Policy框架，该框架可直接从人类视频学习灵巧操作策略，无需机器人演示。我们的核心见解是，统一的3D关键点表示可以在用于观察和动作时桥接人类和机器人的化身。具体而言，我们从原始视频中提取任务相关物体和人类手的3D关键点，并在这些关键点上训练自回归Transformer。我们观察到，在关键点层面，尤其是手腕和指尖，人类和机器人行为高度一致，从而实现了直接的策略迁移。在一系列涵盖拾取放置和工具使用的真实机器人任务中，Dexterous Point Policy达到了75.0%的成功率，而最先进的VLA基线仅达到1.0%。此外，我们的方法对新场景具有很强的泛化能力，包括多物体环境和新型物体类别。

diffusion

Diffusion

35 篇论文

今日 Diffusion 相关论文涵盖视频生成、机器人控制、强化学习规划、图像编辑、模型压缩等多个领域，整体呈现以下趋势：1) 推理加速成为热点，多篇论文探索少步生成、蒸馏和压缩技术；2) 视频 Diffusion 受到关注，包括时间编辑、注意力机制和长视频记忆机制；3) 可控性增强继续深入，涉及文本对齐、编辑边界、分布适应等；4) 跨领域应用拓展到机器人决策、地下流模拟、PDE 代理训练等工程问题；5) 安全与版权开始受到关注，有论文研究版权保护绕过问题。

Bypassing Copyright Protection in Diffusion-based Customization via Two-Stage Latent Feature Optimization - 提出绕过 Diffusion 定制化中版权保护的方法，对理解模型安全边界有重要价值。
Making Time Editable in Video Diffusion Transformers - 首次实现视频 Diffusion Transformer 的时间维度编辑能力，是视频生成控制的重要突破。
Test-time Adversarial Takeover: A Real-time Hijacking Interface against Robotic Diffusion Policies - 揭示了机器人 Diffusion 策略的对抗性漏洞，对安全部署有重要警示意义。
Diffusion Forcing Planner: History-Annealed Planning with Time-Dependent Guidance for Autonomous Driving - 将 Diffusion 应用于自动驾驶规划，提出了时间依赖引导的历史退火方法。
Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization - 实现了实时唇同步的少步自回归 Diffusion，在效率和效果间取得良好平衡。

Tractogram foundation model

2026-06-10T04:00:00cs.AI, cs.LG, diffusion, eess.IV2606.09893

中文标题：纤维追踪基础模型

作者：Guikun Chen, Yuqian Chen, Yijie Li, Yogesh Rathi, Nikos Makris, Fan Zhang, Wenguan Wang, Lauren J. O'Donnell

摘要：

Diffusion MRI (dMRI) tractography is the only noninvasive approach for mapping white-matter pathways in the living human brain. It represents each brain as a tractogram: a large, unordered set of three-dimensional streamlines that includes information about both local streamline geometry and whole-brain anatomical organization. This structure makes tractograms a natural but challenging target for representation learning. Existing methods treat streamline classification and subject-level prediction as separate problems: streamline classifiers focus on geometric patterns, whereas subject-level prediction often depends on hand-crafted features. As a result, current methods do not learn reusable representations that connect streamline anatomy with whole-brain inter-subject variation. Here we introduce TractFM, a tractogram foundation model that learns reusable representations directly from whole-brain streamline sets. TractFM combines a local streamline encoder with a permutation-equivariant tractogram encoder, allowing all streamlines from a subject to be contextualized jointly in a single forward pass. Pretraining on dense anatomical tract parcellation, i.e., assigning anatomical labels to individual streamlines, yields two complementary representations: contextualized streamline-level embeddings for tract parcellation and compact subject-level descriptors for downstream prediction of subject phenotypes. Across three tractography algorithms and five dMRI datasets, TractFM transfers to both streamline-level and subject-level tasks. Its frozen representations achieve accurate tract parcellation and predict age and sex across independent datasets. These results show that whole-brain geometric context, learned once, can generalize across tractography pipelines, datasets, and prediction tasks.

摘要中文：

扩散磁共振成像（dMRI）纤维追踪是目前唯一能够无创绘制活体人脑白质通路的方法。它将每个大脑表示为一份纤维追踪图：一个庞大的无序三维流线集合，包含了局部流线几何结构和全脑解剖组织的信息。这种结构使纤维追踪图成为自然但具有挑战性的表示学习目标。现有的方法将流线分类和受试者水平预测视为独立的问题：流线分类器侧重于几何模式，而受试者水平预测往往依赖于手工设计的特征。因此，当前方法无法学习将流线解剖结构与全脑受试者间差异联系起来的高复用性表示。本研究提出了TractFM，一个直接从全脑流线集合学习高复用性表示的纤维追踪基础模型。TractFM将局部流线编码器与排列等变的纤维追踪图编码器相结合，使所有流线能够在单次前向传播中被联合上下文化。在密集解剖学纤维束分区（即为每条流线分配解剖标签）上进行预训练后，产生了两种互补的表示：用于纤维束分区的上下文化流线级嵌入，以及用于下游受试者表型预测的紧凑受试者级描述符。在三种纤维追踪算法和五个dMRI数据集上，TractFM能够迁移到流线级和受试者级任务。其冻结表示在独立数据集上实现了准确的纤维束分区，并能预测年龄和性别。这些结果表明，学习一次的全脑几何上下文能够跨纤维追踪流程、数据集和预测任务进行泛化。

Bypassing Copyright Protection in Diffusion-based Customization via Two-Stage Latent Feature Optimization

2026-06-10T04:00:00cs.AI, cs.CR, cs.CV, diffusion2606.09909

中文标题：通过两阶段潜在特征优化绕过基于扩散模型定制化中的版权保护

作者：Ziang Xu, Wenbo Yu, Hongyao Yu, Hao Fang, Jiawei Kong, Bin Chen, Hao Wu, Shu-Tao Xia, Zhiyong Wu

摘要：

With the growing concerns over copyright infringement in diffusion-based customization, adversarial attacks have emerged as a prominent defense strategy to prevent malicious content forgery in personalized image generation. However, current defenses typically introduce persistent perturbations in the latent space of Latent Diffusion Models (LDMs), which remain susceptible to adaptive bypasses by adversaries. In this paper, we introduce Two-Stage Latent Feature Optimization (TS-LFO), an efficient and effective copyright-stealing attack against protected diffusion-based customization. We begin by observing that existing defenses primarily disrupt the mapping between input images and their latent representations, thereby degrading the model's ability to produce personalized outputs. To counteract this, TS-LFO restores the broken mapping through a two-stage optimization process. In the Latent Denoising Stage, we enhance semantic consistency between latent codes and input images by jointly minimizing a Latent-Image Alignment Loss and a Latent Diffusion Loss with timestep-dependent weights, effectively suppressing the high-frequency noise introduced by defenses. In the Latent Reconstruction Stage, we recover low-frequency semantic information using pixel-level constraints to refine the latent features. Extensive experiments show that TS-LFO consistently bypasses state-of-the-art (SOTA) copyright defenses and outperforms SOTA copyright attacks such as DiffPure, GrIDPure and IMPRESS across diverse settings.

摘要中文：

随着对扩散模型定制化中版权侵权问题的日益关注，对抗攻击已成为防止个性化图像生成中恶意内容伪造的重要防御策略。然而，当前防御方法通常在潜在扩散模型（LDMs）的潜在空间中引入持久性扰动，容易受到对手的自适应绕过攻击。本文提出两阶段潜在特征优化（TS-LFO），一种针对受保护扩散模型定制化的高效版权窃取攻击。我们首先观察到，现有防御方法主要破坏输入图像与其潜在表示之间的映射关系，从而削弱模型生成个性化输出的能力。为应对这一问题，TS-LFO通过两阶段优化过程恢复被破坏的映射。在潜在去噪阶段，我们通过联合最小化潜在-图像对齐损失和具有时间步依赖权重的潜在扩散损失，增强潜在码与输入图像之间的语义一致性，有效抑制防御引入的高频噪声。在潜在重建阶段，我们使用像素级约束来恢复低频语义信息，从而优化潜在特征。大量实验表明，TS-LFO始终能够绕过最先进的版权防御，并在各种设置下优于DiffPure、GrIDPure和IMPRESS等先进版权攻击方法。

Learning Where to Simulate: Generative Active Sampling for Online PDE Surrogate Training

2026-06-10T04:00:00cs.AI, cs.LG, diffusion2606.09949

中文标题：学习模拟何处：面向在线偏微分方程代理模型训练的生成式主动采样方法

作者：Pierre Cesar (DATAMOVE), Sofya Dymchenko (DATAMOVE), Abhishek Purandare (DATAMOVE), Bruno Raffin (DATAMOVE)

摘要：

Data-driven PDE surrogates are trained with data produced by numerical PDE solvers. However, when the surrogate's goal is to generalize across a wide range of PDE configurations (e.g., initial conditions and physical coefficients), generating a representative training set is non-trivial. Uniform sampling of configuration parameters often under-represents trajectories exhibiting challenging dynamics, leading to high prediction errors and large error variance in the trained surrogate. Online training, where data generation and surrogate training are coupled, offers a natural advantage by allowing solver parameters to be steered on-the-fly. To efficiently exploit this capability, we introduce Online Generative Active Sampling (OGAS), an active learning method that reactively learns the relationship between configuration parameters and surrogate performance to control the sampling distribution. OGAS trains a fast diffusion model in parallel to the surrogate to act as a conditional sampler, mapping a surrogate-derived difficulty signal (e.g., loss or uncertainty) to configuration parameters. By actively drawing target signals from a prior biased toward high difficulty, OGAS continuously steers data generation toward challenging regimes without delaying the training workflow. We evaluate OGAS across 2D PDEs with distinct challenging dynamics (Kuramoto-Sivashinsky, Navier-Stokes, Gray-Scott) and up to 308 parameters, using multiple surrogate architectures. Across all settings, OGAS consistently improves tail statistics, yielding substantial reductions in errors above the 99th percentile and overall error dispersion compared to uniform sampling. While prioritizing challenging trajectories introduces a trade-off with average error, OGAS effectively ensures worst-case reliability of trained surrogates with negligible wall-time overhead.

摘要中文：

数据驱动的偏微分方程代理模型使用数值偏微分方程求解器产生的数据进行训练。然而，当代理模型的目标是跨广泛的偏微分方程配置（例如初始条件和物理系数）进行泛化时，生成具有代表性的训练集并非易事。配置参数的均匀采样往往低估了具有挑战性动力学的轨迹，导致训练后的代理模型预测误差较高且误差方差较大。数据生成与代理训练相耦合的在线训练通过允许求解器参数实时调整而具有天然优势。为了有效利用这一能力，我们提出了在线生成式主动采样（Online Generative Active Sampling, OGAS），这是一种主动学习方法，能够学习配置参数与代理模型性能之间的关系，从而控制采样分布。OGAS在代理模型旁并行训练一个快速扩散模型作为条件采样器，将代理模型衍生的难度信号（例如损失值或不确定性）映射到配置参数。通过主动从偏向高难度的先验分布中提取目标信号，OGAS持续将数据生成引导至具有挑战性的区域，而不会延误训练流程。我们在具有不同挑战性动力学的二维偏微分方程上评估OGAS，包括Kuramoto-Sivashinsky方程、Navier-Stokes方程和Gray-Scott方程，参数规模高达308个，并使用多种代理模型架构。在所有设置中，OGAS始终改善了尾部统计特性，与均匀采样相比，在第99百分位以上的误差和整体误差离散度方面实现了显著降低。虽然优先处理挑战性轨迹会在平均误差上引入权衡，但OGAS有效地确保了训练后代理模型在最坏情况下的可靠性，且时间开销可忽略不计。

Temporal Sheaf Neural Networks with Dynamic Orthogonal Transport

2026-06-10T04:00:00cs.AI, cs.LG, diffusion2606.10071

中文标题：具有动态正交传输的时间层神经网络

作者：Md Sadek Hossain Asif, Tanzila Khan, Md. Mosaddek Khan

摘要：

We introduce Temporal Sheaf Neural Networks (TSNN), a temporal link prediction framework that equips each node with a time-varying orthogonal frame and compares node states only after explicit transport between local coordinate systems. In contrast to existing continuous-time graph models that operate in a shared global embedding space, TSNN models node-specific and evolving interaction semantics through dynamic local frames. The model parameterizes per-node frames via efficient low-rank Householder products, preserves stored hidden states exactly under frame updates, and uses a geometric-residual decoder that anchors predictions on transported distances while learning residual corrections. All computations are strictly causal and use only the pre-event history. We show that the symmetric degree-normalized sheaf Laplacian is orthogonally similar to the symmetric normalized graph Laplacian, with the random-walk normalized form similar in the corresponding degree metric; the full-active, feature-scaled diffusion used by TSNN is exactly a metric-gradient step on the combinatorial sheaf Dirichlet energy, with a degree-free monotone-descent and non-expansiveness guarantee. Frame drift perturbs updates only linearly. Across TGB v2 link-prediction and temporal-heterogeneous leaderboards, together with the DGB benchmark suite, TSNN matches or surpasses the strongest prior methods on most benchmarks, with the largest improvements on graphs exhibiting strong node-role heterogeneity. Ablations confirm the distinct benefit of dynamic frames, orthogonal transport, and geometric-residual decoding.

摘要中文：

我们提出时间层神经网络（TSNN），这是一个时序链接预测框架，为每个节点配备时间变化的正交框架，并仅在局部坐标系之间进行显式传输后才比较节点状态。与现有的在共享全局嵌入空间中工作的连续时间图模型不同，TSNN通过动态局部框架建模节点特定且不断演化的交互语义。该模型通过高效的低秩Householder乘积参数化每个节点的框架，在框架更新时精确保持存储的隐藏状态，并使用几何残差解码器将预测锚定在传输距离上，同时学习残差校正。所有计算都是严格因果的，仅使用事件前历史。我们证明了对称度归一化层拉普拉斯算子与对称归一化图拉普拉斯算子正交相似，随机游走归一化形式在相应度度量中相似；TSNN使用的全活跃特征缩放扩散恰好是组合层Dirichlet能量上的度规梯度步，具有无度单调下降和非扩展性保证。框架漂移仅线性扰动更新。在TGB v2链接预测和时序异构排行榜以及DGB基准测试套件中，TSNN在大多数基准测试中匹配或超越最强先前方法，在表现出强节点角色异构性的图上改进最大。消融实验证实了动态框架、正交传输和几何残差解码的独特优势。

VFUSE: Virulent Feature Understanding with Sparse autoEncoders

2026-06-10T04:00:00cs.AI, cs.LG, diffusion, q-bio.QM2606.10080

中文标题：VFUSE：基于稀疏自编码器的毒性特征理解

作者：Michael Yu, Matthew L. Olson

摘要：

Generative models have shown remarkable progress in a variety of domains such as protein design, but such power enables the opaque generation of hazardous proteins. In this work, we introduce VFUSE (Virulent Feature Understanding with Sparse autoEncoders), a mechanistic interpretability approach that trains SAEs on diffusion-transformer activations to audit protein models for hazard-aware features. We apply VFUSE to RoseTTAFold3 and RFDiffusion3, popular open-weight models for protein folding and synthesis. We find that for certain blocks, linear probes detect hazardous designs significantly better when fit in the SAE latent space over the original model's representations: improving interpretability without sacrificing model performance. Furthermore, we identify monosemantic features from the SAE that fire only on hazardous designs at up to AUROC $$0.84$$ ( $q < 10^{-13}$ ). To our knowledge this is the first SAE trained on an all-atom diffusion model and the first feature-level virulence audit of a protein design model, paving the way towards safe and interpretable protein design.

摘要中文：

生成模型在蛋白质设计等多个领域取得了显著进展，但这种能力也使得危险蛋白质的生成变得不透明。本研究提出了VFUSE（基于稀疏自编码器的毒性特征理解），这是一种机制可解释性方法，通过在扩散Transformer激活值上训练稀疏自编码器来审计蛋白质模型的危险感知特征。我们将VFUSE应用于RoseTTAFold3和RFDiffusion3，这两个是用于蛋白质折叠和合成的热门开源权重模型。我们发现，对于某些模块，在稀疏自编码器潜在空间中训练的线性探针检测危险设计的效果显著优于在原始模型表征上训练的效果：在不牺牲模型性能的前提下提高了可解释性。此外，我们从稀疏自编码器中识别出单语义特征，这些特征仅在危险设计上激活，AUROC高达0.84（q < 10^{-13}）。据我们所知，这是首个在全原子扩散模型上训练的稀疏自编码器，也是首个对蛋白质设计模型进行特征级毒性审计的研究，为安全且可解释的蛋白质设计铺平了道路。

BiWM: Advancing Open-Source Interactive Video World Models with Bidirectional Autoregression

2026-06-10T04:00:00autoregressive, cs.AI, cs.CV, diffusion2606.10135

中文标题：BiWM：利用双向自回归推进开源交互式视频世界模型

作者：Shaohao Rui, Xiaofeng Mao, Zhanyu Zhang, Peijia Lin, Yansong Zhu, Yibo Zhang, Haibin Wan, Weijie Ma

摘要：

摘要中文：

Making Time Editable in Video Diffusion Transformers

2026-06-10T04:00:00cs.AI, cs.CV, cs.MM, diffusion2606.10183

中文标题：实现视频扩散变换器中的时间编辑

作者：Konstantin Kuklev, Viacheslav Vasilev, Alexander Kunitsyn, Andrei Ivaniuta, Denis Dimitrov

摘要：

Modern Diffusion Transformers for video generation provide limited control over the progression of time and the editing of temporal dynamics. We propose a temporal-control methodology that extends a pretrained DiT with explicit time editing, allowing control over motion speed and temporal structure without redesigning the backbone. Its core implementation augments the pretrained model with a lightweight temporal module, preserving the original generative prior while expanding its controllable dynamic range.

摘要中文：

现代视频生成扩散变换器对时间进程和时间动态编辑的控制能力有限。本研究提出一种时序控制方法，通过为预训练的DiT模型添加显式时间编辑功能，实现了对运动速度和时间结构的控制，而无需重新设计主干网络。其核心实现方式是为预训练模型配备一个轻量级的时序模块，在保留原始生成先验的同时扩展其可控动态范围。

Test-time Adversarial Takeover: A Real-time Hijacking Interface against Robotic Diffusion Policies

2026-06-10T04:00:00cs.AI, cs.RO, diffusion2606.10371

中文标题：测试时对抗劫持：针对机器人扩散策略的实时 hijacking 接口

作者：Zi Yin, Peilin Chai, Siyuan Huang, Zhanhao Hu

摘要：

Diffusion-based action generation has become a foundational component of embodied AI, but its reliance on visual conditioning leaves deployed visuomotor policies vulnerable to adversarial manipulation. Most prior attacks focus on disruption: they perturb the observation stream to reduce task success or induce erratic behavior. We study a stronger threat, Test-time Adversarial Takeover (TAKO), in which an attacker obtains a real-time steering interface over a frozen robot policy and turns it into a remotely piloted instrument. TAKO learns a small vocabulary of reusable universal patches through differentiable diffusion inference; at test time, the attacker switches among these patches in the camera stream to compose attacker-chosen trajectories. This works because the perturbation acts on the visual conditioning pathway, where the induced bias can persist through iterative generative inference. We further show that the natural targeted baseline, target-policy matching, fails because the victim policy cannot reliably supervise itself on out-of-distribution target shifts. Across four tasks (2D manipulation, simulated aerial delivery, simulated ground navigation, and physical-world ground navigation), two visual encoders (ResNet-18 and EfficientNet-B0 + Transformer), and three generative inference families (DDPM, DDIM, and flow matching), human operators achieve 100\% takeover success on attacker-defined objectives in every evaluated setting. The project page is available at https://tako-attack.github.io.

摘要中文：

基于扩散的动作生成已成为具身智能的基础组件，但其对视觉条件的依赖使得部署的视觉运动策略容易受到对抗性操纵。先前的攻击大多聚焦于干扰：它们对观察流进行扰动以降低任务成功率或引发异常行为。我们研究了一种更严重的威胁——测试时对抗劫持（TAKO），攻击者由此获得对冻结机器人策略的实时转向接口，并将其转变为远程操控工具。TAKO 通过可微扩散推理学习少量可复用的通用补丁；在测试时，攻击者在摄像头流中切换这些补丁以组合出攻击者选择的轨迹。这是因为扰动作用于视觉条件通路，其中产生的偏差可以通过迭代生成推理持续存在。我们进一步表明，自然的定向基线——目标策略匹配——会失败，因为受害策略无法可靠地对自己的分布外目标偏移进行监督。在四项任务（二维操作、模拟空中投递、模拟地面导航和真实世界地面导航）、两种视觉编码器（ResNet-18 和 EfficientNet-B0 + Transformer）以及三种生成推理家族（DDPM、DDIM 和流匹配）上，人类操作员在每个评估设置中均实现了对攻击者定义目标的 100% 劫持成功率。项目页面见 https://tako-attack.github.io。

Machine Learning Methods for Studying Latent Neural Activity Dynamics

2026-06-10T04:00:00cs.AI, cs.LG, diffusion2606.10530

中文标题：用于研究潜在神经活动动态的机器学习方法

作者：Shufeng Kong, Fumei Deng, Xinyi Dong, Caihua Liu, Weiwei Chen, Yingheng Wang, Daniel Cao, Azahara Oliva, Antonio Fernandez-Ruiz, Carla Gomes

摘要：

Recent developments in brain recording are driving a demand for machine learning tools capable of decoding the latent structure of large populations of neurons. In this paper, we provide a comprehensive survey that outlines the trajectory of Latent Variable Models (LVMs) from early state-space models to more recent deep generative models. We organize the literature into three closely related domains: (1) Single-Region Latent Dynamics, which includes models such as linear dynamical systems to more complex dynamics represented by Recurrent Neural Networks (RNNs) and Neural Ordinary Differential Equations (ODEs); (2) Multi-Region Communication, which employs probabilistic as well as subspace methods to study how information is transferred across different brain areas considering synaptic propagation delays and network connectivity; and (3) Behavior-Aligned Modeling, which seeks to disentangle neural activity related to task performance from other internal states via supervised or contrastive learning. This survey also includes large-scale neural foundation models, such as Transformers and diffusion models, that rely on large-scale pre-training for optimal performance across subjects. Finally, we conclude and discuss benchmarks, evaluation criteria, and open challenges, such as the ability to identify causal links or directionality of communication, to facilitate future research for bridging interpretable brain dynamics with reliable neural decoding.

摘要中文：

脑记录技术的最新发展推动了对机器学习工具的需求，这些工具能够解码大规模神经元群体的潜在结构。本文提供了一篇综合综述，概述了潜在变量模型（LVMs）从早期状态空间模型到近期深度生成模型的发展轨迹。我们将相关文献组织为三个紧密相关的领域：（1）单区域潜在动态，包括从线性动态系统到由循环神经网络和神经常微分方程（Neural ODEs）表示的更复杂动态；（2）多区域通信，采用概率方法和子空间方法研究信息在不同脑区之间的传递，同时考虑突触传播延迟和网络连接；（3）行为对齐建模，通过监督学习或对比学习将任务表现相关的神经活动与其他内部状态分离。本综述还涵盖了大尺度神经基础模型，如Transformer和扩散模型，这些模型依赖于大规模预训练以在不同被试间实现最佳性能。最后，我们总结并讨论了基准测试、评估标准以及开放性挑战，例如识别因果关系或通信方向性的能力，以促进未来将可解释的大脑动态与可靠的神经解码相结合的研究。

Fast and Highly Expressive Policy Learning for Offline Reinforcement Learning via Bootstrapped Flow Q-Learning

2026-06-10T04:00:00cs.AI, cs.LG, diffusion2606.10613

中文标题：基于自助流Q学习的离线强化学习快速高表达力策略学习

作者：Thanh Nguyen, Tri Ton, Hongbin Choe, Tung M. Luu, Chang D. Yoo

摘要：

Diffusion-based Q-learning has emerged as a powerful paradigm for offline reinforcement learning, but its reliance on multi-step denoising makes both training and inference computationally expensive and brittle. Recent efforts to accelerate diffusion Q-learning toward single-step action generation typically introduce auxiliary networks, policy distillation, or multi-phase training, which frequently compromise simplicity, stability, or performance. To address these limitations, we introduce Bootstrapped Flow Q-Learning (BFQ), a novel framework that enables accurate single-step action generation during both training and inference, without auxiliary networks or distillation procedures. BFQ adopts a divide-and-conquer view of the displacement vector along the flow path: it begins by learning short-range displacements that can be accurately estimated from the Flow Matching marginal velocity, and bootstraps these components to directly learn a noise-to-action mapping in a single step. This formulation eliminates multi-step denoising, resulting in a learning procedure that is substantially faster, simpler, and more robust. Extensive D4RL evaluations show that BFQ improves performance while significantly reducing computational cost compared to multi-step diffusion baselines, demonstrating that single-step action generation suffices for high-performance offline Reinforcement Learning.

摘要中文：

基于扩散的Q学习已成为离线强化学习的一种强大范式，但其对多步去噪的依赖使得训练和推理过程计算成本高且不够稳健。近年来，加速扩散Q学习以实现单步动作生成的努力通常引入辅助网络、策略蒸馏或多阶段训练，这些方法往往会影响简洁性、稳定性或性能。为解决这些局限性，我们提出了自助流Q学习（Bootstrapped Flow Q-Learning, BFQ），这是一个新颖的框架，能够在训练和推理过程中实现精确的单步动作生成，且无需辅助网络或蒸馏过程。BFQ采用分而治之的视角处理流路径上的位移向量：它首先学习可从流匹配边缘速度精确估计的短程位移，然后自助引导这些分量，以在单步中直接学习噪声到动作的映射。这种表述消除了多步去噪，从而实现了显著更快、更简单且更稳健的学习过程。广泛的D4RL评估表明，BFQ在提升性能的同时显著降低了计算成本，相比多步扩散基线方法具有优势，证明了单步动作生成足以实现高性能离线强化学习。

Improving Text-Instance Alignment Of Foreground Conditioned Out-Painting Via Customized Concept Embedding

2026-06-10T04:00:00cs.AI, cs.CV, diffusion2606.10892

中文标题：通过自定义概念嵌入改进前景条件外绘的文本-实例对齐

作者：Yihao Zhao, Xuan Han, Bin He, Mingyu You

摘要：

To showcase products, merchants often incur substantial costs creating high-quality display images. Foreground Conditioned Outpainting (FCO) meets this demand, allowing users to create desired backgrounds for foreground instances at a low cost by adjusting the text prompt. However, existing text-driven FCO methods exhibit critical flaws in their outputs, most notably the presence of artifacts, which refer to regions in the synthesized background that share the same semantics as the foreground instance. Such artifacts diminish the object's prominence and degrade image quality. We attribute the issue to the misalignment between the given instance and text-derived concept embeddings. To address this, we propose the Customized Concept Embedding Diffusion (CCE-Diffusion) framework. Its core is a CCE-Module to customize concept embeddings, bridging the gap between generic noun semantics and a specific visual instance. An Instance-Aware Loss guides the module&x27;s optimization, while a Semantic-Preserving Prompt Template prevents customized embeddings from distorting other words in the prompt. Both qualitative and quantitative evaluations demonstrate that CCE-Diffusion significantly reduces artifacts in the outputs. As a plug-and-play component, the CCE-Module can integrate with various FCO methods, enhancing their performance.

摘要中文：

为展示产品，商家通常需要投入大量成本创建高质量的展示图像。前景条件外绘（FCO）满足了这一需求，允许用户通过调整文本提示词，以低成本为前景实例创建所需的背景。然而，现有的文本驱动FCO方法在输出结果中存在关键缺陷，尤其是伪影问题——即合成背景中存在与前景实例语义相同的区域。这种伪影会削弱物体的突出性，降低图像质量。我们将这一问题归因于给定实例与文本派生概念嵌入之间的错位。为解决这一问题，我们提出了自定义概念嵌入扩散（CCE-Diffusion）框架。其核心是CCE-Module，用于自定义概念嵌入，弥合通用名词语义与特定视觉实例之间的差距。实例感知损失（Instance-Aware Loss）引导该模块的优化，而语义保持提示词模板（Semantic-Preserving Prompt Template）则防止自定义嵌入扭曲提示词中的其他词汇。定性和定量评估表明，CCE-Diffusion显著减少了输出中的伪影。作为一个即插即用的组件，CCE-Module可以与各种FCO方法集成，提升其性能。

Diffusion Forcing Planner: History-Annealed Planning with Time-Dependent Guidance for Autonomous Driving

2026-06-10T04:00:00cs.AI, cs.RO, diffusion2606.11019

中文标题：扩散强制规划器：面向自动驾驶的基于时间依赖引导的历史退火规划方法

作者：Zehan Zhang, Neng Zhang, Yaoyi Li, Jia Cai, Zhiling Wang

摘要：

Learning-based motion planners, despite recent progress, often suffer from temporal inconsistency. Small perturbations across frames can accumulate into unstable trajectories, degrading comfort and safety in closed-loop driving. Several methods attempt to inject history as a static conditioning signal to stabilize outputs, only to induce the planner to copy historical patterns instead of adapting to environment contexts. To address this limitation, we propose Diffusion Forcing Planner (DFP), a diffusion-based planning framework driven by history-guided control. Specifically, DFP decomposes the full trajectory into history, current and future segments, and assign independent noise levels to each segment. The model jointly denoises the historical and the future segments, enforcing a heterogeneous joint diffusion process. At inference, classifier-free guidance (CFG) is applied to steer future sampling using annealed history in a controllable manner. Closed-loop evaluation and comprehensive ablations on nuPlan show that DFP achieves competitive performance while producing continuous, stable, and controllable motion plans in complex driving scenarios.

摘要中文：

基于学习的运动规划器尽管取得了 recent progress，但常常遭受时间不一致性问题。跨帧的小扰动会累积为不稳定轨迹，从而在闭环驾驶中降低舒适性和安全性。多种方法尝试将历史作为静态条件信号注入以稳定输出，却导致规划器复制历史模式而非适应环境上下文。为解决这一局限性，我们提出了扩散强制规划器（Diffusion Forcing Planner, DFP），一种由历史引导控制驱动的基于扩散的规划框架。具体而言，DFP将完整轨迹分解为历史、当前和未来三个段落，并为每个段落分配独立的噪声水平。该模型对历史段落和未来段落联合去噪，强制执行异质联合扩散过程。在推理阶段，应用无分类器引导（classifier-free guidance, CFG）以可控方式使用退火历史引导未来采样。在nuPlan上的闭环评估和全面消融实验表明，DFP在复杂驾驶场景中实现了具有竞争力的性能，同时产生了连续、稳定且可控的运动规划结果。

Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

2026-06-10T04:00:00cs.AI, cs.LG, diffusion2606.11087

中文标题：强化学习中流策略的测试时梯度引导

作者：Zhiyuan Zhou, Andy Peng, Charles Xu, Qiyang Li, Tobias Springenberg, Kevin Frans, Sergey Levine

摘要：

Expressive continuous control policies, such as diffusion and flow models, form the backbone of recent advances in scaling imitation learning for simulated and real robot control. While they are known to scale stably in the supervised imitation learning setting, incorporating them into reinforcement learning (RL) pipelines for policy improvement has proven more difficult. It often requires specialized training objectives or backpropagating through denoising processes, which cause well-known issues with stability and affect scalability. In this paper we study the question of whether simple policy improvement schemes at test time alone, leaving stable supervised policy training intact, can be a competitive alternative which sidesteps these issues. To this end, we propose QGF (Q-Guided Flow), an RL algorithm that performs policy optimization entirely at test time. QGF works by pre-training both a reference flow policy (via a standard behavioral cloning objective) and a value function critic and, at test time, using the value gradient to guide the reference policy to generate higher-value actions without any additional policy learning. Empirically, QGF outperforms prior test-time RL methods on single-task and goal-conditioned offline RL benchmarks with high-dimensional action spaces, and is competitive with state-of-the-art training-time algorithms while being much cheaper to run. Moreover, it exhibits favorable scaling with model size by avoiding the instability of actor-critic training, offering a practical and effective alternative RL algorithm with expressive policies.

摘要中文：

表达性连续控制策略（如扩散模型和流模型）是近期模拟与真实机器人控制领域模仿学习规模化进展的核心技术。尽管这些策略在监督模仿学习设置下能够稳定扩展，但将其整合到强化学习（RL）流程中进行策略改进则更为困难。这通常需要专门的训练目标或通过去噪过程进行反向传播，从而引发众所周知的稳定性问题，并影响可扩展性。本文研究仅在测试时采用简单策略改进方案、同时保持稳定的监督策略训练不变，是否可以成为规避这些问题的有竞争力的替代方案。为此，我们提出 QGF（Q-Guided Flow），一种完全在测试时进行策略优化的 RL 算法。QGF 的工作原理是：预训练一个参考流策略（通过标准行为克隆目标）和一个价值函数评论家，在测试时利用价值梯度引导参考策略生成更高价值的动作，而无需进行额外的策略学习。实证结果表明，QGF 在高维动作空间的单任务和目标条件离线 RL 基准测试中优于此前的测试时 RL 方法，并且与最先进的训练时算法具有竞争力，同时运行成本更低。此外，由于避免了训练时演员-评论家训练的不稳定性，QGF 表现出良好的模型规模可扩展性，为使用表达性策略的 RL 提供了一种实用有效的替代算法。

Data assimilation for subsurface flow using latent diffusion model parameterization: performance of ensemble-Kalman and Monte Carlo techniques

2026-06-10T04:00:00cs.AI, cs.LG, diffusion, physics.geo-ph, stat.AP, stat.ML2606.11140

中文标题：基于潜扩散模型参数化的地下水流数据同化：集合卡尔曼与蒙特卡洛技术的性能比较

作者：Guido Di Federico, Wenchao Teng, Louis J. Durlofsky

摘要：

Data assimilation (DA) in subsurface flow entails calibrating model parameters to match observed data, typically at wells, while preserving geological realism. Latent diffusion models (LDMs) provide efficient mappings from high-dimensional geological model space to a low-dimensional latent variable, reducing the dimensionality of the inverse problem while maintaining plausibility in posterior geomodels. However, the high nonlinearity in the LDM mapping may degrade the performance of Kalman-gain-based ensemble updates. We present a systematic comparison of DA algorithms applied to large-scale 3D channelized geomodels with hierarchical geological uncertainty. We compare model-space and latent-space DA using the ensemble smoother with multiple data assimilation (ESMDA), and demonstrate a key trade-off: model-space updates achieve significant uncertainty reduction but produce geologically unrealistic posterior models, while latent-space updates preserve realism but exhibit limited uncertainty reduction. Motivated by this, we explore rigorous Markov chain Monte Carlo (MCMC) and Sequential Monte Carlo (SMC) algorithms in the 3D-LDM latent space. To accommodate their high computational demands, we develop a fast surrogate flow model that approximates well-rate responses. MCMC and SMC are evaluated against ESMDA across three synthetic test cases, with DA performed in the LDM latent space. All models maintain geological realism due to the LDM parameterization. MCMC and SMC are consistent with one another and achieve lower data mismatch and more uncertainty reduction than latent-space ESMDA. Our overall results demonstrate that ensemble Kalman methods may provide overestimated posterior uncertainty with highly nonlinear parameterizations, while rigorous Monte Carlo sampling, enabled by fast surrogate models, can provide a more reliable alternative.

摘要中文：

地下水流数据同化需要对模型参数进行校准以匹配井眼观测数据，同时保持地质真实性。潜扩散模型（LDM）提供了从高维地质模型空间到低维潜变量的高效映射，在降低逆问题维度的同时保持后验地质模型的合理性。然而，LDM映射的高度非线性可能会降低基于卡尔曼增益的集合更新的性能。我们系统比较了应用于大规模3D通道型地质模型并具有层级地质不确定性的数据同化算法。我们比较了使用多数据同化集合平滑器（ESMDA）在模型空间和数据同化潜空间进行的数据同化，并展示了关键的权衡：模型空间更新实现了显著的不确定性约简，但产生了地质上不真实的后验模型；而潜空间更新保持了真实性但表现出有限的不确定性约简。基于此，我们在3D-LDM潜空间中探索了严格的马尔可夫链蒙特卡洛（MCMC）和序贯蒙特卡洛（SMC）算法。为了满足其高计算需求，我们开发了一个近似井产量响应的快速代理流模型。在三个合成测试案例中，MCMC和SMC与ESMDA进行了评估，所有数据同化均在LDM潜空间进行。由于采用LDM参数化，所有模型都保持了地质真实性。MCMC和SMC彼此一致，并实现了比潜空间ESMDA更低的数据 mismatch 和更大的不确定性约简。我们的整体结果表明，集合卡尔曼方法在使用高度非线性参数化时可能高估后验不确定性，而通过快速代理模型实现的严格蒙特卡洛采样可以提供更可靠的替代方案。

Model-Based Diffusion Sampling for Predictive Control in Offline Decision Making

2026-06-10T04:00:00cs.AI, cs.RO, cs.SY, diffusion, eess.SY2512.08280

中文标题：基于模型的扩散采样预测控制用于离线决策

作者：Haldun Balim, Na Li, Yilun Du

摘要：

Offline decision-making via diffusion models often produces trajectories that are misaligned with system dynamics, limiting their reliability for control. We propose Model Predictive Diffuser (MPDiffuser), a compositional diffusion framework that combines a diffusion planner with a dynamics diffusion model to generate task-aligned and dynamically plausible trajectories. MPDiffuser interleaves planner and dynamics updates during sampling, progressively correcting feasibility while preserving task intent. A lightweight ranking module then selects trajectories that best satisfy task objectives. The compositional design improves sample efficiency and adaptability by enabling the dynamics model to leverage diverse and previously unseen data independently of the planner. Empirically, we demonstrate consistent improvements over prior diffusion-based methods on unconstrained (D4RL) and constrained (DSRL) benchmarks, and validate practicality through deployment on a real quadrupedal robot.

摘要中文：

通过扩散模型进行离线决策时，产生的轨迹往往与系统动力学不一致，这限制了其在控制中的可靠性。我们提出了模型预测扩散器（Model Predictive Diffuser，MPDiffuser），这是一种组合式扩散框架，将扩散规划器与动力学扩散模型相结合，以生成任务对齐且动力学可行的轨迹。MPDiffuser在采样过程中交错执行规划器和动力学更新，逐步纠正可行性同时保留任务意图。随后，一个轻量级排序模块选择最能满足任务目标的轨迹。这种组合式设计通过使动力学模型能够独立于规划器利用多样化的前所未见的数据，提高了样本效率和适应性。我们在无约束（D4RL）和有约束（DSRL）基准测试上验证了相较于先前基于扩散的方法的一致性改进，并通过在真实四足机器人上的部署验证了其实用性。

MMD Guidance: Training-Free Distribution Adaptation for Diffusion Models via Maximum Mean Discrepancy Guidance

2026-06-10T04:00:00cs.AI, cs.CV, cs.LG, diffusion2601.08379

中文标题：MMD引导：基于最大均值差异的扩散模型无训练分布适配

作者：Matina Mahdizadeh Sani, Nima Jamali, Mohammad Jalali, Farzan Farnia

摘要：

Pre-trained diffusion models have emerged as powerful generative priors for both unconditional and conditional sample generation, yet their outputs often deviate from the characteristics of user-specific target data. Such mismatches are especially problematic in domain adaptation tasks, where only a few reference examples are available and retraining the diffusion model is infeasible. Existing inference-time guidance methods can adjust sampling trajectories, but they typically optimize surrogate objectives such as classifier likelihoods rather than directly aligning with the target distribution. We propose \emph{MMD Guidance}, a training-free mechanism that augments the reverse diffusion process with gradients of the \textit{Maximum Mean Discrepancy (MMD)} between generated samples and a reference dataset. MMD provides reliable distributional estimates from limited data, exhibits low variance in practice, and is efficiently differentiable, which makes it particularly well-suited for the guidance task. Our framework naturally extends to prompt-aware adaptation in conditional generation models via product kernels. Also, it can be applied with computational efficiency in latent diffusion models (LDMs), since guidance is applied in the latent space of the LDM. Experiments on synthetic and real-world benchmarks demonstrate that MMD Guidance can achieve distributional alignment while preserving sample fidelity. The project code is available at github.com/matinamehdizadeh/MMD-Guidance.

摘要中文：

预训练扩散模型已成为强大的生成先验，可用于无条件样本生成和条件样本生成，但其输出往往与用户特定目标数据的特征存在偏差。这种不匹配在领域适应任务中尤为棘手，因为这类任务仅提供少量参考样本，且重新训练扩散模型并不可行。现有的推理时引导方法可以调整采样轨迹，但通常优化分类器 likelihood 等代理目标，而非直接与目标分布对齐。我们提出MMD引导（\emph{MMD Guidance}），一种无需训练的机制，通过生成样本与参考数据集之间最大均值差异（\textit{Maximum Mean Discrepancy, MMD}）的梯度来增强逆向扩散过程。MMD能够从有限数据中提供可靠的分布估计，在实际应用中具有低方差特性，且计算高效可微，这使其非常适合引导任务。我们的框架可通过乘积核自然地扩展到条件生成模型中的提示感知适配。此外，由于引导作用于潜在扩散模型的潜在空间，因此可以高效地应用于潜在扩散模型（LDMs）。在合成数据集和真实世界基准上的实验表明，MMD引导能够在保持样本保真度的同时实现分布对齐。项目代码可访问 github.com/matinamehdizadeh/MMD-Guidance。

Variational Learning for Insertion-based Generation

2026-06-10T04:00:00autoregressive, cs.AI, cs.LG, diffusion2606.02133

中文标题：基于插入的生成的变分学习

作者：Yangtian Zhang, Zhe Wang, Arthur Gretton, Rex Ying, David van Dijk, Michalis K. Titsias, Jiaxin Shi

摘要：

摘要中文：

Benchmarking stereo reconstruction for 3D printable Martian terrain models

2026-06-10T04:00:00cs.CV, diffusion2606.10364

中文标题：面向3D可打印火星地形模型的立体视觉重建基准测试

作者：Josephine Wang

摘要：

Reconstructing printable 3D models from Mars rover imagery is challenging because Martian terrain is low-texture, irregular, and partially observed. We evaluate a pipeline that estimates stereo depth from NASA Curiosity images, completes geometry, and exports watertight OBJ meshes. On Middlebury, RAFT-Stereo outperforms semi-global block matching (SGBM), reducing disparity MAE from 3.22px to 0.73px and increasing valid prediction coverage from 76.3% to 100.0%. On Curiosity imagery, however, RAFT's denser disparities show weaker edge alignment and higher photometric reprojection error, suggesting that benchmark accuracy does not directly transfer to Martian terrain reconstruction. Geometry completion demonstrates a tradeoff between local fidelity and global connectivity. We find that alpha shapes preserve accurate but fragmented structure, Poisson reconstruction produces more coherent meshes but adds unsupported surfaces, and a deterministic diffusion-fill baseline is intermediate but sensitive to stereo quality. Overall, standard stereo and completion methods can produce printable approximations of Martian terrain, but reliable reconstruction requires stronger domain-specific validation.

摘要中文：

从火星探测器图像重建可打印的3D模型具有挑战性，因为火星地形具有低纹理、不规则且部分观测的特点。我们评估了一个从NASA好奇号图像估计立体深度、修复几何并导出密封OBJ网格模型的流程。在Middlebury数据集上，RAFT-Stereo优于半全局块匹配（SGBM），将视差平均绝对误差从3.22像素降低到0.73像素，并将有效预测覆盖率从76.3%提高到100.0%。然而，在好奇号图像上，RAFT更密集的视差表现出更弱的边缘对齐和更高的光度重投影误差，表明基准精度不能直接迁移到火星地形重建。几何修复展示了局部保真度与全局连通性之间的权衡。我们发现alpha形状保留了准确但碎片化的结构，泊松重建产生更连贯的网格但添加了悬空表面，而确定性扩散填充基线介于两者之间但对立体匹配质量敏感。总体而言，标准立体视觉和修复方法可以产生火星地形的可打印近似，但可靠的重建需要更强的领域特定验证。

Few-step Generative Models as Lossy Compression

2026-06-10T04:00:00cs.CV, cs.LG, diffusion2606.10450

中文标题：少步生成模型作为有损压缩

作者：Fuma Kimishima, Jinjia Zhou

摘要：

DiffC provides a principled way to reuse pre-trained diffusion models for lossy compression, but its encoding and decoding procedures remain slow because they require many discretized forward and reverse steps. We study whether few-step generative models -- Rectified Flow, Consistency Trajectory Models (CTM), and MeanFlow -- can be cast as codecs within the same reverse channel coding (RCC) framework. The main challenge is that RCC requires posterior and shared distribution parameters, whereas these models do not explicitly parameterize intermediate conditional distributions. For Rectified Flow and MeanFlow, we use the equivalence between velocity parameterization and diffusion-style denoising parameterization to derive the quantities required by RCC. For CTM, which is distilled from EDM, we adopt the EDM noise parameterization together with local Gaussian approximations of the sender and shared distributions at intermediate states. This yields a proof-of-concept probabilistic formulation that enables compression with pre-trained few-step generative models without retraining. On low-resolution benchmarks, the resulting codecs reduce encoding and decoding time and improve realism in the low-bit-rate regime.

摘要中文：

DiffC提供了一种原则性的方法来复用预训练扩散模型进行有损压缩，但由于其编码和解码过程需要许多离散化的前向和反向步骤，速度仍然较慢。我们研究了少步生成模型——Rectified Flow、Consistency Trajectory Models（CTM）和MeanFlow——是否可以在相同的反向通道编码（RCC）框架下被转换为编解码器。主要挑战在于RCC需要后验和共享分布参数，而这些模型并未显式参数化中间条件分布。对于Rectified Flow和MeanFlow，我们利用速度参数化与扩散式去噪参数化之间的等价性，推导出RCC所需的量。对于从EDM蒸馏而来的CTM，我们采用EDM噪声参数化，并结合发送方和共享分布在中间状态的局部高斯近似。这产生了一种概念验证的概率公式化方法，使得无需重新训练即可使用预训练少步生成模型进行压缩。在低分辨率基准测试中，所得到的编解码器减少了编码和解码时间，并在低比特率场景下提升了真实感。

SSR-Merge: Subspace Signal Routing for Training-Free LoRA Merging in Diffusion Models

2026-06-10T04:00:00cs.CV, diffusion2606.10617

中文标题：SSR-Merge：扩散模型中免训练LoRA融合的子空间信号路由方法

作者：Zhengxuan Wei, Yi Dong, Zonghui Li, Xianhui Lin, Xing Liu, Hong Gu, Shaofeng Zhang, Wenbin Li, Qi Fan

摘要：

Low-Rank Adaptation (LoRA) merging can efficiently combine diverse generative capabilities from multiple trained LoRAs for a diffusion model. However, existing LoRA merging techniques often suffer from severe parameter interference, causing destructive collisions in the shared parameter space. To address this, we propose Subspace Signal Routing (SSR), which resolves interference by routing internal signals instead of performing parameter-space merge. Specifically, SSR first constructs a unified subspace by concatenating candidate LoRAs along the rank dimension. Next, SSR employs an inverse correlation matrix to decorrelate mixed signals within this space. Finally, a directional guide matrix steers these purified signals into their respective task-specific subspaces. We provide a rigorous theoretical analysis proving that SSR aligns with the Ordinary Least Squares (OLS) solution, thereby ensuring mathematical optimality. We utilize the additivity of sufficient statistics to design a streaming algorithm. This enables on-the-fly updates that significantly reduce memory overhead and computation time. Extensive experiments validate that SSR significantly outperforms state-of-the-art methods while maintaining comparable efficiency. Code is available at https://github.com/nagara214/SSR-Merge.

摘要中文：

低秩适应（LoRA）融合能够高效地将多个已训练LoRA的多样化生成能力整合到扩散模型中。然而，现有的LoRA融合技术常常遭受严重的参数干扰问题，导致共享参数空间中出现破坏性冲突。为解决这一问题，我们提出了子空间信号路由（SSR）方法，该方法通过路由内部信号而非执行参数空间融合来消除干扰。具体而言，SSR首先沿秩维度连接候选LoRA以构建统一子空间。随后，SSR利用逆相关矩阵对该空间内的混合信号进行解相关处理。最后，方向引导矩阵将这些净化后的信号引导至各自的任务特定子空间。我们提供了严谨的理论分析，证明了SSR与普通最小二乘（OLS）解相一致，从而确保了数学上的最优性。我们利用充分统计量的可加性设计了一种流式算法，实现了显著降低内存开销和计算时间的即时更新。大量实验验证表明，SSR在保持相当效率的同时显著优于最先进的方法。代码可访问 https://github.com/nagara214/SSR-Merge。

STEDiff: Strengthening Text Embedding for Text-to-Image Alignment in Diffusion Model

2026-06-10T04:00:00cs.CV, diffusion2606.10653

中文标题：STEDiff：增强文本嵌入以提升扩散模型中的文本到图像对齐

作者：Hailan Zhang, Haipeng Liu, Bo Fu, Yang Wang

摘要：

Although pretrained text-to-image (T2I) generation models can produce high-quality images, they often fail to faithfully reflect the semantic intent of complex prompts due to stochastic noise and inherent model limitations. This issue frequently manifests as the model overlooking specific objects or failing to correctly bind attributes to their corresponding entities, a challenge referred to as semantic alignment. Unlike existing approaches that rely on computationally expensive fine-tuning or labor-intensive layout priors, we propose STEDiff, a training-free method designed to enhance semantic representations directly within the text-embedding space. Specifically, we introduce a method that primarily leverages the [EOT] token to strengthen the relevant semantics of sub-sentences and then replaces the corresponding tokens in the original prompt. Furthermore, a novel semantic enhancement loss is incorporated to enforce spatial constraints, ensuring that the semantics of each entity are precisely mapped to their respective image regions. Extensive quantitative and qualitative evaluations on the T2I-CompBench demonstrate that our method notably improves semantic consistency and generation integrity in complex scenarios.

摘要中文：

虽然预训练的文本到图像（T2I）生成模型能够生成高质量图像，但由于随机噪声和固有模型局限性，它们往往无法忠实地反映复杂提示词的语义意图。此问题通常表现为模型忽略特定对象或未能正确将属性绑定到相应实体，即所谓的语义对齐难题。与现有依赖计算成本高昂的微调方法或劳动密集型布局先验的方法不同，我们提出了STEDiff，这是一种无需训练的方法，旨在直接增强文本嵌入空间中的语义表示。具体而言，我们引入了一种主要利用[EOT]标记来强化子句相关语义的方法，随后用原始提示词中的对应标记进行替换。此外，我们还引入了语义增强损失来强制执行空间约束，确保每个实体的语义精确映射到各自对应的图像区域。在T2I-CompBench上的大量定量和定性评估表明，我们的方法在复杂场景下显著提升了语义一致性和生成完整性。

FadeMem: Distance-Aware Memory Consolidation for Autoregressive Video Diffusion

2026-06-10T04:00:00autoregressive, cs.CV, diffusion2606.10671

中文标题：FadeMem: 面向自回归视频扩散的距离感知记忆巩固

作者：Yu Lu, Junjie Yang, Piotr Koniusz, YuXin Song, Yi Yang

摘要：

摘要中文：

Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization

2026-06-10T04:00:00autoregressive, cs.CV, diffusion2606.11180

中文标题：Lip Forcing：面向实时唇同步的少步自回归扩散方法

摘要：

摘要中文：

On the Controllability-Fidelity Frontier in Diffusion Editing

2026-06-10T04:00:00cs.CV, cs.GR, cs.HC, cs.LG, cs.MM, diffusion2606.09901

中文标题：扩散编辑中的可控性-保真度前沿研究

作者：Yi Hu, Leying Yi, Emily Davis, Finn Carter

摘要：

Diffusion-based generative models enable powerful image editing capabilities, but achieving precise control while maintaining fidelity and safety remains challenging. We present a comprehensive theoretical and empirical study of controllable diffusion-based image editing, analyzing the trade-offs between adherence to user intent, preservation of non-target content, and output quality. Our work spans text- and mask-guided edits, point/drag manipulation, and inversion-based pipelines. We derive mathematical formulations of editing objectives and analyze dynamics of noise injection, score guidance, and inversion error. We provide theoretical bounds on reconstruction error, stability under repeated edits, and locality of changes. We propose algorithmic frameworks (with pseudocode) for mask-localized and instruction-guided editing, and present extensive experiments comparing state-of-the-art methods (e.g.\ TF-ICON \cite{lu2023tficone}, DragFlow \cite{zhou2025dragflow}, InstructPix2Pix \cite{brooks2023instructpix2pix}, UltraEdit \cite{zhao2024ultraedit}) on multiple tasks and metrics (FID, identity similarity, CLIP alignment, artifact scores, etc). Our results reveal key failure modes, such as identity drift, prompt sensitivity, and compositional errors. We also discuss ethical considerations in image editing, including misuse risks, bias, consent, and concept erasure techniques (e.g.\ MACE \cite{lu2024mace}, ANT \cite{li2025ant}, EraseAnything \cite{gao2024eraseanything}) as safeguards. We conclude with best practices and future directions for responsible, high-fidelity diffusion-based editing.

摘要中文：

基于扩散的生成模型实现了强大的图像编辑能力，但如何在保持保真度和安全性的同时实现精确控制仍然具有挑战性。本文对可控的基于扩散的图像编辑进行了全面的理论和实证研究，分析了遵循用户意图、保留非目标内容与输出质量之间的权衡。本研究涵盖文本和掩码引导的编辑、点/拖拽操作以及基于反演的管道。我们推导了编辑目标的数学公式，并分析了噪声注入、分数引导和反演误差的动态特性。我们提供了关于重建误差、重复编辑下的稳定性以及变化局部性的理论界。我们提出了掩码局部化和指令引导编辑的算法框架（附伪代码），并进行了广泛实验，在多个任务和指标上比较了先进方法（如TF-ICON、DragFlow、InstructPix2Pix、UltraEdit等）。我们的结果揭示了关键失效模式，如身份漂移、提示敏感度和组合错误。我们还讨论了图像编辑中的伦理考量，包括误用风险、偏见、同意问题以及概念擦除技术（如MACE、ANT、EraseAnything）作为防护措施。最后，我们总结了负责任、高保真扩散编辑的最佳实践和未来方向。

GHOST: Hierarchical Sub-Goal Policies for Generalizing Robot Manipulation

2026-06-10T04:00:00cs.CV, cs.LG, cs.RO, diffusion2606.10025

中文标题：GHOST: 用于泛化机器人操作的分层子目标策略

作者：Sriram Krishna, Ben Eisner, Haotian Zhan, Ying Yuan, Haoyu Zhen, Chuang Gan, Shubham Tulsiani, David Held

摘要：

We present GHOST, a framework for learning visuomotor manipulation policies that generalize beyond the training distribution. GHOST factorizes control into (i) a high-level policy that predicts the next sub-goal as a distribution over 3D end-effector poses from multi-view RGB-D observations, and (ii) a low-level goal-conditioned controller that executes embodiment-specific actions. To condition image-based policies on 3D goals, we introduce a simple spatial interface that projects predicted goals into the image plane and represents them as end-effector heatmaps. Across a suite of manipulation tasks, this hierarchical factorization consistently improves performance and robustness compared to a flat Diffusion Policy. Further, we show that this hierarchical interface also makes it easy to incorporate human demonstrations without relying on (noisy) action retargeting. As sub-goals are largely embodiment-agnostic, we train the high-level policy on human video to specify how learned skills should be applied and composed, while keeping the low-level policy trained purely on robot data. This hierarchy enables adaptation to novel objects and task variations using a small number of human demonstrations.

摘要中文：

我们提出了GHOST框架，用于学习能够超越训练分布进行泛化的视觉运动操作策略。GHOST将控制分解为：(i)一个高层策略，根据多视角RGB-D观测预测作为3D末端执行器姿态分布的下一个子目标；(ii)一个低层目标条件控制器，执行具身特定的动作。为了将基于图像的策略条件化到3D目标上，我们引入了一个简单的空间接口，将预测的目标投影到图像平面并将其表示为末端执行器热图。在一系列操作任务中，这种分层分解方法相比扁平扩散策略(Diffusion Policy)能够持续提升性能和鲁棒性。此外，我们还表明这种分层接口可以轻松整合人类演示，而无需依赖（噪声较大的）动作重定向。由于子目标在很大程度上与具身无关，我们在人类视频上训练高层策略，以指定习得技能的组合方式，同时保持低层策略完全基于机器人数据训练。这种分层架构使得仅使用少量人类演示就能适应新物体和任务变体。

Overlapped Wavelet Diffusion for Low-Light Image Enhancement

2026-06-10T04:00:00cs.CV, diffusion, eess.IV2606.10280

中文标题：用于低光照图像增强的重叠小波扩散框架

作者：Fen Peng, Taizo Suzuki, Seisuke Kyochi

摘要：

In this study, we propose an overlapped wavelet diffusion framework for Low-Light Image Enhancement (LLIE), which incorporates two complementary components to achieve blocking artifact-free and detail-preserving enhancement. Although recent diffusion-based LLIE methods have demonstrated remarkable performance compared with traditional approaches, DiffLL still suffers from blocking artifacts caused by the Haar Wavelet Transform (WT) and blurred edges or over-smoothed textures due to the limitations of its High-Frequency Restoration Module (HFRM). To overcome these issues, we introduce an Overlapped WT (OWT) that incorporates correlations across neighboring regions, thereby structurally preventing blocking artifacts. Furthermore, we integrate a low-frequency-guided High-Frequency Enhance Block (HFEBlock) to strengthen detail recovery, yielding sharper edges and more reliable textures. Extensive experiments on the LOLv1 and LOLv2-real datasets demonstrate that our framework, termed OWDiff, consistently outperforms existing LLIE methods both qualitatively and quantitatively, achieving superior visual quality while maintaining computational efficiency. OWDiff effectively addresses the structural limitations of the Haar WT and the HFRM, achieving an average PSNR gain of 0.58 dB, along with a 1.64% relative improvement in SSIM and a 5.9% relative reduction in LPIPS, compared to DiffLL across both the LOLv1 and LOLv2-real datasets.

摘要中文：

本研究提出了一种用于低光照图像增强（LLIE）的重叠小波扩散框架，该框架包含两个互补组件，以实现无块状伪影且保留细节的增强效果。尽管与传统方法相比，近期基于扩散的LLIE方法已展现出优异的性能，但DiffLL仍存在由哈尔小波变换（WT）导致的块状伪影问题，以及由于高频恢复模块（HFRM）的局限性而产生的边缘模糊或纹理过度平滑问题。为克服这些问题，我们引入了重叠小波变换（OWT），该方法融合了相邻区域之间的相关性，从而从结构上防止块状伪影。此外，我们整合了一个低频引导的高频增强块（HFEBlock）以加强细节恢复，从而获得更清晰的边缘和更可靠的纹理。在LOLv1和LOLv2-real数据集上的大量实验表明，我们提出的框架OWDiff在定性和定量方面均持续优于现有的LLIE方法，在保持计算效率的同时实现了卓越的视觉质量。OWDiff有效解决了Haar WT的结构性局限性和HFRM的不足，在LOLv1和LOLv2-real数据集上相比DiffLL实现了平均0.58 dB的PSNR提升，以及1.64%的SSIM相对提升和5.9%的LPIPS相对降低。

One-Step Residual Shifting Diffusion for Image Super-Resolution via Distillation

2026-06-10T04:00:00cs.CV, diffusion2503.13358

中文标题：基于蒸馏的图像超分辨率单步残差偏移扩散模型

作者：Daniil Selikhanovych, David Li, Aleksei Leonov, Nikita Gushchin, Sergei Kushneriuk, Alexander Filippov, Evgeny Burnaev, Iaroslav Koshelev, Alexander Korotin

摘要：

Diffusion models for super-resolution (SR) produce high-quality visual results but require expensive computational costs. Despite the development of several methods to accelerate diffusion-based SR models, some (e.g., SinSR) fail to produce realistic perceptual details, while others (e.g., OSEDiff) may hallucinate non-existent structures. To overcome these issues, we present RSD, a new distillation method for ResShift. Our method is based on training the student network to produce images such that a new fake ResShift model trained on them will coincide with the teacher model. RSD achieves single-step restoration and outperforms the teacher by a noticeable margin in various perceptual metrics (LPIPS, CLIPIQA, MUSIQ). We show that our distillation method can surpass SinSR, the other distillation-based method for ResShift, making it on par with state-of-the-art diffusion SR distillation methods with limited computational costs in terms of perceptual quality. Compared to SR methods based on pre-trained text-to-image models, RSD produces competitive perceptual quality and requires fewer parameters, GPU memory, and training cost. We provide experimental results on various real-world and synthetic datasets, including RealSR, RealSet65, DRealSR, ImageNet, and DIV2K. We provide the code at https://github.com/Daniil-Selikhanovych/RSD.

摘要中文：

用于超分辨率的扩散模型能够生成高质量的视觉结果，但需要较高的计算成本。尽管已开发出多种加速基于扩散的超分辨率模型的方法，但其中一些方法（如SinSR）无法产生逼真的感知细节，而另一些方法（如OSEDiff）可能会产生不存在的结构幻觉。为克服这些问题，我们提出了RSD，这是一种用于ResShift的新型蒸馏方法。我们的方法基于训练学生网络使其生成的图像能够使基于这些图像训练的新伪ResShift模型与教师模型重合。RSD实现了单步重建，并在各种感知指标（LPIPS、CLIPIQA、MUSIQ）上以明显优势超越教师模型。我们表明，我们的蒸馏方法能够超越SinSR（另一种基于ResShift的蒸馏方法），使其在感知质量方面与最先进的扩散超分辨率蒸馏方法相当，且计算成本有限。与基于预训练文本到图像模型的超分辨率方法相比，RSD产生了具有竞争力的感知质量，且参数更少、GPU显存更少、训练成本更低。我们在各种真实世界和合成数据集上提供了实验结果，包括RealSR、RealSet65、DRealSR、ImageNet和DIV2K。我们在https://github.com/Daniil-Selikhanovych/RSD上提供了代码。

Cost-Aware Routing for Efficient Text-To-Image Generation

2026-06-10T04:00:00cs.CV, cs.LG, diffusion2506.14753

中文标题：面向高效文本到图像生成的成本感知路由方法

作者：Qinchan Li, Kenneth Chen, Changyue Su, Wittawat Jitkrittum, Qi Sun, Patsorn Sangkloy

摘要：

Diffusion models are well known for their ability to generate a high-fidelity image for an input prompt through an iterative denoising process. Unfortunately, the high fidelity also comes at a high computational cost due to the inherently sequential generative process. In this work, we seek to optimally balance quality and computational cost, and propose a framework to allow the amount of computation to vary for each prompt, depending on its complexity. Each prompt is automatically routed to the most appropriate text-to-image generation function, which may correspond to a distinct number of denoising steps of a diffusion model, or a disparate, independent text-to-image model. Unlike uniform cost reduction techniques (e.g., distillation, model quantization), our approach achieves the optimal trade-off by learning to reserve expensive choices (e.g., 100+ denoising steps) only for a few complex prompts, and employ more economical choices (e.g., small distilled model) for less sophisticated prompts. We empirically demonstrate on COCO and DiffusionDB that by learning to route to nine already-trained text-to-image models, our approach is able to deliver an average quality that is higher than that achievable by any of these models alone. Code is available at https://github.com/winglicopy/CATImage.

摘要中文：

扩散模型以其通过迭代去噪过程为输入提示词生成高保真图像的能力而闻名。然而，由于其固有的顺序生成特性，高保真度也伴随着高昂的计算成本。本研究旨在寻求质量与计算成本之间的最优平衡，并提出一个框架，允许根据每个提示词的复杂程度动态调整计算量。每个提示词会自动路由至最合适的文本到图像生成函数，该函数可能对应于扩散模型的不同去噪步数，或者是不同的独立文本到图像模型。与均匀成本削减技术（如蒸馏、模型量化）不同，本方法通过学习仅对少数复杂提示词保留昂贵选项（如100+去噪步数），并对较简单的提示词采用更经济的选项（如小型蒸馏模型），从而实现最优权衡。我们在COCO和DiffusionDB数据集上的实证表明，通过学习路由到九个已训练的文本到图像模型，本方法能够实现高于其中任何一个单独模型的平均质量。代码可访问 https://github.com/winglicopy/CATImage。

FG-Attn: Leveraging Fine-Grained Sparse Attention in Video Diffusion Models

2026-06-10T04:00:00cs.AR, cs.CV, diffusion2509.16518

中文标题：FG-Attn：在视频扩散模型中利用细粒度稀疏注意力

作者：Sankeerth Durvasula, Kavya Sreedhar, Zain Moustafa, Suraj Kothawade, Tianlei Pang, Ashish Gondimalla, Suvinay Subramanian, Narges Shahidi, Nandita Vijaykumar

摘要：

Using diffusion transformers for media generation may require evaluating attention over extremely long sequences, with attention layers accounting for the majority of generation latency. Exploiting sparsity in attention maps offers a promising opportunity to reduce this cost. In this work, we show that attention maps in diffusion transformers exhibit significant fine-grained sparsity in video generation models. Existing sparse attention methods, however, are too coarse-grained, leaving a large fraction of redundant computation unaddressed, or incur high overheads at finer granularity. We propose FG-Attn, a novel, low-overhead fine-grained sparse attention mechanism that skips score computations at the granularity of a MxN tile, where N>=1 and M>=16, and where each block is the result of query-key dot products between M queries and N keys. FG-Attn addresses the key challenge of hardware underutilization in sparse attention kernels on GPUs, without incurring the overheads of irregular memory access and redundant operations. FG-Attn can fully supersede existing sparse attention methods and extend block sparse attention methods to finer granularities on modern GPUs. At 70% sparsity, FG-Attn is up to 2.45X faster than the state-of-art FlashInfer, and reduces attention kernel time by 14.7% on average. FG-Attn speeds up end-to-end video generation times by up to 1.40X (1.18X on average) over Flash Attention 3.

摘要中文：

使用扩散变压器进行媒体生成可能需要评估极长序列上的注意力，而注意力层占据了大部分生成延迟。利用注意力图中的稀疏性为降低这一成本提供了很好的机会。在本工作中，我们发现注意力图在视频生成模型的扩散变压器中表现出显著的细粒度稀疏性。然而，现有的稀疏注意力方法粒度过粗，导致大量冗余计算未被处理，或在细粒度下产生过高开销。我们提出FG-Attn，这是一种新型低开销细粒度稀疏注意力机制，可在MxN瓦片粒度上跳过得分计算，其中N≥1且M≥16，每个块是M个查询与N个键之间查询-键点积的结果。FG-Attn解决了GPU上稀疏注意力核心中硬件利用不足的关键问题，同时避免了不规则内存访问和冗余操作的开销。FG-Attn能完全替代现有稀疏注意力方法，并将块稀疏注意力扩展到现代GPU上的更细粒度。在70%稀疏率下，FG-Attn比最先进的FlashInfer快2.45倍，平均减少14.7%的注意力核心计算时间。FG-Attn将端到端视频生成速度提升至1.40倍（平均1.18倍），超越了Flash Attention 3。

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

2026-06-10T04:00:00cs.CV, cs.GR, diffusion2512.14614

中文标题：WorldPlay：面向实时交互式世界建模的长期几何一致性

作者：Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, Chunchao Guo

摘要：

This paper presents WorldPlay, a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency, resolving the trade-off between speed and memory that limits current methods. WorldPlay draws power from three key ingredients. 1) We use a Dual Action Representation to enable robust action control in response to the user's keyboard and mouse inputs. 2) To enforce long-term consistency, our Reconstituted Context Memory dynamically rebuilds context from past frames and uses temporal reframing to keep geometrically important but long-past frames accessible, effectively alleviating memory attenuation. 3) We also propose Context Forcing, a novel distillation method designed for memory-aware model. Aligning memory context between the teacher and student preserves the student&x27;s capacity to use long-range information, enabling real-time speeds while preventing error drift. Taken together, WorldPlay generates long-horizon streaming 720p video at 24 FPS with superior consistency, comparing favorably with existing techniques and showing strong generalization across diverse scenes. Project page and online demo can be found: https://3d-models.hunyuan.tencent.com/world/ and https://3d.hunyuan.tencent.com/sceneTo3D.

摘要中文：

本文提出了WorldPlay，一个流式视频扩散模型，能够实现长期几何一致性的实时交互式世界建模，解决了当前方法中速度和内存之间的权衡问题。WorldPlay得益于三个关键组成部分。1）我们采用双动作表示（Dual Action Representation），使用户能够通过键盘和鼠标输入实现稳健的动作控制。2）为强化长期一致性，我们的重构上下文记忆（Reconstituted Context Memory）动态地从过去帧重建上下文，并使用时间重帧技术使几何上重要但久远的帧保持可访问性，有效缓解了记忆衰减问题。3）我们还提出了上下文强制（Context Force），一种专为内存感知模型设计的新型蒸馏方法。通过对齐师生模型之间的内存上下文，保持学生模型使用长程信息的能力，从而实现实时推理速度的同时防止误差漂移。综合而言，WorldPlay能够生成24 FPS的长期720p流式视频，具有卓越的一致性，与现有技术相比表现出色，并展现出在多样化场景中的强大泛化能力。项目页面和在线演示可见：https://3d-models.hunyuan.tencent.com/world/ 和 https://3d.hunyuan.tencent.com/sceneTo3D。

Prompt Reinjection: Alleviating Prompt Forgetting in Multimodal Diffusion Transformers

2026-06-10T04:00:00cs.CV, diffusion2602.06886

中文标题：提示词重注入：缓解多模态扩散变换器中的提示词遗忘

作者：Yuxuan Yao, Yuxuan Chen, Hui Li, Kaihui Cheng, Qipeng Guo, Yuwei Sun, Zilong Dong, Jingdong Wang, Siyu Zhu

摘要：

Multimodal Diffusion Transformers (MMDiTs) for text-to-image generation maintain separate text and image branches, with bidirectional information flow between text tokens and visual latents throughout denoising. In this setting, we observe a prompt forgetting phenomenon: the semantics of the prompt representation in the text branch is progressively forgotten as depth increases. We further verify this effect on three representative MMDiTs--SD3, SD3.5, and FLUX.1 by probing linguistic attributes of the representations over the layers in the text branch. Motivated by these findings, we introduce a training-free approach, prompt reinjection, which reinjects prompt representations from early layers into later layers to alleviate this forgetting. Experiments on GenEval, DPG, and T2I-CompBench++ show consistent gains in instruction-following capability, along with improvements on metrics capturing preference, aesthetics, and overall text--image generation quality.

摘要中文：

用于文本到图像生成的多模态扩散变换器（MMDiT）维持独立的文本分支和图像分支，在整个去噪过程中文本token与视觉潜向量之间存在双向信息流动。在此设置下，我们观察到一种提示词遗忘现象：文本分支中提示词表示的语义随层深增加而逐渐被遗忘。我们通过探测文本分支各层表示的语言属性，在三个代表性MMDiT模型（SD3、SD3.5和FLUX.1）上进一步验证了这一效应。基于这些发现，我们引入了一种无需训练的方法——提示词重注入，将早期层的提示词表示重新注入到后期层以缓解这种遗忘。在GenEval、DPG和T2I-CompBench++数据集上的实验表明，指令遵循能力获得了一致的提升，同时在偏好、美学以及整体文本-图像生成质量等指标上也都有所改善。

Glove2Hand: Synthesizing Natural Hand-Object Interaction from Multi-Modal Sensing Gloves

2026-06-10T04:00:00cs.CV, cs.RO, diffusion2603.20850

中文标题：Glove2Hand：从多模态传感手套合成自然手-物体交互

作者：Xinyu Zhang, Ziyi Kou, Chuan Qin, Mia Huang, Ergys Ristani, Ankit Kumar, Lele Chen, Kun He, Abdeslam Boularias, Li Guan

摘要：

Understanding hand-object interaction (HOI) is fundamental to computer vision, robotics, and AR/VR. However, conventional hand videos often lack essential physical information such as contact forces and motion signals, and are prone to frequent occlusions. To address the challenges, we present Glove2Hand, a framework that translates multi-modal sensing glove HOI videos into photorealistic bare hands, while faithfully preserving the underlying physical interaction dynamics. We introduce a novel 3D Gaussian hand model that ensures temporal rendering consistency. The rendered hand is seamlessly integrated into the scene using a diffusion-based hand restorer, which effectively handles complex hand-object interactions and non-rigid deformations. Leveraging Glove2Hand, we create HandSense, the first multi-modal HOI dataset featuring glove-to-hand videos with synchronized tactile and IMU signals. We demonstrate that HandSense significantly enhances downstream bare-hand applications, including video-based contact estimation and hand tracking under severe occlusion.

摘要中文：

理解手-物体交互（HOI）对于计算机视觉、机器人和增强现实/虚拟现实至关重要。然而，传统的手部视频通常缺乏关键的物理信息，如接触力和运动信号，且容易频繁出现遮挡。为解决这些挑战，我们提出了Glove2Hand，一个将多模态传感手套HOI视频转换为逼真的赤手图像，同时忠实保留底层物理交互动力学的框架。我们引入了一种确保时间渲染一致性的新型3D高斯手部模型。渲染的手部使用基于扩散的手部修复器无缝集成到场景中，该修复器能有效处理复杂的手-物体交互和非刚性变形。利用Glove2Hand，我们创建了HandSense，这是首个多模态HOI数据集，包含具有同步触觉和IMU信号的手套到手部视频。我们证明HandSense显著增强了赤手下游应用，包括基于视频的接触估计和严重遮挡下的手部追踪。

SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models

2026-06-10T04:00:00cs.CV, diffusion2605.07800

中文标题：SARA：视频扩散模型的语义自适应关系对齐

作者：Jiesong Lian, Zixiang Zhou, Ruizhe Zhong, Yuan Zhou, Qinglin Lu, Rui Wang, Long Hu, Yixue Hao, Baoru Huang

摘要：

Recent video diffusion models (VDMs) synthesize visually convincing clips, yet still drop entities, mis-bind attributes, and weaken the interactions specified in the prompt. Representation-alignment objectives such as VideoREPA and MoAlign improve fine-grained text following by distilling spatio-temporal token relations from a frozen visual foundation model, but their pairwise supervision budget is allocated by visual or motion cues rather than by how relevant each pair is to the prompt. We present SARA, Semantically Adaptive Relational Alignment, which keeps token-relation distillation (TRD) on a frozen VFM target and adds a text-conditioned saliency that decides which token pairs carry supervision. A lightweight Stage~1 aligner is trained with per-entity SAM~3.1 mask supervision and an InfoNCE regulariser, and its continuous saliency is fused into TRD through a pair-routing operator that assigns each token pair a weight whenever either of its two endpoints is salient, thereby routing supervision toward subject-subject and subject-background pairs and away from background-background ones. In the Wan2.2 continual-training setting, SARA improves both text alignment and motion quality over SFT, VideoREPA, and MoAlign on a 13-dimension VLM rubric, on the public VBench benchmarks, and in a blind user study. Project page: https://saradit.github.io/.

摘要中文：

尽管最新的视频扩散模型（VDMs）能够合成视觉上令人信服的片段，但仍会出现实体遗漏、属性误绑以及削弱提示词中指定的交互等问题。VideoREPA和MoAlign等表征对齐目标通过从冻结的视觉基础模型中提取时空token关系来改进细粒度文本跟随，但其成对监督预算是基于视觉或运动线索分配的，而非基于每对关系与提示词的相关程度。本文提出SARA（语义自适应关系对齐），该方法保持token关系蒸馏（TRD）在冻结的VFM目标上运行，并引入文本条件显著性来确定哪些token对需要监督。我们训练了一个轻量级的第一阶段对齐器，使用每个实体的SAM 3.1掩码监督和InfoNCE正则化器，并将其连续显著性通过成对路由算子融合到TRD中——该算子每当两个端点中的任一端点显著时即为该token对分配权重，从而将监督引导至主体-主体和主体-背景对，而远离背景-背景对。在Wan2.2持续训练设置下，SARA在13维VLM评分标准、公开VBench基准测试以及盲评用户研究中，均在文本对齐和运动质量方面超越了SFT、VideoREPA和MoAlign。项目主页：https://saradit.github.io/

The Emergence of Reproducibility and Generalizability in Diffusion Models

2026-06-10T04:00:00cs.CV, cs.LG, diffusion2310.05264

中文标题：扩散模型中可复现性与泛化性的涌现

作者：Huijie Zhang, Jinfan Zhou, Yifu Lu, Minzhe Guo, Peng Wang, Liyue Shen, Qing Qu

摘要：

In this work, we investigate an intriguing and prevalent phenomenon of diffusion models which we term as "consistent model reproducibility": given the same starting noise input and a deterministic sampler, different diffusion models often yield remarkably similar outputs. We confirm this phenomenon through comprehensive experiments, implying that different diffusion models consistently reach the same data distribution and scoring function regardless of diffusion model frameworks, model architectures, or training procedures. More strikingly, our further investigation implies that diffusion models are learning distinct distributions affected by the training data size. This is supported by the fact that the model reproducibility manifests in two distinct training regimes: (i) "memorization regime", where the diffusion model overfits to the training data distribution, and (ii) "generalization regime", where the model learns the underlying data distribution. Our study also finds that this valuable property generalizes to many variants of diffusion models, including those for conditional use, solving inverse problems, and model fine-tuning. Finally, our work raises numerous intriguing theoretical questions for future investigation and highlights practical implications regarding training efficiency, model privacy, and the controlled generation of diffusion models.

摘要中文：

在这项工作中，我们研究了一个有趣且普遍存在的扩散模型现象，称之为“一致的模型可复现性”：给定相同的初始噪声输入和确定性采样器，不同的扩散模型往往产生非常相似的输出。我们通过全面的实验确认了这一现象，表明无论扩散模型框架、模型架构或训练过程如何，不同的扩散模型始终能够达到相同的数据分布和评分函数。更令人惊讶的是，我们进一步研究表明，扩散模型学习的是受训练数据量影响的独特分布。这一发现得到以下事实的支持：模型可复现性在两种不同的训练模式下表现出来：(i) “记忆化模式”，即扩散模型过拟合到训练数据分布；(ii) “泛化模式”，即模型学习潜在的数据分布。我们的研究还发现，这一宝贵特性可推广到多种扩散模型变体，包括条件生成、逆问题求解和模型微调等场景。最后，我们的工作提出了许多有趣的理论问题供未来研究，并强调了关于训练效率、模型隐私和扩散模型可控生成的实际意义。

Breaking the Curse of Dimensionality: Diffusion Models Efficiently Learn Low-Dimensional Distributions

2026-06-10T04:00:00cs.CV, cs.LG, diffusion2409.02426

中文标题：打破维度灾难：扩散模型高效学习低维分布

作者：Peng Wang, Huijie Zhang, Zekai Zhang, Siyi Chen, Yi Ma, Qing Qu

摘要：

Despite their empirical success across a wide range of generative tasks, the fundamental principles underlying the ability of diffusion models to learn data distributions are poorly understood. In this work, we develop a new mathematical framework that explains how diffusion models can effectively learn low-dimensional distributions from a finite number of training samples without suffering from the curse of dimensionality. Specifically, motivated by the intrinsic low-dimensional structure of image data, we theoretically analyze a setting in which the data distribution is modeled as a mixture of low-rank Gaussians. Under suitable network parameterization, we show that optimizing the training objective of diffusion models is equivalent to solving the canonical subspace clustering problem over the training samples, where each subspace basis corresponds to the low-rank covariance of a Gaussian component. This equivalence allows us to show that the sample complexity for learning the underlying distribution scales linearly with the intrinsic dimension of the data, rather than exponentially with the ambient dimension. Our theoretical findings are further supported by empirical evidence that demonstrates phase transition phenomena in generalization on both synthetic and real-world image datasets. Moreover, we establish a correspondence between the learned subspace bases and semantic attributes of image data, providing a principled foundation for controllable image generation.

摘要中文：

尽管扩散模型在广泛的生成任务中取得了实证成功，但扩散模型学习数据分布能力的基本原理仍缺乏充分理解。在本文中，我们构建了一个新的数学框架，解释扩散模型如何能够从有限数量的训练样本中有效学习低维分布，而不受维度灾难之苦。具体而言，受图像数据内在低维结构的启发，我们理论上分析了一种数据分布被建模为低秩高斯混合模型的设置。在合适的网络参数化条件下，我们表明优化扩散模型的训练目标等价于在训练样本上求解典型子空间聚类问题，其中每个子空间基对应于高斯分量的低秩协方差。这一等价关系使我们能够证明学习底层分布的样本复杂度随数据的内在维度线性增长，而非随环境维度指数增长。我们的理论发现进一步得到了实证证据的支持，证明了在合成和真实世界图像数据集上泛化过程中存在的相变现象。此外，我们建立了学习到的子空间基与图像数据语义属性之间的对应关系，为可控图像生成提供了原则性基础。

image_compression

Image Compression

3 篇论文

您好！我注意到今天提供的论文列表中，没有直接属于 Image Compression（图像压缩）类别的论文。让我为您分析一下这3篇论文的实际研究方向：

第1篇（2606.09924）：关于神经网络架构优化的研究，聚焦动态推理（Dynamic Inference）和单路径网络重建，旨在减少活跃参数数量，提升计算效率。
第2篇（2606.10309）：属于AI生成图像检测（AIGC Detection）领域，研究如何增强对合成图像的识别鲁棒性，可视为数字媒体取证方向。
第3篇（2606.10876）：是计算机视觉在特定领域的应用研究，聚焦菲律宾木材树种的AI识别任务。

建议：请确认论文列表是否准确，或者提供您希望关注的其他类别（如 Neural Network Architecture、AIGC Detection、Computer Vision Applications 等），我将为您重新撰写符合要求的研究总览。

Sigma-Branch: Hierarchical Single-Path Network Reconstruction for Dynamic Inference with Reduced Active Parameters

2026-06-10T04:00:00cs.AI, cs.LG, image_compression2606.09924

中文标题：Sigma-Branch：用于减少活跃参数动态推理的分层单路径网络重构

作者：Kohga Tanaka, Hiroaki Nishi

摘要：

Deploying deep neural networks on memory-constrained edge accelerators is bottlenecked by per-inference off-chip weight transfer rather than computation: the dense network cannot be retained on-chip, and every parameter must be loaded for every input. Existing model compression reduces this transfer only at the cost of permanent capacity loss. We propose Sigma-Branch (SigmaB), a framework that restructures a pretrained dense network into a hierarchical binary tree composed of a shared backbone, hierarchical routers, and specialized leaves. Pretrained weights are distributed across the tree via activation-based spherical k-means clustering, which jointly initializes router weights and per-branch channel allocations; soft-routing fine-tuning then aligns each leaf with its routed input subset. At inference, the resulting network executes only a single root-to-leaf path, reducing the active-parameter footprint while storing the complete dense parameter set in memory. Across CIFAR-100 / ResNet-50, ImageNet-1K / ResNet-50, and ModelNet40 / PointNet++, SigmaB-Net reduces per-inference active parameters by 58-60% while remaining within 1.72 percentage points (pp) of the dense baseline Top-1. At comparable ImageNet-1K Top-1, the active-parameter reduction exceeds static structured pruning (FPGM, HRank) by 14-23 pp. The cross-modal evaluation, spanning 2D vision and 3D point-cloud backbones, substantiates a framework-level claim that decouples per-inference memory traffic from the total parameter count.

摘要中文：

在内存受限的边缘加速器上部署深度神经网络时，每次推理的片外权重传输而非计算成为瓶颈：密集网络无法保留在芯片上，且每个输入必须加载全部参数。现有的模型压缩技术仅能以永久容量损失为代价减少这种传输。我们提出Sigma-Branch（SigmaB）框架，将预训练的密集网络重构为分层二叉树结构，由共享骨干网络、分层路由器和专用叶子节点组成。预训练权重通过基于激活的球面k-means聚类在树中分布，联合初始化路由器权重和每分支通道分配；软路由微调随后将每个叶子节点与其路由输入子集对齐。在推理过程中，产生的网络仅执行单条从根到叶的路径，在内存中存储完整密集参数集的同时减少活跃参数占用。在CIFAR-100/ResNet-50、ImageNet-1K/ResNet-50和ModelNet40/PointNet++上，SigmaB-Net在每次推理中减少58-60%的活跃参数，同时与密集基线Top-1精度的差距控制在1.72个百分点以内。在ImageNet-1K Top-1精度相当的情况下，活跃参数减少幅度比静态结构化剪枝（FPGM、HRank）高出14-23个百分点。跨模态评估涵盖2D视觉和3D点云骨干网络，验证了框架级声明：每次推理的内存传输与总参数数量解耦。

Dissect and Prune: Enhancing Robustness in AI-Generated Image Detection

2026-06-10T04:00:00cs.CV, image_compression2606.10309

中文标题：分析并剪枝：增强AI生成图像检测的鲁棒性

作者：Dahye Kim, Jaehyun Choi, Hyun Seok Seong, Seongho Kim, Donghun Lee, Sungwon Yi, Jang-Ho Choi

摘要：

While existing AI-generated image detectors report high performance, we identify that this is largely driven by a critical prediction asymmetry: a bias toward the real class that severely limits sensitivity to generated content, especially under standard post-processing operations such as compression and resizing. We hypothesize that this stems from the model's reliance on spurious features, distracting signals that obscure true generative artifacts. To address this, we propose DEAR (Dissect and Prune), which leverages inpainted images to identify and prune these interfering components. Specifically, we find that features strongly aligned to either inpainted or non-inpainted regions are less robust to post-processing. By measuring the alignment between channel activations and inpaint masks, DEAR removes features at both extremes, retaining only those that capture genuine generative artifacts. Experimental results demonstrate that our approach significantly enhances robustness against unseen generators and post-processing, effectively mitigating the prediction asymmetry. Our code is available at https://github.com/dahyedahye/dear.

摘要中文：

虽然现有的AI生成图像检测器报告了高性能，但我们发现这主要由关键的预测不对称驱动：即对真实类别的严重偏向，严重限制了检测生成内容的能力，尤其是在标准后处理操作（如压缩和调整大小）下。我们假设这源于模型对虚假特征的依赖，这些干扰信号掩盖了真正的生成伪影。为此，我们提出了DEAR（分析并剪枝），该方法利用修复图像来识别并剪枝这些干扰成分。具体而言，我们发现与修复区域或非修复区域高度相关的特征对后处理的鲁棒性较差。通过测量通道激活与修复掩模之间的对齐程度，DEAR移除两端的特征，仅保留那些捕获真实生成伪影的特征。实验结果表明，我们的方法显著增强了对未知生成器和后处理的鲁棒性，有效缓解了预测不对称问题。我们的代码可访问 https://github.com/dahyedahye/dear。

Advancing Wood Identification in the Philippines: Utilizing the Xylorix Platform for Efficient AI Model Development and Deployment for Five Key Species

2026-06-10T04:00:00cs.CV, image_compression2606.10876

中文标题：推动菲律宾木材鉴定发展：利用Xylorix平台高效开发并部署五种关键树种的人工智能模型

作者：Rosalie C. Mendoza, Vivian C. Daracan, Arlene D. Romano, Ronniel D. Manalo, Xin Jie Tang, Yi Hong Wong, Yong Haur Tay

摘要：

Illegal logging and timber trade continue to pose significant challenges in the Philippines, where accurate wood species identification is essential for enforcement but limited by the need for specialised equipment and expertise. This study aims to evaluate whether AI models for macroscopic wood identification can be developed and deployed by wood scientists without programming expertise using the Xylorix platform, focusing on five Philippine hardwood species: Mangium (Acacia mangium Willd.), Rain Tree [Samanea saman (Jacq.) Merr.], Banuyo (Wallaceodendron celebicum Koord.), Tindalo [Afzelia rhomboidea (Blanco) Vidal], and Ipil [Intsia bijuga (Colebr.) O. Kuntze]. Binary classifiers were trained on 10,663 verified cross-section images from 260 specimens and evaluated using specimen-level mean scoring to mirror operational field conditions. Area Under the ROC Curve (AUC) values ranged from 0.969 (Ipil) to 1.000 (Mangium), and Average Precision (AP) values ranged from 0.589 (Samanea) to 1.000 (Mangium). Four of five species achieved AA grade (AUC and AP both \geq 0.90); Rain Tree received AE (AUC \geq 0.90, AP < 0.60) due to AP compression from its small positive test set (3 specimens). All five classifiers rank their target specimens above non-target specimens with near-perfect fidelity. Specimen-level error analysis revealed 9 false negatives from Ipil, primarily stemming from localized image artifacts and 3 false positives for Rain Tree and 1 false positive for Tindalo caused by shared tribal-level anatomical traits. These findings demonstrate that Xylorix non-programmers can leverage the Xylorix platform to construct operationally reliable wood identification models suitable for field deployment at supply chain checkpoints.

摘要中文：

非法采伐和木材贸易持续对菲律宾构成重大挑战，尽管准确的木材树种鉴定对执法至关重要，但目前仍受限于专业设备和专家知识的需求。本研究旨在评估木材科学家能否在无需编程专业知识的情况下，利用Xylorix平台开发和部署宏观木材鉴定人工智能模型，研究聚焦于五种菲律宾阔叶树种：马占相思树（Acacia mangium Willd.）、雨树[Samanea saman (Jacq.) Merr.]、巴努约（Wallaceodendron celebicum Koord.）、廷达洛[Afzelia rhomboidea (Blanco) Vidal]和伊皮尔[Intsia bijuga (Colebr.) O. Kuntze]。二分类器基于260个样本的10,663张经核实的横截面图像进行训练，并采用样本级平均评分进行评估，以模拟实际操作现场条件。ROC曲线下面积（AUC）值范围从0.969（伊皮尔）到1.000（马占相思树），平均精度（AP）值范围从0.589（雨树）到1.000（马占相思树）。五个树种中有四个达到AA级（AUC和AP均≥0.90）；雨树获得AE级（AUC≥0.90，AP<0.60），这是由于其阳性测试集样本量较小（仅3个样本）导致AP被压缩。全部五个分类器均以近乎完美的保真度将目标样本排在非目标样本之前。样本级错误分析显示伊皮尔出现9个假阴性，主要源于局部图像伪影；雨树出现3个假阳性，廷达洛出现1个假阳性，均由共享的属级解剖特征引起。这些研究结果表明，非程序员木材科学家可以利用Xylorix平台构建适用于供应链检查站现场部署的操作可靠木材鉴定模型。

visual_tokenizer_1d

1D Visual Tokenizer

0 篇论文

今日未找到该分类的匹配论文。

diffusion_visual_encoder

Diffusion Visual Encoder

0 篇论文

今日未找到该分类的匹配论文。