ESC
输入关键词搜索文章
目录

每日 arXiv 论文简报

2026-06-10 · 77 篇论文 · 按研究方向分组
自动追踪 · LLM 总览 · 研究雷达
77Total Papers
12Autoregressive
61Diffusion
4Image Compression
01D Visual Tokenizer
0Diffusion Visual Encoder
Daily Radar
每日总览

今日arXiv论文整体呈现出扩散模型主导、多模态融合深化、视频生成爆发三大趋势。扩散模型相关论文达61篇,占据绝对主导地位,广泛覆盖机器人操控、视频生成、医学影像、异常检测等场景;自回归模型12篇则聚焦于视频生成、神经符号推理、资产压缩等垂直领域;图像压缩方向4篇专注于高效ViT架构设计。两类生成范式出现交叉融合趋势,如TBD-VLA将时间块扩散与自回归策略结合,Unifying World Models与Diffusion Policy探索层级化机器人控制框架。世界模型(World Model)成为连接视觉理解、动作预测与规划的核心抽象,EgoTactile、Latent Diffusion Policy等工作进一步打通触觉反馈与物理交互的闭环。

  • OmniGen-AR: AutoRegressive Any-to-Image Generation — 突破性实现任意输入条件下的自回归图像生成,统一多模态控制范式
  • TBD-VLA: Temporal Block Diffusion Vision Language Action Model — 创新融合时间块扩散与自回归建模,为长程机器人任务提供新范式
  • PACT: Self-Evolving Physical Safety Alignment for Diffusion Policies — 首次提出扩散策略的物理安全自演化对齐框架,保障人机共存环境
  • Neuro-Symbolic Injection of LTLf Constraints in Autoregressive RL Policies — 开创性地将形式化LTLf约束注入自回归强化学习策略,兼顾可解释性与样本效率
  • What Makes Video Foundation Model Latents Action-Relevant — 深入剖析视频模型潜在表示的动作相关性,为世界模型构建提供理论锚点
autoregressive
Autoregressive
12 篇论文

今日 Autoregressive 论文总览:

今日自回归领域呈现多元化发展态势,主要集中在视觉内容生成与理解高效模型压缩以及跨模态应用三大方向。视频生成相关论文占据显著位置,包括长视频生成、实时流式视频生成以及高分辨率视频处理,体现了自回归模型在视频领域的持续突破。模型效率优化也是重要趋势,HACK++ 等工作专注于自回归视觉模型的键值压缩,为边缘部署提供了新思路。此外,神经符号方法与强化学习的结合(如 LTLf 约束注入)展示了自回归策略在复杂决策任务中的潜力。总体来看,自回归范式正从图像生成向视频、点云、动作控制等多模态场景扩展,同时在效率与效果之间寻求更好平衡。

重点推荐论文:

  • Ultra Flash: Scaling Real-Time Streaming Video Generation to High Resolutions — 首次实现高分辨率实时流式视频生成,突破自回归模型在视频生成速度与质量上的双重瓶颈。
  • OmniGen-AR: AutoRegressive Any-to-Image Generation — 提出通用自回归框架,支持任意条件到图像的生成,代表了自回归图像生成的新范式。
  • DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation — 引入动态帧汇机制,有效解决长视频生成中的时序一致性问题。
  • HACK++: Towards More Effective Head-Aware Key-Value Compression — 提出头部感知的键值压缩方法,显著提升视觉自回归模型的推理效率。
  • TBD-VLA: Temporal Block Diffusion Vision Language Action Model — 融合扩散与自回归机制,探索视觉-语言-动作统一建模的新路径。

EditSR: Enhancing Neural Symbolic Regression via Edit-based Rectification

2026-06-09T04:00:00autoregressive, cs.AI2606.07915

中文标题:EditSR:通过基于编辑的纠正增强神经符号回归

作者:Da Li, Xinxin Li, Xingyu Cui, Jin Xu, Juan Zhang, Junping Yin

摘要:

Neural symbolic regression models improve inference efficiency by shifting structural search to pretraining, but their one-pass autoregressive decoding is prone to error accumulation, which may lead to generating structurally incorrect expressions, especially in complex expression generation scenarios. Existing rectification strategies can alleviate this issue, but they often depend on restarting global search, thereby weakening the efficiency advantage of neural models, and remain susceptible to error accumulation. In this paper, we propose EditSR, a two-layer framework that combines a neural symbolic regression model in the first layer with an edit-based Rectifier in the second layer to achieve efficient prediction and post-hoc rectification. Instead of restarting the global search, we maintain rectification efficiency by pretraining the Rectifier. Specifically, we formulate the rectification process as a step-by-step state-transition chain starting from an incorrect expression, and develop a state-transition algorithm to construct supervised rectification chains for training the Rectifier. To ensure syntactic validity throughout rectification, each edit action is restricted to a syntactically valid space so that every edited expression remains parseable. In addition, because each edit decision is conditioned on the current state rather than the history, the Rectifier allows errors made in earlier steps to be rectified by subsequent edits, thereby reducing the risk of error accumulation. Extensive experiments and ablation studies show that EditSR substantially improves symbolic structure recovery with limited extra cost, with more pronounced gains on complex expressions, where one-pass autoregressive decoding is more susceptible to error accumulation.

摘要中文:

神经符号回归模型通过将结构搜索转移到预训练阶段来提高推理效率,但其一次性自回归解码容易出现误差累积,可能导致生成结构错误的表达式,尤其是在复杂表达式生成场景中。现有的纠正策略可以缓解这一问题,但它们通常依赖于重新启动全局搜索,从而削弱了神经模型的效率优势,并且仍然容易受到误差累积的影响。本论文提出EditSR,一个双层框架,在第一层结合神经符号回归模型,在第二层结合基于编辑的纠正器,以实现高效预测和事后纠正。不同于重新启动全局搜索,我们通过预训练纠正器来保持纠正效率。具体而言,我们将纠正过程表述为从错误表达式开始的逐步状态转移链,并开发状态转移算法来构建监督纠正链以训练纠正器。为确保整个纠正过程中的句法有效性,每个编辑操作都被限制在句法有效的空间中,使得每个编辑后的表达式仍可解析。此外,由于每个编辑决策都基于当前状态而非历史记录,纠正器允许早期步骤中产生的错误被后续编辑纠正,从而降低误差累积的风险。广泛的实验和消融研究表明,EditSR以有限的额外成本显著改善了符号结构恢复,在复杂表达式上提升更为明显,因为一次性自回归解码在复杂表达式上更容易受到误差累积的影响。

Neuro-Symbolic Injection of LTLf Constraints in Autoregressive Reinforcement Learning Policies

2026-06-09T04:00:00autoregressive, cs.AI, cs.FL2606.08312

中文标题:自回归强化学习策略中LTLf约束的神经符号注入方法

作者:Ashkan Ansarifard (Sapienza University of Rome), Matteo Mancanelli (Sapienza University of Rome), Elena Umili (Sapienza University of Rome), Fabio Patrizi (Sapienza University of Rome)

摘要:

In this work we study offline reinforcement learning (RL) under temporally extended task constraints expressed in Linear Temporal Logic over finite traces (LTLf). Recently, transformer-based approaches such as Trajectory Transformers and Decision Transformers have been adopted to address RL as a sequence modeling problem. However, these methods optimize purely for reward and do not account for high-level temporal requirements. Here, we introduce a neurosymbolic framework that injects LTLf background knowledge into such transformer-based RL policies. Our approach compiles LTLf formulas into deterministic finite automata (DFAs) and integrates them into the learning process through a differentiable representation and a logic-based loss function. In particular, we derive differentiable satisfaction signals from DFA progression and use them as a regularization term during training. The resulting method is architecture-agnostic across different models. We evaluate the proposed framework on navigation environments with specification suites covering combinations of safety and reachability temporal properties. Experimental results show that incorporating background knowledge not only improves constraint satisfaction, but also maintains competitive return compared to vanilla baselines.

摘要中文:

本文研究基于有限迹上线性时序逻辑(LTLf)表达的时间扩展任务约束的离线强化学习问题。近年来,基于Transformer的方法(如Trajectory Transformers和Decision Transformers)已被用于将强化学习作为序列建模问题来处理。然而,这些方法仅优化奖励,未能考虑高层时序要求。本文提出一种神经符号框架,将LTLf背景知识注入此类基于Transformer的强化学习策略中。该方法将LTLf公式编译为确定有限自动机(DFA),并通过可微表示和基于逻辑的损失函数将其集成到学习过程中。具体而言,我们从DFA演进中推导可微分的满足信号,并将其作为训练时的正则化项。该方法在不同模型间具有架构无关性。我们在导航环境中评估了所提框架,测试集涵盖安全性和可达性时序属性的组合。实验结果表明,融入背景知识不仅能够提升约束满足效果,还能保持与基线方法相当的回报水平。

MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention

2026-06-09T04:00:00autoregressive, cs.AI, cs.CV2606.07639

中文标题:MOSS-Video-Preview:通过交叉注意力迈向实时视频理解

作者:Pengyu Wang, Chenkun Tan, Shaojun Zhou, Wei Huang, Qirui Zhou, Zhan Huang, Zhen Ye, Jijun Cheng, Xiaomeng Qian, Yanxin Chen, Xingyang He, Huazheng Zeng, Chenghao Wang, Pengfei Wang, Hongkai Wang, Shanqing Gao, Yixian Tian, Chenghao Liu, Xinghao Wang, Botian Jiang, Xipeng Qiu

摘要:

Video understanding is shifting from the offline paradigm -- taking a fully recorded video as input and producing a single answer after it ends -- toward real-time interaction, in which the model perceives new frames while still replying, revises its answer as new evidence appears, and remains silent when there is nothing to say. We present MOSS-Video-Preview to validate this paradigm. Our central claim is that perception must not be blocked by generation; its natural realization is a two-channel architecture. We argue that a cross-attention backbone is better suited to real-time vision-language fusion than the prevailing decoder-only design: visual features enter through a side channel rather than joining the autoregressive sequence, so perception and generation run on separate, non-blocking pathways -- reducing the frequency of visual processing and exposing a clean channel-wise interface for independent compression. We complement this with a data synthesis pipeline that converts dense captions into real-time understanding QA whose answers are revised to match what the model has perceived so far, and we specialize an offline model on these data to elicit real-time behavior. Our model trails the strong Qwen2.5-VL-7B baseline overall -- a gap we attribute primarily to data and scale rather than the architecture -- yet attains competitive offline video and multimodal understanding, remains robust on the spatial and fine-grained temporal reasoning central to real-time use, and acquires behaviors that offline models lack: continuous perception, answer revision, and timely silence. On a single H200 with 256 frames per video, it achieves about a 5x speedup in time to first token and 2.7x higher decoding throughput, with negligible degradation in offline ability. Our study of paradigm, architecture, and data outlines a viable path toward real-time video understanding.

摘要中文:

视频理解正在从离线范式——以完整录制的视频作为输入,并在视频结束后生成单一答案——转向实时交互,在此范式下,模型在回复的同时感知新帧,根据新出现的证据修订答案,并在无话可说时保持沉默。我们提出MOSS-Video-Preview来验证这一范式。我们的核心主张是感知不能被生成阻塞;其自然的实现方式是双通道架构。我们认为交叉注意力主干网络比当前流行的仅解码器设计更适合实时视觉语言融合:视觉特征通过侧通道进入而非加入自回归序列,因此感知和生成在独立、非阻塞的路径上运行——降低视觉处理频率,并呈现清晰的通道级接口以实现独立压缩。我们辅以数据合成管道,将密集字幕转换为实时理解问答,并修订答案以匹配模型到目前为止的感知数据,同时在这些数据上专门训练离线模型以激发实时行为。我们的模型在整体性能上落后于强大的Qwen2.5-VL-7B基线——这一差距主要归因于数据和规模而非架构——但在离线视频和多模态理解方面仍具有竞争力,在实时应用核心的空间和细粒度时间推理方面保持稳健,并获得了离线模型缺乏的能力:持续感知、答案修订和适时沉默。在配备256帧的单个H200上,首个词元生成时间提升约5倍,解码吞吐量提高2.7倍,而离线能力几乎没有下降。我们的范式、架构和数据研究概述了一条通往实时视频理解的可行路径。

Addressing Market Regime Changes and Heavy-Tailed Returns in Portfolio Optimization via Bayesian VAR and Elliptical Black-Litterman

2026-06-09T04:00:00autoregressive, cs.AI, cs.LG, q-fin.PM2606.09104

中文标题:基于贝叶斯向量自回归和椭圆分布Black-Litterman的投资组合优化:应对市场状态变化与厚尾收益

作者:Daniil Mikriukov (University of Liverpool, Xi'an Jiaotong-Liverpool University), Ruoyu Sun (Xi&x27;an Jiaotong-Liverpool University), Angelos Stefanidis (Xi'an Jiaotong-Liverpool University), Jionglong Su (Xi&x27;an Jiaotong-Liverpool University), Zhengyong Jiang (Xi'an Jiaotong-Liverpool University)

摘要:

Deep reinforcement learning (DRL) frameworks for portfolio optimization have shown promise for their ability to learn allocation rules dynamically from market data. However, these models fail to account for fat-tailed returns, which characterize actual market behavior with more frequent extreme events. Furthermore, historical data is treated homogeneously, without accounting for temporal importance, leading models to fail during regime changes. We propose a new BAVAR-BLED algorithm that combines methods derived from Bayesian-Averaging Vector Autoregressive (BAVAR) and the Black-Litterman model using Elliptical Distributions (BLED) within a TD3 architecture. BAVAR captures a set of vector autoregressive representations that consider multi-scale temporal features, enabling adaptive allocation decisions based on regime-aware estimates of return expectations and dispersion matrices. These estimates serve as prior inputs to BLED, a model that uses Student's t-distributions, allowing for more realistic fat tail return estimates. The BAVAR-BLED algorithm uses transformer networks for view construction and CNNs for risk-aversion estimates, which modify dynamic allocation decisions based on market conditions. An evaluation of 29 Dow Jones Industrial Average constituents over a decade-long market period shows that BAVAR-BLED significantly outperforms state-of-the-art methods, achieving Sharpe and Sortino ratios of 1.72 and 2.70, respectively, and total returns of 57.26%.

摘要中文:

深度强化学习(DRL)框架在投资组合优化方面展现出从市场数据动态学习配置策略的潜力。然而,这些模型未能考虑厚尾收益,而厚尾收益是实际市场行为的特征,具有更频繁的极端事件。此外,历史数据被同质化处理,未能考虑时间重要性,导致模型在市场状态变化时失效。我们提出了一种新的BAVAR-BLED算法,该算法将贝叶斯平均向量自回归(BAVAR)方法与椭圆分布Black-Litterman模型(BLED)相结合,嵌入TD3架构中。BAVAR捕捉了一组向量自回归表示,这些表示考虑了多尺度时间特征,使得能够基于状态感知的收益期望和分散矩阵估计进行自适应配置决策。这些估计作为BLED的先验输入,该模型使用学生t分布,能够获得更现实的厚尾收益估计。BAVAR-BLED算法使用Transformer网络进行观点构建,并使用卷积神经网络进行风险厌恶估计,从而根据市场状况修改动态配置决策。对道琼斯工业平均指数29只成分股十年市场期间的评估表明,BAVAR-BLED显著优于现有最优方法,夏普比率和索提诺比率分别达到1.72和2.70,总收益率为57.26%。

DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation

2026-06-09T04:00:00autoregressive, cs.AI, cs.CV2605.21028

中文标题:DySink:用于自回归长视频生成的动态帧汇点

作者:Bo Ye, Xinyu Cui, Jian Zhao, Tong Wei, Min-Ling Zhang

摘要:

Autoregressive long video generation often adopts bounded-memory streaming for efficiency, typically combining local windows for short-term continuity with static early-frame sinks as long-range anchors. However, this fixed allocation keeps early frames cached even when the current visual state has substantially diverged from them, while discarding potentially more relevant intermediate history. As a result, the retained long-range context may become less adaptive and bias generation toward outdated cues; in severe cases, RoPE-induced phase re-alignment can homogenize inter-head attention and cause sink collapse, where content regresses toward sink frames. We propose DySink, a retrieval-based framework that maintains a compact memory bank and selects visually relevant historical frames as dynamic frame sinks. DySink couples adaptive retrieval with a sink anomaly gate, which detects excessive inter-head consensus over retrieved context and suppresses collapse-prone context. Experiments on minute-long videos show that DySink consistently improves dynamic degree over strong baselines while also achieving higher temporal quality. The code and model weights will be released at https://github.com/yebo0216best/DySink.

摘要中文:

自回归长视频生成通常采用有界内存流式处理以提高效率,通常结合用于短期连续性的局部窗口与作为长期锚点的静态早期帧汇点。然而,这种固定分配方式即使在当前视觉状态已与早期帧大幅偏离时仍会保留其缓存,同时丢弃可能更相关的中间历史。因此,保留的长期上下文可能变得不再适应,并使生成偏向过时的线索;在严重情况下,RoPE 诱导的相位重对齐可使头间注意力同质化并导致汇点崩溃,即内容向汇点帧退化。我们提出了 DySink,一个基于检索的框架,它维护一个紧凑的记忆库并选择视觉上相关的历史帧作为动态帧汇点。DySink 将自适应检索与汇点异常门耦合,该门可检测到检索上下文上的过度头间共识并抑制易崩溃的上下文。在分钟级长视频上的实验表明,DySink 在动态程度方面一致优于强基线,同时实现了更高的时间质量。代码和模型权重将发布于 https://github.com/yebo0216best/DySink。

TBD-VLA: Temporal Block Diffusion Vision Language Action Model

2026-06-09T04:00:00autoregressive, cs.CV, cs.RO, diffusion2606.07895

中文标题:TBD-VLA:时间块扩散视觉语言动作模型

作者:Sung-Wook Lee, Xuhui Kang, Yen-Ling Kuo

摘要:

Discrete Vision-Language-Action (VLA) models typically formulate action generation as next-token prediction over discretized action spaces, conditioning each token autoregressively on prior context. While effective, this paradigm incurs high inference latency and largely ignores the temporal structure inherent in action trajectories. Recent efforts introduce parallel decoding to improve efficiency, enabling faster inference, but lack explicit mechanisms for modeling token dependencies. We introduce TBD-VLA, a discrete token-based VLA framework that incorporates block diffusion to enable temporal action generation. We partition action sequences into temporal blocks and perform masked discrete diffusion within each block, while maintaining autoregressive generation across blocks. This design unifies temporal autoregression and parallel action decoding, achieving both strong temporal coherence and improved inference speed. In addition, the explicit temporal modeling enables asynchronous execution of action chunks (e.g., Real-Time Chunking) via temporal in-painting. TBD-VLA significantly outperforms prior VLA approaches in both simulation and real-world manipulation tasks, offering a scalable path toward fast, temporally aware, discrete VLA models. Project webpage: https://tbd-vla.github.io/

摘要中文:

离散视觉语言动作(VLA)模型通常将动作生成为离散化动作空间上的下一个token预测,每个token基于先前上下文进行自回归条件化。虽然这种方法有效,但其范式存在高推理延迟的问题,且很大程度上忽略了动作轨迹中固有的时间结构。最近的研究引入并行解码来提高推理速度,但缺乏显式建模token依赖关系的机制。我们提出TBD-VLA,一个基于离散token的VLA框架,采用块扩散技术实现时间动作生成。我们将动作序列划分为时间块,在每个块内执行掩码离散扩散,同时保持块间的自回归生成。这种设计统一了时间自回归和并行动作解码,实现了强时间一致性和改进的推理速度。此外,显式的时间建模支持通过时间修复实现动作块的异步执行(例如实时分块)。TBD-VLA在仿真和真实世界操作任务中显著优于现有VLA方法,为构建快速、时间感知的离散VLA模型提供了一条可扩展的路径。项目主页:https://tbd-vla.github.io/

HACK++: Towards More Effective Head-Aware Key-Value Compression for Efficient Visual Autoregressive Modeling

2026-06-09T04:00:00autoregressive, cs.CV, image_compression2606.08302

中文标题:HACK++:面向高效视觉自回归建模的更有效头部感知键值缓存压缩

作者:Ziran Qin, Yuchen Jiang, Mingbao Lin, Youru Lv, Hang Guo, Wen Fei, Weiyao Lin

摘要:

Visual Autoregressive (VAR) models adopt a next-scale prediction paradigm, offering high-quality generation with substantially fewer decoding steps. However, existing VAR models suffer from significant attention complexity and severe memory overhead due to the accumulation of key-value (KV) caches across scales. In this paper, we tackle this challenge by introducing KV cache compression into the next-scale paradigm. We begin with an in-depth analysis of VAR attention and observe that attention heads can be stably divided into two functionally distinct categories: Contextual Heads focus on maintaining semantic consistency, while Structural Heads preserve spatial coherence. Their functional divergence makes existing one-size-fits-all compression methods perform poorly on VAR models. We further find that the two head types differ markedly in their reliance on historical scales, and that this reliance shifts across layers and generation steps, arguing for an adaptive cache budget allocation. To address these challenges, we propose HACK++, a training-free Head-Aware key-value Compression frameworK for VAR models. From a one-time offline calibration, HACK++ classifies head types and derives head-specific priors. At inference, it decouples attention from cache compression under independent budgets, bounding the current-scale attention cost while compressing the accumulated cache far more aggressively, via pattern-specific strategies and a reliance-aware budget allocation. Extensive experiments on multiple VAR models across text-to-image, class-conditional, and unified understanding-and-generation tasks validate the effectiveness and generalizability of HACK++. For example, on Infinity-2B/8B, HACK++ maintains near-lossless generation with only a 30% attention budget and a 10% cache budget, and remains robust even under a 1% cache budget.

摘要中文:

视觉自回归(VAR)模型采用下一尺度预测范式,以更少的解码步骤提供高质量生成。然而,现有VAR模型因跨尺度累积的键值(KV)缓存而面临显著的注意力复杂度和严重的内存开销问题。本文通过将KV缓存压缩引入下一尺度范式来解决这一挑战。我们首先对VAR注意力进行深入分析,发现注意力头可以稳定地分为两个功能不同的类别:语义头专注于维护语义一致性,而结构头保持空间一致性。它们的functional divergence使得现有的一刀切压缩方法在VAR模型上表现不佳。我们进一步发现,两种头部类型在历史尺度的依赖程度上存在显著差异,且这种依赖随层和生成步骤而变化,这表明需要自适应的缓存预算分配。为了应对这些挑战,我们提出了HACK++,一个用于VAR模型的免训练头部感知键值缓存压缩框架。通过一次性离线校准,HACK++对头部类型进行分类并得出头部特定的先验知识。在推理过程中,它在独立预算下将注意力与缓存压缩解绑,通过模式特定策略和依赖感知预算分配,在限制当前尺度注意力成本的同时更积极地压缩累积缓存。在多个VAR模型上进行了广泛实验,涵盖文本到图像、类别条件和统一理解与生成任务,验证了HACK++的有效性和泛化能力。例如,在Infinity-2B/8B上,HACK++仅使用30%的注意力预算和10%的缓存预算即可保持近无损生成,甚至在1%缓存预算下仍然鲁棒。

CSFlow: Aligning Flow Matching with Human Contrast Sensitivity

2026-06-09T04:00:00autoregressive, cs.CV, diffusion2606.08833

中文标题:CSFlow:将流匹配与人类对比敏感度对齐

作者:Malgorzata Galinska, Bart Pogodzinski, Jan Eric Lenssen

摘要:

We introduce Contrast Sensitive Flow (CSFlow), a weighting scheme that connects the human eye's Contrast Sensitivity Function (CSF) to the iterative denoising steps of flow matching. Because real-world images concentrate signal at low spatial frequencies, these components reach high signal-to-noise ratio earlier during continuous diffusion than high-frequency components. When generating images with diffusion or flow matching models, this induces a soft autoregressive structure in Fourier space, where coarse image content stabilizes before fine detail. Meanwhile, the human visual system is unequally sensitive to spatial frequencies: very low and very high frequencies require significantly higher contrast to be perceived. We for the first time merge these observations through two contributions: (1) a metric that estimates which frequencies are generated at each reverse flow interval and (2) timestep weights obtained by aligning the frequencies generated at each noise level with human contrast sensitivity. We validate our contributions experimentally showing that these weights can improve generative performance by lowering FID by 4.7%, increasing Inception Score by 2.2% and improving GenEval scores by 2.5% using inference-only timestep modification or short fine-tuning. Qualitatively, we find that our CSFlow weights lead to better visual realism and less cartoonish appearance of generated images.

摘要中文:

我们提出了对比度敏感流(CSFlow),这是一种加权方案,将人眼的对比敏感度函数(CSF)与流匹配的迭代去噪步骤连接起来。由于现实世界图像的信号集中在低空间频率,这些成分在连续扩散过程中比高频成分更早达到高信噪比。当使用扩散或流匹配模型生成图像时,这会在傅里叶空间中诱导一种软自回归结构,其中粗糙的图像内容先于细节趋于稳定。同时,人类视觉系统对空间频率的敏感程度不均匀:极低和极高的频率需要显著更高的对比度才能被感知。我们首次通过两个贡献将这些观察结果相结合:(1)一个度量标准,用于估计每个逆流间隔生成哪些频率;(2)通过将每个噪声水平生成的频率与人类对比敏感度对齐而获得的时间步权重。我们通过实验验证了这些贡献,表明这些权重可以通过仅使用推理的时间步修改或短期微调来改善生成性能:FID降低4.7%,Inception Score提高2.2%,GenEval分数提高2.5%。定性来看,我们发现CSFlow权重能够带来更好的视觉真实感,并减少生成图像的卡通化外观。

Ultra Flash: Scaling Real-Time Streaming Video Generation to High Resolutions

2026-06-09T04:00:00autoregressive, cs.CV, diffusion2606.09150

中文标题:Ultra Flash:扩展实时流式视频生成至高分辨率

作者:Luxury, Jie Huang, Zihao Fan, Xiaoxiao Ma, Yuming Li, Jun-hao Zhuang, Zeyue Xue, Siming Fu, Haoran Li, Mingchen Zhong, Guohui Zhang, Shichen Ma, Yijun Liu, Jiaqi Shi, Yanwen Ma, Yaofeng Su, Haoyu Wang, Yaowei Li, Songchun Zhang, Weiyang Jin, Yuxuan Bian, Shiyi Zhang, Haojun Xu, Shuai Lu, Xin Han, Wei Tang, Haoyang Huang, Nan Duan

摘要:

While recent autoregressive video diffusion models achieve remarkable streaming quality, they remain confined to low resolutions (e.g., 480P), leaving efficient, scalable, real-time high-resolution video generation a fundamental open challenge. To bridge this gap, we present Ultra Flash, a cascaded streaming framework capable of real-time high-resolution video generation. Ultra Flash achieves ~30 FPS at 1K resolution and ~18 FPS at 2K resolution on a single GPU through three key contributions: (1) an architecture-preserving T2V-to-TV2V super-resolution training paradigm coupled with an AIGC-oriented data degradation pipeline that effectively preserves the generative capability of the base model, enabling enhanced high-resolution detail when cascaded after mainstream low-resolution generative models; (2) a causal streaming latent upsampler paired with a high-resolution decoder, which enhances spatiotemporal coherence while enabling efficient latent spatial scaling and precise high-resolution decoding with negligible computational overhead; and (3) a cascade high-resolution streaming video generation optimization scheme that first performs hybrid-reward-enhanced sparse causalization and single-step distillation of the super-resolution model, then introduces cascaded streaming self-forcing preference optimization with dynamic cache management, jointly enhancing overall coherence, improving quality, and enabling real-time high-resolution streaming video generation. Extensive experiments demonstrate that Ultra Flash reliably produces ultra-high-resolution streaming video while maintaining state-of-the-art visual quality and superior efficiency.

摘要中文:

尽管当前的自回归视频扩散模型已实现了卓越的流式质量,但其仍局限于低分辨率(如480P),高效、可扩展的实时高分辨率视频生成仍是一个根本性的开放挑战。为弥补这一差距,我们提出了Ultra Flash,一个能够实现实时高分辨率视频生成的级联流式框架。Ultra Flash在单GPU上于1K分辨率达到约30 FPS、于2K分辨率达到约18 FPS,其核心贡献包括三个方面:(1)一种架构保持的T2V到TV2V超分辨率训练范式,结合AIGC导向的数据降级管道,能够有效保留基础模型的生成能力,使其在与主流低分辨率生成模型级联时能够增强高分辨率细节;(2)因果流式潜在上采样器与高分辨率解码器配对使用,可增强时空一致性,同时实现高效的潜在空间缩放和精确的高分辨率解码,且计算开销可忽略不计;(3)一种级联高分辨率流式视频生成优化方案,首先对超分辨率模型进行混合奖励增强的稀疏因果化和单步蒸馏,随后引入带动态缓存管理的级联流式自强制偏好优化,共同增强整体一致性、提升质量并实现实时高分辨率流式视频生成。大量实验表明,Ultra Flash能够在保持最先进视觉质量和卓越效率的同时可靠地生成超高清流式视频。

OmniGen-AR: AutoRegressive Any-to-Image Generation

2026-06-09T04:00:00autoregressive, cs.CV2606.09156

中文标题:OmniGen-AR:自回归任意到图像生成

作者:Junke Wang, Xun Wang, Qiushan Guo, Peize Sun, Weilin Huang, Zuxuan Wu, Yu-Gang Jiang

摘要:

Autoregressive (AR) models have demonstrated strong potential in visual generation, offering superior performance with simple architectures and optimization objectives. However, existing methods are typically limited to single-modality conditions, e.g., text, restricting their applicability in real-world scenarios that demand image synthesis from diverse controls. In this work, we present OmniGen-AR, a unified autoregressive framework for Any-to-Image generation. By discretizing various visual conditions through a shared visual tokenizer and text prompts with a text tokenizer, OmniGen-AR supports a broad spectrum of conditional inputs within a single model, including text (text-to-image generation), spatial signals (segmentation-to-image and depth-to-image), and visual context (image editing, frame prediction, and text-to-video generation). To mitigate the risk of information leakage from condition tokens to content tokens, we introduce Disentangled Causal Attention (DCA), which separates the full-sequence causal mask into condition causal attention and content causal attention. It serves as a training-time regularizer without affecting the standard next-token prediction during inference. With this design, OmniGen-AR achieves new state-of-the-art or at least competitive results across a range of benchmark, e.g., 0.63 on GenEval and 80.02 on VBench, demonstrating its effectiveness in flexible and high-fidelity visual generation.

摘要中文:

自回归(AR)模型在视觉生成领域展现出强大的潜力,以简单的架构和优化目标实现了优越的性能。然而,现有方法通常局限于单模态条件,例如文本,这限制了它们在需要从多种控制条件进行图像合成的真实应用场景中的适用性。本工作提出了 OmniGen-AR,一个用于任意到图像生成的统一自回归框架。通过使用共享的视觉分词器对各种视觉条件进行离散化,并使用文本分词器处理文本提示,OmniGen-AR 在单一模型内支持广泛的条件输入,包括文本(文本到图像生成)、空间信号(分割到图像和深度到图像)以及视觉上下文(图像编辑、帧预测和文本到视频生成)。为了缓解条件token向内容token泄露信息的风险,我们引入了解耦因果注意力(DCA),它将全序列因果掩码分离为条件因果注意力和内容因果注意力。它作为训练时的正则化器,不影响推理时的标准下一个token预测。凭借这一设计,OmniGen-AR 在一系列基准测试中取得了新的最先进结果或至少具有竞争力的表现,例如在 GenEval 上达到 0.63 和在 VBench 上达到 80.02,证明了其在灵活高保真视觉生成中的有效性。

PACE: Post-Causal Entropy Modeling for Learned LiDAR Point Cloud Compression

2026-06-09T04:00:00autoregressive, cs.CV2605.01320

中文标题:PACE:面向学习的LiDAR点云压缩的后因果熵建模

作者:Jiahao Zhu, Kang You, Dandan Ding, Zhan Ma

摘要:

LiDAR point cloud compression is vital for autonomous systems to handle massive data from high-resolution sensors. While learned entropy modeling built upon octree structures yields high compression gains, it faces two critical bottlenecks: 1) prohibitive latency, particularly during decoding, caused by causal, multi-stage context modeling; and 2) a rigid performance-latency trade-off, preventing a single model from adapting to varying constraints. These limitations stem from the tight coupling between the context aggregation backbone and probability prediction. To address this, we propose PACE, a new framework that reformulates ancestral context aggregation as a non-causal backbone and confines causality to a lightweight, stage-scalable predictor, eliminating repetitive backbone executions and reducing computational overhead. The predictor supports an arbitrary number of prediction stages, enabling seamless adaptation across diverse performance-latency trade-offs without reloading parameters. Experiments demonstrate that PACE sets a new state-of-the-art in compression efficiency, achieving notable BD-BR savings and reducing decoding latency by over 90\% in autoregressive mode, making it attractive for practical applications.

摘要中文:

LiDAR点云压缩对于自动驾驶系统处理高分辨率传感器产生的海量数据至关重要。虽然基于八叉树结构的学习熵建模取得了较高的压缩增益,但存在两个关键瓶颈:1)由因果性、多阶段上下文建模导致的过高延迟,尤其是在解码过程中;2)僵化的性能-延迟权衡,阻止单个模型适应不同的约束条件。这些限制源于上下文聚合骨干网络与概率预测之间的紧密耦合。为解决这一问题,我们提出了PACE框架,该框架将祖先上下文聚合重新表述为非因果骨干网络,并将因果性限制在轻量级、阶段可扩展的预测器中,消除了重复的骨干网络执行并降低了计算开销。预测器支持任意数量的预测阶段,能够在不同性能-延迟权衡之间无缝适应,无需重新加载参数。实验表明,PACE在压缩效率上达到了新的最优水平,取得了显著的BD-BR节省,并将自回归模式下的解码延迟降低了超过90%,使其在实际应用中具有吸引力。

ClothTransformer: Unified Latent-Space Transformers for Scalable Cloth Simulation

2026-06-09T04:00:00autoregressive, cs.CV, cs.GR2605.27852

中文标题:ClothTransformer:用于可扩展布料模拟的统一潜在空间Transformer

作者:Yu Zhang, Yidi Shao, Wenqi Ouyang, Yushi Lan, Zhexin Liang, Chengrui Wu, Xudong Xu, Xingang Pan

摘要:

Unified and scalable Transformers have recently achieved remarkable success in modeling diverse phenomena traditionally associated with computer graphics, such as 3D visual effects, rendering processes, and motion in videos. In this work, we take a step further by investigating whether modern Transformer techniques can tackle the challenging task of cloth simulation. To this end, we present ClothTransformer, a framework that reformulates cloth simulation as autoregressive sequence modeling in a learned latent space. Existing neural cloth simulators are largely specialized to single scenarios, intrinsically coupled to the mesh discretization, and lack robust collision handling. Our approach addresses these limitations through three contributions: (1) a unified Transformer architecture that handles diverse scenarios -- body-driven garments, robotic manipulation, and free-fall collisions -- under a single model and achieves approximately $4$--$9{\times}$ lower error than prior state-of-the-art methods across all scenarios; (2) a scalable latent-space formulation that compresses arbitrary-resolution meshes into a fixed-size set of latent tokens, making temporal dynamics computation independent of mesh resolution; and (3) a diverse-scenario high-fidelity penetration-free dataset of ${\sim}$493.4k frames spanning all three settings, which enables a differentiable Continuous Collision Detection (CCD) module to suppress penetration artifacts. Project Page: https://yucrazing.github.io/clothtransformer/

摘要中文:

统一且可扩展的Transformer近年来在建模传统上与计算机图形学相关的多样化现象方面取得了显著成功,例如三维视觉效果、渲染过程和视频中的运动。在本工作中,我们进一步探索现代Transformer技术是否能应对布料模拟这一挑战。为此,我们提出了ClothTransformer框架,该框架将布料模拟重新表述为在习得潜在空间中的自回归序列建模。现有的神经布料模拟器在很大程度上仅限于单一场景,本质上与网格离散化耦合,且缺乏 robust 的碰撞处理能力。我们的方法通过以下三个贡献来解决这些局限性:(1)一个统一的Transformer架构,能够在单一模型下处理多样化场景——身体驱动的服装、机器人操作和自由落体碰撞——并在所有场景中实现比先前最先进方法低约4至9倍的误差;(2)一种可扩展的潜在空间 formulation,将任意分辨率的网格压缩为固定大小的潜在令牌集,使时间动态计算与网格分辨率无关;(3)一个涵盖所有三种场景的高保真无穿透多样化场景数据集,包含约49.34万帧,使得可微分的连续碰撞检测(CCD)模块能够抑制穿透伪影。项目主页:https://yucrazing.github.io/clothtransformer/

diffusion
Diffusion
61 篇论文

2025年6月24日 Diffusion 模型论文每日总览

今日 Diffusion 相关论文呈现多领域深度渗透的趋势,涵盖机器人控制、医学影像、视频生成、3D 重建、理论分析等多个方向。亮点包括:1)Diffusion 模型与 World Model 的结合日益紧密,出现多篇探索动作空间建模、记忆机制的研究(MemoryVLA++、Echo-Memory、Latent Spatial Memory);2)视频生成与编辑领域持续火热,关注时序一致性(Beyond Consistency、MilliVid)与实时生成(Ultra Flash、SwiftVR);3)训练效率提升成为重要议题,Token-Subset Representation Alignment、Self-Conditioning 等方法受关注;4)Flow Matching 与 Diffusion Bridge 的统一框架分析表明理论层面仍在深化。

  • IDEQ -- Improving Diffusion Models for the Traveling Salesman Problem:通过利用解空间结构改进 TSP 问题的 Diffusion 模型,为组合优化提供新思路。
  • Latent Diffusion Policy: Shaping Latent Spaces for Diffusion-Based Robotic Manipulation:将 Diffusion 策略与潜空空间塑造结合,提升机器人操作的泛化能力。
  • MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action Models:引入记忆与想象机制,增强 VLA 模型的时序建模能力,是 Diffusion + World Model 的重要进展。
  • WaveDiT: Distribution-Aware Wavelet Flow Matching for Efficient 3D Brain MRI Synthesis:在小波域进行流匹配,实现高效的 3D 脑部 MRI 合成,医学应用价值突出。
  • Diffusion Bridge or Flow Matching? A Unifying Framework and Comparative Analysis:对 Diffusion Bridge 与 Flow Matching 进行统一分析与比较,推动理论理解深化。

Leveraging Structural Constraints for Diffusion-based Neural TSP Solvers

2026-06-09T04:00:00cs.AI, diffusion2606.09343

中文标题:利用结构约束的扩散式神经TSP求解器

作者:Micka\"el Basson (CRIStAL, Scool), Philippe Preux (CRIStAL, Scool)

摘要:

Neural combinatorial optimization has recently achieved strong results on the Euclidean Traveling Salesman Problem (TSP) using generative models such as diffusion and consistency models. State-ofthe-art approaches like FT2T combine fast consistency-based prediction with gradient-based inference time refinement. However, gradient search often incurs significant computational overhead and may not align with the discrete structure of feasible solutions. We introduce Projected Consistency Inference (PCI), a plug-and-play, retraining-free alternative that replaces gradient refinement with structure-aware projections: PCI decodes valid Hamiltonian tours from the consistency model output and applies a lightweight local search (e.g., 2-opt). PCI achieves an average optimality gap (OG) of 0.17% on TSP with 500 cities, and 0.31% on TSP with 1000 cities, outperforming FT2T best settings (OG 0.22% and 0.36%, respectively) while reducing the inference time up to 30 to 40%. PCI also exhibits lower variance and memory usage, and can surpass classical heuristics such as LKH3 in rapid solution generation. Our results demonstrate that structure-aware inference time operations provide a practical and principled path for neural TSP solvers, complementing training time objectives.

摘要中文:

神经组合优化近年来在使用扩散模型和一致性模型等生成模型解决欧几里得旅行商问题(TSP)方面取得了显著成果。FT2T等先进方法将快速一致性预测与基于梯度的推理时优化相结合。然而,梯度搜索通常会产生大量计算开销,且可能与可行解的离散结构不兼容。我们提出了投影一致性推理(Projected Consistency Inference, PCI),这是一种即插即用、无需重新训练的替代方案,用结构感知的投影替换梯度优化:PCI从一致性模型输出中解码有效的哈密顿回路,并应用轻量级局部搜索(如2-opt)。PCI在500城市TSP上实现了0.17%的平均最优性间隙(OG),在1000城市TSP上实现0.31%,优于FT2T最佳设置(OG分别为0.22%和0.36%),同时减少30%至40%的推理时间。PCI还表现出更低的方差和内存消耗,并能在快速解生成中超越LKH3等经典启发式算法。我们的结果表明,结构感知的推理时操作,为神经TSP求解器提供了一条实用且原则性的路径,可作为训练时目标的有效补充。

DiffoR: A Unified Continuous Generative Framework for Universal Ordinal Regression

2026-06-09T04:00:00cs.AI, cs.CV, cs.LG, diffusion2606.07599

中文标题:DiffoR:通用有序回归的统一连续生成框架

作者:Hongxu Ma, Lin Wang, Chenghou Jin, Han Zhou, Jie Zhang, Xiaoyu Yang, Chunjie Chen, Jihong Guan, Shuigeng Zhou

摘要:

Ordinal Regression (OR) aims to predict target values with inherent order, underpinning critical applications across diverse domains, from recommender systems to computer vision. Though having evolved from naive regression to discretization-based classification and generation, existing paradigms remain fundamentally constrained by quantization artifacts and the lack of global ordinal topological perception. These methods typically enforce rigid boundary delineations, failing to capture the non-stationary semantic transitions inherent to ordinal data. In this paper, we propose a novel paradigm where OR is formulated as a Continuous Generative Ordinal Regression task. Under the novel paradigm, we introduce DiffOR, a unified framework that leverages diffusion models to recover continuous ordinal values via iterative denoising, thereby enabling the dynamic learning of soft semantic transitions. To explicitly preserve ordinal topology, we devise a Dual-Decoupling Strategy: Spatially, Multi-scale Increment Aggregation decomposes targets into hierarchical continuous increments; Temporally, Dynamic Denoising Perception synchronizes denoising steps with feature frequencies, ensuring robust coarse-to-fine refinement. Theoretically, we show that the proposed method can significantly enhance both representation capability and mechanistic interpretability. Extensive experiments on 12 benchmarks across four domains validate DiffOR's consistent superiority over state-of-the-art methods, establishing a new standard that demonstrates strong potential as a general-purpose solution for universal ordinal regression.

摘要中文:

有序回归(Ordinal Regression,OR)旨在预测具有内在顺序的目标值,支撑着从推荐系统到计算机视觉等众多领域的关键应用。尽管已从朴素回归演进到基于离散化的分类和生成方法,但现有范式仍从根本上受限于量化伪影和全局有序拓扑感知缺失。这些方法通常强制执行刚性边界划分,无法捕捉有序数据固有的非平稳语义过渡。本论文提出一种新范式,将有序回归形式化为连续生成有序回归任务。在这一新范式下,我们引入DiffoR,这是一个统一的框架,利用扩散模型通过迭代去噪恢复连续有序值,从而能够动态学习软语义过渡。为明确保持有序拓扑,我们设计了双重解耦策略:在空间上,多尺度增量聚合将目标分解为层级连续增量;在时间上,动态去噪感知将去噪步骤与特征频率同步,确保稳健的粗到细精炼。理论上,我们证明所提出的方法能显著增强表征能力和机制可解释性。在四个领域的12个基准数据集上进行的广泛实验验证了DiffoR相对于最先进方法的一致优越性,确立了作为通用有序回归通用解决方案的新标准,展示了强大的潜力。

LFNO: Bridging Laplace and Fourier via Transient-Steady Decomposition

2026-06-09T04:00:00cs.AI, cs.LG, diffusion2606.07601

中文标题:LFNO:通过瞬态-稳态分解构建拉普拉斯与傅里叶的桥梁

作者:Jeongun Ha, Sanga Yoon, Donghun Lee

摘要:

We introduce the Laplace-Fourier Neural Operator (LFNO), a unified framework for modeling dynamical systems across transient and steady-state regimes by integrating the spectral advantages of Laplace and Fourier Neural Operators. LFNO employs a dual-branch architecture that explicitly decomposes system dynamics into transient and steady-state components. We evaluate LFNO on nine benchmarks, including three ODE systems (Duffing, Lorenz, and Pendulum) and six PDE systems (Euler-Bernoulli beam, Heat, Reaction-diffusion, Brusselator, Burgers, and Navier-Stokes). LFNO significantly outperforms existing operators on ODE systems, where transient dynamics dominate, and consistently surpasses LNO while achieving performance competitive with FNO on PDE benchmarks. Furthermore, LFNO offers improved stability and physical interpretability through its component-wise decomposition. These results demonstrate that LFNO provides a robust and unified approach for learning complex dynamical systems across multiple temporal scales.

摘要中文:

本文提出了拉普拉斯-傅里叶神经算子(LFNO),该框架通过整合拉普拉斯神经算子和傅里叶神经算子的谱优势,建立了一个统一的动态系统建模框架,能够同时处理瞬态和稳态机制。LFNO采用双分支架构,将系统动力学显式分解为瞬态和稳态分量。本文在九个基准问题上评估了LFNO的性能,包括三个常微分方程系统(Duffing系统、Lorenz系统和Pendulum系统)以及六个偏微分方程系统(Euler-Bernoulli梁、热传导方程、反流-扩散方程、Brusselator方程、Burgers方程和Navier-Stokes方程)。实验结果表明,LFNO在瞬态动力学占主导地位的常微分方程系统上显著优于现有神经算子,并在偏微分方程基准上持续超越LNO,同时达到与FNO相当的性能。此外,LFNO通过分量分解提供了更好的稳定性和物理可解释性。这些结果证明了LFNO能够为跨多时间尺度的复杂动态系统学习提供一种稳健且统一的处理方法。

NeuroAlign: Hierarchical Multimodal Fusion of Dynamic and Structural Neuroimaging for MCI Analysis

2026-06-09T04:00:00cs.AI, cs.CV, diffusion2606.07635

中文标题:NeuroAlign:用于MCI分析的动态与结构神经影像层次化多模态融合

作者:Xiongri Shen, Zhenxi Song, Jiaqi wang, Yi Zhong, Leilei Zhao, Chenqi Xu, Linling Li, Yichen Wei, Lingyan Liang, Demao Deng, Luping Song, Ping Luan, Ahmed M. Anter, Shuqiang Wang, Baiying Lei, Zhiguo Zhang

摘要:

Multimodal neuroimaging fusion of functional MRI (fMRI) and diffusion tensor imaging (DTI) provides complementary information for cognitive impairment analysis, but remains challenged by heterogeneous feature spaces and misaligned representations. We propose \textit{NeuroAlign}, a hierarchical framework for structured multimodal fusion. It introduces (1) \textit{Dual-Modal Hierarchical Alignment} (DMHA), which models multi-scale dynamic connectivity and aligns dynamic-static and functional-structural embeddings; and (2) \textit{Dual-Domain Hierarchical Interaction} (DDHI), which enables fine-grained modulation and global interaction between connectivity- and region-level features. To support feature-level inspection, we design \textit{Synergistic Activation Mapping} (SAM), a gradient-free, marker-oriented attribution method for DFC, SFC, ALFF, and FA. Evaluated on GUTCM, ADNI, and OASIS under five-fold validation, NeuroAlign achieves competitive MCI/SCD detection and preliminary cross-dataset transferability. Attribution analyses reveal modality-specific and partially consistent brain patterns, providing model-derived evidence for multimodal representation analysis.

摘要中文:

功能磁共振成像(fMRI)与扩散张量成像(DTI)的多模态神经影像融合为认知障碍分析提供了互补信息,但仍面临异构特征空间和表示对齐不佳的挑战。我们提出了NeuroAlign,一个用于结构化多模态融合的层次化框架。该框架引入(1)双模态层次对齐(DMHA),用于建模多尺度动态连接并对齐动态-静态以及功能-结构嵌入;以及(2)双域层次交互(DDHI),实现连接级和区域级特征之间的细粒度调制和全局交互。为支持特征级检查,我们设计了协同激活映射(SAM),这是一种无梯度、面向标记的属性归因方法,适用于DFC、SFC、ALSS和FA。在GUTCM、ADNI和OASIS数据集上采用五折交叉验证进行评估,NeuroAlign在MCI/SCD检测中取得了竞争性的性能,并展现了初步的跨数据集迁移能力。归因分析揭示了模态特异性和部分一致的大脑模式,为多模态表示分析提供了模型推导的证据。

Anchor-Conditioned Compositional Control for Landscape Image Generation

2026-06-09T04:00:00cs.AI, cs.CV, diffusion2606.07638

中文标题:面向风景图像生成的锚点条件化构图控制

作者:Gadha Lekshmi P, Govind Arun, Rohith Syam, Ahmed Elgammal

摘要:

Image generative models, though widely used as creative tools, offer limited support for the kind of compositional control that photographers and visual artists routinely exercise. This paper presents early results on an anchor conditioned finetuning framework for landscape image generation, in which a four dimensional compositional anchor vector is extracted from training images and injected into a diffusion model via a decoupled cross attention mechanism with Fourier encoding and three way classifier free guidance dropout. Quantitative evaluation against a baseline and three ablation variants shows that the proposed architecture achieves the highest horizon detection rate of 0.850 and the highest rule of thirds alignment of 0.817. A category specific ablation further demonstrates that training on compositionally homogeneous scene subsets reduces horizon deviation by up to 40 percent compared to mixed training. This establishes that compositional control precision is category dependent.

摘要中文:

图像生成模型虽被广泛用作创意工具,但对摄影师和视觉艺术家日常使用的构图控制支持有限。本文提出了一种面向风景图像生成的锚点条件化微调框架的初步成果,该框架从训练图像中提取四维构图锚点向量,并通过解耦交叉注意力机制、傅里叶编码和三类无分类器引导dropout将其注入扩散模型。相较于基线和三个消融变体的定量评估表明,所提架构实现了最高的地平线检测率(0.850)和最高的三分割线对齐度(0.817)。类别特定消融进一步表明,与混合训练相比,在构图同质场景子集上进行训练可将地平线偏差降低高达40%。这表明构图控制精度具有类别依赖性。

What Makes Video World Model Latents Action-Relevant: Prediction over Reconstruction

2026-06-09T04:00:00cs.AI, cs.CV, diffusion2606.07687

中文标题:视频世界模型潜在表示为何与动作相关:预测优于重建

作者:Jewon Yeom, Hanseul Kim, Jeongjae Park, Sungmok Jung, Jaejin Lee, Taesup Kim

摘要:

Video world models are increasingly used to provide predictive visual representations, yet it remains unclear which pretraining signals induce action-relevant structure in their latent spaces. We study this question through a unified probe-based evaluation across diverse encoder families, including image-only self-supervision, video pretraining with and without latent prediction, reconstruction-based autoencoders, diffusion models, and shortcut-forcing dynamics models. Using a common inverse-dynamics probing objective, we find that action-relevant structure is driven primarily by temporal video pretraining rather than pixel reconstruction fidelity: models with strong pixel decoding quality can exhibit near-zero action recoverability, while video-pretrained self-supervised encoders consistently achieve the best Pareto trade-off between visual fidelity and action prediction. Comparing V-JEPA and VideoMAE further shows that most gains arise from natural-video temporal context, with feature-level latent prediction providing a smaller additional benefit. These trends transfer across robotic benchmarks, though CALVIN reveals that static-environment tasks can partially mask the importance of temporal structure by allowing strong image priors to suffice. Finally, inverse-dynamics supervision substantially improves robustness to visual corruption, suggesting that action-aware objectives regularize latent geometry beyond clean-setting performance. Our results identify temporal predictive structure -- not reconstruction fidelity -- as the primary ingredient underlying action-relevant video representations.

摘要中文:

视频世界模型越来越多地被用于提供预测性视觉表示,但目前尚不清楚哪些预训练信号会在其潜在空间中诱导出与动作相关的结构。我们通过统一的基于探测的评估来研究这一问题,评估涵盖了多种编码器家族,包括纯图像自监督学习、带和不带潜在预测的视频预训练、基于重建的自编码器、扩散模型和捷径强制动力学模型。使用通用的逆向动力学探测目标,我们发现与动作相关的结构主要由时间视频预训练驱动,而非像素重建保真度:像素解码质量较强的模型可能表现出接近零的动作可恢复性,而视频预训练的自监督编码器在视觉保真度和动作预测之间始终实现最佳的帕累托权衡。对V-JEPA和VideoMAE的比较进一步表明,大部分收益来自自然视频的时间上下文,特征级潜在预测提供了较小的额外收益。这些趋势可转移到机器人基准测试中,尽管CALVIN显示静态环境任务可能通过允许强大的图像先验足以掩盖时间结构的重要性。最后,逆向动力学监督显著提高了对视觉退化的鲁棒性,表明动作感知目标对潜在几何结构的正则化作用超越了干净环境下的性能。我们的结果表明,时间预测结构——而非重建保真度——是与动作相关的视频表示的主要成分。

PACT: Self-Evolving Physical Safety Alignment for Diffusion Policies in Embodied Manipulation

2026-06-09T04:00:00cs.AI, cs.RO, diffusion2606.08414

中文标题:PACT:扩散策略在具身操作中的自演进物理安全对齐

作者:Lingxuan Wu, Zijian Zhu, Lizhong Wang, Chengyang Ying, Huayu Chen, Xiao Yang, Fangming Liu, Jun Zhu

摘要:

Diffusion policies have achieved remarkable success in robotic manipulation, yet they often fail to satisfy strict physical constraints required for safe deployment. Existing approaches impose safety either prematurely during training or reactively via external guardrails at test time, limiting policy expressivity and overall scalability. We propose Physical safety Alignment for Constrained Trajectories (PACT), a self-evolving post-training framework that projects pretrained diffusion policies onto constraint-feasible regions without accessing demonstration data or task rewards. PACT distills constraint gradients into the diffusion model through a reverse-KL objective with dense supervision across timesteps. It incorporates a curriculum that progressively tightens constraints while maintaining theoretically bounded policy shift and monotone improvement, mitigating the safety-performance trade-off from catastrophic forgetting. On simulated and real-world embodied manipulation benchmarks, PACT significantly reduces safety violations by 31.0% on average while improving task success by 30.7%.

摘要中文:

扩散策略在机器人操作中取得了显著成功,但往往无法满足安全部署所需的严格物理约束。现有的安全方法要么在训练期间过早地施加约束,要么在测试时通过外部防护栏被动应对,这限制了策略的表达能力和整体可扩展性。我们提出了约束轨迹物理安全对齐(PACT),这是一种自演进的后训练框架,无需访问演示数据或任务奖励,即可将预训练的扩散策略投影到约束可行区域。PACT 通过反向 KL 目标并跨时间步提供密集监督,将约束梯度融入扩散模型。它引入了一种课程学习方法,在逐步收紧约束的同时,保持理论上有界的策略偏移和单调改进,从而缓解灾难性遗忘导致的安全-性能权衡问题。在模拟和真实世界的具身操作基准测试中,PACT 平均将安全违规降低 31.0%,同时将任务成功率提高 30.7%。

Reconstructing Synthetic SDO/AIA 193 A EUV Images from He I 10830 A Observations with Diffusion Model Translator

2026-06-09T04:00:00astro-ph.SR, cs.AI, cs.CV, diffusion2606.08652

中文标题:利用扩散模型翻译器从He I 10830 Å观测重建合成SDO/AIA 193 Å EUV图像

作者:Marco Marena, Qin Li, Haimin Wang, Haodi Jiang, Prajwal Shah, Bo Shen

摘要:

Routine full-disk EUV imaging has been available only since the modern era, such as SOHO and SDO. To extend EUV coronal context into earlier periods, we leverage the multi-decade availability of full-disk \HeI{} observations, whose absorption is modulated by coronal irradiance and magnetic topology and is widely used as a proxy for open-field regions. We present a diffusion-based conditional image translation framework, Coronal Hole-aware Diffusion Model Translator (CH-aware DMT), to reconstruct synthetic SDO/AIA 193 \AA{} EUV images from \HeI{} inputs. The model is trained on temporally co-aligned SOLIS \HeI{} and AIA 193 \AA{} pairs spanning 2011--2015 using a month-based split, where January--October are used for training, November is used for validation, and December for testing. On the held-out test set, the reconstructions preserve dominant full-disk EUV morphology (CC=0.92) and recover CH-related low-intensity structure (CC=0.84). We further assess historical applicability by (1) comparing reconstructed AIA 193 \AA{} morphology with SOHO/EIT 195 \AA{} over 2005--2015; (2) comparing reconstructed AIA 193 \AA{} images generated from KPVT \HeI{} inputs against Yohkoh/SXT soft X-ray observations; and (3) evaluating long-term reconstructed disk-integrated emission statistics against observational EUV series and independent solar activity proxies (sunspot number and F10.7 radio flux over 1974--2015). These results indicate that CH-aware DMT conditioned on \HeI{} can provide a physically plausible synthetic AIA 193 \AA{} coronal proxy for historical studies, supporting multi-decade analyses of large-scale coronal evolution before the direct EUV imaging was available.

摘要中文:

常规的全日面EUV成像仅在现代时期可用,如SOHO和SDO。为了将EUV日冕覆盖范围扩展到更早的时期,我们利用了数十年可用的全日面He I观测数据,其吸收受日冕辐照度和磁拓扑调制,并被广泛用作开放场区的代理。我们提出了一种基于扩散条件的图像翻译框架——日冕洞感知扩散模型翻译器(CH-aware DMT),用于从He I输入重建合成的SDO/AIA 193 Å EUV图像。该模型使用2011-2015年间时间对齐的SOLIS He I和AIA 193 Å配对数据进行训练,采用按月份划分的方式,其中1月至10月用于训练,11月用于验证,12月用于测试。在保留的测试集上,重建结果保持了主导的全日面EUV形态(相关系数=0.84),并恢复了与日冕洞相关的低强度结构。我们进一步评估了历史适用性,具体包括:(1)将重建的AIA 193 Å形态与2005-2015年间的SOHO/EIT 195 Å进行比较;(2)将基于KPVT He I输入生成的重建AIA 193 Å图像与Yohkoh/SXT软X射线观测进行比较;(3)评估长期重建的全日面积分发射统计特性,与观测EUV序列和独立太阳活动代理(1974-2015年间的太阳黑子数和F10.7射电通量)进行对比。这些结果表明,以He I为条件的CH-aware DMT可以为历史研究提供物理上可信的合成AIA 193 Å日冕代理,支持在直接EUV成像可用之前对大尺度日冕演化的数十年分析。

Latent Diffusion Policy: Shaping Latent Spaces for Diffusion-Based Robotic Manipulation

2026-06-09T04:00:00cs.AI, cs.RO, diffusion2606.08657

中文标题:潜在扩散策略:为基于扩散的机器人操作塑造潜在空间

作者:Zhexuan Zhou, Yichen Lai, Jinhao Zhang, Huizhe Li, Youmin Gong, Jie Mei

摘要:

Diffusion-based visuomotor policies operating directly in raw action spaces conflate scene comprehension with trajectory generation within a single denoising process. The resulting velocity field must simultaneously encode scene information and generate precise trajectories, increasing learning complexity and limiting performance on tasks demanding precise temporal coordination across multiple arms. To simplify this joint learning problem, we introduce Latent Diffusion Policy (LDP), a two-stage framework performing flow matching in a deliberately shaped latent space. By absorbing scene understanding into an observation-conditioned CVAE encoder, LDP concentrates the conditional distribution of each observation. Consequently, the flow model avoids implicitly resolving scene-dependent structures; instead, it generates within a pre-concentrated distribution featuring a smoother velocity field, simplifying learning from limited demonstrations. Furthermore, to capture temporal dependencies among latent tokens, LDP trains with per-token diffusion forcing and employs staircase inference sampling to resolve the resulting distributional mismatch. We also propose reconstruction FID (rFID) as a lightweight proxy predicting downstream task success solely from latent space statistics. On coordination-intensive tasks from RoboTwin 2.0, LDP outperforms DP3 by a substantial margin and transfers effectively to real-world bimanual deployments.

摘要中文:

直接在原始动作空间中运行的基于扩散的视觉运动策略将场景理解与轨迹生成混淆在单一去噪过程中。生成的速度场必须同时编码场景信息并生成精确轨迹,这增加了学习难度,并限制了需要多臂精确时间协调的任务性能。为了简化这一联合学习问题,我们提出了潜在扩散策略(LDP),一个在精心塑造的潜在空间中进行流匹配的两阶段框架。通过将场景理解吸收到观察条件化的CVAE编码器中,LDP集中了每个观察的条件分布。因此,流模型无需隐式解析场景依赖结构;相反,它在预先集中的分布中生成,该分布具有更平滑的速度场,从而简化了从有限演示中学习。此外,为了捕获潜在token之间的时间依赖性,LDP采用每token扩散强制进行训练,并使用阶梯推理采样来解决由此产生的分布不匹配问题。我们还提出了重建FID(rFID)作为轻量级代理,仅根据潜在空间统计数据预测下游任务成功。在RoboTwin 2.0的协调密集型任务上,LDP显著优于DP3,并可有效迁移到真实世界的双臂部署场景。

Unifying Object-Centric World Models and Diffusion Policy: A Hierarchical Framework for Multi-Stage Robotic Tasks

2026-06-09T04:00:00cs.AI, cs.RO, diffusion2606.08775

中文标题:统一以对象为中心的World Models与扩散策略:面向多阶段机器人任务的分层框架

作者:Raktim Gautam Goswami, Prashanth Krishnamurthy, Yann LeCun, Farshad Khorrami

摘要:

Visual world models have shown great potential in learning complex system dynamics. Recent advancements leverage these models as transition functions within Model Predictive Control (MPC) frameworks to solve various control tasks. When applied to robotics, however, they are limited to single-stage tasks such as reaching or grasping, and struggle with multi-stage ones that demand complex sequential planning. In this work, we introduce WorldDP, a world model framework designed for multi-stage robotic manipulation. Our hierarchical approach utilizes a high-level world model as a transition function to optimize for feasible subgoals during runtime, which are subsequently reached by a low-level Diffusion Policy. To further aid in learning dynamics and planning, we incorporate object-centric representations that decouple environmental entities and enable us to plan sequentially with respect to each. Evaluated across several robotics benchmarks, WorldDP consistently outperforms existing baselines, validating that coupling the world model's physically grounded planning with diffusion policy&x27;s efficient execution yields superior multi-stage performance.

摘要中文:

视觉世界模型在学习复杂系统动力学方面展现出巨大潜力。近期研究将这些模型作为转换函数应用于模型预测控制(MPC)框架中,以解决各类控制任务。然而,当应用于机器人领域时,现有方法仅限于到达或抓取等单阶段任务,难以处理需要复杂序列规划的多阶段任务。本研究提出了WorldDP,一个面向多阶段机器人操作的世界模型框架。我们的分层方法将高层世界模型作为转换函数,在运行时优化可行的子目标,随后由低层扩散策略(Diffusion Policy)实现这些子目标。为进一步辅助动力学学习和规划,我们引入了以对象为中心的表示方法,将环境实体解耦,使我们能够针对每个对象进行序列规划。在多个机器人基准测试上的评估表明,WorldDP始终优于现有基线方法,验证了将世界模型的物理 grounded 规划与扩散策略的高效执行相结合,能够实现更优的多阶段性能。

EgoTactile: Learning Grasp Pressure for Everyday Objects from Egocentric Video

2026-06-09T04:00:00cs.AI, cs.CV, diffusion2606.09243

中文标题:EgoTactile:从自我中心视频学习日常物品的抓取压力

作者:Yuan Zeng, Yujia Shi, Tiao Tan, Xingting Li, Yaqi Qin, Zongqing Lu, Wenming Yang, Jing-Hao Xue, Qingmin Liao

摘要:

Estimating full-hand grasp pressure from egocentric video is critical for immersive VR and robotic manipulation, yet dense tactile sensing often relies on intrusive hardware. Existing vision-based methods predominantly rely on planar surfaces or fingertip contacts, failing to generalize to complex 3D object interactions. Therefore, we introduce EgoTactile, a benchmark pairing egocentric video with full-hand pressure supervision for diverse everyday objects, incorporating a bare-hand transfer subset to enable generalization to natural scenarios. Leveraging this benchmark, we first establish EgoPressureFormer as a discriminative baseline. Beyond this, to explicitly address the uncertainty in partial observations, we propose EgoPressureDiff, a conditional diffusion framework that adapts a large-scale pre-trained video diffusion backbone. By combining rich world knowledge priors with a Physically-Informed Feature Rectification layer to inject semantic constraints, our approach effectively infers plausible contact patterns and resolves visual-physical ambiguities. Extensive experiments demonstrate that our method achieves superior performance on the benchmark and robust transferability to in-the-wild scenarios. Our project page is available at https://egotactile.github.io/.

摘要中文:

从自我中心视频估计全手抓取压力对于沉浸式VR和机器人操作至关重要,然而密集触觉传感通常依赖于侵入性硬件。现有基于视觉的方法主要依赖平面表面或指尖接触,难以推广到复杂的3D物体交互。因此,我们提出了EgoTactile,这是一个将自我中心视频与全手压力监督配对的基准数据集,包含用于实现自然场景泛化的裸手迁移子集。利用该基准,我们首先建立了EgoPressureFormer作为判别基线。除此之外,为了明确解决部分观测中的不确定性,我们提出了EgoPressureDiff,这是一个条件扩散框架,适配大规模预训练的视频扩散骨干。通过将丰富的世界知识先验与物理信息特征校正层相结合以注入语义约束,我们的方法能够有效推断合理的接触模式并解决视觉-物理歧义。大量实验表明,我们的方法在该基准上实现了卓越的性能,并具有对野外场景的强大迁移能力。项目页面见 https://egotactile.github.io/。

BSTabDiff: Block-Subunit Diffusion Priors for High-Dimensional Tabular Data Generation

2026-06-09T04:00:00cs.AI, cs.LG, diffusion, stat.ML2606.09257

中文标题:BSTabDiff:高维表格数据生成的块-亚基扩散先验

作者:Al Zadid Sultan Bin Habib, Md Younus Ahamed, Prashnna Gyawali, Gianfranco Doretto, Donald A. Adjeroh

摘要:

High-Dimensional Low-Sample Size (HDLSS) tabular domains (e.g., omics) are characterized by $n \ll m$, where $n$ = number of samples, and $m$ = number of features. Such domains often exhibit strong local correlation groups, sparse cross-group dependencies, heavy-tailed non-Gaussian marginals, heteroscedastic noise, and structured missingness, making direct density learning in $\mathbb{R}^m$ ill-conditioned since $n \ll m$. We propose BSTabDiff, a block-subunit generative framework that partitions the $m$ observed features into $M$ latent blocks ($M \ll m$) and generates each block via a shared low-dimensional subunit variable, concentrating global dependence learning in the compact block-latent space $\mathbb{R}^M$ while decoding to the full feature space with copula-driven dependence, flexible per-feature marginals, and explicit missingness mechanisms. BSTabDiff supports modern deep priors on block latents, including diffusion and normalizing flows, enabling stable synthesis and controllable benchmark generation in the HDLSS regime. Empirically, BSTabDiff produces more realistic and stable high-dimensional synthetic data when compared with unstructured tabular generators on HDLSS data.

摘要中文:

高维低样本量(HDLSS)表格数据领域(如组学数据)的特点在于$n \ll m$,其中$n$为样本数,$m$为特征数。这类领域通常表现出强局部相关组、稀疏的跨组依赖、重尾非高斯边缘分布、异方差噪声以及结构化缺失等特征,使得在$\mathbb{R}^m$空间中进行直接密度学习呈现病态性质,因为$n \ll m$。我们提出BSTabDiff,一个块-亚基生成框架,将$m$个观测特征划分为$M$个潜在块($M \ll m$),并通过共享的低维亚基变量生成每个块,将全局依赖学习集中在紧凑的块潜在空间$\mathbb{R}^M$中,同时利用Copula驱动的依赖关系、灵活的逐特征边缘分布以及显式的缺失机制解码到完整特征空间。 BSTabDiff支持在块潜在变量上的现代深度先验,包括扩散模型和归一化流,使得在HDLSS场景下能够进行稳定的合成和可控的基准生成。实证结果表明,与非结构化表格生成器相比,BSTabDiff在HDLSS数据上能够生成更加真实且稳定的高维合成数据。

Do Video Foundation Models Understand Intuitive Physics? A Layerwise Probing Analysis

2026-06-09T04:00:00cs.AI, cs.CV, cs.LG, diffusion2606.09646

中文标题:视频基础模型是否理解直观物理?逐层探针分析

作者:Samuele Punzo, Niccol\`o Caselli, Ippokratis Pantelidis, Francesco Massafra, Salvatore Lo Sardo, Mohammadreza Salehi

摘要:

We study whether pretrained video foundation models encode intuitive-physics information in their frozen representations, and how this information varies across model families, layers, and probe types. Using frozen-feature probing on IntPhys2 and Minimal Video Pairs (MVP), we compare predictive joint-embedding models (V-JEPA), masked reconstruction models (VideoMAE), and a diffusion-based video generator (LTX-Video). V-JEPA achieves the strongest overall results across benchmarks, especially with probes that model temporal dynamics, while VideoMAE remains competitive and LTX-Video recovers weaker but non-trivial signal. Layerwise analyses show that physics-relevant information is weakest in early layers and becomes most accessible at intermediate-to-late depth, and temporal controls show that disrupting frame order substantially reduces performance, especially on MVP. Together, these results suggest that intuitive-physics knowledge emerges reliably in pretrained video representations, but its accessibility depends strongly on pretraining paradigm, representational depth, and readout mechanism.

摘要中文:

我们研究预训练视频基础模型在其冻结表示中是否编码了直观物理信息,以及该信息如何随模型家族、层数和探测类型而变化。使用IntPhys2和Minimal Video Pairs(MVP)数据集上的冻结特征探测,我们比较了预测性联合嵌入模型(V-JEPA)、掩码重建模型(VideoMAE)以及基于扩散的视频生成器(LTX-Video)。V-JEPA在各项基准测试中取得最佳总体结果,特别是使用建模时间动态的探测器时表现优异;VideoMAE保持竞争力;LTX-Video则恢复了较弱但非平凡的信号。逐层分析表明,物理相关信息在早期层中最弱,在中后深度层中最易获取;时间控制实验表明,打破帧序会显著降低性能,尤其在MVP上。综上所述,这些结果表明直观物理知识在预训练视频表示中可靠地涌现,但其可访问性强烈依赖于预训练范式、表示深度和读出机制。

Visual Prompting Meets Feature Reconstruction-Based Anomaly Detection with Dual-Teacher Supervision

2026-06-09T04:00:00cs.AI, cs.CV, diffusion2606.09670

中文标题:视觉提示与双教师监督下的基于特征重构的异常检测

作者:Mateo Diaz-Bone, Daniel Caraballo, Florian Scheidegger, Thomas Frick, Mattia Rigotti, Andrea Bartezzaghi, Roy Assaf, Niccolo Avogaro, Yagmur G. Cinar, Brown Ebouky, Filip M. Janicki, Piotr S. Kluska, Cezary Skura, Cristiano Malossi

摘要:

Recent Anomaly Detection methods achieve perfect detection and segmentation scores on well-established datasets, such as MVTec. However, many of these methods face challenges when foundational assumptions - such as consistent object scale, viewpoint, background, illumination, and centered placement - are violated. Those variations that occur render anomaly detection methods unusable in many real-world scenarios. To address these limitations, we introduce three key contributions: (1) a visual prompting pipeline that isolates objects using foreground-background masking; (2) a mechanism for unfreezing the teacher in student-teacher models to improve domain adaptability; and (3) a data augmentation strategy leveraging diffusion-generated synthetic images to enhance anomaly detection performance. We achieve a 3.5 percentage point improvement over the previous state-of-the-art on the challenging AeBAD dataset by using the Masked Multiscale Reconstruction (MMR) model as our backbone.

摘要中文:

近期的异常检测方法在MVTec等成熟数据集上取得了完美的检测和分割分数。然而,当基本假设——如一致的物体尺度、视角、背景、光照和居中放置——被违反时,许多方法面临挑战。这些变化使得异常检测方法在许多实际场景中无法使用。为了解决这些局限性,我们引入了三个关键贡献:(1)使用前景-背景掩码隔离物体的视觉提示管道;(2)解冻学生-教师模型中教师机制的领域适应性改进方法;(3)利用扩散生成的合成图像增强异常检测性能的数据增强策略。通过使用掩码多尺度重建(MMR)模型作为主干网络,我们在具有挑战性的AeBAD数据集上比之前的先进方法提升了3.5个百分点。

AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing

2026-06-09T04:00:00cs.AI, cs.CV, cs.RO, diffusion2606.09811

中文标题:AHA-WAM:基于观察引导上下文路由的异步地平线自适应世界-动作建模

作者:Jisong Cai, Long Ling, Shiwei Chu, Zhongshan Liu, Jiayue Kang, Zhixuan Liang, Wenjie Xu, Yinan Mao, Weinan Zhang, Xiaokang Yang, Ru Ying, Ran Zheng, Yao Mu

摘要:

World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning. However, existing world-action models couple world prediction and action execution at the same temporal resolution, forcing the world branch to model near-term frame variations that are redundant and weakly informative. We posit that strictly binding world prediction and action execution to the same temporal rhythm may underutilize the potential of the video branch for embodied control. Therefore, we propose AHA-WAM, an Asynchronous Horizon-Adaptive World-Action Model built on a dual Diffusion Transformer (DiT) architecture that reorganizes world-action modeling around this temporal asymmetry. AHA-WAM instantiates the video DiT as a low-frequency world planner that maintains rolling key-value memory over past observations and exposes reusable layerwise latent context encoding long-horizon scene evolution, while a high-frequency action DiT executes short action chunks in closed loop by querying this context through layerwise joint attention. To support asynchronous execution, we introduce horizon-adaptive offset training and Observation-Guided Video-Context Routing (OVCR), which together let the action expert exploit long-horizon world context while remaining responsive to real-time execution state without rerunning the video DiT. Experiments on RoboTwin and real-world manipulation tasks show that AHA-WAM achieves state-of-the-art performance without any robot-data pretraining, attaining 92.80% average success on RoboTwin and 78.3% success across 4 real-world tasks, while reaching 24.17 Hz closed-loop control with a 4.59x speedup over Fast-WAM.

摘要中文:

世界-动作模型已成为机器人操作领域一种有前景的范式,通过联合建模视觉场景动态和动作来为策略学习注入物理先验。然而,现有世界-动作模型在世界预测和动作执行采用相同的时间分辨率,迫使世界分支对冗余且信息量较少的近帧变化进行建模。我们认为,将世界预测与动作执行严格绑定到相同的时间节奏可能无法充分发挥视频分支在具身控制中的潜力。因此,我们提出了AHA-WAM,一种基于双扩散变换器(DiT)架构构建的异步地平线自适应世界-动作模型,围绕这种时间不对称性重新组织世界-动作建模。AHA-WAM将视频DiT实例化为一个低频世界规划器,通过对过去观察维护滚动键值记忆,并公开可重用的层间潜在上下文编码以实现长期场景演化;而高频动作DiT通过层间联合注意力查询该上下文来闭环执行短动作块。为支持异步执行,我们引入了地平线自适应偏移训练和观察引导的视频-上下文路由(OVCR),使动作专家能够利用长期世界上下文,同时保持对实时执行状态的响应,而无需重新运行视频DiT。在RoboTwin和真实世界操作任务上的实验表明,AHA-WAM在无需任何机器人数据预训练的情况下达到了最先进的性能,在RoboTwin上获得92.80%的平均成功率,在4个真实世界任务上获得78.3%的成功率,同时实现了24.17 Hz的闭环控制频率,相比Fast-WAM提速4.59倍。

PTL-Diffusion: Manifold-Aware Diffusion with Periodic Terminal Laws

2026-06-09T04:00:00cs.AI, cs.CV, diffusion, math.PR2606.09816

中文标题:PTL-Diffusion:基于周期性终端律的流形感知扩散模型

作者:Danqi Zhuang, Jisui Huang, Xiaoyue Xi, Andrew Kiggins, Xiaojie Wang, Ke Chen, Yue Wu

摘要:

Standard diffusion models typically use a single time-homogeneous Gaussian terminal distribution as the reference law for generation. While this choice is analytically convenient and empirically powerful, it provides little explicit structure for data concentrated near low-dimensional manifolds, where different regions of the data distribution may correspond to distinct local geometric or semantic factors. As a result, the reverse model must recover manifold-level structure almost entirely from an unstructured terminal reference distribution. We propose PTL-Diffusion, a proof-of-concept diffusion framework whose forward noising process converges to a nonconstant periodic family of Gaussian terminal laws rather than to a single invariant law. Unlike a phase-conditioned DDPM, where phase information only enters the denoising network while the forward process remains unchanged, PTL-Diffusion embeds phase structure directly into the forward noising dynamics. The proposed construction remains close to standard denoising diffusion models: for a periodically forced Ornstein--Uhlenbeck-type forward process, we derive closed-form forward marginals, the limiting periodic Gaussian terminal family, and explicit Gaussian reverse posteriors, enabling standard noise-prediction training. We also introduce an invariant-average regularization term coupling the phase-conditioned reverse dynamics through the averaged periodic reference law. Experiments on torus and cylinder point-cloud benchmarks and the Olivetti face dataset show that PTL-Diffusion improves manifold-level distributional matching over matched DDPM baselines, reducing phase-conditioned errors, feature-space covariance errors, and nearest-neighbour manifold distances. These results suggest structured terminal reference laws as a promising direction, while motivating more expressive phase constructions and larger-scale evaluations.

摘要中文:

标准扩散模型通常采用单一时间齐次高斯终端分布作为生成的参考律。尽管这一选择便于分析且经验上效果显著,但它对位于低维流形附近的数据缺乏明确的结构化表示,而数据的不同区域可能对应于不同的局部几何或语义因素。因此,反向模型必须从无结构的终端参考分布中几乎完全恢复流形级结构。 我们提出PTL-Diffusion,一个概念验证型扩散框架,其前向加噪过程收敛到非恒定的周期性高斯终端律族,而非单一不变律。与相位条件DDPM不同(相位信息仅进入去噪网络而前向过程保持不变),PTL-Diffusion将相位结构直接嵌入前向加噪动力学中。 所提出的构建方法与标准去噪扩散模型保持高度一致:对于周期性驱动的Ornstein-Uhlenbeck型前向过程,我们推导出了封闭形式的前向边缘分布、极限周期高斯终端族以及显式的高斯反向后验,从而实现标准的噪声预测训练。我们还引入了一种不变平均正则化项,通过平均周期参考律将相位条件反向动力学耦合起来。在环面和圆柱点云基准以及Olivetti人脸数据集上的实验表明,PTL-Diffusion在流形级分布匹配方面优于匹配的DDPM基线,能够降低相位条件误差、特征空间协方差误差和最近邻流形距离。这些结果表明结构化终端参考律是一个有前景的研究方向,同时激励着更具表达力的相位构造和更大规模的评估工作。

IDEQ -- Improving Diffusion Models for the Traveling Salesman Problem (TSP) by Leveraging the Structure of the Solution Space

2026-06-09T04:00:00cs.AI, cs.LG, diffusion2412.13858

中文标题:IDEQ——通过利用解空间结构改进求解旅行商问题的扩散模型

作者:Mickael Basson, Philippe Preux

摘要:

We investigate diffusion models to solve the Traveling Salesman Problem. Building on the recent DIFUSCO and T2TCO approaches, we propose IDEQ. IDEQ improves the quality of the solutions by leveraging the constrained structure of the state space of the TSP. Another key component of IDEQ consists in replacing the last stages of DIFUSCO curriculum learning by considering a uniform distribution over the Hamiltonian tours whose orbits by the 2-opt operator converge to the optimal solution as the training objective. Our experiments show that IDEQ improves the state of the art for such neural network based techniques on synthetic instances. More importantly, our experiments show that IDEQ performs very well on the instances of the TSPlib, a reference benchmark in the TSP community: it closely matches the performance of the best heuristics, LKH3, being even able to obtain better solutions than LKH3 on 2 instances of the TSPlib defined on 1577 and 3795 cities. IDEQ obtains 0.3% optimality gap on TSP instances made of 500 cities, and 0.5% on TSP instances with 1000 cities. This sets a new SOTA for neural based methods solving the TSP. Moreover, IDEQ exhibits a lower variance and better scales-up with the number of cities with regards to DIFUSCO and T2TCO.

摘要中文:

我们研究利用扩散模型求解旅行商问题。基于最近的DIFUSCO和T2TCO方法,我们提出了IDEQ。IDEQ通过利用TSP状态空间的约束结构来提高解的质量。IDEQ的另一个关键组成部分是用均匀分布取代DIFUSCO课程学习的最后阶段,该均匀分布覆盖那些经2-opt算子作用后收敛于最优解的哈密顿回路作为训练目标。我们的实验表明,IDEQ在合成实例上改善了此类神经网络技术的最新水平。更重要的是,我们的实验表明IDEQ在TSPlib的实例上表现优异,TSPlib是TSP领域的参考基准:它与最佳启发式算法LKH3的性能非常接近,甚至在TSPlib的两个分别包含1577和3795个城市的实例上获得了比LKH3更好的解。IDEQ在500个城市的TSP实例上获得了0.3%的最优性差距,在1000个城市的实例上获得了0.5%的最优性差距。这为求解TSP的神经网络方法设定了新的SOTA。此外,IDEQ表现出更低的方差,并且在城市数量方面比DIFUSCO和T2TCO具有更好的扩展性。

Complement or substitute? How AI increases the demand for human skills

2026-06-09T04:00:00cs.AI, diffusion, econ.GN, q-fin.EC2412.19754

中文标题:互补还是替代?人工智能如何影响人类技能需求

作者:Elina M\"akel\"a, Matthew Bone, Mareike Sehrer, Farah Nanji, Fabian Stephany

摘要:

Artificial Intelligence (AI) is transforming the nature of work, yet there is limited empirical evidence on how it affects demand for human skills. This paper examines whether AI adoption increases the prevalence and value of human capabilities that complement technical AI skills, such as analytical thinking, resilience, or ethical judgment, within and beyond AI-intensive job roles. Using a dataset of nearly 30 million job postings from the US, the UK and Australia, between 2018 and 2024, we distinguish between internal effects (within AI roles) and external effects (in non-AI roles) across companies, industries, and regions. This paper has three main findings. First, we find that AI-intensive roles are significantly more likely to require complementary non-technical capabilities, such as analytical thinking, resilience, and digital literacy. Second, these complementary skills are associated with meaningful wage premiums, particularly in managerial, sales or finance roles working with AI. Third, we show that AI diffusion has potential spillover effects: as AI adoption rises within companies, industries, and regions, demand for complementary skills increases even in non-AI roles while demand for substitutable skills - summarisation, translation or customer service - decreases. These trends hold across geographies, including the United States, United Kingdom, and Australia, confirming the robustness of our findings. Together, these findings indicate that AI is not simply replacing tasks or requiring more AI developer skills; it may be transforming workforce skill requirements to favor human attributes that enhance collaboration with intelligent systems.

摘要中文:

人工智能(AI)正在改变工作的性质,但关于其如何影响人类技能需求的实证证据仍然有限。本文探讨AI的采用是否会增加与AI技术技能形成互补的人类能力(如分析性思维、韧性或道德判断)在AI密集型岗位内部及之外的普及程度和价值。利用2018年至2024年间来自美国、英国和澳大利亚近3000万条招聘帖子数据集,本文区分了公司、行业和地区层面的内部效应(在AI岗位内)和外部效应(在非AI岗位中)。本文有三个主要发现。首先,我们发现AI密集型岗位显著更可能要求互补性的非技术能力,如分析性思维、韧性和数字素养。其次,这些互补性技能与可观的工资溢价相关,尤其是在与AI配合工作的管理、销售或财务岗位中。第三,我们发现AI扩散具有潜在的溢出效应:随着公司、行业和地区内AI采用率的上升,即使在非AI岗位中,对互补性技能的需求也在增加,而对可替代技能——如总结、翻译或客户服务——的需求则在减少。这些趋势在美国、英国和澳大利亚等不同地区均成立,证实了我们研究结果的稳健性。综上,这些发现表明AI并非简单地替代任务或仅需要更多的AI开发技能;它可能正在改变劳动力技能需求,使其倾向于有利于增强与智能系统协作的人类特质。

CLONE: A 3DGS-Based Closed-Loop Differentiable Optimization Framework for Single-Image Normal Estimation

2026-06-09T04:00:00cs.AI, cs.CV, diffusion2508.05950

中文标题:CLONE:基于3DGS的单图像法线估计闭环可微优化框架

作者:Yanxing Liang, Yinghui Wang, Wei Li, Tao Yan, Jiaxing Shen

摘要:

We propose CLONE, a 3DGS-based Closed-Loop differentiable Optimization framework for single-image Normal Estimation. The core idea is to construct an "image-geometry-image" consistency loop that unifies and jointly constrains the limitations of both paradigms: the reliance on explicit supervision without cross-domain geometric constraints in discriminative methods, and the absence of stable differentiable optimization pathways in generative methods despite strong generative priors. Specifically, we first employ 3D Gaussian Splatting to explicitly parameterize the scene and derive continuous and differentiable surface normals via covariance eigen-decomposition, providing an analytical gradient pathway for geometric modeling. We then introduce a differentiable illumination model with a learnable light modulation kernel to establish a continuous mapping between surface normals and image radiance, enabling reprojection errors to directly supervise the underlying 3D geometry. Furthermore, to compensate for the limited local detail expressiveness of Gaussian representations, we design a one-step deterministic diffusion-inspired refinement network, which enhances local geometric details while preserving end-to-end differentiability. A cross-domain gating fusion mechanism is introduced to coordinate global geometric consistency and local detail reconstruction. Finally, all components are jointly optimized under a unified reprojection objective, forming a closed-loop and stable gradient propagation pathway. This enables effective constraint of the multi-solution space and improved geometric consistency without requiring ground-truth normal supervision.

摘要中文:

我们提出CLONE,一个基于三维高斯溅射(3D Gaussian Splatting)的单图像法线估计闭环可微优化框架。其核心思想是构建一个“图像-几何-图像”一致性循环,统一并共同约束两种范式的局限性:判别方法缺乏跨域几何约束的显式监督,以及生成方法尽管具备强大的生成先验,却缺乏稳定的可微优化路径。具体而言,我们首先采用三维高斯溅射对场景进行显式参数化,并通过协方差特征分解推导出连续且可微的表面法线,从而为几何建模提供分析梯度路径。随后,我们引入一个配备可学习光照调制核的可微光照模型,建立表面法线与图像辐射率之间的连续映射,使重投影误差能够直接监督底层三维几何。此外,为弥补高斯表示在局部细节表达方面的局限性,我们设计了一个单步确定性扩散启发式refinement网络,在保持端到端可微性的同时增强局部几何细节。引入跨域门控融合机制以协调全局几何一致性与局部细节重建。最终,所有组件在统一的重投影目标下进行联合优化,形成闭环且稳定的梯度传播路径。这使得在无需真值法线监督的情况下,能够有效约束多解空间并提升几何一致性。

TAO: Tolerance-Aware Optimistic Verification for Floating-Point Neural Networks

2026-06-09T04:00:00cs.AI, cs.CR, cs.LG, cs.SY, diffusion, eess.SY2510.16028

中文标题:TAO:浮点神经网络的容错感知乐观验证

作者:Jianzhu Yao, Hongxu Su, Taobo Liao, Zerui Cheng, Huan Zhang, Xuechao Wang, Pramod Viswanath

摘要:

Neural networks increasingly run on hardware outside the user's control (cloud GPUs, inference marketplaces). Yet ML-as-a-Service reveals little about what actually ran or whether returned outputs faithfully reflect the intended inputs. Users lack recourse against service downgrades (model swaps, quantization, graph rewrites, or discrepancies like altered ad embeddings). Verifying outputs is hard because floating-point(FP) execution on heterogeneous accelerators is inherently nondeterministic. Existing approaches are either impractical for real FP neural networks or reintroduce vendor trust. We present TAO: a Tolerance Aware Optimistic verification protocol that accepts outputs within principled operator-level acceptance regions rather than requiring bitwise equality. TAO combines two error models: (i) sound per-operator IEEE-754 worst-case bounds and (ii) tight empirical percentile profiles calibrated across hardware. Discrepancies trigger a Merkle-anchored, threshold-guided dispute game that recursively partitions the computation graph until one operator remains, where adjudication reduces to a lightweight theoretical-bound check or a small honest-majority vote against empirical thresholds. Unchallenged results finalize after a challenge window, without requiring trusted hardware or deterministic kernels. We implement TAO as a PyTorch-compatible runtime and a contract layer currently deployed on Ethereum Holesky testnet. The runtime instruments graphs, computes per-operator bounds, and runs unmodified vendor kernels in FP32 with negligible overhead (0.3% on Qwen3-8B). Across CNNs, Transformers and diffusion models on A100, H100, RTX6000, RTX4090, empirical thresholds are $10^2-10^3$ times tighter than theoretical bounds, and bound-aware adversarial attacks achieve 0% success. Together, TAO reconciles scalability with verifiability for real-world heterogeneous ML compute.

摘要中文:

神经网络日益运行在用户无法控制的硬件上(云GPU、推理市场)。然而,机器学习即服务(ML-as-a-Service)很少披露实际运行的内容或返回的输出是否忠实反映预期输入。用户对服务降级(模型替换、量化、图重写或广告嵌入修改等差异)缺乏追索权。验证输出很困难,因为异构加速器上的浮点(FP)执行本质上是非确定性的。现有方法要么对真实浮点神经网络不切实际,要么重新引入供应商信任。我们提出TAO:一种容错感知乐观验证协议,它接受在原则性操作级接受区域内的输出,而非要求位级相等。TAO结合两种错误模型:(i)针对每操作符的IEEE-754最坏情况Sound界,以及(ii)跨硬件校准的紧致经验百分位配置文件。差异会触发Merkle锚定的、阈值引导的争议游戏,递归划分计算图直到只剩一个操作符,此时判决可简化为轻量级理论界检查或针对经验阈值的少量诚实多数投票。未受挑战的结果在挑战窗口后最终确定,无需可信硬件或确定性内核。我们将TAO实现为PyTorch兼容的运行时和合约层,部署在以太坊Holesky测试网上。该运行时检测图、计算每操作符界,并以可忽略的开销(Qwen3-8B上仅0.3%)运行未修改的供应商内核(FP32)。在A100、H100、RTX6000、RTX4090上针对CNN、Transformer和扩散模型的实验表明,经验阈值比理论界紧致102-103倍,且边界感知对抗攻击成功率为0%。综上,TAO在真实世界异构ML计算中实现了可扩展性与可验证性的平衡。

VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

2026-06-09T04:00:00cs.AI, cs.CV, cs.LG, diffusion2601.23286

中文标题:VideoGPA:用于3D一致性视频生成的几何先验蒸馏

作者:Hongyang Du, Junjie Ye, Xiaoyan Cong, Runhao Li, Jingcheng Ni, Aman Agarwal, Zeqi Zhou, Zekun Li, Randall Balestriero, Yue Wang

摘要:

While recent video diffusion models (VDMs) produce visually impressive results, they fundamentally struggle to maintain 3D structural consistency, often resulting in object deformation or spatial drift. We hypothesize that these failures arise because standard denoising objectives lack explicit incentives for geometric coherence. To address this, we introduce VideoGPA (Video Geometric Preference Alignment), a data-efficient self-supervised framework that leverages a geometry foundation model to automatically derive dense preference signals that guide VDMs via Direct Preference Optimization (DPO). This approach effectively steers the generative distribution toward inherent 3D consistency without requiring human annotations. VideoGPA significantly enhances temporal stability, geometric plausibility, and motion coherence using minimal preference pairs, consistently outperforming state-of-the-art baselines in extensive experiments.

摘要中文:

尽管近期视频扩散模型(VDMs)能够生成视觉效果出众的视频,但在维持3D结构一致性方面存在根本性困难,常导致物体变形或空间漂移。我们假设这些失败源于标准去噪目标缺乏对几何一致性的显式激励。为解决此问题,我们提出VideoGPA(视频几何偏好对齐),这是一种数据高效的自监督框架,利用几何基础模型自动导出密集偏好信号,通过直接偏好优化(DPO)引导VDMs。该方法有效地将生成分布导向固有3D一致性,无需人工标注。VideoGPA使用极少的偏好对即可显著增强时间稳定性、几何合理性和运动一致性,在大量实验中始终优于最先进的基线方法。

Cosmo3DFlow: Wavelet Flow Matching for Spatial-to-Spectral Compression in Reconstructing the Early Universe

2026-06-09T04:00:00astro-ph.IM, cs.AI, diffusion2602.10172

中文标题:Cosmo3DFlow:用于重建早期宇宙的空间到频谱压缩的小波流匹配

作者:Md. Khairul Islam, Zeyu Xia, Ryan Goudjil, Jialu Wang, Arya Farahi, Judy Fox

摘要:

Reconstructing the early universe from the evolved present-day universe is a challenging and computationally demanding problem in modern astrophysics. We devise a novel generative framework, Cosmo3DFlow, designed to address dimensionality and sparsity, the critical bottlenecks inherent in current state-of-the-art methods for cosmological inference. By integrating 3D Discrete Wavelet Transform (DWT) with flow matching, we effectively represent high-dimensional cosmological structures. The Wavelet Transform addresses the ``void problem'&x27; by translating spatial emptiness into spectral sparsity. It decouples high-frequency details from low-frequency structures, and wavelet-space velocity fields facilitate stable ordinary differential equation (ODE) solvers with large step sizes. Using large-scale cosmological $N$-body simulations at $128^3$ resolution, we achieve up to $46\times$ faster sampling than diffusion models. Our results enable initial conditions to be sampled in seconds, compared to minutes for previous methods.

摘要中文:

从演化至今的宇宙重建早期宇宙是现代天体物理学中一项具有挑战性且计算密集的问题。我们设计了一个新的生成框架Cosmo3DFlow,旨在解决维度和稀疏性问题,这是当前最先进的宇宙学推断方法中固有的关键瓶颈。通过将3D离散小波变换(DWT)与流匹配相结合,我们有效地表征了高维宇宙结构。小波变换通过将空间空旷性转化为频谱稀疏性来解决“空洞问题”。它将高频细节与低频结构解耦,且小波空间的速度场有助于实现大步长的稳定常微分方程(ODE)求解器。利用128^3分辨率的大尺度宇宙学N体模拟,我们实现了比扩散模型快高达46倍的采样速度。我们的结果使得初始条件能够在秒级时间内采样,而此前的方法需要分钟级。

Speech Enhancement Based on Drifting Models

2026-06-09T04:00:00cs.AI, cs.SD, diffusion, eess.AS, eess.SP2604.24199

中文标题:基于漂移模型的语音增强

作者:Liang Xu, Diego Caviedes-Nozal, W. Bastiaan Kleijn, Longfei Felix Yan, Rasmus Kongsgaard Olsson

摘要:

We propose Speech Enhancement based on Drifting Models (DriftSE), a novel generative framework that formulates denoising as an equilibrium problem. Rather than relying on iterative sampling, DriftSE natively achieves one-step inference by evolving the pushforward distribution of a mapping function to directly match the clean speech distribution. This evolution is driven by a Drifting Field, a learned correction vector that guides samples toward the high-density regions of the clean distribution, which naturally facilitates training on unpaired data by matching distributions rather than paired samples. We investigate the framework under two formulations: a direct mapping from the noisy observation, and a stochastic conditional generative model from a Gaussian prior. Experiments on the VoiceBank-DEMAND benchmark demonstrate that DriftSE achieves high-fidelity enhancement in a single step, outperforming multi-step diffusion baselines and establishing a new paradigm for speech enhancement.

摘要中文:

我们提出了基于漂移模型的语音增强(DriftSE),一种将去噪表述为均衡问题的新型生成框架。DriftSE不依赖迭代采样,而是通过演化映射函数的前推分布来直接匹配纯净语音分布,从而原生实现单步推理。这种演化由漂移场驱动——一种学习得到的修正向量,引导样本朝向纯净分布的高密度区域,这自然地通过匹配分布而非成对样本来促进在非成对数据上的训练。我们在两种框架变体下进行研究:一是从噪声观测的直接映射,二是基于高斯先验的随机条件生成模型。在VoiceBank-DEMAND基准数据集上的实验表明,DriftSE能够单步实现高保真增强,优于多步扩散基线方法,确立了语音增强的新范式。

Simple Self-Conditioning Adaptation for Masked Diffusion Models

2026-06-09T04:00:00cs.AI, cs.LG, diffusion2604.26985

中文标题:掩码扩散模型的简单自条件化适配方法

作者:Michael Cardei, Huu Binh Ta, Ferdinando Fioretto

摘要:

Masked diffusion models (MDMs) generate discrete sequences by iterative denoising under an absorbing masking process. In standard masked diffusion, if a token remains masked after a reverse update, the model discards its clean-state prediction for that position. Thus, still-masked positions must be repeatedly inferred from the mask token alone. This design choice limits cross-step refinement. To address this limitation, this paper proposes a simple, yet effective, post-training adaptation for MDMs that conditions each denoising step on the model's own previous clean-state predictions. The resulting method, called Self-Conditioned Masked Diffusion Models (SCMDM), requires minimal architectural change, does not introduce a recurrent latent-state pathway, does not rely on an auxiliary reference model, and adds no extra denoiser evaluations during sampling. This is an important departure from partial self-conditioning approaches which requires expensive model training from scratch. In particular, the paper shows that partial self-conditioning, including the commonly used 50% dropout strategy for training self-conditioned models from scratch, is suboptimal in the post-training regime. Instead, once the model&x27;s self-generated clean-state estimates become informative, the specialization to refinement is preferable to mixing conditional and unconditional objectives. SCMDM is evaluated across multiple domains, demonstrating consistent improvement over vanilla MDM baselines, achieving nearly a 50% reduction in generative perplexity on OWT-trained models (42.89 to 23.72), alongside strong improvements in discretized image synthesis quality, small molecular generation, and enhanced fidelity in genomic distribution modeling.

摘要中文:

掩码扩散模型(MDMs)通过迭代去噪过程在吸收掩码机制下生成离散序列。在标准掩码扩散中,如果某个标记在反向更新后仍处于掩码状态,模型则会丢弃该位置的干净状态预测。因此,仍处于掩码状态的标记只能从掩码标记本身进行重复推断。这种设计选择限制了跨步精炼的能力。为解决这一局限性,本文提出了一种针对MDMs的简单而有效的训练后适配方法,使每个去噪步骤以模型自身先前的干净状态预测为条件。由此产生的方法称为自条件掩码扩散模型(SCMDM),该方法仅需最小的架构修改,不引入循环潜在状态路径,不依赖辅助参考模型,且在采样过程中不增加额外的去噪器评估。这与部分自条件化方法有重要区别,后者需要从头进行昂贵的模型训练。本文特别指出,包括常用的50% dropout策略(用于从头训练自条件模型)在内的部分自条件化方法在训练后适配范式下是次优的。实际上,一旦模型自生成的干净状态估计变得具有信息性,专注于精炼要比混合条件和非条件目标更为可取。SCMDM在多个领域进行了评估,结果表明相比 vanilla MDM 基线方法具有一致的提升,在OWT训练模型上实现了生成困惑度近50%的降低(从42.89降至23.72),同时在离散图像合成质量、小分子生成以及基因组分布建模保真度方面均取得了显著改进。

When Do Diffusion Models learn to Generate Multiple Objects?

2026-06-09T04:00:00cs.AI, cs.CV, diffusion2605.00273

中文标题:扩散模型何时学会生成多个物体?

作者:Yujin Jeong, Arnas Uselis, Iro Laina, Seong Joon Oh, Anna Rohrbach

摘要:

Text-to-image diffusion models achieve impressive visual fidelity, yet they remain unreliable in multi-object generation. Despite extensive empirical evidence of these failures, the underlying causes remain unclear. We begin by asking how much of this limitation arises from the data itself. To disentangle data effects, we consider two regimes across different dataset sizes: (1) concept generalization, where each individual concept is observed during training under potentially imbalanced data distributions, and (2) compositional generalization, where specific combinations of concepts are systematically held out. To study these regimes, we introduce mosaic (Multi-Object Spatial relations, AttrIbution, Counting), a controlled framework for dataset generation. By training diffusion models on mosaic, we find that scene complexity plays a dominant role rather than concept imbalance, and that counting is uniquely difficult to learn in low-data regimes. Moreover, compositional generalization collapses as more concept combinations are held out during training. These findings highlight fundamental limitations of diffusion models and motivate stronger inductive biases and data design for robust multi-object compositional generation.

摘要中文:

文生图扩散模型在视觉保真度方面取得了令人瞩目的成就,然而在多物体生成方面仍不可靠。尽管已有大量实证证据表明这些失败现象,但其根本原因尚不明确。我们首先询问这一局限性在多大程度上源于数据本身。为了解耦数据效应,我们在不同数据集规模下考虑两种情况:(1)概念泛化,即在训练过程中以潜在的不平衡数据分布观察每个单独概念;(2)组合泛化,即特定的概念组合被系统性地保留用于测试。为了研究这些情况,我们引入了Mosaic(多物体空间关系、属性、计数),这是一个用于数据集生成的控制框架。通过在Mosaic上训练扩散模型,我们发现场景复杂度而非概念不平衡起着主导作用,并且在低数据环境下计数尤其难以学习。此外,随着训练中保留的概念组合越来越多,组合泛化能力会崩溃。这些发现突出了扩散模型的基本局限性,并促使我们寻求更强的归纳偏置和更好的数据设计,以实现稳健的多物体组合生成。

TBD-VLA: Temporal Block Diffusion Vision Language Action Model

2026-06-09T04:00:00autoregressive, cs.CV, cs.RO, diffusion2606.07895

中文标题:TBD-VLA:时间块扩散视觉语言动作模型

作者:Sung-Wook Lee, Xuhui Kang, Yen-Ling Kuo

摘要:

Discrete Vision-Language-Action (VLA) models typically formulate action generation as next-token prediction over discretized action spaces, conditioning each token autoregressively on prior context. While effective, this paradigm incurs high inference latency and largely ignores the temporal structure inherent in action trajectories. Recent efforts introduce parallel decoding to improve efficiency, enabling faster inference, but lack explicit mechanisms for modeling token dependencies. We introduce TBD-VLA, a discrete token-based VLA framework that incorporates block diffusion to enable temporal action generation. We partition action sequences into temporal blocks and perform masked discrete diffusion within each block, while maintaining autoregressive generation across blocks. This design unifies temporal autoregression and parallel action decoding, achieving both strong temporal coherence and improved inference speed. In addition, the explicit temporal modeling enables asynchronous execution of action chunks (e.g., Real-Time Chunking) via temporal in-painting. TBD-VLA significantly outperforms prior VLA approaches in both simulation and real-world manipulation tasks, offering a scalable path toward fast, temporally aware, discrete VLA models. Project webpage: https://tbd-vla.github.io/

摘要中文:

离散视觉语言动作(VLA)模型通常将动作生成为离散化动作空间上的下一个token预测,每个token基于先前上下文进行自回归条件化。虽然这种方法有效,但其范式存在高推理延迟的问题,且很大程度上忽略了动作轨迹中固有的时间结构。最近的研究引入并行解码来提高推理速度,但缺乏显式建模token依赖关系的机制。我们提出TBD-VLA,一个基于离散token的VLA框架,采用块扩散技术实现时间动作生成。我们将动作序列划分为时间块,在每个块内执行掩码离散扩散,同时保持块间的自回归生成。这种设计统一了时间自回归和并行动作解码,实现了强时间一致性和改进的推理速度。此外,显式的时间建模支持通过时间修复实现动作块的异步执行(例如实时分块)。TBD-VLA在仿真和真实世界操作任务中显著优于现有VLA方法,为构建快速、时间感知的离散VLA模型提供了一条可扩展的路径。项目主页:https://tbd-vla.github.io/

Property-Informed Diffusion-Based Text-to-Microstructure Generation

2026-06-09T04:00:00cs.CV, diffusion2606.08150

中文标题:属性信息扩散的文本到超材料微结构生成

作者:Bingxuan Dai, Hongsong Wang, Jie Gui

摘要:

Designing 3D metamaterial microstructures that meet the intended functions remains a major challenge, as it typically requires domain expertise, iterative simulations, and extensive manual tuning. Existing work on inverse design that automatically generates microstructures based on desired target properties often suffers from limited design diversity and faces challenges in ensuring the physical feasibility of the generated structures. To address this issue, a property-informed diffusion-based network is proposed that enables the generation of 3D microstructures directly from textual descriptions. Unlike traditional property conditioning methods, our approach leverages rich guidance in terms of semantics and physical properties in the text input to support diverse structure synthesis. To enforce consistency between the generated structures and the target textual prompts, a dual alignment strategy is adopted, including contrastive text-structure alignment and test-time reward-guided alignment. Experimental results show that the model is capable of generating semantically meaningful and physically plausible structures across a wide range of material categories. Our approach has good potential for interactive microstructure design and opens up new directions for combining language-based interfaces with inverse material discovery. Code is available at: https://github.com/hongsong-wang/PropDiff-TMG

摘要中文:

设计满足预期功能的三维超材料微结构仍然是一项重大挑战,因为这通常需要领域专业知识、迭代模拟和大量手动调优。现有的逆向设计工作虽然能够根据所需的目标属性自动生成微结构,但往往存在设计多样性有限的问题,并面临确保生成结构物理可行性的挑战。为解决这一问题,我们提出了一种属性信息扩散网络,能够直接从文本描述生成三维微结构。与传统的属性条件方法不同,我们的方法利用文本输入中丰富的语义和物理属性指导,以支持多样化的结构合成。为确保生成结构与目标文本提示的一致性,我们采用了双对齐策略,包括对比性文本-结构对齐和测试时奖励引导对齐。实验结果表明,该模型能够跨广泛材料类别生成语义上有意义且物理上合理的结构。我们的方法在交互式微结构设计中具有良好潜力,并为结合基于语言的接口与逆向材料发现开辟了新方向。代码可访问:https://github.com/hongsong-wang/PropDiff-TMG

WaveDiT: Distribution-Aware Wavelet Flow Matching for Efficient 3D Brain MRI Synthesis

2026-06-09T04:00:00cs.CV, diffusion2606.08670

中文标题:WaveDiT:用于高效3D脑MRI合成的分布感知小波流匹配

作者:Danilo Danese, Angela Lombardi, Giuseppe Fasano, Matteo Attimonelli, Tommaso Di Noia

摘要:

Large and demographically balanced datasets are essential for reliable neuroimaging biomarkers. Full-resolution 3D brain MRI synthesis can support data augmentation in this setting, but existing approaches either incur prohibitive computational cost at volumetric scale or rely on lossy latent compression that may compromise anatomical detail. As a result, practical 3D generative augmentation often requires specialized compute infrastructure. We propose WaveDiT, a conditional flow matching framework operating in the coefficient space of a 3D Haar Discrete Wavelet Transform. The model combines factorized spatio-depth attention with band-wise heteroscedastic uncertainty modeling derived from higher-order wavelet statistics. Predicted log-variance is integrated directly into both the flow objective and conditioning pathway, enabling adaptive precision consistent with the heavy-tailed and input-dependent variance structure of anatomical detail. This formulation supports full-resolution 3D synthesis under practical memory and time constraints on a single modern GPU. Evaluation on a multi-site cohort demonstrates improved alignment between generated and real MRI distributions, together with enhanced downstream brain age prediction and region-level anatomical agreement relative to diffusion, latent, and wavelet-based baselines. Code is available at https://github.com/sisinflab/WaveDiT

摘要中文:

大规模且人口统计学平衡的数据集对于可靠的神经影像学生物标志物至关重要。全分辨率3D脑MRI合成可在该场景下支持数据增强,但现有方法在体积尺度上要么产生难以承受的计算成本,要么依赖可能损害解剖细节的有损潜在压缩。因此,实际应用中的3D生成式增强通常需要专门的计算基础设施。我们提出了WaveDiT,这是一种在3D Haar离散小波变换系数空间中运行的条件流匹配框架。该模型结合了分解的时空深度注意力与从高阶小波统计导出的带间异方差不确定性建模。预测的对数方差直接集成到流目标和条件通路中,使得自适应精度与解剖细节的重尾且输入依赖的方差结构一致。该公式在单个现代GPU的实际内存和时间约束下支持全分辨率3D合成。在多中心队列上的评估表明,生成MRI与真实MRI分布之间的对齐得到改善,同时与扩散、潜在和小波基线相比,下游脑年龄预测和区域级解剖一致性得到增强。代码可访问 https://github.com/sisinflab/WaveDiT

Beyond Consistency: Preserving Temporal Structure in Zero-Shot Video Editing

2026-06-09T04:00:00cs.CV, diffusion2606.08780

中文标题:超越一致性:零样本视频编辑中时间结构的保留

作者:Deyin Liu, Yisheng Ding, Zhe Jin, Xiatian Zhu, Anjan Dutta, Lin Wu

摘要:

Existing zero-shot video editing methods rely on pre-trained diffusion models, successfully achieving spatial control and basic temporal consistency but fundamentally fail to preserve the video's original temporal structure.This distinction is critical: temporal consistency ensures visual smoothness, but temporal structure dictates the video&x27;s high-level narrative, rhythm, and semantic flow. Without this preservation, the edited output, especially for long videos with complex semantic variations, becomes narratively incoherent and semantically ambiguous. To address this limitation, we introduce a novel zero-shot editing approach that, for the first time, explicitly focuses on preserving the source video's temporal structure. We achieve this by adaptively partitioning the video into semantically distinct clips based on feature similarity and selecting a representative anchor frame for each clip. To enhance both intra-clip fidelity and computational efficiency, we design a clip-adaptive token merging strategy which leverages the anchor&x27;s semantic dominance to stabilize the editing. Furthermore, we employ an alternating combination strategy that ensures seamless inter-clip transitions while maintaining semantic distinction. Extensive experiments demonstrate that our method achieves state-of-the-art results, successfully balancing the preservation of original temporal structure with computational efficiency, and setting a new benchmark for zero-shot video editing fidelity.

摘要中文:

现有零样本视频编辑方法依赖于预训练扩散模型,成功实现了空间控制和基本的时间一致性,但在根本层面上未能保留视频原始的时间结构。这一区别至关重要:时间一致性确保视觉平滑性,但时间结构决定了视频的高层叙事、节奏和语义流动。若缺乏这种保留,编辑后的输出尤其是具有复杂语义变化的长视频将变得叙事不连贯且语义模糊。针对这一局限性,我们引入了一种新颖的零样本编辑方法,首次明确聚焦于保留源视频的时间结构。我们通过基于特征相似性将视频自适应划分为语义不同的片段,并为每个片段选择代表性锚帧来实现这一目标。为了增强片段内保真度并提升计算效率,我们设计了一种片段自适应token合并策略,利用锚帧的语义主导性来稳定编辑过程。此外,我们采用交替组合策略以确保片段间过渡的流畅性同时保持语义区分度。大量实验表明,我们的方法达到了最先进的结果,成功平衡了原始时间结构的保留与计算效率,并为零样本视频编辑保真度设定了新的基准。

MaskAlign: Token-Subset Representation Alignment for Efficient Diffusion Training

2026-06-09T04:00:00cs.CV, diffusion2606.08788

中文标题:MaskAlign:用于高效扩散训练的令牌子集表示对齐

作者:Lianyu Pang, Tianlin Pan, Cheng Da, Changqian Yu, Huan Yang, Kun Gai, Song Guo, Wenhan Luo

摘要:

Representation alignment with pretrained vision models has recently shown strong potential for accelerating diffusion transformer training. By aligning intermediate diffusion features with clean-image representations from self-supervised vision encoders, existing methods improve convergence and generation quality. However, such alignment also introduces a non-trivial constraint: diffusion models operate on noisy inputs whose usable information varies across timesteps, while the reference features are extracted from clean images. In this paper, we revisit this mismatch from a token-level perspective. We find that, under full-token representation alignment, tokens with large alignment-gradient norms exhibit a stable spatial preference, suggesting that the alignment objective does not affect all tokens uniformly and may encourage the model to rely on the complete set of clean-image tokens. To address this issue, we propose MaskAlign, a token-subset representation alignment method that applies alignment to randomly sampled token subsets during training. By exposing the model to different token subsets across iterations, MaskAlign reduces the dependence of representation alignment on the complete token set and encourages alignment behavior that is more stable under token-subset perturbations. To mitigate the information loss caused by directly dropping tokens, we further introduce a lightweight pre-mask token mixing block that shares information across tokens before masking.

摘要中文:

与预训练视觉模型的表示对齐最近在加速扩散Transformer训练方面展现出强大潜力。通过将中间扩散特征与自监督视觉编码器提取的清晰图像表示对齐,现有方法改善了收敛性和生成质量。然而,这种对齐也引入了一个不可忽视的约束:扩散模型在噪声输入上运行,其可用信息随时间步变化,而参考特征则从清晰图像中提取。本文从令牌级视角重新审视这一不匹配问题。研究发现,在全令牌表示对齐下,具有大对齐梯度范数的令牌表现出稳定的空间偏好,表明对齐目标并非均匀地影响所有令牌,且可能鼓励模型依赖完整的清晰图像令牌集。针对这一问题,我们提出了MaskAlign,一种令牌子集表示对齐方法,在训练过程中对随机采样的令牌子集进行对齐。通过让模型在不同迭代中接触不同的令牌子集,MaskAlign减少了对完整令牌集的依赖,并鼓励产生在令牌子集扰动下更为稳定的对齐行为。为弥补直接丢弃令牌造成的信息损失,我们进一步引入了一个轻量级的掩码前令牌混合块,在掩码前在令牌间共享信息。

CSFlow: Aligning Flow Matching with Human Contrast Sensitivity

2026-06-09T04:00:00autoregressive, cs.CV, diffusion2606.08833

中文标题:CSFlow:将流匹配与人类对比敏感度对齐

作者:Malgorzata Galinska, Bart Pogodzinski, Jan Eric Lenssen

摘要:

We introduce Contrast Sensitive Flow (CSFlow), a weighting scheme that connects the human eye's Contrast Sensitivity Function (CSF) to the iterative denoising steps of flow matching. Because real-world images concentrate signal at low spatial frequencies, these components reach high signal-to-noise ratio earlier during continuous diffusion than high-frequency components. When generating images with diffusion or flow matching models, this induces a soft autoregressive structure in Fourier space, where coarse image content stabilizes before fine detail. Meanwhile, the human visual system is unequally sensitive to spatial frequencies: very low and very high frequencies require significantly higher contrast to be perceived. We for the first time merge these observations through two contributions: (1) a metric that estimates which frequencies are generated at each reverse flow interval and (2) timestep weights obtained by aligning the frequencies generated at each noise level with human contrast sensitivity. We validate our contributions experimentally showing that these weights can improve generative performance by lowering FID by 4.7%, increasing Inception Score by 2.2% and improving GenEval scores by 2.5% using inference-only timestep modification or short fine-tuning. Qualitatively, we find that our CSFlow weights lead to better visual realism and less cartoonish appearance of generated images.

摘要中文:

我们提出了对比度敏感流(CSFlow),这是一种加权方案,将人眼的对比敏感度函数(CSF)与流匹配的迭代去噪步骤连接起来。由于现实世界图像的信号集中在低空间频率,这些成分在连续扩散过程中比高频成分更早达到高信噪比。当使用扩散或流匹配模型生成图像时,这会在傅里叶空间中诱导一种软自回归结构,其中粗糙的图像内容先于细节趋于稳定。同时,人类视觉系统对空间频率的敏感程度不均匀:极低和极高的频率需要显著更高的对比度才能被感知。我们首次通过两个贡献将这些观察结果相结合:(1)一个度量标准,用于估计每个逆流间隔生成哪些频率;(2)通过将每个噪声水平生成的频率与人类对比敏感度对齐而获得的时间步权重。我们通过实验验证了这些贡献,表明这些权重可以通过仅使用推理的时间步修改或短期微调来改善生成性能:FID降低4.7%,Inception Score提高2.2%,GenEval分数提高2.5%。定性来看,我们发现CSFlow权重能够带来更好的视觉真实感,并减少生成图像的卡通化外观。

CHROMA: Detecting AI-Generated Images through Inter-Channel Color-Space Correlations

2026-06-09T04:00:00cs.CV, cs.LG, diffusion2606.08864

中文标题:CHROMA:通过通道间颜色空间相关性检测AI生成的图像

作者:Juan Pablo Sotelo, Marina Gardella, Pablo Mus\'e

摘要:

The rapid adoption of diffusion and large-scale generative models has made it increasingly challenging to distinguish synthetic imagery from real photographs. While automated detectors have been proposed, their generalization to unseen generators remains brittle. To address this limitation, we investigate inter-channel color correlations, a lightweight and underexploited forensic cue. We first demonstrate that LPIPS, a widely used perceptual metric, exhibits inconsistent responses to perturbations that selectively alter channel dependence across different color-space parameterizations, indicating that cross-channel statistics are not uniformly constrained by common perceptual training objectives. Motivated by this, we analyze the distributions of pairwise inter-channel correlation features across multiple color spaces. Our analysis reveals systematic, generator-specific differences in these distributions, with RGB and Lab color spaces providing the most apparent separation between real and generated images. Building on this, we introduce Chroma, a detector of AI-generated images which augments standard RGB inputs with inter-channel correlation maps and employs a fixed CNN backbone trained with a modest computational budget. We assess its robustness under both single-generator training and a limited multi-generator supervision regime, where only a few samples from additional generators are available. Across a standard benchmark protocol, correlation-augmented inputs improve real-vs-generated discrimination and robustness, yielding performance competitive with recent detectors while maintaining a simple architecture and training procedure. Code is available at https://github.com/JPSoteloSilva/CHROMA

摘要中文:

扩散模型和大尺度生成模型的快速采用使得区分合成图像与真实照片变得越来越困难。尽管已有多种自动检测器被提出,但其对未见过的生成器的泛化能力仍然脆弱。为解决这一局限性,我们研究了一种轻量级且未被充分利用的取证线索——通道间颜色相关性。我们首先证明,LPIPS作为一种广泛使用的感知度量,对选择性改变不同颜色空间参数化下通道依赖性的扰动表现出不一致的反应,表明跨通道统计信息并未受到常见感知训练目标的统一约束。受此启发,我们分析了在多个颜色空间中成对通道相关性特征的分布。我们的分析揭示了这些分布中存在系统性的、生成器特定的差异,其中RGB和Lab颜色空间在真实图像与生成图像之间提供了最明显的分离。基于此,我们提出了Chroma——一个AI生成图像检测器,它将标准RGB输入与通道间相关性图相结合,并采用固定CNN骨干网络进行训练,计算开销适中。我们分别在单生成器训练和有限多生成器监督机制下评估了其鲁棒性,后者仅使用少量来自其他生成器的样本。在标准基准协议下,相关性增强的输入提高了真实与生成图像的区分能力和鲁棒性,性能与近期检测器相当,同时保持了简单的架构和训练流程。代码可访问:https://github.com/JPSoteloSilva/CHROMA

Rethinking 3D Shape Generation: Diffusion over Superquadrics

2026-06-09T04:00:00cs.CV, diffusion2606.08957

中文标题:重新思考3D形状生成:基于超二次曲面的扩散模型

作者:Zhiyang Liu, Wanze Li, Yuwei Wu, Chengran Yuan, Jiawei Sun, Rui Zheng, Marcelo H Ang Jr

摘要:

Diffusion models have advanced 3D shape generation, yet most methods still denoise in high-cardinality spaces (e.g., voxel/SDF grids, meshes, or point clouds), which is computationally and memory intensive and makes it difficult to scale in terms of both higher resolution and stronger controllability. We rethink the diffusion representation and propose to move diffusion from dense geometry to compact geometric primitives, representing each shape as a small set of superquadrics. Instead of operating on thousands to millions of geometric representation values, we leverage 7KB superquadric parameters (pose, size, and shape), drastically reducing diffusion-state dimensionality and per-step compute/memory. Our diffusion-over-superquadrics improves scalability by supporting broader capabilities (e.g., resolution-free point-cloud decoding, part-level editing, and constraint-based design) and achieving competitive surface-fidelity and distributional performance on standard benchmarks after point-cloud decoding, while enabling efficient generation within 0.6s per shape for most conditions.

摘要中文:

扩散模型推动了3D形状生成的发展,然而大多数方法仍在高基数空间(如体素/SDF网格、网格或点云)中进行去噪,这导致计算和内存密集,且难以在更高分辨率和更强可控性方面进行扩展。我们重新思考扩散表示,提出将扩散从密集几何迁移到紧凑几何基元,将每个形状表示为少量的超二次曲面。不同于操作数千到数百万个几何表示值,我们利用7KB的超二次曲面参数(姿态、大小和形状),大幅降低了扩散状态维度和每步计算/内存开销。我们的超二次曲面扩散模型通过支持更广泛的能力(如无需分辨率的点云解码、部件级编辑和基于约束的设计)提升了可扩展性,并在点云解码后的标准基准测试中实现了具有竞争力的表面保真度和分布性能,同时能够在大多数条件下以每个形状0.6秒的效率完成生成。

Leveraging NeRF-Rendered Images for 3D Gaussian Splatting

2026-06-09T04:00:00cs.CV, diffusion2606.09034

中文标题:利用NeRF渲染图像进行3D高斯溅射

作者:Mizuki Morikawa, Yuta Shimizu, Chunyu Li, Yusuke Monno, Masatoshi Okutomi

摘要:

Neural radiance field (NeRF) and 3D Gaussian splatting (3DGS) are two mainstream approaches for novel view synthesis. They often show complementary performance, i.e., 3DGS demonstrating faster rendering speed and NeRF demonstrating higher rendering quality. Motivated by this, we propose leveraging NeRF-rendered images for 3DGS. Specifically, we target street scenes and utilize a pre-trained street-specific NeRF method to produce training images for a target 3DGS method. In our 3DGS training, NeRF-rendered images are used to remove transient objects in street-level input views and to generate bird's-eye views as additional views, inheriting the higher-quality rendering of NeRF into 3DGS. We further incorporate a diffusion-based image enhancement to improve the image quality of the additional views. Experimental results on one synthetic and two real datasets demonstrate that our proposed method improves street-scene rendering while preserving the speed of 3DGS and the quality of NeRF.

摘要中文:

神经辐射场(NeRF)和3D高斯溅射(3DGS)是新视角合成的两种主流方法。它们通常表现出互补的性能,即3DGS具有更快的渲染速度,而NeRF具有更高的渲染质量。基于此,我们提出利用NeRF渲染图像来增强3DGS。具体而言,我们针对街景场景,利用预训练的街景专用NeRF方法为目标3DGS方法生成训练图像。在我们的3DGS训练中,NeRF渲染图像被用于移除街景级输入视图中的瞬态物体,并生成鸟瞰图作为额外视图,从而将NeRF的高质量渲染特性继承到3DGS中。我们进一步结合基于扩散的图像增强来提高额外视图的图像质量。在一个合成数据集和两个真实数据集上的实验结果表明,我们提出的方法在保持3DGS速度的同时提升了街景渲染质量,并继承了NeRF的渲染质量。

MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation

2026-06-09T04:00:00cs.CV, cs.LG, diffusion2606.09056

中文标题:MilliVid:用于视频生成中长程一致性的层级潜变量

作者:Ishaan Preetam Chandratreya, David Charatan, Basile Van Hoorick, Sergey Zakharov, Vitor Guizilini, Phillip Isola, Vincent Sitzmann

摘要:

Video generative models have become increasingly powerful, but long-range consistency remains challenging to achieve because even a few dozen frames require impractically long transformer sequence lengths. We show that this issue can be mitigated by generating video using coarse-to-fine rollout within a multi-scale token space. Our approach is simple: first, we pre-train an autoencoder that compresses each frame into a hierarchy of tokens, with levels ranging from the typical latent resolution to only a handful of tokens per frame. The coarsest levels capture the most consequential information, such as scene layout and semantics, while finer levels add high-frequency appearance and texture. Then, we train a video diffusion model to generate these tokens using coarse-to-fine rollout. By carefully controlling the level of detail at which frames are generated and used as context during each rollout step, we are able to preserve long-range consistency in geometry and object permanence while spending less compute on the long-range consistency of less perceptually relevant details. We validate this approach using a custom dataset of long Minecraft videos, where it produces substantially more consistent rollouts compared to existing baselines.

摘要中文:

视频生成模型已变得日益强大,但长程一致性仍难以实现,因为即使仅几十帧也需要过长的Transformer序列长度。我们表明,通过在多尺度token空间中采用从粗到细的生成方式可以缓解这一问题。我们的方法很简单:首先,我们预训练一个自编码器,将每一帧压缩为一个层级的token,层级从典型的潜空间分辨率到每帧仅几个token不等。最粗粒度的层级捕获最关键的信息,如场景布局和语义,而更细的层级则添加高频外观和纹理。然后,我们训练一个视频扩散模型来生成这些token,采用从粗到细的生成方式。通过仔细控制每个生成步骤中生成帧和用作上下文的细节层级,我们能够在几何和物体持久性方面保持长程一致性,同时在感知相关性较低的细节上减少计算资源消耗。我们在自定义的Minecraft长视频数据集上验证了这一方法,与现有基线方法相比,该方法产生了明显更加一致的生成结果。

Ultra Flash: Scaling Real-Time Streaming Video Generation to High Resolutions

2026-06-09T04:00:00autoregressive, cs.CV, diffusion2606.09150

中文标题:Ultra Flash:扩展实时流式视频生成至高分辨率

作者:Luxury, Jie Huang, Zihao Fan, Xiaoxiao Ma, Yuming Li, Jun-hao Zhuang, Zeyue Xue, Siming Fu, Haoran Li, Mingchen Zhong, Guohui Zhang, Shichen Ma, Yijun Liu, Jiaqi Shi, Yanwen Ma, Yaofeng Su, Haoyu Wang, Yaowei Li, Songchun Zhang, Weiyang Jin, Yuxuan Bian, Shiyi Zhang, Haojun Xu, Shuai Lu, Xin Han, Wei Tang, Haoyang Huang, Nan Duan

摘要:

While recent autoregressive video diffusion models achieve remarkable streaming quality, they remain confined to low resolutions (e.g., 480P), leaving efficient, scalable, real-time high-resolution video generation a fundamental open challenge. To bridge this gap, we present Ultra Flash, a cascaded streaming framework capable of real-time high-resolution video generation. Ultra Flash achieves ~30 FPS at 1K resolution and ~18 FPS at 2K resolution on a single GPU through three key contributions: (1) an architecture-preserving T2V-to-TV2V super-resolution training paradigm coupled with an AIGC-oriented data degradation pipeline that effectively preserves the generative capability of the base model, enabling enhanced high-resolution detail when cascaded after mainstream low-resolution generative models; (2) a causal streaming latent upsampler paired with a high-resolution decoder, which enhances spatiotemporal coherence while enabling efficient latent spatial scaling and precise high-resolution decoding with negligible computational overhead; and (3) a cascade high-resolution streaming video generation optimization scheme that first performs hybrid-reward-enhanced sparse causalization and single-step distillation of the super-resolution model, then introduces cascaded streaming self-forcing preference optimization with dynamic cache management, jointly enhancing overall coherence, improving quality, and enabling real-time high-resolution streaming video generation. Extensive experiments demonstrate that Ultra Flash reliably produces ultra-high-resolution streaming video while maintaining state-of-the-art visual quality and superior efficiency.

摘要中文:

尽管当前的自回归视频扩散模型已实现了卓越的流式质量,但其仍局限于低分辨率(如480P),高效、可扩展的实时高分辨率视频生成仍是一个根本性的开放挑战。为弥补这一差距,我们提出了Ultra Flash,一个能够实现实时高分辨率视频生成的级联流式框架。Ultra Flash在单GPU上于1K分辨率达到约30 FPS、于2K分辨率达到约18 FPS,其核心贡献包括三个方面:(1)一种架构保持的T2V到TV2V超分辨率训练范式,结合AIGC导向的数据降级管道,能够有效保留基础模型的生成能力,使其在与主流低分辨率生成模型级联时能够增强高分辨率细节;(2)因果流式潜在上采样器与高分辨率解码器配对使用,可增强时空一致性,同时实现高效的潜在空间缩放和精确的高分辨率解码,且计算开销可忽略不计;(3)一种级联高分辨率流式视频生成优化方案,首先对超分辨率模型进行混合奖励增强的稀疏因果化和单步蒸馏,随后引入带动态缓存管理的级联流式自强制偏好优化,共同增强整体一致性、提升质量并实现实时高分辨率流式视频生成。大量实验表明,Ultra Flash能够在保持最先进视觉质量和卓越效率的同时可靠地生成超高清流式视频。

CP4D: Compositional Physics-aware 4D Scene Generation

2026-06-09T04:00:00cs.CV, diffusion2606.09187

中文标题:CP4D:组合式物理感知的4D场景生成

作者:Hanxin Zhu, Cong Wang, Tianyu He, Long Chen, Xin Jin, Chen Gao, Zhibo Chen

摘要:

4D generation (\textit{i.e.}, dynamic 3D generation) has recently emerged as a rapidly growing research frontier due to its powerful spatiotemporal modeling capabilities. However, despite notable advances, existing approaches typically fail to capture the underlying physical principles, producing results that are both physically inconsistent and visually implausible. To overcome this limitation, we present CP4D, a novel paradigm for photorealistic 4D scene synthesis with faithful adherence to complex physical dynamics. Drawing inspiration from the compositional nature of real-world scenes, where immutable static backgrounds coexist with dynamic, physically plausible foregrounds, CP4D reformulates 4D generation as the integration of a static 3D environment with physically grounded dynamic objects. On this basis, our framework follows a three-stage pipeline: \textbf{1)} Firstly, we leverage pre-trained expert models to generate high-fidelity 3D representations of the environment and foreground objects respectively. \textbf{2)} Subsequently, to produce physically plausible trajectories and realistic interactions for these objects, we propose a hybrid motion synthesis strategy that integrates priors from physical simulators with the common sense embedded in video diffusion models. \textbf{3)} Finally, we develop an automated composition mechanism that seamlessly fuses the static environment and dynamic objects into coherent, physically consistent 4D scenes. Extensive experiments demonstrate that CP4D can generate explorable and interactive 4D scenes with high visual fidelity, strong physical plausibility, and fine-grained controllability, significantly outperforming existing methods. The project page: https://anonymous.4open.science/w/CP4D/.

摘要中文:

4D生成(即动态3D生成)因其强大的时空建模能力而成为近年来迅速发展的研究前沿。然而,尽管取得了显著进展,现有方法通常无法捕捉底层物理原理,产生的结果在物理上不一致且视觉上不可信。为克服这一局限性,我们提出了CP4D,这是一种在逼真4D场景合成中忠实遵循复杂物理动力学的新范式。借鉴现实世界场景的组合性质——不可变的静态背景与动态、物理上可信的前景共存——CP4D将4D生成重新表述为静态3D环境与基于物理的动态对象的集成。在此基础上,我们的框架遵循三阶段流程:1)首先,我们利用预训练专家模型分别生成环境和前景对象的高保真3D表示;2)随后,为了产生这些对象物理上可信的轨迹和逼真的交互,我们提出了一种混合运动合成策略,将物理模拟器的先验知识与视频扩散模型中嵌入的常识相融合;3)最后,我们开发了一种自动组合机制,将静态环境和动态对象无缝融合为连贯、物理一致的4D场景。大量实验表明,CP4D能够生成具有高视觉保真度、强物理可信性和细粒度可控性的可探索和可交互4D场景,显著优于现有方法。项目页面:https://anonymous.4open.science/w/CP4D/。

LiteVSR: Lightweight Adaptation of Frozen Diffusion Transformers for Video Super-Resolution

2026-06-09T04:00:00cs.CV, diffusion2606.09250

中文标题:LiteVSR:面向视频超分辨率的冻结扩散Transformer轻量级适配

作者:Yu Cao, Ziquan Liu, Zhensong Zhang, Jiankang Deng, Shaogang Gong, Jifei Song

摘要:

Adapting large-scale pre-trained video generators for Video Super-Resolution (VSR) in novel domains remains computationally prohibitive. Methods that reformulate generation as direct Low-Quality to High-Quality mappings deviate from the original generative formulation, demanding extensive fine-tuning. ControlNet-style adapters lose their efficiency under modern Diffusion Transformers since the absence of encoder-decoder hierarchy forces duplication of the entire backbone. We observe that flow matching offers a principled alternative for cross-domain VSR adaptation. By predicting a constant velocity field across all timesteps, the adaptation task reduces to learning a fixed injection pattern rather than time-varying transformations. Building on this insight, we propose LiteVSR, a minimalist framework that performs VSR using a completely frozen Diffusion Transformer with a lightweight State-Aware Adapter. The adapter employs a dual-stream architecture that extracts static structural cues from the LQ input and dynamic cues from intermediate denoising states, aligning them through time-dependent cross-attention to enable adaptive transition from structural alignment to texture refinement as denoising proceeds. LiteVSR achieves competitive restoration quality with only 11.25% trainable parameters and 12 GPU-hours of training on a single A100, while maintaining fast sampling (down to a single step) compatibility.

摘要中文:

将大规模预训练视频生成器适配到新领域的视频超分辨率(VSR)任务中仍存在计算成本过高的问题。将生成重新定义为直接的从低质量到高质量的映射方案偏离了原始生成公式,需要大量的微调工作。ControlNet风格的适配器在现代扩散Transformer中失去了效率优势,因为缺乏编码器-解码器层次结构迫使人们复制整个骨干网络。我们注意到流匹配为跨域VSR适配提供了一种原则性的替代方案。通过预测所有时间步的恒定速度场,适配任务简化为学习固定注入模式而非时变变换。基于这一洞察,我们提出了LiteVSR,一个极简框架,使用完全冻结的扩散Transformer配合轻量级状态感知适配器执行VSR任务。该适配器采用双流架构,从低质量输入中提取静态结构线索,从中间去噪状态中提取动态线索,通过时间依赖的交叉注意力进行对齐,使模型能够在去噪过程中自适应地从结构对齐过渡到纹理细化。LiteVSR仅使用11.25%的可训练参数和在单块A100上12 GPU小时的训练即可达到具有竞争力的恢复质量,同时保持快速采样能力(可低至单步采样)。

EditSSC: Toward Editable Semantic Occupancy Scenes with Unconditional Diffusion Models

2026-06-09T04:00:00cs.CV, diffusion2606.09273

中文标题:EditSSC:基于无条件扩散模型的可编辑语义占用场景生成方法

作者:Fatima Balde, Raoul de Charette, Alexandre Boulch

摘要:

3D semantic scene generation is crucial for autonomous driving applications, yet most methods rely on complex 3D-specific architectures such as triplane encoders and adapted diffusion networks, limiting both their simplicity and their editing capabilities. We propose EditSSC, an editing-ready method for 3D semantic scene generation using 2D Bird's Eye View (BEV) representations and off-the-shelf latent diffusion network. Our approach reshapes 3D semantic occupancy grids into multi-channel BEV images and leverages the quantized autoencoder and UNet from Stable Diffusion with minimal modifications. We perform diffusion on the latents after quantization, which enables training-free editing capabilities. By exploiting class-to-code correspondences in the codebook, our method supports sketch-guided generation, inpainting, and outpainting without any retraining. On SemanticKITTI, EditSSC outperforms existing 3D-specific baselines on unconditional generation, demonstrating that well-established 2D architectures can be effectively repurposed for 3D scene generation and editing.

摘要中文:

3D语义场景生成对自动驾驶应用至关重要,然而现有方法大多依赖复杂的三平面编码器和适配扩散网络等3D专用架构,这在一定程度上限制了其简洁性和编辑能力。我们提出EditSSC,一种基于2D鸟瞰图(BEV)表示和现成潜在扩散网络的编辑式3D语义场景生成方法。我们的方法将3D语义占用网格重新构建为多通道BEV图像,并采用Stable Diffusion中的量化自编码器和UNet进行最小化修改。我们在量化后的潜在空间执行扩散操作,从而实现了无需训练的编辑能力。借助码本中的类别-码字对应关系,我们的方法支持草图引导生成、图像修复和图像扩展,且无需任何重训练。在SemanticKITTI数据集上,EditSSC在无条件生成任务上优于现有的3D专用基线方法,证明了成熟的2D架构可以有效迁移用于3D场景生成和编辑。

SwiftVR: Real-Time One-Step Generative Video Restoration

2026-06-09T04:00:00cs.CV, diffusion2606.09516

中文标题:SwiftVR: 实时一步生成式视频恢复

作者:Jiaqi Yan, Xiangyu Chen, Xinlin Zhong, Haibin Huang, Chi Zhang, Jie Liu, Jiantao Zhou, Xuelong Li

摘要:

Real-time video restoration (VR) for live streams requires high-resolution outputs under strict per-frame latency constraints. Existing one-step diffusion-based VR models remain difficult to deploy on consumer-grade GPUs due to two main bottlenecks: quadratic spatial attention at high resolutions and the latency-memory overhead of large video autoencoders. We present SwiftVR, a streaming one-step generative VR framework that reduces both bottlenecks under a causal chunk-wise protocol. For attention, mask-free shifted-window self-attention gathers each spatial window into a dense tensor via deterministic indexing, keeping all attention calls on the dense scaled dot-product attention path without masks, cyclic shifts, padding, or hardware-specific sparse kernels. Because SwiftVR uses only standard dense SDPA calls, the trained model transfers to consumer GPUs without retraining or custom kernels. For autoencoding, a lightweight Restoration-aware Autoencoder enables fast chunk-wise decoding while preserving reconstruction quality. On a single H100, SwiftVR sustains 31~FPS at 2560x1440 and 14~FPS at 3840x2160, whereas all compared diffusion-based VR baselines exceed the memory limit at 4K. On a consumer RTX~5090, SwiftVR reaches 26~FPS at 1920x1080. To our knowledge, SwiftVR is the first generative VR model to achieve real-time 1080p streaming on a consumer-grade GPU, while attaining strong no-reference perceptual quality with lower inference cost. Project is available at https://h-oliday.github.io/SwiftVR.

摘要中文:

面向直播流的实时视频恢复(VR)需要在严格的单帧延迟约束下输出高分辨率结果。现有基于一步扩散的VR模型由于两大瓶颈难以在消费级GPU上部署:高分辨率下的二次方空间注意力以及大型视频自编码器的延迟-内存开销。我们提出SwiftVR,一个流式一步生成式VR框架,在因果分块协议下同时降低这两个瓶颈。对于注意力机制,无掩码移位窗口自注意力通过确定性索引将每个空间窗口聚合成密集张量,使所有注意力调用保持在密集缩放点积注意力路径上,无需掩码、循环移位、填充或硬件特定的稀疏核。由于SwiftVR仅使用标准密集SDPA调用,训练后的模型可迁移至消费级GPU而无需重训练或自定义核。对于自编码器,轻量级恢复感知自编码器能够在保持重建质量的同时实现快速分块解码。在单块H100上,SwiftVR在2560x1440分辨率下可维持31帧/秒,在3840x2160分辨率下达14帧/秒,而所有对比的基于扩散的VR基线在4K分辨率下均超出内存限制。在消费级RTX 5090上,SwiftVR在1920x1080分辨率下达26帧/秒。据我们所知,SwiftVR是首个在消费级GPU上实现1080p实时流式处理的生成式VR模型,同时以更低的推理成本获得了更强的无参考感知质量。项目主页:https://h-oliday.github.io/SwiftVR

TUDSR: Twice Upsampling-Diffusion for Higher Super-Resolution

2026-06-09T04:00:00cs.CV, diffusion2606.09608

中文标题:TUDSR:用于更高分辨率超分辨率的两次上采样扩散方法

作者:Zhiqiang Wu, Yitong Dong, Xian Wei

摘要:

Diffusion-based generative models have achieved remarkable success in real-world image super-resolution (SR). With tiled diffusion techniques, these models can produce high-resolution images that exceed their native-supported resolution. However, the quality of such high-resolution (e.g $2048^2$) outputs often remains extremely poor, primarily due to two factors we consider: the image upsampling ratio (e.g $\times8$) exceeding the model's native-supported upsampling ratio (e.g $\times4$), and the model&x27;s native-supported resolution. In practice, training a native high-resolution model requires larger architectures, which incur significant computational overhead and GPU memory costs, making it hard on limited-resource equipment. Thus, we present TUDSR, a Twice Upsampling-Diffusion framework for higher SR. The TUDSR framework mainly consists of two stages: the first involves training at $R$-resolution, and the second introduces a looped chunk-based training strategy at $NR$-resolution. Each stage adapts a one-step GAN architecture comprising a generator and a discriminator. Based on SD2.1-base, we develop TUDSR-S, which achieves state-of-the-art performance across multiple benchmarks. Extensive experiments further demonstrate that TUDSR-S generates high-quality images at the resolutions of $1024^2$ and even $2048^2$, significantly outperforming existing approaches. Code is available at https://github.com/wuer5/TUDSR.

摘要中文:

基于扩散的生成模型在真实世界图像超分辨率(SR)任务中取得了显著成功。通过分块扩散技术,这些模型可以生成超出其原生支持分辨率的高分辨率图像。然而,此类高分辨率(如2048²)输出的质量通常仍然极差,主要原因可归结为两个方面:我们认为图像上采样倍数(如×8)超出了模型原生支持的上采样倍数(如×4),以及模型的原生支持分辨率。在实践中,训练原生高分辨率模型需要更大的架构,会产生显著的计算开销和GPU显存成本,使其难以在资源受限的设备上运行。因此,我们提出了TUDSR,一个用于更高分辨率超分辨率的两次上采样扩散框架。TUDSR框架主要包含两个阶段:第一阶段在R分辨率下进行训练,第二阶段引入循环分块训练策略在NR分辨率下进行训练。每个阶段采用一步GAN架构,包含一个生成器和一个判别器。基于SD2.1-base,我们开发了TUDSR-S,在多个基准测试上实现了最先进的性能。大量实验进一步表明,TUDSR-S能够在1024²甚至2048²分辨率下生成高质量图像,显著优于现有方法。代码可访问 https://github.com/wuer5/TUDSR。

Cranio-Diff: Diffusion-based Cross-domain Craniofacial Reconstruction with 2D X-ray Skull Guidance and Structural Identity Constraints

2026-06-09T04:00:00cs.CV, diffusion2606.09699

中文标题:Cranio-Diff:基于扩散的跨域颅颌面重建及2D X射线颅骨引导与结构身份约束

作者:Ravi Shankar Prasad, Naresh Gurjar, Shashank Baghel, Chirag, Dinesh Singh

摘要:

The state-of-the-art generative models, such as CycleGAN, Pix2Pix, and diffusion models have demonstrated remarkable performance in the face generation task. However, they fail to effectively capture cross-modality semantic information in craniofacial reconstruction when translating from the skull (x-ray) to the face (optical) domain, due to a mismatch in the alignment of structural identity across modalities. To address this issue, we propose Cranio-Diff, a diffusion-based framework for cross-domain cranio-facial reconstruction from 2D X-ray skull images. The proposed approach integrates skull-conditioned structural guidance through ControlNet with biometric text conditioning to generate a face which is more semantically and structurally aligned with the given skull. The proposed Cranio-diff method is evaluated on skull-face dataset obtained from X-ray scans of 120 subjects in lateral and frontal views. To enable controlled evaluation, each face image is synthesised across three age groups (25, 45, 65) and three BMI variations of -10%, baseline and +10%, yielding 4320 paired samples. To the best of our knowledge, this is the only X-ray-face dataset with this magnitude. Extensive experiments showed that the proposed method outperforms recent existing approaches in both generated image quality and retrieval task. Finally, to evaluate the performance of our proposed method, we have evaluated the quality of the generated image using FID, IS, SSIM, LPIPS, PSNR and ArcFace score. Additionally, retrieval performance is evaluated using recall@k, mAP@k and MRR@k. Obtained experimental results demonstrate that the proposed method can be used as an alternate tool in providing aid in forensic investigations.

摘要中文:

当前最先进的生成模型(如CycleGAN、Pix2Pix和扩散模型)在面部生成任务中展现了卓越的性能。然而,由于跨模态结构身份对齐不匹配,这些模型在从颅骨(X射线)域到面部(光学)域的转换中进行颅颌面重建时,无法有效捕获跨模态语义信息。为解决这一问题,我们提出了Cranio-Diff,一个基于扩散的跨域颅颌面重建框架,可从2D X射线颅骨图像生成面部。该方法通过ControlNet整合颅骨条件结构引导和生物特征文本条件,以生成在语义和结构上与给定颅骨更加对齐的面部。所提出的Cranio-Diff方法在来自120名受试者的X射线扫描的颅骨-面部数据集上进行了评估,涵盖侧位和正位视图。为实现可控评估,每张面部图像跨越三个年龄组(25岁、45岁、65岁)和三种BMI变化(-10%、基准值、+10%)进行合成,共生成4320个配对样本。据我们所知,这是该规模下唯一的X射线-面部数据集。大量实验表明,所提方法在生成图像质量和检索任务上均优于现有方法。最后,为评估所提方法的性能,我们使用FID、IS、SSIM、LPIPS、PSNR和ArcFace分数评估了生成图像的质量。此外,检索性能使用recall@k、mAP@k和MRR@k进行评估。实验结果表明,所提方法可作为法医调查中的辅助工具。

Echo-Memory: A Controlled Study of Memory in Action World Models

2026-06-09T04:00:00cs.CV, cs.GR, cs.LG, diffusion, image_compression2606.09803

中文标题:Echo-Memory:动作世界模型中记忆机制的控制变量研究

作者:Wayne King, Zeyue Xue, Yuxuan Bian, Jie Huang, Haoran Li, Yaowei Li, Yaofeng Su, Yuming Li, Haoyu Wang, Shiyi Zhang, Songchun Zhang, Yuwei Niu, Sihan Xu, Junhao Zhuang, Haoyang Huang, Nan Duan

摘要:

We present \textbf{Echo-Memory}, a controlled study of memory mechanisms in action-conditioned world models. These models generate multi-segment videos from a first frame, text prompt, and camera-action sequence, but their central failure is often memory rather than local image synthesis: after the camera leaves and returns, the scene or salient object may silently change. Existing memory designs are hard to compare because gains are entangled with backbone, training, retrieval, and evaluation differences. Echo-Memory fixes the action-to-video interface and varies only how history is stored and read by the generator. Under a shared video diffusion backbone, optimizer, camera-action representation, sampler, and evaluation pipeline, we compare raw context, compression-based memory, spatial summaries with different read-out paths, and state-space recurrence. This matched matrix separates four otherwise conflated axes: \emph{capacity}, \emph{compression}, \emph{read-out}, and \emph{recurrence}. We also evaluate memory through a three-branch protocol: replay quality, in-domain loop revisit, and open-domain return probes. The branches routinely disagree, showing that replay fidelity is not a sufficient proxy for remembering a world. Three findings follow. Raw context is a strong capacity baseline and improves open-domain return far more than it improves replay metrics. Compactness is not a free substitute for capacity: aggressive spatial and hybrid-compression memories lose the salient evidence needed for return. Finally, block-wise state-space recurrence is the strongest open-domain return mechanism in our matrix, showing that the structure of implicit memory matters as much as the decision to use it. These results provide a compact protocol for studying memory in action world models beyond isolated replay metrics.

摘要中文:

我们提出Echo-Memory,这是一个对动作条件世界模型中记忆机制的控制变量研究。这些模型根据第一帧、文本提示和相机动作序列生成多段视频,但它们的中心失败往往是记忆而非局部图像合成:相机离开后返回时,场景或显著物体可能会悄然改变。现有的记忆设计难以比较,因为性能提升与骨干网络、训练、检索和评估的差异相互交织。Echo-Memory固定了动作到视频的接口,仅改变生成器存储和读取历史的方式。在共享的视频扩散骨干网络、优化器、相机动作表示、采样器和评估流程下,我们比较了原始上下文、基于压缩的记忆、具有不同读取路径的空间摘要和状态空间递归。这个匹配的矩阵分离了四个原本混淆的轴:容量、压缩、读取和递归。我们还通过三个分支协议评估记忆:重放质量、域内循环重访和开放域返回探针。这些分支经常不一致,表明重放保真度并非记住一个世界的充分代理。得出三个发现。原始上下文是一个强大的容量基线,其对开放域返回的提升远大于对重放指标的提升。紧凑性不能作为容量的免费替代品:激进的空间和混合压缩记忆会丢失返回所需的显著证据。最后,块级状态空间递归是我们矩阵中开放域返回最强的机制,表明隐式记忆的结构与其使用决策同等重要。这些结果为研究动作世界模型中的记忆提供了一个精简的协议,超越了单独的重放指标。

Latent Spatial Memory for Video World Models

2026-06-09T04:00:00cs.CV, diffusion2606.09828

中文标题:视频世界模型的潜在空间记忆

作者:Weijie Wang, Haoyu Zhao, Yifan Yang, Feng Chen, Zeyu Zhang, Yefei He, Zicheng Duan, Donny Y. Chen, Yuqing Yang, Bohan Zhuang

摘要:

Video world models that maintain 3D spatial consistency across generated frames typically rely on explicit point cloud memory constructed in RGB space. This design is both computationally expensive, requiring repeated rendering and VAE encoding, and inherently lossy, as the round trip through pixel space discards rich features of the learned latent representation. In this paper, we introduce \emph{latent spatial memory} for video world models, a persistent 3D cache that stores scene information directly in the diffusion latent space, avoiding pixel-space reconstruction. Building on this, we propose Mirage, a latent-space spatial memory framework that constructs the memory by lifting latent tokens into 3D via depth-guided back-projection and queries it by synthesizing novel views through direct latent-space warping. This unified formulation eliminates both the information loss of pixel-space reconstruction and the computational burden of repeated encoding and rendering. Experiments show that latent spatial memory achieves up to \textbf{10.57}$\times$ faster end-to-end video generation and \textbf{55}$\times$ reduction in memory footprint relative to explicit 3D baselines. Leveraging the geometric prior of the diffusion model, Mirage attains state-of-the-art performance on WorldScore and strong reconstruction quality on RealEstate10K.

摘要中文:

在生成帧之间保持3D空间一致性的视频世界模型通常依赖于在RGB空间中构建的显式点云记忆。这种设计计算成本高昂,需要重复渲染和VAE编码,且本质上是有损的,因为通过像素空间的往返会丢弃学习到的潜在表示的丰富特征。本文为视频世界模型引入了潜在空间记忆,这是一种直接在扩散潜在空间中存储场景信息的持久3D缓存,避免了像素空间重建。基于此,我们提出了Mirage,一个潜空间空间记忆框架,该框架通过深度引导反向投影将潜在token提升到3D来构建记忆,并通过直接潜空间扭曲合成新视角来进行查询。这种统一公式消除了像素空间重建的信息丢失和重复编码与渲染的计算负担。实验表明,潜在空间记忆相对于显式3D基线实现了高达10.57倍的端到端视频生成速度提升和55倍的内存占用减少。利用扩散模型的几何先验,Mirage在WorldScore上达到了最先进的性能,并在RealEstate10K上获得了强大的重建质量。

Where the Score Lives: A Wavelet View of Diffusion

2026-06-09T04:00:00cs.CV, cs.LG, diffusion2606.08309

中文标题:分数归处:扩散模型的小波视角

作者:Emma Finn, Binxu Wang, T. Anderson Keller, Demba E. Ba

摘要:

Score-based generative models have had remarkable success over the last decade in generating a diverse set of visually plausible images. A variety of architectures including CNNs, U-Nets, and Transformers have been used as the score-approximation network in such diffusion modeling; however, to date, relatively little is known about how these architectural choices impact generative behavior. In this work, to provide insight into this area, we propose an analytically solvable parameterization of the score function using an expansion in a 2D orthogonal wavelet basis. In particular, we derive interpretable optimal score functions in terms of the moments of the data distribution. We use this parametrization to provide an architecture-agnostic, moment-based analysis that reveals which attributes of the data distribution tend to matter most for denoising. Our score machine is flexible enough to partially mimic the relevant inductive biases of multiple architectures, including U-Nets, and CNNs, taking a step towards understanding why different score architectures can exhibit distinct generative behavior. Since our score is solvable in terms of the moments of the data, we can begin to understand how the data distribution interacts with the score network to produce the behavior we observe in diffusion models.

摘要中文:

基于分数的生成模型在过去十年中在生成多样化视觉逼真的图像方面取得了显著成功。在此类扩散建模中,CNN、U-Nets和Transformer等多种架构已被用作分数近似网络;然而迄今为止,这些架构选择如何影响生成行为的相关研究仍较为有限。为解决这一问题,本工作提出了一种利用二维正交小波基展开对分数函数进行解析可解参数化的方法。具体而言,我们从数据分布的矩出发推导了可解释的最优分数函数。我们利用这种参数化方法提供了一种与架构无关的、基于矩的分析,揭示了数据分布的哪些属性对去噪最为重要。我们的分数机器足够灵活,能够部分模拟多种架构(包括U-Nets和CNN)的相关归纳偏置,为理解不同分数架构为何表现出不同的生成行为迈出了重要一步。由于我们的分数函数可以基于数据分布的矩进行解析求解,我们得以开始理解数据分布与分数网络之间的相互作用,从而产生我们在扩散模型中观察到的行为。

Evaluating the Representation Space of Diffusion Models via Self-Supervised Principles

2026-06-09T04:00:00cs.CV, cs.LG, diffusion2606.09718

中文标题:通过自监督原则评估扩散模型的表示空间

作者:Xiao Li, Yixuan Jia, Zekai Zhang, Xiang Li, Lianghe Shi, Jinxin Zhou, Zhihui Zhu, Liyue Shen, Qing Qu

摘要:

Diffusion models have demonstrated remarkable generative capabilities and have also emerged as powerful self-supervised representation learners, yet the connection between these two abilities remains less explored. Drawing inspiration from self-supervised learning (SSL), we introduce a framework for jointly evaluating the representation and generation capabilities of diffusion models. Specifically, we decompose features into invariant and residual components and derive the Invariant Contamination Ratio (ICR), a Fisher-based metric that quantifies how residual variation contaminates invariant signal in feature space. We use this framework to analyze both discriminative and generative behavior of diffusion models. On the representation side, we find that invariance peaks at intermediate noise levels, which also yield the best downstream classification performance. On the generative side, we study how training transitions from genuine generalization to memorization in data-limited regimes, and show that ICR serves as a sensitive training-time indicator of early learning: increasing residual energy along Fisher directions marks the onset of memorization, detectable from training features alone without external evaluators or held-out test sets. Overall, our results show that diffusion models can be monitored from a self-supervised perspective through the geometry of their learned representations.

摘要中文:

扩散模型已展现出显著的生成能力,并已成为强大的自监督表示学习器,然而这两种能力之间的关联仍待深入探索。借鉴自监督学习(SSL)的思想,我们引入了一个用于联合评估扩散模型表示能力与生成能力的框架。具体而言,我们将特征分解为不变分量和残余分量,并推导了不变污染比率(ICR),这是一种基于Fisher的度量指标,用于量化残余变异如何在特征空间中污染不变信号。我们利用该框架分析扩散模型的判别行为和生成行为。在表示方面,我们发现不变性在中等噪声水平达到峰值,而这些噪声水平也对应最佳的下游分类性能。在生成方面,我们研究了在数据有限条件下,训练如何从真正的泛化转变为记忆化,并表明ICR可作为早期学习的敏感训练时间指标:沿Fisher方向的残余能量增加标志着记忆化的开始,仅从训练特征即可检测到这一现象,无需外部评估器或保留的测试集。总体而言,我们的研究表明可以通过学习表示的几何结构从自监督视角对扩散模型进行监控。

MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action Models

2026-06-09T04:00:00cs.CV, cs.RO, diffusion2606.09827

中文标题:MemoryVLA++: 视觉-语言-动作模型中基于记忆和想象的时序建模

作者:Hao Shi, Weiye Li, Bin Xie, Yulin Wang, Renping Zhou, Tiancai Wang, Xiangyu Zhang, Ping Luo, Gao Huang

摘要:

Temporal modeling is essential for robotic manipulation, as effective control requires both memory of past interactions and imagination of future states. However, most VLA models rely primarily on the current observation and therefore struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived context, the hippocampal system to preserve episodic memory of past experience, and internal models to imagine possible future state evolution. Inspired by these mechanisms, we propose MemoryVLA++, a full temporal modeling framework that equips VLA models with memory and imagination for robotic manipulation. A pretrained VLM encodes the current observation into perceptual and cognitive tokens, forming working memory. These tokens query a Perceptual-Cognitive Memory Bank to retrieve relevant historical context. This bank stores low-level details and high-level semantics from past interactions, and is updated through redundancy-aware consolidation. A world model imagines future states in a denoising latent space, and the imagined latents are integrated under memory guidance to form full temporal-aware tokens. The resulting tokens condition a diffusion action expert to predict temporally consistent action sequences. We conduct extensive experiments on 5 simulation benchmarks and 3 categories of real-robot tasks across 3 robots, covering general manipulation, long-horizon temporal tasks, robustness, and generalization. Our method achieves strong performance across Libero, SimplerEnv, Mikasa-Robo, Calvin, Libero-Plus, and diverse real-robot tasks, validating the effectiveness of full temporal modeling with memory and imagination. For example, on real robots, it achieves +9%, +26%, +28% gains on general, memory-dependent, and imagination-dependent tasks. Project Page: https://shihao1895.github.io/MemoryVLA-PP-Web

摘要中文:

时序建模对于机器人操作至关重要,因为有效的控制既需要记忆过去的交互,也需要想象未来的状态。然而,大多数VLA模型主要依赖当前观测,因此在长时序、时序依赖的任务中表现不佳。认知科学表明,人类依赖工作记忆来缓冲短暂的上下文,利用海马体系统保存过去经验的情景记忆,并使用内部模型来想象可能的未来状态。受这些机制启发,我们提出了MemoryVLA++,这是一个完整的时序建模框架,为VLA模型配备记忆和想象能力以用于机器人操作。预训练的VLM将当前观测编码为感知token和认知token,形成工作记忆。这些token查询感知-认知记忆库以检索相关的历史上下文。该记忆库存储过去交互的低级细节和高级语义,并通过冗余感知的整合进行更新。世界模型在去噪潜空间中想象未来状态,在记忆引导下将想象的潜变量整合为完整的时序感知token。生成的token条件化扩散动作专家来预测时间一致的动作序列。我们在5个模拟基准测试和涵盖通用操作、长时序任务、鲁棒性和泛化性的3类真实机器人任务上进行了广泛实验,跨越3种机器人。我们的方法在Libero、SimplerEnv、Mikasa-Robo、Calvin、Libero-Plus和多种真实机器人任务上取得了优异性能,验证了带记忆和想象的完整时序建模的有效性。例如,在真实机器人上,它在通用任务、记忆依赖任务和想象依赖任务上分别实现了+9%、+26%、+28%的提升。

Coop-WD: Cooperative Perception with Weighting and Denoising for Robust V2V Communication

2026-06-09T04:00:00cs.CV, diffusion2505.03528

中文标题:Coop-WD:用于稳健V2V通信的加权去噪协同感知

作者:Chenguang Liu, Jianjun Chen, Yunfei Chen, Yubei He, Zhuangkun Wei, Hongjian Sun, Haiyan Lu, Qi Hao

摘要:

Cooperative perception, leveraging shared information from multiple vehicles via vehicle-to-vehicle (V2V) communication, plays a vital role in autonomous driving to alleviate the limitation of single-vehicle perception. Existing works have explored the effects of V2V communication impairments on perception precision, but they lack generalization to different levels of impairments. In this work, we propose a joint weighting and denoising framework, Coop-WD, to enhance cooperative perception subject to V2V channel impairments. In this framework, the self-supervised contrastive model and the conditional diffusion probabilistic model are adopted hierarchically for vehicle-level and pixel-level feature enhancement. An efficient variant model, Coop-WD-eco, is proposed to selectively deactivate denoising to reduce processing overhead. Rician fading, non-stationarity, and time-varying distortion are considered. Simulation results demonstrate that the proposed Coop-WD outperforms conventional benchmarks in all types of channels. Qualitative analysis with visual examples further proves the superiority of our proposed method. The proposed Coop-WD-eco achieves up to 50% reduction in computational cost under severe distortion while maintaining comparable accuracy as channel conditions improve.

摘要中文:

协同感知通过车联网(V2V)通信共享多辆车上的信息,在自动驾驶中发挥着重要作用,能够缓解单车感知的局限性。现有研究已探索了V2V通信损伤对感知精度的影响,但缺乏对不同程度损伤的泛化能力。在本文中,我们提出了一种联合加权与去噪框架Coop-WD,以增强V2V信道损伤下的协同感知能力。在该框架中,自监督对比模型和条件扩散概率模型被层次化地用于车辆级和像素级特征增强。我们还提出了一种高效变体模型Coop-WD-eco,通过选择性关闭去噪模块来降低处理开销。研究考虑了莱斯衰落、非平稳性和时变失真。仿真结果表明,所提出的Coop-WD在所有类型的信道中均优于传统基准方法。定性分析结合可视化示例进一步证明了所提方法的优越性。Coop-WD-eco在严重失真条件下可实现高达50%的计算成本降低,同时随着信道条件的改善保持相当的精度。

HiMat: DiT-based Ultra-High Resolution SVBRDF Generation

2026-06-09T04:00:00cs.CV, cs.GR, diffusion, image_compression2508.07011

中文标题:HiMat:基于DiT的超高分辨率SVBRDF生成

作者:Zixiong Wang, Jian Yang, Yiwei Hu, Milos Hasan, Beibei Wang

摘要:

Creating ultra-high-resolution spatially varying bidirectional reflectance functions (SVBRDFs) is critical for photorealistic 3D content creation, to faithfully represent fine-scale surface details required for close-up rendering. However, achieving 4K generation faces two key challenges: (1) the need to synthesize multiple reflectance maps at full resolution, which multiplies the pixel budget and imposes prohibitive memory and computational cost, and (2) the requirement to maintain strong pixel-level alignment across maps at 4K, which is particularly difficult when adapting pretrained models designed for the RGB image domain. We introduce HiMat, a diffusion-based framework tailored for efficient and diverse 4K SVBRDF generation. To address the first challenge, HiMat performs generation in a high-compression latent space via DC-AE, and employs a pretrained diffusion transformer with linear attention to improve per-map efficiency. To address the second challenge, we propose CrossStitch, a lightweight convolutional module that enforces cross-map consistency without incurring the cost of global attention. Our experiments show that HiMat achieves high-fidelity 4K SVBRDF generation with superior efficiency, structural consistency, and diversity compared to prior methods. Beyond materials, our framework also generalizes to related applications such as intrinsic decomposition.

摘要中文:

创建超高分辨率空间变化双向反射分布函数(SVBRDF)对于逼真的3D内容创作至关重要,能够真实呈现特写渲染所需的细粒度表面细节。然而,实现4K生成面临两个关键挑战:(1) 需要在全分辨率下合成多个反射图,这成倍增加了像素预算,带来难以承受的内存和计算成本;(2) 需要在4K分辨率下保持各图之间强大的像素级对齐,当改编为RGB图像域设计的预训练模型时尤其困难。我们提出了HiMat,一个专为高效多样化的4K SVBRDF生成而设计的基于扩散的框架。为解决第一个挑战,HiMat通过DC-AE在高度压缩的潜在空间中进行生成,并采用具有线性注意力的预训练扩散Transformer以提高单图效率。为解决第二个挑战,我们提出了CrossStitch,这是一个轻量级卷积模块,可在不产生全局注意力开销的情况下强制跨图一致性。实验表明,与现有方法相比,HiMat实现了高保真的4K SVBRDF生成,在效率、结构一致性和多样性方面表现更优。除了材质,我们的框架还可推广到内在分解等相关应用。

Diffusion Bridge or Flow Matching? A Unifying Framework and Comparative Analysis

2026-06-09T04:00:00cs.CV, diffusion2509.24531

中文标题:扩散桥还是流匹配?统一框架与比较分析

作者:Kaizhen Zhu, Mokai Pan, Zhechuan Yu, Jingya Wang, Jingyi Yu, Ye Shi

摘要:

Diffusion Bridge and Flow Matching have both demonstrated compelling empirical performance in transformation between arbitrary distributions. However, there remains confusion about which approach is generally preferable, and the substantial discrepancies in their modeling assumptions and practical implementations have hindered a unified theoretical account of their relative merits. We have, for the first time, provided a unified theoretical and experimental validation of these two models. We recast their frameworks through the lens of Stochastic Optimal Control and prove that the cost function of the Diffusion Bridge is lower, guiding the system toward more stable and natural trajectories. Simultaneously, from the perspective of Optimal Transport, interpolation coefficients $t$ and $1-t$ of Flow Matching become increasingly ineffective when the training data size is reduced. To corroborate these theoretical claims, we propose a novel, powerful architecture for Diffusion Bridge built on a latent Transformer, and implement a Flow Matching model with the same structure to enable a fair performance comparison in various experiments. Comprehensive experiments are conducted across Image Restoration, Translation, and Style Transfer tasks, systematically varying both the distributional discrepancy (different difficulty) and the training data size. Extensive empirical results align perfectly with our theoretical predictions and allow us to delineate the respective advantages and disadvantages of these two models. Our code is available at https://github.com/zhukaizhen/diffusion_bridge_flow_matching.

摘要中文:

扩散桥和流匹配在任意分布间的转换任务中均展现出优异的实证性能。然而,目前尚不清楚哪种方法通常更为可取,且两者在建模假设和实际实现上的显著差异阻碍了对其相对优势的统一理论阐释。我们首次对这两种模型进行了统一的理论和实验验证。我们从随机最优控制的视角重新审视了它们的框架,并证明了扩散桥的代价函数更低,能够引导系统沿更稳定、自然的轨迹运行。同时,从最优传输的视角来看,当训练数据规模减小时,流匹配的插值系数t和1-t逐渐失效。为验证这些理论预测,我们提出了一种基于潜在Transformer的扩散桥新架构,并实现了具有相同结构的流匹配模型以在各实验中实现公平的性能比较。我们在图像修复、翻译和风格迁移任务上进行了全面实验,系统地改变了分布差异(不同难度)和训练数据规模。大量实证结果与我们的理论预测高度一致,使我们能够阐明这两种模型各自的优劣势。我们的代码开源于 https://github.com/zhukaizhen/diffusion_bridge_flow_matching。

Mitigating Diffusion Model Hallucinations with Dynamic Guidance

2026-06-09T04:00:00cs.CV, cs.LG, diffusion2510.05356

中文标题:利用动态引导缓解扩散模型幻觉

作者:Kostas Triaridis, Alexandros Graikos, Aggelina Chatziagapi, Grigorios G. Chrysos, Dimitris Samaras

摘要:

Hallucinations in diffusion models are samples with structural inconsistencies that can emerge due to the excessive smoothing of the learned score function, which in turn leads to interpolations between modes of the data distribution. Since semantic interpolations are often desirable and contribute to sample diversity, we believe that a nuanced and targeted solution is required to address diffusion model hallucinations. In this work, we introduce Dynamic Guidance, which mitigates hallucinations by selectively sharpening the score function only along the pre-determined directions known to cause artifacts, while preserving valid semantic variations. This sharpening can be performed using either pre-determined classes or semantically coherent clusters that form pseudo-classes over the data distribution. The latter allows for a principled extension of Dynamic Guidance to text-to-image generation, where we select modes to correspond to fine-grained contextual differences in textual descriptions. To our knowledge, this is the first approach that addresses hallucinations at generation time rather than through post-hoc filtering. Dynamic Guidance substantially reduces hallucinations on both controlled and natural image datasets, significantly outperforming baselines.

摘要中文:

扩散模型中的幻觉是指具有结构性不一致的样本,这类样本可能由于学习到的得分函数过度平滑而产生,进而导致数据分布模态之间的插值。由于语义插值通常是可取的并有助于样本多样性,我们认为需要一个细致且有针对性的方案来解决扩散模型幻觉问题。在本工作中,我们引入动态引导方法,通过仅沿预先确定的方向选择性增强得分函数的锐度来缓解幻觉,同时保留有效的语义变化。这种锐化可以通过预定义的类别或在数据分布上形成伪类别的语义连贯聚类来执行。后者允许将动态引导原则性地扩展到文本到图像生成领域,在该领域中,我们选择模态以对应文本描述中的细粒度上下文差异。据我们所知,这是首次在生成时而非通过后过滤的方式解决幻觉问题。动态引导方法在受控和自然图像数据集上显著减少了幻觉,性能明显优于基线方法。

Coarse-to-Fine Hierarchical Alignment for UAV-based Human Detection using Diffusion Models

2026-06-09T04:00:00cs.CV, diffusion2512.13869

中文标题:基于扩散模型的无人机人体检测粗到细层次对齐方法

作者:Wenda Li, Meng Wu, Liangzhao Chen, Sungmin Eum, Heesung Kwon, Qing Qu

摘要:

Training object detectors demands extensive, task-specific annotations, yet this requirement becomes impractical in UAV-based human detection due to constantly shifting target distributions and the scarcity of labeled images. As a remedy, synthetic simulators are adopted to generate annotated data, with a low annotation cost. However, the domain gap between synthetic and real images hinders the model from being effectively applied to the target domain. Accordingly, we introduce Coarse-to-Fine Hierarchical Alignment (CFHA), a three-stage diffusion-based framework designed to transform synthetic data for UAV-based human detection, narrowing the domain gap while preserving the original synthetic labels. CFHA explicitly decouples global style and local content domain discrepancies and bridges those gaps using three modules: (1) Global Style Transfer -- a diffusion model aligns color, illumination, and texture statistics of synthetic images to the realistic style, using only a small real reference set; (2) Local Refinement -- a super-resolution diffusion model is used to facilitate fine-grained and photorealistic details for the small objects, such as human instances, preserving shape and boundary integrity; (3) Hallucination Removal -- a module that filters out human instances whose visual attributes do not align with real-world data to make the human appearance closer to the target distribution. Extensive experiments on public UAV Sim2Real detection benchmarks demonstrate that our methods significantly improve the detection accuracy compared to the non-transformed baselines. Specifically, our method achieves up to $+14.1$ improvement of mAP50 on Semantic-Drone benchmark. Ablation studies confirm the complementary roles of the global and local stages and highlight the importance of hierarchical alignment. The code is released at \href{https://github.com/liwd190019/CFHA}{this url}.

摘要中文:

训练目标检测器需要大量针对特定任务的标注数据,然而在基于无人机的人体检测任务中,由于目标分布不断变化且带标注图像稀缺,这一要求变得不切实际。作为一种补救措施,研究者采用合成模拟器生成带标注数据,以降低标注成本。然而,合成图像与真实图像之间的域差距阻碍了模型在目标域的有效应用。为此,本文提出了粗到细层次对齐(Coarse-to-Fine Hierarchical Alignment, CFHA)方法——一个三阶段基于扩散模型的框架,旨在将合成数据转换为适用于无人机人体检测的数据,在保留原始合成标签的同时缩小域差距。CFHA显式解耦全局风格与局部内容的域差异,并利用三个模块弥合这些差距:(1)全局风格迁移——仅使用少量真实参考图像,通过扩散模型将合成图像的颜色、光照和纹理统计信息对齐到真实风格;(2)局部细化——使用超分辨率扩散模型为行人等小目标补充细粒度逼真细节,同时保持形状和边界完整性;(3)幻觉去除——过滤掉视觉属性与真实数据不一致的人体实例,使人体外观更接近目标分布。在公开的无人机Sim2Real检测基准数据集上进行了大量实验,结果表明本文方法相比未转换的基线方法显著提升了检测精度。具体而言,本文方法在Semantic-Drone基准上实现了最高+14.1的mAP50提升。消融实验证实了全局阶段和局部阶段的互补作用,并强调了层次对齐的重要性。代码已发布于https://github.com/liwd190019/CFHA。

GimmBO: Interactive Generative Image Model Merging via Bayesian Optimization

2026-06-09T04:00:00cs.CV, cs.GR, diffusion2601.18585

中文标题:GimmBO: 基于贝叶斯优化的交互式生成图像模型合并

作者:Chenxi Liu, Selena Ling, Alec Jacobson

摘要:

Fine-tuning-based adaptation is widely used to customize diffusion-based image generation, leading to large collections of community-created adapters that capture diverse subjects and styles. Adapters derived from the same base model can be merged with weights, enabling the synthesis of new visual results within a vast and continuous design space. To explore this space, current workflows rely on manual slider-based tuning, an approach that scales poorly and makes weight selection difficult, even when the candidate set is limited to 20-30 adapters. We propose GimmBO to support interactive exploration of adapter merging for image generation through Preferential Bayesian Optimization (PBO). Motivated by observations from real-world usage, including sparsity and constrained weight ranges, we introduce a two-stage BO backend that improves sampling efficiency and convergence in high-dimensional spaces. We evaluate our approach with simulated users and a user study, demonstrating improved convergence, high success rates, and consistent gains over BO and line-search baselines, and further show the flexibility of the framework through several extensions.

摘要中文:

基于微调的适配方法被广泛用于定制基于扩散的图像生成,由此产生了大量由社区创建的适配器,这些适配器捕捉了丰富多样的主体和风格。源自同一基础模型的适配器可以通过权重合并来实现融合,从而在一个广阔且连续的设计空间中合成新的视觉结果。当前的工作流程依赖于基于滑块的手动调参来探索这一空间,这种方法在可扩展性上存在局限,即使候选集合仅限于20-30个适配器,权重选择也变得困难重重。我们提出GimmBO,通过优先贝叶斯优化(PBO)来支持图像生成中适配器合并的交互式探索。受到真实世界使用场景(包括稀疏性和受约束的权重范围)的启发,我们引入了一种双阶段贝叶斯优化后端,以提高高维空间中的采样效率和收敛速度。我们通过模拟用户和用户研究来评估我们的方法,结果表明该方法在收敛性、成功率和一致性方面均优于贝叶斯优化和线性搜索基线方法,并进一步通过若干扩展展示了该框架的灵活性。

Enhancing Adversarial Robustness with Signed Distance Fields for Harmonizing Geometric Invariance and Texture

2026-06-09T04:00:00cs.CV, diffusion2602.05175

中文标题:利用有符号距离场协调几何不变性与纹理以增强对抗鲁棒性

作者:Zhe Li, Bernhard Kainz

摘要:

Deep neural networks demonstrate impressive performance in visual recognition but remain highly vulnerable to imperceptible adversarial attacks. Existing defense strategies such as adversarial training and diffusion-based purification have achieved significant progress but are frequently constrained by high computational cost, information loss, and inference latency. To address these challenges, we propose a Geometric and Texture balancing Purification (GeoTexPuri) framework that enhances adversarial robustness by harmonizing invariant geometric structures with textural features. Specifically, the framework integrates dense geometric guidance into the training phase by transforming discrete image masks into continuous spatial fields via Signed Distance Fields (SDF). This process establishes stable structural anchors that shield the model from local pixel noise. Through a multi-stream training objective, the model learns to internalize purified representations that effectively align semantic textural cues with these underlying geometric invariants. Extensive experiments on ImageNet demonstrate the efficacy of our approach. GeoTexPuri achieves 84.79\% clean accuracy and 83.52\% robust accuracy under the AutoAttack. Crucially, GeoTexPuri functions as a deterministic classifier during inference, requiring only the input image without any auxiliary geometric modules or additional computational costs, thereby ensuring a scalable and efficient solution for real-time applications.

摘要中文:

深度神经网络在视觉识别任务中展现出卓越的性能,但极易受到难以察觉的对抗攻击。现有的防御策略,如对抗训练和基于扩散的净化方法,已取得显著进展,但通常受限于高计算成本、信息损失和推理延迟等问题。为解决这些挑战,我们提出了几何与纹理平衡净化框架(GeoTexPuri),该框架通过协调不变的几何结构与纹理特征来增强对抗鲁棒性。具体而言,该框架通过有符号距离场(SDF)将离散图像掩模转换为连续空间场,从而将密集几何引导集成到训练阶段。这一过程建立了稳定的结构锚点,使模型免受局部像素噪声的影响。通过多流训练目标,模型学习内化净化后的表示,有效地将语义纹理线索与底层几何不变性对齐。在ImageNet数据集上的大量实验验证了我们方法的有效性。GeoTexPuri在干净数据上达到84.79%的准确率,在AutoAttack攻击下达到83.52%的鲁棒准确率。关键的是,GeoTexPuri在推理阶段作为确定性分类器运行,仅需要输入图像,无需任何辅助几何模块或额外计算成本,从而为实时应用提供了可扩展且高效的解决方案。

IDDM: Identity-Decoupled Personalized Diffusion Models with a Tunable Privacy-Utility Trade-off

2026-06-09T04:00:00cs.CV, diffusion2604.00903

中文标题:IDDM:具有可调隐私-效用权衡的身份解耦个性化扩散模型

作者:Linyan Dai, Xinwei Zhang, Haoyang Li, Qingqing Ye, Haibo Hu

摘要:

Personalized text-to-image diffusion models (e.g., DreamBooth, LoRA) enable users to synthesize high-fidelity avatars from a few reference photos for social expression. However, once these generations are shared on social media platforms (e.g., Instagram, Facebook), they can be linked to the real user via face recognition systems, enabling identity tracking and profiling. Existing defenses mainly follow an anti-personalization strategy that protects publicly released reference photos by disrupting model fine-tuning. While effective against unauthorized personalization, they do not address another practical setting in which personalization is authorized, but the resulting public outputs still leak identity information. To address this problem, we introduce a new defense setting, termed model-side output immunization, whose goal is to produce a personalized model that supports authorized personalization while reducing the identity linkability of public generations, with tunable control over the privacy-utility trade-off to accommodate diverse privacy needs. To this end, we propose Identity-Decoupled personalized Diffusion Models (IDDM), a model-side defense that integrates identity decoupling into the personalization pipeline. Concretely, IDDM follows an alternating procedure that interleaves short personalization updates with identity-decoupled data optimization, using a two-stage schedule to balance identity linkability suppression and generation utility. Extensive experiments across multiple datasets, diverse prompts, and state-of-the-art face recognition systems show that IDDM consistently reduces identity linkability while preserving high-quality personalized generation.

摘要中文:

个性化文本到图像扩散模型(如 DreamBooth、LoRA)使用户能够从少量参考照片合成高质量头像,用于社交表达。然而,一旦这些生成内容在社交媒体平台(如 Instagram、Facebook)上发布,它们可以通过人脸识别系统与真实用户关联,从而实现身份追踪和画像。现有的防御方法主要采用反个性化策略,通过干扰模型微调来保护公开发布的参考照片。虽然该方法对未授权的个性化攻击有效,但无法解决另一个实际场景:个性化是授权的,但 resulting 的公共输出仍然会泄露身份信息。 为解决这一问题,我们引入了一种新的防御设置,称为模型端输出免疫,其目标是生成一个支持授权个性化的个性化模型,同时降低公共生成内容的身份关联性,并提供可调的隐私-效用权衡控制以满足不同的隐私需求。为此,我们提出了身份解耦个性化扩散模型(IDDM),这是一种将身份解耦集成到个性化流程中的模型端防御方法。具体而言,IDDM 采用交替程序,将短期个性化更新与身份解耦数据优化交错进行,使用两阶段调度来平衡身份关联性抑制和生成效用。在多个数据集、多样化提示词以及先进的人脸识别系统上进行的广泛实验表明,IDDM 在保持高质量个性化生成的同时,持续降低了身份关联性。

Anomaly-Preference Image Generation

2026-06-09T04:00:00cs.CV, cs.LG, diffusion2605.02439

中文标题:异常偏好图像生成

作者:Fuyun Wang, Yuanzhi Wang, Xu Guo, Sujia Huang, Tong Zhang, Dan Wang, Hui Yan, Xin Liu, Zhen Cui

摘要:

Synthesizing realistic and diverse anomalous samples from limited data is vital for robust model generalization. However, existing methods struggle to reconcile fidelity and diversity, often hampered by distribution misalignment and overfitting, respectively.To mitigate this, we introduce Anomaly Preference Optimization,a novel paradigm that reformulates anomaly generation as a preference learning problem.Central to our approach is an implicit preference alignment mechanism that leverages real anomalies as positive references, deriving optimization signals directly from denoising trajectory deviations without requiring costly human annotation. Furthermore, we propose a Time-Aware Capacity Allocation module that dynamically distributes model capacity along the diffusion timeline,prioritizing structural diversity during highnoise phases while enhancing fine-grained fidelity in low-noise stages. During inference, a hierarchical sampling strategy modulates the coherencealignment trade-off, enabling precise control over generation. Extensive experiments demonstrate that significantly outperforms existing baselines,achieving state-of-the-art performance in both realism and diversity.

摘要中文:

从有限数据中合成真实且多样的异常样本对于模型的鲁棒泛化至关重要。然而,现有方法难以平衡保真度与多样性,分别受到分布错位和过拟合问题的困扰。为缓解这一问题,我们提出异常偏好优化,一种将异常生成重新表述为偏好学习问题的新范式。我们的核心方法是隐式偏好对齐机制,该机制利用真实异常作为正样本参考,直接从去噪轨迹偏差中获取优化信号,无需昂贵的人工标注。此外,我们提出时间感知容量分配模块,该模块沿扩散时间线动态分配模型容量,在高噪声阶段优先考虑结构多样性,而在低噪声阶段增强细粒度保真度。推理阶段采用分层采样策略调节一致性与对齐之间的权衡,实现对生成过程的精确控制。大量实验表明,本方法显著优于现有基线,在真实性和多样性方面均达到了最先进的性能。

Pantheon360: Taming Digital Twin Generation via 3D-Aware 360{\deg} Video Diffusion

2026-06-09T04:00:00cs.CV, diffusion2605.25449

中文标题:Pantheon360:通过3D感知360度视频扩散驯服数字孪生生成

作者:Ting-Hsuan Chen, Ying-Huan Chen, Tao Tu, Jie-Ying Lee, Cho-Ying Wu, Fangzhou Lin, Hengyuan Zhang, David Paz, Xinyu Huang, Yuliang Guo, Yu-Lun Liu, Yue Wang, Liu Ren

摘要:

Generating complete digital twins from videos requires precise camera control, global scene coverage, and strict spatial-temporal consistency constraints that remain challenging for perspective video generators due to their limited field of view (FoV). Their narrow FoV forces long or multi-view trajectories, amplifying cross-view inconsistency and temporal drift. We argue that 360{\deg} video generation offers a natural solution: panoramic coverage simplifies trajectory design and provides a strong global context for maintaining coherence. We introduce Pantheon360: Taming Digital Twin Generation via 3D-Aware 360{\deg} Video Diffusion, a controllable 360{\deg} video generation framework that synthesizes high-fidelity videos from sparse 360{\deg} inputs. The key idea is an explicit 3D Cache, reconstructed from the input, which serves as a geometric scaffold for any user-defined camera path. This allows the diffusion model to focus on photorealistic texture refinement while the 3D Cache enforces global geometric consistency. Experiments show that Pantheon360 achieves superior visual quality and unmatched geometric coherence, enabling reliable and flexible 360{\deg} scene generation for downstream simulation and digital-twin applications.

摘要中文:

从视频生成完整的数字孪生需要精确的相机控制、全局场景覆盖以及严格的时空一致性约束,这些要求对于视场角有限的透视视频生成器而言仍具挑战。狭窄的视场角迫使生成器采用长轨迹或多视角轨迹,从而加剧了跨视角不一致性和时间漂移问题。我们认为,360度视频生成提供了一种自然解决方案:全景覆盖简化了轨迹设计,并提供了维持一致性的强有力全局上下文。我们提出了Pantheon360:通过3D感知360度视频扩散驯服数字孪生生成,这是一种可控的360度视频生成框架,能够从稀疏的360度输入合成高质量视频。其核心思想是从输入中显式重建3D缓存,作为任意用户定义相机路径的几何支架。这使得扩散模型能够专注于真实感纹理细化,而3D缓存则强制执行全局几何一致性。实验表明,Pan\-theon360在视觉质量和几何一致性方面均达到了卓越水平,能够为下游仿真和数字孪生应用提供可靠且灵活的360度场景生成能力。

Reinforcing Few-step Generators via Reward-Tilted Distribution Matching

2026-06-09T04:00:00cs.CV, diffusion2605.26108

中文标题:通过奖励倾斜分布匹配强化少步生成器

作者:Yushi Huang, Xiangxin Zhou, Ruoyu Wang, Chi Zhang, Jun Zhang, Tianyu Pang

摘要:

Recent advances in few-step diffusion distillation have enabled efficient image generation, yet aligning these models with human preferences remains challenging. We propose Reward-Tilted Distribution Matching Distillation (RTDMD), a two-stage framework that unifies distribution matching distillation with reward-guided reinforcement learning for few-step flow generators. We show that minimizing the KL divergence to a reward-tilted teacher distribution naturally decomposes into a distribution matching term and a reward maximization term. In the first stage, we introduce Ambient-Consistent Distribution Matching Distillation (AC-DMD), which performs subinterval-wise distribution matching and augments the fake score objective with a consistency regularizer to help the fake score model track the shifting generator distribution under limited updates. In the second stage, we jointly optimize both terms: for the reward maximization term, we derive a hybrid policy gradient that combines a GRPO-style estimator for the stochastic intermediate transitions with direct reward backpropagation through the deterministic final step, and further introduce step-subset GRPO (SubGRPO) to reduce variance. Experiments on SD3, SD3.5, and FLUX.2 demonstrate that RTDMD establishes new state-of-the-art results across preference, aesthetic, and compositional metrics with only 4 inference steps, outperforming previous few-step text-to-image generation methods. Code and models are available at https://github.com/Harahan/RTDMD.

摘要中文:

少步扩散蒸馏的最新进展实现了高效图像生成,但使这些模型与人类偏好对齐仍具挑战性。我们提出了奖励倾斜分布匹配蒸馏(RTDMD),这是一个两阶段框架,将分布匹配蒸馏与奖励引导的强化学习统一应用于少步流生成器。我们证明,最小化与奖励倾斜教师分布的KL散度可以自然分解为一个分布匹配项和一个奖励最大化项。在第一阶段,我们引入了环境一致分布匹配蒸馏(AC-DMD),该方法执行子区间分布匹配,并用一致性正则化器增强伪分数目标,以帮助伪分数模型在有限更新下跟踪变化的生成器分布。在第二阶段,我们联合优化两个项:对于奖励最大化项,我们推导了一种混合策略梯度,结合了用于随机中间转移的GRPO风格估计器和通过确定性最终步骤的直接奖励反向传播,并进一步引入了步子集GRPO(SubGRPO)以降低方差。在SD3、SD3.5和FLUX.2上的实验表明,RTDMD仅用4个推理步骤就在偏好、美学和构图指标上建立了新的最先进结果,优于以往的少步文生图生成方法。代码和模型可访问 https://github.com/Harahan/RTDMD。

DisPOSE: Projected Polystochastic Diffusion for Self-Supervised Multi-View 3D Human Pose Estimation

2026-06-09T04:00:00cs.CV, diffusion2606.07419

中文标题:DisPOSE:用于自监督多视角三维人体姿态估计的投影多项随机扩散方法

作者:Tony Danjun Wang, Tolga Birdal, Nassir Navab, Lennart Bastian

摘要:

Recovering 3D human poses for multiple individuals from different camera views is a fundamental bottleneck for analyzing interacting behaviors. Existing self-supervised approaches leverage synthetic catalogues of 3D poses; however, this leads to poor generalization in real-world scenarios due to distribution shifts. We therefore introduce DisPOSE, a self-supervised framework that approximates the inherently discrete multi-view person-assignment problem as a generative diffusion process over the space of polystochastic tensors. By employing differentiable Sinkhorn projections during denoising, our model learns to guide solutions toward valid and feasible assignments based on 2D image priors. The complete 3D skeletons of localized individuals are then regressed using a Hypergraph-Convolutional Decoder that explicitly models relational structures and articulated joints across multiple views. The proposed approach outperforms current state-of-the-art self-supervised methods on standard datasets and demonstrates strong performance on a newly proposed benchmark featuring highly occluded scenes from surgical operating rooms. Our diffusion-based localization demonstrates high label efficiency, retaining 99% of its performance with only 10% of the pseudo-labels. Notably, disentangling the assignment and root regression components while maintaining differentiability makes DisPOSE nearly agnostic to different camera arrangements.

摘要中文:

从不同相机视角恢复多人的三维人体姿态是分析交互行为的关键瓶颈。现有的自监督方法利用合成三维姿态目录;然而,由于分布偏移,这导致在现实场景中泛化性能较差。因此,我们提出了DisPOSE,这是一个自监督框架,将本质上离散的多视角人物分配问题近似为多项随机张量空间上的生成扩散过程。通过在去噪过程中采用可微分的Sinkhorn投影,我们的模型学习根据二维图像先验信息引导解决方案趋向有效且可行的分配。随后,使用超图卷积解码器回归局部化个体的完整三维骨架,该解码器明确建模了多视角间的关联结构和关节连接。所提方法在标准数据集上优于当前最先进的自监督方法,并在包含手术室高度遮挡场景的新基准测试中展现出优异的性能。我们基于扩散的定位方法表现出极高的标签效率,仅使用10%的伪标签即可保持99%的性能。值得注意的是,通过解耦分配和根回归组件并保持可微性,DisPOSE几乎不受不同相机配置的影响。

Bokeh Diffusion: Defocus Blur Control in Text-to-Image Diffusion Models

2026-06-09T04:00:00cs.CV, cs.GR, diffusion2503.08434

中文标题:Bokeh Diffusion: 文本到图像扩散模型中的失焦模糊控制

作者:Armando Fortes, Tianyi Wei, Shangchen Zhou, Xingang Pan

摘要:

Recent advances in large-scale text-to-image models have revolutionized creative fields by generating visually captivating outputs from textual prompts; however, while traditional photography offers precise control over camera settings to shape visual aesthetics - such as depth-of-field via aperture - current diffusion models typically rely on prompt engineering to mimic such effects. This approach often results in crude approximations and inadvertently alters the scene content. In this work, we propose Bokeh Diffusion, a scene-consistent bokeh control framework that explicitly conditions a diffusion model on a physical defocus blur parameter. To overcome the scarcity of paired real-world images captured under different camera settings, we introduce a hybrid training pipeline that aligns in-the-wild images with synthetic blur augmentations, providing diverse scenes and subjects as well as supervision to learn the separation of image content from lens blur. Central to our framework is our grounded self-attention mechanism, trained on image pairs with different bokeh levels of the same scene, which enables blur strength to be adjusted in both directions while preserving the underlying scene. Extensive experiments demonstrate that our approach enables flexible, lens-like blur control, supports downstream applications such as real image editing via inversion, and generalizes effectively across both Stable Diffusion and FLUX architectures.

摘要中文:

大规模文本到图像模型的最新进展通过从文本提示生成视觉吸引人的输出彻底改变了创意领域;然而,传统摄影能够通过光圈等相机设置精确控制视觉美学(如景深),而当前的扩散模型通常依赖提示工程来模拟此类效果。这种方法往往导致粗略的近似,并在无意中改变了场景内容。本研究提出Bokeh Diffusion,一个场景一致的散景控制框架,可将物理失焦模糊参数显式地调节到扩散模型中。为解决在不同相机设置下捕获的成对真实图像稀缺的问题,我们引入了一种混合训练管道,将自然场景图像与合成模糊增强相结合,提供了多样化的场景和主体,并为学习将图像内容与镜头模糊分离提供了监督。我们的核心框架是基于锚定自注意力的机制,该机制在具有相同场景不同散景水平的图像对上进行训练,使模糊强度能够在两个方向上同时保持底层场景不变。大量实验表明,我们的方法实现了灵活的类镜头模糊控制,支持通过反演进行真实图像编辑等下游应用,并在Stable Diffusion和FLUX架构上均能有效泛化。

Real-Time AttentionBender: Granular Interactive Network Bending of Video Diffusion Transformers

2026-06-09T04:00:00cs.CV, cs.GR, cs.HC, diffusion2606.06497

中文标题:实时注意力弯曲器:视频扩散变换器的精细化交互式网络弯曲

作者:Adam Cole, Rebecca Fiebrink, Mick Grierson

摘要:

Generative video models have achieved remarkable visual fidelity, yet their prompt-only interface offers thin creative agency and obscures the model's material process from the artists working with it. We present Real-Time AttentionBender, a tool that extends the practice of network bending across the full depth of the video diffusion transformer (DiT) and brings it into live, interactive generation. Built as a plugin within the DayDream Scope ecosystem and wrapping open-source real-time Wan pipelines, the tool exposes self-attention, cross-attention, and the feed-forward network as independently manipulable surfaces, with targeting down to individual diffusion steps, DiT layers, prompt tokens, and hidden neurons. The immediacy of live manipulation affords what we call "material intimacy" with the model: a responsive, near-mechanistic feel for how specific layers and neurons shape generated video. We position the tool as simultaneously an XAIxArts probe into transformer internals and an expressive instrument for discovering aesthetics outside the model&x27;s default representational space.

摘要中文:

生成式视频模型已取得显著的视觉保真度,但其仅支持提示词的交互界面提供的创作控制力有限,且使艺术家难以洞察模型的内部运作过程。我们提出实时注意力弯曲器(Real-Time AttentionBender),该工具将网络弯曲实践扩展至视频扩散变换器(DiT)的完整深度,并将其带入实时交互式生成领域。该工具作为DayDream Scope生态系统内的插件构建,并封装了开源的实时Wan流水线,将自注意力、跨注意力和前馈网络暴露为可独立操作的界面,支持精确到单个扩散步骤、DiT层、提示词标记和隐藏神经元的干预。实时操作的即时性赋予了我们所称的与模型之间的“物质亲密感”:一种响应式的、近乎机械性的感知,使人能切身体会特定层和神经元如何塑造生成的视频。我们将该工具定位为同时作为可解释人工智能与艺术(XAIxArts)探究变换器内部机制的探针,以及用于探索模型默认表示空间之外美学的表现性工具。

image_compression
Image Compression
4 篇论文

今日图像压缩/高效视觉模型研究概述:

今日arxiv上与图像压缩相关的研究主要集中在视觉模型的高效化方向。4篇论文涵盖了从ViT剪枝、KV压缩到世界模型记忆机制的多个热点领域。整体趋势显示,研究者们正从传统的图像压缩算法转向深度学习模型的效率优化,试图在保持性能的同时大幅降低计算开销。值得注意的是,这些工作更多属于“模型压缩”和“高效视觉模型”范畴,而非传统意义上的图像编码压缩。

  • RAPID: Layer-Wise Redundancy-Aware Pruning and Importance-Driven Token Merging for Efficient ViT - 提出层次化冗余感知剪枝与重要性驱动的token合并方法,为ViT模型提供了一种系统性的效率优化方案,值得关注。
  • HACK++: Towards More Effective Head-Aware Key-Value Compression for Efficient Visual Autoregressive Modeling - 针对视觉自回归模型的头部感知KV压缩,显著提升了推理效率,对理解视觉生成模型的结构化压缩有重要参考价值。
  • HiMat: DiT-based Ultra-High Resolution SVBRDF Generation - 基于DiT的超高分辨率材质生成,属于生成模型应用层面的研究,与图像压缩的直接关联较弱。

总结:今日研究热点集中于视觉Transformer和自回归生成模型的效率提升,反映了当前大模型时代对“更小、更快”模型的强烈需求。建议重点关注RAPID和HACK++,两者均针对主流视觉架构的效率瓶颈提出了创新解决方案。

RAPID: Layer-Wise Redundancy-Aware Pruning and Importance-Driven Token Merging for Efficient ViT

2026-06-09T04:00:00cs.AI, cs.CV, image_compression2606.08156

中文标题:RAPID:面向高效ViT的层级冗余感知剪枝与重要性驱动Token合并

作者:Kyumin Choi, Ikbeom Jang

摘要:

Vision Transformers (ViTs) achieve strong performance but suffer from high computational costs due to quadratic self-attention complexity. Although token reduction techniques such as pruning and merging mitigate this, they typically overlook how representations evolve across network depth. We propose RAPID, a depth-aware token reduction framework that adapts reduction strategies to the layer-wise characteristics of token representations. The primary methodological contribution is a bifurcated strategy: in shallow-to-middle layers, RAPID employs a redundancy-similarity aware pruning metric to eliminate over-represented local patterns. As features transition to global semantic concepts in deeper layers, the framework shifts to an importance-similarity aware merging mechanism. This stage leverages classification (CLS) token attention weights to protect semantically critical tokens while fusing less important but similar neighbors. Empirical validation on ImageNet-1K using ViT and DeiT architectures demonstrates that RAPID establishes a superior accuracy-compression Pareto frontier compared to plug-and-play baselines such as ToMe and ToFu. RAPID is particularly robust in aggressive compression regimes, achieving up to 4.29% higher accuracy than ToMe at extreme reduction rates. Our framework provides a training-free template for optimizing vision models by aligning reduction strategies with hierarchical feature evolution.

摘要中文:

视觉Transformer(ViT)虽然性能优异,但因自注意力复杂度呈二次方增长而面临高计算成本问题。尽管剪枝与合并等Token减少技术可以缓解这一困境,但现有方法通常忽视了表征在网络深度上的演化规律。本文提出RAPID,一种深度感知的Token减少框架,能够根据Token表征的层级特性自适应调整减少策略。本研究的主要方法论贡献在于提出一种双分支策略:在浅层到中层,RAPID采用冗余-相似性感知剪枝度量来消除过度表示的局部模式;随着特征在深层过渡到全局语义概念,框架切换至重要性-相似性感知合并机制,该阶段利用分类(CLS)Token的注意力权重保护语义关键Token,同时融合重要性较低但相似的相邻Token。在ImageNet-1K数据集上针对ViT和DeiT架构的实证验证表明,RAPID在准确率-压缩权衡上建立了优于ToMe和ToFu等即插即用基线的帕累托前沿。RAPID在激进压缩场景下尤为鲁棒,在极端压缩率下比ToMe高出4.29%的准确率。本框架提供了一个无需训练的模板,通过将减少策略与层级特征演化相匹配来优化视觉模型。

HACK++: Towards More Effective Head-Aware Key-Value Compression for Efficient Visual Autoregressive Modeling

2026-06-09T04:00:00autoregressive, cs.CV, image_compression2606.08302

中文标题:HACK++:面向高效视觉自回归建模的更有效头部感知键值缓存压缩

作者:Ziran Qin, Yuchen Jiang, Mingbao Lin, Youru Lv, Hang Guo, Wen Fei, Weiyao Lin

摘要:

Visual Autoregressive (VAR) models adopt a next-scale prediction paradigm, offering high-quality generation with substantially fewer decoding steps. However, existing VAR models suffer from significant attention complexity and severe memory overhead due to the accumulation of key-value (KV) caches across scales. In this paper, we tackle this challenge by introducing KV cache compression into the next-scale paradigm. We begin with an in-depth analysis of VAR attention and observe that attention heads can be stably divided into two functionally distinct categories: Contextual Heads focus on maintaining semantic consistency, while Structural Heads preserve spatial coherence. Their functional divergence makes existing one-size-fits-all compression methods perform poorly on VAR models. We further find that the two head types differ markedly in their reliance on historical scales, and that this reliance shifts across layers and generation steps, arguing for an adaptive cache budget allocation. To address these challenges, we propose HACK++, a training-free Head-Aware key-value Compression frameworK for VAR models. From a one-time offline calibration, HACK++ classifies head types and derives head-specific priors. At inference, it decouples attention from cache compression under independent budgets, bounding the current-scale attention cost while compressing the accumulated cache far more aggressively, via pattern-specific strategies and a reliance-aware budget allocation. Extensive experiments on multiple VAR models across text-to-image, class-conditional, and unified understanding-and-generation tasks validate the effectiveness and generalizability of HACK++. For example, on Infinity-2B/8B, HACK++ maintains near-lossless generation with only a 30% attention budget and a 10% cache budget, and remains robust even under a 1% cache budget.

摘要中文:

视觉自回归(VAR)模型采用下一尺度预测范式,以更少的解码步骤提供高质量生成。然而,现有VAR模型因跨尺度累积的键值(KV)缓存而面临显著的注意力复杂度和严重的内存开销问题。本文通过将KV缓存压缩引入下一尺度范式来解决这一挑战。我们首先对VAR注意力进行深入分析,发现注意力头可以稳定地分为两个功能不同的类别:语义头专注于维护语义一致性,而结构头保持空间一致性。它们的functional divergence使得现有的一刀切压缩方法在VAR模型上表现不佳。我们进一步发现,两种头部类型在历史尺度的依赖程度上存在显著差异,且这种依赖随层和生成步骤而变化,这表明需要自适应的缓存预算分配。为了应对这些挑战,我们提出了HACK++,一个用于VAR模型的免训练头部感知键值缓存压缩框架。通过一次性离线校准,HACK++对头部类型进行分类并得出头部特定的先验知识。在推理过程中,它在独立预算下将注意力与缓存压缩解绑,通过模式特定策略和依赖感知预算分配,在限制当前尺度注意力成本的同时更积极地压缩累积缓存。在多个VAR模型上进行了广泛实验,涵盖文本到图像、类别条件和统一理解与生成任务,验证了HACK++的有效性和泛化能力。例如,在Infinity-2B/8B上,HACK++仅使用30%的注意力预算和10%的缓存预算即可保持近无损生成,甚至在1%缓存预算下仍然鲁棒。

Echo-Memory: A Controlled Study of Memory in Action World Models

2026-06-09T04:00:00cs.CV, cs.GR, cs.LG, diffusion, image_compression2606.09803

中文标题:Echo-Memory:动作世界模型中记忆机制的控制变量研究

作者:Wayne King, Zeyue Xue, Yuxuan Bian, Jie Huang, Haoran Li, Yaowei Li, Yaofeng Su, Yuming Li, Haoyu Wang, Shiyi Zhang, Songchun Zhang, Yuwei Niu, Sihan Xu, Junhao Zhuang, Haoyang Huang, Nan Duan

摘要:

We present \textbf{Echo-Memory}, a controlled study of memory mechanisms in action-conditioned world models. These models generate multi-segment videos from a first frame, text prompt, and camera-action sequence, but their central failure is often memory rather than local image synthesis: after the camera leaves and returns, the scene or salient object may silently change. Existing memory designs are hard to compare because gains are entangled with backbone, training, retrieval, and evaluation differences. Echo-Memory fixes the action-to-video interface and varies only how history is stored and read by the generator. Under a shared video diffusion backbone, optimizer, camera-action representation, sampler, and evaluation pipeline, we compare raw context, compression-based memory, spatial summaries with different read-out paths, and state-space recurrence. This matched matrix separates four otherwise conflated axes: \emph{capacity}, \emph{compression}, \emph{read-out}, and \emph{recurrence}. We also evaluate memory through a three-branch protocol: replay quality, in-domain loop revisit, and open-domain return probes. The branches routinely disagree, showing that replay fidelity is not a sufficient proxy for remembering a world. Three findings follow. Raw context is a strong capacity baseline and improves open-domain return far more than it improves replay metrics. Compactness is not a free substitute for capacity: aggressive spatial and hybrid-compression memories lose the salient evidence needed for return. Finally, block-wise state-space recurrence is the strongest open-domain return mechanism in our matrix, showing that the structure of implicit memory matters as much as the decision to use it. These results provide a compact protocol for studying memory in action world models beyond isolated replay metrics.

摘要中文:

我们提出Echo-Memory,这是一个对动作条件世界模型中记忆机制的控制变量研究。这些模型根据第一帧、文本提示和相机动作序列生成多段视频,但它们的中心失败往往是记忆而非局部图像合成:相机离开后返回时,场景或显著物体可能会悄然改变。现有的记忆设计难以比较,因为性能提升与骨干网络、训练、检索和评估的差异相互交织。Echo-Memory固定了动作到视频的接口,仅改变生成器存储和读取历史的方式。在共享的视频扩散骨干网络、优化器、相机动作表示、采样器和评估流程下,我们比较了原始上下文、基于压缩的记忆、具有不同读取路径的空间摘要和状态空间递归。这个匹配的矩阵分离了四个原本混淆的轴:容量、压缩、读取和递归。我们还通过三个分支协议评估记忆:重放质量、域内循环重访和开放域返回探针。这些分支经常不一致,表明重放保真度并非记住一个世界的充分代理。得出三个发现。原始上下文是一个强大的容量基线,其对开放域返回的提升远大于对重放指标的提升。紧凑性不能作为容量的免费替代品:激进的空间和混合压缩记忆会丢失返回所需的显著证据。最后,块级状态空间递归是我们矩阵中开放域返回最强的机制,表明隐式记忆的结构与其使用决策同等重要。这些结果为研究动作世界模型中的记忆提供了一个精简的协议,超越了单独的重放指标。

HiMat: DiT-based Ultra-High Resolution SVBRDF Generation

2026-06-09T04:00:00cs.CV, cs.GR, diffusion, image_compression2508.07011

中文标题:HiMat:基于DiT的超高分辨率SVBRDF生成

作者:Zixiong Wang, Jian Yang, Yiwei Hu, Milos Hasan, Beibei Wang

摘要:

Creating ultra-high-resolution spatially varying bidirectional reflectance functions (SVBRDFs) is critical for photorealistic 3D content creation, to faithfully represent fine-scale surface details required for close-up rendering. However, achieving 4K generation faces two key challenges: (1) the need to synthesize multiple reflectance maps at full resolution, which multiplies the pixel budget and imposes prohibitive memory and computational cost, and (2) the requirement to maintain strong pixel-level alignment across maps at 4K, which is particularly difficult when adapting pretrained models designed for the RGB image domain. We introduce HiMat, a diffusion-based framework tailored for efficient and diverse 4K SVBRDF generation. To address the first challenge, HiMat performs generation in a high-compression latent space via DC-AE, and employs a pretrained diffusion transformer with linear attention to improve per-map efficiency. To address the second challenge, we propose CrossStitch, a lightweight convolutional module that enforces cross-map consistency without incurring the cost of global attention. Our experiments show that HiMat achieves high-fidelity 4K SVBRDF generation with superior efficiency, structural consistency, and diversity compared to prior methods. Beyond materials, our framework also generalizes to related applications such as intrinsic decomposition.

摘要中文:

创建超高分辨率空间变化双向反射分布函数(SVBRDF)对于逼真的3D内容创作至关重要,能够真实呈现特写渲染所需的细粒度表面细节。然而,实现4K生成面临两个关键挑战:(1) 需要在全分辨率下合成多个反射图,这成倍增加了像素预算,带来难以承受的内存和计算成本;(2) 需要在4K分辨率下保持各图之间强大的像素级对齐,当改编为RGB图像域设计的预训练模型时尤其困难。我们提出了HiMat,一个专为高效多样化的4K SVBRDF生成而设计的基于扩散的框架。为解决第一个挑战,HiMat通过DC-AE在高度压缩的潜在空间中进行生成,并采用具有线性注意力的预训练扩散Transformer以提高单图效率。为解决第二个挑战,我们提出了CrossStitch,这是一个轻量级卷积模块,可在不产生全局注意力开销的情况下强制跨图一致性。实验表明,与现有方法相比,HiMat实现了高保真的4K SVBRDF生成,在效率、结构一致性和多样性方面表现更优。除了材质,我们的框架还可推广到内在分解等相关应用。

visual_tokenizer_1d
1D Visual Tokenizer
0 篇论文

今日未找到该分类的匹配论文。

今日未找到该分类的匹配论文。

diffusion_visual_encoder
Diffusion Visual Encoder
0 篇论文

今日未找到该分类的匹配论文。

今日未找到该分类的匹配论文。