每日 arXiv 论文简报

2026-05-23 · 42 篇论文 · 按研究方向分组

自动追踪 · LLM 总览 · 研究雷达

42Total Papers

12Autoregressive

29Diffusion

1Image Compression

2026/05/23 10:00:00

AI论文每日摘要·208 min read

arXiv 论文 AI 视觉编码器

Daily Radar

每日总览

今日共追踪到 42 篇相关论文。内容按研究方向拆分，避免不同问题域混在同一个长列表里。

autoregressive

Autoregressive

12 篇论文

今日 Autoregressive 方向共追踪到 12 篇论文。简报保留原始摘要、中文摘要、作者和链接，适合先快速筛选，再挑出值得深读的论文进入 org-roam。

MRecover: A Conditional Generative Model for Recovering Motion-Corrupted MR images Using AI Generated Contrast

2026-05-22T04:00:00autoregressive, cs.AI, cs.CVoai:arXiv.org:2605.21669v1

中文标题：MRecover: 使用AI生成的对比度恢复运动损坏的MR图像的条件生成模型

作者：Jinghang Li, Tales Santini, Courtney Clark, Bruno de Almeida, Cong Chu, Salem Alkhateeb, Andrea Sajewski, Jacob Berardinelli, Hecheng Jin, Tobias Campos, Jeremy J. Berardo, Joseph Mettenburg, Ariel Gildengers, Howard J. Aizenstein, Minjie Wu, Tamer S. Ibrahim

摘要：

Hippocampal subfield segmentation requires high-resolution T2w turbo spin echo (TSE) MRI, yet this sequence is susceptible to motion artifacts, leading to substantial data loss. We developed a conditional generative model (MRecover) that synthesizes routinely acquired T1w images to create TSE images with autoregressive slice conditioning for volumetric consistency. Trained on 7T MRI data (n=577), the model achieved high in-domain fidelity (n=148, SSIM=0.84, FSIM=0.94) and generalized well to out-of-domain 3T data: subfield volumes from synthesized and the as-acquired images closely matched: (n=416, r=0.87-0.97) and yielded 31.8% more analyzable subjects in the motion-affected ADNI3 dataset after quality control (593 vs 450). The synthesized images also achieved larger effect sizes due to increasing the sample size for diagnostic group differences in hippocampal subfield atrophy (whole hippocampus $\epsilon^2$ = 0.121-0.100 vs. 0.086-0.062, left-right hemispheres). Project page: https://jinghangli98.github.io/MRecover/

摘要中文：

arXiv:2605.21669v1宣布类型: 新摘要: 海马子域分割需要高分辨率的T2w涡轮自旋回波 (TSE) MRI，但该序列易受运动伪影的影响，导致大量数据丢失。我们开发了一种条件生成模型 (MRecover)，该模型可合成常规采集的T1w图像，以创建具有自回归切片条件的TSE图像，以实现体积一致性。在7t MRI数据 (n = 577) 上训练，该模型实现了高域内保真度 (n = 148，SSIM = 0.84，FSIM = 0.94)，并很好地推广到域外3t数据: 合成图像和采集图像的子场体积紧密匹配: (n = 416，r = 0.87-0.97)，并在质量控制后 (593 vs 450)，在受运动影响的ADNI3数据集中产生了31.8% 个可分析的受试者。由于增加了海马亚区萎缩诊断组差异的样本量，合成图像也达到了更大的效果大小 (整个海马 $\ epsilon ^ 2$ = 0.121-0.100 vs. 0.086-0.062，左右半球)。项目页面: https:// jinghangli98.github.io/MRecover/

GenHAR: Generalizing Cross-domain Human Activity Recognition for Last-mile Delivery

2026-05-22T04:00:00autoregressive, cs.CVoai:arXiv.org:2605.22086v1

中文标题：GenHAR: 泛化最后一英里交付的跨域人类活动识别

作者：Zhiqing Hong, Zelong Li, Xiubin Fan, Guang Yang, Baoshen Guo, Haotian Wang, Tian He, Desheng Zhang

摘要：

Human Activity Recognition (HAR) has shown remarkable effectiveness in various applications, such as smart healthcare and intelligent manufacturing. However, a major challenge faced by HAR is the distribution shift across different sensor data domains, which often leads to decreased performance when deployed for real-world applications. To address this issue, this paper introduces GenHAR, a novel framework designed to mitigate the domain gap by learning domain-invariant sensor representations. GenHAR aims to enhance the generalization capabilities of HAR on target domains purely with data from the source domain. The key novelty of GenHAR lies in two aspects. Firstly, GenHAR tokenizes sensor data and learns correlations among frequency sensor channel dimensions to improve the robustness of HAR models. Secondly, GenHAR improves the efficiency via selective masking and an efficient attention mechanism. We conduct a systematic analysis of GenHAR by comparing it with state-of-the-art HAR methods on real-world human activity datasets. Results show that GenHAR outperforms state-of-the-art methods by 9.97% in accuracy, and reduces Floating Point Operations by 6.4 times. Moreover, we deploy GenHAR at a leading logistics company in 4 cities, and have detected 2.15 billion real-time activities. We release our code at: https://github.com/Sensor-FoundationModel/GenHAR.

摘要中文：

arXiv:2605.22086v1宣布类型: 新摘要: 人类活动识别 (HAR) 在智能医疗、智能制造等领域具有显著的应用效果。但是，HAR面临的主要挑战是跨不同传感器数据域的分布偏移，这通常会导致部署到实际应用程序时性能下降。为了解决这个问题，本文介绍了一种新颖的框架，旨在通过学习域不变的传感器表示来减轻域差距。GenHAR旨在仅使用来自源域的数据来增强HAR在目标域上的泛化能力。GenHAR的主要新颖之处在于两个方面。首先，GenHAR对传感器数据进行标记，并学习频率传感器通道维度之间的相关性，以提高HAR模型的鲁棒性。其次，GenHAR通过选择性掩蔽和有效的注意机制来提高效率。我们通过将其与现实世界人类活动数据集上的最先进的HAR方法进行比较，对GenHAR进行了系统分析。结果表明，通过9.97% 精度，GenHAR优于最先进的方法，并将浮点运算减少了6.4倍。此外，我们在4个城市的一家领先物流公司部署了GenHAR，并检测到21.5亿实时活动。我们在以下位置发布代码: https://github.com /传感器基础模型/根哈尔。

LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model

2026-05-22T04:00:00autoregressive, cs.AI, cs.CVoai:arXiv.org:2605.22089v1

中文标题：LVDrive: 潜在的视觉表现增强的视觉语言动作自动驾驶模型

作者：Xiaodong Mei, Diankun Zhang, Hongwei Xie, Guang Chen, Hangjun Ye, Dan Xu

摘要：

Vision-Language-Action (VLA) models have emerged as a promising framework for end-to-end autonomous driving. However, existing VLAs typically rely on sparse action supervision, which underutilizes their powerful scene understanding and reasoning capabilities. Recent attempts to incorporate dense visual supervision via world modeling often overemphasize pixel-level image reconstruction, neglecting semantically meaningful scene representation learning. In this work, we propose LVDrive, a Latent Visual representation enhanced VLA framework for autonomous driving. LVDrive introduces a future scene prediction task into the VLA paradigm, where future representations are learned entirely in a high-level latent space under auxiliary supervision from a pretrained vision backbone. Departing from inefficient autoregressive generation, we jointly model future scene and motion prediction within a unified embedding space, processed in a single forward pass to conduct the future-aware reasoning. We further design a two-stage trajectory decoding strategy that explicitly leverages the learned latent future representations to refine trajectory generation. Extensive experiments on the challenging Bench2Drive benchmark demonstrate that LVDrive achieves significant improvements in closed-loop driving performance, outperforming both action supervised methods and image-reconstruction-based world model approaches.

摘要中文：

arXiv:2605.22089v1宣布类型: 新摘要: 视觉-语言-动作 (VLA) 模型已成为端到端自动驾驶的有前途的框架。然而，现有vla通常依赖于稀疏动作监督，其未充分利用其强大的场景理解和推理能力。最近通过世界建模纳入密集视觉监督的尝试通常会过分强调像素级图像重建，而忽略了语义上有意义的场景表示学习。在这项工作中，我们提出了LVDrive，这是一种用于自动驾驶的潜在视觉表示增强VLA框架。LVDrive在VLA范式中引入了未来场景预测任务，在该任务中，在预先训练的视觉骨干的辅助监督下，在高级潜在空间中完全学习了未来的表示。与低效的自回归生成不同，我们在一个统一的嵌入空间内对未来场景和运动预测进行联合建模，并在一次正向传递中进行处理，以进行未来感知推理。我们进一步设计了两阶段轨迹解码策略，该策略明确地利用学习到的潜在未来表示来细化轨迹生成。在具有挑战性的Bench2Drive基准测试上进行的广泛实验表明，LVDrive在闭环驱动性能方面取得了显着改善，优于动作监督方法和基于图像重建的世界模型方法。

REACH: Hand Pose Estimation from Room Corners

2026-05-22T04:00:00autoregressive, cs.CVoai:arXiv.org:2605.22231v1

中文标题：REACH: 从房间角落估计手部姿势

作者：Shu Nakamura, Ryo Kawahara, Genki Kinoshita, Ryosuke Hirai, Yasutomo Kawanishi, Shohei Nobuhara, Ko Nishino

摘要：

We introduce a novel 3D hand pose estimator that can accurately recover the shape and pose of people's hands in a room from afar, typically from fixed cameras at room corners, in extremely low-resolution and frequently occluded views. Our key idea is to fully leverage hand-body coordination, its temporal progression, and multiview observations. We achieve this with a novel Transformer-based model, in which hand and body configurations are modeled through correlations between their visual features expressed as per-view tokens, and their temporal coordination is exploited in an autoregressive manner. We introduce a novel dataset, which we refer to as REACH, Room-Environment dataset Annotated with Chest cameras for Hand pose estimation, to train and test our method. REACH is a first-of-its-kind large-scale hand pose dataset that captures accurate hand movements of 50 participants across a wide variety of daily activities. In order to avoid interfering with natural movements while annotating the hands with accurate shape and pose, we leverage concealed chest cameras. Through extensive experiments, including comparative studies with existing methods, we show that our model, REACH-Net, achieves highly accurate 3D hand pose estimation from afar. These results broaden the horizon of 3D hand pose estimation, especially towards "in-the-wild" continuous human behavior analysis.

摘要中文：

arXiv:2605.22231v1宣布类型: 新摘要: 我们介绍了一种新颖的3D手姿态估计器，它可以从远处准确地恢复房间中人们手的形状和姿态，通常是从房间角落的固定摄像机，在极低分辨率和经常被遮挡的视图中。我们的关键思想是充分利用手-身体协调，其时间进展和多视图观察。我们通过一种新颖的基于Transformer的模型来实现这一目标，在该模型中，通过表示为每个视图令牌的视觉特征之间的相关性来对手和身体配置进行建模，并且以自回归方式利用它们的时间协调。我们引入了一个新颖的数据集，我们将其称为REACH，房间环境数据集，带有胸部相机注释，用于手部姿势估计，以训练和测试我们的方法。REACH是首款大型手部姿势数据集，可捕获50名参与者在各种日常活动中的准确手部动作。为了避免干扰自然运动，同时用准确的形状和姿势标注手，我们利用隐藏的胸部相机。通过广泛的实验，包括与现有方法的比较研究，我们表明我们的模型REACH-Net从远处实现了高度准确的3D手姿态估计。这些结果拓宽了3D手姿态估计的视野，尤其是对 “野外” 连续的人类行为分析。

FastTab: A Fast Table Recognizer with a Tiny Recursive Module and 1D Transformers

2026-05-22T04:00:00autoregressive, cs.AI, cs.CVoai:arXiv.org:2605.22422v1

中文标题：FastTab: 具有微小递归模块和1D转换器的快速表识别器

作者：Laziz Hamdi, Amine Tamasna, Pascal Boisson, Thierry Paquet

摘要：

Table structure recognition (TSR) requires both table-level coherence (row/column counts, headers, spanning cells) and precise separator localization. We introduce FastTab, a grid-centric TSR model that avoids autoregressive HTML decoding by combining (i) a lightweight Tiny Recursive Module (TRM) for global reasoning and (ii) axial 1D Transformer encoders that capture long-range dependencies along rows and columns. The model predicts row/column counts, header rows, and separators to construct a grid, then infers rowspan/colspan using ROI-aligned cell features. Across four benchmarks (PubTabNet, FinTabNet, PubTables-1M, and SciTSR), FastTab achieves competitive structure recovery performance while operating at low-latency inference. We further study robustness under pixel-level anonymisation and show an extension to curved separators for camera-captured documents. The source code will be made publicly available at https://github.com/hamdilaziz/FastTab .

摘要中文：

arXiv:2605.22422v1宣布类型: 新摘要: 表结构识别 (TSR) 需要表级一致性 (行/列计数，标题，跨越单元格) 和精确的分隔符定位。我们提出了FastTab，这是一种以网格为中心的TSR模型，通过将以下两者相结合，避免了自回归式的HTML解码：(i) 用于全局推理的轻量级Tiny递归模块（TRM）；(ii) 能够沿行与列捕获长距离依赖关系的轴向一维Transformer编码器。该模型通过预测行数与列数、表头行以及分隔符来构建表格网格，随后利用与感兴趣区域对齐的单元格特征推断rowspan和colspan属性。在四个基准数据集（PubTabNet、FinTabNet、PubTables‑1M和SciTSR）上，FastTab在保持低延迟推理的同时，实现了与现有方法相当的表格结构恢复性能。我们进一步研究了在像素级匿名化条件下的模型鲁棒性，并提出了针对相机拍摄文档的曲面分割方法的扩展。源代码将在 https://github.com/hamdilaziz/FastTab 上公开提供。

GLeVE: Graph-Guided Lesion Grounding with Proposal Verification in 3D CT

2026-05-22T04:00:00autoregressive, cs.CVoai:arXiv.org:2605.22619v1

中文标题：GLeVE: 图形引导的病变接地，并在3D CT中进行建议验证

作者：Shuo Jiang, Yuhao Hong, Chunbo Jiang, Weihong Chen, Huangwei Chen, Shenghao Zhu, Beining Wu, Mingxuan Liu, Zhu Zhu, Feiwei Qin, Min Tan, Yifei Chen

摘要：

Grounding radiology report descriptions to 3D CT volumes is essential for verifiable clinical interpretation, yet remains challenging due to the semantic-spatial gap between free-text narratives and volumetric anatomy. Existing report-assisted and vision-language grounding methods typically rely on phrase-level alignment or dense pixel supervision, resulting in limited lesion-wise correspondence and suboptimal localization accuracy. We propose GLeVE, a graph-guided lesion grounding framework with anatomical prior verification and octree-based autoregressive refinement. GLeVE treats each lesion description as an atomic semantic unit and encodes organ attribution, attributes, and inter-lesion relations through relation-aware graph reasoning to produce discriminative lesion-wise queries. Anatomy-aware proposal generation with region-level verification enforces one-to-one text-lesion alignment, while hierarchical octree refinement progressively improves boundary delineation. Experiments on AbdomenAtlas 3.0 demonstrate consistent gains over classical multimodal foundation models and report-supervised baselines in both segmentation accuracy and lesion-level localization.

摘要中文：

arXiv:2605.22619v1宣布类型: 新摘要: 将放射学报告描述与3D CT体积相结合对于可验证的临床解释至关重要，但由于自由文本叙述和体积解剖之间的语义空间差距，仍然具有挑战性。现有的报告辅助和视觉语言接地方法通常依赖于短语级对齐或密集像素监督，导致有限的病变对应和次优的定位精度。我们提出了GLeVE，这是一种图形引导的病变接地框架，具有解剖先验验证和基于八叉树的自回归细化。GLeVE将每个病变描述视为原子语义单位，并通过关系感知图推理对器官属性，属性和病变间关系进行编码，以产生可区分的病变查询。具有区域级别验证的解剖感知建议生成强制执行一对一的文本-病变对齐，而分层八叉树细化逐步改善边界描绘。在腹部3.0上进行的实验证明，在分割准确性和病变水平定位方面，与经典的多模态基础模型和报告监督的基线相比，收益一致。

WorldKV: Efficient World Memory with World Retrieval and Compression

2026-05-22T04:00:00autoregressive, cs.CV, diffusionoai:arXiv.org:2605.22718v1

中文标题：WorldKV: 具有世界检索和压缩功能的高效世界内存

作者：Jung Yi, Minjae Kim, Paul Hyunbin Cho, Wooseok Jang, Sangdoo Yun, Seungryong Kim

摘要：

Autoregressive video diffusion models have enabled real-time, action-conditioned world generation. However, sustaining a persistent world, where revisiting a previously seen viewpoint yields consistent content, remains an open problem. Full KV-cache attention preserves this consistency but breaks real-time constraints: memory footprint and attention cost grow linearly with rollout length. Sliding window inference restores throughput but discards long-term consistency. We propose WorldKV, a training-free framework with two components: World Retrieval and World Compression. World Retrieval stores evicted KV-cache chunks in GPU/CPU memory and selectively retrieves scene-relevant chunks via camera/ action correspondence, inserting them back into the native attention window without re-encoding. World Compression prunes redundant tokens within each chunk via key-key similarity to an anchor frame, halving per-chunk storage to fit 2x more history under a fixed budget. On Matrix-Game-2.0 and LingBot- World-Fast, WorldKV matches or exceeds full-KV memory fidelity at roughly 2x the throughput, and is competitive with memory-trained baselines without any fine-tuning. Project Page: https://cvlab-kaist.github.io/WorldKV/

摘要中文：

arXiv:2605.22718v1宣布类型: 新摘要: 自回归视频扩散模型实现了实时的、以动作为条件的世界生成。然而，构建一个持久的世界——在该世界中，用户再次访问先前已见的视点时能够获得一致的内容——仍是一个尚未解决的问题。全量KV缓存注意力机制能够保持这种一致性，但会突破实时性约束：其内存开销和注意力计算成本均随回放长度呈线性增长。滑动窗口推理能够恢复吞吐量，但会牺牲长期一致性。我们提出了WorldKV，这是一种无需训练的框架，由两个模块组成：世界检索与世界压缩。世界检索将被逐出的KV缓存块存储在GPU或CPU内存中，并通过相机与动作的对应关系有选择地检索与场景相关的缓存块，将其重新插入原生注意力窗口而无需再次编码。世界压缩通过计算各分块内标记与锚定帧之间的键间相似度来剔除冗余标记，使每个分块的存储开销减半，在固定预算下可容纳两倍的历史数据。在Matrix-Game-2.0和LingBot上，World-Fast在WorldKV的各项任务中，吞吐量约为全KV内存精度的两倍，并且在无需任何微调的情况下，其性能即可与经内存训练的基线模型相媲美。项目页面: https:// cvlab-kaist.github.io/WorldKV/

LACO: Adaptive Latent Communication for Collaborative Driving

2026-05-22T04:00:00autoregressive, cs.AI, cs.CVoai:arXiv.org:2605.22504v1

中文标题：LACO: 用于协作驾驶的自适应潜在通信

作者：Tianhao Chen, Yuheng Wu, Dongman Lee

摘要：

Collaborative driving aims to improve safety and efficiency by enabling connected vehicles to coordinate under partial observability. Recent approaches have evolved from sharing visual features for perception to exchanging language-based reasoning through foundation models for behavioral coordination. Though communicating in language provides intuitive information, it introduces two challenges: high latency caused by autoregressive decoding and information loss caused by compressing rich internal representations into discrete tokens. To address these challenges, we analyze latent communication in collaborative driving under inherent limitations of multi-agent settings. Our analysis reveals agent identity confusion, where direct fusion of latent states entangles decision representations across vehicles. Motivated by this, we propose LACO, a training-free \textbf{LA}tent \textbf{CO}mmunication paradigm that seamlessly adapts pretrained driving models to collaborative settings. LACO introduces Iterative Latent Deliberation (ILD) for latent reasoning, Cross-Horizon Saliency Attribution (CHSA) for communication-efficient information selection, and Structured Semantic Knowledge Distillation (SSKD) to stabilize ego-centric decision making. Closed-loop experiments in CARLA show that LACO notably reduces communication and inference latency while maintaining strong collaborative driving performance.

摘要中文：

arXiv:2605.22504v1宣布类型: 交叉摘要: 协同驾驶旨在通过使连接的车辆能够在部分可观察性下进行协调来提高安全性和效率。近年来的研究已从基于视觉特征的感知共享，发展到通过基础模型进行语言驱动的推理交互，以实现行为协同。尽管语言交流能够提供直观的信息，但它也带来了两大挑战：由自回归解码引起的高延迟，以及将丰富的内部表征压缩为离散标记所导致的信息损失。为应对这些挑战，我们在多智能体场景的固有约束下，对协同驾驶中的隐式通信进行了分析。我们的分析揭示了智能体身份混淆现象，即潜在状态的直接融合会导致不同车辆之间的决策表征相互纠缠。受此启发，我们提出了LACO，这是一种无需训练的潜在协同通信范式，能够将预训练的自动驾驶模型无缝适配至协作场景。LACO提出了用于潜在推理的迭代式潜在深思（ILD）、用于通信高效信息选择的跨horizon显著性归因（CHSA），以及用于稳定自我中心决策的结构化语义知识蒸馏（SSKD）。在CARLA中的闭环实验表明，LACO能够在显著降低通信与推理延迟的同时，保持优异的协同驾驶性能。

InfVSR: Breaking Length Limits of Generic Video Super-Resolution

2026-05-22T04:00:00autoregressive, cs.CV, diffusionoai:arXiv.org:2510.00948v2

中文标题：InfVSR: 打破通用视频超分辨率的长度限制

作者：Ziqing Zhang, Kai Liu, Zheng Chen, Xi Li, Yucong Chen, Bingnan Duan, Linghe Kong, Yulun Zhang

摘要：

Real-world videos often extend over thousands of frames. Existing generative video super-resolution (VSR) approaches, however, face two persistent challenges when processing long sequences: (1) inefficiency due to the heavy cost of multi-step denoising for full-length sequences; and (2) poor consistency is hindered by temporal decomposition that causes artifacts and discontinuities. To break these limits, we propose InfVSR, which reformulates VSR as an autoregressive-one-step-diffusion paradigm, and enables streaming inference with video diffusion priors. First, we adapt the pretrained DiT into a causal structure, maintaining both local and global coherence via rolling KV-cache and joint visual guidance. Second, we distill the diffusion process into a single step efficiently, with patch-wise pixel supervision and cross-chunk distribution matching. To fill the gap in long-form video evaluation, we build a new benchmark tailored for extended sequences and further introduce semantic-level metrics to comprehensively assess temporal consistency. Our method pushes the frontier of long-form VSR, achieves state-of-the-art quality with enhanced semantic consistency, and delivers up to 58x speed-up over existing methods such as MGLD-VSR. Our code and models are available at https://github.com/Kai-Liu001/InfVSR.

摘要中文：

arXiv:2510.00948v2宣布类型: 更换摘要: 现实世界的视频通常会延伸到数千帧。然而，现有的生成视频超分辨率 (VSR) 方法在处理长序列时面临两个持续的挑战 :( 1) 由于全长序列的多步去噪的沉重成本而导致的低效率; 以及 (2) 导致伪影和不连续性的时间分解阻碍了较差的一致性。为了打破这些限制，我们提出了InfVSR，它将VSR重新表述为自回归一步扩散范例，并使用视频扩散先验进行流推理。首先，我们将预先训练的DiT调整为因果结构，通过滚动KV缓存和联合视觉指导来保持局部和全局一致性。其次，我们通过逐块像素监督和跨块分布匹配，将扩散过程有效地分为一步。为了填补长格式视频评估的空白，我们为扩展序列构建了一个新的基准，并进一步引入了语义级度量来全面评估时间一致性。我们的方法推动了长格式VSR的前沿，通过增强的语义一致性实现了最先进的质量，并比mgld-vsr等现有方法提高了58倍的速度。我们的代码和模型可在 https://github.com/Kai-Liu001/InfVSR。

UIKA: Fast Universal Head Avatar from Pose-Free Images

2026-05-22T04:00:00autoregressive, cs.CVoai:arXiv.org:2601.07603v3

中文标题：UIKA: 来自无姿势图像的快速通用头部头像

作者：Zijian Wu, Boyao Zhou, Liangxiao Hu, Hongyu Liu, Yuan Sun, Xuan Wang, Xun Cao, Yujun Shen, Hao Zhu

摘要：

We present UIKA, a feed-forward animatable Gaussian head model from an arbitrary number of pose-free inputs, including a single image, multi-view captures, and smartphone-captured videos. Unlike the traditional avatar method, which requires a studio-level multi-view capture system and reconstructs a human-specific model through a long-time optimization process, we rethink the task through the lenses of model representation, network design, and data preparation. First, we introduce a UV-guided avatar modeling strategy, in which each input image is associated with a pixel-wise facial correspondence estimation. Such correspondence estimation allows us to reproject each valid pixel color from screen space to UV space, which is independent of camera pose and character expression. Furthermore, we design learnable UV tokens on which the attention mechanism can be applied at both the screen and UV levels. The learned UV tokens can be decoded into canonical Gaussian attributes using aggregated UV information from all input views. To train our large avatar model, we additionally prepare a large-scale, identity-rich synthetic training dataset. Our method significantly outperforms existing approaches in both monocular and multi-view settings.

摘要中文：

arXiv:2601.07603v3宣布类型: 替换摘要: 我们提出了UIKA，一个来自任意数量的无姿势输入的前馈动画高斯头模型，包括单个图像，多视图捕获和智能手机捕获的视频。与传统的avatar方法不同，后者需要工作室级别的多视图捕获系统并通过长时间的优化过程来重建特定于人类的模型，我们通过模型表示，网络设计和数据准备的镜头来重新思考任务。首先，我们介绍了一种UV引导的化身建模策略，其中每个输入图像都与按像素的面部对应估计相关联。这种对应估计允许我们将每个有效像素颜色从屏幕空间重新投影到UV空间，这与相机姿势和角色表达无关。此外，我们设计了可学习的UV令牌，可以在屏幕和UV级别上应用注意机制。可以使用来自所有输入视图的聚合UV信息将学习的UV令牌解码为规范高斯属性。为了训练我们的大型化身模型，我们还准备了一个大型的，身份丰富的合成训练数据集。我们的方法在单眼和多视图设置中都明显优于现有方法。

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

2026-05-22T04:00:00autoregressive, cs.CV, diffusionoai:arXiv.org:2602.02214v3

中文标题：因果强迫: 正确完成自回归扩散蒸馏以生成高质量的实时交互式视频

作者：Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, Jun Zhu

摘要：

To achieve real-time interactive video generation, current methods distill pretrained bidirectional video diffusion models into few-step autoregressive (AR) models, facing an architectural gap when full attention is replaced by causal attention. However, existing approaches do not bridge this gap theoretically. They initialize the AR student via ODE distillation, which requires frame-level injectivity, where each noisy frame must map to a unique clean frame under the PF-ODE of an AR teacher. Distilling an AR student from a bidirectional teacher violates this condition, preventing recovery of the teacher's flow map and instead inducing a conditional-expectation solution, which degrades performance. To address this issue, we propose Causal Forcing, which uses an autoregressive teacher for ODE initialization to bridge the architectural gap, and then applies the same DMD procedure as in Self Forcing. Empirical results show that our method outperforms all baselines across all metrics, surpassing the SOTA Self Forcing by 19.3\% in Dynamic Degree, 8.7\% in VisionReward, and 16.7\% in Instruction Following. Project page: \href{https://thu-ml.github.io/CausalForcing.github.io/}{https://thu-ml.github.io/CausalForcing.github.io/}; the code: \href{https://github.com/thu-ml/Causal-Forcing}{https://github.com/thu-ml/Causal-Forcing}.

摘要中文：

arXiv:2602.02214v3宣布类型: 替换摘要: 为了实现实时的交互式视频生成，当前的方法将预训练的双向视频扩散模型提炼为少步自回归 (AR) 模型，当完全注意被因果注意代替时，面临着架构上的空白。然而，现有的方法在理论上没有弥合这一差距。他们通过ODE蒸馏来初始化AR学生，这需要帧级注入性，其中每个嘈杂的帧必须在AR老师的pf-ode下映射到唯一的干净帧。从双向老师中提取AR学生违反了此条件，从而阻止了老师的流程图的恢复，而是导致了条件期望解决方案，从而降低了性能。为了解决这个问题，我们提出了因果强迫，它使用自回归老师进行ODE初始化以弥合体系结构差距，然后应用与自我强迫相同的DMD过程。实证结果表明，我们的方法在所有指标上都优于所有基线，在动态程度上超过SOTA自我强迫19.3%，在视觉奖励方面超过8.7%，在指令遵循方面超过16.7%。项目页面: \ href{https:// thu-ml.github.io/CausalForcing.github.io/}; 代码: \ href{https://github.com/thu-ml/causal-forcing}。

Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion

2026-05-22T04:00:00autoregressive, cs.CV, cs.LG, diffusionoai:arXiv.org:2605.16579v2

中文标题：局部参与，线性记忆：将线性注意力作为自回归视频扩散模型的跨帧记忆

作者：Kunyang Li, Mubarak Shah, Yuzhang Shang

摘要：

Autoregressive (AR) video diffusion is a powerful paradigm for streaming and interactive video generation. However, its reliance on softmax self-attention leads to quadratic compute complexity in sequence length and memory usage due to key-value caching, which limits its scalability to long video horizons. Existing remedies (e.g., sparse attention and KV-cache compression) reduce per-step cost but still rely on a linearly growing cache or irreversibly discard past context, and thus fail to address linear memory growth and streaming context management. To address this scalability bottleneck, we propose ARL2 (Attend Locally, Remember Linearly), a hybrid attention module that replaces quadratic cross-frame attention with a fixed-size recurrent state. We decompose self-attention into two branches: an intra-frame softmax branch for spatial detail and local dependencies, and an inter-frame gated recurrent linear branch that maintains a fixed-size state for streaming context. Our key insight is that softmax attention captures fine-grained local interactions, while a recurrent state provides controllable long-range memory. This design achieves linear-time scaling with constant memory while improving temporal consistency over the full-softmax model. To prevent noisy intermediate states from corrupting memory, we update the recurrent state only after the denoised pass. To avoid within-frame information asymmetry, all tokens share the same pre-update state rather than sequential updates. To the best of our knowledge, this is the first work to convert a pretrained AR video diffusion model into a hybrid linear attention architecture, through an efficient two-stage training scheme for AR video. With 75% of layers replaced by hybrid linear attention, the model achieves up to 2.26 wall-clock speedup and 54% memory reduction, while maintaining comparable quality with improving temporal consistency.

摘要中文：

arXiv:2605.16579v2公告类型: 替换摘要：自回归视频扩散模型是一种用于流式传输与交互式视频生成的强大范式。然而，由于其依赖于基于softmax的自注意力机制，并且在序列长度上存在二次方的计算复杂度和由键值缓存带来的内存开销，这限制了其在长视频时序上的可扩展性。现有的方法（如稀疏注意力和键值缓存压缩）虽然能够降低每步的计算开销，但仍依赖于线性增长的缓存或对历史上下文的不可逆丢弃，因此无法有效解决内存的线性增长问题以及流式上下文管理难题。为解决这一可扩展性瓶颈，我们提出了ARL2（局部注意力，线性记忆），这是一种混合注意力模块，用一个固定大小的循环状态替代了二次复杂度的跨帧注意力。我们将自注意力机制分解为两个分支：一个用于捕捉空间细节与局部依赖的帧内softmax分支，以及一个通过维护固定大小的状态来处理流式上下文的帧间门控循环线性分支。我们的关键见解是：Softmax注意力能够捕捉细粒度的局部交互，而循环状态则提供了可调控的长程记忆。该设计在保持常量内存开销的同时实现了线性时间复杂度的扩展，并提升了相对于全softmax模型的时间一致性。为防止噪声中间状态污染记忆，我们仅在去噪遍之后才更新循环状态。为避免帧内信息不对称，所有标记均采用相同的更新前状态，而非按顺序依次更新。据我们所知，这是首项将预训练的自回归视频扩散模型通过一种高效的两阶段训练方案转化为混合线性注意力架构的工作，适用于自回归视频生成任务。在75%的层被混合线性注意力机制替代后，该模型实现了最高2.26倍的壁钟加速和54%的显存占用缩减，同时保持了与原模型相当的性能，并进一步提升了时序一致性。

diffusion

Diffusion

29 篇论文

今日 Diffusion 方向共追踪到 29 篇论文。简报保留原始摘要、中文摘要、作者和链接，适合先快速筛选，再挑出值得深读的论文进入 org-roam。

UniVL: Unified Vision-Language Embedding for Spatially Grounded Contextual Image Generation

2026-05-22T04:00:00cs.CV, cs.LG, diffusionoai:arXiv.org:2605.21611v1

中文标题：UniVL: 用于空间接地的上下文图像生成的统一视觉语言嵌入

作者：Jiayun Wang, Yu Wang, Weijie Gan, Zhenting Wang, Wei Wei

摘要：

We introduce spatially grounded contextual image generation, a controllable image generation task that reframes the conditioning paradigm. Instead of supplying a reference image and a global text prompt through two separate encoders, one for vision and one for language, UniVL is trained to bind semantics to spatial locations directly from a single unified visual input, where the textual instruction is rendered onto the spatial mask. This removes the need for a standalone text encoder at inference time. The resulting model supports contextual image generation by following user-specified instructions about what should appear where, while substantially reducing computation. To address this task, we propose a framework in which the UniVL encoder, adapted from an optical-character-recognition-pretrained backbone, reads the unified condition optically and produces a UniVL embedding, fVIL, that fuses visual and semantic intent with spatial locations in a single token sequence. A two-stage pipeline first aligns UniVL with the VAE embedding space and then conditions a pretrained diffusion backbone entirely on UniVL embeddings, eliminating the standalone text encoder, such as T5. Although this reframing uses a deliberately minimal text interface, it yields strong empirical gains. On UniVL-ImgGen, a benchmark of 477K mask-annotated images that we construct for training and evaluation, UniVL improves image quality over text-prompted baselines, reducing FID from 14 to 11 and increasing PSNR from 16 to 20. It also eliminates the text encoder entirely, reducing inference TFLOPs by up to 52% and runtime by up to 44%. Additional ablation studies validate the contributions of the proposed components, paving the way for efficient, spatially grounded image generation with a unified conditioning paradigm.

摘要中文：

arXiv:2605.21611v1宣布类型: 新摘要: 我们介绍了基于空间的上下文图像生成，这是一种可控制的图像生成任务，可以重新构建条件范式。UniVL不是通过两个单独的编码器 (一个用于视觉，一个用于语言) 提供参考图像和全局文本提示，而是训练UniVL直接从单个统一的视觉输入将语义绑定到空间位置，其中文本指令被渲染到空间掩码上。这消除了在推理时对独立文本编码器的需求。所得到的模型通过遵循用户指定的关于什么应该出现在哪里的指令来支持上下文图像生成，同时大大减少了计算。为了解决这个任务，我们提出了一个框架，其中UniVL编码器改编自光学字符识别预训练的主干，以光学方式读取统一条件，并产生UniVL嵌入fVIL，将视觉和语义意图与空间位置融合在一个单一的令牌序列中。两阶段管道首先将UniVL与VAE嵌入空间对齐，然后完全在UniVL嵌入上调节预训练的扩散主干，从而消除了独立的文本编码器，例如t5。尽管这种重构使用了故意最小化的文本界面，但它产生了强大的经验收益。在univl-imggen上，我们为训练和评估构建了477K掩模注释图像的基准，UniVL在文本提示的基线上提高了图像质量，将FID从14降低到11，并将PSNR从16提高到20。它还完全消除了文本编码器，将推理TFLOPs减少了52%，运行时间减少了44%。其他消融研究验证了建议组件的贡献，为使用统一的调节范例有效地生成空间接地图像铺平了道路。

BodyReLux: Temporally Consistent Full-Body Video Relighting

2026-05-22T04:00:00cs.CV, cs.GR, diffusionoai:arXiv.org:2605.21766v1

中文标题：BodyReLux: 时间一致的全身视频重注

作者：Li Ma, Mingming He, Xueming Yu, David M. George, Ahmet Levent Ta\c{s}el, Paul Debevec, Julien Philip

摘要：

Being able to relight human performance is a fundamental task for post production and content creation. We present BodyReLux, a subject-specific video diffusion-based framework for relighting full-body human performances in a temporally consistent way. Our model is trained on a hybrid dataset of pixel-aligned video relighting pairs, covering a diverse combination of lighting conditions, performances and viewpoints. To acquire such dataset, we combine traditional static One-Light-at-a-Time (OLAT) capture and a novel dynamic performance capture in which two smoothly varying lighting sequences are rapidly interleaved. Because the lighting operates above the human flicker-fusion threshold, the interleaving does not appear to strobe. We train our video relighting model from a pretrained text-to-video model to fully leverage the generative priors for producing high quality videos. To achieve accurate lighting control, we introduce a new lighting conditioning method that represents each light source as a token. We further condition on sequences of lighting using masked attention to support dynamic lighting control. Together with a carefully designed data augmentation pipeline, we achieve photorealistic, robust, and temporally consistent video relighting of subject-specific human performances.

摘要中文：

arXiv:2605.21766v1宣布类型: 新摘要: 能够重新调整人的表现是后期制作和内容创作的基本任务。我们介绍了BodyReLux，这是一种基于特定主题的基于视频扩散的框架，用于以时间一致的方式重新调整全身的人体表现。我们的模型是在像素对齐的视频重照明对的混合数据集上进行训练的，涵盖了照明条件，性能和视点的各种组合。为了获取这样的数据集，我们结合了传统的静态一次一灯 (OLAT) 捕获和新颖的动态性能捕获，其中两个平滑变化的照明序列快速交错。因为照明在人类闪烁融合阈值之上操作，所以交错看起来不频闪。我们从预先训练的文本到视频模型来训练我们的视频重着色模型，以充分利用生成先验来产生高质量的视频。为了实现精确的照明控制，我们引入了一种新的照明调节方法，该方法将每个光源表示为令牌。我们进一步使用掩蔽注意力来限制照明序列，以支持动态照明控制。与精心设计的数据增强管道一起，我们实现了针对特定主题的人类表演的逼真，健壮且时间一致的视频重新绘制。

Guided Trajectory Optimization with Sparse Scaling for Test-Time Diffusion

2026-05-22T04:00:00cs.CV, diffusionoai:arXiv.org:2605.21907v1

中文标题：测试时间扩散的稀疏缩放制导轨迹优化

作者：Gang Dai, Yining Huang, Yiming Xia, Guohao Chen, Shuaicheng Niu

摘要：

The efficient Test-Time Scaling (TTS) paradigm offers a promising perspective for enhancing the generation performance of diffusion models. However, current solutions are limited to a static, pre-defined noise pool and suffer from inflexible noise exploration across the denoising trajectory. To bridge this gap, we propose RTS, a novel Reward-guided Trajectory Scaling method to fully unlock the generative potential of diffusion models. Unlike existing methods, RTS facilitates the synthesis of refined, high-fidelity images via two core innovations: 1) a reward-guided noise optimization strategy to actively direct the search towards promising regions; and 2) a sparse test-time scaling framework together with a PCA-driven curvature analysis scheme to prioritize key intermediate steps in the entire denoising space, effectively compressing the search space. Experiments show our approach outperforms baselines by 15.6% across GenEval Score, and a 60.4% enhancement in ImageReward score, setting a new SOTA while providing a practical guideline for more effective test-time scaling across diffusion-specific architectures.

摘要中文：

arXiv:2605.21907v1宣布类型: 新摘要: 有效的测试时间缩放 (TTS) 范式为提高扩散模型的生成性能提供了一个有前景的视角。然而，当前的解决方案限于静态的、预定义的噪声池，并且遭受跨去噪轨迹的不灵活的噪声探索。为了弥合这一差距，我们提出了RTS，一种新颖的奖励引导轨迹缩放方法，以完全解锁扩散模型的生成潜力。与现有方法不同，RTS通过两项核心创新促进了精细，高保真图像的合成: 1) 奖励引导的噪声优化策略，以积极地将搜索引向有前途的区域；和2) 稀疏测试时间缩放框架以及PCA驱动的曲率分析方案，以优先考虑整个去噪空间中的关键中间步骤，从而有效地压缩搜索空间。实验表明，我们的方法通过跨GenEval得分的15.6% 以及imagerevard得分的60.4% 增强而优于基线，设置了新的SOTA，同时为跨扩散特定体系结构的更有效的测试时间缩放提供了实用指南。

RiT: Vanilla Diffusion Transformers Suffice in Representation Space

2026-05-22T04:00:00cs.CV, diffusionoai:arXiv.org:2605.21981v1

中文标题：RiT: 香草扩散变换器在表示空间中就足够了

作者：Le Zhang, Ning Mang, Aishwarya Agrawal

摘要：

Flow matching with $$x$$ -prediction -- regressing the clean data point rather than the ambient velocity -- is known to exploit low-dimensional manifold structure effectively in pixel space \cite{li2025back}. We ask whether a pretrained representation space, while containing a low-dimensional data manifold of comparable intrinsic dimensionality, offers a distribution more favorable for flow-matching learning. Comparing pixel, SD-VAE, and DINOv2 features along four geometric axes, we find that pixel and DINOv2 share nearly identical intrinsic dimensionalities (both $\hat{d}\!\approx\!33$ ) yet DINOv2 exhibits $7.3\times$ higher effective rank, $35\times$ better covariance conditioning, $11.5\times$ lower excess kurtosis, and $1.7\times$ lower on-manifold interpolation error; SD-VAE latents are consistently intermediate, indicating that the advantage stems from representation-learning objectives rather than mere compression. These statistical properties render the flow-matching regression well-conditioned and remove the need for the specialized prediction heads or Riemannian transport used by prior DINOv2 diffusion methods. We propose the \emph{Representation Image Transformer} (RiT): a vanilla Diffusion Transformer trained by $$x$$ -prediction on frozen DINOv2 features, augmented only by a dimension-aware noise schedule and joint \texttt{[CLS]}-patch modeling. On ImageNet $256{\times}256$ , RiT attains FID 1.45 without guidance and 1.14 with classifier-free guidance, outperforming DiT $^\text{DH}$ -XL with $19\%$ fewer parameters (676M vs.\ 839M). The resulting ODE is efficiently solvable at coarse discretizations: with classifier-free guidance, $$5$$ Heun steps already reach FID 2.0 and $$10$$ steps reach 1.25, without distillation or consistency training. Code at https://github.com/lezhang7/RiT.

摘要中文：

arXiv:2605.21981v1宣布类型: 新摘要: 已知具有 $$ x $$ 预测的流量匹配-回归干净的数据点而不是环境速度-可以有效地利用像素空间中的低维流形结构 \ cite{li2025back}。我们问一个预训练的表示空间，在包含可比较的固有维度的低维数据流形的同时，是否提供了更有利于流匹配学习的分布。沿着四个几何轴比较pixel、sd-vae和DINOv2特征，我们发现pixel和DINOv2具有几乎相同的内在维度 (两者都是 $\ hat{d}\!\ 大约 \!33$ ) 然而DINOv2表现出 $7.3 \ 乘以$ 更高的有效秩， $35 \ 乘以$ 更好的协方差条件， $11.5 \ 乘以更低的超额峰度和$ 1.7 \ 乘以 $更低的流形上插值误差; Sd-vae潜伏期始终处于中间，这表明优势源于表征学习目标，而不仅仅是压缩。这些统计特性使流匹配回归条件良好，并且消除了对先前DINOv2扩散方法所使用的专用预测头或黎曼传输的需求。我们提出了 \ emph {表示图像变换器} (RiT): 一种香草扩散变换器，由$ x $预测对冻结的DINOv2特征进行训练，仅通过维度感知噪声时间表和联合 \ texttt{[CLS]} 补丁建模进行增强。在ImageNet$ 256{\ times}256 $$ 上，RiT在没有指导的情况下获得了FID 1.45，在没有分类器的指导下获得了1.14，在参数少 $$ 19 \% $$ 的情况下，RiT的表现优于DiT $$ ^\ text{DH} $-XL (676M对 \ 839M)。生成的ODE可以在粗离散化时有效地解决: 在无分类器的指导下，$ 5 $$ Heun步骤已经达到FID 2.0，而 $$ 10 $ 步骤达到1.25，而无需蒸馏或一致性训练。代码在 https://github.com /乐章7/RiT。

Rethinking Token Reduction for Diffusion Models via Output-Similarity-Awareness

2026-05-22T04:00:00cs.CV, diffusionoai:arXiv.org:2605.22011v1

中文标题：通过输出相似性意识重新考虑扩散模型的令牌减少

作者：Hangyeol Lee, Hyojeong Lee, Joo-Young Kim

摘要：

Diffusion Transformers (DiTs) achieve superior image generation quality but suffer from quadratic computational complexity relative to token count. While various token reduction (TR) methods have been proposed to mitigate this cost, they overlook the primary objective of generative models: minimizing recovery error, which requires reflecting output token similarity. They rely solely on input token similarity inherited from reduction-only ViT paradigms, leading to a fundamental misalignment with this objective. To bridge this gap, we propose DiTo, a novel TR paradigm that shifts the focus toward output-centric token reduction. Based on the observation that output token similarity is consistently preserved across adjacent timesteps, DiTo utilizes prior-step similarities as an effective proxy to establish token correspondences at a Matching timestep, which are then reused across multiple subsequent Reduction timesteps. To optimize this interleaved scheduling, we propose Pair Match Ratio (PMR)-guided Interval Scheduling to determine the optimal matching frequency. Furthermore, to mitigate localized approximation errors and resulting blocking artifacts caused by repeated reuse, we propose Frequency-aware Token Matching by incorporating a selection-frequency penalty. Extensive experiments demonstrate that DiTo consistently outperforms existing TR methods with 1.6-3.9 dB higher PSNR at comparable speedups, achieving a superior Pareto frontier.

摘要中文：

arXiv:2605.22011v1宣布类型: 新摘要: 扩散变换器 (DiTs) 实现了优越的图像生成质量，但遭受相对于令牌计数的二次计算复杂性。虽然已经提出了各种令牌减少 (TR) 方法来减轻这种成本，但它们忽略了生成模型的主要目标: 最小化恢复错误，这需要反映输出令牌相似性。它们仅依赖于从仅还原的ViT范例继承的输入令牌相似性，从而导致与该目标的根本不一致。为了弥合这一差距，我们提出了DiTo，这是一种新颖的TR范式，将重点转向以输出为中心的令牌减少。基于跨相邻时间步长一致地保留输出令牌相似性的观察，DiTo利用先前步骤相似性作为有效代理来在匹配时间步长处建立令牌对应关系，然后跨多个后续缩减时间步长重用令牌对应关系。为了优化这种交错调度，我们提出了成对匹配比 (PMR) 引导的间隔调度来确定最佳匹配频率。此外，为了减轻局部近似误差和由重复使用引起的块效应，我们提出了通过结合选择频率惩罚的频率感知令牌匹配。大量实验表明，DiTo始终优于现有的TR方法，在可比的速度下PSNR高出1.6-3.9 dB，从而实现了出色的帕累托边界。

ORBIS: Output-Guided Token Reduction with Distribution-Aware Matching for Video Diffusion Acceleration

2026-05-22T04:00:00cs.AR, cs.CV, diffusionoai:arXiv.org:2605.22015v1

中文标题：ORBIS: 具有分布感知匹配的输出引导令牌减少，用于视频扩散加速

作者：Hangyeol Lee, Joo-Young Kim

摘要：

Diffusion Transformer (DiT) has emerged as a powerful model architecture for generating high-quality images and videos. In the case of video DiT, 3D Spatio-Temporal Attention increases token length in proportion to the number of frames, sharply increasing computational cost. Token reduction methods mitigate this cost by exploiting spatial redundancy, but existing approaches rely on inaccurate similarity estimates and lightweight matching algorithms, resulting in poor matching quality and only marginal acceleration. To overcome these limitations, we propose ORBIS, an SW-HW co-designed accelerator for video DiT. ORBIS leverages the output activation from the previous timestep to obtain more accurate inter-token similarity, substantially improving matching quality and enabling a higher token reduction ratio. We further introduce a Distribution-Aware Token Matching (DATM) algorithm that captures global token distribution and explicitly minimizes token-pair loss for additional gains. To fully hide DATM latency, we design specialized, deeply pipelined hardware and minimize its hardware cost through quantization, occupying only 2.4% of total area with negligible accuracy loss. Extensive experiments show that ORBIS achieves about 2x higher token reduction ratio than the state-of-the-art approach, AsymRnR, while delivering up to 4.5x speedup and 79.3% energy reduction compared to an NVIDIA A100 GPU.

摘要中文：

arXiv:2605.22015v1宣布类型: 新摘要: 扩散变换器 (DiT) 已成为生成高质量图像和视频的强大模型架构。在视频DiT中，三维时空注意力会随着帧数的增加而线性增长序列长度，从而导致计算成本急剧上升。令牌压缩方法通过利用空间冗余来降低这一开销，但现有方法依赖于不准确的相似度估计和轻量级匹配算法，导致匹配质量较差且加速效果有限。为克服这些局限性，我们提出了ORBIS，这是一种面向视频DiT的软硬件协同设计加速器。ORBIS利用前一时间步的输出激活来获得更精确的标记间相似度，从而显著提升匹配质量，并支持更高的标记压缩比。我们进一步提出了一种分布感知的标记匹配（DATM）算法，该算法能够捕捉全局标记分布，并通过显式地最小化标记对损失来获得额外的性能提升。为彻底隐藏DATM的延迟，我们设计了专用的深度流水线硬件，并通过量化技术将硬件开销降至最低，在仅占用总芯片面积2.4% 的情况下实现了几乎可忽略的精度损失。大量实验表明，ORBIS的标记压缩比约为当前最优方法AsymRnR的两倍，同时与NVIDIA A100 GPU相比，其加速最高可达4.5倍，能耗降低79.3%。

Diverse Yet Consistent: Context-Guided Diffusion with Energy-Based Joint Refinement for Multi-Agent Motion Prediction

2026-05-22T04:00:00cs.CV, diffusionoai:arXiv.org:2605.22017v1

中文标题：多样化但一致: 上下文引导的扩散与基于能量的联合细化，用于多智能体运动预测

作者：Lei Chu, Yuhuan Zhao

摘要：

Deepgenerative models havebecomeapromisingapproach for human motion prediction due to their ability to capture multimodal distributions and represent diverse human be haviors. However, generating predictions that are both di verse and jointly consistent among interacting agents re mains challenging. In addition, most existing approaches are primarily evaluated using single-agent (marginal) met rics, which fail to fully reflect the joint dynamics of multi agent interactions. We propose a diffusion-based frame work that improves multi-agent motion prediction by lever aging rich contextual information from historical trajecto ries. This information is incorporated through a guidance mechanism to enhance the diversity and expressiveness of predicted motions. To further enforce interaction consis tency, we introduce an energy-based formulation that re fines the joint trajectory distribution while preserving the plausibility of individual trajectories. Extensive experi ments on four benchmark datasets demonstrate that our approach consistently outperforms existing methods. No tably, our approach substantially improves both marginal (ADE/FDE) and joint (JADE/JFDE) metrics on ETH/UCY over strong marginal baselines. Compared with prior joint prediction methods, it delivers significant gains in marginal metrics while maintaining competitive joint performance.

摘要中文：

arXiv:2605.22017v1宣布类型: 新摘要: 深度生成模型由于能够捕获多模态分布并表示不同的人类行为，因此成为人类运动预测的一种有效方法。然而，在相互作用的代理之间生成既相反又共同一致的预测是具有挑战性的。此外，大多数现有方法主要使用单代理 (边际) met rics进行评估，无法充分反映多代理相互作用的联合动态。我们提出了一种基于扩散的框架，该框架通过将丰富的上下文信息从历史轨迹杠杆化到ries来改善多主体运动预测。该信息通过引导机制并入，以增强预测运动的多样性和表现力。为了进一步增强相互作用的一致性，我们引入了一种基于能量的公式，该公式可以在保留单个轨迹的合理性的同时，对关节轨迹分布进行精细调整。对四个基准数据集的广泛实验表明，我们的方法始终优于现有方法。不，我们的方法在强大的边际基线上大大提高了ETH/UCY的边际 (ADE/FDE) 和联合 (JADE/JFDE) 指标。与先前的联合预测方法相比，它在维持竞争性联合绩效的同时，在边际指标方面提供了显着的收益。

Broken Memories: Detecting and Mitigating Memorization in Diffusion Models with Degraded Generations

2026-05-22T04:00:00cs.CV, diffusionoai:arXiv.org:2605.22050v1

中文标题：破碎的记忆: 在具有退化世代的扩散模型中检测和减轻记忆

作者：Yuanmin Huang, Mi Zhang, Chen Chen, Feifei Li, Geng Hong, Xiaoyu You, Min Yang

摘要：

While diffusion models excel at generating high-quality images, their tendency to memorize training data poses significant privacy and copyright risks. In this work, we for the first time identify that memorization induces internal numerical instability, often manifesting as visually ``broken'&x27; artifacts. Inspired by stability analysis in numerical methods, we introduce empirical stability regions based on latent update norms to quantitatively characterize stable behavior during generation. Leveraging this, we propose a principled, on-the-fly framework for step-wise detection and adaptive mitigation. Our approach suppresses memorization without altering prompts or guidance, thereby preserving semantic fidelity and image quality. Extensive experiments on Stable Diffusion 1.4 demonstrate that our method achieves an AUC $$>0.999$$ detection performance and a $0.0\%$ memorization rate after mitigation with negligible overhead ( $\approx0.01$ s per image).

摘要中文：

arXiv:2605.22050v1宣布类型: 新摘要: 虽然扩散模型擅长生成高质量的图像，但它们记忆训练数据的趋势带来了巨大的隐私和版权风险。在这项工作中，我们首次发现记忆会导致内部数值不稳定，通常表现为视觉上的 “残障” 文物。受数值方法中稳定性分析的启发，我们引入了基于潜在更新规范的经验稳定性区域，以定量表征发电过程中的稳定行为。利用这一点，我们提出了一个有原则的，即时的框架，用于逐步检测和自适应缓解。我们的方法在不改变提示或指导的情况下抑制记忆，从而保持语义保真度和图像质量。在稳定扩散1.4上进行的大量实验表明，我们的方法在缓解后可实现AUC $$>0.999 $$ 的检测性能和 $0.0 \%$ 的记忆率，而开销可忽略不计 (每个图像 $\ approx0.01$ s)。

Distributed Image Compression with Multimodal Side Information at Extremely Low Bitrates

2026-05-22T04:00:00cs.CV, diffusion, image_compressionoai:arXiv.org:2605.22061v1

中文标题：具有极低比特率的多模态边信息的分布式图像压缩

作者：Guojun Xu, Mingyang Zhang, Jianwen Xiang, Cheng Tan, Yanchao Yang, Junwei Zhou

摘要：

Distributed Image Compression (DIC) is crucial for multi-view transmission, especially when operating at extremely low bitrates (< 0.1 bpp). Its core challenge is effectively utilizing side information to achieve high-quality reconstruction under strict bitrate budgets. However, existing DIC approaches struggle to exploit global context and object-level details from side information, leading to local blurring and the loss of fine details in the reconstruction. To address these limitations, we propose a Multimodal DIC framework (MDIC), which, for the first time, leverages side information in a multimodal manner into the DIC paradigm, effectively preserving fine-grained local details and enhancing global perceptual quality in reconstructed images. Specifically, we introduce a text-to-image diffusion-based decoder conditioned on textual side information extracted from correlated images to capture shared global semantics. Moreover, we design a feature-mask generator, supervised by a multimodal fine-grained alignment task, to strengthen the exploitation of visual side information. The generated mask serves two purposes: first, it guides the extraction of fine-grained details from losslessly transmitted side information to preserve the semantic consistency of reconstructed details; second, it regulates the extraction of clustered feature representations from the quantized VQ-VAE embeddings, compensating for category information lost under the extreme compression of the primary image. Extensive experiments on the widely used KITTI Stereo and Cityscapes datasets demonstrate that MDIC achieves state-of-the-art perceptual quality at extremely low bitrates.

摘要中文：

arXiv:2605.22061v1宣布类型: 新摘要: 分布式图像压缩 (DIC) 对于多视图传输至关重要，尤其是在极低比特率 (< 0.1 bpp) 下运行时。它的核心挑战是有效地利用边信息在严格的比特率预算下实现高质量的重建。然而，现有的DIC方法难以利用来自边信息的全局上下文和对象级细节，导致局部模糊和重建中的精细细节的丢失。为了解决这些限制，我们提出了一种多模态DIC框架 (MDIC)，该框架首次以多模态方式利用辅助信息进入DIC范例，有效地保留了细粒度的局部细节并增强了重建图像中的全局感知质量。具体来说，我们引入了一种基于文本到图像扩散的解码器，该解码器以从相关图像中提取的文本边信息为条件，以捕获共享的全局语义。此外，我们设计了一个特征掩模生成器，由多模式细粒度对齐任务监督，以加强对视觉辅助信息的利用。生成的掩码有两个目的: 第一，它指导从无损传输的边信息中提取细粒度的细节，以保持重构细节的语义一致性; 第二，它规范从量化的vq-vae嵌入中提取聚类特征表示，补偿在主图像的极端压缩下丢失的类别信息。在广泛使用的KITTI Stereo和Cityscapes数据集上进行的大量实验表明，MDIC以极低的比特率实现了最先进的感知质量。

MotionDPS: Motion-Compensated 3D Brain MRI Reconstruction

2026-05-22T04:00:00cs.CV, diffusionoai:arXiv.org:2605.22121v1

中文标题：MotionDPS: 运动补偿3D脑MRI重建

作者：Antonio Ortiz-Gonzalez, Erich Kobler, Lukas Schletter, Alexander Effland

摘要：

Magnetic resonance imaging (MRI) is highly susceptible to patient motion due to its relatively long acquisition times and the fact that data are acquired sequentially in k-space. Even small patient movements introduce phase inconsistencies across measurements, leading to severe artifacts such as blurring, ghosting, and geometric distortions that can compromise diagnostic quality. Retrospective motion compensation remains challenging, particularly in accelerated acquisitions, due to the ill-posed nature of the joint reconstruction and motion estimation problem. In this work, we propose a unified Bayesian framework for motion-compensated 3D MRI that jointly estimates the anatomical image, rigid-body motion parameters, and coil sensitivity maps directly from motion-corrupted k-space data. Our approach integrates pretrained 3D complex-valued score-based diffusion models as expressive anatomical image priors within a physics-based forward model. Inference is performed by alternating diffusion posterior image updates with efficient proximal optimization steps for motion and coil sensitivity estimation, enabling fully unsupervised reconstruction without the need for paired motion-free training data. Experiments on simulated and real-motion brain MRI datasets demonstrate that the proposed method achieves improved image quality and motion robustness compared to state-of-the-art classical and learning-based motion correction techniques, particularly in the presence of severe motion and high acceleration.

摘要中文：

arXiv:2605.22121v1宣布类型: 新摘要: 磁共振成像 (MRI) 由于其相对较长的采集时间以及在k空间中顺序采集数据的事实，对患者运动非常敏感。即使是小的患者移动也会在测量中引入相位不一致，从而导致严重的伪影，例如模糊，重影和几何失真，这可能会损害诊断质量。由于关节重建和运动估计问题的不适定性质，回顾性运动补偿仍然具有挑战性，特别是在加速采集中。在这项工作中，我们提出了一个统一的贝叶斯框架，用于运动补偿3D MRI，直接从运动损坏的k空间数据联合估计解剖图像，刚体运动参数和线圈灵敏度图。我们的方法将预训练的基于3D复值分数的扩散模型集成为基于物理的前向模型中的表达性解剖图像先验。通过交替扩散后验图像更新来执行推断，其中有效的近端优化步骤用于运动和线圈灵敏度估计，从而实现完全无监督的重建，而无需配对的无运动训练数据。在模拟和真实运动脑MRI数据集上的实验表明，与最先进的经典和基于学习的运动校正技术相比，所提出的方法实现了改进的图像质量和运动鲁棒性，特别是在存在剧烈运动和高加速度的情况下。

Flow-based Gaussian Splatting for Continuous-Scale Remote Sensing Image Super-Resolution

2026-05-22T04:00:00cs.CV, diffusionoai:arXiv.org:2605.22147v1

中文标题：基于高斯流的连续尺度遥感图像超分辨率

作者：Jiangwei Mo, Xi Lu, Hanlin Wu

摘要：

High-resolution remote sensing images (RSIs) are crucial for Earth observation applications, yet acquiring them is often limited by sensor constraints and costs. In recent years, generative super-resolution (SR) methods, particularly diffusion models, have made significant progress. However, they typically require slow iterative inference with 40--1000 steps and exhibit limited flexibility in continuous-scale SR settings. To address these issues, we propose FlowGS, a generative reconstruction framework for arbitrary-scale SR of RSIs. FlowGS models the high-frequency detail representations between high- and low-resolution images and learns a continuous probability flow from noise to detail priors via flow matching (FM) constrained by shortcut consistency, thereby reducing generative complexity and improving inference efficiency. Additionally, we employ 2D Gaussian splatting to construct a continuous feature field, thereby enabling flexible reconstruction at arbitrary query locations. Experimental results show that FlowGS delivers competitive perceptual quality compared with existing methods in both continuous-scale and fixed-scale SR settings, with substantially improved inference efficiency.

摘要中文：

arXiv:2605.22147v1宣布类型: 新摘要: 高分辨率遥感图像 (RSIs) 对于地球观测应用至关重要，但获取它们通常受到传感器约束和成本的限制。近年来，生成超分辨率 (SR) 方法，特别是扩散模型，取得了重大进展。但是，它们通常需要40到1000步的缓慢迭代推理，并且在连续尺度SR设置中表现出有限的灵活性。为了解决这些问题，我们提出了FlowGS，这是一种用于rsi的任意规模SR的生成重建框架。FlowGS对高分辨率和低分辨率图像之间的高频细节表示进行建模，并通过受捷径一致性约束的流匹配 (FM) 学习从噪声到细节先验的连续概率流，从而降低生成复杂性并提高推理效率。此外，我们采用2D高斯splatting来构造连续的特征场，从而可以在任意查询位置进行灵活的重建。实验结果表明，与连续尺度和固定尺度SR设置中的现有方法相比，FlowGS提供了具有竞争力的感知质量，并且大大提高了推理效率。

D3Seg: Dependency-Aware Diffusion for Brain Tumor Segmentation with Missing Modalities

2026-05-22T04:00:00cs.CV, diffusionoai:arXiv.org:2605.22249v1

中文标题：D3Seg: 用于缺失模态的脑肿瘤分割的依赖感知扩散

作者：Danish Ali, Ajmal Mian, Naveed Akhtar, Ghulam Mubashar Hassan

摘要：

Accurate brain tumor segmentation using multiparametric MRI is critical for effective treatment planning. However, in clinical settings, complete acquisition of all MRI sequences is not always possible. The absence of certain MRI modalities results in substantial performance degradation in existing segmentation methods, which typically rely on naive feature concatenation or direct fusion strategies. To address this limitation, we propose a novel segmentation model D3Seg which is designed to maintain stable performance under missing-modality settings. D3Seg introduces Multi-hop Modality Graph Fusion (MMGF) to model higher order inter-modality dependencies, a lightweight diffusion-based imputation mechanism to compensate for missing T1ce representations in latent space, and probability-space decision refinement to mitigate dominant class overconfidence and improve delineation of underrepresented tumor subregions. Extensive evaluation on BraTS 2023 dataset demonstrates that our D3Seg model consistently improves segmentation performance under missing modality configurations. The proposed model achieves approximately 1.5-2.0% Dice improvement on enhancing tumor (ET) and around 1.0% on tumor core (TC) across multiple missing modality configurations compared to the current state-of-the-art model, while maintaining computational efficiency.

摘要中文：

arXiv:2605.22249v1宣布类型: 新摘要: 使用多参数MRI进行准确的脑肿瘤分割对于有效的治疗计划至关重要。然而，在临床环境中，所有MRI序列的完全采集并不总是可能的。缺少某些MRI模态会导致现有分割方法的性能大大降低，这些方法通常依赖于朴素的特征级联或直接融合策略。为了解决这个限制，我们提出了一种新颖的分割模型D3Seg，旨在在缺失模态设置下保持稳定的性能。D3Seg引入了多跳模态图融合 (MMGF) 来对高阶模态间依赖进行建模，引入了一种基于轻量级扩散的插补机制来补偿潜在空间中丢失的T1ce表示，并引入了概率空间决策细化来减轻主导类过度自信并改善对未充分表示的肿瘤子区域的描绘。对bats 2023数据集的广泛评估表明，我们的D3Seg模型在缺失模态配置下始终如一地提高了分割性能。与当前最先进的模型相比，所提出的模型在增强肿瘤 (ET) 上实现了大约1.5 2.0% 的Dice改进，并且在多个缺失模态配置上实现了大约1.0% % 的肿瘤核心 (TC)，同时保持了计算效率。

PIU: Proximity-guided Identity Unlearning in ID-Conditioned Diffusion Models

2026-05-22T04:00:00cs.CV, diffusionoai:arXiv.org:2605.22311v1

中文标题：PIU: ID条件扩散模型中的邻近引导身份学习

作者：Jose Edgar Hernandez Cancino Estrada, Mauro D\'iaz Lupone, \v{Z}iga Emer\v{s}i\v{c}, Vitomir \v{S}truc, Peter Peer, Darian Toma\v{s}evi\&x27;c

摘要：

Identity-conditioned diffusion models enable high-quality and identity-consistent face generation, but they also raise severe privacy concerns, as models may continue to synthesize individuals despite their right to be forgotten. While machine unlearning has been extensively studied for concept and data removal, identity unlearning remains largely unexplored, particularly in models conditioned directly on identity embeddings rather than text prompts. In this work, we study identity unlearning in Arc2Face, a state-of-the-art identity-conditioned latent diffusion model for face generation, and introduce Proximity-guided Identity Unlearning (PIU), an anchor-guided framework for identity unlearning. Specifically, we formulate identity removal as an identity replacement objective that reassigns the source identity to a selected anchor identity in the learned identity space, and we complement it with a proximity-based anchor selection strategy motivated by the geometry of ArcFace representations. We further show that effective unlearning can be achieved through localized fine-tuning of a small subset of identity-sensitive cross-attention layers. Experiments across many target identities show that our framework effectively suppresses generation of the target identity while preserving realism and identity consistency for retained identities, as validated by improved performance on unlearning and image-quality metrics, together with qualitative evaluation. The source code for the PIU framework is publicly available at https://github.com/edgarcancinoe/piu_unlearning .

摘要中文：

arXiv:2605.22311v1宣布类型: 新摘要: 身份条件扩散模型可以实现高质量和身份一致的人脸生成，但它们也引起了严重的隐私问题，因为模型可能会继续合成个人，尽管他们有权被遗忘。虽然机器学习在概念和数据删除方面已经得到了广泛的研究，但身份学习在很大程度上仍然没有被探索，特别是在直接以身份嵌入而不是文本提示为条件的模型中。在这项工作中，我们研究了Arc2Face中的身份unlearning，这是一种用于人脸生成的最先进的身份条件潜在扩散模型，并引入了邻近引导的身份Unlearning (PIU)，这是一种用于身份unlearning的锚定引导框架。具体来说，我们将身份删除制定为身份替换目标，该目标将源身份重新分配给学习的身份空间中的选定锚身份，并且我们通过基于ArcFace表示的几何形状的基于邻近度的锚选择策略对其进行补充。我们进一步表明，可以通过对身份敏感的交叉注意层的一小部分进行局部微调来实现有效的学习。跨许多目标身份的实验表明，我们的框架有效地抑制了目标身份的生成，同时保留了保留身份的真实性和身份一致性，这通过改进学习和图像质量指标的性能以及定性评估得到了验证。PIU框架的源代码可在piu_unlearning https://github.com/edgarcancinoe/ 公开获得。

Diffusion-guided Generalizable Enhancer for Urban Scene Reconstruction

2026-05-22T04:00:00cs.AI, cs.CV, cs.RO, diffusionoai:arXiv.org:2605.22420v1

中文标题：用于城市场景重建的扩散引导广义增强器

作者：Henry Che, Jingkang Wang, Yun Chen, Ze Yang, Sivabalan Manivasagam, Raquel Urtasun

摘要：

Urban scene reconstruction from real-world observations has emerged as a powerful tool for self-driving development and testing. While current neural rendering approaches achieve high-fidelity rendering along the recorded trajectories, their quality degrades significantly under large viewpoint shifts, limiting the applicability for closed-loop simulation. Recent works have shown promising results in using diffusion models to enhance quality at these challenging viewpoints and distill improvements back into 3D representations. However, they often require costly per-scene optimization, and the distilled representations remain fragile and fail to generalize beyond limited synthesized views. To address these limitations, we propose GenRe, a novel diffusion-guided generalizable enhancer for urban scene reconstruction. GenRe takes as input any pretrained 3D Gaussian representation and fixes the deficiencies within a few minutes. By learning to distill generative priors across diverse scenes, GenRe produces robust and high-fidelity representation efficiently that generalizes reliably to challenging unseen viewpoints (e.g., lane change). Experiments show that GenRe outperforms existing methods in both quality and efficiency and benefits various downstream tasks, enabling robust and scalable sensor simulation for autonomous driving.

摘要中文：

arXiv:2605.22420v1宣布类型: 新摘要: 基于真实世界观测的城市场景重建已经成为自动驾驶开发和测试的有力工具。尽管现有的神经渲染方法能够在已采集的视点轨迹上实现高保真渲染，但在大范围视角变化下其重建质量会显著下降，从而限制了其在闭环仿真中的应用。近期的研究表明，利用扩散模型在这些具有挑战性的视角下提升图像质量，并将所取得的改进回溯至三维表征，已取得了令人鼓舞的成果。然而，它们通常需要代价高昂的逐场景优化，且所得到的蒸馏表示仍然较为脆弱，难以在有限的合成视图之外实现有效的泛化。为应对这些局限性，我们提出了GenRe，一种基于扩散模型的新型可泛化的城市场景重建增强器。GenRe以任意预训练的3D高斯表示为输入，并在几分钟内修复其存在的缺陷。通过学习在不同场景间提炼生成先验，GenRe能够高效地构建鲁棒且高保真的表征，并可靠地泛化至具有挑战性的未见视角（例如变道场景）。实验结果表明，GenRe在质量和效率方面均优于现有方法，并能有效提升多种下游任务的性能，从而为自动驾驶提供稳健且可扩展的传感器仿真能力。

SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers

2026-05-22T04:00:00cs.CV, diffusionoai:arXiv.org:2605.22668v1

中文标题：SEGA: 频谱能量引导的注意力用于扩散变压器中的分辨率外推

作者：Javad Rajabi, Kimia Shaban, Koorosh Roohi, David B. Lindell, Babak Taati

摘要：

Diffusion transformers (DiTs) have emerged as a dominant architecture for text-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by modifying inference-time attention behavior, often through Rotary Position Embeddings (RoPE) extrapolation combined with attention scaling. However, these strategies apply a uniform and content-agnostic scaling across RoPE components with distinct frequency characteristics, inducing a trade-off between preserving global structure and recovering fine detail. We introduce SEGA, a training-free method that dynamically scales attention across RoPE components according to the latent's spatial-frequency structure at each denoising step. This adaptive scaling improves both structural coherence and fine-detail fidelity. Experiments show that SEGA consistently improves high-resolution synthesis across multiple target resolutions, outperforming state-of-the-art training-free baselines.

摘要中文：

arXiv:2605.22668v1宣布类型: 新摘要: 扩散变换器 (DiTs) 已成为文本到图像生成的主要体系结构，但是当以超出其训练范围的分辨率生成时，其性能会下降。现有的无需训练的方法通过调整推理阶段的注意力机制来缓解这一问题，通常采用旋转位置嵌入（RoPE）外推并结合注意力缩放来实现。然而，这些策略对具有不同频率特性的RoPE各组件采用统一且与内容无关的缩放，从而在保持全局结构与恢复精细细节之间形成权衡。我们提出SEGA，这是一种无需训练的方法，它在每个去噪步骤中，根据潜在变量的空域–频域结构，对RoPE各组件的注意力进行动态缩放。这种自适应缩放能够同时提升结构的一致性和细节保真度。实验表明，SEGA在多种目标分辨率下均能持续提升高分辨率图像合成效果，优于当前最先进的无训练基线方法。

WorldKV: Efficient World Memory with World Retrieval and Compression

2026-05-22T04:00:00autoregressive, cs.CV, diffusionoai:arXiv.org:2605.22718v1

中文标题：WorldKV: 具有世界检索和压缩功能的高效世界内存

作者：Jung Yi, Minjae Kim, Paul Hyunbin Cho, Wooseok Jang, Sangdoo Yun, Seungryong Kim

摘要：

摘要中文：

DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders

2026-05-22T04:00:00cs.CV, diffusionoai:arXiv.org:2605.22777v1

中文标题：DecQ: 用于在表示自动编码器中增强重建和生成的详细压缩查询

作者：Tianhang Wang, Yitong Chen, Wei Song, Zuxuan Wu, Min Li, Jiaqi Wang

摘要：

Representation Autoencoders (RAEs) leverage frozen vision foundation models (VFMs) as tokenizer encoders, providing robust high-level representations that facilitate fast convergence and high-quality generation in latent diffusion models. However, freezing the VFM inherently constrains its spatial reconstruction capacity, limiting fine-grained generation and image editing; in contrast, incorporating reconstruction-oriented signals via fine-tuning disrupts the pretrained semantic space and degrades generative fidelity. To address this trade-off, we propose DecQ, a simple yet effective framework for RAEs. Specifically, DecQ introduces lightweight detail-condensing queries that extract fine-grained information from intermediate VFM features through condenser modules. These queries are incorporated into the decoder to support reconstruction and are jointly generated with patch tokens during generative modeling. By aggregating information from both shallow and deep layers, DecQ effectively mitigates the reconstruction--generation trade-off, improving both reconstruction quality and generative performance. Our experiments demonstrate that: (1) with only 8 additional queries and 3.9% extra computation, DecQ improves reconstruction over the frozen DINOv2-based RAE, increasing PSNR from 19.13 dB to 22.76 dB; and (2) for generative modeling, DecQ achieves 3.3 $\times$ faster convergence than RAE, attaining an FID of 1.41 without guidance and 1.05 with guidance.

摘要中文：

arXiv:2605.22777v1宣布类型: 新摘要: 表示自动编码器 (rae) 利用冻结视觉基础模型 (vfm) 作为标记器编码器，提供鲁棒的高级表示，促进潜在扩散模型的快速收敛和高质量生成。然而，对VFM进行冻结会inherent地限制其空间重建能力，从而制约细粒度生成与图像编辑；相比之下，通过微调引入以重建为导向的信号，则会破坏预训练得到的语义空间，并降低生成质量。为应对这一权衡，我们提出了DecQ，这是一种简单而有效的RAE框架。具体而言，DecQ引入了轻量级的细节压缩查询，通过压缩模块从中间VFM特征中提取细粒度信息。这些查询被整合到解码器中以支持重建，并在生成式建模过程中与补丁标记共同生成。通过融合浅层与深层的特征信息，DecQ能够有效缓解重建与生成之间的权衡问题，从而同时提升重建质量和生成性能。我们的实验表明 :( 1) 只有8个额外的查询和3.9% 个额外的计算，DecQ改善了冻结DINOv2-based RAE的重建，PSNR从19.13 dB增加到22.76 dB; (2) 对于生成建模，DecQ的收敛速度比RAE快3.3 $ \ 倍，在没有指导的情况下实现1.41的FID，在没有指导的情况下实现1.05。

Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving

2026-05-22T04:00:00cs.CV, diffusionoai:arXiv.org:2605.22809v1

中文标题：Sensor2Sensor: 自动驾驶的交叉实施例传感器转换

作者：Jiahao Wang, Bo Sun, Yijing Bai, Vincent Casser, Songyou Peng, Zehao Zhu, Meng-Li Shih, Xander Masotto, Shih-Yang Su, Kanaad V Parvate, Tiancheng Ge, Linn Bieske, Dragomir Anguelov, Mingxing Tan, Chiyu Max Jiang

摘要：

Robust training and validation of Autonomous Driving Systems (ADS) require massive, diverse datasets. Proprietary data collected by Autonomous Vehicle (AV) fleets, while high-fidelity, are limited in scale, diversity of sensor configurations, as well as geographic and long-tail-behavioral coverage. In contrast, in-the-wild data from sources like dashcams offers immense scale and diversity, capturing critical long-tail scenarios and novel environments. However, this unstructured, in-the-wild video data is incompatible with ADS expecting structured, multi-modal sensor inputs for validation and training. To bridge this data gap, we propose Sensor2Sensor, a novel generative modeling paradigm that translates in-the-wild monocular dashcam videos into a high-fidelity, multi-modal sensor suite (AV logs) comprising multi-view camera images and LiDAR point clouds. A core challenge is the lack of paired training data. We address this by converting real AV logs into dashcam-style videos via 4D Gaussian Splatting (4DGS) reconstruction and novel-view rendering. Sensor2Sensor then utilizes a diffusion architecture to perform the generative conversion. We perform comprehensive quantitative evaluations on the fidelity and realism of the generated sensor data. We demonstrate Sensor2Sensor's practical utility by converting challenging in-the-wild internet and dashcam footage into realistic, multi-modal data formats, further unlocking vast external data sources for AV development.

摘要中文：

arXiv:2605.22809v1宣布类型: 新摘要: 自动驾驶系统 (ADS) 的鲁棒训练和验证需要大量多样化的数据集。自动驾驶车队所采集的专有数据虽然具有较高的保真度，但在规模、传感器配置的多样性以及地理覆盖范围和长尾行为的表征方面均存在局限。相比之下，来自行车记录仪等渠道的自然场景数据具有庞大的规模与丰富的多样性，能够捕捉到关键的长尾场景与全新环境。然而，这些非结构化的、真实场景下的视频数据与自动驾驶系统所期望的用于验证和训练的结构化多模态传感器输入并不兼容。为弥补这一数据缺口，我们提出了Sensor2Sensor，这是一种新颖的生成式建模范式，可将野外采集的单目车载摄像头视频转化为高保真的多模态传感器数据集（自动驾驶日志），其中包括多视角相机图像和激光雷达点云。一个核心挑战是缺乏成对的训练数据。为此，我们通过四维高斯泼溅（4DGS）重建与新视角渲染，将真实的AV日志转换为行车记录仪风格的视频。随后，Sensor2Sensor采用扩散模型架构来实现生成式转换。我们对所生成传感器数据的保真度与真实感进行了全面的定量评估。我们通过将极具挑战性的野外互联网视频与车载摄像头视频转换为逼真的多模态数据格式，展示了Sensor2Sensor的实用价值，从而为自动驾驶系统的研发进一步解锁了海量的外部数据源。

Hierarchical Variational Policies for Reward-Guided Diffusion

2026-05-22T04:00:00cs.AI, cs.CV, cs.LG, diffusionoai:arXiv.org:2605.21661v1

中文标题：奖励引导扩散的分层变分策略

作者：Kushagra Pandey, Farrin Marouf Sofian, Jan Niklas Groeneveld, Felix Draxler, Stephan Mandt

摘要：

Adapting pretrained diffusion models to downstream objectives such as inverse problems often requires expensive test-time guidance or optimization. We propose a principled framework for generating high-quality reward-aligned samples at substantially reduced inference cost. Our approach formulates test-time adaptation as a hierarchical variational model, where control is amortized into a lightweight yet expressive stochastic policy. This formulation naturally supports few-step diffusion sampling: large step sizes enable fast inference, while the learned policy maintains sample quality by providing structured per-step control. The resulting fully amortized sampler achieves a strong quality--speed tradeoff, matching or exceeding recent test-time scaling baselines while requiring significantly less compute. For example, on 4x super-resolution, our method achieves better perceptual quality with more than 5x faster inference compared to the best-performing baseline. We further extend our approach to a semi-amortized regime that combines cheap amortized proposals with limited test-time optimization, achieving state-of-the-art perceptual quality across several challenging inverse problems.

摘要中文：

arXiv:2605.21661v1宣布类型: 交叉摘要: 使预训练的扩散模型适应下游目标 (如逆问题) 通常需要昂贵的测试时间指导或优化。我们提出了一个原则性的框架，用于以大幅降低的推理成本生成高质量的奖励对齐样本。我们的方法将测试时间适应制定为分层变分模型，其中控制被摊销为轻量级但富有表现力的随机策略。此公式自然支持少步扩散采样: 大步长可以实现快速推理，而学习的策略通过提供结构化的每步控制来保持样本质量。由此产生的完全摊销的采样器实现了强大的质量-速度折衷，匹配或超过最近的测试时间缩放基线，同时需要更少的计算。例如，在4x超分辨率上，我们的方法实现了更好的感知质量，与最佳性能基线相比，推理速度快于5倍。我们进一步将我们的方法扩展到半摊销制度，该制度将廉价的摊销提案与有限的测试时间优化相结合，从而在几个具有挑战性的逆问题上实现了最先进的感知质量。

The Double Dilemma in Multi-Task Radiology Report Generation: A Gradient Dynamics Analysis and Solution

2026-05-22T04:00:00cs.CL, cs.CV, cs.LG, diffusionoai:arXiv.org:2605.22635v1

中文标题：多任务放射学报告生成中的双重困境: 梯度动力学分析与解决方案

作者：Erjian Zhang, Yatong Hao, Liejun Wang, Zhiqing Guo

摘要：

While multi-task learning based automatic radiology report generation (RRG) is widely adopted to ensure clinical consistency, most focus on architectural designs yet remain limited to coarse linear scalarization strategies. These strategies cannot effectively balance the hard constraints of discriminative clinical supervision with the smoothness requirements of report generation. To address these problems, we analyze the failure mechanism of linear scalarization from the perspective of gradient dynamics, utilizing the stochastic differential equation (SDE) framework to characterize it as a "Double Dilemma" of drift term deviation and diffusion term decay. Based on this, we propose a backbone-agnostic optimizer named Conflict-Averse Magnitude-Enhanced Gradient Descent (CAME-Grad). Through conflict-averse direction rectification and magnitude-enhanced energy injection, the algorithm not only ensures geometric validity, but also avoids local optimal solutions. Then, the adaptive gradient fusion mechanism is used to establish a dynamic balance between the theoretical optimal direction and the task-specific inductive bias. Experiments show that as a universal plug-and-play optimizer, CAME-Grad brings substantial and consistent improvements across eight diverse RRG methods, elevating overall clinical efficacy performance by an average of 2.3\% on MIMIC-CXR and 1.9\% on IU X-Ray. Our code is available at https://github.com/vpsg-research/CAME-Grad.

摘要中文：

arXiv:2605.22635v1宣布类型: 交叉摘要: 虽然基于多任务学习的自动放射学报告生成 (RRG) 被广泛采用以确保临床一致性，但大多数关注架构设计的重点仍然局限于粗略的线性缩放策略。这些策略无法有效兼顾判别性临床监管的硬约束与报告生成的平滑性要求。为解决这些问题，我们从梯度动力学的角度分析了线性标量化方法的失效机制，并借助随机微分方程（SDE）框架，将其刻画为由漂移项偏差与扩散项衰减共同导致的“双重困境”。基于此，我们提出了一种与主干网络无关的优化算法——冲突规避型幅度增强梯度下降法（CAME-Grad）。通过规避冲突的导向修正与能量增强的注入策略，该算法不仅确保了几何有效性，还有效避免了陷入局部最优解。随后，采用自适应梯度融合机制，在理论最优方向与任务特定的归纳偏置之间建立动态平衡。实验表明，作为一种通用的即插即用型优化器，CAME-Grad在八种不同的RRG方法上均带来了显著且稳定的性能提升，在MIMIC-CXR数据集上的整体临床疗效平均提升了2.3%，在IU X‑Ray数据集上则提升了1.9%。我们的代码可在 https://github.com/vpsg-research/come-grad。

Improved DDIM Sampling with Moment Matching Gaussian Mixtures

2026-05-22T04:00:00cs.AI, cs.CV, cs.LG, diffusionoai:arXiv.org:2311.04938v5

中文标题：改进的基于矩匹配高斯混合的DDIM采样

作者：Prasad Gabbur

摘要：

We propose using a Gaussian Mixture Model (GMM) as reverse transition operator (kernel) within the Denoising Diffusion Implicit Models (DDIM) framework, which is one of the most widely used approaches for accelerated sampling from pre-trained Denoising Diffusion Probabilistic Models (DDPM). Specifically we match the first and second order central moments of the DDPM forward marginals by constraining the parameters of the GMM. We see that moment matching is sufficient to obtain samples with equal or better quality than the original DDIM with Gaussian kernels. We provide experimental results with unconditional models trained on CelebAHQ and FFHQ, class-conditional models trained on ImageNet, and text-to-image generation using Stable Diffusion v2.1 on COYO700M datasets respectively. Our results suggest that using the GMM kernel leads to significant improvements in the quality of the generated samples when the number of sampling steps is small, as measured by FID and IS metrics. For example on ImageNet 256x256, using 10 sampling steps, we achieve a FID of 6.94 and IS of 207.85 with a GMM kernel compared to 10.15 and 196.73 respectively with a Gaussian kernel. Further, we derive novel SDE samplers for rectified flow matching models and experiment with the proposed approach. We see improvements using both 1-rectified flow and 2-rectified flow models. Code: https://github.com/pgabbur/ddim-gmm.

摘要中文：

arXiv:2311.04938v5宣布类型: 更换摘要: 我们提出在去噪扩散隐式模型 (DDIM) 框架中使用高斯混合模型 (GMM) 作为反向转换算子 (kernel)，这是从预训练的去噪扩散概率模型 (DDPM) 中加速采样的最广泛使用的方法之一。具体来说，我们通过限制GMM的参数来匹配DDPM前向边缘的一阶和二阶中心矩。我们看到矩匹配足以获得质量等于或优于具有高斯核的原始DDIM的样本。我们提供了在CelebAHQ和FFHQ上训练的无条件模型，在ImageNet上训练的类条件模型以及在COYO700M数据集上使用稳定扩散v2.1生成文本到图像的实验结果。我们的结果表明，当通过FID和is度量测量的采样步骤数量较少时，使用GMM内核可以显着改善生成的样本的质量。例如，在ImageNet 256x256上，使用10个采样步骤，我们实现了6.94的FID，并且与高斯内核的10.15和196.73相比，GMM内核分别具有207.85。此外，我们为整流流量匹配模型推导了新颖的SDE采样器，并对所提出的方法进行了实验。我们看到使用1整流流和2整流流模型的改进。代码: https://github.com/pgabbur/ ddim-gmm。

RobuQ: Pushing DiTs to W1.58A2 via Robust Activation Quantization

2026-05-22T04:00:00cs.CV, diffusionoai:arXiv.org:2509.23582v2

中文标题：RobuQ: 通过强大的激活量化将DiTs推送到W1.58A2

作者：Kaicheng Yang, Xun Zhang, Haotong Qin, Yucheng Lin, Kaisen Yang, Xianglong Yan, Yulun Zhang

摘要：

Diffusion Transformers (DiTs) have recently emerged as a powerful backbone for image generation, demonstrating superior scalability and performance over U-Net architectures. However, their practical deployment is hindered by substantial computational and memory costs. While Quantization-Aware Training (QAT) has shown promise for U-Nets, its application to DiTs faces unique challenges, primarily due to the sensitivity and distributional complexity of activations. In this work, we identify activation quantization as the primary bottleneck for pushing DiTs to extremely low-bit settings. To address this, we propose a systematic QAT framework for DiTs, named RobuQ. We start by establishing a strong ternary weight (W1.58A4) DiT baseline. Building upon this, we propose RobustQuantizer to achieve robust activation quantization. Our theoretical analyses show that the Hadamard transform can convert unknown per-token distributions into per-token normal distributions, providing a strong foundation for this method. Furthermore, we propose AMPN, the first Activation-only Mixed-Precision Network pipeline for DiTs. This method applies ternary weights across the entire network while allocating different activation precisions to each layer to eliminate information bottlenecks. Through extensive experiments on unconditional and conditional image generation, our RobuQ framework achieves state-of-the-art performance for DiT quantization in sub-4-bit quantization configuration. To the best of our knowledge, RobuQ is the first achieving stable and competitive image generation on large datasets like ImageNet-1K with activations quantized to average 2 bits. The code and models will be available at https://github.com/racoonykc/RobuQ .

摘要中文：

arXiv:2509.23582v2宣布类型: 替换摘要: 扩散变换器 (dit) 最近已成为图像生成的强大支柱，展示了优于u-net架构的可扩展性和性能。然而，它们的实际部署受到大量计算和存储器成本的阻碍。尽管量化感知训练 (QAT) 已显示出对u-net的希望，但其在dit中的应用面临着独特的挑战，这主要是由于激活的敏感性和分布复杂性。在这项工作中，我们将激活量化确定为将dit推向极低位设置的主要瓶颈。为了解决这个问题，我们为DiTs提出了一个系统的QAT框架，名为RobuQ。我们首先建立一个强大的三元权重 (W1.58A4) DiT基线。在此基础上，我们提出了鲁棒量化器来实现鲁棒的激活量化。我们的理论分析表明，Hadamard变换可以将未知的每令牌分布转换为每令牌正态分布，为该方法提供了坚实的基础。此外，我们提出了AMPN，这是DiTs的第一个仅激活的混合精度网络管道。该方法在整个网络上应用三元权重，同时为每一层分配不同的激活精度，以消除信息瓶颈。通过对无条件和有条件图像生成的广泛实验，我们的RobuQ框架在亚4位量化配置中实现了DiT量化的最新性能。据我们所知，RobuQ是第一个在大型数据集上实现稳定和有竞争力的图像生成的，如ImageNet-1K，激活量化为平均2位。代码和模型将在 https://github.com/racoonykc/RobuQ 提供。

InfVSR: Breaking Length Limits of Generic Video Super-Resolution

2026-05-22T04:00:00autoregressive, cs.CV, diffusionoai:arXiv.org:2510.00948v2

中文标题：InfVSR: 打破通用视频超分辨率的长度限制

作者：Ziqing Zhang, Kai Liu, Zheng Chen, Xi Li, Yucong Chen, Bingnan Duan, Linghe Kong, Yulun Zhang

摘要：

摘要中文：

MagicFuse: Single Image Fusion for Visual and Semantic Reinforcement

2026-05-22T04:00:00cs.CV, diffusionoai:arXiv.org:2602.01760v2

中文标题：MagicFuse: 用于视觉和语义强化的单图像融合

作者：Hao Zhang, Yanping Zha, Zizhuo Li, Meiqi Gong, Jiayi Ma

摘要：

This paper focuses on a highly practical scenario: how to continue benefiting from the advantages of multi-modal image fusion under harsh conditions when only visible imaging sensors are available. To achieve this goal, we propose a novel concept of single-image fusion, which extends conventional data-level fusion to the knowledge level. Specifically, we develop MagicFuse, a novel single image fusion framework capable of deriving a comprehensive cross-spectral scene representation from a single low-quality visible image. MagicFuse first introduces an intra-spectral knowledge reinforcement branch and a cross-spectral knowledge generation branch based on the diffusion models. They mine scene information obscured in the visible spectrum and learn thermal radiation distribution patterns transferred to the infrared spectrum, respectively. Building on them, we design a multi-domain knowledge fusion branch that integrates the probabilistic noise from the diffusion streams of these two branches, from which a cross-spectral scene representation can be obtained through successive sampling. Then, we impose both visual and semantic constraints to ensure that this scene representation can satisfy human observation while supporting downstream semantic decision-making. Extensive experiments show that our MagicFuse achieves visual and semantic representation performance comparable to or even better than state-of-the-art fusion methods with multi-modal inputs, despite relying solely on a single degraded visible image. The code is publicly available at https://github.com/zhayanping/MagicFuse.

摘要中文：

arXiv:2602.01760v2宣布类型: 更换摘要: 本文着重于一个高度实用的场景: 在只有可见成像传感器可用时，如何在恶劣条件下继续受益于多模态图像融合的优势。为了实现这一目标，我们提出了一种新颖的单图像融合概念，将传统的数据级融合扩展到知识级。具体来说，我们开发了MagicFuse，这是一种新颖的单图像融合框架，能够从单个低质量可见图像中得出全面的跨光谱场景表示。MagicFuse首先介绍了基于扩散模型的谱内知识强化分支和跨谱知识生成分支。他们分别挖掘在可见光谱中被遮挡的场景信息，并学习转移到红外光谱的热辐射分布模式。在它们的基础上，我们设计了一个多域知识融合分支，该分支集成了来自这两个分支的扩散流的概率噪声，从中可以通过连续采样获得跨谱场景表示。然后，我们施加视觉和语义约束，以确保该场景表示可以满足人类观察，同时支持下游语义决策。大量实验表明，尽管仅依赖于单个降级的可见图像，但我们的MagicFuse实现的视觉和语义表示性能与具有多模式输入的最新融合方法相当甚至更好。该代码可在 https://github.com/zhayanping/MagicFuse. 上公开获得

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

2026-05-22T04:00:00autoregressive, cs.CV, diffusionoai:arXiv.org:2602.02214v3

中文标题：因果强迫: 正确完成自回归扩散蒸馏以生成高质量的实时交互式视频

作者：Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, Jun Zhu

摘要：

摘要中文：

ViPS: Video-informed Pose Spaces for Auto-Rigged Meshes

2026-05-22T04:00:00cs.CV, cs.GR, diffusionoai:arXiv.org:2604.17623v3

中文标题：ViPS: 自动操纵网格的视频通知姿势空间

作者：Honglin Chen, Karran Pandey, Rundi Wu, Matheus Gadelha, Yannick Hold-Geoffroy, Ayush Tewari, Niloy J. Mitra, Changxi Zheng, Paul Guerrero

摘要：

Kinematic rigs provide a structured interface for articulating 3D meshes but lack any associated pose space, i.e., an explicit representation of the plausible manifold of joint configurations for a given mesh. Without such a pose space, stochastic sampling or manual manipulation of raw rig parameters easily results in semantic and/or geometric violations, such as anatomical hyperextension and non-physical self-intersections. We propose Video-informed Pose Spaces (ViPS), a feedforward framework that discovers the latent distribution of valid articulations for auto-rigged meshes by distilling motion priors from a pretrained video diffusion model. Unlike existing methods that rely on scarce, artist-authored 4D datasets, or focus on reconstructing instances of individual motions, ViPS transfers generative video model priors into a universal distribution over the given rig parameterization. Differentiable geometric validators applied to the skinned mesh enforce shape-specific integrity without requiring manual regularizers. Our feedforward model reveals a smooth, compact, and controllable pose space. This, in turn, supports sampling for diverse shape variations, manifold projection for inverse kinematics, and temporally coherent trajectories for animation and keyframing. Further, the distilled 3D pose samples serve as semantic proxies to guide video diffusion, effectively closing the loop between generative 2D priors and structured 3D kinematic control. Our evaluations show that ViPS, trained solely using video priors, matches the performance of state-of-the-art models trained on synthetic artist-created 4D data in both plausibility and diversity. Additionally, as a universal model, ViPS exhibits robust zero-shot generalization to out-of-distribution species and unseen skeletal topologies.

摘要中文：

arXiv:2604.17623v3宣布类型: 更换摘要: 运动装备为关节连接的三维网格提供了一个结构化的接口，但缺乏任何相关的姿势空间，即，一个给定的网格关节配置的合理流形的明确表示。在没有这样的姿势空间的情况下，原始钻机参数的随机采样或手动操纵容易导致语义和/或几何违反，诸如解剖过度伸展和非物理自相交。我们提出了视频通知的姿势空间 (vip)，这是一种前馈框架，通过从预训练的视频扩散模型中提取运动先验来发现自动操纵网格的有效关节的潜在分布。与依赖于稀缺的艺术家创作的4D数据集或专注于重建单个运动实例的现有方法不同，ViPS将生成视频模型先验转换为给定钻机参数化的通用分布。应用于蒙皮网格的可微分几何验证器可增强特定于形状的完整性，而无需手动正则化器。我们的前馈模型揭示了一个平滑、紧凑和可控的姿势空间。这反过来又支持各种形状变化的采样，反向运动学的流形投影以及动画和关键帧的时间相干轨迹。此外，提取的3D姿态样本用作语义代理来引导视频扩散，有效地闭合生成2D先验和结构化3D运动学控制之间的循环。我们的评估表明，仅使用视频先验训练的vip在合理性和多样性方面与在合成艺术家创建的4D数据上训练的最先进模型的性能相匹配。此外，作为通用模型，ViPS对分布外的物种和看不见的骨骼拓扑表现出强大的零射击泛化。

Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion

2026-05-22T04:00:00autoregressive, cs.CV, cs.LG, diffusionoai:arXiv.org:2605.16579v2

中文标题：局部参与，线性记忆：将线性注意力作为自回归视频扩散模型的跨帧记忆

作者：Kunyang Li, Mubarak Shah, Yuzhang Shang

摘要：

摘要中文：

Temporal Aware Pruning for Efficient Diffusion-based Video Generation

2026-05-22T04:00:00cs.AI, cs.CV, diffusionoai:arXiv.org:2605.17837v2

中文标题：面向高效的基于扩散模型的视频生成的时序感知剪枝方法

作者：Sheng Li, Yang Sui, Junhao Ran, Bo Yuan, Yue Dai, Xulong Tang

摘要：

Video diffusion models have recently enabled high-quality video generation with ViT-based architectures, but remain computationally intensive because generation requires attention computation over long spatiotemporal sequences. Token pruning has proven effective for ViTs and VLMs. However, most prior pruning methods are attention-based and operate per frame, failing to ensure the vital temporal coherence across frames in video generation tasks. In practice, naively adopting attention-only pruning causes noticeable degradation due to worsened background consistency, flickering, and reduced image quality. To address this, we propose TAPE, a training-free Temporal Aware Pruning for Efficient diffusion-based video generation. TAPE (i) applies temporal smoothing to align token-importance across adjacent frames and suppress selection jitter; and (ii) performs token reselection in selected layers to align token pruning with layers' diverse semantic focus and avoid error accumulation in specific areas; it also (iii) adopt a timestep-level budget scheduling that prunes aggressively at early noisy steps and relaxes pruning during fidelity-critical refinement. The experimental results show that TAPE delivers significant speedups while preserving high visual fidelity, outperforming prior token reduction approaches.

摘要中文：

arXiv:2605.17837v2宣布类型: 更换摘要：近年来，基于ViT架构的视频扩散模型实现了高质量的视频生成，但由于在生成过程中需要对长时空序列进行注意力计算，其计算开销依然巨大。令牌剪枝已被证明对视觉Transformer和视觉语言模型有效。然而，大多数现有的剪枝方法均基于注意力机制且逐帧进行，无法在视频生成任务中保证帧间至关重要的时序一致性。在实践中，简单地采用仅基于注意力的剪枝方法会导致明显的性能退化，表现为背景一致性变差、画面闪烁以及图像质量下降。为此，我们提出了TAPE，一种无需训练的、面向时序感知的剪枝方法，用于高效生成基于扩散模型的视频。TAPE（i）通过时间平滑处理，使相邻帧之间的令牌重要性保持一致，并抑制选择抖动；（ii）在选定的层中进行令牌重选，以使令牌剪枝与各层不同的语义侧重相匹配，避免特定区域的误差累积；此外，（iii）采用基于时间步的预算调度策略，在早期噪声较大的步骤中实施更激进的剪枝，而在对保真度要求较高的细化阶段则适当放宽剪枝力度。实验结果表明，TAPE在保持高视觉保真度的同时实现了显著的加速效果，优于现有的令牌缩减方法。

Findings of the Counter Turing Test: AI-Generated Image Detection

2026-05-22T04:00:00cs.CV, diffusionoai:arXiv.org:2605.20787v2

中文标题：反图灵测试的研究结果：人工智能生成图像检测

作者：Rajarshi Roy, Nasrin Imanpour, Ashhar Aziz, Shashwat Bajpai, Gurpreet Singh, Shwetangshu Biswas, Kapil Wanaskar, Parth Patwa, Subhankar Ghosh, Shreyas Dixit, Nilesh Ranjan Pal, Vipula Rawte, Ritvik Garimella, Amitava Das, Amit Sheth, Vasu Sharma, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha

摘要：

The rapid advancements in generative AI technologies, such as Stable Diffusion, DALL-E, and Midjourney, have significantly transformed the creation of synthetic visual content. While these models enable innovation across industries, they also pose serious challenges, including misinformation, disinformation, and biased content generation. The increasing realism of AI-generated images makes their detection a pressing concern for researchers, policymakers, and industry stakeholders. In this paper, we present the findings of the Defactify 4.0 workshop, which introduced the Counter Turing Test (CT2) for AI-Generated Image Detection. The competition consisted of two key tasks: (1) binary classification of images as either AI-generated or real and (2) identification of the specific generative model responsible for an AI-generated image. To facilitate this, we developed the MS COCOAI dataset, consisting of 50,000 synthetic images from multiple generative models alongside real-world images from the MS COCO dataset. Participants employed diverse detection strategies, including convolutional neural networks (CNNs), Vision Transformers (ViTs), frequency-based analysis, contrastive learning, and multimodal techniques. The results demonstrated that while AI-generated images can be detected with high accuracy (F1-score > 0.83), identifying the exact model used remains significantly more challenging (highest F1-score: 0.4986). These findings highlight the need for improved model fingerprinting, adversarial robustness, and real-time detection mechanisms.

摘要中文：

arXiv:2605.20787v2公告类型: 替换摘要：生成式人工智能技术的迅猛发展，如Stable Diffusion、DALL-E和Midjourney等，已深刻改变了合成视觉内容的创作方式。尽管这些模型推动了各行业的创新，但也带来了严峻挑战，包括虚假信息、错误信息以及有偏见的内容生成等问题。人工智能生成图像的逼真度不断提升，使其检测成为研究人员、政策制定者及行业相关方亟待解决的重要问题。本文报告了Defactify 4.0研讨会的研究成果，该研讨会提出了用于人工智能生成图像检测的反图灵测试（CT2）。该竞赛包含两项核心任务：（1）对图像进行二分类，判定其为人工智能生成或真实；（2）识别生成某张人工智能生成图像的具体生成模型。为此，我们构建了MS COCOAI数据集，该数据集由来自多个生成模型的5万张合成图像以及MS COCO数据集中的真实图像组成。参与者采用了多种检测策略，包括卷积神经网络（CNN）、视觉Transformer（ViT）、基于频域的分析、对比学习以及多模态方法。结果表明，尽管人工智能生成图像可被以较高的准确率检测出来（F1分数>0.83），但识别其所使用的具体模型仍面临较大挑战（最高F1分数为0.4986）。这些研究结果凸显了改进模型指纹识别、对抗鲁棒性以及实时检测机制的必要性。

摘要：

摘要中文：