ESC
输入关键词搜索文章
目录

每日 arXiv 论文简报

2026-05-28 · 53 篇论文 · 按研究方向分组
自动追踪 · LLM 总览 · 研究雷达
53Total Papers
8Autoregressive
42Diffusion
3Image Compression
01D Visual Tokenizer
0Diffusion Visual Encoder
Daily Radar
每日总览

今日 arXiv 论文涵盖自回归、扩散与图像压缩三大方向,整体呈现出“生成模型跨模态统一、推理效率压缩细化、生成质量与安全并行”三大趋势。自回归模型聚焦结构化生成(如BrickAnything、SenBen),并通过模块化配方(PilotTTS、Genre Controlled)探索可控音乐和语音;扩散模型则向视频、3D姿态、金融时序等真实场景深度渗透,E³C、Neuro‑Inspired Inverse Learning 体现环境记忆与规划控制结合;此外,KV‑Cache 量化、Domain‑Gated Latent Diffusion 等技术显著降低推理成本,同时关注成员推断攻击等安全问题。图像压缩方向从特征蒸馏和低秩编码切入,为高效生成提供后盾。跨方向的交叉点在于统一的离散扩散框架、概念锚定与激活导向,表明未来生成系统将更趋向于模块化、可解释且资源友好。

  • Quantized Keys Steal Attention: Bias Correction for KV‑Cache Compression in Video Diffusion – 首次系统纠正视频扩散中量化键值缓存的偏差,使高压缩下仍保持细节。
  • E³C: Video Generation with 3D Environmental Memory and Ego‑Exo Human Pose Control – 将3D环境记忆与姿态控制融合,实现具有真实空间感的人体视频生成。
  • BrickAnything: Geometry‑Conditioned Buildable Brick Generation with Structure‑Aware Tokenization – 为可构建砖块提供结构感知分词,推动建筑类生成的自动化。
  • PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis – 模块化配方实现高保真语音合成,降低部署复杂度。
  • Self‑Cascaded Diffusion Models for Arbitrary‑Scale Image Super‑Resolution – 通过自级联机制实现任意尺度的超分辨率,提升高分辨率生成的灵活性。
autoregressive
Autoregressive
8 篇论文

Autoregressive 分类论文每日总览

今日 Autoregressive 相关论文聚焦于生成模型的多模态应用与优化技术。整体趋势显示,自回归框架正从传统的语言生成向视觉、语音、音乐等领域深度渗透,核心方向包括:模型推理效率优化(如 KV-Cache 压缩)、生成过程的可控性与对齐技术、以及跨模态统一架构探索。同时,tokenization 方法在视觉生成中的地位日益凸显,可能借鉴语言模型的自回归预测思想。

重点论文推荐:

  • BrickAnything(arxiv.org/abs/2605.26182) - 提出结构感知的 tokenization 方法,为几何条件生成提供自回归式建模思路,值得关注其在复杂建筑结构建模中的应用潜力。
  • PilotTTS(arxiv.org/abs/2605.27258) - 采用模块化设计实现竞争性语音合成,自回归解码模块设计值得关注,可借鉴其分治策略优化语音生成质量。
  • Aligning Few-Step Generative Models(arxiv.org/abs/2605.26552) - 通过变分推断对齐少步生成模型,兼顾自回归模型质量与采样效率,适合需要快速推理的应用场景。
  • Muddit(arxiv.org/abs/2505.23606) - 统一离散扩散框架突破文本到图像边界,token-based 生成机制与自回归范式有技术关联,适合多模态生成研究参考。
  • Quantized Keys(arxiv.org/abs/2605.26266) - 针对 KV-Cache 压缩的偏差校正技术,可提升视频扩散等自回归生成模型的推理效率,具有实际部署价值。

BrickAnything: Geometry-Conditioned Buildable Brick Generation with Structure-Aware Tokenization

2026-05-27T04:00:00autoregressive, cs.AI, cs.GR2605.26182

中文标题:BrickAnything:基于几何条件的可构建砖块生成与结构感知的标记化方法

作者:Zhengyang Ni, Feng Yan, Yu Guo, Fei Wang

摘要:

Generating physically buildable brick structures from 3D shapes requires more than geometric reconstruction: the output must also satisfy discrete part constraints and structural stability. Existing brick generation methods either rely on heuristic optimization, which can break down when the target 3D shape does not admit a feasible structure under predefined constraints, or generate brick sequences without explicitly modeling the underlying 3D geometry and assembly relations. In this work, we present BrickAnything, a geometry-conditioned autoregressive framework for generating buildable brick structures from diverse 3D representations. BrickAnything uses point clouds as a unified geometric interface and predicts brick sequences that reconstruct the target shape under assembly constraints. To model structural dependencies among bricks, we introduce a structure-aware tree tokenization, which represents brick structures through local attachment relations. This formulation makes sequence generation more consistent with the physical construction process, and reduces invalid intermediate states. We further introduce preference-based alignment post-training, validity-constrained decoding and adaptive rollback to improve buildability objectives such as stability and geometric fidelity. Extensive experiments demonstrate that BrickAnything produces geometrically faithful and physically realizable brick structures, and that the proposed tokenization effectively reduces rollback and regeneration compared with conventional ordering strategies.

摘要中文:

从三维形状生成可实际搭建的积木结构,不仅需要几何重建,还必须使输出结果同时满足构件离散化约束与结构稳定性要求。现有的砖块生成方法要么依赖于启发式优化,而当目标三维形状在预定义约束下无法构造出可行结构时,这种方法就会失效;要么在不显式建模底层三维几何与装配关系的情况下生成砖块序列。在本工作中,我们提出了BrickAnything,这是一种基于几何条件的自回归框架,能够从多种三维表示中生成可搭建的砖结构。BrickAnything以点云作为统一的几何接口,并在装配约束条件下预测能够重建目标形状的砌块序列。为建模砖块之间的结构依赖关系,我们提出了一种结构感知的树状分词方法,该方法通过局部依存关系来表征砖块的结构。该公式使序列生成更符合实际的施工过程,并减少了无效的中间状态。我们进一步引入了基于偏好对齐的后训练方法、有效性约束解码以及自适应回退机制,以提升稳定性与几何保真度等可建造性指标。大量实验表明,BrickAnything能够生成几何上精确且物理上可实现的砖块结构,并且与传统的排序策略相比,所提出的标记化方法能够有效减少回退与重生成现象。

Quantized Keys Steal Attention: Bias Correction for KV-Cache Compression in Video Diffusion

2026-05-27T04:00:00autoregressive, cs.AI, cs.CV, cs.GR, cs.LG, diffusion, eess.IV2605.26266

中文标题:量化键窃取注意力:视频扩散模型中KV缓存压缩的偏差校正

作者:Tuna Tuncer, Felix Becker, Thomas Pfeil

摘要:

Chunk-wise autoregressive video diffusion models rely on a KV cache of previously generated chunks to avoid redundant computation, but this cache quickly becomes a memory bottleneck as videos grow longer. Methods that quantize the KV cache to low bitwidths reduce memory pressure but degrade video quality. We show that a key driver of this degradation is a systematic bias in attention weights: due to the convexity of the exponential in softmax attention, quantization noise inflates the contribution of cached keys, a phenomenon we call the Jensen bias. This effect causes quantized keys to steal attention mass from the unquantized current chunk. We derive a per-attention-score correction that removes this bias in expectation, computed on the fly from the quantization step sizes of the cached keys and the query norm. Using a second-order Taylor approximation, the additional computational overhead is negligible, and no additional memory is needed alongside the cache. Evaluated on MAGI-1, SkyReels-V2, and HY-WorldPlay at INT2 quantization, our correction recovers most of the quality lost to aggressive quantization, reaching near-BF16 video quality, and can outperform INT4 quantization while using 50% less memory.

摘要中文:

分块自回归视频扩散模型依赖于对先前生成的各分块所构建的键值缓存,以避免重复计算,但随着视频长度的增加,该缓存很快就会成为内存瓶颈。将KV缓存量化至低比特宽度的方法虽能缓解内存压力,但会降低视频质量。我们证明,导致这一性能退化的关键因素是注意力权重中的系统性偏差:由于softmax注意力中指数函数的凸性,量化噪声会放大缓存键的贡献,这种现象被称为Jensen偏差。该效应会导致量化后的关键帧从未量化的当前片段中“窃取”注意力权重。我们推导出一种逐注意力分数的校正方法,该方法在期望意义上消除了这一偏差,并且能够基于缓存键的量化步长与查询范数实时计算得到。采用二阶泰勒展开近似时,额外的计算开销可忽略不计,且无需在缓存之外再分配额外的内存。在INT2量化条件下,基于MAGI-1、SkyReels-V2和HY-WorldPlay的评估结果显示,我们的校正方法能够恢复因激进量化而损失的大部分质量,达到接近BF16的视频质量,并且在仅占用一半内存的情况下,性能优于INT4量化。

Aligning Few-Step Generative Models by Amortizing Sample-based Variational Inference

2026-05-27T04:00:00autoregressive, cs.AI, cs.LG2605.26552

中文标题:通过样本化变分推断的摊销机制对少步生成模型进行对齐

作者:Jaewoo Lee, Hyeongyu Kang, Dohyun Kim, Kyuil Sim, Woocheol Shin, Minsu Kim, Taeyoung Yun, Jeongjae Lee, Sanghyeok Choi, Tabitha Edith Lee, Jong Chul Ye, Jinkyoo Park

摘要:

Aligning a few-step generative model is challenging, since existing alignment frameworks typically rely on restrictive assumptions: a tractable likelihood, a specific ODE/SDE solver, or a particular model family. We introduce FAV, Few-step Generative Models Alignment via Sample-based Variational Inference, a general alignment framework that requires only sample access to the generator and the reference distribution. We cast alignment as sampling from a reward-tilted distribution anchored to a reference distribution. We leverage Stein Variational Gradient Descent as a sample-based variational inference scheme and amortize its particle updates into the generator parameters via fixed-point regression. We evaluate FAV on two domains: robotics manipulation and image generator alignment. On generative policy alignment for robotic manipulation, FAV outperforms prevailing policy extraction baselines across 56 offline and 30 offline-to-online RL tasks. For image generator alignment, FAV fine-tunes diverse few-step backbones, including GAN, drifting model, consistency models, and flow maps, scaling from ImageNet-$256$ to 1024$^2$ text-to-image synthesis. Code is available at https://github.com/Jaewoopudding/FAV.

摘要中文:

对多步生成模型进行对齐是一项挑战,因为现有的对齐框架通常依赖于一些限制性假设:可计算的似然函数、特定的常微分方程或随机微分方程求解器,以及某一类特定的模型。我们提出了FAV,即基于样本的变分推断的少步生成模型对齐方法,这是一种仅需访问生成器和参考分布的样本即可实现的通用对齐框架。我们将对齐问题建模为从以参考分布为基准、并受奖励函数偏置的分布中进行采样。我们利用斯坦因变分梯度下降作为一种基于样本的变分推断方法,并通过不动点回归将其粒子更新过程参数化,从而将计算开销摊薄到生成器的参数上。我们在两个领域对FAV进行了评估:机器人操作和图像生成器对齐。在机器人操作的生成式策略对齐任务中,FAV在56个离线任务和30个从离线到在线的强化学习任务上均优于现有的主流策略提取基线方法。在图像生成器对齐方面,FAV对多种少步长骨干模型进行微调,包括GAN、扩散模型、一致性模型以及流式变换网络,其规模从ImageNet‑256扩展至1024² 的文本到图像生成任务。代码可在 https://github.com/Jaewoopudding/FAV 获取。

PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis

2026-05-27T04:00:00autoregressive, cs.AI, cs.SD2605.27258

中文标题:PilotTTS:一种规范的模块化方案,用于实现具有竞争力的语音合成

作者:Bowen Li, Shaotong Guo, Zhen Wang, Yang Xiang, Mingli Jin, Yihang Lin, Jiahui Zhao, Weibo Xiong, Dongrui Zhang, Keming Chen, Yunze Gao, Zeyang Lin, Yuze Zhou, Yue Liu

摘要:

Building state-of-the-art text-to-speech (TTS) systems typically demands millions of hours of proprietary data and complex multi-stage architectures, creating substantial barriers for resource-constrained research teams. In this report, we present PilotTTS, a lightweight autoregressive TTS system that achieves competitive performance through minimalist architecture and rigorous data engineering. PilotTTS is trained on only 200K hours of data processed entirely with open-source tools. Specifically, our contributions are: (1) a reproducible multi-stage data processing pipeline covering quality assessment, label annotation, and filtering, and (2) a compact model architecture that employs Q-Former-based conditioning to decouple speaker identity from speaking style via cross-sample paired training. Within a unified framework, PilotTTS supports zero-shot voice cloning, emotion synthesis (11 categories), paralinguistic synthesis (4 categories), and Chinese dialect synthesis (14 dialects). On the Seed-TTS Eval benchmark, PilotTTS achieves the lowest WER of 1.50% on test-en, a CER of 0.87% on test-zh, and the highest speaker similarity on both test sets (0.862 and 0.815), outperforming systems trained on significantly larger datasets. We release the complete data pipeline recipe, pretrained weights, and code at https://github.com/AMAPVOICE/PilotTTS.

摘要中文:

构建最先进的文本到语音(TTS)系统通常需要数百万小时的专有数据以及复杂的多阶段架构,这为资源受限的研究团队设置了重重壁垒。在本报告中,我们提出了PilotTTS,这是一种轻量级的自回归TTS系统,通过极简的架构和严格的数据工程实现了与现有方法相当的性能。PilotTTS仅基于20万小时的数据进行训练,且所有数据处理均采用开源工具完成。具体而言,我们的贡献包括:(1)一条可复现的多阶段数据处理流水线,涵盖质量评估、标签标注与数据过滤;(2)一种紧凑的模型架构,该架构采用基于Q-Former的条件生成机制,通过跨样本的成对训练实现说话人身份与讲话风格的解耦。在统一的框架下,PilotTTS支持零样本语音克隆、情感合成(11类)、超语言特征合成(4类)以及汉语方言合成(14种方言)。在Seed-TTS评测基准上,PilotTTS在test-en上取得了1.50%的最低词错误率,在test-zh上取得了0.87%的字符错误率,并且在两个测试集上均实现了最高的说话人相似度(分别为0.862和0.815),其性能优于那些基于规模大得多的数据集训练的系统。我们在 https://github.com/AMAPVOICE/PilotTTS 上发布了完整的数据处理流程配方、预训练权重及代码。

Genre Controlled Music Generation via Activation Steering

2026-05-27T04:00:00autoregressive, cs.AI, cs.SD, eess.AS2506.10225

中文标题:基于激活引导的风格可控音乐生成

作者:Swathi Narashiman, Pranay Mathur, Dipanshu Panda, Jayden Koshy Joe, Harshith M R, Anish Veerakumar, Aniruddh Krishna, Keerthiharan A

摘要:

Computational Music Generation is evolving towards non-conventional styles, demanding methods that enable precise and controllable blending of diverse music elements. In this work, we present a method for fine grained control using inference-time interventions on an autoregressive generative transformer, MusicGen. Through our approach, we achieve genre control by steering the residual stream using weights of a linear probe on it. By framing activation steering as a human-controllable interaction, our work highlights how interpretable model behaviors can empower in co-creative music generation.Audio samples demonstrating our method are available on our demo page.

摘要中文:

计算作曲正朝着非传统风格发展,亟需能够实现多种音乐要素精准可控融合的方法。在本工作中,我们提出了一种基于推理时干预的细粒度控制方法,应用于自回归生成式Transformer模型MusicGen。通过我们的方法,我们通过对残差流施加线性探针的权重来实现体裁控制。通过将激活引导视为一种可由人类操控的交互方式,我们的研究凸显了可解释的模型行为如何赋能协同创作式的音乐生成。展示我们方法的音频样本已在我们的演示页面上提供。

SenBen: Sensitive Scene Graphs for Explainable Content Moderation

2026-05-27T04:00:00autoregressive, cs.AI, cs.CV, cs.LG, cs.MM2604.08819

中文标题:SenBen:用于可解释内容审核的敏感场景图

作者:Fatih Cagatay Akyon, Alptekin Temizel

摘要:

Content moderation systems classify images as safe or unsafe but lack spatial grounding and interpretability: they cannot explain what sensitive behavior was detected, who is involved, or where it occurs. We introduce the Sensitive Benchmark (SenBen), the first large-scale scene graph benchmark for sensitive content, comprising 13,999 frames from 157 movies annotated with Visual Genome-style scene graphs (25 object classes, 28 attributes including affective states such as pain, fear, aggression, and distress, 14 predicates) and 16 sensitivity tags across 5 categories. We distill a frontier VLM into a compact 241M student model using a multi-task recipe that addresses vocabulary imbalance in autoregressive scene graph generation through suffix-based object identity, Vocabulary-Aware Recall (VAR) Loss, and a decoupled Query2Label tag head with asymmetric loss, yielding a +6.4 percentage point improvement in SenBen Recall over standard cross-entropy training. On grounded scene graph metrics, our student model outperforms all evaluated VLMs except Gemini models and all commercial safety APIs, while achieving the highest object detection and captioning scores across all models, at $7.6\times$ faster inference and $16\times$ less GPU memory.

摘要中文:

内容审核系统会将图像划分为安全或不安全两类,但缺乏空间语义的关联与可解释性:它们无法说明具体检测到了何种敏感行为、涉及哪些主体以及发生于何处。我们提出了敏感性基准数据集(SenBen),这是首个针对敏感内容的大规模场景图基准,包含来自157部电影的13,999个帧,这些帧均标注了类似Visual Genome的场景图(25个物体类别、28种属性,其中包括疼痛、恐惧、攻击和痛苦等情感状态,以及14个谓词)及5个大类下的16个敏感性标签。我们采用一种多任务训练方案,通过基于后缀的对象身份表示、词汇感知召回(VAR)损失,以及带有非对称损失的解耦Query2Label标签头,有效缓解了自回归场景图生成中的词汇分布不均衡问题,从而将前沿的视觉语言模型蒸馏为一个参数量仅为2.41亿的小型学生模型;与标准的交叉熵训练相比,该方法在SenBen召回率上提升了6.4个百分点。在基于场景图的基准评测中,我们的学生模型性能优于除Gemini系列模型及所有商用安全API之外的所有视觉语言模型,并在所有模型中同时取得了最高的目标检测与图像描述得分,且推理速度提升至原来的7.6倍、显存占用降低至原来的1/16。

Pusa V1.0: Unlocking Temporal Control in Pretrained Video Diffusion Models via Vectorized Timestep Adaptation

2026-05-27T04:00:00autoregressive, cs.CV, diffusion2507.16116

中文标题:Pusa V1.0:通过向量化时间步长适配解锁预训练视频扩散模型中的时序控制

作者:Yaofang Liu, Yumeng Ren, Aitor Artola, Yuxuan Hu, Xiaodong Cun, Xiaotong Zhao, Alan Zhao, Raymond H. Chan, Suiyun Zhang, Rui Liu, Dandan Tu, Jean-Michel Morel

摘要:

The rapid advancement of video diffusion models has been hindered by fundamental limitations in temporal modeling, particularly the rigid synchronization of frame evolution imposed by conventional scalar timestep variables. While task-specific adaptations and autoregressive models have sought to address these challenges, they remain constrained by computational inefficiency, catastrophic forgetting, or narrow applicability. In this work, we present \textbf{Pusa} V1.0, a versatile model that leverages \textbf{vectorized timestep adaptation (VTA)} to enable fine-grained temporal control within a unified video diffusion framework. Note that VTA is a non-destructive adaptation, which means that it fully preserves the capabilities of the base model. Unlike conventional methods like Wan-I2V, which finetune a base text-to-video (T2V) model with abundant resources to do image-to-video (I2V), we achieve comparable results in a zero-shot manner after an ultra-efficient finetuning process based on VTA. Moreover, this method also unlocks many other zero-shot capabilities simultaneously, such as start-end frames and video extension -- all without task-specific training. Meanwhile, it keeps the T2V capability from the base model. Mechanistic analyses also reveal that our approach preserves the foundation model's generative priors while surgically injecting temporal dynamics, avoiding the combinatorial explosion inherent to the vectorized timestep. This work establishes a scalable, efficient, and versatile paradigm for next-generation video synthesis, democratizing high-fidelity video generation for research and industry alike.

摘要中文:

视频扩散模型的快速发展一直受到时间建模方面根本性局限的制约,尤其是传统标量时间步长变量所施加的帧间演化刚性同步机制。尽管针对特定任务的模型优化与自回归模型已尝试应对这些挑战,但它们仍受限于计算效率低下、灾难性遗忘或适用范围狭窄等问题。在本工作中,我们提出了 \textbf{Pusa} V1.0,这是一种基于 \textbf{向量化时间步自适应(VTA)} 的通用模型,能够在统一的视频扩散框架中实现细粒度的时序控制。请注意,VTA是一种无损适配方法,这意味着它能够完整保留基础模型的各项能力。与依赖大量资源对基础文生视频(T2V)模型进行微调以实现图生视频(I2V)的传统方法不同,我们在基于VTA的超高效微调之后,便能以零样本方式取得相当的效果。此外,该方法还能同时解锁多项零样本能力,例如起止帧检测与视频扩展,且均无需进行任务特定的训练。同时,它保留了基础模型的T2V能力。机制分析还表明,我们的方法在保留基础模型生成先验的同时,通过精准地引入时间动态,避免了由向量化时间步长所固有的组合爆炸问题。本研究构建了一种可扩展、高效且通用的新一代视频合成范式,为科研与工业界实现了高保真视频生成的普惠化。

Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

2026-05-27T04:00:00autoregressive, cs.CV, cs.LG, diffusion2505.23606

中文标题:Muddit:以统一的离散扩散模型突破文本到图像生成,开启全新纪元

作者:Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Yunhai Tong, Xiangtai Li, Xuelong Li, Shuicheng Yan

摘要:

Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce the second-generation Meissonic: Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.

摘要中文:

统一生成模型旨在以单一架构和解码范式,同时处理跨模态的多样化任务,如文本生成、图像生成以及视觉‑语言推理等。自回归统一模型因序列解码而推理速度较慢,而非自回归统一模型则因预训练主干网络的局限性而导致泛化能力较弱。我们推出了第二代Meissonic:Muddit,这是一种统一的离散扩散Transformer,能够在文本和图像两种模态上实现快速且并行的生成。与以往从头训练的统一扩散模型不同,Muddit将预训练文本到图像骨干网络的强大视觉先验与轻量级文本解码器相结合,在统一的架构下实现了灵活且高质量的多模态生成。实证结果表明,Muddit在质量和效率两方面均取得了与规模远大于自身的自回归模型相当甚至更优的性能。该研究指出,当配备强大的视觉先验时,纯离散扩散模型有望成为一种可扩展且高效的统一生成任务骨干。

diffusion
Diffusion
42 篇论文

今天的Diffusion模型研究呈现出多模态可控生成、效率优化和安全隐私并重的趋势。视频生成领域持续升温,从3D环境感知、人体姿态控制到时序精准调控,多项工作致力于提升视频生成的可控性和物理真实性。在图像生成方面,从分子材料发现到金融数据合成,Diffusion模型的应用边界不断拓展。逆问题求解仍是重要方向,涵盖超分辨率、异常检测、物理场重建等任务。值得注意的是,安全隐私类论文数量显著增加,反映出学界对模型记忆化、成员隐私泄露等问题的高度关注。

  • 推荐阅读:
  • PARE: Pruning and Adaptive Routing for Efficient Video Generation — 通过剪枝和自适应路由显著提升视频生成效率,对实际部署具有重要价值
  • A Unified Framework for Diffusion Model Unlearning with f-Divergence — 提供系统化的Diffusion模型遗忘方案,直接应对数据隐私和版权治理需求
  • Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models — 将奖励对齐从pairwise升级到listwise,提升偏好建模精度
  • TAG: Tangential Amplifying Guidance for Hallucination-Resistant Sampling — 针对视频生成中的幻觉问题提供有效解决方案,提升生成可靠性
  • Geometry-Aware Representation Denoising for Robust Multi-view 3D Reconstruction — 融合几何感知的多视图3D重建,提升重建鲁棒性

Co-folding model guided by structural proteomics

2026-05-27T04:00:00cs.AI, cs.LG, diffusion, q-bio.BM2605.26192

中文标题:由结构蛋白质组学指导的共折叠模型

作者:Alon Shtrikman, Nitzan Simchi, Michal Ran Shchory, Sagie Brodsky, Eran Seger, Kirill Pevzner

摘要:

Protein structure generative models excel at predicting single protein static structures from sequence, but routinely fail to capture the correct conformational state of protein complexes, critical for protein design and induced proximity modalities such as antibodies and PROTACs. While structural proteomics techniques like Cross-Linking Mass Spectrometry (XL-MS) and Hydrogen-Deuterium Exchange (HDX-MS) offer valuable spatial and dynamic insights, integrating these sparse, heterogeneous measurements into these models remains an open challenge. Here, we bridge this gap by combining structural proteomics data with the rich biophysical priors learned by pretrained diffusion models. We introduce AIMS-Fold, an inference-time guided-diffusion framework that actively steers the generative sampling trajectory using differentiable physical potentials derived from XL-MS spatial restraints and HDX-MS solvent accessibility profiles. We demonstrate that these structural methods individually enhance predictive accuracy, and their integration yields synergistic improvement. Crucially, by leveraging these experimental restraints, AIMS-Fold achieves higher accuracy on challenging induced proximity targets than purely computational, unguided state-of-the-art models like Boltz-2. This establishes our framework as a powerful, integrative computational approach for the structure based drug design of induced proximity drugs. Evaluation code will be made publicly available upon publication.

摘要中文:

蛋白质结构生成模型在从氨基酸序列预测单体蛋白质的静态结构方面表现卓越,但在捕捉蛋白质复合物的正确构象状态上却屡屡失准,而这一能力对于抗体、PROTAC等蛋白质设计及诱导邻近效应类药物的研发至关重要。尽管交联质谱(XL-MS)和氢–氘交换质谱(HDX-MS)等结构蛋白质组学技术能够提供重要的空间与动态信息,但如何将这些稀疏且异质的实验数据整合到相关模型中,仍是一个尚未解决的难题。在此,我们通过将结构蛋白质组学数据与预训练扩散模型所习得的丰富生物物理先验相结合,弥合了这一鸿沟。我们提出了AIMS-Fold,这是一种在推理阶段运行的引导式扩散框架,它利用由XL-MS空间约束和HDX-MS溶剂可及性谱导出的可微物理势场,对生成采样轨迹进行主动调控。我们证明,这些结构化方法在单独使用时均能提升预测精度,而它们的集成则带来了协同增效的效果。至关重要的是,通过利用这些实验约束条件,AIMS-Fold在具有挑战性的诱导邻近靶标上所达到的精度,超过了诸如Bolt‑2等纯计算、无指导的最先进模型。这确立了我们的框架,使其成为一种强大且整合性的计算方法,用于基于结构的诱导邻近型药物设计。评估代码将在论文发表时公开发布。

Quantized Keys Steal Attention: Bias Correction for KV-Cache Compression in Video Diffusion

2026-05-27T04:00:00autoregressive, cs.AI, cs.CV, cs.GR, cs.LG, diffusion, eess.IV2605.26266

中文标题:量化键窃取注意力:视频扩散模型中KV缓存压缩的偏差校正

作者:Tuna Tuncer, Felix Becker, Thomas Pfeil

摘要:

Chunk-wise autoregressive video diffusion models rely on a KV cache of previously generated chunks to avoid redundant computation, but this cache quickly becomes a memory bottleneck as videos grow longer. Methods that quantize the KV cache to low bitwidths reduce memory pressure but degrade video quality. We show that a key driver of this degradation is a systematic bias in attention weights: due to the convexity of the exponential in softmax attention, quantization noise inflates the contribution of cached keys, a phenomenon we call the Jensen bias. This effect causes quantized keys to steal attention mass from the unquantized current chunk. We derive a per-attention-score correction that removes this bias in expectation, computed on the fly from the quantization step sizes of the cached keys and the query norm. Using a second-order Taylor approximation, the additional computational overhead is negligible, and no additional memory is needed alongside the cache. Evaluated on MAGI-1, SkyReels-V2, and HY-WorldPlay at INT2 quantization, our correction recovers most of the quality lost to aggressive quantization, reaching near-BF16 video quality, and can outperform INT4 quantization while using 50% less memory.

摘要中文:

分块自回归视频扩散模型依赖于对先前生成的各分块所构建的键值缓存,以避免重复计算,但随着视频长度的增加,该缓存很快就会成为内存瓶颈。将KV缓存量化至低比特宽度的方法虽能缓解内存压力,但会降低视频质量。我们证明,导致这一性能退化的关键因素是注意力权重中的系统性偏差:由于softmax注意力中指数函数的凸性,量化噪声会放大缓存键的贡献,这种现象被称为Jensen偏差。该效应会导致量化后的关键帧从未量化的当前片段中“窃取”注意力权重。我们推导出一种逐注意力分数的校正方法,该方法在期望意义上消除了这一偏差,并且能够基于缓存键的量化步长与查询范数实时计算得到。采用二阶泰勒展开近似时,额外的计算开销可忽略不计,且无需在缓存之外再分配额外的内存。在INT2量化条件下,基于MAGI-1、SkyReels-V2和HY-WorldPlay的评估结果显示,我们的校正方法能够恢复因激进量化而损失的大部分质量,达到接近BF16的视频质量,并且在仅占用一半内存的情况下,性能优于INT4量化。

E$^3$C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control

2026-05-27T04:00:00cs.AI, cs.CV, diffusion2605.26316

中文标题:E$^3$C:基于三维环境记忆与内视角-外视角人体姿态控制的视频生成

作者:Qiao Gu, Lingni Ma, Adam W Harley, Richard Newcombe, Florian Shkurti, Julian Straub

摘要:

Controllable and physically grounded egocentric video generation is essential for embodied agents to reason about how their own and others' actions manifest and change the world. Compared to generic video synthesis, egocentric generation is especially challenging: the camera is tightly coupled to the actor, leading to rapid viewpoint changes and frequent self-occlusions; the underlying actions are subtle, articulated, and often only partially visible; and both the people and the scene state must evolve consistently with the specified controls. We present E$^3$C, a controllable video diffusion framework for egocentric generation that builds structured and compact conditions disentangling persistent scene structure from human-driven dynamics. From context frames, E$^3$C constructs a semi-dense point cloud-based 3D memory and augments each point with appearance descriptors from video-VAE features. Rendering this memory into target viewpoints produces conditioning aligned with the target frames. Human dynamics are modeled separately. The observed people in the scene are controlled by skeleton renderings (exo human control), while the camera wearer is specified by their 3D body joints and 6DoF wrist motion (ego human control). To preserve ego human control when the wearer&x27;s body parts are invisible, we introduce an ego motion encoder that produces persistent cross-attention tokens. Experiments on Nymeria show that E$^3$C improves visual fidelity, camera-motion accuracy, object consistency, and ego & exo human control over strong baselines, while also enabling intuitive scene editing.

摘要中文:

可控且具有物理基础的自我中心视频生成对于具身智能体而言至关重要,它使智能体能够推断自身及他人的行为如何在现实中呈现并改变世界。与通用视频合成相比,第一人称视角的视频生成尤为困难:摄像机与主体紧密耦合,导致视角变化迅速且频繁出现自我遮挡;底层动作细微、连贯,且往往仅部分可见;同时,人物与场景的状态都必须与给定的控制条件保持一致地演化。我们提出了E$^3$C,这是一种用于第一人称视角视频生成的可控扩散框架,它通过构建结构化且紧凑的条件,将持久的场景结构与由人体动作驱动的动态过程解耦开来。E$^3$C从上下文帧中构建基于半稠密点云的三维记忆,并为每个点附加来自视频VAE特征的外观描述符。将该记忆渲染至目标视点,即可得到与目标帧对齐的条件特征。人类行为动力学被单独建模。场景中的被观察者由骨骼蒙皮动画控制(外部人体控制),而摄像机佩戴者则由其三维身体关节及六自由度腕部运动所驱动(自我人体控制)。为在穿戴者身体部位不可见时仍保持自我空间的可控性,我们提出了一种自我的运动编码器,用于生成持久的跨注意力令牌。对Nymeria的实验表明,E$^3$C在视觉保真度、相机运动精度、对象一致性以及对主体与外部人体的操控等方面均优于强基准模型,同时还能实现直观的场景编辑。

Personalized Generative Models for Contextual Debiasing

2026-05-27T04:00:00cs.AI, cs.CV, cs.LG, diffusion2605.26353

中文标题:面向上下文去偏的个性化生成模型

作者:Xinran Liang, Esin Tureci, Prachi Sinha, Ye Zhu, Vikram V. Ramaswamy, Olga Russakovsky

摘要:

Different visual patterns appear with different frequencies in the world: e.g., beach balls appear on sand more often than they do on a road. These statistics are reflected in vision datasets, and as a result trained models more easily recognize objects in common scenarios. However, recognizing a beach ball on a road may arguably be even more important than recognizing it on sand. We study how to mitigate this discrepancy. Since collecting uncommon images in the real world may be difficult, we explore whether generating images with less frequent contexts can serve as effective training augmentation. A key challenge is guiding generations to remain close to the original dataset distribution while creating diverse images with uncommon contexts. We introduce Decoupling Contextual Patterns with Generations (DecoupleGen), a method that personalizes text-to-image diffusion models to facilitate coherent synthesis of images with rare contexts while preserving original visual details. The generated images contain semantically meaningful content and remain visually aligned with the original datasets. We further apply verification constraints to ensure relevance of the augmented data. We evaluate our approach on object classification and recognition tasks on complex scene datasets. Our experiments demonstrate consistent improvements over previous approaches, and our analyses identify factors underlying these improvements.

摘要中文:

不同的视觉模式在现实世界中以不同的频率出现:例如,沙滩上的沙滩球比在道路上的沙滩球更为常见。这些统计数据体现在视觉数据集中,因此训练得到的模型在常见场景中更容易识别目标。然而,识别公路上的沙滩球,其重要性或许甚至超过在沙滩上对其进行识别。我们研究如何缓解这一差异。鉴于在真实世界中收集稀有样本可能较为困难,我们探讨了通过生成具有较少见场景的图像是否能够作为一种有效的数据增强方法。一个关键挑战是如何在生成具有罕见场景的多样化图像的同时,引导各代模型始终贴近原始数据分布。我们提出了“基于生成的上下文模式解耦”(DecoupleGen)方法,该方法通过个性化文本到图像的扩散模型,在保留原始视觉细节的同时,实现对稀有场景的一致性图像合成。生成的图像包含语义上有意义的内容,并且在视觉上与原始数据集保持一致。我们进一步施加验证约束,以确保增强数据的相关性。我们在复杂场景数据集上针对目标分类与识别任务对所提出的方法进行了评估。我们的实验表明,与现有方法相比取得了持续性的性能提升,而我们的分析则揭示了这些提升的内在因素。

DDGAD: Trajectory Dynamics for Diffusion-Based Graph Anomaly Detection

2026-05-27T04:00:00cs.AI, cs.LG, diffusion2605.26446

中文标题:DDGAD:基于扩散的图异常检测轨迹动力学

作者:Yuxin Yang, Limei Hu, Feng Chen

摘要:

Graph anomaly detection (GAD) aims to identify nodes or substructures whose behavior or attributes deviate significantly from the overall pattern in graph-structured data, with critical applications in financial risk control, social network analysis, and cybersecurity. However, existing GCN-based methods suffer from the fundamental problem of contamination propagation, where anomalous nodes pollute the representations of their neighbors through message passing, leading to degraded detection performance. In this paper, we propose DDGAD, a novel diffusion-based graph anomaly detection framework that leverages trajectory dynamics to distinguish normal and anomalous nodes. Our key insight is that normal nodes exhibit consistent and stable representation trajectories under the coupled effects of diffusion regularization and reliability-aware neighborhood consensus, while anomalous nodes exhibit unstable and conflicting dynamics due to the directional disagreement between the global manifold prior and locally contaminated message passing. To mitigate contamination propagation, we introduce a distributed reliability-aware consensus refinement mechanism and define three complementary anomaly signals: neighbor inconsistency, reliability weight, and dynamical conflict energy. We further provide a preliminary theoretical analysis on normal node stability under the coupled dynamics. These signals collectively characterize anomalous behaviors from the perspectives of local inconsistency, consensus reliability, and dynamical instability. Extensive experiments on five real-world datasets demonstrate the effectiveness of the proposed framework.

摘要中文:

图异常检测(GAD)旨在识别图结构数据中其行为或属性显著偏离整体模式的节点或子结构,在金融风险控制、社交网络分析和网络安全等领域具有重要应用。然而,现有的基于图卷积网络的方法存在一个根本性问题——异常节点污染传播:异常节点通过消息传递污染其邻居的表示,从而导致检测性能下降。本文提出了一种基于扩散的图异常检测框架——DDGAD,该框架利用轨迹动力学来区分正常节点与异常节点。我们的核心见解是:在扩散正则化与可靠性感知的邻域共识的协同作用下,正常节点的表征轨迹呈现出一致且稳定的特征;而异常节点则由于全局流形先验与局部污染的消息传递之间存在方向性分歧,其表征动态表现为不稳定且相互冲突。为抑制污染传播,我们提出了一种分布式、基于可靠性的共识精化机制,并定义了三种互补的异常检测信号:邻居不一致性、可靠性权重以及动态冲突能量。我们进一步对耦合动力学下正常节点的稳定性进行了初步的理论分析。这些信号从局部不一致性、共识可靠性以及动态不稳定性等视角,共同刻画了异常行为。在五个真实世界数据集上的大量实验验证了所提出框架的有效性。

Cross-scale Aligned Supervision for Training GANs

2026-05-27T04:00:00cs.AI, cs.CV, diffusion2605.26449

中文标题:用于生成对抗网络训练的跨尺度对齐监督

作者:Sangeek Hyun, MinKyu Lee, Jae-Pil Heo

摘要:

Modern GANs often introduce adversarial supervision on intermediate generator outputs and interpret the resulting multi-stage synthesis as coarse-to-fine hierarchical generation. In this work, we challenge this interpretation. We argue that standard scale-wise adversarial supervision does not construct a proper coarse-to-fine hierarchy: each intermediate image is independently pushed toward the real distribution at its own resolution, but this scale-wise realism does not ensure that outputs across stages represent the identical generated sample. Moreover, the scale-specific image produced at each stage is not used as an explicit refinement target for the subsequent stage. Therefore, its adversarial loss can improve a scale-specific output without constraining later stages to preserve the same sample trajectory, allowing them to move toward a different sample rather than refine the previous output. We refer to this problem as a cross-scale trajectory misalignment problem. To resolve it, we propose CAT, a Cross-scale Aligned Transformer for multi-scale adversarial generation. CAT keeps the discriminator scale-wise, so each intermediate output is evaluated at its own resolution, while adding a simple generator-side consistency regularization that aligns intermediate outputs with the final output. On class-conditional ImageNet-256, CAT-H/2 achieves an FID-50K of 1.56 with one-step inference after only 60 training epochs, outperforming strong one-step GAN and diffusion/flow baselines.

摘要中文:

现代生成对抗网络通常在生成器的中间层输出上引入对抗性监督,并将由此产生的多阶段合成过程解释为由粗到精的层次化生成。在本研究中,我们对这一解释提出了质疑。我们认为,标准的逐尺度对抗监督并不能构建出恰当的由粗到精层次结构:每一级的中间图像都仅在其自身分辨率下被独立地推向真实分布,但这种逐尺度的真实性并不能保证各阶段的输出对应于同一个生成样本。此外,每一阶段生成的尺度特定图像并未被用作下一阶段的显式细化目标。因此,其对抗损失能够在不约束后续阶段保持相同采样轨迹的情况下,提升特定尺度的输出,从而使这些阶段能够转向不同的采样路径,而非对先前的输出进行进一步优化。我们将这一问题称为跨尺度轨迹错配问题。为解决这一问题,我们提出了CAT,即一种用于多尺度对抗生成的跨尺度对齐Transformer。CAT在判别器中保持尺度一致性,因此每个中间输出都以其自身的分辨率进行评估;同时,在生成器侧引入了一种简单的输出一致性正则化,以使各中间输出与最终输出保持一致。在基于类别条件的ImageNet-256数据集上,CAT-H/2仅经过60个训练周期,便在单步推理下取得了1.56的FID-50K值,性能优于强大的单步GAN以及扩散模型和流模型基准。

AnchorDiff: Training-Free Concept Grounding for MM-DiTs via Anchor-Based Graph Propagation

2026-05-27T04:00:00cs.AI, cs.CV, diffusion2605.26460

中文标题:AnchorDiff:基于锚点图传播的无需训练的多模态ViT概念对齐方法

作者:Jian Zhang, Zhijun Zhang

摘要:

Multi-Modal Diffusion Transformers (MM-DiTs) encode rich representations for training-free concept grounding, but existing attention-based methods often produce overlapping activations on visually confusable concepts, a failure mode we call concept leakage, where target responses spill over to non-target objects. To address this issue, we propose AnchorDiff, a training-free grounding method that decouples semantic localization from structural refinement. AnchorDiff selects a high-confidence anchor from concept-to-image attention map and propagates it as a one-hot seed over a hybrid graph derived from image-to-image self-attention. The graph uses output-space similarity for dense within-object propagation and a row-wise attention gate to suppress cross-object connections. Additionally, we introduce the Multi-Concept Confusion Dataset, which contains images with multiple visually similar concepts and separate masks, enabling explicit evaluation of concept leakage. Experiments show that AnchorDiff achieves strong grounding performance on ImageNet-Segmentation and PascalVOC, while substantially reducing concept leakage on our Multi-Concept Confusion Dataset.

摘要中文:

多模态扩散变换器(MM-DiTs)能够为无训练的概念对齐编码出丰富的表征,但现有的基于注意力的方法在处理视觉上易混淆的概念时,往往会产生重叠的激活,这种失效模式被称为“概念泄漏”,即目标响应会溢出到非目标对象上。为解决这一问题,我们提出了AnchorDiff,这是一种无需训练的定位方法,它将语义定位与结构细化相分离。AnchorDiff从概念到图像的注意力图中选取置信度较高的锚点,并将其作为独热种子,沿由图像到图像自注意力构建的混合图谱进行传播。该图采用输出空间相似性实现对象内部的密集传播,并通过行级注意力门控机制抑制跨对象连接。此外,我们提出了多概念混淆数据集,该数据集包含具有多个视觉上相似概念的图像及相应的分割掩码,从而能够对概念泄露进行明确的评估。实验结果表明,AnchorDiff在ImageNet-分割和Pascal VOC数据集上均取得了优异的定位性能,同时在我们构建的多概念混淆数据集上显著降低了概念泄露问题。

Diffuse to Detect: Generative Diffusion Models for Unsupervised IC Anomaly Detection

2026-05-27T04:00:00cs.AI, cs.LG, diffusion2605.26468

中文标题:通过扩散进行检测:用于无监督集成电路异常检测的生成式扩散模型

作者:Yuxuan Yin, Chen He, Todd Jacobs, Jialei He, Boxun Xu, Robert Jin, Peng Li

摘要:

Latent defect screening is challenged by extremely low failure rates, high-dimensional test data, and absence of labeled anomalies. We propose the first unsupervised anomaly detection framework incorporating a Diffusion Transformer. Raw test measurements are first compressed by an autoencoder, then reshaped into a structured token sequence enriched with sinusoidal and per-device wafer-position embeddings. Anomaly scores are derived from the noise-prediction error over mid-range diffusion timesteps, enabling fast wafer-scale screening without any labeled defects or manual feature engineering. Our approach achieves state-of-the-art performance on industrial 16nm IC test data under extreme class imbalance, offering interpretable failure localization through latent-space reconstruction residuals.

摘要中文:

潜在缺陷检测面临诸多挑战,包括故障率极低、测试数据维度高以及缺乏标注的异常样本。我们提出了首个融合扩散Transformer的无监督异常检测框架。原始测试数据首先通过自编码器进行压缩,随后被重新塑形为一种结构化的标记序列,并在其中融入了正弦位置编码以及基于每台设备的晶圆位置嵌入。异常得分由中等扩散步长区间内的噪声预测误差计算得到,可在无需任何缺陷标注或人工特征工程的情况下实现晶圆级的快速筛查。我们的方法在极端类别不平衡的工业16纳米集成电路测试数据上取得了当前最佳性能,并通过潜在空间重构残差实现了可解释的故障定位。

DGLD: Domain-Gated Latent Diffusion for the Discovery of Novel Energetic Materials

2026-05-27T04:00:00cs.AI, diffusion, physics.chem-ph2605.26540

中文标题:DGLD:用于新型含能材料发现的域门控潜在扩散模型

作者:Yehudit Aperstein, Alexander Apartsin

摘要:

Energetic-materials performance gains translate directly into reduced propellant mass, smaller warheads, and more efficient civilian gas-generators, yet no new HMX-class compound has been disclosed in fifteen years. Designing one is a sparse-label problem: of ~66 k labelled CHNO molecules only ~3 k carry experimental or DFT-quality measurements, and naive generative models trained on the full mixture either memorise the high-performance tail or extrapolate without calibration. We introduce Domain-Gated Latent Diffusion (DGLD): a label-quality gate at training time, multi-task score-model guidance at sample time, and a four-stage chemistry-validation funnel ending in first-principles DFT audit. The result is 12 DFT-confirmed novel leads. The headline compound, 3,4,5-trinitro-1,2-isoxazole (L1), reaches \r{ho}_"cal" =2.09 g/cm3 and D_"K-J,cal" =8.25 km/s and is structurally dissimilar from all 65 980 training molecules (nearest-neighbour Tanimoto 0.27). A co-headline lead, E1 (4-nitro-1,2,3,5-oxatriazole), exceeds L1 on calibrated detonation velocity (D_"K-J,cal" =9.00 km/s) from a chemotype family disjoint from L1's. DGLD is the only method to land in the productive quadrant (simultaneously novel and on-target) at DFT level. SMILES-LSTM memorises 18.3% of its outputs exactly; SELFIES-GA&x27;s best novel candidate loses 3.5 km/s under DFT audit; REINVENT 4 generates novel high-N heterocycles but peaks at D=9.02 km/s. Code, checkpoints, and 918 mined hard negatives are released on Zenodo (DOI 10.5281/zenodo.19821953); the next compound to enter the HMX-class band can be discovered, validated, and recommended for synthesis at the cost of a few GPU-days.

摘要中文:

含能材料性能的提升可直接转化为推进剂质量的降低、战斗部尺寸的缩小以及民用燃气发生器效率的提高,然而在过去的十五年里,尚未有新型HMX类化合物被公开。分子设计属于稀疏标签问题:在约6.6万种已标注的CHNO分子中,仅有约3千种拥有实验或DFT精度的测量数据;而直接基于全数据集训练的朴素生成模型,要么完全记忆高性能分子这一长尾,要么在缺乏校准的情况下进行外推。我们提出了域门控潜扩散模型(DGLD):在训练阶段引入标签质量门控,在采样阶段采用多任务得分模型引导,并构建了一个包含四个阶段的化学验证流程,最终以第一性原理密度泛函理论审核收尾。最终获得了12个经DFT验证的新型先导化合物。标题化合物3,4,5-三硝基-1,2-异恶唑(L1)的理论密度ρ_cal为2.09 g/cm³,爆速D_K‑J,cal为8.25 km/s,其结构与65,980个训练分子均不相似(最近邻Tanimoto相似度为0.27)。一种并列首位的炸药——E1(4-硝基-1,2,3,5-恶三唑)——在经校正的爆轰速度(D_"K-J,cal" = 9.00 km/s)方面超过了L1,且其化学类型与L1所属的化学类型家族互不重叠。DGLD是在DFT层面实现“既新颖又切中目标”这一成果的唯一方法。SMILES-LSTM能够精确记忆其输出的18.3%;SELFIES-GA的最佳新分子候选在DFT审核中性能下降了3.5 km/s;REINVENT 4虽能生成新型高氮杂环化合物,但其性能峰值仅为D=9.02 km/s。代码、检查点以及918个经人工标注的难负样本已通过Zenodo发布(DOI:10.5281/zenodo.19821953);只需耗费数个GPU‑天,即可发现、验证并推荐合成下一个进入HMX类能带的新化合物。

On the Error-Correcting Effects of Stochasticity in Discrete Diffusion

2026-05-27T04:00:00cs.AI, cs.LG, diffusion2605.26582

中文标题:论离散扩散过程中随机性带来的纠错效应

作者:William Yuan, Sungwon Jeong, Amirali Aghazadeh

摘要:

Discrete diffusion models achieve strong performance in text and image generation, but their inference remains slow and must inherently balance sampling efficiency and sample quality. In this work, we present a systematic study of how the \emph{degree of stochasticity} in Markov transitions governs the sampling tradeoff. We show that highly deterministic transitions converge rapidly but suffer from error accumulation, while more stochastic transitions converge more slowly yet can achieve higher final sample quality. Using an information-theoretic analysis, we identify the underlying mechanism as an error-correcting effect induced by \emph{redundant transitions} that symmetrically exchange mass between states, and show that these transitions can provably contract sampling errors. Motivated by this analysis, we propose \emph{Discrete Churn and Restart Sampling} (DCRS), a novel inference algorithm that injects controlled stochasticity by alternating between forward and reverse diffusion processes. Experiments on synthetic datasets and large-scale benchmarks show that DCRS improves the speed-quality tradeoff in the low number of function evaluations regime. On image datasets, DCRS achieves up to a $10\times$ reduction in sampling steps compared to standard samplers while maintaining competitive sample quality, whereas on language benchmarks, we observe more nuanced behavior depending on the corruption process and sampling procedure.

摘要中文:

离散扩散模型在文本与图像生成任务中取得了优异的性能,但其推理过程依然较为缓慢,并且在采样效率与样本质量之间始终存在内在的权衡。在本研究中,我们系统地探讨了马尔可夫转移中的“随机性程度”如何调控采样权衡。我们证明,高度确定性的转换能够快速收敛,但会面临误差累积问题;而更具随机性的转换虽然收敛速度较慢,却能获得更高的最终采样质量。借助信息论分析,我们确定其内在机制是由“冗余转移”所诱发的纠错效应——这些转移以对称的方式在各状态之间交换概率质量,并且证明了它们能够严格地减小采样误差。受这一分析的启发,我们提出了“离散型用户流失与重启采样”(DCRS),这是一种通过在前向与逆向扩散过程之间交替来引入可控随机性的新型推理算法。在合成数据集和大规模基准测试上的实验表明,DCRS在函数评估次数较少的场景下能够改善速度与质量之间的权衡。在图像数据集上,DCRS相比标准采样器可将采样步数减少多达10倍,同时保持相当的样本质量;而在语言基准测试中,我们观察到其表现因噪声注入机制与采样策略的不同而呈现出更为细微的差异。

Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models

2026-05-27T04:00:00cs.AI, cs.CV, diffusion2605.27020

中文标题:针对图像生成模型预训练数据的黑盒成员推断攻击

作者:Tao Qi, Huili Wang, Yuanhong Huang, Wendan Wang, Lianchao Zhao, Jinrui Wang, Zichen Qin, Shangguang Wang, Yongfeng Huang

摘要:

The rapid advancement of diffusion-based image generation models has raised serious concerns regarding potential copyright and privacy infringements involving human-created data. Membership inference attacks (MIAs) have emerged as a promising tool for identifying unauthorized data usage during model training. Existing methods typically assess the ability of model to denoise perturbed suspect images as an indicator of membership status. However, the discriminative power of such features is highly dependent on the degree of model memorization and deteriorates significantly when applied to less exposed data (e.g., pre-training data). Although several methods attempt to enhance detection by leveraging internal model features, these features are generally inaccessible in mainstream closed-source image generation platforms, limiting their practicality. In this paper, we demonstrate that analyzing how a black-box diffusion model denoises a target image and corresponding perturbed textual instructions can reveal more distinctive membership cues. Based on this insight, we propose a black-box membership inference attack framework (named SD-MIA) that leverages a cross-modal data perturbation mechanism to detect pre-training data in diffusion models. We conduct extensive experiments on both a public benchmark dataset and a newly constructed dataset, each comprising pre-training membership and non-membership samples with identical distributions. Experimental results demonstrate that SD-MIA achieves superior performance compared to existing baselines, including those with the unfair advantage of accessing internal model features.

摘要中文:

基于扩散的图像生成模型的迅猛发展,引发了人们对相关技术可能侵犯人类创作数据之著作权与隐私权的严重担忧。成员推断攻击(MIAs)已作为一种有前景的工具,用于识别模型训练过程中未经授权的数据使用行为。现有方法通常将模型对受扰动的可疑图像进行去噪的能力作为成员身份的判别指标。然而,这类特征的判别能力在很大程度上取决于模型的记忆程度,并且当应用于暴露程度较低的数据(例如预训练数据)时会显著下降。尽管已有多种方法试图通过利用模型内部特征来提升检测性能,但这些特征在主流的闭源图像生成平台上通常无法获取,从而限制了其实际应用价值。本文表明,通过分析黑盒扩散模型如何对目标图像及其对应的扰动文本指令进行去噪,可以揭示出更为显著的成员身份线索。基于这一洞察,我们提出了一种黑盒成员推理攻击框架(称为SD-MIA),该框架利用跨模态数据扰动机制来检测扩散模型中的预训练数据。我们在一个公开的基准数据集和一个新构建的数据集上开展了大量实验,这两个数据集均包含预训练阶段的成员样本与非成员样本,且二者具有相同的分布。实验结果表明,SD-MIA的性能优于现有基线方法,即便这些基线方法在获取模型内部特征方面具有不公平优势。

High-Quality Synthetic Financial Time-Series using a GAN-Diffusion Framework

2026-05-27T04:00:00cs.AI, cs.LG, diffusion2605.27113

中文标题:基于GAN-扩散模型框架的高质量合成金融时间序列

作者:Giuseppe Masi, Andrea Coletta, Novella Bartolini

摘要:

In recent years, financial institutions and firms have increasingly adopted synthetic data to address data scarcity and to generate counterfactual market scenarios. However, reproducing all the statistical properties of financial time series, commonly known as stylized facts, remains an open challenge for many existing general-purpose architectures. In this paper, we present a quality-aware generative framework that combines two classes of generative methods, demonstrating how their integration addresses existing limitations while enhancing the realism of synthetic data. Specifically, we first introduce CoMeTS-GAN (Correlated Multivariate Time Series GAN), a Conditional Generative Adversarial Network (C-GAN) designed to jointly generate mid-price and volume time-series for correlated stocks. We then show how our GAN architecture can be incorporated into state-of-the-art diffusion models to enhance the quality of generated correlation structures. Specifically, the GAN's Critic serves as a quality evaluation module that guides the diffusion process, enforcing learned correlation structures in the generated time-series. Our framework offers a lightweight and responsive solution for realistic stock market simulation, explicitly modeling inter-asset correlation structures. We experimentally validate our framework against leading generative architectures, showing that it more effectively captures the stylized facts of stock markets and models inter-asset correlations.

摘要中文:

近年来,金融机构和企业越来越多地采用合成数据,以应对数据匮乏问题并构建反事实市场情景。然而,再现金融时间序列的所有统计特性——通常被称为“典型事实”——对于许多现有的通用架构而言仍是一项尚未解决的难题。本文提出了一种质量感知的生成框架,该框架整合了两类生成方法,展示了它们的融合如何克服现有局限,并提升合成数据的逼真度。具体而言,我们首先提出了CoMeTS-GAN(相关多变量时间序列生成对抗网络),这是一种条件生成对抗网络(C-GAN),旨在联合生成具有相关性的股票的中间价与成交量时间序列。随后,我们展示了如何将我们的GAN架构集成到当前最先进的扩散模型中,以提升所生成相关结构的质量。具体而言,GAN的判别器充当一个质量评估模块,用于指导扩散过程,从而在生成的时序数据中强化所学习到的依赖结构。我们的框架为真实的股票市场仿真提供了一种轻量且响应迅速的解决方案,能够显式地刻画资产间的相关性结构。我们针对主流的生成式架构对所提出的框架进行了实验验证,结果表明,该框架能够更有效地刻画股票市场的典型事实,并准确建模资产间的相关性。

Semantic Robustness Probing via Inpainting: An Interactive Tool for Safety-Critical Object Detection

2026-05-27T04:00:00cs.AI, cs.CV, diffusion2605.27155

中文标题:基于图像修复的语义鲁棒性探测:一种面向安全关键型目标检测的交互式工具

作者:Nico Steckhan, Krutarth Prajapati, Weija Shao, Silvia Vock

摘要:

Testing object detectors in safety-critical domains requires semantically meaningful probes beyond pixel-level corruptions. We present SemProbe, a tool for semantic robustness probing: users upload deployment images, create masks manually or automatically, select operational design domain-derived factors (or custom prompts), and run diffusion-based controlled inpainting. The system supports batch jobs, parallel seed/workflow variations, and configurable generation parameters. After each output, model inference runs automatically and displays annotated before/after comparisons with performance deltas. All probes are logged as structured artifacts, enabling traceable robustness evidence aligned with safety evaluation workflows. We demonstrate \textsc{SemProbe} on hand detection for dimension saws, targeting factors from insurance-oriented test criteria.

摘要中文:

在安全攸关领域对目标检测器进行测试时,除了像素级的扰动之外,还需要语义上具有明确意义的测试样本。我们提出了SemProbe,这是一种用于语义鲁棒性探测的工具:用户可上传部署镜像,手动或自动创建掩码,选择源自运行设计域的各类因素(或自定义提示词),并执行基于扩散模型的可控修复。该系统支持批处理作业、并行种子与工作流变体,以及可配置的生成参数。每次输出后,模型推理都会自动运行,并展示标注后的前后对比及性能差异。所有探测器均被记录为结构化工件,从而支持与安全评估工作流程相一致的可追溯性鲁棒性证据。我们在面向保险测试标准的尺寸锯手部检测任务上展示了\textsc{SemProbe}。

Neuro-Inspired Inverse Learning for Planning and Control

2026-05-27T04:00:00cs.AI, diffusion2605.24152

中文标题:面向规划与控制的神经启发式逆向学习

作者:Maryna Kapitonova, Tonio Ball

摘要:

We present a neuro-inspired framework for embodied planning and control. Building on three principles that enable fast and highly effective goal-directed behavior in the mammalian brain - paired forward/inverse internal models, open-loop multi-step motor commands, and sequential, hierarchical organization of action - our Inverter framework uses learned components, trained end-to-end through Inverse Learning (IL) and supplemented where natural by analytic or algorithmic modules; we formalize IL and delineate it from supervised, reinforcement, and imitation learning. IL bridges Reinforcement Learning (RL)-style amortization, which runs in a single forward pass but emits only one action at a time, and Optimal Control (OC)-style sequence planning over whole trajectories, but with iterative test-time computation. Single Inverters or hierarchical n=2 Inverter stacks match or improve on offline-RL and diffusion-planner baselines on all 3 maze2d and 6 antmaze D4RL variants by an average of +24.2% (range -1.9% to +78.2%), at one-to-two orders of magnitude less inference compute time. Distinctively, optimizing through the Forward Model (FoM) over the entire T-step action sequence - rather than per step - lets Inverters produce smooth, goal-coherent, trajectory-wide structure and reach control policies closer to the analytic optimum than the policy underlying the training data itself. We also identify a failure mode of IL: FoM hacking under narrow training-data coverage, which we mitigate by using random training data with broader coverage. As an application example, a Pulse Inverter synthesizes arbitrary single-qubit quantum gates with fidelity matching the standard iterative numerical baseline (GRAPE), at more than 1000x lower per-gate compute time. In summary, we conclude that IL enables a versatile class of world-interfaces, especially for latency- and resource-critical embodied AI.

摘要中文:

我们提出了一种面向具身规划与控制的神经启发式框架。基于哺乳动物大脑实现快速且高度有效的目标导向行为的三项原则——成对的正向与逆向内部模型、开环多步运动指令,以及动作的序列化、层次化组织——我们的Inverter框架采用可学习的模块,通过逆向学习(IL)端到端训练,并在必要时辅以解析或算法模块;我们对逆向学习进行了形式化定义,并将其与监督学习、强化学习和模仿学习加以区分。IL桥接了强化学习(RL)风格的可泛化推理,它只需一次前向传播即可完成,但每次仅输出一个动作;同时也兼顾了最优控制(OC)风格的全局轨迹序列规划,不过需要在测试时进行迭代计算。单台逆向控制器或分层的n=2逆向控制器堆叠,在全部3个Maze2D任务和6个AntMaze D4RL变体上,性能均优于或持平于离线强化学习与扩散规划基线,平均提升24.2%(范围为–1.9%至+78.2%),且推理计算量降低了一到两个数量级。值得注意的是,与逐个时间步优化不同,通过前向模型(FoM)对整个T步的动作序列进行优化,使逆向控制器能够生成平滑、与目标一致的全局轨迹,并获得比训练数据所依据的策略更接近解析最优解的轨迹规划与末端执行器控制策略。我们还识别出一种IL的失效模式:在训练数据覆盖范围较窄时发生的性能指标篡改攻击;对此,我们通过采用覆盖范围更广的随机训练数据来加以缓解。作为应用实例,脉冲逆变器能够在每门操作的计算时间降低至标准迭代数值基准(GRAPE)的千分之一以下的情况下,以与之相当的保真度合成任意单量子比特量子门。综上所述,我们得出结论:IL能够构建一类功能多样的世界接口,尤其适用于对延迟和资源敏感的具身人工智能系统。

Self-Cascaded Diffusion Models for Arbitrary-Scale Image Super-Resolution

2026-05-27T04:00:00cs.AI, cs.CV, diffusion2506.07813

中文标题:用于任意尺度图像超分辨率的自级联扩散模型

作者:Junseo Bang, Joonhee Lee, Kyeonghyun Lee, Haechang Lee, Dong Un Kang, Se Young Chun

摘要:

Arbitrary-scale image super-resolution aims to upsample images to any desired resolution, offering greater flexibility than traditional fixed-scale super-resolution. Recent approaches based on regression-based or generative models have shown promising results but often suffer from scale inconsistency due to their single-stage formulation, which must handle a wide range of scaling factors simultaneously. To address this, we propose CasArbi, a self-cascaded diffusion framework for arbitrary-scale image super-resolution. CasArbi decomposes varying scaling factors into smaller sequential steps, progressively enhancing the image resolution at each step with seamless transitions for arbitrary scales. CasArbi leverages a coordinate-conditioned diffusion model for learning continuous image representations and adopts self-consistency guidance to generate scale-consistent details at inference time. Extensive experiments show that CasArbi outperforms existing methods in both perceptual and distortion metrics and demonstrates superior scale consistency across diverse arbitrary-scale super-resolution benchmarks. Our code is available at https://github.com/junseo88/CasArbi.

摘要中文:

任意尺度图像超分辨率旨在将图像上采样至任意目标分辨率,相较于传统的固定尺度超分辨率具有更高的灵活性。近年来,基于回归或生成模型的方法已展现出良好的性能,但由于其单阶段的架构设计,往往难以处理不同尺度之间的不一致性,因为该架构必须同时应对广泛的缩放因子。为此,我们提出了CasArbi,一种用于任意尺度图像超分辨率的自级联扩散框架。CasArbi将变化的缩放因子分解为一系列较小的步骤,在每个步骤中逐步提升图像分辨率,并针对任意缩放比例实现无缝过渡。CasArbi借助一种基于坐标条件的扩散模型来学习连续的图像表示,并在推理阶段采用自一致性引导机制,以生成尺度一致的细节。大量实验表明,CasArbi在主观感知与失真度量方面均优于现有方法,并在多种任意尺度超分辨率基准上展现出优异的尺度一致性。我们的代码已在 https://github.com/junseo88/CasArbi 上公开。

Inference-Time Search Using Side Information for Diffusion-Based Image Reconstruction

2026-05-27T04:00:00cs.AI, cs.CV, cs.LG, diffusion2510.03352

中文标题:基于扩散模型的图像重建中利用辅助信息的推理时搜索方法

作者:Mahdi Farahbakhsh, Vishnu Teja Kunde, Dileep Kalathil, Krishna Narayanan, Jean-Francois Chamberland

摘要:

Diffusion models have been used as priors for solving inverse problems. However, existing approaches typically overlook side information that could significantly improve reconstruction quality, especially in severely ill-posed settings. In this work, we propose a novel framework that incorporates side information into existing diffusion-based inverse problem solvers via inference-time search, in a plug-and-play, training-free manner. Through extensive experiments across a range of inverse problems, including inpainting, super-resolution, and several deblurring tasks, and across multiple diffusion-based inverse problem solvers (DPS, DAPS, and MPGD), we show that augmenting each solver with our framework consistently improves the quality of the reconstructions over the corresponding original method. To demonstrate the generality of our approach, we consider diverse forms of side information, including reference images, textual descriptions, and anatomical MRI scans. The code is available at this \href{https://github.com/mahdi-farahbakhsh/DISS}{repository}\footnote{https://github.com/mahdi-farahbakhsh/DISS}.

摘要中文:

扩散模型已被用作求解反问题的先验模型。然而,现有方法通常忽视了能够显著提升重建质量的辅助信息,尤其是在严重不适定的情况下。在本工作中,我们提出了一种新颖的框架,通过推理时搜索的方式,以即插即用、无需训练的模式,将辅助信息融入现有的基于扩散的反问题求解器中。通过在一系列逆问题上开展的大量实验,涵盖图像修复、超分辨率以及多项去模糊任务,并基于多种基于扩散的逆问题求解器(DPS、DAPS和MPGD),我们证明:将我们的框架与每种求解器相结合,均能持续提升重建质量,优于相应的原始方法。为验证我们方法的通用性,我们考虑了多种类型的辅助信息,包括参考图像、文本描述以及解剖学磁共振成像。代码可在该\href{https://github.com/mahdi-farahbakhsh/DISS}{仓库}\footnote{https://github.com/mahdi-farahbakhsh/DISS}中获取。

CFG-OEC: Classifier Free Guidance with Orthogonal Error Correction

2026-05-27T04:00:00cs.AI, cs.LG, diffusion2511.14075

中文标题:CFG-OEC:带有正交误差修正的无分类器引导

作者:Nakgyu Yang, Yechan Lee, SooJean Han

摘要:

Classifier free guidance is a standard method for conditional sampling in diffusion models, but its sampling rule is not aligned with the objective used in training. This mismatch induces a structural sampling error through the interaction of conditional and unconditional prediction errors. We analyze this issue by decomposing the sampling error into a base term and a cross term determined by the alignment of the two errors. Based on this analysis we propose CFG with orthogonal error correction (CFG-OEC), a structural modification that reduces the interaction term. For practical settings where ground truth noise is not observable, we introduce a proxy computed from model predictions and a dynamic method that stabilizes correction across diffusion timesteps. Experiments in a controlled environment validate our theoretical error decomposition and proxy construction. Image generation on Stable Diffusion v1.5 and Stable Diffusion XL show that CFG-OEC improves FID and CLIP scores over CFG and CFG++ across multiple samplers and guidance regimes.

摘要中文:

无分类器指导是扩散模型中用于条件采样的标准方法,但其采样规则与训练阶段所采用的目标函数并不一致。这种不匹配通过条件性预测误差与非条件性预测误差的相互作用,诱发了结构采样误差。我们通过将抽样误差分解为一个基准项和一个由两类误差的对齐关系所决定的交叉项,来分析这一问题。基于上述分析,我们提出了具有正交纠错功能的CFG(CFG-OEC),这是一种通过减少交互项来实现结构改进的方法。在无法观测真实噪声的实用场景中,我们提出了一种基于模型预测计算的代理指标,并设计了一种能够使校正过程在扩散各时间步上保持稳定性的动态方法。在受控环境下的实验验证了我们的理论误差分解与代理变量构建方法。在Stable Diffusion v1.5和Stable Diffusion XL上的图像生成实验表明,在多种采样器和指导尺度下,CFG-OEC均能优于常规CFG和CFG‑Scale,在FID指标及CLIP评分方面取得更佳效果。

Reconstructing Multi-Scale Physical Fields from Extremely Sparse Measurements with an Autoencoder-Diffusion Cascade

2026-05-27T04:00:00cs.AI, cs.LG, diffusion, physics.app-ph2512.01572

中文标题:基于自编码器-扩散模型级联的极稀疏测量数据多尺度物理场重建

作者:Letian Yi, Tingpeng Zhang, Mingyuan Zhou, Guannan Wang, Quanke Su, Zhilu Lai

摘要:

Extreme sensor sparsity makes full-field reconstruction a fundamentally ill-posed problem in scientific sensing,where the goal is to infer physical fields from sparse measurements.In this regime,the posterior is severely underconstrained and inherently multimodal,making its approximation highly ill-conditioned.Specifically,deterministic mappings collapse uncertainty,direct conditional learning cannot cover the space of possible observation-conditioned solutions,and likelihood-guided sampling becomes highly sensitive to noise and sensor configurations.These limitations result in unstable posterior estimates and highlight the need for modeling uncertainty in a structural manner.To this end,we propose Cascaded Sensing,a hierarchical framework that restructures posterior inference across scales.Rather than modeling the full-field posterior directly,Cas-Sensing first resolves global structural ambiguity through a deterministic coarse-stage estimator.A neural-operator-based functional autoencoder,trained with masked inputs,maps sparse observations to a coarse-scale structural field,acting analogously to a maximum a posteriori estimator that selects the dominant global configuration.This structural anchor fixes the principal degrees of freedom of the posterior and transforms the problem into a better-conditioned residual inference task.A conditional diffusion model then learns only the refined-scale residual distribution,confining sampling to a stable neighborhood of plausible solutions and suppressing competition among observation-consistent modes.To enhance robustness under varying sensing conditions,we introduce mask-cascade training,which exposes the model to diverse sparse observation patterns through intermediate coarse reconstructions.During inference,manifold-constrained guidance enforces observation consistency as a refinement mechanism rather than a global mode-selection process.

摘要中文:

在科学传感领域,极端的传感器稀疏性使得全场重建成为一个本质上不适定的问题,其目标是从稀疏测量中推断物理场。在这一regime下,后验分布严重欠约束且inherently多峰,导致对其的近似高度病态。具体而言,确定性映射会压缩不确定性,直接的条件学习无法覆盖所有可能的观测条件解空间,而基于似然的采样则对噪声和传感器配置极为敏感。这些局限性造成后验估计的不稳定,并凸显了以结构化方式建模不确定性的重要性。为此,我们提出Cascaded Sensing——一个跨尺度重构后验推断的分层框架。Cascaded Sensing并不直接建模全场后验,而是首先通过一个确定性的粗粒度估计器化解全局结构上的歧义。该估计器基于神经算子的泛函自编码器,采用掩码输入进行训练,将稀疏观测映射为粗尺度的结构场,其作用类似于最大后验估计器,用于选择占主导地位的全局构型。这一结构锚点固定了后验的主要自由度,从而将原问题转化为一个条件更好的残差推断任务。随后,一个条件扩散模型仅学习细粒度的残差分布,将采样限制在一组合理解的稳定邻域内,并抑制各与观测相容模式之间的竞争。为进一步提升在不同传感条件下的鲁棒性,我们引入了掩码级联训练策略,通过中间的粗重建阶段使模型暴露于多种稀疏观测模式之下。在推理过程中,流形约束引导将观测一致性作为精化机制来执行,而非全局的模式选择过程。

Demystifying Video Reasoning

2026-05-27T04:00:00cs.AI, cs.CV, diffusion2603.16870

中文标题:揭秘视频推理

作者:Ruisi Wang, Zhongang Cai, Fanyi Pu, Junxiang Xu, Wanqi Yin, Maijunxian Wang, Ran Ji, Chenyang Gu, Bo Li, Ziqi Huang, Hokin Deng, Dahua Lin, Ziwei Liu, Lei Yang

摘要:

Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.

摘要中文:

近年来,视频生成领域的研究取得了重要进展,并揭示了一个出乎意料的现象:基于扩散模型的视频生成模型展现出非平凡的推理能力。先前的研究将这一现象归因于帧链机制,该机制假定推理过程在视频的各帧之间依次展开。在本研究中,我们对这一假设提出了质疑,并揭示了一种截然不同的作用机制。我们发现,视频模型中的推理能力主要是在扩散去噪的各个步骤中逐步涌现出来的。通过定性分析与针对性的探查实验,我们发现模型在去噪过程的早期阶段会探索多个候选解,并逐步收敛至最终答案,这一过程我们称之为“步骤链”(Chain-of-Steps,CoS)。在这一核心机制之外,我们还识别出若干对模型性能至关重要的涌现推理行为:(1)工作记忆,支持持续的引用;(2)自我修正与优化,使模型能够从错误的中间解中恢复;以及(3)先感知后行动,即前期步骤建立语义基础,后期步骤实施结构化操作。在扩散步骤中,我们进一步揭示了扩散变换器内部自演化形成的功能专业化:早期层编码密集的感知结构,中间层执行推理,而后期层则对潜在表征进行整合。受这些洞察的启发,我们提出了一种简单的无训练策略作为概念验证,展示了如何通过集成来自采用不同随机种子的同一模型的潜在轨迹来提升推理能力。总体而言,我们的研究系统地揭示了视频生成模型中推理能力的涌现机制,为未来研究提供了理论基础,有助于更充分地挖掘视频模型内在的推理动力学,将其作为一种新型智能载体加以利用。

Diff-Instruct with Diffused Reward: Towards Principled One-step Generator RL

2026-05-27T04:00:00cs.AI, cs.CV, cs.LG, diffusion2605.24001

中文标题:基于扩散奖励的差异指令:迈向有理论依据的单步生成式强化学习

作者:Junyi Wu, Weijian Luo, Haoyang Zheng, Ruizhe Zhang, Guang Lin

摘要:

Recent advances in one-step text-to-image generation have enabled real-time synthesis with remarkable efficiency and quality. Previous reinforcement learning methods for one-step generators combine image-space reward optimization with diffusion noisy-space distribution matching. This paradigm brings challenges due to a mismatch between terminal reward optimization and the underlying generative dynamics. As a result, optimization tends to exploit stochastic degrees of freedom, often improving reward at the expense of image fidelity. To address this issue, we propose Diff-Instruct with Diffused Reward (DIDR), a data-free trajectory-level alignment framework derived from Integral KL minimization. DIDR propagates the RLHF-optimal reward-tilted clean-image distribution across all noise levels along the diffusion trajectory. We show that this objective admits the same minimizer as clean-image RLHF, while naturally inducing the Diffused Reward Score (DRS), which acts as a reward-driven correction to the reference score function. To make this practical, we further introduce the Diffused Reward Proxy (DRP), an efficient estimator of DRS based on differentiable short-step denoising. Extensive experiments demonstrate that DIDR consistently Pareto-dominates existing one-step SDXL baselines. Moreover, when transferred to a 6B DiT backbone (Z-Image), DIDR surpasses its 50-step teacher in preference alignment while requiring only a single generation step.

摘要中文:

近年来,单步文本到图像生成技术取得了显著进展,实现了兼具卓越效率与质量的实时合成。先前针对单步生成器的强化学习方法将图像空间中的奖励优化与扩散模型在噪声空间中的分布匹配相结合。这一范式由于终端奖励优化与底层生成式动态之间的不匹配而带来了诸多挑战。因此,优化往往倾向于利用随机性带来的自由度,常常以牺牲图像保真度为代价来提升奖励。为解决这一问题,我们提出了基于扩散奖励的Diff-Instruct(DIDR),这是一种源自积分KL散度最小化的无数据轨迹级对齐框架。DIDR在整个扩散轨迹上,将RLHF优化后的奖励偏置清洁图像分布推广至所有噪声水平。我们证明,该目标函数与用于干净图像的RLHF具有相同的极小值点,并且能够自然地诱导出扩散奖励得分(DRS),后者作为对参考评分函数的一种奖励驱动型修正。为使其更具实用性,我们进一步提出了扩散奖励代理(DRP),这是一种基于可微分的短步去噪过程所构建的、用于估计DRS的有效指标。大量实验表明,DIDR始终在帕累托意义上优于现有的单步SDXL基线方法。此外,当迁移至6B参数的DiT骨干模型(Z-Image)时,DIDR在偏好对齐方面超越了其50步教师模型,且仅需一步生成。

Geometry-Aware Representation Denoising for Robust Multi-view 3D Reconstruction

2026-05-27T04:00:00cs.CV, diffusion2605.26230

中文标题:面向鲁棒多视图三维重建的几何感知表示去噪方法

作者:Jin Hyeon Kim, Jaeeun Lee, Claire Kim, Kyoungjin Oh, Paul Hyunbin Cho, Jaewon Min, Yeji Choi, Jihye Park, Hyunhee Park, Minkyu Park, Seungryong Kim

摘要:

Multi-view 3D reconstruction has achieved remarkable progress with the advent of feed-forward 3D reconstruction models. However, these models are typically trained and evaluated under ideal, degradation-free imaging conditions, whereas real-world observations often contain degradations that differ significantly from such settings. Improving robustness for multi-view 3D reconstruction under degraded conditions therefore remains an important challenge. We present Geometry-Aware Representation Denoising (GARD), a novel framework that performs diffusion-based multi-view restoration directly in the feature space of a feed-forward 3D reconstruction model. This design exploits the geometry-aware feature representations of the 3D reconstructor to effectively recover accurate scene geometry. Furthermore, by employing an additional RGB image decoder, the refined representations can also be used to restore high-quality RGB images, thereby enabling the simultaneous recovery of 3D scene geometry and high-quality imagery. Comprehensive experiments on the Depth Anything 3 (DA3) benchmark demonstrate the effectiveness of the proposed GARD framework.

摘要中文:

随着前馈式三维重建模型的出现,多视角三维重建技术取得了显著进展。然而,这些模型通常是在理想的、无退化的成像条件下进行训练和评估的,而真实场景中的观测往往存在与上述假设显著不同的退化现象。因此,在退化条件下提升多视角三维重建的鲁棒性仍是一项重要挑战。我们提出了几何感知表征去噪(GARD),这是一种新颖的框架,能够在前向3D重建模型的特征空间中直接进行基于扩散的多视角图像复原。该设计利用三维重建器的几何感知特征表示,从而有效地恢复精确的场景几何。此外,通过引入一个额外的RGB图像解码器,这些经过优化的表征还可用于重建高质量的RGB图像,从而实现三维场景几何与高质量图像的同步恢复。在Depth Anything 3(DA3)基准上的综合实验验证了所提出的GARD框架的有效性。

Triadic Dynamics Aware Diffusion Posterior Sampling for Inverse Problems: Optimizing Guidance and Stochasticity Schedules

2026-05-27T04:00:00cs.CV, diffusion2605.26470

中文标题:面向逆问题的三元动力学感知扩散后验采样:引导与随机性调度的优化

作者:Junseo Bang, Dong Ju Mun, Hoigi Seo, Seongmin Hong, Se Young Chun

摘要:

Generative posterior sampling using diffusion models has emerged as a dominant paradigm for solving inverse problems in imaging, which usually consists of three main components: data consistency (DC) guidance, classifier-free guidance (CFG) and stochasticity. While prior arts have focused on how to develop each or all components, less attention has given to how to schedule them, leading to heuristically fixed or partially adjusted suboptimal schedules. In this work, we argue that the interactions among all three components in terms of scheduling are crucial for significantly improved performance in solving inverse problems in imaging. Our analysis shows that aggressive CFG early in sampling conflict with DC guidance, while stochasticity brings the trajectory back to higher-probability regions. Based on these findings, we propose Triadic Dynamics Aware Posterior Sampling (TriPS), which reformulates posterior sampling as a time-varying control problem and optimizes schedules following a triadic trend of decreasing DC and stochasticity scales alongside increasing CFG scale. TriPS achieves this through two strategies: template-based search over functional priors for reliable baseline schedules, and Group Relative Policy Optimization (GRPO)-based reinforcement learning for more flexible temporal curves. Experiments demonstrate TriPS outperforms state-of-the-art baselines in data fidelity and perceptual realism.

摘要中文:

基于扩散模型的生成式后验采样已成为求解成像领域逆问题的主导范式,其通常由三个主要组成部分构成:数据一致性(DC)引导、无分类器指导(CFG)以及随机性。尽管既有研究多聚焦于各组件乃至所有组件的开发方法,却较少关注其调度策略,从而导致所采用的调度方案往往为启发式地固定或仅作局部调整的次优解。在本研究中,我们论证了在成像反问题求解中,三类要素之间在调度层面的相互作用对于性能的显著提升至关重要。我们的分析表明,在采样初期,CFG的激进性与DC指导相冲突,而随机性则使轨迹回归到高概率区域。基于上述发现,我们提出了三元动力学感知后验采样(TriPS),该方法将后验采样重新表述为一个时变控制问题,并按照直流分量和随机性尺度递减、条件生成因子尺度递增的三元趋势来优化调度策略。TriPS通过两种策略实现这一目标:一是基于模板在功能先验上进行搜索,以获得可靠的基准排程;二是采用基于组相对策略优化(GRPO)的强化学习,以生成更为灵活的时间曲线。实验结果表明,TriPS在数据保真度和感知真实感方面均优于当前最先进的基线方法。

Scheduled Style Injection: Expanding the Style-Content Pareto Frontier in Training-Free Diffusion-based Style Transfer

2026-05-27T04:00:00cs.CV, diffusion2605.26538

中文标题:预设风格注入:在无训练的扩散模型风格迁移中拓展风格-内容帕累托前沿

作者:Amey Sunil Kulkarni

摘要:

Style transfer with pre-trained diffusion models has advanced rapidly, but a core question remains underexplored: where in the model should style injection be strongest? StyleID, the leading training-free method, uses a single global parameter (gamma) uniformly across all layers and timesteps, which forces a fixed tradeoff between style quality and content preservation. We show this tradeoff is unnecessarily rigid. We systematically explore four dimensions of control: varying style injection strength across decoder layers, across denoising timesteps, and scheduling ControlNet geometric conditioning along both axes. The pattern is consistent everywhere: decreasing schedules, with stronger structural signal injection in shallower layers and earlier timesteps, reliably outperform the reverse. Beyond direction, schedule shape matters: cosine and square-root timestep schedules outperform linear. Most importantly, we find that gamma scheduling and ControlNet conditioning are nearly independent. The resulting combined configurations expand the Pareto frontier, offering superior tradeoffs between style fidelity and content preservation compared to any single baseline setting. Our best balanced configuration achieves ArtFID of 27.036 versus StyleID's 28.801 - a 6.1% relative improvement, with consistent gains across the full style-content tradeoff frontier. Results are validated across 35 configurations totaling over 28,000 stylized images using four complementary metrics. These findings generalize across SD backbones with identical rank ordering. All modifications are training-free, parameter-free, and require only a few lines of scheduling code; code is available at https://github.com/ameyskulkarni/scheduled_style_injection.

摘要中文:

基于预训练扩散模型的风格迁移技术发展迅速,但一个核心问题仍未得到充分探讨:在模型的哪一层进行风格注入最为有效?StyleID是一种领先的无训练方法,它在所有层和所有时间步上统一使用一个全局参数(γ),这导致风格质量与内容保真度之间存在固定的权衡。我们证明了这种权衡是不必要的僵化。我们系统地探讨了控制的四个维度:在解码器各层之间、在去噪的各个时间步之间调节风格注入的强度,并在这两个维度上对ControlNet的几何条件进行调度。这一规律在各处均具有一致性:调度逐渐递减、在较浅层和较早时间步注入更强的结构化信号的方案,其性能始终优于相反的配置。除了优化方向之外,时间步长的调度形式同样重要:余弦和平方根型的时间步长调度策略优于线性调度。最重要的是,我们发现伽马调度与ControlNet条件控制几乎是相互独立的。由此得到的组合配置扩展了帕累托前沿,与任何单一基线设置相比,在风格保真度与内容保留之间的权衡上均表现出更优的性能。我们最优的平衡配置在ArtFID指标上达到27.036,而StyleID为28.801,相对提升了6.1%,并且在整个风格–内容权衡边界上均表现出一致的增益。研究结果基于四种互补的评价指标,在35种配置下、共计超过2.8万张风格化图像上得到了验证。这些研究结果在具有相同排序的结构降秩主干网络中均具有一般性。所有改进均无需训练、无需调整参数,且仅需几行调度代码即可实现;相关代码已公开,地址为:https://github.com/ameyskulkarni/scheduled_style_injection。

Self-Intersection-Aware 3D Human Motion Generation Using an Efficient Human Sphere Proxy

2026-05-27T04:00:00cs.CV, diffusion2605.26744

中文标题:基于高效人体球体代理的自相交感知三维人体运动生成

作者:Pascal Herrmann, Maarten Bieshaar, Dennis Mack, Robert Herzog, Juergen Gall

摘要:

Human motion generation has made tremendous progress in recent years, with state-of-the-art approaches surpassing ground truth data in leading evaluation benchmarks. However, visual inspection of the generated motions paints a different picture. Even state-of-the-art approaches generate motions frequently containing self-intersections, i.e., body parts interpenetrating, which are strong artifacts, severely limiting the perceived motion quality. We introduce a novel loss, which explicitly penalizes self-intersections, to the training of human motion generation methods. We base our loss on a sphere proxy of human geometry, which allows us to calculate a self-intersection loss 98% faster and uses 83% less memory than comparable methods based on triangular meshes. The loss is agnostic to the specific approach, and we add it to the training of the recent human motion generation methods human motion diffusion model (MDM) and MoMask. Our extensive experiments show a reduction of self-intersections in generated motions of up to 49% while improving other evaluation metrics. The code is available at https://github.com/boschresearch/humansphereproxy .

摘要中文:

近年来,人体运动生成技术取得了巨大进展,其中最先进的方法在主流评估基准上已超越真实标注数据的水平。然而,对生成运动的目视检查却呈现出另一幅图景。即便是最先进的方法,生成的运动也常常存在自相交现象,即人体各部位相互穿透,这种伪影十分明显,严重降低了运动的真实感与质量。我们为人体运动生成方法的训练引入了一种新型损失函数,该损失函数能够显式地对自相交现象施加惩罚。我们基于人体几何的球面代理来定义损失函数,这使得自相交损失的计算速度比基于三角网格的同类方法快98%,且内存占用降低83%。该损失函数与具体方法无关,我们将其纳入近期人体运动生成方法——人体运动扩散模型(MDM)和MoMask的训练过程。我们的大量实验表明,所生成运动中的自相交现象可减少多达49%,同时其他评估指标也得到提升。该代码可在 https://github.com/boschresearch/humansphereproxy 获取。

Leveraging Text-to-Image Diffusion Models for Unsupervised Visual Object Tracking

2026-05-27T04:00:00cs.CV, diffusion2605.26933

中文标题:利用文本到图像的扩散模型进行无监督视觉目标跟踪

作者:Zhengbo Zhang, Zhigang Tu, Junsong Yuan, De Wen Soh, Bo Du

摘要:

Unsupervised visual object tracking is a challenging task that requires following arbitrary targets in videos without training on ground-truth annotations. Despite considerable progress, existing state-of-the-art unsupervised trackers often struggle in scenarios that demand fine-grained understanding of semantic and visual structural information within video frames. Text-to-image diffusion models are well known for their ability to generate images that accurately reflect the semantics and structures described in the input prompt, demonstrating a strong grasp of visual semantics and structures. Building on this capability, we approach the unsupervised tracking from a new perspective by exploiting the rich semantic knowledge encoded in pretrained text-to-image diffusion models. To adapt the diffusion models, which are originally developed for image generation, to the tracking task, we reinterpret the models as a bridge between text and image modalities. This connection is realized through the cross-attention mechanism: when both text and an image are input into the models, they highlight the regions of the image that are semantically aligned with the text in the cross-attention maps. We therefore learn a prompt that represents the tracking target and activates its corresponding region in the cross-attention map for each frame, which enables object tracking with the diffusion model. Specifically, our method Diff-Tracking is composed of two main components: an initial prompt learner and an online prompt updater. The initial prompt learner generates a prompt that captures the target object in the first frame, allowing the diffusion model to identify the target. The online prompt updater refines the prompt based on motion information, enabling consistent tracking across video frames. We evaluate our approach on six challenging tracking datasets demonstrate the effectiveness of our approach.

摘要中文:

无监督视觉目标跟踪是一项具有挑战性的任务,它要求在不依赖真实标注数据进行训练的情况下,对视频中的任意目标进行持续跟踪。尽管取得了显著进展,现有的最先进无监督跟踪算法在需要对视频帧中的语义信息和视觉结构信息进行细粒度理解的场景下仍表现欠佳。文本到图像的扩散模型以其能够生成准确反映输入提示中语义与结构的图像而闻名,展现出对视觉语义与结构的深刻理解。在此基础上,我们通过利用预训练文本-图像扩散模型中所蕴含的丰富语义知识,从一个新的视角来解决无监督跟踪问题。为了将最初为图像生成而设计的扩散模型适配于跟踪任务,我们将其重新诠释为连接文本与图像两种模态的桥梁。这种关联性通过交叉注意力机制得以实现:当文本与图像同时输入模型时,模型会在交叉注意力图中突出显示与文本语义对齐的图像区域。因此,我们学习一个用于表征跟踪目标的提示,并在每一帧的交叉注意力图中激活其对应区域,从而实现基于扩散模型的目标跟踪。具体而言,我们的方法Diff-Tracking由两个主要模块组成:初始提示学习器和在线提示更新器。初始提示学习器会生成一个能够捕捉第一帧中目标对象的提示,从而使扩散模型能够识别该目标。在线提示词更新器基于运动信息对提示词进行优化,从而实现视频帧间的稳定跟踪。我们在六个具有挑战性的目标跟踪数据集上对所提出的方法进行了评估,验证了其有效性。

JLT: Clean-Latent Prediction in Latent Diffusion Transformers

2026-05-27T04:00:00cs.CV, cs.LG, diffusion, image_compression2605.27102

中文标题:JLT:潜在扩散变换器中的清洁潜在预测

作者:Funing Fu, Tenghui Wang, Guanyu Zhou, Junyong Cen, Qichao Zhu

摘要:

Flow matching with clean-data prediction has shown that regressing the clean point can exploit low-dimensional structure more effectively than predicting an ambient noised quantity. We ask whether this principle remains useful after images are mapped into a learned latent space, where compression has already removed much of the raw pixel variability. We introduce JLT, a 130M latent diffusion Transformer over frozen FLUX.2 VAE codes, and compare clean-latent prediction with a matched velocity-prediction DiT under the same representation, backbone, and training settings. Although the three variables x, epsilon, and v are linearly convertible for a fixed corruption time, a local Gaussian analysis shows that velocity regression inherits an isotropic target-covariance floor and amplifies low-variance latent directions, while clean prediction damps them. On ImageNet 256 x 256, JLT-B/1 obtains FID-50K 2.50 with classifier-free guidance, with a large matched-target gap over velocity prediction. These results suggest that prediction targets in latent diffusion are representation-dependent geometric choices, rather than interchangeable algebraic parameterizations.

摘要中文:

基于干净数据预测的流匹配方法表明,回归干净样本点能够比直接预测高维空间中的噪声观测值更有效地利用低维结构。我们探讨,在图像被映射到由模型学习得到的潜在空间之后,这一原则是否仍然有效;在该空间中,压缩已消除了大量原始像素级的变异。我们提出了JLT,这是一种基于冻结的FLUX.2 VAE编码的1.3亿参数潜在扩散Transformer,并在同一表示、主干网络和训练设置下,将纯潜在空间预测与对应的速度预测DiT进行了对比。尽管在给定的污染时间下,变量x、ε 和v之间存在线性转换关系,但局部高斯分析表明,速度回归会继承一个各向同性的目标协方差下界,并放大低方差的隐变量方向;而干净预测则会抑制这些方向。在ImageNet 256×256数据集上,JLT-B/1在无分类器指导条件下取得了FID-50K为2.50的性能,且与基于速度预测的方法相比存在显著的差距。这些结果表明,潜在扩散模型中的预测目标是依赖于表征的几何选择,而非可互换的代数参数化。

MRT: Masked Region Transformer for Layered Image Generation and Editing at Scale

2026-05-27T04:00:00cs.CV, diffusion2605.27235

中文标题:MRT:用于大规模分层图像生成与编辑的掩码区域Transformer

作者:Zhicong Tang, Zhao Zhang, Jingye Chen, Mohan Zhou, Yifan Pu, Yuchi Liu, Yalong Bai, Ethan Smith, Yuhui Yuan

摘要:

Layered image generation and editing is a fundamental capability that enables layer-wise reuse, editing, and composition of generated visual content, analogous to word-level editing in natural language. Despite its importance, this remains an underexplored area at scale. To address this gap, we present MRT, a 20B-parameter masked region diffusion model tailored for multi-layer transparent image generation and editing, trained on over 10M multilingual design samples spanning diverse aspect ratios and textual prompts. To fully leverage this scale, we make two key technical contributions. First, we unify three complementary tasks including text-to-layers, image-to-layers, and layers-to-layers within a shared masked region diffusion framework, where selective token masking enables flexible layer-wise generation and editing. Second, to enable overflow layer generation, we introduce an overflow-aware canvas layer that handles boundary inconsistencies and supports semi-transparent background synthesis, enabling complete editable layers extending beyond visible canvas boundaries. Additionally, we apply diffusion distillation to achieve 8-step, real-time multi-layer generation with minimal quality degradation. Extensive experiments demonstrate that our framework substantially outperforms prior state-of-the-art approaches, including various commercial systems, across all three tasks, establishing a new benchmark for multi-layer transparent image generation. Notably, our model significantly outperforms the concurrent Qwen-Image-Layered model in image-to-layers quality according to user-study results, while achieving 10-100\times faster inference and reducing activation GPU memory consumption by 50-90\% during image-to-layer inference.

摘要中文:

分层图像生成与编辑是一项基础性能力,它支持对生成的视觉内容进行逐层复用、编辑与合成,类似于自然语言中的词级编辑。尽管其重要性不言而喻,但这一领域在规模化层面仍鲜有深入研究。为弥补这一不足,我们提出了MRT——一种专用于多层透明图像生成与编辑的200亿参数掩码区域扩散模型,该模型基于超过1000万张涵盖多种宽高比与文本提示的多语言设计样本进行训练。为充分发挥这一规模效应,我们提出了两项关键性的技术贡献。首先,我们在一个共享的掩码区域扩散框架中统一了文本到图层、图像到图层以及图层到图层这三项互补任务,其中选择性标记掩码实现了灵活的逐层生成与编辑。其次,为支持溢出图层的生成,我们引入了一个具有溢出感知能力的画布图层,该图层能够处理边界不一致问题,并支持半透明背景合成,从而实现可完全编辑的图层跨越可见画布边界。此外,我们采用扩散蒸馏技术,实现了8步实时多层生成,且质量损失极小。大量实验表明,我们的框架在三项任务上均显著优于现有最先进方法,包括各类商用系统,为多层透明图像生成树立了新的基准。值得注意的是,根据用户研究结果,我们的模型在图像到图层的生成质量方面显著优于同期的Qwen-Image-Layered模型,同时推理速度提升了10至100倍,并且在图像到图层的推理过程中将激活态显存占用降低了50%至90%。

PARE: Pruning and Adaptive Routing for Efficient Video Generation

2026-05-27T04:00:00cs.CV, diffusion2605.27336

中文标题:PARE:用于高效视频生成的剪枝与自适应路由

作者:Yutong Wang, Yunke Wang, Tianfan Xue, Yu Qiao, Yaohui Wang, Xinyuan Chen, Chang Xu

摘要:

Video Diffusion Transformers (DiTs) generate high-quality videos but demand substantial compute due to wide blocks, deep architectures, and iterative sampling. Recent methods reduce cost by compressing width, depth, or sampling steps, but typically commit to a fixed architecture that cannot adapt to individual inputs or denoising stages. We propose PARE (Pruning and Adaptive Routing for Efficient video generation), which jointly compresses width and depth with structure-aware pruning and input-adaptive routing. For width, we observe that attention heads specialize into spatial and temporal roles, and design importance scoring that accounts for this distinction to prevent motion-critical temporal heads from being pruned prematurely. For depth, we train a lightweight router conditioned on denoising timestep and visual content to dynamically select which blocks to execute at each step, enabling per-input compute adaptation rather than static block removal. A progressive pipeline first recovers width-pruned quality via distillation, then jointly optimizes the student and router to decouple the two learning objectives. Experiments on Wan2.1-14B for both image-to-video and text-to-video generation show that PARE substantially reduces per-step computation while preserving quality across VBench dimensions, and composes with step distillation for further acceleration.

摘要中文:

视频扩散变换器(DiT)能够生成高质量视频,但由于其采用宽块结构、深层网络以及迭代采样策略,对计算资源的需求极为庞大。近年来的方法通过压缩网络的宽度、深度或采样步数来降低计算成本,但通常采用固定的网络架构,无法根据具体输入或去噪过程的不同阶段进行自适应调整。我们提出了PARE(用于高效视频生成的剪枝与自适应路由),该方法通过结构感知剪枝和输入自适应路由,同时对网络的宽度与深度进行压缩。在宽度方面,我们观察到注意力头会分别专注于空间和时间两类功能,并为此设计了一种重要性评分机制,以区分这两类角色,从而避免对运动敏感的时间注意力头过早被剪枝。为了提升模型的深度灵活性,我们训练了一个以去噪时间步和视觉内容为条件的轻量化路由模块,用于在每一步动态选择需要执行的网络块,从而实现基于输入的自适应计算,而非简单的静态剪枝。一种渐进式流水线首先通过知识蒸馏恢复宽度剪枝后的模型性能,随后联合优化学生模型与路由网络,以解耦这两个优化目标。针对图像到视频和文本到视频生成的Wan2.1-14B实验表明,PARE在保持VBench各维度质量的同时,显著降低了每步计算量,并可与步数蒸馏技术协同使用,实现进一步加速。

Towards Controllable Image Generation through Representation-Conditioned Diffusion Models

2026-05-27T04:00:00cs.CV, cs.LG, diffusion2605.27343

中文标题:通过表征条件扩散模型实现可控图像生成

作者:Nithesh Chandher Karthikeyan, Jonas Unger, Gabriel Eilertsen

摘要:

Diffusion models have emerged as powerful tools for high-quality image generation and editing, but guiding these models to produce specific outputs remains a challenge. Conventional approaches rely on conditioning mechanisms, such as text prompts or semantic maps, which require extensively annotated datasets. In this preliminary work, we explore diffusion models conditioned on representations from a pre-trained self-supervised model. The self-conditioning mechanism not only improves the quality of unconditional image generation, but also provides a representation space that can be used to control the generation. We explore this conditioning space by identifying directions of variations, and demonstrate promising properties in terms of smoothness and disentanglement.

摘要中文:

扩散模型已成为生成和编辑高质量图像的强大工具,但如何引导这些模型生成特定的输出仍是一项挑战。传统方法依赖于诸如文本提示或语义地图之类的条件化机制,而这些机制又需要大规模标注的数据集。在本项初步工作中,我们探讨了以预训练自监督模型所提取的表征为条件的扩散模型。自调节机制不仅提升了无条件图像生成的质量,还提供了一个可用于控制生成过程的表征空间。我们通过识别变化方向来探索这一条件空间,并在平滑性和解耦性方面展现出令人期待的特性。

Garment Particles: A 2D--3D Symmetric Garment Representation for Generation and Editing

2026-05-27T04:00:00cs.CV, cs.GR, diffusion2605.26391

中文标题:服装粒子:一种用于生成与编辑的二维-三维对称服装表示方法

作者:Kiyohiro Nakayama, I-Chao Shen, Ruofan Liu, Yiming Wang, Gordon Wetzstein, Takeo Igarashi

摘要:

Practical garment design spans two modes: intuitive creation from high-level intent, such as a reference image or text description, and complex low-level editing across 2D sewing patterns and 3D draped geometry, which requires professional training to navigate their complex interdependencies. Yet existing frameworks address only part of this challenge, offering either garment generation from casual inputs or direct editing on sewing patterns. To support both ends of the spectrum, we propose Garment Particles, a 5D point-cloud representation that jointly encodes 2D sewing patterns and 3D geometry. This representation enables Garment Particles Flow (GPF), a rectified flow framework that supports intuitive generation from high-level inputs (text, images, sketches) and various editing operations on 2D sewing patterns and 3D geometries via diffusion posterior sampling. Finally, we introduce Particles-to-Pattern Flow that converts generated garment particles into curved-based patterns for simulation. We validate our model's generation ability on multiple datasets, achieving state-of-the-art garment generation results against competitive baselines. Our model also enables many garment editing scenarios, including garment interpolation, sewing pattern editing, point-cloud- and silhouette-conditioned garment generation. Our project website is at https://garment-particles.github.io .

摘要中文:

实用的服装设计涵盖两种模式:一是基于高层次意图的直观创作,例如参考图或文字描述;二是对二维裁剪样板与三维悬垂形态进行复杂的底层编辑,而后者因其错综复杂的相互依赖关系,需要经过专业训练才能熟练驾驭。然而,现有框架仅能应对这一挑战的一部分,要么根据随意输入生成服装,要么直接在裁剪图上进行编辑。为同时支持这两个极端,我们提出了“服装粒子”这一5D点云表示方法,它能够联合编码二维缝制图案与三维几何信息。这种表示方法实现了服装粒子流(GPF),这是一种修正流框架,支持从高层次输入(文本、图像、草图)进行直观生成,并可通过扩散后验采样对二维裁剪样板和三维几何体实施多种编辑操作。最后,我们提出了粒子‑图案流,该方法可将生成的服装粒子转换为基于曲线的裁剪样片,以用于仿真。我们在多个数据集上验证了所提模型的生成能力,并在与多种竞争性基线方法的对比中取得了最先进的服装生成效果。我们的模型还支持多种服装编辑场景,包括服装插值、裁剪图编辑,以及基于点云和轮廓的服装生成。我们的项目网站位于 https://garment-particles.github.io 。

Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models

2026-05-27T04:00:00cs.CV, cs.LG, diffusion2605.26491

中文标题:超越成对偏好:面向扩散模型的列表级奖励感知对齐

作者:Austin Wang, Jiaqi Han, Stefano Ermon, Yisong Yue

摘要:

Preference optimization has emerged as an efficient alternative to online reinforcement learning from human feedback (RLHF) for aligning text-to-image diffusion models. However, existing methods largely reduce supervision to binary pairwise comparisons. This pairwise reduction is limiting when training data naturally contains multiple candidate images for the same prompt, and when continuous reward scores can provide richer information than a single winner-loser label. To address these limitations, we propose Diffusion LAIR, a reward-aware listwise preference optimization method for diffusion models. For each prompt, LAIR converts reward scores across a group of candidate images into centered advantage weights, then optimizes an advantage-weighted regression objective on the implicit reward, defined as the denoising-loss improvement of the current model over a fixed reference model, with a quadratic penalty that regularizes the magnitude of the implicit reward. The resulting objective uses all candidates simultaneously rather than selecting pairs, and remains conservative by explicitly controlling the magnitude of the implicit reward. The LAIR objective admits a bounded closed-form optimum in implicit-reward space, clarifying how the regularization strength controls the magnitude of the preference update. Experiments show that Diffusion LAIR outperforms strong preference optimization baselines on SD1.5 and SDXL across text-to-image generation, compositional generation, and image editing benchmarks.

摘要中文:

偏好优化作为一种高效替代方案,已用于对齐文本到图像的扩散模型,以取代基于人类反馈的在线强化学习(RLHF)。然而,现有方法在很大程度上将监督简化为二元成对比较。当训练数据中自然地为同一提示词包含多个候选图像时,这种成对约简方法便显得局限;同时,连续的奖励分数相较于单一的优胜者–失败者标签能够提供更为丰富的信息。为应对这些局限性,我们提出了Diffusion LAIR,这是一种面向扩散模型的、基于奖励的列表级偏好优化方法。对于每个提示,LAIR将一组候选图像的奖励得分转换为中心化的优势权重,然后以隐式奖励为目标函数对其进行优势加权回归优化;该隐式奖励定义为当前模型相对于固定参考模型在去噪损失上的改善,并附加一项二次惩罚项以约束隐式奖励的大小。由此得到的目标函数同时利用所有候选对象,而非仅选取成对的样本,并通过显式地约束隐式奖励的幅度来保持其保守性。LAIR目标在隐式奖励空间中存在一个有界的闭式最优解,从而明确了正则化强度如何调控偏好更新的幅度。实验结果表明,在文本到图像生成、组合式生成以及图像编辑等基准测试中,Diffusion LAIR在SD1.5和SDXL上均优于现有的强偏好优化基线方法。

Do Modern Post-Hoc Watermarking Methods Beat Broken-Arrows?

2026-05-27T04:00:00cs.CR, cs.CV, diffusion2605.27135

中文标题:现代事后水印方法能否超越“断箭”攻击?

作者:Enoal Gesny, Eva Giboulot

摘要:

With the rapid proliferation of generative models, such as diffusion models, digital watermarking has emerged as a crucial solution for identifying AI-generated images. Modern post-hoc watermarking schemes use neural networks to achieve an extremely low false-alarm rate while remaining robust to common image transformations. However, there is a lack of comparison between these modern methods and classic ones, particularly in real-world scenarios where robustness and security take precedence over achieving an extremely low false-alarm probability. In this paper, we propose a fair comparison of robustness and security between modern and classic post-hoc watermarking across various types of classic augmentations and recent sophisticated attacks. Our experiments show that, in a realistic scenario, classic watermarking outperforms modern techniques in terms of security while maintaining robustness.

摘要中文:

随着扩散模型等生成模型的迅速普及,数字水印技术已成为识别人工智能生成图像的关键解决方案。现代事后水印方案利用神经网络,在对常见图像变换保持鲁棒性的同时,实现了极低的误报率。然而,这些现代方法与经典方法之间缺乏系统性的对比研究,尤其是在实际应用场景中,鲁棒性和安全性往往优先于追求极低的虚警率。本文针对多种经典数据增强方法以及近年来出现的各类复杂攻击,对现代后处理水印与经典后处理水印在鲁棒性和安全性方面的性能进行了公平比较。我们的实验表明,在实际应用场景下,经典水印技术在保证鲁棒性的同时,其安全性优于现代方法。

Pusa V1.0: Unlocking Temporal Control in Pretrained Video Diffusion Models via Vectorized Timestep Adaptation

2026-05-27T04:00:00autoregressive, cs.CV, diffusion2507.16116

中文标题:Pusa V1.0:通过向量化时间步长适配解锁预训练视频扩散模型中的时序控制

作者:Yaofang Liu, Yumeng Ren, Aitor Artola, Yuxuan Hu, Xiaodong Cun, Xiaotong Zhao, Alan Zhao, Raymond H. Chan, Suiyun Zhang, Rui Liu, Dandan Tu, Jean-Michel Morel

摘要:

The rapid advancement of video diffusion models has been hindered by fundamental limitations in temporal modeling, particularly the rigid synchronization of frame evolution imposed by conventional scalar timestep variables. While task-specific adaptations and autoregressive models have sought to address these challenges, they remain constrained by computational inefficiency, catastrophic forgetting, or narrow applicability. In this work, we present \textbf{Pusa} V1.0, a versatile model that leverages \textbf{vectorized timestep adaptation (VTA)} to enable fine-grained temporal control within a unified video diffusion framework. Note that VTA is a non-destructive adaptation, which means that it fully preserves the capabilities of the base model. Unlike conventional methods like Wan-I2V, which finetune a base text-to-video (T2V) model with abundant resources to do image-to-video (I2V), we achieve comparable results in a zero-shot manner after an ultra-efficient finetuning process based on VTA. Moreover, this method also unlocks many other zero-shot capabilities simultaneously, such as start-end frames and video extension -- all without task-specific training. Meanwhile, it keeps the T2V capability from the base model. Mechanistic analyses also reveal that our approach preserves the foundation model's generative priors while surgically injecting temporal dynamics, avoiding the combinatorial explosion inherent to the vectorized timestep. This work establishes a scalable, efficient, and versatile paradigm for next-generation video synthesis, democratizing high-fidelity video generation for research and industry alike.

摘要中文:

视频扩散模型的快速发展一直受到时间建模方面根本性局限的制约,尤其是传统标量时间步长变量所施加的帧间演化刚性同步机制。尽管针对特定任务的模型优化与自回归模型已尝试应对这些挑战,但它们仍受限于计算效率低下、灾难性遗忘或适用范围狭窄等问题。在本工作中,我们提出了 \textbf{Pusa} V1.0,这是一种基于 \textbf{向量化时间步自适应(VTA)} 的通用模型,能够在统一的视频扩散框架中实现细粒度的时序控制。请注意,VTA是一种无损适配方法,这意味着它能够完整保留基础模型的各项能力。与依赖大量资源对基础文生视频(T2V)模型进行微调以实现图生视频(I2V)的传统方法不同,我们在基于VTA的超高效微调之后,便能以零样本方式取得相当的效果。此外,该方法还能同时解锁多项零样本能力,例如起止帧检测与视频扩展,且均无需进行任务特定的训练。同时,它保留了基础模型的T2V能力。机制分析还表明,我们的方法在保留基础模型生成先验的同时,通过精准地引入时间动态,避免了由向量化时间步长所固有的组合爆炸问题。本研究构建了一种可扩展、高效且通用的新一代视频合成范式,为科研与工业界实现了高保真视频生成的普惠化。

TAG: Tangential Amplifying Guidance for Hallucination-Resistant Sampling

2026-05-27T04:00:00cs.CV, diffusion2510.04533

中文标题:标签:用于抗幻觉采样的切向放大引导

作者:Hyunmin Cho, Donghoon Ahn, Susung Hong, Jee Eun Kim, Seungryong Kim, Kyong Hwan Jin

摘要:

Diffusion models achieve state-of-the-art image generation but often produce semantic inconsistencies, or hallucinations. Existing inference-time guidance methods rely on external signals or architectural modifications, adding computational overhead. We propose $\mathbf{T}$angential $\mathbf{A}$mplifying $\mathbf{G}$uidance $\mathbf{(TAG)}$, a training-free, architecture-agnostic, plug-and-play guidance method that operates purely on trajectory signals. TAG uses an intermediate sample as a projection basis and amplifies the tangential components of the estimated score to correct the sampling trajectory. A first-order Taylor analysis shows that this steers the state toward higher-probability regions of the data manifold, reducing inconsistencies and improving fidelity while adding negligible overhead to existing samplers. Code is available at our Project Page (https://hyeon-cho.github.io/TAG/).

摘要中文:

扩散模型在图像生成任务上取得了最先进的性能,但往往会产生语义上的不一致,即所谓的“幻觉”现象。现有的推理时引导方法依赖于外部信号或架构修改,从而增加了计算开销。我们提出了切向放大引导($\mathbf{T}$angential $\mathbf{A}$mplifying $\mathbf{G}$uidance,简称 $\mathbf{TAG}$),这是一种无需训练、与网络架构无关的即插即用式引导方法,其作用对象仅为轨迹信号。TAG以一个中间样本作为投影基准,并通过放大估计得分的切向分量来校正采样轨迹。一阶泰勒展开分析表明,这一方法能够引导系统状态向数据流形上概率更高的区域演化,在几乎不增加现有采样器计算开销的同时,降低不一致性并提升生成质量。代码可在我们的项目页面获取(https://hyeon-cho.github.io/TAG/)。

Align & Invert: Solving Inverse Problems with Diffusion and Flow-based Models via Representation Alignment

2026-05-27T04:00:00cs.CV, cs.LG, diffusion2511.16870

中文标题:对齐与反演:基于表征对齐的扩散模型与流模型在逆问题求解中的应用

作者:Loukas Sfountouris, Giannis Daras, Paris Giampouras

摘要:

Enforcing alignment between the internal representations of diffusion or flow-based generative models and those of pretrained self-supervised encoders has recently been shown to provide a powerful inductive bias, improving both convergence and sample quality. In this work, we extend this idea to inverse problems, where pretrained generative models are employed as priors. We propose applying representation alignment (REPA) between diffusion or flow-based models and a DINOv2 visual encoder, to guide the reconstruction process at inference time. Although ground-truth signals are unavailable in inverse problems, we empirically show that aligning model representations of approximate target features can substantially enhance reconstruction quality and perceptual realism. We provide theoretical results showing (a) that REPA regularization can be viewed as a variational approach for minimizing a divergence measure in the DINOv2 embedding space, and (b) how under certain regularity assumptions REPA updates steer the latent diffusion states toward those of the clean image. These results offer insights into the role of REPA in improving perceptual fidelity. Finally, we demonstrate the generality of our approach by We integrate REPA into multiple state-of-the-art inverse problem solvers, and provide extensive experiments on super-resolution, box inpainting, Gaussian deblurring, and motion deblurring confirming that our method consistently improves reconstruction quality, while also providing efficiency gains reducing the number of required discretization steps.

摘要中文:

近期研究表明,强制使基于扩散或流的生成模型与其预训练的自监督编码器在内部表征上保持一致,能够提供强大的归纳偏置,从而同时提升收敛速度与样本质量。在本文中,我们将这一思想推广至反问题领域,并将预训练的生成模型用作先验。我们提出在基于扩散或流的模型与DINOv2视觉编码器之间引入表征对齐(REPA),以在推理阶段指导重建过程。尽管在反问题中无法获得真实标签信号,我们通过实验证明,对近似目标特征的模型表征进行对齐,能够显著提升重建质量和感知逼真度。我们给出了理论结果,表明:(a) REPA正则化可被视为一种变分方法,用于在DINOv2嵌入空间中最小化某种散度度量;以及 (b) 在某些正则性假设下,REPA更新会引导潜在扩散状态逐步逼近干净图像的相应状态。这些结果为REPA在提升感知保真度方面的作用提供了新的见解。最后,我们通过将REPA集成到多个最先进的反问题求解器中,并在超分辨率、矩形区域修复、高斯模糊去卷积以及运动模糊去卷积等任务上开展了大量实验,充分验证了所提方法的通用性。实验结果表明,该方法不仅能够持续提升重建质量,还显著提高了计算效率,减少了所需的离散化迭代步数。

Guiding Token-Sparse Diffusion Models

2026-05-27T04:00:00cs.CV, diffusion2601.01608

中文标题:指导令牌稀疏扩散模型

作者:Felix Krause, Stefan Andreas Baumann, Johannes Schusterbauer, Olga Grebenkova, Ming Gui, Vincent Tao Hu, Bj\"orn Ommer

摘要:

Diffusion models deliver high quality in image synthesis but remain expensive during training and inference. Recent works have leveraged the inherent redundancy in visual content to make training more affordable by training only on a subset of visual information. While these methods were successful in providing cheaper and more effective training, sparsely trained diffusion models struggle in inference. This is due to their lacking response to Classifier-free Guidance (CFG) leading to underwhelming performance during inference. To overcome this, we propose Sparse Guidance (SG). Instead of using conditional dropout as a signal to guide diffusion models, SG uses token-level sparsity. As a result, SG preserves the high-variance of the conditional prediction better, achieving good quality and high variance outputs. Leveraging token-level sparsity at inference, SG improves fidelity at lower compute, achieving 1.58 FID on the commonly used ImageNet-256 benchmark with 25% fewer FLOPs, and yields up to 58% FLOP savings at matched baseline quality. To demonstrate the effectiveness of Sparse Guidance, we train a 2.5B text-to-image diffusion model using training time sparsity and leverage SG during inference. SG achieves improvements in composition and human preference score while increasing throughput at the same time.

摘要中文:

扩散模型在图像生成任务中能够产出高质量结果,但在训练与推理阶段的计算开销依然高昂。近期的研究利用视觉内容固有的冗余性,通过仅基于部分视觉信息进行训练,从而降低了训练成本。尽管这些方法成功地实现了更廉价、更高效的训练,但稀疏训练的扩散模型在推理阶段仍表现欠佳。这是因为他们对无分类器引导(CFG)的响应不足,导致在推理阶段的性能不尽如人意。为克服这一问题,我们提出了稀疏引导(SG)。与将条件丢弃用作引导扩散模型的信号不同,SG采用基于标记级别的稀疏性。因此,SG能够更好地保留条件预测的高方差特性,从而实现高质量且方差较高的输出。在推理阶段利用标记级稀疏性,SG能以更低的计算成本提升生成质量:在常用的ImageNet‑256基准上,仅需25% 的浮点运算量即可达到1.58的FID;而在保持与基准模型相当的生成质量时,最多可节省58% 的浮点运算量。为验证稀疏引导的有效性,我们采用训练时的稀疏化策略训练了一个25亿参数的文本到图像扩散模型,并在推理阶段引入稀疏引导。SG在提升吞吐量的同时,还实现了模型生成质量与人类偏好评分的双重优化。

UniPCB: A Generation-Assisted Detection Framework for PCB Defect Inspection

2026-05-27T04:00:00cs.CV, diffusion2605.04635

中文标题:UniPCB:一种基于生成式技术的印刷电路板缺陷检测框架

作者:Huan Zhang, Lianghong Tan, Yichu Xu, Zishan Su, Jiangzhong Cao, Huanqi Wu, Linwei Zhu, Xu Zhang

摘要:

In the Industrial Internet of Things (IIoT), enabling intelligent, real-time Printed Circuit Board (PCB) defect inspection is critical for ensuring product reliability. However, existing IIoT-based visual inspection systems face two compounding challenges: scarce and imbalanced defect samples that limit model training, and insufficient feature representation under complex circuit backgrounds. Existing generation methods rely on single-modality conditions with coarse structural control, while detection methods improve architectures without addressing the data bottleneck. To resolve both challenges jointly, we propose a generation-assisted PCB defect inspection framework that integrates controlled defect synthesis with task-specific defect detection within an IIoT-enabled pipeline. On the generation side, a Multi-modal Condition Generator extracts complementary edge, depth, and text conditions in parallel. A ScaleEncoder then embeds these conditions into the diffusion U-Net at four resolutions, and a Condition Modulation applies FiLM-style spatially-adaptive modulation at each scale, enabling structurally aligned and defect-aware sample synthesis to augment the scarce IIoT dataset. On the detection side, an Inverted Residual Shift Attention couples self-attention with shift-wise convolution to jointly capture global context and local texture, and a Cross-level Complementary Fusion Block generates pixel-level gates for selective cross-level feature fusion. The synthesized samples directly enrich the detection training set, so that improvements in generation compound with improvements in detection. Extensive experiments on DsPCBSD+ demonstrate that UniPCB achieves mAP@0.5 of 98.0% and mAP@0.5:0.95 of 61.8% on defect detection, surpassing all compared methods, while the generation branch attains an FID of 129.61 and SSIM of 0.619, outperforming existing conditional generation approaches.

摘要中文:

在工业物联网(IIoT)领域,实现智能化、实时的印刷电路板(PCB)缺陷检测对于保障产品可靠性至关重要。然而,现有的基于工业物联网的视觉检测系统面临两大相互叠加的挑战:缺陷样本稀缺且分布不均衡,制约了模型的训练;同时,在复杂的电路背景下,特征表达能力不足。现有的生成方法依赖于单模态条件,且对结构的控制较为粗略;而检测方法则侧重于改进模型架构,却未能解决数据瓶颈问题。为协同解决上述两项挑战,我们提出了一种基于生成技术的PCB缺陷检测框架,该框架在工业物联网赋能的流水线中实现了受控缺陷合成与任务特定缺陷检测的有机融合。在生成端,多模态条件生成器并行提取互补的边缘、深度和文本条件。随后,一个尺度编码器将这些条件信息以四种分辨率嵌入到扩散U-Net中,而条件调制模块则在每个尺度上采用FiLM风格的自适应空间调制,从而实现结构对齐且兼顾缺陷感知的样本生成,以扩充稀缺的工业物联网数据集。在检测端,一种倒残差移位注意力机制将自注意力与移位卷积相结合,协同捕获全局上下文与局部纹理;同时,跨层互补融合模块为像素级门控生成信号,实现选择性的跨层特征融合。合成样本直接扩充了检测训练集,从而使生成能力的提升与检测性能的提升相互促进。在DsPCBSD+ 数据集上的大量实验表明,UniPCB在缺陷检测任务上分别取得了98.0% 的mAP@0.5和61.8% 的mAP@0.5:0.95,优于所有对比方法;同时,其生成分支的FID达到129.61、SSIM达到0.619,性能均优于现有条件生成方法。

MiVE: Multiscale Vision-language features for reference-guided video Editing

2026-05-27T04:00:00cs.CV, diffusion2605.14664

中文标题:MiVE:用于参考引导视频编辑的多尺度视觉-语言特征

作者:Tong Wang, Meng Zou, Chengjing Wu, Xiaochao Qu, Luoqi Liu, Xiaolin Hu, Ting Liu

摘要:

Reference-guided video editing takes a source video, a text instruction, and a reference image as inputs, requiring the model to faithfully apply the instructed edits while preserving original motion and unedited content. Existing methods fall into two paradigms, each with inherent limitations: decoupled encoders suffer from modality gaps when processing instructions and visual content independently, while unified vision-language encoders lose fine-grained spatial details by relying solely on final-layer representations. We observe that VLM layers encode complementary information hierarchically -- early layers capture localized spatial details essential for precise editing, while deeper layers encode global semantics for instruction comprehension. Building on this insight, we present MiVE (Multiscale Vision-language features for reference-guided video Editing), a framework that repurposes VLMs as multiscale feature extractors. MiVE extracts hierarchical features from Qwen3-VL and integrates them into a unified self-attention Diffusion Transformer, eliminating the modality mismatch inherent in cross-attention designs. Experiments demonstrate that MiVE achieves state-of-the-art performance by ranking highest in human preference, outperforming both academic methods and commercial systems.

摘要中文:

参考引导的视频编辑以源视频、文本指令和参考图像为输入,要求模型在忠实执行所指示的编辑操作的同时,保留原始运动及未编辑的内容。现有方法主要分为两大范式,且各自存在固有局限:解耦式编码器在分别处理指令与视觉内容时会面临模态差异问题,而统一的视觉‑语言编码器则因仅依赖最后一层的表示而丢失细粒度的空间信息。我们观察到,视觉语言模型的各层以层次化的方式编码互补信息:浅层捕捉对精确编辑至关重要的局部空间细节,而深层则编码用于理解指令的全局语义。基于这一洞察,我们提出了MiVE(用于参考引导视频编辑的多尺度视觉‑语言特征),该框架将视觉‑语言模型重新定位为多尺度特征提取器。MiVE从Qwen3-VL中提取层次化特征,并将其整合至统一的自注意力扩散Transformer中,从而消除了交叉注意力架构固有的模态不匹配问题。实验结果表明,MiVE在人类偏好评测中位居榜首,性能达到当前最优水平,优于各类学术方法和商用系统。

Broken Memories: Detecting and Mitigating Memorization in Diffusion Models with Degraded Generations

2026-05-27T04:00:00cs.CV, diffusion2605.22050

中文标题:破碎的记忆:利用退化生成样本来检测并缓解扩散模型中的记忆现象

作者:Yuanmin Huang, Mi Zhang, Chen Chen, Feifei Li, Geng Hong, Xiaoyu You, Min Yang

摘要:

While diffusion models excel at generating high-quality images, their tendency to memorize training data poses significant privacy and copyright risks. In this work, we for the first time identify that memorization induces internal numerical instability, often manifesting as visually ``broken'&x27; artifacts. Inspired by stability analysis in numerical methods, we introduce empirical stability regions based on latent update norms to quantitatively characterize stable behavior during generation. Leveraging this, we propose a principled, on-the-fly framework for step-wise detection and adaptive mitigation. Our approach suppresses memorization without altering prompts or guidance, thereby preserving semantic fidelity and image quality. Extensive experiments on Stable Diffusion 1.4 demonstrate that our method achieves an AUC $>0.999$ detection performance and a $0.0\%$ memorization rate after mitigation with negligible overhead ($\approx0.01$s per image).

摘要中文:

尽管扩散模型在生成高质量图像方面表现卓越,但其易对训练数据产生记忆的特性也带来了严重的隐私与版权风险。在本研究中,我们首次发现,记忆效应会引发内部数值不稳定,通常表现为视觉上的“破损”伪影。受数值方法中稳定性分析的启发,我们基于潜在变量更新范数提出了经验稳定性区域,以定量刻画生成过程中的稳定行为。基于此,我们提出了一种严谨的、在线式的框架,用于分步检测与自适应缓解。我们的方法在不改变提示词或指导内容的前提下抑制记忆效应,从而保持语义一致性和图像质量。在Stable Diffusion 1.4上开展的大量实验表明,所提方法在缓解后可实现AUC> 0.999的检测性能以及0.0% 的记忆率,且计算开销极低(每张图像约0.01秒)。

AI-T2I: Aggregating-and-Isolating Cross-Attention to Diffusion Models for Text-to-Image Synthesis

2026-05-27T04:00:00cs.CV, diffusion2605.25763

中文标题:AI-T2I:面向文本到图像合成的扩散模型中的聚合与隔离式交叉注意力机制

作者:Shipeng Cao, Biao Qian, Haipeng Liu, Yang Wang, Meng Wang

摘要:

Text-to-image synthesis has made significant progress, benefiting from the strong generative capabilities of diffusion models. However, these models struggle to achieve precise text-to-image alignment within cross-attention maps during the denoising process. Existing works primarily focus on inter-subject-token activations (i.e., cross-attention scores) overlap for different subjects, overlooking the intra-subject-token activations scattering issue for identical subjects. In this paper, we propose an Aggregating-and-Isolating cross-attention approach to diffusion models for Text-to-Image synthesis, dubbed AI-T2I. Technically, to address the scattering issue, we devise an aggregation loss to identify and consolidate the scattered intra-token activations, which implicitly helps mitigate the potential overlap issue. Upon that, an isolation loss is further introduced to push the inter-token activations apart, thus fulfilling precise text-to-image alignment. Extensive experiments on various benchmarks demonstrate the superiority of AI-T2I over the state-of-the-art works for text-to-image synthesis. Furthermore, our AI-T2I exhibits excellent generalization across other tasks, e.g., controllable layout generation and personalized generation. Our code is available at https://github.com/Hatter77/AI-T2I.

摘要中文:

文本到图像生成取得了显著进展,这得益于扩散模型强大的生成能力。然而,这些模型在去噪过程中难以在交叉注意力图中实现文本与图像的精确对齐。现有研究主要关注不同主体间标记激活值(即交叉注意力得分)的重叠,而忽视了同一主体内部标记激活值的分散问题。本文提出了一种用于文本到图像生成的扩散模型的聚合与隔离交叉注意力方法,称为AI-T2I。从技术层面而言,为解决激活分散问题,我们设计了一种聚合损失函数,用于识别并整合分散的词内激活,从而在隐含层面上缓解潜在的重叠问题。在此基础上,进一步引入隔离损失,以拉大不同标记间激活值的距离,从而实现精确的文本-图像对齐。在多个基准数据集上的大量实验表明,AI-T2I在文本到图像生成任务上优于现有最先进方法。此外,我们的AI-T2I在其他任务上也表现出优异的泛化能力,例如可控布局生成和个性化生成。我们的代码已在 https://github.com/Hatter77/AI-T2I 上公开。

Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

2026-05-27T04:00:00autoregressive, cs.CV, cs.LG, diffusion2505.23606

中文标题:Muddit:以统一的离散扩散模型突破文本到图像生成,开启全新纪元

作者:Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Yunhai Tong, Xiangtai Li, Xuelong Li, Shuicheng Yan

摘要:

Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce the second-generation Meissonic: Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.

摘要中文:

统一生成模型旨在以单一架构和解码范式,同时处理跨模态的多样化任务,如文本生成、图像生成以及视觉‑语言推理等。自回归统一模型因序列解码而推理速度较慢,而非自回归统一模型则因预训练主干网络的局限性而导致泛化能力较弱。我们推出了第二代Meissonic:Muddit,这是一种统一的离散扩散Transformer,能够在文本和图像两种模态上实现快速且并行的生成。与以往从头训练的统一扩散模型不同,Muddit将预训练文本到图像骨干网络的强大视觉先验与轻量级文本解码器相结合,在统一的架构下实现了灵活且高质量的多模态生成。实证结果表明,Muddit在质量和效率两方面均取得了与规模远大于自身的自回归模型相当甚至更优的性能。该研究指出,当配备强大的视觉先验时,纯离散扩散模型有望成为一种可扩展且高效的统一生成任务骨干。

A Unified Framework for Diffusion Model Unlearning with f-Divergence

2026-05-27T04:00:00cs.CV, cs.LG, diffusion2509.21167

中文标题:一种基于f-散度的扩散模型遗忘统一框架

作者:Nicola Novello, Federico Fontana, Luigi Cinque, Deniz Gunduz, Andrea M. Tonello

摘要:

Most existing methods for concept unlearning in text-to-image diffusion models minimize a mean squared error (MSE) loss between the denoiser outputs conditioned on a target and an anchor concept, which is implicitly the KL divergence between two Gaussians. We generalize this objective to any $f$-divergence, recovering MSE as the KL instance, and identify a family of $\alpha$-divergences whose Gaussian closed-form yields cheap, MSE-like training objectives. For the remaining $f$-divergences, we provide a min-max objective based on the variational formulation of the $f$-divergence. We theoretically analyze and numerically validate how different $f$-divergences impact the gradient magnitude and the convergence properties of the algorithm, affecting the quality of unlearning. For instance, we observe that the Hellinger closed-form instance consistently dominates MSE across multiple scenarios. More generally, the proposed unified framework offers a flexible paradigm for selecting the optimal divergence based on the application and user goal, allowing for finer control over the trade-off between unlearning efficacy and generative fidelity.

摘要中文:

现有大多数用于文本到图像扩散模型的概念遗忘方法,都是通过最小化目标概念与锚定概念条件下的去噪器输出之间的均方误差(MSE)损失来实现的,而该损失在隐含意义上等价于两个高斯分布之间的KL散度。我们将该目标泛化至任意f-散度,其中KL散度的特例即为均方误差;同时,我们识别出一类 α-散度,其高斯分布下的闭式解可导出计算成本低廉、形式与均方误差相似的训练目标。对于其余的f-散度,我们基于f-散度的变分表示给出了一个极小极大目标函数。我们从理论上分析并以数值实验验证了不同$f$-散度对梯度幅值及算法收敛性质的影响,进而影响去学习机制的性能。例如,我们观察到,在多个场景下,基于赫林格距离的闭式解始终优于均方误差。更一般地,所提出的统一框架为根据具体应用与用户目标选择最优散度提供了一种灵活的范式,从而能够更精细地调控遗忘效用与生成保真度之间的权衡。

image_compression
Image Compression
3 篇论文

今日 Image Compression 领域的三篇论文聚焦于压缩表示的鲁棒性与效率。第一篇提出退化解一致的配对训练,使 AI 生成图像检测在多种降质下保持高准确率,暗示压缩后图像特征仍可可靠识别。第二篇引入 JLT 清洁潜在预测机制,通过在潜在扩散 Transformer 中直接预测干净潜在向量,显著提升重建质量并兼顾压缩率。第三篇从每图像低秩角度审视特征蒸馏,发现编码不匹配是导致教师‑学生网络压缩性能下降的关键,提出适配策略以缩小差距,实现更紧凑的 ViT 压缩模型。三项工作共同推动压缩技术向“低码率+高鲁棒+轻量化”方向演进。

重点论文

  • 退化解一致的配对训练用于鲁棒的 AI 生成图像检测:通过在训练阶段保持退化一致性,显著提升压缩图像中 AI 生成内容的检测鲁棒性。
  • JLT:潜在扩散 Transformer 中的清洁潜在预测:提出干净潜在预测机制,直接在潜在空间恢复高质量表示,兼顾高压缩率与优异重建。
  • 从每图像低秩到编码不匹配:重新思考视觉 Transformer 中的特征蒸馏:针对教师‑学生网络在低秩近似下的编码不匹配问题提出新适配方法,使 ViT 压缩更高效且保持性能。

Degradation-Consistent Paired Training for Robust AI-Generated Image Detection

2026-05-27T04:00:00cs.AI, cs.CV, image_compression2604.10102

中文标题:面向鲁棒性人工智能生成图像检测的退化一致性成对训练

作者:Zongyou Yang, Yinghan Hou, Xiaokun Yang

摘要:

AI-generated image detectors suffer significant performance degradation under real-world image corruptions such as JPEG compression, Gaussian blur, and resolution downsampling. We observe that state-of-the-art methods, including B-Free, treat degradation robustness as a byproduct of data augmentation rather than an explicit training objective. In this work, we propose Degradation-Consistent Paired Training (DCPT), a simple yet effective training strategy that explicitly enforces robustness through paired consistency constraints. For each training image, we construct a clean view and a degraded view, then impose two constraints: a feature consistency loss that minimizes the cosine distance between clean and degraded representations, and a prediction consistency loss based on symmetric KL divergence that aligns output distributions across views. DCPT adds zero additional parameters and zero inference overhead. Experiments on the Synthbuster benchmark (9 generators, 8 degradation conditions) demonstrate that DCPT improves the degraded-condition average accuracy by 9.1 percentage points compared to an identical baseline without paired training, while sacrificing only 0.9% clean accuracy. The improvement is most pronounced under JPEG compression (+15.7% to +17.9%). Ablation further reveals that adding architectural components leads to overfitting on limited training data, confirming that training objective improvement is more effective than architectural augmentation for degradation robustness.

摘要中文:

在面对JPEG压缩、高斯模糊和分辨率下采样等真实场景中的图像退化时,人工智能生成图像检测器的性能会显著下降。我们观察到,包括B-Free在内的最先进方法将退化鲁棒性视为数据增强的副产品,而非明确的训练目标。在本工作中,我们提出了退化一致成对训练(DCPT),这是一种简单而有效的训练策略,通过成对的一致性约束显式地提升模型的鲁棒性。对于每一张训练图像,我们分别构建一个干净视图和一个退化视图,并施加两种约束:一是特征一致性损失,用于最小化干净视图与退化视图表示之间的余弦距离;二是基于对称KL散度的预测一致性损失,用于对齐不同视图的输出分布。DCPT不引入任何额外的参数,且推理开销为零。在Synthbuster基准测试集上(包含9个生成器和8种退化条件)的实验结果表明,与不采用成对训练的同基准模型相比,DCPT将退化条件下的平均准确率提升了9.1个百分点,同时仅损失了0.9% 的纯净样本准确率。在JPEG压缩条件下,性能提升最为显著,增幅为15.7%至17.9%。消融实验进一步表明,在训练数据有限的情况下,增加网络架构组件会导致过拟合,从而证实对于退化鲁棒性而言,优化训练目标比单纯扩充网络架构更为有效。

JLT: Clean-Latent Prediction in Latent Diffusion Transformers

2026-05-27T04:00:00cs.CV, cs.LG, diffusion, image_compression2605.27102

中文标题:JLT:潜在扩散变换器中的清洁潜在预测

作者:Funing Fu, Tenghui Wang, Guanyu Zhou, Junyong Cen, Qichao Zhu

摘要:

Flow matching with clean-data prediction has shown that regressing the clean point can exploit low-dimensional structure more effectively than predicting an ambient noised quantity. We ask whether this principle remains useful after images are mapped into a learned latent space, where compression has already removed much of the raw pixel variability. We introduce JLT, a 130M latent diffusion Transformer over frozen FLUX.2 VAE codes, and compare clean-latent prediction with a matched velocity-prediction DiT under the same representation, backbone, and training settings. Although the three variables x, epsilon, and v are linearly convertible for a fixed corruption time, a local Gaussian analysis shows that velocity regression inherits an isotropic target-covariance floor and amplifies low-variance latent directions, while clean prediction damps them. On ImageNet 256 x 256, JLT-B/1 obtains FID-50K 2.50 with classifier-free guidance, with a large matched-target gap over velocity prediction. These results suggest that prediction targets in latent diffusion are representation-dependent geometric choices, rather than interchangeable algebraic parameterizations.

摘要中文:

基于干净数据预测的流匹配方法表明,回归干净样本点能够比直接预测高维空间中的噪声观测值更有效地利用低维结构。我们探讨,在图像被映射到由模型学习得到的潜在空间之后,这一原则是否仍然有效;在该空间中,压缩已消除了大量原始像素级的变异。我们提出了JLT,这是一种基于冻结的FLUX.2 VAE编码的1.3亿参数潜在扩散Transformer,并在同一表示、主干网络和训练设置下,将纯潜在空间预测与对应的速度预测DiT进行了对比。尽管在给定的污染时间下,变量x、ε 和v之间存在线性转换关系,但局部高斯分析表明,速度回归会继承一个各向同性的目标协方差下界,并放大低方差的隐变量方向;而干净预测则会抑制这些方向。在ImageNet 256×256数据集上,JLT-B/1在无分类器指导条件下取得了FID-50K为2.50的性能,且与基于速度预测的方法相比存在显著的差距。这些结果表明,潜在扩散模型中的预测目标是依赖于表征的几何选择,而非可互换的代数参数化。

From Per-Image Low-Rank to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers

2026-05-27T04:00:00cs.CV, image_compression2511.15572

中文标题:从单张图像的低秩性到编码不匹配:重新审视视觉Transformer中的特征蒸馏

作者:Huiyuan Tian, Bonan Xu, Shijian Li

摘要:

Feature-map knowledge distillation (KD) transfers internal representations well between comparably sized Vision Transformers (ViTs), but it often fails in compression. We revisit this failure and uncover a paradox. Sample-wise SVD shows that each image is highly compressible, which seems to suggest that a narrow student with a linear projector should match the teacher "in principle". However, a dataset-level view contradicts this intuition: PCA shows that the teacher is a union of low-rank subspaces with significant subspace rotation across inputs. We further introduce token-level Spectral Energy Patterns (SEP) and find an architecture-invariant encoding law: tokens spread energy broadly across channel modes even when they live in low-rank subspace, creating a bandwidth mismatch. We refer to this combined phenomenon as an encoding mismatch. We propose two minimal remedies, Lift or WideLast: (i) Lift retains a lightweight lifting projector at inference to provide wider channel, or (ii) WideLast widens only the student's last block, enabling an input-dependent expansion. On ImageNet-1K, these fixes revive feature KD for ViT compression, improving DeiT-Tiny distilled from CaiT-S24 from 74.86% to 77.53%/78.23% top-1 accuracy, and they also strengthen students trained without distillation. Our analyses clarify when and why feature-map KD fails and then how to fix it. Code and raw data are provided in https://github.com/thy960112/From-Per-Image-Low-Rank-to-Encoding-Mismatch.

摘要中文:

特征图知识蒸馏(KD)能够在尺寸相近的视觉Transformer(ViT)之间较好地迁移内部表征,但在模型压缩任务中往往效果不佳。我们重新审视这一失败,并揭示了一个悖论。基于样本的奇异值分解表明,每幅图像都具有很高的可压缩性,这似乎暗示着:在原则上,配备线性投影器的轻量级学生模型应当能够与教师模型相媲美。然而,从数据集层面来看,这一直觉却与之相悖:主成分分析表明,教师模型可视为由多个低秩子空间的并集构成,且这些子空间在不同输入之间存在显著的旋转。我们进一步提出了基于标记的谱能量模式(SEP),并发现了一条与架构无关的编码规律:即使标记位于低秩子空间中,它们也会将能量广泛地分布于各个通道模态上,从而导致带宽失配。我们将这一复合现象称为编码不匹配。我们提出了两种最小化改进方法:Lift和WideLast:(i) Lift在推理阶段保留一个轻量级的提升投影层,以扩大通道宽度;(ii) WideLast仅扩展学生模型的最后一个模块,从而实现基于输入的自适应扩张。在ImageNet‑1K数据集上,这些改进使特征蒸馏技术重新适用于ViT的压缩,将从CaiT‑S24中蒸馏得到的DeiT‑Tiny的top‑1准确率由74.86% 提升至77.53%/78.23%;同时,它们还增强了未经过蒸馏训练的学生模型的表现。我们的分析阐明了特征图知识蒸馏在何时以及为何会失效,并进一步提出了相应的改进方法。代码和原始数据已在 https://github.com/thy960112/From-Per-Image-Low-Rank-to-Encoding-Mismatch 中提供。

visual_tokenizer_1d
1D Visual Tokenizer
0 篇论文

今日未找到该分类的匹配论文。

今日未找到该分类的匹配论文。

diffusion_visual_encoder
Diffusion Visual Encoder
0 篇论文

今日未找到该分类的匹配论文。

今日未找到该分类的匹配论文。