每日 arXiv 论文简报
今日 arXiv 论文呈现出生成式模型的多元化发展与跨领域融合的趋势。Autoregressive 与 Diffusion 两大范式在视频生成、图结构建模、视觉 tokenization 等任务上呈现交叉重叠,反映出研究者正探索统一框架以兼顾效率与质量。值得注意的是,多篇论文聚焦于高效化与实时性:DSA 通过动态步长分配加速自回归视频生成,ChannelTok 实现灵活长度的视觉 token 压缩,InstantRetouch 则强调无需训练的图像修复。
此外,Agent 与扩散模型的结合成为新热点——AgenticDiffusion 将扩散策略应用于无人机视觉导航,PointAction 用 3D 点作为通用动作表示控制机器人,体现生成式模型向物理交互的延伸。个性化与可控性方面,Concept-Incremental Versatile Customization 支持概念的渐进式扩展,DiverAge 则在保持身份特征的条件下实现多样化人脸老化。
- Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation — 提出演进式记忆机制,实现实时无限视频生成,对长时序一致性有重要贡献。
- Representation Forcing for Bottleneck-Free Unified Multimodal Models — 解决多模态模型的信息瓶颈问题,是迈向统一多模态理解与生成的关键进展。
- PointAction: 3D Points as Universal Action Representations for Robot Control — 首次将 3D 点云作为通用动作表示,显著提升机器人控制的可泛化性与灵活性。
- AgenticDiffusion: Agentic Diffusion-based Path Planning for Vision-Based UAV Navigation — 将扩散模型与 Agent 规划结合,为无人机视觉导航提供新型样本高效策略。
- Crafting Your Evolving Dreams: Concept-Incremental Versatile Customization — 支持生成模型的概念增量定制,平衡多样性与保真度,具有广阔应用前景。
今日 Autoregressive 方向共追踪到 11 篇论文。简报保留原始摘要、中文摘要、作者和链接,适合先快速筛选,再挑出值得深读的论文进入 org-roam。
Scaling Novel Graph Generation via Lightweight Structure-Guided Autoregressive Models
中文标题:基于轻量级结构引导自回归模型的可扩展新图生成
作者:Alessio Barboni, Massimiliano Lupo Pasini, Bishal Lakha, Edoardo Serra
Generating realistic and diverse graphs is a key problem in machine learning, with applications in molecular discovery, circuit design, cybersecurity, and beyond. However, current graph generative models remain limited by scalability and novelty. Diffusion-based methods often require costly full-adjacency operations and long denoising chains, while many autoregressive and hybrid models have at least quadratic complexity. In addition, these models often imitate training graphs rather than generalize beyond them. We propose a lightweight autoregressive framework to address these issues. It uses a structure-guided topological ordering to serialize graphs into regular edge sequences, enabling near log-linear generation, and a two-phase training strategy that combines exploration-oriented augmentation with iterative refinement to reduce overfitting and promote controlled novelty. Experiments on molecular and non-molecular benchmarks show that our approach improves novelty while preserving high validity and uniqueness. The framework also supports both LSTM and Mamba-style causal sequence backbones, with large-memory accelerators enabling longer graph-sequence experiments beyond typical GPU limits.
生成真实且多样的图是机器学习中的关键问题,应用于分子发现、电路设计、网络安全等领域。然而,当前的图生成模型在可扩展性和新颖性方面仍存在局限性。扩散方法通常需要昂贵的完整邻接操作和较长的去噪链,而许多自回归和混合模型至少具有二次复杂度。此外,这些模型通常模仿训练图而非超越它们进行泛化。我们提出了一个轻量级自回归框架来解决这些问题。该框架使用结构引导的拓扑排序将图序列化为规则的边序列,实现近似对数线性的生成;并采用两阶段训练策略,结合探索性增强和迭代细化来减少过拟合并促进受控的新颖性。在分子和非分子基准上的实验表明,我们的方法在保持高有效性和唯一性的同时提高了新颖性。该框架还支持LSTM和Mamba风格的因果序列骨干网络,大内存加速器能够支持超出典型GPU限制的更长图序列实验。
Extending Fair Null-Space Projections for Continuous Attributes to Kernel Methods
中文标题:扩展公平零空间投影方法在核处理中对连续属性的应用
作者:Felix St\"orck, Fabian Hinder, Barbara Hammer
With the on-going integration of machine learning systems into the everyday social life of millions the notion of fairness becomes an ever increasing priority in their development. Fairness notions commonly rely on protected attributes to assess potential biases. Here, the majority of literature focuses on discrete setups regarding both target and protected attributes. The literature on continuous attributes especially in conjunction with regression -- we refer to this as \emph{continuous fairness} -- is scarce. A common strategy is iterative null-space projection which as of now has only been explored for linear models or embeddings such as obtained by a non-linear encoder. We improve on this by extending this to kernel induced feature spaces by means of the ``empirical feature space'&x27;. We theoretically derive this as a direct transformation of the kernel matrix yielding a model and fairness-score agnostic method applicable to continuous protected attributes. We demonstrate that our novel approach in conjunction with Support Vector Regression (SVR) provides competitive or improved performance across multiple datasets in comparison to other contemporary methods.
Unifying Model-Free Efficiency and Model-Based Representations via Latent Dynamics
中文标题:通过潜动态统一无模型效率与基于模型的表示
作者:Jashaswimalya Acharjee, Balaraman Ravindran
We present Unified Latent Dynamics (ULD), a novel reinforcement learning algorithm that unifies the efficiency of model-free methods with the representational strengths of model-based approaches, without incurring planning overhead. By embedding state-action pairs into a latent space in which the true value function is approximately linear, our method supports a single set of hyperparameters across diverse domains -- from continuous control with low-dimensional and pixel inputs to high-dimensional Atari games. We prove that, under mild conditions, the fixed point of our embedding-based temporal-difference updates coincides with that of a corresponding linear model-based value expansion, and we derive explicit error bounds relating embedding fidelity to value approximation quality. In practice, ULD employs synchronized updates of encoder, value, and policy networks, auxiliary losses for short-horizon predictive dynamics, and reward-scale normalization to ensure stable learning under sparse rewards. Evaluated on 80 environments spanning Gym locomotion, DeepMind Control (proprioceptive and visual), and Atari, our approach matches or exceeds the performance of specialized model-free and general model-based baselines -- achieving cross-domain competence with minimal tuning and a fraction of the parameter footprint. These results indicate that value-aligned latent representations alone can deliver the adaptability and sample efficiency traditionally attributed to full model-based planning.
我们提出统一潜动态(Unified Latent Dynamics, ULD),这是一种新型强化学习算法,将无模型方法的高效性与基于模型方法的表示优势相结合,且无需引入规划开销。通过将状态-动作对嵌入到一个真实价值函数近似线性的潜空间中,我们的方法支持在从低维和像素输入的连续控制到高维Atari游戏的多种领域中使用单一超参数集。我们证明,在温和条件下,我们基于嵌入的时序差分更新的不动点与相应线性基于模型的价值扩展的不动点一致,并推导出了将嵌入保真度与价值近似质量联系起来的具体误差界。在实际应用中,ULD采用编码器、价值网络和策略网络的同步更新、短期预测动力学的辅助损失,以及奖励尺度归一化来确保稀疏奖励下的稳定学习。在涵盖Gym运动控制、DeepMind控制(本体感知和视觉)和Atari的80个环境中进行评估,我们的方法达到了或超越了专门的无模型和通用基于模型基线的性能——以极少的调参和一小部分参数实现了跨领域能力。这些结果表明,仅凭与价值对齐的潜表示就能实现传统上归因于完整基于模型规划的适应性和样本效率。
End-to-End Text Line Detection and Ordering
中文标题:端到端文本行检测与排序
作者:Benjamin Kiessling (ALMAnaCH)
Practical text-recognition pipelines for historical documents typically decompose layout analysis into line detection followed by a separate reading-order step, with the latter most often handled by a hand-coded geometric heuristic that struggles with marginalia, multiple columns, tables, and source-specific editorial conventions. This article introduces Orli (Ordered Regression of Lines), an end-to-end model that casts both sub-tasks as a single image-to-sequence problem: from a page image, Orli autoregressively generates text-line baselines directly in reading order. Baselines are represented in a chord-frame parameterization that anchors a line's position, orientation, and extent while encoding local geometry through perpendicular offsets; an iterative refinement head and a local visual refiner produce the final curve. Trained on a heterogeneous corpus of 196,691 pages spanning ten writing systems, Orli marginally exceeds the previously reported state of the art for cBAD line detection without dataset-specific training, reaches near perfect coverage and ordering on multiple reading-order benchmarks zero-shot, and adapts to more specialized out-of-domain layouts with limited fine-tuning. The method&x27;s source code and model weights are available under an open license at https://github.com/mittagessen/orli.
历史文档的实用文本识别流程通常将布局分析分解为行检测步骤和单独的阅读顺序步骤,其中后者最常采用手工编码的几何启发式方法处理,但这种方法难以应对旁注、多栏、表格和特定来源的编辑惯例。本文提出Orli(有序行回归),这是一种将两个子任务统一为单一图像到序列问题的端到端模型:从页面图像出发,Orli自回归地直接按阅读顺序生成文本行基线。基线采用弦框架参数化表示,该表示锚定线条的位置、方向和范围,同时通过垂直偏移编码局部几何信息;迭代细化头和局部视觉细化器共同生成最终曲线。Orli在包含10种书写系统、共196,691页的异构语料库上进行训练,在未经数据集特定训练的情况下,其cBAD行检测性能略超此前报道的最佳水平,在多个阅读顺序基准上实现接近完美的覆盖率和排序准确率(零样本),并可在有限微调下适应更专业的域外布局。该方法的源代码和模型权重已在开源许可下发布于https://github.com/mittagessen/orli。
DSA: Dynamic Step Allocation for Fast Autoregressive Video Generation
中文标题:DSA:用于快速自回归视频生成的动态步长分配
作者:Thanh-Tung Le, Yunhan Zhao, Menglei Chai, Zhengyang Shen, Zhe Cao, Danhang Tang, Xiaohui Xie, Deying Kong
Video diffusion transformers have achieved state-of-the-art visual quality, but their high inference cost remains a major bottleneck for real-time applications. Recent distillation frameworks produce autoregressive video diffusion models with reduced latency, yet these models still use a fixed number of denoising steps per frame, wasting computation on predictable frames and under-refining challenging ones. We present DSA, a confidence-guided adaptive computation framework for AR video diffusion. DSA introduces a lightweight confidence head, trained jointly with the generator under a distribution-matching distillation objective, to estimate per-frame denoising reliability. At inference, this confidence signal dynamically adjusts the number of diffusion steps: simple frames terminate early for speed, while complex frames receive additional refinement. Our method requires no extra video data, no heuristics, and little architectural modification. Experiments show that DSA achieves real-time autoregressive video generation, reaching 22.63 FPS with sub-second latency on H100 GPUs, while maintaining competitive or superior VBench quality compared to recent autoregressive and bidirectional video diffusion models. Our results demonstrate that confidence-guided adaptive sampling provides an effective and practical path toward interactive video generation.
视频扩散Transformer已实现最先进的视觉质量,但其高推理成本仍是实时应用的主要瓶颈。近期蒸馏框架能够生成降低延迟的自回归视频扩散模型,但这些模型每帧仍使用固定数量的去噪步骤,在可预测的帧上浪费计算资源,而对具有挑战性的帧细化不足。我们提出了DSA,一个用于自回归视频扩散的置信度引导自适应计算框架。DSA引入了一个轻量级置信度头,在分布匹配蒸馏目标下与生成器联合训练,用于估计每帧的去噪可靠性。在推理时,该置信度信号动态调整扩散步数:简单帧提前终止以提速,而复杂帧获得额外细化。我们的方法不需要额外视频数据、不需要启发式方法,且仅需少量架构修改。实验表明,DSA实现了实时自回归视频生成,在H100 GPU上达到22.63 FPS且延迟低于1秒,同时在VBench质量上保持与近期自回归和双向视频扩散模型相当或更优的水平。我们的结果表明,置信度引导的自适应采样为交互式视频生成提供了一条有效且实用的路径。
Imagine Before You Draw: Visual Prompt Engineering for Image Generation
中文标题:先想后画:图像生成的视觉提示工程
作者:Liyu Jia, Fengda Zhang, Jiachun Pan, Kesen Zhao, Saining Zhang, Wang Lin, Weijia Wu, Yue Liao, Aojun Zhou, Hanwang Zhang
Incorporating visual semantic representations as an intermediate step before image generation can reduce the modeling difficulty between text and images, thereby improving generation quality. Recent works such as X-Omni and BLIP3o-Next have explored this direction, but they typically use a two-stage external pipeline: a separate autoregressive model first generates semantic tokens, which are then fed as conditioning to an independent diffusion decoder. Since the decoder cannot jointly access the original input and the semantic plan, this design introduces an information bottleneck that limits detail preservation in downstream tasks such as editing. Internal architectures such as Transfusion, BAGEL, and Show-o2 avoid this bottleneck by enabling cross-modal interaction within a single model, but they still face the difficult text-to-pixel modeling gap without intermediate semantic guidance. We propose Visual Prompt Engineering (VPE), which can be seamlessly integrated into such internal frameworks. Specifically, the model first autoregressively generates visual semantic tokens (e.g., SigLIP 2) as "visual prompts" that capture the semantic layout, then generates the full image tokens conditioned on this plan. We validate VPE across class-conditional generation, text-to-image generation, and image editing, covering various token types and model architectures. Results show that VPE can accelerate convergence, raise quality ceilings, and through internal integration, achieve substantially better editing preservation (PSNR: 26.76 vs. 19.92) than external alternatives of the same parameter scale, while maintaining competitive editing responsiveness.
将视觉语义表示作为图像生成之前的中间步骤可以减少文本与图像之间的建模难度,从而提高生成质量。最近的研究如X-Omni和BLIP3o-Next探索了这一方向,但它们通常采用两阶段外部流程:独立的自回归模型首先生成语义token,然后将其作为条件输入到独立的扩散解码器。由于解码器无法同时访问原始输入和语义规划,这种设计引入了信息瓶颈,限制了编辑等下游任务中的细节保留。Transfusion、BAGEL和Show-o2等内部架构通过在单一模型中实现跨模态交互来避免这一瓶颈,但它们仍然面临缺乏中间语义指导的文本到像素建模差距。我们提出视觉提示工程(VPE),可无缝集成到此类内部框架中。具体而言,模型首先自回归生成捕捉语义布局的视觉语义token(如SigLIP 2)作为“视觉提示”,然后在此规划条件下生成完整图像token。我们在类别条件生成、文本到图像生成和图像编辑等不同任务中验证了VPE,涵盖了多种token类型和模型架构。结果表明,VPE能够加速收敛、提升质量上限,并通过内部集成实现显著更好的编辑保真度(PSNR:26.76 vs 19.92),同时保持具有竞争力的编辑响应能力。
ChannelTok: Efficient Flexible-Length Vision Tokenization
中文标题:ChannelTok:高效的灵活长度视觉标记化
作者:Sukriti Paul, Arpit Bansal, Tom Goldstein
Leading flexible vision tokenizers achieve SOTA quality at an extreme cost, relying on parameter-heavy backbones and slow, multi-step generative decoders. We depart from this complex, spatial-token paradigm and introduce a simple, lightweight, and fast channel-wise flexible-length tokenizer. Our method treats each latent channel as a visual token, enabling a parameter-efficient CNN-Transformer hybrid backbone. Furthermore, employing a stochastic tail-dropping paradigm during training naturally forces channels to organize by semantic importance. This allows for flexible compression at inference by simply retaining the first $k$ channels, and naturally enables variable-length autoregressive image generation. We validate our approach through extensive experiments on ImageNet, demonstrating consistent quality across diverse token budgets. The results establish a new quality-efficiency frontier: our model achieves state-of-the-art perceptual quality (rFID 2.92) while being $8.6\times$ faster in decoding and $2.1\times$ smaller (159M params) than the next-best alternative. Our work establishes channel-wise tokenization as a powerful and practical paradigm for efficient visual representation. Project page: https://channeltok.github.io
领先的灵活视觉标记化方法在极端成本下实现了最先进的质量,依赖于参数重骨干网络和慢速的多步生成解码器。我们摒弃这种复杂的空间标记范式,引入一种简单、轻量且快速的通道级灵活长度标记化器。我们的方法将每个潜在通道视为一个视觉标记,实现了参数高效的CNN-Transformer混合骨干网络。此外,在训练过程中采用随机尾部丢弃范式自然地迫使通道按语义重要性组织。这使得在推理时可以通过简单地保留前k个通道实现灵活压缩,并自然地支持可变长度自回归图像生成。我们在ImageNet上通过大量实验验证了我们的方法,证明了在不同的标记预算下均能保持一致的质量。结果表明我们建立了新的质量-效率前沿:我们的模型实现了最先进的感知质量(rFID 2.92),同时解码速度提高8.6倍,参数规模更小(159M参数),优于次优方案。我们的工作确立了通道级标记化作为高效视觉表示的一种强大且实用的范式。项目主页:https://channeltok.github.io
MeshWeaver: Sparse-Voxel-Guided Surface Weaving for Autoregressive Mesh Generation
中文标题:MeshWeaver:稀疏体素引导的表面编织自回归网格生成
作者:Jiale Xu, Wang Zhao, Ying Shan
Autoregressive mesh generation has gained attention by tokenizing meshes into sequences and training models in a language-modeling fashion. However, existing approaches suffer from two fundamental limitations: (i) low tokenization efficiency, which yields long token sequences and prevents scaling to high-poly meshes, and (ii) absence of geometry-aware guidance, as generation is conditioned only on global shape embeddings rather than local surface cues. We introduce MeshWeaver, an autoregressive framework that treats mesh generation as a surface weaving process by directly predicting the next vertex instead of independent coordinates. At its core is a multi-level sparse-voxel encoder that injects geometric context into the generative process in three complementary ways: providing voxel features as vertex representations, guiding token prediction via cross-attention to voxel features, and serving as a structural scaffold that constrains generation around the input surface. Our hierarchical design enables coarse-to-fine vertex prediction in a single decoding step, while tightly coupling the generative model with 3D geometry. Extensive experiments demonstrate that MeshWeaver achieves a state-of-the-art compression ratio of 18%, can generate meshes with up to 16K faces, and significantly improves geometric fidelity over prior approaches.
自回归网格生成通过将网格标记化为序列并以语言建模方式训练模型而受到关注。然而,现有方法存在两个根本性局限:(i) 标记化效率低下,导致标记序列过长,无法扩展到高多边形网格;(ii) 缺乏几何感知引导,因为生成仅以全局形状嵌入为条件,而非局部表面线索。本文提出MeshWeaver,一种将网格生成视为表面编织过程的自回归框架,通过直接预测下一个顶点而非独立坐标来进行生成。其核心是一个多层稀疏体素编码器,通过三种互补方式将几何上下文注入生成过程:提供体素特征作为顶点表示,通过交叉注意力引导标记预测至体素特征,以及作为结构支架约束输入表面周围的生成。我们的层次化设计能够在单次解码步骤中实现粗到细的顶点预测,同时将生成模型与3D几何紧密耦合。大量实验表明,MeshWeaver达到了18%的最先进压缩比,能够生成最多16K面的网格,并在几何保真度方面显著优于先前方法。
Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation
中文标题:Echo-Infinity:面向实时无限视频生成的演进记忆学习
作者:Yuxuan Bian, Zeyue Xue, Songchun Zhang, Shiyi Zhang, Weiyang Jin, Yaowei Li, Junhao Zhuang, Haoran Li, Jie Huang, Haoyang Huang, Nan Duan, Qiang Xu
We present Echo Infinity, an autoregressive (AR) framework towards real-time infinite video generation that employs a learnable evolving memory to dynamically filter, abstract, and compress any-length history at constant cost. Existing methods mainly curate memory with predefined KV-cache schedules, fixed-ratio heuristic compression, or inference-time RoPE adaptation. These designs inevitably lose historical information and amplify compounding errors due to their limited cache window and ignorance of autoregressive generation noise. Inspired by human memory consolidation, Echo-Infinity replaces handcrafted memory curation with learnable Memory Query, which are updated by attention and a gating mechanism when past frames are evicted from the local window. The queries are optimized end-to-end with the video diffusion transformers (DiTs), forming an evolving memory that supports arbitrary compression ratios with constant computation independent of video length. They also act as a generalizable generation prior, improving quality even when only the optimized initial state is used. We further introduce Unified Relative RoPE Recipe, which anchors the sink frames to start from id 0 and lets the newest frame id grow at most to the DiTs' pretrained maximum temporal RoPE id throughout training and inference, freeing the model from the finite RoPE constraint and closing the train-test RoPE extrapolation gap. In long and short video generation, Echo-Infinity achieves state-of-the-art performance, and, to our knowledge, demonstrates promising 24-hour (>1.3 M frames) real-time rollouts for the first time, suggesting a practical path toward infinite video generation.
我们提出 Echo-Infinity,一个面向实时无限视频生成的自回归(AR)框架,该框架采用可学习的演进记忆来以恒定成本动态过滤、抽象和压缩任意长度的历史信息。现有方法主要通过预设的 KV-cache 调度策略、固定比例的启发式压缩或推理时的 RoPE 适配来管理记忆。这些设计不可避免地会丢失历史信息,并因有限的缓存窗口和对自回归生成噪声的忽视而放大累积误差。受到人类记忆整合的启发,Echo-Infinity 用可学习的记忆查询(Memory Query)取代了人工设计的记忆管理机制,这些查询通过注意力机制和门控机制进行更新,当过去的帧从局部窗口中被移除时会被更新。这些查询与视频扩散Transformer(DiT)进行端到端优化,形成了一种支持任意压缩比且计算量与视频长度无关的演进记忆。它们还充当可泛化的生成先验,即使仅使用优化后的初始状态也能提升质量。我们进一步提出统一相对 RoPE 方案(Unified Relative RoPE Recipe),将汇帧(sink frames)的 id 锚定为 0,并让最新的帧 id 在训练和推理过程中最多增长到 DiT 预训练的最大时间 RoPE id,使模型摆脱有限 RoPE 的约束,并弥合训练-测试 RoPE 外推差距。在长视频和短视频生成任务上,Echo-Infinity 达到了最先进的性能,据我们所知,它首次展示了有前景的 24 小时(超过 130 万帧)实时生成,为无限视频生成提供了一条可行路径。
Representation Forcing for Bottleneck-Free Unified Multimodal Models
中文标题:用于无瓶颈统一多模态模型的表征强制方法
作者:Yuqing Wang, Zhijie Lin, Ceyuan Yang, Yang Zhao, Fei Xiao, Hao He, Qi Zhao, Zihan Ding, Fuyun Wang, Shuai Wang, Youliang Zhang, Haoqi Fan, Xihui Liu
Unified multimodal models (UMMs) aim to handle perception and generation in a single model. Yet existing UMMs still rely on a frozen, separately pretrained VAE for image generation, imposing a structural bottleneck. Naively removing it introduces a quality gap, as the model must learn both high-level structure and low-level details from raw pixels. In this paper, we propose Representation Forcing (RF), a technique that closes this gap by making representation prediction a native capability of the model. Concretely, RF forces the decoder to autoregressively predict visual representations as intermediate tokens before pixels; these tokens then stay in context to guide pixel diffusion within the same backbone. By turning representations from perception outputs into generation targets, RF eliminates the need for any external generative latent space. We find that RF benefits both understanding and generation. On image generation, our pixel-space model with RF matches state-of-the-art VAE-based unified models. On image understanding, pixel-space RF generally outperforms its VAE-based variant. Together, these results offer an effective step toward end-to-end, bottleneck-free UMMs.
统一多模态模型(UMMs)旨在在单一模型中处理感知和生成任务。然而,现有的UMMs仍然依赖于冻结的、单独预训练的VAE进行图像生成,这造成了结构瓶颈。简单地移除它会引入质量差距,因为模型必须从原始像素同时学习高级结构和低级细节。在本文中,我们提出了表征强制(Representation Forcing,RF)技术,这是一种通过使表征预测成为模型的原生能力来弥合这一差距的方法。具体而言,RF强制解码器在像素之前自回归地预测视觉表征作为中间tokens;这些tokens保持在上下文中,以在同一骨干网络内引导像素扩散。通过将表征从感知输出转变为生成目标,RF消除了对任何外部生成潜空间的需求。我们发现RF同时有益于理解和生成任务。在图像生成方面,采用RF的像素空间模型达到了最先进的基于VAE的统一模型的水平。在图像理解方面,像素空间RF通常优于其基于VAE的变体。这些结果为实现端到端、无瓶颈的UMMs提供了有效的一步。
AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video Generation
中文标题:AAD-1:用于一步自回归视频生成的非对称对抗蒸馏方法
作者:Haobo Li, Yanhong Zeng, Yunhong Lu, Jiapeng Zhu, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Yujun Shen, Zhipeng Zhang
We present AAD-1, an Asymmetric Adversarial Distillation framework for One-step autoregressive image-to-video generation. State-of-the-art methods adopt adversarial distillation but suffer from motion collapse and training instability, resulting in static videos. AAD-1 addresses these challenges through two key designs in architecture and training strategy. Our key architectural insight is to break the symmetry between generator and discriminator. While the generator remains causal to preserve autoregressive sampling capability, the discriminator attends bidirectionally over the full spatiotemporal context and produces a single holistic realism score for the entire video sequence. This asymmetric design enables the discriminator to effectively detect global temporal failures and long-range drift that cause motion collapse in autoregressive generation. To stabilize training, we introduce a phased strategy that first uses distribution matching to bootstrap a stable one-step generator, providing a warm-up phase that brings the student distribution closer to the teacher before adversarial distillation begins. Extensive experiments on VBench demonstrate that AAD-1 achieves state-of-the-art performance in one-step autoregressive video generation.
我们提出了AAD-1,一个用于一步自回归图像到视频生成的非对称对抗蒸馏框架。当前最先进的方法采用对抗蒸馏,但存在运动崩溃和训练不稳定的问题,导致生成静态视频。AAD-1通过架构和训练策略两个关键设计来解决这些挑战。我们的关键架构洞察是打破生成器与判别器之间的对称性。生成器保持因果结构以保留自回归采样能力,而判判别器则对完整的时空上下文进行双向注意力处理,并为整个视频序列生成单一的整体真实感评分。这种非对称设计使判别器能够有效检测导致自回归生成中运动崩溃的全局时间失败和长期漂移。为稳定训练,我们引入了分阶段策略,首先使用分布匹配来引导稳定的一步生成器,提供预热阶段使学生分布更接近教师模型,随后开始对抗蒸馏。在VBench上的大量实验表明,AAD-1在一步自回归视频生成中实现了最先进的性能。
今日 Diffusion 方向共追踪到 24 篇论文。简报保留原始摘要、中文摘要、作者和链接,适合先快速筛选,再挑出值得深读的论文进入 org-roam。
Counterfactual Explanations for Deep Two-Sample Testing
中文标题:深度双样本检验的反事实解释
作者:Wei-Cheng Lai, Marco Simnacher, Christoph Lippert
Two-sample testing is a fundamental tool for detecting distributional differences across scientific domains, but classical tests (including kernel-based tests) can be ineffective on high-dimensional structured data such as images. Recent deep two-sample tests improve sensitivity in these settings by learning informative representations, yet they provide limited insight into which data features drive rejection of the null hypothesis $H_0$. To address this issue, we propose a counterfactual explanation framework for deep two-sample testing that generates sample-level edits moving observations from a source group toward a target group while explicitly reducing the discrepancy measured by the test. Our method combines a diffusion autoencoder with a pretrained deep two-sample test model and optimizes a maximum mean discrepancy (MMD) objective in the test model's representation space to produce plausible counterfactuals. We quantify distribution-level effects through changes in the test statistic and the resulting two-sample p-values. We evaluate the method on synthetic 2D shape datasets and two MRI cohorts. Across both settings, the counterfactual transformations consistently increase p-values relative to the original samples, indicating that the edited source set becomes statistically closer to the target distribution under the test. We measure minimality using LPIPS to ensure the counterfactuals remain close to the original samples. The resulting edits provide interpretable evidence of the features associated with the detected group differences. On MRI, the localized changes are consistent with known anatomical differences between cohorts.
双样本检验是检测跨科学领域分布差异的基本工具,但经典检验(包括基于核的检验)在处理图像等高维结构化数据时可能效果不佳。深度双样本检验通过学习信息表示来提高这些场景下的灵敏度,但它们对驱动拒绝零假设H0的数据特征提供的见解有限。为解决这一问题,我们提出了一种用于深度双样本检验的反事实解释框架,该框架生成样本级编辑,将观测值从源组移向目标组,同时明确减少检验所测量的差异。我们的方法将扩散自编码器与预训练的深度双样本检验模型相结合,并在检验模型的表示空间中优化最大均值差异(MMD)目标,以生成合理的反事实。我们通过检验统计量的变化和由此产生的双样本p值来量化分布层面的效应。我们在合成2D形状数据集和两个MRI队列上评估了该方法。在这两种设置下,反事实变换相对于原始样本持续增加p值,表明编辑后的源集在检验下与目标分布在统计上更加接近。我们使用LPIPS测量最小化程度,以确保反事实与原始样本保持接近。由此产生的编辑提供了与检测到的组差异相关特征的可解释证据。在MRI数据上,局部变化与队列之间的已知解剖学差异一致。
AgenticDiffusion: Agentic Diffusion-based Path Planning for Vision-Based UAV Navigation
中文标题:AgenticDiffusion:基于智能体扩散的视觉无人机导航路径规划
作者:Faryal Batool, Muhammad Ahsan Mustafa, Fawad Mehboob, Valerii Serpiva, Dzmitry Tsetserukou
Indoor UAV navigation requires efficient exploration, scene understanding, and reliable trajectory execution under limited field-of-view observations. Existing vision-based navigation frameworks typically rely on single-view observations, limiting their ability to reason about occlusions, target visibility, and global scene structure. In this work, we propose AgenticDiffusion, a multi-view UAV navigation framework that coordinates language-guided reasoning, open-vocabulary target grounding, vision-based diffusion planning, and NMPC within a unified aerial navigation pipeline. Given a natural language instruction and synchronized first-person-view (FPV) and top-view observations, the framework determines the most informative viewpoint for navigation and generates a mission plan prior to trajectory execution. The targets are localized using an open-vocabulary grounding model, after which viewpoint-specific diffusion planners generate navigation trajectories for UAV execution. Using complementary viewpoints, the proposed framework reduces repeated target exploration and improves navigation efficiency in cluttered indoor environments. The framework was validated in four real-world UAV navigation scenarios involving adaptive viewpoint selection, multi-stage mission execution, long-horizon navigation, and safe landing-site selection. The experimental results demonstrated an overall mission success rate of 80% in 40 real-world trials, while the diffusion planners achieved a trajectory generation success rate of 100%.
室内无人机导航需要在有限视场观测下实现高效探索、场景理解和可靠的轨迹执行。现有的基于视觉的导航框架通常依赖于单视角观测,限制了其推理遮挡、目标可见性和全局场景结构的能力。本研究提出AgenticDiffusion,一个多视角无人机导航框架,在统一的空中导航流程中协调语言引导推理、开放词汇目标定位、基于视觉的扩散规划和NMPC。给定自然语言指令以及同步的第一人称视角(FPV)和顶视图观测,该框架确定导航的最具信息量的视角,并在轨迹执行前生成任务计划。目标使用开放词汇定位模型进行定位,随后视点特定的扩散规划器为无人机执行生成导航轨迹。所提出的框架利用互补视角减少了重复目标探索,并在杂乱的室内环境中提高了导航效率。该框架在四个真实无人机导航场景中进行了验证,包括自适应视点选择、多阶段任务执行、长程导航和安全着陆点选择。实验结果表明,在40次真实世界试验中总体任务成功率达80%,扩散规划器的轨迹生成成功率达100%。
Scaling Novel Graph Generation via Lightweight Structure-Guided Autoregressive Models
中文标题:基于轻量级结构引导自回归模型的可扩展新图生成
作者:Alessio Barboni, Massimiliano Lupo Pasini, Bishal Lakha, Edoardo Serra
Generating realistic and diverse graphs is a key problem in machine learning, with applications in molecular discovery, circuit design, cybersecurity, and beyond. However, current graph generative models remain limited by scalability and novelty. Diffusion-based methods often require costly full-adjacency operations and long denoising chains, while many autoregressive and hybrid models have at least quadratic complexity. In addition, these models often imitate training graphs rather than generalize beyond them. We propose a lightweight autoregressive framework to address these issues. It uses a structure-guided topological ordering to serialize graphs into regular edge sequences, enabling near log-linear generation, and a two-phase training strategy that combines exploration-oriented augmentation with iterative refinement to reduce overfitting and promote controlled novelty. Experiments on molecular and non-molecular benchmarks show that our approach improves novelty while preserving high validity and uniqueness. The framework also supports both LSTM and Mamba-style causal sequence backbones, with large-memory accelerators enabling longer graph-sequence experiments beyond typical GPU limits.
生成真实且多样的图是机器学习中的关键问题,应用于分子发现、电路设计、网络安全等领域。然而,当前的图生成模型在可扩展性和新颖性方面仍存在局限性。扩散方法通常需要昂贵的完整邻接操作和较长的去噪链,而许多自回归和混合模型至少具有二次复杂度。此外,这些模型通常模仿训练图而非超越它们进行泛化。我们提出了一个轻量级自回归框架来解决这些问题。该框架使用结构引导的拓扑排序将图序列化为规则的边序列,实现近似对数线性的生成;并采用两阶段训练策略,结合探索性增强和迭代细化来减少过拟合并促进受控的新颖性。在分子和非分子基准上的实验表明,我们的方法在保持高有效性和唯一性的同时提高了新颖性。该框架还支持LSTM和Mamba风格的因果序列骨干网络,大内存加速器能够支持超出典型GPU限制的更长图序列实验。
ParetoPilot: Zero-Surrogate Offline Multi-Objective Optimization via Infer-Perturb-Guide Diffusion
中文标题:ParetoPilot:基于推断-扰动-引导扩散的零代理离线多目标优化
作者:Ruiqing Sun, Sen Yang, Dawei Feng, Bo Ding, Yijie Wang, Huaimin Wang
Offline multi-objective optimization (Offline MOO) aims to discover novel Pareto-optimal designs based on static datasets without expensive environment interactions. While recent generative methods have achieved notable success, they predominantly rely on external surrogate models. This dependency introduces significant computational overhead, suffers from deceptive evaluations, and deviates from the prevailing paradigm of jointly training mainstream generative models with conditions. To address these bottlenecks, we propose ParetoPilot, a novel zero-surrogate diffusion framework for offline MOO. ParetoPilot fully leverages the conditional priors inherently embedded within pre-trained diffusion models. At its core, the framework introduces the Infer-Perturb-Guide (IPG) engine, which is seamlessly interleaved within the unconditional denoising steps of the reverse generation process. First, it implicitly infers the instantaneous objective direction by matching conditional and unconditional noise predictions. Next, it mathematically orthogonalizes a parallel gravity field for strict convergence and an edgeness-aware repulsive force for mutual diversity, creating a dynamically annealed perturbation vector. Finally, this perturbed target seamlessly steers the generation process via standard Classifier-Free Guidance (CFG). Extensive experiments across 51 tasks demonstrate that ParetoPilot outperforms 14 state-of-the-art surrogate-based and inverse generative baselines. By eliminating auxiliary proxy training, our approach preserves data privacy while achieving hypervolume improvement and robust Pareto front coverage.
离线多目标优化(Offline MOO)旨在基于静态数据集发现新颖的帕累托最优设计,无需昂贵的环境交互。尽管当前生成方法取得了显著成功,但主要依赖于外部代理模型。这种依赖带来了显著的额外计算开销,遭受欺骗性评估的困扰,并且偏离了与条件联合训练主流生成模型的现行范式。为解决这些瓶颈,我们提出了ParetoPilot,这是一个用于离线多目标优化的新型零代理扩散框架。ParetoPilot充分利用预训练扩散模型中固有的条件先验。其核心是推断-扰动-引导(IPG)引擎,该引擎无缝地交错插入反向生成过程的去噪步骤中。首先,它通过匹配条件和无条件噪声预测来隐式推断瞬时目标方向。其次,它在数学上正交化一个用于严格收敛的并行引力场和一个用于相互多样性的边缘感知斥力,从而创建一个动态退火的扰动向量。最后,这个扰动目标通过标准无分类器引导(CFG)无缝地引导生成过程。在51个任务上的广泛实验表明,ParetoPilot优于14个基于代理模型和逆生成的最新基线方法。通过消除辅助代理训练,我们的方法在实现超体积改进和稳健的帕累托前沿覆盖的同时,还保护了数据隐私。
DiverAge: Reliable Pluralistic Face Aging with Cross-Age Identity Relation Guidance
中文标题:DiverAge:基于跨年龄身份关系引导的可靠多元化人脸老化方法
作者:Yueying Zou, Peipei Li, Qianrui Teng, Dianyan Xu, Zekun Li
Face aging plays an important role in long-term biometric analysis, cross-age identity verification, and forensic identity analysis. Since the same subject may exhibit multiple plausible appearances at a target age due to genetic, environmental, and lifestyle factors, face aging is inherently a one-to-many generation problem. However, pluralism alone is insufficient for reliable face aging: a model should provide appearance-level candidate diversity within each age group while maintaining sequence-level ordinal reliability across ordered age groups. Existing deterministic aging methods can synthesize visually plausible age-progressed faces, but usually lack stochastic diversity. In contrast, pluralistic aging methods introduce local appearance variations, but often fail to explicitly regulate the identity evolution of the full aging sequence. In this paper, we propose \textbf{DiverAge}, a hierarchical pluralistic face aging framework based on diffusion autoencoding. DiverAge preserves appearance-level diversity through stochastic diffusion decoding and age-conditioned semantic modulation. To improve sequence-level reliability, we introduce a Cross-age Identity Relation Regulator (CARR), an inference-time guidance strategy that jointly denoises multiple target age groups. CARR is guided by a Cross-age Identity Similarity (CIS) prior estimated from real same-identity cross-age pairs, and suppresses excessive cross-age identity drift through one-sided sampling-time guidance without modifying the training objective or introducing extra trainable parameters. Experiments demonstrate that DiverAge improves sequence-level ordinal reliability while maintaining identity preservation, age accuracy, image quality, and appearance-level diversity.
人脸老化在长期生物特征分析、跨年龄身份验证和法医身份分析中具有重要作用。由于同一主体在目标年龄可能因遗传、环境和生活方式因素呈现多种合理外观,人脸老化本质上是一个一对多的生成问题。然而,仅有多样性不足以实现可靠的人脸老化:模型应在每个年龄组内提供外观层面的候选多样性,同时在有序年龄组之间保持序列层面的序数可靠性。现有的确定性老化方法能够合成视觉上可信的年龄进展人脸,但通常缺乏随机多样性。相比之下,多元化老化方法引入了局部外观变化,但往往未能明确规范完整老化序列的身份演化。本文提出DiverAge,一个基于扩散自编码的层级多元化人脸老化框架。DiverAge通过随机扩散解码和年龄条件语义调制来保留外观层面的多样性。为提升序列层面的可靠性,本文引入跨年龄身份关系调节器(CARR),这是一种在推理时对多个目标年龄组进行联合去噪的引导策略。CARR由从真实同身份跨年龄对估计的跨年龄身份相似度(CIS)先验引导,并通过单向采样时间引导抑制过度的跨年龄身份漂移,无需修改训练目标或引入额外可训练参数。实验表明,DiverAge在保持身份保持、年龄准确性、图像质量和外观层面多样性的同时,提升了序列层面的序数可靠性。
On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers
中文标题:扩散Transformer中基于上下文空间的即时排斥实现丰富多样性
作者:Omer Dahary, Benaya Koren, Daniel Garibi, Daniel Cohen-Or
Modern Text-to-Image (T2I) diffusion models have achieved remarkable semantic alignment, yet they often suffer from a significant lack of variety, converging on a narrow set of visual solutions for any given prompt. This typicality bias presents a challenge for creative applications that require a wide range of generative outcomes. We identify a fundamental trade-off in current approaches to diversity: modifying model inputs requires costly optimization to incorporate feedback from the generative path. In contrast, acting on spatially-committed intermediate latents tends to disrupt the forming visual structure, leading to artifacts. In this work, we propose to apply repulsion in the Contextual Space as a novel framework for achieving rich diversity in Diffusion Transformers. By intervening in the multimodal attention channels, we apply on-the-fly repulsion during the transformer's forward pass, injecting the intervention between blocks where text conditioning is enriched with emergent image structure. This allows for redirecting the guidance trajectory after it is structurally informed but before the composition is fixed. Our results demonstrate that repulsion in the Contextual Space produces significantly richer diversity without sacrificing visual fidelity or semantic adherence. Furthermore, our method is uniquely efficient, imposing a small computational overhead while remaining effective even in modern "Turbo" and distilled models where traditional trajectory-based interventions typically fail.
现代文本到图像(T2I)扩散模型已实现卓越的语义对齐,然而它们常常面临多样性显著不足的问题,对于任何给定提示都收敛于有限的视觉解决方案集合。这种典型性偏差对需要广泛生成结果的创意应用构成了挑战。我们识别出现有多样性方法中的一个根本权衡:修改模型输入需要昂贵的优化过程来整合生成路径的反馈。相比之下,对空间约束的中间潜在变量进行操作往往会破坏正在形成的视觉结构,导致伪影。本工作提出在上下文空间中应用排斥作为一种新框架,以在扩散Transformer中实现丰富多样性。通过干预多模态注意力通道,我们在Transformer的前向传播过程中应用即时排斥,将干预注入文本条件与新兴图像结构丰富的模块之间。这使得我们能够在引导轨迹获得结构信息之后、但在构图固定之前对其进行重定向。我们的结果表明,上下文空间中的排斥能够产生显著更丰富的多样性,同时不牺牲视觉保真度或语义一致性。此外,我们的方法具有独特的效率优势,计算开销很小,即使在传统的基于轨迹的干预通常失效的现代Turbo模型和蒸馏模型中也能保持有效性。
Reflection Separation from a Single Image via Joint Latent Diffusion
中文标题:基于联合潜在扩散的单图像反射分离
作者:Zheng-Hui Huang, Zhixiang Wang, Yu-Lun Liu, Yung-Yu Chuang
Single-image reflection separation is highly challenging under extreme conditions like glare or weak reflections. Existing methods often struggle to recover both layers in glare or weak-reflection scenarios because of insufficient information. This paper presents a diffusion model explicitly fine-tuned for this task, leveraging generative diffusion priors for robust separation. Our method simultaneously generates transmission and reflection layers through a unified diffusion model, incorporating a novel cross-layer self-attention mechanism for better feature disentanglement. We further introduce a disjoint sampling strategy to iteratively reduce interference between the layers during diffusion and a latent optimization step with a learned composition function for improved results in complex real-world scenarios. Extensive experiments demonstrate that our approach surpasses state-of-the-art methods on multiple real-world benchmarks. Project page: https://brian90709.github.io/diff-reflection-separation/
单图像反射分离在眩光或弱反射等极端条件下极具挑战性。现有方法由于信息不足,在处理眩光或弱反射场景时往往难以准确恢复两个图层。本文提出了一种针对该任务专门微调的扩散模型,充分利用生成式扩散先验实现稳健的分离效果。我们的方法通过统一的扩散模型同时生成透射层和反射层,并引入了一种新颖的跨层自注意力机制以实现更好的特征解耦。我们进一步提出了一种分离采样策略,在扩散过程中迭代减少图层间的干扰,并采用一种学习到的组合函数进行潜在优化,从而在复杂的真实场景中获得更好的结果。大量实验表明,我们的方法在多个真实世界基准数据集上超越了当前最先进的方法。项目页面:https://brian90709.github.io/diff-reflection-separation/
Efficient and Training-Free Single-Image Diffusion Models
作者:Haojun Qiu, Kiriakos N. Kutulakos, David B. Lindell
We consider the problem of generating images whose internal structure -- defined by the distribution of patches across multiple scales -- matches that of a single reference image. Recent approaches address this problem by training a diffusion model on a single image. But even in this setting, training is computationally expensive and requires hours of optimization. Instead, we model the image using a dataset of its patches at different scales. As this dataset is finite and the dimensionality of its patches is small, the score function for a noisy patch can be computed tractably using an optimal, closed-form denoiser, eliminating the need for neural network training. We integrate this patch-based denoiser into an efficient, training-free image diffusion model, and we describe how our method connects to classical patch-based image restoration techniques. Our approach achieves state-of-the-art generation quality and diversity compared to trained single-image diffusion models, and we demonstrate applications, including unconditional image generation, text-guided stylization, image symmetrization, and retargeting. Further, we show that our approach is compatible with latent space diffusion, and we show multiple additional acceleration techniques to achieve megapixel single-image generation in one second, and gigapixel generation in minutes.
DSA: Dynamic Step Allocation for Fast Autoregressive Video Generation
中文标题:DSA:用于快速自回归视频生成的动态步长分配
作者:Thanh-Tung Le, Yunhan Zhao, Menglei Chai, Zhengyang Shen, Zhe Cao, Danhang Tang, Xiaohui Xie, Deying Kong
Video diffusion transformers have achieved state-of-the-art visual quality, but their high inference cost remains a major bottleneck for real-time applications. Recent distillation frameworks produce autoregressive video diffusion models with reduced latency, yet these models still use a fixed number of denoising steps per frame, wasting computation on predictable frames and under-refining challenging ones. We present DSA, a confidence-guided adaptive computation framework for AR video diffusion. DSA introduces a lightweight confidence head, trained jointly with the generator under a distribution-matching distillation objective, to estimate per-frame denoising reliability. At inference, this confidence signal dynamically adjusts the number of diffusion steps: simple frames terminate early for speed, while complex frames receive additional refinement. Our method requires no extra video data, no heuristics, and little architectural modification. Experiments show that DSA achieves real-time autoregressive video generation, reaching 22.63 FPS with sub-second latency on H100 GPUs, while maintaining competitive or superior VBench quality compared to recent autoregressive and bidirectional video diffusion models. Our results demonstrate that confidence-guided adaptive sampling provides an effective and practical path toward interactive video generation.
视频扩散Transformer已实现最先进的视觉质量,但其高推理成本仍是实时应用的主要瓶颈。近期蒸馏框架能够生成降低延迟的自回归视频扩散模型,但这些模型每帧仍使用固定数量的去噪步骤,在可预测的帧上浪费计算资源,而对具有挑战性的帧细化不足。我们提出了DSA,一个用于自回归视频扩散的置信度引导自适应计算框架。DSA引入了一个轻量级置信度头,在分布匹配蒸馏目标下与生成器联合训练,用于估计每帧的去噪可靠性。在推理时,该置信度信号动态调整扩散步数:简单帧提前终止以提速,而复杂帧获得额外细化。我们的方法不需要额外视频数据、不需要启发式方法,且仅需少量架构修改。实验表明,DSA实现了实时自回归视频生成,在H100 GPU上达到22.63 FPS且延迟低于1秒,同时在VBench质量上保持与近期自回归和双向视频扩散模型相当或更优的水平。我们的结果表明,置信度引导的自适应采样为交互式视频生成提供了一条有效且实用的路径。
Imagine Before You Draw: Visual Prompt Engineering for Image Generation
中文标题:先想后画:图像生成的视觉提示工程
作者:Liyu Jia, Fengda Zhang, Jiachun Pan, Kesen Zhao, Saining Zhang, Wang Lin, Weijia Wu, Yue Liao, Aojun Zhou, Hanwang Zhang
Incorporating visual semantic representations as an intermediate step before image generation can reduce the modeling difficulty between text and images, thereby improving generation quality. Recent works such as X-Omni and BLIP3o-Next have explored this direction, but they typically use a two-stage external pipeline: a separate autoregressive model first generates semantic tokens, which are then fed as conditioning to an independent diffusion decoder. Since the decoder cannot jointly access the original input and the semantic plan, this design introduces an information bottleneck that limits detail preservation in downstream tasks such as editing. Internal architectures such as Transfusion, BAGEL, and Show-o2 avoid this bottleneck by enabling cross-modal interaction within a single model, but they still face the difficult text-to-pixel modeling gap without intermediate semantic guidance. We propose Visual Prompt Engineering (VPE), which can be seamlessly integrated into such internal frameworks. Specifically, the model first autoregressively generates visual semantic tokens (e.g., SigLIP 2) as "visual prompts" that capture the semantic layout, then generates the full image tokens conditioned on this plan. We validate VPE across class-conditional generation, text-to-image generation, and image editing, covering various token types and model architectures. Results show that VPE can accelerate convergence, raise quality ceilings, and through internal integration, achieve substantially better editing preservation (PSNR: 26.76 vs. 19.92) than external alternatives of the same parameter scale, while maintaining competitive editing responsiveness.
将视觉语义表示作为图像生成之前的中间步骤可以减少文本与图像之间的建模难度,从而提高生成质量。最近的研究如X-Omni和BLIP3o-Next探索了这一方向,但它们通常采用两阶段外部流程:独立的自回归模型首先生成语义token,然后将其作为条件输入到独立的扩散解码器。由于解码器无法同时访问原始输入和语义规划,这种设计引入了信息瓶颈,限制了编辑等下游任务中的细节保留。Transfusion、BAGEL和Show-o2等内部架构通过在单一模型中实现跨模态交互来避免这一瓶颈,但它们仍然面临缺乏中间语义指导的文本到像素建模差距。我们提出视觉提示工程(VPE),可无缝集成到此类内部框架中。具体而言,模型首先自回归生成捕捉语义布局的视觉语义token(如SigLIP 2)作为“视觉提示”,然后在此规划条件下生成完整图像token。我们在类别条件生成、文本到图像生成和图像编辑等不同任务中验证了VPE,涵盖了多种token类型和模型架构。结果表明,VPE能够加速收敛、提升质量上限,并通过内部集成实现显著更好的编辑保真度(PSNR:26.76 vs 19.92),同时保持具有竞争力的编辑响应能力。
MeshFlow: Efficient Artistic Mesh Generation via MeshVAE and Flow-based Diffusion Transformer
中文标题:MeshFlow:通过MeshVAE和流式扩散变换器实现高效艺术网格生成
作者:Weiyu Li, Antoine Toisoul, Tom Monnier, Roman Shapovalov, Rakesh Ranjan, Ping Tan, Andrea Vedaldi
We present MeshFlow, a new method for generating artist-like 3D meshes. Current mesh generators often adopt Auto-Regressive (AR) next-token prediction, a natural choice given the discrete nature of mesh topology. However, AR methods scale poorly because the inference cost is quadratic in mesh size. They also require discretizing the vertex coordinates, which introduces quantization errors. To address these challenges, we introduce a Variational Autoencoder (VAE) that, supervised with a contrastive loss, represents both continuous vertex positions and discrete connectivity in a continuous latent space. This latent space is significantly more compact than prior token-based mesh representations. We then build a 3D generator based on a Rectified Flow transformer, generating all mesh vertices and edges in parallel. Our model generates meshes 18x faster than the fastest AR generator while also achieving excellent accuracy across standard mesh-generation metrics. Homepage: https://mesh-flow.github.io/, Code: https://github.com/facebookresearch/meshflow
我们提出MeshFlow,这是一种生成类艺术家风格3D网格的新方法。当前的网格生成器通常采用自回归(AR)下一个token预测方法,考虑到网格拓扑的离散性质,这是一种自然的选择。然而,自回归方法的推理成本随网格大小呈二次方增长,导致可扩展性较差。此外,这些方法需要对顶点坐标进行离散化处理,从而引入量化误差。为解决这些挑战,我们引入了一种变分自编码器(VAE),通过对比损失进行监督,将连续顶点位置和离散连通性表示在一个连续的潜在空间中。该潜在空间比先前基于token的网格表示更加紧凑。随后,我们基于修正流(Rectified Flow)变换器构建了一个3D生成器,并行生成所有网格顶点和边。我们的模型在网格生成速度上比最快的自回归生成器快18倍,同时在标准网格生成指标上也达到了优异的准确度。
Crafting Your Evolving Dreams: Concept-Incremental Versatile Customization
中文标题:打造你演进中的梦想:概念增量式通用定制
作者:Jiahua Dong, Wenqi Liang, Hongliu Li, Yang Cong, Duzhen Zhang, Hanbin Zhao, Henghui Ding, Yulun Zhang, Salman Khan, Fahad Shahbaz Khan
Custom diffusion models (CDMs) have garnered significant interest owing to their remarkable capacity for generating personalized concepts. However, the majority of CDMs unrealistically presume that the user's collection of personalized concepts is static and incapable of incremental growth over time. Furthermore, they exhibit significant catastrophic forgetting and concept neglect of previously learned concepts when incrementally learning a sequence of new ones. To resolve the above challenges, we develop a novel Continually Customizable Diffusion Model (CCDM), enabling users to perform concept-incremental versatile customization. Specifically, we design an attribute-decoupled LoRA (AD-LoRA) module and a relevance-guided AD-LoRA aggregation strategy to mitigate catastrophic forgetting. They can preserve concept-specific attributes of each task and leverage beneficial inter-task correlations to enhance the continual learning of new customization tasks. Additionally, to address the challenge of concept neglect, we propose a controllable regional context synthesis strategy that performs multi-concept composition in alignment with user-provided conditions. This strategy enhances the overall consistency in multi-concept synthesis by guaranteeing semantic independence between user-defined regions and their smooth boundary transitions. Experiments show our CCDM exhibits significant improvements over baseline methods.
定制扩散模型(Custom Diffusion Models, CDMs)因其生成个性化概念的卓越能力而受到广泛关注。然而,大多数CDMs不切实际地假设用户的个性化概念集合是静态的,无法随时间增量增长。此外,当增量学习一系列新概念时,它们会出现严重的灾难性遗忘问题,并忽视先前学习过的概念。为解决上述挑战,我们开发了一种新型的持续可定制扩散模型(Continually Customizable Diffusion Model, CCDM),使用户能够执行概念增量式多功能定制。具体而言,我们设计了属性解耦LoRA(Attribute-Decoupled LoRA, AD-LoRA)模块和相关 性引导的AD-LoRA聚合策略,以缓解灾难性遗忘。它们能够保留每个任务的概念特定属性,并利用有益的任务间相关性来增强新定制任务的持续学习。此外,为解决概念忽视问题,我们提出了一种可控区域上下文合成策略,该策略根据用户提供的条件执行多概念组合。该策略通过保证用户定义区域之间的语义独立性及其平滑的边界过渡,增强了多概念合成的整体一致性。实验表明,我们的CCDM相比基线方法表现出显著的改进。
IRIS-GAN: Staged Specialist Detection of Deepfake Faces
中文标题:IRIS-GAN:深度伪造人脸的分阶段专家检测
作者:Jaume M. Trenchs, Veronica Sanz
We introduce IRIS-GAN, a specialist forensic detector for synthetic face images under cross-generator shift. Rather than addressing universal synthetic-image detection, we focus on faces generated by generative adversarial networks (GANs), which are state-of-the-art in deepfake content, and train the detector through staged exposure to increasingly demanding GAN families while retaining earlier generators. The final model reaches fake-detection rates above 99% across the GAN families considered and classifies an external real-face dataset with 98.9% accuracy. Grad-CAM analysis further reveals measurable generator-dependent spatial response patterns, which remain informative for a secondary heatmap-only classifier. Out-of-family tests on diffusion-generated faces confirm that IRIS-GAN is a specialist detector, with some capability to reach non-GAN deepfakes. These results establish staged training as an effective strategy for robust GAN-face forensics.
InstantRetouch: Efficient and High-Fidelity Instruction-Guided Image Retouching with Bilateral Space
中文标题:InstantRetouch:高效高保真指令引导图像修图与双边空间
作者:Jiarui Wu, Yujin Wang, Ruikang Li, Fan Zhang, Mingde Yao, Tianfan Xue
Language-guided photo retouching aims to adjust color and tone while preserving geometry and texture. Recently, diffusion-based retouching shows a superior visual quality, but often struggles with both fidelity issues due to its generative nature and efficiency because of its iterative sampling process. In this work, we propose an efficient and fidelity-preserving retouching method using bilateral space manipulation, which is both compact and content-decoupled. Specifically, instead of directly editing pixels or image latents, our model predicts a low-resolution bilateral grid of affine transforms, which are sliced using a learned guidance map and then applied to the full-resolution image. This approach yields both high fidelity and improved efficiency. To retain strong priors of a pretrained generative model, we distill a multi-step diffusion model into our bilateral grid framework using Variational Score Distillation, complemented by a prompt alignment loss to guide instruction-following behavior. Additionally, we introduce a new benchmark and evaluate our method across multiple dimensions: fidelity, instruction following, and efficiency. Compared to the latest retouch methods, like Gemini-2.5-Flash (Nano-Banana), our method can avoid content drift, significantly improve latency, and generate visually pleasing edits, while maintaining a high level of fidelity. Project page: https://openimaginglab.github.io/InstantRetouch/.
语言引导的照片修饰旨在调整颜色和色调,同时保留几何形状和纹理。近年来,基于扩散模型的修饰方法展现出优异的视觉质量,但由于其生成性质导致的保真度问题以及迭代采样过程带来的效率问题,往往表现不佳。本工作提出了一种高效且保持保真度的修饰方法,该方法采用双边空间操作,既紧凑又实现内容解耦。具体而言,我们的模型不直接编辑像素或图像潜在表示,而是预测一个低分辨率的仿射变换双边网格,然后使用学习得到的引导图对其进行切片,并应用于全分辨率图像。这种方法实现了高保真度和改进的效率。为了保留预训练生成模型的强先验知识,我们使用变分分数蒸馏将多步扩散模型蒸馏到双边网格框架中,并辅以提示对齐损失来指导指令跟随行为。此外,我们引入了一个新的基准测试集,从多个维度评估我们的方法:保真度、指令跟随和效率。与最新的修饰方法(如 Gemini-2.5-Flash (Nano-Banana))相比,我们的方法可以避免内容漂移,显著降低延迟,并生成视觉愉悦的编辑效果,同时保持高水平的保真度。项目页面:https://openimaginglab.github.io/InstantRetouch/。
PointAction: 3D Points as Universal Action Representations for Robot Control
中文标题:PointAction:基于3D点的通用动作表示用于机器人控制
作者:Mutian Tong, Han Jiang, Qiao Feng, Lingjie Liu, Jiatao Gu
Video-Action Models (VAMs) leverage the broad visual dynamics captured by pre-trained video diffusion models, offering a promising path toward generalizable robot manipulation. However, RGB-only video rollouts are not directly actionable: they leave metric 3D motion, contact geometry, and fine-grained spatial constraints under-specified, making action grounding ambiguous. Meanwhile, scaling action supervision across diverse tasks and embodiments remains costly. We present PointAction, a framework that bridges video predictions to robot actions through explicit point-based 4D modeling. PointAction fine-tunes a foundation video generation model to jointly predict future RGB frames and dynamic 3D pointmaps, producing temporally consistent 3D motion of task-relevant scene geometry. These point dynamics serve as a structured, embodiment-agnostic action interface, which a diffusion-based action decoder maps to executable robot actions. By using metric 3D point dynamics as the interface between video prediction and control, PointAction reduces the ambiguity of RGB-only action grounding and supports transfer across tasks and embodiments with limited action supervision. Experiments show that PointAction achieves state-of-the-art 4D generation quality on robot scenes, outperforms existing baselines in simulation, and generalizes to two real robot arms unseen during pretraining.
视频动作模型(Video-Action Models, VAMs)利用预训练视频扩散模型所捕获的广泛视觉动力学,为通用化机器人操作提供了有前景的研究路径。然而,纯RGB视频rollout无法直接生成动作:其缺乏度量3D运动、接触几何和细粒度空间约束的明确表示,导致动作锚定存在歧义。同时,在不同任务和具身形态间扩展动作监督的成本仍然很高。本研究提出PointAction框架,通过基于点的显式4D建模将视频预测与机器人动作连接起来。PointAction对基础视频生成模型进行微调,以联合预测未来RGB帧和动态3D点图,生成任务相关场景几何的时间一致3D运动。这些点动力学作为结构化、具身无关的动作接口,由基于扩散的动作解码器将其映射为可执行的机器人动作。通过使用度量3D点动力学作为视频预测与控制之间的接口,PointAction降低了纯RGB动作锚定的歧义性,并支持在有限动作监督下跨任务和具身形态进行迁移。实验表明,PointAction在机器人场景中实现了最先进的4D生成质量,在模拟环境中优于现有基线,并能够泛化到两种预训练中未见过的真实机械臂。
TGSD: Topology-Guided State-Space Diffusion for EEG Spatial Super-Resolution
中文标题:TGSD: 拓扑引导的状态空间扩散用于脑电图空间超分辨率
作者:Zijian Kang, Weiming Zeng, Yueyang Li, Shengyu Gong, Hongjie Yan, Wai Ting Siok, Nizhuan Wang
Low-density EEG is more suitable for wearable and IoT-based brain sensing, but sparse electrode sampling often lacks sufficient spatial information to characterize cross-regional neural activity. EEG spatial super-resolution aims to recover dense-channel EEG from sparse recordings, yet remains challenging because channel missingness typically occurs at the whole-channel level, spatiotemporal dependencies over the full electrode layout are often underexplored, and the mapping from sparse to dense signals is inherently ambiguous. To address these issues, we propose TGSD, a topology-guided state-space diffusion framework for EEG spatial super-resolution. TGSD first employs a Hierarchical Spatial Prior Encoder to learn topology-aware priors over the complete electrode layout by integrating local geometric relationships with region-level contextual information. Based on these priors and sparse observations, a Conditional State-Space Diffusion Reconstructor progressively generates missing-channel signals through reverse diffusion, while alternating temporal and channel-wise state-space modeling captures long-range temporal dynamics and inter-channel dependencies in a unified framework. Experiments on the SEED and PhysioNet MM/I datasets show that TGSD consistently outperforms representative baselines under different super-resolution factors in both reconstruction fidelity and downstream classification performance. These results demonstrate the effectiveness of combining topology-aware spatial priors with conditional diffusion for enhancing practical low-density EEG sensing in wearable and IoT scenarios. The official implementation code is available at https://github.com/jtggz/TGSD.
低密度脑电图更适合可穿戴和基于物联网的脑感知应用,但稀疏电极采样往往缺乏足够的空间信息来表征跨区域神经活动。脑电图空间超分辨率旨在从稀疏记录中恢复密集通道脑电图,然而仍面临诸多挑战:通道缺失通常发生在全通道层面,全电极布局的时空依赖性往往未被充分探索,且从稀疏信号到密集信号的映射本身具有歧义性。为解决这些问题,我们提出了TGSD,一种用于脑电图空间超分辨率的拓扑引导状态空间扩散框架。TGSD首先采用分层空间先验编码器,通过整合局部几何关系与区域级上下文信息,在完整电极布局上学习拓扑感知先验。基于这些先验和稀疏观测,条件状态空间扩散重建器通过逆向扩散逐步生成缺失通道信号,同时交替进行时间和通道级状态空间建模,在统一框架下捕捉长程时间动态和通道间依赖性。在SEED和PhysioNet MM/I数据集上的实验表明,TGSD在不同超分辨率因子下,在重建保真度和下游分类性能方面均一致优于代表性基线方法。这些结果证明了将拓扑感知空间先验与条件扩散相结合,对增强可穿戴和物联网场景中实用低密度脑电图感知的有效性。官方实现代码可访问 https://github.com/jtggz/TGSD。
Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation
中文标题:Echo-Infinity:面向实时无限视频生成的演进记忆学习
作者:Yuxuan Bian, Zeyue Xue, Songchun Zhang, Shiyi Zhang, Weiyang Jin, Yaowei Li, Junhao Zhuang, Haoran Li, Jie Huang, Haoyang Huang, Nan Duan, Qiang Xu
We present Echo Infinity, an autoregressive (AR) framework towards real-time infinite video generation that employs a learnable evolving memory to dynamically filter, abstract, and compress any-length history at constant cost. Existing methods mainly curate memory with predefined KV-cache schedules, fixed-ratio heuristic compression, or inference-time RoPE adaptation. These designs inevitably lose historical information and amplify compounding errors due to their limited cache window and ignorance of autoregressive generation noise. Inspired by human memory consolidation, Echo-Infinity replaces handcrafted memory curation with learnable Memory Query, which are updated by attention and a gating mechanism when past frames are evicted from the local window. The queries are optimized end-to-end with the video diffusion transformers (DiTs), forming an evolving memory that supports arbitrary compression ratios with constant computation independent of video length. They also act as a generalizable generation prior, improving quality even when only the optimized initial state is used. We further introduce Unified Relative RoPE Recipe, which anchors the sink frames to start from id 0 and lets the newest frame id grow at most to the DiTs' pretrained maximum temporal RoPE id throughout training and inference, freeing the model from the finite RoPE constraint and closing the train-test RoPE extrapolation gap. In long and short video generation, Echo-Infinity achieves state-of-the-art performance, and, to our knowledge, demonstrates promising 24-hour (>1.3 M frames) real-time rollouts for the first time, suggesting a practical path toward infinite video generation.
我们提出 Echo-Infinity,一个面向实时无限视频生成的自回归(AR)框架,该框架采用可学习的演进记忆来以恒定成本动态过滤、抽象和压缩任意长度的历史信息。现有方法主要通过预设的 KV-cache 调度策略、固定比例的启发式压缩或推理时的 RoPE 适配来管理记忆。这些设计不可避免地会丢失历史信息,并因有限的缓存窗口和对自回归生成噪声的忽视而放大累积误差。受到人类记忆整合的启发,Echo-Infinity 用可学习的记忆查询(Memory Query)取代了人工设计的记忆管理机制,这些查询通过注意力机制和门控机制进行更新,当过去的帧从局部窗口中被移除时会被更新。这些查询与视频扩散Transformer(DiT)进行端到端优化,形成了一种支持任意压缩比且计算量与视频长度无关的演进记忆。它们还充当可泛化的生成先验,即使仅使用优化后的初始状态也能提升质量。我们进一步提出统一相对 RoPE 方案(Unified Relative RoPE Recipe),将汇帧(sink frames)的 id 锚定为 0,并让最新的帧 id 在训练和推理过程中最多增长到 DiT 预训练的最大时间 RoPE id,使模型摆脱有限 RoPE 的约束,并弥合训练-测试 RoPE 外推差距。在长视频和短视频生成任务上,Echo-Infinity 达到了最先进的性能,据我们所知,它首次展示了有前景的 24 小时(超过 130 万帧)实时生成,为无限视频生成提供了一条可行路径。
ViewMask-1-to-3: Multi-View Consistent Image Generation via Multimodal Discrete Diffusion Models
中文标题:ViewMask-1-to-3:基于多模态离散扩散模型的多视角一致图像生成
作者:Ruishu Zhu, Zhihao Huang, Jiacheng Sun, Ping Luo, Hongyuan Zhang, Xuelong Li
Motivated by discrete diffusion's success in language-vision modeling, we explore its potential for multi-view generation, a task dominated by continuous approaches. We introduce ViewMask-1-to-3, formulating multi-view generation as a discrete sequence modeling problem where each viewpoint is represented as visual tokens from MAGVIT-v2. Through discrete diffusion via masked token prediction, our approach enables progressive multi-view generation via iterative token unmasking, unifying language and vision in a shared token space. Importantly, simple random masking combined with self-attention naturally encourages cross-view consistency without specialized architectures or 3D geometric priors. Our method outperforms the baseline on the GSO and 3D-FUTURE benchmarks, ranking first on average across standard image metrics, and achieving a 10.6% higher IoU than continuous diffusion models on 3D-FUTURE. Furthermore, the proposed framework can be naturally extended to support text-to-image generation and multimodal understanding, highlighting its potential toward a more unified paradigm for multimodal understanding and generation.
受离散扩散在语言-视觉建模中成功的启发,我们探索其在多视角生成任务中的潜力,而该任务目前主要由连续方法主导。我们提出了ViewMask-1-to-3,将多视角生公式化为离散序列建模问题,其中每个视角由MAGVIT-v2提取的视觉token表示。通过基于masked token预测的离散扩散,我们的方案能够通过迭代token去掩码实现渐进式多视角生成,并在共享token空间中统一语言和视觉。重要的是,简单的随机掩码结合自注意力机制自然地促进了跨视角一致性,无需专门的架构或3D几何先验。我们的方法在GSO和3D-FUTURE基准数据集上超越了基线模型,在标准图像指标上平均排名第一,并在3D-FUTURE上比连续扩散模型实现了10.6%的IoU提升。此外,所提出的框架可以自然扩展以支持文本到图像生成和多模态理解,突显其在实现多模态理解和生成更统一范式方面的潜力。
Shifting the Breaking Point of Flow Matching for Multi-Instance Editing
中文标题:突破流匹配多实例编辑的极限
作者:Carmine Zaccagnino, Fabio Quattrini, Enis Simsar, Marta Tintor\'e Gazulla, Rita Cucchiara, Alessio Tonioni, Silvia Cascianelli
Flow matching models have recently emerged as an efficient alternative to diffusion, especially for text-guided image generation and editing, offering faster inference through continuous-time dynamics. However, existing flow-based editors predominantly support global or single-instruction edits and struggle with multi-instance scenarios, where multiple parts of a reference input must be edited independently without semantic interference. We identify this limitation as a consequence of globally conditioned velocity fields and joint attention mechanisms, which entangle concurrent edits. To address this issue, we introduce Instance-Disentangled Attention, a mechanism that partitions joint attention operations, enforcing binding between instance-specific textual instructions and spatial regions during velocity field estimation. We evaluate our approach on both natural image editing and a newly introduced benchmark of text-dense infographics with region-level editing instructions. Experimental results demonstrate that our approach promotes edit disentanglement and locality while preserving global output coherence, enabling single-pass, instance-level editing.
流匹配模型作为扩散模型的高效替代方案近期兴起,尤其在文本引导的图像生成与编辑任务中表现出色,通过连续时间动力学实现更快的推理速度。然而,现有的基于流的编辑方法主要支持全局或单一指令编辑,难以处理多实例场景,即需要对参考输入的多个部分进行独立编辑而避免语义干扰。我们将此局限性问题归因于全局条件速度场和联合注意力机制,它们会纠缠并发编辑操作。为解决这一问题,我们提出了实例解耦注意力机制,该机制对联合注意力操作进行分区,在速度场估计过程中强制实现实例特定文本指令与空间区域的绑定。我们在自然图像编辑和新引入的文本密集型信息图表区域级编辑指令基准数据集上对我们的方法进行了评估。实验结果表明,我们的方法在保持全局输出一致性的同时促进了编辑解耦和局部性,实现了单次通过实例级编辑。
Transferable Multi-Bit Watermarking Across Frozen Diffusion Models via Latent Consistency Bridges
中文标题:基于潜在一致性桥接的可迁移多比特水印跨冻结扩散模型研究
作者:Hong-Hanh Nguyen-Le, Van-Tuan Tran, Thuc D. Nguyen, Nhien-An Le-Khac
As generative AI advances, global governance frameworks increasingly mandate verifiable content provenance. However, existing watermarking techniques face a critical policy-to-technology disconnect: sampling-based methods require computationally prohibitive inversion, while fine-tuning approaches are tethered to specific model checkpoints, hindering standardized, cross-model oversight. To bridge this gap, we introduce DiffMark, a plug-and-play multi-bit watermarking framework. DiffMark embeds a persistent, learned perturbation into every denoising step of a frozen diffusion model, accumulating a recoverable signal in the final latent space. To enable efficient training through the frozen network, we utilize Latent Consistency Models (LCMs) as a differentiable training bridge. DiffMark achieves 64-bit extraction in a single 16.4 ms forward pass, which is a $45\times$ speed-up over inversion baselines. By enabling per-image key flexibility and cross-architecture transferability without retraining, DiffMark provides the practical, scalable technical tooling necessary to operationalize user accountability and enforce emerging AI governance mandates.
HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion
中文标题:HyperDiT:用于高保真像素空间扩散的超连接Transformer
作者:Yu He, Lichen Ma, Zipeng Guo, Xinyuan Shan, Jingling Fu, Dong Chen, Junshi Huang, Yan Li
Pixel-space diffusion models bypass the reconstruction bottleneck of Variational Autoencoders (VAEs) but face a fundamental "granularity dilemma": capturing global semantics favors large patch scales, while generating high-fidelity details demands fine-grained inputs. To address this issue, we propose HyperDiT, a unified framework establishing Hyper-Connected Cross-Scale Interactions to bridge the semantic and pixel manifold. Diverging from injecting semantics by AdaLN, HyperDiT utilizes Cross-Attention mechanisms, enabling fine-grained tokens to query multi-level semantic anchors globally. To resolve the spatial mismatch during multi-scale interactions, we introduce Scale-Aware Rotary Position Embedding (SA-RoPE) to ensure precise geometric alignment among tokens of varying patch sizes. Furthermore, we incorporate Registers to learn the dense semantics from a pretrained Visual Foundation Model (VFM), effectively reducing generation hallucination and artifacts. Extensive experiments demonstrate that HyperDiT achieves state-of-the-art (SoTA) FID of $\mathbf{1.56}$ on ImageNet $256\times256$ directly within the pixel space. By combining the fine-grained stream with semantic guidance, HyperDiT offers a superior paradigm for high-fidelity pixel generation.
像素空间扩散模型绕过了变分自编码器(VAE)的重建瓶颈,但面临一个基本的“粒度困境”:捕获全局语义需要较大的patch尺度,而生成高保真细节则需要细粒度输入。为解决这一问题,我们提出了HyperDiT,这是一个统一框架,用于建立超连接跨尺度交互,以弥合语义流形和像素流形。与通过AdaLN注入语义不同,HyperDiT利用交叉注意力机制,使细粒度token能够全局查询多层语义锚点。为解决多尺度交互中的空间错位问题,我们引入了尺度感知旋转位置嵌入(SA-RoPE),以确保不同patch大小的token之间的精确几何对齐。此外,我们结合寄存器(Registers)从预训练的视觉基础模型(VFM)学习密集语义,有效减少生成幻觉和伪影。大量实验表明,HyperDiT在ImageNet 256×256上直接于像素空间内实现了最先进(SoTA)的FID分数1.56。通过将细粒度流与语义指导相结合,HyperDiT为高保真像素生成提供了更优的范式。
Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization
中文标题:Flash-GRPO:基于单步策略优化的高效视频扩散模型对齐方法
作者:Xiaoxuan He, Siming Fu, Zeyue Xue, Weijie Wang, Ruizhe He, Yuming Li, Dacheng Yin, Shuai Dong, Haoyang Huang, Hongfa Wang, Nan Duan, Bohan Zhuang
Group Relative Policy Optimization has emerged as essential for aligning video diffusion models with human preferences, but faces a critical computational bottleneck: training a 14B parametered model typically demands hundreds of GPU days per experiment. Existing efficiency methods reduce costs through sliding window subsampling training timesteps, but fundamentally compromise optimization, exhibiting severe instability and failing to reach full trajectory performance. We present Flash-GRPO, a single-step training framework that outperforms full trajectory training in alignment quality under low computational budgets while substantially improving training efficiency. Flash-GRPO addresses two critical challenges: iso-temporal grouping eliminates timestep-confounded variance by enforcing prompt-wise temporal consistency, decoupling policy performance from timestep difficulty; temporal gradient rectification neutralizes the time-dependent scaling factor that causes vastly inconsistent gradient magnitudes across timesteps. Experiments on 1.3B to 14B parameter models validate Flash-GRPO's effectiveness, demonstrating substantial training acceleration with consistent stability and state-of-the-art alignment quality.
组相对策略优化(GRPO)已成为视频扩散模型与人类偏好对齐的关键技术,但面临严峻的计算瓶颈:训练一个140亿参数的模型通常每个实验需要数百GPU天。现有的效率方法通过滑动窗口采样训练时间步来降低成本,但从根本上损害了优化过程,表现出严重的不稳定性,无法达到完整轨迹训练的性能。我们提出了Flash-GRPO,这是一种单步训练框架,在低计算预算下其对齐质量优于完整轨迹训练,同时显著提升了训练效率。Flash-GRPO解决了两大关键挑战:等时间分组通过强制提示词级别的时间一致性消除了时间步混淆的方差,将策略性能与时间步难度解耦;时间梯度校正中和了导致不同时间步梯度幅度严重不一致的时间依赖缩放因子。在1.3B至14B参数模型上的实验验证了Flash-GRPO的有效性,展现出显著的训练加速、一致的稳定性以及最优的对齐质量。
Representation Forcing for Bottleneck-Free Unified Multimodal Models
中文标题:用于无瓶颈统一多模态模型的表征强制方法
作者:Yuqing Wang, Zhijie Lin, Ceyuan Yang, Yang Zhao, Fei Xiao, Hao He, Qi Zhao, Zihan Ding, Fuyun Wang, Shuai Wang, Youliang Zhang, Haoqi Fan, Xihui Liu
Unified multimodal models (UMMs) aim to handle perception and generation in a single model. Yet existing UMMs still rely on a frozen, separately pretrained VAE for image generation, imposing a structural bottleneck. Naively removing it introduces a quality gap, as the model must learn both high-level structure and low-level details from raw pixels. In this paper, we propose Representation Forcing (RF), a technique that closes this gap by making representation prediction a native capability of the model. Concretely, RF forces the decoder to autoregressively predict visual representations as intermediate tokens before pixels; these tokens then stay in context to guide pixel diffusion within the same backbone. By turning representations from perception outputs into generation targets, RF eliminates the need for any external generative latent space. We find that RF benefits both understanding and generation. On image generation, our pixel-space model with RF matches state-of-the-art VAE-based unified models. On image understanding, pixel-space RF generally outperforms its VAE-based variant. Together, these results offer an effective step toward end-to-end, bottleneck-free UMMs.
统一多模态模型(UMMs)旨在在单一模型中处理感知和生成任务。然而,现有的UMMs仍然依赖于冻结的、单独预训练的VAE进行图像生成,这造成了结构瓶颈。简单地移除它会引入质量差距,因为模型必须从原始像素同时学习高级结构和低级细节。在本文中,我们提出了表征强制(Representation Forcing,RF)技术,这是一种通过使表征预测成为模型的原生能力来弥合这一差距的方法。具体而言,RF强制解码器在像素之前自回归地预测视觉表征作为中间tokens;这些tokens保持在上下文中,以在同一骨干网络内引导像素扩散。通过将表征从感知输出转变为生成目标,RF消除了对任何外部生成潜空间的需求。我们发现RF同时有益于理解和生成任务。在图像生成方面,采用RF的像素空间模型达到了最先进的基于VAE的统一模型的水平。在图像理解方面,像素空间RF通常优于其基于VAE的变体。这些结果为实现端到端、无瓶颈的UMMs提供了有效的一步。
Mamba-Enhanced Implicit Motion Learning for Audio-Driven Portrait Animation
中文标题:Mamba增强的隐式运动学习用于音频驱动的人物肖像动画
作者:Xuan Wei, Jiahui Chen, Kaiheng Li, Mingyu Shao, Qingqi Hong
Audio-driven human motion video generation aims to synthesize realistic and temporally coherent human animations from a single static image, with applications in talking-head synthesis, co-speech gesture generation, and dynamic presentations. Moving beyond conventional keypoint-based methods that often struggle to capture subtle motion dynamics, We propose a novel implicit-motion framework for generating realistic and temporally coherent human motion videos from a single static image and audio. Our approach uses a two-stage pipeline that decouples motion prediction from rendering. The first stage integrates appearance priors and hierarchical depth cues into a region-aware attention mechanism to model latent motion features. The second stage employs a Mamba-enhanced diffusion model to directly predict these features from audio and the source image, enabling unsupervised learning of fine-grained motion patterns. This decoupled architecture enhances flexibility and efficiency. Trained on a new 380-hour high-quality dataset, our method outperforms prior work across multiple public benchmarks and our collected data in accuracy, naturalness, and temporal coherence, setting a new state-of-the-art.
音频驱动的人体动作视频生成旨在从单张静态图像合成逼真且时间一致的人体动画,应用于虚拟数字人合成、手势生成和动态演示等领域。传统基于关键点的方法往往难以捕捉细微的动作动态,针对这一问题,我们提出了一种新颖的隐式运动框架,用于从单张静态图像和音频生成逼真且时间一致的人体动作视频。我们的方法采用解耦运动预测与渲染的两阶段流水线。第一阶段将外观先验和分层深度线索整合到区域感知注意力机制中,以建模潜在运动特征。第二阶段采用Mamba增强的扩散模型直接从音频和源图像预测这些特征,实现细粒度运动模式的无监督学习。这种解耦架构增强了灵活性和高效性。我们的方法在一个新收集的380小时高质量数据集上进行训练,在准确性、自然性和时间一致性方面超越了多个公共基准测试和自收集数据上的现有工作,达到了新的最先进水平。
今日 Image Compression 方向共追踪到 1 篇论文。简报保留原始摘要、中文摘要、作者和链接,适合先快速筛选,再挑出值得深读的论文进入 org-roam。
ChannelTok: Efficient Flexible-Length Vision Tokenization
中文标题:ChannelTok:高效的灵活长度视觉标记化
作者:Sukriti Paul, Arpit Bansal, Tom Goldstein
Leading flexible vision tokenizers achieve SOTA quality at an extreme cost, relying on parameter-heavy backbones and slow, multi-step generative decoders. We depart from this complex, spatial-token paradigm and introduce a simple, lightweight, and fast channel-wise flexible-length tokenizer. Our method treats each latent channel as a visual token, enabling a parameter-efficient CNN-Transformer hybrid backbone. Furthermore, employing a stochastic tail-dropping paradigm during training naturally forces channels to organize by semantic importance. This allows for flexible compression at inference by simply retaining the first $k$ channels, and naturally enables variable-length autoregressive image generation. We validate our approach through extensive experiments on ImageNet, demonstrating consistent quality across diverse token budgets. The results establish a new quality-efficiency frontier: our model achieves state-of-the-art perceptual quality (rFID 2.92) while being $8.6\times$ faster in decoding and $2.1\times$ smaller (159M params) than the next-best alternative. Our work establishes channel-wise tokenization as a powerful and practical paradigm for efficient visual representation. Project page: https://channeltok.github.io
领先的灵活视觉标记化方法在极端成本下实现了最先进的质量,依赖于参数重骨干网络和慢速的多步生成解码器。我们摒弃这种复杂的空间标记范式,引入一种简单、轻量且快速的通道级灵活长度标记化器。我们的方法将每个潜在通道视为一个视觉标记,实现了参数高效的CNN-Transformer混合骨干网络。此外,在训练过程中采用随机尾部丢弃范式自然地迫使通道按语义重要性组织。这使得在推理时可以通过简单地保留前k个通道实现灵活压缩,并自然地支持可变长度自回归图像生成。我们在ImageNet上通过大量实验验证了我们的方法,证明了在不同的标记预算下均能保持一致的质量。结果表明我们建立了新的质量-效率前沿:我们的模型实现了最先进的感知质量(rFID 2.92),同时解码速度提高8.6倍,参数规模更小(159M参数),优于次优方案。我们的工作确立了通道级标记化作为高效视觉表示的一种强大且实用的范式。项目主页:https://channeltok.github.io
今日未找到该分类的匹配论文。
今日未找到该分类的匹配论文。
今日未找到该分类的匹配论文。
今日未找到该分类的匹配论文。