ESC
输入关键词搜索文章
目录

每日 arXiv 论文简报

2026-06-03 · 131 篇论文 · 按研究方向分组
自动追踪 · LLM 总览 · 研究雷达
131Total Papers
26Autoregressive
100Diffusion
5Image Compression
01D Visual Tokenizer
0Diffusion Visual Encoder
Daily Radar
每日总览

今日arXiv论文呈现出Diffusion模型主导(100篇)、自回归与扩散技术融合的格局。值得关注的几大趋势:

  • 高效推理成为核心议题:多篇论文聚焦Speculative Decoding(投机解码),如TAPS、预算感知训练等,反映出大模型部署对推理速度的迫切需求
  • 多模态深度融合:Diffusion与Cross-Modal任务紧密结合,涵盖图像重建、3D生成、视频编辑、手势识别等场景
  • 物理可控性突破:Physical Object Understanding、Parameterized Diffusion Policies等论文表明AI开始追求对物理世界的理解和可控生成
  • 端到端压缩与生成一体化:流式可说话人像、深度压缩VAE等展示实时生成与低比特压缩的结合趋势

今日最值得关注的论文:

  • TAPS: Target-Aware Prefix Tree Selection for Diffusion-Drafted Speculative Decoding — 创新性地将前缀树优化引入扩散模型投机解码,为高效推理提供新思路
  • TrustLDM: Benchmarking Trustworthiness in Language Diffusion Models — 首次系统评估语言扩散模型的可信度问题,填补该领域安全评估空白
  • Physical Object Understanding with a Physically Controllable World Model — 探索物理世界建模与可控生成,代表AI理解物理规律的重要进展
  • Real-Time Generation of Streamable Talking Portrait Video with Reference-Guided Deep Compression VAEs — 实现端到端流式生成与压缩,为实时数字人生成提供新范式
  • Ideas in Inference-time Scaling can Benefit Generative Pre-training Algorithms — 揭示推理时扩展与预训练的正向反馈,为下一代训练方法指明方向
autoregressive
Autoregressive
26 篇论文

Autoregressive 类别每日总览

今日 Autoregressive 相关论文聚焦于生成效率优化多模态内容创作两大主题。在推理效率方面,Speculative Decoding 技术持续演进,TAPS 和 BudgetDraft 分别从前缀树选择和多视图训练角度提升推理速度;Causal Forcing 系列通过扩散蒸馏实现实时交互视频生成。在内容生成方面,Visual Autoregressive Generation 探索冗余精简策略,4D 人物生成、说话人像视频、长视频等应用场景全面开花。整体趋势呈现自回归与扩散模型深度融合、推理-生成协同优化的特点。

  • Where to Refine, When to Stop: Rethinking Redundancy via Latent Discrepancy for Efficient Visual Autoregressive Generation — 提出潜在差异方法优化视觉自回归生成的去冗余问题,兼顾效率与质量。
  • Residual Decoder Adapter: ID-Preserving Tokenizer Adaption for Autoregressive Text Rendering — 解决自回归文本渲染中的身份保持难题,适配 tokenizer 提升渲染精度。
  • MORPHOS: Autoregressive 4D Generation with Temporal Structured Latents — 首个时间结构化隐变量的自回归 4D 人体生成框架,突破动态人物建模瓶颈。
  • Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation — 扩展蒸馏规模,实现高质量实时交互视频生成。
  • LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation — 针对长视频生成的检索增强框架,填补长时序生成的技术空白。

TAPS: Target-Aware Prefix Tree Selection for Diffusion-Drafted Speculative Decoding

2026-06-02T04:00:00autoregressive, cs.AI, diffusion2606.00487

中文标题:TAPS: 面向扩散起草投机解码的目标感知前缀树选择

作者:Zhuoyu Wang, Junnan Huang, Xinyu Chen

摘要:

Using a diffusion model for parallel drafting is a promising approach for speculative decoding. By predicting tokens at multiple future positions in a single forward pass, diffusion drafters substantially reduce drafting latency. However, this shifts the bottleneck to verification: verifying a single sequence limits acceptance length, while verifying large draft trees incurs excessive target-model latency. We identify a key mismatch in existing draft-tree methods: existing diffusion-tree methods rank nodes by the marginal probability, ignoring that verification is prefix-conditioned. As a result, they may verify unreachable descendants of rejected prefixes, increasing latency with limited acceptance gains. To address this, we propose TAPS, a target-aware prefix selection method that turns diffusion marginals into path-conditioned acceptance estimates. TAPS then selects a compact prefix-closed subtree under a fixed verification budget, improving the acceptance-cost tradeoff rather than simply expanding the draft tree. Experiments across diverse datasets and model families demonstrate that TAPS achieves up to 7.9x lossless end-to-end speedup over vanilla autoregressive decoding, outperforming state-of-the-art DFlash and DDTree by 1.36x and 1.74x respectively. Our work is available at https://anonymous.4open.science/r/TAPS-EMNLP2026-53DD

摘要中文:

使用扩散模型进行并行起草是投机解码的一种有前景的方法。通过在单次前向传播中预测多个未来位置的token,扩散起草模型大幅降低了起草延迟。然而,这把瓶颈转移到了验证环节:验证单序列限制了接受长度,而验证大型起草树会产生过高的目标模型延迟。我们发现了现有起草树方法中的一个关键不匹配:现有的扩散树方法按边缘概率对节点排序,忽视了验证是以前缀为条件的。因此,它们可能会验证被拒绝前缀的不可达后代,在接受收益有限的情况下增加延迟。为了解决这一问题,我们提出了TAPS,这是一种目标感知的前缀选择方法,能够将扩散边缘概率转化为路径条件的接受估计。TAPS在固定验证预算下选择紧凑的前缀封闭子树,而不是简单地扩展起草树,从而改善了接受成本权衡。跨不同数据集和模型家族的实验表明,TAPS相比 vanilla 自回归解码实现了高达7.9倍的无损端到端加速,性能分别优于最先进的DFlash和DDTree方法达1.36倍和1.74倍。

S3TS: Stochastic Scenario-Structured Tree Search for Advanced Planning Under Uncertainty

2026-06-02T04:00:00autoregressive, cs.AI, cs.SY, eess.SY2606.02151

中文标题:S3TS:用于高级不确定性规划的随机场景结构树搜索

作者:Fabio Pavirani, Bert Claessens, Pierre Pinson, Chris Develder

摘要:

Effective scheduling in the energy sector is essential to ensure the reliable operation of electrical grids and their connected assets by, for instance, optimizing the dispatch of generation units and storage systems. An effective planning strategy must (a) accommodate advanced and potentially non-linear system models -- exploiting the increasing data availability of modern grids, and (b) explicitly handle uncertainties arising, for instance, from the integration of renewable energy sources. While existing approaches can address either non-linearity (e.g., Monte Carlo Tree Search) or uncertainty (e.g., stochastic mathematical optimization), there is a lack of planning techniques capable of addressing both challenges simultaneously. To bridge this gap, we propose a Stochastic Scenario-Structured Tree Search (S3TS) algorithm that explicitly represents uncertainty through scenario trees while enabling the integration of advanced non-linear models. We evaluate S3TS on a simulated demand response signal publication problem, largely mimicking the imbalance settlement mechanism in Belgium. The results demonstrate near-optimal performance in linear, analytically tractable settings, with costs within 14% of the mathematically optimal solution conditioned to the scenario trees. In highly non-linear scenarios, S3TS significantly outperforms baseline methods, achieving cost reductions of up to 51% and 5.4% compared to a myopic algorithm and deterministic MCTS, respectively.

摘要中文:

能源领域的高效调度对于确保电网及其关联资产的可靠运行至关重要,例如通过优化发电机组和储能系统的调度。有效的规划策略必须(a)接纳高级且潜在的非线性系统模型——利用现代电网日益增长的数据可用性,(b)明确处理由可再生能源整合等所带来的不确定性。虽然现有方法能够解决非线性问题(例如蒙特卡洛树搜索)或不确定性问题(例如随机数学优化),但目前缺乏能够同时应对这两个挑战的规划技术。为弥补这一空白,我们提出了随机场景结构树搜索(S3TS)算法,该算法通过场景树显式表示不确定性,同时支持整合高级非线性模型。我们在一个模拟的需求响应信号发布问题(大致模拟比利时的偏差结算机制)上对S3TS进行了评估。结果表明,在线性、解析可处理的设置中,S3TS实现了接近最优的性能,其成本比场景树条件下的数学最优解高出不超过14%。在高度非线性场景中,S3TS显著优于基线方法,相较于短视算法和确定性蒙特卡洛树搜索分别实现了最高51%和5.4%的成本降低。

Tracking the Behavioral Trajectories of Adapting Agents

2026-06-02T04:00:00autoregressive, cs.AI2606.02536

中文标题:追踪适应型代理的行为轨迹

作者:Jonah Leshin, Manish Shah, Ian Timmis

摘要:

Text files such as skill files, memory files, and behavioral configuration files play a central role in defining how modern agents act. Through edits by humans or the agents themselves, these files may evolve over time, directly steering the agent's behavior in future interactions. We present a methodology and framework for measuring agent $traits$ by defining traits as directions in the embedding space of a text embedding model. We train a linear model on labeled "before" versus "after" skill file diffs to learn a trait vector, then score arbitrary skill edits by projecting their embedding diffs onto this vector. Evaluated on 68 labeled skill diff pairs for the trait of propensity to seek sensitive data, our method achieves 91.2% sign classification accuracy and a Spearman rank correlation of $\rho = 0.82$ under leave-one-out cross-validation. We build this trait evaluation into a broader agent-to-agent protocol that enables one agent to evaluate another&x27;s skill file updates through a trusted intermediary.

摘要中文:

技能文件、记忆文件和行为配置文件等文本文件在定义现代代理的行为方式中起着核心作用。通过人类或代理自身的编辑,这些文件可能会随时间演变,从而直接引导代理在后续交互中的行为。我们提出一种通过定义特征为文本嵌入模型嵌入空间中的方向来测量代理特征的方法与框架。我们在标记的技能文件“修改前”与“修改后”差异对上进行线性模型训练,以学习特征向量,然后通过将任意技能编辑的嵌入差异投影到该向量上进行评分。在针对寻求敏感数据倾向这一特征标记的68个技能差异对的评估中,我们的方法在留一法交叉验证下达到了91.2%的符号分类准确率和斯皮尔曼等级相关系数ρ=0.82。我们将这一特征评估构建到一个更广泛的代理间协议中,使一个代理能够通过可信中介来评估另一个代理的技能文件更新。

BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding

2026-06-02T04:00:00autoregressive, cs.AI, cs.LG2606.00144

中文标题:BudgetDraft:面向稀疏KV投机的接受率感知多视角训练方法

作者:Liang He, Jingbo Wen, Qishi Zhan, Yixiong Chen, Kangning Cui, Qizhen Lan, Xilu Wang

摘要:

Speculative decoding speeds up autoregressive decoding by using a drafter to propose multiple tokens that a verifier validates in parallel. In resource-constrained deployments, the drafter uses a sparse KV cache to limit peak GPU memory and end-to-end latency under a fixed KV budget, while the verifier keeps a full KV cache. Mid-to-long context inference (4K--16K context length) is common in real applications. However, naive sparse/full speculative decoding suffers from the sparse/full mismatch as context length grows, causing the acceptance rate to drop quickly. We propose BudgetDraft, a multi-view sparse training method for sparse drafting in mid-to-long inference. The drafter is exposed to multiple sampled KV budgets during training and learns to align each sparse view with one shared full-cache teacher target. BudgetDraft combines an acceptance-aware loss on a full-cache branch with a multi-view loss on a sparse-cache branch, producing a single budget-robust drafter that recovers acceptance across sparsity levels without extra inference-time components. Experimental results on PG-19, LongBench, and LWM show that BudgetDraft achieves up to 6.55x, 4.46x, and 2.10x end-to-end speedup vs AR at 4K, 8K, and 16K context lengths, while keeping the inference pipeline memory-friendly.

摘要中文:

投机解码通过使用起草模型并行提出多个token供验证器验证来加速自回归解码。在资源受限的部署中,起草模型使用稀疏KV缓存在固定KV预算下限制峰值GPU内存和端到端延迟,而验证器保持完整KV缓存。中长上下文推理(4K-16K上下文长度)在实际应用中很常见。然而,简单的稀疏/完整投机解码在上下文长度增加时存在稀疏/完整不匹配问题,导致接受率迅速下降。我们提出了BudgetDraft,一种用于中长推理中稀疏起草的多视角稀疏训练方法。起草模型在训练过程中接触多个采样的KV预算,并学习将每个稀疏视角与一个共享的完整缓存教师目标对齐。BudgetDraft结合了完整缓存分支的接受率感知损失和稀疏缓存分支的多视角损失,产生了一个预算鲁棒的单一起草模型,能够在不同稀疏度下恢复接受率,且无需额外的推理时组件。在PG-19、LongBench和LWM上的实验表明,BudgetDraft在4K、8K和16K上下文长度下分别实现了高达6.55倍、4.46倍和2.10倍相对于自回归的端到端加速比,同时保持了推理管道的内存友好性。

(HB-ARFM) History-Bootstrapped Flow Matching for Inverse Boiling Reconstruction

2026-06-02T04:00:00autoregressive, cs.AI, cs.CE, cs.LG2606.00349

中文标题:(HB-ARFM) 历史引导流匹配用于逆沸腾重建

作者:Xianwei Zou, Sheikh Md Shakeel Hassan, Arthur Feeney, Aparna Chandramowlishwaran

摘要:

Reconstructing spatiotemporal fields from partial observations is fundamental to scientific inference, from inferring atmospheric states from satellite data to recovering fluid states from imaging. When observations are incomplete, the inverse problem is fundamentally ill-posed: even when the underlying PDE dynamics are Markovian in the full state, partial observation operators induce a non-Markovian posterior that cannot be resolved from a single timestep. We propose a history-bootstrapped autoregressive flow matching (HB-ARFM) for spatiotemporal inverse reconstruction under partial observability. Observation history bootstraps the initial reconstruction via conditional flow matching, reducing ambiguities. The same conditional transport model is then applied autoregressively, conditioning on both new observations and past predictions to propagate the reconstruction forward in time. We evaluate the method on boiling dynamics reconstruction, recovering full velocity and temperature fields from interface geometry and motion. Across two inverse tasks with varying observation sparsity, HB-ARFM produces physically and temporally valid reconstructions where other models fail.

摘要中文:

从部分观测重建时空场是科学推理的基础,从卫星数据推断大气状态到从成像恢复流体状态。当观测不完整时,逆问题本质上是适定性问题的:即使底层偏微分方程动力学在完整状态下是马尔可夫的,部分观测算子也会诱导出非马尔可夫后验,无法从单一时间步解决。我们提出了历史引导自回归流匹配(HB-ARFM)方法,用于在部分可观测条件下的时空逆重建。观测历史通过条件流匹配引导初始重建,减少了歧义。随后相同的条件传输模型以自回归方式应用,基于新观测和过去预测进行条件化,以将重建结果向前传播。我们在沸腾动力学重建任务上评估了该方法,从界面几何形状和运动中恢复完整的速度和温度场。在两个观测稀疏程度不同的逆任务中,HB-ARFM 产生了物理上和时间上有效的重建结果,而其他模型则无法做到。

Rank-Constrained Deep Matrix Completion for Group Recommendation

2026-06-02T04:00:00autoregressive, cs.AI, cs.IR2606.01948

中文标题:面向群组推荐的秩约束深度矩阵补全

作者:Mubaraka Sani Ibrahim, Lehel Csat\'o, Isah Charles Saidu

摘要:

The growing popularity of group activities has increased the need for methods that provide recommendations to groups of users given their individual preferences. Many existing group recommender systems rely on aggregating individual user preferences, but they often struggle with high-dimensional and highly sparse rating data commonly found in real-world scenarios. We propose Group Rank-Constrained Deep Matrix Completion (Group RC-DMC), a novel framework that extends RC-DMC by integrating group-level representation learning via a Set-Transformer aggregator, jointly leveraging low-rank structure and attention-based nonlinear modeling. Unlike most existing group recommender systems, Group RC-DMC unifies explicit low-rank regularization, linear encoder-decoder architectures, and attention-based nonlinear group modeling within a single framework, yielding accurate predictions at both the individual and group levels. Group RC-DMC addresses data sparsity through low-rank matrix completion, computing per-user latent representations from observed ratings only, and enforcing a rank constraint on the latent space using a nuclear-norm proximal step based on periodic singular value thresholding. The decoder is parametrized as a low-rank factorization, enabling efficient inference. Experimental results on the MovieLens and Goodbooks datasets demonstrate that Group RC-DMC achieves superior reconstruction accuracy, measured by lower group RMSE, while remaining computationally efficient and competitive in group-level performance in terms of precision, recall, and F1 score compared with weighted-before-factorization (WBF) and after-factorization (AF) baselines. The results highlight the model's ability to recover the underlying low-rank structure of user-item interactions and provide robust group recommendations across small, medium, and large user groups.

摘要中文:

群组活动的日益流行增加了根据用户个人偏好为群组提供推荐方法的需求。许多现有的群组推荐系统依赖于聚合单个用户偏好,但它们通常难以处理现实场景中常见的高维且高度稀疏的评分数据。我们提出了群组秩约束深度矩阵补全(Group RC-DMC),这是一个将RC-DMC扩展的新型框架,通过Set-Transformer聚合器整合群组级表示学习,共同利用低秩结构和基于注意力的非线性建模。与大多数现有群组推荐系统不同,Group RC-DMC将显式低秩正则化、线性编码器-解码器架构以及基于注意力的非线性群组建模统一在单一框架中,从而在个体和群组层面都能产生准确预测。Group RC-DMC通过低秩矩阵补全解决数据稀疏性问题,仅从观察到的评分计算每个用户的潜在表示,并使用基于周期性奇异值阈值化的核范数近端步对潜在空间施加秩约束。解码器参数化为低秩分解,能够实现高效推理。在MovieLens和Goodbooks数据集上的实验结果表明,Group RC-DMC实现了更优的重建精度(以较低的群组RMSE衡量),同时在计算效率方面表现优异,并且在群组级别的精确率、召回率和F1分数方面与加权分解前聚合(WBF)和分解后聚合(AF)基线方法相比具有竞争力。结果突出了该模型能够恢复用户-项目交互的底层低秩结构,并为小型、中型和大型用户群组提供稳健的群组推荐。

Variational Learning for Insertion-based Generation

2026-06-02T04:00:00autoregressive, cs.AI, cs.LG, diffusion2606.02133

中文标题:基于插入的生成的变分学习

作者:Yangtian Zhang, Zhe Wang, Arthur Gretton, Rex Ying, David van Dijk, Michalis K. Titsias, Jiaxin Shi

摘要:

Non-monotonic sequence generation methods, such as masked diffusion models, provide a flexible alternative to left-to-right autoregressive modeling by allowing tokens to be generated in non-fixed and prescribed orders. Despite their practical advantages, most existing non-monotonic models are order-agnostic and rely on a fixed-length grid, limiting their ability to support variable-length generation and adaptive insertion order. In this work, we introduce a probabilistic framework for learning insertion order in variable-length insertion models. We formalize a bijective correspondence between insertion trajectories and permutations, which enables an exact reparameterization of the data likelihood as a sum over permutations. Building on this result, we propose the Insertion Process (IP), a stochastic generative model that jointly learns where to insert, what to insert, and when to terminate, trained via permutation-based variational inference. Unlike prior fixed-canvas approaches, IP natively supports variable-length generation and learns data-driven preferences over insertion orders. Experiments on goal-conditioned planning and molecular string generation demonstrate that learning insertion order improves both modeling quality and generalization in domains without a canonical left-to-right structure.

摘要中文:

非单调序列生成方法(如掩码扩散模型)通过允许以非固定和预定义的顺序生成token,为从左到右的自回归建模提供了灵活的替代方案。尽管具有实际优势,但现有大多数非单调模型都是顺序无关的,并且依赖固定长度的网格,限制了它们支持可变长度生成和自适应插入顺序的能力。在本工作中,我们引入了一个用于学习可变长度插入模型中插入顺序的概率框架。我们建立了插入轨迹与排列之间的双射对应关系,这使得数据似然可以精确地重参数化为对所有排列的求和。基于这一结果,我们提出了插入过程(Insertion Process, IP),这是一个随机生成模型,联合学习插入位置、插入内容和终止时机,并通过基于排列的变分推理进行训练。与之前的固定画布方法不同,IP原生支持可变长度生成,并学习数据驱动的插入顺序偏好。在目标条件规划和分子字符串生成上的实验表明,在没有规范从左到右结构的领域中,学习插入顺序能够提高建模质量和泛化能力。

Ideas in Inference-time Scaling can Benefit Generative Pre-training Algorithms

2026-06-02T04:00:00autoregressive, cs.AI, cs.LG, diffusion2503.07154

中文标题:推理时扩展的思想可以改进生成式预训练算法

作者:Jiaming Song, Linqi Zhou

摘要:

Generative pre-training is often framed through a false dichotomy between autoregressive models for discrete signals and diffusion models for continuous signals. We argue that the dichotomy is false because it conflates model family, data representation, training objective, and inference procedure. Autoregression is an inference procedure that expands a sequence through normalized conditional draws, while diffusion is a refinement procedure that repeatedly revises an existing state. The more useful contrast is therefore not autoregressive versus diffusion, but discrete tokens learned with cross-entropy versus continuous tokens learned with diffusion-style objectives, together with the inference algorithms used to sample from them. From this perspective, algorithmic progress should prioritize inference-time efficiency along two axes: sequence expansion and state refinement. We advocate designing the inference procedure before the training objective, because a training method cannot compensate for an inference map that omits necessary arguments or imposes an incorrect factorization. We illustrate this principle through a target-time limitation of DDIM-style samplers, a joint-distribution limitation of multi-token prediction, and recent flow-map and few-step distillation methods that directly parameterize long-range inference moves.

摘要中文:

生成式预训练常被框定为一个错误的二分法:离散信号使用自回归模型,连续信号使用扩散模型。我们认为这个二分法是错误的,因为它混淆了模型家族、数据表示、训练目标和推理程序。自回归是一种推理程序,通过规范化条件采样来扩展序列;而扩散是一种精炼程序,反复修改现有状态。因此,更有用的对比不是自回归与扩散,而是使用交叉熵学习的离散标记与使用扩散式目标学习的连续标记,以及用于从它们采样的推理算法。从这个角度看,算法进步应该优先考虑推理时在两个维度上的效率:序列扩展和状态精炼。我们主张在训练目标之前设计推理程序,因为一种训练方法无法弥补省略必要参数或施加错误分解的推理映射。我们通过DDIM风格采样器的目标时间限制、多标记预测的联合分布限制,以及直接参数化远程推理移动的流映射和少步蒸馏方法来阐明这一原则。

Learning-To-Measure: In-Context Active Feature Acquisition

2026-06-02T04:00:00autoregressive, cs.AI, cs.LG2510.12624

中文标题:学习度量:情境感知的主动特征获取

作者:Yuta Kobayashi, Zilin Jing, Jiayu Yao, Hongseok Namkoong, Shalmali Joshi

摘要:

Active feature acquisition (AFA) is a sequential decision-making problem where the goal is to improve model performance for test instances by adaptively selecting which features to acquire. In practice, AFA methods often learn from retrospective data with systematic missingness in the features and limited task-specific labels. Most prior work addresses acquisition for a single predetermined task, limiting scalability. To address this limitation, we formalize the meta-AFA problem, where the goal is to learn acquisition policies across various tasks. We introduce Learning-to-Measure (L2M), which consists of i) reliable uncertainty quantification over unseen tasks, and ii) an uncertainty-guided greedy feature acquisition agent that maximizes conditional mutual information. We demonstrate a sequence-modeling or autoregressive pre-training approach that underpins reliable uncertainty quantification for tasks with arbitrary missingness. L2M operates directly on datasets with retrospective missingness and performs the meta-AFA task in-context, eliminating per-task retraining. Across synthetic and real-world tabular benchmarks, L2M matches or surpasses task-specific baselines, particularly under scarce labels and high missingness.

摘要中文:

主动特征获取(AFA)是一个序列决策问题,其目标是通过自适应选择要获取的特征来提升测试实例的模型性能。在实践中,AFA方法通常从具有特征系统性缺失和有限特定任务标签的回溯数据中学习。之前的大多数工作都针对单一预定义任务进行获取策略学习,这限制了其可扩展性。为解决这一局限性,我们形式化定义了元AFA问题,其目标是学习跨任务的获取策略。我们提出了学习度量(L2M)方法,该方法包含:i)对未见任务的可信不确定性量化,以及 ii)一个不确定性引导的贪婪特征获取智能体,用于最大化条件互信息。我们展示了一种序列建模或自回归预训练方法,该方法为具有任意缺失的任务提供可信的不确定性量化支撑。L2M直接在具有回溯缺失的数据集上运行,并在情境中执行元AFA任务,无需进行任务级别的重训练。在合成和真实表格基准数据集上的实验表明,L2M能够匹配或超越任务特定基线方法,尤其是在标签稀缺和高缺失率的场景下。

Multi-Modal Learning meets Genetic Programming: Analyzing Alignment in Latent Space Optimization

2026-06-02T04:00:00autoregressive, cs.AI, cs.NE2604.08324

中文标题:多模态学习与遗传编程的结合:潜在空间优化中的对齐性分析

作者:Benjamin L\'eger, Kazem Meidani, Christian Gagn\&x27;e

摘要:

Symbolic regression (SR) aims to discover mathematical expressions from data, a task traditionally tackled using Genetic Programming (GP) through combinatorial search over symbolic structures. Latent Space Optimization (LSO) methods use neural encoders to map symbolic expressions into continuous spaces, transforming the combinatorial search into continuous optimization. SNIP (Meidani et al., 2024), a contrastive pre-training model inspired by CLIP, advances LSO by introducing a multi-modal approach: aligning symbolic and numeric encoders in a shared latent space to learn the phenotype-genotype mapping, enabling optimization in the numeric space to implicitly guide symbolic search. However, this relies on fine-grained cross-modal alignment, whereas literature on similar models like CLIP reveals that such an alignment is typically coarse-grained. In this paper, we investigate whether SNIP delivers on its promise of effective bi-modal optimization for SR. Our experiments show that: (1) cross-modal alignment does not improve during optimization, even as fitness increases, and (2) the alignment learned by SNIP is too coarse to efficiently conduct principled search in the symbolic space. These findings reveal that while multi-modal LSO holds significant potential for SR, effective alignment-guided optimization remains unrealized in practice, highlighting fine-grained alignment as a critical direction for future work.

摘要中文:

符号回归(SR)旨在从数据中发现数学表达式,这一任务传统上通过遗传编程(GP)在符号结构上进行组合搜索来解决。潜在空间优化(LSO)方法使用神经编码器将符号表达式映射到连续空间,将组合搜索转化为连续优化。SNIP(Meidani et al., 2024)是一个受CLIP启发的对比预训练模型,通过引入多模态方法推进了LSO:在共享潜在空间中对齐符号编码器和数值编码器以学习表型-基因型映射,使得在数值空间中的优化能够隐式引导符号搜索。然而,这一方法依赖于细粒度的跨模态对齐,而类似CLIP的文献表明这种对齐通常是粗粒度的。本文研究了SNIP是否能兑现其对SR进行有效双模态优化的承诺。我们的实验表明:(1)即使适应度增加,跨模态对齐在优化过程中并未改善;(2)SNIP学习的对齐过于粗粒度,无法在符号空间中进行高效的系统性搜索。这些发现表明,尽管多模态LSO在SR中具有重要潜力,但有效的对齐引导优化仍未在实践中实现,这凸显了细粒度对齐是未来工作的关键方向。

Channel-wise Vector Quantization

2026-06-02T04:00:00autoregressive, cs.AI, cs.CV2605.26089

中文标题:通道式向量量化

作者:Wei Song, Tianhang Wang, Yitong Chen, Tong Zhang, Zuxuan Wu, Min Li, Jiaqi Wang, Kaicheng Yu

摘要:

We present Channel-wise Vector Quantization (CVQ), a novel image tokenization paradigm that replaces patch-wise tokens with channel-wise tokens. Unlike conventional vector quantization, which assigns a discrete token to each patch feature vector, CVQ quantizes each channel of the feature map. This formulation represents an image as discrete levels of visual details, rather than as a grid of spatial patches. Based on CVQ, we introduce a new visual autoregressive framework with "next-channel prediction". Instead of rendering images patch by patch in raster order, our Channel-wise Autoregressive (CAR) model predicts image channels sequentially, producing progressively enriched visual details. Specifically, it first sketches global structure and then refines fine-grained attributes, akin to a human artist's workflow. Empirically, we show that: (1) CVQ achieves 100% codebook utilization with a 16K+ codebook size without any bells and whistles, and substantially improves reconstruction quality over conventional VQ; and (2) CAR attains a DPG score of 86.7 and a GenEval score of 0.79, demonstrating strong effectiveness for text-to-image generation.

摘要中文:

本文提出了通道式向量量化(Channel-wise Vector Quantization,CVQ),这是一种新颖的图像标记化范式,用通道级标记取代块级标记。不同于传统向量量化将离散标记分配给每个块特征向量,CVQ对特征图的每个通道进行量化。该方法将图像表示为视觉细节的离散层级,而非空间块的网格结构。在此基础上,我们引入了一种新的视觉自回归框架,采用“下一个通道预测”。与传统的按光栅顺序逐块渲染图像不同,我们的通道自回归模型(Channel-wise Autoregressive,CAR)按顺序预测图像通道,逐步生成愈加丰富的视觉细节。具体而言,该模型首先勾勒全局结构,然后细化细粒度属性,类似于人类艺术家的创作过程。实验结果表明:(1)CVQ在无需任何额外技巧的情况下实现了100%的码本利用率,码本规模超过16000,显著提升了重建质量;(2)CAR在文本到图像生成任务中取得了DPG分数86.7和GenEval分数0.79,表现出强大的有效性。

Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS

2026-06-02T04:00:00autoregressive, cs.AI, cs.SD, diffusion, eess.AS2605.30748

中文标题:Chatterbox-Flash:用于流式零样本TTS的先验校准块扩散模型

作者:Deokjin Seo, Gangin Park, Kihyun Nam

摘要:

We present Chatterbox-Flash, a zero-shot text-to-speech model obtained by fine-tuning a pretrained autoregressive TTS decoder into a block-diffusion decoder, enabling parallel token generation within each block while retaining block-by-block streaming. We find that naively transferring mainstream block-diffusion decoding to discrete speech tokens degrades quality, as a long-tail token distribution biases parallel position selection toward a few high-frequency tokens. To mitigate this without architectural modification, we introduce two inference-time techniques: prior-calibrated scoring, which subtracts the block-level marginal token distribution, and an early-decoding schedule, which adaptively terminates iteration based on calibrated confidence. On standard zero-shot TTS benchmarks, Chatterbox-Flash attains high-fidelity synthesis comparable to strong autoregressive and non-autoregressive baselines, while supporting streaming inference with time-to-first-packet on par with streaming AR systems and substantially lower real-time factor. Code and audio samples are available at https://github.com/resemble-ai/chatterbox-flash.

摘要中文:

我们提出了Chatterbox-Flash,这是一个零样本文本到语音模型,通过将预训练的自回归TTS解码器微调为块扩散解码器获得,能够在每个块内实现并行token生成,同时保持逐块的流式推理。我们发现,将主流的块扩散解码直接迁移到离散语音token会导致质量下降,因为长尾token分布会使并行位置选择偏向少数高频token。为了在不修改架构的情况下解决这个问题,我们引入了两种推理时技术:先验校准评分,即减去块级边缘token分布;以及早停解码调度,即根据校准置信度自适应终止迭代。在标准零样本TTS基准测试中,Chatterbox-Flash实现了与强自回归和非自回归基线相当的高保真合成,同时支持流式推理,首包延迟与流式AR系统相当,且实时因子显著更低。代码和音频样本可访问 https://github.com/resemble-ai/chatterbox-flash。

Where to Refine, When to Stop: Rethinking Redundancy via Latent Discrepancy for Efficient Visual Autoregressive Generation

2026-06-02T04:00:00autoregressive, cs.CV2606.00310

中文标题:何处细化,何时停止:通过潜在差异重新思考冗余以实现高效视觉自回归生成

作者:Changwang Mei, Peisong Wang, Zekun Li, Changsheng Li, Shuang Qiu, Qinghao Hu, Gang Li, Yifan Zhang, Zhihui Wei, Jian Cheng

摘要:

Visual Autoregressive (VAR) models deliver high-quality image generation but suffer from significant inference latency at high resolutions. Recent acceleration approaches most rely on heuristic measures with layer features to prune tokens. Such heuristics are sensitive to complex contextual semantics, leading to inaccurate identification of redundant computation and poor adaptability across prompts. We rethink redundancy in VAR from the perspective of its impact on pixel-space generation and introduce Latent Discrepancy. This unified metric quantifies a token's contribution by measuring the change in model states during generation. Our analysis shows that redundancy is more accurately identified when guided by image latent or pixel-space signals. We further observed that in classifier-free guidance (CFG), the convergence trend of the discrepancy between conditional and unconditional branches exhibits high dynamics with different prompts. Based on these findings, we propose LD-Pruning (Latent Discrepancy Pruning), a training-free framework that removes redundancy via latent discrepancy by integrating decoding-free region selection and adaptive unconditional-branch skipping. Extensive experiments show that LD-Pruning substantially reduces inference latency while maintaining high generation quality, achieving up to 2.35x speedup on Infinity-8B.

摘要中文:

视觉自回归(VAR)模型能够生成高质量图像,但在高分辨率下存在严重的推理延迟问题。近期加速方法大多依赖层特征的启发式度量来剪枝标记。这些启发式方法对复杂上下文语义敏感,导致冗余计算识别不准确,且在不同提示词间适应性较差。我们从对像素空间生成的影响角度重新审视VAR中的冗余问题,并提出潜在差异(Latent Discrepancy)。这一统一度量通过测量生成过程中模型状态的变化来量化标记的贡献。我们的分析表明,在图像潜在空间或像素空间信号的引导下,冗余识别更为准确。我们进一步观察到,在无分类器引导(CFG)中,条件分支与非条件分支之间的差异收敛趋势在不同提示词下表现出较高的动态性。基于这些发现,我们提出LD-Pruning(潜在差异剪枝),这是一个无需训练的框架,通过整合无需解码的区域选择和自适应非条件分支跳过来利用潜在差异去除冗余。大量实验表明,LD-Pruning在保持高生成质量的同时大幅降低推理延迟,在Infinity-8B上实现了最高2.35倍的加速。

Physical Object Understanding with a Physically Controllable World Model

2026-06-02T04:00:00autoregressive, cs.CV2606.00439

中文标题:基于物理可控世界模型的物体理解

作者:Rahul Venkatesh, Klemen Kotar, Lilian Naing Chen, Wanhee Lee, Gia Ancone, Seungwoo Kim, Luca Thomas Wheeler, Jared Watrous, Honglin Chen, Daniel Bear, Stefan Stojanov, Daniel LK Yamins

摘要:

A central challenge in visual intelligence is learning the physical structure of scenes from raw videos: how regions form objects and the laws that govern their interactions. Solving these tasks requires world models capable of inferring distributional states of the world from partial observations - capabilities that current architectures do not provide. We introduce a new class of probabilistic world models that support estimation of the probability of any visual variable, such as appearance and dynamics, conditioned on any other variables. Here, we identify that these models can be trained efficiently with autoregressive sequence modeling, yielding world models from which rich object understanding emerges. First, we demonstrate that our model captures the physical laws governing how objects move by generating multiple plausible future states of the world through sequential inference. Then, by analyzing motion correlations across these futures, we extract objects and articulated object subparts. Having discovered these objects, we show that our world model can manipulate them in 3D. Finally, we demonstrate how physical relationships between objects can be computed from the world model, enabling applications such as Visual Jenga.

摘要中文:

视觉智能的核心挑战在于从原始视频中学习场景的物理结构:区域如何形成物体,以及支配其相互作用的规律。解决这些任务需要世界模型具备从部分观测中推断世界分布状态的能力,而这是当前架构所无法提供的。我们引入了一类新的概率世界模型,支持在任何视觉变量(如外观和动力学)以其他任意变量为条件的条件下估计其概率。在此,我们发现这些模型可以通过自回归序列建模进行高效训练,从而产生能够实现丰富物体理解的世界模型。首先,我们通过顺序推理生成多个合理的未来世界状态,验证了模型能够捕捉控制物体运动的物理定律。随后,通过分析这些未来状态之间的运动相关性,我们提取出了物体及其关节子部件。在发现这些物体后,我们展示了世界模型可以在三维空间中对它们进行操作。最后,我们演示了如何从世界模型中计算物体之间的物理关系,从而实现了视觉积木(Visual Jenga)等应用。

Real-Time Generation of Streamable Talking Portrait Video with Reference-Guided Deep Compression VAEs

2026-06-02T04:00:00autoregressive, cs.CV, diffusion, image_compression2606.01620

中文标题:基于参考引导深度压缩VAE的可流式传输说话人像视频实时生成

作者:Sicheng Xu, Yu Deng, Shoukang Hu, Yichuan Wang, Yizhong Zhang, Zhan Chen, Jiaolong Yang, Baining Guo

摘要:

Video diffusion models have significantly advanced portrait video generation, yet their high computational demands limit their use in interactive applications. This work presents a framework for streamable talking portrait video generation conditioned on speech audio and reference images. Designed meticulously for streaming scenarios, it features a causal video VAE for deep latent compression and an autoregressive latent denoising model. Our causal VAE integrates a variable number of reference images as guidance, allowing the network to focus on dynamic information rather than static appearance, thereby enhancing compression efficacy and reconstruction quality. Additionally, we extend the residual auto-encoding paradigm to improve spatial-temporal causality handling in our VAE. The generator is based on a Rectified Flow Transformer architecture and produces video latents in a blockwise auto-regressive manner. Our method enables the real-time generation of high-quality talking portrait videos, achieving speeds significantly faster than baseline models. Furthermore, comprehensive experiments demonstrate that it is on par with or even outperforms these large models in realism, vividness, and video quality.

摘要中文:

视频扩散模型在肖像视频生成方面取得了显著进展,但其高计算需求限制了其在交互式应用中的使用。本研究提出了一个基于语音音频和参考图像生成可流式传输说话人像视频的框架。该框架专为流式场景设计,包含一个用于深度潜空间压缩的因果视频VAE和一个自回归潜空间去噪模型。我们的因果VAE整合了可变数量的参考图像作为引导,使网络能够专注于动态信息而非静态外观,从而提高了压缩效率和重建质量。此外,我们将残差自编码范式扩展应用于VAE中,以改进时空因果关系的处理。生成器基于修正流Transformer架构,以分块自回归方式生成视频潜空间。我们的方法实现了高质量说话人像视频的实时生成,生成速度显著优于基线模型。全面的实验表明,本方法在真实感、生动性和视频质量方面与这些大型模型相当,甚至更优。

Residual Decoder Adapter: ID-Preserving Tokenizer Adaption for Autoregressive Text Rendering

2026-06-02T04:00:00autoregressive, cs.CV2606.01911

中文标题:残差解码器适配器:用于自回归文本渲染的保ID分词器适配方法

作者:Dongxing Mao, Jinpeng Wang, Jiahao Tang, Kevin Qinghong Lin, Linjie Li, Zhengyuan Yang, Lijuan Wang, Min Li, Jingru Tan

摘要:

Visual Autoregressive (AR) models generate images by predicting discrete tokens that are decoded by a visual tokenizer. Despite demonstrating strong overall image generation ability, they still underperform on text rendering with blur strokes and disrupt letter shapes. In this work, we trace this limitation to the visual tokenizer, which struggles to reconstruct fine-grained detail. Improving the tokenizer is straightforward but expensive, as it necessitates retraining both the tokenizer and the AR model. Can we improve text rendering performance of AR models without retraining the existing tokenizer and AR model? To achieve this, we propose the Residual Decoder Adapter(RDA) that upgrades an existing tokenizer post-hoc without changing its token space. Specifically, it refines the decoder output of the visual tokenizer by introducing two novel components: (i) a paired codebook that shares the token distribution with the original one; (ii) a parallel branch to learn the tiny differences (residual) between the reconstructed image and the ground-truth images in the pixel space. This residual design allows us to enhance the tokenizer non-invasively while preserving compatibility with prior AR models. RDA substantially improves text rendering significantly by a large margin. For instance, we boost finetuned Janus-Pro OCR accuracy rises from 24.52% to 58.26% (TextVisionBlend), from 12.75% to 36.81% (StyledTextSynth) on competitive TextAtlas benchmark. The code is available at https://github.com/CSU-JPG/RDA

摘要中文:

视觉自回归(AR)模型通过预测离散token并由视觉分词器解码来生成图像。尽管这些模型展示了强大的整体图像生成能力,但在文本渲染方面仍表现不佳,常出现笔画模糊和字母形状失真的问题。本工作追溯这一局限性,发现问题源于视觉分词器难以重建细粒度细节。改进分词器的方法虽然直接,但代价高昂,因为它需要同时重训练分词器和AR模型。是否可以在不重训练现有分词器和AR模型的情况下提升文本渲染性能?为实现这一目标,我们提出了残差解码器适配器(RDA),它可以在不改变token空间的情况下对现有分词器进行事后升级。具体而言,它通过引入两个创新组件来优化视觉分词器的解码器输出:(i)一个与原始token分布共享的配对码本;(ii)一个并行分支,用于学习像素空间中重建图像与真实图像之间的微小差异(残差)。这种残差设计使我们能够在保持与现有AR模型兼容性的同时,以非侵入式方式增强分词器性能。RDA显著且大幅提升了文本渲染效果。例如,我们将微调后的Janus-Pro在TextAtlas基准测试上的OCR准确率从24.52%提升至58.26%(TextVisionBlend),从12.75%提升至36.81%(StyledTextSynth)。代码已开源于https://github.com/CSU-JPG/RDA

Retrieve What's Missing: Coverage-Maximizing Retrieval for Consistent Long Video Generation

2026-06-02T04:00:00autoregressive, cs.CV2606.02479

中文标题:检索缺失内容:面向一致长视频生成的覆盖度最大化检索

作者:Minseok Joo, Dogyun Park, Taehoon Lee, Kyujin Lee, Hyunwoo J. Kim

摘要:

Maintaining long-term geometric consistency remains challenging for long-horizon autoregressive video generation. Memory-augmented generative models address this by retrieving historical frames, but their effectiveness depends on two key design choices: what 3D-geometric evidence should represent past observations, and how memory frames should be selected from this evidence. Existing methods often rely on camera poses or field-of-view overlap, which are lightweight but too coarse to reason about pixel-wise visibility, or use explicit 3D reconstruction, which provides fine-grained evidence but is costly to maintain over long rollouts. We propose Coverage-Maximizing Retrieval-Augmented Generation (COVRAG), a depth-based memory retrieval framework that uses pretrained 3D priors to construct a target-view coverage map as lightweight 3D memory evidence. For frame selection, COVRAG maximizes residual coverage gain, iteratively retrieving frames that explain target-view regions not covered by the current context or previously selected memories. To improve scalability in long-video generation, we introduce sliding-window depth caching for efficient geometry estimation. Experiments on RealEstate10K and DL3DV10K show that COVRAG improves long-horizon geometric consistency while maintaining low latency compared to baselines.

摘要中文:

保持长期几何一致性对于长时序自回归视频生成仍然具有挑战性。记忆增强生成模型通过检索历史帧来解决这一问题,但其有效性取决于两个关键设计选择:应该用什么样的3D几何证据来表示过去的观测,以及如何从这些证据中选择记忆帧。现有方法通常依赖相机位姿或视场重叠,这些方法轻量级但过于粗略,无法推理像素级的可见性;或者使用显式3D重建,这种方法提供细粒度证据,但在长时序生成中维护成本较高。我们提出了覆盖度最大化检索增强生成(COVRAG),一种基于深度的记忆检索框架,使用预训练3D先验构建目标视图覆盖图作为轻量级3D记忆证据。在帧选择方面,COVRAG最大化残差覆盖增益,迭代检索能够解释当前上下文或先前选择记忆未覆盖的目标视图区域的帧。为了提升长视频生成的可扩展性,我们引入了滑动窗口深度缓存以实现高效的几何估计。在RealEstate10K和DL3DV10K数据集上的实验表明,COVRAG在保持低延迟的同时提升了长时序几何一致性。

MORPHOS: Autoregressive 4D Generation with Temporal Structured Latents

2026-06-02T04:00:00autoregressive, cs.CV2606.02491

中文标题:MORPHOS:基于时序结构潜在变量的自回归四维生成

作者:Minkyung Kwon, Jinhyeok Choi, Youngjin Shin, Jaeyeong Kim, JongMin Lee, Seungryong Kim

摘要:

We present MORPHOS, a novel autoregressive framework that generates dynamic 3D assets from videos across diverse representations, including meshes, 3D Gaussians, and radiance fields. Existing methods are typically limited to a single representation, struggle to model topological changes, or fail to maintain temporal consistency over long videos. To address these limitations, we introduce the Temporal Structured Latents (T-SLAT), a unified 4D representation that jointly encodes geometry and appearance along the temporal dimension. Leveraging T-SLAT, MORPHOS autoregressively generates dynamic 3D assets via causal attention, conditioning each frame on its preceding history to ensure temporal consistency while handling evolving topologies. We also propose a temporal-structural augmentation to mitigate error accumulation in autoregressive generation. MORPHOS achieves state-of-the-art performance in appearance and competitive results in geometry across multiple benchmarks, demonstrating superior generalization across various representations and robustness in long-horizon generation.

摘要中文:

我们提出了MORPHOS,这是一种新型自回归框架,能够从视频中生成跨多种表示形式的动态3D资产,包括网格、3D高斯和光场。现有的方法通常仅限于单一表示形式,难以建模拓扑变化,或无法在长视频中保持时间一致性。为解决这些局限性,我们引入了时序结构潜在变量(T-SLAT),这是一种统一的四维表示,能够沿时间维度联合编码几何和外观信息。利用T-SLAT,MORPHOS通过因果注意力自回归生成动态3D资产,每一帧以其前序历史为条件,从而确保时间一致性同时处理演化的拓扑结构。我们还提出了一种时序结构增强方法,以减轻自回归生成中的误差累积。MORPHOS在外观质量上达到了最先进的性能,在几何质量方面也取得了具有竞争力的结果,跨多个基准测试证明了其出色的跨表示泛化能力和长时序生成的鲁棒性。

LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

2026-06-02T04:00:00autoregressive, cs.CV, diffusion2606.02553

中文标题:LongLive-RAG:一种用于长视频生成的通用检索增强框架

作者:Qixin Hu, Shuai Yang, Wei Huang, Song Han, Yukang Chen

摘要:

Autoregressive (AR) video diffusion enables variable-length synthesis, but long-horizon generation often suffers from accumulated errors and identity drift. For efficiency, existing methods commonly adopt sliding-window attention during generation. This creates an irreversible generation trajectory: once the active window accumulates appearance errors, subsequent generations can only condition on this degraded trajectory and drift further away. We address this limitation by formulating long video generation as a retrieval-augmented generation (RAG) problem. Rather than relying solely on the recent window, we treat previously generated latents as a dynamic, searchable history. We propose LongLive-RAG, a general retrieval framework for AR video generation. At each new block, LongLive-RAG uses a query embedding to retrieve relevant historical latents. This lightweight retrieval step adds only a small overhead relative to generation and lets the generator condition on non-local context instead of only the recent window. To make retrieval more discriminative, we introduce the Window Temporal Delta Loss that suppresses redundant local similarity and encourages embeddings to capture meaningful temporal changes. Together, these components help reduce error accumulation caused by sliding-window attention. Experiments across multiple AR backbones and generation lengths show improved long-video quality and the best average VBench-Long rank. To our knowledge, among open-ended AR long video generation methods, LongLive-RAG is the first to formulate self-generated latent history as content-addressable retrieval memory. Code is available at https://github.com/qixinhu11/LongLive-RAG.

摘要中文:

自回归(AR)视频扩散模型能够实现可变长度合成,但长时序生成往往面临误差累积和特征漂移问题。为提高效率,现有方法在生成过程中普遍采用滑动窗口注意力机制。这造成了一条不可逆的生成轨迹:当活跃窗口累积了外观误差后,后续生成只能基于这一退化轨迹进行条件建模,从而进一步加剧漂移。我们通过将长视频生成建模为检索增强生成(RAG)问题来应对这一局限性。我们不再仅仅依赖最近的窗口,而是将先前生成的潜变量视为一个动态可搜索的历史记录。我们提出了LongLive-RAG,这是一个用于自回归视频生成的通用检索框架。在每个新块生成时,LongLive-RAG使用查询嵌入来检索相关的历史潜变量。这一轻量级的检索步骤相较于生成过程只增加了少量开销,却能让生成器基于非局部上下文而非仅靠最近窗口进行条件建模。为使检索更具区分性,我们引入了窗口时序差分损失(Window Temporal Delta Loss),该损失函数抑制了冗余的局部相似性,并促使嵌入向量捕捉有意义的时序变化。这些组件共同帮助减少了由滑动窗口注意力引起的误差累积。在多种自回归主干网络和生成长度上的实验表明,LongLive-RAG提升了长视频质量,并取得了最佳的VBench-Long平均排名。据我们所知,在开放式自回归长视频生成方法中,LongLive-RAG首次将自生成的潜变量历史建模为内容可寻址的检索记忆。代码已开源于 https://github.com/qixinhu11/LongLive-RAG。

HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single Image

2026-06-02T04:00:00autoregressive, cs.CV2606.02573

中文标题:HumanNOVA:基于单张图像的逼真、通用且快速的3D数字人建模

作者:Hezhen Hu, Wangbo Zhao, Lanqing Guo, Hanwen Jiang, Jonathan C. Liu, Zhiwen Fan, Kai Wang, Zhangyang Wang, Georgios Pavlakos

摘要:

In this paper, we present HumanNOVA, a photorealistic, universal, and rapid model for generating 3D human avatars from a single RGB image. Achieving both photorealism and generalization is challenging due to the scarcity of diverse, high-quality 3D human data. To address this, we build a scalable data generation pipeline that follows two strategies. The first one is to leverage existing rigged assets and animate them with extensive poses from daily life. The second strategy is to utilize existing multi-camera captures of humans and employ fitting to generate more diverse views for training. These two strategies enable us to scale up to 100k assets, significantly enhancing both the quantity and the diversity of data for robust model training. In terms of the architecture, HumanNOVA adopts a feed-forward, token-conditioned avatar modeling framework that allows fast inference in less than one second and requires no test-time optimization. Given an input image and an estimated simplified human mesh (SMPL) without detailed geometry or appearance, the model first encodes both inputs into compact token representations. These tokens then act as conditioning signals and are fused through cross-attention to construct a triplane-based 3D avatar representation. Extensive experiments on multiple benchmarks demonstrate the superiority of our approach, both quantitatively and qualitatively, as well as its robustness under diverse input image conditions. Project page at https://HumanNOVA.github.io .

摘要中文:

本文提出HumanNOVA,一个基于单张RGB图像生成逼真、通用且快速的3D数字人虚拟形象模型。由于高质量、多样化3D人类数据的稀缺,同时实现逼真效果和泛化能力具有极大挑战性。为此,我们构建了一个可扩展的数据生成 pipeline,采用两种策略:第一是利用现有的绑定资产,并用日常生活中的大量姿态进行动画化;第二是利用现有的多相机捕捉数据,通过人体拟合生成更多样化的训练视角。这两种策略使数据规模扩展至10万资产,显著提升了数据量和多样性,为模型稳健训练提供了坚实基础。在架构方面,HumanNOVA采用前馈、token条件化的虚拟形象建模框架,推理时间小于1秒,且无需测试时优化。给定输入图像和估计的简化人体网格(SMPL,无细节几何和外观信息),模型首先将两者编码为紧凑的token表示。这些token作为条件信号,通过交叉注意力融合,构建基于三平面(triplane)的3D虚拟形象表示。在多个基准数据集上的广泛实验表明,我们的方法在定量和定性评估上均表现出色,并在不同输入图像条件下展现出强大的鲁棒性。项目页面:https://HumanNOVA.github.io

From Zero to Hero: Training-Free Custom Concept Spawning in World Models

2026-06-02T04:00:00autoregressive, cs.CV2606.02575

中文标题:从零到英雄:世界模型中无需训练的自定义概念生成方法

作者:Kiymet Akdemir, Pinar Yanardag

摘要:

Autoregressive world models have emerged as a powerful paradigm for interactive video generation, allowing users to navigate dynamically generated environments through actions. These models are typically conditioned on a text prompt and/or a single reference frame, from which the entire world is generated. Yet the moment the user navigates beyond what is visible in that frame, the unseen regions are populated by the base model's priors, with no mechanism for the user to specify what should appear and where. This is a fundamental limitation for applications such as gaming, interactive storytelling, and simulation, where controllable scene composition is essential. We refer to this missing capability as concept spawning; introducing a user-specified visual concept into a world model, analogous to spawning in a game engine. We introduce SPAWN (Swapping Pinned Anchor with Windowed iNjection), a training-free method for concept spawning. SPAWN exploits a structural property of image-to-video backbones: the first slot of the context memory is pinned to the reference frame and acts as a foundational anchor for every generated chunk. By swapping this anchor with an external concept latent over a short injection window and letting the original anchor return, we cause the concept to propagate naturally through the rollout via the model&x27;s own memory. SPAWN supports concepts from fine-grained entities such as characters and props to large-scale elements such as buildings and landmarks, and accepts either a concept image or a text description as input. Experiments show that SPAWN integrates concepts with consistent lighting, scale, and perspective while preserving identity and temporal coherence, demonstrating that controllable concept spawning is achievable in existing autoregressive world models without any training.

摘要中文:

自回归世界模型已成为交互式视频生成的有力范式,允许用户通过动作导航动态生成的环境。这些模型通常以文本提示和/或单帧参考图像为条件,从中生成整个世界。然而,当用户导航至该帧可见区域之外时,未见区域由基础模型的先验知识填充,用户无法指定应出现的内容及其位置。这是游戏、交互式故事叙述和模拟等应用的根本限制,这些应用需要可控的场景组合。我们将这种缺失的能力称为概念生成;即向世界模型中引入用户指定的视觉概念,类似于游戏引擎中的生成机制。我们提出了SPAWN(锚点替换与窗口注入),一种无需训练的概念生成方法。SPAWN利用图像到视频骨干网络的结构特性:上下文记忆的第一槽位固定在参考帧上,作为每个生成片段的基础锚点。通过在短注入窗口内将此锚点与外部概念潜向量交换,并让原始锚点回归,使概念能够通过模型自身的记忆在推演过程中自然传播。SPAWN支持从细粒度实体(如角色和道具)到大规模元素(如建筑物和地标)的概念,并接受概念图像或文本描述作为输入。实验表明,SPAWN能够在保持身份特征和时间一致性的同时,以一致的光照、尺度和视角整合概念,证明了在现有自回归世界模型中无需任何训练即可实现可控概念生成。

ChWDTA: Channel-wise Wavelet-Domain Transformer Attention and Entropy Modeling for Learned Image Compression

2026-06-02T04:00:00autoregressive, cs.CV, cs.LG, eess.IV, image_compression2606.00111

中文标题:ChWDTA:用于端到端可学习图像压缩的通道级小波域Transformer注意力和熵建模

作者:Haisheng Fu, Runyu Yang, Feng Ding, Siyu Zhu, Jie Liang, Xiaoxiao Li, Zhenman Fang, Jingning Han

摘要:

State-of-the-art learned image compression (LIC) schemes are increasingly based on hybrid CNN-transformer architectures. To further improve rate-distortion performance, we introduce channel-wise wavelet transforms into both the transformer and entropy-coding components. First, we propose a channel-wise wavelet-domain transformer attention (ChWDTA) mechanism. ChWDTA keeps the efficient windowed spatial self-attention used in modern LIC backbones, but computes the Q/K/V projections on channel-wise wavelet-transformed features before mapping the attention output back with the inverse transform. The resulting Channel-wise Wavelet-Domain Transformer Block (ChWDTB) therefore preserves the spatial tokenization pattern of windowed attention while sparsifying the channel covariance seen by the attention projections. Second, in the entropy-coding stage, we introduce a channel-wise wavelet packet (ChWP) decomposition that produces four equal-sized subbands, which better fit channel-wise slice-based autoregressive entropy modeling. When each channel-wise subband is divided into two slices, we use eight slices for entropy coding. With this configuration, the proposed scheme obtains BD-rate reductions of -17.82%, -19.15%, and -22.56% on the Kodak, CLIC Professional Validation, and Tecnick test sets, respectively. Even when each channel-wise subband is coded as a single slice, the scheme still retains most of the coding gains with lower complexity. The results confirm the advantage of introducing wavelet transform in CNN-transformer-based LIC schemes.

摘要中文:

最先进的端到端可学习图像压缩(LIC)方案越来越多地基于混合CNN-Transformer架构。为了进一步提高率失真性能,我们将通道级小波变换引入到Transformer和熵编码组件中。首先,我们提出了通道级小波域Transformer注意力(ChWDTA)机制。ChWDTA保留了现代LIC骨干网络中高效的窗口空间自注意力,但在注意力输出逆变换映射之前,在通道级小波变换后的特征上计算Q/K/V投影。因此,通道级小波域Transformer模块(ChWDTB)在保持窗口注意力空间标记化模式的同时稀疏化了注意力投影所见的通道协方差。其次,在熵编码阶段,我们引入通道级小波包(ChWP)分解,生成四个等大小的子带,更好地适应通道级分片自回归熵建模。当每个通道级子带被划分为两个分片时,我们使用八个分片进行熵编码。通过这种配置,所提出的方案在Kodak、CLIC Professional Validation和Tecnick测试集上分别获得了-17.82%、-19.15%和-22.56%的BD-rate降低。即使将每个通道级子带编码为单个分片,该方案仍能保持大部分编码增益,同时降低复杂度。结果表明,在基于CNN-Transformer的LIC方案中引入小波变换具有显著优势。

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

2026-06-02T04:00:00autoregressive, cs.CV, cs.LG, cs.RO, diffusion2508.20072

中文标题:离散扩散VLA:将离散扩散引入视觉-语言-动作策略的动作解码

作者:Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Liuao Pei, Tian Nian, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, Yao Mu, Ping Luo

摘要:

Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions into robot actions. However, prevailing VLAs either generate actions autoregressively in a fixed left-to-right order with poor performance or attach separate diffusion heads outside the backbone that fragments information pathways and hinders unified, scalable architectures. Instead, we present Discrete Diffusion VLA that discretizes action chunks and models them with discrete diffusion pattern retaining progressive refinement inside the unified transformer backbone. Our method achieves an adaptive decoding order that resolves high-confidence action elements before harder ones and employs secondary re-masking to revisit uncertain predictions, enabling robust error correction. This design preserves pretrained vision-language priors, supports parallel decoding, and improves the efficiency. Discrete Diffusion VLA achieves 96.4% avg. success on LIBERO, 71.2% visual matching on SimplerEnv-Fractal, and 54.2% overall on SimplerEnv-Bridge. On out-of-distribution tests of LIBERO-Goal, our method exhibits only 0.8% language degradation versus 8.0% of parallel decoding, and 20.4% vision degradation versus 29.0% for continuous diffusion, demonstrating well retention of pretrained vision-language capabilities. We also conduct two real-robot evaluations on AgileX Cobot Magic platform to show the method's effectiveness.

摘要中文:

视觉-语言-动作(Vision-Language-Action, VLA)模型将大型视觉语言骨干网络适配以将图像和指令映射为机器人动作。然而,现有的VLA模型要么以固定的从左到右顺序自回归生成动作(性能不佳),要么在骨干网络外附加独立的扩散头(导致信息路径碎片化,阻碍统一且可扩展的架构)。为此,我们提出离散扩散VLA(Discrete Diffusion VLA),该方法将动作分块并进行离散扩散建模,在统一的Transformer骨干网络内保留渐进式细化能力。我们的方法实现了自适应解码顺序,能够在处理较难的动作元素之前先解析高置信度动作元素,并采用二次重掩码策略回溯不确定的预测,从而实现鲁棒的错误修正。该设计保留了预训练的视觉语言先验,支持并行解码,并提高了效率。离散扩散VLA在LIBERO上达到96.4%的平均成功率,在SimplerEnv-Fractal上达到71.2%的视觉匹配率,在SimplerEnv-Bridge上达到54.2%的整体成功率。在LIBERO-Goal的分布外测试中,我们的方法仅出现0.8%的语言性能下降(并行解码为8.0%),以及20.4%的视觉性能下降(连续扩散为29.0%),证明了其对预训练视觉语言能力的良好保持。我们还在AgileX Cobot Magic平台上进行了两项真实机器人评估,以展示该方法的有效性。

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

2026-06-02T04:00:00autoregressive, cs.CV, diffusion2602.02214

中文标题:因果强制:正确实现的高质量实时交互视频生成自回归扩散蒸馏

作者:Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, Jun Zhu

摘要:

To achieve real-time interactive video generation, current methods distill pretrained bidirectional video diffusion models into few-step autoregressive (AR) models, facing an architectural gap when full attention is replaced by causal attention. However, existing approaches do not bridge this gap theoretically. They initialize the AR student via ODE distillation, which requires frame-level injectivity, where each noisy frame must map to a unique clean frame under the PF-ODE of an AR teacher. Distilling an AR student from a bidirectional teacher violates this condition, preventing recovery of the teacher's flow map and instead inducing a conditional-expectation solution, which degrades performance. To address this issue, we propose Causal Forcing, which uses an autoregressive teacher for ODE initialization to bridge the architectural gap, and then applies the same DMD procedure as in Self Forcing. Empirical results show that our method outperforms all baselines across all metrics, surpassing the SOTA Self Forcing by 19.3\% in Dynamic Degree, 8.7\% in VisionReward, and 16.7\% in Instruction Following. Project page: \href{https://thu-ml.github.io/CausalForcing.github.io/}{https://thu-ml.github.io/CausalForcing.github.io/}; the code: \href{https://github.com/thu-ml/Causal-Forcing}{https://github.com/thu-ml/Causal-Forcing}.

摘要中文:

为实现实时交互视频生成,当前方法将预训练的双向视频扩散模型蒸馏为少步自回归(AR)模型,但当完整注意力被替换为因果注意力时,面临着架构差距。然而,现有方法并未从理论上解决这一差距。它们通过ODE蒸馏初始化AR学生模型,这要求帧级可注入性,即在AR教师的PF-ODE下,每个噪声帧必须映射到唯一的干净帧。从双向教师蒸馏AR学生违反了这一条件,无法恢复教师的流映射,而是导致条件期望解,从而导致性能下降。针对这一问题,我们提出了因果强制方法,该方法使用自回归教师进行ODE初始化以桥架构差距,然后应用与自强制相同的DMD程序。实验表明,我们的方法在所有指标上均优于所有基线,在动态程度、视觉奖励和指令遵循方面分别超越SOTA自强制方法19.3%、8.7%和16.7%。项目页面:https://thu-ml.github.io/CausalForcing.github.io/;代码:https://github.com/thu-ml/Causal-Forcing

Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation

2026-06-02T04:00:00autoregressive, cs.CV, diffusion2605.15141

中文标题:因果强制++:用于实时交互视频生成的可扩展少步数自回归扩散蒸馏

作者:Min Zhao, Hongzhou Zhu, Kaiwen Zheng, Zihan Zhou, Bokai Yan, Xinyuan Li, Xiao Yang, Chongxuan Li, Jun Zhu

摘要:

Real-time interactive video generation requires low-latency, streaming, and controllable rollout. Existing autoregressive (AR) diffusion distillation methods have achieved strong results in the chunk-wise 4-step regime by distilling bidirectional base models into few-step AR students, but they remain limited by coarse response granularity and non-negligible sampling latency. In this paper, we study a more aggressive setting: frame-wise autoregression with only 1--2 sampling steps. In this regime, we identify the initialization of a few-step AR student as the key bottleneck: existing strategies are either target-misaligned, incapable of few-step generation, or too costly to scale. We propose \textbf{Causal Forcing++}, a principled and scalable pipeline that uses \emph{causal consistency distillation} (causal CD) for few-step AR initialization. The core idea is that causal CD learns the same AR-conditional flow map as causal ODE distillation, but obtains supervision from a single online teacher ODE step between adjacent timesteps, avoiding the need to precompute and store full PF-ODE trajectories. This makes the initialization both more efficient and easier to optimize. The resulting pipeline, \ours, surpasses the SOTA 4-step chunk-wise Causal Forcing under the \textit{\textbf{frame-wise 2-step setting}} by 0.1 in VBench Total, 0.3 in VBench Quality, and 0.335 in VisionReward, while reducing first-frame latency by 50\% and Stage 2 training cost by $\sim__MATHBLOCK_0__4\times$. We further extend the pipeline to action-conditioned world model generation in the spirit of Genie3. Project Page: https://github.com/thu-ml/Causal-Forcing and https://github.com/shengshu-ai/minWM .

摘要中文:

实时交互式视频生成需要低延迟、流式输出和可控的生成策略。现有的自回归扩散蒸馏方法通过将双向基础模型蒸馏为少步数自回归学生模型,在分块式4步生成中取得了优异成绩,但其性能仍受限于粗糙的响应粒度和不可忽视的采样延迟。本文研究了一个更为激进的设置:仅使用1-2个采样步数的逐帧自回归生成。在该设置下,我们将少步数自回归学生的初始化识别为关键瓶颈:现有策略要么与目标不对齐,要么无法支持少步数生成,要么成本过高难以扩展。我们提出因果强制++(Causal Forcing++),这是一个原则性强且可扩展的pipeline,采用因果一致性蒸馏(causal CD)进行少步数自回归初始化。其核心思想是,因果一致性蒸馏学习与因果ODE蒸馏相同的自回归条件流映射,但从一个在相邻时间步之间的单一在线教师ODE步骤获取监督,避免了预计算和存储完整概率流ODE轨迹的需要。这使得初始化更加高效且更容易优化。所得的Causal Forcing++ pipeline在逐帧2步设置下,在VBench Total上超越当前最先进的4步分块式因果强制方法0.1分,在VBench Quality上超越0.3分,在VisionReward上超越0.335分,同时将首帧延迟降低50%,并将第二阶段训练成本降低约4倍。我们进一步将该pipeline扩展到类Genie3的动作条件世界模型生成。

Diffusion Models, Denoiser Architecture and Creativity

2026-06-02T04:00:00cs.CV, cs.LG, diffusion2605.16415

中文标题:扩散模型、去噪器架构与创造力

作者:Itamar Levine, Yair Weiss

摘要:

The creativity of diffusion models refers to their ability to generate highly realistic images that are different from their training data. Creativity is somewhat surprising since it is known that if the denoiser used in the diffusion model is the Bayes optimal denoiser for a given training set, then the model will simply copy the training samples. In this paper we present empirical and theoretical results that suggest that creativity in diffusion models is due to an interaction between the denoiser architecture and the target distribution. Theoretically, we give explicit forms for the distribution of generated samples as a function of the target distribution and the denoiser architecture for three different denoiser architectures (linear, polynomial, bottleneck). Empirically, we show that small changes in the popular UNET denoiser architecture leads to very different forms of creativity, and these small changes often yield samples that are highly nonrealistic. Taken together, our results show that diffusion models will only be successful if the inductive bias of the denoiser architecture is in strong alignment with the true target distribution.

摘要中文:

扩散模型的创造力指的是它们生成与训练数据不同的高度逼真图像的能力。创造力有些令人惊讶,因为已知如果扩散模型中使用的去噪器是给定训练集的贝叶斯最优去噪器,那么模型只会复制训练样本。在本文中,我们提供了经验和理论结果,表明扩散模型中的创造力是由于去噪器架构与目标分布之间的相互作用。理论上,我们给出了三种不同去噪器架构(线性、多项式、瓶颈)生成样本分布作为目标分布和去噪器架构函数的显式形式。经验上,我们表明,对流行的UNET去噪器架构的微小改变会导致非常不同的创造力形式,并且这些微小改变通常会产生高度不逼真的样本。综合来看,我们的结果表明,扩散模型只有在去噪器架构的归纳偏置与真实目标分布强烈对齐时才会成功。

Robust Dreamer: Deviation-Aware Latent Gaussian Memory for Action-Controlled AR Video Generation

2026-06-02T04:00:00autoregressive, cs.CV, diffusion2605.30855

中文标题:鲁棒Dreamer:面向动作控制AR视频生成的偏差感知潜在高斯记忆

作者:Hanlin Chen, Jiaxin Wei, Xibin Song, Yifu Wang, Steve Wang, Hongdong Li, Pan Ji, Gim Hee Lee

摘要:

Frame-wise action-controlled image-to-video generation is a promising paradigm for interactive world simulation, where each control signal should elicit an immediate visual response. However, maintaining visual fidelity and 3D consistency over long autoregressive rollouts remains challenging. Existing 3D-aware methods often suffer from catastrophic drift due to two impediments: information loss from \textit{Latent--RGB Cycling}, where generated latents are repeatedly decoded to RGB and re-encoded for future conditioning, and the training--inference gap induced by the \textit{error-free hypothesis}, where clean training memory fails to match prediction-corrupted inference memory. To address these challenges, we present \textbf{Robust Dreamer}, a memory-augmented framework built around how to design 3D memory and how to use it robustly. First, we introduce \textbf{Latent Gaussian Memory}, which anchors diffusion latents inherited from the generation process to Gaussian primitives and recalls them via latent-space Gaussian splatting. This provides dense, geometry-aware, view-aligned conditioning while avoiding accumulated degradation from repeated VAE conversion. Second, we propose \textbf{Deviation Learning with Dynamic Deviation Archive}, which synthesizes rollout-induced latent deviations through a one-step approximation, stores them by autoregressive stage and denoising timestamp, and injects them into historical memory during training. This exposes the generator to realistic corrupted memory states and teaches internal correction before inference. Experiments on ScanNet, DL3DV, and OmniWorldGame demonstrate state-of-the-art long-horizon performance.

摘要中文:

逐帧动作控制的图像到视频生成是交互式世界模拟的一种有前景的范式,其中每个控制信号应引发即时的视觉响应。然而,在长时间自回归推演中保持视觉保真度和3D一致性仍然具有挑战性。现有的3D感知方法经常遭受灾难性漂移之苦,主要源于两个障碍:一是潜在-RGB循环导致的信息损失,即生成的潜在表示被反复解码为RGB并重新编码以用于未来条件化;二是由无错误假设引起的训练-推理差距,其中清洁的训练记忆无法匹配被预测破坏的推理记忆。为了应对这些挑战,我们提出了鲁棒Dreamer,一个围绕如何设计3D记忆以及如何稳健使用它而构建的记忆增强框架。首先,我们引入了潜在高斯记忆,将从生成过程继承的扩散潜在表示锚定到高斯原语,并通过潜在空间高斯溅射进行召回。这提供了密集的、几何感知的、对齐视图的条件化,同时避免了来自反复VAE转换的累积退化。其次,我们提出了带动态偏差存档的偏差学习,通过一步近似合成推演引起的潜在偏差,按自回归阶段和去噪时间戳存储它们,并在训练期间将它们注入历史记忆。这使生成器暴露于真实损坏的记忆状态,并教会其在推理前进行内部校正。在ScanNet、DL3DV和OmniWorldGame上的实验表明,该方法具有最先进的长时域性能。

Diffusion Models for Hyperspectral Image Analysis: A Comprehensive Review

2026-06-02T04:00:00cs.CV, diffusion, eess.IV2505.11158

中文标题:高光谱图像分析中的扩散模型:综合综述

作者:Xing Hu, Xiangcheng Liu, Qianqian Duan, Lian Zhang, Huiliang Shang, Linhua Jiang, Haima Yang, Dawei Zhang

摘要:

Hyperspectral image (HSI) analysis plays a critical role in remote sensing, agriculture, and environmental monitoring. However, traditional methods often struggle to handle the high dimensionality, spectral redundancy, and noise inherent in HSI data, limiting their accuracy and scalability. Recently, diffusion models including denoising diffusion probabilistic models and other generative frameworks based on stochastic differential equations have shown strong potential in capturing complex spectral spatial structures and generating high fidelity HSI data. These models offer effective solutions for tasks such as noise supression, data augmentation, classification, and anomaly detection. This review presents a systematic summary of recent advances in diffusion models for HSI processing. We categorize existing methods, highlight their strengths in handling high dimensional data, and compare their performance with conventional approaches. Special attention is given to critical applications such as change detection and post disaster anomaly identification. The review also discusses current limitations, such as computational cost and training stability, and outlines potential research directions. Our main contributions can be summarized as follows: we provide a systematic taxonomy of diffusion based HSI methods, examine their applications across major remote sensing tasks, and offer perspectives on potential directions for future research. With these efforts, this review seeks to support the community in harnessing deep learning models to achieve more effective and efficient hyperspectral image analysis.

摘要中文:

高光谱图像(HSI)分析在遥感、农业和环境监测中发挥着关键作用。然而,传统方法往往难以处理高光谱图像数据固有的高维度、光谱冗余和噪声问题,限制了其准确性和可扩展性。近年来,包括去噪扩散概率模型和基于随机微分方程的其他生成框架在内的扩散模型,在捕捉复杂的光谱-空间结构和生成高保真高光谱数据方面展现出强大潜力。这些模型为噪声抑制、数据增强、分类和异常检测等任务提供了有效的解决方案。本综述对高光谱图像处理中扩散模型的最新进展进行了系统总结。我们对现有方法进行分类,突出其在处理高维数据方面的优势,并与传统方法进行性能比较。特别关注变化检测和灾后异常识别等关键应用。本综述还讨论了当前的局限性,如计算成本和训练稳定性,并展望了潜在的研究方向。我们的主要贡献可归纳如下:提供了基于扩散的高光谱图像方法的系统分类,审查了其在主要遥感任务中的应用,并展望了未来研究的潜在方向。通过这些努力,本综述旨在支持学界利用深度学习模型实现更有效、更高效的高光谱图像分析。

image_compression
Image Compression
5 篇论文

图像压缩领域每日总览

今日图像压缩领域的论文呈现出多元化创新趋势,主要集中在以下几个方向:超低比特率压缩技术、生成式压缩模型、端到端可扩展编码以及针对人机混合场景的优化。值得注意的是,多篇论文开始探索语义信息与像素级特征的深度融合,同时Transformer架构在小波域熵建模中的应用也成为新的研究热点。此外,矢量量化(VQ)与超先验结合的生成式压缩方案显示出显著潜力,为高效重建质量提供了新思路。

重点论文推荐:

  • Exploiting Semantic and Pixel Representations for Ultra-Low Bitrate Image Compression — 首次同时利用语义分割和像素级特征,在超低比特率下实现语义一致性和视觉质量的双重提升,对极端压缩场景具有重要参考价值。
  • Real-Time Generation of Streamable Talking Portrait Video with Reference-Guided Deep Compression VAEs — 提出参考引导的深度压缩VAE架构,实现了说话人像视频的实时生成与流式传输,为数字人生成和低带宽视频通信提供了新方案。
  • ChWDTA: Channel-wise Wavelet-Domain Transformer Attention and Entropy Modeling — 创新性地将Transformer注意力机制引入小波域,并结合通道级特征建模,在率-失真性能上取得显著提升,推动了学习式压缩的架构创新。
  • HyperVQ: Enabling Hyperprior Entropy Modeling for VQ-Based Generative Image Compression — 解决了矢量量化生成压缩中熵建模困难的瓶颈问题,通过超先验机制大幅提升了码率利用效率,是生成式压缩的重要进展。
  • Training-Free Continuous Bitrate Control for Scalable Image Coding — 提出无需训练的连续比特率控制方法,同时满足人类感知和机器视觉需求,为可扩展编码的实际应用提供了灵活方案。

Exploiting Semantic and Pixel Representations for Ultra-Low Bitrate Image Compression

2026-06-02T04:00:00cs.CV, diffusion, image_compression2606.01608

中文标题:利用语义和像素表示的超低比特率图像压缩

作者:Hao Wei, Yanhui Zhou, Chenyang Ge, Saeed Anwar, Ajmal Mian

摘要:

Most existing extreme compression methods fail to achieve an optimal rate-distortion-perception trade-off, as they typically prioritize perceptual fidelity and visual realism over pixel-level accuracy. Consequently, the resulting reconstructions often deviate noticeably from the originals. Ultra-low bitrate image compression is therefore crucial-not only for producing extremely compact representations but also for ensuring that reconstructed images remain semantically coherent and faithful to the source at the pixel level. To this end, we propose SPRDiff, a diffusion-based compression method that fully leverages both semantic and pixel representations, thereby enhancing reconstruction fidelity under ultra-low bitrate constraints. Specifically, we develop a triple-encoder architecture that utilizes high-fidelity features from the pretrained distortion-oriented and semantic-oriented encoders to compensate for the limited representations extracted by the frozen VAE encoder, thereby improving latent compression and entropy modeling. To further enhance the reconstruction fidelity of diffusion models, we introduce a distortion-aware reconstruction module with dual feature extraction. This module not only generates a coarse reconstruction that preserves the main structures, but also provides practical and accurate semantic- and pixel-level conditional signals to guide the diffusion model. Extensive experiments on benchmark datasets demonstrate that our method outperforms state-of-the-art approaches in the rate-distortion-perception tradeoff at extremely low bitrates (below 0.03 bpp), effectively preserving both perceptual quality and pixel-wise fidelity in the reconstructed images. We will release the source code and trained models at https://github.com/cshw2021/SPRDiff.

摘要中文:

现有大多数极端压缩方法未能实现最优的率-失真-感知权衡,因为它们通常优先考虑感知保真度和视觉真实感,而忽视像素级精度。因此,重建结果与原始图像存在明显偏差。超低比特率图像压缩至关重要——不仅能够生成极紧凑的表示,还能确保重建图像在语义上保持一致并在像素级别上忠实于源图像。为此,我们提出了SPRDiff,这是一种基于扩散的压缩方法,充分挖掘语义和像素表示,从而在超低比特率约束下提升重建质量。具体而言,我们开发了一种三重编码器架构,利用预训练的失真导向编码器和语义导向编码器的高保真特征来补偿冻结VAE编码器提取的有限表示,从而改进潜在压缩和熵建模。为进一步提升扩散模型的重建质量,我们引入了具有双特征提取功能的失真感知重建模块。该模块不仅生成保留主要结构的粗重建,还提供实用且精确的语义级和像素级条件信号来引导扩散模型。在基准数据集上的大量实验表明,我们的方法在极低比特率(低于0.03 bpp)下的率-失真-感知权衡方面优于现有最佳方法,有效保持了重建图像的感知质量和像素级保真度。我们将在 https://github.com/cshw2021/SPRDiff 上发布源代码和训练模型。

Real-Time Generation of Streamable Talking Portrait Video with Reference-Guided Deep Compression VAEs

2026-06-02T04:00:00autoregressive, cs.CV, diffusion, image_compression2606.01620

中文标题:基于参考引导深度压缩VAE的可流式传输说话人像视频实时生成

作者:Sicheng Xu, Yu Deng, Shoukang Hu, Yichuan Wang, Yizhong Zhang, Zhan Chen, Jiaolong Yang, Baining Guo

摘要:

Video diffusion models have significantly advanced portrait video generation, yet their high computational demands limit their use in interactive applications. This work presents a framework for streamable talking portrait video generation conditioned on speech audio and reference images. Designed meticulously for streaming scenarios, it features a causal video VAE for deep latent compression and an autoregressive latent denoising model. Our causal VAE integrates a variable number of reference images as guidance, allowing the network to focus on dynamic information rather than static appearance, thereby enhancing compression efficacy and reconstruction quality. Additionally, we extend the residual auto-encoding paradigm to improve spatial-temporal causality handling in our VAE. The generator is based on a Rectified Flow Transformer architecture and produces video latents in a blockwise auto-regressive manner. Our method enables the real-time generation of high-quality talking portrait videos, achieving speeds significantly faster than baseline models. Furthermore, comprehensive experiments demonstrate that it is on par with or even outperforms these large models in realism, vividness, and video quality.

摘要中文:

视频扩散模型在肖像视频生成方面取得了显著进展,但其高计算需求限制了其在交互式应用中的使用。本研究提出了一个基于语音音频和参考图像生成可流式传输说话人像视频的框架。该框架专为流式场景设计,包含一个用于深度潜空间压缩的因果视频VAE和一个自回归潜空间去噪模型。我们的因果VAE整合了可变数量的参考图像作为引导,使网络能够专注于动态信息而非静态外观,从而提高了压缩效率和重建质量。此外,我们将残差自编码范式扩展应用于VAE中,以改进时空因果关系的处理。生成器基于修正流Transformer架构,以分块自回归方式生成视频潜空间。我们的方法实现了高质量说话人像视频的实时生成,生成速度显著优于基线模型。全面的实验表明,本方法在真实感、生动性和视频质量方面与这些大型模型相当,甚至更优。

ChWDTA: Channel-wise Wavelet-Domain Transformer Attention and Entropy Modeling for Learned Image Compression

2026-06-02T04:00:00autoregressive, cs.CV, cs.LG, eess.IV, image_compression2606.00111

中文标题:ChWDTA:用于端到端可学习图像压缩的通道级小波域Transformer注意力和熵建模

作者:Haisheng Fu, Runyu Yang, Feng Ding, Siyu Zhu, Jie Liang, Xiaoxiao Li, Zhenman Fang, Jingning Han

摘要:

State-of-the-art learned image compression (LIC) schemes are increasingly based on hybrid CNN-transformer architectures. To further improve rate-distortion performance, we introduce channel-wise wavelet transforms into both the transformer and entropy-coding components. First, we propose a channel-wise wavelet-domain transformer attention (ChWDTA) mechanism. ChWDTA keeps the efficient windowed spatial self-attention used in modern LIC backbones, but computes the Q/K/V projections on channel-wise wavelet-transformed features before mapping the attention output back with the inverse transform. The resulting Channel-wise Wavelet-Domain Transformer Block (ChWDTB) therefore preserves the spatial tokenization pattern of windowed attention while sparsifying the channel covariance seen by the attention projections. Second, in the entropy-coding stage, we introduce a channel-wise wavelet packet (ChWP) decomposition that produces four equal-sized subbands, which better fit channel-wise slice-based autoregressive entropy modeling. When each channel-wise subband is divided into two slices, we use eight slices for entropy coding. With this configuration, the proposed scheme obtains BD-rate reductions of -17.82%, -19.15%, and -22.56% on the Kodak, CLIC Professional Validation, and Tecnick test sets, respectively. Even when each channel-wise subband is coded as a single slice, the scheme still retains most of the coding gains with lower complexity. The results confirm the advantage of introducing wavelet transform in CNN-transformer-based LIC schemes.

摘要中文:

最先进的端到端可学习图像压缩(LIC)方案越来越多地基于混合CNN-Transformer架构。为了进一步提高率失真性能,我们将通道级小波变换引入到Transformer和熵编码组件中。首先,我们提出了通道级小波域Transformer注意力(ChWDTA)机制。ChWDTA保留了现代LIC骨干网络中高效的窗口空间自注意力,但在注意力输出逆变换映射之前,在通道级小波变换后的特征上计算Q/K/V投影。因此,通道级小波域Transformer模块(ChWDTB)在保持窗口注意力空间标记化模式的同时稀疏化了注意力投影所见的通道协方差。其次,在熵编码阶段,我们引入通道级小波包(ChWP)分解,生成四个等大小的子带,更好地适应通道级分片自回归熵建模。当每个通道级子带被划分为两个分片时,我们使用八个分片进行熵编码。通过这种配置,所提出的方案在Kodak、CLIC Professional Validation和Tecnick测试集上分别获得了-17.82%、-19.15%和-22.56%的BD-rate降低。即使将每个通道级子带编码为单个分片,该方案仍能保持大部分编码增益,同时降低复杂度。结果表明,在基于CNN-Transformer的LIC方案中引入小波变换具有显著优势。

Training-Free Continuous Bitrate Control for Scalable Image Coding for Humans and Machines

2026-06-02T04:00:00cs.CV, eess.IV, image_compression2606.00158

中文标题:面向人类和机器的可伸缩图像编码的免训练连续码率控制

作者:Yui Tatsumi, Hiroshi Watanabe

摘要:

Continuous variable-rate compression is highly demanded in real-world applications, but remains underexplored in scalable image coding for humans and machines. In this paper, we propose a training-free variable-rate scalable image coding framework. By adjusting quantization steps based on predicted scale values, the proposed method achieves continuous bitrate control while preserving high-scale information in the machine and enhancement layers. Experimental results demonstrate the effectiveness of the proposed method and highlight the importance of bitrate allocation between the two layers.

摘要中文:

连续可变码率压缩在实际应用中需求强烈,但在面向人类和机器的可伸缩图像编码中仍缺乏充分探索。本文提出了一种免训练的可变码率可伸缩图像编码框架。该方法通过基于预测的缩放值调整量化步长,在保持机器层和增强层中高尺度信息的同时实现连续码率控制。实验结果验证了所提方法的有效性,并突出了两层之间码率分配的重要性。

HyperVQ: Enabling Hyperprior Entropy Modeling for VQ-Based Generative Image Compression

2026-06-02T04:00:00cs.CV, image_compression2512.07192

中文标题:HyperVQ:面向基于VQ的生成式图像压缩的超先验熵建模

作者:Niu Yi, Xu Tianyi, Ma Mingming, Wang Xinkun

摘要:

Vector Quantization (VQ) based generative image compression has achieved remarkable perceptual quality. However, existing VQ codecs suffer from two fundamental limitations. First, they lack efficient content-adaptive entropy modeling and rely on static frequencies, leading to low coding efficiency. Second, the inherent conflict between discrete indices and continuous priors prevents true end-to-end joint Rate-Distortion (RD) optimization. To resolve these issues, we propose HyperVQ, a principled framework that establishes a high-performance hyperprior entropy foundation for VQ-based codecs. The core insight of HyperVQ is to shift probability modeling entirely into the continuous embedding space. Instead of directly predicting probabilities for discrete symbols, HyperVQ predicts a high-dimensional continuous multivariate Gaussian distribution for the continuous latents. By treating the discrete codebook entries as fixed "anchors" in this space, we convert the continuous Gaussian density into categorical index probabilities based on relative distances. This elegant formulation provides a powerful, spatially-adaptive entropy engine and renders the cross-entropy rate objective fully differentiable, empowering the network to actively and dynamically optimize the RD trade-off during training. To ensure practicality, we design the lightweight H Block and the Probability Estimation Engine (PEE) to facilitate highly parallel, millisecond-level inference. Experiments demonstrate that HyperVQ acts as a universal module across diverse VQ architectures (single-scale, large-codebook, RVQ), achieving an average bitrate saving of 18.5%, which is 7.28x the saving achieved by conventional Huffman coding. This establishes a robust, RD-controllable foundation for next-generation generative image compression.

摘要中文:

基于矢量量化(VQ)的生成式图像压缩已取得显著的感知质量。然而,现有的VQ编解码器存在两个根本性局限。首先,它们缺乏高效的内容自适应熵建模,仅依赖静态频率,导致编码效率低下。其次,离散索引与连续先验之间的固有冲突阻碍了真正的端到端率失真(RD)联合优化。为解决这些问题,我们提出了HyperVQ,这是一个为VQ编解码器建立高性能超先验熵基础的原理性框架。HyperVQ的核心洞见是将概率建模完全转移至连续嵌入空间。HyperVQ不直接预测离散符号的概率,而是为连续潜变量预测高维多元高斯分布。通过将离散码本条目视为该空间中的固定锚点,我们根据相对距离将连续高斯密度转换为类别索引概率。这种优雅的公式提供了一个强大的空间自适应熵引擎,并使交叉熵率目标函数完全可微分,使网络能够在训练过程中主动、动态地优化RD权衡。为确保实用性,我们设计了轻量级H模块和概率估计引擎(PEE),以实现高度并行化的毫秒级推理。实验表明,HyperVQ作为通用模块在多种VQ架构(单尺度、大码本、RVQ)中表现出色,平均比特率节省达18.5%,是传统霍夫曼编码节省量的7.28倍。这为下一代生成式图像压缩建立了稳健的RD可控基础。

visual_tokenizer_1d
1D Visual Tokenizer
0 篇论文

今日未找到该分类的匹配论文。

今日未找到该分类的匹配论文。

diffusion_visual_encoder
Diffusion Visual Encoder
0 篇论文

今日未找到该分类的匹配论文。

今日未找到该分类的匹配论文。