ESC
输入关键词搜索文章
目录

每日 arXiv 论文简报

2026-06-07 · 19 篇论文 · 按研究方向分组
自动追踪 · LLM 总览 · 研究雷达
19Total Papers
5Autoregressive
13Diffusion
1Image Compression
01D Visual Tokenizer
0Diffusion Visual Encoder
Daily Radar
每日总览

今日 arXiv 论文全局总览

今日论文呈现出多模态统一化方法论融合两大主线。扩散模型与自回归模型的边界日趋模糊——DiG-Plan 和 World-Language-Action Model 同时出现在两个分类中,体现了"扩散引导规划"与"统一世界建模"的新范式。理论层面,研究从现象观察深入到机制解释:Grokking 现象的深层线性网络理论、条件 ReLU 降维方法,以及树集成敏感性的符号化分析,都试图回答"为什么模型能学习"这一根本问题。

应用层面呈现安全-critical 导向:RiskFlow 生成安全关键交通场景、CLEAR 实现端到端自动驾驶的自适应路由、Edit-R2 处理多轮图像编辑——这些工作不再满足于性能提升,而是将可靠性、可控性置于首位。值得注意的是,物理先验正在被显式注入生成模型:The Invisible Hand of Physics 揭示视频扩散模型已隐式学习物理规律,为可解释性提供了新视角。

今日值得关注的论文:

  • DiG-Plan(跨分类) —— 首创扩散引导工具图规划,缓解早期承诺问题,为机器人规划提供新思路
  • World-Language-Action Model(跨分类) —— 统一世界建模、语言推理与动作合成,是具身智能的重要突破
  • The Invisible Hand of Physics —— 发现视频扩散模型掌握超出显示的物理知识,开辟可解释性研究新方向
  • When Attention Beats Fourier —— 对比注意力机制与傅里叶方法在 PDE 求解中的优劣,提供架构选择依据
  • UniVoice —— 首个统一语音与歌唱生成的模型,在多说话人、高表现力方面有突破
autoregressive
Autoregressive
5 篇论文

今日arXiv论文涵盖了强化学习规划、训练动态理论、世界多模态模型、模型敏感性分析及神经网络缩放定律等核心方向。研究趋势显示,AI领域正从单一模型架构向跨模态统一推理迈进,同时深化对训练机制和理论可解释性的探索。值得注意的是,多篇论文聚焦于模型 scaling laws 和训练动力学的深层机理,体现出对大模型本质的持续追问。

  • World-Language-Action Model for Unified World Modeling(2606.05979):提出统一的世界-语言-动作三模态模型,为具身智能提供端到端推理框架,值得关注。
  • Deciphering Two Training Clocks in Grokking(2606.05863):从深度线性网络视角揭示Grokking现象的双时钟机制,对理解泛化有重要理论价值。
  • DiG-Plan: Mitigating Early Commitment(2606.05728):创新性地将扩散模型用于工具图规划,缓解早期承诺问题,为LLM规划提供新思路。
  • Scaling Laws and Spectra of Shallow Neural Networks(2509.24882):在特征学习 regime 下建立浅层网络缩放定律,补充了深度网络理论的空白。

DiG-Plan: Mitigating Early Commitment for Tool-Graph Planning via Diffusion Guidance

2026-06-06T04:00:00autoregressive, cs.AI, cs.CL, diffusion2606.05728

中文标题:DiG-Plan:通过扩散引导减轻工具图规划中的早期承诺问题

作者:Yansi Li, Zhuosheng Zhang

摘要:

Generating executable tool plans requires selecting appropriate subsets from tool libraries, a combinatorial search problem with an exponentially large solution space. However, we identify a critical misalignment in predominant approaches: standard autoregressive (AR) decoding suffers from early commitment, where initial token choices rigidly constrain the search trajectory. A controlled study shows that masked denoising raises Pass@10 solution coverage from 0.320 to 0.943 over AR sampling under matched compute. Motivated by this, we propose DiG-Plan, a framework that decouples combinatorial exploration from structural refinement. DiG-Plan employs a diffusion-based proposer to generate diverse tool sets via iterative refinement, followed by an AR refiner for dependency prediction. On TaskBench, DiG-Plan improves over AR baselines by a 10% relative margin, with the largest gains on complex compositional tasks; API-Bank results show that the propose-refine-select design remains effective across domains. Code is available at https://github.com/puddingyeah/DiG-Plan.

摘要中文:

生成可执行工具计划需要从工具库中选择适当的子集,这是一个解空间指数级大的组合搜索问题。然而,我们发现现有主流方法存在关键错位:标准自回归(AR)解码存在早期承诺问题,初始token选择会刚性约束搜索轨迹。一项控制研究表明,在匹配计算条件下,掩码去噪将Pass@10解决方案覆盖率从0.320提升至0.943。受此启发,我们提出DiG-Plan框架,将组合探索与结构细化解耦。DiG-Plan采用基于扩散的proposer通过迭代细化生成多样化的工具集,随后使用AR refiner进行依赖预测。在TaskBench上,DiG-Plan相比AR基线提升了10%的相对边际,复杂组合任务上的增益最大;API-Bank结果表明propose-refine-select设计在不同领域均保持有效。代码可访问 https://github.com/puddingyeah/DiG-Plan。

Deciphering Two Training Clocks in Grokking via Deep Linear Network Theory with Conditional ReLU Reduction

2026-06-06T04:00:00autoregressive, cs.AI, cs.LG2606.05863

中文标题:基于深度线性网络理论和条件性ReLU约简解构Grokking中的两个训练时钟

作者:Hu Tan, Kuo Gai, Shihua Zhang

摘要:

Grokking suggests that fitting the training data and learning a simple underlying rule may occur on different time scales. We formalize this phenomenon by separating the fast decay of the classification loss from the slower simplification of the learned representation, and we call the resulting pair of stopping times two training clocks. For deep linear networks, we show that a post-margin gap-growth or one-step tail-contraction condition reduces the cross-entropy loss to level epsilon on a logarithmic time scale. In contrast, when layerwise weight decay is present, the induced regularization on the end-to-end map can be expressed as a Schatten-type penalty; under a sharp late-time Kurdyka-Lojasiewicz tail, this structural energy closes on a polynomial time scale. The two clocks, therefore, separate fitting from representation simplification. We then explain how the same mechanism can appear in ReLU MLPs. In regions where the activation patterns on the training set remain fixed, the network reduces to a linear model in the active coordinates. In a two-layer ReLU embedding model, chain-rule estimates further show that the classifier head can receive larger effective gradients than the embedding block under controlled downstream norms. This supports a two-stage mechanism in which the classifier fits first, while the representation continues to simplify later. We use modular addition as the main experimental setting. The deep linear theory provides the rigorous core of the analysis. But the ReLU results are formulated as conditional reductions that account for empirical behavior without claiming a global proof for nonlinear training dynamics.

摘要中文:

Grokking现象表明,拟合训练数据和学习底层规则可能发生在不同的时间尺度上。我们通过将分类损失的快速衰减与学习表征的慢速简化相分离来形式化这一现象,并将产生的停止时间对称为“两个训练时钟”。对于深度线性网络,我们证明了后验边缘间隙增长条件或一步尾收缩条件可以在对数时间尺度上将交叉熵损失降低到ε级别。相反,当存在逐层权重衰减时,对端到端映射的诱导正则化可表示为Schatten型惩罚;在尖锐的晚期Kurdyka-Lojasiewicz尾条件下,这种结构能量在多项式时间尺度上闭合。因此,两个时钟将拟合与表征简化分离。然后我们解释了相同的机制如何出现在ReLU多层感知机中。在训练集上激活模式保持固定的区域,网络在活跃坐标上简化为线性模型。在两层ReLU嵌入模型中,链规则估计进一步表明,在受控的下游范数下,分类器头比嵌入块接收更大的有效梯度。这支持了一种两阶段机制,其中分类器首先拟合,而表征随后继续简化。我们使用模加法作为主要实验设置。深度线性理论提供了分析的核心部分。但ReLU结果被表述为条件性约简,考虑了经验行为,而非声称对非线性训练动力学给出全局证明。

World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

2026-06-06T04:00:00autoregressive, cs.AI, cs.RO, diffusion2606.05979

中文标题:面向统一世界建模、语言推理与动作合成的世界-语言-动作模型

作者:Yi Yang, Zhihong Liu, Siqi Kou, Yiyang Chen, Yanzhe Hu, Jianbo Zhou, Boyuan Zhao, Zhijie Wei, Xiao Xia, Xueqi Li, Pengfei Liu, Zhijie Deng

摘要:

We propose world-language-action (WLA) models as a new class of embodied foundation models. WLA takes textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions, conjoining the \emph{world modeling interface} to learn from extensive egocentric videos as in the world-action model (WAM) and the \emph{language reasoning} capacities to solve complex long-horizon tasks as in vision-language-action (VLA) models. At the core of WLA lies an \emph{autoregressive (AR)} Transformer backbone, instead of a bidirectional diffusion Transformer as in WAMs, to predict the \emph{next state}, comprising the \emph{semantic-level} textual intention and complementary \emph{fine-grained} physical dynamics. The physical dynamics are supervised by the world modeling objective based on a dedicated World Expert, and are leveraged to ease the characterization of the state-action correlation for the Action Expert. WLA leverages meta-queries to make the world prediction \emph{implicitly} impact the action generation so that the former can be disabled during inference. The world prediction can also be activated to enable test-time scaling for improved robot control. Our WLA-0 prototype, with 2B active parameters, achieves 40 ms per inference on an NVIDIA RTX 5090. Evaluations across simulated and real-world environments demonstrate that WLA-0 achieves state-of-the-art multi-task and long-horizon learning abilities, e.g., 92.94\% success rate on RoboTwin2.0 Clean and 56.5\% success rate on RMBench. WLA-0 also holds the promise to learn novel tasks directly from \emph{cross-embodiment robot videos} without action annotations.

摘要中文:

我们提出世界-语言-动作(WLA)模型作为新一类具身基础模型。WLA以文本指令、图像和机器人状态作为输入,共同预测文本子任务、子目标图像和机器人动作,将世界建模接口(如世界动作模型WAM中从大量第一人称视频学习)与语言推理能力(如视觉语言动作模型VLA中解决复杂长时域任务)相结合。WLA的核心是自回归(AR)Transformer骨干网络,而非WAM中的双向扩散Transformer,用于预测下一状态,包括语义层面的文本意图和互补的细粒度物理动力学。物理动力学由基于专用World Expert的世界建模目标监督,并被用于简化Action Expert的状态-动作关联建模。WLA利用元查询使世界预测隐式影响动作生成,从而在推理时可禁用前者。世界预测也可激活以实现测试时扩展,提升机器人控制性能。我们的WLA-0原型拥有20亿活跃参数,在NVIDIA RTX 5090上实现每次推理40毫秒。模拟和真实环境评估表明,WLA-0达到了最先进的多任务和长时域学习能力,例如在RoboTwin2.0 Clean上成功率达92.94%,在RMBench上成功率达56.5%。WLA-0还有望直接从跨具身机器人视频中学习新任务,无需动作标注。

Quantifying Sensitivity for Tree Ensembles: A symbolic and compositional approach

2026-06-06T04:00:00autoregressive, cs.AI, cs.LG2605.13830

中文标题:决策树集成的敏感性量化:一种符号化组合方法

作者:Ajinkya Naik, Chaitanya Garg, S. Akshay, Ashutosh Gupta, Kuldeep S. Meel

摘要:

Decision tree ensembles (DTE) are a popular model for a wide range of AI classification tasks, used in multiple safety critical domains, and hence verifying properties on these models has been an active topic of study over the last decade. One such verification question is the problem of sensitivity, which asks, given a DTE, whether a small change in subset of features can lead to misclassification of the input. In this work, our focus is to build a quantitative notion of sensitivity, tailored to DTEs, by discretizing the input space of the model and enumerating the regions which are susceptible to sensitivity. We propose a novel algorithmic technique that can perform this computation efficiently, within a certified error and confidence bound. Our approach is based on encoding the problem as an algebraic decision diagram (ADD), and further splitting it into subproblems that can be solved efficiently and make the computation compositional and scalable. We evaluate the performance of our technique over benchmarks of varying size in terms of number of trees and depth, comparing it against the performance of model counters over the same problem encoding. Experimental results show that our tool XCount achieves significant speedup over other approaches and can scale well with the increasing sizes of the ensembles.

摘要中文:

决策树集成(DTE)是广泛应用于各类人工智能分类任务的热门模型,被用于多个安全关键领域,因此对这类模型的属性验证在过去十年一直是热门研究课题。其中一个重要的验证问题是敏感性问题,即给定一个DTE,输入特征的子集发生微小变化是否会导致误分类。本研究致力于为DTE构建定量的敏感性概念,通过离散化模型的输入空间并枚举易受敏感性影响的区域。我们提出了一种创新的算法技术,能够在有保证的误差和置信度范围内高效完成计算。我们的方法基于将问题编码为代数决策图(ADD),并进一步将其分解为可高效解决的子问题,使计算具有组合性和可扩展性。我们在具有不同树数量和深度的基准问题上评估了该技术的性能,并与同一问题编码下的模型计数器性能进行了比较。实验结果表明,我们的工具XCount相比其他方法实现了显著加速,并能很好地随集成模型规模的增长而扩展。

Scaling Laws and Spectra of Shallow Neural Networks in the Feature Learning Regime

2026-06-06T04:00:00autoregressive, cond-mat.dis-nn, cs.AI, cs.LG, stat.ML2509.24882

中文标题:特征学习机制下浅层神经网络的缩放定律与谱特性

作者:Leonardo Defilippis, Yizhou Xu, Julius Girardin, Emanuele Troiani, Vittorio Erba, Lenka Zdeborov\'a, Bruno Loureiro, Florent Krzakala

摘要:

Neural scaling laws underlie many of the recent advances in deep learning, yet their theoretical understanding remains largely confined to linear models. In this work, we present a systematic analysis of scaling laws for quadratic and diagonal neural networks in the feature learning regime. Leveraging connections with matrix compressed sensing and LASSO, we derive a detailed phase diagram for the scaling exponents of the excess risk as a function of sample complexity and weight decay. This analysis uncovers crossovers between distinct scaling regimes and plateau behaviors, mirroring phenomena widely reported in the empirical neural scaling literature. Furthermore, we establish a precise link between these regimes and the spectral properties of the trained network weights, which we characterize in detail. As a consequence, we provide a theoretical validation of recent empirical observations connecting the emergence of power-law tails in the weight spectrum with network generalization performance, yielding an interpretation from first principles.

摘要中文:

神经缩放定律是深度学习近期许多进展的基础,然而其理论理解仍主要局限于线性模型。本工作中,我们对特征学习机制下二次和对角神经网络的缩放定律进行了系统性分析。利用与矩阵压缩感知和LASSO的联系,我们推导出了超额风险缩放指数作为样本复杂度和权重衰减函数的详细相图。该分析揭示了不同缩放 regime 之间的交叉现象和平台行为,与经验神经缩放文献中广泛报道的现象高度吻合。此外,我们建立了这些 regime 与训练网络权重谱特性之间的精确联系,并对其进行了详细表征。由此,我们对权重谱中幂律尾部的出现与网络泛化性能之间关系的近期经验观察提供了理论验证,从第一性原理角度给出了解释。

diffusion
Diffusion
13 篇论文

今日 Diffusion 分类论文概述:

今日 Diffusion 领域呈现出两大显著趋势:一是扩散模型向多模态和具身智能的深度渗透,涵盖视觉-语言-动作统一模型、端到端自动驾驶和工具规划等任务;二是理论层面继续深化,对分数函数和绝热传输的数学映射取得新进展。同时,离散扩散模型的引导控制和一步生成方法成为效率优化的重要方向。

重点论文推荐:

  • Generating Graph-Like Logical Rules for Knowledge Graph Reasoning via Diffusion Models:首个通过扩散模型生成知识图推理逻辑规则的工作,为符号推理与生成式AI的结合提供了新范式。
  • UniVoice: A Unified Model for Speech and Singing Voice Generation:实现语音与歌唱的统一生成,扩散模型在音频领域的应用迈向新高度。
  • The Invisible Hand of Physics: When Video Diffusion Models Know More Than They Show:揭示视频扩散模型隐含的物理先验知识,对理解生成模型的世界模型能力具有重要意义。
  • The Score Hamiltonian: Mapping Diffusion Models to Adiabatic Transport:从理论层面建立扩散模型与绝热传输的数学联系,为理解扩散模型的损失景观提供新视角。
  • RiskFlow: Fast and Faithful Safety-Critical Traffic Scenario Generation:针对自动驾驶安全关键场景的高效生成,扩散模型在仿真测试中的实际落地价值突出。

DiG-Plan: Mitigating Early Commitment for Tool-Graph Planning via Diffusion Guidance

2026-06-06T04:00:00autoregressive, cs.AI, cs.CL, diffusion2606.05728

中文标题:DiG-Plan:通过扩散引导减轻工具图规划中的早期承诺问题

作者:Yansi Li, Zhuosheng Zhang

摘要:

Generating executable tool plans requires selecting appropriate subsets from tool libraries, a combinatorial search problem with an exponentially large solution space. However, we identify a critical misalignment in predominant approaches: standard autoregressive (AR) decoding suffers from early commitment, where initial token choices rigidly constrain the search trajectory. A controlled study shows that masked denoising raises Pass@10 solution coverage from 0.320 to 0.943 over AR sampling under matched compute. Motivated by this, we propose DiG-Plan, a framework that decouples combinatorial exploration from structural refinement. DiG-Plan employs a diffusion-based proposer to generate diverse tool sets via iterative refinement, followed by an AR refiner for dependency prediction. On TaskBench, DiG-Plan improves over AR baselines by a 10% relative margin, with the largest gains on complex compositional tasks; API-Bank results show that the propose-refine-select design remains effective across domains. Code is available at https://github.com/puddingyeah/DiG-Plan.

摘要中文:

生成可执行工具计划需要从工具库中选择适当的子集,这是一个解空间指数级大的组合搜索问题。然而,我们发现现有主流方法存在关键错位:标准自回归(AR)解码存在早期承诺问题,初始token选择会刚性约束搜索轨迹。一项控制研究表明,在匹配计算条件下,掩码去噪将Pass@10解决方案覆盖率从0.320提升至0.943。受此启发,我们提出DiG-Plan框架,将组合探索与结构细化解耦。DiG-Plan采用基于扩散的proposer通过迭代细化生成多样化的工具集,随后使用AR refiner进行依赖预测。在TaskBench上,DiG-Plan相比AR基线提升了10%的相对边际,复杂组合任务上的增益最大;API-Bank结果表明propose-refine-select设计在不同领域均保持有效。代码可访问 https://github.com/puddingyeah/DiG-Plan。

Edit-R2: Context-Aware Reinforcement Learning for Multi-Turn Image Editing

2026-06-06T04:00:00cs.AI, diffusion2606.05950

中文标题:Edit-R2:面向多轮图像编辑的上下文感知强化学习

作者:Yuxiao Ye, Haoran He, Fangyuan Kong, Xintao Wang, Pengfei Wan, Kun Gai, Ling Pan

摘要:

Text-guided image editing has advanced rapidly with diffusion models and unified multimodal foundation models. However, most existing methods remain confined to single-turn settings, overlooking the more realistic scenario of multi-turn in-context editing, where users iteratively refine an image through a sequence of instructions. In this setting, a model must follow each new instruction while preserving accumulated session-level constraints, challenged by two coupled failure modes: long-context dilution, where sparse textual constraints become difficult to recover from growing interleaved image-text histories, and state contamination, where earlier editing mistakes degrade subsequent generations. We introduce Edit-R2, a novel reinforcement learning post-training framework for unified multimodal models. Edit-R2 reconstructs the operative session intent, which effectively consolidates scattered historical constraints into an explicit reasoning trace before each editing turn. It further enables multi-turn RL over both reasoning and generation through a unified objective that jointly optimizes intent reconstruction generation in discrete text space and flow-matching image generation in continuous latent space, while a trajectory filtering mechanism suppresses corrupted rollouts to stabilize training under state contamination. To support systematic evaluation, we introduce MICE-Bench, a large-scale benchmark for multi-turn in-context editing with automated metrics for instruction following (IF), content consistency (CC), and global awareness (GA) over accumulated session constraints. Experiments show that Edit-R2 substantially improves multi-turn in-context editing and achieves competitive performance compared against strong baselines.

摘要中文:

文本引导的图像编辑随着扩散模型和统一多模态基础模型的发展而快速进步。然而,现有大多数方法仍局限于单轮设置,忽略了更现实的多轮上下文编辑场景——即用户通过一系列指令迭代地优化图像。在这一设置中,模型必须遵循每条新指令,同时保留累积的会话级约束,并面临两种耦合的失败模式:长上下文稀释(稀疏的文本约束在不断增长的交叠图像-文本历史中难以恢复)以及状态污染(早期的编辑错误会降质后续生成)。本文提出 Edit-R2,一种用于统一多模态模型的新型强化学习后训练框架。Edit-R2 重建可操作会话意图,在每次编辑前将分散的历史约束有效整合为明确的推理轨迹。该框架进一步通过统一目标实现对推理和生成的多轮强化学习,在离散文本空间中联合优化意图重建生成,在连续潜空间中优化流匹配图像生成,同时采用轨迹过滤机制抑制损坏的 rollout 以在状态污染下稳定训练。为支持系统评估,我们构建了 MICE-Bench,这是一个面向多轮上下文编辑的大规模基准,包含指令跟随(IF)、内容一致性(CC)和累积会话约束全局感知(GA)的自动化指标。实验表明,Edit-R2 显著提升了多轮上下文编辑能力,并相比强基线方法取得了具有竞争力的性能。

Where Should Knowledge Enter? A Layered Framework for Knowledge Infusion in Multimodal Iterative Generative Mo

2026-06-06T04:00:00cs.AI, diffusion2606.06356

中文标题:知识应从何而入?多模态迭代生成模型中的知识注入分层框架

作者:Renjith Prasad, Chathurangi Shyalika, Anushka Pawar, Amit Sheth

摘要:

Multimodal generative models produce fluent outputs but remain unreliable when generation must respect structured, domain-specific, or safety-critical knowledge. Existing methods incorporate knowledge through mechanisms such as prompt augmentation, guidance, latent editing, or fine-tuning, yet they are typically categorized by technique rather than by the component of the generative process they modify. We argue that knowledge infusion in iterative generative models is fundamentally anintervention-layer problem. Since thegenerative process unfolds as a trajectory of internal states, knowledge can act on four structurally distinct components of this process: the input/output boundary, the transition function, the intermediate state, and the model parameters. This maps to four intervention layers: surface, trajectory, latent, and parametric infusion. We instantiate the framework in diffusion models, map representative methods to all four layers, and derive design principles for multi-layer composition. In a controlled safety-alignment experiment using a multimodal knowledge graph with two diffusion backbones, we implement three of the four layers cumulatively, surface (input-side and output-side) and trajectory--latent (mid-generation). We show empirically that each additional layer addresses failure classes that prior layers cannot reach, reducing knowledge-violating outputs by 70.97% compared to vanilla generation and empirically confirming the framework's complementarity prediction.

摘要中文:

多模态生成模型能够产生流畅的输出,但在生成必须遵循结构化、特定领域或安全关键知识时仍不可靠。现有的方法通过提示增强、引导、潜在编辑或微调等机制来融入知识,然而这些方法通常按技术分类而非按其修改的生成过程组件进行分类。我们认为,迭代生成模型中的知识注入本质上是一个干预层问题。由于生成过程以内部状态轨迹的形式展开,知识可以作用于该过程的四个结构不同的组件:输入/输出边界、转换函数、中间状态和模型参数。这对应四个干预层:表面层、轨迹层、潜在层和参数层。我们在扩散模型中实例化该框架,将代表性方法映射到所有四个层,并推导出多层组合的设计原则。在使用多模态知识图谱和两个扩散骨干网络的受控安全对齐实验中,我们累积实现了四个层中的三个:表面层(输入侧和输出侧)以及轨迹-潜在层(生成中期)。实证表明,每个额外的层都能解决先前层无法处理的失败类别,与基线生成相比,知识违规输出减少了70.97%,并经验性地验证了框架的互补性预测。

The Score Hamiltonian: Mapping Diffusion Models to Adiabatic Transport

2026-06-06T04:00:00cs.AI, cs.LG, diffusion, math-ph, math.MP, physics.data-an2606.05217

中文标题:Score哈密顿量:扩散模型到绝热传输的映射

作者:Peter Halmos, Boris Hanin

摘要:

We exhibit an exact correspondence between sampling with score-based diffusion models and adiabatic transport of ground states for a family of Schr\"odinger operators we call Score Hamiltonians, built from the learned score's quantum potential. We obtain novel density reconstruction bounds and principled annealing schedules via adiabatic theorems for Fokker-Planck equations with time-varying potentials. We find the fundamental limit of sampling is set by the ratio of squared score-matching error to Score Hamiltonian spectral gap - the inverse Poincar\&x27;e constant of the data density.

摘要中文:

我们展示了基于分数的扩散模型采样与一类我们称之为Score哈密顿量的薛定谔算符基态绝热传输之间的精确对应关系,这些算符由学习到的分数的量子势构建。我们通过福克-普朗克方程的绝热定理获得了新的密度重建界和有原则的退火计划。我们发现采样的根本极限由平方分数匹配误差与Score哈密顿量谱隙的比值决定——即数据密度的庞加莱常数的倒数。

The Invisible Hand of Physics: When Video Diffusion Models Know More Than They Show

2026-06-06T04:00:00cs.AI, cs.CV, cs.GR, cs.LG, diffusion2606.05328

中文标题:物理的看不见的手:当视频扩散模型知道得比展示的更多

作者:Parsa Esmati, Somjit Nath, Katja Hofmann, Derek Nowrouzezahrai, Samira Ebrahimi Kahou, Majid Mirmehdi

摘要:

Modern video diffusion models generate increasingly realistic and temporally coherent videos, motivating their use as candidate world simulators. Yet it remains unclear whether these models internally encode physical structure, or merely reproduce motion patterns seen during training. We study this question by probing video diffusion models along latent trajectories corresponding to real videos with known physical plausibility. To obtain such trajectories, we approximately invert the deterministic sampling process by integrating the learned velocity field backward from a clean video latent to noise, giving access to the model's intermediate states and attention maps. Using these recovered trajectories, we show that physical plausibility is linearly decodable from diffusion transformer states across IntPhys and InfLevel, reaching around 81.27% average accuracy and outperforming dedicated representation-learning baselines such as V-JEPA and VideoMAE. Surprisingly, this signal is absent from the VAE latent input and emerges inside the denoising transformer itself, despite the model not being trained with a self-supervised predictive objective. These findings suggest that physically meaningful representations can arise as a byproduct of generative denoising.

摘要中文:

现代视频扩散模型能够生成越来越逼真且时间一致的视频,这促使人们将其作为候选的世界模拟器。然而,目前尚不清楚这些模型是否在内部编码了物理结构,或者仅仅是复现训练过程中看到的运动模式。我们通过探测视频扩散模型在具有已知物理合理性的真实视频对应的潜在轨迹上来研究这个问题。为了获得这样的轨迹,我们通过从干净的视频潜在表示向后积分学习到的速度场来近似反转确定性采样过程,从而能够访问模型的中间状态和注意力图。使用这些恢复的轨迹,我们表明物理合理性可以从跨IntPhys和InfLevel的扩散Transformer状态中线性解码,达到约81.27%的平均准确率,并且优于诸如V-JEPA和VideoMAE等专门的表征学习基线。令人惊讶的是,这种信号在VAE潜在输入中并不存在,而是出现在去噪Transformer内部,尽管该模型并未使用自监督预测目标进行训练。这些发现表明,物理上有意义的表示可以作为生成式去噪的副产物而产生。

Let It Be Simple: One-Step Action Generation for Vision-Language-Action Models

2026-06-06T04:00:00cs.AI, cs.CV, cs.LG, cs.RO, diffusion2606.05737

中文标题:化繁为简:视觉-语言-动作模型的单步动作生成

作者:Yitong Chen, Shiduo Zhang, Jingjing Gong, Xipeng Qiu

摘要:

Diffusion-based vision-language-action (VLA) models often inherit the image-generation view: actions are generated by iterative denoising. We argue that VLA action generation has a different condition-target structure: the policy is conditioned on rich observations, language, and state, but predicts only a compact, low-dimensional action chunk. Under this asymmetry, strong one-step action generation should not necessarily require the advanced one-step methods developed for image synthesis. We keep standard velocity prediction and add no teacher model, distillation stage, or auxiliary objective; in our main recipe, we simply bias the training time distribution toward high-noise states. We first isolate the effect in a controlled MNIST grid-to-sequence task, then test it with extensive robot-policy experiments. Across standard LIBERO, LIBERO-Plus, and LIBERO-Pro, one-step policies trained with high-noise biased schedules generally match ten-step decoding under the same recipe, and on standard LIBERO can exceed ten-step policies trained with a uniform time distribution. A real-robot bimanual YAM RSS evaluation gives a small-sample cross-architecture check of the same sampler trend. On a 1.4B VLM model with a 30M action head, one-step decoding reaches 95.6\% on LIBERO-Long. These results show that strong one-step VLA action generation can emerge from standard diffusion training, without importing the full few-step diffusion machinery developed for image generation.

摘要中文:

基于扩散的视觉-语言-动作(VLA)模型通常继承图像生成视角:动作通过迭代去噪生成。我们认为VLA动作生成具有不同的条件-目标结构:策略以丰富的观察、语言和状态为条件,但仅预测紧凑的低维动作块。在这种不对称性下,强大的单步动作生成并不一定需要为图像合成开发的先进单步方法。我们保持标准的速度预测,不添加任何教师模型、蒸馏阶段或辅助目标;在我们的主要方案中,我们只是将训练时间分布偏向高噪声状态。我们首先在受控的MNIST网格到序列任务中分离出该效果,然后通过大量机器人策略实验进行测试。在标准LIBERO、LIBERO-Plus和LIBERO-Pro上,使用高噪声偏置调度训练的单步策略在相同方案下通常与十步解码相当,并且在标准LIBERO上可以超过使用均匀时间分布训练的十步策略。真实机器人双手YAM RSS评估对该采样器趋势进行了小样本跨架构验证。在1.4B VLM模型搭配3000万动作头的情况下,单步解码在LIBERO-Long上达到95.6%。这些结果表明,强大的单步VLA动作生成可以从标准扩散训练中涌现,而无需引入为图像生成开发的完整少步扩散机制。

UniVoice: A Unified Model for Speech and Singing Voice Generation

2026-06-06T04:00:00cs.AI, cs.SD, diffusion, eess.AS2606.05852

中文标题:UniVoice:用于语音和歌唱声音生成的统一模型

作者:Junjie Zheng, Huixin Xue, Shihong Ren, Chaofan Ding, Hao Liu, Zihao Chen

摘要:

Text-to-speech (TTS) and singing voice synthesis (SVS) both aim to generate human vocal audio from symbolic inputs, but they impose different requirements on the generation process. Speech generation relies on flexible, language-driven prosody, whereas singing generation requires explicit melody control and accurate rhythmic alignment. This mismatch makes it challenging to train a single model that can generate both natural speech and controllable singing, since melody-related conditions should strongly constrain singing but should not restrict speech prosody. We present UniVoice, a unified speech and singing voice generation framework based on conditional flow matching. Instead of using a single undifferentiated conditioning representation, UniVoice factorizes the condition into content, melody, and timbre, which are encoded by modality-appropriate encoders and consumed by a shared Diffusion Transformer (DiT) backbone. For singing, the melody condition is represented by MIDI note sequences; for speech, it is replaced with a learned null melody token, allowing the model to infer prosody from linguistic and acoustic context. This design preserves explicit melody control for singing while avoiding the need to impose melody constraints on speech. We further analyze the null melody token as an approximation to melody marginalization in the conditional flow. Trained on 30k hours of speech and 35k hours of singing data, UniVoice achieves a speech PER of 5.26\%, comparable to dedicated TTS systems such as F5-TTS (5.21\%) and CosyVoice3 (5.30\%). On singing generation, UniVoice achieves a PER of 16.22\%, outperforming the unified baseline Vevo1.5 (24.72\%).

摘要中文:

语音合成(TTS)和歌唱声音合成(SVS)均旨在从符号输入生成人声音频,但它们对生成过程有着不同的要求。语音生成依赖于灵活的、由语言驱动的韵律特征,而歌唱生成则需要明确的旋律控制和精准的节奏对齐。这种差异使得训练一个能够同时生成自然语音和控制性歌唱的单一模型变得具有挑战性,因为旋律相关的条件应该强烈约束歌唱,但不应限制语音韵律。我们提出了UniVoice,一个基于条件流统一的语音和歌唱声音生成框架。UniVoice不使用单一的未分化条件表示,而是将条件分解为内容、旋律和音色三个部分,分别由适合各模态的编码器编码,并由共享的Diffusion Transformer(DiT)主干网络处理。对于歌唱,旋律条件由MIDI音符序列表示;对于语音,则用学习到的空旋律token替代,使模型能够从语言学和声学上下文中推断韵律。该设计保留了对歌唱的明确旋律控制,同时避免了对语音施加旋律约束。我们进一步分析了空旋律token作为条件流中旋律边缘化的近似方法。UniVoice在30k小时语音和35k小时歌唱数据上进行训练,取得了5.26%的语音音素错误率(PER),与专用的TTS系统F5-TTS(5.21%)和CosyVoice3(5.30%)相当。在歌唱生成任务上,UniVoice取得了16.22%的PER,优于统一基线模型Vevo1.5(24.72%)。

World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

2026-06-06T04:00:00autoregressive, cs.AI, cs.RO, diffusion2606.05979

中文标题:面向统一世界建模、语言推理与动作合成的世界-语言-动作模型

作者:Yi Yang, Zhihong Liu, Siqi Kou, Yiyang Chen, Yanzhe Hu, Jianbo Zhou, Boyuan Zhao, Zhijie Wei, Xiao Xia, Xueqi Li, Pengfei Liu, Zhijie Deng

摘要:

We propose world-language-action (WLA) models as a new class of embodied foundation models. WLA takes textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions, conjoining the \emph{world modeling interface} to learn from extensive egocentric videos as in the world-action model (WAM) and the \emph{language reasoning} capacities to solve complex long-horizon tasks as in vision-language-action (VLA) models. At the core of WLA lies an \emph{autoregressive (AR)} Transformer backbone, instead of a bidirectional diffusion Transformer as in WAMs, to predict the \emph{next state}, comprising the \emph{semantic-level} textual intention and complementary \emph{fine-grained} physical dynamics. The physical dynamics are supervised by the world modeling objective based on a dedicated World Expert, and are leveraged to ease the characterization of the state-action correlation for the Action Expert. WLA leverages meta-queries to make the world prediction \emph{implicitly} impact the action generation so that the former can be disabled during inference. The world prediction can also be activated to enable test-time scaling for improved robot control. Our WLA-0 prototype, with 2B active parameters, achieves 40 ms per inference on an NVIDIA RTX 5090. Evaluations across simulated and real-world environments demonstrate that WLA-0 achieves state-of-the-art multi-task and long-horizon learning abilities, e.g., 92.94\% success rate on RoboTwin2.0 Clean and 56.5\% success rate on RMBench. WLA-0 also holds the promise to learn novel tasks directly from \emph{cross-embodiment robot videos} without action annotations.

摘要中文:

我们提出世界-语言-动作(WLA)模型作为新一类具身基础模型。WLA以文本指令、图像和机器人状态作为输入,共同预测文本子任务、子目标图像和机器人动作,将世界建模接口(如世界动作模型WAM中从大量第一人称视频学习)与语言推理能力(如视觉语言动作模型VLA中解决复杂长时域任务)相结合。WLA的核心是自回归(AR)Transformer骨干网络,而非WAM中的双向扩散Transformer,用于预测下一状态,包括语义层面的文本意图和互补的细粒度物理动力学。物理动力学由基于专用World Expert的世界建模目标监督,并被用于简化Action Expert的状态-动作关联建模。WLA利用元查询使世界预测隐式影响动作生成,从而在推理时可禁用前者。世界预测也可激活以实现测试时扩展,提升机器人控制性能。我们的WLA-0原型拥有20亿活跃参数,在NVIDIA RTX 5090上实现每次推理40毫秒。模拟和真实环境评估表明,WLA-0达到了最先进的多任务和长时域学习能力,例如在RoboTwin2.0 Clean上成功率达92.94%,在RMBench上成功率达56.5%。WLA-0还有望直接从跨具身机器人视频中学习新任务,无需动作标注。

CLEAR: Cognition and Latent Evaluation for Adaptive Routing in End-to-End Autonomous Driving

2026-06-06T04:00:00cs.AI, cs.RO, diffusion2606.06219

中文标题:CLEAR:端到端自动驾驶中用于自适应路由的认知与潜空间评估

作者:Yining Xing, Zehong Ke, Zhiyuan Liu, Yanbo Jiang, Wenhao Yu, Jianqiang Wang

摘要:

End-to-end autonomous driving models often struggle to balance multi-modal maneuver generation with real-time inference constraints. While diffusion models successfully capture diverse driving behaviors, their iterative denoising process incurs unacceptable latency for safety-critical deployment. To address this, we propose CLEAR (Cognition and Latent Evaluation for Adaptive Routing), a framework that combines ultra-fast generative planning with deep semantic reasoning. CLEAR employs Drive-JEPA as the visual encoder and replaces the multi-step denoising chain with a single-step conditional drift in a VAE latent space, introducing a conditioning coefficient to balance diversity and expert precision. Meanwhile, we fully fine-tune Qwen~3.5~0.8B on driving QA pairs to extract scene-aware hidden states. These states guide both an Adaptive Scheduler, which selects the conditioning coefficient $\alpha$ and sample count $N$ from a discrete set of predefined schemes, and a cross-attention scorer that selects the optimal trajectory from candidates. On the NAVSIM v1 benchmark, CLEAR achieves a state-of-the-art PDMS of 93.7. Our results demonstrate that high-fidelity, multi-modal planning can be executed efficiently without dense geometric annotations or iterative sampling.

摘要中文:

端到端自动驾驶模型在平衡多模态机动生成与实时推理约束方面常常面临挑战。尽管扩散模型能够成功捕捉多样化的驾驶行为,但其迭代去噪过程会产生对安全关键部署不可接受的延迟。为解决这一问题,我们提出CLEAR(Cognition and Latent Evaluation for Adaptive Routing,自适应路由的认知与潜空间评估),该框架将超快生成式规划与深度语义推理相结合。CLEAR采用Drive-JEPA作为视觉编码器,并将多步去噪链替换为VAE潜空间中的单步条件漂移,引入条件系数以平衡多样性与专家精度。同时,我们对Qwen3.5 0.8B在驾驶问答对上进行全参数微调,以提取场景感知隐藏状态。这些隐藏状态指导自适应调度器(从预定义方案离散集合中选择条件系数α和样本数量N)以及交叉注意力评分器(从候选轨迹中选择最优轨迹)。在NAVSIM v1基准上,CLEAR实现了93.7的最先进PDMS。我们的结果表明,高保真多模态规划可以在没有密集几何标注或迭代采样的情况下高效执行。

Plug-and-Play Guidance for Discrete Diffusion Models via Gradient-Informed Logit Correction

2026-06-06T04:00:00cs.AI, cs.LG, diffusion2606.06303

中文标题:基于梯度信息逻辑特修正的离散扩散模型即插即用引导方法

作者:Hongkun Dou, Zike Chen, Fengji Li, Hongjue Li, Yue Deng

摘要:

Controllable generation with discrete diffusion models is often hindered by high computational overhead or the need for retraining. In this paper, we present \underline{\textbf{G}}radient-\underline{\textbf{I}}nformed \underline{\textbf{L}}ogit \underline{\textbf{C}}orrection (\textbf{GILC}), a plug-and-play framework that efficiently estimates guidance signals by repurposing the pretrained denoising network as a variational proxy. To circumvent the gradient instability inherent in high-dimensional discrete spaces, we introduce a Jacobian-free mechanism that directly corrects the clean prediction logits, facilitating stable and effective guidance. Our method accommodates both differentiable and non-differentiable reward functions. Extensive experiments across DNA, protein sequence, and molecular generation tasks demonstrate that GILC achieves state-of-the-art performance without additional training, frequently outperforming fine-tuning approaches.

摘要中文:

可控生成在离散扩散模型中常受高计算开销或需重新训练的阻碍。本研究提出梯度信息逻辑特修正(GILC),一种即插即用框架,通过将预训练去噪网络重新用作变分代理来有效估计引导信号。为克服高维离散空间中固有的梯度不稳定性问题,本方法引入无雅可比机制,直接修正干净预测 logits,促进稳定有效的引导。本方法同时适用于可微和不可微奖励函数。在 DNA、蛋白质序列和分子生成任务上的广泛实验表明,GILC 无需额外训练即可达到最优性能,常优于微调方法。

RiskFlow: Fast and Faithful Safety-Critical Traffic Scenario Generation

2026-06-06T04:00:00cs.AI, cs.RO, diffusion2606.06423

中文标题:RiskFlow:快速且保真的安全关键交通场景生成

作者:Qi Lan, Yining Tang, Yu Shen, Yi Zhou, Yuhao Wei, Jie Li, Guofa Li

摘要:

Safety-critical traffic scenario generation is essential for evaluating autonomous driving systems under rare but high-risk interactions. Existing diffusion-based methods offer strong controllability in closed-loop generation, but their iterative denoising process is computationally expensive and may accumulate sampling and guidance errors over long rollouts, causing unrealistic motion artifacts such as jitter, abnormal acceleration, and off-road behavior. To address these issues, we propose RiskFlow, a closed-loop safety-critical multi-agent traffic generation framework that formulates future trajectory generation as transport in the action space. Instead of relying on iterative denoising, RiskFlow learns an average velocity field over a finite interval to transform Gaussian action sequences into future acceleration and yaw-rate commands with a single forward pass, using a JVP-based objective for efficient and stable training. At test time, RiskFlow applies output-space guidance to the generated actions, steering selected critical agents toward risky interactions while regularizing off-road behavior, and reconstructs physically feasible trajectories through vehicle dynamics. Experiments on nuScenes with tbsim closed-loop evaluation show that RiskFlow achieves a strong adversariality-realism trade-off across multi-agent and long-horizon settings. Compared with representative baselines, RiskFlow consistently improves realism while maintaining competitive safety-critical generation capability, and substantially reduces inference time for evaluation.

摘要中文:

安全关键交通场景生成对于评估自动驾驶系统在罕见但高风险交互场景下的表现至关重要。现有的基于扩散模型的方法在闭环生成中提供了强大的可控性,但其迭代去噪过程计算成本高,并且在长时间展开过程中可能累积采样和引导误差,导致出现抖动、异常加速、偏离道路等不切实际的运动伪影。为解决这些问题,我们提出了RiskFlow,一个闭环安全关键多智能体交通生成框架,将未来轨迹生成形式化为动作空间中的传输。RiskFlow不依赖迭代去噪,而是学习有限时间间隔内的平均速度场,通过单次前向传播将高斯动作序列转换为未来加速和横摆率命令,并使用基于JVP的目标函数实现高效稳定训练。在测试时,RiskFlow对生成的动作施加输出空间引导,将选定的关键智能体导向危险交互,同时约束偏离道路行为,并通过车辆动力学重建物理可行的轨迹。在nuScenes数据集上的tbsim闭环评估实验表明,RiskFlow在多智能体和长视野场景下实现了良好的对抗性-真实感权衡。与代表性基线方法相比,RiskFlow在保持竞争力的安全关键生成能力的同时持续提升真实感,并显著缩短了评估推理时间。

Generating Graph-Like Logical Rules for Knowledge Graph Reasoning via Diffusion Models

2026-06-06T04:00:00cs.AI, diffusion2605.30747

中文标题:基于扩散模型生成图式逻辑规则用于知识图谱推理

作者:Haoxiang Cheng, Yunfei Wang, Chao Chen, Kewei Cheng, Zhipeng Lin, Haoxuan Li, Changjun Fan, Shixuan Liu

摘要:

Logical rules constitute a cornerstone of knowledge graph (KG) reasoning, valued for their interpretability and ability to model relational patterns. However, existing rule mining methods predominantly focus on simple chain-like rules and therefore neglect the richer relational information encoded in graph-like structures, such as cycles and branches. This limitation is further exacerbated by computational bottlenecks caused by the combinatorial explosion of the search space, which is especially challenging for graph-like rules. Meanwhile, generative approaches such as diffusion models, despite their success in other domains, cannot be directly applied to rule mining because their training objectives are not aligned with the goal of learning high-quality rules, and non-differentiable KG rule quality metrics cannot directly guide model optimization. To address these limitations, we propose GRiD, a framework that reformulates graph-like rule discovery as a discrete generative process conditioned on the target relation. GRiD employs a two-phase training strategy. First, supervised pre-training enables GRiD to capture structural priors from subgraphs sampled from the KG meta-graph. Subsequently, reinforcement learning is applied to fine-tune GRiD through policy gradient optimization guided directly by non-differentiable rule-quality metrics. Experiments on six benchmark datasets show that GRiD achieves competitive performance on KG completion tasks. Ablation studies confirm the efficiency and robustness of GRiD and further show that graph-like rules complement chain-like rules in KG completion. Our code and datasets are available in https://github.com/Haoxiang-Cheng/GRiD.

摘要中文:

逻辑规则是知识图谱推理的基石,因其可解释性和建模关系模式的能力而受到重视。然而,现有的规则挖掘方法主要关注简单的链式规则,因此忽略了图式结构(如环和分支)中编码的更丰富的关系信息。这一局限性因搜索空间的组合爆炸导致的计算瓶颈而进一步加剧,对于图式规则尤为具有挑战性。同时,尽管生成式方法(如扩散模型)在其他领域取得了成功,但由于其训练目标与学习高质量规则的目标不一致,且不可微分的知识图谱规则质量度量无法直接指导模型优化,因此无法直接应用于规则挖掘。为解决这些局限性,我们提出了GRiD框架,将图式规则发现重新表述为以目标关系为条件的离散生成过程。GRiD采用两阶段训练策略。首先,监督预训练使GRiD能够从知识图谱元图中采样的子图中捕获结构先验。随后,应用强化学习通过由不可微分规则质量度量直接引导的策略梯度优化来微调GRiD。在六个基准数据集上的实验表明,GRiD在知识图谱补全任务上取得了具有竞争力的性能。消融实验证实了GRiD的效率和鲁棒性,并进一步表明图式规则与链式规则在知识图谱补全中互为补充。我们的代码和数据集可在https://github.com/Haoxiang-Cheng/GRiD获取。

When Attention Beats Fourier: Multi-Scale Transformers for PDE Solving on Irregular Domains

2026-06-06T04:00:00cs.AI, cs.LG, cs.NA, diffusion, math.NA, physics.comp-ph, stat.ML2605.08318

中文标题:当注意力超越傅里叶:不规则域上PDE求解的多尺度Transformer

作者:Brandon Yee, Pairie Koh, Jack Rodriguez, Mihir Tekal

摘要:

We study the problem of \emph{architecture selection} for deep learning models trained to solve partial differential equations (PDEs), asking when transformer-based architectures with learned attention outperform Fourier-domain neural operators. We introduce the \textbf{Multi-Scale Attention Transformer} (\msat{}), a deep learning architecture that encodes spatiotemporal solution histories as token sequences and trains end-to-end via a composite supervised objective with optional physics-informed regularization terms. We conduct a comprehensive empirical evaluation against nine baselines -- including physics-informed neural networks (PINNs), neural operators (FNO, DeepONet, GNOT), and state-space models (Mamba-NO) -- across five benchmark problems from the PINNacle suite, using identical train/test splits and reference data for all methods. \msat{} achieves state-of-the-art generalization on complex geometry problems ($L^2_\mathrm{rel} = 0.0101$ on Heat2D-CG, a $3.7\times$ improvement over FNO) at $34\,\mathrm{s}$ total inference vs.\ $120{,}812\,\mathrm{s}$ for Mamba-NO. Ablation studies over the physics regularization component reveal a precise inductive bias tradeoff: physics priors reduce test error on diffusion-dominated problems but degrade generalization on chaotic and recirculating-flow regimes, directly characterizing the prior misspecification boundary. Approximation error bounds as a function of domain boundary complexity $\kappa$ provide a theoretical basis for these empirical findings and a principled rule for architecture selection.

摘要中文:

本文研究用于求解偏微分方程(PDEs)的深度学习模型的架构选择问题,探讨基于学习注意力的Transformer架构何时优于傅里叶域神经算子。我们提出了多尺度注意力Transformer(Multi-Scale Attention Transformer,MSAT),该深度学习架构将时空解历史编码为token序列,并通过包含可选物理信息正则化项的复合监督目标进行端到端训练。我们针对PINNacle套件中的五个基准问题,对九个基线模型进行了全面的实证评估,这些基线包括物理信息神经网络(PINNs)、神经算子(FNO、DeepONet、GNOT)和状态空间模型(Mamba-NO),所有方法使用相同的训练/测试划分和参考数据。MSAT在复杂几何问题上实现了最先进的泛化性能(Heat2D-CG上相对L2误差为0.0101,比FNO提升3.7倍),推理总耗时仅34秒,而Mamba-NO需要120,812秒。对物理正则化组件的消融实验揭示了精确的归纳偏置权衡:物理先验降低了扩散主导问题的测试误差,但在混沌和环流条件下会降低泛化能力,这直接刻画了先验错误指定的边界。作为这些实证发现理论基础的是关于域边界复杂度κ的近似误差界,并为架构选择提供了原则性指导。

image_compression
Image Compression
1 篇论文
让我先搜索这篇论文的具体内容,以便为您提供准确的概述。

您提供的列表中只有一篇关于图像压缩的论文。让我基于这篇论文为您撰写概述。

# 每日 arXiv 论文分类总览:Image Compression

## 概述

今日图像压缩领域的论文数量较少,但质量较高。核心趋势集中在自回归生成模型与tokenization的结合,通过改进图像分词器来同时优化压缩效率和生成质量。Bootstrapped Tokenization 方法代表了该方向的重要进展,体现了图像压缩与生成一体化设计的新范式。

## 重点推荐论文

  • Balancing Image Compression and Generation with Bootstrapped Tokenization
    推荐理由:该论文由斯坦福大学提出,创新性地采用自举训练方式优化图像tokenizer,在压缩率和生成质量之间取得显著平衡,对自回归图像生成模型的发展具有重要推动作用。

---

注:今日arXiv图像压缩类别论文数量有限,建议持续关注后续更新。

Balancing Image Compression and Generation with Bootstrapped Tokenization

2026-06-06T04:00:00cs.AI, cs.GR, cs.LG, image_compression2606.05552

中文标题:利用自举分词平衡图像压缩与生成

作者:Haozhe Chi, Jinghan Li, Hao Jiang, Wu Sheng, Yi Ma, Jing Wang, Yadong Mu

摘要:

Despite progress in image tokenization, standard methods encode redundant information by mixing all granularities within each token, thus redundancy persists between tokens. The mix of information of different granularity also complicates the training of generators. This paper introduces SelfBootTok, a method that resolves this by cleanly decomposing information into global and local token groups. Through self-bootstrapped learning, the model predicts local details exclusively from global tokens, shifting the burden of visual details from the generator to the tokenizer. Consequently, our generator is far more efficient, requiring only global tokens and reducing computation by approximately 40%, while delivering superior reconstruction and generation. Moreover, this paradigm scales elegantly: by leveraging more data or parameters to self-supervise local representation learning, SelfBootTok achieves a new state-of-the-art gFID score of 1.56 using only 64 tokens.

摘要中文:

尽管图像分词技术取得了进展,但标准方法通过在每个token内混合所有粒度来编码冗余信息,因此token之间仍然存在冗余。不同粒度信息的混合也使生成器的训练变得复杂。本文提出了SelfBootTok方法,通过将信息清晰分解为全局token组和局部token组来解决这一问题。通过自引导学习,模型仅从全局token预测局部细节,将视觉细节的负担从生成器转移到分词器。因此,我们的生成器更加高效,仅需全局token即可工作,计算量减少约40%,同时实现了更优的重建和生成效果。此外,该范式具有良好的可扩展性:通过利用更多数据或参数进行局部表示学习的自监督学习,SelfBootTok仅使用64个token就实现了1.56的最先进gFID分数。

visual_tokenizer_1d
1D Visual Tokenizer
0 篇论文

今日未找到该分类的匹配论文。

今日未找到该分类的匹配论文。

diffusion_visual_encoder
Diffusion Visual Encoder
0 篇论文

今日未找到该分类的匹配论文。

今日未找到该分类的匹配论文。