每日 arXiv 论文简报
今日arXiv论文呈现Diffusion模型主导、自回归模型探索新边界的双轨并行格局。Diffusion方向(28篇)依然是主力,涵盖3D场景生成(Native3D)、视频动画(FreeAnimate)、音乐共生成(统一歌曲与歌声转换)、机器人强化学习(CHDP)等多元化应用,显示出生成式AI从图像向多模态、具身智能、安全可控方向演进的明显趋势。Autoregressive方向(7篇)则聚焦于tokenization优化(AdaTok)、机器人策略(ActionMap)、时空生成(AsyncPatch)等细分场景,两类范式在图像/视频生成任务上呈现交叉融合。
- Native3D: End-to-End 3D Scene Generation — 统一网格-纹理建模与语义对齐,实现端到端3D场景生成,是3D AIGC的重要突破。
- FreeAnimate: Training-Free Human Image Animation — 预览引导去噪实现无训练的人类图像动画,大幅降低生成门槛。
- CHDP: Cooperative Hybrid Diffusion Policies — 混合扩散策略解决参数化动作空间的强化学习,为机器人控制提供新范式。
- LUCID: Nighttime Photography — 统一图像去模糊与曝光掌控,推动移动端夜拍体验升级。
- D5P4: Partition Determinantal Point Process — 创新性地将DPP引入离散扩散解码,提升并行生成多样性。
今日 Autoregressive 分类论文总览
今日 Autoregressive 相关论文共计 7 篇,涵盖了语音合成、图像/视频生成、机器人控制等多个应用领域。整体趋势显示,自回归模型正在与扩散模型、神经算子等技术深度融合,涌现出多种创新架构。其中,流式生成和高效tokenization是两大核心亮点,多篇论文聚焦于如何在保持生成质量的同时提升效率。此外,机器人策略学习也呈现出从传统 RL 向深度自回归策略转变的趋势。
- D5P4: Partition Determinantal Point Process for Diversity in Parallel Discrete Diffusion Decoding - 引入 DDP 来增强生成多样性,为扩散模型的解码策略提供了新思路。
- AsyncPatch Diffusion: spatially-flexible image generation 提出空间灵活的图像生成方式,突破传统扩散模型的局部限制。
- AdaTok: Self-Budgeting Image Tokenization with Quality-Preserving Dynamic Tokens 通过自适应预算的 tokenization 方法,在质量与效率间取得良好平衡。
- Streaming Video Generation with Streaming Force Control 首次将流式力控制引入视频生成,提升了生成的可控性与物理一致性。
- ActionMap: Robot Policy Learning via Voxel Action Heatmap 提出基于体素动作热图的机器人策略学习,为复杂环境下的决策提供新范式。
dots.tts Technical Report
中文标题:dots.tts 技术报告
作者:Shi Lian, Changtao Li, Bohan Li, Hankun Wang, Da Zheng, Junfeng Tian, Yufeng Ma, Colin Zhang, Kai Yu
We present dots.tts, a 2B-parameter continuous autoregressive text-to-speech (TTS) foundation model that models speech in a continuous latent space. Compared with existing continuous autoregressive models, our key innovations are threefold. First, we train an AudioVAE with multiple objectives to build a semantically structured and prediction-friendly continuous speech space. Second, we use full-history conditioning in the flow-matching head to preserve long-range consistency and reduce drift during generation. Third, we apply reward-free self-corrective post-training to the flow-matching head to further improve robustness and acoustic quality. After being trained on a large-scale multilingual corpus, dots.tts achieves the best average performance on Seed-TTS-Eval, with WERs of 0.94%/1.30%/6.60% and SIM scores of 81.0/77.1/79.5 on the zh/en/zh-hard test sets, respectively. Across other benchmarks, dots.tts also consistently demonstrates open-source state-of-the-art performance, exhibiting strong generation stability, voice cloning ability, and emotional expressiveness. For efficient inference, we further apply CFG-aware MeanFlow distillation, enabling low-latency speech generation with first-packet latencies of 85/54 ms in output streaming and dual-streaming modes, respectively. To facilitate reproducible research and practical deployment, we release the training and inference code, together with the pretrained, post-trained, and MeanFlow-distilled checkpoints, under the Apache 2.0 license.
我们提出dots.tts,一个拥有20亿参数的连续自回归文本到语音(TTS)基础模型,在连续潜空间中建模语音。与现有的连续自回归模型相比,我们的关键创新包括三个方面。首先,我们使用多目标训练AudioVAE,构建了一个语义结构化且易于预测的连续语音空间。其次,我们在流匹配头部使用全历史条件,以保持长程一致性并减少生成过程中的漂移。第三,我们对流匹配头部应用无奖励自纠正后训练,以进一步提高鲁棒性和声学质量。在大规模多语言语料库上训练后,dots.tts在Seed-TTS-Eval上取得了最佳平均性能,在zh/en/zh-hard测试集上的WER分别为0.94%/1.30%/6.60%,SIM分数分别为81.0/77.1/79.5。在其他基准测试中,dots.tts也持续展现出开源最优性能,表现出强大的生成稳定性、零样本声音克隆能力和情感表现力。为了实现高效推理,我们进一步应用了CFG感知的MeanFlow蒸馏,使得在输出流模式和双流模式下分别能够以85/54毫秒的首包延迟进行低延迟语音生成。为了促进可复现研究和实际部署,我们在Apache 2.0许可证下开源了训练和推理代码,以及预训练、后训练和MeanFlow蒸馏后的模型检查点。
D5P4: Partition Determinantal Point Process for Diversity in Parallel Discrete Diffusion Decoding
中文标题:D5P4:用于并行离散扩散解码多样性的分区行列式点过程
作者:Jonathan Lys, Vincent Gripon, Axel Marmoret, Lukas Mauch, Fabien Cardinaux, Ghouthi Boukli Hacene, Bastien Pasdeloup
Discrete diffusion models are promising alternatives to autoregressive approaches for text generation, yet their decoding methods remain under-studied. Standard autoregressive search procedures, such as beam search, do not directly apply to iterative denoising, where hypotheses are complete intermediate sequences rather than left-to-right prefixes. Furthermore, existing diffusion decoding procedures only provide limited control over the diversity and coverage of retained hypotheses. In this work, we introduce D5P4, a beam-style decoding method tailored to discrete diffusion models, which casts intermediate beam selection as MAP inference under a partitioned Determinantal Point Process. This yields a model-internal batch objective that balances quality and diversity without external verifiers. Experiments on open-ended generation, question answering, and mathematical reasoning show that D5P4 improves diversity and pass@$k$ coverage while matching or surpassing baseline quality and fidelity
离散扩散模型是文本生成中自回归方法的有前景替代方案,但其解码方法仍缺乏深入研究。标准自回归搜索程序(如束搜索)无法直接应用于迭代去噪过程,因为假设是完整的中间序列而非从左到右的前缀。此外,现有的扩散解码程序对保留假设的多样性和覆盖率控制有限。本工作中,我们提出了D5P4,这是一种专为离散扩散模型设计的束式解码方法,将中间束选择表述为分区行列式点过程下的最大后验推理。这产生了一个模型内部的批目标函数,可在无需外部验证器的情况下平衡质量与多样性。在开放式生成、问答和数学推理任务上的实验表明,D5P4在匹配或超越基线质量和保真度的同时,提升了多样性和pass@k覆盖率。
Autoregression-Free Neural Operators for Time-Dependent PDEs
中文标题:时间依赖偏微分方程的无自回归神经算子
作者:Jiaquan Zhang, Caiyan Qin, Haoyu Bian, Libin Cai, Yi Lu, Chaoning Zhang, Wei Dong, Yuanfang Guo, Yang Yang, Heng Tao Shen
Neural operators learn mappings from function-dependent inputs to solutions, providing an effective framework for solving partial differential equations (PDEs). For time-dependent PDEs, existing methods typically perform long-horizon prediction through autoregressive rollout directly in high-dimensional physical field spaces, where each predicted state is recursively fed back as the input for the next step. Although effective for short-term prediction, this autoregressive rollout and the lack of continuous-time modeling lead to progressive error accumulation over long-horizon rollouts. In this work, we propose Autoregression-Free Neural Operators (AFNO), which map the time evolution of PDEs into a latent space and model continuous-time vector fields within it. AFNO uses flow matching to learn the latent vector field, thereby enabling continuous evolution over extended horizons, avoiding autoregressive rollout and capturing dynamics under varying parameter configurations through explicit conditioning on physical parameters. Theoretical analysis and extensive experiments on six PDEs demonstrate that AFNO improves long-horizon prediction stability and consistently reduces rollout errors compared with the baselines.
神经算子学习从函数依赖输入到解的映射,为求解偏微分方程(PDEs)提供了有效框架。对于时间依赖的PDEs,现有方法通常在高位物理场空间中通过自回归展开来进行长期预测,其中每个预测状态被递归反馈作为下一步的输入。尽管这种方法在短期预测中有效,但自回归展开和缺乏连续时间建模会导致长期展开过程中的误差逐渐累积。本工作提出无自回归神经算子(AFNO),它将PDE的时间演化映射到潜在空间,并在其中建模连续时间向量场。AFNO使用流匹配来学习潜在向量场,从而能够在扩展的时间范围内实现连续演化,避免了自回归展开,并通过显式物理参数条件化来捕获不同参数配置下的动力学行为。在六个PDEs上的理论分析和大量实验表明,AFNO提高了长期预测稳定性,并持续降低了展开误差,相较于基线方法表现出更优的性能。
AsyncPatch Diffusion: spatially-flexible image generation
中文标题:AsyncPatch Diffusion:空间灵活图像生成
作者:Samuele Papa, Valentin De Bortoli, Guillaume Couairon, Daniel S\'ykora, Romuald Elie, Klaus Greff
Standard diffusion models corrupt an entire sample with a single shared noise level, forcing all spatial regions to follow the same denoising trajectory. We introduce AsyncPatch Diffusion, a joint-diffusion framework that assigns distinct noise levels to different input dimensions, such as image pixels, or latent tokens. We show how this asynchronous corruption defines a valid generative process while supporting a richer family of spatially heterogeneous denoising trajectories, and prove the first valid ELBO for this process. We show that a single pretrained model can perform spatially adaptive generation, where different regions are denoised on different schedules. A key challenge is training: naive independent noise-level sampling overemphasizes highly heterogeneous configurations and underrepresents homogeneous noise levels, that are crucial during sampling. We address this with a controlled noise-level sampler that regulates both the average corruption level and its spatial variability. AsyncPatch achieves generation quality comparable to conventional diffusion on ImageNet 256 and LSUN, while being natively suited for inpainting without task-specific fine-tuning. We further introduce input guidance, which uses clean or partially corrupted regions to guide the generation of unknown regions, improving local consistency and texture matching. Finally, we demonstrate adaptive generation strategies including uncertainty-guided acceleration and autoregressive sampling.
标准扩散模型使用单一共享噪声水平破坏整个样本,迫使所有空间区域遵循相同的去噪轨迹。本研究提出AsyncPatch Diffusion,一种联合扩散框架,可为不同输入维度(如图像像素或潜在token)分配不同的噪声水平。本研究证明了这种异步破坏定义了一个有效的生成过程,同时支持更丰富的空间异构去噪轨迹族,并首次为该过程证明了有效的ELBO(证据下界)。本研究展示了单个预训练模型可执行空间自适应生成,其中不同区域以不同调度进行去噪。训练中的一个关键挑战是:朴素的独立噪声水平采样过度强调高度异构配置,而忽视了采样过程中至关重要但代表性不足的同质噪声水平。本研究通过一种受控噪声水平采样器来解决这一问题,该采样器同时调节平均腐蚀水平及其空间变异性。AsyncPatch在ImageNet 256和LSUN上实现了与常规扩散模型相当的生成质量,同时天然适用于图像修复而无需任务特定的微调。本研究进一步引入输入引导,利用干净或部分破坏的区域来引导未知区域的生成,从而改善局部一致性和纹理匹配。最后,本研究展示了自适应生成策略,包括不确定性引导加速和自回归采样。
AdaTok: Self-Budgeting Image Tokenization with Quality-Preserving Dynamic Tokens
中文标题:AdaTok:具有质量保持动态Token的自预算图像分词器
作者:Xiaocheng Lu, Yuxi Chen, Jie Zhang, Jian Liu, Jingcai Guo, Fangqi Zhu, Tao Han, Song Guo
Image tokenizers, from 2D grids to recent 1D sequences, typically encode every image with the same fixed number of tokens. Yet visual complexity is highly heterogeneous, so a uniform budget overspends on simple inputs and underserves complex ones. Existing elastic tokenizers expose variable-length reconstructions, but often leave token length as a deployment-time operating point, a search target, or an external prediction rather than an output of the tokenizer itself. In this work, we ask whether a discrete visual tokenizer can budget itself in one pass. Our central finding is that actionable elasticity requires a representation--allocation co-design: prefixes must remain decodable across budgets, and the tokenizer must learn which prefix each image needs. We propose AdaTok, a self-budgeting discrete 1D tokenizer. AdaTok combines Prioritized Representation Learning, which orders tokens with nested tail masking and resolves budget-dependent semantic shift through Multi-Head LoRA decoder heads, with Adaptive Token Allocation, which trains a lightweight deterministic-group GRPO policy over candidate budgets. Dynamic Pareto Weighting balances fidelity and efficiency during policy training without manual trade-off sweeps. On ImageNet-1K, AdaTok-Full reaches rFID 1.31 at 256 tokens, while AdaTok-Adaptive attains rFID 1.50 using only ~118 tokens on average, outperforming discrete 1D baselines at comparable budgets. In autoregressive image generation, the shorter adaptive representation yields ~2.1x throughput over a fixed 256-token decode, suggesting that visual token count can be learned as a content-conditioned output rather than set as a fixed hyperparameter.
从2D网格到近年来的1D序列,图像分词器通常对每张图像使用相同数量的Token进行编码。然而,视觉复杂性高度异质,因此统一预算会在简单输入上过度消耗,而在复杂输入上服务不足。现有的弹性分词器虽能实现可变长度重建,但通常将Token长度作为部署时的操作点、搜索目标或外部预测,而非分词器本身的输出。在本工作中,我们探讨离散视觉分词器是否能够一次性完成自预算。我们的核心发现是,实现可操作的弹性需要表征-分配协同设计:前缀必须在不同预算下保持可解码,且分词器必须学习每张图像需要哪个前缀。我们提出了AdaTok,一种自预算离散1D分词器。AdaTok结合了优先表征学习(通过嵌套尾部掩码对Token进行排序,并通过多头LoRA解码器解决预算相关的语义偏移)与自适应Token分配(在候选预算上训练轻量级确定性组GRPO策略)。动态帕累托加权在策略训练期间平衡保真度和效率,无需手动权衡调整。在ImageNet-1K上,AdaTok-Full在256个Token时达到rFID 1.31,而AdaTok-Adaptive仅使用约118个Token平均实现rFID 1.50,优于相当预算下的离散1D基线。在自回归图像生成中,较短的自适应表示相对于固定256个Token解码可实现约2.1倍吞吐量,表明视觉Token数量可以作为内容条件输出进行学习,而非设为固定超参数。
Streaming Video Generation with Streaming Force Control
中文标题:流式视频生成与流式力控制
作者:Hanhui Wang, Yiming Xie, Haiwen Feng, Zhaoyang Lv, Shenlong Wang, Huaizu Jiang
We introduce StreamForce, a streaming video generation framework that enables physically grounded control through continuous force inputs. Unlike prior video models that train separate models for different force types, assume fixed forces, or rely on non-causal processing, StreamForce is a causal and unified model that responds instantly and coherently to both local and global, time-varying forces. To achieve this, we design a unified force representation as a control signal and develop a distillation pipeline for force-controllable video generation. Our model combines autoregressive efficiency with force responsiveness, sustaining stable photometric and dynamic realism. StreamForce runs at up to 16.6 FPS on a single GPU, achieving state-of-the-art performance in both force adherence and motion realism. Project website: https://neu-vi.github.io/StreamForce/
我们提出StreamForce,一个通过连续力输入实现物理基础控制的流式视频生成框架。与以往视频模型不同(这些模型为不同力类型训练独立模型、假设固定力或依赖非因果处理),StreamForce是一个因果统一的模型,能够即时且一致地响应局部和全局的时变力。为此,我们设计了一种统一的力表示作为控制信号,并开发了一个力可控视频生成的蒸馏管道。我们的模型结合了自回归效率与力响应特性,保持稳定的光度真实感和动态真实感。StreamForce在单GPU上运行速度可达16.6 FPS,在力控制准确性和运动真实感方面均达到了最先进的性能。项目网站:https://neu-vi.github.io/StreamForce/
ActionMap: Robot Policy Learning via Voxel Action Heatmap
中文标题:ActionMap:基于体素动作热图的机器人策略学习
作者:Pei Yang, Hai Ci, Yanzhe Chen, Qi Lv, Han Cai, Mike Zheng Shou
Vision-language-action (VLA) models have advanced rapidly across backbones, training recipes, and data scale, yet the action decoder, which converts the backbone's hidden state into a continuous control signal, has barely changed and remains a single-point predictor across the majority of current VLAs. Whether implemented via autoregressive token bins, L1 regression, or flow-matching denoising, the resulting decoder treats the action space as unstructured, leaving the geometric proximity of neighboring actions unexploited during training. To advance this, we introduce ActionMap, a voxel heatmap action head that drops into an existing VLA in place of its native action decoder. For each new action, the head predicts a voxel heatmap over the action space, where each voxel directly stores the probability of the corresponding action. Across LIBERO simulation and real-world Franka manipulation, our heatmap head surpasses two architecturally distinct backbones at matched training steps (e.g., +8.2% over OpenVLA-OFT&x27;s L1 regression head on the LIBERO four-suite average), converges at comparable or faster rates on both backbones, and remains markedly more data-efficient at low training data. The cross-backbone consistency indicates that action representation is a real lever for VLA performance, distinct from further backbone or recipe scaling. Project Page: https://github.com/showlab/ActionMap.
视觉-语言-动作(VLA)模型在主干网络、训练方法和数据规模方面发展迅速,然而动作解码器(将主干网络的隐藏状态转换为连续控制信号)却几乎未发生变化,目前大多数VLA仍采用单点预测器。无论通过自回归token分箱、L1回归还是流匹配去噪实现,现有的解码器都将动作空间视为非结构化的,未能在训练中利用相邻动作的几何邻近性。为此,我们提出ActionMap,一种可嵌入现有VLA以替代其原生动作解码器的体素热图动作头。对于每个新动作,该动作头在动作空间上预测一个体素热图,其中每个体素直接存储对应动作的概率。在LIBERO模拟环境和真实世界的Franka机械臂操作中,我们的热图头在匹配的训练步骤数下超越两种架构不同的主干网络(例如,在LIBERO四套任务平均指标上比OpenVLA-OFT的L1回归头高出8.2%),在两种主干网络上均以相当或更快的速度收敛,且在低训练数据量下仍保持明显更高的数据效率。跨主干网络的一致性表明,动作表示是提升VLA性能的真实杠杆,区别于进一步的主干网络或训练方法的规模化。项目主页:https://github.com/showlab/ActionMap
今日 Diffusion 相关论文呈现多领域深耕与结构创新并重的趋势。应用层面从传统的图像/视频生成延伸至3D场景重建、雷达感知、音频合成、人体姿态估计及强化学习决策等新场景;方法层面则聚焦于提升生成质量(如逆向一致性指导、曝光控制)、增强交互性(实时视频变形、空间灵活生成)以及保障安全性(不安全信息流限制)。整体亮点在于 Diffusers 正从通用生成工具向垂直领域专用引擎演进,紧密结合具体任务约束进行定制化设计。
- Native3D(2606.07117):端到端统一网格-纹理建模与语义对齐的3D场景生成,代表了三维内容创建的新范式,突破性实现语义一致性。
- FreeAnimate(2606.06885):无训练预览引导的人类图像动画,降低部署门槛的同时保持高质量动态效果,对内容创作友好。
- Real-Time AttentionBender(2606.06497):粒度可控的视频Diffusion Transformer交互变形,首次实现实时精细化操控,为交互式生成奠定基础。
- LUCID(2606.06901):夜间摄影图像去模糊与曝光统一的联合学习框架,解决低光环境下的核心痛点,实用性强。
- Measurement-Consistent Langevin Corrector(2601.04791):为潜在扩散逆问题求解器提供稳定性保障,从方法论层面推动Inverse Problems与Diffusion的深度融合。
DiBS: Diffusion-Informed Branch Selection
中文标题:DiBS:扩散模型引导的分支选择
作者:Bo Liu, Yuan Xie, Yuan Gao, Xiaolong Luo, Peng Ye, Tao Chen, Fujun Han
Sudoku is a representative constraint satisfaction problem that requires global structural reasoning under strict discrete constraints. The existing works of solving Sudoku mainly focus on two dominant approaches, i.e., traditional heuristic and deep learning solver. However, they suffer from two complementary limitations: learning-based solvers lack hard correctness guarantees, while complete symbolic solvers are still prone to long-tail search. To address these shortcomings, we propose a novel diffusion model-guided approach, termed as DiBS, for the branch selection search process. Specifically, DiBS keeps the symbolic solver complete and uses the diffusion model as a branch-ordering guide. The core method is ranking candidate values under the current partial assignment and lightweight consistency signal. Furthermore, we provide an in-depth theoretical proof to reveal how it works and why it works. Experiments on the challenging Royle 17-clue Sudoku benchmark show that our DiBS substantially reduces search cost relative to strong heuristic baselines, especially in nodes, backtracks, and long-tail percentiles. Besides, these results confirm that learned global guidance is effective on hard instances where branch-order mistakes are most expensive. All codes are available at https://github.com/shanxierdan/DiBS.
数独是一个典型的约束满足问题,需要在严格的离散约束条件下进行全局结构推理。现有的数独求解研究主要关注两种主流方法,即传统启发式方法和深度学习求解器。然而,它们存在两个互补的局限性:基于学习的求解器缺乏硬性正确性保证,而完整的符号求解器仍然容易受到长尾搜索的影响。为了解决这些不足,我们提出了一种名为DiBS的新型扩散模型引导方法,用于分支选择搜索过程。具体而言,DiBS保持符号求解器的完整性,并利用扩散模型作为分支排序的指导。其核心方法是对当前部分赋值下的候选值和轻量级一致性信号进行排序排序。此外,我们还提供了深入的理论证明,揭示了其工作原理及有效性。在具有挑战性的Royle 17线索数独基准上的实验表明,与强启发式基线相比,我们的DiBS显著降低了搜索成本,尤其是在节点数、回溯次数和长尾百分位方面。此外,这些结果证实了学习到的全局指导对于分支排序错误代价最高的困难实例是有效的。所有代码可在https://github.com/shanxierdan/DiBS获取。
Direct 3D-Aware Object Insertion via Decomposed Visual Proxies
中文标题:基于分解视觉代理的直接3D感知对象插入
作者:Jingbo Gong, Yikai Wang, Yushi Lan, Yuhao Wan, Ziheng Ouyang, Rui Zhao, Ming-Ming Cheng, Qibin Hou, Chen Change Loy
Object insertion aims to seamlessly composite a reference object into a specified region of a background image. Recent diffusion-based methods achieve high visual quality but formulate insertion as a simple 2D inpainting task, providing no explicit control over the object's 3D pose and limiting their practical applicability. We propose DIRECT (Decomposed Injection for Reference Composition and Target-integration), a novel framework that integrates interactive pose manipulation with high-fidelity 2D image synthesis to enable pose-controllable object insertion. Our method decomposes the insertion conditions into three complementary components: appearance guidance capturing visual details from the reference object, geometry guidance derived from the user-adjusted 3D proxy, and context guidance from the target background. By injecting them through separate pathways, DIRECT avoids feature entanglement and simultaneously preserves reference appearance, follows the user-specified pose, and adapts the object to the target scene. We also introduce an automated data construction pipeline to improve the diversity and quality of training data. Experiments show that DIRECT outperforms previous methods in both geometric controllability and visual quality.
对象插入旨在将参考对象无缝合成到背景图像的指定区域。最近的基于扩散的方法取得了较高的视觉质量,但将插入表述为简单的2D修复任务,无法对对象的3D姿态进行显式控制,限制了其实际应用。我们提出了DIRECT(用于参考合成和目标整合的分解注入),这是一个将交互式姿态操作与高保真2D图像合成相结合的新框架,实现姿态可控的对象插入。我们的方法将插入条件分解为三个互补的组件:从参考对象捕获视觉细节的外观引导、从用户调整的3D代理推导的几何引导,以及来自目标背景的上下文引导。通过单独的路径注入,DIRECT避免特征纠缠,同时保持参考外观、遵循用户指定的姿态,并使对象适应目标场景。我们还引入了一个自动数据构建管道,以提高训练数据的多样性和质量。实验表明,DIRECT在几何可控性和视觉质量方面均优于先前方法。
ChronoForest: Closed-Loop Multi-Tree Diffusion Planning for Efficient Bridge Search and Route Composition
中文标题:ChronoForest:用于高效桥梁搜索与路线组合的闭环多树扩散规划方法
作者:Jungmin Seo, Jaesik Park
How can we plan long-horizon routes that reach designated goals, visit required waypoints, and remain short when only short-horizon offline trajectories are available? This problem matters in offline navigation because collecting sufficiently rich long-horizon data is difficult, yet real agents must still solve long-range tasks with route-level efficiency rather than mere feasibility. The difficulty is twofold: at the microscopic level, composing many short-horizon segments creates a trade-off between search cost and path quality, while at the macroscopic level, waypoint ordering requires comparing pairwise travel costs among start, goal, and waypoint anchors that are unknown before planning and increasingly unreliable when estimated only from long-range temporal distance. In this paper, we propose ChronoForest, a closed-loop planning system that couples local bridge search and online route re-solving through an anchor-chaining tree diffusion planner and an online multi-tree orchestrator. ChronoForest uses temporal distance for short-range guidance and node evaluation, while using search-time bridge evidence to validate long-range anchor connectivity and repeatedly re-solve the route. On OGBench AntMaze-Stitch, ChronoForest achieves 99.8%, 99.3%, and 99.5% success on the medium, large, and giant splits and improves giant-stitch success by up to 34.5 points over prior reported diffusion-based results. On Hamiltonian route-composition benchmarks, online re-solving corrects poor temporal orderings and improves route quality while remaining substantially cheaper than exhaustive planning.
如何在仅有短视野离线轨迹的情况下,规划能够到达指定目标、访问必需航点且保持路线简短的长视野路线?这一问题在离线导航中具有重要意义,因为收集足够丰富的长视野数据十分困难,但实际智能体仍需解决具有路线级效率的长距离任务,而不仅仅是可行性。该困难体现在两个层面:在微观层面,组合多个短视野片段需要在搜索成本与路径质量之间权衡;在宏观层面,航点排序需要比较起点、终点与航点锚点之间的成对旅行成本,而这些锚点在规划前是未知的,仅从长程时间距离估计时愈发不可靠。 本文提出 ChronoForest,一种将局部桥梁搜索与在线路线重解相结合的闭环规划系统,通过锚链树扩散规划器和在线多树编排器实现。ChronoForest 在短程引导和节点评估中使用时间距离,同时利用搜索时的桥梁证据验证长程锚点连接并反复重解路线。在 OGBench AntMaze-Stitch 基准上,ChronoForest 在 medium、large 和 giant 三个划分上分别达到 99.8%、99.3% 和 99.5% 的成功率,且在 giant-stitch 任务上比此前报道的基于扩散的方法最高提升 34.5 个百分点。在哈密顿路线组合基准上,在线重解能够纠正不良的时间排序并在路线质量上取得提升,同时计算成本远低于穷举规划。
EgoPressDiff: Multimodal Video Diffusion for Egocentric UV-Domain Hand-Pressure Estimation
中文标题:EgoPressDiff:用于自我视角UV域手部压力估计的多模态视频扩散模型
作者:Yuan Zeng, Zilue Gao, Yujia Shi, Zongqing Lu, Wenming Yang, QingMin Liao
Estimating hand-surface contact pressure from an egocentric view is crucial for AR/VR devices, robotic imitation, and ergonomic analysis. Existing methods often discretize pressure signal and process frames independently, leading to quantization errors and temporal inconsistencies. We present \emph{EgoPressDiff}, a conditional video diffusion framework that generates UV-pressure maps from visual input. The core of our approach is a multi-modal conditioning strategy, introducing a PoseNet and a Vertex Encoder to efficiently extract features from hand pose and 3D mesh vertices. These signals, along with depth information, guide the generative process to ensure the pressure fields are physically grounded. To effectively fuse these heterogeneous features, we further propose a Distribution-Calibrated Spatial Layer, which aligns their statistical properties before combination. Evaluated on the EgoPressure ego-view setting, EgoPressDiff achieves state-of-the-art results, improving Volumetric IoU by over 34\% relative to prior baseline, while reducing MAE and maintaining high temporal accuracy. Our project page is at https://egopressdiff.github.io/.
从自我视角估计手部与表面接触的压力对于AR/VR设备、机器人模仿以及人体工程学分析至关重要。现有方法通常将压力信号离散化并独立处理帧,导致量化误差和时间不一致问题。本研究提出EgoPressDiff,一个条件视频扩散框架,能够从视觉输入生成UV压力图。本方法的核心是多模态条件策略,引入PoseNet和Vertex Encoder分别从手部姿态和3D网格顶点高效提取特征。这些信号与深度信息一同引导生成过程,确保压力场在物理上合理。为有效融合这些异构特征,本研究进一步提出分布校准空间层(Distribution-Calibrated Spatial Layer),在组合前对其统计特性进行对齐。在EgoPressure自我视角数据集上的评估表明,EgoPressDiff取得了最先进的结果,Volumetric IoU相比先前基线提升超过34%,同时降低了MAE并保持了较高的时间精度。项目主页见https://egopressdiff.github.io/。
FreeAnimate: Training-Free Human Image Animation with Preview-Guided Denoising
中文标题:FreeAnimate:基于预览引导去噪的免训练人体图像动画生成
作者:Yuan Zeng, Yujia Shi, Zongqing Lu, QingMin Liao
Human Image Animation has seen significant advancements, primarily driven by diffusion models. However, existing methods typically demand substantial training data and resources to achieve high-quality results, limiting generalization and accessibility. In this work, we introduce \emph{FreeAnimate}, a training-free framework that leverages the inherent capabilities of image diffusion models to enable temporal consistency, identity preservation, and background stability. Our approach incorporates a novel preview generation strategy that provides temporal and structural priors from generated preview frames, effectively guiding pose alignment and background consistency without training. Additionally, FreeAnimate introduces Inversion-Boosted Attention and Reference-Anchored Self-Attention modules to guarantee temporal consistency and identity preservation. Experimental results demonstrate that FreeAnimate outperforms existing training-free competitors and training-based baseline methods, achieving generation quality comparable to state-of-the-art methods and offering robust generalization across diverse datasets. Our project page is at https://freeani.github.io/.
人体图像动画领域取得了显著进展,主要由扩散模型驱动。然而,现有方法通常需要大量训练数据和资源才能获得高质量结果,这限制了泛化能力和可访问性。本研究提出了FreeAnimate,一个无需训练的框架,利用图像扩散模型的内在能力实现时序一致性、身份保持和背景稳定性。我们的方法采用了一种新颖的预览生成策略,从生成的预览帧中提供时序和结构先验,无需训练即可有效引导姿态对齐和背景一致性。此外,FreeAnimate还引入了倒置增强注意力和参考锚定自注意力模块,以确保时序一致性和身份保持。实验结果表明,FreeAnimate优于现有的无需训练竞争方法和基于训练的基线方法,生成质量与最先进方法相当,并在不同数据集上展现出强大的泛化能力。我们的项目主页位于 https://freeani.github.io/。
Towards Unified Song Generation and Singing Voice Conversion with Accompaniment Co-Generation
中文标题:面向统一歌曲生成与伴唱协同生成的歌声转换研究
作者:Ziyu Zhang, Chunyu Qiang, Xiaopeng Wang, Yuxin Guo, Kang Yin, Wenjie Tian, Jingbin Hu, Tianlun Zuo, Zhao Guo, Teng Ma, Yuzhe Liang, Chen Zhang, Lei Xie
While song generation and singing voice conversion (SVC) have evolved significantly, they have long been developed isolated: the former lacks zero-shot speaker cloning, while the latter overlooks vocal-accompaniment synergy. To bridge this gap, we propose UniSinger, the first end-to-end framework unifying speaker cloning song generation and accompaniment co-generation SVC. Building on the multimodal diffusion transformer, we construct a unified speaker embedding space transferring speaker representation from SVC to song generation, endowing fine-grained cross-task timbre control. To mitigate multi-task optimization conflicts, we design a curriculum learning strategy using task-specific modality masking to guide the model to gradually master the generative mechanisms among semantic content, vocal timbre, and accompaniment. Experiments show state-of-the-art performance on both tasks and realizes complementary benefits, offering new possibilities for intelligent music production.
尽管歌曲生成和歌声转换(SVC)已取得显著发展,但二者长期独立开发:前者缺乏零样本说话人克隆功能,后者则忽视了人声与伴奏的协同效应。为弥补这一差距,我们提出UniSinger,这是首个将伴唱协同生成SVC与零样本说话人克隆歌曲生成统一起来的端到端框架。基于多模态扩散变换器,我们构建了统一的说话人嵌入空间,将说话人表示从SVC迁移到歌曲生成中,实现了细粒度的跨任务音色控制。为缓解多任务优化冲突,我们设计了基于任务特定模态掩码的课程学习策略,引导模型逐步掌握语义内容、人声音色和伴奏之间的生成机制。实验表明,该方法在两项任务上均实现了最先进的性能,并实现了互补优势,为智能音乐制作提供了新的可能性。
Native3D: End-to-End 3D Scene Generation via Unified Mesh-Texture Modeling and Semantic Alignment
中文标题:Native3D:基于统一网格-纹理建模与语义对齐的端到端3D场景生成
作者:Yibo Liu, Ziwei Zhang, Haozhou Pang, Menghao Li, Lanshan He, Gan Qi
This paper presents Native3D, the first end-to-end 3D scene generation framework that completely bypasses 2D intermediate representations. Traditional approaches typically require adapting 3D representations to the 2D domain to leverage pre-trained diffusion models, which inevitably introduces domain adaptation issues including geometric structural distortion and texture detail degradation. To address these limitations, we design a unified mesh-texture joint representation that simultaneously models both geometric structures and texture features through a Transformer-based scene encoder, effectively maintaining spatial relationships and visual consistency among objects within scenes. We further propose the 3D Representation Alignment Loss (3D REPA Loss), which employs an improved contrastive learning mechanism to align multi-level semantic representations in the latent space, significantly enhancing geometric and textural fidelity. Experimental results demonstrate that Native3D outperforms existing methods in both generation quality and editing flexibility, providing a novel solution for 3D scene editing.
本文提出Native3D,这是首个完全绕过2D中间表示的端到端3D场景生成框架。传统方法通常需要将3D表示适配到2D领域以利用预训练的扩散模型,这不可避免地引入了领域适应问题,包括几何结构扭曲和纹理细节退化。为解决这些局限性,我们设计了一个统一的网格-纹理联合表示,通过基于Transformer的场景编码器同时建模几何结构和纹理特征,有效维持场景中物体之间的空间关系和视觉一致性。我们进一步提出了3D表示对齐损失(3D REPA Loss),该损失函数采用改进的对比学习机制来对齐潜在空间中的多层次语义表示,显著增强了几何和纹理保真度。实验结果表明,Native3D在生成质量和编辑灵活性方面均优于现有方法,为3D场景编辑提供了新颖的解决方案。
Beyond Waypoints: A Trajectory-Centric Waypointing Paradigm for Vision-Language Navigation
中文标题:超越路点:面向视觉-语言导航的以轨迹为中心的路点规划范式
作者:Haoxiang Shi, Xiang Deng, Haoyu Zhang, Qiaohui Chu, Yaowei Wang, Liqiang Nie
Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural-language instructions while navigating in real-world-like environments. Most VLN-CE approach\-es adopt a three-stage framework: a waypoint predictor proposes navigable waypoints, and a navigator selects the best waypoint, with a low-level controller executing the movement to it. However, this decoupled paradigm often leads to unreachable waypoints or inconsistencies between planning and control. In this work, instead of predicting isolated waypoints, we introduce a novel paradigm called Trajectory Waypoint, which grounds each candidate waypoint in an executable trajectory. To realize this, we design a Trajectory Waypoint Predictor formulated as a TSDF-guided diffusion policy, which steers trajectory generation away from obstacles, inherently ensuring the reachability of the predicted waypoints. We further propose a trajectory-enhanced navigator that injects the associated trajectory as additional information for planning, enabling strict consistency between high-level semantic decisions and low-level execution. Extensive experiments on the VLN-CE benchmark show that our Trajectory Waypoint paradigm achieves superior performance over the baselines.
连续环境中的视觉语言导航(VLN-CE)要求智能体在类似真实世界的环境中遵循自然语言指令进行导航。现有的VLN-CE方法大多采用三阶段框架:航点预测器提出可导航的路点,导航器选择最佳路点,低层控制器执行到达该路点的运动。然而,这种解耦范式经常导致路点不可达或规划与控制之间不一致的问题。本研究不再预测孤立的路点,而是提出一种名为轨迹路点(Trajectory Waypoint)的新范式,将每个候选路点锚定在可执行的轨迹上。为此,我们设计了一个轨迹路点预测器,采用TSDF引导的扩散策略形式,生成过程中引导轨迹远离障碍物,从根本上保证了预测路点的可达性。我们进一步提出轨迹增强导航器,将关联轨迹作为规划的其他信息注入,使高层语义决策与低层执行之间保持严格的一致性。在VLN-CE基准数据集上的大量实验表明,我们的轨迹路点范式优于基线方法,取得了更优的性能。
CHDP: Cooperative Hybrid Diffusion Policies for Reinforcement Learning in Parameterized Action Space
中文标题:CHDP:参数化动作空间中强化学习的合作式混合扩散策略
作者:Bingyi Liu, Jinbo He, Haiyong Shi, Enshu Wang, Weizhen Han, Jingxiang Hao, Peixi Wang, Zhuangzhuang Zhang
Hybrid action space, which combines discrete choices and continuous parameters, is prevalent in domains such as robot control and game AI. However, efficiently modeling and optimizing hybrid discrete-continuous action space remains a fundamental challenge, mainly due to limited policy expressiveness and poor scalability in high-dimensional settings. To address this challenge, we view the hybrid action space problem as a fully cooperative game and propose a \textbf{Cooperative Hybrid Diffusion Policies (CHDP)} framework to solve it. CHDP employs two cooperative agents that leverage a discrete and a continuous diffusion policy, respectively. The continuous policy is conditioned on the discrete action's representation, explicitly modeling the dependency between them. This cooperative design allows the diffusion policies to leverage their expressiveness to capture complex distributions in their respective action spaces. To mitigate the update conflicts arising from simultaneous policy updates in this cooperative setting, we employ a sequential update scheme that fosters co-adaptation. Moreover, to improve scalability when learning in high-dimensional discrete action space, we construct a codebook that embeds the action space into a low-dimensional latent space. This mapping enables the discrete policy to learn in a compact, structured space. Finally, we design a Q-function-based guidance mechanism to align the codebook&x27;s embeddings with the discrete policy's representation during training. On challenging hybrid action benchmarks, CHDP outperforms the state-of-the-art method by up to $19.3\%$ in success rate.
混合动作空间结合了离散选择和连续参数,在机器人控制和游戏AI等领域普遍存在。然而,高效地建模和优化混合离散-连续动作空间仍然是一个根本性挑战,主要原因在于策略表达能力有限以及高维环境下的可扩展性差。为解决这一挑战,我们将混合动作空间问题视为一个完全合作博弈,并提出了合作式混合扩散策略(Cooperative Hybrid Diffusion Policies,CHDP)框架来求解。CHDP采用两个合作智能体,分别利用离散扩散策略和连续扩散策略。连续策略以离散动作的表示为条件,明确建模二者之间的依赖关系。这种合作设计使扩散策略能够发挥其强大的表达能力,捕捉各自动作空间中的复杂分布。为缓解该合作设定中同时策略更新所产生的更新冲突,我们采用顺序更新方案来促进协同适应。此外,为提高高维离散动作空间学习时的可扩展性,我们构建了一个将动作空间嵌入低维潜在空间的码本。这种映射使离散策略能够在紧凑、结构化的空间中学习。最后,我们设计了一种基于Q函数的引导机制,在训练过程中将码本的嵌入与离散策略的表示进行对齐。在具有挑战性的混合动作空间基准测试中,CHDP的成功率比最先进的方法高出19.3%。
D5P4: Partition Determinantal Point Process for Diversity in Parallel Discrete Diffusion Decoding
中文标题:D5P4:用于并行离散扩散解码多样性的分区行列式点过程
作者:Jonathan Lys, Vincent Gripon, Axel Marmoret, Lukas Mauch, Fabien Cardinaux, Ghouthi Boukli Hacene, Bastien Pasdeloup
Discrete diffusion models are promising alternatives to autoregressive approaches for text generation, yet their decoding methods remain under-studied. Standard autoregressive search procedures, such as beam search, do not directly apply to iterative denoising, where hypotheses are complete intermediate sequences rather than left-to-right prefixes. Furthermore, existing diffusion decoding procedures only provide limited control over the diversity and coverage of retained hypotheses. In this work, we introduce D5P4, a beam-style decoding method tailored to discrete diffusion models, which casts intermediate beam selection as MAP inference under a partitioned Determinantal Point Process. This yields a model-internal batch objective that balances quality and diversity without external verifiers. Experiments on open-ended generation, question answering, and mathematical reasoning show that D5P4 improves diversity and pass@$k$ coverage while matching or surpassing baseline quality and fidelity
离散扩散模型是文本生成中自回归方法的有前景替代方案,但其解码方法仍缺乏深入研究。标准自回归搜索程序(如束搜索)无法直接应用于迭代去噪过程,因为假设是完整的中间序列而非从左到右的前缀。此外,现有的扩散解码程序对保留假设的多样性和覆盖率控制有限。本工作中,我们提出了D5P4,这是一种专为离散扩散模型设计的束式解码方法,将中间束选择表述为分区行列式点过程下的最大后验推理。这产生了一个模型内部的批目标函数,可在无需外部验证器的情况下平衡质量与多样性。在开放式生成、问答和数学推理任务上的实验表明,D5P4在匹配或超越基线质量和保真度的同时,提升了多样性和pass@k覆盖率。
CountsDiff: A Diffusion Model on the Natural Numbers for Generation and Imputation of Count-Based Data
中文标题:CountsDiff:面向自然数的扩散模型及其在计数数据生成与插补中的应用
作者:Renzo G. Soatto, Anders Hoel, Greycen Ren, Shorna Alam, Stephen Bates, Nikolaos P. Daskalakis, Caroline Uhler, Maria Skoularidou
Diffusion models have excelled at generative tasks for both continuous and token-based domains, but their application to discrete ordinal data remains underdeveloped. We present CountsDiff, a diffusion framework designed to model distributions on the natural numbers. CountsDiff extends the Blackout diffusion framework by simplifying its formulation through a direct parameterization in terms of a survival probability schedule and an explicit loss weighting. This introduces flexibility through design parameters with direct analogues in existing diffusion modeling frameworks. Beyond this reparameterization, CountsDiff introduces features from modern diffusion models, previously absent in counts-based domains, including continuous-time training, classifier-free guidance, and churn/remasking reverse dynamics that allow non-monotone reverse trajectories. We propose an initial instantiation of CountsDiff and validate it on natural image datasets (CIFAR-10, CelebA), exploring the effects of the introduced design parameters in a complex, well-studied, and interpretable data domain. We then highlight biological count assays as a natural use case, evaluating CountsDiff on single-cell RNA-seq imputation in fetal and heart cell atlases. Remarkably, we find that even this simple instantiation matches or surpasses the performance of a state-of-the-art discrete generative model and leading scRNA-seq imputation methods, while leaving substantial headroom for further gains through optimized design choices in future work.
扩散模型在连续域和基于token的生成任务中表现出色,但其在离散有序数据上的应用仍欠成熟。本研究提出CountsDiff,一个专门设计用于建模自然数分布的扩散框架。CountsDiff通过直接参数化生存概率调度并引入显式损失加权,简化了Blackout扩散框架的 formulation。这使得该框架能够通过设计参数获得灵活性,这些参数与现有扩散建模框架中的参数具有直接的对应关系。除这一重新参数化外,CountsDiff还引入了现代扩散模型的特征,这些特征在基于计数的领域中此前并不存在,包括连续时间训练、无分类器引导以及允许非单调反向轨迹的churn/重掩码反向动力学。本研究提出了CountsDiff的初始实现,并在自然图像数据集(CIFAR-10、CelebA)上对其进行验证,在这一复杂、研究充分且可解释的数据领域中探索所引入的设计参数的效果。随后,本研究将生物计数分析作为其自然应用场景,在胎儿和心脏细胞图谱的单细胞RNA-seq插补任务上评估CountsDiff。值得注意的是,即使是这个简单的实现,也已达到或超越最先进的离散生成模型和主流scRNA-seq插补方法的性能,同时为未来通过优化设计选择实现进一步提升留下了相当大的空间。
Unified Safe In-context Image Generation in Multimodal Diffusion Transformers via Restricting Unsafe Information Flows
中文标题:多模态扩散Transformer中基于限制不安全信息流的统一安全上下文图像生成
作者:Xiang Yang, Feifei Li, Mi Zhang, Geng Hong, Xiaoyu You, Mi Wen, Min Yang
Diffusion transformers (DiTs) equipped with multimodal attention (MM-Attn) have become a dominant paradigm for image generation. However, preventing the generation of harmful content remains a critical challenge, particularly in image-to-image (I2I) editing tasks. Existing safety mechanisms are primarily designed for text-to-image (T2I) synthesis or U-Net-based architectures, which limits their effectiveness for unified safety mitigation in DiT-based frameworks. To bridge this gap, we propose Unified Visual Safety Regulator (UVR), a training-free safe generation framework that regulates unsafe semantics in generated images. UVR is grounded in an analysis of attention dynamics from the perspective of information flow in MM-Attn. We identify a task-independent start-up stage, during which unsafe semantics in output patches rapidly emerge and can be accurately localized, followed by task-specific semantic amplification and interference stages, where harmful signals are further propagated and entangled with benign content. Based on these observations, UVR mitigates unsafe generation through unified, targeted attention modulation and explicit restriction of harmful information flow over the identified unsafe output patches. Experiments across various concepts show that UVR achieves state-of-the-art safety performance by achieving 91% and 77% erase rate in image synthesis and editing tasks, while preserving visual quality and fidelity with minimal degradation. Code is available at https://github.com/deng12yx/UVR.
配备多模态注意力(MM-Attn)的扩散Transformer(DiT)已成为图像生成的主导范式。然而,防止生成有害内容仍然是一个关键挑战,特别是在图像到图像(I2I)编辑任务中。现有的安全机制主要针对文本到图像(T2I)合成或基于U-Net的架构设计,这限制了它们在基于DiT的统一安全缓解中的有效性。为了弥补这一差距,我们提出了统一视觉安全调节器(UVR),一种无需训练的安全生成框架,用于调节生成图像中的不安全语义。UVR基于从MM-Attn信息流角度对注意力动态的分析。我们识别出一个任务无关的启动阶段,在此阶段输出patch中的不安全语义迅速出现并可准确定位,随后是任务特定的语义放大和干扰阶段,在此阶段有害信号被进一步传播并与良性内容纠缠。基于这些观察,UVR通过统一的针对性注意力调节和对已识别的不安全输出patch的有害信息流的显式限制来缓解不安全生成。在各种概念上的实验表明,UVR通过在图像合成和编辑任务中分别实现91%和77%的擦除率,同时以最小的退化保持视觉质量和保真度,达到了最先进的安全性能。代码可访问https://github.com/deng12yx/UVR。
ARAPDiffusion: ARAP Regularization for Diffusion-Based Deformable Shape Space Learning
中文标题:ARAPDiffusion:基于ARAP正则化的扩散可变形形状空间学习
作者:Haibo Liu, Jinghan Ke, Haitao Yang, Xiangru Huang, Georgios Pavlakos, Qixing Huang
This paper introduces ARAPDiffusion, a latent diffusion model to learn the underlying continuous shape space of a deformation shape collection. The key innovation is in injecting the as-rigid-as-possible (ARAP) deformation model as regularization losses into latent diffusion (LD), releasing the requirement of having abundant 3D training data for learning generative models. In contrast to the standard LD, we show how the ARAP model can be used to improve both the encoder/decoder and the LD model. The training procedure alternates between using the synthetic distribution defined by the LD model to develop a regularization loss that enhances the shape encoder/decoder and using the shape decoder to develop a regularization loss to improve the LD model. We also show the benefit of the LD paradigm in combining a representation-free LD process and an implicit shape decoder that is applicable to unorganized point clouds. The experimental results of unconditional and conditional shape generation demonstrate the advantages of ARAPDiffusion over baseline approaches.
本文提出了ARAPDiffusion,一个用于学习变形形状集合底层连续形状空间的潜在扩散模型。其核心创新在于将尽可能刚性(ARAP)变形模型作为正则化损失注入潜在扩散(LD)模型中,从而降低了对大量3D训练数据的需求。与标准LD模型不同,本文展示了ARAP模型如何同时改进编码器/解码器和LD模型。训练过程在两种模式间交替进行:其一,利用LD模型定义的合成分布构建正则化损失以增强形状编码器/解码器;其二,利用形状解码器构建正则化损失以改进LD模型。本文还展示了LD范式在结合无表征约束的LD过程与适用于无组织点云的隐式形状解码器方面的优势。无条件和条件形状生成的实验结果表明,ARAPDiffusion相对于基线方法具有明显优势。
LUCID: Learning Unified Control for Image Deflaring and Exposure Mastery in Nighttime Photography
中文标题:LUCID:面向夜间摄影的图像去眩光与曝光控制统一学习框架
作者:Tingyu Yang, Yuan Cheng, Xiaoyun Yuan
Photography is the art of painting with light, yet nighttime scenes are shaped by competing degradations: intense flares obscure scene structure, while photon-limited regions collapse into noise. Conventional approaches address these factors in isolation, overlooking the fact that these degradations are fundamentally entangled. To bridge this gap, we introduce LUCID, a unified framework that reframes nighttime restoration as a continuous and controllable process rather than a fixed correction. We decompose nighttime restoration into two cooperative components: a flare disentanglement module that lifts the 'curtain&x27; of optical artifacts to provide reliable structural guidance, and a diffusion-driven module that leverages generative priors to reconstruct clean and well-exposed imagery. Crucially, LUCID introduces explicit controllability through a novel four-mode training strategy, enabling users to steer the restoration process via classifier-free guidance (CFG) and allowing selective control over light sources and their associated flare and ghosting artifacts, while also supporting high dynamic range (HDR) reconstruction through continuous exposure control. Extensive experiments demonstrate that LUCID consistently outperforms state-of-the-art methods across diverse real-world nighttime scenarios.
摄影是光影绘画的艺术,然而夜间场景受到多种退化的共同影响:强烈的眩光遮挡场景结构,而光子受限区域则退化为噪声。传统方法孤立处理这些因素,忽略了这些退化从根本上相互纠缠的事实。为弥补这一差距,我们提出LUCID,一个将夜间恢复重新定义为连续且可控过程而非固定校正的统一框架。我们将夜间恢复分解为两个协同组件:一个眩光解耦模块,用于揭开光学伪装的“幕布”以提供可靠的结构指导;以及一个扩散驱动模块,利用生成先验重建清晰且曝光良好的图像。关键在于,LUCID通过创新的四模式训练策略引入了显式的可控性,使用户能够通过无分类器引导(CFG)来控制恢复过程,实现对光源及其相关眩光和重影伪影的选择性控制,同时支持通过连续曝光控制进行高动态范围(HDR)重建。大量实验表明,LUCID在多种真实夜间场景中始终优于现有最先进的方法。
ForensicConcept: Transferable Forensic Concepts for AIGI Detection
中文标题:ForensicConcept:用于AI生成图像检测的可迁移取证概念
作者:Menyanshu Zhou, Ziyin Zhou, Ke Sun, Yunpeng Luo, Jiayi Ji, Xiaoshuai Sun, Rongrong Ji
AI-generated image detectors achieve high accuracy on in-distribution data but often fail on unseen generators. A key obstacle to understanding this failure is the black-box nature of current detectors: they do not reveal which evidence drives their decisions. We propose ForensicConcept, a framework that extracts explicit forensic concepts from detectors and enables their transfer across backbones. Our method localizes decision-critical patches via Transformer attribution, clusters them into a compact concept codebook, and uses a concept-aligned projection to produce auditable evidence readouts. Motivated by prior studies showing that DINO representations can guide diffusion generation and exhibit concept-level correspondence with diffusion features, we introduce a generation-trace reference based on CleanDIFT diffusion features and quantify backbone-trace alignment via neighborhood-structure consistency (CKNNA). We further propose concept codebook injection to transfer diffusion-derived concepts into target backbones. Experiments on GenImage, GAN-family, and Chameleon benchmarks show consistent improvements over prior methods. We also find that CKNNA alignment predicts transfer effectiveness, providing a principled explanation for why some backbones yield more transferable forensic evidence than others.
AI生成图像检测器在分布内数据上能达到较高的准确率,但在未见过的生成器上往往失效。理解这种失效的一个关键障碍在于当前检测器的黑箱性质:它们无法揭示驱动其决策的证据。我们提出了ForensicConcept,一个从检测器中提取显式取证概念并实现跨骨干网络迁移的框架。我们的方法通过Transformer归因定位决策关键补丁,将其聚类为紧凑的概念码本,并使用概念对齐投影生成可审计的证据读数。受到先前研究的启发——研究表明DINO表示可以引导扩散生成并在概念层面与扩散特征对应——我们引入了基于CleanDIFT扩散特征的生成追踪参考,并通过邻域结构一致性(CKNNA)来量化骨干追踪对齐。我们进一步提出概念码本注入,以将扩散衍生的概念迁移到目标骨干网络。在GenImage、GAN-family和Chameleon基准数据集上的实验表明,我们的方法相较于先前方法取得了一致的改进。我们还发现CKNNA对齐能够预测迁移效果,为为什么某些骨干网络能够产生更可迁移的取证证据提供了原则性解释。
TrioPose: Native Triple-Stream Diffusion Transformers for Pose-Guided Text-to-Image Generation
中文标题:TrioPose:用于姿态引导文本到图像生成的原生三流扩散变换器
作者:Dian Gu, Zhengyi Yang
Pose-guided text-to-image generation often suffers from limb distortions and feature crosstalk in complex multi-person scenarios. While existing UNet-based adapters struggle with long-range spatial dependencies, emerging Multimodal Diffusion Transformers (MM-DiTs) offer superior global modeling. However, naive signal concatenation in MM-DiTs severely disrupts pre-trained latent distributions. To address this, we propose TrioPose, a native pose-driven framework built upon the SD3.5M architecture. Specifically, we introduce a Triple-Stream Pose-Aware DiT (TSPA-DiT) that treats pose as an independent modality. It employs layer-wise activation and zero-initialized dual-residual injection to smoothly enforce geometric constraints while preserving pre-trained latent stability. To resolve severe multi-instance occlusions, we design a Learnable Relational Bias Mask that categorizes topological connectivity into fine-grained physical states, mapping them into continuous attention soft constraints to effectively decouple inter-instance interference. Furthermore, a Pose-Guided Spatial Loss Weighting strategy modulates the native diffusion objective using heatmap-derived error maps, focusing anatomical supervision strictly on distortion-prone regions. Extensive experiments demonstrate that TrioPose achieves state-of-the-art performance across challenging benchmarks, including Human-Art, CrowdPose, and OCHuman. Notably, it attains an AP of $64.33$ on Human-Art, representing a $30\%$ improvement over prior arts, while setting new standards for visual fidelity and text-image semantic alignment in complex multi-human generation.
姿态引导的文本到图像生成在复杂多人场景中经常遭受肢体扭曲和特征串扰的困扰。尽管现有的基于UNet的适配器难以处理长距离空间依赖,但新兴的多模态扩散变换器(MM-DiTs)提供了更优越的全局建模能力。然而,MM-DiTs中的朴素信号连接严重扰乱了预训练的潜在分布。针对这一问题,我们提出了TrioPose,这是一个基于SD3.5M架构构建的原生姿态驱动框架。具体而言,我们引入了三流姿态感知扩散变换器(TSPA-DiT),将姿态视为独立模态。它采用分层激活和零初始化的双残差注入机制,在保持预训练潜在稳定性的同时平滑地强制执行几何约束。为了解决严重的多实例遮挡问题,我们设计了可学习关系偏置掩码,将拓扑连通性分类为细粒度的物理状态,并将其映射为连续的注意力软约束,以有效解耦实例间干扰。此外,姿态引导空间损失加权策略利用热图派生的误差图来调制原生扩散目标,将解剖学监督严格聚焦于易扭曲区域。大量实验表明,TrioPose在包括Human-Art、CrowdPose和OCHuman在内的挑战性基准测试中达到了最先进的性能。值得注意的是,它在Human-Art上达到了64.33的AP,比现有方法提高了30%,同时在复杂多人生成中树立了视觉保真度和文本图像语义对齐的新标准。
AsyncPatch Diffusion: spatially-flexible image generation
中文标题:AsyncPatch Diffusion:空间灵活图像生成
作者:Samuele Papa, Valentin De Bortoli, Guillaume Couairon, Daniel S\'ykora, Romuald Elie, Klaus Greff
Standard diffusion models corrupt an entire sample with a single shared noise level, forcing all spatial regions to follow the same denoising trajectory. We introduce AsyncPatch Diffusion, a joint-diffusion framework that assigns distinct noise levels to different input dimensions, such as image pixels, or latent tokens. We show how this asynchronous corruption defines a valid generative process while supporting a richer family of spatially heterogeneous denoising trajectories, and prove the first valid ELBO for this process. We show that a single pretrained model can perform spatially adaptive generation, where different regions are denoised on different schedules. A key challenge is training: naive independent noise-level sampling overemphasizes highly heterogeneous configurations and underrepresents homogeneous noise levels, that are crucial during sampling. We address this with a controlled noise-level sampler that regulates both the average corruption level and its spatial variability. AsyncPatch achieves generation quality comparable to conventional diffusion on ImageNet 256 and LSUN, while being natively suited for inpainting without task-specific fine-tuning. We further introduce input guidance, which uses clean or partially corrupted regions to guide the generation of unknown regions, improving local consistency and texture matching. Finally, we demonstrate adaptive generation strategies including uncertainty-guided acceleration and autoregressive sampling.
标准扩散模型使用单一共享噪声水平破坏整个样本,迫使所有空间区域遵循相同的去噪轨迹。本研究提出AsyncPatch Diffusion,一种联合扩散框架,可为不同输入维度(如图像像素或潜在token)分配不同的噪声水平。本研究证明了这种异步破坏定义了一个有效的生成过程,同时支持更丰富的空间异构去噪轨迹族,并首次为该过程证明了有效的ELBO(证据下界)。本研究展示了单个预训练模型可执行空间自适应生成,其中不同区域以不同调度进行去噪。训练中的一个关键挑战是:朴素的独立噪声水平采样过度强调高度异构配置,而忽视了采样过程中至关重要但代表性不足的同质噪声水平。本研究通过一种受控噪声水平采样器来解决这一问题,该采样器同时调节平均腐蚀水平及其空间变异性。AsyncPatch在ImageNet 256和LSUN上实现了与常规扩散模型相当的生成质量,同时天然适用于图像修复而无需任务特定的微调。本研究进一步引入输入引导,利用干净或部分破坏的区域来引导未知区域的生成,从而改善局部一致性和纹理匹配。最后,本研究展示了自适应生成策略,包括不确定性引导加速和自回归采样。
Consistent-Inversion: Reverse Consistency Guidance for Structure-Preserving Visual Editing
中文标题:Consistent-Inversion:用于保持结构的视觉编辑的反向一致性指导框架
作者:Xiaocheng Lu, Jingcai Guo, Song Guo
Text-guided diffusion models have become effective tools for real-image visual editing, where the edited image must follow a target instruction while preserving editing-irrelevant structure. Most training-free editors rely on inversion: a source image is mapped to a noisy latent trajectory and the terminal latent is reused for target-prompt denoising. This reuse is useful for preservation, but it also couples source reconstruction and target editing. The resulting trajectory mismatch may either damage background/layout details or over-constrain the intended edit. This paper presents Consistent-Inversion, a training-free reverse consistency guidance framework for structure-preserving visual editing. Instead of treating the inverted source latent as a fixed initialization, Consistent-Inversion checks whether an intermediate target trajectory can be reversed toward the source inversion trajectory under the source prompt. To make this check well-defined, we construct an auxiliary target-side noise representation, perform source-guided reverse denoising, and use the resulting reverse consistency discrepancy as a correction signal for selected early target denoising steps. The method does not update model parameters, is compatible with inversion-based editors, and introduces only a small inference overhead when applied sparsely. Experiments on PIE-Bench show that Consistent-Inversion improves background and structural fidelity under a unified SD3.5 protocol while maintaining target-prompt alignment, and compatibility experiments further verify the same correction principle on classical Stable-Diffusion inversion pipelines.
文本引导的扩散模型已成为真实图像视觉编辑的有效工具,其中编辑后的图像必须遵循目标指令,同时保持与编辑无关的结构。大多数无需训练的编辑方法依赖于反演技术:将源图像映射到噪声潜在轨迹,并将终端潜在变量重用于目标提示的去噪。这种重用有助于保持原始结构,但它也将源图像重建与目标编辑耦合起来。产生的轨迹不匹配可能会破坏背景/布局细节,或过度约束预期的编辑效果。本文提出了 Consistent-Inversion,这是一个无需训练的反向一致性指导框架,用于保持结构的视觉编辑。Consistent-Inversion 不将反演的源潜在变量视为固定初始化,而是检查中间目标轨迹是否可以在源提示下反向追踪到源反演轨迹。为使这一检查具有良好的定义,我们构建了一个辅助的目标侧噪声表示,执行源引导的反向去噪,并将产生的反向一致性差异作为选定早期目标去噪步骤的校正信号。该方法不需要更新模型参数,与基于反演的编辑方法兼容,并且在稀疏应用时仅引入较小的推理开销。在 PIE-Bench 上的实验表明,Consistent-Inversion 在统一的 SD3.5 协议下提高了背景和结构保真度,同时保持了目标提示对齐,兼容性实验进一步验证了相同校正原则在经典 Stable-Diffusion 反演流程中的有效性。
DisPOSE: Projected Polystochastic Diffusion for Self-Supervised Multi-View 3D Human Pose Estimation
中文标题:DisPOSE:用于自监督多视角3D人体姿态估计的投影多项式随机扩散模型
作者:Tony Danjun Wang, Tolga Birdal, Nassir Navab
Recovering 3D human poses for multiple individuals from different camera views is a fundamental bottleneck for analyzing interacting behaviors. Existing self-supervised approaches leverage synthetic catalogues of 3D poses; however, this leads to poor generalization in real-world scenarios due to distribution shifts. We therefore introduce DisPOSE, a self-supervised framework that approximates the inherently discrete multi-view person-assignment problem as a generative diffusion process over the space of polystochastic tensors. By employing differentiable Sinkhorn projections during denoising, our model learns to guide solutions toward valid and feasible assignments based on 2D image priors. The complete 3D skeletons of localized individuals are then regressed using a Hypergraph-Convolutional Decoder that explicitly models relational structures and articulated joints across multiple views. The proposed approach outperforms current state-of-the-art self-supervised methods on standard datasets and demonstrates strong performance on a newly proposed benchmark featuring highly occluded scenes from surgical operating rooms. Our diffusion-based localization demonstrates high label efficiency, retaining 99% of its performance with only 10% of the pseudo-labels. Notably, disentangling the assignment and root regression components while maintaining differentiability makes DisPOSE nearly agnostic to different camera arrangements.
从不同相机视角恢复多人的3D人体姿态是分析交互行为的基础瓶颈。现有的自监督方法利用3D姿态的合成目录;然而,由于分布偏移,这导致在真实场景中的泛化能力较差。因此,我们提出了DisPOSE,一个自监督框架,将固有的离散多视角人物匹配问题近似为多项式随机张量空间上的生成扩散过程。通过在去噪过程中采用可微Sinkhorn投影,我们的模型学习基于2D图像先验引导解向有效且可行的匹配方向收敛。随后,使用超图卷积解码器回归已定位个体的完整3D骨架,该解码器显式建模多视角间的关联结构和关节连接。所提方法在标准数据集上优于当前最先进的自监督方法,并在来自手术室高遮挡场景的新基准上展现出强劲性能。我们基于扩散的定位方法表现出极高的标签效率,仅使用10%的伪标签即可保留99%的性能。值得注意的是,在保持可微性的同时解耦匹配和根节点回归组件,使DisPOSE对不同的相机布置几乎无敏感。
Real-Time AttentionBender: Granular Interactive Network Bending of Video Diffusion Transformers
中文标题:实时注意力弯曲器:视频扩散变换器的粒度化交互式网络弯曲
作者:Adam Cole, Mick Grierson
Generative video models have achieved remarkable visual fidelity, yet their prompt-only interface offers thin creative agency and obscures the model's material process from the artists working with it. We present Real-Time AttentionBender, a tool that extends the practice of network bending across the full depth of the video diffusion transformer (DiT) and brings it into live, interactive generation. Built as a plugin within the DayDream Scope ecosystem and wrapping open-source real-time Wan pipelines, the tool exposes self-attention, cross-attention, and the feed-forward network as independently manipulable surfaces, with targeting down to individual diffusion steps, DiT layers, prompt tokens, and hidden neurons. The immediacy of live manipulation affords what we call "material intimacy" with the model: a responsive, near-mechanistic feel for how specific layers and neurons shape generated video. We position the tool as simultaneously an XAIxArts probe into transformer internals and an expressive instrument for discovering aesthetics outside the model&x27;s default representational space.
生成式视频模型已取得显著的视觉保真度,但其仅支持提示词的界面提供有限的创意控制力,并使艺术家难以洞察模型的物质化过程。我们提出实时注意力弯曲器(Real-Time AttentionBender),该工具将网络弯曲实践拓展至视频扩散变换器(DiT)的完整深度,并实现实时交互式生成。作为DayDream Scope生态系统内的插件构建,并封装开源的实时Wan流水线,该工具将自注意力、交叉注意力和前馈网络作为独立可操作的表面进行暴露,支持精细化至单个扩散步骤、DiT层、提示词标记和隐藏神经元的定向。实时操控的即时性赋予了我们所称的与模型的“物质亲密感”:一种响应式的、近乎机械性的感知,让人们理解特定层和神经元如何塑造生成的视频。我们将该工具定位为同时作为可解释人工智能与艺术(XAIxArts)探索变换器内部机制的探针,以及在模型默认表示空间之外发现美学表现的表达性乐器。
Semantic-Structural Alignment for Generative Pictorial Charts
中文标题:生成式图示图表的语义-结构对齐
作者:Zhida Sun, Yulin Zhang, Zheng Gu, Min Lu, Bongshin Lee, Daniel Cohen-Or, Hui Huang
Traditional statistical graphics are precise but often lack the visual appeal, memorability, and engagement of pictorial charts. We present a generative framework for the automated synthesis of pictorial charts that bridges the gap between semantic expression and structural faithfulness. Rather than treating charts merely as images to be stylized, we frame the problem as a dual-conditioned generation task guided by two parallel external control signals: a text prompt capturing the semantic context of the editing intent, and a context image providing the abstract statistical chart's global structure. To reinforce these controls within a Multi-Modal Diffusion Transformer, we introduce two complementary feature-level mechanisms: structural alignment to anchor spatial layouts to the input chart, and semantic alignment to transfer expressive textures from reference images. Generalizing across major visual channels (i.e., length, area, angle, and position) and diverse semantic domains, our method produces pictorial charts that are both artistically compelling and structurally consistent. Extensive quantitative evaluations and perceptual user studies demonstrate that our framework outperforms traditional controllable generation and image editing baselines, providing a foundation for high-fidelity, data-driven generative modeling in expressive visual storytelling. Project page: https://ssalign.github.io/.
传统统计图形虽然精确,但往往缺乏图示图表的视觉吸引力、可记忆性和参与感。本研究提出了一种用于自动合成图示图表的生成式框架,旨在弥合语义表达与结构保真度之间的鸿沟。我们不再将图表仅视为需要风格化的图像,而是将其定义为一个双条件生成任务,由两个并行的外部控制信号引导:一个文本提示用于捕捉编辑意图的语义上下文,一个上下文图像用于提供抽象统计图表的全局结构。为在多模态扩散Transformer中强化这些控制,我们引入了两种互补的特征级机制:结构对齐用于将空间布局锚定到输入图表,语义对齐用于从参考图像传递表达性纹理。我们的方法能够跨越主要视觉通道(即长度、面积、角度和位置)以及多样化的语义域进行泛化,生成既具有艺术吸引力又保持结构一致性的图示图表。广泛的定量评估和感知用户研究表明,我们的框架优于传统的可控生成和图像编辑基线方法,为表达性视觉叙事中的高保真、数据驱动生成式建模提供了基础。项目页面:https://ssalign.github.io/。
RISE: Single Static Radar-based Indoor Scene Understanding
中文标题:RISE:基于单静态毫米波雷达的室内场景理解
作者:Kaichen Zhou, Laura Dodds, Sayed Saad Afzal, Fadel Adib
Robust and privacy-preserving indoor scene understanding remains a fundamental open problem. While optical sensors such as RGB and LiDAR offer high spatial fidelity, they suffer from severe occlusions and introduce privacy risks in indoor environments. In contrast, millimeter-wave (mmWave) radar preserves privacy and penetrates obstacles, but its inherently low spatial resolution makes reliable geometric reasoning difficult. We introduce RISE, the first benchmark and system for single-static-radar indoor scene understanding, jointly targeting layout reconstruction and object detection. RISE is built upon the key insight that multipath reflections-traditionally treated as noise-encode rich geometric cues. To exploit this, we propose a Bi-Angular Multipath Enhancement that explicitly models Angle-of-Arrival and Angle-of-Departure to recover secondary (ghost) reflections and reveal invisible structures. On top of these enhanced observations, a simulation-to-reality Hierarchical Diffusion framework transforms fragmented radar responses into complete layout reconstruction and object detection. Our benchmark contains 50,000 frames collected across 100 real indoor trajectories, forming the first large-scale dataset dedicated to single, static, radar-based indoor scene understanding. Extensive experiments show that RISE reduces the Chamfer Distance by 60% (down to 16 cm) compared to the state of the art in mmWave layout reconstruction, and delivers the first mmWave-based object detection, achieving 58% IoU. These results establish RISE as a new foundation for geometry-aware and privacy-preserving indoor scene understanding using a single static radar. Our website and code are available at https://rise-cvpr.github.io.
鲁棒且保护隐私的室内场景理解仍然是一个基本的开放性问题。光学传感器(如RGB相机和LiDAR)虽然具有较高的空间保真度,但容易受到严重遮挡,并在室内环境中引入隐私风险。相比之下,毫米波(mmWave)雷达可以保护隐私并穿透障碍物,但其固有的低空间分辨率使得可靠的几何推理变得困难。我们提出了RISE,这是首个针对单静态雷达室内场景理解的基准测试和系统,同时实现布局重建和目标检测。RISE基于一个关键洞察构建:多径反射——传统上被视为噪声——蕴含着丰富的几何线索。为了利用这一点,我们提出了双角多径增强方法,明确建模到达角和发射角以恢复次级(伪影)反射并揭示不可见结构。在此增强观测的基础上,仿真到现实的分层扩散框架将碎片化的雷达响应转化为完整的布局重建和目标检测。我们的基准测试包含在100条真实室内轨迹上采集的50,000帧数据,形成了首个专门用于单静态雷达室内场景理解的大规模数据集。大量实验表明,与毫米波布局重建的最新技术相比,RISE将Chamfer距离降低了60%(降至16厘米),并实现了首个基于毫米波的目标检测,达到58%的IoU。这些结果使RISE成为使用单静态雷达进行几何感知和保护隐私的室内场景理解的新基础。我们的网站和代码可在 https://rise-cvpr.github.io 获取。
Measurement-Consistent Langevin Corrector for Stabilizing Latent Diffusion Inverse Problem Solvers
中文标题:用于稳定潜在扩散逆问题求解器的测量一致性朗之万校正器
作者:Lee Hyoseok, Sohwi Lim, Eunju Cha, Tae-Hyun Oh
While latent diffusion models (LDMs) have emerged as powerful priors for inverse problems, existing LDM-based solvers frequently suffer from instability. In this work, we first identify the instability as a discrepancy between the solver dynamics and stable reverse diffusion dynamics learned by the diffusion model, and show that reducing this gap stabilizes the solver. Building on this, we introduce \textit{Measurement-Consistent Langevin Corrector (MCLC)}, a theoretically grounded plug-and-play stabilization module that remedies the LDM-based inverse problem solvers through measurement-consistent Langevin updates. Compared to prior approaches that rely on linear manifold assumptions, which often fail to hold in latent space, MCLC provides a principled stabilization mechanism, leading to more stable and reliable behavior in latent space.
虽然潜在扩散模型已成为逆问题的强大先验,但现有的基于潜在扩散模型的求解器经常遭受不稳定问题的困扰。在本工作中,我们首先将这种不稳定性识别为求解器动力学与扩散模型学习的稳定逆向扩散动力学之间的差异,并表明缩小这种差距可以稳定求解器。在此基础上,我们引入了测量一致性朗之万校正器(Measurement-Consistent Langevin Corrector,MCLC),这是一种理论上合理且即插即用的稳定模块,通过测量一致的朗之万更新来改进基于潜在扩散模型的逆问题求解器。与先前依赖线性流形假设的方法(这些假设在潜在空间中往往不成立)相比,MCLC提供了一种原则性的稳定机制,从而在潜在空间中实现更稳定和更可靠的行为。
Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models
中文标题:Rein3D:基于全景视频扩散模型的增强式3D室内场景生成
作者:Dehui Wang, Rong Wei, Yue Shi, Congsheng Xu, Shoufa Chen, Dingxiang Luo, Tianshuo Yang, Xiaokang Yang, Wei Sui, Yusen Qin, Rui Tang, Yao Mu
The growing demand for Embodied AI and VR applications has highlighted the need for synthesizing high-quality 3D indoor scenes from sparse inputs. However, existing approaches struggle to infer massive amounts of missing geometry in large unseen areas while maintaining global consistency, often producing locally plausible but globally inconsistent reconstructions. We present Rein3D, a framework that reconstructs full 360-degree indoor environments by coupling explicit 3D Gaussian Splatting (3DGS) with temporally coherent priors from video diffusion models. Our approach follows a "restore-and-refine" paradigm: we employ a radial exploration strategy to render imperfect panoramic videos along trajectories starting from the origin, effectively uncovering occluded regions from a coarse 3DGS initialization. These sequences are restored by a panoramic video-to-video diffusion model and further enhanced via video super-resolution to synthesize high-fidelity geometry and textures. Finally, these refined videos serve as pseudo-ground truths to update the global 3D Gaussian field. To support this task, we construct PanoV2V-15K, a dataset of over 15K paired clean and degraded panoramic videos for diffusion-based scene restoration. Experiments demonstrate that Rein3D produces photorealistic and globally consistent 3D scenes and significantly improves long-range camera exploration compared with existing baselines.
具身AI和VR应用日益增长的需求凸显了从稀疏输入合成高质量3D室内场景的重要性。然而,现有方法在推断大面积未知区域中大量缺失几何信息的同时难以保持全局一致性,往往产生局部合理但全局不一致的重建结果。我们提出了Rein3D框架,通过将显式3D高斯溅射(3DGS)与视频扩散模型的时间一致性先验相结合来重建完整的360度室内环境。我们的方法遵循“恢复-精炼”范式:我们采用径向探索策略,从原点开始沿轨迹渲染不完善的全景视频,从而有效地从粗粒度3DGS初始化中揭示被遮挡区域。这些序列通过全景视频到视频扩散模型进行恢复,并进一步通过视频超分辨率增强,以合成高保真几何和纹理。最终,这些精炼后的视频作为伪真实值来更新全局3D高斯场。为支持此任务,我们构建了PanoV2V-15K数据集,包含超过15000对清洁和降质的全景视频,用于基于扩散的场景恢复。实验表明,Rein3D能够生成逼真且全局一致的3D场景,并在长距离相机探索方面相较于现有基线方法有显著提升。
Physics Guided Conditional Diffusion Framework for Generative Inverse Design of Manufacturable Metasurface based Absorbers
中文标题:基于物理引导的条件扩散框架用于可制造超材料吸收器的生成式逆向设计
作者:Vineetha Joy, Jamshed Palai, Satwik Sahu, Anshuman Kumar, Amit Sethi, Hema Singh
Inverse design of metasurfaces under continuous electromagnetic constraints requires generation of geometries that simultaneously satisfy stringent spectral specifications and remain manufacturable. Conventional approaches based on iterative full wave simulations are computationally prohibitive for large design spaces, while existing generative models often suffer from poor conditional controllability and limited fabrication awareness. In this regard, we propose a physics guided condition quality enhanced diffusion framework for the inverse design of metasurface based absorbers. Fabrication-aware constraints are incorporated to ensure practical realizability of the generated designs. The framework introduces a conditioning mechanism for continuous spectral specifications, wherein feature-wise linear modulation propagates the condition across the denoising hierarchy, enabling stable and accurate generation with improved spectral controllability. Further, to embed EM consistency directly into the generative learning process, a pre trained surrogate EM simulator is integrated within the diffusion training pipeline. The proposed framework generated physically realizable metasurface designs for diverse reflection characteristics in the frequency range of 2 to 18 GHz, achieving a very low average spectral mean squared error of 0.0006 and a high band alignment accuracy of 0.958. The framework also addresses the fundamentally non-unique nature of inverse EM design by enabling structured multimodal generation of geometrically distinct yet spectrally consistent metasurface designs for the same target response. The proposed model produces the suitable design in approximately 30 seconds, whereas the conventional approach can take several months under comparable computational resources. The efficiency of the model is also established via experimental measurements.
超材料在连续电磁约束条件下的逆向设计需要生成能够同时满足严格光谱规格并保持可制造性的几何结构。传统的基于迭代全波仿真的方法对于大规模设计空间而言计算成本过高,而现有生成模型往往存在条件可控性差和制造感知不足的问题。为此,我们提出了一种用于超材料吸收器逆向设计的物理引导条件质量增强扩散框架。该框架整合了制造感知约束,以确保生成设计的实际可实现性。该框架引入了一种针对连续光谱规格的条件机制,其中特征级线性调制将条件信息传递到去噪层级,从而实现更优光谱可控性的稳定准确生成。此外,为了将电磁一致性直接嵌入生成学习过程,我们将预训练的代理电磁仿真器集成到扩散训练流程中。该框架在2至18 GHz频率范围内为不同反射特性生成了物理可实现的超材料设计,实现了极低的光谱均方误差0.0006和高达0.958的带对齐精度。该框架还通过为同一目标响应生成几何不同但光谱一致的超材料设计,解决了逆向电磁设计的根本非唯一性问题。拟议模型在约30秒内生成合适的设计,而传统方法在相同计算资源下可能需要数月。实验测量也验证了该模型的效率。
Pixel Cube: Diffusion-based Portrait Video Relighting Through Realistic Lighting Reproduction
中文标题:Pixel Cube:基于扩散的逼真感人像视频重光照技术
作者:Yufan Zhang, Yu Ji, Ayo Ajiboye, Rundi Wu, Yu Guo, Changxi Zheng, Jinwei Ye
We present a diffusion-based method for relighting dynamic portrait videos with photorealism and temporal consistency. Our method is fueled by a hybrid training dataset that consists of real-captured and rendered dynamic portrait videos with diverse subject appearances, facial motions, head poses, and known lighting conditions. Specifically, we construct an LED-based lighting system for realistic lighting emulation and high-speed video relighting data acquisition. By leveraging the image priors embedded in pre-trained video diffusion models, and using per-frame high dynamic range (HDR) environment map as lighting control, we train a high-performance generative model for realistic and identity-preserving dynamic portrait video relighting. In addition to the environment map control, our model uses a synthesized background image to enable control on the camera's exposure level and color tone. Our model can produce temporally consistent relit portrait video that looks realistic and harmonious under a provided new environment and faithfully preserve the subject&x27;s expression and fine facial features, including skin tone, wrinkles, and facial hair. Our model generalizes well to unseen data, in terms of the subject appearance, motion, and lighting condition. We perform extensive experiments on relighting in-the-wild videos with various environment maps and demonstrate practical applications on portrait photography. Results show that our method achieves state-of-the-art performance in photorealism, lighting harmony, and temporal consistency.
我们提出了一种基于扩散模型的方法,用于实现具有照片级真实感和时间一致性的动态人像视频重光照。我们的方法依赖于一个混合训练数据集,该数据集包含真实采集和渲染的动态人像视频,涵盖多样化的主体外观、面部动作、头部姿态以及已知的光照条件。具体而言,我们构建了一套基于LED的照明系统,以实现逼真的光照模拟和高速视频重光照数据采集。通过利用预训练视频扩散模型中嵌入的图像先验,并以每帧高动态范围(HDR)环境贴图作为光照控制信号,我们训练了一个高性能的生成模型,用于实现逼真且保持身份特征的动态人像视频重光照。除环境贴图控制外,我们的模型还使用合成的背景图像来实现对相机曝光水平和色调的控制。我们的模型能够生成时间一致的重光照人像视频,在给定的新环境下呈现逼真且和谐的效果,并忠实地保留主体的表情和精细的面部特征,包括肤色、皱纹和胡须。我们的模型在主体外观、动作和光照条件等未见数据上具有良好的泛化能力。我们针对各种环境贴图对真实场景视频进行了大量重光照实验,并展示了在人像摄影中的实际应用。实验结果表明,我们的方法在照片级真实感、光照和谐性和时间一致性方面达到了最先进的性能。
Audio-Visual World Models: Grounding Multisensory Imagination for Embodied Agents
中文标题:视听世界模型:为具身智能体构建多感官想象
作者:Jiahua Wang, Leqi Zheng, Jialong Wu, Yaoxin Mao, Shijie Cheng
World models simulate environmental dynamics to enable agents to plan and reason about future states. While existing approaches have primarily focused on visual observations, real-world perception inherently involves multiple sensory modalities. Audio provides crucial spatial and temporal cues such as sound source localization and acoustic scene properties, yet its integration into world models remains relatively underexplored. Prior work has not established a commonly adopted formulation for audio-visual world modeling under low-level action control or clarified how to jointly capture physically grounded binaural audio and visual dynamics. This work presents a unified formulation of Audio-Visual World Models (AVWM), casting multimodal environment simulation as a partially observable Markov decision process with synchronized audio-visual observations. As a foundational step toward this problem, we construct AVW-4k, a controlled benchmark comprising 30 hours of binaural audio-visual trajectories with action annotations across 76 indoor environments. We propose AV-CDiT, an Audio-Visual Conditional Diffusion Transformer with a novel modality expert architecture that balances visual and auditory learning, optimized through a three-stage training strategy for effective multimodal integration. Extensive experiments on this benchmark demonstrate that AV-CDiT achieves high-fidelity multimodal prediction across visual and auditory modalities. Furthermore, we validate its practical utility in embodied navigation, demonstrating that AVWM improves a vision-language-model-guided agent in continuous audio-visual navigation.
世界模型模拟环境动态,使智能体能够对未来状态进行规划和推理。虽然现有方法主要关注视觉观测,但现实世界的感知本质上涉及多种感官模态。音频提供关键的时空线索,如声源定位和声学场景属性,但其集成到世界模型中仍相对未被探索。先前的工作尚未建立低层次动作控制下视听世界建模的通用公式,也未明确如何联合捕获物理上可 grounding 的双耳音频和视觉动态。本工作提出了视听世界模型(Audio-Visual World Models, AVWM)的统一公式,将多模态环境模拟建模为具有同步视听观测的部分可观测马尔可夫决策过程。作为解决该问题的基础步骤,我们构建了 AVW-4k,这是一个受控基准测试,包含76个室内环境中30小时的双耳视听轨迹及动作标注。我们提出了 AV-CDiT(视听条件扩散Transformer),采用新颖的模态专家架构以平衡视觉和听觉学习,并通过三阶段训练策略优化以实现有效的多模态集成。在该基准测试上的广泛实验表明,AV-CDiT 在视觉和听觉模态上均实现了高保真多模态预测。此外,我们验证了其在具身导航中的实用性,证明 AVWM 能够改进视觉语言模型引导的智能体在连续视听导航中的性能。
Generalization of Diffusion Models Arises with a Balanced Representation Space
中文标题:平衡表示空间下扩散模型的泛化
作者:Zekai Zhang, Xiao Li, Xiang Li, Lianghe Shi, Meng Wu, Molei Tao, Qing Qu
Diffusion models excel at generating high-quality, diverse samples, yet they risk memorizing training data when overfit to the training objective. We analyze the distinctions between memorization and generalization in diffusion models through the lens of representation learning. By investigating a two-layer ReLU denoising autoencoder (DAE), we prove that (i) memorization corresponds to the model storing raw training samples in the learned weights for encoding and decoding, yielding localized spiky representations, whereas (ii) generalization arises when the model captures local data statistics, producing balanced representations. Furthermore, we validate these theoretical findings on real-world unconditional and text-to-image diffusion models, demonstrating that the same representation structures emerge in deep generative models with significant practical implications. Building on these insights, we propose a representation-based method for detecting memorization and a training-free editing technique that allows precise control via representation steering. Together, our results highlight that learning good representations is central to novel and meaningful generative modeling.
扩散模型在生成高质量、多样化样本方面表现优异,但当过度拟合训练目标时,它们存在记忆训练数据的风险。我们通过表示学习的视角分析扩散模型中记忆与泛化的区别。通过研究两层ReLU去噪自编码器(DAE),我们证明:(i)记忆对应于模型将原始训练样本存储在用于编码和解码的学习权重中,产生局部尖峰的表示;而(ii)泛化出现在模型捕获局部数据统计时,产生平衡的表示。此外,我们在实际的无条件扩散模型和文本到图像扩散模型上验证了这些理论发现,表明相同的表示结构也出现在深度生成模型中,具有重要的实际意义。基于这些见解,我们提出了一种基于表示的检测记忆方法,以及一种无需训练的编辑技术,允许通过表示控制实现精确操作。总的来说,我们的研究结果表明,学习良好的表示对于新颖且有意义的生成建模至关重要。
今日未找到该分类的匹配论文。
今日未找到该分类的匹配论文。
今日未找到该分类的匹配论文。
今日未找到该分类的匹配论文。
今日未找到该分类的匹配论文。
今日未找到该分类的匹配论文。