Ch8 · 多模态大模型 · 动手学大模型

2026/05/19 13:48:30·2026/05/20 14:45:00

Chapter 8

多模态大语言模型

大语言模型的出现让研究者们清晰地看到了一个事实：高阶智能在语言模态上得到了充分体现——无论是推理、规划还是创造性生成，语言模型展现出的能力远超此前的预期。那么问题是：如果语言已经足以承载如此丰富的智能，能否以语言为"枢纽"，将视觉、听觉等其他模态的能力也一并整合进来？这正是多模态大语言模型（Multimodal Large Language Model，MLLM）研究的核心命题。

本章分为理论与实践两部分。理论部分系统梳理 MLLM 的类型划分与架构范式，阐明为什么当前几乎所有 MLLM 都以 LLM 为核心决策单元。实践部分以 NExT-GPT 为例，这是首个实现"任意模态到任意模态"（Any-to-Any）的多模态系统，支持文本、图像、视频、音频四种模态的任意组合输入与输出。我们将深入分析其技术栈设计、数据构建流程和三阶段训练方法。

前提

为什么语言是枢纽？

理解 MLLM 的设计哲学，首先要理解一个关键共识：语言智能是多模态智能的枢纽。这句话包含三层含义：

第一：语言具备通用表示能力

语言是人类用来描述任意事物的通用符号系统——我们可以用语言描述图像中的场景、声音中的情绪、视频中的动作序列，甚至描述另一个模态本身。这种描述的通用性意味着，如果一个模型能深入理解语言，它就拥有了一个可用于理解其他任何模态的"概念框架"。一张照片可以用"一只猫在窗台上打盹"来描述，一段音乐可以用"低音在第 30 秒开始渐强"来描述——这些描述本质上是对模态内容的语言化蒸馏，而语言模型天然擅长处理这种抽象的语义表示。

第二：规模定律在语言模态上率先突破

GPT-2、GPT-3、LLaMA 等纯语言模型已经展示了明确的规模定律（Scaling Law）——随着参数规模、数据量和计算量的增加，模型能力出现质的飞跃。这些能力包括推理、上下文学习、指令遵循等"涌现性"技能。这些技能的获得并不依赖于图像或音频的参与，而是纯粹通过语言数据训练实现的。这意味着语言模态本身已经足以训练出高级认知能力。

第三：LLM 具备任务规划与决策能力

语言模型不仅仅是"特征提取器"，它还是一个具备规划能力和决策能力的智能体。它可以理解复杂的指令、分解多步骤任务、判断当前状态是否满足目标、并根据反馈调整行为。这种"通用问题解决者"的定位，是 MLLM 架构设计的核心基础——我们不需要为每种模态单独训练一个"大脑"，只需要给这个已有的"大脑"添加眼睛（视觉编码器）、耳朵（音频编码器）和发声器官（生成器），它就能自然地调度这些能力。

综合以上三点，当前 MLLM 的主流设计范式就变得自然而清晰了：以预训练的大语言模型为"核心决策单元"（类似于大脑或中央处理器），通过添加额外的外部模态编码器和输出解码器，赋予 LLM 多模态感知与多模态生成的能力。这种"LLM 中心化"的设计哲学不仅降低了训练成本，更关键的是，复用了 LLM 在语言数据上习得的推理、规划和决策能力。

分类

MLLM 的类型划分

现有的 MLLM 可以从两个维度进行分类：模态支持（输入×输出）和功能支持。理解这个分类体系有助于把握整个领域的发展全景。

输入模态×输出模态

模态支持是 MLLM 最直接的分类维度。输入模态可以是文本（Text）、图像（Image）、音频（Audio）或视频（Video），输出模态同理。早期 MLLM 主要是"看图说话"（Image→Text），如 LLaVA、InstructBLIP 等。随着研究深入，出现了更多的组合：

T+I → T：输入文本+图像，输出文本。视觉问答、图像描述都属于这一类。
T+I → T+I：输入文本+图像，输出文本+图像。这要求模型不仅理解图像内容，还要能生成新的图像（如 GPT-4V 支持此能力）。
T+I → T+V/A：输入图像描述或指令，生成视频或音频。如 NExT-GPT 的图像到视频生成。
T → T+I/V/A：仅输入文本，输出多模态内容。这是文本到媒体的生成任务。
T+I/V/A → T+I/V/A：任意模态到任意模态。NExT-GPT 是这一类的代表。

功能支持：通用ist vs 专用ist

从功能支持角度，MLLM 可以分为"通用ist"（Generalist）和"专用ist"（Specialist）。通用ist模型追求在尽可能多的模态和任务上具备泛化能力，如 NExT-GPT、GPT-4V、Gemini 等。专用ist模型则专注于某一类任务，例如，专门处理医学影像理解的 RadFM，或者专门做视频理解的长视频语言模型。这两种路线各有优势：通用ist覆盖面广但深度可能不足，专用ist在特定任务上精度更高但泛化能力受限。

趋势：从2023年到2026年的发展趋势来看，通用ist是明显的主流方向。随着模型规模的增大和训练数据的丰富，单个模型已经能够同时处理多种模态且保持良好性能。"任意到任意"的模态支持成为越来越多工作的目标。

架构

两大架构范式

在具体实现层面，当前社区存在两种主流的 MLLM 架构。这两种架构的核心区别在于 LLM 在系统中的角色定位。

架构一：LLM as Task Scheduler（LLM 作为调度器）

在第一种架构中，LLM 的角色是"离散调度器"或"控制器"。系统的工作流程是：LLM 接收来自用户的多模态输入（如文本指令），经过语义理解后，输出纯文本形式的命令或指令，发送给下游的功能模块。各功能模块（视觉生成器、语音合成器、文本生成器等）执行完成后，将结果反馈给 LLM，由 LLM 继续调度下一个模块。

这种架构的特点是：系统内部的所有消息传递都是通过 LLM 输出的纯文本命令作为媒介。不同功能模块之间不存在直接交互，它们只接收来自 LLM 的指令并返回执行结果。在这种架构中，LLM 更像是一个"总指挥"，而各模态模块是听令行事的"执行者"。HuggingGPT、Visual ChatGPT 等采用了类似的设计思路。

这种架构的优势在于模块化程度高，各模态模块可以独立开发和替换。但它的缺点也很明显：跨模态信息传递需要经过语言这个"中转站"，效率和自然度都会打折扣。此外，LLM 的输出是离散文本，需要额外的解析器将文本转换为具体的 API 调用或动作指令，引入了误差链。

架构二：编码器-LLM-解码器（LLM 作为联合决策者）

第二种架构是目前最流行的方案。在这种架构中，LLM 不再是外部的调度器，而是系统的"关键联合部分"——它直接感知来自各编码器的多模态信息，并直接委派指令给解码器。具体来说：编码器处理来自不同模态（图像、音频、视频）的原始信号，将其转换为统一的特征表示；LLM 接收这些多模态特征，进行深度的语义理解和推理，生成文本响应和"信号 token"（Signal Token）来指明期望的输出模态；解码器接收 LLM 的输出，生成对应的图像、音频或视频。

与第一种架构的关键区别在于：LLM 在这里是从外部直接接收多模态信息，而不是通过文本命令的间接传递。这使得信息的流转更加顺畅和自然——视觉信息、听觉信息和文本信息在 LLM 内部进行深层次的融合，而不是在外部通过文本命令进行"翻译"。这种架构更接近人类处理多模态信息的真实方式。

用一个生活化的类比来理解这两种架构的差异：第一种架构就像一位餐厅经理（LLM），顾客用不同的语言（多模态）描述需求，经理需要将每种语言翻译成统一的"服务指令"再发给后厨。第二种架构则像是经理直接配备了懂各种语言的副手，各语言的信息直接在经理脑中融合，经理直接向各后厨下达具体指令——信息流转不经过"翻译"这一中间环节。

实践

NExT-GPT：任意到任意的多模态系统

NExT-GPT（全称 "NExT Generative Pre-trained Transformer"）是2023年由新加坡国立大学（NUS）和德克萨斯大学奥斯汀分校（UT Austin）联合提出的工作。它的核心贡献是首次实现了"任意模态到任意模态"的 MLLM——用户可以输入任意模态的组合（例如文本+图像），系统可以输出任意模态的组合（例如文本+视频+音频），且支持多轮上下文交互。

技术栈总览

NExT-GPT 的技术栈由三大组件构成：

ImageBind（统一编码器）：Meta 提出的多模态编码器，能够用统一的特征空间编码图像、视频、音频、文本等多种模态。ImageBind 的关键创新在于：它证明了不同模态的表示可以在同一个嵌入空间中实现对齐，这意味着用一个编码器就能处理多种模态的输入。
Vicuna-7B（LLM 基座）：基于 LLaMA 的对话模型，作为系统的"大脑"，负责语义理解、任务规划和多模态协调。
扩散模型解码器：图像生成使用 Stable Diffusion v1-5，视频生成使用 ZeroScope v2_576w，音频生成使用 AudioLDM-l-full。解码器的任务是接收 LLM 输出的高层语义描述（通过 Signal Token 指定），生成对应的媒体内容。

Signal Token 机制

NExT-GPT 引入了一个关键设计——Signal Token。这是一些特殊定义的 token，用于在 LLM 的输出中"标记"模态转换的时机。当 LLM 决定需要生成图像时，它输出一个特定的 Signal Token，解码器接收到这个信号后开始执行图像生成。这种机制使得 LLM 能够以完全文本化的方式"指令"多模态输出，同时保持了系统的端到端可微分性。

类比一下：Signal Token 就像是演出剧本中的"舞台指示"——"灯光变暗"、"音乐响起"这样的文字指示本身不是演出内容，但它们告诉技术部门何时执行什么操作。LLM 生成文本内容的同时附带 Signal Token，Signal Token 触发对应解码器的工作。

训练

三阶段渐进式训练

NExT-GPT 的训练分为三个阶段，每个阶段有明确的目标和冻结策略。这种渐进式的训练方法有效避免了"灾难性遗忘"问题——即在学习新能力的同时丢失已有的语言智能。

阶段一：编码端多模态对齐（Enc-side Alignment）

训练目标：让 LLM 能够正确"理解"来自各编码器的多模态特征表示。具体操作是：冻结 ImageBind（编码器）、LLM（Vicuna）和解码器，只训练编码器与 LLM 之间的输入投影层（Input Projection Layer）。

这个阶段的核心问题是"模态鸿沟"——ImageBind 输出的特征空间与 LLM 预期的文本嵌入空间并不天然对齐。例如，一张狗的图片经过 ImageBind 编码后产生的向量，与文本"a dog"经过分词和嵌入后的向量，在数值空间中并不接近。我们需要训练一个投影层，将 ImageBind 的输出"翻译"成 LLM 能理解的语言。

训练数据使用 T-X 对数据（Text-X pairs）——即图像-文本对、视频-文本对、音频-文本对。例如，给定一张狗的图片和文本"a dog"，模型需要学会将图像编码与"a dog"这个文本描述在语义空间中对齐。训练完成后，LLM 就能像理解文本一样理解图像、音频、视频的编码表示。

阶段二：解码端指令跟随对齐（Dec-side Alignment）

训练目标：让 LLM 的输出能够正确"驱动"各模态的解码器生成内容。具体操作是：冻结 ImageBind、LLM 和输入投影层，只训练 LLM 与解码器之间的输出投影层（Output Projection Layer）和扩散模型的文本编码器。

这个阶段的核心问题是"生成鸿沟"——LLM 输出的语义描述与解码器期望的输入格式之间存在差距。例如，LLM 说"生成一幅日落图片"，Stable Diffusion 需要接收"a beautiful sunset over the ocean, golden hour, cinematic lighting"这样的详细文本描述才能生成高质量图像。输出投影层的任务是将 LLM 的高层语义输出转换为各解码器能理解的详细描述。

训练数据同样使用 T-X 对数据，但关注的是反向对齐——文本到图像、文本到音频、文本到视频的配对数据。例如，给定文本描述"a cat sitting on a windowsill"，模型需要学会生成对应的图像（或音频、视频）。

阶段三：指令微调（Instruction Tuning）

训练目标：让整个系统具备指令遵循能力和多轮对话能力。具体操作是：使用 LoRA（Low-Rank Adaptation）同时微调 LLM、输入投影层和输出投影层。

前两个阶段分别解决了"感知"和"生成"的对齐问题，但尚未让系统学会"如何响应用户的指令"。阶段三使用指令微调数据——包括 Alpaca（文本指令）、LLaVA（视觉指令）、VideoChat（视频指令）以及 MosIT（模态切换指令）——来训练系统理解多样化的指令格式，并在多轮对话中保持上下文一致性。

MosIT 数据尤其重要，它专门用于训练模型在多模态输出之间正确切换的能力。例如，用户说"给我讲个笑话然后画出来"，模型需要正确识别这是一个文本回复任务，而在"画出来"这个部分插入 Signal Token 触发图像生成。

三阶段设计的智慧：这种"先对齐、后微调"的策略避免了让 LLM 同时学习太多东西——感知和生成的对齐是相对机械的过程，适合单独进行；指令遵循能力的培养则是语义层面的，需要 LLM 参与但不需要编码器和解码器同时更新。如果一次性端到端训练，编码器和解码器的随机初始化梯度会干扰 LLM 已经学到的语言智能。

数据

数据准备与处理流程

训练 NExT-GPT 需要多种类型的数据，整个数据准备流程颇为复杂。

T-X 对数据（模态配对）

用于前两个阶段的编码-解码对齐训练。包括：

CC3M：文本-图像对数据，约 300 万对，用于训练图像理解与生成的对齐。
WebVid：文本-视频对数据，约 1000 万对，用于训练视频理解与生成的对齐。
AudioCap：文本-音频对数据，用于训练音频理解与生成的对齐。

这些数据的格式是简单的 (x, text) 配对，训练目标是让编码器和解码器在语义空间中建立对应关系。

指令微调数据

用于第三阶段的高质量指令遵循能力培养。包括：

Alpaca：纯文本指令数据，用于基础指令遵循能力的培养。
LLaVA：图像-文本指令数据，用于让模型学会处理视觉输入。
VideoChat：视频-文本指令数据，用于视频理解。
MosIT：模态切换指令数据，专门训练模型在不同输出模态之间切换的能力。
T2M 合成数据：文本到多模态的合成指令数据，用于训练文本驱动的多模态生成能力。

嵌入向量预计算

为提高训练效率，NExT-GPT 在解码端对齐训练之前，会预先使用扩散模型的文本编码器计算各模态数据（如图像、视频）的描述文本的嵌入向量。这些嵌入向量在训练时被用作解码器的"目标参考"，以最小化 Signal Token 特征与描述文本特征之间的距离。这一步通过运行 process_embeddings.py 脚本完成。

代码

关键代码结构

以下是 NExT-GPT 代码库的核心结构解析，完整代码请参考 GitHub: NExT-GPT。

# NExT-GPT 项目结构（核心部分）
├── code/
│   ├── config/
│   │   ├── base.yaml         # 基础系统配置（模型路径、模态设置）
│   │   ├── stage_1.yaml      # 编码端对齐训练配置
│   │   ├── stage_2.yaml      # 解码端对齐训练配置
│   │   └── stage_3.yaml      # 指令微调训练配置
│   ├── datast/
│   │   ├── cc3m_datast.py    # 图像-文本对数据加载器
│   │   ├── audiocap_datast.py # 音频-文本对数据加载器
│   │   ├── webvid_dataset.py  # 视频-文本对数据加载器
│   │   ├── T+X-T_instruction_dataset.py  # 多模态到文本指令数据
│   │   └── MosIT_instruction_dataset.py  # 模态切换指令数据
│   ├── model/
│   │   ├── ImageBind/        # ImageBind 统一编码器实现
│   │   ├── anyToImageVideoAudio.py  # 主模型文件（核心）
│   │   ├── layers.py         # 输入/输出投影层定义
│   │   ├── custom_sd.py      # Stable Diffusion 解码器
│   │   ├── custom_vd.py      # 视频扩散解码器
│   │   └── custom_ad.py      # 音频扩散解码器
│   ├── train.py              # 训练入口（支持三阶段切换）
│   └── inference.py          # 推理入口
├── ckpt/
│   ├── delta_ckpt/           # 可训练参数（LoRA + 投影层）
│   └── pretrained_ckpt/       # 冻结的预训练模型权重
└── data/
    ├── T-X_pair_data/        # 模态配对数据
    └── IT_data/              # 指令微调数据

训练命令通过 scripts/train.sh 脚本启动，关键参数包括：--stage（1/2/3 指定训练阶段）、--save_path（delta 权重保存路径）和 --log_path（日志路径）。使用 DeepSpeed 进行分布式训练，默认配置下需要至少一张 80GB 显存的 GPU。

推理通过 demo_app.py 启动 Gradio 演示界面，用户可以上传图像或输入文本，观察模型的多模态理解和生成能力。

效果

NExT-GPT 的能力展示

经过三阶段训练后，NExT-GPT 能够完成多种跨模态任务，以下是几个典型案例：

Case 1：T+I → T+A（图像理解 + 语音生成）

输入一张森林中的鹿的照片和文字"Describe what you see in a poetic voice"。模型理解图像内容后，用诗意的语音描述森林中的鹿。这是一个典型的跨模态理解+生成任务，展示了模型在感知视觉信息后，将其转化为多模态输出的能力。

Case 2：T+V → T+A（视频理解 + 音频生成）

输入一段钢琴演奏的视频和指令"Describe the music style and create a similar melody"。模型理解视频中的钢琴演奏风格后，生成一段符合该风格的语音分析，并创作一段类似旋律的音频输出。

Case 3：T+I → T+I+V（图像到图像+视频）

输入一张风景照片和指令"Imagine this as an animation and generate a short clip"。模型首先描述图像内容，然后生成该场景的动画版本，同时输出一段短视频。

Case 4：T → T+I+V+A（文本到全模态生成）

输入文本"Create a short scene of a rainy day in Tokyo with ambient sounds"。模型理解这个场景描述后，同时生成：描述场景的文字叙述、东京雨天的图像、延时摄影风格的短视频，以及雨声的环境音。这个案例最能体现"任意到任意"模态系统的能力——一个文本输入驱动四种模态的协同输出。

挑战

当前局限与未来方向

尽管 NExT-GPT 开创了任意到任意多模态的先河，但它仍面临一些局限：

生成质量：扩散模型解码器生成的图像、音频和视频在质量上不如单独训练的最先进生成模型。例如，Stable Diffusion 生成质量显著低于 DALL-E 3 或 Midjourney。
模态数量：当前只支持文本、图像、视频、音频四种模态。现实世界中的模态远不止这些——触觉、味觉、嗅觉、深度图、医学影像等都是重要的模态。
长上下文：处理长视频和长音频的能力有限。视频帧数增加会导致显存爆炸，而音频的时间长度也给注意力机制带来挑战。
实时性：当前系统需要数秒到数十秒才能完成多模态生成，离实时交互还有距离。

未来的发展方向包括：更大的模型规模（参数从 7B 向 70B 乃至更大演进）、更细粒度的模态支持（如支持 3D 场景、体积视频）、更高效的多模态注意力机制，以及端到端的 diffusion-LLM 融合架构（而非当前的编码器-解码器拼接）。

小结

本章小结

本章系统梳理了多模态大语言模型的理论基础与实践方法。核心要点如下：

语言智能是枢纽：语言模型因规模定律和通用表示能力，成为多模态智能的核心。几乎所有 MLLM 都建立在 LLM 之上，将语言作为跨模态的统一接口。
两大架构：LLM as Task Scheduler（离散调度器）和编码器-LLM-解码器（联合决策者）。后者是当前主流方案，信息流转更自然。
NExT-GPT：首个任意到任意模态系统，ImageBind 统一编码 + Vicuna LLM + 扩散解码器，Signal Token 机制实现模态切换。
三阶段训练：编码端对齐 → 解码端对齐 → 指令微调，渐进式训练避免灾难性遗忘。

多模态能力使 LLM 从"文字大师"进化为"全感官智能体"。下一章我们将看到，当这种多模态感知能力与 Agent 框架结合时，AI 系统将能够理解屏幕截图、规划操作步骤并执行真实世界的 GUI 任务——这是 Ch9 GUI 智能体的核心命题。

课件

课件原文精读

以下内容来自本章 PDF 课件原文（81页），保留讲义的完整结构供对照参考。

分类

MLLM 的输入-输出模态矩阵

多模态大模型按支持的输入和输出模态可分为不同类型。核心洞察：语言作为多模态智能的枢纽——几乎所有 MLLM 都以 LLM 为核心决策单元，外挂编码器赋予多模态感知能力。模态支持的广度是区分"专用ist"（针对特定任务优化的专用模型）和"通用ist"（追求通用能力的通用模型）的关键维度。

架构

两大架构范式深度解析

LLM as Task Scheduler：LLM 充当离散调度器，向下游模块发出文本命令，各功能模块间不存在直接交互，消息传递通过纯文本命令完成。编码器-LLM-解码器架构：最流行的范式，LLM 直接从外部接收多模态编码信号，以更顺畅的方式委派指令给解码器/生成器，实现端到端的联合优化。

课件

课件原文精读

以下为本章 PDF 课件原文（共81页），按页面顺序呈现，保留讲义的原始措辞与结构。

第1页

页面原文

A Brief Intro to Multimodal LLM

Stay tune to MLLM tutorial series:

Hao Fei

Research Fellow

National University of Singapore

https://mllm2024.github.io/COLING2024

http://haofei.vip/

第2页

页面原文

Table of Content

⊹2 Architecture

⊹1 Modality

Overview

Multimodal Encoding

Tokenization

Input-side Projection

Backbone LLMs

Decoding-side Connection

Multimodal Generation

Overview

Multimodal Perceiving

Multimodal Perceiving + Generation

Unified MLLM

Fine-grained MLLM

第3页

页面原文

Intelligence in Multi-Sensory Data

Trends of MLLMs

[1] MM-LLMs: Recent Advances in MultiModal

Large Language Models, 2023.

第4页

页面原文

Intelligence in Multi-Sensory Data

Trends of MLLMs

[1] A Survey on Multimodal Large Language Models. https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models, 2023.

第5页

页面原文

Modality and

Functionality

What are MLLMs capable of?

第6页

页面原文

Overview of Modality and Functionality

Modalities

Language + Vision

第7页

页面原文

Overview of Modality and Functionality

Modality (w/ Language)

MLLMs at A Glance

Image

Video

Audio

Flamingo, Kosmos-1, Blip2, mPLUG-Owl,

Mini-GPT4, LLaVA, InstructBLIP,

VPGTrans, CogVLM, Monkey, Chameleon,

Otter, Qwen-VL, GPT-4v, SPHINX, Yi-

VL, Fuyu, …

VideoChat, Video-

ChatGPT, Video-

LLaMA, PandaGPT,

MovieChat, Video-

LLaVA, LLaMA-VID,

Momentor, …

AudioGPT, SpeechGPT,

VIOLA, AudioPaLM,

SALMONN, MU-

LLaMA, …

3D-LLM, 3D-GPT,

LL3DA,

SpatialVLM,

PointLLM, Point-

Bind, …

Input-side

[Pixel-wise] GPT4RoI, LION, MiniGPT-

v2, NExT-Chat, Kosmos-2, GLaMM,

LISA, DetGPT, Osprey, PixelLM, …

[Pixel-wise] PG-

Video-LLaVA, Merlin,

MotionEpic, …

Perceiving

Video-LLaVA, Chat-UniVi, LLaMA-VID

Panda-GPT, Video-LLaMA, AnyMAL, Macaw-LLM, Gemini, VideoPoet, ImageBind-LLM,

LLMBind, LLaMA-Adapter, …

AudioGPT, SpeechGPT,

VIOLA, AudioPaLM, …

GILL, EMU, MiniGPT-5, DreamLLM,

LLaVA-Plus, InternLM-XComposer2,

SEED-LLaMA, LaVIT, Mini-Gemini, …

GPT4Video, Video-

LaVIT, VideoPoet, …

Perceiving

Generating

[Pixel-wise] Vitron

NExT-GPT, Unified-IO 2, AnyGPT, CoDi-2, Modaverse, ViT-Lens, …

第8页

页面原文

Multimodal Perceiving

Image-perceiving MLLM

Text

⊹

Flamingo,

⊹

Kosmos-1,

⊹

Blip2, mPLUG-Owl,

⊹

Mini-GPT4, LLaVA,

⊹

InstructBLIP, Otter,

⊹

VPGTrans

⊹

Chameleon,

⊹

Qwen-VL, GPT-4v,

⊹

SPHINX,

⊹

…

LLM

Image

Encode input images with external image encoders, generating

LLM-understandable visual feature, which is then fed into the

LLM. LLM then interprets the input images based on the input

text instructions and produces a textual response.

[1] Flamingo: a Visual Language Model for Few-Shot Learning. 2022

[2] Language Is Not All You Need: Aligning Perception with Language Models. 2023

[3] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. 2023

[4] MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. 2024

…

第9页

页面原文

Multimodal Perceiving

Image-perceiving MLLM

⊹

Blip2

⊹

LLaVA

⊹

Flamingo

⊹

Mini-GPT4

[1] Flamingo: a Visual Language Model for Few-Shot Learning. 2022

[2] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. 2023

[3] Visual Instruction Tuning. 2023

[4] A Survey on Multimodal Large Language Models. https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models, 2023.

第10页

页面原文

Multimodal Perceiving

Video-perceiving MLLM

Text

⊹

VideoChat,

⊹

Video-ChatGPT,

⊹

Video-LLaMA,

⊹

PandaGPT,

⊹

MovieChat,

⊹

Video-LLaVA,

⊹

LLaMA-VID,

⊹

Momentor

⊹

…

LLM

Video

Encode input videos with external video encoders, generating

LLM-understandable visual feature, feeding into LLM, which

then interprets the input videos based on the input text

instructions and produces a textual response.

[1] VideoChat: Chat-Centric Video Understanding. 2023

[2] Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models. 2023

[3] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. 2023

[4] Video-LLaVA: Learning United Visual Representation by Alignment Before Projection. 2023

[5] Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning. 2024

…

第11页

页面原文

Multimodal Perceiving

Video-perceiving MLLM

⊹

Video-ChatGPT

⊹

Video-LLaVA

[1] Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models. 2023

[2] Video-LLaVA: Learning United Visual Representation by Alignment Before Projection. 2023

[3] Video Understanding with Large Language Models: A Survey. https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding, 2023

第12页

页面原文

Multimodal Perceiving

3D-perceiving MLLM

Text

⊹

3D-LLM,

⊹

3D-GPT,

⊹

LL3DA,

⊹

SpatialVLM

⊹

PointLLM

⊹

Point-Bind

⊹

…

LLM

3D/Points

Encode input 3D information with external encoders, generating

LLM-understandable 3D feature, feeding into LLM, which then

interprets the input 3D/points based on the input text

instructions and produces a textual response.

[1] 3D-LLM: Injecting the 3D World into Large Language Models. 2023

[2] 3D-GPT: Procedural 3D Modeling with Large Language Models. 2023

[3] LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning. 2023

[4] PointLLM: Empowering Large Language Models to Understand Point Clouds. 2023

[5] SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities. 2024

…

第13页

页面原文

Multimodal Perceiving

3D-perceiving MLLM

⊹

3D-LLM

⊹

PointLLM

[1] 3D-LLM: Injecting the 3D World into Large Language Models. 2023

[2] PointLLM: Empowering Large Language Models to Understand Point Clouds. 2023

第14页

页面原文

Multimodal Perceiving

Audio-perceiving MLLM

Text

⊹

AudioGPT,

⊹

SpeechGPT,

⊹

VIOLA,

⊹

AudioPaLM

⊹

SALMONN

⊹

MU-LLaMA

⊹

…

LLM

Audio

speech

sound

…

music

Encode input audio signals with external

encoders, generating LLM-understandable

signal features, feeding into LLM, which then

interprets the audio based on the input text

instructions and produces a textual response.

[1] AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head. 2023

[2] SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities. 2023

[3] VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation. 2023

[4] AudioPaLM: A Large Language Model That Can Speak and Listen. 2023

[5] SALMONN: Towards Generic Hearing Abilities for Large Language Models. 2023

…

第15页

页面原文

Multimodal Perceiving

Audio-perceiving MLLM

⊹

SpeechGPT

⊹

SALMONN

[1] SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities. 2023

[2] SALMONN: Towards Generic Hearing Abilities for Large Language Models. 2023

[3] Sparks of Large Audio Models: A Survey and Outlook. https://github.com/EmulationAI/awesome-large-audio-models, 2023

第16页

页面原文

Multimodal Perceiving

Text

X-perceiving MLLM

Text

⊹Bio-/Medical & Healthcare

LLM

biomedicine

⊹

BioGPT

⊹

DrugGPT

⊹

BioMedLM

⊹

OphGLM

⊹

GatorTron

⊹

GatorTronGPT

⊹

MEDITRON

⊹

MedAlpaca

⊹

AlpaCare

⊹

Zhongjing

⊹

PMC-LLaMA

⊹

CPLLM

⊹

MedPaLM 2

⊹

BioMedGPT

⊹

DoctorGLM

⊹

BianQue

⊹

ClinicalGPT

⊹

Qilin-Med

⊹

ChatDoctor

⊹

BenTsao

⊹

HuatuoGPT

[1] BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining. 2022

[2] DrugGPT: A GPT-based Strategy for Designing Potential Ligands Targeting Specific Proteins. 2023

[3] MEDITRON-70B: Scaling Medical Pretraining for Large Language Models. 2023

[4] HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge. 2023

[5] AlpaCare:Instruction-tuned Large Language Models for Medical Application. 2023

[6] A Survey of Large Language Models in Medicine: Progress, Application, and Challenge, https://github.com/AI-in-Health/MedLLMsPracticalGuide. 2023.

…

第17页

页面原文

Multimodal Perceiving

X-perceiving MLLM

Text

⊹Graph

⊹Molecule & Chemistry

LLM

Chem/Graph

⊹

ChemGPT

⊹

SPT

⊹

T5 Chem

⊹

ChemLLM

⊹

MolCA

⊹

MolXPT

⊹

MolSTM

⊹

GIMLET

⊹

…

⊹

StructGPT

⊹

GPT4Graph

⊹

GraphGPT

⊹

LLaGA

⊹

HiGPT

⊹

…

⊹Geographical Information System (GIS)

⊹

GeoGPT

[1] Neural Scaling of Deep Chemical Models. 2022

[2] ChemLLM: A Chemical Large Language Model. 2023

[3] MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter. 2023

[4] StructGPT: A General Framework for Large Language Model to Reason on Structured Data. 2023

[5] LLaGA: Large Language and Graph Assistant. 2023

[6] Awesome-Graph-LLM, https://github.com/XiaoxinHe/Awesome-Graph-LLM. 2023

第18页

页面原文

Unified MLLM: Perceiving + Generation

Scenarios

Often, MLLMs need to not only understand the input multimodal information, but also to

generate information in that modality.

⊹

Image Captioning

⊹

Visual Question Answering

⊹

Text-to-Vision Synthesis

⊹

Vision-to-Vision Translation

⊹

Scene Text Recognition

⊹

Scene Text Inpainting

⊹

…

第19页

页面原文

Unified MLLM: Perceiving + Generation

Image

Text

⊹

GILL

⊹

EMU

⊹

MiniGPT-5

⊹

DreamLLM

⊹

LLaVA-Plus

⊹

LaVIT

⊹

…

LLM

Image

Central LLMs take as input both texts and images, after

semantics comprehension, and generate both texts and images.

[1] Generating Images with Multimodal Language Models. 2023

[2] Generative Pretraining in Multimodality. 2023

[3] MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens. 2023

[4] DreamLLM: Synergistic Multimodal Comprehension and Creation. 2023

[5] LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents. 2023

…

第20页

页面原文

Unified MLLM: Perceiving + Generation

Image

⊹

GILL

⊹

EMU

[1] Generating Images with Multimodal Language Models. 2023

[2] Generative Pretraining in Multimodality. 2023

第21页

页面原文

Unified MLLM: Perceiving + Generation

Video

Text

⊹

GPT4Video

⊹

VideoPoet

⊹

Video-LaVIT

⊹

…

LLM

Video

Central LLMs take as input both texts and videos, after

semantics comprehension, and generate both texts and videos.

[1] GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation. 2023

[2] VideoPoet: A Large Language Model for Zero-Shot Video Generation. 2023

[3] Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization. 2024

…

第22页

页面原文

Unified MLLM: Perceiving + Generation

Video

⊹

VideoPoet

⊹

GPT4Video

[1] GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation. 2023

[2] VideoPoet: A Large Language Model for Zero-Shot Video Generation. 2023

第23页

页面原文

Unified MLLM: Perceiving + Generation

Audio

Text

⊹

AudioGPT,

⊹

SpeechGPT,

⊹

VIOLA,

⊹

AudioPaLM,

⊹

…

LLM

Audio

Central LLMs take as input both texts and audio, after

semantics comprehension, and generate both texts and audio.

[1] AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head. 2023

[2] SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities. 2023

[3] VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation. 2023

[4] AudioPaLM: A Large Language Model That Can Speak and Listen. 2023

…

第24页

页面原文

Unified MLLM: Perceiving + Generation

Audio

⊹

AudioGPT

⊹

SpeechGPT

[1] SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities. 2023

[2] AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head. 2023

第25页

页面原文

Unified MLLM: Harnessing Multi-Modalities

Scenarios:

In reality, modalities often have strong interconnections simultaneously. Thus, it is frequently

necessary for MLLMs to handle the understanding of multiple non-textual modalities at once,

rather than just one single (non-textual) modality.

⊹

Image+Video

⊹

Audio+Video

⊹

Image+Video+Audio

⊹

Any-to-Any

⊹

…

第26页

页面原文

Unified MLLM: Harnessing Multi-Modalities

Text+Image+Video

Text

⊹

Video-LLaVA

⊹

Chat-UniVi

⊹

LLaMA-VID

⊹

…

Image

LLM

Video

Central LLMs take as input texts, image and video, after

semantics comprehension, and generate texts (maybe also

image and video, or combination).

[1] Video-LLaVA: Learning United Visual Representation by Alignment Before Projection. 2023

[2] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding. 2023

[3] LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models. 2023

…

第27页

页面原文

Unified MLLM: Harnessing Multi-Modalities

Text+Image+Video

⊹

Chat-UniVi

[1] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding. 2023

第28页

页面原文

Unified MLLM: Harnessing Multi-Modalities

Text

Text+Image+Video+Audio

⊹

Panda-GPT

⊹

Video-LLaMA

⊹

AnyMAL

⊹

Macaw-LLM

⊹

VideoPoet

⊹

ImageBind-LLM

⊹

LLMBind

⊹

LLaMA-Adapter

⊹

…

Image

LLM

Audio

Video

Central LLMs take as input texts, audio, image and video, and

generate texts (maybe also audio, image and video, or combination).

[1] PandaGPT: One Model to Instruction-Follow Them All. 2023

[2] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. 2023

[3] AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model. 2023

[4] Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration. 2023

…

第29页

页面原文

Unified MLLM: Harnessing Multi-Modalities

Text+Image+Video+Audio

⊹

Macaw-LLM

[1] Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration. 2023

第30页

页面原文

Unified MLLM: Harnessing Multi-Modalities

Text

Any-to-Any MLLM

⊹

NExT-GPT

⊹

Unified-IO 2 (w/o video)

⊹

AnyGPT (w/o video)

⊹

CoDi-2

⊹

Modaverse

⊹

…

Image

LLM

Audio

Video

Central LLMs take as input texts, audio, image and video, and freely

generate texts, audio, image and video, or combination.

[1] NExT-GPT: Any-to-Any Multimodal LLM. 2023

[2] AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling. 2023

[3] CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation. 2023

[4] ModaVerse: Efficiently Transforming Modalities with LLMs. 2023

第31页

页面原文

Unified MLLM: Harnessing Multi-Modalities

Any-to-Any MLLM

⊹

NExT-GPT

Project: https://next-gpt.github.io

Paper: https://arxiv.org/pdf/2309.05519

Code: https://github.com/NExT-GPT/NExT-GPT

[1] NExT-GPT: Any-to-Any Multimodal LLM. 2023

第32页

页面原文

Unified MLLM: Harnessing Multi-Modalities

Any-to-Any MLLM

⊹

NExT-GPT

[1] NExT-GPT: Any-to-Any Multimodal LLM. 2023

第33页

页面原文

Unified MLLM: Harnessing Multi-Modalities

Any-to-Any MLLM

⊹

NExT-GPT

[1] NExT-GPT: Any-to-Any Multimodal LLM. 2023

第34页

页面原文

Unified MLLM: Harnessing Multi-Modalities

Any-to-Any MLLM

Multimodal Encoding Stage

Multimodal Generation Stage

⊹

NExT-GPT

LLM Understanding and Reasoning Stage

[1] NExT-GPT: Any-to-Any Multimodal LLM. 2023

第35页

页面原文

Unified MLLM: Harnessing Multi-Modalities

Any-to-Any MLLM

Taking ImageBind as the unified multimodal encoder

An input projection layer to connect multimodal encoder and LLM

⊹

NExT-GPT

ImageBind

Input Projection

[1] NExT-GPT: Any-to-Any Multimodal LLM. 2023

第36页

页面原文

Unified MLLM: Harnessing Multi-Modalities

Any-to-Any MLLM

⊹

NExT-GPT

➢

Leveraging the current SoTA diffusion-based Text-to-

Image, Video, Audio generation model to generate

multimodal content

Text encoder – control the generation

process

VAE

UNet

[1] NExT-GPT: Any-to-Any Multimodal LLM. 2023

第37页

页面原文

Unified MLLM: Harnessing Multi-Modalities

Any-to-Any MLLM

Harnessing LLM as the core to decide whether & what modal content to output

correspondingly.

⊹

NExT-GPT

Instead of generating textual instructions, LLM produces unique “modality signal”

tokens that serve as instructions to guide the generation process.

[1] NExT-GPT: Any-to-Any Multimodal LLM. 2023

第38页

页面原文

Unified MLLM: Harnessing Multi-Modalities

Any-to-Any MLLM

⊹

NExT-GPT

[1] NExT-GPT: Any-to-Any Multimodal LLM. 2023

第39页

页面原文

Unified MLLM: Harnessing Multi-Modalities

Any-to-Any MLLM

⊹

NExT-GPT

[1] NExT-GPT: Any-to-Any Multimodal LLM. 2023

第40页

页面原文

Unified MLLM: Harnessing Multi-Modalities

Any-to-Any MLLM

⊹

NExT-GPT

[1] NExT-GPT: Any-to-Any Multimodal LLM. 2023

第41页

页面原文

Unified MLLM: Harnessing Multi-Modalities

Any-to-Any MLLM

⊹

NExT-GPT

[1] NExT-GPT: Any-to-Any Multimodal LLM. 2023

第42页

页面原文

Unified MLLM: Harnessing Multi-Modalities

Any-to-Any MLLM

➢

Key Aspect-I：Parameter-efficient Low-cost Training

⊹

NExT-GPT

𝑇𝑢𝑛𝑒𝑑𝑃𝑎𝑟𝑎𝑚𝑠

𝐹𝑟𝑜𝑧𝑒𝑛+ 𝑇𝑢𝑛𝑒𝑑𝑃𝑎𝑟𝑎𝑚𝑠=

(4M+33M+31M+31M+32M)

(4M+33M+31M+31M+32M) + (1.2B+7B+1.3B+1.8B+0.975B)

131M

131M +12.275B ≅𝟎. 𝟎𝟏

[1] NExT-GPT: Any-to-Any Multimodal LLM. 2023

第43页

页面原文

Unified MLLM: Harnessing Multi-Modalities

Any-to-Any MLLM

➢

Key Aspect-II：Modality-switching Instruction Tuning

⊹

NExT-GPT

[1] NExT-GPT: Any-to-Any Multimodal LLM. 2023

第44页

页面原文

Fine-grained Capability of MLLM

Pixel-level Vision MLLM

The vision MLLMs described above generally only support coarse-grained, instance-level visual

understanding. This can lead to imprecise visual interpretations. Also due to the lack of visual

grounding, these MLLMs will potentially produce hallucinations.

⊹

Visual Grounding

⊹

Visual Segmentation

⊹

Visual Editing

⊹

Visual Inpainting

⊹

…

第45页

页面原文

Fine-grained Capability of MLLM

Image-oriented Pixel-wise Regional MLLM

Text

⊹

GPT4RoI

⊹

NExT-Chat

⊹

MiniGPT-v2

⊹

Shikra

⊹

Kosmos-2

⊹

GLaMM

⊹

LISA

⊹

DetGPT

⊹

Osprey

⊹

PixelLM

⊹

LION

⊹

…

Users input an image

(potentially specifying a

region), and the LLM

outputs content based on

its understanding,

grounding the visual

content to specific pixel-

level regions of the image.

Image

LLM

Region/Pixels

Region

[1] GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest. 2023

[2] NExT-Chat: An LMM for Chat, Detection and Segmentation. 2023

[3] MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. 2023

[4] Osprey: Pixel Understanding with Visual Instruction Tuning. 2023

[5] GLaMM: Pixel Grounding Large Multimodal Model. 2023

[6] Kosmos-2: Grounding Multimodal Large Language Models to the World. 2023

[7] DetGPT: Detect What You Need via Reasoning. 2023

[8] PixelLM: Pixel Reasoning with Large Multimodal Model. 2023

[9] Lisa: Reasoning segmentation via large language model. 2023

[10] Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic. 2023

…

第46页

页面原文

Fine-grained Capability of MLLM

Image-oriented Pixel-wise Regional MLLM

⊹

NExT-Chat

⊹

GLaMM

第47页

页面原文

Fine-grained Capability of MLLM

Video-oriented Pixel-wise Regional MLLM

Text

⊹

PG-Video-LLaVA

⊹

Merlin

⊹

MotionEpic

⊹

…

Video

LLM

Region/Pixels

Region

Users input an video (potentially specifying a region), and the

LLM outputs content based on its understanding, grounding or

tracking the content to specific pixel-level regions of the video.

[1] PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models. 2023

[2] Merlin: Empowering Multimodal LLMs with Foresight Minds. 2023

[3] Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition. 2024

…

第48页

页面原文

Fine-grained Capability of MLLM

Video-oriented Pixel-wise Regional MLLM

⊹

PG-Video-LLaVA

⊹

MotionEpic

[1] PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models. 2023

[2] Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition. 2024

第49页

页面原文

Fine-grained Capability of MLLM

Unified Pixel-wise MLLM

Text

⊹

Vitron

Image

Users input either an image or video

(potentially specifying a region), and

the LLM outputs content based on

its understanding, generating,

grounding or tracking the content to

specific pixel-level regions of the

image, video.

LLM

Video

Region/Pixels

Region

[1] VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing. 2024

第50页

页面原文

Fine-grained Capability of MLLM

Unified Pixel-wise MLLM

⊹

Vitron

[1] VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing. 2024

第51页

页面原文

Fine-grained Capability of MLLM

Unified Pixel-wise MLLM

⊹

Vitron

[1] VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing. 2024

第52页

页面原文

Architecture of MLLM

How to design an MLLM?

第53页

页面原文

Overview of MLLM Architecture

Preliminary Idea: Intelligence over Language

Due to the scaling law, emergent phenomena have extensively already occurred in language-based LLMs.

These LLMs now generally possess very powerful semantic understanding capabilities.

This also implies that language is a crucial modality for carrying intelligence.

language

第54页

页面原文

Overview of MLLM Architecture

Preliminary Idea: Language Intelligence as Pivot

Given this premise, nearly all CURRENT MLLMs are built based on language-based LLMs as the core

decision-making module (i.e., the brain or central processor).

By adding additional external non-textual modality modules or encoders, LLMs are enabled with

multimodal perceptual/operation abilities.

smell

visual

sound

video

language

touch

第55页

页面原文

Overview of MLLM Architecture

Architecture-I: LLM as Discrete Scheduler/Controller

The role of the LLM is to receive textual signals and instruct textual commands to call downstream modules.

⊹

Key feature:

All message passing within the system, such as “multimodal encoder to the LLM” or “LLM to

downstream modules”, is facilitated through pure textual commands as the medium.

downstream

module

Text

User

Text

LLM

downstream

module

第56页

页面原文

Overview of MLLM Architecture

Architecture-I: LLM as Discrete Scheduler/Controller

⊹

Representative MLLMs:

⊹

Visual-ChatGPT

⊹

HuggingGPT

⊹

MM-REACT

⊹

ViperGPT

⊹

AudioGPT

⊹

LLaVA-Plus

⊹

…

downstream

module

Text

User

Text

LLM

downstream

module

第57页

页面原文

Overview of MLLM Architecture

Architecture-I: LLM as Discrete Scheduler/Controller

⊹

Visual-ChatGPT

⊹

HuggingGPT

[1] Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models. 2023

[2] HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face. 2023

第58页

页面原文

Overview of MLLM Architecture

Architecture-II: LLM as Joint Part of System

The role of the LLM is to perceive multimodal information, and react by itself, in an structure of

Encoder-LLM-Decoder.

⊹

Key feature:

LLM is the key joint part of the system, receiving multimodal information directly from outside, and

delegating instruction to decoders/generators in a more smooth manner.

Encoder

Core

Decoder

Text

LLM

Multimodal

Decoder

Multimodal

Signals

Outputs

Encoder

第59页

页面原文

Overview of MLLM Architecture

Architecture-II: LLM as Joint Part of System

More promising

⊹

≈ 96% MLLMs belong to this category.

[1] A Survey on Multimodal

Large Language Models.

https://github.com/BradyFU/A

wesome-Multimodal-Large-

Language-Models, 2023.

第60页

页面原文

Multimodal Encoding

Visual (Image&Video) Encoder

⊹

CLIP-ViT is the most popular choice for vision-language models.

⊹

Advantages:

Providing image representations well aligned with text space.

Scale well with respect to parameters and data.

第61页

页面原文

Multimodal Encoding

Non-Visual Encoder

⊹Audio:

HuBERT

Whisper

BEATs

⊹3D Point:

Point-BERT

第62页

页面原文

Multimodal Encoding

Unified Multimodal Encoder

⊹

ImageBind:

Embedding all modalities into a joint representation space of Image.

Well aligned modality representations can benefit LLM understanding

[1] ImageBind: One Embedding Space To Bind Them All. 2023

第63页

页面原文

Multimodal Encoding

Unified Multimodal Encoder

⊹

LanguageBind:

Embedding all modalities into a joint representation space of Language.

Well aligned modality representations can benefit LLM understanding

[1] LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment. 2023

第64页

页面原文

Multimodal Signal Tokenization

Tokenization

⊹

SEED

[1] Planting a SEED of Vision in Large Language Model. 2023

第65页

页面原文

Multimodal Signal Tokenization

Tokenization

⊹

AnyGPT

[1] AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling. 2023

第66页

页面原文

Multimodal Signal Tokenization

Tokenization

⊹

VideoPoet

[1] VideoPoet: A Large Language Model for Zero-Shot Video Generation. 2023

第67页

页面原文

Multimodal Signal Tokenization

Visual (Image&Video) Tokenization in Codebook

⊹

Represent multimodal signals as discrete tokens in a codebook

Advantages: support unified multimodal signal understanding and generation

in an auto-regressive next-token prediction framework

More commonly used in image synthesize

⬩

Parti

⬩

Muse (parallel)

⬩

MaskGIT (parallel)

Representative Multimodal LLMs

⬩

Gemini

⬩

CM3

⬩

VideoPoet

第68页

页面原文

Multimodal Signal Tokenization

Audio Tokenization

SpeechTokenizer

+RVQ-VAE

SoundStream

+RVQ-VAE

[1] SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models. 2023

[2] SoundStream: An End-to-End Neural Audio Codec. 2021

第69页

页面原文

Input-side Projection

Methods to Connect Multimodal Representation with LLM

⊹

Projecting multimodal (e.g., image) representations into LLM semantic space

Linear projection: LLaVA, MiniGPT-4, NExT-GPT

Two-layer MLP: LLaVA-1.5/NeXT, CogVLM, DeepSeek-VL, Yi-VL

Perceiver Resampler: Flamingo, Qwen-VL, MiniCPM-V, LLaVA-UHD

Q-Former: BLIP-2, InstructBLIP, VisCPM, VisualGLM

C-Abstractor: HoneyBee, MM1

第70页

页面原文

Input-side Projection

Some Insights

⊹

Different papers have different conclusions about other projection methods.

⊹

Two-layer MLP is better than linear projection. (LLaVA)

⊹

Linear projection is more useful than Q-former layers. (MiniGPT-4)

[1] Improved Baselines with Visual Instruction Tuning. 2023

[2] MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. 2021

第71页

页面原文

Backbone LLMs

Open-source Language-based LLMs

LLM

Size (B)

Data Scale (T)

Date

Language

Architecture

Flan-T5

3/11

Oct-2022

en, fr, de

Encoder-Decoder

LLaMA

7/13

1.4

Feb-2023

Decoder

Alpaca

Mar-2023

Decoder

Vicuna

7/13

1.4

Mar-2023

Decoder

LLaMA-2

7/13

Jul-2023

Decoder

GLM

2/10

0.4

Oct-2022

Decoder

Qwen

1.8/7/14

Sep-2023

en, zh

Decoder

Skywork

3.2

Oct-2023

Decoder

[1] A Survey of Large Language Models. https://github.com/RUCAIBox/LLMSurvey, 2023

第72页

页面原文

Decoding-side Connection

Message passing via 1) discrete token of language

⊹Representative MLLMs:

⊹Visual-ChatGPT

⊹HuggingGPT

⊹GPT4Video

⊹MM-REACT

⊹ViperGPT

⊹ModaVerse

⊹Vitron

⊹…

Text

LLM

Text

Response

Multimodal

Decoder

Content

⊹Pros:

⊹Cons:

⊹High performance lower-bound

⊹More Efficient, i.e., without tuning

⊹Loss of end-to-end tuning capabilities.

⊹Performance upper-bound is limited, i.e., some multimodal

signals cannot be optimally conveyed through text).

[1] Visual-ChatGPT: Talking, Drawing and Editing with Visual Foundation Models. 2023

[2] HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face. 2023

[3] ModaVerse: Efficiently Transforming Modalities with LLMs. 2024

[4] VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing. 2024

第73页

页面原文

Decoding-side Connection

Message passing via 2) continuous embedding

Passing the message from LLM to downstream decoders via soft embeddings,

i.e., signal tokens.

Text

⊹Merits

LLM

Multimodal

⊹Capable of end-to-end tuning, resulting in more

Multimodal

efficient instruction transmission

Decoder

Content

⊹More able to convey various multimodal signals

embeddings of

that text alone cannot express, e.g.,

signal tokens

⊹the numeration of vision

⊹the visual-spatial relational semantics

[1] Generating Images with Multimodal Language Models. 2023

[2] NExT-GPT: Any-to-Any Multimodal LLM. 2023

第74页

页面原文

Decoding-side Connection

Message passing via 1) discrete token of language

➢Instruction: There were originally 7 apples on the table, but I ate one. Then, how many apples are left

now? Please generate a picture to describe the result.

[1] NExT-GPT: Any-to-Any Multimodal LLM. 2023

第75页

页面原文

Decoding-side Connection

Message passing via 2) continuous embedding

➢Instruction: There were originally 7 apples on the table, but I ate one. Then, how many apples are left

now? Please generate a picture to describe the result.

[1] NExT-GPT: Any-to-Any Multimodal LLM. 2023

第76页

页面原文

Decoding-side Connection

Message passing via 3) codebooks

LLM generates special tokens id, i.e., codebooks, to downstream (visual) decoders .

Text

⊹Merits

LLM

Multimodal

⊹Capable of end-to-end tuning for higher

de-tokenize

Multimodal

Decoder

efficiency in command transmission

Content

⊹Better at expressing various multimodal signals

multimodal

embeddings

codebooks

that cannot be captured by text alone

⊹Supports autoregressive multimodal token

generation

[1] Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action. 2023

[2] LVM: Sequential Modeling Enables Scalable Learning for Large Vision Models. 2023

[3] AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling. 2024

[4] VideoPoet: A Large Language Model for Zero-Shot Video Generation. 2024

第77页

页面原文

Multimodal Generation

Text Generation

⊹LLMs naturally support direct text generation

via e.g., BPE decoding, Beam search, …

第78页

页面原文

Multimodal Generation

Generation via Diffusion Models

⊹Visual (Image/Video) Generator

⊹Image Diffusion

⊹Video Diffusion

⊹Audio Generator

⊹Speech Diffusion

⊹Audio Diffusion

[1] NExT-GPT: Any-to-Any Multimodal LLM. 2023

第79页

页面原文

Multimodal Generation

Generation via Codebooks

⊹Visual (Image/Video) Generator

⊹VQ-VAE + Codebooks

⊹VQ-GAN + Codebooks

⊹Audio Generator

⊹SpeechTokenizer + Residual Vector Quantizer

⊹SoundStream + Residual Vector Quantizer

[1] Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action. 2023

第80页

页面原文

Multimodal Generation

Generation via Codebooks

⊹

VQ-GAN in

Stable-diffusion

64 ✕64 ✕3 or 32 ✕32 ✕4

第81页

页面原文

Thanks!

Any questions?

前沿

NExT-GPT 与"任意到任意"模态系统

NExT-GPT 是该领域的开创性工作，首次引入"任意到任意"（any-to-any）MLLM 概念。系统核心技术栈：ImageBind（统一编码器，处理图像/视频/音频） + Vicuna（LLM 基座） + 扩散模型解码器（Stable Diffusion 图像生成、AudioLDM 音频生成、ZeroScope 视频生成）。Signal Token 机制实现模态切换。