Ch5 · 模型水印 · 动手学大模型

2026/05/19 13:48:30·2026/05/20 14:45:00

Chapter 5

模型水印：追踪 AI 生成内容的来源

大型语言模型（LLM）已经展现出惊人的内容生成能力，从新闻写作到代码生成，从文学创作到学术论文，AI 正在渗透到内容创作的各个领域。然而，这种强大能力也带来了严峻的挑战：如何防止 LLM 被滥用？如何追踪 AI 生成内容的来源？如何为 AI 生成的内容建立可验证的「身份证」？

文本水印技术正是为解决这些问题而生的。它的核心思想是：在 LLM 生成内容的过程中，嵌入一种人类无法察觉、但可以被算法检测到的「水印」。这个水印可以是一个比特位（标识这段文本是 AI 生成的），也可以是一串编码（标识生成这段文本的具体模型、时间、甚至用户）。通过检测文本中的水印，我们可以追溯 AI 生成内容的来源，为 AI 治理提供技术支撑。

本章将深入介绍文本水印的核心算法——KGW（Kirchhoff-Gollac-Welin），解析其数学原理、实现细节和检测方法，并探讨水印在跨语言翻译场景下的鲁棒性问题。本章配套 Notebook 为 documents/chapter5/watermark.ipynb，实验基于 X-SIR 代码仓库（zwhe99/X-SIR）。

5.1

为什么需要文本水印？

想象一个场景：某篇敏感的学术论文被泄露，调查人员怀疑它是由 AI 生成的；某条诈骗短信声称来自某银行，但实际上是诈骗者用 AI 生成的；某段虚假新闻在社交媒体传播，追查源头发现是 AI 炮制的。在这些场景中，如果能检测出文本是否由 AI 生成、是由哪个模型生成，将极大地帮助厘清责任、维护安全。

文本水印正是实现这一目标的技术手段。与传统的水印（用于图片、音视频）不同，文本水印面临独特的挑战：文本是高熵的离散序列，长度可变，且人类对文字的微小变化极为敏感。因此，文本水印必须满足两个看似矛盾的要求——既要嵌入信息，又要对人类「不可见」（即保持文本的自然性和可读性）。

一个简单而有效的文本水印框架是：在生成文本时，微妙地调整 token 的采样概率，使得水印文本与无水印文本在统计上存在可检测的差异。这种差异对人类完全无法察觉，但通过统计分析可以高置信度地检测出来。

文本水印的核心特性

文本水印有两个关键特性：「对人类不可见」和「可被算法检测」。前者意味着水印不应影响文本的可读性和自然性；后者意味着水印的存在应该产生可量化的统计偏差，使得检测器能够区分 AI 生成文本和人类文本。这两个特性共同构成了文本水印的可行性基础。

5.2

形式化定义与符号约定

在深入算法之前，我们先建立形式化框架。设语言模型为 $$M$$ ，词表为 $$V$$ ，文本序列为 $x_{1:n} = (x_1, x_2, ..., x_n)$ 。在第 $$n+1$$ 步解码时，模型输出下一个 token 的条件概率分布：

P_M(x_{n+1} | x_{1:n}) = \text{softmax}(z_{n+1})

其中 $z_{n+1} = M(x_{1:n}) \in \mathbb{R}^{|V|}$ 是模型输出的 logit 向量。这个分布描述了在给定前文的情况下，模型认为每个 token 接下来出现的概率。

水印的核心任务是在这个采样过程中嵌入信息，同时保持分布的整体形状——让人类感受不到变化，但让统计检测器能够发现异常。

5.3

KGW 算法：核心思想

KGW（Kirchhoff-Gollac-Welin）算法是文本水印领域的里程碑工作，它获得了 ICML 2023 的 Outstanding Paper 奖。KGW 的核心思想可以概括为三点：

词表划分：在每一步解码时，基于前文哈希将词表随机划分为「绿名单」（green list $$V_g$$ ）和「红名单」（red list $$V_r$$ ）两个不相交的子集。
前缀哈希驱动：划分的随机性由前文文本的哈希值决定，而非独立的随机数。这意味着给定相同的前文，划分结果是确定的，水印是可复现的。
概率提升：在解码时，微妙地提升绿名单中 token 的 logit 值（加一个偏置 $\delta$ ），使得模型更倾向于选择绿名单中的 token。

这三点的组合产生了水印效果：带水印的文本会统计上包含更多绿名单 token，这个特征在人类写作的文本中几乎不可能出现，因此可以通过统计检测来识别。

要点：KGW 的巧妙之处在于利用了语言模型本身的分布特性——通过微调 logit 而非强制选择，在保持文本自然性的同时嵌入统计偏差。

5.4

KGW 算法：数学推导

KGW 的水印嵌入过程分为三个步骤，每一步都有明确的数学定义。

第一步：计算哈希值

给定前文 $x_{1:n}$ ，计算一个哈希值 $h_{n+1} = H(x_{1:n})$ 。这个哈希函数使用前 $$k$$ 个 token 来计算（窗口哈希），避免引入过多前文信息，同时保持随机性。

第二步：词表划分

用 $h_{n+1}$ 作为随机数种子，将词表 $$V$$ 划分为绿名单 $$V_g$$ 和红名单 $$V_r$$ 。划分结果满足 $V_g \cap V_r = \emptyset$ 且 $V_g \cup V_r = V$ 。

第三步：Logit 调整

对 logit 向量进行如下调整：

\tilde{z}_{n+1,i} = \begin{cases} z_{n+1,i} + \delta & \text{if } v_i \in V_g \\ z_{n+1,i} & \text{if } v_i \in V_r \end{cases}

其中 $\delta > 0$ 是一个可调节的偏置参数。 $\delta$ 越大，水印越强（绿 token 被选中的概率越高），但对文本分布的影响也越大； $\delta$ 越小，文本越自然，但水印检测难度也越高。

调整后的分布为 $P_{\text{watermark}}(x_{n+1} | x_{1:n}) = \text{softmax}(\tilde{z}_{n+1})$ 。由于绿名单 token 的 logit 被提升，它们被采样的概率也会相应提升。经过多步解码的累积效应，带水印的文本会包含显著高于随机期望的绿名单 token 比例。

5.5

KGW 检测：z-score 统计量

水印检测是嵌入的逆过程。给定一段文本 $$x$$ ，我们按照同样的划分规则计算每个位置的绿名单状态，然后统计绿名单 token 的数量 $$|x|_g$$ 。

设词表中绿名单比例为 $\gamma = \frac{|V_g|}{|V|}$ （通常为 50%），文本长度为 $$|V|$$ （注意这里 $$|V|$$ 指 token 数量，非词表大小），则在无水印的文本中，绿 token 数量服从均值为 $\gamma |V|$ 、方差为 $|V|\gamma(1-\gamma)$ 的二项分布。

KGW 使用 z-score 作为水印强度的度量：

s = \frac{|x|_g - \gamma |V|}{\sqrt{|V| \gamma (1-\gamma)}}

z-score 衡量了实际绿 token 数量偏离随机期望的程度。z-score 越大（水印文本通常在 10 以上），说明这段文本包含的绿 token 显著多于随机情况，从而推断这段文本很可能经过了水印嵌入。

检测时，我们设定一个阈值（如 z-score > 4），如果文本的 z-score 超过阈值，则判定为水印文本；否则判定为无水印文本。阈值的选择需要在假阳性率（将人类文本误判为水印）和假阴性率（将水印文本漏判为无水印）之间做权衡。

z-score 的直观理解

z-score 本质上是「偏离程度」的标准化度量。如果绿 token 比例恰好等于随机期望（ $\gamma$ ），则 z-score 约为 0；如果绿 token 显著偏多（如 60% 而非 50%），则 z-score 为正值且较大。在 KG W 的实验中，带水印文本的 z-score 通常在 10-15 之间，远超人类文本的噪声水平（通常在 ±3 以内），因此检测准确率极高。

实验结果显示，KGW 的 ROC 曲线下面积（AUC）达到 0.998，几乎接近完美检测。这意味着在大多数情况下，我们能够准确区分 AI 水印文本和人类文本。

5.6

SIR 算法：提升鲁棒性

KGW 虽然检测效果出色，但存在一个显著的弱点——对文本改写攻击的鲁棒性较差。当带水印文本经过 paraphrase（改写）或 translation（翻译）后，原本精心嵌入的绿 token 分布会被打乱，检测能力急剧下降。

SIR（Semantically Invariant Robust watermark）正是为解决这一问题而设计的。SIR 的核心思想是：让 logit 调整量 $\Delta$ 与文本语义保持一致。具体来说，SIR 训练一个函数 $\Delta(x_{1:n})$ ，使得：

\text{Sim}(\Delta(x), \Delta(y)) \approx \text{Sim}(E(x), E(y))

其中 $E(\cdot)$ 是文本嵌入模型， $\text{Sim}(\cdot)$ 是相似度函数。直观理解：语义相似的文本（如同一句话的不同表述）应该产生相似的 logit 调整量，这样即使文本被改写，水印效果也能部分保留。

SIR 的训练目标是最小化嵌入相似度与语义相似度之间的差异：

L = |\text{Sim}(E(x), E(y)) - \text{Sim}(\Delta(x), \Delta(y))|

此外，SIR 还约束 $\Delta_i$ 的值接近 +1 或 -1，这意味着 SIR 可以看作是一种「软版本」的 KGW—— $\Delta_i > 0$ 表示 token $$v_i$$ 属于绿名单。SIR 在 KGW 的基础上引入了语义一致性约束，因此在 paraphrase 攻击下表现更好。

5.7

X-SIR：跨语言水印防御

当水印文本被翻译成另一种语言时，KGW 和 SIR 的检测能力都会大幅下降。原因在于：词表划分是基于模型词表的，翻译后 token 的对应关系被打破，原本的绿 token 可能不再被识别为绿 token。

X-SIR 提出了一种跨语言水印防御方法。它的核心思路是：引入语义聚类（Semantic Clustering），将词表中语义相近的 token 归为同一组，然后在组级别而非单个 token 级别应用水印调整。

具体来说，X-SIR 定义语义聚类 $C = \{C_1, C_2, ..., C_{|C|}\}$ ，其中每个聚类 $$C_i$$ 包含语义等价的 token。logit 调整在聚类级别进行：

\tilde{z}_{n+1,i} = z_{n+1,i} + \Delta_{C(i)}

其中 $$C(i)$$ 表示 token $$v_i$$ 所属的聚类索引。语义聚类的构建依赖于双语词典——如果两个 token 在双语词典中是互译词，则将它们归入同一聚类。这样，当文本被翻译时，虽然 token 身份变了，但由于源语言和目标语言的互译词在同一个聚类中，水印效果得以跨语言保留。

语义聚类的两个必要条件

X-SIR 的研究表明，跨语言水印一致性需要满足两个条件：其一，语义相近的 token 应该在同一个名单（绿或红）中；其二，语义相近的前缀应该产生相同的词表划分。只有同时满足这两个条件，水印才能在翻译后保持可检测性。

实验表明，X-SIR 在翻译攻击下的检测能力显著优于 KGW 和 SIR，但仍有局限性：语义聚类只能覆盖词表中与外部词典重叠的 token（约 20.76%），且只支持模型本身支持的语种。

5.8

实验流程：使用 X-SIR 实现水印

本节介绍如何使用 X-SIR 代码仓库实现完整的水印流程。X-SIR 包含三种水印算法（KGW、SIR、X-SIR）和两种攻击方法（paraphrase、translation），是学习水印技术的绝佳资源。

环境准备

git clone https://github.com/zwhe99/X-SIR && cd X-SIR
conda create -n xsir python==3.10.10
conda activate xsir
pip3 install -r requirements.txt

数据准备

将待处理的 prompt 组织成 jsonl 格式，每行一个 JSON 对象，至少包含 prompt 字段：

{"prompt": "Ghost of Emmett Till: Based on Real Life Events"}
{"prompt": "Antique Cambridge Glass Pink Decagon Console Bowl Engraved Gold Highlights"}
{"prompt": "2009 > Information And Communication Technology Index statistics - Countries"}
...

水印嵌入

使用 Baichuan-7B 模型和 KGW 算法生成带水印文本：

MODEL_NAME=baichuan-inc/Baichuan-7B
MODEL_ABBR=baichuan-7b
python3 gen.py \
    --base_model $MODEL_NAME \
    --fp16 \
    --batch_size 32 \
    --input_file data/dataset/mc4/mc4.en.jsonl \
    --output_file gen/$MODEL_ABBR/kgw/mc4.en.mod.jsonl \
    --watermark_method kgw

生成的文件包含 prompt 和 response 两个字段，其中 response 是模型的带水印输出。

水印检测

计算有水印文本和无水印文本的 z-score：

# 有水印文本的 z-score
python3 detect.py \
    --base_model $MODEL_NAME \
    --detect_file gen/$MODEL_ABBR/kgw/mc4.en.mod.jsonl \
    --output_file gen/$MODEL_ABBR/kgw/mc4.en.mod.z_score.jsonl \
    --watermark_method kgw

# 无水印文本的 z-score
python3 detect.py \
    --base_model $MODEL_NAME \
    --detect_file data/dataset/mc4/mc4.en.jsonl \
    --output_file gen/$MODEL_ABBR/kgw/mc4.en.hum.z_score.jsonl \
    --watermark_method kgw

输出的 JSON 文件包含每个样本的 z-score 值。有水印文本的 z-score 通常在 10-15 之间，而无水印文本的 z-score 接近 0（标准正态分布）。

水印评估

计算检测准确率并绘制 ROC 曲线：

python3 eval_detection.py \
    --hm_zscore gen/$MODEL_ABBR/kgw/mc4.en.hum.z_score.jsonl \
    --wm_zscore gen/$MODEL_ABBR/kgw/mc4.en.mod.z_score.jsonl \
    --roc_curve roc

典型输出为：

AUC: 1.000
TPR@FPR=0.1: 0.998
TPR@FPR=0.01: 0.998
F1@FPR=0.1: 0.999
F1@FPR=0.01: 0.999

AUC 接近 1.0 说明 KGW 算法在干净文本上的检测能力极强，几乎可以完美区分水印文本和人类文本。

5.9

鲁棒性评估：翻译攻击

水印的生命力在于其鲁棒性。当带水印文本被 paraphrase（改写）或 translation（翻译）后，水印是否还能被检测到？

使用 GPT-3.5-Turbo 对带水印文本进行翻译攻击：

# 设置 OpenAI API Key
export OPENAI_API_KEY=xxxx

# 将英文水印文本翻译为中文
python3 attack/translate.py \
    --input_file gen/$MODEL_ABBR/kgw/mc4.en.mod.jsonl \
    --output_file gen/$MODEL_ABBR/kgw/mc4.en-zh.mod.jsonl \
    --model gpt-3.5-turbo-1106 \
    --src_lang en \
    --tgt_lang zh

翻译后的文本再次进行水印检测，你会发现检测能力显著下降。原因在于：翻译过程改变了 token 序列，原本的绿 token 可能变成红 token，原本的统计偏差被「洗掉」了。

这引出了一个重要的研究方向——如何设计对翻译攻击鲁棒的水印算法？X-SIR 通过语义聚类在一定程度上解决了这个问题，但正如前文所述，其覆盖率仍有提升空间。

此外，还有一个「语言切换」攻击值得关注：攻击者通常不想改变回复的语言，他们只是想去除水印。此时可以使用 pivot language（枢轴语言）——将英文文本先翻译成中文（或其他语言）再翻译回英文，利用翻译过程中的语义变化来破坏水印，同时保持语言不变。

跨语言一致性评估指标

为了量化翻译对水印的影响，研究者提出了两个评估指标：Pearson 相关系数（PCC）衡量趋势一致性，衡量翻译前后水印强度的相关性；相对误差（RE）衡量幅度一致性，衡量翻译前后水印强度的偏差。实验显示，主流水印方法的 PCC 通常低于 0.2，RE 通常高于 80%，说明跨语言鲁棒性仍然是亟待解决的问题。

5.10

本章小结

本章围绕文本水印技术进行了系统性的介绍，从理论基础到工程实现，从检测原理到鲁棒性挑战。核心要点总结如下：

文本水印的目标是在 LLM 生成内容中嵌入人类不可见但算法可检测的标识，用于追踪 AI 生成内容的来源。
KGW 算法通过词表划分（绿名单/红名单）和 logit 调整来实现水印嵌入。绿名单 token 被采样的概率被微妙提升，产生可检测的统计偏差。
水印检测使用 z-score 统计量，衡量实际绿 token 数量与随机期望之间的偏离程度。z-score 越大，水印存在的置信度越高。
KGW 在干净文本上的检测能力极强，AUC 接近 0.998。但对 paraphrase 和 translation 攻击的鲁棒性较差。
SIR 通过引入语义相似度约束来提升改写攻击下的鲁棒性。X-SIR 通过语义聚类进一步提升跨语言翻译场景下的检测能力。
跨语言水印需要满足两个条件：语义相近 token 在同一名单、语义相近前缀产生相同划分。

文本水印是 AI 治理的重要技术支撑。尽管当前方法在鲁棒性上仍有挑战，但它为追踪 AI 生成内容来源提供了可行的技术路径。随着研究的深入，水印技术将越来越完善，为构建可信的 AI 环境贡献力量。

课件

课件原文精读

以下内容来自本章 PDF 课件原文（48页），保留讲义的完整结构供对照参考。

动机

为什么要给大模型生成内容加水印？

大语言模型展现出令人印象深刻的内容生成能力。缓解大模型滥用（misuse）至关重要，对 LLM 生成内容进行标记和识别将有助于治理。Watermark 作为一种新兴技术，能够在不损害模型性能的前提下，在生成的文本中嵌入可被检测的信号。

方法

KGW 水印算法

KGW（Kirchhoff-Gollac-Welin）水印是 ICML 2023 Outstanding Paper 工作。其核心思想是：在语言模型生成文本时，利用 token 采样分布的微弱统计偏差嵌入水印。具体做法是对 GREEN 列表中的 token（占比 γ=0.25）增加小的 logit偏移量 +δ，使这些 token 被采样的概率微微提升。

检测时计算水印强度（z-score）：有水印文本的 z-score 显著高于无水印文本，AUC 可接近 100%。

鲁棒性

SIR 与 X-SIR：提升鲁棒性

ICLR 2024 的 SIR（Steganography-based Intelligent Rewriting）通过隐写式改写提升水印鲁棒性。SIR 利用隐写技术将水印嵌入改写过程中，确保改写后的文本仍携带水印。X-SIR 在此基础上进一步改进，覆盖 KGW、SIR、X-SIR 三种算法，支持 paraphrase 和 translation 两种去除攻击的评估。

课件

课件原文精读

以下为本章 PDF 课件原文（共48页），按页面顺序呈现，保留讲义的原始措辞与结构。

第1页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Limitations

Watermark for LLMs

Zhiwei He

School of Electronic Information and Electrical Engineering

Shanghai Jiao Tong University

April 14, 2024

Zhiwei He

Watermark for LLMs

1 / 44

第2页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Limitations

Outline

Motivation

Watermarking Method

Intro

KGW (ICML 23 Outstanding)

SIR (ICLR 24)

Can Watermarks Survive Translation?

Intro

Evaluation

Attack

Defense

Zhiwei He

Watermark for LLMs

2 / 44

第3页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Limitations

Motivation

Large language models (LLMs) have exhibited impressive content generation capabilities.

Mitigating the misuse of LLM is important.

Tagging and identifying LLM-generated content would help.

Zhiwei He

Watermark for LLMs

3 / 44

第4页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

KGW (ICML 23 Outstanding)

SIR (ICLR 24)

Limitations

Outline

Motivation

Watermarking Method

Intro

KGW (ICML 23 Outstanding)

SIR (ICLR 24)

Can Watermarks Survive Translation?

Intro

Evaluation

Attack

Defense

Zhiwei He

Watermark for LLMs

4 / 44

第5页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

KGW (ICML 23 Outstanding)

SIR (ICLR 24)

Limitations

Intro

Text watermarking embeds a “message” into the LLM-generated content.

Zhiwei He

Watermark for LLMs

5 / 44

第6页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

KGW (ICML 23 Outstanding)

SIR (ICLR 24)

Limitations

Intro

Text watermarking embeds a “message” into the LLM-generated content.

invisible to human

can be detected algorithmically

Zhiwei He

Watermark for LLMs

5 / 44

第7页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

KGW (ICML 23 Outstanding)

SIR (ICLR 24)

Limitations

Intro

Text watermarking embeds a “message” into the LLM-generated content.

invisible to human

can be detected algorithmically

In the simplest form, the “message” can be a single bit indicating the presence of the

watermark.

Zhiwei He

Watermark for LLMs

5 / 44

第8页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

KGW (ICML 23 Outstanding)

SIR (ICLR 24)

Limitations

Outline

Motivation

Watermarking Method

Intro

KGW (ICML 23 Outstanding)

SIR (ICLR 24)

Can Watermarks Survive Translation?

Intro

Evaluation

Attack

Defense

Zhiwei He

Watermark for LLMs

6 / 44

第9页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

KGW (ICML 23 Outstanding)

SIR (ICLR 24)

Limitations

Notations

Language model: M

Vocab: V

A sequence of tokens: x1:n = (x1, x2, . . . , xn)

Conditional probability of the next token: PM(xn+1|x1:n)

Logits of the next token: zn+1 = M(x1:n) ∈R|V|

Therefore, we have PM(xn+1|x1:n) = softmax(zn+1).

Zhiwei He

Watermark for LLMs

7 / 44

第10页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

KGW (ICML 23 Outstanding)

SIR (ICLR 24)

Limitations

KGW (ICML 23 Outstanding)

Zhiwei He

Watermark for LLMs

8 / 44

第11页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

KGW (ICML 23 Outstanding)

SIR (ICLR 24)

Limitations

Core idea

Vocab partition based on preceding text.

Vocab partition: in each step, randomly split the vocab V into two disjoint subsets, the

green list Vg and the red list Vr.

Preceding-text-based: the randomness is seeded by the hash of the preceding text.

Increase probs for green tokens (tokens in Vg).

Zhiwei He

Watermark for LLMs

9 / 44

第12页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

KGW (ICML 23 Outstanding)

SIR (ICLR 24)

Limitations

Watermark Ironing

In each step of decoding:

1. compute a hash of x1:n: hn+1 = H(x1:n) · · · H(·) can only use the last k tokens xn−k+1:n.

2. seed a random number generator with hn+1 and randomly partitions V into two disjoint

lists: the green list Vg and the red list Vr,

3. adjust the logits zn+1 by adding a constant bias δ (δ > 0) for tokens in the green list:

∀i ∈{1, 2, . . . , |V|},

(

zn+1

+ δ,

if vi ∈Vg,

zn+1

if vi ∈Vr.

(1)

˜zn+1

Zhiwei He

Watermark for LLMs

10 / 44

第13页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

KGW (ICML 23 Outstanding)

SIR (ICLR 24)

Limitations

Watermark Ironing

As a result, watermarked text will

statistically contain more green

tokens, an attribute unlikely to occur

in human-written text.

Zhiwei He

Watermark for LLMs

11 / 44

第14页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

KGW (ICML 23 Outstanding)

SIR (ICLR 24)

Limitations

Watermark Detection

When detecting, one can apply step (1) and (2), and calculate the z-score as the watermark

strength of x:

s = (|x|g −γ|V|)

|V|γ(1 −γ)

(2)

where |x|g is the number of green tokens in x and γ = |Vg|

|V| . The presence of the watermark

can be determined by comparing s with a appropriate threshold.

Zhiwei He

Watermark for LLMs

12 / 44

第15页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

KGW (ICML 23 Outstanding)

SIR (ICLR 24)

Limitations

Performance (ROC Curves — AUC: 0.998)

Zhiwei He

Watermark for LLMs

13 / 44

第16页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

KGW (ICML 23 Outstanding)

SIR (ICLR 24)

Limitations

Text quality

∀i ∈{1, 2, . . . , |V|},

(

zn+1

+ δ,

if vi ∈Vg,

zn+1

if vi ∈Vr.

˜zn+1

Zhiwei He

Watermark for LLMs

14 / 44

第17页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

KGW (ICML 23 Outstanding)

SIR (ICLR 24)

Limitations

Outline

Motivation

Watermarking Method

Intro

KGW (ICML 23 Outstanding)

SIR (ICLR 24)

Can Watermarks Survive Translation?

Intro

Evaluation

Attack

Defense

Zhiwei He

Watermark for LLMs

15 / 44

第18页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

KGW (ICML 23 Outstanding)

SIR (ICLR 24)

Limitations

SIR (ICLR 24)

Zhiwei He

Watermark for LLMs

16 / 44

第19页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

KGW (ICML 23 Outstanding)

SIR (ICLR 24)

Limitations

Intro

The robustness of watermarking, i.e., the ability to detect watermarked text even after it

has been modified, is important.

Semantic invariant robust watermark (SIR) is designed to improve the robustness under

text re-writing attack.

Text re-writing attack: modify the wording of the text without changing its semantic,

such as re-translation and paraphrase.

Zhiwei He

Watermark for LLMs

17 / 44

第20页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

KGW (ICML 23 Outstanding)

SIR (ICLR 24)

Limitations

A general view of logits adjustment

∀i ∈{1, 2, . . . , |V|},

(

zn+1

+ δ,

if vi ∈Vg,

zn+1

if vi ∈Vr.

˜zn+1

We can view the process of adjusting the logits as applying a ∆function (∆∈R|V|):

˜zn+1 = zn+1 + ∆(x1:n).

Zhiwei He

Watermark for LLMs

18 / 44

第21页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

KGW (ICML 23 Outstanding)

SIR (ICLR 24)

Limitations

Method

Core idea

Sim(∆(x), ∆(y)) ≈Sim(E(x), E(y)), where E(·) is an embedding model, and Sim(·) is a

similarity function.

Given an embedding model E, SIR train a ∆function with main objective:

L = |Sim(E(x), E(y)) −Sim(∆(x), ∆(y))|.

(3)

Furthermore, ∀i ∈{1, 2, . . . , |V|}, ∆i is trained to be close to +1 or −1. Therefore, SIR can

be seen as an improvement based on KGW, where ∆i > 0 indicating that vi is a green token.

Zhiwei He

Watermark for LLMs

19 / 44

第22页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

KGW (ICML 23 Outstanding)

SIR (ICLR 24)

Limitations

Performance (Our implementation)

Zhiwei He

Watermark for LLMs

20 / 44

第23页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

Evaluation

Attack

Defense

Outline

Limitations

Motivation

Watermarking Method

Intro

KGW (ICML 23 Outstanding)

SIR (ICLR 24)

Can Watermarks Survive Translation?

Intro

Evaluation

Attack

Defense

Zhiwei He

Watermark for LLMs

21 / 44

第24页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

Evaluation

Attack

Defense

Can Watermarks Survive Translation?

Limitations

Existing works on watermark

robustness focus mainly on English.

However, our world is multilingual.

What if we translate watermarked

text into other language? Can

watermarks survive translation?

Zhiwei He

Watermark for LLMs

22 / 44

第25页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

Evaluation

Attack

Defense

Outline

Limitations

Motivation

Watermarking Method

Intro

KGW (ICML 23 Outstanding)

SIR (ICLR 24)

Can Watermarks Survive Translation?

Intro

Evaluation

Attack

Defense

Zhiwei He

Watermark for LLMs

23 / 44

第26页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

Evaluation

Attack

Defense

Evaluation: cross-lingual consistency of text watermark

Limitations

We define cross-lingual consistency to assess the ability of text watermarks to maintain

their effectiveness after being translated into other languages.

Given original watermark strength S, and the strength after translation ˆS:

Pearson Correlation Coefficient (PCC) captures trend consistency:

PCC(S, ˆS) = cov(S, ˆS)

σSσ ˆS

(4)

Relative Error (RE) captures magnitude consistency

ˆS −S

RE(S, ˆS) = E

× 100%.

(5)

S

Zhiwei He

Watermark for LLMs

24 / 44

第27页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

Evaluation

Attack

Defense

PCCs and REs

Limitations

PCCs are generally less than 0.2, and the REs are predominantly above 80%.

SIR is slightly better than other methods.

Zhiwei He

Watermark for LLMs

25 / 44

第28页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

Evaluation

Attack

Defense

Watermark strength vs Text length

Limitations

Current text watermarking methods lack cross-lingual consistency.

Zhiwei He

Watermark for LLMs

26 / 44

第29页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

Evaluation

Attack

Defense

Outline

Limitations

Motivation

Watermarking Method

Intro

KGW (ICML 23 Outstanding)

SIR (ICLR 24)

Can Watermarks Survive Translation?

Intro

Evaluation

Attack

Defense

Zhiwei He

Watermark for LLMs

27 / 44

第30页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

Evaluation

Attack

Defense

Attack: the gaps between real scenarios

Limitations

Language switching: An attacker who wants to remove the watermark typically do not

want to change the language of the response.

Text quality: Translation might effect text quality, but we have not conducted evaluation

because we change the language of response in the previous section.

Zhiwei He

Watermark for LLMs

28 / 44

第31页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

Evaluation

Attack

Defense

Cross-lingual Watermark Removal Attack (CWRA)

Limitations

Zhiwei He

Watermark for LLMs

29 / 44

第32页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

Evaluation

Attack

Defense

Performance: watermark detection

Limitations

1 2

1We fixed the paraphraser and translator used in all methods as gpt-3.5-turbo-0613.

2The base model is Baichuan, supporting English and Chinese.

Zhiwei He

Watermark for LLMs

30 / 44

第33页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

Evaluation

Attack

Defense

Performance: text quality

Limitations

These attack methods not only preserve text quality, but also bring slight improvements

in most cases. This might be attributed to good translators and paraphraser.

CWRA has the best overall results. We speculate that Baichuan performs even better in

the pivot language (Chinese) than in the original language (English).

Zhiwei He

Watermark for LLMs

31 / 44

第34页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

Evaluation

Attack

Defense

Outline

Limitations

Motivation

Watermarking Method

Intro

KGW (ICML 23 Outstanding)

SIR (ICLR 24)

Can Watermarks Survive Translation?

Intro

Evaluation

Attack

Defense

Zhiwei He

Watermark for LLMs

32 / 44

第35页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

Evaluation

Attack

Defense

Defence: how to improve cross-lingual consistency?

Limitations

KGW-based watermarking methods fundamentally depend on the partition of the vocab,

i.e., the red and green lists, as discussed in Section 2.

Cross-lingual consistency

the green tokens in the watermarked text will still be recognized as green tokens after being

translated into other languages

Zhiwei He

Watermark for LLMs

33 / 44

第36页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

Evaluation

Attack

Defense

A simplest case study - 1

Limitations

✓Factor 1: semantically similar tokens

should be in the same list (either red or

green)

✓Factor 2: the vocab partitions for

semantically similar prefixes should be

the same.

Zhiwei He

Watermark for LLMs

34 / 44

第37页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

Evaluation

Attack

Defense

A simplest case study - 2

Limitations

✗Factor 1: semantically similar tokens

should be in the same list (either red or

green)

✓Factor 2: the vocab partitions for

semantically similar prefixes should be

the same.

Zhiwei He

Watermark for LLMs

35 / 44

第38页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

Evaluation

Attack

Defense

A simplest case study - 3

Limitations

✓Factor 1: semantically similar tokens

should be in the same list (either red or

green)

✗Factor 2: the vocab partitions for

semantically similar prefixes should be

the same.

Zhiwei He

Watermark for LLMs

36 / 44

第39页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

Evaluation

Attack

Defense

A simplest case study - 3

Limitations

✓Factor 1: semantically similar tokens

should be in the same list (either red or

green)

✗Factor 2: the vocab partitions for

semantically similar prefixes should be

the same.

Factor 1 & 2 must be satisfied simultaneously.

Zhiwei He

Watermark for LLMs

36 / 44

第40页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

Evaluation

Attack

Defense

Defense Method

Limitations

SIR has already optimized for Factor 2 since its objective is:

L = |Sim(E(x), E(y)) −Sim(∆(x), ∆(y))|.

(6)

Based on SIR, we discuss how to achieve Factor 1 and name our method X-SIR.

Zhiwei He

Watermark for LLMs

37 / 44

第41页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

Evaluation

Attack

Defense

Defense Method (X-SIR): adapting ∆function

Limitations

We define semantic clustering as a partition C of the vocabulary V:

C = {C1, C2, . . . , C|C|},

(7)

where each cluster Ci consists of semantically equivalent tokens.

We adapt the ∆function so that it yields biases to each cluster in C, i.e., ∆∈R|C|

(∆∈R|V|).

Thus, the process of adjusting the logits should be:

∀i ∈{1, 2, . . . , |V|},

˜zn+1

= zn+1

+ ∆C(i),

(8)

where C(i) indicates the index of vi’s cluster within C.

Zhiwei He

Watermark for LLMs

38 / 44

第42页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

Evaluation

Attack

Defense

Defense Method (X-SIR): semantic clustering of vocab

Limitations

Algorithm 1 Constructing semantic clusters

Require: A vocabulary V, a bilingual dictionary D

Ensure: Semantic clusters C

1: Initialize an empty graph G with nodes for each token in V

2: for each entry (vi, vj) in the bilingual dictionary D do

if both vi and vj are in V then

Add an edge (vi, vj) to G

end if

6: end for

7: Initialize C to be an empty set

8: for each connected component C in G do

Add C to C

10: end for

11: return C

Zhiwei He

Watermark for LLMs

39 / 44

第43页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

Evaluation

Attack

Defense

Defense Method (X-SIR): semantic clustering of vocab

Limitations

Algorithm 2 Constructing semantic clusters

Require: A vocabulary V, a bilingual dictionary D

Ensure: Semantic clusters C

1: Initialize an empty graph G with nodes for each token in V

2: for each entry (vi, vj) in the bilingual dictionary D do

if both vi and vj are in V then

Add an edge (vi, vj) to G

end if

6: end for

7: Initialize C to be an empty set

8: for each connected component C in G do

Add C to C

10: end for

11: return C

Line 2-3: We only consider

tokens shared by V and D,

which results in limitations

(discuss later).

Zhiwei He

Watermark for LLMs

39 / 44

第44页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

Evaluation

Attack

Defense

Defense Method (X-SIR): semantic clustering of vocab

Limitations

We also consider the meta symbol (U+2581) for sentencepiece.

Zhiwei He

Watermark for LLMs

40 / 44

第45页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

Evaluation

Attack

Defense

Performance: watermark detection

Limitations

Zhiwei He

Watermark for LLMs

41 / 44

第46页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Intro

Evaluation

Attack

Defense

Performance: text quality

Limitations

Zhiwei He

Watermark for LLMs

42 / 44

第47页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Limitations

Semantic clustering only considers tokens shared by the vocab V of model and external

dictionary D, which results in the following limitations.

Language coverage: only support language supported by the model. In a real scenario,

the attacker can choose the original language and the pivot language at will.

Vocab coverage (20.76%): since external dictionary D only contains whole words,

words units can not be clustered. Llama tokenizer tends to split a Chinese char into

multiple bytes.

Zhiwei He

Watermark for LLMs

43 / 44

第48页

页面原文

Motivation

Watermarking Method

Can Watermarks Survive Translation?

Limitations

Paper & Code

https://arxiv.org/abs/2402.14007

https://github.com/zwhe99/X-SIR

Zhiwei He

Watermark for LLMs

44 / 44