Graph Embedded Pose Clustering for Anomaly Detection

2024/04/26 00:00:00·2026/05/19 10:23:00

AI视觉模型·9 min read

姿态检测异常检测图嵌入论文笔记

paper with code 相关排行榜 HR-ShanghaiTech Benchmark (Video Anomaly Detection)
- 榜上目前第一：TrajREC77.9
- 榜上目前第二：MoCoDAD77.6
- github tianyu0207/RTFM on surveillance video。与普通视频有什么区别？98.14

Questions

arxiv 2404.19760

如何区分异常数据与正常数据？
1. 见此处 google 异常检测
如何利用了视频数据的时序性？
1. 视频长度不同怎么处理？
  1. split 成固定长度相互重叠的视频片段
如何使用GCN？
1. 如何保存结点位置信息，从而得到角度等信息？
2. 如何得到运动速度等信息？
如何聚类？
1. K-means 初始化聚类
使用了什么损失函数？
1. 预训练时有一个reconstruction loss。fine-tuning 时有一个clustering loss. 二者线性叠加。
cluster的数量是预先设置好的吗？

Methods

Innovation

把图像映射到latent space然后cluster
每一个动作都由其对每个集群的soft-assignment来表示。bag of words.
- 对比学习？自监督学习？
Dirichlet process based mixture
提出了coarse-grained anomaly detection
什么是这里提到的layer ？
三个adjacency都是怎么得到的？

advantage

wiki.zh 异常检测领域两种问题此算法都可以处理

History

How does this field developed ?

google Reconstructive model

使用construction and predictive branches来重建过去的pose，预测未来的pose，从而计算anomaly score.

wiki.en GCN

本文使用github yysijie/st-gcn 分别计算时间与空间的图卷积。

Deep Clustering Models

深度聚类方法旨在通过在聚类诱导目标(cluster inducing objecive)下优化深度模型来提供有用的聚类分配。

Deep Embedded Clustering(DEC)中提出的的两步方法：第一步，使用当前聚类分配计算目标分布。在下一步中，模型被优化以提供与目标分布类似的聚类分配。最近的扩展使用正则化方法和各种后处理手段解决了 DEC 对退化解决方案的敏感性

model architecture

Pasted image 20240421093833.png

key points

human pose estimation 把人类动作表示为一个compact graph
- 去除不相关特征的影响，如光照因素
- 在每一个视频帧中提取出每一个人的动作keypoints
使用 google Autoencoder 之Spatio-temporal graph autoencoder 与 clustering branch 将所有的训练数据映射到一个latent space中，所有的样本都被soft cluster
- 每一个样本都被他所在在cluster所表示
- 动作的bag-of-words
- each cluster is an action-word
对于每一个样本与每一个cluster，计算前者属于后者的概率（能不能换成计算能量）？JEPA D-questions
拟合一个模型进行分类

feature extraction

Pasted image 20240421101612.png

空间注意力
时间注意力
输入时空图，输出得到一个embedding，以此作为clustering branch的起点

Spatial Attention Graph Convolution

Pasted image 20240421102121.png

三个google GCN ，处理三个adjacent matrix。

A: hard-coded physical adjacency matrix 身体部位连接性。固定（反向传播不会改变其参数？），所有层相同。例如手臂连接到肩膀，这在所有的身体上都不会改变
B: 全局邻接矩阵。dataset层面的关键点关系。每一个层各自学习。前向传播时所有样本相同。应该是指所有的样本共通的一些信息？
C: inferred - 基于注意力机制。样本特定关系。一个batch中的每一个样本都不同。 关注每一个样本自身的全局信息。例如一样图片上的同一个人的手臂与大腿结点之间的关系？
三者的输出 stacked in the channel dimension.

For example, for a batch of size N of graphs with V nodes, the inferred adjacency size is $$[N, V, V ]$$ , while other adjacencies are $$[V, V ]$$ matrices.

B：初始化一个全连接层，并有一个邻接矩阵，由于结点数量有限，计算难度不会太大。 C：graph self-attention layer. multiplicative attention mechanism.可以把attention mechanism 换成其他的模块。

Deep Embedding Clustering

encoder
decoder
soft-cluster layer
保持图的结构，但是用large temporal stride and increasing channel number 来压缩输入序列得到latent vector。Decoder使用时间up-sampling以及additional graph convolutional blocks逐渐恢复原始通道数与时间维度。

ST-GCAE的输出是clustering的起点。输入是x，embedding得到的latent vector是z，聚类完以后得到的是y。

计算出样本属于每一个聚类的可能性。

计算当前分布与目标分布之间的ＫＬ散度。 GAN D-math

要让概率趋近于0或者1。

期望最大化。

Dirichlet Process Mixture Model(DPMM)

Dirichlet process 两种多模态分布：

聚类分配层面 cluster assignment。一个动作可以被分给多个聚类。从而就有multimodal soft-assignment vector。
soft-assignment level
估计（拟合）阶段，在此期间评估一组分布参数，
推理阶段，使用拟合模型为每个嵌入样本提供分数。

狄利克雷过程混合模型（Dirichlet Process Mixture Model, DPMM） | 数据学习者官方网站(Datalearner)

徐亦达机器学习：Dirichlet Process 狄利克雷过程【2015年版-全集】_哔哩哔哩_bilibili

训练

两个阶段的训练

pre-training for Autoencoder : clustering branch remains unchanged
fine-tuning stage: 嵌入与聚类都被优化。
1. K-means 初始化聚类

hyper-parameter

loss

L_{combined} = L_{rec} + \lambda \cdot L_{cluster}

volume and speed

experiment

fine-grained ShanghaiTech dataset

training: only normal example
test: normal and abnormal ROC
对于长视频，通过滑窗转化为多个等长且重叠的片段。
对于一张照片中有多人的情况，每个人单独打分。

coarse-grained anomaly detection

random split
- split sample and non-split sample. 没有label。
  - low vs many
  - many vs low
meaningful split

ablation study

Pasted image 20240421150412.png 加入noise data，看会对模型的性能造成多大的影响。

Efficiency and effect of the model 模型效果

Pasted image 20240421144457.png

advantage and disadvantage of the model

Pasted image 20240421145642.png 难题

多人重叠，那么人就只能被部分检测到。由此，同一人员在不同帧之间的分数变化很大。同时，异常成分（一个滑滑板的人）被时常遮挡。
移动速度过快造成人像模糊，从而无法提取pose，那么也就不能进行分析
车辆遮挡