剪辑-TSA：剪辑辅助的时间自我注意，用于弱监督视频异常检测

论文标题

剪辑-TSA：剪辑辅助的时间自我注意，用于弱监督视频异常检测

CLIP-TSA: CLIP-Assisted Temporal Self-Attention for Weakly-Supervised Video Anomaly Detection

论文作者

Joo, Hyekang Kevin, Vo, Khoa, Yamazaki, Kashu, Le, Ngan

论文摘要

视频异常检测（VAD）通常以弱监督的方式被表达为一个多种稳定的学习问题，这是由于其劳动密集型的性质，是视频监视中的一个挑战性问题，在视频监视中，异常框架需要在未修剪的视频中定位。在本文中，我们首先建议利用夹子中的VIT编码的视觉特征，与域中的常规C3D或I3D特征相反，以有效提取新技术中的区分表示。然后，我们通过利用我们提出的时间自我注意（TSA）来对时间依赖性建模并提名感兴趣的片段。消融研究证实了TSA和VIT特征的有效性。广泛的实验表明，我们提出的clip-TSA在VAD问题（UCF-Crime，Shanghaitech Campus和XD-Violence）中的三个常用基准数据集上的三个常用基准数据集对现有的最新方法（SOTA）方法优于现有的最新方法（SOTA）方法。我们的源代码可在https://github.com/joos2010kj/clip-tsa上找到。

Video anomaly detection (VAD) -- commonly formulated as a multiple-instance learning problem in a weakly-supervised manner due to its labor-intensive nature -- is a challenging problem in video surveillance where the frames of anomaly need to be localized in an untrimmed video. In this paper, we first propose to utilize the ViT-encoded visual features from CLIP, in contrast with the conventional C3D or I3D features in the domain, to efficiently extract discriminative representations in the novel technique. We then model temporal dependencies and nominate the snippets of interest by leveraging our proposed Temporal Self-Attention (TSA). The ablation study confirms the effectiveness of TSA and ViT feature. The extensive experiments show that our proposed CLIP-TSA outperforms the existing state-of-the-art (SOTA) methods by a large margin on three commonly-used benchmark datasets in the VAD problem (UCF-Crime, ShanghaiTech Campus, and XD-Violence). Our source code is available at https://github.com/joos2010kj/CLIP-TSA.

下载PDF全文

下载文献需遵守相关版权规定

论文标题