推扣：表征视听主动扬声器检测的对抗性鲁棒性

论文标题

推扣：表征视听主动扬声器检测的对抗性鲁棒性

Push-Pull: Characterizing the Adversarial Robustness for Audio-Visual Active Speaker Detection

论文作者

Chen, Xuanjun, Wu, Haibin, Meng, Helen, Lee, Hung-yi, Jang, Jyh-Shing Roger

论文摘要

视听主动扬声器检测（AVASD）已发达，现在是多种多模式应用的必不可少的前端。但是，据我们所知，尚未对AVASD模型的对抗性鲁棒性进行调查，更不用说针对此类攻击的有效辩护了。在本文中，我们是第一个通过广泛的实验进行的，只有音频，仅视觉和视听对抗性攻击，揭示AVASD模型的脆弱性。更重要的是，我们还提出了一种新颖的视听互动损失（AVIL），以使攻击者在分配的攻击预算下难以找到可行的对抗例子。损失的目的是将阶层间的嵌入分散，即非语音和语音群集，充分散布，并尽可能靠近阶层内嵌入以保持它们紧凑。实验结果表明，在多模式攻击下，Avil优于33.14 MAP（％）的对抗训练。

Audio-visual active speaker detection (AVASD) is well-developed, and now is an indispensable front-end for several multi-modal applications. However, to the best of our knowledge, the adversarial robustness of AVASD models hasn't been investigated, not to mention the effective defense against such attacks. In this paper, we are the first to reveal the vulnerability of AVASD models under audio-only, visual-only, and audio-visual adversarial attacks through extensive experiments. What's more, we also propose a novel audio-visual interaction loss (AVIL) for making attackers difficult to find feasible adversarial examples under an allocated attack budget. The loss aims at pushing the inter-class embeddings to be dispersed, namely non-speech and speech clusters, sufficiently disentangled, and pulling the intra-class embeddings as close as possible to keep them compact. Experimental results show the AVIL outperforms the adversarial training by 33.14 mAP (%) under multi-modal attacks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题