在分布移位下的多模式图像文本模型的基准测试稳健性

论文标题

在分布移位下的多模式图像文本模型的基准测试稳健性

Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift

论文作者

Qiu, Jielin, Zhu, Yi, Shi, Xingjian, Wenzel, Florian, Tang, Zhiqiang, Zhao, Ding, Li, Bo, Li, Mu

论文摘要

在过去的几年中，多模式图像文本模型在过去几年中表现出色。但是，在对分配变化的鲁棒性中进行评估至关重要。在这项工作中，我们研究了在五个任务上的常见扰动下，在五个任务（图像 - 文本检索，视觉推理，视觉上，图像字幕和文本对图像生成）上研究了12种流行的开源图像文本模型的鲁棒性。特别是，我们通过在现有数据集的顶部应用17个图像扰动和16个文本扰动技术，提出了几种新的多模式鲁棒性基准。我们观察到，多模型模型对图像和文本扰动不鲁棒，尤其是图像扰动。在经过测试的扰动方法中，字符级扰动构成文本最严重的分布变化，而变焦模糊是图像数据的最严重变化。我们还为多模型模型的正确评估进行了两个新的鲁棒性指标（\ textbf {mmi}，用于多模式影响分数，而对于丢失对象率，则\ textbf {MOR}。我们希望我们的广泛研究能阐明新的方向，以开发强大的多模型。可以在项目网页上找到更多详细信息：\ url {https://mmrobustness.github.io}。

Multimodal image-text models have shown remarkable performance in the past few years. However, evaluating robustness against distribution shifts is crucial before adopting them in real-world applications. In this work, we investigate the robustness of 12 popular open-sourced image-text models under common perturbations on five tasks (image-text retrieval, visual reasoning, visual entailment, image captioning, and text-to-image generation). In particular, we propose several new multimodal robustness benchmarks by applying 17 image perturbation and 16 text perturbation techniques on top of existing datasets. We observe that multimodal models are not robust to image and text perturbations, especially to image perturbations. Among the tested perturbation methods, character-level perturbations constitute the most severe distribution shift for text, and zoom blur is the most severe shift for image data. We also introduce two new robustness metrics (\textbf{MMI} for MultiModal Impact score and \textbf{MOR} for Missing Object Rate) for proper evaluations of multimodal models. We hope our extensive study sheds light on new directions for the development of robust multimodal models. More details can be found on the project webpage: \url{https://MMRobustness.github.io}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题