论文标题
MM-shap:一种用于测量视觉和语言模型和任务中多模式贡献的性能指标
MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks
论文作者
论文摘要
已知视觉和语言模型(VL)可以利用单个模态(例如,通过分布偏见引入)中的不稳定指标,而不是专注于每种模式中的相关信息。单峰模型在VL任务上的准确性与多模态的准确性相似,这表明发生了所谓的单峰崩溃。但是,基于精度的测试无法检测到模型预测错误时,而模型使用模式中的相关信息。取而代之的是,我们提出了MM-Shap,这是一种基于Shapley值的性能多模式得分,该值可靠地量化了比例多模型使用单个模式的比例。我们以两种方式应用MM-shap:(1)比较其平均多模式程度的模型,(2)对单个模型进行测量单个模式对不同任务和数据集的贡献。在四个VL任务上,具有六个VL模型的实验 - LXMERT,夹子和四个ALBEF变体 - 强调了单峰崩溃可能在不同的程度和不同的方向上发生,这与单峰倒塌是单一侧面的广泛假设相矛盾。根据我们的结果,我们建议使用MM-SHAP来分析多模式任务,以诊断和指导多模式集成的进展。代码可在\ url {https://github.com/heidelberg-nlp/mm-shap}中获得。
Vision and language models (VL) are known to exploit unrobust indicators in individual modalities (e.g., introduced by distributional biases) instead of focusing on relevant information in each modality. That a unimodal model achieves similar accuracy on a VL task to a multimodal one, indicates that so-called unimodal collapse occurred. However, accuracy-based tests fail to detect e.g., when the model prediction is wrong, while the model used relevant information from a modality. Instead, we propose MM-SHAP, a performance-agnostic multimodality score based on Shapley values that reliably quantifies in which proportions a multimodal model uses individual modalities. We apply MM-SHAP in two ways: (1) to compare models for their average degree of multimodality, and (2) to measure for individual models the contribution of individual modalities for different tasks and datasets. Experiments with six VL models -- LXMERT, CLIP and four ALBEF variants -- on four VL tasks highlight that unimodal collapse can occur to different degrees and in different directions, contradicting the wide-spread assumption that unimodal collapse is one-sided. Based on our results, we recommend MM-SHAP for analysing multimodal tasks, to diagnose and guide progress towards multimodal integration. Code available at \url{https://github.com/Heidelberg-NLP/MM-SHAP}.