Coupalign：将单词像素与句子掩码对齐耦合，以引用图像分割

论文标题

Coupalign：将单词像素与句子掩码对齐耦合，以引用图像分割

CoupAlign: Coupling Word-Pixel with Sentence-Mask Alignments for Referring Image Segmentation

论文作者

Zhang, Zicheng, Zhu, Yi, Liu, Jianzhuang, Liang, Xiaodan, Ke, Wei

论文摘要

参考图像分割旨在将自然语言句子描述的视觉对象的所有像素定位。以前的作品学会直接将句子嵌入和像素级嵌入以突出引用对象，但忽略了同一对象中像素的语义一致性，从而导致预测中的不完整掩码和本地化错误。为了解决这个问题，我们提出了coupalign，这是一种简单而有效的多级视觉声音对齐方法，将句子掩盖对准与单词像素对齐，以实现对象掩盖的约束，以实现更准确的本地化和细分。具体而言，单词像素对齐（WPA）模块在视觉和语言编码器的中间层中执行语言和像素级特征的早期融合。基于单词像素对齐的嵌入，生成了一组蒙版建议，以假设可能的对象。然后，在句子掩码对准（SMA）模块中，掩模是通过嵌入句子来定位引用对象的句子加权的，最后投影回去汇总目标的像素。为了进一步增强两个对齐模块的学习，辅助损失旨在对比前景和背景像素。通过层次对齐像素和语言特征的掩模，我们的Coupalign捕获了视觉和语义水平的像素连贯性，从而产生了更准确的预测。关于流行数据集（例如Refcoco和G-REF）的广泛实验表明，我们的方法对最先进的方法进行了一致的改进，例如，Refcoco的验证和测试集增加了约2％OIOU。特别是，Coupalign具有将目标与同一类的多个对象区分开的出色能力。

Referring image segmentation aims at localizing all pixels of the visual objects described by a natural language sentence. Previous works learn to straightforwardly align the sentence embedding and pixel-level embedding for highlighting the referred objects, but ignore the semantic consistency of pixels within the same object, leading to incomplete masks and localization errors in predictions. To tackle this problem, we propose CoupAlign, a simple yet effective multi-level visual-semantic alignment method, to couple sentence-mask alignment with word-pixel alignment to enforce object mask constraint for achieving more accurate localization and segmentation. Specifically, the Word-Pixel Alignment (WPA) module performs early fusion of linguistic and pixel-level features in intermediate layers of the vision and language encoders. Based on the word-pixel aligned embedding, a set of mask proposals are generated to hypothesize possible objects. Then in the Sentence-Mask Alignment (SMA) module, the masks are weighted by the sentence embedding to localize the referred object, and finally projected back to aggregate the pixels for the target. To further enhance the learning of the two alignment modules, an auxiliary loss is designed to contrast the foreground and background pixels. By hierarchically aligning pixels and masks with linguistic features, our CoupAlign captures the pixel coherence at both visual and semantic levels, thus generating more accurate predictions. Extensive experiments on popular datasets (e.g., RefCOCO and G-Ref) show that our method achieves consistent improvements over state-of-the-art methods, e.g., about 2% oIoU increase on the validation and testing set of RefCOCO. Especially, CoupAlign has remarkable ability in distinguishing the target from multiple objects of the same class.

下载PDF全文

下载文献需遵守相关版权规定

论文标题