视频K-NET：视频细分的简单，坚固且统一的基线

论文标题

视频K-NET：视频细分的简单，坚固且统一的基线

Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation

论文作者

Li, Xiangtai, Zhang, Wenwei, Pang, Jiangmiao, Chen, Kai, Cheng, Guangliang, Tong, Yunhai, Loy, Chen Change

论文摘要

本文介绍了视频K-NET，这是一个简单，坚固且统一的框架，用于完全端到端的视频全景分割。该方法建立在K-NET上，该方法可以通过一组可学习的内核统一图像进行分割。我们观察到，从K-NET编码对象的外观和上下文的这些可学习的内核可以自然地将视频帧的相同实例关联。在这一观察方面，视频K-NET学会了在带有简单基于内核的外观建模和跨速率内核交互的视频中同时细分和跟踪“事物”和“东西”。尽管很简单，但它还是在Citscapes-vps，Kitti-Step和Vipseg上实现了最新的视频泛型分割结果，而无需铃声和哨声。特别是，在Kitti-Step上，简单的方法比以前的方法可以提高几乎12 \％的相对改进。在VIPSEG上，视频K-NET可提高几乎15 \％的相对改进，并导致39.8％的VPQ。我们还验证了其对视频语义分割的概括，在该视频语义分段中，我们在VSPW数据集中将各种基线提高了2 \％。此外，我们将k-net扩展到用于视频实例细分的剪辑级视频框架中，在该框架中，我们在YouTube-2019 YouTube-2019验证设置上获得了40.5％的RESNET50骨干线和54.1％的SWIN-BASE地图。我们希望这种简单而有效的方法可以作为统一视频细分设计中的新的灵活基线。代码和模型均在https://github.com/lxtgh/video-k-net上发布。

This paper presents Video K-Net, a simple, strong, and unified framework for fully end-to-end video panoptic segmentation. The method is built upon K-Net, a method that unifies image segmentation via a group of learnable kernels. We observe that these learnable kernels from K-Net, which encode object appearances and contexts, can naturally associate identical instances across video frames. Motivated by this observation, Video K-Net learns to simultaneously segment and track "things" and "stuff" in a video with simple kernel-based appearance modeling and cross-temporal kernel interaction. Despite the simplicity, it achieves state-of-the-art video panoptic segmentation results on Citscapes-VPS, KITTI-STEP, and VIPSeg without bells and whistles. In particular, on KITTI-STEP, the simple method can boost almost 12\% relative improvements over previous methods. On VIPSeg, Video K-Net boosts almost 15\% relative improvements and results in 39.8 % VPQ. We also validate its generalization on video semantic segmentation, where we boost various baselines by 2\% on the VSPW dataset. Moreover, we extend K-Net into clip-level video framework for video instance segmentation, where we obtain 40.5% mAP for ResNet50 backbone and 54.1% mAP for Swin-base on YouTube-2019 validation set. We hope this simple, yet effective method can serve as a new, flexible baseline in unified video segmentation design. Both code and models are released at https://github.com/lxtGH/Video-K-Net.

下载PDF全文

下载文献需遵守相关版权规定

论文标题