论文标题
学习视频语义细分的本地和全球时间上下文
Learning Local and Global Temporal Contexts for Video Semantic Segmentation
论文作者
论文摘要
上下文信息在视频语义细分(VSS)中起着核心作用。本文总结了VSS的上下文,以两倍:本地时间上下文(LTC)定义了相邻帧的上下文以及代表整个视频中上下文的全局时间上下文(GTC)的上下文。至于LTC,它包括静态和运动环境,分别对应于相邻帧中的静态和移动内容。以前,已经研究了静态和运动环境。但是,没有关于同时学习静态和运动环境(高度互补)的研究。因此,我们提出了一种粗到精细的特征挖掘(CFFM)技术,以学习LTC的统一表现。 CFFM包含两个部分:粗到精细的特征组装(CFFA)和跨框架特征挖掘(CFM)。 CFFA摘要静态和运动环境,以及CFM矿山的有用信息,以增强目标特征。为了进一步利用更多的时间上下文,我们通过从整个视频中学习GTC来提出CFFM ++。具体而言,我们从视频中统一采样了某些帧,并通过K-均值提取全局上下文原型。这些原型中的信息由CFM开采,以完善目标特征。对流行基准测试的实验结果表明,CFFM和CFFM ++对最先进的方法表现出色。我们的代码可从https://github.com/guoleisun/vss-cffm获得
Contextual information plays a core role for video semantic segmentation (VSS). This paper summarizes contexts for VSS in two-fold: local temporal contexts (LTC) which define the contexts from neighboring frames, and global temporal contexts (GTC) which represent the contexts from the whole video. As for LTC, it includes static and motional contexts, corresponding to static and moving content in neighboring frames, respectively. Previously, both static and motional contexts have been studied. However, there is no research about simultaneously learning static and motional contexts (highly complementary). Hence, we propose a Coarse-to-Fine Feature Mining (CFFM) technique to learn a unified presentation of LTC. CFFM contains two parts: Coarse-to-Fine Feature Assembling (CFFA) and Cross-frame Feature Mining (CFM). CFFA abstracts static and motional contexts, and CFM mines useful information from nearby frames to enhance target features. To further exploit more temporal contexts, we propose CFFM++ by additionally learning GTC from the whole video. Specifically, we uniformly sample certain frames from the video and extract global contextual prototypes by k-means. The information within those prototypes is mined by CFM to refine target features. Experimental results on popular benchmarks demonstrate that CFFM and CFFM++ perform favorably against state-of-the-art methods. Our code is available at https://github.com/GuoleiSun/VSS-CFFM