论文标题
时间上的会合:一种基于注意力的时间融合方法的手术三胞胎识别方法
Rendezvous in Time: An Attention-based Temporal Fusion approach for Surgical Triplet Recognition
论文作者
论文摘要
手术AI的最新进展之一是对手术活动的识别为(仪器,动词,目标)的三胞胎。尽管为计算机辅助干预提供了详细信息,但当前的三重态识别方法仅依赖于单一框架功能。从较早的框架中利用时间提示将改善视频中对手术动作三胞胎的识别。在本文中,我们在时间上提出了Rendezvous(RIT) - 一种深度学习模型,该模型通过时间建模扩展了最先进的模型Rendezvous。我们的RIT更多地关注动词,探索了当前和过去框架的连接性,以学习基于时间注意的特征,以增强三重态识别。我们验证了有关具有挑战性的手术三重态数据集Cholect45的建议,证明了对动词和三胞胎的识别,以及涉及动词的其他相互作用,例如(仪器,动词)。定性结果表明,与最先进的情况相比,RIT对大多数三重态实例产生的预测更顺畅。我们提出了一种新型的基于注意力的方法,该方法利用视频框架的时间融合来对手术动作的演变进行建模,并利用其益处来进行手术三胞胎识别。
One of the recent advances in surgical AI is the recognition of surgical activities as triplets of (instrument, verb, target). Albeit providing detailed information for computer-assisted intervention, current triplet recognition approaches rely only on single frame features. Exploiting the temporal cues from earlier frames would improve the recognition of surgical action triplets from videos. In this paper, we propose Rendezvous in Time (RiT) - a deep learning model that extends the state-of-the-art model, Rendezvous, with temporal modeling. Focusing more on the verbs, our RiT explores the connectedness of current and past frames to learn temporal attention-based features for enhanced triplet recognition. We validate our proposal on the challenging surgical triplet dataset, CholecT45, demonstrating an improved recognition of the verb and triplet along with other interactions involving the verb such as (instrument, verb). Qualitative results show that the RiT produces smoother predictions for most triplet instances than the state-of-the-arts. We present a novel attention-based approach that leverages the temporal fusion of video frames to model the evolution of surgical actions and exploit their benefits for surgical triplet recognition.