基于变压器自我注意的改进的端到端多目标跟踪方法

论文标题

基于变压器自我注意的改进的端到端多目标跟踪方法

An Improved End-to-End Multi-Target Tracking Method Based on Transformer Self-Attention

论文作者

Hong, Yong, Li, Deren, Luo, Shupei, Chen, Xin, Yang, Yi, Wang, Mi

论文摘要

这项研究提出了一种改进的端到端多目标跟踪算法，该算法基于变压器编码器编码器结构的自动机制来适应多视图多尺度场景。多维特征提取骨干网络与自我构建的语义栅格映射结合使用，该映射存储在编码器中以进行相关性，并生成目标位置编码和多维特征向量。解码器包含四种方法：多视图目标的空间聚类和语义滤波，多维特征的动态匹配，基于时空逻辑的多目标跟踪以及基于时空收敛网络（STCN）基于参数的参数传递。通过多种解码方法的融合，Muti-Chera目标在三个维度中进行跟踪：时间逻辑，空间逻辑和特征匹配。对于MOT17数据集，本研究的方法在多个对象跟踪准确性（MOTA）度量标准上显着优于当前最新方法MinitRackV2 [49]。此外，这项研究首次提出了回顾性机制，并采用了反向处理方法来优化改善识别识别F1得分（IDF1）的历史标签错误的目标。对于自行构建的数据集OVIT-MOT01，IDF1从0.948提高到0.967，多相机跟踪精度（MCTA）从0.878提高到0.909，这显着提高了持续的跟踪准确性和场景适应。该研究方法介绍了一种新的注意跟踪范式，该范式能够在多目标跟踪（MOT17和OVIT-MOT01）任务上实现最先进的性能。

This study proposes an improved end-to-end multi-target tracking algorithm that adapts to multi-view multi-scale scenes based on the self-attentive mechanism of the transformer's encoder-decoder structure. A multi-dimensional feature extraction backbone network is combined with a self-built semantic raster map, which is stored in the encoder for correlation and generates target position encoding and multi-dimensional feature vectors. The decoder incorporates four methods: spatial clustering and semantic filtering of multi-view targets, dynamic matching of multi-dimensional features, space-time logic-based multi-target tracking, and space-time convergence network (STCN)-based parameter passing. Through the fusion of multiple decoding methods, muti-camera targets are tracked in three dimensions: temporal logic, spatial logic, and feature matching. For the MOT17 dataset, this study's method significantly outperforms the current state-of-the-art method MiniTrackV2 [49] by 2.2% to 0.836 on Multiple Object Tracking Accuracy(MOTA) metric. Furthermore, this study proposes a retrospective mechanism for the first time, and adopts a reverse-order processing method to optimise the historical mislabeled targets for improving the Identification F1-score(IDF1). For the self-built dataset OVIT-MOT01, the IDF1 improves from 0.948 to 0.967, and the Multi-camera Tracking Accuracy(MCTA) improves from 0.878 to 0.909, which significantly improves the continuous tracking accuracy and scene adaptation. This research method introduces a new attentional tracking paradigm which is able to achieve state-of-the-art performance on multi-target tracking (MOT17 and OVIT-MOT01) tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题