论文标题
Omnipd:一步的人在顶部全向室内场景中的检测
OmniPD: One-Step Person Detection in Top-View Omnidirectional Indoor Scenes
论文作者
论文摘要
我们提出了一个基于卷积神经网络(CNN)的Topeview全向室内场景的一步人检测器。尽管最先进的人检测器在透视图像上达到了竞争成果,但缺少CNN体系结构以及训练数据遵循全向图像扭曲的训练数据,使当前方法不适用于我们的数据。该方法可以直接在全向图像中预测多个人的边界框,而无需透视转换,从而减少了预处理和后处理的开销,并可以实现实时性能。基本思想是利用转移学习来微调在透视图像上使用数据增强技术训练的CNN,以在全向图像中检测。我们微调了两个单镜头多贝克斯检测器(SSD)的两个变体。第一个使用Mobilenet V1 FPN作为特征提取器(MOSSD)。第二个使用RESNET50 V1 FPN(RESSSD)。这两个模型均在上下文(可可)数据集中的Microsoft Common对象上进行了预训练。我们在Pascal VOC07和VOC12数据集上,特别是在班级人员上微调模型。除了原始SSD提出的方法外,还将随机90度旋转和随机垂直翻转用于数据增强。在评估数据集上,我们的平均精度(AP)为67.3%,而Resssd的平均精度为74.9%。为了增强微调过程,我们添加了HDA人数据集的一个子集和piropodatabase的子集,并将透视图像的数量减少到Pascal VOC07。 MOSSD的AP上升到83.2%,Resssd上涨86.3%。使用NVIDIA Quadro P6000,RESSSD的平均推理速度为每张图像的平均推理速度为RESSD的平均推理速度。我们的方法适用于其他基于CNN的对象检测器,并且可以潜在地概括用于全向图像中的其他对象。
We propose a one-step person detector for topview omnidirectional indoor scenes based on convolutional neural networks (CNNs). While state of the art person detectors reach competitive results on perspective images, missing CNN architectures as well as training data that follows the distortion of omnidirectional images makes current approaches not applicable to our data. The method predicts bounding boxes of multiple persons directly in omnidirectional images without perspective transformation, which reduces overhead of pre- and post-processing and enables real-time performance. The basic idea is to utilize transfer learning to fine-tune CNNs trained on perspective images with data augmentation techniques for detection in omnidirectional images. We fine-tune two variants of Single Shot MultiBox detectors (SSDs). The first one uses Mobilenet v1 FPN as feature extractor (moSSD). The second one uses ResNet50 v1 FPN (resSSD). Both models are pre-trained on Microsoft Common Objects in Context (COCO) dataset. We fine-tune both models on PASCAL VOC07 and VOC12 datasets, specifically on class person. Random 90-degree rotation and random vertical flipping are used for data augmentation in addition to the methods proposed by original SSD. We reach an average precision (AP) of 67.3 % with moSSD and 74.9 % with resSSD onthe evaluation dataset. To enhance the fine-tuning process, we add a subset of HDA Person dataset and a subset of PIROPOdatabase and reduce the number of perspective images to PASCAL VOC07. The AP rises to 83.2 % for moSSD and 86.3 % for resSSD, respectively. The average inference speed is 28 ms per image for moSSD and 38 ms per image for resSSD using Nvidia Quadro P6000. Our method is applicable to other CNN-based object detectors and can potentially generalize for detecting other objects in omnidirectional images.