长尾视觉识别摄像机陷阱图像中的动物物种的技巧袋

论文标题

长尾视觉识别摄像机陷阱图像中的动物物种的技巧袋

Bag of Tricks for Long-Tail Visual Recognition of Animal Species in Camera-Trap Images

论文作者

Cunha, Fagner, Santos, Eulanda M. dos, Colonna, Juan G.

论文摘要

相机陷阱是一种监视野生动植物的方法，它们收集了大量图片。每个物种收集的图像的数量通常遵循长尾分布，即，一些类别有大量实例，而许多物种的比例只有很小的比例。尽管在大多数情况下，这些稀有物种是生态学家感兴趣的物种，但在使用深度学习模型时，它们通常会被忽略，因为这些模型需要大量的培训图像。在这项工作中，提出了一个简单有效的框架，称为Square-Root采样分支（SSB），该框架结合了两个分类分支，使用方形 - 根本采样和实例采样培训，以改善长尾视觉识别，并且将其与用于处理此任务的最新方法相比：Squart-Root采样，类别采样，类别型型focal focal soft和Ballanced Soft soft soft soft soft soft soft soft soft soft soft soft soft soft soft soft soft soft soft soft soft soft soft soft和bal soft softmax。为了得出更一般的结论，在四个计算机视觉模型（Resnet，MobilenetV3，EdgitionNetV2和Swin Transformer）和四个具有不同特征不同的摄像机陷阱数据集中，系统地评估了处理长尾视觉识别的方法。最初，准备了最新的训练技巧的坚固基线，然后采用了改善长尾识别的方法。我们的实验表明，平方根采样是最大程度地提高少数民族表现的方法。但是，这是以将多数类的准确性降低至少3％的代价。我们提出的框架（SSB）证明自己与其他方法具有竞争力，并且在大多数尾巴类别的情况下都取得了最佳或第二好的结果。但是，与平方根的采样不同，头部阶级表现的损失很小，因此在所有评估的方法中取得了最佳的权衡。

Camera traps are a method for monitoring wildlife and they collect a large number of pictures. The number of images collected of each species usually follows a long-tail distribution, i.e., a few classes have a large number of instances, while a lot of species have just a small percentage. Although in most cases these rare species are the ones of interest to ecologists, they are often neglected when using deep-learning models because these models require a large number of images for the training. In this work, a simple and effective framework called Square-Root Sampling Branch (SSB) is proposed, which combines two classification branches that are trained using square-root sampling and instance sampling to improve long-tail visual recognition, and this is compared to state-of-the-art methods for handling this task: square-root sampling, class-balanced focal loss, and balanced group softmax. To achieve a more general conclusion, the methods for handling long-tail visual recognition were systematically evaluated in four families of computer vision models (ResNet, MobileNetV3, EfficientNetV2, and Swin Transformer) and four camera-trap datasets with different characteristics. Initially, a robust baseline with the most recent training tricks was prepared and, then, the methods for improving long-tail recognition were applied. Our experiments show that square-root sampling was the method that most improved the performance for minority classes by around 15%; however, this was at the cost of reducing the majority classes' accuracy by at least 3%. Our proposed framework (SSB) demonstrated itself to be competitive with the other methods and achieved the best or the second-best results for most of the cases for the tail classes; but, unlike the square-root sampling, the loss in the performance of the head classes was minimal, thus achieving the best trade-off among all the evaluated methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题