论文标题
深度学习能否认识到细微的人类活动?
Can Deep Learning Recognize Subtle Human Activities?
论文作者
论文摘要
深度学习推动了计算机视觉的最新进展,灌输了这样一种信念,即这些算法可以解决任何视觉任务。但是,通常用于训练和测试计算机视觉算法的数据集具有普遍的混杂因素。这种偏见使得很难真正估计这些算法的性能以及计算机视觉模型如何在训练的分布之外推断出来。在这项工作中,我们提出了一项新的动作分类挑战,该挑战是由人类表现出色的,但最先进的深度学习模型却很差。作为原理证明,我们考虑了三个示例性的任务:饮酒,阅读和坐着。使用最先进的计算机视觉模型达到的最佳精度分别为61.7%,62.8%和76.8%,而人类参与者在这三个任务上的得分超过90%。我们提出了一种严格的方法,以减少创建数据集以及比较人类与计算机视觉性能时的混淆。源代码和数据集公开可用。
Deep Learning has driven recent and exciting progress in computer vision, instilling the belief that these algorithms could solve any visual task. Yet, datasets commonly used to train and test computer vision algorithms have pervasive confounding factors. Such biases make it difficult to truly estimate the performance of those algorithms and how well computer vision models can extrapolate outside the distribution in which they were trained. In this work, we propose a new action classification challenge that is performed well by humans, but poorly by state-of-the-art Deep Learning models. As a proof-of-principle, we consider three exemplary tasks: drinking, reading, and sitting. The best accuracies reached using state-of-the-art computer vision models were 61.7%, 62.8%, and 76.8%, respectively, while human participants scored above 90% accuracy on the three tasks. We propose a rigorous method to reduce confounds when creating datasets, and when comparing human versus computer vision performance. Source code and datasets are publicly available.