端到端的深度学习直接从地面图像中估算葡萄产量

论文标题

端到端的深度学习直接从地面图像中估算葡萄产量

End-to-end deep learning for directly estimating grape yield from ground-based imagery

论文作者

Olenskyj, Alexander G., Sams, Brent S., Fei, Zhenghao, Singh, Vishal, Raja, Pranav V., Bornhorst, Gail M., Earles, J. Mason

论文摘要

产量估计是葡萄园管理中的强大工具，因为它允许种植者微调实践以优化产量和质量。但是，目前使用手动抽样进行估计，这是耗时和不精确的。这项研究表明，近端成像的应用与深度学习相结合，以供葡萄园中的产量估计。使用车辆安装的传感套件进行连续数据收集，并使用商业收益率监控器结合了收集的地面真相收益数据，可以生成23,581个收益点和107,933张图像的大型数据集。此外，这项研究是在机械管理的商业葡萄园中进行的，代表了一个充满挑战的图像分析环境，但在加利福尼亚中央山谷中的一组常见条件。测试了三个模型架构：对象检测，CNN回归和变压器模型。对象检测模型在手工标记的图像上进行了训练以定位葡萄束，并将束数量或像素区域求和以与葡萄产量相关。相反，回归模型端到端训练，以预测图像数据中的葡萄产量，而无需手动标记。结果表明，在代表性的保留数据集上，具有相当的绝对百分比误差为18％和18.5％的变压器以及具有像素区域处理的对象检测模型。使用显着映射来证明CNN模型的注意力位于葡萄束的预测位置附近以及葡萄树冠的顶部。总体而言，这项研究表明，近端成像和深度学习对于大规模预测葡萄产量的适用性。此外，端到端的建模方法能够与对象检测方法相当地执行，同时消除了对手标记的需求。

Yield estimation is a powerful tool in vineyard management, as it allows growers to fine-tune practices to optimize yield and quality. However, yield estimation is currently performed using manual sampling, which is time-consuming and imprecise. This study demonstrates the application of proximal imaging combined with deep learning for yield estimation in vineyards. Continuous data collection using a vehicle-mounted sensing kit combined with collection of ground truth yield data at harvest using a commercial yield monitor allowed for the generation of a large dataset of 23,581 yield points and 107,933 images. Moreover, this study was conducted in a mechanically managed commercial vineyard, representing a challenging environment for image analysis but a common set of conditions in the California Central Valley. Three model architectures were tested: object detection, CNN regression, and transformer models. The object detection model was trained on hand-labeled images to localize grape bunches, and either bunch count or pixel area was summed to correlate with grape yield. Conversely, regression models were trained end-to-end to predict grape yield from image data without the need for hand labeling. Results demonstrated that both a transformer as well as the object detection model with pixel area processing performed comparably, with a mean absolute percent error of 18% and 18.5%, respectively on a representative holdout dataset. Saliency mapping was used to demonstrate the attention of the CNN model was localized near the predicted location of grape bunches, as well as on the top of the grapevine canopy. Overall, the study showed the applicability of proximal imaging and deep learning for prediction of grapevine yield on a large scale. Additionally, the end-to-end modeling approach was able to perform comparably to the object detection approach while eliminating the need for hand-labeling.

下载PDF全文

下载文献需遵守相关版权规定

论文标题