论文标题
deplot:通过绘图到桌子翻译的一声视觉语言推理
DePlot: One-shot visual language reasoning by plot-to-table translation
论文作者
论文摘要
在人类世界中,诸如图表和情节之类的视觉语言无处不在。理解图和图表需要强大的推理能力。先前的最新模型(SOTA)模型至少需要成千上万的培训示例,其推理能力仍然受到很大的限制,尤其是在复杂的人工编写的查询上。本文介绍了视觉语言推理的第一个单发解决方案。我们将视觉语言推理的挑战分解为两个步骤:(1)绘图到文本翻译,以及(2)对翻译文本的推理。此方法中的关键是模态转换模块,称为Deplot,该模块将图或图表的图像转换为线性化表。然后,删除术的输出可以直接用于提示预验证的大语言模型(LLM),从而利用了LLM的少量推理能力。为了获得删除术,我们通过建立统一的任务格式和指标来标准化情节到桌子的任务,并在此任务上端到端训练deplot。然后可以以插件方式与LLM一起使用Deplot。与在超过28K数据点上进行的SOTA模型相比,从图表QA的任务中,对人为编写的查询的DEPLOT+LLM仅带有一杆,促使SOTA的SOTA提高了24.0%。
Visual language such as charts and plots is ubiquitous in the human world. Comprehending plots and charts requires strong reasoning skills. Prior state-of-the-art (SOTA) models require at least tens of thousands of training examples and their reasoning capabilities are still much limited, especially on complex human-written queries. This paper presents the first one-shot solution to visual language reasoning. We decompose the challenge of visual language reasoning into two steps: (1) plot-to-text translation, and (2) reasoning over the translated text. The key in this method is a modality conversion module, named as DePlot, which translates the image of a plot or chart to a linearized table. The output of DePlot can then be directly used to prompt a pretrained large language model (LLM), exploiting the few-shot reasoning capabilities of LLMs. To obtain DePlot, we standardize the plot-to-table task by establishing unified task formats and metrics, and train DePlot end-to-end on this task. DePlot can then be used off-the-shelf together with LLMs in a plug-and-play fashion. Compared with a SOTA model finetuned on more than >28k data points, DePlot+LLM with just one-shot prompting achieves a 24.0% improvement over finetuned SOTA on human-written queries from the task of chart QA.