论文标题
专利配学:用于匹配专利索赔和先验艺术的数据集
PatentMatch: A Dataset for Matching Patent Claims & Prior Art
论文作者
论文摘要
专利审查员在评估专利申请中提出的索赔的新颖性和创造性步骤时,需要解决复杂的信息检索任务。考虑到索赔,他们搜索了先前的艺术,其中包括所有相关的公开信息。这项耗时的任务需要对各自的技术领域和专利域特异性语言有深入的了解。由于这些原因,我们通过创建一个名为PathentMatch的监督机器学习的培训数据集来解决计算机辅助搜索先前的艺术。它包含来自专利申请的一对索赔,以及来自引用的专利文件的不同程度的语义相应的文本段落。每对均由欧洲专利局的技术技能专利审查员标记。因此,标签表示语义对应关系(匹配)的程度,即文本段落是否偏见声明的发明的新颖性。使用基线系统的初步实验表明,PateNtMatch确实可以用于训练有关此具有挑战性的信息检索任务的二进制文本对分类器。该数据集可在线获得:https://hpi.de/naumann/s/patentmatch。
Patent examiners need to solve a complex information retrieval task when they assess the novelty and inventive step of claims made in a patent application. Given a claim, they search for prior art, which comprises all relevant publicly available information. This time-consuming task requires a deep understanding of the respective technical domain and the patent-domain-specific language. For these reasons, we address the computer-assisted search for prior art by creating a training dataset for supervised machine learning called PatentMatch. It contains pairs of claims from patent applications and semantically corresponding text passages of different degrees from cited patent documents. Each pair has been labeled by technically-skilled patent examiners from the European Patent Office. Accordingly, the label indicates the degree of semantic correspondence (matching), i.e., whether the text passage is prejudicial to the novelty of the claimed invention or not. Preliminary experiments using a baseline system show that PatentMatch can indeed be used for training a binary text pair classifier on this challenging information retrieval task. The dataset is available online: https://hpi.de/naumann/s/patentmatch.