论文标题

检测差异维护的机会的机会

Detecting Opportunities for Differential Maintenance of Extracted Views

论文作者

Kassaie, Besat, Tompa, Frank Wm.

论文摘要

半结构化和非结构化数据管理具有挑战性,但是所遇到的许多问题类似于关系环境中已经解决的问题。例如,在信息提取领域,从工程临时,特定于应用程序的提取规则转向使用诸如CPSL和AQL之类的表达语言的转变创造了可以应用于可以应用于广泛提取程序的解决方案的机会。在这项工作中,我们将重点放在提取的视图维护上,这是在关系环境中充分激励并彻底解决的问题。特别是,我们正式化并解决了与可以任意更新的源文档一致的提取关系的问题。我们正式表征了三类文档更新,即相对于给定的提取器而言无关,可自动计算和伪级的文档更新。最后,我们提出了算法,以检测伪列表的文档更新,以表达为文档Spanners的提取器,这是由Systemt启发的信息提取模型。

Semi-structured and unstructured data management is challenging, but many of the problems encountered are analogous to problems already addressed in the relational context. In the area of information extraction, for example, the shift from engineering ad hoc, application-specific extraction rules towards using expressive languages such as CPSL and AQL creates opportunities to propose solutions that can be applied to a wide range of extraction programs. In this work, we focus on extracted view maintenance, a problem that is well-motivated and thoroughly addressed in the relational setting. In particular, we formalize and address the problem of keeping extracted relations consistent with source documents that can be arbitrarily updated. We formally characterize three classes of document updates, namely those that are irrelevant, autonomously computable, and pseudo-irrelevant with respect to a given extractor. Finally, we propose algorithms to detect pseudo-irrelevant document updates with respect to extractors that are expressed as document spanners, a model of information extraction inspired by SystemT.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源