在编辑距离下更快的模式匹配

论文标题

在编辑距离下更快的模式匹配

Faster Pattern Matching under Edit Distance

论文作者

Charalampopoulos, Panagiotis, Kociumaka, Tomasz, Wellnitz, Philip

论文摘要

我们考虑在编辑距离下的近似模式匹配问题。鉴于文本$ t $ t $ length $ n $，图案$ p $的长度$ m $和一个阈值$ k $，其任务是找到所有可以将$ t $的子字符串的起始位置与最多可将$ k $ edits转换为$ p $。 20多年前，Cole和Hariharan [Soda'98，J。Comput.'02]给出了$ \ Mathcal {O}（N+K^4 \ CDOT N/ M）$ - 用于此经典问题的时间算法，此后此运行时没有得到改善。在这里，我们提出了一种算法，该算法在时间上运行$ \ Mathcal {o}（n+k^{3.5} \ sqrt {\ log m \ log k} \ cdot n/m）$，从而突破了这个长期存在的障碍。在某些情况下$ \ MATHCAL {O}（kn）$ - Landau和Vishkin的时间算法[Stoc'86，J。算法89]。我们观察到，替代性$ \ Mathcal {O}（N+K^4 \ CDOT N/M）$ - CHARALAMPOPOPOULOS，KOCIUMAKA和WELLNITZ [focs'20]的瓶颈（n+k^4 \ cdot n/m）$ - 时间算法是（几乎是（几乎））。我们的新算法将这种情况减少到了一个新的动态问题（动态拼图匹配），我们通过基于Tiskin [Soda'10，Algorithmica'15]开发的工具来解决该问题，以解决所谓的置换矩阵的海藻。我们的算法仅依赖于字符串上的一系列原始操作，因此也适用于完全压缩的设置（在其中给出文本和图案作为直线程序）和动态设置（在其中维护创建，拆分和置置的字符串集合），改善艺术状态。

We consider the approximate pattern matching problem under the edit distance. Given a text $T$ of length $n$, a pattern $P$ of length $m$, and a threshold $k$, the task is to find the starting positions of all substrings of $T$ that can be transformed to $P$ with at most $k$ edits. More than 20 years ago, Cole and Hariharan [SODA'98, J. Comput.'02] gave an $\mathcal{O}(n+k^4 \cdot n/ m)$-time algorithm for this classic problem, and this runtime has not been improved since. Here, we present an algorithm that runs in time $\mathcal{O}(n+k^{3.5} \sqrt{\log m \log k} \cdot n/m)$, thus breaking through this long-standing barrier. In the case where $n^{1/4+\varepsilon} \leq k \leq n^{2/5-\varepsilon}$ for some arbitrarily small positive constant $\varepsilon$, our algorithm improves over the state-of-the-art by polynomial factors: it is polynomially faster than both the algorithm of Cole and Hariharan and the classic $\mathcal{O}(kn)$-time algorithm of Landau and Vishkin [STOC'86, J. Algorithms'89]. We observe that the bottleneck case of the alternative $\mathcal{O}(n+k^4 \cdot n/m)$-time algorithm of Charalampopoulos, Kociumaka, and Wellnitz [FOCS'20] is when the text and the pattern are (almost) periodic. Our new algorithm reduces this case to a new dynamic problem (Dynamic Puzzle Matching), which we solve by building on tools developed by Tiskin [SODA'10, Algorithmica'15] for the so-called seaweed monoid of permutation matrices. Our algorithm relies only on a small set of primitive operations on strings and thus also applies to the fully-compressed setting (where text and pattern are given as straight-line programs) and to the dynamic setting (where we maintain a collection of strings under creation, splitting, and concatenation), improving over the state of the art.

下载PDF全文

下载文献需遵守相关版权规定

论文标题