论文标题
在编辑距离下更快的模式匹配
Faster Pattern Matching under Edit Distance
论文作者
论文摘要
我们考虑在编辑距离下的近似模式匹配问题。鉴于文本$ t $ t $ length $ n $,图案$ p $的长度$ m $和一个阈值$ k $,其任务是找到所有可以将$ t $的子字符串的起始位置与最多可将$ k $ edits转换为$ p $。 20多年前,Cole和Hariharan [Soda'98,J。Comput.'02]给出了$ \ Mathcal {O}(N+K^4 \ CDOT N/ M)$ - 用于此经典问题的时间算法,此后此运行时没有得到改善。 在这里,我们提出了一种算法,该算法在时间上运行$ \ Mathcal {o}(n+k^{3.5} \ sqrt {\ log m \ log k} \ cdot n/m)$,从而突破了这个长期存在的障碍。在某些情况下$ \ MATHCAL {O}(kn)$ - Landau和Vishkin的时间算法[Stoc'86,J。算法89]。 我们观察到,替代性$ \ Mathcal {O}(N+K^4 \ CDOT N/M)$ - CHARALAMPOPOPOULOS,KOCIUMAKA和WELLNITZ [focs'20]的瓶颈(n+k^4 \ cdot n/m)$ - 时间算法是(几乎是(几乎))。我们的新算法将这种情况减少到了一个新的动态问题(动态拼图匹配),我们通过基于Tiskin [Soda'10,Algorithmica'15]开发的工具来解决该问题,以解决所谓的置换矩阵的海藻。我们的算法仅依赖于字符串上的一系列原始操作,因此也适用于完全压缩的设置(在其中给出文本和图案作为直线程序)和动态设置(在其中维护创建,拆分和置置的字符串集合),改善艺术状态。
We consider the approximate pattern matching problem under the edit distance. Given a text $T$ of length $n$, a pattern $P$ of length $m$, and a threshold $k$, the task is to find the starting positions of all substrings of $T$ that can be transformed to $P$ with at most $k$ edits. More than 20 years ago, Cole and Hariharan [SODA'98, J. Comput.'02] gave an $\mathcal{O}(n+k^4 \cdot n/ m)$-time algorithm for this classic problem, and this runtime has not been improved since. Here, we present an algorithm that runs in time $\mathcal{O}(n+k^{3.5} \sqrt{\log m \log k} \cdot n/m)$, thus breaking through this long-standing barrier. In the case where $n^{1/4+\varepsilon} \leq k \leq n^{2/5-\varepsilon}$ for some arbitrarily small positive constant $\varepsilon$, our algorithm improves over the state-of-the-art by polynomial factors: it is polynomially faster than both the algorithm of Cole and Hariharan and the classic $\mathcal{O}(kn)$-time algorithm of Landau and Vishkin [STOC'86, J. Algorithms'89]. We observe that the bottleneck case of the alternative $\mathcal{O}(n+k^4 \cdot n/m)$-time algorithm of Charalampopoulos, Kociumaka, and Wellnitz [FOCS'20] is when the text and the pattern are (almost) periodic. Our new algorithm reduces this case to a new dynamic problem (Dynamic Puzzle Matching), which we solve by building on tools developed by Tiskin [SODA'10, Algorithmica'15] for the so-called seaweed monoid of permutation matrices. Our algorithm relies only on a small set of primitive operations on strings and thus also applies to the fully-compressed setting (where text and pattern are given as straight-line programs) and to the dynamic setting (where we maintain a collection of strings under creation, splitting, and concatenation), improving over the state of the art.