We assume the complexity of a substring search to be O(mn) where m is the needle length and n is the haystack length. Moreover, we assume
that if a match exists at position k <
n, then it will be found in O(mk). Therefore,
str_search needle haystack = k executes in O(mk).
An input search performs the substring search once and advances the input by this many characters. Advancing an input by k characters is done in O(k). Therefore the input search has a runtime of O(mk) + O(k) = O(mk).
Let us now consider any algorithm where the input search is called multiple times, but each call is performed on a strict suffix of the previous result of an input search. Let c be the amount of input search calls performed in total. Since each call is performed on a strict suffix, c is bounded above by n. Let k_i be the amount of characters we advance by in call number i. The complexity of all the calls is:
O(mk_1) + O(mk_2) + \cdots + O(mk_c) = O(m(k_1 + k_2 + \cdots + k_c))
But \sum_{i=1}^c k_i \le n. So O(m(k_1 + k_2 + \cdots + k_c)) = O(mn).
If we consider a base algorithm with some complexity O(Q), then the input search calls are only an additive runtime cost, so the entire’s algorithm complexity is O(Q) + O(mn). In our case the needle size is proportional to the regex size m = O(|r|), the haystack is n = |s|, and the base algorithm in the case of a linear regex engine (PikeVM) has complexity O(Q) = O(|r| \cdot |s|). Thus,
a linear regex engine using input searches on consecutive strict suffixes has complexity
O(|r| \cdot |s|) + O(|r| \cdot |s|) = O(|r| \cdot |s|)
\square.
The general alg that does not care about the underlying engine can be generalized to skip not only at the very start, but also during iteration. We can keep track of what is the next place the prefix matches and skip all of the things in between.
This can be useful for other engines than the PikeVM which simulates
the .*? itself. For the memoized backtracker, for this more
advanced alg to be useful, we need to figure out cache sharing between
executions after each acceleration.
We want to prove that the extracted prefix is of a size proportional
to O(|R|) where R is the expanded regex r (i.e., r = /x{4,5}/ but
R = /xxxxx?/).
We first define what is the size of a regex r.
Fixpoint regex_size (r: regex) : nat :=
match r with
| Epsilon => 1
| Regex.Character _ => 1
| Disjunction r1 r2 => 1 + regex_size r1 + regex_size r2
| Sequence r1 r2 => 1 + regex_size r1 + regex_size r2
| Quantified _ min delta r1 =>
1 + min * regex_size r1 +
(match delta with
| NoI.Inf => regex_size r1
| NoI.N n => n * regex_size r1
end)
| Lookaround _ r1 => 1 + regex_size r1
| Group _ r1 => 1 + regex_size r1
| Anchor _ => 1
| Backreference _ => 1
end.
Thus, |R| =
regex_size r.
We want to prove the following theorem:
Theorem extract_literal_size_bound:
forall r,
length (prefix (extract_literal r)) <= regex_size r.
This proves that we are bound by O(|R|).
With help of a few lemmas about chain_literals and
merge_literals returned lengths, the theorem is easily
proven mostly by lia.
Interestingly, if literal extraction would handle backreferences
properly, this theorem would not hold. Consider
/(abc)\1\1/. The size of the regex is ~6. But the extracted
prefix could be Exact abcabcabc which is of size 9. No
constant factor for the regex size bound would help. It is not linear in
the size.
We want to see in the real world how much does prefix acceleration
help. We evaluate four strategies implemented in rust using
rebar
The four strategies have been implemented in rust-regex (where the first is just a matter of disabling prefix acc, and the third one is the existing behavior). Changes can be found in the LindenRegex fork on the branch “mw/prefix-acc-cmp”. This was done by adding an additional config to the PikeVM which allows to specify the prefiltering strategy:
pub enum PrefilterStrategy {
/// Use the prefilter only at the start of the search.
Once,
/// Use the prefilter whenever the PikeVM runs out of active states.
#[default]
OnEmptyStates,
/// Use the prefilter in advance to know which positions to skip.
OneAhead,
}When PrefilterStrategy::OneAhead is used, we always know
about the next position where a prefix is present. This allows us to
skip doing an epsilon closure to simulate .*? on positions
that we know have no matches.
Afterwards rebar (the rust-regex benchmarking suite) was extended to
include engines which would be configured to our wanted four strategies.
Changes can be found in the LindenRegex
fork on the branch “mw/prefix-acc-cmp”. This was done by extending
the existing “regex-automata” engine which benched individual regex
engines of rust-regex (dfa, pikevm, onepass, backtracking, etc.). The
rust-regex dependency was rewired to point to our fork with
the implemented PrefilterStrategy. Additional targets
(“pikevm/noAcc”, “pikevm/accOnce”, “pikevm/accEmptyState”,
“pikevm/accOneAhead”) were added. Then, these targets were added to
all existing benchmarks and run on a silent system. The
benchmarking result were committed
to the rebar fork.
Results were generated by:
rebar build -e 'rust/regex/pikevm/(?:noAcc|accOnce|accEmptyStates|accOneAhead)'
rebar measure -e 'rust/regex/pikevm/(?:noAcc|accOnce|accEmptyStates|accOneAhead)' | tee prefilters.csv(we will always exclude compile benchmark model using
-M compile, we are not interested in comparing compilation
times)
rebar rank prefilters.csv -e 'rust/regex/pikevm/(?:accOnce|noAcc)' -M compile
rebar cmp prefilters.csv -e 'rust/regex/pikevm/(?:accOnce|noAcc)' -M compileReveals that doing prefix acceleration even just once is on par with
doing no acceleration at all. accOnce is never slower than
noAcc. Whenever accOnce is faster than
noAcc, it is so by a very large margin (up to 600x faster).
This speed-up can be attributed to either skipping the entire haystack
or a large portion of it thanks to prefix acc.
rebar rank prefilters.csv -e 'rust/regex/pikevm/(?:accEmptyStates|accOnce)' -M compile
rebar cmp prefilters.csv -e 'rust/regex/pikevm/(?:accEmptyStates|accOnce)' -M compileSimilarly, accEmptyStates is strictly faster than
accOnce. After all, accEmptyStates does
accOnce, plus more. accEmptyStates is
resilient to degenerative cases where for example the prefix matches on
the very start of the string. This can be seen by accOnce
being slower by up to 80x in some cases.
rebar rank prefilters.csv -e 'rust/regex/pikevm/(?:accOneAhead|accEmptyStates)' -M compile
rebar cmp prefilters.csv -e 'rust/regex/pikevm/(?:accOneAhead|accEmptyStates)' -M compileThis time there are cases where accOneAhead is slower
than accEmptyStates, but only up to a factor of 2. On the
other hand, accOneAhead in some cases is up to 20x faster
than accEmptyStates.
This would be very interesting to contributed to rust-regex. If I find time in my personal time, it would be worth doing.
TODO: check if not caching the oneeahead is a problem. Since rebar does find-all searches, not caching the result means we have to do a string search again when resuming searching
We want to prove a version of the PikeVM that does prefix acceleration. To do so, we extend the PikeTree non-deterministic algorithm to skip branches that do not have matching leaf. This should not change the correctness: there were no matches there, so we are free to drop it. This approach allows for the PikeVM to use strategies that drop some execution paths if there is no result there (this is similar to the SeenSet). This should make the relation between the PikeVM and the PikeTree easier to establish.
We still need to remember that the PikeVM needs to simulate the
.*? itself, that is what makes the relation with the
PikeTree difficult, as the PikeTree only executes on a given start
position. We still need to relate .*? with trying at every
input position.