Collected notes

Complexity of prefix-acc

We assume the complexity of a substring search to be O(mn) where m is the needle length and n is the haystack length. Moreover, we assume that if a match exists at position k < n, then it will be found in O(mk). Therefore, str_search needle haystack = k executes in O(mk).

An input search performs the substring search once and advances the input by this many characters. Advancing an input by k characters is done in O(k). Therefore the input search has a runtime of O(mk) + O(k) = O(mk).

Let us now consider any algorithm where the input search is called multiple times, but each call is performed on a strict suffix of the previous result of an input search. Let c be the amount of input search calls performed in total. Since each call is performed on a strict suffix, c is bounded above by n. Let k_i be the amount of characters we advance by in call number i. The complexity of all the calls is:

O(mk_1) + O(mk_2) + \cdots + O(mk_c) = O(m(k_1 + k_2 + \cdots + k_c))

But \sum_{i=1}^c k_i \le n. So O(m(k_1 + k_2 + \cdots + k_c)) = O(mn).

If we consider a base algorithm with some complexity O(Q), then the input search calls are only an additive runtime cost, so the entire’s algorithm complexity is O(Q) + O(mn). In our case the needle size is proportional to the regex size m = O(|r|), the haystack is n = |s|, and the base algorithm in the case of a linear regex engine (PikeVM) has complexity O(Q) = O(|r| \cdot |s|). Thus,

a linear regex engine using input searches on consecutive strict suffixes has complexity

O(|r| \cdot |s|) + O(|r| \cdot |s|) = O(|r| \cdot |s|)

\square.

More advanced general alg

The general alg that does not care about the underlying engine can be generalized to skip not only at the very start, but also during iteration. We can keep track of what is the next place the prefix matches and skip all of the things in between.

This can be useful for other engines than the PikeVM which simulates the .*? itself. For the memoized backtracker, for this more advanced alg to be useful, we need to figure out cache sharing between executions after each acceleration.

Extracted prefix size

We want to prove that the extracted prefix is of a size proportional to O(|R|) where R is the expanded regex r (i.e., r = /x{4,5}/ but R = /xxxxx?/).

We first define what is the size of a regex r.

Fixpoint regex_size (r: regex) : nat :=
    match r with
    | Epsilon => 1
    | Regex.Character _ => 1
    | Disjunction r1 r2 => 1 + regex_size r1 + regex_size r2
    | Sequence r1 r2 => 1 + regex_size r1 + regex_size r2
    | Quantified _ min delta r1 =>
            1 + min * regex_size r1 +
                (match delta with
                | NoI.Inf => regex_size r1
                | NoI.N n => n * regex_size r1
                end)
    | Lookaround _ r1 => 1 + regex_size r1
    | Group _ r1 => 1 + regex_size r1
    | Anchor _ => 1
    | Backreference _ => 1
    end.

Thus, |R| = regex_size r.

We want to prove the following theorem:

Theorem extract_literal_size_bound:
  forall r,
    length (prefix (extract_literal r)) <= regex_size r.

This proves that we are bound by O(|R|).

With help of a few lemmas about chain_literals and merge_literals returned lengths, the theorem is easily proven mostly by lia.

Backreferences

Interestingly, if literal extraction would handle backreferences properly, this theorem would not hold. Consider /(abc)\1\1/. The size of the regex is ~6. But the extracted prefix could be Exact abcabcabc which is of size 9. No constant factor for the regex size bound would help. It is not linear in the size.

Investigation of prefix acceleration strategies

We want to see in the real world how much does prefix acceleration help. We evaluate four strategies implemented in rust using rebar

  1. No prefix acceleration
  2. Prefix acceleration once at the beginning
  3. Prefix acceleration each time we are out of PikeVM states (this is what rust-regex does)
  4. Prefix acceleration done always one ahead, allowing us to know which start states are not needed

The four strategies have been implemented in rust-regex (where the first is just a matter of disabling prefix acc, and the third one is the existing behavior). Changes can be found in the LindenRegex fork on the branch “mw/prefix-acc-cmp”. This was done by adding an additional config to the PikeVM which allows to specify the prefiltering strategy:

pub enum PrefilterStrategy {
    /// Use the prefilter only at the start of the search.
    Once,
    /// Use the prefilter whenever the PikeVM runs out of active states.
    #[default]
    OnEmptyStates,
    /// Use the prefilter in advance to know which positions to skip.
    OneAhead,
}

When PrefilterStrategy::OneAhead is used, we always know about the next position where a prefix is present. This allows us to skip doing an epsilon closure to simulate .*? on positions that we know have no matches.

Afterwards rebar (the rust-regex benchmarking suite) was extended to include engines which would be configured to our wanted four strategies. Changes can be found in the LindenRegex fork on the branch “mw/prefix-acc-cmp”. This was done by extending the existing “regex-automata” engine which benched individual regex engines of rust-regex (dfa, pikevm, onepass, backtracking, etc.). The rust-regex dependency was rewired to point to our fork with the implemented PrefilterStrategy. Additional targets (“pikevm/noAcc”, “pikevm/accOnce”, “pikevm/accEmptyState”, “pikevm/accOneAhead”) were added. Then, these targets were added to all existing benchmarks and run on a silent system. The benchmarking result were committed to the rebar fork.

Results were generated by:

rebar build -e 'rust/regex/pikevm/(?:noAcc|accOnce|accEmptyStates|accOneAhead)'
rebar measure -e 'rust/regex/pikevm/(?:noAcc|accOnce|accEmptyStates|accOneAhead)' | tee prefilters.csv

Result analysis

(we will always exclude compile benchmark model using -M compile, we are not interested in comparing compilation times)

rebar rank prefilters.csv -e 'rust/regex/pikevm/(?:accOnce|noAcc)' -M compile
rebar cmp prefilters.csv -e 'rust/regex/pikevm/(?:accOnce|noAcc)' -M compile

Reveals that doing prefix acceleration even just once is on par with doing no acceleration at all. accOnce is never slower than noAcc. Whenever accOnce is faster than noAcc, it is so by a very large margin (up to 600x faster). This speed-up can be attributed to either skipping the entire haystack or a large portion of it thanks to prefix acc.


rebar rank prefilters.csv -e 'rust/regex/pikevm/(?:accEmptyStates|accOnce)' -M compile
rebar cmp prefilters.csv -e 'rust/regex/pikevm/(?:accEmptyStates|accOnce)' -M compile

Similarly, accEmptyStates is strictly faster than accOnce. After all, accEmptyStates does accOnce, plus more. accEmptyStates is resilient to degenerative cases where for example the prefix matches on the very start of the string. This can be seen by accOnce being slower by up to 80x in some cases.


rebar rank prefilters.csv -e 'rust/regex/pikevm/(?:accOneAhead|accEmptyStates)' -M compile
rebar cmp prefilters.csv -e 'rust/regex/pikevm/(?:accOneAhead|accEmptyStates)' -M compile

This time there are cases where accOneAhead is slower than accEmptyStates, but only up to a factor of 2. On the other hand, accOneAhead in some cases is up to 20x faster than accEmptyStates.

This would be very interesting to contributed to rust-regex. If I find time in my personal time, it would be worth doing.

TODO: check if not caching the oneeahead is a problem. Since rebar does find-all searches, not caching the result means we have to do a string search again when resuming searching

Prefix accelerated PikeVM

We want to prove a version of the PikeVM that does prefix acceleration. To do so, we extend the PikeTree non-deterministic algorithm to skip branches that do not have matching leaf. This should not change the correctness: there were no matches there, so we are free to drop it. This approach allows for the PikeVM to use strategies that drop some execution paths if there is no result there (this is similar to the SeenSet). This should make the relation between the PikeVM and the PikeTree easier to establish.

We still need to remember that the PikeVM needs to simulate the .*? itself, that is what makes the relation with the PikeTree difficult, as the PikeTree only executes on a given start position. We still need to relate .*? with trying at every input position.

To discuss

Action items