2026_03

Collected notes

Exact literals

For some regexes, we can immediately tell what string exactly they will match. Consider /ab{2}c/. The only strings this can match is the “abbc” string. Instead of running an entire regex engine, we can instead run a substring search looking for that string. The first occurrence will be the highest priority match we are interested in.

To prove that, we first prove a smaller theorem that just talks about string prefixes. String prefixes is what I used a lot in many other proofs when doing literal-based optimizations.

Lemma exact_literal_result :
  forall r tree inp gm p,
    is_tree rer [Areg r] inp gm forward tree ->
    has_asserts r = false ->
    extract_literal r = Exact p ->
    starts_with p (next_str inp) ->
    (exists gm', tree_res tree gm inp forward = Some (advance_input_n inp (length p) forward, gm')).

If we have a regex for which we know its exact literal, if that literal is the prefix of the input, we know a match exists at this position. We don’t know what kind of capturing groups may be produced: consider /(ab)c/. Its literal is “abc” but it also produces a capture group. In practice, we should be able to prove the exact values for those capture groups too, but that is left for another time. The base theorem for captures would look like this:

Lemma exact_literal_capture_value:
  forall r r' i tree inp inp' gm gm' p,
    r = Group i r' ->
    is_tree rer [Areg r] inp gm forward tree ->
    extract_literal r = Exact p ->
    tree_res tree gm inp forward = Some (inp', gm') ->
    Groups.GroupMap.find i gm' = Some (Groups.GroupMap.Range 0 (Some (length p))).

This could generalize to other regexes that are not exact, but have exact captures.

Proving exact_literal_result is done by induction on the is_tree inductive. We then proceed to generalize to finding matches anywhere in the string by the lazy prefix. The proof is by induction from the position of the match down to the start of the input.

Lemma exact_literal_result_unanchored {strs:StrSearch} :
  forall r inp p inp' tree,
    has_asserts r = false ->
    extract_literal r = Exact p ->
    is_tree rer [Areg (lazy_prefix r)] inp Groups.GroupMap.empty forward tree ->
    input_search p inp = Some inp' ->
    exists gm', first_leaf tree inp = Some (advance_input_n inp' (length p) forward, gm').

In the wild, around 12% of regexes have exact literals. This optimization makes choosing a regex (over a substring search) for constant strings a zero-cost abstraction.

`strict_suffix` solver

During proofs we often need to reason about the relation between inputs called strict_suffix is relates two inputs saying that one is the suffix of the other one and they are not equal. Proving this is usually trivial but can get quite tedious. I wrote a small solver for this purpose.

Notice that the strict_suffix relation can be represented as a DAG. The relation is irreflexive, asymmetric, transitive. To solve any goal of strict_suffix inp1 inp2, we can just construct a DAG from the premises and if there exists a path from inp1 to inp2 then the goal is provable. To find a contradiction we can look for cycles (self-loops included) which by asymmetry is false.

I don’t do that however. I did not dare to spend time trying to implement this (probably complete?) solver in Ltac/Ltac2. Maybe with Dario’s OCaml tactic metaprogramming?? Instead I rudely look for a path by transitivity to either find the solution or a contradiction.

The first stage of solving is canonicalization which tries to gather as many strict_suffix premises. For example, it turns backward strict_suffix (it has a direction) into forward ones, turns input_prefix into suffixes, unfolds advance_input (which are the “successor” function of inputs), etc.

`StrictSuffix` solver

The DAG idea from above generalizes to any relation that is transitive and irreflexive (a strict order). After nerd sniping my office mates, an early version of a complete strict order solver was produced. I hope to polish it and make it more generic to be easily usable by anyone that wants to prove things about strict orders.

Rust regex “filtering”

During my thesis we devised an alternative way to perform prefix acceleration. Prefix acceleration is a generic term used to describe an optimization that uses the information about prefixes to skip exploring parts of the string during matching. The alternative way simply allows us to skip more positions of the string. I implemented this new strategy in Rust’s regex crate. The new implementation is strictly better than the existing one whenever we are trying to find all matches in a string (up to 20\times speedup). For ‘find-first’ it can still be beneficial, but it is less clear-cut.

Hopefully, next week I will send the PR. Currently, a 2% regression in overall performance is observed. I have minimized the example, but I still find it hard to explain where this slowdown is coming from. Whenever I add a single condition a an if (not_one_ahead &&) in a hot loop, the code slows down. I suspect I run out of registers and spill to the stack. I use the https://github.com/pacak/cargo-show-asm tool to see the generated assembly, but the diff even with that small change is big and hard to debug.

Reverse PikeVM

To implement lookarounds we will need to run regex engines on a regex and input that are both reversed. Since reversing an input (“abcde” becomes “edcba”) is expensive, we instead parametrize an engine (in our case the PikeVM) with a direction of reading. The only difference between a forward and backward direction is that characters are respectively read left-to-right and right-to-left. Then, we have a theorem which states that running in a backward direction is the same as running forward with the input reversed.

One subtle aspect to consider is the regex /^/ (assert that we are at the start of the string). When running in a backward direction it actually has to be interpreted as a /$/ instead.

Since we have that correspondence between forward and backward directions, all proofs about the PikeVM will still be expressed in terms of the forward direction.

To discuss

reverse the pikevm direction or the entire nfa?
Rust regex prose
Exact literals

Action items

Finish Rust PR
Prove things about reversal of regexes