For some regexes, we can immediately tell what string exactly they
will match. Consider /ab{2}c/. The only strings this can
match is the “abbc” string. Instead of running an entire regex engine,
we can instead run a substring search looking for that string. The first
occurrence will be the highest priority match we are interested in.
To prove that, we first prove a smaller theorem that just talks about string prefixes. String prefixes is what I used a lot in many other proofs when doing literal-based optimizations.
Lemma exact_literal_result :
forall r tree inp gm p,
is_tree rer [Areg r] inp gm forward tree ->
has_asserts r = false ->
extract_literal r = Exact p ->
starts_with p (next_str inp) ->
(exists gm', tree_res tree gm inp forward = Some (advance_input_n inp (length p) forward, gm')).
If we have a regex for which we know its exact literal, if that
literal is the prefix of the input, we know a match exists at this
position. We don’t know what kind of capturing groups may be produced:
consider /(ab)c/. Its literal is “abc” but it also produces
a capture group. In practice, we should be able to prove the exact
values for those capture groups too, but that is left for another time.
The base theorem for captures would look like this:
Lemma exact_literal_capture_value:
forall r r' i tree inp inp' gm gm' p,
r = Group i r' ->
is_tree rer [Areg r] inp gm forward tree ->
extract_literal r = Exact p ->
tree_res tree gm inp forward = Some (inp', gm') ->
Groups.GroupMap.find i gm' = Some (Groups.GroupMap.Range 0 (Some (length p))).
This could generalize to other regexes that are not exact, but have exact captures.
Proving exact_literal_result is done by induction on the
is_tree inductive. We then proceed to generalize to finding
matches anywhere in the string by the lazy prefix. The proof is by
induction from the position of the match down to the start of the
input.
Lemma exact_literal_result_unanchored {strs:StrSearch} :
forall r inp p inp' tree,
has_asserts r = false ->
extract_literal r = Exact p ->
is_tree rer [Areg (lazy_prefix r)] inp Groups.GroupMap.empty forward tree ->
input_search p inp = Some inp' ->
exists gm', first_leaf tree inp = Some (advance_input_n inp' (length p) forward, gm').
In the wild, around 12% of regexes have exact literals. This optimization makes choosing a regex (over a substring search) for constant strings a zero-cost abstraction.
strict_suffix solverDuring proofs we often need to reason about the relation between
inputs called strict_suffix is relates two inputs saying
that one is the suffix of the other one and they are not equal. Proving
this is usually trivial but can get quite tedious. I wrote a small
solver for this purpose.
Notice that the strict_suffix relation can be
represented as a DAG. The relation is irreflexive, asymmetric,
transitive. To solve any goal of strict_suffix inp1 inp2,
we can just construct a DAG from the premises and if there exists a path
from inp1 to inp2 then the goal is provable.
To find a contradiction we can look for cycles (self-loops included)
which by asymmetry is false.
I don’t do that however. I did not dare to spend time trying to implement this (probably complete?) solver in Ltac/Ltac2. Maybe with Dario’s OCaml tactic metaprogramming?? Instead I rudely look for a path by transitivity to either find the solution or a contradiction.
The first stage of solving is canonicalization which tries to gather
as many strict_suffix premises. For example, it turns
backward strict_suffix (it has a direction) into forward
ones, turns input_prefix into suffixes, unfolds
advance_input (which are the “successor” function of
inputs), etc.
StrictSuffix solverThe DAG idea from above generalizes to any relation that is transitive and irreflexive (a strict order). After nerd sniping my office mates, an early version of a complete strict order solver was produced. I hope to polish it and make it more generic to be easily usable by anyone that wants to prove things about strict orders.
During my thesis we devised an alternative way to perform prefix acceleration. Prefix acceleration is a generic term used to describe an optimization that uses the information about prefixes to skip exploring parts of the string during matching. The alternative way simply allows us to skip more positions of the string. I implemented this new strategy in Rust’s regex crate. The new implementation is strictly better than the existing one whenever we are trying to find all matches in a string (up to 20\times speedup). For ‘find-first’ it can still be beneficial, but it is less clear-cut.
Hopefully, next week I will send the PR. Currently, a 2% regression
in overall performance is observed. I have minimized the example, but I
still find it hard to explain where this slowdown is coming from.
Whenever I add a single condition a an if
(not_one_ahead &&) in a hot loop, the code slows
down. I suspect I run out of registers and spill to the stack. I use the
https://github.com/pacak/cargo-show-asm tool to see the
generated assembly, but the diff even with that small change is big and
hard to debug.
To implement lookarounds we will need to run regex engines on a regex and input that are both reversed. Since reversing an input (“abcde” becomes “edcba”) is expensive, we instead parametrize an engine (in our case the PikeVM) with a direction of reading. The only difference between a forward and backward direction is that characters are respectively read left-to-right and right-to-left. Then, we have a theorem which states that running in a backward direction is the same as running forward with the input reversed.
One subtle aspect to consider is the regex /^/ (assert
that we are at the start of the string). When running in a backward
direction it actually has to be interpreted as a /$/
instead.
Since we have that correspondence between forward and backward directions, all proofs about the PikeVM will still be expressed in terms of the forward direction.