We extend the PikeVM in a way that allows it to run searches that
simulate the .*? prefix as an opt-in feature. While
simulating .*? we also perform the prefix acceleration
heuristic. To do so, the PikeVM’s state has an additional field storing
a counter indicating in how many characters the input will start with
the prefix of the regex, called next_prefix. To run the
standard PikeVM, we initialize next_prefix to
None. To run the .*? version, we initialize
next_prefix with the computation of when the prefix appears
in the input. This gives rise to three new rules of the semantics of the
PikeVM:
| pvs_acc:
(* if there are no more active or blocked threads and we know where the next prefix matches, *)
(* we accelerate to that point *)
forall inp best n nextinp seen
(ADVANCE: advance_input_n inp (S n) forward = nextinp),
pike_vm_step c lit (PVS inp [] best [] (Some n) seen) (PVS nextinp [pike_vm_initial_thread] best [] (next_prefix_counter nextinp lit) initial_seenpcs)
Normally once we are out of blocked and active threads, we would end
the search here. But if we have some position that we know the prefix is
at, we jump to that point in the input. Then, we recompute the next
point in the input where the prefix is present. We had to adjust the
existing pvs_final rule (which ends the search) to be used
only if next_prefix is None. Notice, however,
that if next_prefix was None because
next_prefix_counter said so (meaning there is no other
place in the input that has the prefix), that means that for both the
run of just r and the simulation of .*? the
search can end here!
The two other new rules are the following:
| pvs_nextchar_generate:
forall inp1 inp2 best thr blocked seen
(ADVANCE: advance_input inp1 forward = Some inp2),
pike_vm_step c lit (PVS inp1 [] best (thr::blocked) (Some 0) seen) (PVS inp2 (thr::blocked ++ [pike_vm_initial_thread]) best [] (next_prefix_counter inp2 lit) initial_seenpcs)
| pvs_nextchar_filter:
forall inp1 inp2 best thr blocked n seen
(ADVANCE: advance_input inp1 forward = Some inp2),
pike_vm_step c lit (PVS inp1 [] best (thr::blocked) (Some (S n)) seen) (PVS inp2 (thr::blocked) best [] (Some n) initial_seenpcs)
Whenever we advance the input by one character and the
next_prefix is not yet zero (_filter rule), we
just move the blocked threads into the active ones and decrement the
next_prefix. If it is zero (_generate rule),
that means we might potentially find a match at this input position so
we must start a new (lowest priority) thread at this point and recompute
the next point in the input where the prefix is present again. We later
argue in the proofs that filtering out the generation of the new threads
does not change the end result of the search. The existing
pvs_nextchar rule had to be adjusted to be used only when
next_prefix is None.
Finally, the pvs_match rule had to be adjusted to set
next_prefix to None whenever we find a match.
Since we want leftmost semantics, we should not start any new threads in
the future. The need for this was discovered during proofs. Meaning the
semantics I defined a few weeks ago were incorrect. When implementing
this prefix acceleration method in rust-regex I was not aware of this
but looking at the implementation now it seems I accidentally did it
correctly.
All existing proofs had to be adjusted for the new rules. It was
mostly very straightforward and the proofs did not require additional
lemmas. A new functional version of the small-step semantics was written
and proven equivalent to the semantics. Additionally, I kept the old
functional version to write a proof that when next_prefix
is None, both functional versions are
exactly the same! This reassures me that the extended
PikeVM still encodes the base semantics when next_prefix is
never initialized.
The new lemmas that were added are analogous to those that currently
argue something about the PikeVM’s initial state (such as some invariant
initialization). These lemmas had to be duplicated to argue the same
thing about the initial state that enables the .*?
simulation.
To still have an equivalence between the PikeTree and the PikeVM, the
PikeTree has to be analogously extended to simulate the
.*?. However, the PikeTree does not need to reason about
prefixes and can be non-deterministic. After all, the goal of the
PikeTree is only to prove the correctness of the PikeVM.
First, we extend the PikeTree state to store a nextt,
this will be our connection to the next_prefix.
nextt will essentially be the tree for the
/.*?r/ regex. Such a regex boils down to a tree which is a
choice between trying r on the current input, and moving
the input by one followed by the same kind of choice until we run out of
input to explore.
The connection between backtracking trees and the PikeTree comes from
proving that the result computed by the PikeTree stays the same after
every step. To prove that the extended PikeTree can match
/.*?r/ we add that nextt is part of the
computed result.
What follows are the new semantic rules of the PikeTree. At the
moment we are missing something analogous to PikeVM’s _acc
rule. We imagine this would be a non-deterministic pruning of
nextt’s Choice branches whenever the branch
contains no result.
| pts_nextchar_generate:
(* when the list of active trees is empty and the next tree is a segment of a lazy star prefix, *)
(* restart from the blocked ones and the head iteration of the lazy star, proceeding to the next character *)
(* resetting the seen trees *)
forall inp c next pref best blocked tgm nextt t1 t2 seen
(INPUT: inp = Input (c::next) pref)
(NEXTT: nextt = Some (Read c
(Progress
(Choice
t1
(GroupAction (Reset []) t2))))
)
(SUBSET: pike_subtree t1),
pike_tree_step (PTS inp [] best (tgm::blocked) nextt seen) (PTS (Input next (c::pref)) ((tgm::blocked) ++ [pike_tree_initial_tree t1]) best [] (Some t2) initial_seentrees)
| pts_nextchar_filter:
(* when the list of active trees is empty and the next tree is a segment of a lazy star prefix, *)
(* and the head iteration of the lazy star contains no result, *)
(* restart from the blocked ones, proceeding to the next character *)
(* resetting the seen trees *)
forall inp c next pref best blocked tgm nextt t1 t2 seen
(INPUT: inp = Input (c::next) pref)
(NEXTT: nextt = Some (Read c
(Progress
(Choice
t1
(GroupAction (Reset []) t2))))
)
(LEAF: first_leaf t1 (Input next (c::pref)) = None),
pike_tree_step (PTS inp [] best (tgm::blocked) nextt seen) (PTS (Input next (c::pref)) (tgm::blocked) best [] (Some t2) initial_seentrees)
These rules allow the PikeTree to non-deterministically generate new
active trees from the nextt. If the branch of
nextt contains no result, we may skip it. The
NEXTT hypothesis enforces that the nextt is of
a form generated by the tree of /.*?r/. As before, some
existing rules need to be changed to take nextt correctly
into account.
Similarly to the PikeVM, the existing lemmas have to be adjusted in a
fairly mechanical way, and the lemmas talking about the initial state
have to be duplicated to talk about the .*? initial state
as well.
We can already state and prove the final theorem: properly
initialized PikeVM returns the the result of /.*?r/. The
proof goes through the PikeTree, and needs one theorem to be more
carefully done: invariant_preservation. This theorem talks
about how the both algorithms return the same result. This is done
through an invariant relating the trees of the PikeTree and the threads
of PikeVM, as well as a few other things. Since we added rules for the
nextt/next_prefix, we need to relate those too
after each step of the algorithms. That is currently what I am working
on.
StrSearch instanceSo far we had an axiomatization of what a string search is. Now we provide an instance which uses brute force search:
(* Search from position i onwards in string s for substring ss *)
Function brute_force_str_search (ss s: string) (i: nat) {measure (fun i => S (length s) - i) i} : option nat :=
match Nat.leb i (length s) with
| true => match starts_with_dec ss (List.skipn i s) with
| left _ => Some i
| right _ => brute_force_str_search ss s (S i)
end
| false => None
end.
Proof.
intros. apply Nat.leb_le in teq. lia.
Defined.
The definition uses Function to be able to use
functional induction which removes the ugly-ness of the
measure. The definition was optimized for ease of proving.
Additionally, we need a decidable version of the
starts_with predicate. We get it derived from the proof of
decidability:
Lemma starts_with_dec:
forall s1 s2, { starts_with s1 s2 } + { ~ starts_with s1 s2 }.
So our str_search ss s is
brute_force_str_search ss s 0. Proofs of axioms follow
nicely from just doing functional induction on
brute_force_str_search. For each axiom we need a
generalized version on i.
This proves the axiomatization is satisfiable.
In the past I have tried to dispatch some LLM agents to prove some
lemmas which seemed obvious to be true but I did not want to waste my
time on the proofs. It has not been very successful. The agent could
prove really simple lemmas but it would often get stuck on slightly more
involved ones. This time, it was actually useful! I wrote the definition
of brute_force_str_search and proved one of the axioms,
leaving two for the LLM. To my surprise, it rewrote
brute_force_str_search to have i increase
rather than decrease (I wrote it as decreasing to not have to prove
termination, as it would stop at i = 0), then it proved
termination (not difficult), and proved the remaining two axioms! It
proved the remaining two axioms by realizing that i has to
be generalized (the first axiom did not need a generalization when
i was decreasing). That was nice. I used copilot with
Claude Sonnet 4.5.
input_search proofsstr_search operates on strings, so we have an analogous
version that works on inputs. Under the hood it uses directly the
str_search. Then the str_search axioms are
transformed into lemmas about inputs which are provable from those
axioms. Until now those lemmas were left unproven because it was not
entirely obvious how to. I tackled them now, which made me realize one
of the str_search axioms is wrong. There was a < in place of a \leq in the not_found axiom. The
proofs required an intermediate lemma:
Lemma strict_suffix_forward_skipn:
forall i2 i1,
strict_suffix i2 i1 forward ->
exists k, k > 0 /\ k <= length (next_str i1) /\ next_str i2 = List.skipn k (next_str i1).
This is essentially what connects strict_suffixes to the advancement of an input.
Find unused definitions (and maybe imports?) with https://github.com/rocq-community/coq-dpdgraph#dpdusage-find-unused-definitions-via-dpd-file
Require dpdgraph.dpdgraph.
From Linden Require Regex Chars Groups.
From Linden Require Tree Semantics BooleanSemantics.
From Linden Require NFA PikeTree PikeVM.
From Linden Require PikeSubset.
From Linden Require TreeRep SeenSets FunctionalPikeVM Complexity.
From Linden Require Parameters Prefix FunctionalUtils Correctness.
Print FileDependGraph PikeVM
Regex Chars Groups
Tree Semantics BooleanSemantics
NFA PikeTree PikeVM
PikeSubset
TreeRep SeenSets FunctionalPikeVM Complexity
Parameters Prefix FunctionalUtils Correctness.
Found nothing particularly interesting.
.*? PikeVM proof