Firstly, I must say, it is very nice that “Coq” is not a substring of “Rocq”. Doing a grep of “coq” returns only things that need to be migrated.
It is quite unfortunate how big of an undertaking renaming Coq to
Rocq is. Anyway, to port Linden to Rocq (which should be a very small
effort), we first must port Warblre. I have sent a PR: https://github.com/LindenRegex/Warblre/pull/6. There are
a few blockers, alectryon using SerAPI (which does not work
with Rocq 9) being the main one. Rocq 9 seems to have fixed quite a few
bugs that were hit by Warblre which allowed to clear some
LATER’s.
Before we were missing a step in the PikeTree which would be
analogous to PikeVM’s _acc step (one that jumps to the next
input position where the prefix matches). This was now added and is the
following:
| pts_acc:
(* if there are no more active or blocked trees and we have some nextt, *)
(* we accelerate by non-deterministically skipping branches with no results *)
forall inp best seen nextinp nextt acc t
(ACC: pike_tree_acc inp nextt nextinp acc t),
pike_tree_step (PTS inp [] best [] (Some nextt) seen) (PTS nextinp [pike_tree_initial_tree t] best [] (Some acc) initial_seentrees)
pike_tree_acc is a non-deterministic skipping of head
branches that do not contain a match. The definition is the
following:
Inductive pike_tree_acc : input -> tree -> input -> tree -> tree -> Prop :=
| acc_keep:
forall inp c next pref nextt t1 t2
(INPUT: inp = Input (c::next) pref)
(NEXTT: nextt = Read c
(Progress
(Choice
t1
(GroupAction (Reset []) t2)))
),
pike_tree_acc inp nextt (Input next (c::pref)) t2 t1
| acc_skip:
forall inp c next pref nextt t1 t2 nextinp acc t
(INPUT: inp = Input (c::next) pref)
(NEXTT: nextt = Read c
(Progress
(Choice
t1
(GroupAction (Reset []) t2)))
)
(LEAF: first_leaf t1 (Input next (c::pref)) = None)
(TRANS: pike_tree_acc (Input next (c::pref)) t2 nextinp acc t),
pike_tree_acc inp nextt nextinp acc t.
The transitive rule exists only if we skipped a branch, we do not
want to skip some branches in the middle of the tree. So the final
nextinp is the input point to which we accelerated,
acc is the next nextt, and t is
the backtracking tree of r at nextinp.
The proof of preservation of the PikeTree invariant for the
_acc step requires induction over
pike_tree_acc. We relate pts_acc to the PikeVM
by skipping exactly nextprefix amount of head branches.
Since nextprefix indicates that there is no prefix at the
next nextprefix input positions, we know the
first_leaf will be None.
Additionally, the pts_final rule had to be updated:
| pts_final:
(* moving to a final state when there are no more active or blocked trees *)
forall inp best nextt seen
(LEAF: option_flat_map (fun t => first_leaf t inp) nextt = None),
pike_tree_step (PTS inp [] best [] nextt seen) (PTS_final best)
This option should be used only whenever nextt contains
no results. We cannot just say that we use this rule whenever
nextt = None as in the PikeVM. The reason is that
nextprefix = None -> nextt = None does not hold, more
details in the discussion of the PikeEquiv invariant.
We want to prove that for every execution of the PikeVM, there exists
one of the PikeTree which preserves some invariant. Until now it talked
about how we can related active threads to active trees, blocked threads
to blocked trees, etc. Now we must relate nextt to
nextprefix. The intuitive idea is that we want to say two
things:
nextprefix describes how many top branches of
nextt do not contain a resultnextt has a shape of
Read -> Choice (tree_of r) (next_nextt)The invariant is split into five cases:
When nextprefix = None,
nextt = None
When nextprefix = None,
nextt = Some Mismatch (we exhausted
nextt)
When nextprefix = None,
nextt = Some (Read -> Choice t1 t2) where
t1 contains no results
Case 2 and 3 is needed because the PikeTree never turns a
Some nextt into a None. It will
skip head branches one by one. On the other hand, if the PikeVM does not
find the next position of a prefix, nextprefix will turn
into None.
When nextprefix = Some 0,
nextt = Some (Read -> Choice t1 t2) where
t1 and t2 have the correct shape
When nextprefix = Some (S n),
nextt = Some (Read -> Choice t1 t2) where
t2 has the correct shape and t1 contains no
results
This invariant is proven to hold at the initial state and is preserved on steps. This concludes the proof of correctness of the prefix accelerated PikeVM!
A part of my thesis was supposed to be around formalizing look-arounds in the PikeVM. There is a bit less than 2 months left of my thesis. I will dedicate a day to think about what kind of lemmas we need (https://github.shilangyu.dev/pikevm-systemf/meeting-notes/2025_11_12.html#look-arounds-in-pikevm) and see if there is some self contained section that can be tackled. It is unlikely I can complete the entire formalization in the time that is left.
Instead, I will most likely focus on improving the meta engine. This will keep the topic of the thesis more contained. Here are possible next extensions:
Exploit Exact literals
If the regex has no groups, we can just run a substring search. This requires rechecking how we handle look-arounds during extraction.
Extract multiple prefixes
This is especially useful for case-insensitive searches. Then
something like /ab/i would extract 4 prefixes; “ab”, “Ab”,
“aB”, “AB” (all of them being a Prefix).
This is also generally useful for regexes allowing us to handle
disjunctions better: /a|b/ generates
Prefix "a" and Prefix "b".
There are specialized substring search algorithms that can use efficiently multiple needles.
This would require a bit of refactoring. Currently we have a theorem
which says “if the tree has a result at some input inp,
then inp starts with the extracted prefix”. We would either
have to rephrase it as “[…] starts with one of the extracted prefixes”.
While we are at it, a bigger generalization can be done (see next
point).
Instead of talking about substring searches, we can more generally talk about any strategy that advances our input
The only axiom it has to fulfill is to not skip input positions where there could potentially be a match. Then, substring searches (prefix acceleration) is one concrete instance of such a strategy. This is sound, but first it would be worth thinking about whether this abstraction is even needed (ie. is there any other useful instance that is not substring searching).
Extending the wrapper to do prefix acceleration multiple times
A backtracking engine does that, and Aurèle has recently completed a proof of a memoized backtracking engine
Actually have a meta engine
We could implement a generic “give me a regex and input and I will
give you a match” function that will dispatch it to different engines
depending on some heuristics. This should be eventually proven to be
equivalent to BuiltinExec.
Having finished the first big proof of a prefix accelerated PikeVM, we feel more secure about the changes I introduced and we can start merging my work into Linden. The work is split into three parts:
Impossible literal)First two have been now prepared and requested for merging:
Before the third part is merged we want to explore some possible refactorings of the proof. Mainly around the invariant in PikeEquiv.