2025_11

Collected notes

Porting Warblre to Rocq 9

Firstly, I must say, it is very nice that “Coq” is not a substring of “Rocq”. Doing a grep of “coq” returns only things that need to be migrated.

It is quite unfortunate how big of an undertaking renaming Coq to Rocq is. Anyway, to port Linden to Rocq (which should be a very small effort), we first must port Warblre. I have sent a PR: https://github.com/LindenRegex/Warblre/pull/6. There are a few blockers, alectryon using SerAPI (which does not work with Rocq 9) being the main one. Rocq 9 seems to have fixed quite a few bugs that were hit by Warblre which allowed to clear some LATER’s.

\square of the prefix accelerated PikeVM!

PikeTree acceleration

Before we were missing a step in the PikeTree which would be analogous to PikeVM’s _acc step (one that jumps to the next input position where the prefix matches). This was now added and is the following:

| pts_acc:
  (* if there are no more active or blocked trees and we have some nextt, *)
  (* we accelerate by non-deterministically skipping branches with no results *)
    forall inp best seen nextinp nextt acc t
      (ACC: pike_tree_acc inp nextt nextinp acc t),
      pike_tree_step (PTS inp [] best [] (Some nextt) seen) (PTS nextinp [pike_tree_initial_tree t] best [] (Some acc) initial_seentrees)

pike_tree_acc is a non-deterministic skipping of head branches that do not contain a match. The definition is the following:

Inductive pike_tree_acc : input -> tree -> input -> tree -> tree -> Prop :=
| acc_keep:
    forall inp c next pref nextt t1 t2
    (INPUT: inp = Input (c::next) pref)
    (NEXTT: nextt = Read c
      (Progress
        (Choice
          t1
          (GroupAction (Reset []) t2)))
    ),
    pike_tree_acc inp nextt (Input next (c::pref)) t2 t1
| acc_skip:
    forall inp c next pref nextt t1 t2 nextinp acc t
    (INPUT: inp = Input (c::next) pref)
    (NEXTT: nextt = Read c
      (Progress
        (Choice
          t1
          (GroupAction (Reset []) t2)))
    )
    (LEAF: first_leaf t1 (Input next (c::pref)) = None)
    (TRANS: pike_tree_acc (Input next (c::pref)) t2 nextinp acc t),
    pike_tree_acc inp nextt nextinp acc t.

The transitive rule exists only if we skipped a branch, we do not want to skip some branches in the middle of the tree. So the final nextinp is the input point to which we accelerated, acc is the next nextt, and t is the backtracking tree of r at nextinp.

The proof of preservation of the PikeTree invariant for the _acc step requires induction over pike_tree_acc. We relate pts_acc to the PikeVM by skipping exactly nextprefix amount of head branches. Since nextprefix indicates that there is no prefix at the next nextprefix input positions, we know the first_leaf will be None.

Additionally, the pts_final rule had to be updated:

| pts_final:
  (* moving to a final state when there are no more active or blocked trees *)
    forall inp best nextt seen
      (LEAF: option_flat_map (fun t => first_leaf t inp) nextt = None),
      pike_tree_step (PTS inp [] best [] nextt seen) (PTS_final best)

This option should be used only whenever nextt contains no results. We cannot just say that we use this rule whenever nextt = None as in the PikeVM. The reason is that nextprefix = None -> nextt = None does not hold, more details in the discussion of the PikeEquiv invariant.

PikeEquiv invariant

We want to prove that for every execution of the PikeVM, there exists one of the PikeTree which preserves some invariant. Until now it talked about how we can related active threads to active trees, blocked threads to blocked trees, etc. Now we must relate nextt to nextprefix. The intuitive idea is that we want to say two things:

nextprefix describes how many top branches of nextt do not contain a result
nextt has a shape of Read -> Choice (tree_of r) (next_nextt)

The invariant is split into five cases:

When nextprefix = None, nextt = None
When nextprefix = None, nextt = Some Mismatch (we exhausted nextt)
When nextprefix = None, nextt = Some (Read -> Choice t1 t2) where t1 contains no results

Case 2 and 3 is needed because the PikeTree never turns a Some nextt into a None. It will skip head branches one by one. On the other hand, if the PikeVM does not find the next position of a prefix, nextprefix will turn into None.
When nextprefix = Some 0, nextt = Some (Read -> Choice t1 t2) where t1 and t2 have the correct shape
When nextprefix = Some (S n), nextt = Some (Read -> Choice t1 t2) where t2 has the correct shape and t1 contains no results

This invariant is proven to hold at the initial state and is preserved on steps. This concludes the proof of correctness of the prefix accelerated PikeVM!

Look-arounds?

A part of my thesis was supposed to be around formalizing look-arounds in the PikeVM. There is a bit less than 2 months left of my thesis. I will dedicate a day to think about what kind of lemmas we need (https://github.shilangyu.dev/pikevm-systemf/meeting-notes/2025_11_12.html#look-arounds-in-pikevm) and see if there is some self contained section that can be tackled. It is unlikely I can complete the entire formalization in the time that is left.

Instead, I will most likely focus on improving the meta engine. This will keep the topic of the thesis more contained. Here are possible next extensions:

Exploit Exact literals

If the regex has no groups, we can just run a substring search. This requires rechecking how we handle look-arounds during extraction.
Extract multiple prefixes

This is especially useful for case-insensitive searches. Then something like /ab/i would extract 4 prefixes; “ab”, “Ab”, “aB”, “AB” (all of them being a Prefix).

This is also generally useful for regexes allowing us to handle disjunctions better: /a|b/ generates Prefix "a" and Prefix "b".

There are specialized substring search algorithms that can use efficiently multiple needles.

This would require a bit of refactoring. Currently we have a theorem which says “if the tree has a result at some input inp, then inp starts with the extracted prefix”. We would either have to rephrase it as “[…] starts with one of the extracted prefixes”. While we are at it, a bigger generalization can be done (see next point).
Instead of talking about substring searches, we can more generally talk about any strategy that advances our input

The only axiom it has to fulfill is to not skip input positions where there could potentially be a match. Then, substring searches (prefix acceleration) is one concrete instance of such a strategy. This is sound, but first it would be worth thinking about whether this abstraction is even needed (ie. is there any other useful instance that is not substring searching).
Extending the wrapper to do prefix acceleration multiple times

A backtracking engine does that, and Aurèle has recently completed a proof of a memoized backtracking engine
Actually have a meta engine

We could implement a generic “give me a regex and input and I will give you a match” function that will dispatch it to different engines depending on some heuristics. This should be eventually proven to be equivalent to BuiltinExec.

PRs to Linden

Having finished the first big proof of a prefix accelerated PikeVM, we feel more secure about the changes I introduced and we can start merging my work into Linden. The work is split into three parts:

Substring searches, literal extraction and their relation to regex matching
Small meta engine (running prefix acceleration once at the start, handling the Impossible literal)
Prefix accelerated PikeVM and its correctness proof

First two have been now prepared and requested for merging:

Before the third part is merged we want to explore some possible refactorings of the proof. Mainly around the invariant in PikeEquiv.

To discuss

Upgrade Linden/Warblre to Rocq 9? -> yes, I will work on it on the side

Action items

Open first two PRs to Linden
Start writing the report
Refactor PikeVM proof to be mergeable
Think about look-arounds lemmas