Collected notes

Extended semantics of the PikeVM

We extend the PikeVM in a way that allows it to run searches that simulate the .*? prefix as an opt-in feature. While simulating .*? we also perform the prefix acceleration heuristic. To do so, the PikeVM’s state has an additional field storing a counter indicating in how many characters the input will start with the prefix of the regex, called next_prefix. To run the standard PikeVM, we initialize next_prefix to None. To run the .*? version, we initialize next_prefix with the computation of when the prefix appears in the input. This gives rise to three new rules of the semantics of the PikeVM:

| pvs_acc:
(* if there are no more active or blocked threads and we know where the next prefix matches, *)
(* we accelerate to that point *)
  forall inp best n nextinp seen
    (ADVANCE: advance_input_n inp (S n) forward = nextinp),
    pike_vm_step c lit (PVS inp [] best [] (Some n) seen) (PVS nextinp [pike_vm_initial_thread] best [] (next_prefix_counter nextinp lit) initial_seenpcs)

Normally once we are out of blocked and active threads, we would end the search here. But if we have some position that we know the prefix is at, we jump to that point in the input. Then, we recompute the next point in the input where the prefix is present. We had to adjust the existing pvs_final rule (which ends the search) to be used only if next_prefix is None. Notice, however, that if next_prefix was None because next_prefix_counter said so (meaning there is no other place in the input that has the prefix), that means that for both the run of just r and the simulation of .*? the search can end here!

The two other new rules are the following:

| pvs_nextchar_generate:
  forall inp1 inp2 best thr blocked seen
    (ADVANCE: advance_input inp1 forward = Some inp2),
    pike_vm_step c lit (PVS inp1 [] best (thr::blocked) (Some 0) seen) (PVS inp2 (thr::blocked ++ [pike_vm_initial_thread]) best [] (next_prefix_counter inp2 lit) initial_seenpcs)
| pvs_nextchar_filter:
  forall inp1 inp2 best thr blocked n seen
    (ADVANCE: advance_input inp1 forward = Some inp2),
    pike_vm_step c lit (PVS inp1 [] best (thr::blocked) (Some (S n)) seen) (PVS inp2 (thr::blocked) best [] (Some n) initial_seenpcs)

Whenever we advance the input by one character and the next_prefix is not yet zero (_filter rule), we just move the blocked threads into the active ones and decrement the next_prefix. If it is zero (_generate rule), that means we might potentially find a match at this input position so we must start a new (lowest priority) thread at this point and recompute the next point in the input where the prefix is present again. We later argue in the proofs that filtering out the generation of the new threads does not change the end result of the search. The existing pvs_nextchar rule had to be adjusted to be used only when next_prefix is None.

Finally, the pvs_match rule had to be adjusted to set next_prefix to None whenever we find a match. Since we want leftmost semantics, we should not start any new threads in the future. The need for this was discovered during proofs. Meaning the semantics I defined a few weeks ago were incorrect. When implementing this prefix acceleration method in rust-regex I was not aware of this but looking at the implementation now it seems I accidentally did it correctly.

All existing proofs had to be adjusted for the new rules. It was mostly very straightforward and the proofs did not require additional lemmas. A new functional version of the small-step semantics was written and proven equivalent to the semantics. Additionally, I kept the old functional version to write a proof that when next_prefix is None, both functional versions are exactly the same! This reassures me that the extended PikeVM still encodes the base semantics when next_prefix is never initialized.

The new lemmas that were added are analogous to those that currently argue something about the PikeVM’s initial state (such as some invariant initialization). These lemmas had to be duplicated to argue the same thing about the initial state that enables the .*? simulation.

Extended semantics of the PikeTree

To still have an equivalence between the PikeTree and the PikeVM, the PikeTree has to be analogously extended to simulate the .*?. However, the PikeTree does not need to reason about prefixes and can be non-deterministic. After all, the goal of the PikeTree is only to prove the correctness of the PikeVM.

First, we extend the PikeTree state to store a nextt, this will be our connection to the next_prefix. nextt will essentially be the tree for the /.*?r/ regex. Such a regex boils down to a tree which is a choice between trying r on the current input, and moving the input by one followed by the same kind of choice until we run out of input to explore.

The connection between backtracking trees and the PikeTree comes from proving that the result computed by the PikeTree stays the same after every step. To prove that the extended PikeTree can match /.*?r/ we add that nextt is part of the computed result.

What follows are the new semantic rules of the PikeTree. At the moment we are missing something analogous to PikeVM’s _acc rule. We imagine this would be a non-deterministic pruning of nextt’s Choice branches whenever the branch contains no result.

| pts_nextchar_generate:
  (* when the list of active trees is empty and the next tree is a segment of a lazy star prefix, *)
  (* restart from the blocked ones and the head iteration of the lazy star, proceeding to the next character *)
  (* resetting the seen trees *)
  forall inp c next pref best blocked tgm nextt t1 t2 seen
    (INPUT: inp = Input (c::next) pref)
    (NEXTT: nextt = Some (Read c
      (Progress
        (Choice
          t1
          (GroupAction (Reset []) t2))))
    )
    (SUBSET: pike_subtree t1),
    pike_tree_step (PTS inp [] best (tgm::blocked) nextt seen) (PTS (Input next (c::pref)) ((tgm::blocked) ++ [pike_tree_initial_tree t1]) best [] (Some t2) initial_seentrees)
| pts_nextchar_filter:
  (* when the list of active trees is empty and the next tree is a segment of a lazy star prefix, *)
  (* and the head iteration of the lazy star contains no result, *)
  (* restart from the blocked ones, proceeding to the next character *)
  (* resetting the seen trees *)
  forall inp c next pref best blocked tgm nextt t1 t2 seen
    (INPUT: inp = Input (c::next) pref)
    (NEXTT: nextt = Some (Read c
      (Progress
        (Choice
          t1
          (GroupAction (Reset []) t2))))
    )
    (LEAF: first_leaf t1 (Input next (c::pref)) = None),
    pike_tree_step (PTS inp [] best (tgm::blocked) nextt seen) (PTS (Input next (c::pref)) (tgm::blocked) best [] (Some t2) initial_seentrees)

These rules allow the PikeTree to non-deterministically generate new active trees from the nextt. If the branch of nextt contains no result, we may skip it. The NEXTT hypothesis enforces that the nextt is of a form generated by the tree of /.*?r/. As before, some existing rules need to be changed to take nextt correctly into account.

Similarly to the PikeVM, the existing lemmas have to be adjusted in a fairly mechanical way, and the lemmas talking about the initial state have to be duplicated to talk about the .*? initial state as well.

PikeTree-PikeVM equivalence

We can already state and prove the final theorem: properly initialized PikeVM returns the the result of /.*?r/. The proof goes through the PikeTree, and needs one theorem to be more carefully done: invariant_preservation. This theorem talks about how the both algorithms return the same result. This is done through an invariant relating the trees of the PikeTree and the threads of PikeVM, as well as a few other things. Since we added rules for the nextt/next_prefix, we need to relate those too after each step of the algorithms. That is currently what I am working on.

StrSearch instance

So far we had an axiomatization of what a string search is. Now we provide an instance which uses brute force search:

(* Search from position i onwards in string s for substring ss *)
Function brute_force_str_search (ss s: string) (i: nat) {measure (fun i => S (length s) - i) i} : option nat :=
  match Nat.leb i (length s) with
  | true => match starts_with_dec ss (List.skipn i s) with
    | left _ => Some i
    | right _ => brute_force_str_search ss s (S i)
    end
  | false => None
  end.
Proof.
  intros. apply Nat.leb_le in teq. lia.
Defined.

The definition uses Function to be able to use functional induction which removes the ugly-ness of the measure. The definition was optimized for ease of proving. Additionally, we need a decidable version of the starts_with predicate. We get it derived from the proof of decidability:

Lemma starts_with_dec:
  forall s1 s2, { starts_with s1 s2 } + { ~ starts_with s1 s2 }.

So our str_search ss s is brute_force_str_search ss s 0. Proofs of axioms follow nicely from just doing functional induction on brute_force_str_search. For each axiom we need a generalized version on i.

This proves the axiomatization is satisfiable.

Use of LLMs

In the past I have tried to dispatch some LLM agents to prove some lemmas which seemed obvious to be true but I did not want to waste my time on the proofs. It has not been very successful. The agent could prove really simple lemmas but it would often get stuck on slightly more involved ones. This time, it was actually useful! I wrote the definition of brute_force_str_search and proved one of the axioms, leaving two for the LLM. To my surprise, it rewrote brute_force_str_search to have i increase rather than decrease (I wrote it as decreasing to not have to prove termination, as it would stop at i = 0), then it proved termination (not difficult), and proved the remaining two axioms! It proved the remaining two axioms by realizing that i has to be generalized (the first axiom did not need a generalization when i was decreasing). That was nice. I used copilot with Claude Sonnet 4.5.

input_search proofs

str_search operates on strings, so we have an analogous version that works on inputs. Under the hood it uses directly the str_search. Then the str_search axioms are transformed into lemmas about inputs which are provable from those axioms. Until now those lemmas were left unproven because it was not entirely obvious how to. I tackled them now, which made me realize one of the str_search axioms is wrong. There was a < in place of a \leq in the not_found axiom. The proofs required an intermediate lemma:

Lemma strict_suffix_forward_skipn:
  forall i2 i1,
    strict_suffix i2 i1 forward ->
    exists k, k > 0 /\ k <= length (next_str i1) /\ next_str i2 = List.skipn k (next_str i1).

This is essentially what connects strict_suffixes to the advancement of an input.

To discuss

Action items