Collected notes

New pike_subset

To support lookarounds in the PikeVM need to add them to the pike_regex inductive indicating that the PikeVM is able to compute match results for regexes containing lookarounds. The algorithm for matching lookarounds is split into 3 phases (oracle creation, matching with oracles, group reconstruction). In the phase of matching with oracles, we will compute the result of a match for regexes with lookarounds. However, the match result will not have the correct values for capture groups, these are obtained separately in the last phase. That means we would have to alter all correctness theorems to say “the result of the PikeVM is correct modulo the capture values in lookarounds”. This would be quite tedious, so instead I modified pike_regex to accept only lookarounds without capture groups. We have an analogous pike_subtree saying what backtracking tree is supported by the PikeTree. They are the subtrees that have LK/LKFail nodes containing no GroupAction (Open _) nodes.

We might later need a theorem saying that if for a regex r, pike_regex (remove_captures r) holds, then the result of matching r and remove_captures r is the same if we disregard capture values.

For Exact literals I have already proven that if we are matching a regex without capture groups, the match result will not modify the group map assignments. This theorem was directly used here.

Lookarounds in PikeTree

The PikeTree reads LK tree nodes to extract the new group map if the lookaround subtree contains results, otherwise bails. Technically, we should not need to do that since if a tree node is LK, we know it has to contain a result. However, this is true only if we know the tree is well formed with respect to the is_tree inductive. This is not the case when reasoning about the PikeTree in isolation where operate on arbitrary trees.

This requires adjusting PikeTree’s lemmas about the nondeterministic results the PikeTree can produce and finally proving that they all lead to the same answer.

This reading of lookarounds happens in a single step in the PikeTree, which will simplify relating it to the PikeVM where it is also a single step (by querying the oracle).

Lookarounds in PikeVM

First, we add a new NFA bytecode instruction: OracleQuery (n: nat) (positive: bool). We compile all lookarounds to this instruction giving each lookaround a unique identifier n and whether we expect the response to be positive. An oracle has type input -> bool. When querying the oracle, if the response is equal to positive, we continue execution of the PikeVM. If not, the current thread dies. For lookarounds, positive = true for positive lookarounds ((?<=), (?=)) and positive = false for negative lookarounds ((?<!), (?!)). I am considering removing the positive argument and just proving oracles that flip the returned boolean for negative lookarounds.

For the PikeVM, I parametrize everything with a list of oracles. Their position in that list corresponds to the index in OracleQuery. The oracles are queried in epsilon_step, producing no new threads (dying) if the query returns an unwanted result. No new rules have to be added to PikeVM’s small step semantics.

Correctness

We must relate the querying of oracles to reading the subtree of a lookaround. To do so we must first relate the oracles to the individual OracleQuery instructions. I decided to not impose any restriction on what an oracle is other than having the type input -> bool in the PikeVM. Ideally we want to say that the oracle will correctly answer to some questions regarding the tree semantics. But just like the PikeTree does not know how a tree was produced, the PikeVM does not know about any trees let alone the regex it is matching. The PikeVM only knows about the bytecode it executes. To axiomatize the oracles, they would have to refer to the regex of the lookaround. Pushing this information down the PikeVM would litter many definitions. Instead, in the correctness proofs (where we are talking about specific trees and regexes) I will talk about a well-founded list of oracles that directly corresponds to the bytecode oracle queries. This unfortunately means the proof will not prove a generic oracle strategy for the PikeVM, but the specific one for lookarounds.

With this approach I have reduced everything to a single admit in generate_active. I need to work on an invariant that will capture the relation between oracles and OracleQuery, and then OracleQuery and LK/LKFail backtracking trees.

TreeRep

The proof of correctness relies on a lemma stating that the bytecode uniquely identifies backtracking trees. I don’t think it is true with the new changes. The regexes (?=a) and (?=b) generate different trees, but when compiled generate the same bytecode. I need to examine why this property is needed in the proofs and if I can weaken the lemma. If not, I need to think of a (reasonable) way of making the OracleQuery instruction unique (for example by adding to it the entire subregex of that lookaround).

To discuss

Action items