pike_subsetTo support lookarounds in the PikeVM need to add them to the
pike_regex inductive indicating that the PikeVM is able to
compute match results for regexes containing lookarounds. The algorithm
for matching lookarounds is split into 3 phases (oracle creation,
matching with oracles, group reconstruction). In the phase of matching
with oracles, we will compute the result of a match for regexes with
lookarounds. However, the match result will not have the correct values
for capture groups, these are obtained separately in the last phase.
That means we would have to alter all correctness theorems to say “the
result of the PikeVM is correct modulo the capture values in
lookarounds”. This would be quite tedious, so instead I modified
pike_regex to accept only lookarounds without capture
groups. We have an analogous pike_subtree saying what
backtracking tree is supported by the PikeTree. They are the subtrees
that have LK/LKFail nodes containing no
GroupAction (Open _) nodes.
We might later need a theorem saying that if for a regex
r, pike_regex (remove_captures r) holds, then
the result of matching r and remove_captures r
is the same if we disregard capture values.
For Exact literals I have already proven that if we are
matching a regex without capture groups, the match result will not
modify the group map assignments. This theorem was directly used
here.
The PikeTree reads LK tree nodes to extract the new
group map if the lookaround subtree contains results, otherwise bails.
Technically, we should not need to do that since if a tree node is
LK, we know it has to contain a result. However, this is
true only if we know the tree is well formed with respect to the
is_tree inductive. This is not the case when reasoning
about the PikeTree in isolation where operate on arbitrary trees.
This requires adjusting PikeTree’s lemmas about the nondeterministic results the PikeTree can produce and finally proving that they all lead to the same answer.
This reading of lookarounds happens in a single step in the PikeTree, which will simplify relating it to the PikeVM where it is also a single step (by querying the oracle).
First, we add a new NFA bytecode instruction:
OracleQuery (n: nat) (positive: bool). We compile all
lookarounds to this instruction giving each lookaround a unique
identifier n and whether we expect the response to be
positive. An oracle has type input -> bool. When
querying the oracle, if the response is equal to positive,
we continue execution of the PikeVM. If not, the current thread dies.
For lookarounds, positive = true for positive lookarounds
((?<=), (?=)) and
positive = false for negative lookarounds
((?<!), (?!)). I am considering removing
the positive argument and just proving oracles that flip
the returned boolean for negative lookarounds.
For the PikeVM, I parametrize everything with a list of oracles.
Their position in that list corresponds to the index in
OracleQuery. The oracles are queried in
epsilon_step, producing no new threads (dying) if the query
returns an unwanted result. No new rules have to be added to PikeVM’s
small step semantics.
We must relate the querying of oracles to reading the subtree of a
lookaround. To do so we must first relate the oracles to the individual
OracleQuery instructions. I decided to not impose
any restriction on what an oracle is other than having the type
input -> bool in the PikeVM. Ideally we want to say that
the oracle will correctly answer to some questions regarding the tree
semantics. But just like the PikeTree does not know how a tree was
produced, the PikeVM does not know about any trees let alone the regex
it is matching. The PikeVM only knows about the bytecode it executes. To
axiomatize the oracles, they would have to refer to the regex of the
lookaround. Pushing this information down the PikeVM would litter many
definitions. Instead, in the correctness proofs (where we are talking
about specific trees and regexes) I will talk about a well-founded list
of oracles that directly corresponds to the bytecode oracle queries.
This unfortunately means the proof will not prove a generic oracle
strategy for the PikeVM, but the specific one for lookarounds.
With this approach I have reduced everything to a single
admit in generate_active. I need to work on an
invariant that will capture the relation between oracles and
OracleQuery, and then OracleQuery and
LK/LKFail backtracking trees.
The proof of correctness relies on a lemma stating that the bytecode
uniquely identifies backtracking trees. I don’t think it is true with
the new changes. The regexes (?=a) and (?=b)
generate different trees, but when compiled generate the same bytecode.
I need to examine why this property is needed in the proofs and if I can
weaken the lemma. If not, I need to think of a (reasonable) way of
making the OracleQuery instruction unique (for example by
adding to it the entire subregex of that lookaround).