Collected notes

Oracle querying

I have QED’d the proof for oracle queries. This means, given a correct set of oracles which represent the lookarounds in the regex, querying the oracles instead of running the lookaround subregex yields the same final match result.

To do so, the PikeVM and PikeTree had to be modified. For the PikeVM, a new OracleQuery bytecode instruction was added. For the PikeTree, it just directly reads the lookaround tree.

Refactor

The current proof can be improved to address some issue. To recall, the pipeline of the proof is as follows:

  1. is_tree ~ bool_tree (neither of them have oracles)
  2. bool_tree ~ PikeTree (still no oracles)
  3. PikeTree ~ PikeVM (PikeVM has oracles)

The bulk of the proof is in the PikeTree ~ PikeVM relation because that is where we need to reason about the oracle correctness. This is fine, but this puts even more burden on the PikeEquiv.v file which already reasons about many other things. Additionally, for this proof we need to reason about the unicity of trees with regards to the bytecode: given some bytecode and two trees that represent that bytecode, the trees are propositionally equal. Oracles in the PikeVM break unicity if done naively. To address this, a “hack” was done by adding some metadata to the OracleQuery instruction which yields back the wanted unicity.

We are considering a few paths for a refactor, all of which would move that burden elsewhere (it will not simplify the proofs, just move them).

1. Modifying the bool_tree

We could add oracles directly to the bool_tree. This means that the PikeTree gets oracles too. Then, the proofs of bool_tree ~ PikeTree ~ PikeVM become trivial. The difficulty is moved to the is_tree ~ bool_tree which I imagine would follow the same proof approach as what I have currently done for PikeTree ~ PikeVM. But that means the tree on which bool_tree and is_tree operate are not the same (bool_tree’s tree has a OracleQuery node).

The last con is a very big one. So much so that this path is not going to be considered for now.

2. Modifying the PikeTree

We add oracles to the PikeTree. The PikeTree ~ PikeVM becomes trivial, burden is moved to the pts_preservation theorem. We can either make the PikeTree directly be using oracles, or have two versions (one that reads lookaround trees and one that queries oracles) and prove them equivalent. The only difference is whether we have the bulk of the proof at bool_tree ~ PikeTree or at PikeTree_oracle ~ PikeTree_lk.

To solve unicity we can weaken the statement. Instead of saying that two trees are propositionally equal, we can establish an equivalence between them that is strong enough for when we use the unicity theorem. Namely it would be equality modulo lookaround subtrees.

Oracle creation

Now that we can query oracles, we need to create them. To create them we must find out all positions where a lookaround’s match can possibly end at. To do so we only have to modify the small step rule concerning handling of the Accept bytecode instruction. Instead of it setting a best and discarding other lower priority active threads, it will remember that a match happened and keep all the lower priority threads.

Then, we must argue that these collected match points are enough to describe a lookaround oracle. To do so, I think four main theorems will be needed.

About the collected positions

Theorem all_leaves : forall pos tree, In pos (leaves tree) <-> In pos (match_positions (pike_tree tree))

This means that we indeed find all match positions.

About the reversal of regexes

Theorem reverse : forall inp inp' r, In inp' (leaves (tree_of r inp)) -> In (rev inp) (leaves (tree_of (rev r) (rev inp')))

Which means that matching in reverse from the end position succeeds.

About the backward tree

I am not yet sure what will be the exact statement. But this should talk about the backward direction of the tree semantics which we currently have.

About the correctness of the positions

Theorem collect_correct : forall r inp inp0, In inp (match_positions (pike_tree (tree_of /.*?r/ inp0))) -> NonEmpty (leaves (tree_of (rev r) inp))

Which means the positions we have found describe exactly all the positions a regex can match.

One thing that all of these definitions are not describing correctly is that the collected leaves are actually those that are suffixes of our start position. We must restrict every definition to suffixes, or just actually find all positions even for prefixes of the input string.

To discuss

Action items