Collected notes

Collecting all leaves

Normally, the implemented algorithms look for the highest priority match. We can however trick some of them into collecting all possible matches instead of just the best one.

Here I tricked the PikeVM. Normally when it would reach a state that indicates that we have found a match, it would save this and discard all other states which it knows produces lower priority matches. Now, instead of doing that it remembers that it found a match and continues looking for more matches (even if lower priority!). This is done through the occurrence variant which decides if we want to find only the best (highest priority) or all matches.

(* kinds of occurrence states *)
Variant occurrence :=
(* we only collect the highest priority result *)
| Best (best: option leaf)
(* we collect end positions of matches regardless of priority *)
| All (positions: list leaf).

Then we must react differently to a found match:

(* depending on how we collect occurrences, we handle the `Accept` instruction *)
(* we update the `occurrence` state and the list of active threads/trees *)
Definition accept {A} (occ: occurrence) (inp: input) (gm: group_map) (active: list A): list A * occurrence :=
  match occ with
  (* we kill all lower priority threads/trees *)
  | Best _ => ([], Best (Some (inp, gm)))
  (* we keep all threads/trees to produce lower priority results *)
  | All positions => (active, All ((inp, gm) :: positions))
  end.

We now must reason about the correctness of the PikeVM in two different ways.

  1. Best: the correctness statement is as before, the result of the PikeVM is the first_leaf of the backtracking tree
  2. All: here we want to say that the set of collected match positions (so we exclude group maps) is the same as the set of match positions from all leaves of the backtracking tree

We don’t consider group maps for All because the current implementation for it does not actually collect all leaves, but all of them modulo group maps. Luckily, in practice we indeed only care about these match positions.

Linear match-all

In a match-all we want to find all highest priority non-overlapping matches for a regex on some input. Doing so in quadratic time is easy: call repeatedly the “find one match” version until you collect all results. The new All mode cannot be used as-is to define a match-all search. But the idea to transform it into a match-all is possible and would be nice to investigate in the future, it was for example implemented in Nim’s regex library https://github.com/nitely/nim-regex/pull/71.

New PikeTree

All definitions in the PikeTree have been generalized. Before, they talked about the first_leaf of a tree. That was not sufficient to say anything about how All matching should be done. As stated, for All we want to say that the collected results are compatible with all the leaves of the backtracking tree. If the definitions always project the tree into its first leaf, we lose the needed information. Thus, all definitions are now expressed in terms of all leaves, and the projection happens much later when we want to either talk about the Best case or the All case.

The following was generalized:

  1. seqop replaced with ++. To combine two results of first_leaf (which returns an option leaf) we used seqop which says “if the first result is None, take the second one otherwise keep the first result”. Since now we want to talk about all leaves (tree_leaves) this operation generalizes to the concatenation of two lists
  2. tree_nd talks about all the possible non-deterministic leaves of a tree. It is essentially the tree_leaves but some branches can be skipped due to it being already seen before.
  3. list_nd talks about all the possible non-deterministic leaves of trees in a list.
  4. list_result talks about all leaves of trees in a list.

Finally, the PikeTree invariant was similarly generalized. After adjusting the proofs, everything neatly passes without bigger changes. What is left to prove is that the “we collect all match positions” is preserved when the PikeTree takes a step.

Different generalization

Generalizing to tree_leaves might not be enough. For the third phase of lookarounds, during matching we need to keep track when we last used an oracle successfully. This will be needed to then reconstruct groups in the correct place. If we just return all leaves, this might not be enough information to be able to reason about these last usages.

The suggestion is to instead parametrize everything (tree_nd/list_nd/list_results/state_nd etc) by the occurrence type and collect different things at different times depending on the kind. We could also then specialize all the definition back to collecting just first_leaf whenever doing Best. To me it does not yet seem clear that this would allow us to collect last usages or that this is in general preferable as a formalization. Since for now it is not clear, I will leave the existing generalization as-is and revisit it during group reconstruction.

4 PikeVM modes

In my master thesis I introduced a second mode of execution for the PikeVM: optimized unanchored search. Instead of finding a match at a specific place of the input, it can find the match anywhere in the input. Adding the new All mode duplicates the amount of execution modes of the PikeVM by 2 again. In practice we are only ever interested in running the All mode with the unanchored mode. Luckily, the proofs nicely compose and we get for free that we can find all match positions of a regex anywhere in the input!

To discuss

Action items