Normally, the implemented algorithms look for the highest priority match. We can however trick some of them into collecting all possible matches instead of just the best one.
Here I tricked the PikeVM. Normally when it would reach a state that
indicates that we have found a match, it would save this and discard all
other states which it knows produces lower priority matches. Now,
instead of doing that it remembers that it found a match and continues
looking for more matches (even if lower priority!). This is done through
the occurrence variant which decides if we want to find
only the best (highest priority) or all matches.
(* kinds of occurrence states *)
Variant occurrence :=
(* we only collect the highest priority result *)
| Best (best: option leaf)
(* we collect end positions of matches regardless of priority *)
| All (positions: list leaf).
Then we must react differently to a found match:
(* depending on how we collect occurrences, we handle the `Accept` instruction *)
(* we update the `occurrence` state and the list of active threads/trees *)
Definition accept {A} (occ: occurrence) (inp: input) (gm: group_map) (active: list A): list A * occurrence :=
match occ with
(* we kill all lower priority threads/trees *)
| Best _ => ([], Best (Some (inp, gm)))
(* we keep all threads/trees to produce lower priority results *)
| All positions => (active, All ((inp, gm) :: positions))
end.
We now must reason about the correctness of the PikeVM in two different ways.
first_leaf of the backtracking
treeWe don’t consider group maps for All because the current
implementation for it does not actually collect all leaves, but all of
them modulo group maps. Luckily, in practice we indeed only care about
these match positions.
In a match-all we want to find all highest priority non-overlapping
matches for a regex on some input. Doing so in quadratic time is easy:
call repeatedly the “find one match” version until you collect all
results. The new All mode cannot be used as-is to define a
match-all search. But the idea to transform it into a match-all is
possible and would be nice to investigate in the future, it was for
example implemented in Nim’s regex library https://github.com/nitely/nim-regex/pull/71.
All definitions in the PikeTree have been generalized. Before, they
talked about the first_leaf of a tree. That was not
sufficient to say anything about how All matching should be
done. As stated, for All we want to say that the collected
results are compatible with all the leaves of the backtracking tree. If
the definitions always project the tree into its first leaf, we lose the
needed information. Thus, all definitions are now expressed in terms of
all leaves, and the projection happens much later when we want to either
talk about the Best case or the All case.
The following was generalized:
seqop replaced with ++. To combine two
results of first_leaf (which returns an
option leaf) we used seqop which says “if the
first result is None, take the second one otherwise keep
the first result”. Since now we want to talk about all leaves
(tree_leaves) this operation generalizes to the
concatenation of two liststree_nd talks about all the possible non-deterministic
leaves of a tree. It is essentially the tree_leaves but
some branches can be skipped due to it being already seen before.list_nd talks about all the possible non-deterministic
leaves of trees in a list.list_result talks about all leaves of trees in a
list.Finally, the PikeTree invariant was similarly generalized. After adjusting the proofs, everything neatly passes without bigger changes. What is left to prove is that the “we collect all match positions” is preserved when the PikeTree takes a step.
Generalizing to tree_leaves might not be enough. For the
third phase of lookarounds, during matching we need to keep track when
we last used an oracle successfully. This will be needed to then
reconstruct groups in the correct place. If we just return all leaves,
this might not be enough information to be able to reason about these
last usages.
The suggestion is to instead parametrize everything
(tree_nd/list_nd/list_results/state_nd
etc) by the occurrence type and collect different things at
different times depending on the kind. We could also then specialize all
the definition back to collecting just first_leaf whenever
doing Best. To me it does not yet seem clear that this
would allow us to collect last usages or that this is in general
preferable as a formalization. Since for now it is not clear, I will
leave the existing generalization as-is and revisit it during group
reconstruction.
In my master thesis I introduced a second mode of execution for the
PikeVM: optimized unanchored search. Instead of finding a match at a
specific place of the input, it can find the match anywhere in the
input. Adding the new All mode duplicates the amount of
execution modes of the PikeVM by 2 again. In practice we are only ever
interested in running the All mode with the unanchored
mode. Luckily, the proofs nicely compose and we get for free that we can
find all match positions of a regex
anywhere in the input!