r* we do a ResetRegs. We can remove all
ResetRegs where r has no groups.r* when the minimum matching length of
r is 1 or larger, the BeginLoop and
EndLoop is not needed (the boolean check will always
succeed)We define what it means to be an engine. It simply should accept a
regex and an input, and return the result of the matching. The only
axiom it must follow is that the returned result from the execution is
the same as the one from the backtracking tree semantics
first_leaf. We additionally allow engines to constrain the
supported subset of regexes (for instance, the PikeVM does not support
backreferences).
Then, we prove that the generic algorithm of
for i in 0..len(input):
if match(r, input[i..]):
return result
gives exactly the same result as
start_pos = input_search(extract_literal(r), input)
if start_pos == None:
return None
for i in start_pos..len(input):
if match(r, input[i..]):
return result
Namely, this talks about allowing the skip some initial input, but also talks about if the literal is not present in the input, then we can skip the search all together.
This proof works for any engine having the match
function!
Future work: an additional proof that the for i
algorithm is equivalent to just running the .*?r regex
(which, might be significantly more performant in some cases) yields the
proof that we can run the engine just once.
The previous result is generic to any engine. But it means we can run prefix acceleration only a single time. To run it more than once, we must tightly integrate with the way the PikeVM works.
The idea is two fold:
.*? directly in the PikeVM rather than
actually compiling .*?(see my paper notes in the drawer for a more in depth exploration of this topic)
lit = Impossible means that
there is no match.*?r.
Then prove that there exists a tree of this sort that computes this
.*?r. At a given position when starts_with is
true, it should always try r. But if it is false, it should
non-deterministically try r or not. Every tree should
result in the same first_leaf result.