Collected notes

PikeVM bytecode optimizer

Prefix acceleration once with abstract engines

We define what it means to be an engine. It simply should accept a regex and an input, and return the result of the matching. The only axiom it must follow is that the returned result from the execution is the same as the one from the backtracking tree semantics first_leaf. We additionally allow engines to constrain the supported subset of regexes (for instance, the PikeVM does not support backreferences).

Then, we prove that the generic algorithm of

for i in 0..len(input):
    if match(r, input[i..]):
        return result

gives exactly the same result as

start_pos = input_search(extract_literal(r), input)
if start_pos == None:
    return None

for i in start_pos..len(input):
    if match(r, input[i..]):
        return result

Namely, this talks about allowing the skip some initial input, but also talks about if the literal is not present in the input, then we can skip the search all together.

This proof works for any engine having the match function!

Future work: an additional proof that the for i algorithm is equivalent to just running the .*?r regex (which, might be significantly more performant in some cases) yields the proof that we can run the engine just once.

Implementation of the PikeVM algorithm with prefix acceleration

The previous result is generic to any engine. But it means we can run prefix acceleration only a single time. To run it more than once, we must tightly integrate with the way the PikeVM works.

The idea is two fold:

  1. Simulate the .*? directly in the PikeVM rather than actually compiling .*?
  2. Whenever the active/blocked threads of the PikeVM are in a favorable arrangement, run the prefix acceleration

(see my paper notes in the drawer for a more in depth exploration of this topic)

To discuss

Action items