Agenda

Introduce myself
I want you to leave with an understanding of what the PikeVM is how prefix acceleration works
Give an intuition of how the proof was approached
Talk about my future work

PikeVM

A PikeVM executes an NFA against some input string
But only at a particular string position!
- we want to find anywhere in the string
Let’s look at how the PikeVM works and try to fix it

Order of states gives priority
But there is a match later in the string! Let’s try to find matches anywhere in the input

We cannot just run the engine at every input position: this will ruin our complexity

Lazy dot star, if we fail to find matches we will advance the input by one and try r again
- Lazy ? means we find the left-most match. Without it it would be the right-most match
Simulate it ourselves
- Better performance
- Allows for optimizations

Implemented in the rust-regex crate (Optim 2 already present, optim 1 implemented by me)
This is done by doing classical substring searches which are much faster than running an engine
- We keep a counter indicating in how many characters the prefix will match
- If the counter is non-zero, we do Optim 1
In rebar which is supposed to be a real world benchmarking suit
1. optim 2 can lead to 600x times speedups -> skipping large portions of the input
2. optim 1 can lead to 20x times speedups -> not keeping the PikeVM alive when not needed
Arguably the most important regex matching optimization for large inputs

Hopefully by now you have the intuition that this algorithm is correct. But how to approach the proof?

Proven to follow exactly the ECMAScript 2023 semantics
Materialization of matching a regex against a specific input (execution trace)
- This is opposed to the NFA, which is only a program
As we said, we want to essentially return the same result as for /.*?r/
The result is the leftmost accepting leaf

Operates on backtracking trees -> firstLeaf equivalence
- Each step of the PikeTree preserves the result
Explores like the PikeVM -> PikeVM equivalence
- There exists an execution of the PikeTree which corresponds to one from the PikeVM
The subtrees are \alpha at different positions
When we added \alpha in the PikeVM, here we add a subtree. But we optimize by not adding the subtree if it contains no results

Each engine has it’s own strengths and weaknesses
- Some support only a subset of regex features, some have high memory requirements
- OnePass: only NFAs that have one transition per character+state
- Lazy DFA: no groups support
- PikeVM: slow, supports everything
- Memoized backtracker: memory intensive
Everything is composed to find matches the fastest way
- Sometimes it is faster to use the Lazy DFA to find the bounds of the match, and then run an anchored PikeVM

To guide optimizations, I will analyze the corpus of regexes used in the wild