2025_12

Collected notes

Meta engine

External vs Internal heuristics

By internal heuristics we mean that each engine given a regex and an input can answer whether it can perform a search or not. Then the meta engine just has an ordered list of engines to try without any heuristics
By external we mean that Meta might make decisions where for some (r1, inp1) and (r2, inp2) that are supported by both engine X and Y, Meta will choose X for the first and Y for the second. So Meta has some heuristics about the engines that make the list of engines not necessarily strictly ordered.

Internal make the Meta engine easier and encapsulate the logic per engine. However, external allow for more fine grained control over which engine to use and can lead to better results.

TODO: construct an example (r1, inp1) and (r2, inp2) with two engines X and Y showcasing the benefit of external heuristics

Analyzing regexes in the wild

Using the corpus from https://dl.acm.org/doi/10.1145/3236024.3236027 and the RegElk regex parser, I extended the stats.ml file to collect statistics about patterns the Meta engine wants to optimize. It would be very useful to also evaluate on what haystacks these regexes are ran. This is unfortunately not available in the corpora. Neither are the regex flags available in the corpora.

These statistics will guide which optimizations are worth implementing in the Meta engine. The updated file can be found here.

Kind	Description	Occurrence
Front-only literal	If a regex has an extractable literal in the front but none in the back	9.47%
Back-only literal	If a regex has an extractable literal in the back but none in the front	9.28%
Front literal with offset	If a regex has an extractable literal in the front that is offset by some fixed amount of unknown characters	2.00%
Back literal with offset	If a regex has an extractable literal in the back that is offset by some fixed amount of unknown characters	2.47%
Front and back literal	If a regex has an extractable literal in the front and back	38.06%
No assert exact literal	If a regex matches only a constant string and it contains no assertions (anchors or lookarounds)	14.97%
No assert nor group exact literal	If a regex matches only a constant string and it contains no assertions (anchors or lookarounds) and no capture groups	14.97%
Front anchored	If a regex is anchored to the start of a string with `^`	23.05%
Back anchored	If a regex is anchored to the end of a string with `$`	21.51%
Double anchored	If a regex is anchored to the start and the end of a string	12.68%
Potentially accidental capture	If a regex uses a capture group potentially only for grouping	12.33%
No captures	If a regex has no captures	62.09%

Offset literals: this is about regexes such as /..abc/. We can run prefix acceleration using “abc” and then go back two characters.

Potentially accidental capture: this is not possible to tell what a user actually wanted so it cannot be optimized. But it is interesting to see. Matching captures is generally expensive. We believe users might accidentally create capture groups when they only meant to group a part of the regex (to for example quantify it).

Anchored: ^ and $ anchors depend on the m (multiline) flag. If m is enabled, ^/$ talk about the start/end of lines rather than start/end of the entire input. Luckily, m is by default disabled, and as we know, humans tend to leave the defaults on.

Offset literals

Computing them requires heuristics. For example, given a regex like /a..abcde/ we can either extract (0, Prefix "a") or (3, Exact "abcde").

TODO: before this is implemented one would have to think how this affects complexity

Lean formalization of extended regular expression matching with lookarounds

Paper: https://dl.acm.org/doi/pdf/10.1145/3636501.3636959

Semantics:

Defined with a function models producing a Prop. This is due to negative occurrences that would be cumbersome to deal with. This function only models if a regex matches an input span, not where and how. In fact, there is no “how” as there are no capture groups. And the “where” is essentially the entire input.

The input is modeled as a “span” which is a slice of a string with reference to the entire string too. It is a triple: past characters, characters we must match (exhaustively), future characters. A similar concept is a “location” which is a tuple of past characters and next characters. “location” is what an “input” is in Linden.

Derivatives is the decider for these semantics. It is proven to be equivalent.

Capture groups:

The paper does add them to the formalization. It discusses shortly that supporting them is a major undertaking:

Formalizing even the core aspect of the capture semantics and algorithms of the nonbacktracking engine in Lean, based on its open-source implementation [15], is therefore a major undertaking with many challenges.

llmatch:

Until now the semantics did not talk about how to find a match anywhere in an input. llmatch is a function that computes the leftmost-longest match of a regex on a string. This will find matches present anywhere in the input and return the span.

Regex equivalence:

The equivalence notion is “r q is equiv iff for all strings r matches that string iff q matches that string”. This has been known to not be the ideal definition in PCRE semantics as it is not strong enough to reason about rewrites of regex subexpressions. This is due to the non-commutativity of alternations and groups. In this paper, a few rewrites are proven to be correct: https://github.com/ezhuchko/extended-regexes/blob/194f2da778ab20c734dcfaa29808d0d2b726cfdd/Regex/Rewrites.lean.

Progress presentation

I made a script and graphics for the progress presentation that will take place on 9.12.2025: https://github.shilangyu.dev/pikevm-systemf/progress/script.html.

The goal of the presentation is to update SYSTEMF on what I have been working on, show a bit of the proof, and explain what I will be working on next. Additionally, Ekaterina Zhuchko is coming to visit us this week, so it is a good way to show her what I am working on. This will set solid grounds for a chat with her.

Report writing

The Typst report skeleton is done.

Presentation feedback

Mention what PikeVM is earlier
Do mention that that transitions have priority in the NFA
Mention why prefixes are needed before introducing them
A “match” means either of the prefix or of the regex. It would be better to distinguish them two
When relating PikeTree and PikeVM trees and states, talk about the PikeVM+ instead

To discuss

Action items

Prepare PikeVM proof for merging. Later we might change the invariant, but that will be after the merge.