Collected notes

Meta engine

External vs Internal heuristics

Internal make the Meta engine easier and encapsulate the logic per engine. However, external allow for more fine grained control over which engine to use and can lead to better results.

TODO: construct an example (r1, inp1) and (r2, inp2) with two engines X and Y showcasing the benefit of external heuristics

Analyzing regexes in the wild

Using the corpus from https://dl.acm.org/doi/10.1145/3236024.3236027 and the RegElk regex parser, I extended the stats.ml file to collect statistics about patterns the Meta engine wants to optimize. It would be very useful to also evaluate on what haystacks these regexes are ran. This is unfortunately not available in the corpora. Neither are the regex flags available in the corpora.

These statistics will guide which optimizations are worth implementing in the Meta engine. The updated file can be found here.

Kind Description Occurrence
Front-only literal If a regex has an extractable literal in the front but none in the back 9.47%
Back-only literal If a regex has an extractable literal in the back but none in the front 9.28%
Front literal with offset If a regex has an extractable literal in the front that is offset by some fixed amount of unknown characters 2.00%
Back literal with offset If a regex has an extractable literal in the back that is offset by some fixed amount of unknown characters 2.47%
Front and back literal If a regex has an extractable literal in the front and back 38.06%
No assert exact literal If a regex matches only a constant string and it contains no assertions (anchors or lookarounds) 14.97%
No assert nor group exact literal If a regex matches only a constant string and it contains no assertions (anchors or lookarounds) and no capture groups 14.97%
Front anchored If a regex is anchored to the start of a string with ^ 23.05%
Back anchored If a regex is anchored to the end of a string with $ 21.51%
Double anchored If a regex is anchored to the start and the end of a string 12.68%
Potentially accidental capture If a regex uses a capture group potentially only for grouping 12.33%
No captures If a regex has no captures 62.09%

Offset literals: this is about regexes such as /..abc/. We can run prefix acceleration using “abc” and then go back two characters.

Potentially accidental capture: this is not possible to tell what a user actually wanted so it cannot be optimized. But it is interesting to see. Matching captures is generally expensive. We believe users might accidentally create capture groups when they only meant to group a part of the regex (to for example quantify it).

Anchored: ^ and $ anchors depend on the m (multiline) flag. If m is enabled, ^/$ talk about the start/end of lines rather than start/end of the entire input. Luckily, m is by default disabled, and as we know, humans tend to leave the defaults on.

Offset literals

Computing them requires heuristics. For example, given a regex like /a..abcde/ we can either extract (0, Prefix "a") or (3, Exact "abcde").

TODO: before this is implemented one would have to think how this affects complexity

Lean formalization of extended regular expression matching with lookarounds

Paper: https://dl.acm.org/doi/pdf/10.1145/3636501.3636959

Semantics:

Defined with a function models producing a Prop. This is due to negative occurrences that would be cumbersome to deal with. This function only models if a regex matches an input span, not where and how. In fact, there is no “how” as there are no capture groups. And the “where” is essentially the entire input.

The input is modeled as a “span” which is a slice of a string with reference to the entire string too. It is a triple: past characters, characters we must match (exhaustively), future characters. A similar concept is a “location” which is a tuple of past characters and next characters. “location” is what an “input” is in Linden.

Derivatives is the decider for these semantics. It is proven to be equivalent.

Capture groups:

The paper does add them to the formalization. It discusses shortly that supporting them is a major undertaking:

Formalizing even the core aspect of the capture semantics and algorithms of the nonbacktracking engine in Lean, based on its open-source implementation [15], is therefore a major undertaking with many challenges.

llmatch:

Until now the semantics did not talk about how to find a match anywhere in an input. llmatch is a function that computes the leftmost-longest match of a regex on a string. This will find matches present anywhere in the input and return the span.

Regex equivalence:

The equivalence notion is “r q is equiv iff for all strings r matches that string iff q matches that string”. This has been known to not be the ideal definition in PCRE semantics as it is not strong enough to reason about rewrites of regex subexpressions. This is due to the non-commutativity of alternations and groups. In this paper, a few rewrites are proven to be correct: https://github.com/ezhuchko/extended-regexes/blob/194f2da778ab20c734dcfaa29808d0d2b726cfdd/Regex/Rewrites.lean.

Progress presentation

I made a script and graphics for the progress presentation that will take place on 9.12.2025: https://github.shilangyu.dev/pikevm-systemf/progress/script.html.

The goal of the presentation is to update SYSTEMF on what I have been working on, show a bit of the proof, and explain what I will be working on next. Additionally, Ekaterina Zhuchko is coming to visit us this week, so it is a good way to show her what I am working on. This will set solid grounds for a chat with her.

Report writing

The Typst report skeleton is done.

Presentation feedback

To discuss

Action items