r1, inp1) and
(r2, inp2) that are supported by both engine X
and Y, Meta will choose X for the first and Y for the second. So Meta
has some heuristics about the engines that make the list of engines not
necessarily strictly ordered.Internal make the Meta engine easier and encapsulate the logic per engine. However, external allow for more fine grained control over which engine to use and can lead to better results.
TODO: construct an example (r1, inp1) and
(r2, inp2) with two engines X and Y showcasing
the benefit of external heuristics
Using the corpus from https://dl.acm.org/doi/10.1145/3236024.3236027 and the
RegElk regex parser,
I extended the stats.ml
file to collect statistics about patterns the Meta engine wants to
optimize. It would be very useful to also evaluate on what haystacks
these regexes are ran. This is unfortunately not available in the
corpora. Neither are the regex flags available in the corpora.
These statistics will guide which optimizations are worth implementing in the Meta engine. The updated file can be found here.
| Kind | Description | Occurrence |
|---|---|---|
| Front-only literal | If a regex has an extractable literal in the front but none in the back | 9.47% |
| Back-only literal | If a regex has an extractable literal in the back but none in the front | 9.28% |
| Front literal with offset | If a regex has an extractable literal in the front that is offset by some fixed amount of unknown characters | 2.00% |
| Back literal with offset | If a regex has an extractable literal in the back that is offset by some fixed amount of unknown characters | 2.47% |
| Front and back literal | If a regex has an extractable literal in the front and back | 38.06% |
| No assert exact literal | If a regex matches only a constant string and it contains no assertions (anchors or lookarounds) | 14.97% |
| No assert nor group exact literal | If a regex matches only a constant string and it contains no assertions (anchors or lookarounds) and no capture groups | 14.97% |
| Front anchored | If a regex is anchored to the start of a string with
^ |
23.05% |
| Back anchored | If a regex is anchored to the end of a string with
$ |
21.51% |
| Double anchored | If a regex is anchored to the start and the end of a string | 12.68% |
| Potentially accidental capture | If a regex uses a capture group potentially only for grouping | 12.33% |
| No captures | If a regex has no captures | 62.09% |
Offset literals: this is about regexes such as
/..abc/. We can run prefix acceleration using “abc” and
then go back two characters.
Potentially accidental capture: this is not possible to tell what a user actually wanted so it cannot be optimized. But it is interesting to see. Matching captures is generally expensive. We believe users might accidentally create capture groups when they only meant to group a part of the regex (to for example quantify it).
Anchored: ^ and $ anchors
depend on the m (multiline) flag. If m is
enabled, ^/$ talk about the start/end of lines
rather than start/end of the entire input. Luckily, m is by
default disabled, and as we know, humans tend to leave the defaults
on.
Computing them requires heuristics. For example, given a regex like
/a..abcde/ we can either extract
(0, Prefix "a") or (3, Exact "abcde").
TODO: before this is implemented one would have to think how this affects complexity
Paper: https://dl.acm.org/doi/pdf/10.1145/3636501.3636959
Semantics:
Defined with a function models producing a
Prop. This is due to negative occurrences that would be
cumbersome to deal with. This function only models if a regex matches an
input span, not where and how. In fact, there is no “how” as there are
no capture groups. And the “where” is essentially the entire input.
The input is modeled as a “span” which is a slice of a string with reference to the entire string too. It is a triple: past characters, characters we must match (exhaustively), future characters. A similar concept is a “location” which is a tuple of past characters and next characters. “location” is what an “input” is in Linden.
Derivatives is the decider for these semantics. It is proven to be equivalent.
Capture groups:
The paper does add them to the formalization. It discusses shortly that supporting them is a major undertaking:
Formalizing even the core aspect of the capture semantics and algorithms of the nonbacktracking engine in Lean, based on its open-source implementation [15], is therefore a major undertaking with many challenges.
llmatch:
Until now the semantics did not talk about how to find a match
anywhere in an input. llmatch is a function that computes
the leftmost-longest match of a regex on a string. This will find
matches present anywhere in the input and return the span.
Regex equivalence:
The equivalence notion is “r q is equiv iff for all strings r matches that string iff q matches that string”. This has been known to not be the ideal definition in PCRE semantics as it is not strong enough to reason about rewrites of regex subexpressions. This is due to the non-commutativity of alternations and groups. In this paper, a few rewrites are proven to be correct: https://github.com/ezhuchko/extended-regexes/blob/194f2da778ab20c734dcfaa29808d0d2b726cfdd/Regex/Rewrites.lean.
I made a script and graphics for the progress presentation that will take place on 9.12.2025: https://github.shilangyu.dev/pikevm-systemf/progress/script.html.
The goal of the presentation is to update SYSTEMF on what I have been working on, show a bit of the proof, and explain what I will be working on next. Additionally, Ekaterina Zhuchko is coming to visit us this week, so it is a good way to show her what I am working on. This will set solid grounds for a chat with her.
The Typst report skeleton is done.