For some regexes, we know that they require to be at the start of the
input to match. This information comes from the /^/ anchor
which asserts that we are at the start of the input.
The anchor does not have to be necessarily at the start of the regex. All of the following regexes are anchored at the start:
/abc^//(^p|^t)//(?=.*yes[so])()(|)^later//(?=^)later/However, /(^a|b)/ is not anchored: there exists a branch
that does not require to be a the start of the input. Additionally,
notice how /abc^/ can never match. But this is not handled
by this optimization.
We therefore detect anchored regexes with the following:
Fixpoint is_anchored' (r:regex) : bool :=
match r with
| Anchor BeginInput => true
| Disjunction r1 r2 => is_anchored' r1 && is_anchored' r2
| Sequence r1 r2 => is_anchored' r1 || is_anchored' r2
| Group _ r1 => is_anchored' r1
| Quantified _ min _ r1 => (min != 0) && is_anchored' r1
| Lookaround LookAhead r1 | Lookbaround LookBehind r1 => is_anchored' r1
| Anchor _ | Lookaround _ _ | Epsilon | Regex.Character _ | Backreference _ => false
end.
Definition is_anchored (r:regex) : bool :=
if RegExpRecord.multiline rer then false
else is_anchored' r.
Once we have this we can characterize this function in the following way: running a match using an anchored regex will always fail if we are not at the start of the input. This is expressed in this theorem:
Theorem is_anchored_match_not_begin_regex:
forall r c next pref tree,
is_anchored r = true ->
is_tree rer [Areg r] (Input next (c::pref)) Groups.GroupMap.empty forward tree ->
first_leaf tree (Input next (c :: pref)) = None.
Given this, we know that when looking for a match of an anchored regex anywhere in the haystack, it is either at the very start of it or nowhere. This allows us to just run an anchored engine once. Therefore, for the following function:
Definition try_anchored_search {engine:AnchoredEngine rer} (r:regex) (inp:input) : option (option leaf) :=
if is_anchored r then
if pref_str inp == [] then
Some (exec rer r inp)
else
Some None
else None.
where exec is the execution of the anchored engine, we
have the following theorem:
Theorem try_anchored_search_correct {engine:AnchoredEngine rer}:
forall r inp leaf tree,
supported_regex rer r = true ->
is_tree rer [Areg (lazy_prefix r)] inp Groups.GroupMap.empty forward tree ->
try_anchored_search r inp = Some leaf ->
first_leaf tree inp = leaf.
So if we find a result, it is the result of an unanchored search. Approximately 23% of regexes found in the wild are anchored. This optimization allows us to run the engine potentially for a much smaller region of the input.
/^abc/ benefits from both the literal extraction and
anchored searches. Currently all optimizations are very disjoint. We
should interleave them