Collected notes

Prefix acceleration

  1. Formalize substring searches on string. The definition should be roughly: str_search : string -> string -> option nat.

    This function should follow some axioms. Given Definition starts_with (p s : string) := p = List.firstn (length p) s.

    1. Produced results are indeed pointing to that substring: forall s ss i, str_search s ss = Some i -> starts_with ss (List.skipn i s)
    2. Produced result is the very first substring encountered: forall s ss i, str_search s ss = Some i -> forall i', i' < i -> ~ (starts_with ss (List.skipn i' s))
    3. If no substring is found, it is not present in the string: forall s ss, str_search s ss = None -> forall i, i < length s -> ~ (starts_with ss (List.skipn i s))

    Connect these to operating on input. To do so, a wrapper function is defined in terms of str_search: input_search : input -> string -> option input where the result is an advanced input. The following can be proven about it using only the axiomatization of str_search:

    1. Produced results are strict suffixes: forall i1 i2 p, input_search i1 p = Some i2 -> strict_suffix i1 i2 forward
    2. Produced results are indeed pointing to that substring: forall a1 b1 a2 b2 p, input_search (a1, b1) p = Some (a2, b2) -> starts_with p a2
    3. Produced result is the very first substring encountered: forall a1 b1 a2 b2 p, input_search (a1, b1) p = Some (a2, b2) -> forall s1 s2, a1 = s1 ++ s2 ++ a2 -> ~ (starts_with p s2)
    4. If no substring is found, it is not present in the string: forall a1 b1 p, input_search (a1, b1) = None -> forall s1 s2, a1 = s1 ++ s2 -> ~ (starts_with p s2)

    Some of these axioms might be not needed for the proofs. They encompass everything about the substring search, but tells us more than what we need to know.

  2. Formalize literal extraction. The definition should be roughly: extract_literal : regex -> literal. Where

    Inductive literal : Type :=
    | Exact (s : string) (* the entire match is exactly `s` *)
    | Prefix (s : string) (* the match starts with `s` *)
    | None. (* this indicates the match cannot exist, as opposed to Prefix [] which means we do not know anything about the match *)
    
    Definition prefix (l : literal) := match l with
    | Exact s => s
    | Prefix s => s
    | None => []
    end.

    This function should follow these theorems:

    1. Every match starts with the extracted literal:
    forall r tree a b result,
        priotree_inp r (a, b) tree ->
        first_leaf tree (a, b) = Some result ->
        starts_with (prefix (extract_literal r)) a
    1. Every match is wholly Exact s:
    admit.

    TODO: how to specify it and is it useful?

    1. A None literal means there is no match:
    forall r tree inp,
        priotree_inp r inp tree ->
        extract_literal r = None ->
        first_leaf tree inp = None
  3. Write a function prefix_accelerated_pikevm which completes an entire (unanchored) search. Prove equivalence between the prefix accelerated version and the full search algorithm.

    1. Prove equivalence between [^]*?(r)_#0 and the for-loop algorithm of running r anchored at each index. This has to be done in Warblre, out of scope for this thesis, assume it is proven.
    2. For free we get that Linden’s [^]*?(r)_#0 where r is restricted to the PikeVM regex features is equivalent to Warblre’s regex.
    3. Prove equivalence of prefix_accelerated_pikevm to the [^]*?(r)_#0 regex.

    To do 3., we first can prove equivalence with respect to first_leaf of a backtracking tree and a “prefix tree” which skips branches that will never match due to lacking of a prefix.

    Instead of a prefix_accelerated_pikevm I think a more general full-search algorithm that will be continuously improved is better. Especially since we want to extract it to a function at some point. This can be done in the future.

    The accelerated version can skip running the pikevm completely if the prefix extraction returns an Exact.

To discuss

Action items