alphabet
\Sigma = \{a^{(1)}, \cdots,
a^{(\rho)}\} Non-empty!
words
Finite sequence of elements (letters)
Empty word: w = \varepsilon
A set of all words over an alphabet is infinite and countable: \Sigma^*
We can enumerate all words as follows (canonical order):
- shorter words proceed longer ones
- equal lengths are sorted alphabetically
language
A language over an alphabet is any subset of \Sigma^*, there is an infinite uncountable
number of languages
- A, B, C, \cdots - nonterminals
- a, b, c, \cdots, 0, 1, 2, \cdots, +, -, \cdots - terminals
- \alpha, \beta, \gamma, \cdots -
strings of terminals and nonterminals
- z, y, x, w, v, \cdots - strings of
terminals
- Z, Y, X, W, \cdots - either
terminal or nonterminal
grammars
G = (V, T, P, S)
- V - finite set of symbols called
nonterminals
- T - finite set of symbols called
terminals
- P - finite set of productions
- S - nonterminal (distinguished)
called initial symbol of a grammar
Production is a pair (\alpha,
\beta), \alpha, \beta \in (V \cup
T)^*, \alpha \ne \varepsilon
By default S, E are the initial
symbols of a grammar
direct derivation
\gamma \alpha \delta \implies \gamma \beta
\delta, \alpha, \beta, \gamma, \delta
\in (V \cup T)^* and \alpha \to
\beta is a production
derivation
Application of direct derivation several times
A language L(G) generated by a
grammar G is the set of all words over
T that can be derived from S
regular grammars
left linear
Productions of the form A \to Bw or
A \to w
right linear
Productions of the form A \to wB or
A \to w
operations
- set theoretical operations: union, intersection, complement
- concatenation
- of words: u \circ v
- of languages: L_1 \circ L_2
- * closure, Kleene closure: L^* = \bigcup_{k=0}^\infty L^k
\begin{cases}
L^0 = \{\varepsilon\} \\
L^{k+1} = L^k \circ L \\
\end{cases}
Chomsky’s hierarchy of
languages
regular \subsetneq context free
\subsetneq context sensitive \subsetneq recursive \subsetneq recursively enumerable \subsetneq all
regular expressions
Finite strings of symbols over alphabet
- \emptyset, \varepsilon and \forall_{a \in \Sigma} a
- let r and s be regular expressions: (r+s), (r \circ
s), (r^*) (operator precedence
is as in the presented order)
- apply the rules above finite amount of times
Regular expressions generate languages
- \emptyset generate \emptyset
- \varepsilon generate \{\varepsilon\}
- \forall_{a \in \Sigma} a generates
\{a\}
- if r and s generate languages R and S then
- (r+s) generates R \cup S
- (r \circ s) generates R \circ S
- (r^*) generates R^*
Regular languages are those generated by regular
expressions and only those. Regular grammars generate regular languages
and only regular languages.
equivalent expressions
- \emptyset + r \equiv r
- r + s \equiv s + r
- (r+s) + t \equiv r + (s+t)
- \varepsilon r \equiv r\varepsilon \equiv
r
- r(s+t) \equiv rs + rt
- (r^*)^* \equiv r^*
- (r^*s^*)^* \equiv (r+s)^*
- (\varepsilon + r)^* \equiv r^*
- (s + r)^* \equiv (r^* + s^*)^*
- regular grammars
- regular expressions
- pumping lemma
- Myhill-Nerode lemma
pumping lemma
If a language L is regular then
there exists a constant n_L s.t. for
any z \in L:
|z| \ge n_L \implies (\exists_{u,v,w \in \Sigma^*, z = uvw, |uv| \le
n_L, |v| \ge 1})(\forall_{i=0,1,2,\cdots})z_i= uv^iw\in L
contraposition
If for any constant N there exists
z \in L s.t. the following conditions
holds:
|z| \ge N \land (\forall_{u,v,w \in \Sigma^*, z= uvw, |uv| \le N, |v|
\ge 1})(\exists_{i=0,1,2,\cdots})z_i=uv^iw \notin L
then L is not regular
Myhill-Nerode
A language L is regular iff the
relation R_L induced by L has finite index (finite number of
equivalence classes)
Where uR_Lv \equiv \big [(\forall_{z\in
\Sigma^*})uz\in L \equiv vz \in L\big]
let \rho \subset
\Sigma^*\times\Sigma^*, u,v\in
\Sigma^*, u\rho v \equiv |supp(u) \cap
supp(v)| = 2 where supp(u) = \text{the
set of letters in u}
context-free grammars and
languages
A grammar is context-free iff it’s productions are of a form A \mapsto \alpha where A \in V, \alpha \in
(V\cup T)^*
Context-free languages are languages generated by context-free
grammars and only those.
simplification
- removing useless symbols \to remove productions using those
symbols
- removing unit productions
- delete productions of the form A \mapsto
A
- replace A \mapsto B with A \mapsto \text{right hand side of
B-production}
- changing status of nullable symbols (nonterminals which can generate
an empty word)
identification of useless
symbols
not reachable
TODO: wording here is very wrong
- V_r = \{s\}, T_r = \emptyset
- production starts with a reachable left hand symbol and symbols on
its right hand side to the two sets above
- repeat step 2. as long as V_r,
T_r are being changed
- remove productions that do not include symbols from sets 1.
not generative
- V_g = \emptyset, T_g = T
- for each production if its right hand side is empty or has only
generative symbols then add its left hand side to V_g
- repeat step 2. as lon as V_g is
being changed
- remove non generative symbols and productions using those
nullables
- V_n = \emptyset the set of nullable
symbols
- for each production if its right hand side is empty or is a string
of nullable symbols then add its left hand side to V_n
- repeat step 2. as long as V_n is
being changed
- change status of nullable symbols and delete nullable
productions
A \mapsto X_1, \cdots, X_k, \cdots
X_n
if X_k is nullable then replace this
production with
- A \mapsto X_1, \cdots, X_k, \cdots
X_n
- A \mapsto X_1, \cdots, X_{k-1}, X_{k+1}
\cdots X_n
Finally, delete productions to empty word
If a grammar doesn’t generate the empty word then the generated
language does not change
Any context-free grammar not generating empty word can be transformed
to normal forms
Chomsky
The productions are A \mapsto a or
A \mapsto BC
- simplify a grammar
- this yields productions A \mapsto a
or A \mapsto x_1, \cdots , x_n where
n \ge 2
- in all productions A \mapsto x_1, \cdots ,
x_n where n \ge 2 replace
terminals with a nonterminal and add a new production A_a \mapsto a
- this yields productions A \mapsto
a, A \mapsto BC, and A \mapsto B_1, \cdots, B_n where n > 2
- replace A \mapsto B_1, \cdots, B_n
where n > 2 with A \mapsto B_1, \cdots, B_{n-2}, B_{n-1,n} and
B_{n-1,n} \mapsto B_{n-1}B_n
- repeat step 5. until all productions are in Chomsky’s form
Greibach
The productions are A \mapsto
a\alpha where \alpha \in V^*
- transform a grammar to Chomsky’s form (obligatory in automatic
transformation)
- enumerate all nonterminals A_1, A_2,
\cdots A_N
- all productions have to satisfy the star condition:
(*) \begin{cases}
A_i \mapsto a\alpha & a \in T, \alpha \in V^* \\
A_i \mapsto A_j\alpha & j > i, \alpha \in V^*, |\alpha| >
0\\
\end{cases}
- assume that all A_i productions
satisfy (*) for i < k and A_k productions do not, thus are A_k \mapsto a\alpha, A_k \mapsto A_i\alpha where i < k, and A_k
\mapsto A_k\alpha
- replace in A_k \mapsto A_i\alpha
where i < k the A_i with right-hand side of A_i productions. After this operation if we
still have of the same form they will have the first nonterminal of
greater index than i. Repeating this
step will make it satisfy (*)
- assume that we have productions A_k
\mapsto A_k\alpha_1 | A_k\alpha_2 | \cdots | A_k\alpha_m | \beta_1 |
\cdots | \beta_n. Add another nonterminal B_k and replace productions with A_k \mapsto \beta_1 | \cdots | \beta_n | B_k\beta_1
| \cdots | B_k\beta_n and B_k \mapsto
\alpha_1 | \cdots | \alpha_m | \alpha_1B_k|\cdots | \alpha_mB_k.
B_k proceeds all A_i nonterminals and B nonterminals are arranged according to
their indices. Repeat to A_{k+1}, A_N
- A_N productions are in the Greibach
form
- A_{N-1} productions are either
A_{N-1} \mapsto a\alpha or A_{N-1} A_N \alpha. Replace with right-hand
side of A_N. Repeat for A_{N-l}.
derivation trees
A derivation tree of a word in a grammar is a tree with the following
properties:
- its internal nodes are labelled by nonterminals
- its leaves are labelled by an empty word or terminals
- its root is labelled by the initial symbol of the grammar
- for each internal node its children are labelled by symbol of right
hand side of a production from this node’s label (these children must
appear in the order defined by the right hand side derivations)
- if a node has a child labelled by an empty word then it is the only
child of this node
- the crop of such a tree is the derived
word
unique derivation
We say that a word has unique derivation in a grammar if there is
exactly one derivation tree
pumping lemma
If a language L is context free then
there exists a constant n_L such that
for each z \in L the following
holds
|z| \ge \implies (\exists_{u,v,w,x,y \in \Sigma^*}, z=uvwxy, |vwx| \le
n_L, |vx| \ge 1)(\forall_{i=0,1,2,\cdots})z_i = uv^iwx^iy \in L
contraposition
If for any constant N there exists
z \in L s.t. the following conditions
holds:
|z| \ge N \land (\forall_{u,v,w,x,y \in \Sigma^*, z = uvwxy, |vwx| \le
N, |vx| \ge 1})(\exists_{i=0,1,2,\cdots})z_i= uv^iwx^iy \notin L
then L is not context free
Cocke-Younger-Kasami
algorithm
Algorithm for checking whether a word belongs to the language
generated by a grammar to grammar in Chomsky’s normal form.
translation grammars
G = (V, T, T_r, P, S) where T_r denote the translation symbols
LL(1) grammars
Recall context free productions: A \mapsto
\alpha where \alpha \in (V \cup
T)^*
- FIRST: a set of terminals that begin any intermediate word of any
derivation from \alpha to a string of
terminals
- FOLLOW: it is a set of terminals which can directly follow A in any intermediate word of any derivation
in this grammar
- SELECT: FIRST or FIRST \cup FOLLOW.
SELECT is FIRST iff production is non-nullable
A grammar is a LL(1) iff for each
nonterminal A sets SELECT are pairwise
disjoint for all A-productions
Context-sensitive
grammars
Productions of the form:
- \alpha \mapsto \beta
- \alpha, \beta \in (V \cup T)^*
Where 1 \le |\alpha| \le |\beta|
So the empty word does not belong to context-sensitive languages
(CSL)
Productions are \delta_1 A \delta_2 \mapsto
\delta_1 \alpha \delta_2 where A \in
V, \alpha \in (V \cup T)^*,
|\alpha| \ge 1, and \delta_1, \delta_2 \in (V \cup T)^*
Recursive languages
Those that are accepted by turing machines with a stop property.
There is no class of grammars generating these languages.