Compilers

FIRST and FOLLOW sets

Can aid both top-down and bottom-up parsers.

FIRST

FOLLOW

Parsing table construction

foreach production A → α in the grammar; do
  foreach terminal x in FIRST(α), add T[A, x] = A → α
  if ε ∈ FIRST(α), then
    foreach terminal y in FOLLOW(A), add T[A, y] = A → α
    if ε ∈ FIRST(α) and $ ∈ FOLLOW(A), add T[A, $] = A → α

Examples

Eg1

E  → T E'
E' → + T E' | ε
T  → F T'
T' → * F T' | ε
F  → ( E ) | id

where id is a terminal.

FIRST set:

Symbol FIRST set
E F(T) = { (, id }
E' { +, ε }
T F(F) = { (, id }
T' { *, ε }
F { (, id }

FOLLOW set:

Symbol FOLLOW set Comment
E { $, ) } Starting symbol
E' { $, ) }
T { +, $, ) } FIRST(E')
T' { +, $, ) } FOLLOW(T)
F { *, +, $, ) } FIRST(T') ∪ FOLLOW(T)

Parsing table:

id + * ( ) $
E E → T E' E → T E'
E' E' → + T E' E' → ε E' → ε
T T → F T' T → F T'
T' T' → ε T' → * F T' T' → ε T' → ε
F F → id F → ( E )

Eg2

S -> ( L ) | a
L -> L , S | S

This grammar is left-recursive. We need to remove it before we can find FIRST set.

S  → ( L ) | a
L  → L , S | S
L  → SL'
L' → ,L | ε

A -> Aα | b

A -> bA' A' -> αA | ε

FIRST set:

Symbol FIRST set
S {(}
L

Top-down parsing

Can be thought of as finding a left-most derivation for a sentence.

Example: LL(k)

For example, for the grammar

E → T E'
E'→ + T E' | ε
T → F T'
T'→ * F T' | ε
F → ( E ) | id

parse tree for an input string id + id * id could be built like

E

 E
/ \
T E'


   E
 /    \
 T      E'
 |     
/ \
F  T'



   E
 /    \
 T      E'
 |     
/ \
F  T'
|  
id 


   E
 /    \
 T      E'
 |     
/ \
F  T'
|  |
id ε



   E
 /    \
 T      E'
 |      |
/ \     + T E'
F  T'     
|  |
id ε


   E
 /    \
 T      E'
 |      |
/ \     + T      E'
F  T'     |      |
|  |      F T'   ε
id ε




   E
 /    \
 T      E'
 |      |
/ \     + T      E'
F  T'     |      |
|  |      F T'   ε
id ε      |
          id



   E
 /    \
 T      E'
 |      |
/ \     + T                 E'
F  T'     |                 |
|  |      F    T'           ε
id ε      |    |
          id   * F T'


   E
 /    \
 T      E'
 |      |
/ \     + T                E'
F  T'     |                |
|  |      F    T'          ε
id ε      |    |
          id   * F   T'
             |   |
                 id  ε

Recursive descent parsing

Predictive parsing

Bottom-up parsing

Examples:

For example, for the grammar

E → E + T | T
T → T * F | F
F → ( E ) | id

parse tree for an input string id * id could be built like

id * id                 F * id
                        |
                       id



 T * id                 T * F
 |                      |   |
 F                      F  id
 |                      |
id                     id




   T                      E
   |                      |
 /   \                    T
 T * F                    |
 |   |                  /   \
 F  id                  T * F
 |                      |   |
id                      F  id
                        |
                       id

Bottom-parsing may be thought as a process of 'reducing' input sentence to the start symbol of the grammar.

Goal of bottom parsing is to do a derivation in reverse.

For the example that we just saw:

Right sentential Handle Reducing prodn
id₁ * id₂ id₁ F → id
F * id₂ F T → F
T * id₂ id₂ F → id
T * F T * F T → T * F
T T E → T

Shift reduce parsing

Stack has start symbol of grammar and input buffer empty => parsing success.

Shift means pushing to stack.

For our usual example grammar:

E → E + T | T
T → T * F | F
F → ( E ) | id

and input id₁ * id₂,

Stack input action
$ id₁ * id₂$ shift
id₁| * id reduce by F → id
F| * id reduce by T → F
T| * id shift
T * |id shift
$T * id₂ $ shift
$T * id₂ $ reduce by F → id
$T * F $ reduce by T → T * F
$T $ reduce by E → T
$E $ accept input

Conflicts in shift-reduce parsing

LR parsing

LR parsers

Usually, LR parser means LR(1) parser.

LR parsers maybe classified into:

Parsing power and complexity: SLR < LALR < CLR

For most practical purposes, LALR would suffice.

Shift-reduce parsing

Conflicts in shift-reduce parsing

From purple dragon book:

The key decisions during bottom-up parsing are about when to reduce and about what production to apply, as the parse proceeds.

Reduce

Convert the RHS of a production to its LHS.

From the Purple dragon book:

At each reduction step, a specific substring matching the body of a production is replaced by the nonterminal at the head of that production.

For example, if production is:

S -> ab

and input is at:

ab
  ↑

where ab has already matched, we can reduce and make the ab into an S. ie, we decide that we go with this production.

Shift

Delay the decision a bit more. Let's see more of the input before deciding. Leave options open.

Suppose there is this production:

S -> ab

and input is at

a b
 ↑

at this point there is no production is complete, we need to accept more input. In such a case, the current input character is 'shifted' in.

Shift-reduce conflict

Parser isn't sure whether it should shift or reduce.

Suppose we have the following productions:

S -> a
S -> ab

and input is at

a b
 ↑

Here S->a could be used for a reduce. But S->ab could be used for a shift as well. Hence a conflict arises.

Consider a more realistic example:

stmt := if expr then stmt
      | if expr then stmt else expr

When the parsing is at this stage:

Stack: if expr then stmt
Input: else ...

there is a shift-reduce conflict.

Shifting could be done with production 2 and reduce with production 1.

Reduce-reduce conflict

There is more than one production which can be used to do a reduce.

Suppose we have the following productions:

S -> a
E -> a

and input is at

a
 ↑

Here either of the productions can be used to do a reduce. Parser is not sure how to proceed. Hence the conflict.

Why no shift-shift conflict?

Because either way shifting of the same character is done.

LR parsing

Reason for the name:

An ambigous grammar can never be LR.

Symbol table

Used to keep track of variables and their values. Symbol table needn't be in binary executable. Can be included if we want to debug later.

Python has a module to examine its symbol table: https://docs.python.org/3/library/symtable.html

import symtable
table = symtable.symtable("def foo(): pass", "string", "exec")

print(list(table.get_identifiers()))
# ['foo']

Ambiguous CFGs

A grammar that can produce more than one parse tree.

For example, the following CFG is ambiguous:

E := E + E
   | E * E
   | id

because the string id + id * id allows for two different left-most derivations in this CFG:

E  + E                     E
id + E                     |
id + E  * E                +
id + id * E                |
id + id * id         +----------+
                     |          |
                     E          E
                     |          |
                     id         *
                                |
                            +-------+
                            |       |
                            E       E
                            |       |
                           id      id

and

E       * E                E
E  + E  * E                |
id + E  * E                *
id + id * E                |
id + id * id         +----------+
                     |          |
                     E          E
                     |          |
                     +          id
                     |      
                 +-------+  
                 |       |  
                 E       E  
                 |       |  
                id      id     

(Ambiguity only on top-down parsing? Of course not.)

This grammar can be made unambiguous by adding a few extra productions (sort of restricts the entry-point of parsing):

E := E + P | P
P := P * Q | Q
Q := id

Not every ambiguous CFG has an unambiguous equivalent. Such grammars are said to be inherently ambiguous.

No algorithm exists that can tell if a grammar is ambiguous or not. ˢᵒ Also, there's no algorithm that can give unambiguous counterpart of an ambiguous grammar. Both of these are undecidable problems. ˢᵒ

Terms

Nullable non-terminal

A non-terminal is nullable if it can become ε. Even if the derivation is long.

Chomsky Normal Form (CNF)

A grammar is in CNF if every rule is of one of the following forms:

A → BC
A → a

Every grammar in CNF is context free (ie, CNF ⊂ CFG)

Greibach Normal Form (GNF)

Left recursion

Eliminating left recursion

A → Aα | β

can be made into

A  → βA'
A' ⇒ αA' | ε

Or if there are multiple productions for the non-terminal,

A  → Aα₁ | Aα₂ | ... | Aαₘ | β₁ | β₂ | ... | βₙ

can become

A  → β₁A' | β₂A' | ... | βₙA' 
A' ⇒ α₁A' | α₂A' | ... | αₘA' | ε

Left factoring

Making a grammar suitable for predictive, or top-down parsing.

If two productions of a non-terminal has the same prefix, parser cannot choose between them till enough of the input (ie, more than the prefix common to the two productions) has been seen.

A -> αβ₁
A -> αβ₂

We can fix this by making the prefix a non-terminal:

A  -> αA'
A' -> β₁ | β₂

Single production

A production rule whose body consists solely of a single non-terminal. Like: A → B

ε-production

A production rule whose RHS (or body) is ε. Like: A → ε

ε-free grammar

A grammar without any ε-productions.

CYK algorithm

Derivations

For a grammar

E := E + E
   | E * E
   | - E
   | ( E )
   | id

E => -E means E derives -E.

And the sequence of replacements

E => -E => -(E) => -(id)

is a derivation of -(id) from E.

For a grammar G with start symbol S, if S =>* α, we say that α is a sentential form of G. Sentential form may contain terminals and non-terminals. May even be empty.

A sentence of G is a sentential form without non-terminals.

Language generated by a grammar is the set of its sentences.

Two grammars are equivalent if the languages generated by them are the same.

Left-most derivation

Left-most non-terminal in the sentential (ie, sentential form) is chosen as the next non-terminal to be replaced (always).

E =>* -(id+id)

E -E -(E) -(E + E) -(id + E) -(id + id)

If S =>* α by left-most derivation, we can say that α is a left-sentential form of G.

Right-most derivation

Right-most non-terminal in the sentential (ie, sentential form) is chosen as the next non-terminal to be replaced (always).

E =>* -(id+id)

E -E -(E) -(E + E) -(E + id) -(id + id)

If S =>* α by right-most derivation, we can say that α is a right-sentential form of G.

Right-most derivations are also known as canonical derivations. (WHY??)

Parse tree

E =>* -(id+id)

   E
   |
+--+--+
|     |
-     E
   +--+--+
   |  |  |
   (  E  )
      |
   +--+--+
   |  |  |
   E  +  E
   |     |
  id    id

Parse tree is same for left-most and right-most derivations.

Ambiguous grammars

A grammar capable of producing more than one parse tree for the same sentence. ie, more than one left-most derivation or more than one right-most derivation for the same sentence.

For the grammar

E := E + E
   | E * E
   | ( E )
   | id

id + id * id can have two parse trees.

Left-most derivation:

Right-most derivation:

Grammars

Ambiguity

Example CFGs

:TODO:

2 zeros: Z -> 001 | 010 | 100
1 one:   O -> 1

This is actually a regular language...

https://cs.stackexchange.com/questions/110821/context-free-grammar-for-language-with-even-number-of-0s-and-1s

Dbts

References