CS 541 Lecture -*- Outline -*- * Lexical (or micro-) syntax ** motivation is abstraction (working at the right level) want to describe syntax of language at a higher level than characters the higher level things are called tokens (words) also may want to have different sets of tokens (publication vs. working) but to do that have to define the syntax of tokens typically described as a regular language because very simple regularity guarantees not ambiguous very fast to parse. ** Lexical conventions *** blanks, whitespace often used to separate tokens (words) *** reserved words: if, then, etc. cannot be used as identifiers *** keywords: (in Algol 60) are distinguished (by font, quotes etc) *** keywords in context: keywords only recognized in certain contexts /* PL/I example */ IF IF=THEN THEN THEN=ELSE; ELSE; IF=THEN=ELSE; ** Regular expressions (Watt, section 2.2.2) go quickly... --------------- Denotational semantics of REGULAR EXPRESSIONS Syntax: re ::= char | \epsilon | re re | re '|' re | re * | ( re ) | re + Semantics: for all c in char, r1, r2 in re M[[c]] = {"c"} M[[\epsilon]] = {""} M[[r1 r2]] = {st | s in M[[r1]], t in M[[r2]]} M[[r1 | r2]] = M[[r1]] \union M[[r2]] M[[r1*]] = {"", s,ss,...| s in M[[r1]]} M[[(r1)]] = M[[r1]] M[[r1+]] = M[[r1 (r1 *)]] --------------- Examples: ab denotes {"ab"} a|b denotes {"a", "b"} a* denotes {"", "a", "aa", ...} (all sequences of a's) (ab)* denotes {"", "ab", "abab", "ababab", ...} (a|b)* denotes {"", "a", "b", "ab", "ba", "aaa", ...} (\epsilon|a)(a)*b denotes a*b --many equivalent forms "(a|b|c|...|z|_|"")*" -- string literals with "" for " inside Each regular language can be described by a regular expression (and vice versa). Precedence: * has highest priority, | lowest (Exercise, rewrite the grammar to show that) ** regular exps vs. regular grammars grammars help by giving names to things better for syntax-directed documentation. res closer to finite automata, and thus convenient for compilers *** translation see Watt 2.2.3