I. The Nature of Program Analysis (1.1)
A. goals
What are the goals of program analysis?

GOALS OF PROGRAM ANALYSIS

B. ideas
What are the main ideas of program analysis?

MAIN IDEAS

What's the difference between a semanticsbased approach and
a semanticsdirected approach?
C. applications
What other areas, besides optimizing compilers, could the ideas
and goals of program analysis usefully be applied?
II. Setting the Scene (1.2)
A. The While Language
1. concrete syntax

CONCRETE SYNTAX FOR THE WHILE LANGUAGE
%  Regular (lexical) syntax 
::= \r  \n  \r\n
::= "any character except a "
::= %
::=  " "  \t  \f
::= !=  ==  <  >  <=  >=
::= +  
::= *  /
::= "any letter"  '  _
::= 0  1  2  3  4  5  6  7  8  9
::= (  )*
::= ()?()+
%  ContextFree syntax 
::=
::= :=
 skip

 while do
 if then else
::= '{' '}'
::=  ;
::=

::=

::=   ( )
::=
 or
::=
 and
::=
 not

 ( )
::= true  false


EXAMPLE
y := x;
z := 1;
while y>1
do { z := z*y;
y := y1 };
y := 0

2. abstract syntax

ABSTRACT SYNTAX OF THE WHILE LANGUAGE
a \in AExp "arithmetic expressions"
b \in BExp "Boolean expressions"
S \in Stmt "statements"
x,y \in Var "variables"
n \in Num "numeric literals"
l \in Lab "labels"
opa \in Op_a "arithmetic operators"
opb \in Op_b "Boolean operators"
opr \in Op_r "relational operators"
S ::= [x:= a]^l
 [skip]^l
 S1 ; S2
 if [b]^l then S1 else S2
 while [b]^l do S
a ::= x
 n
 a1 opa a2
b ::= true
 false
 not b
 b1 opb b2
 a1 opr a2

What's the difference between abstract and concrete syntax?

EXAMPLE
[y := x]^1;
[z := 1]^2;
while [y>1]^3
do ([z := z*y]^4;
[y := y1]^5);
[y := 0]^6

Why are the labels only attached to certain places in the
abstract syntax trees?
Which blocks are elementary?
What is used to disambiguate different parse trees?
3. semantics
What does the example above do?
What does skip do?
What kind of numeric literals are in this language?
What would be some reasonable arithmetic operators?
One would be some reasonable relational operators?
What would be some reasonable Boolean operators?
What's the type of the relational operators?
Can the labels be repeated?
What's missing?
4. variations

VARIATION 1: CONTROL FEATURES
S ::= ...
 break
 for x in a1 .. a2 do S
 throw
 try S1 catch S2
VARIATION 2: TAINTING and INFORMATION FLOW
S ::= ...
 read x
 sanitize x
 print a
VARIATION 3: SPECIFICATION FEATURES
S ::= ...
 assert b
 assume b
 choose S1 or S2
VARIATION 4: PARALLELISM
S ::= ...
 S1 `' S2
VARIATION 5: FUNCTIONS
a ::= ...
 fn x => a
 a1 a2
 let d in a
d ::= x = a
 d1; d2

Which of these need labels, and where do the labels go?
B. reaching definitions analysis
1. definition

REACHING DEFINITIONS (ASSIGNMENTS)
def: Let P be a program.
An assignment [x := a]^l
*may reach* a program point in P
if in some execution of P,
when execution reaches that point,
the last assignment to x was done at l.
RD(P,point) says what assignments may reach point in P.
What can we know for certain?
What kind of errors can this detect?

How to describe program points?
What can we tell for certain from this analysis's results?
What kind of errors could this analysis be used to detect/prevent?
2. examples

EXAMPLE
[y := x]^1;
[z := 1]^2;
while [y>1]^3
do ([z := z*y]^4;
[y := y1]^5);
[y := 0]^6
(y,1) reaches entry to 2

What definitions reach the entry for label 3?
What definitions can't reach the entry for label 3?
How do we talk about uninitialized variables?
In table 1.1, what if we add (y,1) to RDentry(5)?
In table 1.1, what if we remove (y,5) from RDentry(6)?

FOR YOU TO DO
Compute a table like table 1.1 for:
[t := x]^1;
[x := y]^2;
[y := t]^3
if [y>x]^4
then [r := y]^5
else [r := x]^6;
assert [r >= x and r >= y]^7

3. variations
What would the analysis look like if we only wanted to keep
track of which variables were assigned?
What would the analysis look like if we want to keep track of
what variables could influence the values of other variables?
III. Dataflow Analysis (1.3)
A. goals
B. idea

IDEA OF DATA FLOW ANALYSIS
What's the basic idea?
What is a data flow graph?
How is that used to model the semantics?

C. example

EXAMPLE
[y := 0]^1;
[print y]^2;
[read x]^3;
while [x < 0]^4
do ([y := y+1]^5;
[print y]^6;
[read x]^7);
[z := x]^8
What's the flow graph for this?

How would you handle Read? Print?
How would you handle if then else statements?
For loops?
How would you handle break?
How would you handle trycatch and throw?
How would you handle assert? Assume?
Choose?
Parallel composition?
1. the equational approach (1.3.1)

NODE AND EDGE EQUATIONS
FOR TAINT ANALYSIS
Taint analysis:
find the set of variables each program point
that may have a value derived from a value
previously read from the user ("tainted")
Tentry, Texit :
Lab* > Powerset(Var*)
where Lab* = set of labels in program
Var* = set of variables in prog
block Equation
=======================================
[x:=a]^l Texit(l) =
[skip]^l Texit(l) =
[b]^l Texit(l) =
[read x]^l Texit(l) =
[sanitize x]^l Texit(l) =
[print x]^l Texit(l) =
How are edges connected?

Why are these the right equations?
How does this work out for our example?
a. algorithm for solving the equations
What can we do to solve a set of simultaneous equations?

LEAST SOLUTION
F: (Powerset(Var*))^16
> (Powerset(Var*))^16
F is defined by:
F(T_1, ..., T_{16})
= (F_1(T_1, ..., T_{16}),
F_2(T_1, ..., T_{16}),
...,
F_{16}(T_1, ..., T_{16}))
where
F_1(T_1, ..., T_{16}) // Tentry(1)
= {}
F_2(T_1, ..., T_{16}) // Texit(1)
= (T_1  {y}) \cup {}
F_3(T_1, ..., T_{16}) // Tentry(2)
= T_2
F_4(T_1, ..., T_{16}) // Texit(2)
= T_3
F_5(T_1, ..., T_{16}) // Tentry(3)
= T_4
F_6(T_1, ..., T_{16}) // Texit(3)
= T_5 \cup {x}
F_7(T_1, ..., T_{16}) // Tentry(4)
=
F_8(T_1, ...., T_{16}) // Texit(4)
=
Solution, is a 16tuple of the form
(Tentry(1), Texit(1), ..., Tentry(8), Texit(8))
Such a 16tuple is a solution if:
(Tentry(1), Texit(1), ..., Tentry(8), Texit(8))
= F(Tentry(1), Texit(1), ..., Tentry(8), Texit(8))

Would this still be a 16tuple if there were seven
elementary blocks?
How to compare such tuples?
What does the lattice structure of (Powerset(Var*))^16
look like?
How can one find the fixed point using this information?
Why is the fixed point a solution?
Why do we want the least fixed point?
2. the constraint based approach
What are the constraints?

WHAT CONSTRAINTS FOR THIS EXAMPLE?
[y := 0]^1;
[print y]^2;
[read x]^3;
while [x < 0]^4
do ([y := y+1]^5;
[print y]^6;
[read x]^7);
[z := x]^8
What's the flow graph for this?

How do the flows work?
What's the connection to the equations?
In what direction do the subset constraints go?
IV. constraint based analysis (1.4)
A. goals
What's the difference between a data flow analysis and a control
flow analysis?

DATA FLOW VS. CONTROL FLOW ANALYSIS
Main difference?

Isn't control flow obvious in all languages?
B. setting

SETTING
1. Convert all control structures to
functions and function calls.
2. Analysis finds what functions
can be called from each point

Why switch to a functional language for this section?

CONTINUATION PASSING STYLE
An intermediate language
with one control structure
Idea: every expression takes a
"continuation"
to which it sends its result
Examples:
x < 0
==>
[fn k => [[[%< x] 0] k]]
if [x < 0]
then [y := 22]
else [z := 33]
==>
[fn k =>
[[[%< x] 0]
[%if [[y := 22] k]
[[z := 33] k]]]]


LANGUAGE (p. 140)
Work in a functional language:
e \in Exp
t \in Term
f,x \in Var
c \in Const
op \in Op
l \in Lab
e ::= t^l
t ::= c
 x
 fn x => e_0 "nonrecursive fun"
 fun f x => e_0 "recursive fun def"
 e_1 e_2
 if e_0 then e_1 else e_2
 let x = e_1 in e_2
 e_1 op e_2

What are the atomic blocks being labeled here?
C. idea
What are the main ideas in this approach?

IDEAS OF CONSTRAINT BASED ANALYSIS
 assume no side effects
==> associate information with labels
 use a pair of functions, (C,p):
C: Lab* > Powerset(Value)
C(l) contains possible values for
subexpression at label l
p: Var* > Powerset(Value)
p(x) constains possible values for
variable x

What's the alternative to associating information directly
with labels?
How could this information be useful in an objectoriented
program?
How is this different than a type system?

APPROACH
 collect constraints
for function abstractions:
e.g., given
[fn x => [x]^1]^2
get
{[fn x => [x]^1]} \subseteq C(2)

What's the value of a function definition?
Why do they just use the term instead of a closure?
What's the general pattern here?

for variables:
e.g., given
[x]^1
get
p(x) \subseteq C(1)
for applications:
e.g., given
[[f]^1 [e]^2]^3
get
{v  g \in C(1), a \in C(2), and
v = (g a)}
\subseteq C(3)

What's the general rule?
What happens in [[fn x => [x]^1]^2 [fn y => [y]^3]^4]^5 ?
What would be the constraints for an expression of the form
[[e1]^1 op [e2]^2]^3 ?
What would be the constraints for an ifthenelse expression?
What would be the constraints for a let expression,
like let x = e1 in e2 ?
V. Abstract Interpretation (1.5)
A. goals
B. idea
What's the basic idea?

IDEA OF ABSTRACT INTERPRETATION (1.5)

What's a collecting semantics?
How is that used to extract the analysis?
C. example

EXAMPLE
[y := 0]^1;
[print y]^2;
[read x]^3;
while [x < 0]^4
do ([y := y+1]^5;
[print y]^6;
[read x]^7);
[z := x]^8
For taint analysis we seek
sets of variables at each program point
that may have a value derived from a value
previously read from the user ("tainted")

What is a collecting semantics?
What would a collecting semantics look like for this example?
Why is CSentry(1) = {(x,?,{}),(y,?,{}),(z,?,{})} ?
How is this different than the reaching definitions analysis.
it's sets of ordered variable names instead of traces
Q: How would you handle if then else statements?
How would you handle if then else statements?
For loops?
How would you handle break?
1. solving the equations

SOLVING THE EQUATIONS
Traces = Powerset((Var x Lab? x Dependants)*)
G: Traces^{16} > Traces^{16}
G is defined by:
G(CS_1, ..., CS_{16})
= (G_1(CS_1, ..., CS_{16}),
G_2(CS_1, ..., CS_{16}),
...,
G_{16}(CS_1, ..., CS_{16}))
where
G_1(CS_1, ..., CS_{16}) // CSentry(1)
= {(x,?,{}),(y,?,{}),(z,?,{})}
G_2(CS_1, ..., CS_{16}) // CSexit(1)
= {tr : (y,1,{})  tr \in CS_1}
G_3(CS_1, ..., CS_{16})
= CS_2
...
Solution
(CSentry(1), CSexit(1), CSentry(2),
..., CSentry(8), CSexit(8))
is a solution if
G(CSentry(1), CSexit(1), CSentry(2),
..., CSentry(8), CSexit(8))
= (CSentry(1), CSexit(1), CSentry(2),
..., CSentry(8), CSexit(8))

What is G_1?
Why does it have 16 parameters?
What is G_2?
Why is G_3 = CS_2 in this example?
What's the ordering on the solution space?
What does it mean for G to be monotone?
2. Galois connections

ABSTRACTION AND CONCRETIZATION
abstraction function for Taint analysis:
a: Traces > Powerset(Var*)
a(trs) = {x  read \in depends(x,tr),
tr \in trs}
concretization function for Taint analysis:
g: Powerset(Var*) > Traces
g(Y) = {tr  x \in Y, read \in depends(x,tr)}
Adjunction, or Galois connection:
a(X) \subseteq Y <==> X \subseteq g(Y)

Would a and g be different for the RD analysis?

set of traces set of vars
 
  g  
 g(Y) < Y 
 U   U 
 X > a(X) 
  a  
_______________ _______________

3. calculating the analysis
Why do we care about the abstraction and concretization
functions?

CALCULATING THE ANALYSIS
Extend a and g pointwise to tuples:
a(TR_1, ..., TR_16)
= (a(TR_1), ..., a(TR_16))
g(Y_1, ..., Y_16)
= (g(Y_1), ..., g(Y_16))
Define the analysis by the function
a o G o g: Powerset(Var*)^16 > Powerset(Var*)^16
so for each i in {1..12}
(a o G_i o g): Powerset(Var*)^16 > Powerset(Var)
by
a(G_1(g(T_1, ..., T_16)))
= a({(x,?,{}),(y,?,{}),(z,?,{})})
= {}
a(G_2(g(T_1, ..., T_16)))
= a({tr : (y,1,{})  tr \in CS_1})
...
So a solution
(Tentry(1), ..., Texit(8))
has the property that
(Tentry(1), ..., Texit(8))
= (a o G o g)(Tentry(1), ..., Texit(8))

What's T_3?
How does this compare to the analysis we created by hand?
What's the benefit of doing things this way?
VI. Type and Effect Systems (1.6)
A. goals
B. idea
What's the basic idea?

TYPE AND EFFECT SYSTEMS (1.6)
Basic idea?

C. annotated type system example
1. annotated base types

ANNOTATED TYPE SYSTEM EXAMPLE
[asg] [x := a]^l : T1 > T2
if T2 = (T1  {x}) \cup {x  FV(a) \cap T1 \neq {} }
[skip] [skip]^l : T > T
S1 : T1 > T2, S2: T2 > T3
[seq] 
S1; S2 : T1 > T3
S1 : T1 > T2, S2: T1 > T2
[if] 
if [b]^l then S1 else S2 : T1 > T2
S : T1 > T1
[wh] 
while [b]^l then S : T1 > T1
[re] [read x]^l : T1 > T2
if T2 = T1 \cup {x}
[sa] [sanatize x]^l : T1 > T2
if T2 = T1  {x}
[pr] [print x]^l : T1 > T1
if x \not\in T1
S : T2 > T3
[sub]  if T1 \subseteq T2,
S : T1 > T4 T3 \subseteq T4

What's the meaning of the basic types?
What's the meaning of the arrow types, like T1 > T4?
How would you explain the assignment rule?
How would you explain the if rule?
Why don't we have to check the type of the condition in an if?
How would you explain the while rule?
How would you explain the sub rule?
a. type checking

EXAMPLE
[y := 0]^1;
[print y]^2;
[read x]^3;
while [x < 0]^4
do ([y := y+1]^5;
[print y]^6;
[read x]^7);
[z := x]^8


TYPE CHECKING
Idea: accumulate constraints.
[y := 0]^1: T1 > T2 [asg]
[print y]^2: T2 > T3 [pr]
___________________________________ [seq]
([y := 0]^1;[print y]^2)
: T1 > T3
if T2 = T1{y}
and y \not\in T2
and T3 = T2

So what are all the constraints?

CONSTRAINTS
T2 = T1{y}
y \not\in T2
T3 = T2
T4 = T3 \cup {x}
T6 = (T5{y}) \cup ({y} \cap T5)
T7 = T6
y \not\in T6
T8 = T7 \cup {x}
T9 \subseteq T5
T8 \subseteq T9
T4 = T9
T10 = T9{z} \cup ({x} \cap T9)
So, what's a solution?

2. annotated type constructors

EXAMPLE
Judgments:
XMust
S : Sigma > Sigma
YMay
Where XMust and YMay are sets of variables
(that S must assign and may assign).
{x}
[asg] [x := a]^l : Sigma > Sigma
{x}
{}
[skip] [skip]^l : Sigma > Sigma
{}
X1
S1 : Sigma > Sigma,
Y1
X2
S2 : Sigma > Sigma
Y2
[seq] 
X3
S1; S2 : Sigma > Sigma
Y3
if X3 = X1 \cup X2,
Y3 = Y1 \cup Y2
X1
S1 : Sigma > Sigma,
Y1
X2
S2 : Sigma > Sigma
Y2
[if] 
if [b]^l then S1 else S2
X3
: Sigma > Sigma
Y3
if X3 = X1 \cap X2,
Y3 = Y1 \cup Y2
X
S : Sigma > Sigma
Y
[wh] 
while [b]^l then S
{}
: Sigma > Sigma
Y
X
S : Sigma > Sigma
Y
[sub] 
X'
S : Sigma > Sigma
Y'
if X' \subseteq X,
Y \subseteq Y'

Why does the assignment rule make sense?
How do you explain the if rule?
How do you explain the while rule?
How would you deal with assert statements?
How would you deal with a try statement?

TYPE CHECKING EXAMPLE
TYPE CHECKING
Idea: accumulate constraints.
{q}
[q := 0]^1: Sigma > Sigma , [asg]
{q}
{r}
[r := x]^2: Sigma > Sigma [asg]
{r}
_________________________________ [seq]
([q := 0]^1;[r := x]^2)
{q,r}
: Sigma > Sigma
{q,r}
{r}
[r := ry]^4: Sigma > Sigma, [asg]
{r}
{q}
[q := q+1]^5: Sigma > Sigma [asg]
{q}
_________________________________ [seq]
([r := ry]^4;[q := q+1]^5)
{q,r}
: Sigma > Sigma
{q,r}
___________________________________[wh]
while [r >= y]^3
do ([r := ry]^4;[q := q+1]^5)
{}
: Sigma > Sigma
{q,r}
so by the seq rule,

How is this different than with annotated base types?
D. effect systems
1. example: call tracking analysis

EXAMPLE
Judgments:
Gamma  e : t & phi
where Gamma : Var > Type
e : Expression
t : Type
phi : Effect
phi
Type = int  bool  t1 > t2
phi : Powerset(FunName)
[var] Gamma  x : t & {}, if Gamma(x) = t
Gamma[x > tx]  e : t & phi
[fn] 
Gamma  fn_pi x => e
phi2
: tx > t & {}
if phi2 = phi \cup {pi}
phi
Gamma  e1 : t2 > t & phi1,
Gamma  e2 : t2 & phi2
[app] 
Gamma  e1 e2 : t & phi3
if phi3 = phi1 \cup phi2 \cup phi

What do the judgments mean?
What do the arrow types mean?
How would you explain the fn rule?
How would you explain the app rule?