I. The Nature of Program Analysis (1.1) A. goals What are the goals of program analysis? ------------------------------------------ GOALS OF PROGRAM ANALYSIS ------------------------------------------ B. ideas What are the main ideas of program analysis? ------------------------------------------ MAIN IDEAS ------------------------------------------ What's the difference between a semantics-based approach and a semantics-directed approach? C. applications What other areas, besides optimizing compilers, could the ideas and goals of program analysis usefully be applied? II. Setting the Scene (1.2) A. The While Language 1. concrete syntax ------------------------------------------ CONCRETE SYNTAX FOR THE WHILE LANGUAGE % ---- Regular (lexical) syntax ---- ::= \r | \n | \r\n ::= "any character except a " ::= % ::= | " " | \t | \f ::= != | == | < | > | <= | >= ::= + | - ::= * | / ::= "any letter" | ' | _ ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ::= ( | )* ::= (-)?()+ % ---- Context-Free syntax ---- ::= ::= := | skip | | while do | if then else ::= '{' '}' ::= | ; ::= | ::= | ::= | | ( ) ::= | or ::= | and ::= | not | | ( ) ::= true | false ------------------------------------------ ------------------------------------------ EXAMPLE y := x; z := 1; while y>1 do { z := z*y; y := y-1 }; y := 0 ------------------------------------------ 2. abstract syntax ------------------------------------------ ABSTRACT SYNTAX OF THE WHILE LANGUAGE a \in AExp "arithmetic expressions" b \in BExp "Boolean expressions" S \in Stmt "statements" x,y \in Var "variables" n \in Num "numeric literals" l \in Lab "labels" opa \in Op_a "arithmetic operators" opb \in Op_b "Boolean operators" opr \in Op_r "relational operators" S ::= [x:= a]^l | [skip]^l | S1 ; S2 | if [b]^l then S1 else S2 | while [b]^l do S a ::= x | n | a1 opa a2 b ::= true | false | not b | b1 opb b2 | a1 opr a2 ------------------------------------------ What's the difference between abstract and concrete syntax? ------------------------------------------ EXAMPLE [y := x]^1; [z := 1]^2; while [y>1]^3 do ([z := z*y]^4; [y := y-1]^5); [y := 0]^6 ------------------------------------------ Why are the labels only attached to certain places in the abstract syntax trees? Which blocks are elementary? What is used to disambiguate different parse trees? 3. semantics What does the example above do? What does skip do? What kind of numeric literals are in this language? What would be some reasonable arithmetic operators? One would be some reasonable relational operators? What would be some reasonable Boolean operators? What's the type of the relational operators? Can the labels be repeated? What's missing? 4. variations ------------------------------------------ VARIATION 1: CONTROL FEATURES S ::= ... | break | for x in a1 .. a2 do S | throw | try S1 catch S2 VARIATION 2: TAINTING and INFORMATION FLOW S ::= ... | read x | sanitize x | print a VARIATION 3: SPECIFICATION FEATURES S ::= ... | assert b | assume b | choose S1 or S2 VARIATION 4: PARALLELISM S ::= ... | S1 `||' S2 VARIATION 5: FUNCTIONS a ::= ... | fn x => a | a1 a2 | let d in a d ::= x = a | d1; d2 ------------------------------------------ Which of these need labels, and where do the labels go? B. reaching definitions analysis 1. definition ------------------------------------------ REACHING DEFINITIONS (ASSIGNMENTS) def: Let P be a program. An assignment [x := a]^l *may reach* a program point in P if in some execution of P, when execution reaches that point, the last assignment to x was done at l. RD(P,point) says what assignments may reach point in P. What can we know for certain? What kind of errors can this detect? ------------------------------------------ How to describe program points? What can we tell for certain from this analysis's results? What kind of errors could this analysis be used to detect/prevent? 2. examples ------------------------------------------ EXAMPLE [y := x]^1; [z := 1]^2; while [y>1]^3 do ([z := z*y]^4; [y := y-1]^5); [y := 0]^6 (y,1) reaches entry to 2 ------------------------------------------ What definitions reach the entry for label 3? What definitions can't reach the entry for label 3? How do we talk about uninitialized variables? In table 1.1, what if we add (y,1) to RDentry(5)? In table 1.1, what if we remove (y,5) from RDentry(6)? ------------------------------------------ FOR YOU TO DO Compute a table like table 1.1 for: [t := x]^1; [x := y]^2; [y := t]^3 if [y>x]^4 then [r := y]^5 else [r := x]^6; assert [r >= x and r >= y]^7 ------------------------------------------ 3. variations What would the analysis look like if we only wanted to keep track of which variables were assigned? What would the analysis look like if we want to keep track of what variables could influence the values of other variables? III. Dataflow Analysis (1.3) A. goals B. idea ------------------------------------------ IDEA OF DATA FLOW ANALYSIS What's the basic idea? What is a data flow graph? How is that used to model the semantics? ------------------------------------------ C. example ------------------------------------------ EXAMPLE [y := 0]^1; [print y]^2; [read x]^3; while [x < 0]^4 do ([y := y+1]^5; [print y]^6; [read x]^7); [z := x]^8 What's the flow graph for this? ------------------------------------------ How would you handle Read? Print? How would you handle if then else statements? For loops? How would you handle break? How would you handle try-catch and throw? How would you handle assert? Assume? Choose? Parallel composition? 1. the equational approach (1.3.1) ------------------------------------------ NODE AND EDGE EQUATIONS FOR TAINT ANALYSIS Taint analysis: find the set of variables each program point that may have a value derived from a value previously read from the user ("tainted") Tentry, Texit : Lab* -> Powerset(Var*) where Lab* = set of labels in program Var* = set of variables in prog block Equation ======================================= [x:=a]^l Texit(l) = [skip]^l Texit(l) = [b]^l Texit(l) = [read x]^l Texit(l) = [sanitize x]^l Texit(l) = [print x]^l Texit(l) = How are edges connected? ------------------------------------------ Why are these the right equations? How does this work out for our example? a. algorithm for solving the equations What can we do to solve a set of simultaneous equations? ------------------------------------------ LEAST SOLUTION F: (Powerset(Var*))^16 -> (Powerset(Var*))^16 F is defined by: F(T_1, ..., T_{16}) = (F_1(T_1, ..., T_{16}), F_2(T_1, ..., T_{16}), ..., F_{16}(T_1, ..., T_{16})) where F_1(T_1, ..., T_{16}) // Tentry(1) = {} F_2(T_1, ..., T_{16}) // Texit(1) = (T_1 - {y}) \cup {} F_3(T_1, ..., T_{16}) // Tentry(2) = T_2 F_4(T_1, ..., T_{16}) // Texit(2) = T_3 F_5(T_1, ..., T_{16}) // Tentry(3) = T_4 F_6(T_1, ..., T_{16}) // Texit(3) = T_5 \cup {x} F_7(T_1, ..., T_{16}) // Tentry(4) = F_8(T_1, ...., T_{16}) // Texit(4) = Solution, is a 16-tuple of the form (Tentry(1), Texit(1), ..., Tentry(8), Texit(8)) Such a 16-tuple is a solution if: (Tentry(1), Texit(1), ..., Tentry(8), Texit(8)) = F(Tentry(1), Texit(1), ..., Tentry(8), Texit(8)) ------------------------------------------ Would this still be a 16-tuple if there were seven elementary blocks? How to compare such tuples? What does the lattice structure of (Powerset(Var*))^16 look like? How can one find the fixed point using this information? Why is the fixed point a solution? Why do we want the least fixed point? 2. the constraint based approach What are the constraints? ------------------------------------------ WHAT CONSTRAINTS FOR THIS EXAMPLE? [y := 0]^1; [print y]^2; [read x]^3; while [x < 0]^4 do ([y := y+1]^5; [print y]^6; [read x]^7); [z := x]^8 What's the flow graph for this? ------------------------------------------ How do the flows work? What's the connection to the equations? In what direction do the subset constraints go? IV. constraint based analysis (1.4) A. goals What's the difference between a data flow analysis and a control flow analysis? ------------------------------------------ DATA FLOW VS. CONTROL FLOW ANALYSIS Main difference? ------------------------------------------ Isn't control flow obvious in all languages? B. setting ------------------------------------------ SETTING 1. Convert all control structures to functions and function calls. 2. Analysis finds what functions can be called from each point ------------------------------------------ Why switch to a functional language for this section? ------------------------------------------ CONTINUATION PASSING STYLE An intermediate language with one control structure Idea: every expression takes a "continuation" to which it sends its result Examples: x < 0 ==> [fn k => [[[%< x] 0] k]] if [x < 0] then [y := 22] else [z := 33] ==> [fn k => [[[%< x] 0] [%if [[y := 22] k] [[z := 33] k]]]] ------------------------------------------ ------------------------------------------ LANGUAGE (p. 140) Work in a functional language: e \in Exp t \in Term f,x \in Var c \in Const op \in Op l \in Lab e ::= t^l t ::= c | x | fn x => e_0 "non-recursive fun" | fun f x => e_0 "recursive fun def" | e_1 e_2 | if e_0 then e_1 else e_2 | let x = e_1 in e_2 | e_1 op e_2 ------------------------------------------ What are the atomic blocks being labeled here? C. idea What are the main ideas in this approach? ------------------------------------------ IDEAS OF CONSTRAINT BASED ANALYSIS - assume no side effects ==> associate information with labels - use a pair of functions, (C,p): C: Lab* -> Powerset(Value) C(l) contains possible values for subexpression at label l p: Var* -> Powerset(Value) p(x) constains possible values for variable x ------------------------------------------ What's the alternative to associating information directly with labels? How could this information be useful in an object-oriented program? How is this different than a type system? ------------------------------------------ APPROACH - collect constraints for function abstractions: e.g., given [fn x => [x]^1]^2 get {[fn x => [x]^1]} \subseteq C(2) ------------------------------------------ What's the value of a function definition? Why do they just use the term instead of a closure? What's the general pattern here? ------------------------------------------ for variables: e.g., given [x]^1 get p(x) \subseteq C(1) for applications: e.g., given [[f]^1 [e]^2]^3 get {v | g \in C(1), a \in C(2), and v = (g a)} \subseteq C(3) ------------------------------------------ What's the general rule? What happens in [[fn x => [x]^1]^2 [fn y => [y]^3]^4]^5 ? What would be the constraints for an expression of the form [[e1]^1 op [e2]^2]^3 ? What would be the constraints for an if-then-else expression? What would be the constraints for a let expression, like let x = e1 in e2 ? V. Abstract Interpretation (1.5) A. goals B. idea What's the basic idea? ------------------------------------------ IDEA OF ABSTRACT INTERPRETATION (1.5) ------------------------------------------ What's a collecting semantics? How is that used to extract the analysis? C. example ------------------------------------------ EXAMPLE [y := 0]^1; [print y]^2; [read x]^3; while [x < 0]^4 do ([y := y+1]^5; [print y]^6; [read x]^7); [z := x]^8 For taint analysis we seek sets of variables at each program point that may have a value derived from a value previously read from the user ("tainted") ------------------------------------------ What is a collecting semantics? What would a collecting semantics look like for this example? Why is CSentry(1) = {(x,?,{}),(y,?,{}),(z,?,{})} ? How is this different than the reaching definitions analysis. it's sets of ordered variable names instead of traces Q: How would you handle if then else statements? How would you handle if then else statements? For loops? How would you handle break? 1. solving the equations ------------------------------------------ SOLVING THE EQUATIONS Traces = Powerset((Var x Lab? x Dependants)*) G: Traces^{16} -> Traces^{16} G is defined by: G(CS_1, ..., CS_{16}) = (G_1(CS_1, ..., CS_{16}), G_2(CS_1, ..., CS_{16}), ..., G_{16}(CS_1, ..., CS_{16})) where G_1(CS_1, ..., CS_{16}) // CSentry(1) = {(x,?,{}),(y,?,{}),(z,?,{})} G_2(CS_1, ..., CS_{16}) // CSexit(1) = {tr : (y,1,{}) | tr \in CS_1} G_3(CS_1, ..., CS_{16}) = CS_2 ... Solution (CSentry(1), CSexit(1), CSentry(2), ..., CSentry(8), CSexit(8)) is a solution if G(CSentry(1), CSexit(1), CSentry(2), ..., CSentry(8), CSexit(8)) = (CSentry(1), CSexit(1), CSentry(2), ..., CSentry(8), CSexit(8)) ------------------------------------------ What is G_1? Why does it have 16 parameters? What is G_2? Why is G_3 = CS_2 in this example? What's the ordering on the solution space? What does it mean for G to be monotone? 2. Galois connections ------------------------------------------ ABSTRACTION AND CONCRETIZATION abstraction function for Taint analysis: a: Traces -> Powerset(Var*) a(trs) = {x | read \in depends(x,tr), tr \in trs} concretization function for Taint analysis: g: Powerset(Var*) -> Traces g(Y) = {tr | x \in Y, read \in depends(x,tr)} Adjunction, or Galois connection: a(X) \subseteq Y <==> X \subseteq g(Y) ------------------------------------------ Would a and g be different for the RD analysis? ------------------------------------------ set of traces set of vars |---------------| |---------------| | | g | | | g(Y) <---------------- Y | | U| | | U| | | X -----------------> a(X) | | | a | | |_______________| |_______________| ------------------------------------------ 3. calculating the analysis Why do we care about the abstraction and concretization functions? ------------------------------------------ CALCULATING THE ANALYSIS Extend a and g pointwise to tuples: a(TR_1, ..., TR_16) = (a(TR_1), ..., a(TR_16)) g(Y_1, ..., Y_16) = (g(Y_1), ..., g(Y_16)) Define the analysis by the function a o G o g: Powerset(Var*)^16 -> Powerset(Var*)^16 so for each i in {1..12} (a o G_i o g): Powerset(Var*)^16 -> Powerset(Var) by a(G_1(g(T_1, ..., T_16))) = a({(x,?,{}),(y,?,{}),(z,?,{})}) = {} a(G_2(g(T_1, ..., T_16))) = a({tr : (y,1,{}) | tr \in CS_1}) ... So a solution (Tentry(1), ..., Texit(8)) has the property that (Tentry(1), ..., Texit(8)) = (a o G o g)(Tentry(1), ..., Texit(8)) ------------------------------------------ What's T_3? How does this compare to the analysis we created by hand? What's the benefit of doing things this way? VI. Type and Effect Systems (1.6) A. goals B. idea What's the basic idea? ------------------------------------------ TYPE AND EFFECT SYSTEMS (1.6) Basic idea? ------------------------------------------ C. annotated type system example 1. annotated base types ------------------------------------------ ANNOTATED TYPE SYSTEM EXAMPLE [asg] [x := a]^l : T1 -> T2 if T2 = (T1 - {x}) \cup {x | FV(a) \cap T1 \neq {} } [skip] [skip]^l : T -> T S1 : T1 -> T2, S2: T2 -> T3 [seq] -------------------------------- S1; S2 : T1 -> T3 S1 : T1 -> T2, S2: T1 -> T2 [if] -------------------------------- if [b]^l then S1 else S2 : T1 -> T2 S : T1 -> T1 [wh] -------------------------------- while [b]^l then S : T1 -> T1 [re] [read x]^l : T1 -> T2 if T2 = T1 \cup {x} [sa] [sanatize x]^l : T1 -> T2 if T2 = T1 - {x} [pr] [print x]^l : T1 -> T1 if x \not\in T1 S : T2 -> T3 [sub] -------------- if T1 \subseteq T2, S : T1 -> T4 T3 \subseteq T4 ------------------------------------------ What's the meaning of the basic types? What's the meaning of the arrow types, like T1 -> T4? How would you explain the assignment rule? How would you explain the if rule? Why don't we have to check the type of the condition in an if? How would you explain the while rule? How would you explain the sub rule? a. type checking ------------------------------------------ EXAMPLE [y := 0]^1; [print y]^2; [read x]^3; while [x < 0]^4 do ([y := y+1]^5; [print y]^6; [read x]^7); [z := x]^8 ------------------------------------------ ------------------------------------------ TYPE CHECKING Idea: accumulate constraints. [y := 0]^1: T1 -> T2 [asg] [print y]^2: T2 -> T3 [pr] ___________________________________ [seq] ([y := 0]^1;[print y]^2) : T1 -> T3 if T2 = T1-{y} and y \not\in T2 and T3 = T2 ------------------------------------------ So what are all the constraints? ------------------------------------------ CONSTRAINTS T2 = T1-{y} y \not\in T2 T3 = T2 T4 = T3 \cup {x} T6 = (T5-{y}) \cup ({y} \cap T5) T7 = T6 y \not\in T6 T8 = T7 \cup {x} T9 \subseteq T5 T8 \subseteq T9 T4 = T9 T10 = T9-{z} \cup ({x} \cap T9) So, what's a solution? ------------------------------------------ 2. annotated type constructors ------------------------------------------ EXAMPLE Judgments: XMust S : Sigma -------------> Sigma YMay Where XMust and YMay are sets of variables (that S must assign and may assign). {x} [asg] [x := a]^l : Sigma ----> Sigma {x} {} [skip] [skip]^l : Sigma -----> Sigma {} X1 S1 : Sigma -----> Sigma, Y1 X2 S2 : Sigma -----> Sigma Y2 [seq] -------------------------------- X3 S1; S2 : Sigma ----> Sigma Y3 if X3 = X1 \cup X2, Y3 = Y1 \cup Y2 X1 S1 : Sigma -----> Sigma, Y1 X2 S2 : Sigma -----> Sigma Y2 [if] -------------------------------- if [b]^l then S1 else S2 X3 : Sigma ----> Sigma Y3 if X3 = X1 \cap X2, Y3 = Y1 \cup Y2 X S : Sigma -----> Sigma Y [wh] -------------------------------- while [b]^l then S {} : Sigma ----> Sigma Y X S : Sigma ----> Sigma Y [sub] ---------------------- X' S : Sigma -----> Sigma Y' if X' \subseteq X, Y \subseteq Y' ------------------------------------------ Why does the assignment rule make sense? How do you explain the if rule? How do you explain the while rule? How would you deal with assert statements? How would you deal with a try statement? ------------------------------------------ TYPE CHECKING EXAMPLE TYPE CHECKING Idea: accumulate constraints. {q} [q := 0]^1: Sigma ---> Sigma , [asg] {q} {r} [r := x]^2: Sigma ---> Sigma [asg] {r} _________________________________ [seq] ([q := 0]^1;[r := x]^2) {q,r} : Sigma ----> Sigma {q,r} {r} [r := r-y]^4: Sigma ---> Sigma, [asg] {r} {q} [q := q+1]^5: Sigma ---> Sigma [asg] {q} _________________________________ [seq] ([r := r-y]^4;[q := q+1]^5) {q,r} : Sigma ----> Sigma {q,r} ___________________________________[wh] while [r >= y]^3 do ([r := r-y]^4;[q := q+1]^5) {} : Sigma ----> Sigma {q,r} so by the seq rule, ------------------------------------------ How is this different than with annotated base types? D. effect systems 1. example: call tracking analysis ------------------------------------------ EXAMPLE Judgments: Gamma |- e : t & phi where Gamma : Var -> Type e : Expression t : Type phi : Effect phi Type = int | bool | t1 ---> t2 phi : Powerset(FunName) [var] Gamma |- x : t & {}, if Gamma(x) = t Gamma[x |-> tx] |- e : t & phi [fn] -------------------------------- Gamma |- fn_pi x => e phi2 : tx ------> t & {} if phi2 = phi \cup {pi} phi Gamma |- e1 : t2 ---> t & phi1, Gamma |- e2 : t2 & phi2 [app] -------------------------------- Gamma |- e1 e2 : t & phi3 if phi3 = phi1 \cup phi2 \cup phi ------------------------------------------ What do the judgments mean? What do the arrow types mean? How would you explain the fn rule? How would you explain the app rule?