I. The Nature of Program Analysis (1.1) A. goals What are the goals of program analysis? ------------------------------------------ GOALS OF PROGRAM ANALYSIS ------------------------------------------ B. ideas What are the main ideas of program analysis? ------------------------------------------ MAIN IDEAS ------------------------------------------ What's the difference between a semantics-based approach and a semantics-directed approach? C. applications What other areas, besides optimizing compilers, could the ideas and goals of program analysis usefully be applied? II. Setting the Scene (1.2) A. The While Language 1. concrete syntax ------------------------------------------ CONCRETE SYNTAX FOR THE WHILE LANGUAGE % ---- Regular (lexical) syntax ---- ::= \r | \n | \r\n ::= "any character except \r or \n" ::= % ::= | " " | \t | \f ::= != | == | < | > | <= | >= ::= + | - ::= * | / ::= "any letter" | ' | _ ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ::= ( | )* ::= (-)?()+ % ---- Context-Free syntax ---- ::= ::= := | skip | | while do | if then else ::= '{' '}' ::= | ; ::= | ::= | ::= | | ( ) ::= | or ::= | and ::= | not | | ( ) ::= true | false ------------------------------------------ ------------------------------------------ EXAMPLE y := x; z := 1; while y>1 do { z := z*y; y := y-1 }; y := 0 ------------------------------------------ 2. abstract syntax ------------------------------------------ ABSTRACT SYNTAX OF THE WHILE LANGUAGE a \in AExp "arithmetic expressions" b \in BExp "Boolean expressions" S \in Stmt "statements" x,y \in Var "variables" n \in Num "numeric literals" l \in Lab "labels" opa \in Op_a "arithmetic operators" opb \in Op_b "Boolean operators" opr \in Op_r "relational operators" S ::= [x:= a]^l | [skip]^l | S1 ; S2 | if [b]^l then S1 else S2 | while [b]^l do S a ::= x | n | a1 opa a2 b ::= true | false | not b | b1 opb b2 | a1 opr a2 ------------------------------------------ What's the difference between abstract and concrete syntax? ------------------------------------------ EXAMPLE [y := x]^1; [z := 1]^2; while [y>1]^3 do ([z := z*y]^4; [y := y-1]^5); [y := 0]^6 ------------------------------------------ Why are the labels only attached to certain places in the abstract syntax trees? Which blocks are elementary? What is used to disambiguate different parse trees? 3. semantics What does the example above do? What does skip do? What kind of numeric literals are in this language? What would be some reasonable arithmetic operators? One would be some reasonable relational operators? What would be some reasonable Boolean operators? What's the type of the relational operators? Can the labels be repeated? What's missing? 4. variations ------------------------------------------ VARIATION 1: CONTROL FEATURES S ::= ... | break | for x in a1 .. a2 do S | throw | try S1 catch S2 VARIATION 2: TAINTING and INFORMATION FLOW S ::= ... | read x | sanitize x | print a VARIATION 3: SPECIFICATION FEATURES S ::= ... | assert b | assume b | choose S1 or S2 VARIATION 4: PARALLELISM S ::= ... | S1 `||' S2 VARIATION 5: FUNCTIONS a ::= ... | fn x => a | a1 a2 | let d in a d ::= x = a | d1; d2 ------------------------------------------ Which of these need labels, and where do the labels go? B. reaching definitions analysis 1. definition ------------------------------------------ REACHING DEFINITIONS (ASSIGNMENTS) def: Let P be a program. An assignment [x := a]^l at label l *may reach* a program point in P if in some execution of P, when execution reaches that point, the last assignment to x was done at l. RD(P, point) says what (labels of) assignments may reach point in P. What can we know for certain? What kind of errors can this detect? ------------------------------------------ How to describe program points? What can we tell for certain from this analysis's results? What kind of errors could this analysis be used to detect/prevent? 2. examples ------------------------------------------ EXAMPLE [y := x]^1; [z := 1]^2; while [y>1]^3 do ([z := z*y]^4; [y := y-1]^5); [y := 0]^6 (y,1) reaches entry to 2 ------------------------------------------ What definitions reach the entry for label 3? What definitions can't reach the entry for label 3? How do we talk about uninitialized variables? In table 1.1, what if we add (y,1) to RDentry(5)? In table 1.1, what if we remove (y,5) from RDentry(6)? ------------------------------------------ FOR YOU TO DO Compute a table like table 1.1 for: [t := x]^1; [x := y]^2; [y := t]^3 if [y>x]^4 then [r := y]^5 else [r := x]^6; assert [r >= x and r >= y]^7 ------------------------------------------ 3. variations What would the analysis look like if we only wanted to keep track of which variables were assigned? What would the analysis look like if we want to keep track of what variables could influence the values of other variables? III. Dataflow Analysis (1.3) A. goals B. idea ------------------------------------------ IDEA OF DATA FLOW ANALYSIS What's the basic idea? What is a data flow graph? How is that used to model the semantics? ------------------------------------------ C. example ------------------------------------------ EXAMPLE [y := 0]^1; [print y]^2; [read x]^3; while [x < 0]^4 do ([y := y+1]^5; [print y]^6; [read x]^7); [z := x]^8 What's the flow graph for this? ------------------------------------------ How would you handle Read? Print? How would you handle if then else statements? For loops? How would you handle break? How would you handle try-catch and throw? How would you handle assert? Assume? Choose? Parallel composition? 1. the equational approach (1.3.1) ------------------------------------------ NODE AND EDGE EQUATIONS FOR TAINT ANALYSIS Taint analysis: at each program point, find the set of variables that may have a value derived from a value previously read from the user ("tainted") Tentry, Texit : Lab* -> Powerset(Var*) where Lab* = set of labels in program Var* = set of variables in prog block Equation ======================================= [x:=a]^l Texit(l) = [skip]^l Texit(l) = [b]^l Texit(l) = [read x]^l Texit(l) = [sanitize x]^l Texit(l) = [print x]^l Texit(l) = How are edges connected? ------------------------------------------ Why are these the right equations? How does this work out for our example? a. algorithm for solving the equations What can we do to solve a set of simultaneous equations? ------------------------------------------ WHAT IS A SOLUTION? Consider the program [read x]^1; [sanitize x]^2 We get the following equations: Tentry(1) = Tentry(1) // know nothing Texit(1) = union(Tentry(1), {x}) Tentry(2) = Texit(1) Texit(2) = subtract(Tentry(2), {x}) Suppose the program has variables x and y Then the possible solutions are: solution G: Tentry(1) = {x,y} Texit(1) = {x,y} Tentry(2) = {x,y} Texit(2) = {y} solution L: Tentry(1) = {} Texit(1) = {x} Tentry(2) = {x} Texit(2) = {} Why are these solutions? ------------------------------------------ ------------------------------------------ MATHEMATICAL TREATMENT OF SOLUTIONS Can think of the dataflow equations: Tentry(1) = Tentry(1) // know nothing Texit(1) = union(Tentry(1), {x}) Tentry(2) = Texit(1) Texit(2) = subtract(Tentry(2), {x}) as a functional transformation: F(T1, T2, T3, T4) = (T1, union(T1, {x}), T2, subtract(T3,{x})) so a solution is: ------------------------------------------ ------------------------------------------ WHICH SOLUTION IS BETTER? For the transformation: F(T1, T2, T3, T4) = (T1, // Tentry(1) union(T1, {x}), // Texit(1) T2, // Tentry(2) subtract(T3,{x})) // Texit(2) there are two solutions over {x,y}: solution G: ({x,y}, {x,y}, {x,y}, {y}) solution L: ({}, {x}, {x}, {}) Which is better? ------------------------------------------ ------------------------------------------ COMPARING SOLUTIONS Ordering on sets: subset or equal (written \subseteq) defined by s1 \subseteq s2 iff (\forall x \in s1 :: x \in s2) e.g., {} \subseteq {x} {} \subseteq {y} {x} \subseteq {x,y} {y} \subseteq {x,y} Hasse diagram: {x,y} / \ {x} {y} \ / {} Solutions are tuples of sets, so: e.g., ({},{}) \sqsubsetq ({x},{}) ({},{}) \sqsubsetq ({y},{}) ({},{}) \sqsubsetq ({},{x}) ({},{}) \sqsubsetq ({},{y}) ({x},{}) \sqsubsetq ({x,y},{}) ({x},{}) \sqsubsetq ({x},{y}) ({y},{}) \sqsubsetq ({x,y},{}) ({y},{}) \sqsubsetq ({y},{x}) ({},{x}) \sqsubsetq ({y},{x,y}) ({},{x}) \sqsubsetq ({},{x,y}) ({},{y}) \sqsubsetq ({x},{x,y}) ({},{y}) \sqsubsetq ({},{x,y}) ({x,y},{}) \sqsubsetq ({x,y},{y}) ({x},{y}) \sqsubsetq ({x,y},{y}) ({y},{x}) \sqsubsetq ({x,y},{y}) ... For precision we want the ------------------------------------------ ------------------------------------------ ALGORITHM IDEA Goal: find least fixed point of F fact: F is monotonic in \sqsubseteq because each operation used in defining F is monotonic Induction idea: Base. Start with ({},...,{}) Ind. Since F is monotonic v \sqsubseteq F(v) When F^n(v) = F(F^n(v)) then F^n(v) is a fixed point Reach least fixed point first, so stop with that solution ------------------------------------------ IV. constraint based analysis (1.3.2 and 1.4) A. goals What's the difference between a data flow analysis and a constraint-based analysis? ------------------------------------------ DATA FLOW VS. CONSTRAINT-BASED ANALYSIS Main differences? ------------------------------------------ B. example ------------------------------------------ REACHING DEFINITIONS USING CONSTRAINTS For the program: [y := x]^1; [z := 1]^2; while [y>1]^3 do ([z:=z*y]^4; [y:=y-1]^5); [y:= 0]^6 ------------------------------------------ C. Solutions V. control-flow analysis (1.4) A. setting ------------------------------------------ SETTING Goal: Approach: 1. Convert all control structures to functions and function calls. 2. Analysis finds what functions can be called from each point ------------------------------------------ Why switch to a functional language for this section? ------------------------------------------ CONTINUATION PASSING STYLE An intermediate language with one control structure Idea: every expression takes a "continuation" to which it sends its result Examples: x < 0 ==> [fn k => [[[%< x] 0] k]] if [x < 0] then [y := 22] else [z := 33] ==> [fn k => [[[%< x] 0] [%if [[y := 22] k] [[z := 33] k]]]] ------------------------------------------ ------------------------------------------ LANGUAGE (p. 140) Work in a functional language: e \in Exp t \in Term f,x \in Var c \in Const op \in Op l \in Lab e ::= t^l t ::= c | x | fn x => e_0 "non-recursive fun" | fun f x => e_0 "recursive fun def" | e_1 e_2 | if e_0 then e_1 else e_2 | let x = e_1 in e_2 | e_1 op e_2 ------------------------------------------ What are the atomic blocks being labeled here? B. idea What are the main ideas in this approach? ------------------------------------------ IDEAS OF CONSTRAINT BASED ANALYSIS - assume no side effects ==> associate information with labels - use a pair of functions, (C,p): C: Lab* -> Powerset(Value) C(l) contains possible values for subexpression at label l p: Var* -> Powerset(Value) p(x) constains possible values for variable x ------------------------------------------ What's the alternative to associating information directly with labels? How could this information be useful in an object-oriented program? How is this different than a type system? ------------------------------------------ APPROACH - collect constraints for function abstractions: e.g., given [fn x => [x]^1]^2 get {[fn x => [x]^1]} \subseteq C(2) ------------------------------------------ What's the value of a function definition? Why do they just use the term instead of a closure? What's the general pattern here? ------------------------------------------ for variables: e.g., given [x]^1 get p(x) \subseteq C(1) for applications: e.g., given [[f]^1 [e]^2]^3 get {v | g \in C(1), a \in C(2), and v = (g a)} \subseteq C(3) ------------------------------------------ What's the general rule? What happens in [[fn x => [x]^1]^2 [fn y => [y]^3]^4]^5 ? What would be the constraints for an expression of the form [[e1]^1 op [e2]^2]^3 ? What would be the constraints for an if-then-else expression? What would be the constraints for a let expression, like let x = e1 in e2 ? VI. Abstract Interpretation (1.5) A. goals B. idea What's the basic idea? ------------------------------------------ IDEA OF ABSTRACT INTERPRETATION (1.5) ------------------------------------------ What's a collecting semantics? How is that used to extract the analysis? C. example ------------------------------------------ EXAMPLE [y := 0]^1; [print y]^2; [read x]^3; while [x < 0]^4 do ([y := y+1]^5; [print y]^6; [read x]^7); [z := x]^8 For taint analysis we seek sets of variables at each program point that may have a value derived from a value previously read from the user ("tainted") ------------------------------------------ What is a collecting semantics? What would a collecting semantics look like for this example? Why is CSentry(1) = {(x,?,{}),(y,?,{}),(z,?,{})} ? How is this different than the reaching definitions analysis. it's sets of ordered variable names instead of traces Q: How would you handle if then else statements? How would you handle if then else statements? For loops? How would you handle break? 1. solving the equations ------------------------------------------ SOLVING THE EQUATIONS Traces = Powerset((Var x Lab? x Dependants)*) G: Traces^{16} -> Traces^{16} G is defined by: G(CS_1, ..., CS_{16}) = (G_1(CS_1, ..., CS_{16}), G_2(CS_1, ..., CS_{16}), ..., G_{16}(CS_1, ..., CS_{16})) where G_1(CS_1, ..., CS_{16}) // CSentry(1) = {(x,?,{}),(y,?,{}),(z,?,{})} G_2(CS_1, ..., CS_{16}) // CSexit(1) = {tr : (y,1,{}) | tr \in CS_1} G_3(CS_1, ..., CS_{16}) = CS_2 ... Solution (CSentry(1), CSexit(1), CSentry(2), ..., CSentry(8), CSexit(8)) is a solution if G(CSentry(1), CSexit(1), CSentry(2), ..., CSentry(8), CSexit(8)) = (CSentry(1), CSexit(1), CSentry(2), ..., CSentry(8), CSexit(8)) ------------------------------------------ What is G_1? Why does it have 16 parameters? What is G_2? Why is G_3 = CS_2 in this example? What's the ordering on the solution space? What does it mean for G to be monotone? 2. Galois connections ------------------------------------------ ABSTRACTION AND CONCRETIZATION abstraction function for Taint analysis: a: Traces -> Powerset(Var*) a(trs) = {x | read \in depends(x,tr), tr \in trs} concretization function for Taint analysis: g: Powerset(Var*) -> Traces g(Y) = {tr | x \in Y, read \in depends(x,tr)} Adjunction, or Galois connection: a(X) \subseteq Y <==> X \subseteq g(Y) ------------------------------------------ Would a and g be different for the RD analysis? ------------------------------------------ set of traces set of vars |---------------| |---------------| | | g | | | g(Y) <---------------- Y | | U| | | U| | | X -----------------> a(X) | | | a | | |_______________| |_______________| ------------------------------------------ 3. calculating the analysis Why do we care about the abstraction and concretization functions? ------------------------------------------ CALCULATING THE ANALYSIS Extend a and g pointwise to tuples: a(TR_1, ..., TR_16) = (a(TR_1), ..., a(TR_16)) g(Y_1, ..., Y_16) = (g(Y_1), ..., g(Y_16)) Define the analysis by the function a o G o g: Powerset(Var*)^16 -> Powerset(Var*)^16 so for each i in {1..12} (a o G_i o g): Powerset(Var*)^16 -> Powerset(Var) by a(G_1(g(T_1, ..., T_16))) = a({(x,?,{}),(y,?,{}),(z,?,{})}) = {} a(G_2(g(T_1, ..., T_16))) = a({tr : (y,1,{}) | tr \in CS_1}) ... So a solution (Tentry(1), ..., Texit(8)) has the property that (Tentry(1), ..., Texit(8)) = (a o G o g)(Tentry(1), ..., Texit(8)) ------------------------------------------ What's T_3? How does this compare to the analysis we created by hand? What's the benefit of doing things this way? VII. Type and Effect Systems (1.6) A. setting ------------------------------------------ SETTING Type checking is usually syntax-directed and compositional So can be implemented by: ------------------------------------------ B. goals C. idea What's the basic idea? ------------------------------------------ TYPE AND EFFECT SYSTEMS (1.6) Basic idea? ------------------------------------------ D. annotated type system example 1. type notation ------------------------------------------ NATURAL DEDUCTION STYLE NOTATION Type systems are written with rules of the form G |- f : T1 -> T, G |- e : T1 [e-rule] _________________ if C G |- f(e1) : T [int c] |- 0 : Int ------------------------------------------ 2. annotated type system example ------------------------------------------ ANNOTATED TYPE SYSTEM FOR TAINT ANALYSIS Types are sets of variables that may be tainted [asg] [x := a]^l : T1 -> T2 if T2 = (T1 - {x}) \cup {x | FV(a) \cap T1 \neq {} } [skip] [skip]^l : T -> T S1 : T1 -> T2, S2: T2 -> T3 [seq] ____________________________ S1; S2 : T1 -> T3 S1 : T1 -> T2, S2: T1 -> T2 [if] ___________________________________ if [b]^l then S1 else S2 : T1 -> T2 S : T1 -> T1 [wh] _____________________________ while [b]^l then S : T1 -> T1 [re] [read x]^l : T1 -> T2 if T2 = T1 \cup {x} [sa] [sanitize x]^l : T1 -> T2 if T2 = T1 - {x} [pr] [print x]^l : T1 -> T1 if x \not\in T1 S : T2 -> T3 [sub] _____________ if T1 \subseteq T2, S : T1 -> T4 T3 \subseteq T4 ------------------------------------------ What's the meaning of the basic types? What's the meaning of the arrow types, like T1 -> T4? How would you explain the assignment rule? How would you explain the if rule? Why don't we have to check the type of the condition in an if? How would you explain the while rule? How would you explain the sub rule? What would we need to change if we wanted to do an information flow security analysis instead of a taint analysis? a. type checking ------------------------------------------ EXAMPLE [y := 0]^1; [print y]^2; [read x]^3; while [x < 0]^4 do ([y := y+1]^5; [print y]^6; [read x]^7); [z := x]^8 ------------------------------------------ ------------------------------------------ TYPE CHECKING Idea: accumulate constraints. [y := 0]^1: T1 -> T2 [asg] [print y]^2: T2 -> T3 [pr] ___________________________________ [seq] ([y := 0]^1;[print y]^2) : T1 -> T3 if T2 = T1-{y} and y \not\in T2 and T3 = T2 ------------------------------------------ So what are all the constraints? ------------------------------------------ CONSTRAINTS T2 = T1-{y} y \not\in T2 T3 = T2 T4 = T3 \cup {x} T6 = (T5-{y}) \cup ({y} \cap T5) T7 = T6 y \not\in T6 T8 = T7 \cup {x} T9 \subseteq T5 T8 \subseteq T9 T4 = T9 T10 = T9-{z} \cup ({x} \cap T9) So, what's a solution? ------------------------------------------ 3. annotated type constructors ------------------------------------------ EXAMPLE Judgments: XMust S : Sigma -------------> Sigma YMay Where XMust and YMay are sets of variables (that S must assign and may assign). {x} [asg] [x := a]^l : Sigma ----> Sigma {x} {} [skip] [skip]^l : Sigma -----> Sigma {} X1 S1 : Sigma -----> Sigma, Y1 X2 S2 : Sigma -----> Sigma Y2 [seq] -------------------------------- X3 S1; S2 : Sigma ----> Sigma Y3 if X3 = X1 \cup X2, Y3 = Y1 \cup Y2 X1 S1 : Sigma -----> Sigma, Y1 X2 S2 : Sigma -----> Sigma Y2 [if] -------------------------------- if [b]^l then S1 else S2 X3 : Sigma ----> Sigma Y3 if X3 = X1 \cap X2, Y3 = Y1 \cup Y2 X S : Sigma -----> Sigma Y [wh] -------------------------------- while [b]^l then S {} : Sigma ----> Sigma Y X S : Sigma ----> Sigma Y [sub] ---------------------- X' S : Sigma -----> Sigma Y' if X' \subseteq X, Y \subseteq Y' ------------------------------------------ Why does the assignment rule make sense? How do you explain the if rule? How do you explain the while rule? How would you deal with assert statements? How would you deal with a try statement? ------------------------------------------ TYPE CHECKING EXAMPLE TYPE CHECKING Idea: accumulate constraints. {q} [q := 0]^1: Sigma ---> Sigma , [asg] {q} {r} [r := x]^2: Sigma ---> Sigma [asg] {r} _________________________________ [seq] ([q := 0]^1;[r := x]^2) {q,r} : Sigma ----> Sigma {q,r} {r} [r := r-y]^4: Sigma ---> Sigma, [asg] {r} {q} [q := q+1]^5: Sigma ---> Sigma [asg] {q} _________________________________ [seq] ([r := r-y]^4;[q := q+1]^5) {q,r} : Sigma ----> Sigma {q,r} ___________________________________[wh] while [r >= y]^3 do ([r := r-y]^4;[q := q+1]^5) {} : Sigma ----> Sigma {q,r} so by the seq rule, ------------------------------------------ How is this different than with annotated base types? E. effect systems 1. example: call tracking analysis ------------------------------------------ EXAMPLE Judgments: Gamma |- e : t & phi where Gamma : Var -> Type e : Expression t : Type phi : Effect phi Type = int | bool | t1 ---> t2 phi : Powerset(FunName) [var] Gamma |- x : t & {}, if Gamma(x) = t Gamma[x |-> tx] |- e : t & phi [fn] -------------------------------- Gamma |- fn_pi x => e phi2 : tx ------> t & {} if phi2 = phi \cup {pi} phi Gamma |- e1 : t2 ---> t & phi1, Gamma |- e2 : t2 & phi2 [app] -------------------------------- Gamma |- e1 e2 : t & phi3 if phi3 = phi1 \cup phi2 \cup phi ------------------------------------------ What do the judgments mean? What do the arrow types mean? How would you explain the fn rule? How would you explain the app rule?