GOALS OF PROGRAM ANALYSIS MAIN IDEAS CONCRETE SYNTAX FOR THE WHILE LANGUAGE % ---- Regular (lexical) syntax ---- ::= \r | \n | \r\n ::= "any character except \r or \n" ::= % ::= | " " | \t | \f ::= != | == | < | > | <= | >= ::= + | - ::= * | / ::= "any letter" | ' | _ ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ::= ( | )* ::= (-)?()+ % ---- Context-Free syntax ---- ::= ::= := | skip | | while do | if then else ::= '{' '}' ::= | ; ::= | ::= | ::= | | ( ) ::= | or ::= | and ::= | not | | ( ) ::= true | false EXAMPLE y := x; z := 1; while y>1 do { z := z*y; y := y-1 }; y := 0 ABSTRACT SYNTAX OF THE WHILE LANGUAGE a \in AExp "arithmetic expressions" b \in BExp "Boolean expressions" S \in Stmt "statements" x,y \in Var "variables" n \in Num "numeric literals" l \in Lab "labels" opa \in Op_a "arithmetic operators" opb \in Op_b "Boolean operators" opr \in Op_r "relational operators" S ::= [x:= a]^l | [skip]^l | S1 ; S2 | if [b]^l then S1 else S2 | while [b]^l do S a ::= x | n | a1 opa a2 b ::= true | false | not b | b1 opb b2 | a1 opr a2 EXAMPLE [y := x]^1; [z := 1]^2; while [y>1]^3 do ([z := z*y]^4; [y := y-1]^5); [y := 0]^6 VARIATION 1: CONTROL FEATURES S ::= ... | break | for x in a1 .. a2 do S | throw | try S1 catch S2 VARIATION 2: TAINTING and INFORMATION FLOW S ::= ... | read x | sanitize x | print a VARIATION 3: SPECIFICATION FEATURES S ::= ... | assert b | assume b | choose S1 or S2 VARIATION 4: PARALLELISM S ::= ... | S1 `||' S2 VARIATION 5: FUNCTIONS a ::= ... | fn x => a | a1 a2 | let d in a d ::= x = a | d1; d2 REACHING DEFINITIONS (ASSIGNMENTS) def: Let P be a program. An assignment [x := a]^l at label l *may reach* a program point in P if in some execution of P, when execution reaches that point, the last assignment to x was done at l. RD(P, point) says what (labels of) assignments may reach point in P. What can we know for certain? What kind of errors can this detect? EXAMPLE [y := x]^1; [z := 1]^2; while [y>1]^3 do ([z := z*y]^4; [y := y-1]^5); [y := 0]^6 (y,1) reaches entry to 2 FOR YOU TO DO Compute a table like table 1.1 for: [t := x]^1; [x := y]^2; [y := t]^3 if [y>x]^4 then [r := y]^5 else [r := x]^6; assert [r >= x and r >= y]^7 IDEA OF DATA FLOW ANALYSIS What's the basic idea? What is a data flow graph? How is that used to model the semantics? EXAMPLE [y := 0]^1; [print y]^2; [read x]^3; while [x < 0]^4 do ([y := y+1]^5; [print y]^6; [read x]^7); [z := x]^8 What's the flow graph for this? NODE AND EDGE EQUATIONS FOR TAINT ANALYSIS Taint analysis: at each program point, find the set of variables that may have a value derived from a value previously read from the user ("tainted") Tentry, Texit : Lab* -> Powerset(Var*) where Lab* = set of labels in program Var* = set of variables in prog block Equation ======================================= [x:=a]^l Texit(l) = [skip]^l Texit(l) = [b]^l Texit(l) = [read x]^l Texit(l) = [sanitize x]^l Texit(l) = [print x]^l Texit(l) = How are edges connected? WHAT IS A SOLUTION? Consider the program [read x]^1; [sanitize x]^2 We get the following equations: Tentry(1) = Tentry(1) // know nothing Texit(1) = union(Tentry(1), {x}) Tentry(2) = Texit(1) Texit(2) = subtract(Tentry(2), {x}) Suppose the program has variables x and y Then the possible solutions are: solution G: Tentry(1) = {x,y} Texit(1) = {x,y} Tentry(2) = {x,y} Texit(2) = {y} solution L: Tentry(1) = {} Texit(1) = {x} Tentry(2) = {x} Texit(2) = {} Why are these solutions? MATHEMATICAL TREATMENT OF SOLUTIONS Can think of the dataflow equations: Tentry(1) = Tentry(1) // know nothing Texit(1) = union(Tentry(1), {x}) Tentry(2) = Texit(1) Texit(2) = subtract(Tentry(2), {x}) as a functional transformation: F(T1, T2, T3, T4) = (T1, union(T1, {x}), T2, subtract(T3,{x})) so a solution is: WHICH SOLUTION IS BETTER? For the transformation: F(T1, T2, T3, T4) = (T1, // Tentry(1) union(T1, {x}), // Texit(1) T2, // Tentry(2) subtract(T3,{x})) // Texit(2) there are two solutions over {x,y}: solution G: ({x,y}, {x,y}, {x,y}, {y}) solution L: ({}, {x}, {x}, {}) Which is better? COMPARING SOLUTIONS Ordering on sets: subset or equal (written \subseteq) defined by s1 \subseteq s2 iff (\forall x \in s1 :: x \in s2) e.g., {} \subseteq {x} {} \subseteq {y} {x} \subseteq {x,y} {y} \subseteq {x,y} Hasse diagram: {x,y} / \ {x} {y} \ / {} Solutions are tuples of sets, so: e.g., ({},{}) \sqsubsetq ({x},{}) ({},{}) \sqsubsetq ({y},{}) ({},{}) \sqsubsetq ({},{x}) ({},{}) \sqsubsetq ({},{y}) ({x},{}) \sqsubsetq ({x,y},{}) ({x},{}) \sqsubsetq ({x},{y}) ({y},{}) \sqsubsetq ({x,y},{}) ({y},{}) \sqsubsetq ({y},{x}) ({},{x}) \sqsubsetq ({y},{x,y}) ({},{x}) \sqsubsetq ({},{x,y}) ({},{y}) \sqsubsetq ({x},{x,y}) ({},{y}) \sqsubsetq ({},{x,y}) ({x,y},{}) \sqsubsetq ({x,y},{y}) ({x},{y}) \sqsubsetq ({x,y},{y}) ({y},{x}) \sqsubsetq ({x,y},{y}) ... For precision we want the ALGORITHM IDEA Goal: find least fixed point of F fact: F is monotonic in \sqsubseteq because each operation used in defining F is monotonic Induction idea: Base. Start with ({},...,{}) Ind. Since F is monotonic v \sqsubseteq F(v) When F^n(v) = F(F^n(v)) then F^n(v) is a fixed point Reach least fixed point first, so stop with that solution DATA FLOW VS. CONSTRAINT-BASED ANALYSIS Main differences? REACHING DEFINITIONS USING CONSTRAINTS For the program: [y := x]^1; [z := 1]^2; while [y>1]^3 do ([z:=z*y]^4; [y:=y-1]^5); [y:= 0]^6 SETTING Goal: Approach: 1. Convert all control structures to functions and function calls. 2. Analysis finds what functions can be called from each point CONTINUATION PASSING STYLE An intermediate language with one control structure Idea: every expression takes a "continuation" to which it sends its result Examples: x < 0 ==> [fn k => [[[%< x] 0] k]] if [x < 0] then [y := 22] else [z := 33] ==> [fn k => [[[%< x] 0] [%if [[y := 22] k] [[z := 33] k]]]] LANGUAGE (p. 140) Work in a functional language: e \in Exp t \in Term f,x \in Var c \in Const op \in Op l \in Lab e ::= t^l t ::= c | x | fn x => e_0 "non-recursive fun" | fun f x => e_0 "recursive fun def" | e_1 e_2 | if e_0 then e_1 else e_2 | let x = e_1 in e_2 | e_1 op e_2 IDEAS OF CONSTRAINT BASED ANALYSIS - assume no side effects ==> associate information with labels - use a pair of functions, (C,p): C: Lab* -> Powerset(Value) C(l) contains possible values for subexpression at label l p: Var* -> Powerset(Value) p(x) constains possible values for variable x APPROACH - collect constraints for function abstractions: e.g., given [fn x => [x]^1]^2 get {[fn x => [x]^1]} \subseteq C(2) for variables: e.g., given [x]^1 get p(x) \subseteq C(1) for applications: e.g., given [[f]^1 [e]^2]^3 get {v | g \in C(1), a \in C(2), and v = (g a)} \subseteq C(3) IDEA OF ABSTRACT INTERPRETATION (1.5) EXAMPLE [y := 0]^1; [print y]^2; [read x]^3; while [x < 0]^4 do ([y := y+1]^5; [print y]^6; [read x]^7); [z := x]^8 For taint analysis we seek sets of variables at each program point that may have a value derived from a value previously read from the user ("tainted") SOLVING THE EQUATIONS Traces = Powerset((Var x Lab? x Dependants)*) G: Traces^{16} -> Traces^{16} G is defined by: G(CS_1, ..., CS_{16}) = (G_1(CS_1, ..., CS_{16}), G_2(CS_1, ..., CS_{16}), ..., G_{16}(CS_1, ..., CS_{16})) where G_1(CS_1, ..., CS_{16}) // CSentry(1) = {(x,?,{}),(y,?,{}),(z,?,{})} G_2(CS_1, ..., CS_{16}) // CSexit(1) = {tr : (y,1,{}) | tr \in CS_1} G_3(CS_1, ..., CS_{16}) = CS_2 ... Solution (CSentry(1), CSexit(1), CSentry(2), ..., CSentry(8), CSexit(8)) is a solution if G(CSentry(1), CSexit(1), CSentry(2), ..., CSentry(8), CSexit(8)) = (CSentry(1), CSexit(1), CSentry(2), ..., CSentry(8), CSexit(8)) ABSTRACTION AND CONCRETIZATION abstraction function for Taint analysis: a: Traces -> Powerset(Var*) a(trs) = {x | read \in depends(x,tr), tr \in trs} concretization function for Taint analysis: g: Powerset(Var*) -> Traces g(Y) = {tr | x \in Y, read \in depends(x,tr)} Adjunction, or Galois connection: a(X) \subseteq Y <==> X \subseteq g(Y) set of traces set of vars |---------------| |---------------| | | g | | | g(Y) <---------------- Y | | U| | | U| | | X -----------------> a(X) | | | a | | |_______________| |_______________| CALCULATING THE ANALYSIS Extend a and g pointwise to tuples: a(TR_1, ..., TR_16) = (a(TR_1), ..., a(TR_16)) g(Y_1, ..., Y_16) = (g(Y_1), ..., g(Y_16)) Define the analysis by the function a o G o g: Powerset(Var*)^16 -> Powerset(Var*)^16 so for each i in {1..12} (a o G_i o g): Powerset(Var*)^16 -> Powerset(Var) by a(G_1(g(T_1, ..., T_16))) = a({(x,?,{}),(y,?,{}),(z,?,{})}) = {} a(G_2(g(T_1, ..., T_16))) = a({tr : (y,1,{}) | tr \in CS_1}) ... So a solution (Tentry(1), ..., Texit(8)) has the property that (Tentry(1), ..., Texit(8)) = (a o G o g)(Tentry(1), ..., Texit(8)) SETTING Type checking is usually syntax-directed and compositional So can be implemented by: TYPE AND EFFECT SYSTEMS (1.6) Basic idea? NATURAL DEDUCTION STYLE NOTATION Type systems are written with rules of the form G |- f : T1 -> T, G |- e : T1 [e-rule] _________________ if C G |- f(e1) : T [int c] |- 0 : Int ANNOTATED TYPE SYSTEM FOR TAINT ANALYSIS Types are sets of variables that may be tainted [asg] [x := a]^l : T1 -> T2 if T2 = (T1 - {x}) \cup {x | FV(a) \cap T1 \neq {} } [skip] [skip]^l : T -> T S1 : T1 -> T2, S2: T2 -> T3 [seq] ____________________________ S1; S2 : T1 -> T3 S1 : T1 -> T2, S2: T1 -> T2 [if] ___________________________________ if [b]^l then S1 else S2 : T1 -> T2 S : T1 -> T1 [wh] _____________________________ while [b]^l then S : T1 -> T1 [re] [read x]^l : T1 -> T2 if T2 = T1 \cup {x} [sa] [sanitize x]^l : T1 -> T2 if T2 = T1 - {x} [pr] [print x]^l : T1 -> T1 if x \not\in T1 S : T2 -> T3 [sub] _____________ if T1 \subseteq T2, S : T1 -> T4 T3 \subseteq T4 EXAMPLE [y := 0]^1; [print y]^2; [read x]^3; while [x < 0]^4 do ([y := y+1]^5; [print y]^6; [read x]^7); [z := x]^8 TYPE CHECKING Idea: accumulate constraints. [y := 0]^1: T1 -> T2 [asg] [print y]^2: T2 -> T3 [pr] ___________________________________ [seq] ([y := 0]^1;[print y]^2) : T1 -> T3 if T2 = T1-{y} and y \not\in T2 and T3 = T2 CONSTRAINTS T2 = T1-{y} y \not\in T2 T3 = T2 T4 = T3 \cup {x} T6 = (T5-{y}) \cup ({y} \cap T5) T7 = T6 y \not\in T6 T8 = T7 \cup {x} T9 \subseteq T5 T8 \subseteq T9 T4 = T9 T10 = T9-{z} \cup ({x} \cap T9) So, what's a solution? EXAMPLE Judgments: XMust S : Sigma -------------> Sigma YMay Where XMust and YMay are sets of variables (that S must assign and may assign). {x} [asg] [x := a]^l : Sigma ----> Sigma {x} {} [skip] [skip]^l : Sigma -----> Sigma {} X1 S1 : Sigma -----> Sigma, Y1 X2 S2 : Sigma -----> Sigma Y2 [seq] -------------------------------- X3 S1; S2 : Sigma ----> Sigma Y3 if X3 = X1 \cup X2, Y3 = Y1 \cup Y2 X1 S1 : Sigma -----> Sigma, Y1 X2 S2 : Sigma -----> Sigma Y2 [if] -------------------------------- if [b]^l then S1 else S2 X3 : Sigma ----> Sigma Y3 if X3 = X1 \cap X2, Y3 = Y1 \cup Y2 X S : Sigma -----> Sigma Y [wh] -------------------------------- while [b]^l then S {} : Sigma ----> Sigma Y X S : Sigma ----> Sigma Y [sub] ---------------------- X' S : Sigma -----> Sigma Y' if X' \subseteq X, Y \subseteq Y' TYPE CHECKING EXAMPLE TYPE CHECKING Idea: accumulate constraints. {q} [q := 0]^1: Sigma ---> Sigma , [asg] {q} {r} [r := x]^2: Sigma ---> Sigma [asg] {r} _________________________________ [seq] ([q := 0]^1;[r := x]^2) {q,r} : Sigma ----> Sigma {q,r} {r} [r := r-y]^4: Sigma ---> Sigma, [asg] {r} {q} [q := q+1]^5: Sigma ---> Sigma [asg] {q} _________________________________ [seq] ([r := r-y]^4;[q := q+1]^5) {q,r} : Sigma ----> Sigma {q,r} ___________________________________[wh] while [r >= y]^3 do ([r := r-y]^4;[q := q+1]^5) {} : Sigma ----> Sigma {q,r} so by the seq rule, EXAMPLE Judgments: Gamma |- e : t & phi where Gamma : Var -> Type e : Expression t : Type phi : Effect phi Type = int | bool | t1 ---> t2 phi : Powerset(FunName) [var] Gamma |- x : t & {}, if Gamma(x) = t Gamma[x |-> tx] |- e : t & phi [fn] -------------------------------- Gamma |- fn_pi x => e phi2 : tx ------> t & {} if phi2 = phi \cup {pi} phi Gamma |- e1 : t2 ---> t & phi1, Gamma |- e2 : t2 & phi2 [app] -------------------------------- Gamma |- e1 e2 : t & phi3 if phi3 = phi1 \cup phi2 \cup phi