Static Analysis
Here I take some basic keynotes for NJU PASCAL lab's Static Program Analysis course. My homework solution is here.
This course is of huge reputation and well-organized. There are course videos on BiliBili, code repository for assignment on GitHub and the self-built online judge system. Great work!
(Lecture 1) An Naive Glance at Static Analysis
Background: In the last decade, the language cores had few changes, but the programs became significantly larger and more complicated.
Challenge: How to ensure the reliability, security and other promises of large-scale and complex programs?
Static analysis analyzes a program P to reason about its behaviors and determines whether it satisfies some properties before running P. However, according to Rice's theorem, any non-trivial property of the behavior of programs in a r.e. language is undecidable.
Soundness and Completeness
Complete $\subset$ Truth $\subset$ Sound.
- Truth = Complte $\cap$ Sound
- Complete = Underapproximate = have false negatives
- Sound = Overapproximate = have false positives
In useful static analysis, soundness is the golden model.
To conclude static analysis: abstraction + over-approximation.
Abstraction
Transfer functions define how to map values from concrete domains into abstract domain.
Transfer functions are defined according to the "analysis problem".
(Lecture 2) Intermediate Representation
Talks about the relationship between AST and IR.
3AC (3-address code).
There is at most one operator on the right side of an instruction.
Address can be one of: 1. name; 2. constant; 3. compiler-generated temporary.
SSA (Static Single Assignment)
All assignments in SSA are to variables with distinct names
- Give each definition a fresh name
- Propagate fresh name to subsequent uses
- Every variable has exactly one definition
SSA flow information is indirectly incorporated into the unique variable names, define-and-use pairs are explicit. However, SSA may introduce too many variables and phi-functions, causing inefficiency.
Control Flow Analysis
CFG (Control Flow Graph) serves as the basic structure for static analysis. The node in CFG is a BB (Basic Block).
BB can only have one entrance and one exit. Theoretically, basic blocks can be enumerated by the following algorithm:
- Determine the leaders in a 3AC program
- The first instruction in P is a leader
- Any target instruction of a conditional or unconditional jump is a leader
- Any instruction that immediately follows a conditional or unconditional jump is a leader
- Build BBs for this program -A BB consists of a leader and all its subsequent instructions until the next leader
(Lecture 3 & 4) Data Flow Analysis - Applications
DFA (Data Flow Analysis) is about how application-specific data flows through the nodes (basic blocks), edges (control flows) and CFGs (programs). Here, application-specific means abstraction and flows means over-approximation.
Preliminaries
- Each execution of an IR statement transforms an input state (IN[s]) to a new output state (OUT[s]).
- The input/output state is associated with the program point before/after the statement
Forward Analysis: OUT[s] = fs(IN[s]) Backward Analysis: IN[s] = fs(OUT[s])
Reaching Definitions Analysis
A definition d at program point p reaches a point q if there is a path from p to q such that d is not “killed” along that path.
Reach definitions can be used to detect possible undefined variables:
introduce a dummy definition for each variable v at the entry of CFG, and if the dummy definition of v reaches a point p where v is used, then v may be used before definition (undefined reaches v)
D: v = x op y
: this statement generates a definition D of variable v and "kills" all the other definitions in the program that define variable v.
The corresponding transfer function: OUT[B] = genB ∪ (IN[B] - killB) The corresponding control flow (merging): IN[B] = ∪P OUT[P] (P is a predecessor of B)
Algorithm
INPUT: CFG
OUTPUT: IN[B] and OUT[B] for each basic block B
METHOD
1 OUT[entry] = ∅
2
3 for (each basic block B excluding entry)
4 OUT[B] = ∅
5
6 while (changes to any OUT occur)
7 for (each basic block B excluding entry)
8 IN[B] = ∪ OUT[P]
9 OUT[B] = gen ∪ (IN[B] - kill)
The first line is the boundary condition. In general, may-analysis's boundary condition is bottom/empty, while must-analysis's boundary condition is top.
We also witness that line 8 is the transfer function, while line 9 is the control flow.
In this example, each 1 in the bit vector means that the definition can reach the program point.
This algorithm halts. gens and kills remain unchanged. Each definition is either killed or stays in the flow value forever. In other words, OUT[s] never shrinks.
Live Variables Analysis
Live variables analysis tells whether the value of variable v at program point p could be used along some path in CFG starting at p (AND v should not be redefined before usage). If so, v is live at p; otherwise, v is dead at p.
Live variables analysis can be used for register allocations. e.g., at some point all registers are full and we need to use one, then we should favor using a register with a dead value.
In general, live variables analysis are designed in backward analysis
If a value is used and then redefined in a BB, we say this variable still lives in this BB (can still reach the point).
Therefore, the corresponding transfer function: IN[B] = useB ∪ (OUT[B] - defB)
Algorithm
INPUT: CFG
OUTPUT: IN[B] and OUT[B] for each basic block B
METHOD
1 IN[exit] = ∅
2
3 for (each basic block B excluding exit)
4 IN[B] = ∅
5
6 while (changes to any IN occur)
7 for (each basic block B excluding exit)
8 OUT[B] = ∪ IN[S]
9 IN[B] = use ∪ (OUT[B] - def)
Available Expressions Analysis
An expression x op y
is available at program point p if:
- all paths from the entry to p must pass through the evaluation of
x op y
- after the last evaluation of
x op y
, there is no redefinition ofx
ory
At program p, we can replace expression x op y
by the result of its last evaluation. Available expressions can be used for detecting global common subexpressions.
The corresponding transfer function: OUT[B] = genB ∪ (IN[B] - killB) However, the corresponding control flow: IN[B] = ∩P OUT[P] (P is a predecessor of B)
For safety of the analysis, it may report an expression as unavailable even if it is truly available (must analysis -> under-approximation)
Algorithm
INPUT: CFG
OUTPUT: IN[B] and OUT[B] for each basic block B
METHOD
1 OUT[entry] = ∅
2
3 for (each basic block B excluding entry)
4 OUT[B] = ∪
5
6 while (changes to any OUT occur)
7 for (each basic block B excluding entry)
8 IN[B] = ∩ OUT[P]
9 OUT[B] = gen ∪ (IN[B] - kill)
In line 4, initialization for OUT[B] is all.
⭐️ Conclusion
Reaching Definitions | Live Variables | Available Expressions | |
---|---|---|---|
Domain | Set of definitions | Set of variables | Set of expressions |
Direction | Forward | Backward | Forward |
May/Must | May | May | Must |
Boundary | OUT[entry] = ∅ | IN[exit] = ∅ | OUT[entry] = ∅ |
Initialization | OUT[B] = ∅ | IN[B] = ∅ | OUT[B] = ∪ |
Transfer Function | OUT[B]=genB ∪ (IN[B] - killB) | IN[B]=useB ∪ (OUT[B] - defB) | OUT[B] = genB ∪ (IN[B] - killB) |
Meet | ∪ | ∪ | ∩ |
- Understand the three data flow analyses:
- reaching definitions
- live variables
- available expressions
- Can tell the differences and similarities of the three data flow analyses
- Understand the iterative algorithm and can tell why it is able to terminate (fixed point)
(Lecture 5 & 6) Data Flow Analysis - Foundations
Iterative Algorithm
Insights:
- X is a fixed point of function F if X = F(X)
- The interative algorithm reaches a fixed point.
Questions:
- Is the algorithm guaranteed to terminate or reach the fixed point, or does it always have a solution?
- If so, is there only one solution or only one fixed point? If more than one, is our solution the best one (most precise)?
- When will the algorithm reach the fixed point, or when can we get the solution?
Partial Order
Lattice
However, a complete lattice may not be a finite lattice. Consider a poset of all real numbers between 0 and 1. It has an lub (1) and a glb (0). But it is infinite.
Data Flow Analysis Framework via Lattice
Then the remaining issue is to prove that function F is monotonic.
Bit-vector or Gen/Kill problems (set union /intersection for join/meet) are distributive
But some analyses are not distributive
Constant Propagation
Given a variable x
at program point p
, determine whether x
is guaranteed to hold a constant value at p
.
The OUT of each node in CFG, includes a set of pairs (x, v)
where x
is a variable and v
is the value held by x
after that node.
Worklist Algorithm
(Lecture 7) Interprocedural Analysis
Interprocedural analysis: propagate data-flow information along interprocedural control-flow edges.
Essentially, a call graph is a set of call edges from call-sites to their target methods (callees).
Method Calls
In Java, virtual call may have multiple call targets hence is challenging to build call graph.
Method Dispatch
Class Hierarchy Analysis (CHA)
Algorithm:
Resolve(cs)
T = {}
m = method signature at cs
if cs is a static call then
T = { m }
if cs is special call then
cm = class type of m
T = { Dispatch(cm, m) }
if cs is a virtual call then
c = declared type of receiver variable at cs
foreach c' that is a subclass of c or c itself do
add Dispatch(c', m) to T
return T
An example:
class A {
void foo() {...}
}
class B extends A {}
class C extends B {
void foo() {...}
}
class D extends B {
void foo() {...}
}
void resolve() {
C c = ...
c.foo();
A a = ...
a.foo();
B b = ...
b.foo();
}
- Resolve(c.foo()) = {C.foo()}
- Resolve(a.foo()) = {A.foo(), C.foo(), D.foo()}
- Resolve(b.foo()) = {A.foo(), C.foo(), D.foo()} ⭐
Features:
- Fast
- Only consider the declared type of receiver variable at the call-site, and its inheritance hierarchy
- Ignore data- and control-flow information
- Imprecise
- Easily introduce spurious target methods
- Addressed in next lectures
A common usage: IDE
Call Graph Construction
BuildCallGraph(entry)
worklist = [entry], callgraph = {}, RM = {}
while worklist is not empty do
remove m from worklist
if m ∉ RM then
add m to RM
foreach call site cs in m do
T = Resolve(cs)
foreach target method m' in T do
add cs -> m' to callgraph
add m' to WL
return callgraph
RM is a set of reachable methods
Inter-procedural CFG
An Inter-procedural CFG of a program consists of CFGs of the methods in the program, plus two kinds of additional edges: - Call edges: from call sites to the entry nodes of their callees - Return edges: from exit nodes of the callees to the statements following their call sites (i.e., return sites)
Inter-procedural CFG = CFGs + call & return edges
Edges from call sites to return sites are called call-to-return edges
With call-to-return edges, analysis can propagate local data-flow on inter-procedural CFG, which is more efficient.
Intra-procedural | Inter-procedural | |
---|---|---|
Program representation | CFG | Inter-procedural CFG = CFGs + call & return edges |
Transfer functions | Node transfer | Node transfer + edge transfer |
Edge Transfer:
- Call edge transfer: transfer data flow from call site to the entry node of callee (along call edges)
- Return edge transfer: transfer data flow from exit node of the callee to the return site (along return edges)
Node Transfer:
Same as intraprocedural constant propagation, except that - For call nodes, the transfer function is identity function
For call-to-return edges, kill the value of the LHS variable of the call site. Its value will flow to return site along the return edges. Otherwise, it may cause imprecision.
(Lecture 8) Pointer Analysis
CHA only considers class hierarchies, thus being imprecise.
Pointer analysis computes which memory locations a pointer can point to.
Pointer analysis is a may-analysis, computing an over-approximation of the set of objects that a pointer can point to.
Alias analysis is related but different: alias analysis computes can two pointers point to the same object.
Key Factors in Pointer Analysis
Concerned Statements
We only focus on pointer-affecting statements.
Pointers in Java:
- local variable: x
- static field: C.f
- instance field: x.f
- array element: array[i]
Pointer-Affecting Statements:
- new: x = new (T)
- assign: x = y
- store: x.f = y
- load: y = x.f
- call: r = x.k(a, ...)
(virtual call)
(Lecture 9 & 10) Pointer Analysis - Fundations
How to Implement Pointer Analysis
Andersen Algorithm: computes the inclusion constraints for pointers.
However, pointer analysis can be complicated, because PFG is dynamically updated during pointer analysis.
Pointer Analysis - Algorithm
In practice, Δ is usually small compared with the original set, so propagating only the new points-to information (Δ) improves efficiency.
Besides, Δ is also important for efficiency when handling stores, loads, and method calls.
Pointer Analysis with Method Calls
Receiver object should only flow to
this
variable of the corresponding target methodPFG edge
x → m_this
would introduce spurious points-to relations forthis
variables
Pointer Analysis with Method Calls - Algorithms
AddReachable
only process new
and assign
, because load
and store
need points-to information, which is not contained in newly added reachable method.
(Lecture 11 & 12) Pointer Analysis - Context Sensitivity
Context insensitivity does not consider different calling contexts. Such difference brings spurious data flows.
Cloning-Based Context Sensitivity
This is the most straight-forward approach
Each method is qualified by one or more contexts. The variables are also qualified by contexts. Essentially each method and its variables are cloned. One clone per context.
Context-Sensitive Heap
OO programs are typically heap-intensive. In practice, to improve precision, context sensitivity should also be applied to heap abstraction.
Algorithm
Context-Sensitivity Variants
This mainly focus in object-oriented languages, and generally outperforms call-site sensitivity for object-oriented languages.
Faster than object sensitivity.
(Lecture 13) Static Analysis for Security
- Explicit information flow: similar with data flow.
- Implicit information flow: the control flow is affected by secret information.
Taint Analysis
Taint analysis answers "which tainted data a pointer (at a sink) can point to".
Taint analysis and pointer analysis are similar:
- Treats tainted data as (artificial) objects
- Treats sources as allocation sites (of tainted data)
- Leverages pointer analysis to propagate tainted data
(Lecture 14) Datalog-Based Program Analysis
Motivation: datalog-based implementation is more succinct and readable.
Datalog = Data + Logic
- No side-effects
- No control flows
- No functions
- Not Turing-complete
However, Datalog cannot fully control performance and is impossible or inconvenient to express some logics.
Datalog Syntax
- Predicate (Data): represents a relation
- Atom: basic elements of datalog, consists of name of predicate and arguments, like
P(X1,X2,...,Xn)
- Rule: expressing logical inferences, consists of head (as an atom) and body, like
H <- B1,B2,...,Bn
- EDB (extensional database): immutable, initialized, can be seen as input relations
- IDB (intensional database): inferred by rules, can be seen as output relations
- Logical Or: write multiple rules with the same head or use operator
;
- Negation:
!
- Recursion
Datalog Attributes
Rule Safety: a rule is safe if every variable appears in at least one non-negated relational atom
Recursion and Negation: recursion and negation of an atom must be separated.
Execution of Datalog Programs
EDB + Rules -> Datalog engine -> IDB
Datalog engine deduces facts by given rules and EDB predicates until no new facts can be deduced.
Monotonicity: Datalog is monotone as facts cannot be deleted.
Termination: a Datalog program always terminates as 1) Datalog is monotone; 2) Possible values of IDB predicates are finite.
This is prove that Datalog is turing complete (or we cannot determine that a program can halt).
Pointer Analysis via Datalog
Taint Analysis via Datalog
(Lecture 15) CFL-Reachability and IFDS
Infeasible Paths: Paths in CFG that do not correspond to actual executions.
Realizable Paths: The paths in which "returns" are matched with corresponding "calls"
Realizable paths may not be executable, but unrealizable paths must not be executable. So our goal is to recognize realizable paths so that we could avoid polluting analysis results along unrealizable paths.
0 -> 0
is the glue edge. Without glue edge IFDS cannot produce correct solutions.
Distributivity of IFDS
Given a statement S, besides S itself, if we need to consider multiple input data facts to create correct outputs, then the analysis is not distributive and should not be expressed in IFDS.
In IFDS, each data fact (circle) and its propagation (edges) could be handled independently, and doing so will not affect the correctness of the final results.
Regardless of the infinite domain issue:
- IFDS cannot solve constant propagation.
- IFDS can do linear constant propagation (e.g. y = 2x + 3) or copy constant propagation (e.g. x = 2, y = 3)
- IFDS cannot do pointer analysis.
Note, alias information in IFDS, say alias(x, y)
contains multiple input data facts. Thus pointer analysis is non-distributive.
(Lecture 16) Soundness and Soundiness
Soundness is conservative approximation. The analysis captures all program behaviors, or the analysis result models all possible executions of the program.
Hard language features for static analysis:
- Java: reflection, native code, dynamic class loading
- JavaScript: eval, document object model (DOM)
- C/C++: pointer arithmetic, function pointers
(CACM 2015) In Defense of Soundiness: A Manifesto points out: A soundy analysis typically means that the analysis is mostly sound, with well-identified unsound treatments to hard/specific language features.
Java Reflection
- String Constant analysis + Pointer Analysis
- (APLAS 2005) Reflection Analysis for Java
- Type Inference + String Analysis + Pointer Analysis
- (ECOOP 2014) Self-Inferencing Reflection Resolution for Java
- Assisted by Dynamic Analysis
- (ICSE 2011) Taming reflection: Aiding static analysis in the presence of reflection and custom class loaders
Native Code
Current practice: manually models the critical native code. Recent work: Identifying Java Calls in Native Code via Binary Scanning (ISSTA 2020)