Static Analysis

Here I take some basic keynotes for NJU PASCAL lab's Static Program Analysis course. My homework solution is here.

This course is of huge reputation and well-organized. There are course videos on BiliBili, code repository for assignment on GitHub and the self-built online judge system. Great work!

(Lecture 1) An Naive Glance at Static Analysis

Background: In the last decade, the language cores had few changes, but the programs became significantly larger and more complicated.

Challenge: How to ensure the reliability, security and other promises of large-scale and complex programs?

Static analysis analyzes a program P to reason about its behaviors and determines whether it satisfies some properties before running P. However, according to Rice's theorem, any non-trivial property of the behavior of programs in a r.e. language is undecidable.

Soundness and Completeness

Complete $\subset$ Truth $\subset$ Sound.

Truth = Complte $\cap$ Sound
Complete = Underapproximate = have false negatives
Sound = Overapproximate = have false positives

In useful static analysis, soundness is the golden model.

To conclude static analysis: abstraction + over-approximation.

Abstraction

Transfer functions define how to map values from concrete domains into abstract domain.

Transfer functions are defined according to the "analysis problem".

(Lecture 2) Intermediate Representation

Talks about the relationship between AST and IR.

3AC (3-address code).

There is at most one operator on the right side of an instruction.

Address can be one of: 1. name; 2. constant; 3. compiler-generated temporary.

SSA (Static Single Assignment)

All assignments in SSA are to variables with distinct names

Give each definition a fresh name
Propagate fresh name to subsequent uses
Every variable has exactly one definition

SSA flow information is indirectly incorporated into the unique variable names, define-and-use pairs are explicit. However, SSA may introduce too many variables and phi-functions, causing inefficiency.

Control Flow Analysis

CFG (Control Flow Graph) serves as the basic structure for static analysis. The node in CFG is a BB (Basic Block).

BB can only have one entrance and one exit. Theoretically, basic blocks can be enumerated by the following algorithm:

Determine the leaders in a 3AC program
The first instruction in P is a leader
Any target instruction of a conditional or unconditional jump is a leader
Any instruction that immediately follows a conditional or unconditional jump is a leader
Build BBs for this program -A BB consists of a leader and all its subsequent instructions until the next leader

(Lecture 3 & 4) Data Flow Analysis - Applications

DFA (Data Flow Analysis) is about how application-specific data flows through the nodes (basic blocks), edges (control flows) and CFGs (programs). Here, application-specific means abstraction and flows means over-approximation.

DFA Overview

Preliminaries

Each execution of an IR statement transforms an input state (IN[s]) to a new output state (OUT[s]).
The input/output state is associated with the program point before/after the statement

Input and Output States

Forward Analysis: OUT[s] = f_s(IN[s]) Backward Analysis: IN[s] = f_s(OUT[s])

Reaching Definitions Analysis

A definition d at program point p reaches a point q if there is a path from p to q such that d is not “killed” along that path.

Reach definitions can be used to detect possible undefined variables:

introduce a dummy definition for each variable v at the entry of CFG, and if the dummy definition of v reaches a point p where v is used, then v may be used before definition (undefined reaches v)

D: v = x op y: this statement generates a definition D of variable v and "kills" all the other definitions in the program that define variable v.

The corresponding transfer function: OUT[B] = gen_B ∪ (IN[B] - kill_B) The corresponding control flow (merging): IN[B] = ∪_P OUT[P] (P is a predecessor of B)

Algorithm

INPUT: CFG

OUTPUT: IN[B] and OUT[B] for each basic block B

METHOD

1 OUT[entry] = ∅
2 
3 for (each basic block B excluding entry)
4   OUT[B] = ∅
5
6 while (changes to any OUT occur)
7   for (each basic block B excluding entry)
8     IN[B] = ∪ OUT[P]
9     OUT[B] = gen ∪ (IN[B] - kill)

The first line is the boundary condition. In general, may-analysis's boundary condition is bottom/empty, while must-analysis's boundary condition is top.

We also witness that line 8 is the transfer function, while line 9 is the control flow.

Reaching Definition Algorithm Example

In this example, each 1 in the bit vector means that the definition can reach the program point.

This algorithm halts. gen_s and kill_s remain unchanged. Each definition is either killed or stays in the flow value forever. In other words, OUT[s] never shrinks.

Reaching Definition Algorithm Halts

Live Variables Analysis

Live variables analysis tells whether the value of variable v at program point p could be used along some path in CFG starting at p (AND v should not be redefined before usage). If so, v is live at p; otherwise, v is dead at p.

Live variables analysis can be used for register allocations. e.g., at some point all registers are full and we need to use one, then we should favor using a register with a dead value.

In general, live variables analysis are designed in backward analysis

If a value is used and then redefined in a BB, we say this variable still lives in this BB (can still reach the point).

Therefore, the corresponding transfer function: IN[B] = use_B ∪ (OUT[B] - def_B)

Algorithm

INPUT: CFG

OUTPUT: IN[B] and OUT[B] for each basic block B

METHOD

1 IN[exit] = ∅
2 
3 for (each basic block B excluding exit)
4   IN[B] = ∅
5
6 while (changes to any IN occur)
7   for (each basic block B excluding exit)
8     OUT[B] = ∪ IN[S]
9     IN[B] = use ∪ (OUT[B] - def)

Available Expressions Analysis

An expression x op y is available at program point p if:

all paths from the entry to p must pass through the evaluation of x op y
after the last evaluation of x op y, there is no redefinition of x or y

At program p, we can replace expression x op y by the result of its last evaluation. Available expressions can be used for detecting global common subexpressions.

The corresponding transfer function: OUT[B] = gen_B ∪ (IN[B] - kill_B) However, the corresponding control flow: IN[B] = ∩_P OUT[P] (P is a predecessor of B)

For safety of the analysis, it may report an expression as unavailable even if it is truly available (must analysis -> under-approximation)

Algorithm

INPUT: CFG

OUTPUT: IN[B] and OUT[B] for each basic block B

METHOD

1 OUT[entry] = ∅
2 
3 for (each basic block B excluding entry)
4   OUT[B] = ∪
5
6 while (changes to any OUT occur)
7   for (each basic block B excluding entry)
8     IN[B] = ∩ OUT[P]
9     OUT[B] = gen ∪ (IN[B] - kill)

In line 4, initialization for OUT[B] is all.

⭐️ Conclusion

	Reaching Definitions	Live Variables	Available Expressions
Domain	Set of definitions	Set of variables	Set of expressions
Direction	Forward	Backward	Forward
May/Must	May	May	Must
Boundary	OUT[entry] = ∅	IN[exit] = ∅	OUT[entry] = ∅
Initialization	OUT[B] = ∅	IN[B] = ∅	OUT[B] = ∪
Transfer Function	OUT[B]=gen_B ∪ (IN[B] - kill_B)	IN[B]=use_B ∪ (OUT[B] - def_B)	OUT[B] = gen_B ∪ (IN[B] - kill_B)
Meet	∪	∪	∩

Understand the three data flow analyses:
reaching definitions
live variables
available expressions
Can tell the differences and similarities of the three data flow analyses
Understand the iterative algorithm and can tell why it is able to terminate (fixed point)

(Lecture 5 & 6) Data Flow Analysis - Foundations

Iterative Algorithm

View Iterative Algorithm in Another Way

Abstract Algorithm

Insights:

X is a fixed point of function F if X = F(X)
The interative algorithm reaches a fixed point.

Questions:

Is the algorithm guaranteed to terminate or reach the fixed point, or does it always have a solution?
If so, is there only one solution or only one fixed point? If more than one, is our solution the best one (most precise)?
When will the algorithm reach the fixed point, or when can we get the solution?

Partial Order

Upper and Lower Bounds

Some Properties

Lattice

Complete Lattice

However, a complete lattice may not be a finite lattice. Consider a poset of all real numbers between 0 and 1. It has an lub (1) and a glb (0). But it is infinite.

Product Lattice

Data Flow Analysis Framework via Lattice

Monotonicity and Fixed-Point Theorem

Fixed-Point Existence

Least Fixed-Point

Relate Iterative Algorithm to Fixed-Point Theorem

Then the remaining issue is to prove that function F is monotonic.

Prove Function F is Monotonic

Maximum Iterations to Reach the Fixed-Point

May and Must Analyses from a Lattice View

Meet-Over-All-Paths Solution (MOP)

Iterative v.s. MOP

Iterative v.s. MOP (Proof)

Bit-vector or Gen/Kill problems (set union /intersection for join/meet) are distributive

But some analyses are not distributive

Constant Propagation

Given a variable x at program point p, determine whether x is guaranteed to hold a constant value at p.

The OUT of each node in CFG, includes a set of pairs (x, v) where x is a variable and v is the value held by x after that node.

Constant Propagation Lattice

Constant Propagation Transfer Function

Constant Propagation Nondistrubitivity

Worklist Algorithm

(Lecture 7) Interprocedural Analysis

Interprocedural analysis: propagate data-flow information along interprocedural control-flow edges.

Essentially, a call graph is a set of call edges from call-sites to their target methods (callees).

Method Calls

In Java, virtual call may have multiple call targets hence is challenging to build call graph.

Method Dispatch

Method Signatre

Method Dispatch

Class Hierarchy Analysis (CHA)

Algorithm:

Resolve(cs) 
  T = {}
  m = method signature at cs

  if cs is a static call then
    T = { m }
  if cs is special call then
    cm = class type of m
    T = { Dispatch(cm, m) }
  if cs is a virtual call then
    c = declared type of receiver variable at cs
    foreach c' that is a subclass of c or c itself do
      add Dispatch(c', m) to T

  return T

An example:

class A {
  void foo() {...}
}

class B extends A {}

class C extends B {
  void foo() {...}
}

class D extends B {
  void foo() {...}
}

void resolve() { 
  C c = ...
  c.foo(); 

  A a = ...
  a.foo();

  B b = ... 
  b.foo();
}

Resolve(c.foo()) = {C.foo()}
Resolve(a.foo()) = {A.foo(), C.foo(), D.foo()}
Resolve(b.foo()) = {A.foo(), C.foo(), D.foo()} ⭐

Features:

Fast
Only consider the declared type of receiver variable at the call-site, and its inheritance hierarchy
Ignore data- and control-flow information
Imprecise
Easily introduce spurious target methods
Addressed in next lectures

A common usage: IDE

Call Graph Construction

BuildCallGraph(entry)
  worklist = [entry], callgraph = {}, RM = {} 
  while worklist is not empty do
    remove m from worklist
    if m ∉ RM then
      add m to RM
      foreach call site cs in m do
        T = Resolve(cs)
        foreach target method m' in T do
          add cs -> m' to callgraph
          add m' to WL 
  return callgraph

RM is a set of reachable methods

Inter-procedural CFG

An Inter-procedural CFG of a program consists of CFGs of the methods in the program, plus two kinds of additional edges: - Call edges: from call sites to the entry nodes of their callees - Return edges: from exit nodes of the callees to the statements following their call sites (i.e., return sites)

Inter-procedural CFG = CFGs + call & return edges

Edges from call sites to return sites are called call-to-return edges

With call-to-return edges, analysis can propagate local data-flow on inter-procedural CFG, which is more efficient.

	Intra-procedural	Inter-procedural
Program representation	CFG	Inter-procedural CFG = CFGs + call & return edges
Transfer functions	Node transfer	Node transfer + edge transfer

Edge Transfer:

Call edge transfer: transfer data flow from call site to the entry node of callee (along call edges)
Return edge transfer: transfer data flow from exit node of the callee to the return site (along return edges)

Node Transfer:

Same as intraprocedural constant propagation, except that - For call nodes, the transfer function is identity function

For call-to-return edges, kill the value of the LHS variable of the call site. Its value will flow to return site along the return edges. Otherwise, it may cause imprecision.

(Lecture 8) Pointer Analysis

CHA only considers class hierarchies, thus being imprecise.

Pointer analysis computes which memory locations a pointer can point to.

Pointer analysis is a may-analysis, computing an over-approximation of the set of objects that a pointer can point to.

Alias analysis is related but different: alias analysis computes can two pointers point to the same object.

Key Factors in Pointer Analysis

Allocation-Site Abstraction

Context Sensitivity

Flow Sensitivity

Concerned Statements

We only focus on pointer-affecting statements.

Pointers in Java: - local variable: x - static field: C.f - instance field: x.f - array element: array[i]

Pointer-Affecting Statements: - new: x = new (T) - assign: x = y - store: x.f = y - load: y = x.f - call: r = x.k(a, ...) (virtual call)

(Lecture 9 & 10) Pointer Analysis - Fundations

Domain and Notations

Rules

How to Implement Pointer Analysis

Andersen Algorithm: computes the inclusion constraints for pointers.

Pointer Flow Graph

Pointer Flow Graph - Example

However, pointer analysis can be complicated, because PFG is dynamically updated during pointer analysis.

Pointer Analysis - Algorithm

Worklist

Handle New and Design

Propagation

Differential Propagation

In practice, Δ is usually small compared with the original set, so propagating only the new points-to information (Δ) improves efficiency.

Besides, Δ is also important for efficiency when handling stores, loads, and method calls.

Handle Store and Load

Pointer Analysis with Method Calls

Rule: Call

Receiver object should only flow to this variable of the corresponding target method

PFG edge x → m_this would introduce spurious points-to relations for this variables

Pointer Analysis with Method Calls - Algorithms

AddReachable only process new and assign, because load and store need points-to information, which is not contained in newly added reachable method.

(Lecture 11 & 12) Pointer Analysis - Context Sensitivity

Context insensitivity does not consider different calling contexts. Such difference brings spurious data flows.

Cloning-Based Context Sensitivity

This is the most straight-forward approach

Each method is qualified by one or more contexts. The variables are also qualified by contexts. Essentially each method and its variables are cloned. One clone per context.

Context-Sensitive Heap

OO programs are typically heap-intensive. In practice, to improve precision, context sensitivity should also be applied to heap abstraction.

Pointer Analysis (Context-Sensitive) Domain and Notations

Context Sensitivity Rules

Context Sensitivity Rule: Call

Algorithm

Context-Sensitivity Variants

This mainly focus in object-oriented languages, and generally outperforms call-site sensitivity for object-oriented languages.

Faster than object sensitivity.

(Lecture 13) Static Analysis for Security

Explicit information flow: similar with data flow.
Implicit information flow: the control flow is affected by secret information.

Taint Analysis

Taint analysis answers "which tainted data a pointer (at a sink) can point to".

Taint analysis and pointer analysis are similar:

Treats tainted data as (artificial) objects
Treats sources as allocation sites (of tainted data)
Leverages pointer analysis to propagate tainted data

(Lecture 14) Datalog-Based Program Analysis

Motivation: datalog-based implementation is more succinct and readable.

Datalog = Data + Logic

No side-effects
No control flows
No functions
Not Turing-complete

However, Datalog cannot fully control performance and is impossible or inconvenient to express some logics.

Datalog Syntax

Predicate (Data): represents a relation
Atom: basic elements of datalog, consists of name of predicate and arguments, like P(X1,X2,...,Xn)
Rule: expressing logical inferences, consists of head (as an atom) and body, like H <- B1,B2,...,Bn
EDB (extensional database): immutable, initialized, can be seen as input relations
IDB (intensional database): inferred by rules, can be seen as output relations
Logical Or: write multiple rules with the same head or use operator ;
Negation: !
Recursion

Datalog Attributes

Rule Safety: a rule is safe if every variable appears in at least one non-negated relational atom

Recursion and Negation: recursion and negation of an atom must be separated.

Execution of Datalog Programs

EDB + Rules -> Datalog engine -> IDB

Datalog engine deduces facts by given rules and EDB predicates until no new facts can be deduced.

Monotonicity: Datalog is monotone as facts cannot be deleted.

Termination: a Datalog program always terminates as 1) Datalog is monotone; 2) Possible values of IDB predicates are finite.

This is prove that Datalog is turing complete (or we cannot determine that a program can halt).

Pointer Analysis via Datalog

Taint Analysis via Datalog

(Lecture 15) CFL-Reachability and IFDS

Infeasible Paths: Paths in CFG that do not correspond to actual executions.

Realizable Paths: The paths in which "returns" are matched with corresponding "calls"

Realizable paths may not be executable, but unrealizable paths must not be executable. So our goal is to recognize realizable paths so that we could avoid polluting analysis results along unrealizable paths.

0 -> 0 is the glue edge. Without glue edge IFDS cannot produce correct solutions.

Distributivity of IFDS

Given a statement S, besides S itself, if we need to consider multiple input data facts to create correct outputs, then the analysis is not distributive and should not be expressed in IFDS.

In IFDS, each data fact (circle) and its propagation (edges) could be handled independently, and doing so will not affect the correctness of the final results.

Regardless of the infinite domain issue:

IFDS cannot solve constant propagation.
IFDS can do linear constant propagation (e.g. y = 2x + 3) or copy constant propagation (e.g. x = 2, y = 3)
IFDS cannot do pointer analysis.

Note, alias information in IFDS, say alias(x, y) contains multiple input data facts. Thus pointer analysis is non-distributive.

(Lecture 16) Soundness and Soundiness

Soundness is conservative approximation. The analysis captures all program behaviors, or the analysis result models all possible executions of the program.

Hard language features for static analysis:

Java: reflection, native code, dynamic class loading
JavaScript: eval, document object model (DOM)
C/C++: pointer arithmetic, function pointers

(CACM 2015) In Defense of Soundiness: A Manifesto points out: A soundy analysis typically means that the analysis is mostly sound, with well-identified unsound treatments to hard/specific language features.

Java Reflection

String Constant analysis + Pointer Analysis
(APLAS 2005) Reflection Analysis for Java
Type Inference + String Analysis + Pointer Analysis
(ECOOP 2014) Self-Inferencing Reflection Resolution for Java
Assisted by Dynamic Analysis
(ICSE 2011) Taming reflection: Aiding static analysis in the presence of reflection and custom class loaders

Native Code

Current practice: manually models the critical native code. Recent work: Identifying Java Calls in Native Code via Binary Scanning (ISSTA 2020)