This is the fourth part of a series on the Chrome browser and its javascript engine V8. This part's main focus will be on Turbofan, Chrome's just-in-time compiler. Turbofan is in charge of producing highly optimized machine code for the target architecture that the browser is being run on. To fully understand how all of this works, we will start by talking about the Sea of Nodes graph representation that V8 uses for its optimizations and then proceed to walk through various optimization passes performed by V8 while visualizing them using Turbolizer to display the Sea of Nodes graph.

When a function is regarded as 'hot', it is passed on to Turbofan, which then optimizes it into machine code. Functions are considered hot when they are called very frequently. This usually takes around 1000-10000 function calls, however, the exact number varies depending on various factors. Once a function has been jitted by Turbofan, the next time it is called, the machine code can immediately be run, thus skipping all intermediate steps generally required by the interpreter.

As discussed in an earlier part, the Ignition interpreter maintains inline caches that store information about the types and maps of function arguments. Once a function is regarded as 'hot', this inline cache information is passed onto Turbofan alongside the actual bytecode. Turbofan starts by taking this bytecode and passes it on to its Graphbuilder which produces a Sea of Nodes graph. Once that graph is completed, Turbofan starts inlining various functions into it. This inlining enables it to perform many more optimizations than it could otherwise. At this point, the Optimizegraph function is called. This function walks the Sea of Nodes graph and performs various optimization passes on it. Once it finally completes all of its optimizations, the Sea of Nodes graph is transformed into a control flow graph which is then finally lowered into the actual machine code.

Below you can see an overview of Turbofan's optimization pipeline and some of the optimizations performed by it. Note however that there are many more optimization passes performed.

Sea of Nodes

As mentioned earlier, we will be starting with the Sea of Nodes graph. Other representations such as source code, abstract syntax tree's or even bytecode are not great representations to perform heavy optimizations on. Instead, V8 uses a specialized representation called a Sea of Nodes graph.

Earlier when we talked about the parser and Ignition, we went over the abstract syntax tree representation. AST's are great for parsing, however, when it comes to optimizations they become very lacking. The AST representation only stores a limited amount of information about variables which is spread all throughout the tree. For example, reasoning that a certain variable does not change during a certain section of code and optimizing it accordingly requires a lot of work using an AST. To combat this problem more advanced graph representations are used.

Sea of Nodes is an intermediate program representation that combines the features of data-flow and control-flow graphs. Traditionally optimizing compilers used a combination of a DFG a CFG to represent code. The DFG was used to perform optimizations regarding data dependencies in the code, while the CFG was used to maintain the actual control flow. The SON graph combines the above two graphs into one. Every single node/element/instruction within the graph has both data flow and control flow information.

Since the SON graph combines both the DFG and CFG representations, let's start by talking about those two representations first to more easily understand the more complicated SON graph afterwards.

Data Flow Graph
This graph is focused on how the data flows throughout the program. It does not concern itself with any form of control flow and can thus look a little messy. It makes use of def-use (definition and uses) relationships between different values and an SSA (static single assignment) format. The SSA representation requires each variable to be declared exactly once. So an operation such as a1 += 3 would instead result in: a2 = a1 + 3.

While data-flow analysis allows us to make safe assumptions about how we can optimize certain aspects of the program, it just tracks data flow. This means that by using only a DFG we lose all information about control flow which would make it very hard to actually generate code from this representation. The DFG does not enforce any node ordering at all, it just serves to track dependencies between data.

Control Flow Graph
This representation groups graph nodes into blocks that represent instructions executed between each branch. This is very similar to how the CPU handles branching and thus makes code generation a lot simpler. While this graph does track data-flow relations, performing data-flow analysis on it to perform various optimizations quickly becomes very complex. This is why this representation is generally used together with a DFG.

Below you can see examples of both a DFG and a CFG for the following code.

Now, let's finally talk about the actual Sea of Nodes graph. The SON representation uses 3 different types of edges.

1. Control Edges
These edges are very similar to edges in a CFG. They enable branches and loops.

2. Value Edges
These edges are similar to edges in a DFG. They represent value dependencies and connect inputs and outputs.

3. Effect Edges
These edges make sure that operations are scheduled correctly. For example: x = x + arr[1]; -> Need to first read x and arr[1] before the add operation can be performed. The effect chain in this case would be load->add->store

The SON graph also uses the SSA representation. Edges in this representation are direct unlabeled pointers. Data and control are both represented in this graph, and whenever possible, data is strongly typed (this is where the typer optimization phase comes in). Below you can see an example of the SON representation of the previous code example. This graph was generated using turbolizer/. This tool was created by the V8-dev team to visualize the SON graph. If you want to replicate these examples, make sure to use either the helper functions mentioned in part 1 or a large for-loop (10k+ iterations) to call the function repeatedly to actually trigger the optimization. Otherwise, the SON representation is never generated. You will also need the --trace-turbo option to generate the JSON file that Turbolizer will use as its input.

As you can see this graph is incredibly complex, even for such a simple function. For our purposes we will be ignoring most of this graph, however, as you will soon see, the graph is very useful when attempting to understand various optimization stages performed by Turbofan. A few of the nodes such as 'NumberConstant', 'Branch', or 'SpeculativeSafeIntegerAdd' should be understandable to you. Future graph screenshots will be manually edited to only include relevant information.

This is a sufficient introduction to the SON representation. In the rest of this article, we will be continuously working much more with these graphs using the Turbolizer tool. This tool lets us interactively select various optimization stages in Turbofan's pipeline and see how the graph changes at each step.

Inlining

The TurboFan JavaScript pipeline currently works by first creating an (unoptimized) sea of nodes graph from either the bytecode and passing that graph to the so-called inlining phase. This phase is extremely important for further future optimizations. By inlining functions, the compiler has a much easier time recognizing code that can be optimized together.

Turbofan has a limited amount of inlining budget (can't just inline everything recursively), therefore it needs to carefully choose what it wants to inline. Inlining wrong functions wastes the budget and causes optimization losses/wastes compilation time. This is especially true if big or rarely called functions are inlined. The chosen functions are determined using a heuristic that takes invocation count, size of the function, and other information into account.

The inlining heuristic runs as part of this phase and whenever it sees a call or construct site, where it can somehow infer the target (i.e. by looking at the feedback that the native context specialization introduces into the graph), it remembers this call/construct site as a potential candidate for inlining. Once this process reaches the fixpoint, the inlining heuristic goes through the list of candidates looking for a function to inline, and if it finds one (and we still have inlining budget left) it inlines it by expanding the overall graph and restarts the fixpoint iteration with the newly added subgraph. This terminates once the inlining budget is exhausted or there are no more potential candidates left for inlining.

In the below example, the add() function is first inlined into the three() function. This alone already replaces a function call with a simple add operation which is a huge improvement. More importantly, though, it enables future optimizations such as constant folding to realize that it can just reduce the entire set of functions in this example to a 'return 3'. Other optimizations such as strength reduction, redundancy elimination, or escape analysis also strongly benefit from inlining, which is why this is the first phase in turbofan upon which all other optimizations build.

OptimizeGraph

We have finally gotten to the function that takes care of actually starting the optimization passes. It is defined at src/compiler/pipeline.cc. A simplified version of the function is shown below. It calls various optimization phases each of which calls about 10-20 specific optimization functions. Using the Turbolizer tool you can look at the SON graph and the changes to it through all of these optimization phases. The specific optimization functions include optimizations such as constant folding, dead code elimination, escape analysis, and much more. We will be covering these more in-depth later.

Each of these individual optimization passes traverse the graph through a set of functions defined at src/compiler/graph-reducer.cc. In this file the Reduce() function calls ReduceTop() and ReduceNode() to traverse the graph. These queue up the nodes onto a stack and iterate through them, applying the optimization to each node. If the optimization makes any changes during a pass through the graph it queues up all nodes again and repeats the process. This is done until a full pass-through results in no further changes upon which the function returns.

bool PipelineImpl::OptimizeGraph(Linkage* linkage) {
  PipelineData* data = this->data_;

  data->BeginPhaseKind("V8.TFLowering");

  Run<EarlyGraphTrimmingPhase>();
  Run<TyperPhase>(data->CreateTyper());
  Run<TypedLoweringPhase>();
  if (data->info()->loop_peeling()) {
    Run<LoopPeelingPhase>();
  } else {
    Run<LoopExitEliminationPhase>();
  }
  if (FLAG_turbo_load_elimination) {
    Run<LoadEliminationPhase>();
  }
  data->DeleteTyper();
  if (FLAG_assert_types) {
    Run<TypeAssertionsPhase>();
  }
  Run<SimplifiedLoweringPhase>(linkage);
  Run<GenericLoweringPhase>();

  data->BeginPhaseKind("V8.TFBlockBuilding");
  data->InitializeFrameData(linkage->GetIncomingDescriptor());

  Run<EarlyOptimizationPhase>();
  Run<EffectControlLinearizationPhase>();
  if (FLAG_turbo_store_elimination) {
    Run<StoreStoreEliminationPhase>();
  }
  if (FLAG_turbo_cf_optimization) {
    Run<ControlFlowOptimizationPhase>();
  }
  Run<LateOptimizationPhase>();
  Run<MemoryOptimizationPhase>();
  Run<MachineOperatorOptimizationPhase>();
  Run<DecompressionOptimizationPhase>();

  ComputeScheduledGraph();

  return SelectInstructions(linkage);
}

Type and Range Analysis

The typer phase is one of the first stages initiated by the OptimizeGraph() function. It initially analyzes the possible types and ranges of operations. All basic operators are overloaded for various objects, so type analysis is very important for optimization.

The range analysis sets a range for each node depending on the values it can take. If a value is a NumberConstant n, the range of the nodes is displayed as Range(n, n). Applying an operation to this such as multiplying it by 2 would then result in the updated range of Range(2n, 2n).

The type analysis gives every value a specific type. These types can be numbers, boolean, or even something like NonInternal if nothing about the type can be determined.

Below you can see the SON representation for the shown function. For the constants it was able to determine sensible ranges, similarly, it was able to type the conditional statement as a boolean. The passed-in parameter is typed as NonInternal since it is impossible to know what the function is called with. The range for the addition operation is set to [-2^53, 2^53] since the number can take on any possible value. The "Hello" string is just left as a HeapConstant.

Speculative Optimization

This optimization makes use of the information gathered by Ignition's inline caches. It assumes that future function arguments will be of the same type as previous arguments and performs optimizations based on this collected feedback based on the previously encountered values.

Assuming the below function is called with x=smi and y=smi every time for several iterations, TurboFan might optimize it for arguments of the type smi. In this case, checks are put in place that make sure that x and y are both smi's. If this check fails, the deoptimizer is called and control will be transferred back to the interpreter so TurboFan never has to worry about other types that might be used in this function and just checks for smi vs not smi.

This feedback collected by ignition is stored in the feedback vector, a data structure that is linked from the closure and contains elements that provide different kinds of feedback (e.g. bitsets, closures, maps). In the above add function example, the feedback vector has one interesting slot: BinaryOp. This slot can record feedback about the inputs and outputs that have been seen thus far. This feedback vector can be manually inspected using %DebugPrint(). To ensure that the compiler's speculations hold true, it needs to use a number of speculation guards. These are lightweight runtime checks that ensure that the data is in the expected form.

Earlier we mentioned that the jitted code may have to be deoptimized in certain scenarios when speculated types no longer hold true. This is just supposed to exit the JIT and transfer control back to Ignition. This however is not as easy in practice. We would like to just restart the entire function in the interpreter, however what if code that was already executed during the JIT had side effects (such as modifications to a global variable)? Turbofan makes use of On-Stack Replacement for this. For any given bytecode instruction it creates a mapping between JIT locations/registers and the Ignition memory state. This allows for transitions from the JIT to ignition. The -- trace-deopt compiler flag can be used to trace deoptimizations similarly to optimizations.

Below you can once again see a small code snippet alongside its corresponding SON graph after speculative optimization has been applied. After the function was optimized using feedback similar to func(7, 3, 10);, speculative optimization inferred that all 3 arguments will be Int32's. It makes this assumption and then uses CheckedTaggedSignedToInt32() for all 3 arguments to confirm its assumptions. If the assumptions hold true, it continues as expected. If they do not, control is transferred to Deoptimize() which returns control back to Ignition. This may not look like a huge optimization here, but not having to perform checks for 20 different types due to javascript's weakly typed nature leads to huge performance increases.

Constant/Branch Folding & Strength Reduction

Constant Folding is an optimization technique that eliminates expressions that can be calculated at compile time. These are typically expressions that only contain hardcoded/constant values.

Branch Folding is very similar to constant folding, just instead of folding constants together into a single value, it eliminates branches if the result can be determined at compile time.

Strength Reduction takes care of simplifying operations. x * 0 for example might be transformed into 0 by this optimization.

Below you can see a small example that uses these optimizations. The let a = 1 + 1; is optimized to the NumberConstant 2 via constant folding. The branching call is entirely removed by Branch Folding since if (true) will always evaluate to the TruePart. I would have expected Strength Reduction to optimize the x + 0 to just x, however for whatever reason this did not happen.

Escape Analysis & Scalar Replacement

The goal of Escape Analysis is to avoid allocating local objects on the heap. It does this by first attempting to determine if the lifetime of an object is restricted only to the current function. If it succeeds in proving this for an object, scalar replacement comes in. Scalar Replacement stores the object in a CPU register instead of memory thus reducing memory overhead.

The main conditions for Escape Analysis are that the allocation is done within the function and that all property accesses are done without ever passing a reference. It also needs to make sure that the object is not returned from the function or referenced from an escaping object.

Inlining makes Escape Analysis much more powerful since functions suddenly have a much larger reach, thus making it more likely that a variable never escapes it.

Redundancy Elimination

This optimization focuses on removing safety checks from the emitted machine code if these safety checks are deemed unnecessary. This can obviously be very dangerous. Removing the wrong check-in this optimization pass can very easily result in type confusions or out-of-bounds memory access.

In the below example the second call to CheckHeapObject and CheckMap was optimized out since there is no way for the map to change between the map-check for o.a and o.b. The map is instead just checked once before the add operation is performed. The compiler recognizes this by tracking potential side effects of operations and determining if any of these could change an object. If not it might eliminate certain checks. Since JS is a pretty complicated language, correctly recognizing all side effects is a very difficult task, and thus this has led to many bugs in the past.

Scheduling

While the SON graph does keep track of control flow, it also leaves a lot of legal orderings for code, both in terms of ordering code blocks and ordering the individual operations inside of code blocks. The scheduling phase is one of the final phases before code generation that attempts to determine the most efficient order and placement and generates a more traditional CFG graph.

Below you can see the machine code produced by Turbofan for the above example (x84-64). As you can see at :53, the map comparison still exists to make sure that the map remains as expected. Afterwards it either calls the deoptimize function if the comparison failed, or continues on until it eventually performs the addition at :85.

There are still many more optimizations performed by V8, however, I believe that this blog post is already long enough. You should now have a good understanding of the structure of Turbofan's optimization pipeline and some of the main optimizations it performs. At this point, I would encourage readers to start looking at V8's codebase to become familiar with how some of the Optimizations are implemented. Making use of the Turbolizer tool will also be very useful when attempting to properly understand some of the optimizations.

At this point we have covered all of the major building blocks of V8. In the next and final part of this series we will be finally be covering some exploitation topics.

Turbofan

December 2021

Introduction