CacheIR: A new approach to Inline Caching in Firefox

25 Jan 2017 Mozilla

The past months we have been working on CacheIR, an overhaul of the IC code in SpiderMonkey's JITs. CacheIR has allowed us to remove thousands of lines of boilerplate and code duplication. It has also made it much easier to add new optimizations. This post describes how CacheIR works, why it's much better than what we had before, and some plans we have for the future.

IC stubs

Both the Baseline JIT and IonMonkey optimizing JIT can use Inline Cache stubs to optimize things like property accesses. Each IC has a linked list of stubs. When we want to optimize a certain operation, like object.foo, the IC adds a new stub for this particular case to the list. Stubs usually have one or more guards and if a guard fails we jump to the next stub in the list.

CacheIR

CacheIR is a very simple bytecode that makes it easy to generate new IC stubs to optimize particular cases. Consider the following JS code:

var point = {x: 1, y: 2};
var x = point.x;

The GetProp IC we use for the point.x lookup uses the GetPropIRGenerator to emit CacheIR for this lookup. The IR we emit for it looks like this:

GuardIsObject Op0
GuardShape Op0 Field0
LoadFixedSlotResult Op0 Field1

As you can see, CacheIR is a very simple, linear IR. It has no explicit branches, no loops, just guards. This makes it easy to compile CacheIR to machine code.

Here Op0 is the IC's input operand (point in this case). We first guard it's an object, then we guard on its Shape (an object's Shape determines its layout and properties, other engines may use the terms 'map' or 'hidden class'), and then we load a value from the object's fixed slot into the IC's output register.

In addition to the IR, the IR emitter also generates a list of stub fields. Field0 and Field1 above refer to fields in the stub data, in this case we have the following fields:

Field 0: Shape* pointer: 0xeb39f900
Field 1: offset: 16

The IR itself does not contain any pointers or slot offsets, we will see below why this matters.

A GetProp IC has a single input value, but other ICs like GetElem (used for x[y] in JS) can have multiple inputs:

var prop = "x";
var x = point[prop];

Now the IR generator will insert extra guards to check prop is the string "x":

GuardIsObject Op0
GuardIsString Op1              <---
GuardSpecificAtom Op1 Field0   <---
GuardShape Op0 Field1
LoadFixedSlotResult Op0 Field2

We have two extra guards now (GuardIsString and GuardSpecificAtom) and an extra stub field that stores the atom we're guarding on ("x" in this case), but other than that the IR looks the same as before. The same trick can be used in a lot of other cases: for example, if point would be a proxy/wrapper to our Point object, all we need to do is emit some CacheIR instructions to unwrap it and then we can emit the same IR as before.

We emit the same CacheIR for our Baseline and Ion ICs (but Ion ICs may NOP some type guards, for instance if we know the input is always an object). This is a big improvement over what we had before: our Baseline and Ion ICs used to share some code but not much, because they're quite different. This meant there was a lot of code duplication and many cases we optimized in Ion but not (or differently) in Baseline. CacheIR fixes all this.

Instead of having register allocation, code generation, and high level decisions (which guards to emit) in the same code, we now have a very clear separation of concerns: when we emit the IR, we don't have to worry about registers, stack slots, or boxing formats. When we compile the IR to machine code, we can simply compile one instruction at a time and don't have to worry about what to emit next. This separation has made the code much more readable and maintainable.

Our Baseline JIT can use the same machine code for many different stubs, because the IC inputs are always boxed Values stored in fixed registers and the IC stub reads things like Shape* pointers and slot offsets from the stub data, they are not baked into the code. Before CacheIR, each Baseline stub had to define an int32 key for this purpose and stubs with the same key could share code. This worked well, but it was tedious to maintain and easy to screw up when making changes to IC code. With CacheIR, we simply use the IR as key: stubs with the same IR can share code. This means the code sharing works automatically and we no longer have to think about it. It eliminates a lot of boilerplate and a whole class of potential bugs.

Removing boilerplate has been a recurring theme, we have seen the same thing with GC tracing for instance. Each Baseline stub that contained a Shape guard had to trace this Shape for GC purposes. Now there's a single place where we trace Shapes stored in stubs. Less boilerplate and much harder to get wrong.

Compiling CacheIR to machine code

After emitting the IR, we can compile it to native code. The code generation for most CacheIR instructions is shared between Baseline and Ion. As mentioned above, Baseline stub code is shared, so some CacheIR instructions are compiled differently for Baseline and Ion: Ion can bake in values directly into the code, whereas the Baseline code will read them from the stub data.

The CacheIR compiler uses a very simple register allocator to track where each operand lives (register, stack slot, etc) and which registers are available. When we compile an instruction, the first thing we do is request registers for the operands we need, and the register allocator takes care of spilling and restoring values as needed.

Furthermore, the register allocator allows us to generate failure paths automatically. A failure path is the code we emit when a guard fails: it restores the input registers and jumps to the next stub. This works because the register allocator knows the register state at the start of the stub and the current register state.

This makes it much easier to write IC code: you no longer have to remember where all values are stored, which registers are available, and which ones have to be restored. It eliminates another class of subtle bugs.

Improvements

Tom Schuster (evilpie) has been working on a logging mechamism to find cases our ICs fail to optimize currently. He already fixed numerous performance issues that came up on websites like Facebook and Google Docs. We also had some performance issues on file that used to be hard to fix, but were much easier to optimize with CacheIR. In some cases, fixing an issue found on real-world websites turned out to (unexpectedly) improve benchmarks as well :)

Because the CacheIR we emit is the same for Baseline and Ion, we no longer have cases where Ion has IC support for something but Baseline doesn't. Whenever we add a new optimization to our IC code, it works in both Baseline and Ion. As a result, especially our Baseline JIT now optimizes a lot more things than before.

CacheIR stubs require much less boilerplate and code duplication. The IR instructions are our IC stub building blocks: many instructions can be reused for other things. Every time we converted a stub to CacheIR we saw big improvements in code size and maintainability. This has been incredibly satisfying.

Future plans

We are still working on converting the remaining Baseline and Ion IC stubs to CacheIR, and we will continue to use our new CacheIR tools to speed up a lot of other operations.

Our Ion optimizing JIT is currently able to pattern match the CacheIR we emitted for Baseline and uses it optimize certain cases. Once we are done converting IC stubs to CacheIR, the next step is to add a generic CacheIR to MIR compiler. This will bring some significant performance wins and will let us remove even more code: whenever we optimize a new case in the CacheIR emitter, we will not only generate Baseline IC code and Ion IC code, but also Ion inline paths from it.

CacheIR will also make it much easier to optimize the ES2015 super property accesses used in derived classes. super.x involves 2 objects: the receiver and the super object. With CacheIR, we can simply add a new input operand to our GetProp IR generator and the IR compiler will do the right thing.

Conclusion

We've seen how CacheIR is used in SpiderMonkey to simplify/improve IC code generation and remove code duplication and boilerplate. Firefox Nightly builds now optimize more cases than ever, and we are working hard to improve our IC coverage even more. We will also see some serious performance wins from compiling CacheIR to MIR.

Next: W^X JIT-code enabled in Firefox
Previous: Some SpiderMonkey optimizations in Firefox Quantum