diff options
Diffstat (limited to 'html/notes/fastcons.md')
| -rw-r--r-- | html/notes/fastcons.md | 206 |
1 files changed, 0 insertions, 206 deletions
diff --git a/html/notes/fastcons.md b/html/notes/fastcons.md deleted file mode 100644 index 3dd4c3a..0000000 --- a/html/notes/fastcons.md +++ /dev/null @@ -1,206 +0,0 @@ -# Cons cell optimization? - -Assume that your Lisp implementation uses tagged pointers (possibly -with NaN packing; doesn't matter), and there's enough information on -pointers to tell you that the destination is the contents of a pair, -aka cons cell. - -For example, Zisp uses NaN packing, and all values are represented as -a `Value` union, which is either a double, or a special kind of NaN -value that may, among other possibilities, contain a pointer with a -3-bit type tag within it. One of the 8 possible types is a pair. - -You would implement the actual contents of a pair as `[2]Value` (in -Zig notation), i.e., an array of two consecutive `Value` objects on -the heap; you already know from the type tag on the pointer itself -that the destination is pair contents, so you don't need any further -information on the heap indicating that those two consecutive `Value` -objects represent the car and cdr of a cons cell. They are "naked" -within the heap. - -This seems optimal; you have zero meta-data on the heap itself. But -pairs are ubiquitously used to implement linked lists in Lisp, and -that includes a lot of inherent overhead: every single element must -keep a pointer to the next element. So, if you have a list of five -elements, you have five `[2]Value` on the heap, i.e., ten values! - -Can't we improve on that, somehow? - -In Zisp's NaN-packing strategy, there's an entire unused bit on heap -pointers. (There's also a whole other set of possible pointer values -with a 50-bit payload, so really, we have a *ton* of unused pointer -space.) - -What if that bit indicated that the pointer, well, let's say "has to -do" with pairs (I'll explain). The three type tag bits are now free, -since the bit being *unset* means the pointer is for one of the other -types. So, we can encode eight "sub-types" of this pointer. - -The actual pointer would go to a `[8]Value` instead of `[2]Value`, -with the "type" tag being something akin to an index. (It's not an -actual index; read on.) - -When you first use cons, say on a symbol `x` and nil, to get a list of -one element, it uses two of the eight slots: - - [ _ _ _ _ _ _ x () ] (tag = 0) - -(Let's call the operation that does this, i.e., allocates an array of -eight and puts the arguments into positions 6 and 7, `alloc-cons`. -We will use this term later.) - -The pointer points to the start of the array, the "type tag" is zero, -and car and cdr are implemented as: - - (car p) = p[6 - tag] - - (cdr p) = p[7 - tag] ??? (read on) - -When you call cons again but the cdr is a pointer such as the above, -then instead of allocating a new array, it puts the car into it and -returns a modified pointer; say we did `(cons y pair)`: - - [ _ _ _ _ _ y x () ] (tag = 1) - -(The operation that does this, i.e., puts the first argument to cons -into ptr[5 - tag] and returns ptr with tag + 1, is `shift-cons`.) - -The tag on the pointer now says 1. You may notice that if cdr were -truly implemented as the above, it wouldn't actually give us the cdr -of the list any more; it would give us `(car (cdr list))` so we need -to make a change: - - (cdr p) = if tag = 0: p[7]; else: p with tag - 1 - -We introduced a branch instruction, which may come back to bite us, -but our car and cdr are now functional again. - -OK, but what if we run out of space in the array? Cons also needs a -branch, or multiple if we count the one already implied above: - - (cons a b) = if b is pair and b.tag < 6: - (shift-cons a b) - else if b is pair: - (alloc-cons a b) - else: - (alloc-cons a b) - -Oh, would you look at that: the last two branches are identical, so -actually just one branch instruction is enough after all: - - if b is pair and b.tag < 6: - (shift-cons a b) - else: - (alloc-cons a b) - -I *think* this pseudo-code is complete now? If cons, car, and cdr -were to be implemented this way, I think we get an API that's fully -compatible with traditional cons/car/cdr and the optimization is -completely transparent. - -Note that we haven't used tag value 7. Maybe that's fine, or maybe -it's possible to exploit that somehow; let's ignore it for now. - -(We could simply use `[9]Value` and then all 8 tag values would be -used, but that's probably a bad idea due to cache line shenanigans; -more on that later.) - - -## Comparison - -Let's compare the in-memory representation of traditional pairs and -our optimized "pairs" when both are used to implement a linked list -with ten elements: - - ;; traditional - list = [ e0 cdr1 ] - cdr1 = [ e1 cdr2 ] - cdr2 = [ e2 cdr3 ] - cdr3 = [ e3 cdr4 ] - cdr4 = [ e4 cdr5 ] - cdr5 = [ e5 cdr6 ] - cdr6 = [ e6 cdr7 ] - cdr7 = [ e7 cdr8 ] - cdr8 = [ e8 cdr9 ] - cdr9 = [ e9 () ] - - ;; optimized - list = [ _ _ _ _ e0 e1 e2 cdr1 ] (tag = 2) - cdr1 = [ e3 e4 e5 e6 e7 e8 e9 () ] (tag = 6) - -Let's look at the pros and cons. - -<style> -td:first-child { font-weight: bold; } -td:not(:first-child) { font-family: mono; } -</style> - -| | Traditional | Optimized | -|----------|--------------------------|------------------------------| -| cons | alloc, set car & cdr | see above (branches) | -| car | ptr[0] | ptr[6 - tag] | -| cdr | ptr[1] | see above (branches) | -| Memory¹ | n * 2 | (floor((n - 1) / 7) + 1) * 8 | -| Locality | Bad | Better (?) | -| Derefs² | n | n/7 (I think) | - -¹ Counted in `Value`-sized slots on the heap. - -² Pointer dereferences when looping through the whole list. - -I haven't thought too hard about some of the numbers, but it's not -terribly important; this is to get a general idea. - -Traditional cons always allocates; ours has a branch instruction but -this allows it to elide an allocation most of the time; I think it's -every 6 out of 7 cons calls that *don't* require allocation, when -building up a linked list. - -The implementation of car became ever so slightly slower, requiring a -small arithmetic operation. Cdr became even more complicated, with a -branch instruction, but most of the time it can actually elide the -pointer dereference; I think this time it's every 7 out of 8 calls. - -The strange formula for memory use is because we alloc in chunks of 8, -but use one slot for a final cdr. Anyhow, in the worst case, we will -be using 8 rather than 2 slots for a single-list element; the break -even point is 4 elements where both implementations use 8 slots. -(Then there's the 8-element case where both use 16 slots, but after -that our implementation always wins.) - -The fewer pointer derefs should help a bit, although it's likely that -the pairs making up a traditional linked list were all allocated in -close proximity, and CPUs nowadays fetch more memory from main RAM -into a CPU cache than you ask for, if you've asked for less than the -cache line size. The most common size nowadays is 64 bytes, which is -exactly `[8]Value`. This means that, using traditional cons cells, -you'd be fetching 4 of them at once every time you deref a pointer, -assuming they were all allocated next to each other. - -Memory locality is better on paper, but only if you don't think too -hard about it. We will always have seven elements in one contiguous -chunk, whereas the traditional implementation could suffer from a lot -of fragmentation: Each pair could end up in a different cache line -sized segment on the heap, meaning every element requires a main -memory fetch. But that's unlikely; they are much more likely to be -adjacent on the heap. Further, what if we have tons of tiny lists, -with one to four elements per list? Now our strategy forces every -list to be in its own cache line sized segment, whereas some of the -lists using the traditional implementation could end up in a single -64-byte segment and thus be fetched together. A cursory glance at -some Scheme code of mine indicates that there's indeed tons of lists -with fewer than 5 elements! - - -## And that was just linked lists - -Pairs are not only used for linked lists. They could be the slots of -an alist, they could be nodes in a tree data structure, and so on. - -Forcing every cons cell to allocate 64 bytes could thrash locality in -these cases, and actually increase the total memory use as well, as -most of the `Value`-sized slots may end up being wasted. - -I won't implement the strategy mentioned above yet, but if I do it -eventually, it will be very interesting to see its performance in a -variety of benchmarks, reflecting common usage patterns. |
