summaryrefslogtreecommitdiff
path: root/html/notes/fastcons.md
diff options
context:
space:
mode:
Diffstat (limited to 'html/notes/fastcons.md')
-rw-r--r--html/notes/fastcons.md206
1 files changed, 0 insertions, 206 deletions
diff --git a/html/notes/fastcons.md b/html/notes/fastcons.md
deleted file mode 100644
index 3dd4c3a..0000000
--- a/html/notes/fastcons.md
+++ /dev/null
@@ -1,206 +0,0 @@
-# Cons cell optimization?
-
-Assume that your Lisp implementation uses tagged pointers (possibly
-with NaN packing; doesn't matter), and there's enough information on
-pointers to tell you that the destination is the contents of a pair,
-aka cons cell.
-
-For example, Zisp uses NaN packing, and all values are represented as
-a `Value` union, which is either a double, or a special kind of NaN
-value that may, among other possibilities, contain a pointer with a
-3-bit type tag within it. One of the 8 possible types is a pair.
-
-You would implement the actual contents of a pair as `[2]Value` (in
-Zig notation), i.e., an array of two consecutive `Value` objects on
-the heap; you already know from the type tag on the pointer itself
-that the destination is pair contents, so you don't need any further
-information on the heap indicating that those two consecutive `Value`
-objects represent the car and cdr of a cons cell. They are "naked"
-within the heap.
-
-This seems optimal; you have zero meta-data on the heap itself. But
-pairs are ubiquitously used to implement linked lists in Lisp, and
-that includes a lot of inherent overhead: every single element must
-keep a pointer to the next element. So, if you have a list of five
-elements, you have five `[2]Value` on the heap, i.e., ten values!
-
-Can't we improve on that, somehow?
-
-In Zisp's NaN-packing strategy, there's an entire unused bit on heap
-pointers. (There's also a whole other set of possible pointer values
-with a 50-bit payload, so really, we have a *ton* of unused pointer
-space.)
-
-What if that bit indicated that the pointer, well, let's say "has to
-do" with pairs (I'll explain). The three type tag bits are now free,
-since the bit being *unset* means the pointer is for one of the other
-types. So, we can encode eight "sub-types" of this pointer.
-
-The actual pointer would go to a `[8]Value` instead of `[2]Value`,
-with the "type" tag being something akin to an index. (It's not an
-actual index; read on.)
-
-When you first use cons, say on a symbol `x` and nil, to get a list of
-one element, it uses two of the eight slots:
-
- [ _ _ _ _ _ _ x () ] (tag = 0)
-
-(Let's call the operation that does this, i.e., allocates an array of
-eight and puts the arguments into positions 6 and 7, `alloc-cons`.
-We will use this term later.)
-
-The pointer points to the start of the array, the "type tag" is zero,
-and car and cdr are implemented as:
-
- (car p) = p[6 - tag]
-
- (cdr p) = p[7 - tag] ??? (read on)
-
-When you call cons again but the cdr is a pointer such as the above,
-then instead of allocating a new array, it puts the car into it and
-returns a modified pointer; say we did `(cons y pair)`:
-
- [ _ _ _ _ _ y x () ] (tag = 1)
-
-(The operation that does this, i.e., puts the first argument to cons
-into ptr[5 - tag] and returns ptr with tag + 1, is `shift-cons`.)
-
-The tag on the pointer now says 1. You may notice that if cdr were
-truly implemented as the above, it wouldn't actually give us the cdr
-of the list any more; it would give us `(car (cdr list))` so we need
-to make a change:
-
- (cdr p) = if tag = 0: p[7]; else: p with tag - 1
-
-We introduced a branch instruction, which may come back to bite us,
-but our car and cdr are now functional again.
-
-OK, but what if we run out of space in the array? Cons also needs a
-branch, or multiple if we count the one already implied above:
-
- (cons a b) = if b is pair and b.tag < 6:
- (shift-cons a b)
- else if b is pair:
- (alloc-cons a b)
- else:
- (alloc-cons a b)
-
-Oh, would you look at that: the last two branches are identical, so
-actually just one branch instruction is enough after all:
-
- if b is pair and b.tag < 6:
- (shift-cons a b)
- else:
- (alloc-cons a b)
-
-I *think* this pseudo-code is complete now? If cons, car, and cdr
-were to be implemented this way, I think we get an API that's fully
-compatible with traditional cons/car/cdr and the optimization is
-completely transparent.
-
-Note that we haven't used tag value 7. Maybe that's fine, or maybe
-it's possible to exploit that somehow; let's ignore it for now.
-
-(We could simply use `[9]Value` and then all 8 tag values would be
-used, but that's probably a bad idea due to cache line shenanigans;
-more on that later.)
-
-
-## Comparison
-
-Let's compare the in-memory representation of traditional pairs and
-our optimized "pairs" when both are used to implement a linked list
-with ten elements:
-
- ;; traditional
- list = [ e0 cdr1 ]
- cdr1 = [ e1 cdr2 ]
- cdr2 = [ e2 cdr3 ]
- cdr3 = [ e3 cdr4 ]
- cdr4 = [ e4 cdr5 ]
- cdr5 = [ e5 cdr6 ]
- cdr6 = [ e6 cdr7 ]
- cdr7 = [ e7 cdr8 ]
- cdr8 = [ e8 cdr9 ]
- cdr9 = [ e9 () ]
-
- ;; optimized
- list = [ _ _ _ _ e0 e1 e2 cdr1 ] (tag = 2)
- cdr1 = [ e3 e4 e5 e6 e7 e8 e9 () ] (tag = 6)
-
-Let's look at the pros and cons.
-
-<style>
-td:first-child { font-weight: bold; }
-td:not(:first-child) { font-family: mono; }
-</style>
-
-| | Traditional | Optimized |
-|----------|--------------------------|------------------------------|
-| cons | alloc, set car & cdr | see above (branches) |
-| car | ptr[0] | ptr[6 - tag] |
-| cdr | ptr[1] | see above (branches) |
-| Memory¹ | n * 2 | (floor((n - 1) / 7) + 1) * 8 |
-| Locality | Bad | Better (?) |
-| Derefs² | n | n/7 (I think) |
-
-¹ Counted in `Value`-sized slots on the heap.
-
-² Pointer dereferences when looping through the whole list.
-
-I haven't thought too hard about some of the numbers, but it's not
-terribly important; this is to get a general idea.
-
-Traditional cons always allocates; ours has a branch instruction but
-this allows it to elide an allocation most of the time; I think it's
-every 6 out of 7 cons calls that *don't* require allocation, when
-building up a linked list.
-
-The implementation of car became ever so slightly slower, requiring a
-small arithmetic operation. Cdr became even more complicated, with a
-branch instruction, but most of the time it can actually elide the
-pointer dereference; I think this time it's every 7 out of 8 calls.
-
-The strange formula for memory use is because we alloc in chunks of 8,
-but use one slot for a final cdr. Anyhow, in the worst case, we will
-be using 8 rather than 2 slots for a single-list element; the break
-even point is 4 elements where both implementations use 8 slots.
-(Then there's the 8-element case where both use 16 slots, but after
-that our implementation always wins.)
-
-The fewer pointer derefs should help a bit, although it's likely that
-the pairs making up a traditional linked list were all allocated in
-close proximity, and CPUs nowadays fetch more memory from main RAM
-into a CPU cache than you ask for, if you've asked for less than the
-cache line size. The most common size nowadays is 64 bytes, which is
-exactly `[8]Value`. This means that, using traditional cons cells,
-you'd be fetching 4 of them at once every time you deref a pointer,
-assuming they were all allocated next to each other.
-
-Memory locality is better on paper, but only if you don't think too
-hard about it. We will always have seven elements in one contiguous
-chunk, whereas the traditional implementation could suffer from a lot
-of fragmentation: Each pair could end up in a different cache line
-sized segment on the heap, meaning every element requires a main
-memory fetch. But that's unlikely; they are much more likely to be
-adjacent on the heap. Further, what if we have tons of tiny lists,
-with one to four elements per list? Now our strategy forces every
-list to be in its own cache line sized segment, whereas some of the
-lists using the traditional implementation could end up in a single
-64-byte segment and thus be fetched together. A cursory glance at
-some Scheme code of mine indicates that there's indeed tons of lists
-with fewer than 5 elements!
-
-
-## And that was just linked lists
-
-Pairs are not only used for linked lists. They could be the slots of
-an alist, they could be nodes in a tree data structure, and so on.
-
-Forcing every cons cell to allocate 64 bytes could thrash locality in
-these cases, and actually increase the total memory use as well, as
-most of the `Value`-sized slots may end up being wasted.
-
-I won't implement the strategy mentioned above yet, but if I do it
-eventually, it will be very interesting to see its performance in a
-variety of benchmarks, reflecting common usage patterns.