diff options
| author | Taylan Kammer <taylan.kammer@gmail.com> | 2025-03-17 11:55:55 +0100 |
|---|---|---|
| committer | Taylan Kammer <taylan.kammer@gmail.com> | 2025-03-17 11:55:55 +0100 |
| commit | bff3e84e7b4083285d5a0a871663db57430401b6 (patch) | |
| tree | 88d69c365def136232aa1b4349b6a1755e51af97 | |
| parent | fbcedd74510b02a8a56acfbb3fb115394fcbdfd8 (diff) | |
Add fastcons.md
| -rwxr-xr-x | html/gen.sh | 2 | ||||
| -rw-r--r-- | html/index.md | 1 | ||||
| -rw-r--r-- | html/notes/fastcons.md | 206 | ||||
| -rw-r--r-- | html/style.css | 13 |
4 files changed, 220 insertions, 2 deletions
diff --git a/html/gen.sh b/html/gen.sh index d14286e..3ed9482 100755 --- a/html/gen.sh +++ b/html/gen.sh @@ -19,7 +19,7 @@ do title=$(sed '/^# / { s/# //; q }' "$file") sed "s/__TITLE__/$title/" prelude.html echo "<body>" - markdown2 "$file" -x fenced-code-blocks,highlightjs-lang + markdown2 "$file" -x fenced-code-blocks,highlightjs-lang,tables echo "</body>" } > "${file%.md}".html done diff --git a/html/index.md b/html/index.md index e7c5ff2..bf03784 100644 --- a/html/index.md +++ b/html/index.md @@ -33,3 +33,4 @@ because writing the code often gives you yet another perspective. * [Reader? Decoder? I barely know 'er!](notes/reader.html) * [Does the decoder implement macros?](notes/macros.html) * [Better syntax-rules?](notes/sr.html) +* [Cons cell optimization?](notes/fastcons.html) diff --git a/html/notes/fastcons.md b/html/notes/fastcons.md new file mode 100644 index 0000000..3dd4c3a --- /dev/null +++ b/html/notes/fastcons.md @@ -0,0 +1,206 @@ +# Cons cell optimization? + +Assume that your Lisp implementation uses tagged pointers (possibly +with NaN packing; doesn't matter), and there's enough information on +pointers to tell you that the destination is the contents of a pair, +aka cons cell. + +For example, Zisp uses NaN packing, and all values are represented as +a `Value` union, which is either a double, or a special kind of NaN +value that may, among other possibilities, contain a pointer with a +3-bit type tag within it. One of the 8 possible types is a pair. + +You would implement the actual contents of a pair as `[2]Value` (in +Zig notation), i.e., an array of two consecutive `Value` objects on +the heap; you already know from the type tag on the pointer itself +that the destination is pair contents, so you don't need any further +information on the heap indicating that those two consecutive `Value` +objects represent the car and cdr of a cons cell. They are "naked" +within the heap. + +This seems optimal; you have zero meta-data on the heap itself. But +pairs are ubiquitously used to implement linked lists in Lisp, and +that includes a lot of inherent overhead: every single element must +keep a pointer to the next element. So, if you have a list of five +elements, you have five `[2]Value` on the heap, i.e., ten values! + +Can't we improve on that, somehow? + +In Zisp's NaN-packing strategy, there's an entire unused bit on heap +pointers. (There's also a whole other set of possible pointer values +with a 50-bit payload, so really, we have a *ton* of unused pointer +space.) + +What if that bit indicated that the pointer, well, let's say "has to +do" with pairs (I'll explain). The three type tag bits are now free, +since the bit being *unset* means the pointer is for one of the other +types. So, we can encode eight "sub-types" of this pointer. + +The actual pointer would go to a `[8]Value` instead of `[2]Value`, +with the "type" tag being something akin to an index. (It's not an +actual index; read on.) + +When you first use cons, say on a symbol `x` and nil, to get a list of +one element, it uses two of the eight slots: + + [ _ _ _ _ _ _ x () ] (tag = 0) + +(Let's call the operation that does this, i.e., allocates an array of +eight and puts the arguments into positions 6 and 7, `alloc-cons`. +We will use this term later.) + +The pointer points to the start of the array, the "type tag" is zero, +and car and cdr are implemented as: + + (car p) = p[6 - tag] + + (cdr p) = p[7 - tag] ??? (read on) + +When you call cons again but the cdr is a pointer such as the above, +then instead of allocating a new array, it puts the car into it and +returns a modified pointer; say we did `(cons y pair)`: + + [ _ _ _ _ _ y x () ] (tag = 1) + +(The operation that does this, i.e., puts the first argument to cons +into ptr[5 - tag] and returns ptr with tag + 1, is `shift-cons`.) + +The tag on the pointer now says 1. You may notice that if cdr were +truly implemented as the above, it wouldn't actually give us the cdr +of the list any more; it would give us `(car (cdr list))` so we need +to make a change: + + (cdr p) = if tag = 0: p[7]; else: p with tag - 1 + +We introduced a branch instruction, which may come back to bite us, +but our car and cdr are now functional again. + +OK, but what if we run out of space in the array? Cons also needs a +branch, or multiple if we count the one already implied above: + + (cons a b) = if b is pair and b.tag < 6: + (shift-cons a b) + else if b is pair: + (alloc-cons a b) + else: + (alloc-cons a b) + +Oh, would you look at that: the last two branches are identical, so +actually just one branch instruction is enough after all: + + if b is pair and b.tag < 6: + (shift-cons a b) + else: + (alloc-cons a b) + +I *think* this pseudo-code is complete now? If cons, car, and cdr +were to be implemented this way, I think we get an API that's fully +compatible with traditional cons/car/cdr and the optimization is +completely transparent. + +Note that we haven't used tag value 7. Maybe that's fine, or maybe +it's possible to exploit that somehow; let's ignore it for now. + +(We could simply use `[9]Value` and then all 8 tag values would be +used, but that's probably a bad idea due to cache line shenanigans; +more on that later.) + + +## Comparison + +Let's compare the in-memory representation of traditional pairs and +our optimized "pairs" when both are used to implement a linked list +with ten elements: + + ;; traditional + list = [ e0 cdr1 ] + cdr1 = [ e1 cdr2 ] + cdr2 = [ e2 cdr3 ] + cdr3 = [ e3 cdr4 ] + cdr4 = [ e4 cdr5 ] + cdr5 = [ e5 cdr6 ] + cdr6 = [ e6 cdr7 ] + cdr7 = [ e7 cdr8 ] + cdr8 = [ e8 cdr9 ] + cdr9 = [ e9 () ] + + ;; optimized + list = [ _ _ _ _ e0 e1 e2 cdr1 ] (tag = 2) + cdr1 = [ e3 e4 e5 e6 e7 e8 e9 () ] (tag = 6) + +Let's look at the pros and cons. + +<style> +td:first-child { font-weight: bold; } +td:not(:first-child) { font-family: mono; } +</style> + +| | Traditional | Optimized | +|----------|--------------------------|------------------------------| +| cons | alloc, set car & cdr | see above (branches) | +| car | ptr[0] | ptr[6 - tag] | +| cdr | ptr[1] | see above (branches) | +| Memory¹ | n * 2 | (floor((n - 1) / 7) + 1) * 8 | +| Locality | Bad | Better (?) | +| Derefs² | n | n/7 (I think) | + +¹ Counted in `Value`-sized slots on the heap. + +² Pointer dereferences when looping through the whole list. + +I haven't thought too hard about some of the numbers, but it's not +terribly important; this is to get a general idea. + +Traditional cons always allocates; ours has a branch instruction but +this allows it to elide an allocation most of the time; I think it's +every 6 out of 7 cons calls that *don't* require allocation, when +building up a linked list. + +The implementation of car became ever so slightly slower, requiring a +small arithmetic operation. Cdr became even more complicated, with a +branch instruction, but most of the time it can actually elide the +pointer dereference; I think this time it's every 7 out of 8 calls. + +The strange formula for memory use is because we alloc in chunks of 8, +but use one slot for a final cdr. Anyhow, in the worst case, we will +be using 8 rather than 2 slots for a single-list element; the break +even point is 4 elements where both implementations use 8 slots. +(Then there's the 8-element case where both use 16 slots, but after +that our implementation always wins.) + +The fewer pointer derefs should help a bit, although it's likely that +the pairs making up a traditional linked list were all allocated in +close proximity, and CPUs nowadays fetch more memory from main RAM +into a CPU cache than you ask for, if you've asked for less than the +cache line size. The most common size nowadays is 64 bytes, which is +exactly `[8]Value`. This means that, using traditional cons cells, +you'd be fetching 4 of them at once every time you deref a pointer, +assuming they were all allocated next to each other. + +Memory locality is better on paper, but only if you don't think too +hard about it. We will always have seven elements in one contiguous +chunk, whereas the traditional implementation could suffer from a lot +of fragmentation: Each pair could end up in a different cache line +sized segment on the heap, meaning every element requires a main +memory fetch. But that's unlikely; they are much more likely to be +adjacent on the heap. Further, what if we have tons of tiny lists, +with one to four elements per list? Now our strategy forces every +list to be in its own cache line sized segment, whereas some of the +lists using the traditional implementation could end up in a single +64-byte segment and thus be fetched together. A cursory glance at +some Scheme code of mine indicates that there's indeed tons of lists +with fewer than 5 elements! + + +## And that was just linked lists + +Pairs are not only used for linked lists. They could be the slots of +an alist, they could be nodes in a tree data structure, and so on. + +Forcing every cons cell to allocate 64 bytes could thrash locality in +these cases, and actually increase the total memory use as well, as +most of the `Value`-sized slots may end up being wasted. + +I won't implement the strategy mentioned above yet, but if I do it +eventually, it will be very interesting to see its performance in a +variety of benchmarks, reflecting common usage patterns. diff --git a/html/style.css b/html/style.css index 4725089..1c91903 100644 --- a/html/style.css +++ b/html/style.css @@ -6,7 +6,7 @@ body { background: #eee; color: #333; - font-family: sans; + font-family: sans-serif; } h1, h2, h3 { @@ -21,3 +21,14 @@ p { code { font-size: 1.2em; } + +table { + width: 100%; + border-collapse: collapse; + border: 1px solid black; +} + +td, th { + border: 1px solid black; + padding: 4px; +} |
