diff options
| author | Taylan Kammer <taylan.kammer@gmail.com> | 2026-06-08 17:55:47 +0200 |
|---|---|---|
| committer | Taylan Kammer <taylan.kammer@gmail.com> | 2026-06-08 17:55:47 +0200 |
| commit | a215b80479690ad422cedffba00f04a6ad5df351 (patch) | |
| tree | 1cc24f12fa2f5d07249c52c03605f69f1b9ee849 | |
| parent | 627a09b58f404b1f690d27869b15b3197207c0af (diff) | |
Add a note.
| -rw-r--r-- | notes/260608-fastcons2.md | 195 | ||||
| -rw-r--r-- | notes/index.md | 1 |
2 files changed, 196 insertions, 0 deletions
diff --git a/notes/260608-fastcons2.md b/notes/260608-fastcons2.md new file mode 100644 index 0000000..bee97ac --- /dev/null +++ b/notes/260608-fastcons2.md @@ -0,0 +1,195 @@ +# Cons cell optimization! + +_2026 June_ + +I've found a significant improvement over my idea in: + +* [Cons cell optimization?](250317-fastcons.html) + +There, the idea was to use the heap representation `[8]Value` for +"cons cells" and a 3-bit tag telling you where in that array you're +looking at. If the tag is zero, then `car` is `values[6]` and `cdr` +is `values[7]`. If the tag is greater than zero, then `car` is like +`values[7 - tag]` and `cdr` gives you the same pointer but with the +tag decremented. The `cons` operation puts a value into an earlier +slot and increments the tag (you fill the array from end to start) +unless it's full in which case it actually conses a new `[8]Value` +chaining it to this one. It's like jumbo cons cells of 8 not 2. + +A big problem with this is that code often has short lists: + + (let ((a (+ x y)) + (b (* m n)) + (c (foo z bar))) + (blah blub) + (bleb blob blip)) + +How many lists do you spot in there with at least 4 elements? That's +the break-even point of memory use for the 8-value array strategy, yet +we have most lists shorter than that here, wasting memory. In this +sample snippet there actually isn't a single list of 5+ elements. + +## Making it variable-length + +I don't know why I didn't think of this sooner. Simply let the tag +value indicate how many elements are left before the tail pointer, +where `cdr` can increment the pointer value. E.g. consider: + + (define my-list '(a b c d e f)) + + my-list = ptr + tag=5; where ptr -> [ a b c d e f () ] + +An array `[7]Value` has been allocated for the list, with the pointer +value pointing to its start and the tag value being 5. Calling `car` +on this simply dereferences the pointer i.e. `ptr[0]`. Calling `cdr` +does: + + if tag = 0: + return ptr[1] + else: + return { ptr + 8, tag - 1 } + +Note: So far we're not considering the consequences of iteratively +building up a list by calling `cons` repeatedly on the head, because +I've mainly been thinking about this from the perspective of direct +AST interpretation; the parser can check the length of lists it's +building and return array-backed lists of the right size up to the +maximum where they're split up but still saving a lot of space and +significantly reducing the burden of pointer chasing. + +## 16-byte aligned heap + +Furthermore, I've realized earlier today that Zisp heap objects could +be aligned to 16 bytes, since most things that can fit within 8 could +as well be packed into a NaN, with few exceptions. One exception is +interned strings of exactly length 7, which fit into exactly 8 bytes +on the heap since they have a 1-byte length prefix. Any shorter, and +the string fits into the 48-bit area of a NaN; any longer, and it'll +be padded to take up 16 bytes anyway. There's also strings of length +16 to 23, that would normally fit in three 64-bit words, and now need +padding to 4 words, and so on for different "size classes"... + +Thinking about it a bit more, we could let some heap objects have +smaller alignment through some pointer tag related trickery, but it's +not so important right now; we can think more about this later and in +worst case I think 16-byte aligned strings are acceptable since their +actual value is rarely dereferenced when they're used as identifiers. + +What that 16-byte alignment means is we have not 3 but 4 low bits on +pointers usable for tagging. That allows up to `[17]Value` which is +very nice; almost any list you are likely to encounter in Lisp/Scheme +code will fit into that. If you're wondering why the maximum is not +16, it's because even a zero tag means you're looking at `[2]Value` +and tags go up to 15, so yeah. + +To be clear: The individual array elements still align at 8 bytes, +being 64-bit values; it's just the head of the array that's always +aligned at 16. This can actually lead to small amounts of memory +waste due to odd sized lists. If 50% of lists have odd length, we +waste 8 bytes for every second list. I think that's not too bad. +Worst case, I can walk back on this and go with 3 tag bits only. + +## Finding the head + +There's another problem though, which is losing access to the head +pointer. If we allocated a `[7]Value` and ran `cdr` a couple times, +there may be no pointer left to the actual `[7]Value`, and pointers +we're left with don't have enough information to walk back to the +start of the array; they only tell how much longer it goes. + +A solution is putting a sentinel value at the head. This needs to be +an otherwise invalid bit pattern that cannot be any valid Zisp value, +not even the all-zero 64-bit value, because that's just the IEEE 754 +positive zero which could appear in a list. Fortunately, there's a +great many unused bit patterns in our NaN packing strategy, so it's +not a problem defining a special one for this purpose alone. + +This means that we're wasting 8 bytes, but it's only a net loss for +lists of one element or lone non-list pairs. The latter actually +appear frequently in alist type structures, but I really think it's +not a big deal; nobody should be working with alists of hundreds of +entries. Moreover, the "backbone" of the alist actually becomes an +array and thus faster to traverse. (Or at least a chain of arrays, +each up to 16 elements plus the tail pointer or nil.) + +Actually, given the 16-byte alignment, we could just always alloc a +`[4]u64` as a minimum, filling the first two slots with the sentinel, +and then a subsequent `cons` call could reuse one of them to elide an +allocation. More generally, perhaps the total array size could be +always made a multiple of 2. In other words, instead of padding at +the end, we pad at the start with an extra sentinel to achieve the +16-byte alignment. + +Let's see what happens if `cons` is called repeatedly to build a list +of a handful of elements; here `_` is a sentinel and `*` a pointer to +another `[4]u64`-backed list: + + (define ls ()) ; ls = () + + (set! ls (cons 1 ls)) ; ls = [ _ _ 1 () ] + + (set! ls (cons 2 ls)) ; ls = [ _ 2 1 () ] + + (set! ls (cons 3 ls)) ; ls = [ _ _ 3 * ] + [ _ 2 1 () ] + + (set! ls (cons 4 ls)) ; ls = [ _ 4 3 * ] + [ _ 2 1 () ] + + (set! ls (cons 5 ls)) ; ls = [ _ _ 5 * ] + [ _ 4 3 * ] + [ _ 2 1 () ] + +If you think hard and compare this to what would happen under regular +`[2]Value` cons cells, you will arrive at an interesting realization, +which is that as elements are added, it goes back and forth between +wasting one `u64`-sized slot, and using exactly the same number of +slots, compared to regular cons cells. So, the total memory waste +remains bounded to 8 bytes per list constructed in this way! + +(The per-block sentinel value is always "wasted" yes, but this pays +for itself in having fewer pointers, which are also "waste" if you +think about it.) + +That's a very nice graceful degradation for a strategy that will +massively improve the result of things like `(list x y z a b c)`, +which will actually provide contiguous arrays in a trench coat. + +The only thing I'm worried about is branch prediction shenanigans; +perhaps the conditional logic added to cons/car/cdr will make this +worse in some surprising situations. + +## Closing up + +The garbage collector will need to deal with this complexity, but I +think it shouldn't be too bad. For any "cons cell" pointer it finds, +it'll do a scan leftwards to find the sentinel, which is the head of +the real allocated memory. The cell right to the sentinel then tells +how large the array is: tag value + 3. (Why it's + 3 is left as an +exercise to the reader.) Oh wait, there could be two sentinels; well +it just has to find a 16-byte boundary. It can just scan in steps of +two CPU words' worth of memory. + +There may be some uses of cons cells where our "optimization" really +makes things worse, but if you're going for high efficiency, you'll +typically want to avoid cons cells either way, using a more compact +data representation instead. Also, in absolute worst case, we could +just implement yet another heap type that is a real cons cell; with +4-bit pointer tags you have 16 heap types you can point to! (That's +not including this "fat cons cell/array" type which gets a dedicated +bit pattern in higher NaN bits, using the 4-bit tag as explained +herein.) + +## Closing up even more + +I'm quite fond of Google Gemini's current version, especially with my +personal pre-prompt instructions cutting a lot of the BS, so I asked +it to look at this idea. It kindly reminded me that `set-cdr!` is a +thing. But for immutable lists, it produced some calculations that +ultimately claim that this should be a significant improvement even +with branch mis-prediction issues. I'm a little skeptical though, +given that calls to `list` and invoking the parser should generally +yield lists made up of cons cells that are allocated right next to +each other, significantly mitigating the pointer chasing issue. + +Once I get to implement this, I'll just have to benchmark it. diff --git a/notes/index.md b/notes/index.md index f9e4f19..a4f753e 100644 --- a/notes/index.md +++ b/notes/index.md @@ -30,3 +30,4 @@ * [JOKE RANT](260516-joke-rant.html) * [Interpreter](260522-interpreter.html) * [Static analysis to optimize GC](260524-gc-opt.html) +* [Cons cell optimization!](260608-fastcons2.html) |
