# Cons cell optimization! _2026 June_ I've found a significant improvement over my idea in: * [Cons cell optimization?](250317-fastcons.html) There, the idea was to use the heap representation `[8]Value` for "cons cells" and a 3-bit tag telling you where in that array you're looking at. If the tag is zero, then `car` is `values[6]` and `cdr` is `values[7]`. If the tag is greater than zero, then `car` is like `values[7 - tag]` and `cdr` gives you the same pointer but with the tag decremented. The `cons` operation puts a value into an earlier slot and increments the tag (you fill the array from end to start) unless it's full in which case it actually conses a new `[8]Value` chaining it to this one. It's like jumbo cons cells of 8 not 2. A big problem with this is that code often has short lists: (let ((a (+ x y)) (b (* m n)) (c (foo z bar))) (blah blub) (bleb blob blip)) How many lists do you spot in there with at least 4 elements? That's the break-even point of memory use for the 8-value array strategy, yet we have most lists shorter than that here, wasting memory. In this sample snippet there actually isn't a single list of 5+ elements. ## Making it variable-length I don't know why I didn't think of this sooner. Simply let the tag value indicate how many elements are left before the tail pointer, where `cdr` can increment the pointer value. E.g. consider: (define my-list '(a b c d e f)) my-list = ptr + tag=5; where ptr -> [ a b c d e f () ] An array `[7]Value` has been allocated for the list, with the pointer value pointing to its start and the tag value being 5. Calling `car` on this simply dereferences the pointer i.e. `ptr[0]`. Calling `cdr` does: if tag = 0: return ptr[1] else: return { ptr + 8, tag - 1 } Note: So far we're not considering the consequences of iteratively building up a list by calling `cons` repeatedly on the head, because I've mainly been thinking about this from the perspective of direct AST interpretation; the parser can check the length of lists it's building and return array-backed lists of the right size up to the maximum where they're split up but still saving a lot of space and significantly reducing the burden of pointer chasing. ## 16-byte aligned heap Furthermore, I've realized earlier today that Zisp heap objects could be aligned to 16 bytes, since most things that can fit within 8 could as well be packed into a NaN, with few exceptions. One exception is interned strings of exactly length 7, which fit into exactly 8 bytes on the heap since they have a 1-byte length prefix. Any shorter, and the string fits into the 48-bit area of a NaN; any longer, and it'll be padded to take up 16 bytes anyway. There's also strings of length 16 to 23, that would normally fit in three 64-bit words, and now need padding to 4 words, and so on for different "size classes"... Thinking about it a bit more, we could let some heap objects have smaller alignment through some pointer tag related trickery, but it's not so important right now; we can think more about this later and in worst case I think 16-byte aligned strings are acceptable since their actual value is rarely dereferenced when they're used as identifiers. What that 16-byte alignment means is we have not 3 but 4 low bits on pointers usable for tagging. That allows up to `[17]Value` which is very nice; almost any list you are likely to encounter in Lisp/Scheme code will fit into that. If you're wondering why the maximum is not 16, it's because even a zero tag means you're looking at `[2]Value` and tags go up to 15, so yeah. To be clear: The individual array elements still align at 8 bytes, being 64-bit values; it's just the head of the array that's always aligned at 16. This can actually lead to small amounts of memory waste due to odd sized lists. If 50% of lists have odd length, we waste 8 bytes for every second list. I think that's not too bad. Worst case, I can walk back on this and go with 3 tag bits only. ## Finding the head There's another problem though, which is losing access to the head pointer. If we allocated a `[7]Value` and ran `cdr` a couple times, there may be no pointer left to the actual `[7]Value`, and pointers we're left with don't have enough information to walk back to the start of the array; they only tell how much longer it goes. A solution is putting a sentinel value at the head. This needs to be an otherwise invalid bit pattern that cannot be any valid Zisp value, not even the all-zero 64-bit value, because that's just the IEEE 754 positive zero which could appear in a list. Fortunately, there's a great many unused bit patterns in our NaN packing strategy, so it's not a problem defining a special one for this purpose alone. This means that we're wasting 8 bytes, but it's only a net loss for lists of one element or lone non-list pairs. The latter actually appear frequently in alist type structures, but I really think it's not a big deal; nobody should be working with alists of hundreds of entries. Moreover, the "backbone" of the alist actually becomes an array and thus faster to traverse. (Or at least a chain of arrays, each up to 16 elements plus the tail pointer or nil.) Actually, given the 16-byte alignment, we could just always alloc a `[4]u64` as a minimum, filling the first two slots with the sentinel, and then a subsequent `cons` call could reuse one of them to elide an allocation. More generally, perhaps the total array size could be always made a multiple of 2. In other words, instead of padding at the end, we pad at the start with an extra sentinel to achieve the 16-byte alignment. Let's see what happens if `cons` is called repeatedly to build a list of a handful of elements; here `_` is a sentinel and `*` a pointer to another `[4]u64`-backed list: (define ls ()) ; ls = () (set! ls (cons 1 ls)) ; ls = [ _ _ 1 () ] (set! ls (cons 2 ls)) ; ls = [ _ 2 1 () ] (set! ls (cons 3 ls)) ; ls = [ _ _ 3 * ] [ _ 2 1 () ] (set! ls (cons 4 ls)) ; ls = [ _ 4 3 * ] [ _ 2 1 () ] (set! ls (cons 5 ls)) ; ls = [ _ _ 5 * ] [ _ 4 3 * ] [ _ 2 1 () ] If you think hard and compare this to what would happen under regular `[2]Value` cons cells, you will arrive at an interesting realization, which is that as elements are added, it goes back and forth between wasting one `u64`-sized slot, and using exactly the same number of slots, compared to regular cons cells. So, the total memory waste remains bounded to 8 bytes per list constructed in this way! (The per-block sentinel value is always "wasted" yes, but this pays for itself in having fewer pointers, which are also "waste" if you think about it.) That's a very nice graceful degradation for a strategy that will massively improve the result of things like `(list x y z a b c)`, which will actually provide contiguous arrays in a trench coat. The only thing I'm worried about is branch prediction shenanigans; perhaps the conditional logic added to cons/car/cdr will make this worse in some surprising situations. ## Closing up The garbage collector will need to deal with this complexity, but I think it shouldn't be too bad. For any "cons cell" pointer it finds, it'll do a scan leftwards to find the sentinel, which is the head of the real allocated memory. The cell right to the sentinel then tells how large the array is: tag value + 3. (Why it's + 3 is left as an exercise to the reader.) Oh wait, there could be two sentinels; well it just has to find a 16-byte boundary. It can just scan in steps of two CPU words' worth of memory. There may be some uses of cons cells where our "optimization" really makes things worse, but if you're going for high efficiency, you'll typically want to avoid cons cells either way, using a more compact data representation instead. Also, in absolute worst case, we could just implement yet another heap type that is a real cons cell; with 4-bit pointer tags you have 16 heap types you can point to! (That's not including this "fat cons cell/array" type which gets a dedicated bit pattern in higher NaN bits, using the 4-bit tag as explained herein.) ## Closing up even more I'm quite fond of Google Gemini's current version, especially with my personal pre-prompt instructions cutting a lot of the BS, so I asked it to look at this idea. It kindly reminded me that `set-cdr!` is a thing. But for immutable lists, it produced some calculations that ultimately claim that this should be a significant improvement even with branch mis-prediction issues. I'm a little skeptical though, given that calls to `list` and invoking the parser should generally yield lists made up of cons cells that are allocated right next to each other, significantly mitigating the pointer chasing issue. Once I get to implement this, I'll just have to benchmark it.