summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorTaylan Kammer <taylan.kammer@gmail.com>2026-06-08 17:55:47 +0200
committerTaylan Kammer <taylan.kammer@gmail.com>2026-06-08 17:55:47 +0200
commita215b80479690ad422cedffba00f04a6ad5df351 (patch)
tree1cc24f12fa2f5d07249c52c03605f69f1b9ee849
parent627a09b58f404b1f690d27869b15b3197207c0af (diff)
Add a note.
-rw-r--r--notes/260608-fastcons2.md195
-rw-r--r--notes/index.md1
2 files changed, 196 insertions, 0 deletions
diff --git a/notes/260608-fastcons2.md b/notes/260608-fastcons2.md
new file mode 100644
index 0000000..bee97ac
--- /dev/null
+++ b/notes/260608-fastcons2.md
@@ -0,0 +1,195 @@
+# Cons cell optimization!
+
+_2026 June_
+
+I've found a significant improvement over my idea in:
+
+* [Cons cell optimization?](250317-fastcons.html)
+
+There, the idea was to use the heap representation `[8]Value` for
+"cons cells" and a 3-bit tag telling you where in that array you're
+looking at. If the tag is zero, then `car` is `values[6]` and `cdr`
+is `values[7]`. If the tag is greater than zero, then `car` is like
+`values[7 - tag]` and `cdr` gives you the same pointer but with the
+tag decremented. The `cons` operation puts a value into an earlier
+slot and increments the tag (you fill the array from end to start)
+unless it's full in which case it actually conses a new `[8]Value`
+chaining it to this one. It's like jumbo cons cells of 8 not 2.
+
+A big problem with this is that code often has short lists:
+
+ (let ((a (+ x y))
+ (b (* m n))
+ (c (foo z bar)))
+ (blah blub)
+ (bleb blob blip))
+
+How many lists do you spot in there with at least 4 elements? That's
+the break-even point of memory use for the 8-value array strategy, yet
+we have most lists shorter than that here, wasting memory. In this
+sample snippet there actually isn't a single list of 5+ elements.
+
+## Making it variable-length
+
+I don't know why I didn't think of this sooner. Simply let the tag
+value indicate how many elements are left before the tail pointer,
+where `cdr` can increment the pointer value. E.g. consider:
+
+ (define my-list '(a b c d e f))
+
+ my-list = ptr + tag=5; where ptr -> [ a b c d e f () ]
+
+An array `[7]Value` has been allocated for the list, with the pointer
+value pointing to its start and the tag value being 5. Calling `car`
+on this simply dereferences the pointer i.e. `ptr[0]`. Calling `cdr`
+does:
+
+ if tag = 0:
+ return ptr[1]
+ else:
+ return { ptr + 8, tag - 1 }
+
+Note: So far we're not considering the consequences of iteratively
+building up a list by calling `cons` repeatedly on the head, because
+I've mainly been thinking about this from the perspective of direct
+AST interpretation; the parser can check the length of lists it's
+building and return array-backed lists of the right size up to the
+maximum where they're split up but still saving a lot of space and
+significantly reducing the burden of pointer chasing.
+
+## 16-byte aligned heap
+
+Furthermore, I've realized earlier today that Zisp heap objects could
+be aligned to 16 bytes, since most things that can fit within 8 could
+as well be packed into a NaN, with few exceptions. One exception is
+interned strings of exactly length 7, which fit into exactly 8 bytes
+on the heap since they have a 1-byte length prefix. Any shorter, and
+the string fits into the 48-bit area of a NaN; any longer, and it'll
+be padded to take up 16 bytes anyway. There's also strings of length
+16 to 23, that would normally fit in three 64-bit words, and now need
+padding to 4 words, and so on for different "size classes"...
+
+Thinking about it a bit more, we could let some heap objects have
+smaller alignment through some pointer tag related trickery, but it's
+not so important right now; we can think more about this later and in
+worst case I think 16-byte aligned strings are acceptable since their
+actual value is rarely dereferenced when they're used as identifiers.
+
+What that 16-byte alignment means is we have not 3 but 4 low bits on
+pointers usable for tagging. That allows up to `[17]Value` which is
+very nice; almost any list you are likely to encounter in Lisp/Scheme
+code will fit into that. If you're wondering why the maximum is not
+16, it's because even a zero tag means you're looking at `[2]Value`
+and tags go up to 15, so yeah.
+
+To be clear: The individual array elements still align at 8 bytes,
+being 64-bit values; it's just the head of the array that's always
+aligned at 16. This can actually lead to small amounts of memory
+waste due to odd sized lists. If 50% of lists have odd length, we
+waste 8 bytes for every second list. I think that's not too bad.
+Worst case, I can walk back on this and go with 3 tag bits only.
+
+## Finding the head
+
+There's another problem though, which is losing access to the head
+pointer. If we allocated a `[7]Value` and ran `cdr` a couple times,
+there may be no pointer left to the actual `[7]Value`, and pointers
+we're left with don't have enough information to walk back to the
+start of the array; they only tell how much longer it goes.
+
+A solution is putting a sentinel value at the head. This needs to be
+an otherwise invalid bit pattern that cannot be any valid Zisp value,
+not even the all-zero 64-bit value, because that's just the IEEE 754
+positive zero which could appear in a list. Fortunately, there's a
+great many unused bit patterns in our NaN packing strategy, so it's
+not a problem defining a special one for this purpose alone.
+
+This means that we're wasting 8 bytes, but it's only a net loss for
+lists of one element or lone non-list pairs. The latter actually
+appear frequently in alist type structures, but I really think it's
+not a big deal; nobody should be working with alists of hundreds of
+entries. Moreover, the "backbone" of the alist actually becomes an
+array and thus faster to traverse. (Or at least a chain of arrays,
+each up to 16 elements plus the tail pointer or nil.)
+
+Actually, given the 16-byte alignment, we could just always alloc a
+`[4]u64` as a minimum, filling the first two slots with the sentinel,
+and then a subsequent `cons` call could reuse one of them to elide an
+allocation. More generally, perhaps the total array size could be
+always made a multiple of 2. In other words, instead of padding at
+the end, we pad at the start with an extra sentinel to achieve the
+16-byte alignment.
+
+Let's see what happens if `cons` is called repeatedly to build a list
+of a handful of elements; here `_` is a sentinel and `*` a pointer to
+another `[4]u64`-backed list:
+
+ (define ls ()) ; ls = ()
+
+ (set! ls (cons 1 ls)) ; ls = [ _ _ 1 () ]
+
+ (set! ls (cons 2 ls)) ; ls = [ _ 2 1 () ]
+
+ (set! ls (cons 3 ls)) ; ls = [ _ _ 3 * ]
+ [ _ 2 1 () ]
+
+ (set! ls (cons 4 ls)) ; ls = [ _ 4 3 * ]
+ [ _ 2 1 () ]
+
+ (set! ls (cons 5 ls)) ; ls = [ _ _ 5 * ]
+ [ _ 4 3 * ]
+ [ _ 2 1 () ]
+
+If you think hard and compare this to what would happen under regular
+`[2]Value` cons cells, you will arrive at an interesting realization,
+which is that as elements are added, it goes back and forth between
+wasting one `u64`-sized slot, and using exactly the same number of
+slots, compared to regular cons cells. So, the total memory waste
+remains bounded to 8 bytes per list constructed in this way!
+
+(The per-block sentinel value is always "wasted" yes, but this pays
+for itself in having fewer pointers, which are also "waste" if you
+think about it.)
+
+That's a very nice graceful degradation for a strategy that will
+massively improve the result of things like `(list x y z a b c)`,
+which will actually provide contiguous arrays in a trench coat.
+
+The only thing I'm worried about is branch prediction shenanigans;
+perhaps the conditional logic added to cons/car/cdr will make this
+worse in some surprising situations.
+
+## Closing up
+
+The garbage collector will need to deal with this complexity, but I
+think it shouldn't be too bad. For any "cons cell" pointer it finds,
+it'll do a scan leftwards to find the sentinel, which is the head of
+the real allocated memory. The cell right to the sentinel then tells
+how large the array is: tag value + 3. (Why it's + 3 is left as an
+exercise to the reader.) Oh wait, there could be two sentinels; well
+it just has to find a 16-byte boundary. It can just scan in steps of
+two CPU words' worth of memory.
+
+There may be some uses of cons cells where our "optimization" really
+makes things worse, but if you're going for high efficiency, you'll
+typically want to avoid cons cells either way, using a more compact
+data representation instead. Also, in absolute worst case, we could
+just implement yet another heap type that is a real cons cell; with
+4-bit pointer tags you have 16 heap types you can point to! (That's
+not including this "fat cons cell/array" type which gets a dedicated
+bit pattern in higher NaN bits, using the 4-bit tag as explained
+herein.)
+
+## Closing up even more
+
+I'm quite fond of Google Gemini's current version, especially with my
+personal pre-prompt instructions cutting a lot of the BS, so I asked
+it to look at this idea. It kindly reminded me that `set-cdr!` is a
+thing. But for immutable lists, it produced some calculations that
+ultimately claim that this should be a significant improvement even
+with branch mis-prediction issues. I'm a little skeptical though,
+given that calls to `list` and invoking the parser should generally
+yield lists made up of cons cells that are allocated right next to
+each other, significantly mitigating the pointer chasing issue.
+
+Once I get to implement this, I'll just have to benchmark it.
diff --git a/notes/index.md b/notes/index.md
index f9e4f19..a4f753e 100644
--- a/notes/index.md
+++ b/notes/index.md
@@ -30,3 +30,4 @@
* [JOKE RANT](260516-joke-rant.html)
* [Interpreter](260522-interpreter.html)
* [Static analysis to optimize GC](260524-gc-opt.html)
+* [Cons cell optimization!](260608-fastcons2.html)