Add fastcons.md

author: Taylan Kammer <taylan.kammer@gmail.com> 2025-03-17 11:55:55 +0100
committer: Taylan Kammer <taylan.kammer@gmail.com> 2025-03-17 11:55:55 +0100
commit: bff3e84e7b4083285d5a0a871663db57430401b6 (patch)
tree: 88d69c365def136232aa1b4349b6a1755e51af97
parent: fbcedd74510b02a8a56acfbb3fb115394fcbdfd8 (diff)
4 files changed, 220 insertions, 2 deletions
diff --git a/html/gen.sh b/html/gen.sh
index d14286e..3ed9482 100755
--- a/html/gen.sh
+++ b/html/gen.sh
@@ -19,7 +19,7 @@ do
         title=$(sed '/^# / { s/# //; q }' "$file")
         sed "s/__TITLE__/$title/" prelude.html
         echo "<body>"
-        markdown2 "$file" -x fenced-code-blocks,highlightjs-lang
+        markdown2 "$file" -x fenced-code-blocks,highlightjs-lang,tables
         echo "</body>"
     } > "${file%.md}".html
 done
diff --git a/html/index.md b/html/index.md
index e7c5ff2..bf03784 100644
--- a/html/index.md
+++ b/html/index.md
@@ -33,3 +33,4 @@ because writing the code often gives you yet another perspective.
 * [Reader? Decoder? I barely know 'er!](notes/reader.html)
 * [Does the decoder implement macros?](notes/macros.html)
 * [Better syntax-rules?](notes/sr.html)
+* [Cons cell optimization?](notes/fastcons.html)
diff --git a/html/notes/fastcons.md b/html/notes/fastcons.md
new file mode 100644
index 0000000..3dd4c3a
--- /dev/null
+++ b/html/notes/fastcons.md
@@ -0,0 +1,206 @@
+# Cons cell optimization?
+
+Assume that your Lisp implementation uses tagged pointers (possibly
+with NaN packing; doesn't matter), and there's enough information on
+pointers to tell you that the destination is the contents of a pair,
+aka cons cell.
+
+For example, Zisp uses NaN packing, and all values are represented as
+a `Value` union, which is either a double, or a special kind of NaN
+value that may, among other possibilities, contain a pointer with a
+3-bit type tag within it.  One of the 8 possible types is a pair.
+
+You would implement the actual contents of a pair as `[2]Value` (in
+Zig notation), i.e., an array of two consecutive `Value` objects on
+the heap; you already know from the type tag on the pointer itself
+that the destination is pair contents, so you don't need any further
+information on the heap indicating that those two consecutive `Value`
+objects represent the car and cdr of a cons cell.  They are "naked"
+within the heap.
+
+This seems optimal; you have zero meta-data on the heap itself.  But
+pairs are ubiquitously used to implement linked lists in Lisp, and
+that includes a lot of inherent overhead: every single element must
+keep a pointer to the next element.  So, if you have a list of five
+elements, you have five `[2]Value` on the heap, i.e., ten values!
+
+Can't we improve on that, somehow?
+
+In Zisp's NaN-packing strategy, there's an entire unused bit on heap
+pointers.  (There's also a whole other set of possible pointer values
+with a 50-bit payload, so really, we have a *ton* of unused pointer
+space.)
+
+What if that bit indicated that the pointer, well, let's say "has to
+do" with pairs (I'll explain).  The three type tag bits are now free,
+since the bit being *unset* means the pointer is for one of the other
+types.  So, we can encode eight "sub-types" of this pointer.
+
+The actual pointer would go to a `[8]Value` instead of `[2]Value`,
+with the "type" tag being something akin to an index.  (It's not an
+actual index; read on.)
+
+When you first use cons, say on a symbol `x` and nil, to get a list of
+one element, it uses two of the eight slots:
+
+    [ _ _ _ _ _ _ x () ]  (tag = 0)
+
+(Let's call the operation that does this, i.e., allocates an array of
+eight and puts the arguments into positions 6 and 7, `alloc-cons`.
+We will use this term later.)
+
+The pointer points to the start of the array, the "type tag" is zero,
+and car and cdr are implemented as:
+
+    (car p)  =  p[6 - tag]
+
+    (cdr p)  =  p[7 - tag]  ??? (read on)
+
+When you call cons again but the cdr is a pointer such as the above,
+then instead of allocating a new array, it puts the car into it and
+returns a modified pointer; say we did `(cons y pair)`:
+
+    [ _ _ _ _ _ y x () ]  (tag = 1)
+
+(The operation that does this, i.e., puts the first argument to cons
+into ptr[5 - tag] and returns ptr with tag + 1, is `shift-cons`.)
+
+The tag on the pointer now says 1.  You may notice that if cdr were
+truly implemented as the above, it wouldn't actually give us the cdr
+of the list any more; it would give us `(car (cdr list))` so we need
+to make a change:
+
+    (cdr p)  =  if tag = 0: p[7]; else: p with tag - 1
+
+We introduced a branch instruction, which may come back to bite us,
+but our car and cdr are now functional again.
+
+OK, but what if we run out of space in the array?  Cons also needs a
+branch, or multiple if we count the one already implied above:
+
+    (cons a b)  =  if b is pair and b.tag < 6:
+                     (shift-cons a b)
+                   else if b is pair:
+                     (alloc-cons a b)
+                   else:
+                     (alloc-cons a b)
+
+Oh, would you look at that: the last two branches are identical, so
+actually just one branch instruction is enough after all:
+
+    if b is pair and b.tag < 6:
+      (shift-cons a b)
+    else:
+      (alloc-cons a b)
+
+I *think* this pseudo-code is complete now?  If cons, car, and cdr
+were to be implemented this way, I think we get an API that's fully
+compatible with traditional cons/car/cdr and the optimization is
+completely transparent.
+
+Note that we haven't used tag value 7.  Maybe that's fine, or maybe
+it's possible to exploit that somehow; let's ignore it for now.
+
+(We could simply use `[9]Value` and then all 8 tag values would be
+used, but that's probably a bad idea due to cache line shenanigans;
+more on that later.)
+
+
+## Comparison
+
+Let's compare the in-memory representation of traditional pairs and
+our optimized "pairs" when both are used to implement a linked list
+with ten elements:
+
+    ;; traditional
+    list = [ e0 cdr1 ]
+    cdr1 = [ e1 cdr2 ]
+    cdr2 = [ e2 cdr3 ]
+    cdr3 = [ e3 cdr4 ]
+    cdr4 = [ e4 cdr5 ]
+    cdr5 = [ e5 cdr6 ]
+    cdr6 = [ e6 cdr7 ]
+    cdr7 = [ e7 cdr8 ]
+    cdr8 = [ e8 cdr9 ]
+    cdr9 = [ e9  ()  ]
+
+    ;; optimized
+    list = [ _  _  _  _  e0 e1 e2 cdr1 ] (tag = 2)
+    cdr1 = [ e3 e4 e5 e6 e7 e8 e9  ()  ] (tag = 6)
+
+Let's look at the pros and cons.
+
+<style>
+td:first-child { font-weight: bold; }
+td:not(:first-child) { font-family: mono; }
+</style>
+
+|          | Traditional              | Optimized                    |
+|----------|--------------------------|------------------------------|
+| cons     | alloc, set car & cdr     | see above (branches)         |
+| car      | ptr[0]                   | ptr[6 - tag]                 |
+| cdr      | ptr[1]                   | see above (branches)         |
+| Memory¹  | n * 2                    | (floor((n - 1) / 7) + 1) * 8 |
+| Locality | Bad                      | Better (?)                   |
+| Derefs²  | n                        | n/7 (I think)                |
+
+¹ Counted in `Value`-sized slots on the heap.
+
+² Pointer dereferences when looping through the whole list.
+
+I haven't thought too hard about some of the numbers, but it's not
+terribly important; this is to get a general idea.
+
+Traditional cons always allocates; ours has a branch instruction but
+this allows it to elide an allocation most of the time; I think it's
+every 6 out of 7 cons calls that *don't* require allocation, when
+building up a linked list.
+
+The implementation of car became ever so slightly slower, requiring a
+small arithmetic operation.  Cdr became even more complicated, with a
+branch instruction, but most of the time it can actually elide the
+pointer dereference; I think this time it's every 7 out of 8 calls.
+
+The strange formula for memory use is because we alloc in chunks of 8,
+but use one slot for a final cdr.  Anyhow, in the worst case, we will
+be using 8 rather than 2 slots for a single-list element; the break
+even point is 4 elements where both implementations use 8 slots.
+(Then there's the 8-element case where both use 16 slots, but after
+that our implementation always wins.)
+
+The fewer pointer derefs should help a bit, although it's likely that
+the pairs making up a traditional linked list were all allocated in
+close proximity, and CPUs nowadays fetch more memory from main RAM
+into a CPU cache than you ask for, if you've asked for less than the
+cache line size.  The most common size nowadays is 64 bytes, which is
+exactly `[8]Value`.  This means that, using traditional cons cells,
+you'd be fetching 4 of them at once every time you deref a pointer,
+assuming they were all allocated next to each other.
+
+Memory locality is better on paper, but only if you don't think too
+hard about it.  We will always have seven elements in one contiguous
+chunk, whereas the traditional implementation could suffer from a lot
+of fragmentation: Each pair could end up in a different cache line
+sized segment on the heap, meaning every element requires a main
+memory fetch.  But that's unlikely; they are much more likely to be
+adjacent on the heap.  Further, what if we have tons of tiny lists,
+with one to four elements per list?  Now our strategy forces every
+list to be in its own cache line sized segment, whereas some of the
+lists using the traditional implementation could end up in a single
+64-byte segment and thus be fetched together.  A cursory glance at
+some Scheme code of mine indicates that there's indeed tons of lists
+with fewer than 5 elements!
+
+
+## And that was just linked lists
+
+Pairs are not only used for linked lists.  They could be the slots of
+an alist, they could be nodes in a tree data structure, and so on.
+
+Forcing every cons cell to allocate 64 bytes could thrash locality in
+these cases, and actually increase the total memory use as well, as
+most of the `Value`-sized slots may end up being wasted.
+
+I won't implement the strategy mentioned above yet, but if I do it
+eventually, it will be very interesting to see its performance in a
+variety of benchmarks, reflecting common usage patterns.
diff --git a/html/style.css b/html/style.css
index 4725089..1c91903 100644
--- a/html/style.css
+++ b/html/style.css
@@ -6,7 +6,7 @@ body {
     background: #eee;
     color: #333;
 
-    font-family: sans;
+    font-family: sans-serif;
 }
 
 h1, h2, h3 {
@@ -21,3 +21,14 @@ p {
 code {
     font-size: 1.2em;
 }
+
+table {
+    width: 100%;
+    border-collapse: collapse;
+    border: 1px solid black;
+}
+
+td, th {
+    border: 1px solid black;
+    padding: 4px;
+}
author	Taylan Kammer <taylan.kammer@gmail.com>	2025-03-17 11:55:55 +0100
committer	Taylan Kammer <taylan.kammer@gmail.com>	2025-03-17 11:55:55 +0100
commit	bff3e84e7b4083285d5a0a871663db57430401b6 (patch)
tree	88d69c365def136232aa1b4349b6a1755e51af97
parent	fbcedd74510b02a8a56acfbb3fb115394fcbdfd8 (diff)