summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorTaylan Kammer <taylan.kammer@gmail.com>2025-03-17 11:55:55 +0100
committerTaylan Kammer <taylan.kammer@gmail.com>2025-03-17 11:55:55 +0100
commitbff3e84e7b4083285d5a0a871663db57430401b6 (patch)
tree88d69c365def136232aa1b4349b6a1755e51af97
parentfbcedd74510b02a8a56acfbb3fb115394fcbdfd8 (diff)
Add fastcons.md
-rwxr-xr-xhtml/gen.sh2
-rw-r--r--html/index.md1
-rw-r--r--html/notes/fastcons.md206
-rw-r--r--html/style.css13
4 files changed, 220 insertions, 2 deletions
diff --git a/html/gen.sh b/html/gen.sh
index d14286e..3ed9482 100755
--- a/html/gen.sh
+++ b/html/gen.sh
@@ -19,7 +19,7 @@ do
title=$(sed '/^# / { s/# //; q }' "$file")
sed "s/__TITLE__/$title/" prelude.html
echo "<body>"
- markdown2 "$file" -x fenced-code-blocks,highlightjs-lang
+ markdown2 "$file" -x fenced-code-blocks,highlightjs-lang,tables
echo "</body>"
} > "${file%.md}".html
done
diff --git a/html/index.md b/html/index.md
index e7c5ff2..bf03784 100644
--- a/html/index.md
+++ b/html/index.md
@@ -33,3 +33,4 @@ because writing the code often gives you yet another perspective.
* [Reader? Decoder? I barely know 'er!](notes/reader.html)
* [Does the decoder implement macros?](notes/macros.html)
* [Better syntax-rules?](notes/sr.html)
+* [Cons cell optimization?](notes/fastcons.html)
diff --git a/html/notes/fastcons.md b/html/notes/fastcons.md
new file mode 100644
index 0000000..3dd4c3a
--- /dev/null
+++ b/html/notes/fastcons.md
@@ -0,0 +1,206 @@
+# Cons cell optimization?
+
+Assume that your Lisp implementation uses tagged pointers (possibly
+with NaN packing; doesn't matter), and there's enough information on
+pointers to tell you that the destination is the contents of a pair,
+aka cons cell.
+
+For example, Zisp uses NaN packing, and all values are represented as
+a `Value` union, which is either a double, or a special kind of NaN
+value that may, among other possibilities, contain a pointer with a
+3-bit type tag within it. One of the 8 possible types is a pair.
+
+You would implement the actual contents of a pair as `[2]Value` (in
+Zig notation), i.e., an array of two consecutive `Value` objects on
+the heap; you already know from the type tag on the pointer itself
+that the destination is pair contents, so you don't need any further
+information on the heap indicating that those two consecutive `Value`
+objects represent the car and cdr of a cons cell. They are "naked"
+within the heap.
+
+This seems optimal; you have zero meta-data on the heap itself. But
+pairs are ubiquitously used to implement linked lists in Lisp, and
+that includes a lot of inherent overhead: every single element must
+keep a pointer to the next element. So, if you have a list of five
+elements, you have five `[2]Value` on the heap, i.e., ten values!
+
+Can't we improve on that, somehow?
+
+In Zisp's NaN-packing strategy, there's an entire unused bit on heap
+pointers. (There's also a whole other set of possible pointer values
+with a 50-bit payload, so really, we have a *ton* of unused pointer
+space.)
+
+What if that bit indicated that the pointer, well, let's say "has to
+do" with pairs (I'll explain). The three type tag bits are now free,
+since the bit being *unset* means the pointer is for one of the other
+types. So, we can encode eight "sub-types" of this pointer.
+
+The actual pointer would go to a `[8]Value` instead of `[2]Value`,
+with the "type" tag being something akin to an index. (It's not an
+actual index; read on.)
+
+When you first use cons, say on a symbol `x` and nil, to get a list of
+one element, it uses two of the eight slots:
+
+ [ _ _ _ _ _ _ x () ] (tag = 0)
+
+(Let's call the operation that does this, i.e., allocates an array of
+eight and puts the arguments into positions 6 and 7, `alloc-cons`.
+We will use this term later.)
+
+The pointer points to the start of the array, the "type tag" is zero,
+and car and cdr are implemented as:
+
+ (car p) = p[6 - tag]
+
+ (cdr p) = p[7 - tag] ??? (read on)
+
+When you call cons again but the cdr is a pointer such as the above,
+then instead of allocating a new array, it puts the car into it and
+returns a modified pointer; say we did `(cons y pair)`:
+
+ [ _ _ _ _ _ y x () ] (tag = 1)
+
+(The operation that does this, i.e., puts the first argument to cons
+into ptr[5 - tag] and returns ptr with tag + 1, is `shift-cons`.)
+
+The tag on the pointer now says 1. You may notice that if cdr were
+truly implemented as the above, it wouldn't actually give us the cdr
+of the list any more; it would give us `(car (cdr list))` so we need
+to make a change:
+
+ (cdr p) = if tag = 0: p[7]; else: p with tag - 1
+
+We introduced a branch instruction, which may come back to bite us,
+but our car and cdr are now functional again.
+
+OK, but what if we run out of space in the array? Cons also needs a
+branch, or multiple if we count the one already implied above:
+
+ (cons a b) = if b is pair and b.tag < 6:
+ (shift-cons a b)
+ else if b is pair:
+ (alloc-cons a b)
+ else:
+ (alloc-cons a b)
+
+Oh, would you look at that: the last two branches are identical, so
+actually just one branch instruction is enough after all:
+
+ if b is pair and b.tag < 6:
+ (shift-cons a b)
+ else:
+ (alloc-cons a b)
+
+I *think* this pseudo-code is complete now? If cons, car, and cdr
+were to be implemented this way, I think we get an API that's fully
+compatible with traditional cons/car/cdr and the optimization is
+completely transparent.
+
+Note that we haven't used tag value 7. Maybe that's fine, or maybe
+it's possible to exploit that somehow; let's ignore it for now.
+
+(We could simply use `[9]Value` and then all 8 tag values would be
+used, but that's probably a bad idea due to cache line shenanigans;
+more on that later.)
+
+
+## Comparison
+
+Let's compare the in-memory representation of traditional pairs and
+our optimized "pairs" when both are used to implement a linked list
+with ten elements:
+
+ ;; traditional
+ list = [ e0 cdr1 ]
+ cdr1 = [ e1 cdr2 ]
+ cdr2 = [ e2 cdr3 ]
+ cdr3 = [ e3 cdr4 ]
+ cdr4 = [ e4 cdr5 ]
+ cdr5 = [ e5 cdr6 ]
+ cdr6 = [ e6 cdr7 ]
+ cdr7 = [ e7 cdr8 ]
+ cdr8 = [ e8 cdr9 ]
+ cdr9 = [ e9 () ]
+
+ ;; optimized
+ list = [ _ _ _ _ e0 e1 e2 cdr1 ] (tag = 2)
+ cdr1 = [ e3 e4 e5 e6 e7 e8 e9 () ] (tag = 6)
+
+Let's look at the pros and cons.
+
+<style>
+td:first-child { font-weight: bold; }
+td:not(:first-child) { font-family: mono; }
+</style>
+
+| | Traditional | Optimized |
+|----------|--------------------------|------------------------------|
+| cons | alloc, set car & cdr | see above (branches) |
+| car | ptr[0] | ptr[6 - tag] |
+| cdr | ptr[1] | see above (branches) |
+| Memory¹ | n * 2 | (floor((n - 1) / 7) + 1) * 8 |
+| Locality | Bad | Better (?) |
+| Derefs² | n | n/7 (I think) |
+
+¹ Counted in `Value`-sized slots on the heap.
+
+² Pointer dereferences when looping through the whole list.
+
+I haven't thought too hard about some of the numbers, but it's not
+terribly important; this is to get a general idea.
+
+Traditional cons always allocates; ours has a branch instruction but
+this allows it to elide an allocation most of the time; I think it's
+every 6 out of 7 cons calls that *don't* require allocation, when
+building up a linked list.
+
+The implementation of car became ever so slightly slower, requiring a
+small arithmetic operation. Cdr became even more complicated, with a
+branch instruction, but most of the time it can actually elide the
+pointer dereference; I think this time it's every 7 out of 8 calls.
+
+The strange formula for memory use is because we alloc in chunks of 8,
+but use one slot for a final cdr. Anyhow, in the worst case, we will
+be using 8 rather than 2 slots for a single-list element; the break
+even point is 4 elements where both implementations use 8 slots.
+(Then there's the 8-element case where both use 16 slots, but after
+that our implementation always wins.)
+
+The fewer pointer derefs should help a bit, although it's likely that
+the pairs making up a traditional linked list were all allocated in
+close proximity, and CPUs nowadays fetch more memory from main RAM
+into a CPU cache than you ask for, if you've asked for less than the
+cache line size. The most common size nowadays is 64 bytes, which is
+exactly `[8]Value`. This means that, using traditional cons cells,
+you'd be fetching 4 of them at once every time you deref a pointer,
+assuming they were all allocated next to each other.
+
+Memory locality is better on paper, but only if you don't think too
+hard about it. We will always have seven elements in one contiguous
+chunk, whereas the traditional implementation could suffer from a lot
+of fragmentation: Each pair could end up in a different cache line
+sized segment on the heap, meaning every element requires a main
+memory fetch. But that's unlikely; they are much more likely to be
+adjacent on the heap. Further, what if we have tons of tiny lists,
+with one to four elements per list? Now our strategy forces every
+list to be in its own cache line sized segment, whereas some of the
+lists using the traditional implementation could end up in a single
+64-byte segment and thus be fetched together. A cursory glance at
+some Scheme code of mine indicates that there's indeed tons of lists
+with fewer than 5 elements!
+
+
+## And that was just linked lists
+
+Pairs are not only used for linked lists. They could be the slots of
+an alist, they could be nodes in a tree data structure, and so on.
+
+Forcing every cons cell to allocate 64 bytes could thrash locality in
+these cases, and actually increase the total memory use as well, as
+most of the `Value`-sized slots may end up being wasted.
+
+I won't implement the strategy mentioned above yet, but if I do it
+eventually, it will be very interesting to see its performance in a
+variety of benchmarks, reflecting common usage patterns.
diff --git a/html/style.css b/html/style.css
index 4725089..1c91903 100644
--- a/html/style.css
+++ b/html/style.css
@@ -6,7 +6,7 @@ body {
background: #eee;
color: #333;
- font-family: sans;
+ font-family: sans-serif;
}
h1, h2, h3 {
@@ -21,3 +21,14 @@ p {
code {
font-size: 1.2em;
}
+
+table {
+ width: 100%;
+ border-collapse: collapse;
+ border: 1px solid black;
+}
+
+td, th {
+ border: 1px solid black;
+ padding: 4px;
+}