notes/260608-fastcons2.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195

# Cons cell optimization!

_2026 June_

I've found a significant improvement over my idea in:

* [Cons cell optimization?](250317-fastcons.html)

There, the idea was to use the heap representation `[8]Value` for
"cons cells" and a 3-bit tag telling you where in that array you're
looking at.  If the tag is zero, then `car` is `values[6]` and `cdr`
is `values[7]`.  If the tag is greater than zero, then `car` is like
`values[7 - tag]` and `cdr` gives you the same pointer but with the
tag decremented.  The `cons` operation puts a value into an earlier
slot and increments the tag (you fill the array from end to start)
unless it's full in which case it actually conses a new `[8]Value`
chaining it to this one.  It's like jumbo cons cells of 8 not 2.

A big problem with this is that code often has short lists:

    (let ((a (+ x y))
          (b (* m n))
          (c (foo z bar)))
      (blah blub)
      (bleb blob blip))

How many lists do you spot in there with at least 4 elements?  That's
the break-even point of memory use for the 8-value array strategy, yet
we have most lists shorter than that here, wasting memory.  In this
sample snippet there actually isn't a single list of 5+ elements.

## Making it variable-length

I don't know why I didn't think of this sooner.  Simply let the tag
value indicate how many elements are left before the tail pointer,
where `cdr` can increment the pointer value.  E.g. consider:

    (define my-list '(a b c d e f))

    my-list  =  ptr + tag=5;  where ptr -> [ a b c d e f () ]

An array `[7]Value` has been allocated for the list, with the pointer
value pointing to its start and the tag value being 5.  Calling `car`
on this simply dereferences the pointer i.e. `ptr[0]`.  Calling `cdr`
does:

    if tag = 0:
      return ptr[1]
    else:
      return { ptr + 8, tag - 1 }

Note: So far we're not considering the consequences of iteratively
building up a list by calling `cons` repeatedly on the head, because
I've mainly been thinking about this from the perspective of direct
AST interpretation; the parser can check the length of lists it's
building and return array-backed lists of the right size up to the
maximum where they're split up but still saving a lot of space and
significantly reducing the burden of pointer chasing.

## 16-byte aligned heap

Furthermore, I've realized earlier today that Zisp heap objects could
be aligned to 16 bytes, since most things that can fit within 8 could
as well be packed into a NaN, with few exceptions.  One exception is
interned strings of exactly length 7, which fit into exactly 8 bytes
on the heap since they have a 1-byte length prefix.  Any shorter, and
the string fits into the 48-bit area of a NaN; any longer, and it'll
be padded to take up 16 bytes anyway.  There's also strings of length
16 to 23, that would normally fit in three 64-bit words, and now need
padding to 4 words, and so on for different "size classes"...

Thinking about it a bit more, we could let some heap objects have
smaller alignment through some pointer tag related trickery, but it's
not so important right now; we can think more about this later and in
worst case I think 16-byte aligned strings are acceptable since their
actual value is rarely dereferenced when they're used as identifiers.

What that 16-byte alignment means is we have not 3 but 4 low bits on
pointers usable for tagging.  That allows up to `[17]Value` which is
very nice; almost any list you are likely to encounter in Lisp/Scheme
code will fit into that.  If you're wondering why the maximum is not
16, it's because even a zero tag means you're looking at `[2]Value`
and tags go up to 15, so yeah.

To be clear: The individual array elements still align at 8 bytes,
being 64-bit values; it's just the head of the array that's always
aligned at 16.  This can actually lead to small amounts of memory
waste due to odd sized lists.  If 50% of lists have odd length, we
waste 8 bytes for every second list.  I think that's not too bad.
Worst case, I can walk back on this and go with 3 tag bits only.

## Finding the head

There's another problem though, which is losing access to the head
pointer.  If we allocated a `[7]Value` and ran `cdr` a couple times,
there may be no pointer left to the actual `[7]Value`, and pointers
we're left with don't have enough information to walk back to the
start of the array; they only tell how much longer it goes.

A solution is putting a sentinel value at the head.  This needs to be
an otherwise invalid bit pattern that cannot be any valid Zisp value,
not even the all-zero 64-bit value, because that's just the IEEE 754
positive zero which could appear in a list.  Fortunately, there's a
great many unused bit patterns in our NaN packing strategy, so it's
not a problem defining a special one for this purpose alone.

This means that we're wasting 8 bytes, but it's only a net loss for
lists of one element or lone non-list pairs.  The latter actually
appear frequently in alist type structures, but I really think it's
not a big deal; nobody should be working with alists of hundreds of
entries.  Moreover, the "backbone" of the alist actually becomes an
array and thus faster to traverse.  (Or at least a chain of arrays,
each up to 16 elements plus the tail pointer or nil.)

Actually, given the 16-byte alignment, we could just always alloc a
`[4]u64` as a minimum, filling the first two slots with the sentinel,
and then a subsequent `cons` call could reuse one of them to elide an
allocation.  More generally, perhaps the total array size could be
always made a multiple of 2.  In other words, instead of padding at
the end, we pad at the start with an extra sentinel to achieve the
16-byte alignment.

Let's see what happens if `cons` is called repeatedly to build a list
of a handful of elements; here `_` is a sentinel and `*` a pointer to
another `[4]u64`-backed list:

    (define ls ())         ; ls = ()

    (set! ls (cons 1 ls))  ; ls = [ _ _ 1 () ]

    (set! ls (cons 2 ls))  ; ls = [ _ 2 1 () ]

    (set! ls (cons 3 ls))  ; ls = [ _ _ 3 *  ]
                                          [ _ 2 1 () ]

    (set! ls (cons 4 ls))  ; ls = [ _ 4 3 *  ]
                                          [ _ 2 1 () ]

    (set! ls (cons 5 ls))  ; ls = [ _ _ 5 *  ]
                                          [ _ 4 3 *  ]
                                                  [ _ 2 1 () ]

If you think hard and compare this to what would happen under regular
`[2]Value` cons cells, you will arrive at an interesting realization,
which is that as elements are added, it goes back and forth between
wasting two `u64`-sized slot, and using exactly the same number of
slots, compared to regular cons cells.  So, the total memory waste
remains bounded to 16 bytes per list constructed in this way!

(The per-block sentinel value is always "wasted" yes, but this pays
for itself in having fewer pointers, which are also "waste" if you
think about it.)

That's a very nice graceful degradation for a strategy that will
massively improve the result of things like `(list x y z a b c)`,
which will actually provide contiguous arrays in a trench coat.

The only thing I'm worried about is branch prediction shenanigans;
perhaps the conditional logic added to cons/car/cdr will make this
worse in some surprising situations.

## Closing up

The garbage collector will need to deal with this complexity, but I
think it shouldn't be too bad.  For any "cons cell" pointer it finds,
it'll do a scan leftwards to find the sentinel, which is the head of
the real allocated memory.  The cell right to the sentinel then tells
how large the array is: tag value + 3.  (Why it's + 3 is left as an
exercise to the reader.)  Oh wait, there could be two sentinels; well
it just has to find a 16-byte boundary.  It can just scan in steps of
two CPU words' worth of memory.

There may be some uses of cons cells where our "optimization" really
makes things worse, but if you're going for high efficiency, you'll
typically want to avoid cons cells either way, using a more compact
data representation instead.  Also, in absolute worst case, we could
just implement yet another heap type that is a real cons cell; with
4-bit pointer tags you have 16 heap types you can point to!  (That's
not including this "fat cons cell/array" type which gets a dedicated
bit pattern in higher NaN bits, using the 4-bit tag as explained
herein.)

## Closing up even more

I'm quite fond of Google Gemini's current version, especially with my
personal pre-prompt instructions cutting a lot of the BS, so I asked
it to look at this idea.  It kindly reminded me that `set-cdr!` is a
thing.  But for immutable lists, it produced some calculations that
ultimately claim that this should be a significant improvement even
with branch mis-prediction issues.  I'm a little skeptical though,
given that calls to `list` and invoking the parser should generally
yield lists made up of cons cells that are allocated right next to
each other, significantly mitigating the pointer chasing issue.

Once I get to implement this, I'll just have to benchmark it.