html/notes/fastcons.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206

# Cons cell optimization?

Assume that your Lisp implementation uses tagged pointers (possibly
with NaN packing; doesn't matter), and there's enough information on
pointers to tell you that the destination is the contents of a pair,
aka cons cell.

For example, Zisp uses NaN packing, and all values are represented as
a `Value` union, which is either a double, or a special kind of NaN
value that may, among other possibilities, contain a pointer with a
3-bit type tag within it.  One of the 8 possible types is a pair.

You would implement the actual contents of a pair as `[2]Value` (in
Zig notation), i.e., an array of two consecutive `Value` objects on
the heap; you already know from the type tag on the pointer itself
that the destination is pair contents, so you don't need any further
information on the heap indicating that those two consecutive `Value`
objects represent the car and cdr of a cons cell.  They are "naked"
within the heap.

This seems optimal; you have zero meta-data on the heap itself.  But
pairs are ubiquitously used to implement linked lists in Lisp, and
that includes a lot of inherent overhead: every single element must
keep a pointer to the next element.  So, if you have a list of five
elements, you have five `[2]Value` on the heap, i.e., ten values!

Can't we improve on that, somehow?

In Zisp's NaN-packing strategy, there's an entire unused bit on heap
pointers.  (There's also a whole other set of possible pointer values
with a 50-bit payload, so really, we have a *ton* of unused pointer
space.)

What if that bit indicated that the pointer, well, let's say "has to
do" with pairs (I'll explain).  The three type tag bits are now free,
since the bit being *unset* means the pointer is for one of the other
types.  So, we can encode eight "sub-types" of this pointer.

The actual pointer would go to a `[8]Value` instead of `[2]Value`,
with the "type" tag being something akin to an index.  (It's not an
actual index; read on.)

When you first use cons, say on a symbol `x` and nil, to get a list of
one element, it uses two of the eight slots:

    [ _ _ _ _ _ _ x () ]  (tag = 0)

(Let's call the operation that does this, i.e., allocates an array of
eight and puts the arguments into positions 6 and 7, `alloc-cons`.
We will use this term later.)

The pointer points to the start of the array, the "type tag" is zero,
and car and cdr are implemented as:

    (car p)  =  p[6 - tag]

    (cdr p)  =  p[7 - tag]  ??? (read on)

When you call cons again but the cdr is a pointer such as the above,
then instead of allocating a new array, it puts the car into it and
returns a modified pointer; say we did `(cons y pair)`:

    [ _ _ _ _ _ y x () ]  (tag = 1)

(The operation that does this, i.e., puts the first argument to cons
into ptr[5 - tag] and returns ptr with tag + 1, is `shift-cons`.)

The tag on the pointer now says 1.  You may notice that if cdr were
truly implemented as the above, it wouldn't actually give us the cdr
of the list any more; it would give us `(car (cdr list))` so we need
to make a change:

    (cdr p)  =  if tag = 0: p[7]; else: p with tag - 1

We introduced a branch instruction, which may come back to bite us,
but our car and cdr are now functional again.

OK, but what if we run out of space in the array?  Cons also needs a
branch, or multiple if we count the one already implied above:

    (cons a b)  =  if b is pair and b.tag < 6:
                     (shift-cons a b)
                   else if b is pair:
                     (alloc-cons a b)
                   else:
                     (alloc-cons a b)

Oh, would you look at that: the last two branches are identical, so
actually just one branch instruction is enough after all:

    if b is pair and b.tag < 6:
      (shift-cons a b)
    else:
      (alloc-cons a b)

I *think* this pseudo-code is complete now?  If cons, car, and cdr
were to be implemented this way, I think we get an API that's fully
compatible with traditional cons/car/cdr and the optimization is
completely transparent.

Note that we haven't used tag value 7.  Maybe that's fine, or maybe
it's possible to exploit that somehow; let's ignore it for now.

(We could simply use `[9]Value` and then all 8 tag values would be
used, but that's probably a bad idea due to cache line shenanigans;
more on that later.)


## Comparison

Let's compare the in-memory representation of traditional pairs and
our optimized "pairs" when both are used to implement a linked list
with ten elements:

    ;; traditional
    list = [ e0 cdr1 ]
    cdr1 = [ e1 cdr2 ]
    cdr2 = [ e2 cdr3 ]
    cdr3 = [ e3 cdr4 ]
    cdr4 = [ e4 cdr5 ]
    cdr5 = [ e5 cdr6 ]
    cdr6 = [ e6 cdr7 ]
    cdr7 = [ e7 cdr8 ]
    cdr8 = [ e8 cdr9 ]
    cdr9 = [ e9  ()  ]

    ;; optimized
    list = [ _  _  _  _  e0 e1 e2 cdr1 ] (tag = 2)
    cdr1 = [ e3 e4 e5 e6 e7 e8 e9  ()  ] (tag = 6)

Let's look at the pros and cons.

<style>
td:first-child { font-weight: bold; }
td:not(:first-child) { font-family: mono; }
</style>

|          | Traditional              | Optimized                    |
|----------|--------------------------|------------------------------|
| cons     | alloc, set car & cdr     | see above (branches)         |
| car      | ptr[0]                   | ptr[6 - tag]                 |
| cdr      | ptr[1]                   | see above (branches)         |
| Memory¹  | n * 2                    | (floor((n - 1) / 7) + 1) * 8 |
| Locality | Bad                      | Better (?)                   |
| Derefs²  | n                        | n/7 (I think)                |

¹ Counted in `Value`-sized slots on the heap.

² Pointer dereferences when looping through the whole list.

I haven't thought too hard about some of the numbers, but it's not
terribly important; this is to get a general idea.

Traditional cons always allocates; ours has a branch instruction but
this allows it to elide an allocation most of the time; I think it's
every 6 out of 7 cons calls that *don't* require allocation, when
building up a linked list.

The implementation of car became ever so slightly slower, requiring a
small arithmetic operation.  Cdr became even more complicated, with a
branch instruction, but most of the time it can actually elide the
pointer dereference; I think this time it's every 7 out of 8 calls.

The strange formula for memory use is because we alloc in chunks of 8,
but use one slot for a final cdr.  Anyhow, in the worst case, we will
be using 8 rather than 2 slots for a single-list element; the break
even point is 4 elements where both implementations use 8 slots.
(Then there's the 8-element case where both use 16 slots, but after
that our implementation always wins.)

The fewer pointer derefs should help a bit, although it's likely that
the pairs making up a traditional linked list were all allocated in
close proximity, and CPUs nowadays fetch more memory from main RAM
into a CPU cache than you ask for, if you've asked for less than the
cache line size.  The most common size nowadays is 64 bytes, which is
exactly `[8]Value`.  This means that, using traditional cons cells,
you'd be fetching 4 of them at once every time you deref a pointer,
assuming they were all allocated next to each other.

Memory locality is better on paper, but only if you don't think too
hard about it.  We will always have seven elements in one contiguous
chunk, whereas the traditional implementation could suffer from a lot
of fragmentation: Each pair could end up in a different cache line
sized segment on the heap, meaning every element requires a main
memory fetch.  But that's unlikely; they are much more likely to be
adjacent on the heap.  Further, what if we have tons of tiny lists,
with one to four elements per list?  Now our strategy forces every
list to be in its own cache line sized segment, whereas some of the
lists using the traditional implementation could end up in a single
64-byte segment and thus be fetched together.  A cursory glance at
some Scheme code of mine indicates that there's indeed tons of lists
with fewer than 5 elements!


## And that was just linked lists

Pairs are not only used for linked lists.  They could be the slots of
an alist, they could be nodes in a tree data structure, and so on.

Forcing every cons cell to allocate 64 bytes could thrash locality in
these cases, and actually increase the total memory use as well, as
most of the `Value`-sized slots may end up being wasted.

I won't implement the strategy mentioned above yet, but if I do it
eventually, it will be very interesting to see its performance in a
variety of benchmarks, reflecting common usage patterns.