# Cons cell optimization again

_2026 June_

Last episode: [Cons cell optimization!](260608-fastcons2.html)

Given the sub-optimal situation with these kinds of optimizations for
manual uses of `cons` in program code, perhaps it's best to limit the
scope of this optimization:

* Regular `cons` creates normal pairs; `car` and `cdr` only work on
  normal pairs.

* Programmers can explicitly use the array-backed list data type if
  they want to; it's simply another data structure available in the
  standard library, alongside things like dynamic arrays that grow
  upon saturation and move their entire contents.

  Note: Those dynamic regrowing arrays are what are typically called
  an ArrayList data structure, like in Zig or Java.  In this article
  I'm using terms like array-backed list or array-list for my jumbo
  cons cell concept instead.  I believe some call this SegmentedList.

* We break from Lisp tradition and *don't* return regular pairs from
  s-expression parsing; we return array-backed lists.  This is a sane
  choice because:

  1. A parser knows the length of any list it's allocating; why use a
     pair-based linked list structure?

  2. One of the primary users of the parser is the interpreter; it's
     greatly beneficial for its performance to use contiguous arrays
     for code forms to evaluate.

  It just means that when you parse some data, you need to make sure
  to use the array-list API rather than the cons cell API to traverse
  the data you've parsed.  I think that's fine.

Given that these array-lists don't pretend to be cons cells anymore,
the whole "sentinel bit pattern at the start" thing isn't necessary
anymore; programmers using this data type simply need to understand
that it's a different data structure and they can't simply acquire a
pointer to the "rest" of the list from any arbitrary point.

We don't need a `cdr` equivalent for this data type at all; we offer
other functions to traverse it, segment by segment.  You can't get a
pointer to the middle of a segment; you can only get a pointer to the
head of the next segment.

So, the 3-bit or 4-bit tag on the pointer gets a slightly different
meaning: It's simply the length of the current segment.  Well, it's
actually the same exact meaning as before, and technically you could
still construct a pointer to the middle of a segment with the length
tag indicating the remaining length of that segment, but it's simply
not going to be supported; the GC could collect the segment, leaving
you with a pointer to invalid memory.

Also, this means that any length other than the maximum implies that
the list ends there; you'd never put a pointer to another segment to
the end of an unsaturated segment, because why would you.

Let's look at a few examples of such array-lists in memory.  We'll
assume a 3-bit tag for simplicity; you can imagine what a 4-bit tag
would do.

Single-element proper list:

    tag = 0

    0       1       2       3       4       5       6       7

    foo     nil

(Earlier I wrote that the tag is "simply the length of the current
segment" which was a lie: The length is tag + 2 and the last slot in
the segment is nil or a pointer to another segment or the end of an
improper list.)

An improper list with 3 elements and non-nil tail:

    tag = 2

    0       1       2       3       4       5       6       7

    foo     bar     baz     bat

A proper list filing up a cache line perfectly:

    tag = 6

    0       1       2       3       4       5       6       7

    foo     bar     baz     bat     qux     fux     tux     nil

Tag value 7 means the list *might* be longer; this is a bit awkward
because the tail sits on the next cache line:

    tag = 7

    0       1       2       3       4       5       6       7       8

    a       b       c       d       e       f       g       h     ptr

However, if the list actually is longer, the elements could follow
immediately after that pointer, because hopefully the whole structure
was allocated in one contiguous array.  These continuation pointers
are, in a sense, an artifact resulting from the fact that we can't
otherwise represent longer lists, since the length bits are scarce.

This is fine, especially if the length bits are extended to 4, since
the point is using this for the kinds of short lists that are very
common in Lisp/Scheme code.  How often does source code contain an
s-expression longer than 15 elements?  (That's tag value 14, with
trailing nil to end it as a proper list; 16 slots total filling two
cache lines perfectly.)

## Why improper lists at all?

I've asked myself whether it would make sense to ditch improper lists
and make the terminating nil implicit for non-maximum length segments
but I don't like that idea.

Improper lists kinda have their place in Lisp/Scheme code, such as to
represent rest arguments.  And besides, the semantics get quite wonky
if the terminating nil is implicit on non-max length; you still need
it on max length if the list actually does end after that, and then
what do you do if it's not nil and also not a segment pointer?  Now
you have improper lists again, but they can only have a non-tail
element count equal to a multiple of 8 or 16, which is weird.

Plus, there may actually be valid use-cases where you create segments
of non-max length, and then put a pointer to another segment at the
end anyway.  The parser has no need for this, but someone using this
segment-array-linked-list structure for some special purpose could
find a use for this, perhaps.

## Closing up

I think this will be a very fine optimization with respect to the
operation of the interpreter.  I've lots of ideas in mind on how to
make it almost like a bytecode VM despite taking tree data.  As an
example, after the first time a list is evaluated as a function call,
verifying that it's a proper list with an identifier referring to a
function in the first position, it can be modified as follows:

The head of the list is replaced with a special NaN-packed value of
type "call instruction" that encodes a pointer to a function as a
payload directly in the NaN.  This instruction implies that the list
has been verified to be proper, so the length tag on the pointer that
led us to this list can be relied on directly, not requiring another
verification that the call expression (list) is well-formed (proper).

Bytecode is inherently linear, whereas a tree structure inherently
requires some pointer jumping.  Consider the expression:

    (+ x (* y (/ z t)))

As bytecode, this would be turned into instructions that first call
the division operator, put the result in a register or the stack,
continue with the multiplication, and so on.

As a tree, even if you did some pre-verification and turned symbols
into direct call instructions embedding a pointer and so on, you're
still dependent on recursion to evaluate the innermost form first,
store the result, walk back up a level, and so on.

Other than by a complete transformation of the code, extracting inner
expressions and putting their results into local variables, you can't
make this as nice to execute as bytecode.  I wonder how much overhead
that really causes though, if the interpreter is otherwise optimized
well and rewrites s-expressions in-place to save future work, and the
various array-lists the code is made up of all fit neatly into L1.

You may be wondering why I don't just go the bytecode VM route, but
I've already wandered off from the topic far enough, so maybe I'll
write a separate note about that.