summaryrefslogtreecommitdiff
path: root/doc/0/0-value.md
blob: 4bb7e0c6e75936fc981336fca081079af0c0f1fd (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
# NaN-packed Value representation

The format of a binary64 floating-point number, in big-endian notation:

    { sign: 1 bit, exponent: 11 bits, fraction: 52 bits }

When the 11 exponent bits are all set, it's either a NaN or an Infinity.

For value packing, the remaining 53 bits are available, giving us `2^53` values
minus the following four bit patterns:

    *** FORBIDDEN BIT-PATTERNS ***

    1. Negative cqNaN    :: { sign = 1, exponent = MAX, fraction = 10000... }

    2. Negative Infinity :: { sign = 1, exponent = MAX, fraction = 00000... }

    3. Positive cqNaN    :: { sign = 0, exponent = MAX, fraction = 10000... }

    4. Positive Infinity :: { sign = 0, exponent = MAX, fraction = 00000... }

The abbreviation "cqNaN" stands for canonical quiet NaN.

The MSb of the fraction is also called the `is_quiet` flag, because it marks a
NaN as being "quiet" rather than signaling.  The rest of the fraction being all
zero makes it the *canonical* quiet NaN for the given sign value.

The positive and negative cqNaN are the *only* NaN values that can actually be
returned by FP operations.  This is convenient, because it means we can simply
use them to represent themselves in Zisp.

Infinity values may also be returned by FP operations, and we want them to also
exit in Zisp, so they also represent themselves.

Beyond those four bit patterns, all values with a maximum exponent (all bits
set) are fair game for representing other values, so `2^53 - 4` possibilities.

We split those `2^53 - 4` available values into four groups, each allowing for
`2^51 - 1` different values to be encoded.  (51-bit values excluding zero.)

    sign = 1, quiet = 1 :: Negative Fixnum from -1 to -2^51+1

    sign = 1, quiet = 0 :: Positive Fixnum from 0 to 2^51-2

    sign = 0, quiet = 1 :: Pointers and various immediates

    sign = 0, quiet = 0 :: Internal use by interpreter


## Fixnums

Negative fixnums actually represent themselves, without needing to go through
any transformation.  Only the smallest 52-bit signed negative, `-2^51`, cannot
be represented, as it would step on Forbidden Value #1, Negative cqNaN.

Positive fixnums go through a bitsiwe NOT (which can be implemented as an XOR
mask combining it with removal of NaN-related high bits) to avoid the all-zero
payload value, which would step on Forbidden Value #2, Negative Infinity.


## Pointers and immediates

This region of 51-bit values is divided as follows, based on the three highest
bits, providing a payload value of 48 bits for each.

    000   :: Pointer to heap object  (type-tagged)

    001   :: Pointer to list values  (length-tagged)

    010   :: Pointer to istr object

    011   :: Immediate short string

    100   :: Immediate small rational  (sign bit 0)

    101   :: Immediate small rational  (sign bit 1)

    110   :: Undefined

    111   :: Immediate types further subdivided as follows:

             0....... 0....... 0....... (etc.)  :: Rune

             1.......                           :: 128 40-bit types

             0....... 1.......                  :: 16384 32-bit types

             0....... 0....... 1.......         :: 2097152 24-bit types

             (etc.)

Forbidden Value #3, Positive cqNaN, is avoided by not using 0 as a valid type
tag value for heap pointers.

### Type-tagged heap pointers

Regular heap objects are allocated with 16-byte alignment, meaning the lowest
four bits are naturally zero.  We exploit this by shifting down the address by
four bits, making room for more tag bits immediately following the 16 high bits
that mark the value as a heap pointer.

Thus, a value can be checked against a specific heap object type by comparing
the 20 high bits to a combined constant value: the 16 highest bits indicating
that it's a heap pointer, and the 4 bits after that denoting the heap type.

### Length-tagged list pointers

Lists are arrays of Value objects, allocated without padding (8-byte alignment)
for efficient source code representation and traversal.  Since the lowest three
bits are naturally zero, we use them as a 3-bit length information tag.

A length tag value of 1 to 7 means there are exactly that many Value objects
starting at the address, while 0 means there is an array of at least eight,
terminated with a special 64-bit sentinel bit-pattern that is not otherwise
valid as a Zisp value.

### Interned string pointers

Interned string (istr) objects may be unaligned, so the low bits of the pointer
are not used for any special purpose.  Having a separate category for this type
of pointer also streamlines the interpreter implementation

### Short strings

This 48-bit range is used for strings of zero to six bytes.  These are NUL
terminated unless exactly six bytes, meaning that a literal NUL byte cannot
appear in them, but otherwise they allow arbitrary byte values.

### Small rationals

We use a 49-bit space for small rational numbers, with a signed 25-bit two's
complement integer numerator, and 24-bit unsigned integer denominator.

### Runes and other small values

Runes are symbols of up to 6 ASCII characters in length, used to implement
extensible reader syntax.  (See Zisp decoder.)  They cannot contain the NUL
byte, as they are NUL-terminated unless exactly six ASCII bytes in length.

NOTE: The order in which the characters of the rune are encoded depends on
endianness.  On little-endian systems (i.e. most modern architectures) the
characters will be in "reverse" order, with the first character in lowest
position, so the terminating NUL has to be searched from low to high.

The fact that runes are limited to ASCII bytes, whose MSb is unset, opens up
some space for other small values to co-inhabit the same 48-bit value range.

We divide this space into increasingly many potential types, with smaller and
smaller payloads, where the highest byte with a non-zero MSb determines which
size category we're in: If the highest byte has its MSb set, then the other
seven bits are a type tag, and each type has a 40-bit payload; if the second
highest byte has its MSb set, then the 14 non-MSb bits of the two high bytes
define the type, and each has a 32-bit payload; and so on.

Unicode code points need 21 bits, so we use a 24-bit type for the Character
type.  Miscellaneous values like True, False, EOF, etc. are placed in an 8-bit
type, since there will never be that many of them.

A virtually unlimited number of user-defined enum types can fit into the types
with small payload values here: There is room for over 268 Million 16-bit types
(28-bit type tag) and over 34 Billion 8-bit types (35-bit type tag).


## Internal use values

The final 51-bit range is used for various internal purposes by the Zisp
interpreter, mostly related to transparent code optimization.

  000   :: Pointer to heap object as constant

  001   :: Pointer to list values as constant

  010   :: Pointer to istr object as constant

  011   :: Immediate short string as constant

  100   :: Local variable reference by index

  101   :: Pointer to constant function-call expression

  110   :: Pointer to variable function-call expression

  111   :: Pointer to special-form or macro-call expression

Forbidden Value #4, Positive Infinity, is avoided thanks to the fact that heap
pointers always have a non-zero heap-type tag.  (See further above.)

The first four categories simply mirror those of the previous 51-bit range, but
mark the values as being constants rather than code to evaluate.

The remaining four categories are somewhat similar to VM instructions.



<!--
;; Local Variables:
;; fill-column: 80
;; End:
-->