From b84ed4f563b3536365f7d3cc4d068407e98685b3 Mon Sep 17 00:00:00 2001 From: Taylan Kammer Date: Sat, 20 Jun 2026 22:53:50 +0200 Subject: It's a revolution baby. --- doc/0/0-value.md | 199 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 199 insertions(+) create mode 100644 doc/0/0-value.md (limited to 'doc/0/0-value.md') diff --git a/doc/0/0-value.md b/doc/0/0-value.md new file mode 100644 index 0000000..4bb7e0c --- /dev/null +++ b/doc/0/0-value.md @@ -0,0 +1,199 @@ +# NaN-packed Value representation + +The format of a binary64 floating-point number, in big-endian notation: + + { sign: 1 bit, exponent: 11 bits, fraction: 52 bits } + +When the 11 exponent bits are all set, it's either a NaN or an Infinity. + +For value packing, the remaining 53 bits are available, giving us `2^53` values +minus the following four bit patterns: + + *** FORBIDDEN BIT-PATTERNS *** + + 1. Negative cqNaN :: { sign = 1, exponent = MAX, fraction = 10000... } + + 2. Negative Infinity :: { sign = 1, exponent = MAX, fraction = 00000... } + + 3. Positive cqNaN :: { sign = 0, exponent = MAX, fraction = 10000... } + + 4. Positive Infinity :: { sign = 0, exponent = MAX, fraction = 00000... } + +The abbreviation "cqNaN" stands for canonical quiet NaN. + +The MSb of the fraction is also called the `is_quiet` flag, because it marks a +NaN as being "quiet" rather than signaling. The rest of the fraction being all +zero makes it the *canonical* quiet NaN for the given sign value. + +The positive and negative cqNaN are the *only* NaN values that can actually be +returned by FP operations. This is convenient, because it means we can simply +use them to represent themselves in Zisp. + +Infinity values may also be returned by FP operations, and we want them to also +exit in Zisp, so they also represent themselves. + +Beyond those four bit patterns, all values with a maximum exponent (all bits +set) are fair game for representing other values, so `2^53 - 4` possibilities. + +We split those `2^53 - 4` available values into four groups, each allowing for +`2^51 - 1` different values to be encoded. (51-bit values excluding zero.) + + sign = 1, quiet = 1 :: Negative Fixnum from -1 to -2^51+1 + + sign = 1, quiet = 0 :: Positive Fixnum from 0 to 2^51-2 + + sign = 0, quiet = 1 :: Pointers and various immediates + + sign = 0, quiet = 0 :: Internal use by interpreter + + +## Fixnums + +Negative fixnums actually represent themselves, without needing to go through +any transformation. Only the smallest 52-bit signed negative, `-2^51`, cannot +be represented, as it would step on Forbidden Value #1, Negative cqNaN. + +Positive fixnums go through a bitsiwe NOT (which can be implemented as an XOR +mask combining it with removal of NaN-related high bits) to avoid the all-zero +payload value, which would step on Forbidden Value #2, Negative Infinity. + + +## Pointers and immediates + +This region of 51-bit values is divided as follows, based on the three highest +bits, providing a payload value of 48 bits for each. + + 000 :: Pointer to heap object (type-tagged) + + 001 :: Pointer to list values (length-tagged) + + 010 :: Pointer to istr object + + 011 :: Immediate short string + + 100 :: Immediate small rational (sign bit 0) + + 101 :: Immediate small rational (sign bit 1) + + 110 :: Undefined + + 111 :: Immediate types further subdivided as follows: + + 0....... 0....... 0....... (etc.) :: Rune + + 1....... :: 128 40-bit types + + 0....... 1....... :: 16384 32-bit types + + 0....... 0....... 1....... :: 2097152 24-bit types + + (etc.) + +Forbidden Value #3, Positive cqNaN, is avoided by not using 0 as a valid type +tag value for heap pointers. + +### Type-tagged heap pointers + +Regular heap objects are allocated with 16-byte alignment, meaning the lowest +four bits are naturally zero. We exploit this by shifting down the address by +four bits, making room for more tag bits immediately following the 16 high bits +that mark the value as a heap pointer. + +Thus, a value can be checked against a specific heap object type by comparing +the 20 high bits to a combined constant value: the 16 highest bits indicating +that it's a heap pointer, and the 4 bits after that denoting the heap type. + +### Length-tagged list pointers + +Lists are arrays of Value objects, allocated without padding (8-byte alignment) +for efficient source code representation and traversal. Since the lowest three +bits are naturally zero, we use them as a 3-bit length information tag. + +A length tag value of 1 to 7 means there are exactly that many Value objects +starting at the address, while 0 means there is an array of at least eight, +terminated with a special 64-bit sentinel bit-pattern that is not otherwise +valid as a Zisp value. + +### Interned string pointers + +Interned string (istr) objects may be unaligned, so the low bits of the pointer +are not used for any special purpose. Having a separate category for this type +of pointer also streamlines the interpreter implementation + +### Short strings + +This 48-bit range is used for strings of zero to six bytes. These are NUL +terminated unless exactly six bytes, meaning that a literal NUL byte cannot +appear in them, but otherwise they allow arbitrary byte values. + +### Small rationals + +We use a 49-bit space for small rational numbers, with a signed 25-bit two's +complement integer numerator, and 24-bit unsigned integer denominator. + +### Runes and other small values + +Runes are symbols of up to 6 ASCII characters in length, used to implement +extensible reader syntax. (See Zisp decoder.) They cannot contain the NUL +byte, as they are NUL-terminated unless exactly six ASCII bytes in length. + +NOTE: The order in which the characters of the rune are encoded depends on +endianness. On little-endian systems (i.e. most modern architectures) the +characters will be in "reverse" order, with the first character in lowest +position, so the terminating NUL has to be searched from low to high. + +The fact that runes are limited to ASCII bytes, whose MSb is unset, opens up +some space for other small values to co-inhabit the same 48-bit value range. + +We divide this space into increasingly many potential types, with smaller and +smaller payloads, where the highest byte with a non-zero MSb determines which +size category we're in: If the highest byte has its MSb set, then the other +seven bits are a type tag, and each type has a 40-bit payload; if the second +highest byte has its MSb set, then the 14 non-MSb bits of the two high bytes +define the type, and each has a 32-bit payload; and so on. + +Unicode code points need 21 bits, so we use a 24-bit type for the Character +type. Miscellaneous values like True, False, EOF, etc. are placed in an 8-bit +type, since there will never be that many of them. + +A virtually unlimited number of user-defined enum types can fit into the types +with small payload values here: There is room for over 268 Million 16-bit types +(28-bit type tag) and over 34 Billion 8-bit types (35-bit type tag). + + +## Internal use values + +The final 51-bit range is used for various internal purposes by the Zisp +interpreter, mostly related to transparent code optimization. + + 000 :: Pointer to heap object as constant + + 001 :: Pointer to list values as constant + + 010 :: Pointer to istr object as constant + + 011 :: Immediate short string as constant + + 100 :: Local variable reference by index + + 101 :: Pointer to constant function-call expression + + 110 :: Pointer to variable function-call expression + + 111 :: Pointer to special-form or macro-call expression + +Forbidden Value #4, Positive Infinity, is avoided thanks to the fact that heap +pointers always have a non-zero heap-type tag. (See further above.) + +The first four categories simply mirror those of the previous 51-bit range, but +mark the values as being constants rather than code to evaluate. + +The remaining four categories are somewhat similar to VM instructions. + + + + -- cgit v1.2.3