From b84ed4f563b3536365f7d3cc4d068407e98685b3 Mon Sep 17 00:00:00 2001 From: Taylan Kammer Date: Sat, 20 Jun 2026 22:53:50 +0200 Subject: It's a revolution baby. --- doc/0/0-value.md | 199 +++++++++++++++++ doc/0/1-parse.md | 595 +++++++++++++++++++++++++++++++++++++++++++++++++ doc/0/2-decode.md | 44 ++++ doc/0/grammar/abnf.txt | 139 ++++++++++++ doc/0/grammar/index.md | 115 ++++++++++ doc/0/grammar/peg.txt | 91 ++++++++ doc/0/grammar/zbnf.txt | 75 +++++++ doc/0/index.md | 34 +++ 8 files changed, 1292 insertions(+) create mode 100644 doc/0/0-value.md create mode 100644 doc/0/1-parse.md create mode 100644 doc/0/2-decode.md create mode 100644 doc/0/grammar/abnf.txt create mode 100644 doc/0/grammar/index.md create mode 100644 doc/0/grammar/peg.txt create mode 100644 doc/0/grammar/zbnf.txt create mode 100644 doc/0/index.md (limited to 'doc/0') diff --git a/doc/0/0-value.md b/doc/0/0-value.md new file mode 100644 index 0000000..4bb7e0c --- /dev/null +++ b/doc/0/0-value.md @@ -0,0 +1,199 @@ +# NaN-packed Value representation + +The format of a binary64 floating-point number, in big-endian notation: + + { sign: 1 bit, exponent: 11 bits, fraction: 52 bits } + +When the 11 exponent bits are all set, it's either a NaN or an Infinity. + +For value packing, the remaining 53 bits are available, giving us `2^53` values +minus the following four bit patterns: + + *** FORBIDDEN BIT-PATTERNS *** + + 1. Negative cqNaN :: { sign = 1, exponent = MAX, fraction = 10000... } + + 2. Negative Infinity :: { sign = 1, exponent = MAX, fraction = 00000... } + + 3. Positive cqNaN :: { sign = 0, exponent = MAX, fraction = 10000... } + + 4. Positive Infinity :: { sign = 0, exponent = MAX, fraction = 00000... } + +The abbreviation "cqNaN" stands for canonical quiet NaN. + +The MSb of the fraction is also called the `is_quiet` flag, because it marks a +NaN as being "quiet" rather than signaling. The rest of the fraction being all +zero makes it the *canonical* quiet NaN for the given sign value. + +The positive and negative cqNaN are the *only* NaN values that can actually be +returned by FP operations. This is convenient, because it means we can simply +use them to represent themselves in Zisp. + +Infinity values may also be returned by FP operations, and we want them to also +exit in Zisp, so they also represent themselves. + +Beyond those four bit patterns, all values with a maximum exponent (all bits +set) are fair game for representing other values, so `2^53 - 4` possibilities. + +We split those `2^53 - 4` available values into four groups, each allowing for +`2^51 - 1` different values to be encoded. (51-bit values excluding zero.) + + sign = 1, quiet = 1 :: Negative Fixnum from -1 to -2^51+1 + + sign = 1, quiet = 0 :: Positive Fixnum from 0 to 2^51-2 + + sign = 0, quiet = 1 :: Pointers and various immediates + + sign = 0, quiet = 0 :: Internal use by interpreter + + +## Fixnums + +Negative fixnums actually represent themselves, without needing to go through +any transformation. Only the smallest 52-bit signed negative, `-2^51`, cannot +be represented, as it would step on Forbidden Value #1, Negative cqNaN. + +Positive fixnums go through a bitsiwe NOT (which can be implemented as an XOR +mask combining it with removal of NaN-related high bits) to avoid the all-zero +payload value, which would step on Forbidden Value #2, Negative Infinity. + + +## Pointers and immediates + +This region of 51-bit values is divided as follows, based on the three highest +bits, providing a payload value of 48 bits for each. + + 000 :: Pointer to heap object (type-tagged) + + 001 :: Pointer to list values (length-tagged) + + 010 :: Pointer to istr object + + 011 :: Immediate short string + + 100 :: Immediate small rational (sign bit 0) + + 101 :: Immediate small rational (sign bit 1) + + 110 :: Undefined + + 111 :: Immediate types further subdivided as follows: + + 0....... 0....... 0....... (etc.) :: Rune + + 1....... :: 128 40-bit types + + 0....... 1....... :: 16384 32-bit types + + 0....... 0....... 1....... :: 2097152 24-bit types + + (etc.) + +Forbidden Value #3, Positive cqNaN, is avoided by not using 0 as a valid type +tag value for heap pointers. + +### Type-tagged heap pointers + +Regular heap objects are allocated with 16-byte alignment, meaning the lowest +four bits are naturally zero. We exploit this by shifting down the address by +four bits, making room for more tag bits immediately following the 16 high bits +that mark the value as a heap pointer. + +Thus, a value can be checked against a specific heap object type by comparing +the 20 high bits to a combined constant value: the 16 highest bits indicating +that it's a heap pointer, and the 4 bits after that denoting the heap type. + +### Length-tagged list pointers + +Lists are arrays of Value objects, allocated without padding (8-byte alignment) +for efficient source code representation and traversal. Since the lowest three +bits are naturally zero, we use them as a 3-bit length information tag. + +A length tag value of 1 to 7 means there are exactly that many Value objects +starting at the address, while 0 means there is an array of at least eight, +terminated with a special 64-bit sentinel bit-pattern that is not otherwise +valid as a Zisp value. + +### Interned string pointers + +Interned string (istr) objects may be unaligned, so the low bits of the pointer +are not used for any special purpose. Having a separate category for this type +of pointer also streamlines the interpreter implementation + +### Short strings + +This 48-bit range is used for strings of zero to six bytes. These are NUL +terminated unless exactly six bytes, meaning that a literal NUL byte cannot +appear in them, but otherwise they allow arbitrary byte values. + +### Small rationals + +We use a 49-bit space for small rational numbers, with a signed 25-bit two's +complement integer numerator, and 24-bit unsigned integer denominator. + +### Runes and other small values + +Runes are symbols of up to 6 ASCII characters in length, used to implement +extensible reader syntax. (See Zisp decoder.) They cannot contain the NUL +byte, as they are NUL-terminated unless exactly six ASCII bytes in length. + +NOTE: The order in which the characters of the rune are encoded depends on +endianness. On little-endian systems (i.e. most modern architectures) the +characters will be in "reverse" order, with the first character in lowest +position, so the terminating NUL has to be searched from low to high. + +The fact that runes are limited to ASCII bytes, whose MSb is unset, opens up +some space for other small values to co-inhabit the same 48-bit value range. + +We divide this space into increasingly many potential types, with smaller and +smaller payloads, where the highest byte with a non-zero MSb determines which +size category we're in: If the highest byte has its MSb set, then the other +seven bits are a type tag, and each type has a 40-bit payload; if the second +highest byte has its MSb set, then the 14 non-MSb bits of the two high bytes +define the type, and each has a 32-bit payload; and so on. + +Unicode code points need 21 bits, so we use a 24-bit type for the Character +type. Miscellaneous values like True, False, EOF, etc. are placed in an 8-bit +type, since there will never be that many of them. + +A virtually unlimited number of user-defined enum types can fit into the types +with small payload values here: There is room for over 268 Million 16-bit types +(28-bit type tag) and over 34 Billion 8-bit types (35-bit type tag). + + +## Internal use values + +The final 51-bit range is used for various internal purposes by the Zisp +interpreter, mostly related to transparent code optimization. + + 000 :: Pointer to heap object as constant + + 001 :: Pointer to list values as constant + + 010 :: Pointer to istr object as constant + + 011 :: Immediate short string as constant + + 100 :: Local variable reference by index + + 101 :: Pointer to constant function-call expression + + 110 :: Pointer to variable function-call expression + + 111 :: Pointer to special-form or macro-call expression + +Forbidden Value #4, Positive Infinity, is avoided thanks to the fact that heap +pointers always have a non-zero heap-type tag. (See further above.) + +The first four categories simply mirror those of the previous 51-bit range, but +mark the values as being constants rather than code to evaluate. + +The remaining four categories are somewhat similar to VM instructions. + + + + diff --git a/doc/0/1-parse.md b/doc/0/1-parse.md new file mode 100644 index 0000000..101a3b6 --- /dev/null +++ b/doc/0/1-parse.md @@ -0,0 +1,595 @@ +# Parser for Code & Data + + + +Zisp s-expressions represent an extremely minimal set of data types; only that +which is necessary to strategically construct more complex values: + + +---------+--------+----------+------+ + | String | Rune | List | Nil | + +---------+--------+----------+------+ + | foobar | #name | (X ...) | () | + +---------+--------+----------+------+ + +The parser recognizes various *syntax sugar* which abbreviates verbose syntax, +and may result in special data structures (typically, a list with a rune in its +first position) which another Zisp component called the *decoder* can transform +into a rich set of value types. + +More details about syntax sugar, and the decoder, are explained later. + + +## Character Encoding + +The parser does not consume Unicode characters; it consumes bytes. Grammar is +generally constructed by bytes corresponding to ASCII characters. + +Some elements of the grammar, such as comments and quoted strings, may contain +arbitrary byte sequences, until terminated. These sequences may happen to be +valid UTF-8 text. This way, quoted strings and comments may contain Unicode +text encoded in UTF-8, but the parser does not check these for validity. + +Since comments and quoted strings may contain arbitrary byte sequences, a text +editor or other program displaying Zisp s-expressions may need to use a special +visual representation for bytes that don't represent valid text. + +The parser working on bytes rather than Unicode characters is not a limitation, +but rather a feature: It allows Zisp s-expressions to be used as a structured +data exchange format, which may contain binary data elements, without the need +to encode these in Base64 or other such text representations of binary data. +Consider the example: + + ((image.webp "") + (video.webm "")) + +All that needs to be done for this to work, is that any incidental occurrences +of the double-quote sign, and the backslash sign, are escaped with a backslash +within the `` data; all other bytes can appear verbatim in the strings. + + +## Stream Parsing + +The parser can be repeatedly invoked on a byte stream to consume the next datum +within. This does not require "unreading" or back-seeking within the stream; +the parser always reads a full datum, and stops after some byte which cleanly +terminates the currently parsed datum. + +This means Zisp s-expressions can be safely intermixed with other data within +the same byte stream. So long as the other data is consumed by some parser +which similarly stops reading at a clear boundary, the Zisp parser can then +continue operating on the same stream. Consider the example: + + ("image.webp" 8273) + + << 8273 bytes >> + + ("video.webm" 736) + + << 736 bytes >> + +The "header" for each file in this stream is a Zisp s-expression containing +information about how many bytes should be read after the header, before the +next file header appears. (The header data need to be terminated with a blank +ASCII character such as a newline; the closing parenthesis does not act as a +terminator unto itself due to the "join" syntax sugar.) + +To enable this stream parsing strategy, the parser does not use any automatic +buffering. If it did, it might inadvertently consume some bytes beyond the +currently parsed datum, leaving the stream inconsistent. + +If the parser is meant to be used on an input stream associated with expensive +system calls, such as a file handle or network socket, it's best to wrap that +stream in some intermediate object which asks the system for large chunks of +data at once, and stores the data in a buffer. + + +## Comments + +Two types of comment are supported: datum comments and line comments. + +* A semicolon followed by a tilde instructs the parser to consume one datum and + discard it. Whitespace may appear between the tilde and the datum to discard. + +* A semicolon, followed by a non-tilde byte, instructs the parser to consume and + discard bytes until a newline (ASCII Line Feed) is encountered. + + +## Value vs. Datum + +A Zisp *value* that has an *external representation* in the form of a sequence +of bytes is called a *datum*. Every datum is a value, but not every value is a +datum. In other words, a datum is a value that can be printed out as a byte +sequence which the parser can turn back into an equivalent datum. + +A value that is not a datum may nevertheless be *encoded* into one, allowing it +to have an external representation. After parsing, it needs to be *decoded* to +actually become the expected value. + +One may speak of an *external representation of a value* where the value is not +itself a datum, but has an encoding as one. The more strictly correct term for +this is: "The external representation of the datum encoding the value." + +### Syntax sugar + +The parser recognizes various *syntax sugar* to abbreviate an equivalent datum +construction, or express a datum that encodes a more complex value. + +As an example, the expression `#(x y z)` is an abbreviation for the equivalent +`(#HASH x y z)`. These are two external representations for the same datum; +after parsing, both will yield values that are indistinguishable in all but +their memory address. + +An example of syntax sugar that is not a mere abbreviation is a quoted string +which contains bytes that could not appear in a *bare* string: + + "foo bar" -> (#DQUOTE ) + +In this example, the visual token `` represents the actual string value +in program memory, which has no direct external representation in bytes because +it contains a space character. + +Those familiar with Lisp and Scheme may expect bare strings to be parsed into a +separate type called *symbol* while quoted strings are parsed directly into a +string type, but this is not the case in Zisp. + +### Decoder + +The *decoder* transforms Zisp data into values of more complex types, including +values that are not of a datum type. + +Combined with syntax sugar, this allows Zisp to offer familiar syntax elements. +For example, the expression `#(x y z)` which parses into `(#HASH x y z)` can be +decoded into an array, so the result is similar to the vector syntax of Scheme. + +Decoding also resolves datum labels, goes over bare strings to find ones that +represent a number literal, and takes care of a number of other transforms. +This offloads complexity, allowing the parser to remain extremely simple. + +See the dedicated documentation of the [decoder](2-decode.html) for more. + + +## Data types + +Following is a more in-depth explanation of each data type constructed by the +Zisp s-expression parser. + +These are in fact value types, though the term "data type" is often used due to +familiarity. A Zisp value that is a member of one of the following value types +is only a *datum* if it adheres to additional constraints as explained below. + +### String + +Strings can appear *bare* or be quoted in various ways. A quoted string is in +fact parsed into a list value with a rune in the first position to identify the +quotation variant that was parsed, and the string value in the second position; +or, in case of at-quoted strings, a special construct we will look at later. + + +-----------+-------------------------------+ + | Syntax | Parse output | + +-----------+-------------------------------+ + | |bytes| | (#PQSTR ) | + +-----------+-------------------------------+ + | "bytes" | (#DQSTR ) | + +-----------+-------------------------------+ + | @_bytes_ | (#ATSTR ) | + +-----------+-------------------------------+ + +The visual token `` denotes the actual string, as a Zisp value, in the +second position of the list. The visual token `` stands for a Zisp +integer value between 0 and 254. + +These external representations of strings will be explained in more detail +further below, including backslash escape sequences allowed within, and how +exactly at-quoted strings work. + +Strings have a fixed length, counted in bytes. Each byte can have any value, +including zero (ASCII NUL). The parser reads bytes, not Unicode characters; a +string may contain UTF-8 byte sequences, but these are not tested for validity. + +A string that is up to 255 bytes long is automatically *interned*, meaning any +occurrence of the same string -- equal in length and containing the same byte +values -- ends up being represented by the same bit-pattern; either a memory +address, or an immediate representation within a CPU word for short strings. +The quotation method is inconsequential to this process; for example, while +`|foo bar|` and `"foo bar"` will parse into different list values, the actual +string they hold a reference to will be the same one in program memory. This +behavior is however configurable and can be disabled entirely for cases where +large numbers of arbitrary binary strings are being parsed. + +Strings of length greater than 255 bytes are stored separately in memory, even +if they are equal in length and content. + +### Rune + +A rune is represented by an ASCII character sequence of 1 to 6 bytes, that must +begin with a letter, and may only contain letters and digits. This character +sequence of letters and digits is called the *name* of the rune. A rune that +follows this constraint is valid as a datum. + +Zisp code may explicitly construct values of the rune type that violate the +above constraints. Such runes are not valid data and cannot be printed or +parsed. + +Runes are case-sensitive, and the parser always emits runes using upper-case +letters when expressing syntax sugar. Uppercase rune names are reserved for +Zisp's internal use and standard library; users can use lowercase runes with +custom meaning without worrying about clashes, with the exception of a small +number of lowercase runes such as `#true` and `#false` that are part of the +default decoder settings and documented explicitly as such. + +Runes are always stored directly in a CPU word; never by memory address. + +### List + +A list is a contiguous array of one or more values in memory, whose length may +be encoded directly within the pointer to the head of the array, or else the +array is terminated with a special sentinel bit-pattern that is not otherwise +valid as a Zisp value. + +The parser allocates a unique array in program memory for every list, and the +list as a value is then represented by the memory address of that array, with +either an exact length tag or a tag indicating that it's sentinel-terminated. + +Lists are valid data if one of the following holds true: + +* The list encodes a quoted string, datum label, or shebang line. + +* All values in the list are a valid datum. + +Further, a structure of nested list values may not contain cyclic references +back up in the structure (which would make the above definition diverge into +infinity). Such cycles must be broken up with datum labels, or else the list +cannot be considered a datum, since it cannot be printed or parsed. + +### Nil + +The Zisp nil value is a singleton and a datum. There is exactly one nil value, +used in lieu of a list of zero length; it has the external representation `()`. + + +## Quoted strings + +Three quoted string types exist: Pipe-quoted, double-quoted, and at-quoted. +This section goes into the details of each variant. + +### Pipe-quoted + +Strings can be quoted with pipes, like symbols in R7RS Scheme, which triggers +the parser to generate a list with the structure: + + (#PQSTR ) ;; is visual aid, not syntax + +The decoder, using default settings, would emit this string verbatim as a value. +Then, during code evaluation, this would be seen as an identifier. In this way, +pipe-quoted strings are equivalent to bare strings in functionality. + +It is important to understand that the decoder sits between the parser and the +[evaluator](3-eval.html), and in opposition to Lisp and Scheme tradition, it is +common for the evaluator to receive values that are not valid as a datum; here, +a string unto itself that may not be a valid datum. Yet, it is valid as an +identifier for the purposes of the evaluator. + +### Double-quoted + +Strings wrapped in the double-quote symbol parse into: + + (#DQSTR ) ;; is visual aid, not syntax + +Under default settings, the decoder would transform this into a value which, +when evaluated as code, simply yields the contained string as a value. + +### At-quoted + +This is a special type of syntax for "raw" strings, meaning that no backslash +escapes nor any other kind of escape sequence are recognized within them. + +The syntax begins with an at sign, followed by any byte. That byte becomes a +termination marker, and the string cannot contain an occurrence of it, since +there are no escape sequences. The byte value 255 has a special meaning; see +further below. + + @"foo \ bar" -> (#ATSTR ) + +The visual tokens `` and `` represent an integer and string +value, respectively. Here, the integer would be 34, which is the ASCII value +for a double-quote sign. The string contains a literal backslash, since there +is no backslash escape parsing. + +This style of quoting can be useful, for instance, when representing regular +expressions as strings in code: + + ;; Matches e.g. foo\bar.["blah"] + + @/^foo\\(bar|baz)\.\[".*"\]$/ + +Were it not for this syntax, this regular expression would only be possible to +represent through a quoted string such as the following: + + ;; Same as above, but so many backslashes + + "^foo\\\\(bar|baz)\\t\\[\".*\"\\]$" + +The byte that follows the at sign need not be a printable character or even a +valid ASCII byte; it can be absolutely any byte value, even NUL. This can be +useful to easily encode binary data which is known to not contain a specific +byte; an example would be C strings which cannot contain NUL. + +If however the byte value is 255, then it does not stand for a sentinel, but +rather indicates that 6 more bytes follow, interpreted as a big-endian 48-bit +integer, which is the count of bytes making up the contents of the string. + +Example sequence of bytes, represented as a mixture of ASCII and raw integers: + + '@' 255 0 0 0 0 2 100 <612 bytes> -> (#ATSTR ) + +One may ask why the length is not included in the list. This is unnecessary, +since strings in Zisp already carry length information in their own metadata +structure. + + +### Backslash escapes + +In pipe-quoted and double-quoted strings, the following ASCII characters may +follow a backslash to insert a certain character. + + +-------+----------------------------+ + | Char | Meaning | + +-------+----------------------------+ + | \ | Literal backslash | + +-------+----------------------------+ + | | | Literal pipe symbol | + +-------+----------------------------+ + | " | Literal double-quote | + +-------+----------------------------+ + | 0 | ASCII NUL | + +-------+----------------------------+ + | a | ASCII Alert | + +-------+----------------------------+ + | b | ASCII Backspace | + +-------+----------------------------+ + | t | ASCII Tab (Horizontal) | + +-------+----------------------------+ + | n | ASCII Newline (Line Feed) | + +-------+----------------------------+ + | v | ASCII Vertical Tab | + +-------+----------------------------+ + | f | ASCII Form Feed | + +-------+----------------------------+ + | r | ASCII Carriage Return | + +-------+----------------------------+ + | e | ASCII Escape | + +-------+----------------------------+ + +In words: + +* A backslash, followed by a backslash, pipe, or double-quote character, is + substituted with a literal occurrence of that character. + +* The characters 0, a, b, t, n, v, f, r, and e have the same meanings as in the + C programming language, representing common ASCII control characters. + +Further, the following Regular Expression patterns following a backslash have +special meaning. + + +---------------------+-----------------------+ + | Regular Expression | Meaning | + +---------------------+-----------------------+ + | [\t ]*\n[\t ]* | Discarded | + +---------------------+-----------------------+ + | x([0-9a-fA-F]{2})*; | Arbitrary bytes | + +---------------------+-----------------------+ + | u[0-9a-fA-F]+; | Unicode Scalar Value | + +---------------------+-----------------------+ + +Explanations: + +* A backslash followed by any number of blanks (space or tab), a newline, and + again any number of blanks, is substituted with nothing. This is to allow + splitting a string into multiple lines for human readability. + + (define p "This paragraph has been visually split into multiple \ + lines, but the newline is escaped, so it's one line.") + +* An x, followed by pairs of hexadecimal digits (case insensitive), terminated + by a semicolon, is substituted with the sequence of bytes represented by the + corresponding pairs of hexadecimal digits. E.g.: `"foo\xDEADBEEF;bar"` + +* A u, followed by a hexadecimal digit sequence (case insensitive), terminated + by a semicolon, is substituted with the canonical UTF-8 byte sequence for the + Unicode Scalar Value represented by that hexadecimal number. The number must + be in the range `0` to `10FFFF`. E.g.: `"foo\u00A0;bar"` + +### Newlines in strings + +Normally, a newline in a string has no special meaning and simply becomes part +of the string. However, newlines can be backslash-escaped, which simple erases +them; the escaped newline can also be preceded or followed by any number of tab +and space characters, which are all stripped as well. (Note: It's not blanks +preceding the backslash that are stripped, but blanks following the backslash +and preceding the newline; i.e., blanks at the end of the line.) + +Following are some examples of how multi-line strings can appear in source code +with different intentions and meanings: + + (define paragraph "This paragraph has been visually split into multiple \ + lines, but the newlines are escaped, so it's one line.") + + (define json-object '| ;; use '|| so double-quotes need no escaping + { + "key": "value" + } + |) + +The second example is actually slightly problematic. It begins with a newline, +which may be undesirable, but escaping that newline would cause the first line +to have no indentation, thus the opening `{` would not line up with the closing +`}` when this string is printed out. Further, if the entire block of code is +indented, then the string contents may be more indented than intended. (No pun +or rhyme intended.) Consider: + + (let ((foo one)) + (let ((bar two)) + (let ((json-object '| + { + "key": "value" + } + |)) + (do-whatever)))) + +The string bound to `json-object` has redundant indentation. Should the parser +attempt to solve this issue? + +Thankfully, we have the decoder to handle such complexities. Under the default +settings, the rune `#HASH` is bound to a decoder rule which detects a payload +value that is a string literal, and implements the same algorithm as seen in +Java 15 Text Blocks: [JEP 378: Text Blocks](https://openjdk.org/jeps/378) + +Thus, we can do the following: + + (let ((foo one)) + (let ((bar two)) + (let ((json-object #| + ........... { + ........... "key": "value" + ........... } + ...........|)) + (do-whatever)))) + +(Dots represent whitespace that is deleted. The initial newline is, as well.) + +The only feature Zisp does not offer is a way to fence off multi-line strings +with a longer token such as `"""` as seen in Python and Java, or an arbitrary +word as seen in Bourne shell and PHP "here doc" syntax. + +However, if a programmer truly wanted to have arbitrary text blocks in code, +without needing to escape anything in them, it's possible to abuse at-quoted +string syntax, using it with an ASCII control character which is displayed +visibly by a text editor. In the following, the characters `^\` are meant to +represent a literal ASCII File Separator character in the source code: + + (define json-object #@^\ + { + "key": "value" + } + ^\) + +It works fine in Emacs, so why not? Use `C-q C-\` to insert the `^\`. + +This is indeed quite an eldritch syntax, but hopefully most programs would not +need to use it. + + +## Other syntax + +The following table summarizes commonly useful syntax abbreviations: + + [...] -> (#SQUARE ...) #datum -> (#HASH datum) + + {...} -> (#BRACE ...) #rune(...) -> (#rune ...) + + 'datum -> (#QUOTE datum) dat1dat2 -> (#JOIN dat1 dat2) + + `datum -> (#GRAVE datum) dat1.dat2 -> (#DOT dat1 dat2) + + ,datum -> (#COMMA datum) dat1:dat2 -> (#COLON dat1 dat2) + +Notes: + +* The terms datum, dat1, and dat2 each refer to an arbitrary datum; ellipsis + means zero or more data. + +* The `#datum` form only applies when the datum following the hash sign is + anything other than a bare string, since otherwise this would be ambiguous + with a rune literal. A bare string can nevertheless follow the hash sign by + separating the two with a backslash: + + #\string -> (#HASH string) + +* Though not represented in the table due to notational difficulty, the form + `#rune(...)` doesn't require a list in the second position; any datum that + works with the `#datum` syntax also works with `#rune`. + + #rune1#rune2 -> (#rune1 #rune2) + + #rune\string -> (#rune string) + + #rune'string -> (#rune (#QUOTE string)) + + #rune"string" -> (#rune (#DQSTR |string|)) + + As a counter-example, following a rune immediately with a bare string isn't + possible without the delimiting backslash, since that would be ambiguous: + + #abcdefgh ;Could be (#abcdef gh) or (#abcde fgh) or ... + +* Syntax sugar can combine arbitrarily. Some examples follow. Any of these may + or may not actually have a meaning in code; some might simply end up producing + an error during decoding, or later evaluation of code. + + #{...} -> (#HASH (#BRACE ...)) + + #'foo -> (#HASH (#QUOTE foo)) + + ##'[...] -> (#HASH (#HASH (#QUOTE (#SQUARE ...)))) + + {x y}[i j] -> (#JOIN (#BRACE x y) (#SQUARE i j)) + + foo.bar.baz{x y} -> (#JOIN (#DOT (#DOT foo bar) baz) (#BRACE x y)) + +* Those used to thinking in Lisp and Scheme may think that `(#QUOTE ...)` halts + further decoding of enclosed data. This is not so, since quoting is related + to code evaluation, not decoding. + +### Datum labels + +Valid data cannot be cyclic, since that would mean it has infinite length in +bytes. To externally represent a value with cyclic structure, one uses datum +labels in the data encoding of the value. + +A datum label either wraps another datum to assign a number to it, or contains +just a reference to a previous assignment. + + +------------------+----------------------------+ + | Syntax | Internal datum structure | + +------------------+----------------------------+ + | #%= | (#LABEL ) | + +------------------+----------------------------+ + | #%% | (#LABEL ) | + +------------------+----------------------------+ + +In this visual, the token `` stands for a hexadecimal digit sequence, the +token `` stands for any other datum, and `` is a stand-in for a +number value; that which is represented by ``. + +For clarity, concrete examples follow: + + +-------------------+------------------------------+ + | Byte sequence | Parse result | + +-------------------+------------------------------+ + | #%1234abcd=(foo) | (#LABEL <0x1234abcd> (foo)) | + +-------------------+------------------------------+ + | #%1234abcd% | (#LABEL <0x1234abcd>) | + +-------------------+------------------------------+ + +Here, the visual token `<0x1234abcd>` stands for a Zisp value of a numeric type +with an integer value. Note that the decoder may not accept a bare string here, +meaning this syntax sugar is not merely an abbreviation. + +### Shebang + +Finally, the parser recognizes the Unix *shebang* syntax and outputs a datum to +hold the string values found within: + + #!interpreter -> (#SHBANG interpreter) + + #!interpreter argline -> (#SHBANG interpreter argline) + +When executing a script file, Zisp simply stores this into a global value that +may be inspected if desired. + + + diff --git a/doc/0/2-decode.md b/doc/0/2-decode.md new file mode 100644 index 0000000..1a45824 --- /dev/null +++ b/doc/0/2-decode.md @@ -0,0 +1,44 @@ +# Decoding + +A separate process called "decoding" can transform simple data structures, +consisting of only the base datum types, into a richer set of Zisp types. + +For example, the decoder may turn `(#HASH ...)` into a vector, as one would +expect a vector literal like `#(...)` to work in Scheme. Bytevector syntax +could use a custom rune as a list prefix, like: `#u8(...)` + +Runes may be decoded in isolation as well, rather than transforming a list +whose head they appear in. This can implement Boolean constants as `#true` +and `#false` or `#t` and `#f`. + +The decoder recognizes `(#QUOTE ...)` to aid in implementing the traditional +quoting mechanism of Lisp/Scheme, but with a significant difference: + +Traditional quote is "unhygienic" in Scheme terms. An expression such as +`'(foo bar)` will always be read as `(quote (foo bar))` regardless of what +lexical context it appears in, so the semantics will depend on whatever the +identifier `quote` is bound to, meaning that the expression may end up +evaluating to something other than the list `(foo bar)`. + +The Zisp decoder, which transforms not datum to datum, but object to object, +can turn `#QUOTE` into an object which encapsulates the notion of quoting, +which the Zisp evaluator can recognize and act upon, ensuring that an +expression like `'(foo bar)` always turns into the list `(foo bar)`. + +One way to think about this, in Scheme (R6RS / syntax-case) terms, is that +expressions like `'(foo bar)` turn directly into a syntax object when read, +and the created syntax object begins with an identifier bound to `quote` in +the standard library. + +The decoder is, of course, configurable and extensible. The transformations +mentioned above would be performed only when it's told to decode data which +represents Zisp code. The decoder may be given a different configuration, +telling it to decode, for example, data which represents a different kind of +domain-specific data, such as application settings, build system commands, +complex data records with non-standard data types, and so on. + + diff --git a/doc/0/grammar/abnf.txt b/doc/0/grammar/abnf.txt new file mode 100644 index 0000000..5ab3c89 --- /dev/null +++ b/doc/0/grammar/abnf.txt @@ -0,0 +1,139 @@ +; Standards-compliant ABNF (RFC 5234, RFC 7405) + +; Compatible with: https://www.quut.com/abnfgen/ + +; Unlike PEG, grammar rules in BNF are non-deterministic, which makes +; it much more challenging to express our naive parse logic. Whether +; this ABNF file is truly accurate is difficult to assess. + +; The abnfgen(1) tool linked above can be used to generate arbitrary +; strings matching the grammar in this file. These can be fed into +; the Zisp parser to reveal some potential bugs; either in the parser +; itself, or this ABNF grammar. + +; Note that the tool may generate Zisp string literals with Unicode +; escape sequences corresponding to surrogate code points; the parser +; may reject these. This is expected; it's difficult to rewrite this +; ABNF grammar to exclude those Unicode values. + +; Other minor inaccuracies that aren't important include: This ABNF +; forces line comments to be terminated with an LF character, when in +; fact the end-of-file may also terminate them; the same applies to +; hash-bang parsing which doesn't actually have to end in LF. These +; discrepancies won't make abnfgen(1) generate invalid strings; they +; only make this ABNF more strict than the Zisp parser, so it won't +; generate some strings that the parser would actually accept. + + +Stream = [ Unit *( Blank Unit ) ] *Blank [Trail] + + +Unit = *Blank Datum + +Blank = HTAB / LF / %x0b / %x0c / CR / SP / Comment + +Trail = SkipLine / SkipUnit / ";" "~" *Blank + + +Datum = BareString / SpecialStr / CladDatum / Rune / RuneStr + / RuneDotStr / RuneClad / LabelRef / LabelDef / HashStr + / HashDotStr / HashClad / QuoteExpr / JoinExpr + +Comment = SkipLine LF / SkipUnit Blank + +SkipLine = ";" [ SkipLStart *AnyButLF ] + +SkipUnit = ";" "~" Unit + +SkipLStart = %x00-09 / %x0b-7d / %x7f-ff ; any but LF or "~" + +AnyButLF = %x00-09 / %x0b-ff + + +BareString = BareChar *( BareChar / Numeric ) + +SpecialStr = SpecStrChar *( SpecStrChar / BareChar ) + +CladDatum = "|" *( PipeStrChar / "\" StringEsc ) "|" + / DQUOTE *( QuotStrChar / "\" StringEsc ) DQUOTE + / "(" List ")" + / "[" List "]" + / "{" List "}" + +Rune = "#" RuneName + +RuneStr = "#" RuneName "\" BareString + +RuneDotStr = "#" RuneName "\" SpecialStr + +RuneClad = "#" RuneName CladDatum + +HashBang = "#" "!" *( SP / HTAB ) HBLine LF + +LabelRef = "#" "%" Label "%" + +LabelDef = "#" "%" Label "=" Datum + +HashStr = "#" "\" BareString + +HashDotStr = "#" "\" SpecialStr + +HashClad = "#" CladDatum + +QuoteExpr = "'" Datum + / "`" Datum + / "," Datum + +JoinExpr = Datum RJoinDatum + / LJoinDatum NoStartDot + / Datum ":" Datum + / NoEndDot "." Datum + + +BareChar = "!" / "$" / "%" / "&" / "*" / "/" / "<" / "=" / ">" + / "?" / "^" / "_" / "~" / ALPHA + +Numeric = "+" / "-" / DIGIT + +SpecStrChar = "." / ":" / Numeric + +PipeStrChar = %x00-5b / %x5d-7b / %x7d-ff ; any but "|" or "\" + +QuotStrChar = %x00-21 / %x23-5b / %x5d-ff ; any but DQUOTE or "\" + +StringEsc = "\" / "|" / DQUOTE / *( HTAB / SP ) LF *( HTAB / SP ) + / %s"a" / %s"b" / %s"t" / %s"n" + / %s"v" / %s"f" / %s"r" / %s"e" + / %s"x" *( 2HEXDIG ) ";" + / %s"u" ["0"] 1*5HEXDIG ";" + / %s"u" "1" "0" 4HEXDIG ";" + +List = [ Unit *( Blank Unit ) ] *Blank [SkipUnit] + + +RuneName = ALPHA *5( ALPHA / DIGIT ) + +Label = 1*12( HEXDIG ) + +HBLine = 1*HBChar [ 1*( SP / HTAB ) *HBChar ] + +HBChar = %x00-08 / %x0b-1f / %x21-ff ; any but HT, LF, SP + + +RJoinDatum = CladDatum / Rune / RuneStr / RuneDotStr / RuneClad + / LabelRef / LabelDef / HashStr / HashDotStr / HashClad + / QuoteExpr + +LJoinDatum = CladDatum / RuneClad / LabelRef / HashClad + +NoStartDot = BareString / CladDatum / Rune / RuneStr / RuneDotStr + / RuneClad / LabelRef / LabelDef / HashStr / HashDotStr + / HashClad / QuoteExpr + +NoEndDot = BareString / Rune / RuneStr / RuneClad / LabelRef + / HashStr / HashClad + + +;; Local Variables: +;; eval: (flyspell-mode -1) +;; End: diff --git a/doc/0/grammar/index.md b/doc/0/grammar/index.md new file mode 100644 index 0000000..e3716ea --- /dev/null +++ b/doc/0/grammar/index.md @@ -0,0 +1,115 @@ +# Zisp S-Expression Grammar + +The grammar is available in several different formats: + +* [ZBNF](zbnf.txt): See below for the rules of this notation +* [ABNF](abnf.txt): Compatible with the `abnfgen` tool +* [PEG](peg.txt): Compatible with `peg/leg` tool + + +## ZBNF notation + +The ZBNF grammar specification uses a BNF-like notation with PEG-like +semantics: + +* Concatenation of expressions is implicit: `foo bar` means `foo` + followed by `bar`. + +* Parentheses are used for grouping, and the pipe symbol `|` is used + for alternatives. + +* The suffixes `?`, `*`, and `+` have the same meaning as in regular + expressions, although `[foo]` is used in place of `(foo)?`. + +* The syntax is defined in terms of bytes, not characters. Terminals + `'c'` and `"c"` refer to the ASCII value of the given character `c`. + Standard C escape sequences are supported. + +* The prefix `~` means NOT. It only applies to rules that match one + byte, and negates them. For example, `~( 'a' | 'b' )` matches any + byte other than 'a' and 'b'. + +* Ranges of terminal values are expressed as `x...y` (inclusive). + +* ABNF "core rules" like `ALPHA` and `HEXDIG` are supported. + +* There is no ambiguity, or look-ahead / backtracking beyond one byte. + Rules match left to right, depth-first, and greedy. As soon as the + input matches the first terminal of a rule --explicit or implied by + recursively descending into the first non-terminal-- it must match + that rule to the end or a syntax error is reported. + +The last point makes the notation simple to translate to code. + + +## Limitations outside the grammar + +The following limits are not represented in the grammar: + +* A `UnicodeSV` is the hexadecimal representation of a Unicode scalar + value; it must represent a value in the range 0 to D7FF, or E000 to + 10FFFF, inclusive. Any other value signals an error. Valid values + are converted into a UTF-8 byte sequence encoding the value. + +* A `Rune` longer than 6 bytes is grammatical, but signals an error. + This is important because runes are not self-terminating; defining + their grammar as ending after a maximum of 6 bytes would allow + another datum beginning with an alphabetic character to follow a + rune immediately without any visual delineation, which would be + terribly confusing for a human reader. Consider: `#foobarbaz`. + This would parse as a `Datum` joining `#foobar` and `baz`. + + (The ABNF does not suffer from this issue, since it explicitly + enumerates the join possibilities anyway.) + +* A `Label` is the hexadecimal representation of a 48-bit integer, + meaning it allows for a maximum of 12 hexadecimal digits. Longer + values are grammatical, but signal an out-of-range error, so as to + avoid signaling a confusing "invalid character" error on input that + appears grammatical. Consider: `#%123456789abcd=foo`. This would + signal an invalid character error at the letter `d` if the grammar + limited a `Label` to 12 hexadecimal digits. + + (As above, the ABNF doesn't care about this. You probably don't + want to use the ABNF to generate a parser anyway.) + + +## At-quoted strings + +The mechanism of at-quoted strings is not represented in any of the +grammars, since it essentially has 256 variants. Representing it +sanely in a grammar requires the ability to save and reference +variables. + + +## Stream-parsing strategy + +The parser consumes one `Unit` from the input stream every time it's +called; it returns the `Datum` therein if found, or else it returns +the Zisp EOF token. + +Since a `Datum` is not self-terminating, the parser must read beyond +it to realize that it has ended (if not followed by the EOF). Thus, +it will consume one more `Blank` following the `Unit` that it parsed. +If this `Blank` is a comment, it will be consumed entirely, ensuring +that parsing resumes properly on a subsequent parser call on the same +input stream, without needing to store any state in between. + +Since comments of type `SkipUnit` are likewise not self-terminating, +an arbitrary number of chained `SkipUnit` comments may need to be +consumed before the parser is finally allowed to return. + +The following illustration shows the positions at which the parser +will stop consuming input when called repeatedly on the same input +stream. The dots represent the extent of each `Unit` being parsed, +while the caret points at the last byte the parser will consume in +that parse cycle. + +``` +foo (bar)[baz] foo;~bar foo;~bar;~baz;~bat foobar +...^..........^... ^... ^......^ +``` + +Notice how, in the fourth cycle, the parser is forced to consume all +commented-out units before it can return, since it would otherwise +leave the stream in an inappropriate state. diff --git a/doc/0/grammar/peg.txt b/doc/0/grammar/peg.txt new file mode 100644 index 0000000..1541da6 --- /dev/null +++ b/doc/0/grammar/peg.txt @@ -0,0 +1,91 @@ +# Standard PEG notation + +Stream <- Unit ( Blank Unit )* !. + + +Unit <- Blank* Datum + +Blank <- [\t-\r ] / Comment + + +Datum <- OneDatum ( JoinChar? OneDatum )* + +JoinChar <- '.' / ':' + + +Comment <- ';' ( SkipUnit / SkipLine ) + +SkipUnit <- '~' Unit + +SkipLine <- (!'\n' .)* '\n'? + + +OneDatum <- BareString / CladDatum + + +BareString <- SpecBareChar ( BareChar / JoinChar )* + / BareChar+ + +SpecBareChar <- '+' / '-' / JoinChar / DIGIT + +BareChar <- ALPHA / DIGIT + / '!' / '$' / '%' / '&' / '*' / '+' / '-' / '/' + / '<' / '=' / '>' / '?' / '^' / '_' / '~' + + +CladDatum <- PipeStr / QuoteStr / HashExpr / QuoteExpr / List + +PipeStr <- '|' ( PipeStrChar / '\' StringEsc )* '|' +QuoteStr <- '"' ( QuotStrChar / '\' StringEsc )* '"' +HashExpr <- '#' HashExprs +QuoteExpr <- "'" Datum / '`' Datum / ',' Datum +List <- ParenList / SquareList / BraceList + + +PipeStrChar <- (![|\\] .) +QuotStrChar <- (!["\\] .) + +StringEsc <- '\' / '|' / '"' / ( HTAB / SP )* LF ( HTAB / SP )* + / '0' / 'a' / 'b' / 't' / 'n' / 'v' / 'f' / 'r' / 'e' + / 'x' HexByte* ';' + / 'u' UnicodeSV ';' + +HexByte <- HEXDIG HEXDIG +UnicodeSV <- HEXDIG+ + + +HashExprs <- '!' [\t ]* HBangLine '\n'? + / '%' Label ( '%' / '=' Datum ) + / '\' BareString / CladDatum + / Rune ( '\' BareString / CladDatum )? + +HBangLine <- HBChars+ [\t ]* ( HBChars+ )? +HBChars <- (![\t\n ] .) +Label <- HEXDIG+ +Rune <- ALPHA ( ALPHA / DIGIT )* + + +ParenList <- '(' Unit* ')' +SquareList <- '[' Unit* ']' +BraceList <- '{' Unit* '}' + + +DIGIT <- [0-9] +ALPHA <- [a-zA-Z] +HEXDIG <- [0-9a-fA-F] + + +# Keep this in sync line-for-line with the ZBNF grammar for easy +# comparison between the two. + +# This file is meant to be compatible with: +# https://piumarta.com/software/peg + +# Due to a quirk in the peg tool this file is used with, the grammar +# must not allow an empty stream. Therefore, the Unit rule has its +# Datum declared as mandatory rather than optional. + + +# Local Variables: +# eval: (flyspell-mode -1) +# End: diff --git a/doc/0/grammar/zbnf.txt b/doc/0/grammar/zbnf.txt new file mode 100644 index 0000000..c04b813 --- /dev/null +++ b/doc/0/grammar/zbnf.txt @@ -0,0 +1,75 @@ +; Custom notation with PEG semantics + +Stream : Unit ( Blank Unit )* + + +Unit : Blank* [Datum] + +Blank : '\t'...'\r' | SP | Comment + + +Datum : OneDatum ( [JoinChar] OneDatum )* + +JoinChar : '.' | ':' + + +Comment : ';' ( SkipUnit | SkipLine ) + +SkipUnit : '~' Unit + +SkipLine : ( ~LF )* [LF] + + +OneDatum : BareString | CladDatum + + +BareString : SpecBareChar ( BareChar | JoinChar )* + | BareChar+ + +SpecBareChar : '+' | '-' | JoinChar | DIGIT + +BareChar : ALPHA | DIGIT + | '!' | '$' | '%' | '&' | '*' | '+' | '-' | '/' + | '<' | '=' | '>' | '?' | '^' | '_' | '~' + + +CladDatum : PipeStr | QuoteStr | HashExpr | QuoteExpr | List + +PipeStr : '|' ( PipeStrChar | '\' StringEsc )* '|' +QuoteStr : '"' ( QuotStrChar | '\' StringEsc )* '"' +HashExpr : '#' HashExprs +QuoteExpr : "'" Datum | '`' Datum | ',' Datum +List : ParenList | SquareList | BraceList + + +PipeStrChar : ~( '|' | '\' ) +QuotStrChar : ~( '"' | '\' ) + +StringEsc : '\' | '|' | '"' | ( HTAB | SP )* LF ( HTAB | SP )* + | '0' | 'a' | 'b' | 't' | 'n' | 'v' | 'f' | 'r' | 'e' + | 'x' HexByte* ';' + | 'u' UnicodeSV ';' + +HexByte : HEXDIG HEXDIG +UnicodeSV : HEXDIG+ + + +HashExprs : '!' ( SP | HTAB )* HBangLine [ LF ] + | '%' Label ( '%' | '=' Datum ) + | '\' BareString | CladDatum + | Rune [ '\' BareString | CladDatum ] + +HBangLine : HBChars+ ( SP | HTAB )* [ HBChars+ ] +HBChars : ~( SP | HTAB | LF ) +Label : HEXDIG+ +Rune : ALPHA ( ALPHA | DIGIT )* + + +ParenList : '(' Unit* ')' +SquareList : '[' Unit* ']' +BraceList : '{' Unit* '}' + + +;; Local Variables: +;; eval: (flyspell-mode -1) +;; End: diff --git a/doc/0/index.md b/doc/0/index.md new file mode 100644 index 0000000..f0da216 --- /dev/null +++ b/doc/0/index.md @@ -0,0 +1,34 @@ +# Chapter 0: Genesis + +This chapter explains the core value representation of Zisp, and goes +through the processes of parsing, decoding, running, and optionally +compiling code. + +0. [Value](0-value.html) + + Zisp uses a uniform 64-bit representation for all values, densely + packed into signaling or non-canonical NaN values. + +1. [Parse](1-parse.html) (see also [grammar](grammar/)) + + The parser receives a stream of bytes and transforms them into a + minimal set of data types with very little processing. + +2. [Decode](2-decode.html) + + The decoder runs configurable and extensible pre-processing steps + over data received from the parser, enriching it with more complex + value types, and handling primitive source code transforms. + +3. [Eval](3-eval.html) + + Code is evaluated within a mutable module context, on which it can + have side effects such as creating new definitions or establishing + links to other modules. + +4. [Compile](4-compile.html) + + Procedures from within the compiler module can be used to demand + the compilation of other modules, with various options, yielding + static or dynamic object files. These may be loaded immediately, + replacing the previously uncompiled module code in memory. -- cgit v1.2.3