Improve parser doc.

author: Taylan Kammer <taylan.kammer@gmail.com> 2026-06-02 22:21:32 +0200
committer: Taylan Kammer <taylan.kammer@gmail.com> 2026-06-02 22:21:32 +0200
commit: 6d1eb51c0f0ecf0bd4084aa4d8985ab3001ab0e1 (patch)
tree: 9645addbedc726507b012e5c63d3eb1b3a8be01e /doc/c1
parent: d993104e86f2e2ec8ff1036648e34eebdca0d58d (diff)
1 files changed, 256 insertions, 262 deletions
diff --git a/doc/c1/1-parse.md b/doc/c1/1-parse.md
index cb3db82..8932481 100644
--- a/doc/c1/1-parse.md
+++ b/doc/c1/1-parse.md
@@ -1,85 +1,24 @@
-# Parser for Data
+# Parser for Code and Data
 
 Zisp s-expressions represent an extremely minimal set of data types; only that
 which is necessary to strategically construct more complex values:
 
-    +--------+-----------------+--------+----------+------+
-    | TYPE   | String          | Rune   | Pair     | Nil  |
-    +--------+-----------------+--------+----------+------+
-    | E.G.   | foobar          | #name  | (X & Y)  | ()   |
-    |        | |foo bar|       |        |          |      |
-    |        | "foo bar"       |        |          |      |
-    |        | @_foo bar_      |        |          |      |
-    +--------+-----------------+--------+----------+------+
+    +-------+---------+--------+----------+
+    | TYPE  | String  | Rune   | Pair     |
+    +-------+---------+--------+----------+
+    | E.G.  | foobar  | #name  | (X & Y)  |
+    +-------+---------+--------+----------+
 
-Datum comments and line comments are supported:
+The parser also recognizes various *syntax sugar* which typically results in a
+pair beginning with a specific rune.  A separate component called the *decoder*
+transforms such data into a rich set of value types.  See below for details.
 
-* A semicolon followed by a tilde instructs the parser to consume one datum and
-  discard it.  Whitespace may appear between the tilde and the datum to discard.
-
-* A semicolon, followed by a non-tilde byte, instructs the parser to consume and
-  discard bytes until a newline (ASCII Line Feed) is encountered.
-
-The parser can also output non-negative integers, but this is only used for
-datum labels; number literals are handled by the decoder instead; see below.
-
-
-## Overview
 
-This section explains a few core concepts and features related to the parser.
+## Charset and Stream Handling
 
+The parser does not consume Unicode characters; it consumes bytes.  Grammar is
+generally constructed by bytes corresponding to ASCII characters.
 
-### Value vs. Datum
-
-A Zisp *value* that has an *external representation* in the form of a sequence
-of bytes is called a *datum*.  Every datum is a value, but not all values are
-data.  A datum is a value that can be printed out as a byte sequence which the
-parser can recognize and turn back into an equivalent datum.
-
-One may speak of an *external representation of a value* where the value is not
-itself a datum, but can be encoded as a datum.  The more strictly correct term
-for this is: "The external representation of a datum encoding the value."
-
-
-### Syntax sugar
-
-The parser recognizes various "syntax sugar" and transforms it into uses of the
-above listed primitive data types.  As an example, the expression `#(x y z)` is
-parsed into the structure `(#HASH x y z)`.  These are two completely equivalent
-external representations for the same compound datum; after parsing, both byte
-sequences will yield data values that are indistinguishable in all but their
-memory address.
-
-The most ubiquitously used syntax sugar is the list, which stands for a chain of
-pairs, terminated with nil:
-
-    (x y z)  ->  (x & (y & (z & ())))
-
-The full syntax sugar table is listed and explained further below.
-
-
-### Decoder
-
-*The decoder has nothing to do with the concept of text or character encoding.*
-
-A separate process called *decoding* can transform Zisp data into values of more
-complex types, including values that are not of a datum type.
-
-For example, the datum `(#HASH x y z)` could be decoded into an array, so the
-expression `#(x y z)` could work like in Scheme.
-
-Decoding also resolves datum labels, goes over bare strings to find ones that
-represent a number literal, and takes care of a number of other transforms.
-This offloads complexity, allowing the parser to remain extremely simple.
-
-See the dedicated documentation of the [decoder](2-decode.html) for more.
-
-
-### Character encoding
-
-The parser does not consume characters; it consumes bytes.
-
-Grammar is generally constructed by bytes corresponding to ASCII characters.
 Some elements of the grammar, such as comments and quoted strings, may contain
 arbitrary byte sequences, until terminated.  These sequences may happen to be
 valid UTF-8 text.  This way, quoted strings and comments may contain Unicode
@@ -89,21 +28,20 @@ Since comments and quoted strings may contain arbitrary byte sequences, a text
 editor or other program displaying Zisp s-expressions may need to use a special
 visual representation for bytes that don't represent valid text.
 
-The parser being based on bytes rather than characters is not a limitation but
-rather a feature: It allows for Zisp s-expressions to be used as a structured
-data exchange format that may contain binary data elements without the need to
-encode these in Base64 or other such text representations of binary data.
+The parser working on bytes rather than Unicode characters is not a limitation,
+but rather a feature: It allows Zisp s-expressions to be used as a structured
+data exchange format, which may contain binary data elements, without the need
+to encode these in Base64 or other such text representations of binary data.
 Consider the example:
 
-    ((image.webp "<< binary data >>")
-     (video.webm "<< binary data >>"))
+    ((image.webp "<BINARY>")
+     (video.webm "<BINARY>"))
 
 All that needs to be done for this to work, is that any incidental occurrences
 of the double-quote sign, and the backslash sign, are escaped with a backslash
-within the binary data; all other bytes can appear verbatim in the strings.
+within the `<BINARY>` data; all other bytes can appear verbatim in the strings.
 
-
-### Stream parsing
+### Buffering
 
 The parser can be repeatedly invoked on a byte stream to consume the next datum
 within.  This does not require "unreading" or back-seeking within the stream;
@@ -126,114 +64,148 @@ continue operating on the same stream.  Consider the example:
 The "header" for each file in this stream is a Zisp s-expression containing
 information about how many bytes should be read after the header, before the
 next file header appears.  (The header data need to be terminated with a blank
-ASCII character such as a newline.  The reason why the closing parenthesis does
-not act as a terminator unto itself will become apparent later.)
+ASCII character such as a newline; the closing parenthesis does not act as a
+terminator unto itself due to the "join" syntax sugar; see later.)
 
-#### Buffering
+To enable this stream parsing strategy, the parser does not use any automatic
+buffering.  If it did, it might inadvertently consume some bytes beyond the
+currently parsed datum, leaving the stream inconsistent.
 
-To enable the aforementioned stream parsing strategy, the parser does not use
-automatic buffering.  If it did, it might inadvertently consume some bytes
-beyond the currently parsed datum, leaving the stream inconsistent.
+If the parser is meant to be used on an input stream associated with expensive
+system calls, such as a file handle or network socket, it's best to wrap that
+stream in some intermediate object which asks the system for large chunks of
+data at once, and stores the data in a buffer.
 
-The parser could provide access to its buffer, such that one could access the
-unused bytes, but it's simpler and more flexible to let buffering be handled
-externally from the parser.
 
-In other words: If the parser is meant to be used on an I/O stream connected to
-expensive system calls, such as a file handle or network socket, it's best to
-wrap that stream in some intermediate object which asks the system for large
-chunks of data at once, and stores the data in a buffer.
+## Comments
 
+Two types of comment are supported: datum comments and line comments.
 
-### Datum labels
+* A semicolon followed by a tilde instructs the parser to consume one datum and
+  discard it.  Whitespace may appear between the tilde and the datum to discard.
 
-Valid data cannot be cyclic, since that would mean it has infinite length in
-bytes.  To externally represent a value with cyclic structure, one uses datum
-labels in the data encoding of the value.
+* A semicolon, followed by a non-tilde byte, instructs the parser to consume and
+  discard bytes until a newline (ASCII Line Feed) is encountered.
 
-A datum label either wraps another datum to assign a number to it, or contains
-just a reference to a previous assignment.
 
-    +----------------------------------+---------------------------------+
-    | Internal structure               | External representation         |
-    +----------------------------------+---------------------------------+
-    | (#LABEL & (<NUMBER> & <DATUM>))  | #%<HEX>=<DATUM>                 |
-    +----------------------------------+---------------------------------+
-    | (#LABEL & <NUMBER>)              | #%<HEX>%                        |
-    +----------------------------------+---------------------------------+
+## Value vs. Datum
 
-In this visual, the token `<NUMBER>` stands for an actual number value that
-doesn't have its own external representation.  It's printed as a sequence of
-hexadecimal digits, denoted by `<HEX>` in the external representation.
+A Zisp *value* that has an *external representation* in the form of a sequence
+of bytes is called a *datum*.  Every datum is a value, but not every value is a
+datum.  In other words, a datum is a value that can be printed out as a byte
+sequence which the parser can turn back into an equivalent datum.
 
-For clarity, concrete examples follow:
+A value that is not a datum may nevertheless be *encoded* into one, allowing it
+to have an external representation.  After parsing, it needs to be *decoded*.
+
+One may speak of an *external representation of a value* where the value is not
+itself a datum, but can be encoded as one.  The more strictly correct term for
+this is: "The external representation of a datum that encodes the value."
 
-    #%1234abcd=(foo bar)  ->  (#LABEL & (<0x1234abcd> & (foo bar)))
+### Syntax sugar
 
-    #%1234abcd%           ->  (#LABEL & <0x1234abcd>)
+The parser recognizes various *syntax sugar* to abbreviate an equivalent datum
+construction, or express a datum that encodes a more complex value.
 
-Here, the visual token `<0x1234abcd>` stands for a Zisp value of a numeric type
-with an integer value.
+As an example, the expression `#(x y z)` is an abbreviation for the equivalent
+`(#HASH x y z)`.  These are two external representations for the same datum;
+after parsing, both will yield values that are indistinguishable in all but
+their memory address.
 
-Datum labels may look like "syntax sugar" but the fact that integers don't have
-a direct external representation means that datum labels are a fundamental type
-of syntax that has no "desugared" equivalent in external representation.  The
-decoder will not accept a bare string encoding of an integer here.
+The most ubiquitous syntax sugar is the list, which abbreviates a sequence of
+tail-linked pairs, terminated with a special nil value represented as `()`:
 
+    (x)      ->  (x & ())
 
-## Data types
+    (x y)    ->  (x & (y & ()))
 
-Following is an explanation of the four core data types constructed by the Zisp
-s-expression parser.
+    (x y z)  ->  (x & (y & (z & ())))
 
-A Zisp value that is a member of one of these types is also called a *datum* if
-it adheres to additional constraints as explained for each type.
+There are also so-called *improper lists* which are chains of pairs that end in
+a value other than nil:
 
+    (x y & z)    ->  (x & (y & z))
 
-### String
+    (x y z & t)  ->  (x & (y & (z & t)))
 
-Strings can appear "bare" or be quoted in various ways.
+An example of "syntax sugar" that is not a mere abbreviation is a quoted string
+which contains bytes that could not appear in a *bare* string:
 
-A string, as a stand-alone Zisp value, is only a valid datum if it can be
-represented as a bare string.  If it contains bytes that prevent the bare
-representation, then the string must be wrapped in one of the following
-structures to become a valid datum, each of which has its own external
-representation:
+    "foo bar"  ->  (#DQUOTE & <STRING>)
 
-    +-------------------------------+-------------------------------+
-    | Internal structure            | External representation       |
-    +-------------------------------+-------------------------------+
-    | (#PQSTR & <STRING>)           | |contents|                    |
-    +-------------------------------+-------------------------------+
-    | (#DQSTR & <STRING>)           | "contents"                    |
-    +-------------------------------+-------------------------------+
-    | (#ATSTR & <STRING>)           | @_contents_                   |
-    +-------------------------------+-------------------------------+
+In this example, the visual token `<STRING>` represents the actual string value
+in program memory.  It may seem contrived to refer to this as syntax sugar, but
+we are using the term uniformly for any situation in which the parser generates
+a pair with a rune in its first position, intended for the decoder to handle.
 
-The visual token `<STRING>` is meant to denote the actual string, as a Zisp
-value, occupying the second position in the pair.  It is not actual syntax.
+Those familiar with Lisp and Scheme may expect *bare* strings to be parsed into
+a separate data type called a *symbol* but this does not exist in Zisp.  Quoted
+strings instead parse into this internal representation to differentiate them
+from bare strings which may represent identifiers in code.
 
-Note that, while conceptually similar, this internal encoding of string data is
-not syntax sugar, since the internal datum representation using runes cannot be
-printed out verbatim, due to the attached string being impossible to represent
-externally without quotation.  As such, quoted strings are fundamental syntax.
+Other syntax sugar is explained further below.
+
+### Decoder
+
+The *decoder* transforms Zisp data into values of more complex types, including
+values that are not of a datum type.
+
+Combined with syntax sugar, this allows Zisp to offer familiar syntax elements.
+For example, the expression `#(x y z)` which parses into `(#HASH x y z)` can be
+decoded into an array, so the result is similar to the vector syntax of Scheme.
+
+Decoding also resolves datum labels, goes over bare strings to find ones that
+represent a number literal, and takes care of a number of other transforms.
+This offloads complexity, allowing the parser to remain extremely simple.
+
+See the dedicated documentation of the [decoder](2-decode.html) for more.
+
+
+## Data types
+
+Following is a more explanation of the four core data types constructed by the
+Zisp s-expression parser.
+
+These are in fact value types, though the term "data type" is often used due to
+familiarity.  A Zisp value that is a member of one of the following value types
+is only a *datum* if it adheres to additional constraints as explained below.
+
+### String
+
+Strings can appear *bare* or be quoted in various ways.  A quoted string is in
+fact parsed into a pair value (see below) with a rune in the first position to
+identify the quotation category, and the string value in the second position.
+
+    +-----------+----------------------+
+    | Syntax    | Parse output         |
+    +-----------+----------------------+
+    | |bytes|   | (#PQSTR & <STRING>)  |
+    +-----------+----------------------+
+    | "bytes"   | (#DQSTR & <STRING>)  |
+    +-----------+----------------------+
+    | @_bytes_  | (#ATSTR & <STRING>)  |
+    +-----------+----------------------+
+
+The visual token `<STRING>` denotes the actual string, as a Zisp value, in the
+second position of the pair.
 
 These external representations of strings will be explained in more detail
 further below, including backslash escape sequences allowed within.
 
 Strings have a fixed length, counted in bytes.  Each byte can have any value,
-including zero (aka ASCII NULL).  The parser reads bytes, not characters, and
-has no concept of a character encoding, which means that a string can contain
-UTF-8 byte sequences, but these are not tested for validity.
+including zero (ASCII NUL).  The parser reads bytes, not Unicode characters; a
+string may contain UTF-8 byte sequences, but these are not tested for validity.
 
 A string that is up to 255 bytes long is automatically *interned*, meaning any
 occurrence of the same string -- equal in length and containing the same byte
 values -- ends up being represented by the same bit-pattern; either a memory
 address, or an immediate representation within a CPU word for short strings.
+The quotation method is inconsequential to this process; for example, while
+`|foobar|` and `"foobar"` will parse into different pair values, the actual
+string they hold will be the same one in program memory.
 
-Strings with a length greater than 255 bytes end up being represented by a
-distinct memory address, even if they are equal in length and content.
-
+Strings of length greater than 255 bytes are stored separately in memory, even
+if they are equal in length and content.
 
 ### Rune
 
@@ -244,7 +216,7 @@ follows this constraint is valid as a datum.
 
 Zisp code may explicitly construct values of the rune type that violate the
 above constraints.  Such runes are not valid data and cannot be printed or
-parsed in any way.
+parsed.
 
 Runes are case-sensitive, and the parser always emits runes using upper-case
 letters when expressing syntax sugar.  Uppercase rune names are reserved for
@@ -255,31 +227,30 @@ default decoder settings.
 
 Runes are always stored directly in a CPU word; never by memory address.
 
-
 ### Pair
 
-A pair is a tuple of two values: the first value and the second value.
+A pair is a tuple of two values: the first value and the second value.  In Lisp
+tradition, these are also called the `car` and `cdr` of the pair, respectively.
 
-The parser allocates a unique two-word cell in the process heap for every pair,
-and represents that pair through the memory address of that cell.
+The parser allocates a unique two-word cell in program memory for every pair,
+and represents that pair through the memory address of the cell.
 
-Pairs are valid as a datum if one of the following holds true for the pair:
+Pairs are valid data if one of the following holds true:
 
-* It encodes one of the quoted string variants.
+* The pair encodes a quoted string, datum label, or shebang line. (See below.)
 
-* It encodes a datum label (assignment or reference).
-
-* Both the first and second value in the pair is itself a valid datum.
-
-An additional constraint is that a hierarchy of pairs containing pairs must not
-form cycles; if they do, the cycles must be broken up by use of datum labels or
-else none of the pairs within the cyclic structure are a valid datum.
+* Both the first and second value in the pair is a valid datum.
 
+Further, a structure of nested pair values may not contain cyclic references
+back up in the structure (which would make the above definition diverge into
+infinity).  Such cycles must be broken up with datum labels, or else the pair
+cannot be considered a datum, since it cannot be printed or parsed.
 
 ### Nil
 
 The Zisp nil value is a singleton and a datum.  There is exactly one nil value
-and it is used to terminate a chain of pairs representing a list of values.
+and it is used to terminate a chain of pairs representing a list of values; it
+has the external representation `()`.
 
 
 ## Quoted strings
@@ -287,13 +258,12 @@ and it is used to terminate a chain of pairs representing a list of values.
 Three quoted string types exist: Pipe-quoted, double-quoted, and at-quoted.
 This section goes into the details of each variant.
 
-
 ### Pipe-quoted
 
 Strings can be quoted with pipes, like symbols in R7RS Scheme, which triggers
 the parser to generate a pair with the structure:
 
-    (#PQSTR & <STRING>)                 ;; <STRING> is visual aid, not syntax
+    (#PQSTR & <STRING>)   ;; <STRING> is visual aid, not syntax
 
 The decoder, using default settings, would emit this string verbatim as a value.
 Then, during code evaluation, this would be seen as an identifier.  In this way,
@@ -306,101 +276,105 @@ this case, a string unto itself that may not be a valid datum, due to not being
 possible to be represented as a bare string.  Yet, it is valid as an identifier
 for the purposes of the evaluator, since it is a string *value* like any other.
 
-
 ### Double-quoted
 
 Strings wrapped in the double-quote symbol parse into:
 
-    (#DQSTR & <STRING>)                 ;; <STRING> is visual aid, not syntax
+    (#DQSTR & <STRING>)   ;; <STRING> is visual aid, not syntax
 
 Under default settings, the decoder would transform this into a value which,
-when evaluated, yields back the string as a value.  Typically, this would be
-achieved by simply transforming it into `(#QUOTE & <STRING>)`.  (Note that,
-unlike `(#PQSTR & <STRING>)`, this would not be decoded into a string unto
-itself, as that would make the evaluator see it as an identifier.)
+when evaluated as code, simply yields the contained string as a value.
 
+### At-quoted
 
-### At-quoted strings AKA raw strings
-
-There is a special type of syntax for "raw" strings, meaning that no backslash
+This is a special type of syntax for "raw" strings, meaning that no backslash
 escapes nor any other kind of escape sequence are recognized within them.
 
-This raw string syntax begins with an at sign, followed by any byte.  That byte
-becomes the termination marker, and the string cannot contain an occurrence of
-it, since there are no escape sequences.
+The syntax begins with an at sign, followed by any byte.  That byte becomes a
+termination marker, and the string cannot contain an occurrence of it, since
+there are no escape sequences.
 
-    @"foo \ bar"  ->  (#ATSTR & <STRING>)
+    @"foo \ bar"  ->  (#ATSTR <BYTE> & <STRING>)
 
-In the above, the visual token `<STRING>` is not part of datum syntax but a
-stand-in for the actual string value, which is, literally: `foo \ bar`
+In the above, the visual tokens `<BYTE>` and `<STRING>` represent an integer
+value and a string value, respectively.  In this example, the integer value
+would be 34; the ASCII value for the double-quote sign.  The string value
+contains a literal backslash, since there is no backslash escape parsing.
 
 This style of quoting can be useful, for instance, when representing regular
 expressions as strings in code:
 
-    @/^foo\\(bar|baz)\.\[".*"\]$/         ;; matches e.g. foo\bar.["blah"]
+    ;; Matches e.g. foo\bar.["blah"]
+
+    @/^foo\\(bar|baz)\.\[".*"\]$/
 
 Were it not for this syntax, this regular expression would only be possible to
 represent through a quoted string such as the following:
 
-    "^foo\\\\(bar|baz)\\t\\[\".*\"\\]$"   ;; many backslashes
-
-Alternatively, imagine searching for certain MS Windows file paths:
+    ;; Same as above, but so many backslashes
 
-    @_C:\\\\Users\\([a-z]+)_              ;; matches C:\\User\foo
-
-That's already ugly.  Without raw strings, it would need to look even worse:
-
-    "C:\\\\\\\\Users\\\\([a-z]+)"         ;; MANY backslashes
+    "^foo\\\\(bar|baz)\\t\\[\".*\"\\]$"
 
 The byte that follows the at sign need not be a printable character or even a
-valid ASCII byte; it can be absolutely any byte value, even NULL.  This can be
+valid ASCII byte; it can be absolutely any byte value, even NUL.  This can be
 useful to easily encode binary data which is known to not contain a specific
-byte; an example would be C strings which cannot contain NULL.
-
-
-### Backslash escape sequences in strings
-
-The following backslash escapes are supported in pipe-quoted and double-quoted
-strings.  (Some rows use Regular Expression notation.)
-
-    +-----------------------------------+------------------------------+
-    | Character(s) following backslash  | Meaning                      |
-    +-----------------------------------+------------------------------+
-    | \                                 | Literal backslash            |
-    +-----------------------------------+------------------------------+
-    | |                                 | Literal pipe symbol          |
-    +-----------------------------------+------------------------------+
-    | "                                 | Literal double-quote         |
-    +-----------------------------------+------------------------------+
-    | RE: /[\t ]*\n[\t ]*/              | Discarded                    |
-    +-----------------------------------+------------------------------+
-    | 0                                 | ASCII NULL                   |
-    +-----------------------------------+------------------------------+
-    | a                                 | ASCII Alert                  |
-    +-----------------------------------+------------------------------+
-    | b                                 | ASCII Backspace              |
-    +-----------------------------------+------------------------------+
-    | t                                 | ASCII Tab (Horizontal)       |
-    +-----------------------------------+------------------------------+
-    | n                                 | ASCII Newline (Line Feed)    |
-    +-----------------------------------+------------------------------+
-    | v                                 | ASCII Vertical Tab           |
-    +-----------------------------------+------------------------------+
-    | f                                 | ASCII Form Feed              |
-    +-----------------------------------+------------------------------+
-    | r                                 | ASCII Carriage Return        |
-    +-----------------------------------+------------------------------+
-    | e                                 | ASCII Escape                 |
-    +-----------------------------------+------------------------------+
-    | RE: /x([0-9a-fA-F]{2})*;/         | Arbitrary bytes in hex       |
-    +-----------------------------------+------------------------------+
-    | RE: /u[0-9a-fA-F]+;/              | Unicode scalar as UTF-8      |
-    +-----------------------------------+------------------------------+
-     
-To clarify:
+byte; an example would be C strings which cannot contain NUL.
+
+### Backslash escape sequences
+
+In pipe-quoted and double-quoted strings, the following ASCII characters may
+follow a backslash to insert a certain character.
+
+    +-------+----------------------------+
+    | Char  | Meaning                    |
+    +-------+----------------------------+
+    | \     | Literal backslash          |
+    +-------+----------------------------+
+    | |     | Literal pipe symbol        |
+    +-------+----------------------------+
+    | "     | Literal double-quote       |
+    +-------+----------------------------+
+    | 0     | ASCII NUL                  |
+    +-------+----------------------------+
+    | a     | ASCII Alert                |
+    +-------+----------------------------+
+    | b     | ASCII Backspace            |
+    +-------+----------------------------+
+    | t     | ASCII Tab (Horizontal)     |
+    +-------+----------------------------+
+    | n     | ASCII Newline (Line Feed)  |
+    +-------+----------------------------+
+    | v     | ASCII Vertical Tab         |
+    +-------+----------------------------+
+    | f     | ASCII Form Feed            |
+    +-------+----------------------------+
+    | r     | ASCII Carriage Return      |
+    +-------+----------------------------+
+    | e     | ASCII Escape               |
+    +-------+----------------------------+
+
+In words:
 
 * A backslash followed by a backslash, pipe, or double-quote character is
-  substituted with a literal occurrence of the corresponding character.
+  substituted with a literal occurrence of that character.
+
+* The characters 0, a, b, t, n, v, f, r, and e have the same meanings as in the
+  C programming language, representing common ASCII control characters.
+
+Further, the following Regular Expression patterns following a backslash have
+special meaning.
+
+    +---------------------+-----------------------+
+    | Regular Expression  | Meaning               |
+    +---------------------+-----------------------+
+    | [\t ]*\n[\t ]*      | Discarded             |
+    +---------------------+-----------------------+
+    | x([0-9a-fA-F]{2})*; | Arbitrary bytes       |
+    +---------------------+-----------------------+
+    | u[0-9a-fA-F]+;      | Unicode Scalar Value  |
+    +---------------------+-----------------------+
+
+Explanations:
 
 * A backslash followed by any number of blanks (space or tab), a newline, and
   again any number of blanks, is substituted with nothing.  This is to allow
@@ -409,9 +383,6 @@ To clarify:
       (define paragraph "This paragraph has been visually split into multiple \
                          lines, but the newline is escaped, so it's one line.")
 
-* The characters 0, a, b, t, n, v, f, r, and e have the same meanings as in the
-  C programming language, representing common unprintable ASCII bytes.
-
 * An x, followed by pairs of hexadecimal digits (case insensitive), terminated
   by a semicolon, is substituted with the sequence of bytes represented by the
   corresponding pairs of hexadecimal digits.  E.g.: `"foo\xDEADBEEF;bar"`
@@ -421,7 +392,6 @@ To clarify:
   Unicode Scalar Value represented by that hexadecimal number.  The number must
   be in the range `0` to `10FFFF`.  E.g.: `"foo\u00A0;bar"`
 
-
 ### Newlines in strings
 
 Normally, a newline in a string has no special meaning and simply becomes part
@@ -437,7 +407,7 @@ with different intentions and meanings:
     (define paragraph "This paragraph has been visually split into multiple \
                        lines, but the newlines are escaped, so it's one line.")
 
-    (define json-object '|         ;; use '|| so double-quotes need no escaping
+    (define json-object '|   ;; use '|| so double-quotes need no escaping
       {
         "key": "value"
       }
@@ -496,25 +466,15 @@ represent a literal ASCII File Separator character in the source code:
       }
       ^\)
 
-Hey, it works fine in Emacs, so why not?  Use `C-q C-\` to insert the `^\`.
+It works fine in Emacs, so why not?  Use `C-q C-\` to insert the `^\`.
 
 This is indeed quite an eldritch syntax, but hopefully most programs would not
-need to use it anyway.
+need to use it.
 
 
 ## Syntax sugar
 
-The parser recognizes various "syntax sugar" and transforms it into equivalent
-datum constructions.  The most ubiquitous example of this is the list, which is
-transformed into a chain of pairs, terminated with nil:
-
-    (datum1 datum2 ...)  ->  (datum1 & (datum2 & (... & ())))
-
-This is so ubiquitous as to be hardly considered "syntax sugar" but is counted
-as such, since any list could just as well be written as a chain of pairs; both
-would result in an equivalent datum when parsed.
-
-The following table summarizes the other available transformations:
+The following table summarizes commonly useful syntax abbreviations:
 
     [...]   -> (#SQUARE ...)          #datum       -> (#HASH & datum)
 
@@ -588,18 +548,52 @@ Notes:
   further decoding of enclosed data.  This is not so, since quoting is related
   to code evaluation, not decoding.
 
+### Datum labels
+
+Valid data cannot be cyclic, since that would mean it has infinite length in
+bytes.  To externally represent a value with cyclic structure, one uses datum
+labels in the data encoding of the value.
+
+A datum label either wraps another datum to assign a number to it, or contains
+just a reference to a previous assignment.
+
+    +------------------+------------------------------+
+    | Syntax           | Internal datum structure     |
+    +------------------+------------------------------+
+    | #%<HEX>=<DATUM>  | (#LABEL <NUMBER> & <DATUM>)  |
+    +------------------+------------------------------+
+    | #%<HEX>%         | (#LABEL & <NUMBER>)          |
+    +------------------+------------------------------+
+
+In this visual, the token `<HEX>` stands for a hexadecimal digit sequence, the
+token `<DATUM>` stands for any other datum, and `<NUMBER>` is a stand-in for a
+number value; that which is represented by `<HEX>`.
+
+For clarity, concrete examples follow:
+
+    +-------------------+-------------------------------+
+    | Byte sequence     | Parse result                  |
+    +-------------------+-------------------------------+
+    | #%1234abcd=(foo)  | (#LABEL <0x1234abcd> & (foo)) |
+    +-------------------+-------------------------------+
+    | #%1234abcd%       | (#LABEL & <0x1234abcd>)       |
+    +-------------------+-------------------------------+
+
+Here, the visual token `<0x1234abcd>` stands for a Zisp value of a numeric type
+with an integer value.  Note that the decoder may not accept a bare string here,
+meaning this syntax sugar is not merely an abbreviation.
 
-## Shebang
+### Shebang
 
-There is one final "syntax sugar" translation whose sole purpose is to allow a
-shebang line at the start of a file:
+Finally, the parser recognizes the Unix *shebang* syntax and outputs a datum to
+hold the string values found within:
 
     #!interpreter          ->  (#SHBANG & interpreter)
 
     #!interpreter argline  ->  (#SHBANG interpreter & argline)
 
-Under default settings, the decoder will allow this datum to appear once at the
-beginning of a per-file decoding sequence, and simply discard it.
+When executing a script file, Zisp simply stores this into a global value that
+may be inspected if desired.
 
 
 <!--
author	Taylan Kammer <taylan.kammer@gmail.com>	2026-06-02 22:21:32 +0200
committer	Taylan Kammer <taylan.kammer@gmail.com>	2026-06-02 22:21:32 +0200
commit	6d1eb51c0f0ecf0bd4084aa4d8985ab3001ab0e1 (patch)
tree	9645addbedc726507b012e5c63d3eb1b3a8be01e /doc/c1
parent	d993104e86f2e2ec8ff1036648e34eebdca0d58d (diff)