Grammar, parser, and doc improvements.HEAD master

author: Taylan Kammer <taylan.kammer@gmail.com> 2026-05-25 20:48:36 +0200
committer: Taylan Kammer <taylan.kammer@gmail.com> 2026-05-26 18:41:27 +0200
commit: fa5db8e89225622a1ee7a5d802f253d07884b13e (patch)
tree: d7b25178deac71dff00728134555c75f088ec101 /docs/c1
parent: 0f0cb85026406356e16310044b4d09bd316b0747 (diff)
5 files changed, 509 insertions, 123 deletions
diff --git a/docs/c1/1-parse.md b/docs/c1/1-parse.md
index 6484cab..7df2225 100644
--- a/docs/c1/1-parse.md
+++ b/docs/c1/1-parse.md
@@ -1,169 +1,415 @@
-# Parser for Code & Data
+# Parser for Data
 
 *For an exact specification of the grammar, see [grammar](grammar/).*
 
-Zisp S-Expressions represent an extremely minimal set of data types; only that
-which is necessary to strategically construct more complex code and data:
+Zisp s-expressions represent an extremely minimal set of data types; only that
+which is necessary to strategically construct more complex values:
 
     +--------+-----------------+--------+----------+------+
     | TYPE   | String          | Rune   | Pair     | Nil  |
     +--------+-----------------+--------+----------+------+
-    | E.G.   | foo, |foo bar|  | #name  | (X & Y)  | ()   |
+    | E.G.   | foobar          | #name  | (X & Y)  | ()   |
+    |        | |foo bar|       |        |          |      |
+    |        | "foo bar"       |        |          |      |
+    |        | @_foo bar_      |        |          |      |
     +--------+-----------------+--------+----------+------+
 
+Datum comments and line comments are supported:
+
+* A semicolon followed by a tilde instructs the parser to consume one datum and
+  discard it.  Whitespace may appear between the tilde and the datum to discard.
+
+* A semicolon, followed by a non-tilde byte, instructs the parser to consume and
+  discard bytes until a newline (ASCII Line Feed) is encountered.
+
 The parser can also output non-negative integers, but this is only used for
-datum labels; number literals are handled by the *decoder* instead.
+datum labels; number literals are handled by the decoder instead; see below.
 
 
-## Decoder
+## Overview
 
-A separate process called *decoding* can transform such data into more complex
-types.  For example, `(#HASH x y z)` could be decoded into an array, so the
-expression `#(x y z)` could work like in Scheme; or `(#SQUARE x y z)` could be
-decoded into a function call expression that will, at run-time, allocate and
-initialize a dynamic array with three elements, so the expression `[x y z]`
-would work like in JavaScript.
+This section explains a few core concepts and features related to the parser.
 
-Decoding also resolves datum labels, goes over strings to find ones that are
-actually a number literal, and takes care of a number of other transformations.
-This offloads complexity, allowing the parser to remain extremely simple.  See
-the dedicated documentation of the decoder for more.
 
+### Value vs. Datum
 
-## Syntax sugar
+A Zisp *value* that has an *external representation* in the form of a sequence
+of bytes is called a *datum*.  Every datum is a value, but not all values are
+data.  A datum is a value that can be printed out as a byte sequence which the
+parser can recognize and turn back into an equivalent datum.
+
+One may speak of an *external representation of a value* where the value is not
+itself a datum, but can be encoded as a datum.  The more strictly correct term
+for this is: "The external representation of a datum encoding the value."
+
+
+### Syntax sugar
 
 The parser recognizes various "syntax sugar" and transforms it into uses of the
-above listed minimal data types.  The most ubiquitous example is the list:
+above listed primitive data types.  As an example, the expression `#(x y z)` is
+parsed into the structure `(#HASH x y z)`.  These are two completely equivalent
+external representations for the same compound datum; after parsing, both byte
+sequences will yield data values that are indistinguishable in all but their
+memory address.
 
-    (datum1 datum2 ...)  ->  (datum1 & (datum2 & (... & ())))
+The most ubiquitously used syntax sugar is the list, which stands for a chain of
+pairs, terminated with nil:
 
-The following table summarizes the other transformations available:
+    (x y z)  ->  (x & (y & (z & ())))
 
-    "xyz"   -> (#QUOTE & |xyz|)       #datum       -> (#HASH & datum)
+The full syntax sugar table is listed and explained further below.
 
-    ~_xyz_  -> (#TILDE & |xyz|)       #rune(...)   -> (#rune ...)
 
-    [...]   -> (#SQUARE ...)          dat1dat2     -> (#JOIN dat1 & dat2)
-                                 
-    {...}   -> (#BRACE ...)           dat1.dat2    -> (#DOT dat1 & dat2)
-                                 
-    'datum  -> (#QUOTE & datum)       dat1:dat2    -> (#COLON dat1 & dat2)
-                                 
-    `datum  -> (#GRAVE & datum)       #%hex=datum  -> (#LABEL hex & datum)
-                                 
-    ,datum  -> (#COMMA & datum)       #%hex%       -> (#LABEL & hex)
+### Decoder
 
-Notes about the table and examples:
+*The decoder has nothing to do with the concept of text or character encoding.*
 
-* The terms datum, dat1, and dat2 each refer to an arbitrary datum; ellipsis
-  means zero or more data; hex is a hexadecimal number of up to 12 digits.
+A separate process called *decoding* can transform Zisp data into values of more
+complex types, including values that are not of a datum type.
 
-* Strings can be quoted with pipes, like symbols in Scheme.  This is the "real"
-  string literal syntax, whereas using double quotes is syntax sugar for a
-  quoted string literal.
+For example, the datum `(#HASH x y z)` could be decoded into an array, so the
+expression `#(x y z)` could work like in Scheme.
 
-      |foo bar baz|  -> |foo bar baz|
+Decoding also resolves datum labels, goes over bare strings to find ones that
+represent a number literal, and takes care of a number of other transforms.
+This offloads complexity, allowing the parser to remain extremely simple.
 
-      "foo bar baz"  -> (#QUOTE & |foo bar baz|)
+See the dedicated documentation of the [decoder](2-decode.html) for more.
 
-* See the next section for an explanation of the tilde syntax, which implements
-  "raw" string literals.
 
-* The `#datum` form only applies when the datum following the hash sign is
-  anything other than a bare string (unquoted, without pipe symbol) since
-  otherwise this would be ambiguous with a rune literal.  A bare string can
-  nevertheless follow the hash sign by separating the two with a backslash:
+### Character encoding
 
-      #\string  ->  (#HASH & string)
+The parser does not consume characters; it consumes bytes.
 
-* Though not represented in the table due to notational difficulty, the form
-  `#rune(...)` doesn't require a list in the second position; any datum that
-  works with the `#datum` syntax also works with `#rune<DATUM>`.
+Grammar is generally constructed by bytes corresponding to ASCII characters.
+Some elements of the grammar, such as comments and quoted strings, may contain
+arbitrary byte sequences, until terminated.  These sequences may happen to be
+valid UTF-8 text.  This way, quoted strings and comments may contain Unicode
+text encoded in UTF-8, but the parser does not check these for validity.
 
-      #rune1#rune2  -> (#rune1 & #rune2)
+Since comments and quoted strings may contain arbitrary byte sequences, a text
+editor or other program displaying Zisp s-expressions may need to use a special
+visual representation for bytes that don't represent valid text.
 
-      #rune"text"   -> (#rune & "text")
+The parser being based on bytes rather than characters is not a limitation but
+rather a feature: It allows for Zisp s-expressions to be used as a structured
+data exchange format that may contain binary data elements without the need to
+encode these in Base64 or other such text representations of binary data.
+Consider the example:
 
-      #rune\string  -> (rune & string)
+    ((image.webp "<< binary data >>")
+     (video.webm "<< binary data >>"))
 
-      #rune'string  -> (#rune #QUOTE & string)
+All that needs to be done for this to work, is that any incidental occurrences
+of the double-quote sign, and the backslash sign, are escaped with a backslash
+within the binary data; all other bytes can appear verbatim in the strings.
 
-  As a counter-example, following a rune immediately with a bare string isn't
-  possible without the delimiting backslash, since that would be ambiguous:
 
-      #abcdefgh  ;Could be (#abcdef & gh) or (#abcde & fgh) or ...
+### Stream parsing
 
-* Syntax sugar can combine arbitrarily.  Some examples follow.  Any of these may
-  or may not actually have a meaning in code; many could simply end up producing
-  an error during decoding, or later interpretation of code.
+The parser can be repeatedly invoked on a byte stream to consume the next datum
+within.  This does not require "unreading" or back-seeking within the stream;
+the parser always reads a full datum, and stops after some byte which cleanly
+terminates the currently parsed datum.
 
-      #{...}            -> (#HASH #BRACE ...)
+This means Zisp s-expressions can be safely intermixed with other data within
+the same byte stream.  So long as the other data is consumed by some parser
+which similarly stops reading at a clear boundary, the Zisp parser can then
+continue operating on the same stream.  Consider the example:
 
-      #'foo             -> (#HASH #QUOTE & foo)
+    ("image.webp" 8273)
 
-      ##'[...]          -> (#HASH #HASH #QUOTE #SQUARE ...)
+    << 8273 bytes >>
 
-      {x y}[i j]        -> (#JOIN (#BRACE x y) #SQUARE i j)
+    ("video.webm" 736)
 
-      foo.bar.baz{x y}  -> (#JOIN (#DOT (#DOT foo & bar) & baz) #BRACE x y)
+    << 736 bytes >>
 
-* While in Lisp and Scheme `'foo` parses as `(quote foo)`, in Zisp it parses
-  as `(#QUOTE & foo)` instead; the operand of `#QUOTE` is the entire cdr.
+The "header" for each file in this stream is a Zisp s-expression containing
+information about how many bytes should be read after the header, before the
+next file header appears.  (The header data need to be terminated with a blank
+ASCII character such as a newline.  The reason why the closing parenthesis does
+not act as a terminator unto itself will become apparent later.)
 
-  The same principle is used when parsing other sugar; some examples follow:
 
-      Incorrect                              Correct
+### Datum labels
 
-      #(x y z) -> (#HASH (x y z))            #(x y z) -> (#HASH x y z)
+Valid data cannot be cyclic, since that would mean it has infinite length in
+bytes.  To externally represent a value with cyclic structure, one uses datum
+labels in the data encoding of the value.
 
-      [x y z]  -> (#SQUARE (x y z))          [x y z]  -> (#SQUARE x y z)
+A datum label either wraps another datum to assign a number to it, or contains
+just a reference to a previous assignment.
 
-      #{x}     -> (#HASH (#BRACE (x)))       #{x}     -> (#HASH #BRACE x)
+    +----------------------------------+---------------------------------+
+    | Internal structure               | External representation         |
+    +----------------------------------+---------------------------------+
+    | (#LABEL & (<NUMBER> & <DATUM>))  | #%<HEX>=<DATUM>                 |
+    +----------------------------------+---------------------------------+
+    | (#LABEL & <NUMBER>)              | #%<HEX>%                        |
+    +----------------------------------+---------------------------------+
 
-      foo(x y) -> (#JOIN foo (x y))          foo(x y) -> (#JOIN foo x y)
+In this visual, the token `<NUMBER>` stands for an actual number value that
+doesn't have its own external representation.  It's printed as a sequence of
+hexadecimal digits, denoted by `<HEX>` in the external representation.
 
-* Runes are case-sensitive, and the parser always emits runes using upper-case
-  letters when expressing syntax sugar.  Uppercase rune names are reserved for
-  Zisp's internal use and standard library; users can use lowercase runes with
-  custom meaning without worrying about clashes, with the exception of a small
-  number of lowercase runes such as `#true` and `#false` that are part of the
-  default decoder settings.
+For clarity, concrete examples follow:
 
+    #%1234abcd=(foo bar)  ->  (#LABEL & (<0x1234abcd> & (foo bar)))
 
-## Tilde strings
+    #%1234abcd%           ->  (#LABEL & <0x1234abcd>)
 
-There is a special type of syntax sugar for "raw" strings, meaning that no
-backslash escapes nor any other kind of escape sequence are recognized.
+Here, the visual token `<0x1234abcd>` stands for a Zisp value of a numeric type
+with an integer value.
 
-This raw string syntax begins with a tilde, followed by any byte.  That byte
-becomes the termination marker, and the string cannot represent a literal
-occurrence of it, since there are no escape sequences.
+Datum labels may look like "syntax sugar" but the fact that integers don't have
+a direct external representation means that datum labels are a fundamental type
+of syntax that has no "desugared" equivalent in external representation.  The
+decoder will not accept a bare string encoding of an integer here.
 
-    ~%foo \ bar%  ->  (#TILDE |foo \\ bar|)
 
-This can be useful, for instance, when representing regular expressions as
-quoted string literals in code:
+## Data types
 
-    ~/^foo\\(bar|baz)\.\[".*"\]$/     ;; matches e.g. foo\bar.["blah"]
+Following is an explanation of the four core data types constructed by the Zisp
+s-expression parser.
 
-Were it not for this syntax, this regular expression would need to be
-represented by the following quoted string literal in Zisp code:
+A Zisp value that is a member of one of these types is also called a *datum* if
+it adheres to additional constraints as explained for each type.
 
-    "^foo\\\\(bar|baz)\\t\\[\".*\"\\]$"
 
-Alternatively, imagine searching for certain MS Windows file paths:
+### String
+
+Strings can appear "bare" or be quoted in various ways.
+
+A string, as a stand-alone Zisp value, is only a valid datum if it can be
+represented as a bare string.  If it contains bytes that prevent the bare
+representation, then the string must be wrapped in one of the following
+structures to become a valid datum, each of which has its own external
+representation:
+
+    +-------------------------------+-------------------------------+
+    | Internal structure            | External representation       |
+    +-------------------------------+-------------------------------+
+    | (#PQSTR & <STRING>)           | |contents|                    |
+    +-------------------------------+-------------------------------+
+    | (#DQSTR & <STRING>)           | "contents"                    |
+    +-------------------------------+-------------------------------+
+    | (#ATSTR & <STRING>)           | @_contents_                   |
+    +-------------------------------+-------------------------------+
+
+The visual token `<STRING>` is meant to denote the actual string, as a Zisp
+value, occupying the second position in the pair.  It is not actual syntax.
+
+Note that, while conceptually similar, this internal encoding of string data is
+not syntax sugar, since the internal datum representation using runes cannot be
+printed out verbatim, due to the attached string being impossible to represent
+externally without quotation.  As such, quoted strings are fundamental syntax.
+
+These external representations of strings will be explained in more detail
+further below, including backslash escape sequences allowed within.
+
+Strings have a fixed length, counted in bytes.  Each byte can have any value,
+including zero (aka ASCII NULL).  The parser reads bytes, not characters, and
+has no concept of a character encoding, which means that a string can contain
+UTF-8 byte sequences, but these are not tested for validity.
+
+A string that is up to 64 bytes long is automatically *interned*, meaning any
+occurrence of the same string -- equal in length and containing the same byte
+values -- ends up being represented by the same bit-pattern; either a memory
+address, or an immediate representation within a CPU word for short strings.
+
+Strings with a length greater than 64 bytes end up being represented by a
+distinct memory address, even if they are equal in length and content.
+
+
+### Rune
+
+A rune is represented by an ASCII character sequence of 1 to 6 bytes, that must
+begin with a letter, and may only contain letters and digits.  This character
+sequence of letters and digits is called the *name* of the rune.  A rune that
+follows this constraint is valid as a datum.
+
+Zisp code may explicitly construct values of the rune type that violate the
+above constraints.  Such runes are not valid data and cannot be printed or
+parsed in any way.
+
+Runes are case-sensitive, and the parser always emits runes using upper-case
+letters when expressing syntax sugar.  Uppercase rune names are reserved for
+Zisp's internal use and standard library; users can use lowercase runes with
+custom meaning without worrying about clashes, with the exception of a small
+number of lowercase runes such as `#true` and `#false` that are part of the
+default decoder settings.
+
+Runes are always stored directly in a CPU word; never by memory address.
+
+
+### Pair
+
+A pair is a tuple of two values: the first value and the second value.
+
+The parser allocates a unique two-word cell in the process heap for every pair,
+and represents that pair through the memory address of that cell.
+
+Pairs are valid as a datum if one of the following holds true for the pair:
+
+* It encodes one of the quoted string variants.
+
+* It encodes a datum label (assignment or reference).
+
+* Both the first and second value in the pair is itself a valid datum.
+
+An additional constraint is that a hierarchy of pairs containing pairs must not
+form cycles; if they do, the cycles must be broken up by use of datum labels or
+else none of the pairs within the cyclic structure are a valid datum.
+
+
+### Nil
+
+The Zisp nil value is a singleton and a datum.  There is exactly one nil value
+and it is used to terminate a chain of pairs representing a list of values.
+
+
+## Quoted strings
+
+Three quoted string types exist: Pipe-quoted, double-quoted, and at-quoted.
+This section goes into the details of each variant.
+
+
+### Pipe-quoted
+
+Strings can be quoted with pipes, like symbols in R7RS Scheme, which triggers
+the parser to generate a pair with the structure:
+
+    (#PQSTR & <STRING>)                 ;; <STRING> is visual aid, not syntax
+
+The decoder, using default settings, would emit this string verbatim as a value.
+Then, during code evaluation, this would be seen as an identifier.  In this way,
+pipe-quoted strings are equivalent to bare strings in functionality.
+
+It is important to understand that the decoder sits between the parser and the
+[evaluator](3-execute.html), and in opposition to Lisp and Scheme tradition, it
+is common for the evaluator to receive values that are not valid as a datum; in
+this case, a string unto itself that may not be a valid datum, due to not being
+possible to be represented as a bare string.  Yet, it is valid as an identifier
+for the purposes of the evaluator, since it is a string *value* like any other.
+
+
+### Double-quoted
+
+Strings wrapped in the double-quote symbol parse into:
+
+    (#DQSTR & <STRING>)                 ;; <STRING> is visual aid, not syntax
+
+Under default settings, the decoder would transform this into a value which,
+when evaluated, yields back the string as a value.  Typically, this would be
+achieved by simply transforming it into `(#QUOTE & <STRING>)`.  (Note that,
+unlike `(#PQSTR & <STRING>)`, this would not be decoded into a string unto
+itself, as that would make the evaluator see it as an identifier.)
+
 
-    ~_C:\\\\User\\foo_                ;; matches C:\\User\foo
+### At-quoted strings AKA raw strings
 
-That's already ugly.  Without raw strings, it would need to look like this:
+There is a special type of syntax for "raw" strings, meaning that no backslash
+escapes nor any other kind of escape sequence are recognized within them.
 
-    "C:\\\\\\\\User\\\\foo"
+This raw string syntax begins with an at sign, followed by any byte.  That byte
+becomes the termination marker, and the string cannot contain an occurrence of
+it, since there are no escape sequences.
 
-Typically, the rune `#TILDE` would be treated as a synonym to `#QUOTE` by the
-decoder, though creative programmers could repurpose it.
+    @"foo \ bar"  ->  (#ATSTR & <STRING>)
 
+In the above, the visual token `<STRING>` is not part of datum syntax but a
+stand-in for the actual string value, which is, literally: `foo \ bar`
 
-## Newlines in strings
+This style of quoting can be useful, for instance, when representing regular
+expressions as strings in code:
+
+    @/^foo\\(bar|baz)\.\[".*"\]$/         ;; matches e.g. foo\bar.["blah"]
+
+Were it not for this syntax, this regular expression would only be possible to
+represent through a quoted string such as the following:
+
+    "^foo\\\\(bar|baz)\\t\\[\".*\"\\]$"   ;; many backslashes
+
+Alternatively, imagine searching for certain MS Windows file paths:
+
+    @_C:\\\\Users\\([a-z]+)_              ;; matches C:\\User\foo
+
+That's already ugly.  Without raw strings, it would need to look even worse:
+
+    "C:\\\\\\\\Users\\\\([a-z]+)"         ;; MANY backslashes
+
+The byte that follows the at sign need not be a printable character or even a
+valid ASCII byte; it can be absolutely any byte value, even NULL.  This can be
+useful to easily encode binary data which is known to not contain a specific
+byte; an example would be C strings which cannot contain NULL.
+
+
+### Backslash escape sequences in strings
+
+The following backslash escapes are supported in pipe-quoted and double-quoted
+strings.  (Some rows use Regular Expression notation.)
+
+    +-----------------------------------+------------------------------+
+    | Character(s) following backslash  | Meaning                      |
+    +-----------------------------------+------------------------------+
+    | \                                 | Literal backslash            |
+    +-----------------------------------+------------------------------+
+    | |                                 | Literal pipe symbol          |
+    +-----------------------------------+------------------------------+
+    | "                                 | Literal double-quote         |
+    +-----------------------------------+------------------------------+
+    | RE: /[\t ]*\n[\t ]*/              | Discarded                    |
+    +-----------------------------------+------------------------------+
+    | 0                                 | ASCII NULL                   |
+    +-----------------------------------+------------------------------+
+    | a                                 | ASCII Alert                  |
+    +-----------------------------------+------------------------------+
+    | b                                 | ASCII Backspace              |
+    +-----------------------------------+------------------------------+
+    | t                                 | ASCII Tab (Horizontal)       |
+    +-----------------------------------+------------------------------+
+    | n                                 | ASCII Newline (Line Feed)    |
+    +-----------------------------------+------------------------------+
+    | v                                 | ASCII Vertical Tab           |
+    +-----------------------------------+------------------------------+
+    | f                                 | ASCII Form Feed              |
+    +-----------------------------------+------------------------------+
+    | r                                 | ASCII Carriage Return        |
+    +-----------------------------------+------------------------------+
+    | e                                 | ASCII Escape                 |
+    +-----------------------------------+------------------------------+
+    | RE: /x([0-9a-fA-F]{2})+;/         | Arbitrary bytes in hex       |
+    +-----------------------------------+------------------------------+
+    | RE: /u[0-9a-fA-F]+;/              | Unicode scalar as UTF-8      |
+    +-----------------------------------+------------------------------+
+     
+To clarify:
+
+* A backslash followed by a backslash, pipe, or double-quote character is
+  substituted with a literal occurrence of the corresponding character.
+
+* A backslash followed by any number of blanks (space or tab), a newline, and
+  again any number of blanks, is substituted with nothing.  This is to allow
+  splitting a string into multiple lines for human readability.
+
+      (define paragraph "This paragraph has been visually split into multiple \
+                         lines, but the newline is escaped, so it's one line.")
+
+* The characters 0, a, b, t, n, v, f, r, and e have the same meanings as in the
+  C programming language, representing common unprintable ASCII bytes.
+
+* An x, followed by pairs of hexadecimal digits (case insensitive), terminated
+  by a semicolon, is substituted with the sequence of bytes represented by the
+  corresponding pairs of hexadecimal digits.  E.g.: `"foo\xDEADBEEF;bar"`
+
+* A u, followed by a hexadecimal digit sequence (case insensitive), terminated
+  by a semicolon, is substituted with the canonical UTF-8 byte sequence for the
+  Unicode Scalar Value represented by that hexadecimal number.  The number must
+  be in the range `0` to `10FFFF`.  E.g.: `"foo\u00A0;bar"`
+
+
+### Newlines in strings
 
 Normally, a newline in a string has no special meaning and simply becomes part
 of the string.  However, newlines can be backslash-escaped, which simple erases
@@ -178,7 +424,7 @@ with different intentions and meanings:
     (define paragraph "This paragraph has been visually split into multiple \
                        lines, but the newlines are escaped, so it's one line.")
 
-    (define json-object '|   ;; use '|| so we needn't escape "key" etc.
+    (define json-object '|         ;; use '|| so double-quotes need no escaping
       {
         "key": "value"
       }
@@ -200,31 +446,134 @@ or rhyme intended.)  Consider:
                |))
           (do-whatever))))
 
-The string bound to `json-object` has way more indentation than the programmer
-intended.  Should the parser attempt to solve this issue?
+The string bound to `json-object` has redundant indentation.  Should the parser
+attempt to solve this issue?
+
+Thankfully, we have the decoder to handle such complexities.  Under the default
+settings, the rune `#HASH` is bound to a decoder rule which detects a payload
+value that is a string literal, and implements the same algorithm as seen in
+Java 15 Text Blocks: [JEP 378: Text Blocks](https://openjdk.org/jeps/378)
 
-Thankfully, we have the decoder.  The implementation of `#QUOTE` can simply
-implement a post-processing algorithm such as the one used for Java 15 text
-blocks feature: [JEP 378: Text Blocks](https://openjdk.org/jeps/378)
+Thus, we can do the following:
 
-The only feature Zisp cannot offer here is a way to fence off multi-line strings
-with a longer token such as `"""` as seen in Python or Java, or an arbitrary
-word as seen in Bourne shell and PHP "here doc" syntax.  For simplicity, the
-Zisp parser omits such features.
+    (let ((foo one))
+      (let ((bar two))
+        (let ((json-object #|
+    ...........  {
+    ...........    "key": "value"
+    ...........  }
+    ...........|))
+          (do-whatever))))
 
-That said, if a programmer truly wanted to have arbitrary text blocks in code,
-without needing to escape anything in them, it's possible to abuse the tilde
-string syntax by using it with an ASCII control character which is displayed
+(Dots represent whitespace that is deleted.  The initial newline is, as well.)
+
+The only feature Zisp does not offer is a way to fence off multi-line strings
+with a longer token such as `"""` as seen in Python and Java, or an arbitrary
+word as seen in Bourne shell and PHP "here doc" syntax.
+
+However, if a programmer truly wanted to have arbitrary text blocks in code,
+without needing to escape anything in them, it's possible to abuse at-quoted
+string syntax, using it with an ASCII control character which is displayed
 visibly by a text editor.  In the following, the characters `^\` are meant to
 represent a literal ASCII File Separator character in the source code:
 
-    (define json-object ~^\
+    (define json-object #@^\
       {
         "key": "value"
       }
       ^\)
 
-Hey, it works fine in Emacs, so why not??  (`C-q C-\` to insert the `^\`.)
+Hey, it works fine in Emacs, so why not?  Use `C-q C-\` to insert the `^\`.
+
+This is indeed quite an eldritch syntax, but hopefully most programs would not
+need to use it anyway.
+
+
+## Syntax sugar
+
+The parser recognizes various "syntax sugar" and transforms it into equivalent
+datum constructions.  The most ubiquitous example of this is the list, which is
+transformed into a chain of pairs, terminated with nil:
+
+    (datum1 datum2 ...)  ->  (datum1 & (datum2 & (... & ())))
+
+This is so ubiquitous as to be hardly considered "syntax sugar" but is counted
+as such, since any list could just as well be written as a chain of pairs; both
+would result in an equivalent datum when parsed.
+
+The following table summarizes the other available transformations:
+
+    [...]   -> (#SQUARE ...)          #datum       -> (#HASH & datum)
+
+    {...}   -> (#BRACE ...)           #rune(...)   -> (#rune ...)
+
+    'datum  -> (#QUOTE & datum)       dat1dat2     -> (#JOIN dat1 & dat2)
+
+    `datum  -> (#GRAVE & datum)       dat1.dat2    -> (#DOT dat1 & dat2)
+
+    ,datum  -> (#COMMA & datum)       dat1:dat2    -> (#COLON dat1 & dat2)
+
+Notes:
+
+* The terms datum, dat1, and dat2 each refer to an arbitrary datum; ellipsis
+  means zero or more data.
+
+* The `#datum` form only applies when the datum following the hash sign is
+  anything other than a bare string, since otherwise this would be ambiguous
+  with a rune literal.  A bare string can nevertheless follow the hash sign by
+  separating the two with a backslash:
+
+      #\string  ->  (#HASH & string)
+
+* Though not represented in the table due to notational difficulty, the form
+  `#rune(...)` doesn't require a list in the second position; any datum that
+  works with the `#datum` syntax also works with `#rune<DATUM>`.
+
+      #rune1#rune2  -> (#rune1 & #rune2)
+
+      #rune\string  -> (rune & string)
+
+      #rune'string  -> (#rune #QUOTE & string)
+
+      #rune"string" -> (#rune #DQSTR & |string|)
+
+  As a counter-example, following a rune immediately with a bare string isn't
+  possible without the delimiting backslash, since that would be ambiguous:
+
+      #abcdefgh  ;Could be (#abcdef & gh) or (#abcde & fgh) or ...
+
+* Syntax sugar can combine arbitrarily.  Some examples follow.  Any of these may
+  or may not actually have a meaning in code; many could simply end up producing
+  an error during decoding, or later evaluation of code.
+
+      #{...}            -> (#HASH #BRACE ...)
+
+      #'foo             -> (#HASH #QUOTE & foo)
+
+      ##'[...]          -> (#HASH #HASH #QUOTE #SQUARE ...)
+
+      {x y}[i j]        -> (#JOIN (#BRACE x y) #SQUARE i j)
+
+      foo.bar.baz{x y}  -> (#JOIN (#DOT (#DOT foo & bar) & baz) #BRACE x y)
+
+* While in Lisp and Scheme `'foo` parses as `(quote foo)`, in Zisp it parses as
+  `(#QUOTE & foo)`; a single pair with the quoted datum in the second position.
+
+  The same principle is used when parsing other sugar; some examples follow:
+
+      Incorrect                              Correct
+
+      #(x y z) -> (#HASH (x y z))            #(x y z) -> (#HASH x y z)
+
+      [x y z]  -> (#SQUARE (x y z))          [x y z]  -> (#SQUARE x y z)
+
+      #{x}     -> (#HASH (#BRACE (x)))       #{x}     -> (#HASH #BRACE x)
+
+      foo(x y) -> (#JOIN foo (x y))          foo(x y) -> (#JOIN foo x y)
+
+* Those used to thinking in Lisp and Scheme may think that `(#QUOTE ...)` halts
+  further decoding of enclosed data.  This is not so, since quoting is related
+  to code evaluation, not decoding.
 
 <!--
 ;; Local Variables:
diff --git a/docs/c1/grammar/abnf.txt b/docs/c1/grammar/abnf.txt
index 6daaceb..7424f41 100644
--- a/docs/c1/grammar/abnf.txt
+++ b/docs/c1/grammar/abnf.txt
@@ -19,7 +19,7 @@ Blank         = HTAB / LF / %x0b / %x0c / CR / SP / Comment
 Trail         = SkipLine / SkipUnit / ";" "~" *Blank
 
 
-Datum         = BareString / DottedStr / CladDatum / Rune / RuneStr
+Datum         = BareString / SpecialStr / CladDatum / Rune / RuneStr
               / RuneDotStr / RuneClad / LabelRef / LabelDef / HashStr
               / HashDotStr / HashClad / QuoteExpr / JoinExpr
 
@@ -36,7 +36,7 @@ AnyButLF      = %x00-09 / %x0b-ff
 
 BareString    = BareChar *( BareChar / Numeric )
 
-DottedStr     = ( "." / Numeric ) *( "." / Numeric / BareChar )
+SpecialStr    = SpecStrChar *( SpecStrChar / BareChar )
 
 CladDatum     = "|" *( PipeStrChar / "\" StringEsc ) "|"
               / DQUOTE *( QuotStrChar / "\" StringEsc ) DQUOTE
@@ -48,7 +48,7 @@ Rune          = "#" RuneName
 
 RuneStr       = "#" RuneName "\" BareString
 
-RuneDotStr    = "#" RuneName "\" DottedStr
+RuneDotStr    = "#" RuneName "\" SpecialStr
 
 RuneClad      = "#" RuneName CladDatum
 
@@ -58,7 +58,7 @@ LabelDef      = "#" "%" Label "=" Datum
 
 HashStr       = "#" "\" BareString
 
-HashDotStr    = "#" "\" DottedStr
+HashDotStr    = "#" "\" SpecialStr
 
 HashClad      = "#" CladDatum
 
@@ -73,10 +73,12 @@ JoinExpr      = Datum RJoinDatum
 
 
 BareChar      = "!" / "$" / "%" / "*" / "/" / "<" / "=" / ">"
-              / "?" / "@" / "^" / "_" / "~" / ALPHA
+              / "?" / "^" / "_" / "~" / ALPHA
 
 Numeric       = "+" / "-" / DIGIT
 
+SpecStrChar   = "." / ":" / Numeric
+
 PipeStrChar   = %x00-5b / %x5d-7b / %x7d-ff ; any but "|" or "\"
 
 QuotStrChar   = %x00-21 / %x23-5b / %x5d-ff ; any but DQUOTE or "\"
diff --git a/docs/c1/grammar/index.md b/docs/c1/grammar/index.md
index d70021a..8fefe0e 100644
--- a/docs/c1/grammar/index.md
+++ b/docs/c1/grammar/index.md
@@ -74,6 +74,12 @@ The following limits are not represented in the grammar:
    want to use the ABNF to generate a parser anyway.)
 
 
+## At-quoted strings
+
+The mechanism of at-quoted strings is not represented in any of the
+grammars, since it essentially has 256 variants.
+
+
 ## Stream-parsing strategy
 
 The parser consumes one `Unit` from the input stream every time it's
diff --git a/docs/c1/grammar/zbnf.txt b/docs/c1/grammar/zbnf.txt
index 551c319..002e027 100644
--- a/docs/c1/grammar/zbnf.txt
+++ b/docs/c1/grammar/zbnf.txt
@@ -22,7 +22,7 @@ SkipLine      : ( ~LF )* [LF]
 OneDatum      : BareString | CladDatum
 
 
-BareString    : ( '.' | '+' | '-' | DIGIT ) ( BareChar | '.' )*
+BareString    : SpecBareChar ( BareChar | JoinChar )*
               | BareChar+
 
 CladDatum     : PipeStr | QuoteStr | HashExpr | QuoteExpr | List
@@ -33,16 +33,17 @@ HashExpr      : '#' ( RuneExpr | LabelExpr | HashDatum )
 QuoteExpr     : "'" Datum | '`' Datum | ',' Datum
 List          : ParenList | SquareList | BraceList
 
+SpecBareChar  : '+' | '-' | JoinChar | DIGIT
+
 BareChar      : ALPHA | DIGIT
-              | '!' | '$' | '%' | '*' | '+'
-              | '-' | '/' | '<' | '=' | '>'
-              | '?' | '@' | '^' | '_' | '~'
+              | '!' | '$' | '%' | '*' | '+' | '-' | '/'
+              | '<' | '=' | '>' | '?' | '^' | '_' | '~'
 
 PipeStrChar   : ~( '|' | '\' )
 QuotStrChar   : ~( '"' | '\' )
 
 StringEsc     : '\' | '|' | '"' | ( HTAB | SP )* LF ( HTAB | SP )*
-              | 'a' | 'b' | 't' | 'n' | 'v' | 'f' | 'r' | 'e'
+              | '0' | 'a' | 'b' | 't' | 'n' | 'v' | 'f' | 'r' | 'e'
               | 'x' HexByte+ ';'
               | 'u' UnicodeSV ';'
 
diff --git a/docs/c1/index.md b/docs/c1/index.md
index f306e11..af01cea 100644
--- a/docs/c1/index.md
+++ b/docs/c1/index.md
@@ -1,2 +1,30 @@
 # Chapter 1: Genesis
 
+This chapter goes through the processes involved in reading source
+code, running it, and optionally compiling it.
+
+1. [Parse](1-parse.html)
+
+   The parser receives a stream of bytes and transforms them into a
+   minimal set of data types with very little processing.
+
+2. [Decode](2-decode.html)
+
+   The decoder runs configurable and extensible pre-processing steps
+   over data received from the parser, enriching it with more complex
+   data types, and handling primitive source code transforms.  It's
+   comparable to the C pre-processor or Lisp's `DEFMACRO` mechanism,
+   with a few more responsibilities, such as number literal parsing.
+
+3. [Execute](3-execute.html)
+
+   Code is executed (or interpreted, or evaluated) in an environment,
+   also called a module, which may be mutated, and linked with other
+   modules.  Execution is immediate, without any pre-compilation.
+
+4. [Compile](4-compile.html)
+
+   Procedures from within the compiler module can be used to demand
+   the compilation of other modules, with various options, yielding
+   static or dynamic object files.  These may be loaded immediately,
+   replacing the previously uncompiled module code in memory.
author	Taylan Kammer <taylan.kammer@gmail.com>	2026-05-25 20:48:36 +0200
committer	Taylan Kammer <taylan.kammer@gmail.com>	2026-05-26 18:41:27 +0200
commit	fa5db8e89225622a1ee7a5d802f253d07884b13e (patch)
tree	d7b25178deac71dff00728134555c75f088ec101 /docs/c1
parent	0f0cb85026406356e16310044b4d09bd316b0747 (diff)