summaryrefslogtreecommitdiff
path: root/doc/c1
diff options
context:
space:
mode:
authorTaylan Kammer <taylan.kammer@gmail.com>2026-06-01 21:49:37 +0200
committerTaylan Kammer <taylan.kammer@gmail.com>2026-06-01 21:49:37 +0200
commit724ac8ae394675a78c2977c6e35555b210256e01 (patch)
treed7f5574b49ec71341ea8079f18a33b9c17b60221 /doc/c1
parent9ce0aa66cedc985322e06db4bac130910610c113 (diff)
docs -> doc
Diffstat (limited to 'doc/c1')
-rw-r--r--doc/c1/1-parse.md611
-rw-r--r--doc/c1/2-decode.md44
-rw-r--r--doc/c1/grammar/abnf.txt141
-rw-r--r--doc/c1/grammar/index.md115
-rw-r--r--doc/c1/grammar/peg.txt93
-rw-r--r--doc/c1/grammar/zbnf.txt77
-rw-r--r--doc/c1/index.md30
7 files changed, 1111 insertions, 0 deletions
diff --git a/doc/c1/1-parse.md b/doc/c1/1-parse.md
new file mode 100644
index 0000000..4eb5776
--- /dev/null
+++ b/doc/c1/1-parse.md
@@ -0,0 +1,611 @@
+# Parser for Data
+
+*For an exact specification of the grammar, see [grammar](grammar/).*
+
+Zisp s-expressions represent an extremely minimal set of data types; only that
+which is necessary to strategically construct more complex values:
+
+ +--------+-----------------+--------+----------+------+
+ | TYPE | String | Rune | Pair | Nil |
+ +--------+-----------------+--------+----------+------+
+ | E.G. | foobar | #name | (X & Y) | () |
+ | | |foo bar| | | | |
+ | | "foo bar" | | | |
+ | | @_foo bar_ | | | |
+ +--------+-----------------+--------+----------+------+
+
+Datum comments and line comments are supported:
+
+* A semicolon followed by a tilde instructs the parser to consume one datum and
+ discard it. Whitespace may appear between the tilde and the datum to discard.
+
+* A semicolon, followed by a non-tilde byte, instructs the parser to consume and
+ discard bytes until a newline (ASCII Line Feed) is encountered.
+
+The parser can also output non-negative integers, but this is only used for
+datum labels; number literals are handled by the decoder instead; see below.
+
+
+## Overview
+
+This section explains a few core concepts and features related to the parser.
+
+
+### Value vs. Datum
+
+A Zisp *value* that has an *external representation* in the form of a sequence
+of bytes is called a *datum*. Every datum is a value, but not all values are
+data. A datum is a value that can be printed out as a byte sequence which the
+parser can recognize and turn back into an equivalent datum.
+
+One may speak of an *external representation of a value* where the value is not
+itself a datum, but can be encoded as a datum. The more strictly correct term
+for this is: "The external representation of a datum encoding the value."
+
+
+### Syntax sugar
+
+The parser recognizes various "syntax sugar" and transforms it into uses of the
+above listed primitive data types. As an example, the expression `#(x y z)` is
+parsed into the structure `(#HASH x y z)`. These are two completely equivalent
+external representations for the same compound datum; after parsing, both byte
+sequences will yield data values that are indistinguishable in all but their
+memory address.
+
+The most ubiquitously used syntax sugar is the list, which stands for a chain of
+pairs, terminated with nil:
+
+ (x y z) -> (x & (y & (z & ())))
+
+The full syntax sugar table is listed and explained further below.
+
+
+### Decoder
+
+*The decoder has nothing to do with the concept of text or character encoding.*
+
+A separate process called *decoding* can transform Zisp data into values of more
+complex types, including values that are not of a datum type.
+
+For example, the datum `(#HASH x y z)` could be decoded into an array, so the
+expression `#(x y z)` could work like in Scheme.
+
+Decoding also resolves datum labels, goes over bare strings to find ones that
+represent a number literal, and takes care of a number of other transforms.
+This offloads complexity, allowing the parser to remain extremely simple.
+
+See the dedicated documentation of the [decoder](2-decode.html) for more.
+
+
+### Character encoding
+
+The parser does not consume characters; it consumes bytes.
+
+Grammar is generally constructed by bytes corresponding to ASCII characters.
+Some elements of the grammar, such as comments and quoted strings, may contain
+arbitrary byte sequences, until terminated. These sequences may happen to be
+valid UTF-8 text. This way, quoted strings and comments may contain Unicode
+text encoded in UTF-8, but the parser does not check these for validity.
+
+Since comments and quoted strings may contain arbitrary byte sequences, a text
+editor or other program displaying Zisp s-expressions may need to use a special
+visual representation for bytes that don't represent valid text.
+
+The parser being based on bytes rather than characters is not a limitation but
+rather a feature: It allows for Zisp s-expressions to be used as a structured
+data exchange format that may contain binary data elements without the need to
+encode these in Base64 or other such text representations of binary data.
+Consider the example:
+
+ ((image.webp "<< binary data >>")
+ (video.webm "<< binary data >>"))
+
+All that needs to be done for this to work, is that any incidental occurrences
+of the double-quote sign, and the backslash sign, are escaped with a backslash
+within the binary data; all other bytes can appear verbatim in the strings.
+
+
+### Stream parsing
+
+The parser can be repeatedly invoked on a byte stream to consume the next datum
+within. This does not require "unreading" or back-seeking within the stream;
+the parser always reads a full datum, and stops after some byte which cleanly
+terminates the currently parsed datum.
+
+This means Zisp s-expressions can be safely intermixed with other data within
+the same byte stream. So long as the other data is consumed by some parser
+which similarly stops reading at a clear boundary, the Zisp parser can then
+continue operating on the same stream. Consider the example:
+
+ ("image.webp" 8273)
+
+ << 8273 bytes >>
+
+ ("video.webm" 736)
+
+ << 736 bytes >>
+
+The "header" for each file in this stream is a Zisp s-expression containing
+information about how many bytes should be read after the header, before the
+next file header appears. (The header data need to be terminated with a blank
+ASCII character such as a newline. The reason why the closing parenthesis does
+not act as a terminator unto itself will become apparent later.)
+
+#### Buffering
+
+To enable the aforementioned stream parsing strategy, the parser does not use
+automatic buffering. If it did, it might inadvertently consume some bytes
+beyond the currently parsed datum, leaving the stream inconsistent.
+
+The parser could provide access to its buffer, such that one could access the
+unused bytes, but it's simpler and more flexible to let buffering be handled
+externally from the parser.
+
+In other words: If the parser is meant to be used on an I/O stream connected to
+expensive system calls, such as a file handle or network socket, it's best to
+wrap that stream in some intermediate object which asks the system for large
+chunks of data at once, and stores the data in a buffer.
+
+
+### Datum labels
+
+Valid data cannot be cyclic, since that would mean it has infinite length in
+bytes. To externally represent a value with cyclic structure, one uses datum
+labels in the data encoding of the value.
+
+A datum label either wraps another datum to assign a number to it, or contains
+just a reference to a previous assignment.
+
+ +----------------------------------+---------------------------------+
+ | Internal structure | External representation |
+ +----------------------------------+---------------------------------+
+ | (#LABEL & (<NUMBER> & <DATUM>)) | #%<HEX>=<DATUM> |
+ +----------------------------------+---------------------------------+
+ | (#LABEL & <NUMBER>) | #%<HEX>% |
+ +----------------------------------+---------------------------------+
+
+In this visual, the token `<NUMBER>` stands for an actual number value that
+doesn't have its own external representation. It's printed as a sequence of
+hexadecimal digits, denoted by `<HEX>` in the external representation.
+
+For clarity, concrete examples follow:
+
+ #%1234abcd=(foo bar) -> (#LABEL & (<0x1234abcd> & (foo bar)))
+
+ #%1234abcd% -> (#LABEL & <0x1234abcd>)
+
+Here, the visual token `<0x1234abcd>` stands for a Zisp value of a numeric type
+with an integer value.
+
+Datum labels may look like "syntax sugar" but the fact that integers don't have
+a direct external representation means that datum labels are a fundamental type
+of syntax that has no "desugared" equivalent in external representation. The
+decoder will not accept a bare string encoding of an integer here.
+
+
+## Data types
+
+Following is an explanation of the four core data types constructed by the Zisp
+s-expression parser.
+
+A Zisp value that is a member of one of these types is also called a *datum* if
+it adheres to additional constraints as explained for each type.
+
+
+### String
+
+Strings can appear "bare" or be quoted in various ways.
+
+A string, as a stand-alone Zisp value, is only a valid datum if it can be
+represented as a bare string. If it contains bytes that prevent the bare
+representation, then the string must be wrapped in one of the following
+structures to become a valid datum, each of which has its own external
+representation:
+
+ +-------------------------------+-------------------------------+
+ | Internal structure | External representation |
+ +-------------------------------+-------------------------------+
+ | (#PQSTR & <STRING>) | |contents| |
+ +-------------------------------+-------------------------------+
+ | (#DQSTR & <STRING>) | "contents" |
+ +-------------------------------+-------------------------------+
+ | (#ATSTR & <STRING>) | @_contents_ |
+ +-------------------------------+-------------------------------+
+
+The visual token `<STRING>` is meant to denote the actual string, as a Zisp
+value, occupying the second position in the pair. It is not actual syntax.
+
+Note that, while conceptually similar, this internal encoding of string data is
+not syntax sugar, since the internal datum representation using runes cannot be
+printed out verbatim, due to the attached string being impossible to represent
+externally without quotation. As such, quoted strings are fundamental syntax.
+
+These external representations of strings will be explained in more detail
+further below, including backslash escape sequences allowed within.
+
+Strings have a fixed length, counted in bytes. Each byte can have any value,
+including zero (aka ASCII NULL). The parser reads bytes, not characters, and
+has no concept of a character encoding, which means that a string can contain
+UTF-8 byte sequences, but these are not tested for validity.
+
+A string that is up to 255 bytes long is automatically *interned*, meaning any
+occurrence of the same string -- equal in length and containing the same byte
+values -- ends up being represented by the same bit-pattern; either a memory
+address, or an immediate representation within a CPU word for short strings.
+
+Strings with a length greater than 255 bytes end up being represented by a
+distinct memory address, even if they are equal in length and content.
+
+
+### Rune
+
+A rune is represented by an ASCII character sequence of 1 to 6 bytes, that must
+begin with a letter, and may only contain letters and digits. This character
+sequence of letters and digits is called the *name* of the rune. A rune that
+follows this constraint is valid as a datum.
+
+Zisp code may explicitly construct values of the rune type that violate the
+above constraints. Such runes are not valid data and cannot be printed or
+parsed in any way.
+
+Runes are case-sensitive, and the parser always emits runes using upper-case
+letters when expressing syntax sugar. Uppercase rune names are reserved for
+Zisp's internal use and standard library; users can use lowercase runes with
+custom meaning without worrying about clashes, with the exception of a small
+number of lowercase runes such as `#true` and `#false` that are part of the
+default decoder settings.
+
+Runes are always stored directly in a CPU word; never by memory address.
+
+
+### Pair
+
+A pair is a tuple of two values: the first value and the second value.
+
+The parser allocates a unique two-word cell in the process heap for every pair,
+and represents that pair through the memory address of that cell.
+
+Pairs are valid as a datum if one of the following holds true for the pair:
+
+* It encodes one of the quoted string variants.
+
+* It encodes a datum label (assignment or reference).
+
+* Both the first and second value in the pair is itself a valid datum.
+
+An additional constraint is that a hierarchy of pairs containing pairs must not
+form cycles; if they do, the cycles must be broken up by use of datum labels or
+else none of the pairs within the cyclic structure are a valid datum.
+
+
+### Nil
+
+The Zisp nil value is a singleton and a datum. There is exactly one nil value
+and it is used to terminate a chain of pairs representing a list of values.
+
+
+## Quoted strings
+
+Three quoted string types exist: Pipe-quoted, double-quoted, and at-quoted.
+This section goes into the details of each variant.
+
+
+### Pipe-quoted
+
+Strings can be quoted with pipes, like symbols in R7RS Scheme, which triggers
+the parser to generate a pair with the structure:
+
+ (#PQSTR & <STRING>) ;; <STRING> is visual aid, not syntax
+
+The decoder, using default settings, would emit this string verbatim as a value.
+Then, during code evaluation, this would be seen as an identifier. In this way,
+pipe-quoted strings are equivalent to bare strings in functionality.
+
+It is important to understand that the decoder sits between the parser and the
+[evaluator](3-execute.html), and in opposition to Lisp and Scheme tradition, it
+is common for the evaluator to receive values that are not valid as a datum; in
+this case, a string unto itself that may not be a valid datum, due to not being
+possible to be represented as a bare string. Yet, it is valid as an identifier
+for the purposes of the evaluator, since it is a string *value* like any other.
+
+
+### Double-quoted
+
+Strings wrapped in the double-quote symbol parse into:
+
+ (#DQSTR & <STRING>) ;; <STRING> is visual aid, not syntax
+
+Under default settings, the decoder would transform this into a value which,
+when evaluated, yields back the string as a value. Typically, this would be
+achieved by simply transforming it into `(#QUOTE & <STRING>)`. (Note that,
+unlike `(#PQSTR & <STRING>)`, this would not be decoded into a string unto
+itself, as that would make the evaluator see it as an identifier.)
+
+
+### At-quoted strings AKA raw strings
+
+There is a special type of syntax for "raw" strings, meaning that no backslash
+escapes nor any other kind of escape sequence are recognized within them.
+
+This raw string syntax begins with an at sign, followed by any byte. That byte
+becomes the termination marker, and the string cannot contain an occurrence of
+it, since there are no escape sequences.
+
+ @"foo \ bar" -> (#ATSTR & <STRING>)
+
+In the above, the visual token `<STRING>` is not part of datum syntax but a
+stand-in for the actual string value, which is, literally: `foo \ bar`
+
+This style of quoting can be useful, for instance, when representing regular
+expressions as strings in code:
+
+ @/^foo\\(bar|baz)\.\[".*"\]$/ ;; matches e.g. foo\bar.["blah"]
+
+Were it not for this syntax, this regular expression would only be possible to
+represent through a quoted string such as the following:
+
+ "^foo\\\\(bar|baz)\\t\\[\".*\"\\]$" ;; many backslashes
+
+Alternatively, imagine searching for certain MS Windows file paths:
+
+ @_C:\\\\Users\\([a-z]+)_ ;; matches C:\\User\foo
+
+That's already ugly. Without raw strings, it would need to look even worse:
+
+ "C:\\\\\\\\Users\\\\([a-z]+)" ;; MANY backslashes
+
+The byte that follows the at sign need not be a printable character or even a
+valid ASCII byte; it can be absolutely any byte value, even NULL. This can be
+useful to easily encode binary data which is known to not contain a specific
+byte; an example would be C strings which cannot contain NULL.
+
+
+### Backslash escape sequences in strings
+
+The following backslash escapes are supported in pipe-quoted and double-quoted
+strings. (Some rows use Regular Expression notation.)
+
+ +-----------------------------------+------------------------------+
+ | Character(s) following backslash | Meaning |
+ +-----------------------------------+------------------------------+
+ | \ | Literal backslash |
+ +-----------------------------------+------------------------------+
+ | | | Literal pipe symbol |
+ +-----------------------------------+------------------------------+
+ | " | Literal double-quote |
+ +-----------------------------------+------------------------------+
+ | RE: /[\t ]*\n[\t ]*/ | Discarded |
+ +-----------------------------------+------------------------------+
+ | 0 | ASCII NULL |
+ +-----------------------------------+------------------------------+
+ | a | ASCII Alert |
+ +-----------------------------------+------------------------------+
+ | b | ASCII Backspace |
+ +-----------------------------------+------------------------------+
+ | t | ASCII Tab (Horizontal) |
+ +-----------------------------------+------------------------------+
+ | n | ASCII Newline (Line Feed) |
+ +-----------------------------------+------------------------------+
+ | v | ASCII Vertical Tab |
+ +-----------------------------------+------------------------------+
+ | f | ASCII Form Feed |
+ +-----------------------------------+------------------------------+
+ | r | ASCII Carriage Return |
+ +-----------------------------------+------------------------------+
+ | e | ASCII Escape |
+ +-----------------------------------+------------------------------+
+ | RE: /x([0-9a-fA-F]{2})*;/ | Arbitrary bytes in hex |
+ +-----------------------------------+------------------------------+
+ | RE: /u[0-9a-fA-F]+;/ | Unicode scalar as UTF-8 |
+ +-----------------------------------+------------------------------+
+
+To clarify:
+
+* A backslash followed by a backslash, pipe, or double-quote character is
+ substituted with a literal occurrence of the corresponding character.
+
+* A backslash followed by any number of blanks (space or tab), a newline, and
+ again any number of blanks, is substituted with nothing. This is to allow
+ splitting a string into multiple lines for human readability.
+
+ (define paragraph "This paragraph has been visually split into multiple \
+ lines, but the newline is escaped, so it's one line.")
+
+* The characters 0, a, b, t, n, v, f, r, and e have the same meanings as in the
+ C programming language, representing common unprintable ASCII bytes.
+
+* An x, followed by pairs of hexadecimal digits (case insensitive), terminated
+ by a semicolon, is substituted with the sequence of bytes represented by the
+ corresponding pairs of hexadecimal digits. E.g.: `"foo\xDEADBEEF;bar"`
+
+* A u, followed by a hexadecimal digit sequence (case insensitive), terminated
+ by a semicolon, is substituted with the canonical UTF-8 byte sequence for the
+ Unicode Scalar Value represented by that hexadecimal number. The number must
+ be in the range `0` to `10FFFF`. E.g.: `"foo\u00A0;bar"`
+
+
+### Newlines in strings
+
+Normally, a newline in a string has no special meaning and simply becomes part
+of the string. However, newlines can be backslash-escaped, which simple erases
+them; the escaped newline can also be preceded or followed by any number of tab
+and space characters, which are all stripped as well. (Note: It's not blanks
+preceding the backslash that are stripped, but blanks following the backslash
+and preceding the newline; i.e., blanks at the end of the line.)
+
+Following are some examples of how multi-line strings can appear in source code
+with different intentions and meanings:
+
+ (define paragraph "This paragraph has been visually split into multiple \
+ lines, but the newlines are escaped, so it's one line.")
+
+ (define json-object '| ;; use '|| so double-quotes need no escaping
+ {
+ "key": "value"
+ }
+ |)
+
+The second example is actually slightly problematic. It begins with a newline,
+which may be undesirable, but escaping that newline would cause the first line
+to have no indentation, thus the opening `{` would not line up with the closing
+`}` when this string is printed out. Further, if the entire block of code is
+indented, then the string contents may be more indented than intended. (No pun
+or rhyme intended.) Consider:
+
+ (let ((foo one))
+ (let ((bar two))
+ (let ((json-object '|
+ {
+ "key": "value"
+ }
+ |))
+ (do-whatever))))
+
+The string bound to `json-object` has redundant indentation. Should the parser
+attempt to solve this issue?
+
+Thankfully, we have the decoder to handle such complexities. Under the default
+settings, the rune `#HASH` is bound to a decoder rule which detects a payload
+value that is a string literal, and implements the same algorithm as seen in
+Java 15 Text Blocks: [JEP 378: Text Blocks](https://openjdk.org/jeps/378)
+
+Thus, we can do the following:
+
+ (let ((foo one))
+ (let ((bar two))
+ (let ((json-object #|
+ ........... {
+ ........... "key": "value"
+ ........... }
+ ...........|))
+ (do-whatever))))
+
+(Dots represent whitespace that is deleted. The initial newline is, as well.)
+
+The only feature Zisp does not offer is a way to fence off multi-line strings
+with a longer token such as `"""` as seen in Python and Java, or an arbitrary
+word as seen in Bourne shell and PHP "here doc" syntax.
+
+However, if a programmer truly wanted to have arbitrary text blocks in code,
+without needing to escape anything in them, it's possible to abuse at-quoted
+string syntax, using it with an ASCII control character which is displayed
+visibly by a text editor. In the following, the characters `^\` are meant to
+represent a literal ASCII File Separator character in the source code:
+
+ (define json-object #@^\
+ {
+ "key": "value"
+ }
+ ^\)
+
+Hey, it works fine in Emacs, so why not? Use `C-q C-\` to insert the `^\`.
+
+This is indeed quite an eldritch syntax, but hopefully most programs would not
+need to use it anyway.
+
+
+## Syntax sugar
+
+The parser recognizes various "syntax sugar" and transforms it into equivalent
+datum constructions. The most ubiquitous example of this is the list, which is
+transformed into a chain of pairs, terminated with nil:
+
+ (datum1 datum2 ...) -> (datum1 & (datum2 & (... & ())))
+
+This is so ubiquitous as to be hardly considered "syntax sugar" but is counted
+as such, since any list could just as well be written as a chain of pairs; both
+would result in an equivalent datum when parsed.
+
+The following table summarizes the other available transformations:
+
+ [...] -> (#SQUARE ...) #datum -> (#HASH & datum)
+
+ {...} -> (#BRACE ...) #rune(...) -> (#rune ...)
+
+ 'datum -> (#QUOTE & datum) dat1dat2 -> (#JOIN dat1 & dat2)
+
+ `datum -> (#GRAVE & datum) dat1.dat2 -> (#DOT dat1 & dat2)
+
+ ,datum -> (#COMMA & datum) dat1:dat2 -> (#COLON dat1 & dat2)
+
+Notes:
+
+* The terms datum, dat1, and dat2 each refer to an arbitrary datum; ellipsis
+ means zero or more data.
+
+* The `#datum` form only applies when the datum following the hash sign is
+ anything other than a bare string, since otherwise this would be ambiguous
+ with a rune literal. A bare string can nevertheless follow the hash sign by
+ separating the two with a backslash:
+
+ #\string -> (#HASH & string)
+
+* Though not represented in the table due to notational difficulty, the form
+ `#rune(...)` doesn't require a list in the second position; any datum that
+ works with the `#datum` syntax also works with `#rune<DATUM>`.
+
+ #rune1#rune2 -> (#rune1 & #rune2)
+
+ #rune\string -> (rune & string)
+
+ #rune'string -> (#rune #QUOTE & string)
+
+ #rune"string" -> (#rune #DQSTR & |string|)
+
+ As a counter-example, following a rune immediately with a bare string isn't
+ possible without the delimiting backslash, since that would be ambiguous:
+
+ #abcdefgh ;Could be (#abcdef & gh) or (#abcde & fgh) or ...
+
+* Syntax sugar can combine arbitrarily. Some examples follow. Any of these may
+ or may not actually have a meaning in code; many could simply end up producing
+ an error during decoding, or later evaluation of code.
+
+ #{...} -> (#HASH #BRACE ...)
+
+ #'foo -> (#HASH #QUOTE & foo)
+
+ ##'[...] -> (#HASH #HASH #QUOTE #SQUARE ...)
+
+ {x y}[i j] -> (#JOIN (#BRACE x y) #SQUARE i j)
+
+ foo.bar.baz{x y} -> (#JOIN (#DOT (#DOT foo & bar) & baz) #BRACE x y)
+
+* While in Lisp and Scheme `'foo` parses as `(quote foo)`, in Zisp it parses as
+ `(#QUOTE & foo)`; a single pair with the quoted datum in the second position.
+
+ The same principle is used when parsing other sugar; some examples follow:
+
+ Incorrect Correct
+
+ #(x y z) -> (#HASH (x y z)) #(x y z) -> (#HASH x y z)
+
+ [x y z] -> (#SQUARE (x y z)) [x y z] -> (#SQUARE x y z)
+
+ #{x} -> (#HASH (#BRACE (x))) #{x} -> (#HASH #BRACE x)
+
+ foo(x y) -> (#JOIN foo (x y)) foo(x y) -> (#JOIN foo x y)
+
+* Those used to thinking in Lisp and Scheme may think that `(#QUOTE ...)` halts
+ further decoding of enclosed data. This is not so, since quoting is related
+ to code evaluation, not decoding.
+
+
+## Shebang
+
+There is one final "syntax sugar" translation whose sole purpose is to allow a
+shebang line at the start of a file:
+
+ #!interpreter -> (#SHBANG & interpreter)
+
+ #!interpreter argline -> (#SHBANG interpreter & argline)
+
+Under default settings, the decoder will allow this datum to appear once at the
+beginning of a per-file decoding sequence, and simply discard it.
+
+
+<!--
+;; Local Variables:
+;; fill-column: 80
+;; End:
+-->
diff --git a/doc/c1/2-decode.md b/doc/c1/2-decode.md
new file mode 100644
index 0000000..379c74b
--- /dev/null
+++ b/doc/c1/2-decode.md
@@ -0,0 +1,44 @@
+# Decoding
+
+A separate process called "decoding" can transform simple data structures,
+consisting of only the base datum types, into a richer set of Zisp types.
+
+For example, the decoder may turn `(#HASH ...)` into a vector, as one would
+expect a vector literal like `#(...)` to work in Scheme. Bytevector syntax
+could use a custom rune as a list prefix, like: `#u8(...)`
+
+Runes may be decoded in isolation as well, rather than transforming a list
+whose head they appear in. This can implement Boolean constants as `#true`
+and `#false` or `#t` and `#f`.
+
+The decoder recognizes `(#QUOTE ...)` to aid in implementing the traditional
+quoting mechanism of Lisp/Scheme, but with a significant difference:
+
+Traditional quote is "unhygienic" in Scheme terms. An expression such as
+`'(foo bar)` will always be read as `(quote (foo bar))` regardless of what
+lexical context it appears in, so the semantics will depend on whatever the
+identifier `quote` is bound to, meaning that the expression may end up
+evaluating to something other than the list `(foo bar)`.
+
+The Zisp decoder, which transforms not datum to datum, but object to object,
+can turn `#QUOTE` into an object which encapsulates the notion of quoting,
+which the Zisp evaluator can recognize and act upon, ensuring that an
+expression like `'(foo bar)` always turns into the list `(foo bar)`.
+
+One way to think about this, in Scheme (R6RS / syntax-case) terms, is that
+expressions like `'(foo bar)` turn directly into a syntax object when read,
+and the created syntax object begins with an identifier bound to `quote` in
+the standard library.
+
+The decoder is, of course, configurable and extensible. The transformations
+mentioned above would be performed only when it's told to decode data which
+represents Zisp code. The decoder may be given a different configuration,
+telling it to decode, for example, data which represents a different kind of
+domain-specific data, such as application settings, build system commands,
+complex data records with non-standard data types, and so on.
+
+<!--
+;; Local Variables:
+;; fill-column: 77
+;; End:
+-->
diff --git a/doc/c1/grammar/abnf.txt b/doc/c1/grammar/abnf.txt
new file mode 100644
index 0000000..aa67646
--- /dev/null
+++ b/doc/c1/grammar/abnf.txt
@@ -0,0 +1,141 @@
+; Standards-compliant ABNF (RFC 5234, RFC 7405)
+
+; Compatible with: https://www.quut.com/abnfgen/
+
+; Unlike PEG, grammar rules in BNF are non-deterministic, which makes
+; it much more challenging to express our naive parse logic. Whether
+; this ABNF file is truly accurate is difficult to assess.
+
+; The abnfgen(1) tool linked above can be used to generate arbitrary
+; strings matching the grammar in this file. These can be fed into
+; the Zisp parser to reveal some potential bugs; either in the parser
+; itself, or this ABNF grammar.
+
+; Note that the tool may generate Zisp string literals with Unicode
+; escape sequences corresponding to surrogate code points; the parser
+; may reject these. This is expected; it's difficult to rewrite this
+; ABNF grammar to exclude those Unicode values.
+
+; Other minor inaccuracies that aren't important include: This ABNF
+; forces line comments to be terminated with an LF character, when in
+; fact the end-of-file may also terminate them; the same applies to
+; hash-bang parsing which doesn't actually have to end in LF. These
+; discrepancies won't make abnfgen(1) generate invalid strings; they
+; only make this ABNF more strict than the Zisp parser, so it won't
+; generate some strings that the parser would actually accept.
+
+
+Stream = [ Unit *( Blank Unit ) ] *Blank [Trail]
+
+
+Unit = *Blank Datum
+
+Blank = HTAB / LF / %x0b / %x0c / CR / SP / Comment
+
+Trail = SkipLine / SkipUnit / ";" "~" *Blank
+
+
+Datum = BareString / SpecialStr / CladDatum / Rune / RuneStr
+ / RuneDotStr / RuneClad / LabelRef / LabelDef / HashStr
+ / HashDotStr / HashClad / QuoteExpr / JoinExpr
+
+Comment = SkipLine LF / SkipUnit Blank
+
+SkipLine = ";" [ SkipLStart *AnyButLF ]
+
+SkipUnit = ";" "~" Unit
+
+SkipLStart = %x00-09 / %x0b-7d / %x7f-ff ; any but LF or "~"
+
+AnyButLF = %x00-09 / %x0b-ff
+
+
+BareString = BareChar *( BareChar / Numeric )
+
+SpecialStr = SpecStrChar *( SpecStrChar / BareChar )
+
+CladDatum = "|" *( PipeStrChar / "\" StringEsc ) "|"
+ / DQUOTE *( QuotStrChar / "\" StringEsc ) DQUOTE
+ / "(" List ")"
+ / "[" List "]"
+ / "{" List "}"
+
+Rune = "#" RuneName
+
+RuneStr = "#" RuneName "\" BareString
+
+RuneDotStr = "#" RuneName "\" SpecialStr
+
+RuneClad = "#" RuneName CladDatum
+
+HashBang = "#" "!" *( SP / HTAB ) HBLine LF
+
+LabelRef = "#" "%" Label "%"
+
+LabelDef = "#" "%" Label "=" Datum
+
+HashStr = "#" "\" BareString
+
+HashDotStr = "#" "\" SpecialStr
+
+HashClad = "#" CladDatum
+
+QuoteExpr = "'" Datum
+ / "`" Datum
+ / "," Datum
+
+JoinExpr = Datum RJoinDatum
+ / LJoinDatum NoStartDot
+ / Datum ":" Datum
+ / NoEndDot "." Datum
+
+
+BareChar = "!" / "$" / "%" / "*" / "/" / "<" / "=" / ">"
+ / "?" / "^" / "_" / "~" / ALPHA
+
+Numeric = "+" / "-" / DIGIT
+
+SpecStrChar = "." / ":" / Numeric
+
+PipeStrChar = %x00-5b / %x5d-7b / %x7d-ff ; any but "|" or "\"
+
+QuotStrChar = %x00-21 / %x23-5b / %x5d-ff ; any but DQUOTE or "\"
+
+StringEsc = "\" / "|" / DQUOTE / *( HTAB / SP ) LF *( HTAB / SP )
+ / %s"a" / %s"b" / %s"t" / %s"n"
+ / %s"v" / %s"f" / %s"r" / %s"e"
+ / %s"x" *( 2HEXDIG ) ";"
+ / %s"u" ["0"] 1*5HEXDIG ";"
+ / %s"u" "1" "0" 4HEXDIG ";"
+
+List = [ Unit *( Blank Unit ) ] *Blank [Tail] [SkipUnit]
+
+Tail = "&" Unit *Blank
+
+
+RuneName = ALPHA *5( ALPHA / DIGIT )
+
+Label = 1*12( HEXDIG )
+
+HBLine = 1*HBChar [ 1*( SP / HTAB ) *HBChar ]
+
+HBChar = %x00-08 / %x0b-1f / %x21-ff ; any but HT, LF, SP
+
+
+RJoinDatum = CladDatum / Rune / RuneStr / RuneDotStr / RuneClad
+ / LabelRef / LabelDef / HashStr / HashDotStr / HashClad
+ / QuoteExpr
+
+LJoinDatum = CladDatum / RuneClad / LabelRef / HashClad
+
+NoStartDot = BareString / CladDatum / Rune / RuneStr / RuneDotStr
+ / RuneClad / LabelRef / LabelDef / HashStr / HashDotStr
+ / HashClad / QuoteExpr
+
+NoEndDot = BareString / Rune / RuneStr / RuneClad / LabelRef
+ / HashStr / HashClad
+
+
+;; Local Variables:
+;; eval: (flyspell-mode -1)
+;; End:
diff --git a/doc/c1/grammar/index.md b/doc/c1/grammar/index.md
new file mode 100644
index 0000000..e3716ea
--- /dev/null
+++ b/doc/c1/grammar/index.md
@@ -0,0 +1,115 @@
+# Zisp S-Expression Grammar
+
+The grammar is available in several different formats:
+
+* [ZBNF](zbnf.txt): See below for the rules of this notation
+* [ABNF](abnf.txt): Compatible with the `abnfgen` tool
+* [PEG](peg.txt): Compatible with `peg/leg` tool
+
+
+## ZBNF notation
+
+The ZBNF grammar specification uses a BNF-like notation with PEG-like
+semantics:
+
+* Concatenation of expressions is implicit: `foo bar` means `foo`
+ followed by `bar`.
+
+* Parentheses are used for grouping, and the pipe symbol `|` is used
+ for alternatives.
+
+* The suffixes `?`, `*`, and `+` have the same meaning as in regular
+ expressions, although `[foo]` is used in place of `(foo)?`.
+
+* The syntax is defined in terms of bytes, not characters. Terminals
+ `'c'` and `"c"` refer to the ASCII value of the given character `c`.
+ Standard C escape sequences are supported.
+
+* The prefix `~` means NOT. It only applies to rules that match one
+ byte, and negates them. For example, `~( 'a' | 'b' )` matches any
+ byte other than 'a' and 'b'.
+
+* Ranges of terminal values are expressed as `x...y` (inclusive).
+
+* ABNF "core rules" like `ALPHA` and `HEXDIG` are supported.
+
+* There is no ambiguity, or look-ahead / backtracking beyond one byte.
+ Rules match left to right, depth-first, and greedy. As soon as the
+ input matches the first terminal of a rule --explicit or implied by
+ recursively descending into the first non-terminal-- it must match
+ that rule to the end or a syntax error is reported.
+
+The last point makes the notation simple to translate to code.
+
+
+## Limitations outside the grammar
+
+The following limits are not represented in the grammar:
+
+* A `UnicodeSV` is the hexadecimal representation of a Unicode scalar
+ value; it must represent a value in the range 0 to D7FF, or E000 to
+ 10FFFF, inclusive. Any other value signals an error. Valid values
+ are converted into a UTF-8 byte sequence encoding the value.
+
+* A `Rune` longer than 6 bytes is grammatical, but signals an error.
+ This is important because runes are not self-terminating; defining
+ their grammar as ending after a maximum of 6 bytes would allow
+ another datum beginning with an alphabetic character to follow a
+ rune immediately without any visual delineation, which would be
+ terribly confusing for a human reader. Consider: `#foobarbaz`.
+ This would parse as a `Datum` joining `#foobar` and `baz`.
+
+ (The ABNF does not suffer from this issue, since it explicitly
+ enumerates the join possibilities anyway.)
+
+* A `Label` is the hexadecimal representation of a 48-bit integer,
+ meaning it allows for a maximum of 12 hexadecimal digits. Longer
+ values are grammatical, but signal an out-of-range error, so as to
+ avoid signaling a confusing "invalid character" error on input that
+ appears grammatical. Consider: `#%123456789abcd=foo`. This would
+ signal an invalid character error at the letter `d` if the grammar
+ limited a `Label` to 12 hexadecimal digits.
+
+ (As above, the ABNF doesn't care about this. You probably don't
+ want to use the ABNF to generate a parser anyway.)
+
+
+## At-quoted strings
+
+The mechanism of at-quoted strings is not represented in any of the
+grammars, since it essentially has 256 variants. Representing it
+sanely in a grammar requires the ability to save and reference
+variables.
+
+
+## Stream-parsing strategy
+
+The parser consumes one `Unit` from the input stream every time it's
+called; it returns the `Datum` therein if found, or else it returns
+the Zisp EOF token.
+
+Since a `Datum` is not self-terminating, the parser must read beyond
+it to realize that it has ended (if not followed by the EOF). Thus,
+it will consume one more `Blank` following the `Unit` that it parsed.
+If this `Blank` is a comment, it will be consumed entirely, ensuring
+that parsing resumes properly on a subsequent parser call on the same
+input stream, without needing to store any state in between.
+
+Since comments of type `SkipUnit` are likewise not self-terminating,
+an arbitrary number of chained `SkipUnit` comments may need to be
+consumed before the parser is finally allowed to return.
+
+The following illustration shows the positions at which the parser
+will stop consuming input when called repeatedly on the same input
+stream. The dots represent the extent of each `Unit` being parsed,
+while the caret points at the last byte the parser will consume in
+that parse cycle.
+
+```
+foo (bar)[baz] foo;~bar foo;~bar;~baz;~bat foobar
+...^..........^... ^... ^......^
+```
+
+Notice how, in the fourth cycle, the parser is forced to consume all
+commented-out units before it can return, since it would otherwise
+leave the stream in an inappropriate state.
diff --git a/doc/c1/grammar/peg.txt b/doc/c1/grammar/peg.txt
new file mode 100644
index 0000000..7b28a99
--- /dev/null
+++ b/doc/c1/grammar/peg.txt
@@ -0,0 +1,93 @@
+# Standard PEG notation
+
+Stream <- Unit ( Blank Unit )* !.
+
+
+Unit <- Blank* Datum
+
+Blank <- [\t-\r ] / Comment
+
+
+Datum <- OneDatum ( JoinChar? OneDatum )*
+
+JoinChar <- '.' / ':'
+
+
+Comment <- ';' ( SkipUnit / SkipLine )
+
+SkipUnit <- '~' Unit
+
+SkipLine <- (!'\n' .)* '\n'?
+
+
+OneDatum <- BareString / CladDatum
+
+
+BareString <- SpecBareChar ( BareChar / JoinChar )*
+ / BareChar+
+
+SpecBareChar <- '+' / '-' / JoinChar / DIGIT
+
+BareChar <- ALPHA / DIGIT
+ / '!' / '$' / '%' / '*' / '+' / '-' / '/'
+ / '<' / '=' / '>' / '?' / '^' / '_' / '~'
+
+
+CladDatum <- PipeStr / QuoteStr / HashExpr / QuoteExpr / List
+
+PipeStr <- '|' ( PipeStrChar / '\' StringEsc )* '|'
+QuoteStr <- '"' ( QuotStrChar / '\' StringEsc )* '"'
+HashExpr <- '#' HashExprs
+QuoteExpr <- "'" Datum / '`' Datum / ',' Datum
+List <- ParenList / SquareList / BraceList
+
+
+PipeStrChar <- (![|\\] .)
+QuotStrChar <- (!["\\] .)
+
+StringEsc <- '\' / '|' / '"' / ( HTAB / SP )* LF ( HTAB / SP )*
+ / '0' / 'a' / 'b' / 't' / 'n' / 'v' / 'f' / 'r' / 'e'
+ / 'x' HexByte* ';'
+ / 'u' UnicodeSV ';'
+
+HexByte <- HEXDIG HEXDIG
+UnicodeSV <- HEXDIG+
+
+
+HashExprs <- '!' [\t ]* HBangLine '\n'?
+ / '%' Label ( '%' / '=' Datum )
+ / '\' BareString / CladDatum
+ / Rune ( '\' BareString / CladDatum )?
+
+HBangLine <- HBChars+ [\t ]* ( HBChars+ )?
+HBChars <- (![\t\n ] .)
+Label <- HEXDIG+
+Rune <- ALPHA ( ALPHA / DIGIT )*
+
+
+ParenList <- '(' ListBody ')'
+SquareList <- '[' ListBody ']'
+BraceList <- '{' ListBody '}'
+
+ListBody <- Unit* ( Blank* '&' Unit )? Blank*
+
+
+DIGIT <- [0-9]
+ALPHA <- [a-zA-Z]
+HEXDIG <- [0-9a-fA-F]
+
+
+# Keep this in sync line-for-line with the ZBNF grammar for easy
+# comparison between the two.
+
+# This file is meant to be compatible with:
+# https://piumarta.com/software/peg
+
+# Due to a quirk in the peg tool this file is used with, the grammar
+# must not allow an empty stream. Therefore, the Unit rule has its
+# Datum declared as mandatory rather than optional.
+
+
+# Local Variables:
+# eval: (flyspell-mode -1)
+# End:
diff --git a/doc/c1/grammar/zbnf.txt b/doc/c1/grammar/zbnf.txt
new file mode 100644
index 0000000..923ac83
--- /dev/null
+++ b/doc/c1/grammar/zbnf.txt
@@ -0,0 +1,77 @@
+; Custom notation with PEG semantics
+
+Stream : Unit ( Blank Unit )*
+
+
+Unit : Blank* [Datum]
+
+Blank : '\t'...'\r' | SP | Comment
+
+
+Datum : OneDatum ( [JoinChar] OneDatum )*
+
+JoinChar : '.' | ':'
+
+
+Comment : ';' ( SkipUnit | SkipLine )
+
+SkipUnit : '~' Unit
+
+SkipLine : ( ~LF )* [LF]
+
+
+OneDatum : BareString | CladDatum
+
+
+BareString : SpecBareChar ( BareChar | JoinChar )*
+ | BareChar+
+
+SpecBareChar : '+' | '-' | JoinChar | DIGIT
+
+BareChar : ALPHA | DIGIT
+ | '!' | '$' | '%' | '*' | '+' | '-' | '/'
+ | '<' | '=' | '>' | '?' | '^' | '_' | '~'
+
+
+CladDatum : PipeStr | QuoteStr | HashExpr | QuoteExpr | List
+
+PipeStr : '|' ( PipeStrChar | '\' StringEsc )* '|'
+QuoteStr : '"' ( QuotStrChar | '\' StringEsc )* '"'
+HashExpr : '#' HashExprs
+QuoteExpr : "'" Datum | '`' Datum | ',' Datum
+List : ParenList | SquareList | BraceList
+
+
+PipeStrChar : ~( '|' | '\' )
+QuotStrChar : ~( '"' | '\' )
+
+StringEsc : '\' | '|' | '"' | ( HTAB | SP )* LF ( HTAB | SP )*
+ | '0' | 'a' | 'b' | 't' | 'n' | 'v' | 'f' | 'r' | 'e'
+ | 'x' HexByte* ';'
+ | 'u' UnicodeSV ';'
+
+HexByte : HEXDIG HEXDIG
+UnicodeSV : HEXDIG+
+
+
+HashExprs : '!' ( SP | HTAB )* HBangLine [ LF ]
+ | '%' Label ( '%' | '=' Datum )
+ | '\' BareString | CladDatum
+ | Rune [ '\' BareString | CladDatum ]
+
+HBangLine : HBChars+ ( SP | HTAB )* [ HBChars+ ]
+HBChars : ~( SP | HTAB | LF )
+Label : HEXDIG+
+Rune : ALPHA ( ALPHA | DIGIT )*
+
+
+ParenList : '(' ListBody ')'
+SquareList : '[' ListBody ']'
+BraceList : '{' ListBody '}'
+
+ListBody : Unit* [ Blank* '&' Unit ] Blank*
+
+
+;; Local Variables:
+;; eval: (flyspell-mode -1)
+;; End:
diff --git a/doc/c1/index.md b/doc/c1/index.md
new file mode 100644
index 0000000..af01cea
--- /dev/null
+++ b/doc/c1/index.md
@@ -0,0 +1,30 @@
+# Chapter 1: Genesis
+
+This chapter goes through the processes involved in reading source
+code, running it, and optionally compiling it.
+
+1. [Parse](1-parse.html)
+
+ The parser receives a stream of bytes and transforms them into a
+ minimal set of data types with very little processing.
+
+2. [Decode](2-decode.html)
+
+ The decoder runs configurable and extensible pre-processing steps
+ over data received from the parser, enriching it with more complex
+ data types, and handling primitive source code transforms. It's
+ comparable to the C pre-processor or Lisp's `DEFMACRO` mechanism,
+ with a few more responsibilities, such as number literal parsing.
+
+3. [Execute](3-execute.html)
+
+ Code is executed (or interpreted, or evaluated) in an environment,
+ also called a module, which may be mutated, and linked with other
+ modules. Execution is immediate, without any pre-compilation.
+
+4. [Compile](4-compile.html)
+
+ Procedures from within the compiler module can be used to demand
+ the compilation of other modules, with various options, yielding
+ static or dynamic object files. These may be loaded immediately,
+ replacing the previously uncompiled module code in memory.