From 724ac8ae394675a78c2977c6e35555b210256e01 Mon Sep 17 00:00:00 2001
From: Taylan Kammer <taylan.kammer@gmail.com>
Date: Mon, 1 Jun 2026 21:49:37 +0200
Subject: docs -> doc

---
 docs/c1/1-parse.md | 611 -----------------------------------------------------
 1 file changed, 611 deletions(-)
 delete mode 100644 docs/c1/1-parse.md

(limited to 'docs/c1/1-parse.md')

diff --git a/docs/c1/1-parse.md b/docs/c1/1-parse.md
deleted file mode 100644
index 4eb5776..0000000
--- a/docs/c1/1-parse.md
+++ /dev/null
@@ -1,611 +0,0 @@
-# Parser for Data
-
-*For an exact specification of the grammar, see [grammar](grammar/).*
-
-Zisp s-expressions represent an extremely minimal set of data types; only that
-which is necessary to strategically construct more complex values:
-
-    +--------+-----------------+--------+----------+------+
-    | TYPE   | String          | Rune   | Pair     | Nil  |
-    +--------+-----------------+--------+----------+------+
-    | E.G.   | foobar          | #name  | (X & Y)  | ()   |
-    |        | |foo bar|       |        |          |      |
-    |        | "foo bar"       |        |          |      |
-    |        | @_foo bar_      |        |          |      |
-    +--------+-----------------+--------+----------+------+
-
-Datum comments and line comments are supported:
-
-* A semicolon followed by a tilde instructs the parser to consume one datum and
-  discard it.  Whitespace may appear between the tilde and the datum to discard.
-
-* A semicolon, followed by a non-tilde byte, instructs the parser to consume and
-  discard bytes until a newline (ASCII Line Feed) is encountered.
-
-The parser can also output non-negative integers, but this is only used for
-datum labels; number literals are handled by the decoder instead; see below.
-
-
-## Overview
-
-This section explains a few core concepts and features related to the parser.
-
-
-### Value vs. Datum
-
-A Zisp *value* that has an *external representation* in the form of a sequence
-of bytes is called a *datum*.  Every datum is a value, but not all values are
-data.  A datum is a value that can be printed out as a byte sequence which the
-parser can recognize and turn back into an equivalent datum.
-
-One may speak of an *external representation of a value* where the value is not
-itself a datum, but can be encoded as a datum.  The more strictly correct term
-for this is: "The external representation of a datum encoding the value."
-
-
-### Syntax sugar
-
-The parser recognizes various "syntax sugar" and transforms it into uses of the
-above listed primitive data types.  As an example, the expression `#(x y z)` is
-parsed into the structure `(#HASH x y z)`.  These are two completely equivalent
-external representations for the same compound datum; after parsing, both byte
-sequences will yield data values that are indistinguishable in all but their
-memory address.
-
-The most ubiquitously used syntax sugar is the list, which stands for a chain of
-pairs, terminated with nil:
-
-    (x y z)  ->  (x & (y & (z & ())))
-
-The full syntax sugar table is listed and explained further below.
-
-
-### Decoder
-
-*The decoder has nothing to do with the concept of text or character encoding.*
-
-A separate process called *decoding* can transform Zisp data into values of more
-complex types, including values that are not of a datum type.
-
-For example, the datum `(#HASH x y z)` could be decoded into an array, so the
-expression `#(x y z)` could work like in Scheme.
-
-Decoding also resolves datum labels, goes over bare strings to find ones that
-represent a number literal, and takes care of a number of other transforms.
-This offloads complexity, allowing the parser to remain extremely simple.
-
-See the dedicated documentation of the [decoder](2-decode.html) for more.
-
-
-### Character encoding
-
-The parser does not consume characters; it consumes bytes.
-
-Grammar is generally constructed by bytes corresponding to ASCII characters.
-Some elements of the grammar, such as comments and quoted strings, may contain
-arbitrary byte sequences, until terminated.  These sequences may happen to be
-valid UTF-8 text.  This way, quoted strings and comments may contain Unicode
-text encoded in UTF-8, but the parser does not check these for validity.
-
-Since comments and quoted strings may contain arbitrary byte sequences, a text
-editor or other program displaying Zisp s-expressions may need to use a special
-visual representation for bytes that don't represent valid text.
-
-The parser being based on bytes rather than characters is not a limitation but
-rather a feature: It allows for Zisp s-expressions to be used as a structured
-data exchange format that may contain binary data elements without the need to
-encode these in Base64 or other such text representations of binary data.
-Consider the example:
-
-    ((image.webp "<< binary data >>")
-     (video.webm "<< binary data >>"))
-
-All that needs to be done for this to work, is that any incidental occurrences
-of the double-quote sign, and the backslash sign, are escaped with a backslash
-within the binary data; all other bytes can appear verbatim in the strings.
-
-
-### Stream parsing
-
-The parser can be repeatedly invoked on a byte stream to consume the next datum
-within.  This does not require "unreading" or back-seeking within the stream;
-the parser always reads a full datum, and stops after some byte which cleanly
-terminates the currently parsed datum.
-
-This means Zisp s-expressions can be safely intermixed with other data within
-the same byte stream.  So long as the other data is consumed by some parser
-which similarly stops reading at a clear boundary, the Zisp parser can then
-continue operating on the same stream.  Consider the example:
-
-    ("image.webp" 8273)
-
-    << 8273 bytes >>
-
-    ("video.webm" 736)
-
-    << 736 bytes >>
-
-The "header" for each file in this stream is a Zisp s-expression containing
-information about how many bytes should be read after the header, before the
-next file header appears.  (The header data need to be terminated with a blank
-ASCII character such as a newline.  The reason why the closing parenthesis does
-not act as a terminator unto itself will become apparent later.)
-
-#### Buffering
-
-To enable the aforementioned stream parsing strategy, the parser does not use
-automatic buffering.  If it did, it might inadvertently consume some bytes
-beyond the currently parsed datum, leaving the stream inconsistent.
-
-The parser could provide access to its buffer, such that one could access the
-unused bytes, but it's simpler and more flexible to let buffering be handled
-externally from the parser.
-
-In other words: If the parser is meant to be used on an I/O stream connected to
-expensive system calls, such as a file handle or network socket, it's best to
-wrap that stream in some intermediate object which asks the system for large
-chunks of data at once, and stores the data in a buffer.
-
-
-### Datum labels
-
-Valid data cannot be cyclic, since that would mean it has infinite length in
-bytes.  To externally represent a value with cyclic structure, one uses datum
-labels in the data encoding of the value.
-
-A datum label either wraps another datum to assign a number to it, or contains
-just a reference to a previous assignment.
-
-    +----------------------------------+---------------------------------+
-    | Internal structure               | External representation         |
-    +----------------------------------+---------------------------------+
-    | (#LABEL & (<NUMBER> & <DATUM>))  | #%<HEX>=<DATUM>                 |
-    +----------------------------------+---------------------------------+
-    | (#LABEL & <NUMBER>)              | #%<HEX>%                        |
-    +----------------------------------+---------------------------------+
-
-In this visual, the token `<NUMBER>` stands for an actual number value that
-doesn't have its own external representation.  It's printed as a sequence of
-hexadecimal digits, denoted by `<HEX>` in the external representation.
-
-For clarity, concrete examples follow:
-
-    #%1234abcd=(foo bar)  ->  (#LABEL & (<0x1234abcd> & (foo bar)))
-
-    #%1234abcd%           ->  (#LABEL & <0x1234abcd>)
-
-Here, the visual token `<0x1234abcd>` stands for a Zisp value of a numeric type
-with an integer value.
-
-Datum labels may look like "syntax sugar" but the fact that integers don't have
-a direct external representation means that datum labels are a fundamental type
-of syntax that has no "desugared" equivalent in external representation.  The
-decoder will not accept a bare string encoding of an integer here.
-
-
-## Data types
-
-Following is an explanation of the four core data types constructed by the Zisp
-s-expression parser.
-
-A Zisp value that is a member of one of these types is also called a *datum* if
-it adheres to additional constraints as explained for each type.
-
-
-### String
-
-Strings can appear "bare" or be quoted in various ways.
-
-A string, as a stand-alone Zisp value, is only a valid datum if it can be
-represented as a bare string.  If it contains bytes that prevent the bare
-representation, then the string must be wrapped in one of the following
-structures to become a valid datum, each of which has its own external
-representation:
-
-    +-------------------------------+-------------------------------+
-    | Internal structure            | External representation       |
-    +-------------------------------+-------------------------------+
-    | (#PQSTR & <STRING>)           | |contents|                    |
-    +-------------------------------+-------------------------------+
-    | (#DQSTR & <STRING>)           | "contents"                    |
-    +-------------------------------+-------------------------------+
-    | (#ATSTR & <STRING>)           | @_contents_                   |
-    +-------------------------------+-------------------------------+
-
-The visual token `<STRING>` is meant to denote the actual string, as a Zisp
-value, occupying the second position in the pair.  It is not actual syntax.
-
-Note that, while conceptually similar, this internal encoding of string data is
-not syntax sugar, since the internal datum representation using runes cannot be
-printed out verbatim, due to the attached string being impossible to represent
-externally without quotation.  As such, quoted strings are fundamental syntax.
-
-These external representations of strings will be explained in more detail
-further below, including backslash escape sequences allowed within.
-
-Strings have a fixed length, counted in bytes.  Each byte can have any value,
-including zero (aka ASCII NULL).  The parser reads bytes, not characters, and
-has no concept of a character encoding, which means that a string can contain
-UTF-8 byte sequences, but these are not tested for validity.
-
-A string that is up to 255 bytes long is automatically *interned*, meaning any
-occurrence of the same string -- equal in length and containing the same byte
-values -- ends up being represented by the same bit-pattern; either a memory
-address, or an immediate representation within a CPU word for short strings.
-
-Strings with a length greater than 255 bytes end up being represented by a
-distinct memory address, even if they are equal in length and content.
-
-
-### Rune
-
-A rune is represented by an ASCII character sequence of 1 to 6 bytes, that must
-begin with a letter, and may only contain letters and digits.  This character
-sequence of letters and digits is called the *name* of the rune.  A rune that
-follows this constraint is valid as a datum.
-
-Zisp code may explicitly construct values of the rune type that violate the
-above constraints.  Such runes are not valid data and cannot be printed or
-parsed in any way.
-
-Runes are case-sensitive, and the parser always emits runes using upper-case
-letters when expressing syntax sugar.  Uppercase rune names are reserved for
-Zisp's internal use and standard library; users can use lowercase runes with
-custom meaning without worrying about clashes, with the exception of a small
-number of lowercase runes such as `#true` and `#false` that are part of the
-default decoder settings.
-
-Runes are always stored directly in a CPU word; never by memory address.
-
-
-### Pair
-
-A pair is a tuple of two values: the first value and the second value.
-
-The parser allocates a unique two-word cell in the process heap for every pair,
-and represents that pair through the memory address of that cell.
-
-Pairs are valid as a datum if one of the following holds true for the pair:
-
-* It encodes one of the quoted string variants.
-
-* It encodes a datum label (assignment or reference).
-
-* Both the first and second value in the pair is itself a valid datum.
-
-An additional constraint is that a hierarchy of pairs containing pairs must not
-form cycles; if they do, the cycles must be broken up by use of datum labels or
-else none of the pairs within the cyclic structure are a valid datum.
-
-
-### Nil
-
-The Zisp nil value is a singleton and a datum.  There is exactly one nil value
-and it is used to terminate a chain of pairs representing a list of values.
-
-
-## Quoted strings
-
-Three quoted string types exist: Pipe-quoted, double-quoted, and at-quoted.
-This section goes into the details of each variant.
-
-
-### Pipe-quoted
-
-Strings can be quoted with pipes, like symbols in R7RS Scheme, which triggers
-the parser to generate a pair with the structure:
-
-    (#PQSTR & <STRING>)                 ;; <STRING> is visual aid, not syntax
-
-The decoder, using default settings, would emit this string verbatim as a value.
-Then, during code evaluation, this would be seen as an identifier.  In this way,
-pipe-quoted strings are equivalent to bare strings in functionality.
-
-It is important to understand that the decoder sits between the parser and the
-[evaluator](3-execute.html), and in opposition to Lisp and Scheme tradition, it
-is common for the evaluator to receive values that are not valid as a datum; in
-this case, a string unto itself that may not be a valid datum, due to not being
-possible to be represented as a bare string.  Yet, it is valid as an identifier
-for the purposes of the evaluator, since it is a string *value* like any other.
-
-
-### Double-quoted
-
-Strings wrapped in the double-quote symbol parse into:
-
-    (#DQSTR & <STRING>)                 ;; <STRING> is visual aid, not syntax
-
-Under default settings, the decoder would transform this into a value which,
-when evaluated, yields back the string as a value.  Typically, this would be
-achieved by simply transforming it into `(#QUOTE & <STRING>)`.  (Note that,
-unlike `(#PQSTR & <STRING>)`, this would not be decoded into a string unto
-itself, as that would make the evaluator see it as an identifier.)
-
-
-### At-quoted strings AKA raw strings
-
-There is a special type of syntax for "raw" strings, meaning that no backslash
-escapes nor any other kind of escape sequence are recognized within them.
-
-This raw string syntax begins with an at sign, followed by any byte.  That byte
-becomes the termination marker, and the string cannot contain an occurrence of
-it, since there are no escape sequences.
-
-    @"foo \ bar"  ->  (#ATSTR & <STRING>)
-
-In the above, the visual token `<STRING>` is not part of datum syntax but a
-stand-in for the actual string value, which is, literally: `foo \ bar`
-
-This style of quoting can be useful, for instance, when representing regular
-expressions as strings in code:
-
-    @/^foo\\(bar|baz)\.\[".*"\]$/         ;; matches e.g. foo\bar.["blah"]
-
-Were it not for this syntax, this regular expression would only be possible to
-represent through a quoted string such as the following:
-
-    "^foo\\\\(bar|baz)\\t\\[\".*\"\\]$"   ;; many backslashes
-
-Alternatively, imagine searching for certain MS Windows file paths:
-
-    @_C:\\\\Users\\([a-z]+)_              ;; matches C:\\User\foo
-
-That's already ugly.  Without raw strings, it would need to look even worse:
-
-    "C:\\\\\\\\Users\\\\([a-z]+)"         ;; MANY backslashes
-
-The byte that follows the at sign need not be a printable character or even a
-valid ASCII byte; it can be absolutely any byte value, even NULL.  This can be
-useful to easily encode binary data which is known to not contain a specific
-byte; an example would be C strings which cannot contain NULL.
-
-
-### Backslash escape sequences in strings
-
-The following backslash escapes are supported in pipe-quoted and double-quoted
-strings.  (Some rows use Regular Expression notation.)
-
-    +-----------------------------------+------------------------------+
-    | Character(s) following backslash  | Meaning                      |
-    +-----------------------------------+------------------------------+
-    | \                                 | Literal backslash            |
-    +-----------------------------------+------------------------------+
-    | |                                 | Literal pipe symbol          |
-    +-----------------------------------+------------------------------+
-    | "                                 | Literal double-quote         |
-    +-----------------------------------+------------------------------+
-    | RE: /[\t ]*\n[\t ]*/              | Discarded                    |
-    +-----------------------------------+------------------------------+
-    | 0                                 | ASCII NULL                   |
-    +-----------------------------------+------------------------------+
-    | a                                 | ASCII Alert                  |
-    +-----------------------------------+------------------------------+
-    | b                                 | ASCII Backspace              |
-    +-----------------------------------+------------------------------+
-    | t                                 | ASCII Tab (Horizontal)       |
-    +-----------------------------------+------------------------------+
-    | n                                 | ASCII Newline (Line Feed)    |
-    +-----------------------------------+------------------------------+
-    | v                                 | ASCII Vertical Tab           |
-    +-----------------------------------+------------------------------+
-    | f                                 | ASCII Form Feed              |
-    +-----------------------------------+------------------------------+
-    | r                                 | ASCII Carriage Return        |
-    +-----------------------------------+------------------------------+
-    | e                                 | ASCII Escape                 |
-    +-----------------------------------+------------------------------+
-    | RE: /x([0-9a-fA-F]{2})*;/         | Arbitrary bytes in hex       |
-    +-----------------------------------+------------------------------+
-    | RE: /u[0-9a-fA-F]+;/              | Unicode scalar as UTF-8      |
-    +-----------------------------------+------------------------------+
-     
-To clarify:
-
-* A backslash followed by a backslash, pipe, or double-quote character is
-  substituted with a literal occurrence of the corresponding character.
-
-* A backslash followed by any number of blanks (space or tab), a newline, and
-  again any number of blanks, is substituted with nothing.  This is to allow
-  splitting a string into multiple lines for human readability.
-
-      (define paragraph "This paragraph has been visually split into multiple \
-                         lines, but the newline is escaped, so it's one line.")
-
-* The characters 0, a, b, t, n, v, f, r, and e have the same meanings as in the
-  C programming language, representing common unprintable ASCII bytes.
-
-* An x, followed by pairs of hexadecimal digits (case insensitive), terminated
-  by a semicolon, is substituted with the sequence of bytes represented by the
-  corresponding pairs of hexadecimal digits.  E.g.: `"foo\xDEADBEEF;bar"`
-
-* A u, followed by a hexadecimal digit sequence (case insensitive), terminated
-  by a semicolon, is substituted with the canonical UTF-8 byte sequence for the
-  Unicode Scalar Value represented by that hexadecimal number.  The number must
-  be in the range `0` to `10FFFF`.  E.g.: `"foo\u00A0;bar"`
-
-
-### Newlines in strings
-
-Normally, a newline in a string has no special meaning and simply becomes part
-of the string.  However, newlines can be backslash-escaped, which simple erases
-them; the escaped newline can also be preceded or followed by any number of tab
-and space characters, which are all stripped as well.  (Note: It's not blanks
-preceding the backslash that are stripped, but blanks following the backslash
-and preceding the newline; i.e., blanks at the end of the line.)
-
-Following are some examples of how multi-line strings can appear in source code
-with different intentions and meanings:
-
-    (define paragraph "This paragraph has been visually split into multiple \
-                       lines, but the newlines are escaped, so it's one line.")
-
-    (define json-object '|         ;; use '|| so double-quotes need no escaping
-      {
-        "key": "value"
-      }
-    |)
-
-The second example is actually slightly problematic.  It begins with a newline,
-which may be undesirable, but escaping that newline would cause the first line
-to have no indentation, thus the opening `{` would not line up with the closing
-`}` when this string is printed out.  Further, if the entire block of code is
-indented, then the string contents may be more indented than intended.  (No pun
-or rhyme intended.)  Consider:
-
-    (let ((foo one))
-      (let ((bar two))
-        (let ((json-object '|
-                 {
-                   "key": "value"
-                 }
-               |))
-          (do-whatever))))
-
-The string bound to `json-object` has redundant indentation.  Should the parser
-attempt to solve this issue?
-
-Thankfully, we have the decoder to handle such complexities.  Under the default
-settings, the rune `#HASH` is bound to a decoder rule which detects a payload
-value that is a string literal, and implements the same algorithm as seen in
-Java 15 Text Blocks: [JEP 378: Text Blocks](https://openjdk.org/jeps/378)
-
-Thus, we can do the following:
-
-    (let ((foo one))
-      (let ((bar two))
-        (let ((json-object #|
-    ...........  {
-    ...........    "key": "value"
-    ...........  }
-    ...........|))
-          (do-whatever))))
-
-(Dots represent whitespace that is deleted.  The initial newline is, as well.)
-
-The only feature Zisp does not offer is a way to fence off multi-line strings
-with a longer token such as `"""` as seen in Python and Java, or an arbitrary
-word as seen in Bourne shell and PHP "here doc" syntax.
-
-However, if a programmer truly wanted to have arbitrary text blocks in code,
-without needing to escape anything in them, it's possible to abuse at-quoted
-string syntax, using it with an ASCII control character which is displayed
-visibly by a text editor.  In the following, the characters `^\` are meant to
-represent a literal ASCII File Separator character in the source code:
-
-    (define json-object #@^\
-      {
-        "key": "value"
-      }
-      ^\)
-
-Hey, it works fine in Emacs, so why not?  Use `C-q C-\` to insert the `^\`.
-
-This is indeed quite an eldritch syntax, but hopefully most programs would not
-need to use it anyway.
-
-
-## Syntax sugar
-
-The parser recognizes various "syntax sugar" and transforms it into equivalent
-datum constructions.  The most ubiquitous example of this is the list, which is
-transformed into a chain of pairs, terminated with nil:
-
-    (datum1 datum2 ...)  ->  (datum1 & (datum2 & (... & ())))
-
-This is so ubiquitous as to be hardly considered "syntax sugar" but is counted
-as such, since any list could just as well be written as a chain of pairs; both
-would result in an equivalent datum when parsed.
-
-The following table summarizes the other available transformations:
-
-    [...]   -> (#SQUARE ...)          #datum       -> (#HASH & datum)
-
-    {...}   -> (#BRACE ...)           #rune(...)   -> (#rune ...)
-
-    'datum  -> (#QUOTE & datum)       dat1dat2     -> (#JOIN dat1 & dat2)
-
-    `datum  -> (#GRAVE & datum)       dat1.dat2    -> (#DOT dat1 & dat2)
-
-    ,datum  -> (#COMMA & datum)       dat1:dat2    -> (#COLON dat1 & dat2)
-
-Notes:
-
-* The terms datum, dat1, and dat2 each refer to an arbitrary datum; ellipsis
-  means zero or more data.
-
-* The `#datum` form only applies when the datum following the hash sign is
-  anything other than a bare string, since otherwise this would be ambiguous
-  with a rune literal.  A bare string can nevertheless follow the hash sign by
-  separating the two with a backslash:
-
-      #\string  ->  (#HASH & string)
-
-* Though not represented in the table due to notational difficulty, the form
-  `#rune(...)` doesn't require a list in the second position; any datum that
-  works with the `#datum` syntax also works with `#rune<DATUM>`.
-
-      #rune1#rune2  -> (#rune1 & #rune2)
-
-      #rune\string  -> (rune & string)
-
-      #rune'string  -> (#rune #QUOTE & string)
-
-      #rune"string" -> (#rune #DQSTR & |string|)
-
-  As a counter-example, following a rune immediately with a bare string isn't
-  possible without the delimiting backslash, since that would be ambiguous:
-
-      #abcdefgh  ;Could be (#abcdef & gh) or (#abcde & fgh) or ...
-
-* Syntax sugar can combine arbitrarily.  Some examples follow.  Any of these may
-  or may not actually have a meaning in code; many could simply end up producing
-  an error during decoding, or later evaluation of code.
-
-      #{...}            -> (#HASH #BRACE ...)
-
-      #'foo             -> (#HASH #QUOTE & foo)
-
-      ##'[...]          -> (#HASH #HASH #QUOTE #SQUARE ...)
-
-      {x y}[i j]        -> (#JOIN (#BRACE x y) #SQUARE i j)
-
-      foo.bar.baz{x y}  -> (#JOIN (#DOT (#DOT foo & bar) & baz) #BRACE x y)
-
-* While in Lisp and Scheme `'foo` parses as `(quote foo)`, in Zisp it parses as
-  `(#QUOTE & foo)`; a single pair with the quoted datum in the second position.
-
-  The same principle is used when parsing other sugar; some examples follow:
-
-      Incorrect                              Correct
-
-      #(x y z) -> (#HASH (x y z))            #(x y z) -> (#HASH x y z)
-
-      [x y z]  -> (#SQUARE (x y z))          [x y z]  -> (#SQUARE x y z)
-
-      #{x}     -> (#HASH (#BRACE (x)))       #{x}     -> (#HASH #BRACE x)
-
-      foo(x y) -> (#JOIN foo (x y))          foo(x y) -> (#JOIN foo x y)
-
-* Those used to thinking in Lisp and Scheme may think that `(#QUOTE ...)` halts
-  further decoding of enclosed data.  This is not so, since quoting is related
-  to code evaluation, not decoding.
-
-
-## Shebang
-
-There is one final "syntax sugar" translation whose sole purpose is to allow a
-shebang line at the start of a file:
-
-    #!interpreter          ->  (#SHBANG & interpreter)
-
-    #!interpreter argline  ->  (#SHBANG interpreter & argline)
-
-Under default settings, the decoder will allow this datum to appear once at the
-beginning of a per-file decoding sequence, and simply discard it.
-
-
-<!--
-;; Local Variables:
-;; fill-column: 80
-;; End:
--->
-- 
cgit v1.2.3