summaryrefslogtreecommitdiff
path: root/doc/0/1-parse.md
diff options
context:
space:
mode:
authorTaylan Kammer <taylan.kammer@gmail.com>2026-06-20 22:53:50 +0200
committerTaylan Kammer <taylan.kammer@gmail.com>2026-06-20 22:53:50 +0200
commitb84ed4f563b3536365f7d3cc4d068407e98685b3 (patch)
tree9ab7b18d712db1329b6230cb45520e7c85dc46fd /doc/0/1-parse.md
parentbfaa74b19fc81dbe071d55566a78a8e329237eff (diff)
It's a revolution baby.HEADmaster
Diffstat (limited to 'doc/0/1-parse.md')
-rw-r--r--doc/0/1-parse.md595
1 files changed, 595 insertions, 0 deletions
diff --git a/doc/0/1-parse.md b/doc/0/1-parse.md
new file mode 100644
index 0000000..101a3b6
--- /dev/null
+++ b/doc/0/1-parse.md
@@ -0,0 +1,595 @@
+# Parser for Code & Data
+
+<!--TOC-->
+
+Zisp s-expressions represent an extremely minimal set of data types; only that
+which is necessary to strategically construct more complex values:
+
+ +---------+--------+----------+------+
+ | String | Rune | List | Nil |
+ +---------+--------+----------+------+
+ | foobar | #name | (X ...) | () |
+ +---------+--------+----------+------+
+
+The parser recognizes various *syntax sugar* which abbreviates verbose syntax,
+and may result in special data structures (typically, a list with a rune in its
+first position) which another Zisp component called the *decoder* can transform
+into a rich set of value types.
+
+More details about syntax sugar, and the decoder, are explained later.
+
+
+## Character Encoding
+
+The parser does not consume Unicode characters; it consumes bytes. Grammar is
+generally constructed by bytes corresponding to ASCII characters.
+
+Some elements of the grammar, such as comments and quoted strings, may contain
+arbitrary byte sequences, until terminated. These sequences may happen to be
+valid UTF-8 text. This way, quoted strings and comments may contain Unicode
+text encoded in UTF-8, but the parser does not check these for validity.
+
+Since comments and quoted strings may contain arbitrary byte sequences, a text
+editor or other program displaying Zisp s-expressions may need to use a special
+visual representation for bytes that don't represent valid text.
+
+The parser working on bytes rather than Unicode characters is not a limitation,
+but rather a feature: It allows Zisp s-expressions to be used as a structured
+data exchange format, which may contain binary data elements, without the need
+to encode these in Base64 or other such text representations of binary data.
+Consider the example:
+
+ ((image.webp "<BINARY>")
+ (video.webm "<BINARY>"))
+
+All that needs to be done for this to work, is that any incidental occurrences
+of the double-quote sign, and the backslash sign, are escaped with a backslash
+within the `<BINARY>` data; all other bytes can appear verbatim in the strings.
+
+
+## Stream Parsing
+
+The parser can be repeatedly invoked on a byte stream to consume the next datum
+within. This does not require "unreading" or back-seeking within the stream;
+the parser always reads a full datum, and stops after some byte which cleanly
+terminates the currently parsed datum.
+
+This means Zisp s-expressions can be safely intermixed with other data within
+the same byte stream. So long as the other data is consumed by some parser
+which similarly stops reading at a clear boundary, the Zisp parser can then
+continue operating on the same stream. Consider the example:
+
+ ("image.webp" 8273)
+
+ << 8273 bytes >>
+
+ ("video.webm" 736)
+
+ << 736 bytes >>
+
+The "header" for each file in this stream is a Zisp s-expression containing
+information about how many bytes should be read after the header, before the
+next file header appears. (The header data need to be terminated with a blank
+ASCII character such as a newline; the closing parenthesis does not act as a
+terminator unto itself due to the "join" syntax sugar.)
+
+To enable this stream parsing strategy, the parser does not use any automatic
+buffering. If it did, it might inadvertently consume some bytes beyond the
+currently parsed datum, leaving the stream inconsistent.
+
+If the parser is meant to be used on an input stream associated with expensive
+system calls, such as a file handle or network socket, it's best to wrap that
+stream in some intermediate object which asks the system for large chunks of
+data at once, and stores the data in a buffer.
+
+
+## Comments
+
+Two types of comment are supported: datum comments and line comments.
+
+* A semicolon followed by a tilde instructs the parser to consume one datum and
+ discard it. Whitespace may appear between the tilde and the datum to discard.
+
+* A semicolon, followed by a non-tilde byte, instructs the parser to consume and
+ discard bytes until a newline (ASCII Line Feed) is encountered.
+
+
+## Value vs. Datum
+
+A Zisp *value* that has an *external representation* in the form of a sequence
+of bytes is called a *datum*. Every datum is a value, but not every value is a
+datum. In other words, a datum is a value that can be printed out as a byte
+sequence which the parser can turn back into an equivalent datum.
+
+A value that is not a datum may nevertheless be *encoded* into one, allowing it
+to have an external representation. After parsing, it needs to be *decoded* to
+actually become the expected value.
+
+One may speak of an *external representation of a value* where the value is not
+itself a datum, but has an encoding as one. The more strictly correct term for
+this is: "The external representation of the datum encoding the value."
+
+### Syntax sugar
+
+The parser recognizes various *syntax sugar* to abbreviate an equivalent datum
+construction, or express a datum that encodes a more complex value.
+
+As an example, the expression `#(x y z)` is an abbreviation for the equivalent
+`(#HASH x y z)`. These are two external representations for the same datum;
+after parsing, both will yield values that are indistinguishable in all but
+their memory address.
+
+An example of syntax sugar that is not a mere abbreviation is a quoted string
+which contains bytes that could not appear in a *bare* string:
+
+ "foo bar" -> (#DQUOTE <STRING>)
+
+In this example, the visual token `<STRING>` represents the actual string value
+in program memory, which has no direct external representation in bytes because
+it contains a space character.
+
+Those familiar with Lisp and Scheme may expect bare strings to be parsed into a
+separate type called *symbol* while quoted strings are parsed directly into a
+string type, but this is not the case in Zisp.
+
+### Decoder
+
+The *decoder* transforms Zisp data into values of more complex types, including
+values that are not of a datum type.
+
+Combined with syntax sugar, this allows Zisp to offer familiar syntax elements.
+For example, the expression `#(x y z)` which parses into `(#HASH x y z)` can be
+decoded into an array, so the result is similar to the vector syntax of Scheme.
+
+Decoding also resolves datum labels, goes over bare strings to find ones that
+represent a number literal, and takes care of a number of other transforms.
+This offloads complexity, allowing the parser to remain extremely simple.
+
+See the dedicated documentation of the [decoder](2-decode.html) for more.
+
+
+## Data types
+
+Following is a more in-depth explanation of each data type constructed by the
+Zisp s-expression parser.
+
+These are in fact value types, though the term "data type" is often used due to
+familiarity. A Zisp value that is a member of one of the following value types
+is only a *datum* if it adheres to additional constraints as explained below.
+
+### String
+
+Strings can appear *bare* or be quoted in various ways. A quoted string is in
+fact parsed into a list value with a rune in the first position to identify the
+quotation variant that was parsed, and the string value in the second position;
+or, in case of at-quoted strings, a special construct we will look at later.
+
+ +-----------+-------------------------------+
+ | Syntax | Parse output |
+ +-----------+-------------------------------+
+ | |bytes| | (#PQSTR <STRING>) |
+ +-----------+-------------------------------+
+ | "bytes" | (#DQSTR <STRING>) |
+ +-----------+-------------------------------+
+ | @_bytes_ | (#ATSTR <SENTINEL> <STRING>) |
+ +-----------+-------------------------------+
+
+The visual token `<STRING>` denotes the actual string, as a Zisp value, in the
+second position of the list. The visual token `<SENTINEL>` stands for a Zisp
+integer value between 0 and 254.
+
+These external representations of strings will be explained in more detail
+further below, including backslash escape sequences allowed within, and how
+exactly at-quoted strings work.
+
+Strings have a fixed length, counted in bytes. Each byte can have any value,
+including zero (ASCII NUL). The parser reads bytes, not Unicode characters; a
+string may contain UTF-8 byte sequences, but these are not tested for validity.
+
+A string that is up to 255 bytes long is automatically *interned*, meaning any
+occurrence of the same string -- equal in length and containing the same byte
+values -- ends up being represented by the same bit-pattern; either a memory
+address, or an immediate representation within a CPU word for short strings.
+The quotation method is inconsequential to this process; for example, while
+`|foo bar|` and `"foo bar"` will parse into different list values, the actual
+string they hold a reference to will be the same one in program memory. This
+behavior is however configurable and can be disabled entirely for cases where
+large numbers of arbitrary binary strings are being parsed.
+
+Strings of length greater than 255 bytes are stored separately in memory, even
+if they are equal in length and content.
+
+### Rune
+
+A rune is represented by an ASCII character sequence of 1 to 6 bytes, that must
+begin with a letter, and may only contain letters and digits. This character
+sequence of letters and digits is called the *name* of the rune. A rune that
+follows this constraint is valid as a datum.
+
+Zisp code may explicitly construct values of the rune type that violate the
+above constraints. Such runes are not valid data and cannot be printed or
+parsed.
+
+Runes are case-sensitive, and the parser always emits runes using upper-case
+letters when expressing syntax sugar. Uppercase rune names are reserved for
+Zisp's internal use and standard library; users can use lowercase runes with
+custom meaning without worrying about clashes, with the exception of a small
+number of lowercase runes such as `#true` and `#false` that are part of the
+default decoder settings and documented explicitly as such.
+
+Runes are always stored directly in a CPU word; never by memory address.
+
+### List
+
+A list is a contiguous array of one or more values in memory, whose length may
+be encoded directly within the pointer to the head of the array, or else the
+array is terminated with a special sentinel bit-pattern that is not otherwise
+valid as a Zisp value.
+
+The parser allocates a unique array in program memory for every list, and the
+list as a value is then represented by the memory address of that array, with
+either an exact length tag or a tag indicating that it's sentinel-terminated.
+
+Lists are valid data if one of the following holds true:
+
+* The list encodes a quoted string, datum label, or shebang line.
+
+* All values in the list are a valid datum.
+
+Further, a structure of nested list values may not contain cyclic references
+back up in the structure (which would make the above definition diverge into
+infinity). Such cycles must be broken up with datum labels, or else the list
+cannot be considered a datum, since it cannot be printed or parsed.
+
+### Nil
+
+The Zisp nil value is a singleton and a datum. There is exactly one nil value,
+used in lieu of a list of zero length; it has the external representation `()`.
+
+
+## Quoted strings
+
+Three quoted string types exist: Pipe-quoted, double-quoted, and at-quoted.
+This section goes into the details of each variant.
+
+### Pipe-quoted
+
+Strings can be quoted with pipes, like symbols in R7RS Scheme, which triggers
+the parser to generate a list with the structure:
+
+ (#PQSTR <STRING>) ;; <STRING> is visual aid, not syntax
+
+The decoder, using default settings, would emit this string verbatim as a value.
+Then, during code evaluation, this would be seen as an identifier. In this way,
+pipe-quoted strings are equivalent to bare strings in functionality.
+
+It is important to understand that the decoder sits between the parser and the
+[evaluator](3-eval.html), and in opposition to Lisp and Scheme tradition, it is
+common for the evaluator to receive values that are not valid as a datum; here,
+a string unto itself that may not be a valid datum. Yet, it is valid as an
+identifier for the purposes of the evaluator.
+
+### Double-quoted
+
+Strings wrapped in the double-quote symbol parse into:
+
+ (#DQSTR <STRING>) ;; <STRING> is visual aid, not syntax
+
+Under default settings, the decoder would transform this into a value which,
+when evaluated as code, simply yields the contained string as a value.
+
+### At-quoted
+
+This is a special type of syntax for "raw" strings, meaning that no backslash
+escapes nor any other kind of escape sequence are recognized within them.
+
+The syntax begins with an at sign, followed by any byte. That byte becomes a
+termination marker, and the string cannot contain an occurrence of it, since
+there are no escape sequences. The byte value 255 has a special meaning; see
+further below.
+
+ @"foo \ bar" -> (#ATSTR <SENTINEL> <STRING>)
+
+The visual tokens `<SENTINEL>` and `<STRING>` represent an integer and string
+value, respectively. Here, the integer would be 34, which is the ASCII value
+for a double-quote sign. The string contains a literal backslash, since there
+is no backslash escape parsing.
+
+This style of quoting can be useful, for instance, when representing regular
+expressions as strings in code:
+
+ ;; Matches e.g. foo\bar.["blah"]
+
+ @/^foo\\(bar|baz)\.\[".*"\]$/
+
+Were it not for this syntax, this regular expression would only be possible to
+represent through a quoted string such as the following:
+
+ ;; Same as above, but so many backslashes
+
+ "^foo\\\\(bar|baz)\\t\\[\".*\"\\]$"
+
+The byte that follows the at sign need not be a printable character or even a
+valid ASCII byte; it can be absolutely any byte value, even NUL. This can be
+useful to easily encode binary data which is known to not contain a specific
+byte; an example would be C strings which cannot contain NUL.
+
+If however the byte value is 255, then it does not stand for a sentinel, but
+rather indicates that 6 more bytes follow, interpreted as a big-endian 48-bit
+integer, which is the count of bytes making up the contents of the string.
+
+Example sequence of bytes, represented as a mixture of ASCII and raw integers:
+
+ '@' 255 0 0 0 0 2 100 <612 bytes> -> (#ATSTR <STRING>)
+
+One may ask why the length is not included in the list. This is unnecessary,
+since strings in Zisp already carry length information in their own metadata
+structure.
+
+
+### Backslash escapes
+
+In pipe-quoted and double-quoted strings, the following ASCII characters may
+follow a backslash to insert a certain character.
+
+ +-------+----------------------------+
+ | Char | Meaning |
+ +-------+----------------------------+
+ | \ | Literal backslash |
+ +-------+----------------------------+
+ | | | Literal pipe symbol |
+ +-------+----------------------------+
+ | " | Literal double-quote |
+ +-------+----------------------------+
+ | 0 | ASCII NUL |
+ +-------+----------------------------+
+ | a | ASCII Alert |
+ +-------+----------------------------+
+ | b | ASCII Backspace |
+ +-------+----------------------------+
+ | t | ASCII Tab (Horizontal) |
+ +-------+----------------------------+
+ | n | ASCII Newline (Line Feed) |
+ +-------+----------------------------+
+ | v | ASCII Vertical Tab |
+ +-------+----------------------------+
+ | f | ASCII Form Feed |
+ +-------+----------------------------+
+ | r | ASCII Carriage Return |
+ +-------+----------------------------+
+ | e | ASCII Escape |
+ +-------+----------------------------+
+
+In words:
+
+* A backslash, followed by a backslash, pipe, or double-quote character, is
+ substituted with a literal occurrence of that character.
+
+* The characters 0, a, b, t, n, v, f, r, and e have the same meanings as in the
+ C programming language, representing common ASCII control characters.
+
+Further, the following Regular Expression patterns following a backslash have
+special meaning.
+
+ +---------------------+-----------------------+
+ | Regular Expression | Meaning |
+ +---------------------+-----------------------+
+ | [\t ]*\n[\t ]* | Discarded |
+ +---------------------+-----------------------+
+ | x([0-9a-fA-F]{2})*; | Arbitrary bytes |
+ +---------------------+-----------------------+
+ | u[0-9a-fA-F]+; | Unicode Scalar Value |
+ +---------------------+-----------------------+
+
+Explanations:
+
+* A backslash followed by any number of blanks (space or tab), a newline, and
+ again any number of blanks, is substituted with nothing. This is to allow
+ splitting a string into multiple lines for human readability.
+
+ (define p "This paragraph has been visually split into multiple \
+ lines, but the newline is escaped, so it's one line.")
+
+* An x, followed by pairs of hexadecimal digits (case insensitive), terminated
+ by a semicolon, is substituted with the sequence of bytes represented by the
+ corresponding pairs of hexadecimal digits. E.g.: `"foo\xDEADBEEF;bar"`
+
+* A u, followed by a hexadecimal digit sequence (case insensitive), terminated
+ by a semicolon, is substituted with the canonical UTF-8 byte sequence for the
+ Unicode Scalar Value represented by that hexadecimal number. The number must
+ be in the range `0` to `10FFFF`. E.g.: `"foo\u00A0;bar"`
+
+### Newlines in strings
+
+Normally, a newline in a string has no special meaning and simply becomes part
+of the string. However, newlines can be backslash-escaped, which simple erases
+them; the escaped newline can also be preceded or followed by any number of tab
+and space characters, which are all stripped as well. (Note: It's not blanks
+preceding the backslash that are stripped, but blanks following the backslash
+and preceding the newline; i.e., blanks at the end of the line.)
+
+Following are some examples of how multi-line strings can appear in source code
+with different intentions and meanings:
+
+ (define paragraph "This paragraph has been visually split into multiple \
+ lines, but the newlines are escaped, so it's one line.")
+
+ (define json-object '| ;; use '|| so double-quotes need no escaping
+ {
+ "key": "value"
+ }
+ |)
+
+The second example is actually slightly problematic. It begins with a newline,
+which may be undesirable, but escaping that newline would cause the first line
+to have no indentation, thus the opening `{` would not line up with the closing
+`}` when this string is printed out. Further, if the entire block of code is
+indented, then the string contents may be more indented than intended. (No pun
+or rhyme intended.) Consider:
+
+ (let ((foo one))
+ (let ((bar two))
+ (let ((json-object '|
+ {
+ "key": "value"
+ }
+ |))
+ (do-whatever))))
+
+The string bound to `json-object` has redundant indentation. Should the parser
+attempt to solve this issue?
+
+Thankfully, we have the decoder to handle such complexities. Under the default
+settings, the rune `#HASH` is bound to a decoder rule which detects a payload
+value that is a string literal, and implements the same algorithm as seen in
+Java 15 Text Blocks: [JEP 378: Text Blocks](https://openjdk.org/jeps/378)
+
+Thus, we can do the following:
+
+ (let ((foo one))
+ (let ((bar two))
+ (let ((json-object #|
+ ........... {
+ ........... "key": "value"
+ ........... }
+ ...........|))
+ (do-whatever))))
+
+(Dots represent whitespace that is deleted. The initial newline is, as well.)
+
+The only feature Zisp does not offer is a way to fence off multi-line strings
+with a longer token such as `"""` as seen in Python and Java, or an arbitrary
+word as seen in Bourne shell and PHP "here doc" syntax.
+
+However, if a programmer truly wanted to have arbitrary text blocks in code,
+without needing to escape anything in them, it's possible to abuse at-quoted
+string syntax, using it with an ASCII control character which is displayed
+visibly by a text editor. In the following, the characters `^\` are meant to
+represent a literal ASCII File Separator character in the source code:
+
+ (define json-object #@^\
+ {
+ "key": "value"
+ }
+ ^\)
+
+It works fine in Emacs, so why not? Use `C-q C-\` to insert the `^\`.
+
+This is indeed quite an eldritch syntax, but hopefully most programs would not
+need to use it.
+
+
+## Other syntax
+
+The following table summarizes commonly useful syntax abbreviations:
+
+ [...] -> (#SQUARE ...) #datum -> (#HASH datum)
+
+ {...} -> (#BRACE ...) #rune(...) -> (#rune ...)
+
+ 'datum -> (#QUOTE datum) dat1dat2 -> (#JOIN dat1 dat2)
+
+ `datum -> (#GRAVE datum) dat1.dat2 -> (#DOT dat1 dat2)
+
+ ,datum -> (#COMMA datum) dat1:dat2 -> (#COLON dat1 dat2)
+
+Notes:
+
+* The terms datum, dat1, and dat2 each refer to an arbitrary datum; ellipsis
+ means zero or more data.
+
+* The `#datum` form only applies when the datum following the hash sign is
+ anything other than a bare string, since otherwise this would be ambiguous
+ with a rune literal. A bare string can nevertheless follow the hash sign by
+ separating the two with a backslash:
+
+ #\string -> (#HASH string)
+
+* Though not represented in the table due to notational difficulty, the form
+ `#rune(...)` doesn't require a list in the second position; any datum that
+ works with the `#datum` syntax also works with `#rune<DATUM>`.
+
+ #rune1#rune2 -> (#rune1 #rune2)
+
+ #rune\string -> (#rune string)
+
+ #rune'string -> (#rune (#QUOTE string))
+
+ #rune"string" -> (#rune (#DQSTR |string|))
+
+ As a counter-example, following a rune immediately with a bare string isn't
+ possible without the delimiting backslash, since that would be ambiguous:
+
+ #abcdefgh ;Could be (#abcdef gh) or (#abcde fgh) or ...
+
+* Syntax sugar can combine arbitrarily. Some examples follow. Any of these may
+ or may not actually have a meaning in code; some might simply end up producing
+ an error during decoding, or later evaluation of code.
+
+ #{...} -> (#HASH (#BRACE ...))
+
+ #'foo -> (#HASH (#QUOTE foo))
+
+ ##'[...] -> (#HASH (#HASH (#QUOTE (#SQUARE ...))))
+
+ {x y}[i j] -> (#JOIN (#BRACE x y) (#SQUARE i j))
+
+ foo.bar.baz{x y} -> (#JOIN (#DOT (#DOT foo bar) baz) (#BRACE x y))
+
+* Those used to thinking in Lisp and Scheme may think that `(#QUOTE ...)` halts
+ further decoding of enclosed data. This is not so, since quoting is related
+ to code evaluation, not decoding.
+
+### Datum labels
+
+Valid data cannot be cyclic, since that would mean it has infinite length in
+bytes. To externally represent a value with cyclic structure, one uses datum
+labels in the data encoding of the value.
+
+A datum label either wraps another datum to assign a number to it, or contains
+just a reference to a previous assignment.
+
+ +------------------+----------------------------+
+ | Syntax | Internal datum structure |
+ +------------------+----------------------------+
+ | #%<HEX>=<DATUM> | (#LABEL <NUMBER> <DATUM>) |
+ +------------------+----------------------------+
+ | #%<HEX>% | (#LABEL <NUMBER>) |
+ +------------------+----------------------------+
+
+In this visual, the token `<HEX>` stands for a hexadecimal digit sequence, the
+token `<DATUM>` stands for any other datum, and `<NUMBER>` is a stand-in for a
+number value; that which is represented by `<HEX>`.
+
+For clarity, concrete examples follow:
+
+ +-------------------+------------------------------+
+ | Byte sequence | Parse result |
+ +-------------------+------------------------------+
+ | #%1234abcd=(foo) | (#LABEL <0x1234abcd> (foo)) |
+ +-------------------+------------------------------+
+ | #%1234abcd% | (#LABEL <0x1234abcd>) |
+ +-------------------+------------------------------+
+
+Here, the visual token `<0x1234abcd>` stands for a Zisp value of a numeric type
+with an integer value. Note that the decoder may not accept a bare string here,
+meaning this syntax sugar is not merely an abbreviation.
+
+### Shebang
+
+Finally, the parser recognizes the Unix *shebang* syntax and outputs a datum to
+hold the string values found within:
+
+ #!interpreter -> (#SHBANG interpreter)
+
+ #!interpreter argline -> (#SHBANG interpreter argline)
+
+When executing a script file, Zisp simply stores this into a global value that
+may be inspected if desired.
+
+
+<!--
+;; Local Variables:
+;; fill-column: 80
+;; End:
+-->