diff options
| author | Taylan Kammer <taylan.kammer@gmail.com> | 2026-05-25 20:48:36 +0200 |
|---|---|---|
| committer | Taylan Kammer <taylan.kammer@gmail.com> | 2026-05-26 18:41:27 +0200 |
| commit | fa5db8e89225622a1ee7a5d802f253d07884b13e (patch) | |
| tree | d7b25178deac71dff00728134555c75f088ec101 /docs/c1 | |
| parent | 0f0cb85026406356e16310044b4d09bd316b0747 (diff) | |
Diffstat (limited to 'docs/c1')
| -rw-r--r-- | docs/c1/1-parse.md | 575 | ||||
| -rw-r--r-- | docs/c1/grammar/abnf.txt | 12 | ||||
| -rw-r--r-- | docs/c1/grammar/index.md | 6 | ||||
| -rw-r--r-- | docs/c1/grammar/zbnf.txt | 11 | ||||
| -rw-r--r-- | docs/c1/index.md | 28 |
5 files changed, 509 insertions, 123 deletions
diff --git a/docs/c1/1-parse.md b/docs/c1/1-parse.md index 6484cab..7df2225 100644 --- a/docs/c1/1-parse.md +++ b/docs/c1/1-parse.md @@ -1,169 +1,415 @@ -# Parser for Code & Data +# Parser for Data *For an exact specification of the grammar, see [grammar](grammar/).* -Zisp S-Expressions represent an extremely minimal set of data types; only that -which is necessary to strategically construct more complex code and data: +Zisp s-expressions represent an extremely minimal set of data types; only that +which is necessary to strategically construct more complex values: +--------+-----------------+--------+----------+------+ | TYPE | String | Rune | Pair | Nil | +--------+-----------------+--------+----------+------+ - | E.G. | foo, |foo bar| | #name | (X & Y) | () | + | E.G. | foobar | #name | (X & Y) | () | + | | |foo bar| | | | | + | | "foo bar" | | | | + | | @_foo bar_ | | | | +--------+-----------------+--------+----------+------+ +Datum comments and line comments are supported: + +* A semicolon followed by a tilde instructs the parser to consume one datum and + discard it. Whitespace may appear between the tilde and the datum to discard. + +* A semicolon, followed by a non-tilde byte, instructs the parser to consume and + discard bytes until a newline (ASCII Line Feed) is encountered. + The parser can also output non-negative integers, but this is only used for -datum labels; number literals are handled by the *decoder* instead. +datum labels; number literals are handled by the decoder instead; see below. -## Decoder +## Overview -A separate process called *decoding* can transform such data into more complex -types. For example, `(#HASH x y z)` could be decoded into an array, so the -expression `#(x y z)` could work like in Scheme; or `(#SQUARE x y z)` could be -decoded into a function call expression that will, at run-time, allocate and -initialize a dynamic array with three elements, so the expression `[x y z]` -would work like in JavaScript. +This section explains a few core concepts and features related to the parser. -Decoding also resolves datum labels, goes over strings to find ones that are -actually a number literal, and takes care of a number of other transformations. -This offloads complexity, allowing the parser to remain extremely simple. See -the dedicated documentation of the decoder for more. +### Value vs. Datum -## Syntax sugar +A Zisp *value* that has an *external representation* in the form of a sequence +of bytes is called a *datum*. Every datum is a value, but not all values are +data. A datum is a value that can be printed out as a byte sequence which the +parser can recognize and turn back into an equivalent datum. + +One may speak of an *external representation of a value* where the value is not +itself a datum, but can be encoded as a datum. The more strictly correct term +for this is: "The external representation of a datum encoding the value." + + +### Syntax sugar The parser recognizes various "syntax sugar" and transforms it into uses of the -above listed minimal data types. The most ubiquitous example is the list: +above listed primitive data types. As an example, the expression `#(x y z)` is +parsed into the structure `(#HASH x y z)`. These are two completely equivalent +external representations for the same compound datum; after parsing, both byte +sequences will yield data values that are indistinguishable in all but their +memory address. - (datum1 datum2 ...) -> (datum1 & (datum2 & (... & ()))) +The most ubiquitously used syntax sugar is the list, which stands for a chain of +pairs, terminated with nil: -The following table summarizes the other transformations available: + (x y z) -> (x & (y & (z & ()))) - "xyz" -> (#QUOTE & |xyz|) #datum -> (#HASH & datum) +The full syntax sugar table is listed and explained further below. - ~_xyz_ -> (#TILDE & |xyz|) #rune(...) -> (#rune ...) - [...] -> (#SQUARE ...) dat1dat2 -> (#JOIN dat1 & dat2) - - {...} -> (#BRACE ...) dat1.dat2 -> (#DOT dat1 & dat2) - - 'datum -> (#QUOTE & datum) dat1:dat2 -> (#COLON dat1 & dat2) - - `datum -> (#GRAVE & datum) #%hex=datum -> (#LABEL hex & datum) - - ,datum -> (#COMMA & datum) #%hex% -> (#LABEL & hex) +### Decoder -Notes about the table and examples: +*The decoder has nothing to do with the concept of text or character encoding.* -* The terms datum, dat1, and dat2 each refer to an arbitrary datum; ellipsis - means zero or more data; hex is a hexadecimal number of up to 12 digits. +A separate process called *decoding* can transform Zisp data into values of more +complex types, including values that are not of a datum type. -* Strings can be quoted with pipes, like symbols in Scheme. This is the "real" - string literal syntax, whereas using double quotes is syntax sugar for a - quoted string literal. +For example, the datum `(#HASH x y z)` could be decoded into an array, so the +expression `#(x y z)` could work like in Scheme. - |foo bar baz| -> |foo bar baz| +Decoding also resolves datum labels, goes over bare strings to find ones that +represent a number literal, and takes care of a number of other transforms. +This offloads complexity, allowing the parser to remain extremely simple. - "foo bar baz" -> (#QUOTE & |foo bar baz|) +See the dedicated documentation of the [decoder](2-decode.html) for more. -* See the next section for an explanation of the tilde syntax, which implements - "raw" string literals. -* The `#datum` form only applies when the datum following the hash sign is - anything other than a bare string (unquoted, without pipe symbol) since - otherwise this would be ambiguous with a rune literal. A bare string can - nevertheless follow the hash sign by separating the two with a backslash: +### Character encoding - #\string -> (#HASH & string) +The parser does not consume characters; it consumes bytes. -* Though not represented in the table due to notational difficulty, the form - `#rune(...)` doesn't require a list in the second position; any datum that - works with the `#datum` syntax also works with `#rune<DATUM>`. +Grammar is generally constructed by bytes corresponding to ASCII characters. +Some elements of the grammar, such as comments and quoted strings, may contain +arbitrary byte sequences, until terminated. These sequences may happen to be +valid UTF-8 text. This way, quoted strings and comments may contain Unicode +text encoded in UTF-8, but the parser does not check these for validity. - #rune1#rune2 -> (#rune1 & #rune2) +Since comments and quoted strings may contain arbitrary byte sequences, a text +editor or other program displaying Zisp s-expressions may need to use a special +visual representation for bytes that don't represent valid text. - #rune"text" -> (#rune & "text") +The parser being based on bytes rather than characters is not a limitation but +rather a feature: It allows for Zisp s-expressions to be used as a structured +data exchange format that may contain binary data elements without the need to +encode these in Base64 or other such text representations of binary data. +Consider the example: - #rune\string -> (rune & string) + ((image.webp "<< binary data >>") + (video.webm "<< binary data >>")) - #rune'string -> (#rune #QUOTE & string) +All that needs to be done for this to work, is that any incidental occurrences +of the double-quote sign, and the backslash sign, are escaped with a backslash +within the binary data; all other bytes can appear verbatim in the strings. - As a counter-example, following a rune immediately with a bare string isn't - possible without the delimiting backslash, since that would be ambiguous: - #abcdefgh ;Could be (#abcdef & gh) or (#abcde & fgh) or ... +### Stream parsing -* Syntax sugar can combine arbitrarily. Some examples follow. Any of these may - or may not actually have a meaning in code; many could simply end up producing - an error during decoding, or later interpretation of code. +The parser can be repeatedly invoked on a byte stream to consume the next datum +within. This does not require "unreading" or back-seeking within the stream; +the parser always reads a full datum, and stops after some byte which cleanly +terminates the currently parsed datum. - #{...} -> (#HASH #BRACE ...) +This means Zisp s-expressions can be safely intermixed with other data within +the same byte stream. So long as the other data is consumed by some parser +which similarly stops reading at a clear boundary, the Zisp parser can then +continue operating on the same stream. Consider the example: - #'foo -> (#HASH #QUOTE & foo) + ("image.webp" 8273) - ##'[...] -> (#HASH #HASH #QUOTE #SQUARE ...) + << 8273 bytes >> - {x y}[i j] -> (#JOIN (#BRACE x y) #SQUARE i j) + ("video.webm" 736) - foo.bar.baz{x y} -> (#JOIN (#DOT (#DOT foo & bar) & baz) #BRACE x y) + << 736 bytes >> -* While in Lisp and Scheme `'foo` parses as `(quote foo)`, in Zisp it parses - as `(#QUOTE & foo)` instead; the operand of `#QUOTE` is the entire cdr. +The "header" for each file in this stream is a Zisp s-expression containing +information about how many bytes should be read after the header, before the +next file header appears. (The header data need to be terminated with a blank +ASCII character such as a newline. The reason why the closing parenthesis does +not act as a terminator unto itself will become apparent later.) - The same principle is used when parsing other sugar; some examples follow: - Incorrect Correct +### Datum labels - #(x y z) -> (#HASH (x y z)) #(x y z) -> (#HASH x y z) +Valid data cannot be cyclic, since that would mean it has infinite length in +bytes. To externally represent a value with cyclic structure, one uses datum +labels in the data encoding of the value. - [x y z] -> (#SQUARE (x y z)) [x y z] -> (#SQUARE x y z) +A datum label either wraps another datum to assign a number to it, or contains +just a reference to a previous assignment. - #{x} -> (#HASH (#BRACE (x))) #{x} -> (#HASH #BRACE x) + +----------------------------------+---------------------------------+ + | Internal structure | External representation | + +----------------------------------+---------------------------------+ + | (#LABEL & (<NUMBER> & <DATUM>)) | #%<HEX>=<DATUM> | + +----------------------------------+---------------------------------+ + | (#LABEL & <NUMBER>) | #%<HEX>% | + +----------------------------------+---------------------------------+ - foo(x y) -> (#JOIN foo (x y)) foo(x y) -> (#JOIN foo x y) +In this visual, the token `<NUMBER>` stands for an actual number value that +doesn't have its own external representation. It's printed as a sequence of +hexadecimal digits, denoted by `<HEX>` in the external representation. -* Runes are case-sensitive, and the parser always emits runes using upper-case - letters when expressing syntax sugar. Uppercase rune names are reserved for - Zisp's internal use and standard library; users can use lowercase runes with - custom meaning without worrying about clashes, with the exception of a small - number of lowercase runes such as `#true` and `#false` that are part of the - default decoder settings. +For clarity, concrete examples follow: + #%1234abcd=(foo bar) -> (#LABEL & (<0x1234abcd> & (foo bar))) -## Tilde strings + #%1234abcd% -> (#LABEL & <0x1234abcd>) -There is a special type of syntax sugar for "raw" strings, meaning that no -backslash escapes nor any other kind of escape sequence are recognized. +Here, the visual token `<0x1234abcd>` stands for a Zisp value of a numeric type +with an integer value. -This raw string syntax begins with a tilde, followed by any byte. That byte -becomes the termination marker, and the string cannot represent a literal -occurrence of it, since there are no escape sequences. +Datum labels may look like "syntax sugar" but the fact that integers don't have +a direct external representation means that datum labels are a fundamental type +of syntax that has no "desugared" equivalent in external representation. The +decoder will not accept a bare string encoding of an integer here. - ~%foo \ bar% -> (#TILDE |foo \\ bar|) -This can be useful, for instance, when representing regular expressions as -quoted string literals in code: +## Data types - ~/^foo\\(bar|baz)\.\[".*"\]$/ ;; matches e.g. foo\bar.["blah"] +Following is an explanation of the four core data types constructed by the Zisp +s-expression parser. -Were it not for this syntax, this regular expression would need to be -represented by the following quoted string literal in Zisp code: +A Zisp value that is a member of one of these types is also called a *datum* if +it adheres to additional constraints as explained for each type. - "^foo\\\\(bar|baz)\\t\\[\".*\"\\]$" -Alternatively, imagine searching for certain MS Windows file paths: +### String + +Strings can appear "bare" or be quoted in various ways. + +A string, as a stand-alone Zisp value, is only a valid datum if it can be +represented as a bare string. If it contains bytes that prevent the bare +representation, then the string must be wrapped in one of the following +structures to become a valid datum, each of which has its own external +representation: + + +-------------------------------+-------------------------------+ + | Internal structure | External representation | + +-------------------------------+-------------------------------+ + | (#PQSTR & <STRING>) | |contents| | + +-------------------------------+-------------------------------+ + | (#DQSTR & <STRING>) | "contents" | + +-------------------------------+-------------------------------+ + | (#ATSTR & <STRING>) | @_contents_ | + +-------------------------------+-------------------------------+ + +The visual token `<STRING>` is meant to denote the actual string, as a Zisp +value, occupying the second position in the pair. It is not actual syntax. + +Note that, while conceptually similar, this internal encoding of string data is +not syntax sugar, since the internal datum representation using runes cannot be +printed out verbatim, due to the attached string being impossible to represent +externally without quotation. As such, quoted strings are fundamental syntax. + +These external representations of strings will be explained in more detail +further below, including backslash escape sequences allowed within. + +Strings have a fixed length, counted in bytes. Each byte can have any value, +including zero (aka ASCII NULL). The parser reads bytes, not characters, and +has no concept of a character encoding, which means that a string can contain +UTF-8 byte sequences, but these are not tested for validity. + +A string that is up to 64 bytes long is automatically *interned*, meaning any +occurrence of the same string -- equal in length and containing the same byte +values -- ends up being represented by the same bit-pattern; either a memory +address, or an immediate representation within a CPU word for short strings. + +Strings with a length greater than 64 bytes end up being represented by a +distinct memory address, even if they are equal in length and content. + + +### Rune + +A rune is represented by an ASCII character sequence of 1 to 6 bytes, that must +begin with a letter, and may only contain letters and digits. This character +sequence of letters and digits is called the *name* of the rune. A rune that +follows this constraint is valid as a datum. + +Zisp code may explicitly construct values of the rune type that violate the +above constraints. Such runes are not valid data and cannot be printed or +parsed in any way. + +Runes are case-sensitive, and the parser always emits runes using upper-case +letters when expressing syntax sugar. Uppercase rune names are reserved for +Zisp's internal use and standard library; users can use lowercase runes with +custom meaning without worrying about clashes, with the exception of a small +number of lowercase runes such as `#true` and `#false` that are part of the +default decoder settings. + +Runes are always stored directly in a CPU word; never by memory address. + + +### Pair + +A pair is a tuple of two values: the first value and the second value. + +The parser allocates a unique two-word cell in the process heap for every pair, +and represents that pair through the memory address of that cell. + +Pairs are valid as a datum if one of the following holds true for the pair: + +* It encodes one of the quoted string variants. + +* It encodes a datum label (assignment or reference). + +* Both the first and second value in the pair is itself a valid datum. + +An additional constraint is that a hierarchy of pairs containing pairs must not +form cycles; if they do, the cycles must be broken up by use of datum labels or +else none of the pairs within the cyclic structure are a valid datum. + + +### Nil + +The Zisp nil value is a singleton and a datum. There is exactly one nil value +and it is used to terminate a chain of pairs representing a list of values. + + +## Quoted strings + +Three quoted string types exist: Pipe-quoted, double-quoted, and at-quoted. +This section goes into the details of each variant. + + +### Pipe-quoted + +Strings can be quoted with pipes, like symbols in R7RS Scheme, which triggers +the parser to generate a pair with the structure: + + (#PQSTR & <STRING>) ;; <STRING> is visual aid, not syntax + +The decoder, using default settings, would emit this string verbatim as a value. +Then, during code evaluation, this would be seen as an identifier. In this way, +pipe-quoted strings are equivalent to bare strings in functionality. + +It is important to understand that the decoder sits between the parser and the +[evaluator](3-execute.html), and in opposition to Lisp and Scheme tradition, it +is common for the evaluator to receive values that are not valid as a datum; in +this case, a string unto itself that may not be a valid datum, due to not being +possible to be represented as a bare string. Yet, it is valid as an identifier +for the purposes of the evaluator, since it is a string *value* like any other. + + +### Double-quoted + +Strings wrapped in the double-quote symbol parse into: + + (#DQSTR & <STRING>) ;; <STRING> is visual aid, not syntax + +Under default settings, the decoder would transform this into a value which, +when evaluated, yields back the string as a value. Typically, this would be +achieved by simply transforming it into `(#QUOTE & <STRING>)`. (Note that, +unlike `(#PQSTR & <STRING>)`, this would not be decoded into a string unto +itself, as that would make the evaluator see it as an identifier.) + - ~_C:\\\\User\\foo_ ;; matches C:\\User\foo +### At-quoted strings AKA raw strings -That's already ugly. Without raw strings, it would need to look like this: +There is a special type of syntax for "raw" strings, meaning that no backslash +escapes nor any other kind of escape sequence are recognized within them. - "C:\\\\\\\\User\\\\foo" +This raw string syntax begins with an at sign, followed by any byte. That byte +becomes the termination marker, and the string cannot contain an occurrence of +it, since there are no escape sequences. -Typically, the rune `#TILDE` would be treated as a synonym to `#QUOTE` by the -decoder, though creative programmers could repurpose it. + @"foo \ bar" -> (#ATSTR & <STRING>) +In the above, the visual token `<STRING>` is not part of datum syntax but a +stand-in for the actual string value, which is, literally: `foo \ bar` -## Newlines in strings +This style of quoting can be useful, for instance, when representing regular +expressions as strings in code: + + @/^foo\\(bar|baz)\.\[".*"\]$/ ;; matches e.g. foo\bar.["blah"] + +Were it not for this syntax, this regular expression would only be possible to +represent through a quoted string such as the following: + + "^foo\\\\(bar|baz)\\t\\[\".*\"\\]$" ;; many backslashes + +Alternatively, imagine searching for certain MS Windows file paths: + + @_C:\\\\Users\\([a-z]+)_ ;; matches C:\\User\foo + +That's already ugly. Without raw strings, it would need to look even worse: + + "C:\\\\\\\\Users\\\\([a-z]+)" ;; MANY backslashes + +The byte that follows the at sign need not be a printable character or even a +valid ASCII byte; it can be absolutely any byte value, even NULL. This can be +useful to easily encode binary data which is known to not contain a specific +byte; an example would be C strings which cannot contain NULL. + + +### Backslash escape sequences in strings + +The following backslash escapes are supported in pipe-quoted and double-quoted +strings. (Some rows use Regular Expression notation.) + + +-----------------------------------+------------------------------+ + | Character(s) following backslash | Meaning | + +-----------------------------------+------------------------------+ + | \ | Literal backslash | + +-----------------------------------+------------------------------+ + | | | Literal pipe symbol | + +-----------------------------------+------------------------------+ + | " | Literal double-quote | + +-----------------------------------+------------------------------+ + | RE: /[\t ]*\n[\t ]*/ | Discarded | + +-----------------------------------+------------------------------+ + | 0 | ASCII NULL | + +-----------------------------------+------------------------------+ + | a | ASCII Alert | + +-----------------------------------+------------------------------+ + | b | ASCII Backspace | + +-----------------------------------+------------------------------+ + | t | ASCII Tab (Horizontal) | + +-----------------------------------+------------------------------+ + | n | ASCII Newline (Line Feed) | + +-----------------------------------+------------------------------+ + | v | ASCII Vertical Tab | + +-----------------------------------+------------------------------+ + | f | ASCII Form Feed | + +-----------------------------------+------------------------------+ + | r | ASCII Carriage Return | + +-----------------------------------+------------------------------+ + | e | ASCII Escape | + +-----------------------------------+------------------------------+ + | RE: /x([0-9a-fA-F]{2})+;/ | Arbitrary bytes in hex | + +-----------------------------------+------------------------------+ + | RE: /u[0-9a-fA-F]+;/ | Unicode scalar as UTF-8 | + +-----------------------------------+------------------------------+ + +To clarify: + +* A backslash followed by a backslash, pipe, or double-quote character is + substituted with a literal occurrence of the corresponding character. + +* A backslash followed by any number of blanks (space or tab), a newline, and + again any number of blanks, is substituted with nothing. This is to allow + splitting a string into multiple lines for human readability. + + (define paragraph "This paragraph has been visually split into multiple \ + lines, but the newline is escaped, so it's one line.") + +* The characters 0, a, b, t, n, v, f, r, and e have the same meanings as in the + C programming language, representing common unprintable ASCII bytes. + +* An x, followed by pairs of hexadecimal digits (case insensitive), terminated + by a semicolon, is substituted with the sequence of bytes represented by the + corresponding pairs of hexadecimal digits. E.g.: `"foo\xDEADBEEF;bar"` + +* A u, followed by a hexadecimal digit sequence (case insensitive), terminated + by a semicolon, is substituted with the canonical UTF-8 byte sequence for the + Unicode Scalar Value represented by that hexadecimal number. The number must + be in the range `0` to `10FFFF`. E.g.: `"foo\u00A0;bar"` + + +### Newlines in strings Normally, a newline in a string has no special meaning and simply becomes part of the string. However, newlines can be backslash-escaped, which simple erases @@ -178,7 +424,7 @@ with different intentions and meanings: (define paragraph "This paragraph has been visually split into multiple \ lines, but the newlines are escaped, so it's one line.") - (define json-object '| ;; use '|| so we needn't escape "key" etc. + (define json-object '| ;; use '|| so double-quotes need no escaping { "key": "value" } @@ -200,31 +446,134 @@ or rhyme intended.) Consider: |)) (do-whatever)))) -The string bound to `json-object` has way more indentation than the programmer -intended. Should the parser attempt to solve this issue? +The string bound to `json-object` has redundant indentation. Should the parser +attempt to solve this issue? + +Thankfully, we have the decoder to handle such complexities. Under the default +settings, the rune `#HASH` is bound to a decoder rule which detects a payload +value that is a string literal, and implements the same algorithm as seen in +Java 15 Text Blocks: [JEP 378: Text Blocks](https://openjdk.org/jeps/378) -Thankfully, we have the decoder. The implementation of `#QUOTE` can simply -implement a post-processing algorithm such as the one used for Java 15 text -blocks feature: [JEP 378: Text Blocks](https://openjdk.org/jeps/378) +Thus, we can do the following: -The only feature Zisp cannot offer here is a way to fence off multi-line strings -with a longer token such as `"""` as seen in Python or Java, or an arbitrary -word as seen in Bourne shell and PHP "here doc" syntax. For simplicity, the -Zisp parser omits such features. + (let ((foo one)) + (let ((bar two)) + (let ((json-object #| + ........... { + ........... "key": "value" + ........... } + ...........|)) + (do-whatever)))) -That said, if a programmer truly wanted to have arbitrary text blocks in code, -without needing to escape anything in them, it's possible to abuse the tilde -string syntax by using it with an ASCII control character which is displayed +(Dots represent whitespace that is deleted. The initial newline is, as well.) + +The only feature Zisp does not offer is a way to fence off multi-line strings +with a longer token such as `"""` as seen in Python and Java, or an arbitrary +word as seen in Bourne shell and PHP "here doc" syntax. + +However, if a programmer truly wanted to have arbitrary text blocks in code, +without needing to escape anything in them, it's possible to abuse at-quoted +string syntax, using it with an ASCII control character which is displayed visibly by a text editor. In the following, the characters `^\` are meant to represent a literal ASCII File Separator character in the source code: - (define json-object ~^\ + (define json-object #@^\ { "key": "value" } ^\) -Hey, it works fine in Emacs, so why not?? (`C-q C-\` to insert the `^\`.) +Hey, it works fine in Emacs, so why not? Use `C-q C-\` to insert the `^\`. + +This is indeed quite an eldritch syntax, but hopefully most programs would not +need to use it anyway. + + +## Syntax sugar + +The parser recognizes various "syntax sugar" and transforms it into equivalent +datum constructions. The most ubiquitous example of this is the list, which is +transformed into a chain of pairs, terminated with nil: + + (datum1 datum2 ...) -> (datum1 & (datum2 & (... & ()))) + +This is so ubiquitous as to be hardly considered "syntax sugar" but is counted +as such, since any list could just as well be written as a chain of pairs; both +would result in an equivalent datum when parsed. + +The following table summarizes the other available transformations: + + [...] -> (#SQUARE ...) #datum -> (#HASH & datum) + + {...} -> (#BRACE ...) #rune(...) -> (#rune ...) + + 'datum -> (#QUOTE & datum) dat1dat2 -> (#JOIN dat1 & dat2) + + `datum -> (#GRAVE & datum) dat1.dat2 -> (#DOT dat1 & dat2) + + ,datum -> (#COMMA & datum) dat1:dat2 -> (#COLON dat1 & dat2) + +Notes: + +* The terms datum, dat1, and dat2 each refer to an arbitrary datum; ellipsis + means zero or more data. + +* The `#datum` form only applies when the datum following the hash sign is + anything other than a bare string, since otherwise this would be ambiguous + with a rune literal. A bare string can nevertheless follow the hash sign by + separating the two with a backslash: + + #\string -> (#HASH & string) + +* Though not represented in the table due to notational difficulty, the form + `#rune(...)` doesn't require a list in the second position; any datum that + works with the `#datum` syntax also works with `#rune<DATUM>`. + + #rune1#rune2 -> (#rune1 & #rune2) + + #rune\string -> (rune & string) + + #rune'string -> (#rune #QUOTE & string) + + #rune"string" -> (#rune #DQSTR & |string|) + + As a counter-example, following a rune immediately with a bare string isn't + possible without the delimiting backslash, since that would be ambiguous: + + #abcdefgh ;Could be (#abcdef & gh) or (#abcde & fgh) or ... + +* Syntax sugar can combine arbitrarily. Some examples follow. Any of these may + or may not actually have a meaning in code; many could simply end up producing + an error during decoding, or later evaluation of code. + + #{...} -> (#HASH #BRACE ...) + + #'foo -> (#HASH #QUOTE & foo) + + ##'[...] -> (#HASH #HASH #QUOTE #SQUARE ...) + + {x y}[i j] -> (#JOIN (#BRACE x y) #SQUARE i j) + + foo.bar.baz{x y} -> (#JOIN (#DOT (#DOT foo & bar) & baz) #BRACE x y) + +* While in Lisp and Scheme `'foo` parses as `(quote foo)`, in Zisp it parses as + `(#QUOTE & foo)`; a single pair with the quoted datum in the second position. + + The same principle is used when parsing other sugar; some examples follow: + + Incorrect Correct + + #(x y z) -> (#HASH (x y z)) #(x y z) -> (#HASH x y z) + + [x y z] -> (#SQUARE (x y z)) [x y z] -> (#SQUARE x y z) + + #{x} -> (#HASH (#BRACE (x))) #{x} -> (#HASH #BRACE x) + + foo(x y) -> (#JOIN foo (x y)) foo(x y) -> (#JOIN foo x y) + +* Those used to thinking in Lisp and Scheme may think that `(#QUOTE ...)` halts + further decoding of enclosed data. This is not so, since quoting is related + to code evaluation, not decoding. <!-- ;; Local Variables: diff --git a/docs/c1/grammar/abnf.txt b/docs/c1/grammar/abnf.txt index 6daaceb..7424f41 100644 --- a/docs/c1/grammar/abnf.txt +++ b/docs/c1/grammar/abnf.txt @@ -19,7 +19,7 @@ Blank = HTAB / LF / %x0b / %x0c / CR / SP / Comment Trail = SkipLine / SkipUnit / ";" "~" *Blank -Datum = BareString / DottedStr / CladDatum / Rune / RuneStr +Datum = BareString / SpecialStr / CladDatum / Rune / RuneStr / RuneDotStr / RuneClad / LabelRef / LabelDef / HashStr / HashDotStr / HashClad / QuoteExpr / JoinExpr @@ -36,7 +36,7 @@ AnyButLF = %x00-09 / %x0b-ff BareString = BareChar *( BareChar / Numeric ) -DottedStr = ( "." / Numeric ) *( "." / Numeric / BareChar ) +SpecialStr = SpecStrChar *( SpecStrChar / BareChar ) CladDatum = "|" *( PipeStrChar / "\" StringEsc ) "|" / DQUOTE *( QuotStrChar / "\" StringEsc ) DQUOTE @@ -48,7 +48,7 @@ Rune = "#" RuneName RuneStr = "#" RuneName "\" BareString -RuneDotStr = "#" RuneName "\" DottedStr +RuneDotStr = "#" RuneName "\" SpecialStr RuneClad = "#" RuneName CladDatum @@ -58,7 +58,7 @@ LabelDef = "#" "%" Label "=" Datum HashStr = "#" "\" BareString -HashDotStr = "#" "\" DottedStr +HashDotStr = "#" "\" SpecialStr HashClad = "#" CladDatum @@ -73,10 +73,12 @@ JoinExpr = Datum RJoinDatum BareChar = "!" / "$" / "%" / "*" / "/" / "<" / "=" / ">" - / "?" / "@" / "^" / "_" / "~" / ALPHA + / "?" / "^" / "_" / "~" / ALPHA Numeric = "+" / "-" / DIGIT +SpecStrChar = "." / ":" / Numeric + PipeStrChar = %x00-5b / %x5d-7b / %x7d-ff ; any but "|" or "\" QuotStrChar = %x00-21 / %x23-5b / %x5d-ff ; any but DQUOTE or "\" diff --git a/docs/c1/grammar/index.md b/docs/c1/grammar/index.md index d70021a..8fefe0e 100644 --- a/docs/c1/grammar/index.md +++ b/docs/c1/grammar/index.md @@ -74,6 +74,12 @@ The following limits are not represented in the grammar: want to use the ABNF to generate a parser anyway.) +## At-quoted strings + +The mechanism of at-quoted strings is not represented in any of the +grammars, since it essentially has 256 variants. + + ## Stream-parsing strategy The parser consumes one `Unit` from the input stream every time it's diff --git a/docs/c1/grammar/zbnf.txt b/docs/c1/grammar/zbnf.txt index 551c319..002e027 100644 --- a/docs/c1/grammar/zbnf.txt +++ b/docs/c1/grammar/zbnf.txt @@ -22,7 +22,7 @@ SkipLine : ( ~LF )* [LF] OneDatum : BareString | CladDatum -BareString : ( '.' | '+' | '-' | DIGIT ) ( BareChar | '.' )* +BareString : SpecBareChar ( BareChar | JoinChar )* | BareChar+ CladDatum : PipeStr | QuoteStr | HashExpr | QuoteExpr | List @@ -33,16 +33,17 @@ HashExpr : '#' ( RuneExpr | LabelExpr | HashDatum ) QuoteExpr : "'" Datum | '`' Datum | ',' Datum List : ParenList | SquareList | BraceList +SpecBareChar : '+' | '-' | JoinChar | DIGIT + BareChar : ALPHA | DIGIT - | '!' | '$' | '%' | '*' | '+' - | '-' | '/' | '<' | '=' | '>' - | '?' | '@' | '^' | '_' | '~' + | '!' | '$' | '%' | '*' | '+' | '-' | '/' + | '<' | '=' | '>' | '?' | '^' | '_' | '~' PipeStrChar : ~( '|' | '\' ) QuotStrChar : ~( '"' | '\' ) StringEsc : '\' | '|' | '"' | ( HTAB | SP )* LF ( HTAB | SP )* - | 'a' | 'b' | 't' | 'n' | 'v' | 'f' | 'r' | 'e' + | '0' | 'a' | 'b' | 't' | 'n' | 'v' | 'f' | 'r' | 'e' | 'x' HexByte+ ';' | 'u' UnicodeSV ';' diff --git a/docs/c1/index.md b/docs/c1/index.md index f306e11..af01cea 100644 --- a/docs/c1/index.md +++ b/docs/c1/index.md @@ -1,2 +1,30 @@ # Chapter 1: Genesis +This chapter goes through the processes involved in reading source +code, running it, and optionally compiling it. + +1. [Parse](1-parse.html) + + The parser receives a stream of bytes and transforms them into a + minimal set of data types with very little processing. + +2. [Decode](2-decode.html) + + The decoder runs configurable and extensible pre-processing steps + over data received from the parser, enriching it with more complex + data types, and handling primitive source code transforms. It's + comparable to the C pre-processor or Lisp's `DEFMACRO` mechanism, + with a few more responsibilities, such as number literal parsing. + +3. [Execute](3-execute.html) + + Code is executed (or interpreted, or evaluated) in an environment, + also called a module, which may be mutated, and linked with other + modules. Execution is immediate, without any pre-compilation. + +4. [Compile](4-compile.html) + + Procedures from within the compiler module can be used to demand + the compilation of other modules, with various options, yielding + static or dynamic object files. These may be loaded immediately, + replacing the previously uncompiled module code in memory. |
