From 378f8598a5a57b731948241e41f584f5172dc2a2 Mon Sep 17 00:00:00 2001 From: Taylan Kammer Date: Sat, 23 May 2026 22:22:57 +0200 Subject: An update of sorts. --- docs/c1/1-parse.md | 167 +++++++++++++++++++++++++++++++++++++++-------- docs/c1/grammar/index.md | 6 ++ docs/index.md | 7 +- 3 files changed, 149 insertions(+), 31 deletions(-) (limited to 'docs') diff --git a/docs/c1/1-parse.md b/docs/c1/1-parse.md index 73b8d8a..6484cab 100644 --- a/docs/c1/1-parse.md +++ b/docs/c1/1-parse.md @@ -12,48 +12,52 @@ which is necessary to strategically construct more complex code and data: +--------+-----------------+--------+----------+------+ The parser can also output non-negative integers, but this is only used for -datum labels; number literals are handled by the *decoder* (see next). +datum labels; number literals are handled by the *decoder* instead. -The parser recognizes various "syntax sugar" and transforms it into uses of the -above data types. The most ubiquitous example is of course the list: - (datum1 datum2 ...) -> (datum1 & (datum2 & (... & ()))) +## Decoder -The following table summarizes the other supported transformations: +A separate process called *decoding* can transform such data into more complex +types. For example, `(#HASH x y z)` could be decoded into an array, so the +expression `#(x y z)` could work like in Scheme; or `(#SQUARE x y z)` could be +decoded into a function call expression that will, at run-time, allocate and +initialize a dynamic array with three elements, so the expression `[x y z]` +would work like in JavaScript. - "xyz" -> (#QUOTE & |xyz|) #datum -> (#HASH & datum) +Decoding also resolves datum labels, goes over strings to find ones that are +actually a number literal, and takes care of a number of other transformations. +This offloads complexity, allowing the parser to remain extremely simple. See +the dedicated documentation of the decoder for more. - [...] -> (#SQUARE ...) #rune(...) -> (#rune ...) - {...} -> (#BRACE ...) dat1dat2 -> (#JOIN dat1 & dat2) +## Syntax sugar - 'datum -> (#QUOTE & datum) dat1.dat2 -> (#DOT dat1 & dat2) +The parser recognizes various "syntax sugar" and transforms it into uses of the +above listed minimal data types. The most ubiquitous example is the list: - `datum -> (#GRAVE & datum) dat1:dat2 -> (#COLON dat1 & dat2) + (datum1 datum2 ...) -> (datum1 & (datum2 & (... & ()))) - ,datum -> (#COMMA & datum) #%hex% -> (#LABEL & hex) +The following table summarizes the other transformations available: - #%hex=datum -> (#LABEL hex & datum) + "xyz" -> (#QUOTE & |xyz|) #datum -> (#HASH & datum) -A separate process called *decoding* can transform such data into more complex -types. For example, `(#HASH x y z)` could be decoded into a vector, so the -expression `#(x y z)` works just like in Scheme. + ~_xyz_ -> (#TILDE & |xyz|) #rune(...) -> (#rune ...) -Decoding also resolves datum labels, goes over strings to find ones that are -actually a number literal, and takes care of a number of other transformations. -This offloads complexity, allowing the parser to remain extremely simple. See -the dedicated documentation of the decoder for more. + [...] -> (#SQUARE ...) dat1dat2 -> (#JOIN dat1 & dat2) + + {...} -> (#BRACE ...) dat1.dat2 -> (#DOT dat1 & dat2) + + 'datum -> (#QUOTE & datum) dat1:dat2 -> (#COLON dat1 & dat2) + + `datum -> (#GRAVE & datum) #%hex=datum -> (#LABEL hex & datum) + + ,datum -> (#COMMA & datum) #%hex% -> (#LABEL & hex) -Further notes about the syntax sugar table and examples above: +Notes about the table and examples: * The terms datum, dat1, and dat2 each refer to an arbitrary datum; ellipsis means zero or more data; hex is a hexadecimal number of up to 12 digits. -* The `#datum` form only applies when the datum following the hash sign is a - list, quoted string, quote expression, another expression starting with the - hash sign, or a pipe-quoted string (see next). A bare string can follow the - hash sign by separating the two with a backslash: `#\string` - * Strings can be quoted with pipes, like symbols in Scheme. This is the "real" string literal syntax, whereas using double quotes is syntax sugar for a quoted string literal. @@ -62,6 +66,16 @@ Further notes about the syntax sugar table and examples above: "foo bar baz" -> (#QUOTE & |foo bar baz|) +* See the next section for an explanation of the tilde syntax, which implements + "raw" string literals. + +* The `#datum` form only applies when the datum following the hash sign is + anything other than a bare string (unquoted, without pipe symbol) since + otherwise this would be ambiguous with a rune literal. A bare string can + nevertheless follow the hash sign by separating the two with a backslash: + + #\string -> (#HASH & string) + * Though not represented in the table due to notational difficulty, the form `#rune(...)` doesn't require a list in the second position; any datum that works with the `#datum` syntax also works with `#rune`. @@ -81,7 +95,7 @@ Further notes about the syntax sugar table and examples above: * Syntax sugar can combine arbitrarily. Some examples follow. Any of these may or may not actually have a meaning in code; many could simply end up producing - a syntax error at the macro-expand stage. + an error during decoding, or later interpretation of code. #{...} -> (#HASH #BRACE ...) @@ -111,7 +125,106 @@ Further notes about the syntax sugar table and examples above: * Runes are case-sensitive, and the parser always emits runes using upper-case letters when expressing syntax sugar. Uppercase rune names are reserved for Zisp's internal use and standard library; users can use lowercase runes with - custom meaning without worrying about clashes. + custom meaning without worrying about clashes, with the exception of a small + number of lowercase runes such as `#true` and `#false` that are part of the + default decoder settings. + + +## Tilde strings + +There is a special type of syntax sugar for "raw" strings, meaning that no +backslash escapes nor any other kind of escape sequence are recognized. + +This raw string syntax begins with a tilde, followed by any byte. That byte +becomes the termination marker, and the string cannot represent a literal +occurrence of it, since there are no escape sequences. + + ~%foo \ bar% -> (#TILDE |foo \\ bar|) + +This can be useful, for instance, when representing regular expressions as +quoted string literals in code: + + ~/^foo\\(bar|baz)\.\[".*"\]$/ ;; matches e.g. foo\bar.["blah"] + +Were it not for this syntax, this regular expression would need to be +represented by the following quoted string literal in Zisp code: + + "^foo\\\\(bar|baz)\\t\\[\".*\"\\]$" + +Alternatively, imagine searching for certain MS Windows file paths: + + ~_C:\\\\User\\foo_ ;; matches C:\\User\foo + +That's already ugly. Without raw strings, it would need to look like this: + + "C:\\\\\\\\User\\\\foo" + +Typically, the rune `#TILDE` would be treated as a synonym to `#QUOTE` by the +decoder, though creative programmers could repurpose it. + + +## Newlines in strings + +Normally, a newline in a string has no special meaning and simply becomes part +of the string. However, newlines can be backslash-escaped, which simple erases +them; the escaped newline can also be preceded or followed by any number of tab +and space characters, which are all stripped as well. (Note: It's not blanks +preceding the backslash that are stripped, but blanks following the backslash +and preceding the newline; i.e., blanks at the end of the line.) + +Following are some examples of how multi-line strings can appear in source code +with different intentions and meanings: + + (define paragraph "This paragraph has been visually split into multiple \ + lines, but the newlines are escaped, so it's one line.") + + (define json-object '| ;; use '|| so we needn't escape "key" etc. + { + "key": "value" + } + |) + +The second example is actually slightly problematic. It begins with a newline, +which may be undesirable, but escaping that newline would cause the first line +to have no indentation, thus the opening `{` would not line up with the closing +`}` when this string is printed out. Further, if the entire block of code is +indented, then the string contents may be more indented than intended. (No pun +or rhyme intended.) Consider: + + (let ((foo one)) + (let ((bar two)) + (let ((json-object '| + { + "key": "value" + } + |)) + (do-whatever)))) + +The string bound to `json-object` has way more indentation than the programmer +intended. Should the parser attempt to solve this issue? + +Thankfully, we have the decoder. The implementation of `#QUOTE` can simply +implement a post-processing algorithm such as the one used for Java 15 text +blocks feature: [JEP 378: Text Blocks](https://openjdk.org/jeps/378) + +The only feature Zisp cannot offer here is a way to fence off multi-line strings +with a longer token such as `"""` as seen in Python or Java, or an arbitrary +word as seen in Bourne shell and PHP "here doc" syntax. For simplicity, the +Zisp parser omits such features. + +That said, if a programmer truly wanted to have arbitrary text blocks in code, +without needing to escape anything in them, it's possible to abuse the tilde +string syntax by using it with an ASCII control character which is displayed +visibly by a text editor. In the following, the characters `^\` are meant to +represent a literal ASCII File Separator character in the source code: + + (define json-object ~^\ + { + "key": "value" + } + ^\) + +Hey, it works fine in Emacs, so why not?? (`C-q C-\` to insert the `^\`.)