summaryrefslogtreecommitdiff
path: root/docs/c1/1-parse.md
diff options
context:
space:
mode:
authorTaylan Kammer <taylan.kammer@gmail.com>2026-05-23 22:22:57 +0200
committerTaylan Kammer <taylan.kammer@gmail.com>2026-05-23 22:22:57 +0200
commit378f8598a5a57b731948241e41f584f5172dc2a2 (patch)
treee9352110efe5b204a5abe7e00693be2004aab4e5 /docs/c1/1-parse.md
parentf1f134d072e375335be5c1203095115fef1db253 (diff)
An update of sorts.
Diffstat (limited to 'docs/c1/1-parse.md')
-rw-r--r--docs/c1/1-parse.md167
1 files changed, 140 insertions, 27 deletions
diff --git a/docs/c1/1-parse.md b/docs/c1/1-parse.md
index 73b8d8a..6484cab 100644
--- a/docs/c1/1-parse.md
+++ b/docs/c1/1-parse.md
@@ -12,48 +12,52 @@ which is necessary to strategically construct more complex code and data:
+--------+-----------------+--------+----------+------+
The parser can also output non-negative integers, but this is only used for
-datum labels; number literals are handled by the *decoder* (see next).
+datum labels; number literals are handled by the *decoder* instead.
-The parser recognizes various "syntax sugar" and transforms it into uses of the
-above data types. The most ubiquitous example is of course the list:
- (datum1 datum2 ...) -> (datum1 & (datum2 & (... & ())))
+## Decoder
-The following table summarizes the other supported transformations:
+A separate process called *decoding* can transform such data into more complex
+types. For example, `(#HASH x y z)` could be decoded into an array, so the
+expression `#(x y z)` could work like in Scheme; or `(#SQUARE x y z)` could be
+decoded into a function call expression that will, at run-time, allocate and
+initialize a dynamic array with three elements, so the expression `[x y z]`
+would work like in JavaScript.
- "xyz" -> (#QUOTE & |xyz|) #datum -> (#HASH & datum)
+Decoding also resolves datum labels, goes over strings to find ones that are
+actually a number literal, and takes care of a number of other transformations.
+This offloads complexity, allowing the parser to remain extremely simple. See
+the dedicated documentation of the decoder for more.
- [...] -> (#SQUARE ...) #rune(...) -> (#rune ...)
- {...} -> (#BRACE ...) dat1dat2 -> (#JOIN dat1 & dat2)
+## Syntax sugar
- 'datum -> (#QUOTE & datum) dat1.dat2 -> (#DOT dat1 & dat2)
+The parser recognizes various "syntax sugar" and transforms it into uses of the
+above listed minimal data types. The most ubiquitous example is the list:
- `datum -> (#GRAVE & datum) dat1:dat2 -> (#COLON dat1 & dat2)
+ (datum1 datum2 ...) -> (datum1 & (datum2 & (... & ())))
- ,datum -> (#COMMA & datum) #%hex% -> (#LABEL & hex)
+The following table summarizes the other transformations available:
- #%hex=datum -> (#LABEL hex & datum)
+ "xyz" -> (#QUOTE & |xyz|) #datum -> (#HASH & datum)
-A separate process called *decoding* can transform such data into more complex
-types. For example, `(#HASH x y z)` could be decoded into a vector, so the
-expression `#(x y z)` works just like in Scheme.
+ ~_xyz_ -> (#TILDE & |xyz|) #rune(...) -> (#rune ...)
-Decoding also resolves datum labels, goes over strings to find ones that are
-actually a number literal, and takes care of a number of other transformations.
-This offloads complexity, allowing the parser to remain extremely simple. See
-the dedicated documentation of the decoder for more.
+ [...] -> (#SQUARE ...) dat1dat2 -> (#JOIN dat1 & dat2)
+
+ {...} -> (#BRACE ...) dat1.dat2 -> (#DOT dat1 & dat2)
+
+ 'datum -> (#QUOTE & datum) dat1:dat2 -> (#COLON dat1 & dat2)
+
+ `datum -> (#GRAVE & datum) #%hex=datum -> (#LABEL hex & datum)
+
+ ,datum -> (#COMMA & datum) #%hex% -> (#LABEL & hex)
-Further notes about the syntax sugar table and examples above:
+Notes about the table and examples:
* The terms datum, dat1, and dat2 each refer to an arbitrary datum; ellipsis
means zero or more data; hex is a hexadecimal number of up to 12 digits.
-* The `#datum` form only applies when the datum following the hash sign is a
- list, quoted string, quote expression, another expression starting with the
- hash sign, or a pipe-quoted string (see next). A bare string can follow the
- hash sign by separating the two with a backslash: `#\string`
-
* Strings can be quoted with pipes, like symbols in Scheme. This is the "real"
string literal syntax, whereas using double quotes is syntax sugar for a
quoted string literal.
@@ -62,6 +66,16 @@ Further notes about the syntax sugar table and examples above:
"foo bar baz" -> (#QUOTE & |foo bar baz|)
+* See the next section for an explanation of the tilde syntax, which implements
+ "raw" string literals.
+
+* The `#datum` form only applies when the datum following the hash sign is
+ anything other than a bare string (unquoted, without pipe symbol) since
+ otherwise this would be ambiguous with a rune literal. A bare string can
+ nevertheless follow the hash sign by separating the two with a backslash:
+
+ #\string -> (#HASH & string)
+
* Though not represented in the table due to notational difficulty, the form
`#rune(...)` doesn't require a list in the second position; any datum that
works with the `#datum` syntax also works with `#rune<DATUM>`.
@@ -81,7 +95,7 @@ Further notes about the syntax sugar table and examples above:
* Syntax sugar can combine arbitrarily. Some examples follow. Any of these may
or may not actually have a meaning in code; many could simply end up producing
- a syntax error at the macro-expand stage.
+ an error during decoding, or later interpretation of code.
#{...} -> (#HASH #BRACE ...)
@@ -111,7 +125,106 @@ Further notes about the syntax sugar table and examples above:
* Runes are case-sensitive, and the parser always emits runes using upper-case
letters when expressing syntax sugar. Uppercase rune names are reserved for
Zisp's internal use and standard library; users can use lowercase runes with
- custom meaning without worrying about clashes.
+ custom meaning without worrying about clashes, with the exception of a small
+ number of lowercase runes such as `#true` and `#false` that are part of the
+ default decoder settings.
+
+
+## Tilde strings
+
+There is a special type of syntax sugar for "raw" strings, meaning that no
+backslash escapes nor any other kind of escape sequence are recognized.
+
+This raw string syntax begins with a tilde, followed by any byte. That byte
+becomes the termination marker, and the string cannot represent a literal
+occurrence of it, since there are no escape sequences.
+
+ ~%foo \ bar% -> (#TILDE |foo \\ bar|)
+
+This can be useful, for instance, when representing regular expressions as
+quoted string literals in code:
+
+ ~/^foo\\(bar|baz)\.\[".*"\]$/ ;; matches e.g. foo\bar.["blah"]
+
+Were it not for this syntax, this regular expression would need to be
+represented by the following quoted string literal in Zisp code:
+
+ "^foo\\\\(bar|baz)\\t\\[\".*\"\\]$"
+
+Alternatively, imagine searching for certain MS Windows file paths:
+
+ ~_C:\\\\User\\foo_ ;; matches C:\\User\foo
+
+That's already ugly. Without raw strings, it would need to look like this:
+
+ "C:\\\\\\\\User\\\\foo"
+
+Typically, the rune `#TILDE` would be treated as a synonym to `#QUOTE` by the
+decoder, though creative programmers could repurpose it.
+
+
+## Newlines in strings
+
+Normally, a newline in a string has no special meaning and simply becomes part
+of the string. However, newlines can be backslash-escaped, which simple erases
+them; the escaped newline can also be preceded or followed by any number of tab
+and space characters, which are all stripped as well. (Note: It's not blanks
+preceding the backslash that are stripped, but blanks following the backslash
+and preceding the newline; i.e., blanks at the end of the line.)
+
+Following are some examples of how multi-line strings can appear in source code
+with different intentions and meanings:
+
+ (define paragraph "This paragraph has been visually split into multiple \
+ lines, but the newlines are escaped, so it's one line.")
+
+ (define json-object '| ;; use '|| so we needn't escape "key" etc.
+ {
+ "key": "value"
+ }
+ |)
+
+The second example is actually slightly problematic. It begins with a newline,
+which may be undesirable, but escaping that newline would cause the first line
+to have no indentation, thus the opening `{` would not line up with the closing
+`}` when this string is printed out. Further, if the entire block of code is
+indented, then the string contents may be more indented than intended. (No pun
+or rhyme intended.) Consider:
+
+ (let ((foo one))
+ (let ((bar two))
+ (let ((json-object '|
+ {
+ "key": "value"
+ }
+ |))
+ (do-whatever))))
+
+The string bound to `json-object` has way more indentation than the programmer
+intended. Should the parser attempt to solve this issue?
+
+Thankfully, we have the decoder. The implementation of `#QUOTE` can simply
+implement a post-processing algorithm such as the one used for Java 15 text
+blocks feature: [JEP 378: Text Blocks](https://openjdk.org/jeps/378)
+
+The only feature Zisp cannot offer here is a way to fence off multi-line strings
+with a longer token such as `"""` as seen in Python or Java, or an arbitrary
+word as seen in Bourne shell and PHP "here doc" syntax. For simplicity, the
+Zisp parser omits such features.
+
+That said, if a programmer truly wanted to have arbitrary text blocks in code,
+without needing to escape anything in them, it's possible to abuse the tilde
+string syntax by using it with an ASCII control character which is displayed
+visibly by a text editor. In the following, the characters `^\` are meant to
+represent a literal ASCII File Separator character in the source code:
+
+ (define json-object ~^\
+ {
+ "key": "value"
+ }
+ ^\)
+
+Hey, it works fine in Emacs, so why not?? (`C-q C-\` to insert the `^\`.)
<!--
;; Local Variables: