An update of sorts.

author: Taylan Kammer <taylan.kammer@gmail.com> 2026-05-23 22:22:57 +0200
committer: Taylan Kammer <taylan.kammer@gmail.com> 2026-05-23 22:22:57 +0200
commit: 378f8598a5a57b731948241e41f584f5172dc2a2 (patch)
tree: e9352110efe5b204a5abe7e00693be2004aab4e5 /docs/c1/1-parse.md
parent: f1f134d072e375335be5c1203095115fef1db253 (diff)
1 files changed, 140 insertions, 27 deletions
diff --git a/docs/c1/1-parse.md b/docs/c1/1-parse.md
index 73b8d8a..6484cab 100644
--- a/docs/c1/1-parse.md
+++ b/docs/c1/1-parse.md
@@ -12,48 +12,52 @@ which is necessary to strategically construct more complex code and data:
     +--------+-----------------+--------+----------+------+
 
 The parser can also output non-negative integers, but this is only used for
-datum labels; number literals are handled by the *decoder* (see next).
+datum labels; number literals are handled by the *decoder* instead.
 
-The parser recognizes various "syntax sugar" and transforms it into uses of the
-above data types.  The most ubiquitous example is of course the list:
 
-    (datum1 datum2 ...)  ->  (datum1 & (datum2 & (... & ())))
+## Decoder
 
-The following table summarizes the other supported transformations:
+A separate process called *decoding* can transform such data into more complex
+types.  For example, `(#HASH x y z)` could be decoded into an array, so the
+expression `#(x y z)` could work like in Scheme; or `(#SQUARE x y z)` could be
+decoded into a function call expression that will, at run-time, allocate and
+initialize a dynamic array with three elements, so the expression `[x y z]`
+would work like in JavaScript.
 
-    "xyz"   -> (#QUOTE & |xyz|)       #datum       -> (#HASH & datum)
+Decoding also resolves datum labels, goes over strings to find ones that are
+actually a number literal, and takes care of a number of other transformations.
+This offloads complexity, allowing the parser to remain extremely simple.  See
+the dedicated documentation of the decoder for more.
 
-    [...]   -> (#SQUARE ...)          #rune(...)   -> (#rune ...)
 
-    {...}   -> (#BRACE ...)           dat1dat2     -> (#JOIN dat1 & dat2)
+## Syntax sugar
 
-    'datum  -> (#QUOTE & datum)       dat1.dat2    -> (#DOT dat1 & dat2)
+The parser recognizes various "syntax sugar" and transforms it into uses of the
+above listed minimal data types.  The most ubiquitous example is the list:
 
-    `datum  -> (#GRAVE & datum)       dat1:dat2    -> (#COLON dat1 & dat2)
+    (datum1 datum2 ...)  ->  (datum1 & (datum2 & (... & ())))
 
-    ,datum  -> (#COMMA & datum)       #%hex%       -> (#LABEL & hex)
+The following table summarizes the other transformations available:
 
-                                      #%hex=datum  -> (#LABEL hex & datum)
+    "xyz"   -> (#QUOTE & |xyz|)       #datum       -> (#HASH & datum)
 
-A separate process called *decoding* can transform such data into more complex
-types.  For example, `(#HASH x y z)` could be decoded into a vector, so the
-expression `#(x y z)` works just like in Scheme.
+    ~_xyz_  -> (#TILDE & |xyz|)       #rune(...)   -> (#rune ...)
 
-Decoding also resolves datum labels, goes over strings to find ones that are
-actually a number literal, and takes care of a number of other transformations.
-This offloads complexity, allowing the parser to remain extremely simple.  See
-the dedicated documentation of the decoder for more.
+    [...]   -> (#SQUARE ...)          dat1dat2     -> (#JOIN dat1 & dat2)
+                                 
+    {...}   -> (#BRACE ...)           dat1.dat2    -> (#DOT dat1 & dat2)
+                                 
+    'datum  -> (#QUOTE & datum)       dat1:dat2    -> (#COLON dat1 & dat2)
+                                 
+    `datum  -> (#GRAVE & datum)       #%hex=datum  -> (#LABEL hex & datum)
+                                 
+    ,datum  -> (#COMMA & datum)       #%hex%       -> (#LABEL & hex)
 
-Further notes about the syntax sugar table and examples above:
+Notes about the table and examples:
 
 * The terms datum, dat1, and dat2 each refer to an arbitrary datum; ellipsis
   means zero or more data; hex is a hexadecimal number of up to 12 digits.
 
-* The `#datum` form only applies when the datum following the hash sign is a
-  list, quoted string, quote expression, another expression starting with the
-  hash sign, or a pipe-quoted string (see next).  A bare string can follow the
-  hash sign by separating the two with a backslash: `#\string`
-
 * Strings can be quoted with pipes, like symbols in Scheme.  This is the "real"
   string literal syntax, whereas using double quotes is syntax sugar for a
   quoted string literal.
@@ -62,6 +66,16 @@ Further notes about the syntax sugar table and examples above:
 
       "foo bar baz"  -> (#QUOTE & |foo bar baz|)
 
+* See the next section for an explanation of the tilde syntax, which implements
+  "raw" string literals.
+
+* The `#datum` form only applies when the datum following the hash sign is
+  anything other than a bare string (unquoted, without pipe symbol) since
+  otherwise this would be ambiguous with a rune literal.  A bare string can
+  nevertheless follow the hash sign by separating the two with a backslash:
+
+      #\string  ->  (#HASH & string)
+
 * Though not represented in the table due to notational difficulty, the form
   `#rune(...)` doesn't require a list in the second position; any datum that
   works with the `#datum` syntax also works with `#rune<DATUM>`.
@@ -81,7 +95,7 @@ Further notes about the syntax sugar table and examples above:
 
 * Syntax sugar can combine arbitrarily.  Some examples follow.  Any of these may
   or may not actually have a meaning in code; many could simply end up producing
-  a syntax error at the macro-expand stage.
+  an error during decoding, or later interpretation of code.
 
       #{...}            -> (#HASH #BRACE ...)
 
@@ -111,7 +125,106 @@ Further notes about the syntax sugar table and examples above:
 * Runes are case-sensitive, and the parser always emits runes using upper-case
   letters when expressing syntax sugar.  Uppercase rune names are reserved for
   Zisp's internal use and standard library; users can use lowercase runes with
-  custom meaning without worrying about clashes.
+  custom meaning without worrying about clashes, with the exception of a small
+  number of lowercase runes such as `#true` and `#false` that are part of the
+  default decoder settings.
+
+
+## Tilde strings
+
+There is a special type of syntax sugar for "raw" strings, meaning that no
+backslash escapes nor any other kind of escape sequence are recognized.
+
+This raw string syntax begins with a tilde, followed by any byte.  That byte
+becomes the termination marker, and the string cannot represent a literal
+occurrence of it, since there are no escape sequences.
+
+    ~%foo \ bar%  ->  (#TILDE |foo \\ bar|)
+
+This can be useful, for instance, when representing regular expressions as
+quoted string literals in code:
+
+    ~/^foo\\(bar|baz)\.\[".*"\]$/     ;; matches e.g. foo\bar.["blah"]
+
+Were it not for this syntax, this regular expression would need to be
+represented by the following quoted string literal in Zisp code:
+
+    "^foo\\\\(bar|baz)\\t\\[\".*\"\\]$"
+
+Alternatively, imagine searching for certain MS Windows file paths:
+
+    ~_C:\\\\User\\foo_                ;; matches C:\\User\foo
+
+That's already ugly.  Without raw strings, it would need to look like this:
+
+    "C:\\\\\\\\User\\\\foo"
+
+Typically, the rune `#TILDE` would be treated as a synonym to `#QUOTE` by the
+decoder, though creative programmers could repurpose it.
+
+
+## Newlines in strings
+
+Normally, a newline in a string has no special meaning and simply becomes part
+of the string.  However, newlines can be backslash-escaped, which simple erases
+them; the escaped newline can also be preceded or followed by any number of tab
+and space characters, which are all stripped as well.  (Note: It's not blanks
+preceding the backslash that are stripped, but blanks following the backslash
+and preceding the newline; i.e., blanks at the end of the line.)
+
+Following are some examples of how multi-line strings can appear in source code
+with different intentions and meanings:
+
+    (define paragraph "This paragraph has been visually split into multiple \
+                       lines, but the newlines are escaped, so it's one line.")
+
+    (define json-object '|   ;; use '|| so we needn't escape "key" etc.
+      {
+        "key": "value"
+      }
+    |)
+
+The second example is actually slightly problematic.  It begins with a newline,
+which may be undesirable, but escaping that newline would cause the first line
+to have no indentation, thus the opening `{` would not line up with the closing
+`}` when this string is printed out.  Further, if the entire block of code is
+indented, then the string contents may be more indented than intended.  (No pun
+or rhyme intended.)  Consider:
+
+    (let ((foo one))
+      (let ((bar two))
+        (let ((json-object '|
+                 {
+                   "key": "value"
+                 }
+               |))
+          (do-whatever))))
+
+The string bound to `json-object` has way more indentation than the programmer
+intended.  Should the parser attempt to solve this issue?
+
+Thankfully, we have the decoder.  The implementation of `#QUOTE` can simply
+implement a post-processing algorithm such as the one used for Java 15 text
+blocks feature: [JEP 378: Text Blocks](https://openjdk.org/jeps/378)
+
+The only feature Zisp cannot offer here is a way to fence off multi-line strings
+with a longer token such as `"""` as seen in Python or Java, or an arbitrary
+word as seen in Bourne shell and PHP "here doc" syntax.  For simplicity, the
+Zisp parser omits such features.
+
+That said, if a programmer truly wanted to have arbitrary text blocks in code,
+without needing to escape anything in them, it's possible to abuse the tilde
+string syntax by using it with an ASCII control character which is displayed
+visibly by a text editor.  In the following, the characters `^\` are meant to
+represent a literal ASCII File Separator character in the source code:
+
+    (define json-object ~^\
+      {
+        "key": "value"
+      }
+      ^\)
+
+Hey, it works fine in Emacs, so why not??  (`C-q C-\` to insert the `^\`.)
 
 <!--
 ;; Local Variables:
author	Taylan Kammer <taylan.kammer@gmail.com>	2026-05-23 22:22:57 +0200
committer	Taylan Kammer <taylan.kammer@gmail.com>	2026-05-23 22:22:57 +0200
commit	378f8598a5a57b731948241e41f584f5172dc2a2 (patch)
tree	e9352110efe5b204a5abe7e00693be2004aab4e5 /docs/c1/1-parse.md
parent	f1f134d072e375335be5c1203095115fef1db253 (diff)