# Parser for Code & Data *For an exact specification of the grammar, see [grammar](grammar/).* Zisp S-Expressions represent an extremely minimal set of data types; only that which is necessary to strategically construct more complex code and data: +--------+-----------------+--------+----------+------+ | TYPE | String | Rune | Pair | Nil | +--------+-----------------+--------+----------+------+ | E.G. | foo, |foo bar| | #name | (X & Y) | () | +--------+-----------------+--------+----------+------+ The parser can also output non-negative integers, but this is only used for datum labels; number literals are handled by the *decoder* instead. ## Decoder A separate process called *decoding* can transform such data into more complex types. For example, `(#HASH x y z)` could be decoded into an array, so the expression `#(x y z)` could work like in Scheme; or `(#SQUARE x y z)` could be decoded into a function call expression that will, at run-time, allocate and initialize a dynamic array with three elements, so the expression `[x y z]` would work like in JavaScript. Decoding also resolves datum labels, goes over strings to find ones that are actually a number literal, and takes care of a number of other transformations. This offloads complexity, allowing the parser to remain extremely simple. See the dedicated documentation of the decoder for more. ## Syntax sugar The parser recognizes various "syntax sugar" and transforms it into uses of the above listed minimal data types. The most ubiquitous example is the list: (datum1 datum2 ...) -> (datum1 & (datum2 & (... & ()))) The following table summarizes the other transformations available: "xyz" -> (#QUOTE & |xyz|) #datum -> (#HASH & datum) ~_xyz_ -> (#TILDE & |xyz|) #rune(...) -> (#rune ...) [...] -> (#SQUARE ...) dat1dat2 -> (#JOIN dat1 & dat2) {...} -> (#BRACE ...) dat1.dat2 -> (#DOT dat1 & dat2) 'datum -> (#QUOTE & datum) dat1:dat2 -> (#COLON dat1 & dat2) `datum -> (#GRAVE & datum) #%hex=datum -> (#LABEL hex & datum) ,datum -> (#COMMA & datum) #%hex% -> (#LABEL & hex) Notes about the table and examples: * The terms datum, dat1, and dat2 each refer to an arbitrary datum; ellipsis means zero or more data; hex is a hexadecimal number of up to 12 digits. * Strings can be quoted with pipes, like symbols in Scheme. This is the "real" string literal syntax, whereas using double quotes is syntax sugar for a quoted string literal. |foo bar baz| -> |foo bar baz| "foo bar baz" -> (#QUOTE & |foo bar baz|) * See the next section for an explanation of the tilde syntax, which implements "raw" string literals. * The `#datum` form only applies when the datum following the hash sign is anything other than a bare string (unquoted, without pipe symbol) since otherwise this would be ambiguous with a rune literal. A bare string can nevertheless follow the hash sign by separating the two with a backslash: #\string -> (#HASH & string) * Though not represented in the table due to notational difficulty, the form `#rune(...)` doesn't require a list in the second position; any datum that works with the `#datum` syntax also works with `#rune`. #rune1#rune2 -> (#rune1 & #rune2) #rune"text" -> (#rune & "text") #rune\string -> (rune & string) #rune'string -> (#rune #QUOTE & string) As a counter-example, following a rune immediately with a bare string isn't possible without the delimiting backslash, since that would be ambiguous: #abcdefgh ;Could be (#abcdef & gh) or (#abcde & fgh) or ... * Syntax sugar can combine arbitrarily. Some examples follow. Any of these may or may not actually have a meaning in code; many could simply end up producing an error during decoding, or later interpretation of code. #{...} -> (#HASH #BRACE ...) #'foo -> (#HASH #QUOTE & foo) ##'[...] -> (#HASH #HASH #QUOTE #SQUARE ...) {x y}[i j] -> (#JOIN (#BRACE x y) #SQUARE i j) foo.bar.baz{x y} -> (#JOIN (#DOT (#DOT foo & bar) & baz) #BRACE x y) * While in Lisp and Scheme `'foo` parses as `(quote foo)`, in Zisp it parses as `(#QUOTE & foo)` instead; the operand of `#QUOTE` is the entire cdr. The same principle is used when parsing other sugar; some examples follow: Incorrect Correct #(x y z) -> (#HASH (x y z)) #(x y z) -> (#HASH x y z) [x y z] -> (#SQUARE (x y z)) [x y z] -> (#SQUARE x y z) #{x} -> (#HASH (#BRACE (x))) #{x} -> (#HASH #BRACE x) foo(x y) -> (#JOIN foo (x y)) foo(x y) -> (#JOIN foo x y) * Runes are case-sensitive, and the parser always emits runes using upper-case letters when expressing syntax sugar. Uppercase rune names are reserved for Zisp's internal use and standard library; users can use lowercase runes with custom meaning without worrying about clashes, with the exception of a small number of lowercase runes such as `#true` and `#false` that are part of the default decoder settings. ## Tilde strings There is a special type of syntax sugar for "raw" strings, meaning that no backslash escapes nor any other kind of escape sequence are recognized. This raw string syntax begins with a tilde, followed by any byte. That byte becomes the termination marker, and the string cannot represent a literal occurrence of it, since there are no escape sequences. ~%foo \ bar% -> (#TILDE |foo \\ bar|) This can be useful, for instance, when representing regular expressions as quoted string literals in code: ~/^foo\\(bar|baz)\.\[".*"\]$/ ;; matches e.g. foo\bar.["blah"] Were it not for this syntax, this regular expression would need to be represented by the following quoted string literal in Zisp code: "^foo\\\\(bar|baz)\\t\\[\".*\"\\]$" Alternatively, imagine searching for certain MS Windows file paths: ~_C:\\\\User\\foo_ ;; matches C:\\User\foo That's already ugly. Without raw strings, it would need to look like this: "C:\\\\\\\\User\\\\foo" Typically, the rune `#TILDE` would be treated as a synonym to `#QUOTE` by the decoder, though creative programmers could repurpose it. ## Newlines in strings Normally, a newline in a string has no special meaning and simply becomes part of the string. However, newlines can be backslash-escaped, which simple erases them; the escaped newline can also be preceded or followed by any number of tab and space characters, which are all stripped as well. (Note: It's not blanks preceding the backslash that are stripped, but blanks following the backslash and preceding the newline; i.e., blanks at the end of the line.) Following are some examples of how multi-line strings can appear in source code with different intentions and meanings: (define paragraph "This paragraph has been visually split into multiple \ lines, but the newlines are escaped, so it's one line.") (define json-object '| ;; use '|| so we needn't escape "key" etc. { "key": "value" } |) The second example is actually slightly problematic. It begins with a newline, which may be undesirable, but escaping that newline would cause the first line to have no indentation, thus the opening `{` would not line up with the closing `}` when this string is printed out. Further, if the entire block of code is indented, then the string contents may be more indented than intended. (No pun or rhyme intended.) Consider: (let ((foo one)) (let ((bar two)) (let ((json-object '| { "key": "value" } |)) (do-whatever)))) The string bound to `json-object` has way more indentation than the programmer intended. Should the parser attempt to solve this issue? Thankfully, we have the decoder. The implementation of `#QUOTE` can simply implement a post-processing algorithm such as the one used for Java 15 text blocks feature: [JEP 378: Text Blocks](https://openjdk.org/jeps/378) The only feature Zisp cannot offer here is a way to fence off multi-line strings with a longer token such as `"""` as seen in Python or Java, or an arbitrary word as seen in Bourne shell and PHP "here doc" syntax. For simplicity, the Zisp parser omits such features. That said, if a programmer truly wanted to have arbitrary text blocks in code, without needing to escape anything in them, it's possible to abuse the tilde string syntax by using it with an ASCII control character which is displayed visibly by a text editor. In the following, the characters `^\` are meant to represent a literal ASCII File Separator character in the source code: (define json-object ~^\ { "key": "value" } ^\) Hey, it works fine in Emacs, so why not?? (`C-q C-\` to insert the `^\`.)