# Parser for Data *For an exact specification of the grammar, see [grammar](grammar/).* Zisp s-expressions represent an extremely minimal set of data types; only that which is necessary to strategically construct more complex values: +--------+-----------------+--------+----------+------+ | TYPE | String | Rune | Pair | Nil | +--------+-----------------+--------+----------+------+ | E.G. | foobar | #name | (X & Y) | () | | | |foo bar| | | | | | | "foo bar" | | | | | | @_foo bar_ | | | | +--------+-----------------+--------+----------+------+ Datum comments and line comments are supported: * A semicolon followed by a tilde instructs the parser to consume one datum and discard it. Whitespace may appear between the tilde and the datum to discard. * A semicolon, followed by a non-tilde byte, instructs the parser to consume and discard bytes until a newline (ASCII Line Feed) is encountered. The parser can also output non-negative integers, but this is only used for datum labels; number literals are handled by the decoder instead; see below. ## Overview This section explains a few core concepts and features related to the parser. ### Value vs. Datum A Zisp *value* that has an *external representation* in the form of a sequence of bytes is called a *datum*. Every datum is a value, but not all values are data. A datum is a value that can be printed out as a byte sequence which the parser can recognize and turn back into an equivalent datum. One may speak of an *external representation of a value* where the value is not itself a datum, but can be encoded as a datum. The more strictly correct term for this is: "The external representation of a datum encoding the value." ### Syntax sugar The parser recognizes various "syntax sugar" and transforms it into uses of the above listed primitive data types. As an example, the expression `#(x y z)` is parsed into the structure `(#HASH x y z)`. These are two completely equivalent external representations for the same compound datum; after parsing, both byte sequences will yield data values that are indistinguishable in all but their memory address. The most ubiquitously used syntax sugar is the list, which stands for a chain of pairs, terminated with nil: (x y z) -> (x & (y & (z & ()))) The full syntax sugar table is listed and explained further below. ### Decoder *The decoder has nothing to do with the concept of text or character encoding.* A separate process called *decoding* can transform Zisp data into values of more complex types, including values that are not of a datum type. For example, the datum `(#HASH x y z)` could be decoded into an array, so the expression `#(x y z)` could work like in Scheme. Decoding also resolves datum labels, goes over bare strings to find ones that represent a number literal, and takes care of a number of other transforms. This offloads complexity, allowing the parser to remain extremely simple. See the dedicated documentation of the [decoder](2-decode.html) for more. ### Character encoding The parser does not consume characters; it consumes bytes. Grammar is generally constructed by bytes corresponding to ASCII characters. Some elements of the grammar, such as comments and quoted strings, may contain arbitrary byte sequences, until terminated. These sequences may happen to be valid UTF-8 text. This way, quoted strings and comments may contain Unicode text encoded in UTF-8, but the parser does not check these for validity. Since comments and quoted strings may contain arbitrary byte sequences, a text editor or other program displaying Zisp s-expressions may need to use a special visual representation for bytes that don't represent valid text. The parser being based on bytes rather than characters is not a limitation but rather a feature: It allows for Zisp s-expressions to be used as a structured data exchange format that may contain binary data elements without the need to encode these in Base64 or other such text representations of binary data. Consider the example: ((image.webp "<< binary data >>") (video.webm "<< binary data >>")) All that needs to be done for this to work, is that any incidental occurrences of the double-quote sign, and the backslash sign, are escaped with a backslash within the binary data; all other bytes can appear verbatim in the strings. ### Stream parsing The parser can be repeatedly invoked on a byte stream to consume the next datum within. This does not require "unreading" or back-seeking within the stream; the parser always reads a full datum, and stops after some byte which cleanly terminates the currently parsed datum. This means Zisp s-expressions can be safely intermixed with other data within the same byte stream. So long as the other data is consumed by some parser which similarly stops reading at a clear boundary, the Zisp parser can then continue operating on the same stream. Consider the example: ("image.webp" 8273) << 8273 bytes >> ("video.webm" 736) << 736 bytes >> The "header" for each file in this stream is a Zisp s-expression containing information about how many bytes should be read after the header, before the next file header appears. (The header data need to be terminated with a blank ASCII character such as a newline. The reason why the closing parenthesis does not act as a terminator unto itself will become apparent later.) ### Datum labels Valid data cannot be cyclic, since that would mean it has infinite length in bytes. To externally represent a value with cyclic structure, one uses datum labels in the data encoding of the value. A datum label either wraps another datum to assign a number to it, or contains just a reference to a previous assignment. +----------------------------------+---------------------------------+ | Internal structure | External representation | +----------------------------------+---------------------------------+ | (#LABEL & ( & )) | #%= | +----------------------------------+---------------------------------+ | (#LABEL & ) | #%% | +----------------------------------+---------------------------------+ In this visual, the token `` stands for an actual number value that doesn't have its own external representation. It's printed as a sequence of hexadecimal digits, denoted by `` in the external representation. For clarity, concrete examples follow: #%1234abcd=(foo bar) -> (#LABEL & (<0x1234abcd> & (foo bar))) #%1234abcd% -> (#LABEL & <0x1234abcd>) Here, the visual token `<0x1234abcd>` stands for a Zisp value of a numeric type with an integer value. Datum labels may look like "syntax sugar" but the fact that integers don't have a direct external representation means that datum labels are a fundamental type of syntax that has no "desugared" equivalent in external representation. The decoder will not accept a bare string encoding of an integer here. ## Data types Following is an explanation of the four core data types constructed by the Zisp s-expression parser. A Zisp value that is a member of one of these types is also called a *datum* if it adheres to additional constraints as explained for each type. ### String Strings can appear "bare" or be quoted in various ways. A string, as a stand-alone Zisp value, is only a valid datum if it can be represented as a bare string. If it contains bytes that prevent the bare representation, then the string must be wrapped in one of the following structures to become a valid datum, each of which has its own external representation: +-------------------------------+-------------------------------+ | Internal structure | External representation | +-------------------------------+-------------------------------+ | (#PQSTR & ) | |contents| | +-------------------------------+-------------------------------+ | (#DQSTR & ) | "contents" | +-------------------------------+-------------------------------+ | (#ATSTR & ) | @_contents_ | +-------------------------------+-------------------------------+ The visual token `` is meant to denote the actual string, as a Zisp value, occupying the second position in the pair. It is not actual syntax. Note that, while conceptually similar, this internal encoding of string data is not syntax sugar, since the internal datum representation using runes cannot be printed out verbatim, due to the attached string being impossible to represent externally without quotation. As such, quoted strings are fundamental syntax. These external representations of strings will be explained in more detail further below, including backslash escape sequences allowed within. Strings have a fixed length, counted in bytes. Each byte can have any value, including zero (aka ASCII NULL). The parser reads bytes, not characters, and has no concept of a character encoding, which means that a string can contain UTF-8 byte sequences, but these are not tested for validity. A string that is up to 64 bytes long is automatically *interned*, meaning any occurrence of the same string -- equal in length and containing the same byte values -- ends up being represented by the same bit-pattern; either a memory address, or an immediate representation within a CPU word for short strings. Strings with a length greater than 64 bytes end up being represented by a distinct memory address, even if they are equal in length and content. ### Rune A rune is represented by an ASCII character sequence of 1 to 6 bytes, that must begin with a letter, and may only contain letters and digits. This character sequence of letters and digits is called the *name* of the rune. A rune that follows this constraint is valid as a datum. Zisp code may explicitly construct values of the rune type that violate the above constraints. Such runes are not valid data and cannot be printed or parsed in any way. Runes are case-sensitive, and the parser always emits runes using upper-case letters when expressing syntax sugar. Uppercase rune names are reserved for Zisp's internal use and standard library; users can use lowercase runes with custom meaning without worrying about clashes, with the exception of a small number of lowercase runes such as `#true` and `#false` that are part of the default decoder settings. Runes are always stored directly in a CPU word; never by memory address. ### Pair A pair is a tuple of two values: the first value and the second value. The parser allocates a unique two-word cell in the process heap for every pair, and represents that pair through the memory address of that cell. Pairs are valid as a datum if one of the following holds true for the pair: * It encodes one of the quoted string variants. * It encodes a datum label (assignment or reference). * Both the first and second value in the pair is itself a valid datum. An additional constraint is that a hierarchy of pairs containing pairs must not form cycles; if they do, the cycles must be broken up by use of datum labels or else none of the pairs within the cyclic structure are a valid datum. ### Nil The Zisp nil value is a singleton and a datum. There is exactly one nil value and it is used to terminate a chain of pairs representing a list of values. ## Quoted strings Three quoted string types exist: Pipe-quoted, double-quoted, and at-quoted. This section goes into the details of each variant. ### Pipe-quoted Strings can be quoted with pipes, like symbols in R7RS Scheme, which triggers the parser to generate a pair with the structure: (#PQSTR & ) ;; is visual aid, not syntax The decoder, using default settings, would emit this string verbatim as a value. Then, during code evaluation, this would be seen as an identifier. In this way, pipe-quoted strings are equivalent to bare strings in functionality. It is important to understand that the decoder sits between the parser and the [evaluator](3-execute.html), and in opposition to Lisp and Scheme tradition, it is common for the evaluator to receive values that are not valid as a datum; in this case, a string unto itself that may not be a valid datum, due to not being possible to be represented as a bare string. Yet, it is valid as an identifier for the purposes of the evaluator, since it is a string *value* like any other. ### Double-quoted Strings wrapped in the double-quote symbol parse into: (#DQSTR & ) ;; is visual aid, not syntax Under default settings, the decoder would transform this into a value which, when evaluated, yields back the string as a value. Typically, this would be achieved by simply transforming it into `(#QUOTE & )`. (Note that, unlike `(#PQSTR & )`, this would not be decoded into a string unto itself, as that would make the evaluator see it as an identifier.) ### At-quoted strings AKA raw strings There is a special type of syntax for "raw" strings, meaning that no backslash escapes nor any other kind of escape sequence are recognized within them. This raw string syntax begins with an at sign, followed by any byte. That byte becomes the termination marker, and the string cannot contain an occurrence of it, since there are no escape sequences. @"foo \ bar" -> (#ATSTR & ) In the above, the visual token `` is not part of datum syntax but a stand-in for the actual string value, which is, literally: `foo \ bar` This style of quoting can be useful, for instance, when representing regular expressions as strings in code: @/^foo\\(bar|baz)\.\[".*"\]$/ ;; matches e.g. foo\bar.["blah"] Were it not for this syntax, this regular expression would only be possible to represent through a quoted string such as the following: "^foo\\\\(bar|baz)\\t\\[\".*\"\\]$" ;; many backslashes Alternatively, imagine searching for certain MS Windows file paths: @_C:\\\\Users\\([a-z]+)_ ;; matches C:\\User\foo That's already ugly. Without raw strings, it would need to look even worse: "C:\\\\\\\\Users\\\\([a-z]+)" ;; MANY backslashes The byte that follows the at sign need not be a printable character or even a valid ASCII byte; it can be absolutely any byte value, even NULL. This can be useful to easily encode binary data which is known to not contain a specific byte; an example would be C strings which cannot contain NULL. ### Backslash escape sequences in strings The following backslash escapes are supported in pipe-quoted and double-quoted strings. (Some rows use Regular Expression notation.) +-----------------------------------+------------------------------+ | Character(s) following backslash | Meaning | +-----------------------------------+------------------------------+ | \ | Literal backslash | +-----------------------------------+------------------------------+ | | | Literal pipe symbol | +-----------------------------------+------------------------------+ | " | Literal double-quote | +-----------------------------------+------------------------------+ | RE: /[\t ]*\n[\t ]*/ | Discarded | +-----------------------------------+------------------------------+ | 0 | ASCII NULL | +-----------------------------------+------------------------------+ | a | ASCII Alert | +-----------------------------------+------------------------------+ | b | ASCII Backspace | +-----------------------------------+------------------------------+ | t | ASCII Tab (Horizontal) | +-----------------------------------+------------------------------+ | n | ASCII Newline (Line Feed) | +-----------------------------------+------------------------------+ | v | ASCII Vertical Tab | +-----------------------------------+------------------------------+ | f | ASCII Form Feed | +-----------------------------------+------------------------------+ | r | ASCII Carriage Return | +-----------------------------------+------------------------------+ | e | ASCII Escape | +-----------------------------------+------------------------------+ | RE: /x([0-9a-fA-F]{2})+;/ | Arbitrary bytes in hex | +-----------------------------------+------------------------------+ | RE: /u[0-9a-fA-F]+;/ | Unicode scalar as UTF-8 | +-----------------------------------+------------------------------+ To clarify: * A backslash followed by a backslash, pipe, or double-quote character is substituted with a literal occurrence of the corresponding character. * A backslash followed by any number of blanks (space or tab), a newline, and again any number of blanks, is substituted with nothing. This is to allow splitting a string into multiple lines for human readability. (define paragraph "This paragraph has been visually split into multiple \ lines, but the newline is escaped, so it's one line.") * The characters 0, a, b, t, n, v, f, r, and e have the same meanings as in the C programming language, representing common unprintable ASCII bytes. * An x, followed by pairs of hexadecimal digits (case insensitive), terminated by a semicolon, is substituted with the sequence of bytes represented by the corresponding pairs of hexadecimal digits. E.g.: `"foo\xDEADBEEF;bar"` * A u, followed by a hexadecimal digit sequence (case insensitive), terminated by a semicolon, is substituted with the canonical UTF-8 byte sequence for the Unicode Scalar Value represented by that hexadecimal number. The number must be in the range `0` to `10FFFF`. E.g.: `"foo\u00A0;bar"` ### Newlines in strings Normally, a newline in a string has no special meaning and simply becomes part of the string. However, newlines can be backslash-escaped, which simple erases them; the escaped newline can also be preceded or followed by any number of tab and space characters, which are all stripped as well. (Note: It's not blanks preceding the backslash that are stripped, but blanks following the backslash and preceding the newline; i.e., blanks at the end of the line.) Following are some examples of how multi-line strings can appear in source code with different intentions and meanings: (define paragraph "This paragraph has been visually split into multiple \ lines, but the newlines are escaped, so it's one line.") (define json-object '| ;; use '|| so double-quotes need no escaping { "key": "value" } |) The second example is actually slightly problematic. It begins with a newline, which may be undesirable, but escaping that newline would cause the first line to have no indentation, thus the opening `{` would not line up with the closing `}` when this string is printed out. Further, if the entire block of code is indented, then the string contents may be more indented than intended. (No pun or rhyme intended.) Consider: (let ((foo one)) (let ((bar two)) (let ((json-object '| { "key": "value" } |)) (do-whatever)))) The string bound to `json-object` has redundant indentation. Should the parser attempt to solve this issue? Thankfully, we have the decoder to handle such complexities. Under the default settings, the rune `#HASH` is bound to a decoder rule which detects a payload value that is a string literal, and implements the same algorithm as seen in Java 15 Text Blocks: [JEP 378: Text Blocks](https://openjdk.org/jeps/378) Thus, we can do the following: (let ((foo one)) (let ((bar two)) (let ((json-object #| ........... { ........... "key": "value" ........... } ...........|)) (do-whatever)))) (Dots represent whitespace that is deleted. The initial newline is, as well.) The only feature Zisp does not offer is a way to fence off multi-line strings with a longer token such as `"""` as seen in Python and Java, or an arbitrary word as seen in Bourne shell and PHP "here doc" syntax. However, if a programmer truly wanted to have arbitrary text blocks in code, without needing to escape anything in them, it's possible to abuse at-quoted string syntax, using it with an ASCII control character which is displayed visibly by a text editor. In the following, the characters `^\` are meant to represent a literal ASCII File Separator character in the source code: (define json-object #@^\ { "key": "value" } ^\) Hey, it works fine in Emacs, so why not? Use `C-q C-\` to insert the `^\`. This is indeed quite an eldritch syntax, but hopefully most programs would not need to use it anyway. ## Syntax sugar The parser recognizes various "syntax sugar" and transforms it into equivalent datum constructions. The most ubiquitous example of this is the list, which is transformed into a chain of pairs, terminated with nil: (datum1 datum2 ...) -> (datum1 & (datum2 & (... & ()))) This is so ubiquitous as to be hardly considered "syntax sugar" but is counted as such, since any list could just as well be written as a chain of pairs; both would result in an equivalent datum when parsed. The following table summarizes the other available transformations: [...] -> (#SQUARE ...) #datum -> (#HASH & datum) {...} -> (#BRACE ...) #rune(...) -> (#rune ...) 'datum -> (#QUOTE & datum) dat1dat2 -> (#JOIN dat1 & dat2) `datum -> (#GRAVE & datum) dat1.dat2 -> (#DOT dat1 & dat2) ,datum -> (#COMMA & datum) dat1:dat2 -> (#COLON dat1 & dat2) Notes: * The terms datum, dat1, and dat2 each refer to an arbitrary datum; ellipsis means zero or more data. * The `#datum` form only applies when the datum following the hash sign is anything other than a bare string, since otherwise this would be ambiguous with a rune literal. A bare string can nevertheless follow the hash sign by separating the two with a backslash: #\string -> (#HASH & string) * Though not represented in the table due to notational difficulty, the form `#rune(...)` doesn't require a list in the second position; any datum that works with the `#datum` syntax also works with `#rune`. #rune1#rune2 -> (#rune1 & #rune2) #rune\string -> (rune & string) #rune'string -> (#rune #QUOTE & string) #rune"string" -> (#rune #DQSTR & |string|) As a counter-example, following a rune immediately with a bare string isn't possible without the delimiting backslash, since that would be ambiguous: #abcdefgh ;Could be (#abcdef & gh) or (#abcde & fgh) or ... * Syntax sugar can combine arbitrarily. Some examples follow. Any of these may or may not actually have a meaning in code; many could simply end up producing an error during decoding, or later evaluation of code. #{...} -> (#HASH #BRACE ...) #'foo -> (#HASH #QUOTE & foo) ##'[...] -> (#HASH #HASH #QUOTE #SQUARE ...) {x y}[i j] -> (#JOIN (#BRACE x y) #SQUARE i j) foo.bar.baz{x y} -> (#JOIN (#DOT (#DOT foo & bar) & baz) #BRACE x y) * While in Lisp and Scheme `'foo` parses as `(quote foo)`, in Zisp it parses as `(#QUOTE & foo)`; a single pair with the quoted datum in the second position. The same principle is used when parsing other sugar; some examples follow: Incorrect Correct #(x y z) -> (#HASH (x y z)) #(x y z) -> (#HASH x y z) [x y z] -> (#SQUARE (x y z)) [x y z] -> (#SQUARE x y z) #{x} -> (#HASH (#BRACE (x))) #{x} -> (#HASH #BRACE x) foo(x y) -> (#JOIN foo (x y)) foo(x y) -> (#JOIN foo x y) * Those used to thinking in Lisp and Scheme may think that `(#QUOTE ...)` halts further decoding of enclosed data. This is not so, since quoting is related to code evaluation, not decoding.