# Parser for Code & Data Zisp s-expressions represent an extremely minimal set of data types; only that which is necessary to strategically construct more complex values: +-------+---------+--------+----------+------+ | TYPE | String | Rune | Pair | Nil | +-------+---------+--------+----------+------+ | E.G. | foobar | #name | (X & Y) | () | +-------+---------+--------+----------+------+ The parser also recognizes various *syntax sugar* which typically results in a pair beginning with a specific rune. A separate component called the *decoder* transforms such data into a rich set of value types. ## Character Encoding The parser does not consume Unicode characters; it consumes bytes. Grammar is generally constructed by bytes corresponding to ASCII characters. Some elements of the grammar, such as comments and quoted strings, may contain arbitrary byte sequences, until terminated. These sequences may happen to be valid UTF-8 text. This way, quoted strings and comments may contain Unicode text encoded in UTF-8, but the parser does not check these for validity. Since comments and quoted strings may contain arbitrary byte sequences, a text editor or other program displaying Zisp s-expressions may need to use a special visual representation for bytes that don't represent valid text. The parser working on bytes rather than Unicode characters is not a limitation, but rather a feature: It allows Zisp s-expressions to be used as a structured data exchange format, which may contain binary data elements, without the need to encode these in Base64 or other such text representations of binary data. Consider the example: ((image.webp "") (video.webm "")) All that needs to be done for this to work, is that any incidental occurrences of the double-quote sign, and the backslash sign, are escaped with a backslash within the `` data; all other bytes can appear verbatim in the strings. ## Stream Parsing The parser can be repeatedly invoked on a byte stream to consume the next datum within. This does not require "unreading" or back-seeking within the stream; the parser always reads a full datum, and stops after some byte which cleanly terminates the currently parsed datum. This means Zisp s-expressions can be safely intermixed with other data within the same byte stream. So long as the other data is consumed by some parser which similarly stops reading at a clear boundary, the Zisp parser can then continue operating on the same stream. Consider the example: ("image.webp" 8273) << 8273 bytes >> ("video.webm" 736) << 736 bytes >> The "header" for each file in this stream is a Zisp s-expression containing information about how many bytes should be read after the header, before the next file header appears. (The header data need to be terminated with a blank ASCII character such as a newline; the closing parenthesis does not act as a terminator unto itself due to the "join" syntax sugar.) To enable this stream parsing strategy, the parser does not use any automatic buffering. If it did, it might inadvertently consume some bytes beyond the currently parsed datum, leaving the stream inconsistent. If the parser is meant to be used on an input stream associated with expensive system calls, such as a file handle or network socket, it's best to wrap that stream in some intermediate object which asks the system for large chunks of data at once, and stores the data in a buffer. ## Comments Two types of comment are supported: datum comments and line comments. * A semicolon followed by a tilde instructs the parser to consume one datum and discard it. Whitespace may appear between the tilde and the datum to discard. * A semicolon, followed by a non-tilde byte, instructs the parser to consume and discard bytes until a newline (ASCII Line Feed) is encountered. ## Value vs. Datum A Zisp *value* that has an *external representation* in the form of a sequence of bytes is called a *datum*. Every datum is a value, but not every value is a datum. In other words, a datum is a value that can be printed out as a byte sequence which the parser can turn back into an equivalent datum. A value that is not a datum may nevertheless be *encoded* into one, allowing it to have an external representation. After parsing, it needs to be *decoded*. One may speak of an *external representation of a value* where the value is not itself a datum, but can be encoded as one. The more strictly correct term for this is: "The external representation of a datum that encodes the value." ### Syntax sugar The parser recognizes various *syntax sugar* to abbreviate an equivalent datum construction, or express a datum that encodes a more complex value. As an example, the expression `#(x y z)` is an abbreviation for the equivalent `(#HASH x y z)`. These are two external representations for the same datum; after parsing, both will yield values that are indistinguishable in all but their memory address. The most ubiquitous syntax sugar is the list, which abbreviates a sequence of tail-linked pairs, terminated with a special nil value represented as `()`: (x) -> (x & ()) (x y) -> (x & (y & ())) (x y z) -> (x & (y & (z & ()))) There are also so-called *improper lists* which are chains of pairs that end in a value other than nil: (x y & z) -> (x & (y & z)) (x y z & t) -> (x & (y & (z & t))) An example of "syntax sugar" that is not a mere abbreviation is a quoted string which contains bytes that could not appear in a *bare* string: "foo bar" -> (#DQUOTE & ) In this example, the visual token `` represents the actual string value in program memory, which has no direct external representation in bytes because it contains a space character. Those familiar with Lisp and Scheme may expect bare strings to be parsed into a separate type called *symbol* while quoted strings are parsed directly into a string type, but this is not the case in Zisp. ### Decoder The *decoder* transforms Zisp data into values of more complex types, including values that are not of a datum type. Combined with syntax sugar, this allows Zisp to offer familiar syntax elements. For example, the expression `#(x y z)` which parses into `(#HASH x y z)` can be decoded into an array, so the result is similar to the vector syntax of Scheme. Decoding also resolves datum labels, goes over bare strings to find ones that represent a number literal, and takes care of a number of other transforms. This offloads complexity, allowing the parser to remain extremely simple. See the dedicated documentation of the [decoder](2-decode.html) for more. ## Data types Following is a more in-depth explanation of each data type constructed by the Zisp s-expression parser. These are in fact value types, though the term "data type" is often used due to familiarity. A Zisp value that is a member of one of the following value types is only a *datum* if it adheres to additional constraints as explained below. ### String Strings can appear *bare* or be quoted in various ways. A quoted string is in fact parsed into a pair value with a rune in the first position to identify the quotation variant that was parsed, and the string value in the second position. +-----------+----------------------+ | Syntax | Parse output | +-----------+----------------------+ | |bytes| | (#PQSTR & ) | +-----------+----------------------+ | "bytes" | (#DQSTR & ) | +-----------+----------------------+ | @_bytes_ | (#ATSTR & ) | +-----------+----------------------+ The visual token `` denotes the actual string, as a Zisp value, in the second position of the pair. These external representations of strings will be explained in more detail further below, including backslash escape sequences allowed within. Strings have a fixed length, counted in bytes. Each byte can have any value, including zero (ASCII NUL). The parser reads bytes, not Unicode characters; a string may contain UTF-8 byte sequences, but these are not tested for validity. A string that is up to 255 bytes long is automatically *interned*, meaning any occurrence of the same string -- equal in length and containing the same byte values -- ends up being represented by the same bit-pattern; either a memory address, or an immediate representation within a CPU word for short strings. The quotation method is inconsequential to this process; for example, while `|foobar|` and `"foobar"` will parse into different pair values, the actual string they hold will be the same one in program memory. Strings of length greater than 255 bytes are stored separately in memory, even if they are equal in length and content. ### Rune A rune is represented by an ASCII character sequence of 1 to 6 bytes, that must begin with a letter, and may only contain letters and digits. This character sequence of letters and digits is called the *name* of the rune. A rune that follows this constraint is valid as a datum. Zisp code may explicitly construct values of the rune type that violate the above constraints. Such runes are not valid data and cannot be printed or parsed. Runes are case-sensitive, and the parser always emits runes using upper-case letters when expressing syntax sugar. Uppercase rune names are reserved for Zisp's internal use and standard library; users can use lowercase runes with custom meaning without worrying about clashes, with the exception of a small number of lowercase runes such as `#true` and `#false` that are part of the default decoder settings and documented explicitly as such. Runes are always stored directly in a CPU word; never by memory address. ### Pair A pair is a tuple of two values: the first value and the second value. In Lisp tradition, these are also called the `car` and `cdr` of the pair, respectively. The parser allocates a unique two-word cell in program memory for every pair, and represents that pair through the memory address of the cell. Pairs are valid data if one of the following holds true: * The pair encodes a quoted string, datum label, or shebang line. * Both the first and second value in the pair is a valid datum. Further, a structure of nested pair values may not contain cyclic references back up in the structure (which would make the above definition diverge into infinity). Such cycles must be broken up with datum labels, or else the pair cannot be considered a datum, since it cannot be printed or parsed. ### Nil The Zisp nil value is a singleton and a datum. There is exactly one nil value and it is used to terminate a chain of pairs representing a list of values; it has the external representation `()`. ## Quoted strings Three quoted string types exist: Pipe-quoted, double-quoted, and at-quoted. This section goes into the details of each variant. ### Pipe-quoted Strings can be quoted with pipes, like symbols in R7RS Scheme, which triggers the parser to generate a pair with the structure: (#PQSTR & ) ;; is visual aid, not syntax The decoder, using default settings, would emit this string verbatim as a value. Then, during code evaluation, this would be seen as an identifier. In this way, pipe-quoted strings are equivalent to bare strings in functionality. It is important to understand that the decoder sits between the parser and the [evaluator](3-execute.html), and in opposition to Lisp and Scheme tradition, it is common for the evaluator to receive values that are not valid as a datum; in this case, a string unto itself that may not be a valid datum, due to not being possible to be represented as a bare string. Yet, it is valid as an identifier for the purposes of the evaluator, since it is a string *value* like any other. ### Double-quoted Strings wrapped in the double-quote symbol parse into: (#DQSTR & ) ;; is visual aid, not syntax Under default settings, the decoder would transform this into a value which, when evaluated as code, simply yields the contained string as a value. ### At-quoted This is a special type of syntax for "raw" strings, meaning that no backslash escapes nor any other kind of escape sequence are recognized within them. The syntax begins with an at sign, followed by any byte. That byte becomes a termination marker, and the string cannot contain an occurrence of it, since there are no escape sequences. @"foo \ bar" -> (#ATSTR & ) In the above, the visual tokens `` and `` represent an integer value and a string value, respectively. In this example, the integer value would be 34; the ASCII value for the double-quote sign. The string value contains a literal backslash, since there is no backslash escape parsing. This style of quoting can be useful, for instance, when representing regular expressions as strings in code: ;; Matches e.g. foo\bar.["blah"] @/^foo\\(bar|baz)\.\[".*"\]$/ Were it not for this syntax, this regular expression would only be possible to represent through a quoted string such as the following: ;; Same as above, but so many backslashes "^foo\\\\(bar|baz)\\t\\[\".*\"\\]$" The byte that follows the at sign need not be a printable character or even a valid ASCII byte; it can be absolutely any byte value, even NUL. This can be useful to easily encode binary data which is known to not contain a specific byte; an example would be C strings which cannot contain NUL. ### Backslash escapes In pipe-quoted and double-quoted strings, the following ASCII characters may follow a backslash to insert a certain character. +-------+----------------------------+ | Char | Meaning | +-------+----------------------------+ | \ | Literal backslash | +-------+----------------------------+ | | | Literal pipe symbol | +-------+----------------------------+ | " | Literal double-quote | +-------+----------------------------+ | 0 | ASCII NUL | +-------+----------------------------+ | a | ASCII Alert | +-------+----------------------------+ | b | ASCII Backspace | +-------+----------------------------+ | t | ASCII Tab (Horizontal) | +-------+----------------------------+ | n | ASCII Newline (Line Feed) | +-------+----------------------------+ | v | ASCII Vertical Tab | +-------+----------------------------+ | f | ASCII Form Feed | +-------+----------------------------+ | r | ASCII Carriage Return | +-------+----------------------------+ | e | ASCII Escape | +-------+----------------------------+ In words: * A backslash followed by a backslash, pipe, or double-quote character is substituted with a literal occurrence of that character. * The characters 0, a, b, t, n, v, f, r, and e have the same meanings as in the C programming language, representing common ASCII control characters. Further, the following Regular Expression patterns following a backslash have special meaning. +---------------------+-----------------------+ | Regular Expression | Meaning | +---------------------+-----------------------+ | [\t ]*\n[\t ]* | Discarded | +---------------------+-----------------------+ | x([0-9a-fA-F]{2})*; | Arbitrary bytes | +---------------------+-----------------------+ | u[0-9a-fA-F]+; | Unicode Scalar Value | +---------------------+-----------------------+ Explanations: * A backslash followed by any number of blanks (space or tab), a newline, and again any number of blanks, is substituted with nothing. This is to allow splitting a string into multiple lines for human readability. (define p "This paragraph has been visually split into multiple \ lines, but the newline is escaped, so it's one line.") * An x, followed by pairs of hexadecimal digits (case insensitive), terminated by a semicolon, is substituted with the sequence of bytes represented by the corresponding pairs of hexadecimal digits. E.g.: `"foo\xDEADBEEF;bar"` * A u, followed by a hexadecimal digit sequence (case insensitive), terminated by a semicolon, is substituted with the canonical UTF-8 byte sequence for the Unicode Scalar Value represented by that hexadecimal number. The number must be in the range `0` to `10FFFF`. E.g.: `"foo\u00A0;bar"` ### Newlines in strings Normally, a newline in a string has no special meaning and simply becomes part of the string. However, newlines can be backslash-escaped, which simple erases them; the escaped newline can also be preceded or followed by any number of tab and space characters, which are all stripped as well. (Note: It's not blanks preceding the backslash that are stripped, but blanks following the backslash and preceding the newline; i.e., blanks at the end of the line.) Following are some examples of how multi-line strings can appear in source code with different intentions and meanings: (define paragraph "This paragraph has been visually split into multiple \ lines, but the newlines are escaped, so it's one line.") (define json-object '| ;; use '|| so double-quotes need no escaping { "key": "value" } |) The second example is actually slightly problematic. It begins with a newline, which may be undesirable, but escaping that newline would cause the first line to have no indentation, thus the opening `{` would not line up with the closing `}` when this string is printed out. Further, if the entire block of code is indented, then the string contents may be more indented than intended. (No pun or rhyme intended.) Consider: (let ((foo one)) (let ((bar two)) (let ((json-object '| { "key": "value" } |)) (do-whatever)))) The string bound to `json-object` has redundant indentation. Should the parser attempt to solve this issue? Thankfully, we have the decoder to handle such complexities. Under the default settings, the rune `#HASH` is bound to a decoder rule which detects a payload value that is a string literal, and implements the same algorithm as seen in Java 15 Text Blocks: [JEP 378: Text Blocks](https://openjdk.org/jeps/378) Thus, we can do the following: (let ((foo one)) (let ((bar two)) (let ((json-object #| ........... { ........... "key": "value" ........... } ...........|)) (do-whatever)))) (Dots represent whitespace that is deleted. The initial newline is, as well.) The only feature Zisp does not offer is a way to fence off multi-line strings with a longer token such as `"""` as seen in Python and Java, or an arbitrary word as seen in Bourne shell and PHP "here doc" syntax. However, if a programmer truly wanted to have arbitrary text blocks in code, without needing to escape anything in them, it's possible to abuse at-quoted string syntax, using it with an ASCII control character which is displayed visibly by a text editor. In the following, the characters `^\` are meant to represent a literal ASCII File Separator character in the source code: (define json-object #@^\ { "key": "value" } ^\) It works fine in Emacs, so why not? Use `C-q C-\` to insert the `^\`. This is indeed quite an eldritch syntax, but hopefully most programs would not need to use it. ## Other syntax The following table summarizes commonly useful syntax abbreviations: [...] -> (#SQUARE ...) #datum -> (#HASH & datum) {...} -> (#BRACE ...) #rune(...) -> (#rune ...) 'datum -> (#QUOTE & datum) dat1dat2 -> (#JOIN dat1 & dat2) `datum -> (#GRAVE & datum) dat1.dat2 -> (#DOT dat1 & dat2) ,datum -> (#COMMA & datum) dat1:dat2 -> (#COLON dat1 & dat2) Notes: * The terms datum, dat1, and dat2 each refer to an arbitrary datum; ellipsis means zero or more data. * The `#datum` form only applies when the datum following the hash sign is anything other than a bare string, since otherwise this would be ambiguous with a rune literal. A bare string can nevertheless follow the hash sign by separating the two with a backslash: #\string -> (#HASH & string) * Though not represented in the table due to notational difficulty, the form `#rune(...)` doesn't require a list in the second position; any datum that works with the `#datum` syntax also works with `#rune`. #rune1#rune2 -> (#rune1 & #rune2) #rune\string -> (rune & string) #rune'string -> (#rune #QUOTE & string) #rune"string" -> (#rune #DQSTR & |string|) As a counter-example, following a rune immediately with a bare string isn't possible without the delimiting backslash, since that would be ambiguous: #abcdefgh ;Could be (#abcdef & gh) or (#abcde & fgh) or ... * Syntax sugar can combine arbitrarily. Some examples follow. Any of these may or may not actually have a meaning in code; many could simply end up producing an error during decoding, or later evaluation of code. #{...} -> (#HASH #BRACE ...) #'foo -> (#HASH #QUOTE & foo) ##'[...] -> (#HASH #HASH #QUOTE #SQUARE ...) {x y}[i j] -> (#JOIN (#BRACE x y) #SQUARE i j) foo.bar.baz{x y} -> (#JOIN (#DOT (#DOT foo & bar) & baz) #BRACE x y) * While in Lisp and Scheme `'foo` parses as `(quote foo)`, in Zisp it parses as `(#QUOTE & foo)`; a single pair with the quoted datum in the second position. The same principle is used when parsing other sugar; some examples follow: Incorrect Correct #(x y z) -> (#HASH (x y z)) #(x y z) -> (#HASH x y z) [x y z] -> (#SQUARE (x y z)) [x y z] -> (#SQUARE x y z) #{x} -> (#HASH (#BRACE (x))) #{x} -> (#HASH #BRACE x) foo(x y) -> (#JOIN foo (x y)) foo(x y) -> (#JOIN foo x y) * Those used to thinking in Lisp and Scheme may think that `(#QUOTE ...)` halts further decoding of enclosed data. This is not so, since quoting is related to code evaluation, not decoding. ### Datum labels Valid data cannot be cyclic, since that would mean it has infinite length in bytes. To externally represent a value with cyclic structure, one uses datum labels in the data encoding of the value. A datum label either wraps another datum to assign a number to it, or contains just a reference to a previous assignment. +------------------+------------------------------+ | Syntax | Internal datum structure | +------------------+------------------------------+ | #%= | (#LABEL & ) | +------------------+------------------------------+ | #%% | (#LABEL & ) | +------------------+------------------------------+ In this visual, the token `` stands for a hexadecimal digit sequence, the token `` stands for any other datum, and `` is a stand-in for a number value; that which is represented by ``. For clarity, concrete examples follow: +-------------------+-------------------------------+ | Byte sequence | Parse result | +-------------------+-------------------------------+ | #%1234abcd=(foo) | (#LABEL <0x1234abcd> & (foo)) | +-------------------+-------------------------------+ | #%1234abcd% | (#LABEL & <0x1234abcd>) | +-------------------+-------------------------------+ Here, the visual token `<0x1234abcd>` stands for a Zisp value of a numeric type with an integer value. Note that the decoder may not accept a bare string here, meaning this syntax sugar is not merely an abbreviation. ### Shebang Finally, the parser recognizes the Unix *shebang* syntax and outputs a datum to hold the string values found within: #!interpreter -> (#SHBANG & interpreter) #!interpreter argline -> (#SHBANG interpreter & argline) When executing a script file, Zisp simply stores this into a global value that may be inspected if desired.