From 724ac8ae394675a78c2977c6e35555b210256e01 Mon Sep 17 00:00:00 2001 From: Taylan Kammer Date: Mon, 1 Jun 2026 21:49:37 +0200 Subject: docs -> doc --- docs/c1/1-parse.md | 611 ----------------------------------------------------- 1 file changed, 611 deletions(-) delete mode 100644 docs/c1/1-parse.md (limited to 'docs/c1/1-parse.md') diff --git a/docs/c1/1-parse.md b/docs/c1/1-parse.md deleted file mode 100644 index 4eb5776..0000000 --- a/docs/c1/1-parse.md +++ /dev/null @@ -1,611 +0,0 @@ -# Parser for Data - -*For an exact specification of the grammar, see [grammar](grammar/).* - -Zisp s-expressions represent an extremely minimal set of data types; only that -which is necessary to strategically construct more complex values: - - +--------+-----------------+--------+----------+------+ - | TYPE | String | Rune | Pair | Nil | - +--------+-----------------+--------+----------+------+ - | E.G. | foobar | #name | (X & Y) | () | - | | |foo bar| | | | | - | | "foo bar" | | | | - | | @_foo bar_ | | | | - +--------+-----------------+--------+----------+------+ - -Datum comments and line comments are supported: - -* A semicolon followed by a tilde instructs the parser to consume one datum and - discard it. Whitespace may appear between the tilde and the datum to discard. - -* A semicolon, followed by a non-tilde byte, instructs the parser to consume and - discard bytes until a newline (ASCII Line Feed) is encountered. - -The parser can also output non-negative integers, but this is only used for -datum labels; number literals are handled by the decoder instead; see below. - - -## Overview - -This section explains a few core concepts and features related to the parser. - - -### Value vs. Datum - -A Zisp *value* that has an *external representation* in the form of a sequence -of bytes is called a *datum*. Every datum is a value, but not all values are -data. A datum is a value that can be printed out as a byte sequence which the -parser can recognize and turn back into an equivalent datum. - -One may speak of an *external representation of a value* where the value is not -itself a datum, but can be encoded as a datum. The more strictly correct term -for this is: "The external representation of a datum encoding the value." - - -### Syntax sugar - -The parser recognizes various "syntax sugar" and transforms it into uses of the -above listed primitive data types. As an example, the expression `#(x y z)` is -parsed into the structure `(#HASH x y z)`. These are two completely equivalent -external representations for the same compound datum; after parsing, both byte -sequences will yield data values that are indistinguishable in all but their -memory address. - -The most ubiquitously used syntax sugar is the list, which stands for a chain of -pairs, terminated with nil: - - (x y z) -> (x & (y & (z & ()))) - -The full syntax sugar table is listed and explained further below. - - -### Decoder - -*The decoder has nothing to do with the concept of text or character encoding.* - -A separate process called *decoding* can transform Zisp data into values of more -complex types, including values that are not of a datum type. - -For example, the datum `(#HASH x y z)` could be decoded into an array, so the -expression `#(x y z)` could work like in Scheme. - -Decoding also resolves datum labels, goes over bare strings to find ones that -represent a number literal, and takes care of a number of other transforms. -This offloads complexity, allowing the parser to remain extremely simple. - -See the dedicated documentation of the [decoder](2-decode.html) for more. - - -### Character encoding - -The parser does not consume characters; it consumes bytes. - -Grammar is generally constructed by bytes corresponding to ASCII characters. -Some elements of the grammar, such as comments and quoted strings, may contain -arbitrary byte sequences, until terminated. These sequences may happen to be -valid UTF-8 text. This way, quoted strings and comments may contain Unicode -text encoded in UTF-8, but the parser does not check these for validity. - -Since comments and quoted strings may contain arbitrary byte sequences, a text -editor or other program displaying Zisp s-expressions may need to use a special -visual representation for bytes that don't represent valid text. - -The parser being based on bytes rather than characters is not a limitation but -rather a feature: It allows for Zisp s-expressions to be used as a structured -data exchange format that may contain binary data elements without the need to -encode these in Base64 or other such text representations of binary data. -Consider the example: - - ((image.webp "<< binary data >>") - (video.webm "<< binary data >>")) - -All that needs to be done for this to work, is that any incidental occurrences -of the double-quote sign, and the backslash sign, are escaped with a backslash -within the binary data; all other bytes can appear verbatim in the strings. - - -### Stream parsing - -The parser can be repeatedly invoked on a byte stream to consume the next datum -within. This does not require "unreading" or back-seeking within the stream; -the parser always reads a full datum, and stops after some byte which cleanly -terminates the currently parsed datum. - -This means Zisp s-expressions can be safely intermixed with other data within -the same byte stream. So long as the other data is consumed by some parser -which similarly stops reading at a clear boundary, the Zisp parser can then -continue operating on the same stream. Consider the example: - - ("image.webp" 8273) - - << 8273 bytes >> - - ("video.webm" 736) - - << 736 bytes >> - -The "header" for each file in this stream is a Zisp s-expression containing -information about how many bytes should be read after the header, before the -next file header appears. (The header data need to be terminated with a blank -ASCII character such as a newline. The reason why the closing parenthesis does -not act as a terminator unto itself will become apparent later.) - -#### Buffering - -To enable the aforementioned stream parsing strategy, the parser does not use -automatic buffering. If it did, it might inadvertently consume some bytes -beyond the currently parsed datum, leaving the stream inconsistent. - -The parser could provide access to its buffer, such that one could access the -unused bytes, but it's simpler and more flexible to let buffering be handled -externally from the parser. - -In other words: If the parser is meant to be used on an I/O stream connected to -expensive system calls, such as a file handle or network socket, it's best to -wrap that stream in some intermediate object which asks the system for large -chunks of data at once, and stores the data in a buffer. - - -### Datum labels - -Valid data cannot be cyclic, since that would mean it has infinite length in -bytes. To externally represent a value with cyclic structure, one uses datum -labels in the data encoding of the value. - -A datum label either wraps another datum to assign a number to it, or contains -just a reference to a previous assignment. - - +----------------------------------+---------------------------------+ - | Internal structure | External representation | - +----------------------------------+---------------------------------+ - | (#LABEL & ( & )) | #%= | - +----------------------------------+---------------------------------+ - | (#LABEL & ) | #%% | - +----------------------------------+---------------------------------+ - -In this visual, the token `` stands for an actual number value that -doesn't have its own external representation. It's printed as a sequence of -hexadecimal digits, denoted by `` in the external representation. - -For clarity, concrete examples follow: - - #%1234abcd=(foo bar) -> (#LABEL & (<0x1234abcd> & (foo bar))) - - #%1234abcd% -> (#LABEL & <0x1234abcd>) - -Here, the visual token `<0x1234abcd>` stands for a Zisp value of a numeric type -with an integer value. - -Datum labels may look like "syntax sugar" but the fact that integers don't have -a direct external representation means that datum labels are a fundamental type -of syntax that has no "desugared" equivalent in external representation. The -decoder will not accept a bare string encoding of an integer here. - - -## Data types - -Following is an explanation of the four core data types constructed by the Zisp -s-expression parser. - -A Zisp value that is a member of one of these types is also called a *datum* if -it adheres to additional constraints as explained for each type. - - -### String - -Strings can appear "bare" or be quoted in various ways. - -A string, as a stand-alone Zisp value, is only a valid datum if it can be -represented as a bare string. If it contains bytes that prevent the bare -representation, then the string must be wrapped in one of the following -structures to become a valid datum, each of which has its own external -representation: - - +-------------------------------+-------------------------------+ - | Internal structure | External representation | - +-------------------------------+-------------------------------+ - | (#PQSTR & ) | |contents| | - +-------------------------------+-------------------------------+ - | (#DQSTR & ) | "contents" | - +-------------------------------+-------------------------------+ - | (#ATSTR & ) | @_contents_ | - +-------------------------------+-------------------------------+ - -The visual token `` is meant to denote the actual string, as a Zisp -value, occupying the second position in the pair. It is not actual syntax. - -Note that, while conceptually similar, this internal encoding of string data is -not syntax sugar, since the internal datum representation using runes cannot be -printed out verbatim, due to the attached string being impossible to represent -externally without quotation. As such, quoted strings are fundamental syntax. - -These external representations of strings will be explained in more detail -further below, including backslash escape sequences allowed within. - -Strings have a fixed length, counted in bytes. Each byte can have any value, -including zero (aka ASCII NULL). The parser reads bytes, not characters, and -has no concept of a character encoding, which means that a string can contain -UTF-8 byte sequences, but these are not tested for validity. - -A string that is up to 255 bytes long is automatically *interned*, meaning any -occurrence of the same string -- equal in length and containing the same byte -values -- ends up being represented by the same bit-pattern; either a memory -address, or an immediate representation within a CPU word for short strings. - -Strings with a length greater than 255 bytes end up being represented by a -distinct memory address, even if they are equal in length and content. - - -### Rune - -A rune is represented by an ASCII character sequence of 1 to 6 bytes, that must -begin with a letter, and may only contain letters and digits. This character -sequence of letters and digits is called the *name* of the rune. A rune that -follows this constraint is valid as a datum. - -Zisp code may explicitly construct values of the rune type that violate the -above constraints. Such runes are not valid data and cannot be printed or -parsed in any way. - -Runes are case-sensitive, and the parser always emits runes using upper-case -letters when expressing syntax sugar. Uppercase rune names are reserved for -Zisp's internal use and standard library; users can use lowercase runes with -custom meaning without worrying about clashes, with the exception of a small -number of lowercase runes such as `#true` and `#false` that are part of the -default decoder settings. - -Runes are always stored directly in a CPU word; never by memory address. - - -### Pair - -A pair is a tuple of two values: the first value and the second value. - -The parser allocates a unique two-word cell in the process heap for every pair, -and represents that pair through the memory address of that cell. - -Pairs are valid as a datum if one of the following holds true for the pair: - -* It encodes one of the quoted string variants. - -* It encodes a datum label (assignment or reference). - -* Both the first and second value in the pair is itself a valid datum. - -An additional constraint is that a hierarchy of pairs containing pairs must not -form cycles; if they do, the cycles must be broken up by use of datum labels or -else none of the pairs within the cyclic structure are a valid datum. - - -### Nil - -The Zisp nil value is a singleton and a datum. There is exactly one nil value -and it is used to terminate a chain of pairs representing a list of values. - - -## Quoted strings - -Three quoted string types exist: Pipe-quoted, double-quoted, and at-quoted. -This section goes into the details of each variant. - - -### Pipe-quoted - -Strings can be quoted with pipes, like symbols in R7RS Scheme, which triggers -the parser to generate a pair with the structure: - - (#PQSTR & ) ;; is visual aid, not syntax - -The decoder, using default settings, would emit this string verbatim as a value. -Then, during code evaluation, this would be seen as an identifier. In this way, -pipe-quoted strings are equivalent to bare strings in functionality. - -It is important to understand that the decoder sits between the parser and the -[evaluator](3-execute.html), and in opposition to Lisp and Scheme tradition, it -is common for the evaluator to receive values that are not valid as a datum; in -this case, a string unto itself that may not be a valid datum, due to not being -possible to be represented as a bare string. Yet, it is valid as an identifier -for the purposes of the evaluator, since it is a string *value* like any other. - - -### Double-quoted - -Strings wrapped in the double-quote symbol parse into: - - (#DQSTR & ) ;; is visual aid, not syntax - -Under default settings, the decoder would transform this into a value which, -when evaluated, yields back the string as a value. Typically, this would be -achieved by simply transforming it into `(#QUOTE & )`. (Note that, -unlike `(#PQSTR & )`, this would not be decoded into a string unto -itself, as that would make the evaluator see it as an identifier.) - - -### At-quoted strings AKA raw strings - -There is a special type of syntax for "raw" strings, meaning that no backslash -escapes nor any other kind of escape sequence are recognized within them. - -This raw string syntax begins with an at sign, followed by any byte. That byte -becomes the termination marker, and the string cannot contain an occurrence of -it, since there are no escape sequences. - - @"foo \ bar" -> (#ATSTR & ) - -In the above, the visual token `` is not part of datum syntax but a -stand-in for the actual string value, which is, literally: `foo \ bar` - -This style of quoting can be useful, for instance, when representing regular -expressions as strings in code: - - @/^foo\\(bar|baz)\.\[".*"\]$/ ;; matches e.g. foo\bar.["blah"] - -Were it not for this syntax, this regular expression would only be possible to -represent through a quoted string such as the following: - - "^foo\\\\(bar|baz)\\t\\[\".*\"\\]$" ;; many backslashes - -Alternatively, imagine searching for certain MS Windows file paths: - - @_C:\\\\Users\\([a-z]+)_ ;; matches C:\\User\foo - -That's already ugly. Without raw strings, it would need to look even worse: - - "C:\\\\\\\\Users\\\\([a-z]+)" ;; MANY backslashes - -The byte that follows the at sign need not be a printable character or even a -valid ASCII byte; it can be absolutely any byte value, even NULL. This can be -useful to easily encode binary data which is known to not contain a specific -byte; an example would be C strings which cannot contain NULL. - - -### Backslash escape sequences in strings - -The following backslash escapes are supported in pipe-quoted and double-quoted -strings. (Some rows use Regular Expression notation.) - - +-----------------------------------+------------------------------+ - | Character(s) following backslash | Meaning | - +-----------------------------------+------------------------------+ - | \ | Literal backslash | - +-----------------------------------+------------------------------+ - | | | Literal pipe symbol | - +-----------------------------------+------------------------------+ - | " | Literal double-quote | - +-----------------------------------+------------------------------+ - | RE: /[\t ]*\n[\t ]*/ | Discarded | - +-----------------------------------+------------------------------+ - | 0 | ASCII NULL | - +-----------------------------------+------------------------------+ - | a | ASCII Alert | - +-----------------------------------+------------------------------+ - | b | ASCII Backspace | - +-----------------------------------+------------------------------+ - | t | ASCII Tab (Horizontal) | - +-----------------------------------+------------------------------+ - | n | ASCII Newline (Line Feed) | - +-----------------------------------+------------------------------+ - | v | ASCII Vertical Tab | - +-----------------------------------+------------------------------+ - | f | ASCII Form Feed | - +-----------------------------------+------------------------------+ - | r | ASCII Carriage Return | - +-----------------------------------+------------------------------+ - | e | ASCII Escape | - +-----------------------------------+------------------------------+ - | RE: /x([0-9a-fA-F]{2})*;/ | Arbitrary bytes in hex | - +-----------------------------------+------------------------------+ - | RE: /u[0-9a-fA-F]+;/ | Unicode scalar as UTF-8 | - +-----------------------------------+------------------------------+ - -To clarify: - -* A backslash followed by a backslash, pipe, or double-quote character is - substituted with a literal occurrence of the corresponding character. - -* A backslash followed by any number of blanks (space or tab), a newline, and - again any number of blanks, is substituted with nothing. This is to allow - splitting a string into multiple lines for human readability. - - (define paragraph "This paragraph has been visually split into multiple \ - lines, but the newline is escaped, so it's one line.") - -* The characters 0, a, b, t, n, v, f, r, and e have the same meanings as in the - C programming language, representing common unprintable ASCII bytes. - -* An x, followed by pairs of hexadecimal digits (case insensitive), terminated - by a semicolon, is substituted with the sequence of bytes represented by the - corresponding pairs of hexadecimal digits. E.g.: `"foo\xDEADBEEF;bar"` - -* A u, followed by a hexadecimal digit sequence (case insensitive), terminated - by a semicolon, is substituted with the canonical UTF-8 byte sequence for the - Unicode Scalar Value represented by that hexadecimal number. The number must - be in the range `0` to `10FFFF`. E.g.: `"foo\u00A0;bar"` - - -### Newlines in strings - -Normally, a newline in a string has no special meaning and simply becomes part -of the string. However, newlines can be backslash-escaped, which simple erases -them; the escaped newline can also be preceded or followed by any number of tab -and space characters, which are all stripped as well. (Note: It's not blanks -preceding the backslash that are stripped, but blanks following the backslash -and preceding the newline; i.e., blanks at the end of the line.) - -Following are some examples of how multi-line strings can appear in source code -with different intentions and meanings: - - (define paragraph "This paragraph has been visually split into multiple \ - lines, but the newlines are escaped, so it's one line.") - - (define json-object '| ;; use '|| so double-quotes need no escaping - { - "key": "value" - } - |) - -The second example is actually slightly problematic. It begins with a newline, -which may be undesirable, but escaping that newline would cause the first line -to have no indentation, thus the opening `{` would not line up with the closing -`}` when this string is printed out. Further, if the entire block of code is -indented, then the string contents may be more indented than intended. (No pun -or rhyme intended.) Consider: - - (let ((foo one)) - (let ((bar two)) - (let ((json-object '| - { - "key": "value" - } - |)) - (do-whatever)))) - -The string bound to `json-object` has redundant indentation. Should the parser -attempt to solve this issue? - -Thankfully, we have the decoder to handle such complexities. Under the default -settings, the rune `#HASH` is bound to a decoder rule which detects a payload -value that is a string literal, and implements the same algorithm as seen in -Java 15 Text Blocks: [JEP 378: Text Blocks](https://openjdk.org/jeps/378) - -Thus, we can do the following: - - (let ((foo one)) - (let ((bar two)) - (let ((json-object #| - ........... { - ........... "key": "value" - ........... } - ...........|)) - (do-whatever)))) - -(Dots represent whitespace that is deleted. The initial newline is, as well.) - -The only feature Zisp does not offer is a way to fence off multi-line strings -with a longer token such as `"""` as seen in Python and Java, or an arbitrary -word as seen in Bourne shell and PHP "here doc" syntax. - -However, if a programmer truly wanted to have arbitrary text blocks in code, -without needing to escape anything in them, it's possible to abuse at-quoted -string syntax, using it with an ASCII control character which is displayed -visibly by a text editor. In the following, the characters `^\` are meant to -represent a literal ASCII File Separator character in the source code: - - (define json-object #@^\ - { - "key": "value" - } - ^\) - -Hey, it works fine in Emacs, so why not? Use `C-q C-\` to insert the `^\`. - -This is indeed quite an eldritch syntax, but hopefully most programs would not -need to use it anyway. - - -## Syntax sugar - -The parser recognizes various "syntax sugar" and transforms it into equivalent -datum constructions. The most ubiquitous example of this is the list, which is -transformed into a chain of pairs, terminated with nil: - - (datum1 datum2 ...) -> (datum1 & (datum2 & (... & ()))) - -This is so ubiquitous as to be hardly considered "syntax sugar" but is counted -as such, since any list could just as well be written as a chain of pairs; both -would result in an equivalent datum when parsed. - -The following table summarizes the other available transformations: - - [...] -> (#SQUARE ...) #datum -> (#HASH & datum) - - {...} -> (#BRACE ...) #rune(...) -> (#rune ...) - - 'datum -> (#QUOTE & datum) dat1dat2 -> (#JOIN dat1 & dat2) - - `datum -> (#GRAVE & datum) dat1.dat2 -> (#DOT dat1 & dat2) - - ,datum -> (#COMMA & datum) dat1:dat2 -> (#COLON dat1 & dat2) - -Notes: - -* The terms datum, dat1, and dat2 each refer to an arbitrary datum; ellipsis - means zero or more data. - -* The `#datum` form only applies when the datum following the hash sign is - anything other than a bare string, since otherwise this would be ambiguous - with a rune literal. A bare string can nevertheless follow the hash sign by - separating the two with a backslash: - - #\string -> (#HASH & string) - -* Though not represented in the table due to notational difficulty, the form - `#rune(...)` doesn't require a list in the second position; any datum that - works with the `#datum` syntax also works with `#rune`. - - #rune1#rune2 -> (#rune1 & #rune2) - - #rune\string -> (rune & string) - - #rune'string -> (#rune #QUOTE & string) - - #rune"string" -> (#rune #DQSTR & |string|) - - As a counter-example, following a rune immediately with a bare string isn't - possible without the delimiting backslash, since that would be ambiguous: - - #abcdefgh ;Could be (#abcdef & gh) or (#abcde & fgh) or ... - -* Syntax sugar can combine arbitrarily. Some examples follow. Any of these may - or may not actually have a meaning in code; many could simply end up producing - an error during decoding, or later evaluation of code. - - #{...} -> (#HASH #BRACE ...) - - #'foo -> (#HASH #QUOTE & foo) - - ##'[...] -> (#HASH #HASH #QUOTE #SQUARE ...) - - {x y}[i j] -> (#JOIN (#BRACE x y) #SQUARE i j) - - foo.bar.baz{x y} -> (#JOIN (#DOT (#DOT foo & bar) & baz) #BRACE x y) - -* While in Lisp and Scheme `'foo` parses as `(quote foo)`, in Zisp it parses as - `(#QUOTE & foo)`; a single pair with the quoted datum in the second position. - - The same principle is used when parsing other sugar; some examples follow: - - Incorrect Correct - - #(x y z) -> (#HASH (x y z)) #(x y z) -> (#HASH x y z) - - [x y z] -> (#SQUARE (x y z)) [x y z] -> (#SQUARE x y z) - - #{x} -> (#HASH (#BRACE (x))) #{x} -> (#HASH #BRACE x) - - foo(x y) -> (#JOIN foo (x y)) foo(x y) -> (#JOIN foo x y) - -* Those used to thinking in Lisp and Scheme may think that `(#QUOTE ...)` halts - further decoding of enclosed data. This is not so, since quoting is related - to code evaluation, not decoding. - - -## Shebang - -There is one final "syntax sugar" translation whose sole purpose is to allow a -shebang line at the start of a file: - - #!interpreter -> (#SHBANG & interpreter) - - #!interpreter argline -> (#SHBANG interpreter & argline) - -Under default settings, the decoder will allow this datum to appear once at the -beginning of a per-file decoding sequence, and simply discard it. - - - -- cgit v1.2.3