summaryrefslogtreecommitdiff
path: root/doc/c1/1-parse.md
diff options
context:
space:
mode:
Diffstat (limited to 'doc/c1/1-parse.md')
-rw-r--r--doc/c1/1-parse.md608
1 files changed, 0 insertions, 608 deletions
diff --git a/doc/c1/1-parse.md b/doc/c1/1-parse.md
deleted file mode 100644
index d4c4c2e..0000000
--- a/doc/c1/1-parse.md
+++ /dev/null
@@ -1,608 +0,0 @@
-# Parser for Code & Data
-
-<!--TOC-->
-
-Zisp s-expressions represent an extremely minimal set of data types; only that
-which is necessary to strategically construct more complex values:
-
- +-------+---------+--------+----------+------+
- | TYPE | String | Rune | Pair | Nil |
- +-------+---------+--------+----------+------+
- | E.G. | foobar | #name | (X & Y) | () |
- +-------+---------+--------+----------+------+
-
-The parser recognizes various *syntax sugar* which abbreviates verbose syntax,
-and may result in special data structures (typically, a pair with a rune in its
-first, and payload in its second position) which another Zisp component called
-the *decoder* can transform into a rich set of value types.
-
-The most ubiquitous syntax sugar is the list, which abbreviates a sequence of
-tail-linked pairs, terminated with a special nil value represented as `()`:
-
- (x) -> (x & ())
-
- (x y) -> (x & (y & ()))
-
- (x y z) -> (x & (y & (z & ())))
-
-The following are so-called *improper lists*, ending in a non-nil value:
-
- (x y & z) -> (x & (y & z))
-
- (x y z & t) -> (x & (y & (z & t)))
-
-More details about syntax sugar, and the decoder, are explained later.
-
-
-## Character Encoding
-
-The parser does not consume Unicode characters; it consumes bytes. Grammar is
-generally constructed by bytes corresponding to ASCII characters.
-
-Some elements of the grammar, such as comments and quoted strings, may contain
-arbitrary byte sequences, until terminated. These sequences may happen to be
-valid UTF-8 text. This way, quoted strings and comments may contain Unicode
-text encoded in UTF-8, but the parser does not check these for validity.
-
-Since comments and quoted strings may contain arbitrary byte sequences, a text
-editor or other program displaying Zisp s-expressions may need to use a special
-visual representation for bytes that don't represent valid text.
-
-The parser working on bytes rather than Unicode characters is not a limitation,
-but rather a feature: It allows Zisp s-expressions to be used as a structured
-data exchange format, which may contain binary data elements, without the need
-to encode these in Base64 or other such text representations of binary data.
-Consider the example:
-
- ((image.webp "<BINARY>")
- (video.webm "<BINARY>"))
-
-All that needs to be done for this to work, is that any incidental occurrences
-of the double-quote sign, and the backslash sign, are escaped with a backslash
-within the `<BINARY>` data; all other bytes can appear verbatim in the strings.
-
-
-## Stream Parsing
-
-The parser can be repeatedly invoked on a byte stream to consume the next datum
-within. This does not require "unreading" or back-seeking within the stream;
-the parser always reads a full datum, and stops after some byte which cleanly
-terminates the currently parsed datum.
-
-This means Zisp s-expressions can be safely intermixed with other data within
-the same byte stream. So long as the other data is consumed by some parser
-which similarly stops reading at a clear boundary, the Zisp parser can then
-continue operating on the same stream. Consider the example:
-
- ("image.webp" 8273)
-
- << 8273 bytes >>
-
- ("video.webm" 736)
-
- << 736 bytes >>
-
-The "header" for each file in this stream is a Zisp s-expression containing
-information about how many bytes should be read after the header, before the
-next file header appears. (The header data need to be terminated with a blank
-ASCII character such as a newline; the closing parenthesis does not act as a
-terminator unto itself due to the "join" syntax sugar.)
-
-To enable this stream parsing strategy, the parser does not use any automatic
-buffering. If it did, it might inadvertently consume some bytes beyond the
-currently parsed datum, leaving the stream inconsistent.
-
-If the parser is meant to be used on an input stream associated with expensive
-system calls, such as a file handle or network socket, it's best to wrap that
-stream in some intermediate object which asks the system for large chunks of
-data at once, and stores the data in a buffer.
-
-
-## Comments
-
-Two types of comment are supported: datum comments and line comments.
-
-* A semicolon followed by a tilde instructs the parser to consume one datum and
- discard it. Whitespace may appear between the tilde and the datum to discard.
-
-* A semicolon, followed by a non-tilde byte, instructs the parser to consume and
- discard bytes until a newline (ASCII Line Feed) is encountered.
-
-
-## Value vs. Datum
-
-A Zisp *value* that has an *external representation* in the form of a sequence
-of bytes is called a *datum*. Every datum is a value, but not every value is a
-datum. In other words, a datum is a value that can be printed out as a byte
-sequence which the parser can turn back into an equivalent datum.
-
-A value that is not a datum may nevertheless be *encoded* into one, allowing it
-to have an external representation. After parsing, it needs to be *decoded* to
-actually become the expected value.
-
-One may speak of an *external representation of a value* where the value is not
-itself a datum, but can be encoded as one. The more strictly correct term for
-this is: "The external representation of a datum that encodes the value."
-
-### Syntax sugar
-
-The parser recognizes various *syntax sugar* to abbreviate an equivalent datum
-construction, or express a datum that encodes a more complex value.
-
-As an example, the expression `#(x y z)` is an abbreviation for the equivalent
-`(#HASH x y z)`. These are two external representations for the same datum;
-after parsing, both will yield values that are indistinguishable in all but
-their memory address.
-
-An example of syntax sugar that is not a mere abbreviation is a quoted string
-which contains bytes that could not appear in a *bare* string:
-
- "foo bar" -> (#DQUOTE & <STRING>)
-
-In this example, the visual token `<STRING>` represents the actual string value
-in program memory, which has no direct external representation in bytes because
-it contains a space character.
-
-Those familiar with Lisp and Scheme may expect bare strings to be parsed into a
-separate type called *symbol* while quoted strings are parsed directly into a
-string type, but this is not the case in Zisp.
-
-### Decoder
-
-The *decoder* transforms Zisp data into values of more complex types, including
-values that are not of a datum type.
-
-Combined with syntax sugar, this allows Zisp to offer familiar syntax elements.
-For example, the expression `#(x y z)` which parses into `(#HASH x y z)` can be
-decoded into an array, so the result is similar to the vector syntax of Scheme.
-
-Decoding also resolves datum labels, goes over bare strings to find ones that
-represent a number literal, and takes care of a number of other transforms.
-This offloads complexity, allowing the parser to remain extremely simple.
-
-See the dedicated documentation of the [decoder](2-decode.html) for more.
-
-
-## Data types
-
-Following is a more in-depth explanation of each data type constructed by the
-Zisp s-expression parser.
-
-These are in fact value types, though the term "data type" is often used due to
-familiarity. A Zisp value that is a member of one of the following value types
-is only a *datum* if it adheres to additional constraints as explained below.
-
-### String
-
-Strings can appear *bare* or be quoted in various ways. A quoted string is in
-fact parsed into a pair value with a rune in the first position to identify the
-quotation variant that was parsed, and the string value in the second position;
-or, in case of at-quoted strings, a special construct we will look at later.
-
- +-----------+-----------------------------+
- | Syntax | Parse output |
- +-----------+-----------------------------+
- | |bytes| | (#PQSTR & <STRING>) |
- +-----------+-----------------------------+
- | "bytes" | (#DQSTR & <STRING>) |
- +-----------+-----------------------------+
- | @_bytes_ | (#ATSTR <BYTE> & <STRING>) |
- +-----------+-----------------------------+
-
-The visual token `<STRING>` denotes the actual string, as a Zisp value, in the
-second position of the pair. The visual token `<BYTE>` stands for an integer
-Zisp value between 0 and 255.
-
-These external representations of strings will be explained in more detail
-further below, including backslash escape sequences allowed within, and how
-exactly at-quoted strings work.
-
-Strings have a fixed length, counted in bytes. Each byte can have any value,
-including zero (ASCII NUL). The parser reads bytes, not Unicode characters; a
-string may contain UTF-8 byte sequences, but these are not tested for validity.
-
-A string that is up to 255 bytes long is automatically *interned*, meaning any
-occurrence of the same string -- equal in length and containing the same byte
-values -- ends up being represented by the same bit-pattern; either a memory
-address, or an immediate representation within a CPU word for short strings.
-The quotation method is inconsequential to this process; for example, while
-`|foobar|` and `"foobar"` will parse into different pair values, the actual
-string they hold will be the same one in program memory.
-
-Strings of length greater than 255 bytes are stored separately in memory, even
-if they are equal in length and content.
-
-### Rune
-
-A rune is represented by an ASCII character sequence of 1 to 6 bytes, that must
-begin with a letter, and may only contain letters and digits. This character
-sequence of letters and digits is called the *name* of the rune. A rune that
-follows this constraint is valid as a datum.
-
-Zisp code may explicitly construct values of the rune type that violate the
-above constraints. Such runes are not valid data and cannot be printed or
-parsed.
-
-Runes are case-sensitive, and the parser always emits runes using upper-case
-letters when expressing syntax sugar. Uppercase rune names are reserved for
-Zisp's internal use and standard library; users can use lowercase runes with
-custom meaning without worrying about clashes, with the exception of a small
-number of lowercase runes such as `#true` and `#false` that are part of the
-default decoder settings and documented explicitly as such.
-
-Runes are always stored directly in a CPU word; never by memory address.
-
-### Pair
-
-A pair is a tuple of two values: the first value and the second value. In Lisp
-tradition, these are also called the `car` and `cdr` of the pair, respectively.
-
-The parser allocates a unique two-word cell in program memory for every pair,
-and represents that pair through the memory address of the cell.
-
-Pairs are valid data if one of the following holds true:
-
-* The pair encodes a quoted string, datum label, or shebang line.
-
-* Both the first and second value in the pair is a valid datum.
-
-Further, a structure of nested pair values may not contain cyclic references
-back up in the structure (which would make the above definition diverge into
-infinity). Such cycles must be broken up with datum labels, or else the pair
-cannot be considered a datum, since it cannot be printed or parsed.
-
-### Nil
-
-The Zisp nil value is a singleton and a datum. There is exactly one nil value
-and it is used to terminate a chain of pairs representing a list of values; it
-has the external representation `()`.
-
-
-## Quoted strings
-
-Three quoted string types exist: Pipe-quoted, double-quoted, and at-quoted.
-This section goes into the details of each variant.
-
-### Pipe-quoted
-
-Strings can be quoted with pipes, like symbols in R7RS Scheme, which triggers
-the parser to generate a pair with the structure:
-
- (#PQSTR & <STRING>) ;; <STRING> is visual aid, not syntax
-
-The decoder, using default settings, would emit this string verbatim as a value.
-Then, during code evaluation, this would be seen as an identifier. In this way,
-pipe-quoted strings are equivalent to bare strings in functionality.
-
-It is important to understand that the decoder sits between the parser and the
-[evaluator](3-execute.html), and in opposition to Lisp and Scheme tradition, it
-is common for the evaluator to receive values that are not valid as a datum; in
-this case, a string unto itself that may not be a valid datum, due to not being
-possible to be represented as a bare string. Yet, it is valid as an identifier
-for the purposes of the evaluator, since it is a string *value* like any other.
-
-### Double-quoted
-
-Strings wrapped in the double-quote symbol parse into:
-
- (#DQSTR & <STRING>) ;; <STRING> is visual aid, not syntax
-
-Under default settings, the decoder would transform this into a value which,
-when evaluated as code, simply yields the contained string as a value.
-
-### At-quoted
-
-This is a special type of syntax for "raw" strings, meaning that no backslash
-escapes nor any other kind of escape sequence are recognized within them.
-
-The syntax begins with an at sign, followed by any byte. That byte becomes a
-termination marker, and the string cannot contain an occurrence of it, since
-there are no escape sequences.
-
- @"foo \ bar" -> (#ATSTR <BYTE> & <STRING>)
-
-In the above, the visual tokens `<BYTE>` and `<STRING>` represent an integer
-value and a string value, respectively. In this example, the integer value
-would be 34; the ASCII value for the double-quote sign. The string value
-contains a literal backslash, since there is no backslash escape parsing.
-
-This style of quoting can be useful, for instance, when representing regular
-expressions as strings in code:
-
- ;; Matches e.g. foo\bar.["blah"]
-
- @/^foo\\(bar|baz)\.\[".*"\]$/
-
-Were it not for this syntax, this regular expression would only be possible to
-represent through a quoted string such as the following:
-
- ;; Same as above, but so many backslashes
-
- "^foo\\\\(bar|baz)\\t\\[\".*\"\\]$"
-
-The byte that follows the at sign need not be a printable character or even a
-valid ASCII byte; it can be absolutely any byte value, even NUL. This can be
-useful to easily encode binary data which is known to not contain a specific
-byte; an example would be C strings which cannot contain NUL.
-
-### Backslash escapes
-
-In pipe-quoted and double-quoted strings, the following ASCII characters may
-follow a backslash to insert a certain character.
-
- +-------+----------------------------+
- | Char | Meaning |
- +-------+----------------------------+
- | \ | Literal backslash |
- +-------+----------------------------+
- | | | Literal pipe symbol |
- +-------+----------------------------+
- | " | Literal double-quote |
- +-------+----------------------------+
- | 0 | ASCII NUL |
- +-------+----------------------------+
- | a | ASCII Alert |
- +-------+----------------------------+
- | b | ASCII Backspace |
- +-------+----------------------------+
- | t | ASCII Tab (Horizontal) |
- +-------+----------------------------+
- | n | ASCII Newline (Line Feed) |
- +-------+----------------------------+
- | v | ASCII Vertical Tab |
- +-------+----------------------------+
- | f | ASCII Form Feed |
- +-------+----------------------------+
- | r | ASCII Carriage Return |
- +-------+----------------------------+
- | e | ASCII Escape |
- +-------+----------------------------+
-
-In words:
-
-* A backslash followed by a backslash, pipe, or double-quote character is
- substituted with a literal occurrence of that character.
-
-* The characters 0, a, b, t, n, v, f, r, and e have the same meanings as in the
- C programming language, representing common ASCII control characters.
-
-Further, the following Regular Expression patterns following a backslash have
-special meaning.
-
- +---------------------+-----------------------+
- | Regular Expression | Meaning |
- +---------------------+-----------------------+
- | [\t ]*\n[\t ]* | Discarded |
- +---------------------+-----------------------+
- | x([0-9a-fA-F]{2})*; | Arbitrary bytes |
- +---------------------+-----------------------+
- | u[0-9a-fA-F]+; | Unicode Scalar Value |
- +---------------------+-----------------------+
-
-Explanations:
-
-* A backslash followed by any number of blanks (space or tab), a newline, and
- again any number of blanks, is substituted with nothing. This is to allow
- splitting a string into multiple lines for human readability.
-
- (define p "This paragraph has been visually split into multiple \
- lines, but the newline is escaped, so it's one line.")
-
-* An x, followed by pairs of hexadecimal digits (case insensitive), terminated
- by a semicolon, is substituted with the sequence of bytes represented by the
- corresponding pairs of hexadecimal digits. E.g.: `"foo\xDEADBEEF;bar"`
-
-* A u, followed by a hexadecimal digit sequence (case insensitive), terminated
- by a semicolon, is substituted with the canonical UTF-8 byte sequence for the
- Unicode Scalar Value represented by that hexadecimal number. The number must
- be in the range `0` to `10FFFF`. E.g.: `"foo\u00A0;bar"`
-
-### Newlines in strings
-
-Normally, a newline in a string has no special meaning and simply becomes part
-of the string. However, newlines can be backslash-escaped, which simple erases
-them; the escaped newline can also be preceded or followed by any number of tab
-and space characters, which are all stripped as well. (Note: It's not blanks
-preceding the backslash that are stripped, but blanks following the backslash
-and preceding the newline; i.e., blanks at the end of the line.)
-
-Following are some examples of how multi-line strings can appear in source code
-with different intentions and meanings:
-
- (define paragraph "This paragraph has been visually split into multiple \
- lines, but the newlines are escaped, so it's one line.")
-
- (define json-object '| ;; use '|| so double-quotes need no escaping
- {
- "key": "value"
- }
- |)
-
-The second example is actually slightly problematic. It begins with a newline,
-which may be undesirable, but escaping that newline would cause the first line
-to have no indentation, thus the opening `{` would not line up with the closing
-`}` when this string is printed out. Further, if the entire block of code is
-indented, then the string contents may be more indented than intended. (No pun
-or rhyme intended.) Consider:
-
- (let ((foo one))
- (let ((bar two))
- (let ((json-object '|
- {
- "key": "value"
- }
- |))
- (do-whatever))))
-
-The string bound to `json-object` has redundant indentation. Should the parser
-attempt to solve this issue?
-
-Thankfully, we have the decoder to handle such complexities. Under the default
-settings, the rune `#HASH` is bound to a decoder rule which detects a payload
-value that is a string literal, and implements the same algorithm as seen in
-Java 15 Text Blocks: [JEP 378: Text Blocks](https://openjdk.org/jeps/378)
-
-Thus, we can do the following:
-
- (let ((foo one))
- (let ((bar two))
- (let ((json-object #|
- ........... {
- ........... "key": "value"
- ........... }
- ...........|))
- (do-whatever))))
-
-(Dots represent whitespace that is deleted. The initial newline is, as well.)
-
-The only feature Zisp does not offer is a way to fence off multi-line strings
-with a longer token such as `"""` as seen in Python and Java, or an arbitrary
-word as seen in Bourne shell and PHP "here doc" syntax.
-
-However, if a programmer truly wanted to have arbitrary text blocks in code,
-without needing to escape anything in them, it's possible to abuse at-quoted
-string syntax, using it with an ASCII control character which is displayed
-visibly by a text editor. In the following, the characters `^\` are meant to
-represent a literal ASCII File Separator character in the source code:
-
- (define json-object #@^\
- {
- "key": "value"
- }
- ^\)
-
-It works fine in Emacs, so why not? Use `C-q C-\` to insert the `^\`.
-
-This is indeed quite an eldritch syntax, but hopefully most programs would not
-need to use it.
-
-
-## Other syntax
-
-The following table summarizes commonly useful syntax abbreviations:
-
- [...] -> (#SQUARE ...) #datum -> (#HASH & datum)
-
- {...} -> (#BRACE ...) #rune(...) -> (#rune ...)
-
- 'datum -> (#QUOTE & datum) dat1dat2 -> (#JOIN dat1 & dat2)
-
- `datum -> (#GRAVE & datum) dat1.dat2 -> (#DOT dat1 & dat2)
-
- ,datum -> (#COMMA & datum) dat1:dat2 -> (#COLON dat1 & dat2)
-
-Notes:
-
-* The terms datum, dat1, and dat2 each refer to an arbitrary datum; ellipsis
- means zero or more data.
-
-* The `#datum` form only applies when the datum following the hash sign is
- anything other than a bare string, since otherwise this would be ambiguous
- with a rune literal. A bare string can nevertheless follow the hash sign by
- separating the two with a backslash:
-
- #\string -> (#HASH & string)
-
-* Though not represented in the table due to notational difficulty, the form
- `#rune(...)` doesn't require a list in the second position; any datum that
- works with the `#datum` syntax also works with `#rune<DATUM>`.
-
- #rune1#rune2 -> (#rune1 & #rune2)
-
- #rune\string -> (rune & string)
-
- #rune'string -> (#rune #QUOTE & string)
-
- #rune"string" -> (#rune #DQSTR & |string|)
-
- As a counter-example, following a rune immediately with a bare string isn't
- possible without the delimiting backslash, since that would be ambiguous:
-
- #abcdefgh ;Could be (#abcdef & gh) or (#abcde & fgh) or ...
-
-* Syntax sugar can combine arbitrarily. Some examples follow. Any of these may
- or may not actually have a meaning in code; many could simply end up producing
- an error during decoding, or later evaluation of code.
-
- #{...} -> (#HASH #BRACE ...)
-
- #'foo -> (#HASH #QUOTE & foo)
-
- ##'[...] -> (#HASH #HASH #QUOTE #SQUARE ...)
-
- {x y}[i j] -> (#JOIN (#BRACE x y) #SQUARE i j)
-
- foo.bar.baz{x y} -> (#JOIN (#DOT (#DOT foo & bar) & baz) #BRACE x y)
-
-* While in Lisp and Scheme `'foo` parses as `(quote foo)`, in Zisp it parses as
- `(#QUOTE & foo)`; a single pair with the quoted datum in the second position.
-
- The same principle is used when parsing other sugar; some examples follow:
-
- Incorrect Correct
-
- #(x y z) -> (#HASH (x y z)) #(x y z) -> (#HASH x y z)
-
- [x y z] -> (#SQUARE (x y z)) [x y z] -> (#SQUARE x y z)
-
- #{x} -> (#HASH (#BRACE (x))) #{x} -> (#HASH #BRACE x)
-
- foo(x y) -> (#JOIN foo (x y)) foo(x y) -> (#JOIN foo x y)
-
-* Those used to thinking in Lisp and Scheme may think that `(#QUOTE ...)` halts
- further decoding of enclosed data. This is not so, since quoting is related
- to code evaluation, not decoding.
-
-### Datum labels
-
-Valid data cannot be cyclic, since that would mean it has infinite length in
-bytes. To externally represent a value with cyclic structure, one uses datum
-labels in the data encoding of the value.
-
-A datum label either wraps another datum to assign a number to it, or contains
-just a reference to a previous assignment.
-
- +------------------+------------------------------+
- | Syntax | Internal datum structure |
- +------------------+------------------------------+
- | #%<HEX>=<DATUM> | (#LABEL <NUMBER> & <DATUM>) |
- +------------------+------------------------------+
- | #%<HEX>% | (#LABEL & <NUMBER>) |
- +------------------+------------------------------+
-
-In this visual, the token `<HEX>` stands for a hexadecimal digit sequence, the
-token `<DATUM>` stands for any other datum, and `<NUMBER>` is a stand-in for a
-number value; that which is represented by `<HEX>`.
-
-For clarity, concrete examples follow:
-
- +-------------------+-------------------------------+
- | Byte sequence | Parse result |
- +-------------------+-------------------------------+
- | #%1234abcd=(foo) | (#LABEL <0x1234abcd> & (foo)) |
- +-------------------+-------------------------------+
- | #%1234abcd% | (#LABEL & <0x1234abcd>) |
- +-------------------+-------------------------------+
-
-Here, the visual token `<0x1234abcd>` stands for a Zisp value of a numeric type
-with an integer value. Note that the decoder may not accept a bare string here,
-meaning this syntax sugar is not merely an abbreviation.
-
-### Shebang
-
-Finally, the parser recognizes the Unix *shebang* syntax and outputs a datum to
-hold the string values found within:
-
- #!interpreter -> (#SHBANG & interpreter)
-
- #!interpreter argline -> (#SHBANG interpreter & argline)
-
-When executing a script file, Zisp simply stores this into a global value that
-may be inspected if desired.
-
-
-<!--
-;; Local Variables:
-;; fill-column: 80
-;; End:
--->