diff options
Diffstat (limited to 'doc/c1')
| -rw-r--r-- | doc/c1/1-parse.md | 608 | ||||
| -rw-r--r-- | doc/c1/2-decode.md | 44 | ||||
| -rw-r--r-- | doc/c1/grammar/abnf.txt | 141 | ||||
| -rw-r--r-- | doc/c1/grammar/index.md | 115 | ||||
| -rw-r--r-- | doc/c1/grammar/peg.txt | 93 | ||||
| -rw-r--r-- | doc/c1/grammar/zbnf.txt | 77 | ||||
| -rw-r--r-- | doc/c1/index.md | 30 |
7 files changed, 0 insertions, 1108 deletions
diff --git a/doc/c1/1-parse.md b/doc/c1/1-parse.md deleted file mode 100644 index d4c4c2e..0000000 --- a/doc/c1/1-parse.md +++ /dev/null @@ -1,608 +0,0 @@ -# Parser for Code & Data - -<!--TOC--> - -Zisp s-expressions represent an extremely minimal set of data types; only that -which is necessary to strategically construct more complex values: - - +-------+---------+--------+----------+------+ - | TYPE | String | Rune | Pair | Nil | - +-------+---------+--------+----------+------+ - | E.G. | foobar | #name | (X & Y) | () | - +-------+---------+--------+----------+------+ - -The parser recognizes various *syntax sugar* which abbreviates verbose syntax, -and may result in special data structures (typically, a pair with a rune in its -first, and payload in its second position) which another Zisp component called -the *decoder* can transform into a rich set of value types. - -The most ubiquitous syntax sugar is the list, which abbreviates a sequence of -tail-linked pairs, terminated with a special nil value represented as `()`: - - (x) -> (x & ()) - - (x y) -> (x & (y & ())) - - (x y z) -> (x & (y & (z & ()))) - -The following are so-called *improper lists*, ending in a non-nil value: - - (x y & z) -> (x & (y & z)) - - (x y z & t) -> (x & (y & (z & t))) - -More details about syntax sugar, and the decoder, are explained later. - - -## Character Encoding - -The parser does not consume Unicode characters; it consumes bytes. Grammar is -generally constructed by bytes corresponding to ASCII characters. - -Some elements of the grammar, such as comments and quoted strings, may contain -arbitrary byte sequences, until terminated. These sequences may happen to be -valid UTF-8 text. This way, quoted strings and comments may contain Unicode -text encoded in UTF-8, but the parser does not check these for validity. - -Since comments and quoted strings may contain arbitrary byte sequences, a text -editor or other program displaying Zisp s-expressions may need to use a special -visual representation for bytes that don't represent valid text. - -The parser working on bytes rather than Unicode characters is not a limitation, -but rather a feature: It allows Zisp s-expressions to be used as a structured -data exchange format, which may contain binary data elements, without the need -to encode these in Base64 or other such text representations of binary data. -Consider the example: - - ((image.webp "<BINARY>") - (video.webm "<BINARY>")) - -All that needs to be done for this to work, is that any incidental occurrences -of the double-quote sign, and the backslash sign, are escaped with a backslash -within the `<BINARY>` data; all other bytes can appear verbatim in the strings. - - -## Stream Parsing - -The parser can be repeatedly invoked on a byte stream to consume the next datum -within. This does not require "unreading" or back-seeking within the stream; -the parser always reads a full datum, and stops after some byte which cleanly -terminates the currently parsed datum. - -This means Zisp s-expressions can be safely intermixed with other data within -the same byte stream. So long as the other data is consumed by some parser -which similarly stops reading at a clear boundary, the Zisp parser can then -continue operating on the same stream. Consider the example: - - ("image.webp" 8273) - - << 8273 bytes >> - - ("video.webm" 736) - - << 736 bytes >> - -The "header" for each file in this stream is a Zisp s-expression containing -information about how many bytes should be read after the header, before the -next file header appears. (The header data need to be terminated with a blank -ASCII character such as a newline; the closing parenthesis does not act as a -terminator unto itself due to the "join" syntax sugar.) - -To enable this stream parsing strategy, the parser does not use any automatic -buffering. If it did, it might inadvertently consume some bytes beyond the -currently parsed datum, leaving the stream inconsistent. - -If the parser is meant to be used on an input stream associated with expensive -system calls, such as a file handle or network socket, it's best to wrap that -stream in some intermediate object which asks the system for large chunks of -data at once, and stores the data in a buffer. - - -## Comments - -Two types of comment are supported: datum comments and line comments. - -* A semicolon followed by a tilde instructs the parser to consume one datum and - discard it. Whitespace may appear between the tilde and the datum to discard. - -* A semicolon, followed by a non-tilde byte, instructs the parser to consume and - discard bytes until a newline (ASCII Line Feed) is encountered. - - -## Value vs. Datum - -A Zisp *value* that has an *external representation* in the form of a sequence -of bytes is called a *datum*. Every datum is a value, but not every value is a -datum. In other words, a datum is a value that can be printed out as a byte -sequence which the parser can turn back into an equivalent datum. - -A value that is not a datum may nevertheless be *encoded* into one, allowing it -to have an external representation. After parsing, it needs to be *decoded* to -actually become the expected value. - -One may speak of an *external representation of a value* where the value is not -itself a datum, but can be encoded as one. The more strictly correct term for -this is: "The external representation of a datum that encodes the value." - -### Syntax sugar - -The parser recognizes various *syntax sugar* to abbreviate an equivalent datum -construction, or express a datum that encodes a more complex value. - -As an example, the expression `#(x y z)` is an abbreviation for the equivalent -`(#HASH x y z)`. These are two external representations for the same datum; -after parsing, both will yield values that are indistinguishable in all but -their memory address. - -An example of syntax sugar that is not a mere abbreviation is a quoted string -which contains bytes that could not appear in a *bare* string: - - "foo bar" -> (#DQUOTE & <STRING>) - -In this example, the visual token `<STRING>` represents the actual string value -in program memory, which has no direct external representation in bytes because -it contains a space character. - -Those familiar with Lisp and Scheme may expect bare strings to be parsed into a -separate type called *symbol* while quoted strings are parsed directly into a -string type, but this is not the case in Zisp. - -### Decoder - -The *decoder* transforms Zisp data into values of more complex types, including -values that are not of a datum type. - -Combined with syntax sugar, this allows Zisp to offer familiar syntax elements. -For example, the expression `#(x y z)` which parses into `(#HASH x y z)` can be -decoded into an array, so the result is similar to the vector syntax of Scheme. - -Decoding also resolves datum labels, goes over bare strings to find ones that -represent a number literal, and takes care of a number of other transforms. -This offloads complexity, allowing the parser to remain extremely simple. - -See the dedicated documentation of the [decoder](2-decode.html) for more. - - -## Data types - -Following is a more in-depth explanation of each data type constructed by the -Zisp s-expression parser. - -These are in fact value types, though the term "data type" is often used due to -familiarity. A Zisp value that is a member of one of the following value types -is only a *datum* if it adheres to additional constraints as explained below. - -### String - -Strings can appear *bare* or be quoted in various ways. A quoted string is in -fact parsed into a pair value with a rune in the first position to identify the -quotation variant that was parsed, and the string value in the second position; -or, in case of at-quoted strings, a special construct we will look at later. - - +-----------+-----------------------------+ - | Syntax | Parse output | - +-----------+-----------------------------+ - | |bytes| | (#PQSTR & <STRING>) | - +-----------+-----------------------------+ - | "bytes" | (#DQSTR & <STRING>) | - +-----------+-----------------------------+ - | @_bytes_ | (#ATSTR <BYTE> & <STRING>) | - +-----------+-----------------------------+ - -The visual token `<STRING>` denotes the actual string, as a Zisp value, in the -second position of the pair. The visual token `<BYTE>` stands for an integer -Zisp value between 0 and 255. - -These external representations of strings will be explained in more detail -further below, including backslash escape sequences allowed within, and how -exactly at-quoted strings work. - -Strings have a fixed length, counted in bytes. Each byte can have any value, -including zero (ASCII NUL). The parser reads bytes, not Unicode characters; a -string may contain UTF-8 byte sequences, but these are not tested for validity. - -A string that is up to 255 bytes long is automatically *interned*, meaning any -occurrence of the same string -- equal in length and containing the same byte -values -- ends up being represented by the same bit-pattern; either a memory -address, or an immediate representation within a CPU word for short strings. -The quotation method is inconsequential to this process; for example, while -`|foobar|` and `"foobar"` will parse into different pair values, the actual -string they hold will be the same one in program memory. - -Strings of length greater than 255 bytes are stored separately in memory, even -if they are equal in length and content. - -### Rune - -A rune is represented by an ASCII character sequence of 1 to 6 bytes, that must -begin with a letter, and may only contain letters and digits. This character -sequence of letters and digits is called the *name* of the rune. A rune that -follows this constraint is valid as a datum. - -Zisp code may explicitly construct values of the rune type that violate the -above constraints. Such runes are not valid data and cannot be printed or -parsed. - -Runes are case-sensitive, and the parser always emits runes using upper-case -letters when expressing syntax sugar. Uppercase rune names are reserved for -Zisp's internal use and standard library; users can use lowercase runes with -custom meaning without worrying about clashes, with the exception of a small -number of lowercase runes such as `#true` and `#false` that are part of the -default decoder settings and documented explicitly as such. - -Runes are always stored directly in a CPU word; never by memory address. - -### Pair - -A pair is a tuple of two values: the first value and the second value. In Lisp -tradition, these are also called the `car` and `cdr` of the pair, respectively. - -The parser allocates a unique two-word cell in program memory for every pair, -and represents that pair through the memory address of the cell. - -Pairs are valid data if one of the following holds true: - -* The pair encodes a quoted string, datum label, or shebang line. - -* Both the first and second value in the pair is a valid datum. - -Further, a structure of nested pair values may not contain cyclic references -back up in the structure (which would make the above definition diverge into -infinity). Such cycles must be broken up with datum labels, or else the pair -cannot be considered a datum, since it cannot be printed or parsed. - -### Nil - -The Zisp nil value is a singleton and a datum. There is exactly one nil value -and it is used to terminate a chain of pairs representing a list of values; it -has the external representation `()`. - - -## Quoted strings - -Three quoted string types exist: Pipe-quoted, double-quoted, and at-quoted. -This section goes into the details of each variant. - -### Pipe-quoted - -Strings can be quoted with pipes, like symbols in R7RS Scheme, which triggers -the parser to generate a pair with the structure: - - (#PQSTR & <STRING>) ;; <STRING> is visual aid, not syntax - -The decoder, using default settings, would emit this string verbatim as a value. -Then, during code evaluation, this would be seen as an identifier. In this way, -pipe-quoted strings are equivalent to bare strings in functionality. - -It is important to understand that the decoder sits between the parser and the -[evaluator](3-execute.html), and in opposition to Lisp and Scheme tradition, it -is common for the evaluator to receive values that are not valid as a datum; in -this case, a string unto itself that may not be a valid datum, due to not being -possible to be represented as a bare string. Yet, it is valid as an identifier -for the purposes of the evaluator, since it is a string *value* like any other. - -### Double-quoted - -Strings wrapped in the double-quote symbol parse into: - - (#DQSTR & <STRING>) ;; <STRING> is visual aid, not syntax - -Under default settings, the decoder would transform this into a value which, -when evaluated as code, simply yields the contained string as a value. - -### At-quoted - -This is a special type of syntax for "raw" strings, meaning that no backslash -escapes nor any other kind of escape sequence are recognized within them. - -The syntax begins with an at sign, followed by any byte. That byte becomes a -termination marker, and the string cannot contain an occurrence of it, since -there are no escape sequences. - - @"foo \ bar" -> (#ATSTR <BYTE> & <STRING>) - -In the above, the visual tokens `<BYTE>` and `<STRING>` represent an integer -value and a string value, respectively. In this example, the integer value -would be 34; the ASCII value for the double-quote sign. The string value -contains a literal backslash, since there is no backslash escape parsing. - -This style of quoting can be useful, for instance, when representing regular -expressions as strings in code: - - ;; Matches e.g. foo\bar.["blah"] - - @/^foo\\(bar|baz)\.\[".*"\]$/ - -Were it not for this syntax, this regular expression would only be possible to -represent through a quoted string such as the following: - - ;; Same as above, but so many backslashes - - "^foo\\\\(bar|baz)\\t\\[\".*\"\\]$" - -The byte that follows the at sign need not be a printable character or even a -valid ASCII byte; it can be absolutely any byte value, even NUL. This can be -useful to easily encode binary data which is known to not contain a specific -byte; an example would be C strings which cannot contain NUL. - -### Backslash escapes - -In pipe-quoted and double-quoted strings, the following ASCII characters may -follow a backslash to insert a certain character. - - +-------+----------------------------+ - | Char | Meaning | - +-------+----------------------------+ - | \ | Literal backslash | - +-------+----------------------------+ - | | | Literal pipe symbol | - +-------+----------------------------+ - | " | Literal double-quote | - +-------+----------------------------+ - | 0 | ASCII NUL | - +-------+----------------------------+ - | a | ASCII Alert | - +-------+----------------------------+ - | b | ASCII Backspace | - +-------+----------------------------+ - | t | ASCII Tab (Horizontal) | - +-------+----------------------------+ - | n | ASCII Newline (Line Feed) | - +-------+----------------------------+ - | v | ASCII Vertical Tab | - +-------+----------------------------+ - | f | ASCII Form Feed | - +-------+----------------------------+ - | r | ASCII Carriage Return | - +-------+----------------------------+ - | e | ASCII Escape | - +-------+----------------------------+ - -In words: - -* A backslash followed by a backslash, pipe, or double-quote character is - substituted with a literal occurrence of that character. - -* The characters 0, a, b, t, n, v, f, r, and e have the same meanings as in the - C programming language, representing common ASCII control characters. - -Further, the following Regular Expression patterns following a backslash have -special meaning. - - +---------------------+-----------------------+ - | Regular Expression | Meaning | - +---------------------+-----------------------+ - | [\t ]*\n[\t ]* | Discarded | - +---------------------+-----------------------+ - | x([0-9a-fA-F]{2})*; | Arbitrary bytes | - +---------------------+-----------------------+ - | u[0-9a-fA-F]+; | Unicode Scalar Value | - +---------------------+-----------------------+ - -Explanations: - -* A backslash followed by any number of blanks (space or tab), a newline, and - again any number of blanks, is substituted with nothing. This is to allow - splitting a string into multiple lines for human readability. - - (define p "This paragraph has been visually split into multiple \ - lines, but the newline is escaped, so it's one line.") - -* An x, followed by pairs of hexadecimal digits (case insensitive), terminated - by a semicolon, is substituted with the sequence of bytes represented by the - corresponding pairs of hexadecimal digits. E.g.: `"foo\xDEADBEEF;bar"` - -* A u, followed by a hexadecimal digit sequence (case insensitive), terminated - by a semicolon, is substituted with the canonical UTF-8 byte sequence for the - Unicode Scalar Value represented by that hexadecimal number. The number must - be in the range `0` to `10FFFF`. E.g.: `"foo\u00A0;bar"` - -### Newlines in strings - -Normally, a newline in a string has no special meaning and simply becomes part -of the string. However, newlines can be backslash-escaped, which simple erases -them; the escaped newline can also be preceded or followed by any number of tab -and space characters, which are all stripped as well. (Note: It's not blanks -preceding the backslash that are stripped, but blanks following the backslash -and preceding the newline; i.e., blanks at the end of the line.) - -Following are some examples of how multi-line strings can appear in source code -with different intentions and meanings: - - (define paragraph "This paragraph has been visually split into multiple \ - lines, but the newlines are escaped, so it's one line.") - - (define json-object '| ;; use '|| so double-quotes need no escaping - { - "key": "value" - } - |) - -The second example is actually slightly problematic. It begins with a newline, -which may be undesirable, but escaping that newline would cause the first line -to have no indentation, thus the opening `{` would not line up with the closing -`}` when this string is printed out. Further, if the entire block of code is -indented, then the string contents may be more indented than intended. (No pun -or rhyme intended.) Consider: - - (let ((foo one)) - (let ((bar two)) - (let ((json-object '| - { - "key": "value" - } - |)) - (do-whatever)))) - -The string bound to `json-object` has redundant indentation. Should the parser -attempt to solve this issue? - -Thankfully, we have the decoder to handle such complexities. Under the default -settings, the rune `#HASH` is bound to a decoder rule which detects a payload -value that is a string literal, and implements the same algorithm as seen in -Java 15 Text Blocks: [JEP 378: Text Blocks](https://openjdk.org/jeps/378) - -Thus, we can do the following: - - (let ((foo one)) - (let ((bar two)) - (let ((json-object #| - ........... { - ........... "key": "value" - ........... } - ...........|)) - (do-whatever)))) - -(Dots represent whitespace that is deleted. The initial newline is, as well.) - -The only feature Zisp does not offer is a way to fence off multi-line strings -with a longer token such as `"""` as seen in Python and Java, or an arbitrary -word as seen in Bourne shell and PHP "here doc" syntax. - -However, if a programmer truly wanted to have arbitrary text blocks in code, -without needing to escape anything in them, it's possible to abuse at-quoted -string syntax, using it with an ASCII control character which is displayed -visibly by a text editor. In the following, the characters `^\` are meant to -represent a literal ASCII File Separator character in the source code: - - (define json-object #@^\ - { - "key": "value" - } - ^\) - -It works fine in Emacs, so why not? Use `C-q C-\` to insert the `^\`. - -This is indeed quite an eldritch syntax, but hopefully most programs would not -need to use it. - - -## Other syntax - -The following table summarizes commonly useful syntax abbreviations: - - [...] -> (#SQUARE ...) #datum -> (#HASH & datum) - - {...} -> (#BRACE ...) #rune(...) -> (#rune ...) - - 'datum -> (#QUOTE & datum) dat1dat2 -> (#JOIN dat1 & dat2) - - `datum -> (#GRAVE & datum) dat1.dat2 -> (#DOT dat1 & dat2) - - ,datum -> (#COMMA & datum) dat1:dat2 -> (#COLON dat1 & dat2) - -Notes: - -* The terms datum, dat1, and dat2 each refer to an arbitrary datum; ellipsis - means zero or more data. - -* The `#datum` form only applies when the datum following the hash sign is - anything other than a bare string, since otherwise this would be ambiguous - with a rune literal. A bare string can nevertheless follow the hash sign by - separating the two with a backslash: - - #\string -> (#HASH & string) - -* Though not represented in the table due to notational difficulty, the form - `#rune(...)` doesn't require a list in the second position; any datum that - works with the `#datum` syntax also works with `#rune<DATUM>`. - - #rune1#rune2 -> (#rune1 & #rune2) - - #rune\string -> (rune & string) - - #rune'string -> (#rune #QUOTE & string) - - #rune"string" -> (#rune #DQSTR & |string|) - - As a counter-example, following a rune immediately with a bare string isn't - possible without the delimiting backslash, since that would be ambiguous: - - #abcdefgh ;Could be (#abcdef & gh) or (#abcde & fgh) or ... - -* Syntax sugar can combine arbitrarily. Some examples follow. Any of these may - or may not actually have a meaning in code; many could simply end up producing - an error during decoding, or later evaluation of code. - - #{...} -> (#HASH #BRACE ...) - - #'foo -> (#HASH #QUOTE & foo) - - ##'[...] -> (#HASH #HASH #QUOTE #SQUARE ...) - - {x y}[i j] -> (#JOIN (#BRACE x y) #SQUARE i j) - - foo.bar.baz{x y} -> (#JOIN (#DOT (#DOT foo & bar) & baz) #BRACE x y) - -* While in Lisp and Scheme `'foo` parses as `(quote foo)`, in Zisp it parses as - `(#QUOTE & foo)`; a single pair with the quoted datum in the second position. - - The same principle is used when parsing other sugar; some examples follow: - - Incorrect Correct - - #(x y z) -> (#HASH (x y z)) #(x y z) -> (#HASH x y z) - - [x y z] -> (#SQUARE (x y z)) [x y z] -> (#SQUARE x y z) - - #{x} -> (#HASH (#BRACE (x))) #{x} -> (#HASH #BRACE x) - - foo(x y) -> (#JOIN foo (x y)) foo(x y) -> (#JOIN foo x y) - -* Those used to thinking in Lisp and Scheme may think that `(#QUOTE ...)` halts - further decoding of enclosed data. This is not so, since quoting is related - to code evaluation, not decoding. - -### Datum labels - -Valid data cannot be cyclic, since that would mean it has infinite length in -bytes. To externally represent a value with cyclic structure, one uses datum -labels in the data encoding of the value. - -A datum label either wraps another datum to assign a number to it, or contains -just a reference to a previous assignment. - - +------------------+------------------------------+ - | Syntax | Internal datum structure | - +------------------+------------------------------+ - | #%<HEX>=<DATUM> | (#LABEL <NUMBER> & <DATUM>) | - +------------------+------------------------------+ - | #%<HEX>% | (#LABEL & <NUMBER>) | - +------------------+------------------------------+ - -In this visual, the token `<HEX>` stands for a hexadecimal digit sequence, the -token `<DATUM>` stands for any other datum, and `<NUMBER>` is a stand-in for a -number value; that which is represented by `<HEX>`. - -For clarity, concrete examples follow: - - +-------------------+-------------------------------+ - | Byte sequence | Parse result | - +-------------------+-------------------------------+ - | #%1234abcd=(foo) | (#LABEL <0x1234abcd> & (foo)) | - +-------------------+-------------------------------+ - | #%1234abcd% | (#LABEL & <0x1234abcd>) | - +-------------------+-------------------------------+ - -Here, the visual token `<0x1234abcd>` stands for a Zisp value of a numeric type -with an integer value. Note that the decoder may not accept a bare string here, -meaning this syntax sugar is not merely an abbreviation. - -### Shebang - -Finally, the parser recognizes the Unix *shebang* syntax and outputs a datum to -hold the string values found within: - - #!interpreter -> (#SHBANG & interpreter) - - #!interpreter argline -> (#SHBANG interpreter & argline) - -When executing a script file, Zisp simply stores this into a global value that -may be inspected if desired. - - -<!-- -;; Local Variables: -;; fill-column: 80 -;; End: ---> diff --git a/doc/c1/2-decode.md b/doc/c1/2-decode.md deleted file mode 100644 index 379c74b..0000000 --- a/doc/c1/2-decode.md +++ /dev/null @@ -1,44 +0,0 @@ -# Decoding - -A separate process called "decoding" can transform simple data structures, -consisting of only the base datum types, into a richer set of Zisp types. - -For example, the decoder may turn `(#HASH ...)` into a vector, as one would -expect a vector literal like `#(...)` to work in Scheme. Bytevector syntax -could use a custom rune as a list prefix, like: `#u8(...)` - -Runes may be decoded in isolation as well, rather than transforming a list -whose head they appear in. This can implement Boolean constants as `#true` -and `#false` or `#t` and `#f`. - -The decoder recognizes `(#QUOTE ...)` to aid in implementing the traditional -quoting mechanism of Lisp/Scheme, but with a significant difference: - -Traditional quote is "unhygienic" in Scheme terms. An expression such as -`'(foo bar)` will always be read as `(quote (foo bar))` regardless of what -lexical context it appears in, so the semantics will depend on whatever the -identifier `quote` is bound to, meaning that the expression may end up -evaluating to something other than the list `(foo bar)`. - -The Zisp decoder, which transforms not datum to datum, but object to object, -can turn `#QUOTE` into an object which encapsulates the notion of quoting, -which the Zisp evaluator can recognize and act upon, ensuring that an -expression like `'(foo bar)` always turns into the list `(foo bar)`. - -One way to think about this, in Scheme (R6RS / syntax-case) terms, is that -expressions like `'(foo bar)` turn directly into a syntax object when read, -and the created syntax object begins with an identifier bound to `quote` in -the standard library. - -The decoder is, of course, configurable and extensible. The transformations -mentioned above would be performed only when it's told to decode data which -represents Zisp code. The decoder may be given a different configuration, -telling it to decode, for example, data which represents a different kind of -domain-specific data, such as application settings, build system commands, -complex data records with non-standard data types, and so on. - -<!-- -;; Local Variables: -;; fill-column: 77 -;; End: ---> diff --git a/doc/c1/grammar/abnf.txt b/doc/c1/grammar/abnf.txt deleted file mode 100644 index aa67646..0000000 --- a/doc/c1/grammar/abnf.txt +++ /dev/null @@ -1,141 +0,0 @@ -; Standards-compliant ABNF (RFC 5234, RFC 7405) - -; Compatible with: https://www.quut.com/abnfgen/ - -; Unlike PEG, grammar rules in BNF are non-deterministic, which makes -; it much more challenging to express our naive parse logic. Whether -; this ABNF file is truly accurate is difficult to assess. - -; The abnfgen(1) tool linked above can be used to generate arbitrary -; strings matching the grammar in this file. These can be fed into -; the Zisp parser to reveal some potential bugs; either in the parser -; itself, or this ABNF grammar. - -; Note that the tool may generate Zisp string literals with Unicode -; escape sequences corresponding to surrogate code points; the parser -; may reject these. This is expected; it's difficult to rewrite this -; ABNF grammar to exclude those Unicode values. - -; Other minor inaccuracies that aren't important include: This ABNF -; forces line comments to be terminated with an LF character, when in -; fact the end-of-file may also terminate them; the same applies to -; hash-bang parsing which doesn't actually have to end in LF. These -; discrepancies won't make abnfgen(1) generate invalid strings; they -; only make this ABNF more strict than the Zisp parser, so it won't -; generate some strings that the parser would actually accept. - - -Stream = [ Unit *( Blank Unit ) ] *Blank [Trail] - - -Unit = *Blank Datum - -Blank = HTAB / LF / %x0b / %x0c / CR / SP / Comment - -Trail = SkipLine / SkipUnit / ";" "~" *Blank - - -Datum = BareString / SpecialStr / CladDatum / Rune / RuneStr - / RuneDotStr / RuneClad / LabelRef / LabelDef / HashStr - / HashDotStr / HashClad / QuoteExpr / JoinExpr - -Comment = SkipLine LF / SkipUnit Blank - -SkipLine = ";" [ SkipLStart *AnyButLF ] - -SkipUnit = ";" "~" Unit - -SkipLStart = %x00-09 / %x0b-7d / %x7f-ff ; any but LF or "~" - -AnyButLF = %x00-09 / %x0b-ff - - -BareString = BareChar *( BareChar / Numeric ) - -SpecialStr = SpecStrChar *( SpecStrChar / BareChar ) - -CladDatum = "|" *( PipeStrChar / "\" StringEsc ) "|" - / DQUOTE *( QuotStrChar / "\" StringEsc ) DQUOTE - / "(" List ")" - / "[" List "]" - / "{" List "}" - -Rune = "#" RuneName - -RuneStr = "#" RuneName "\" BareString - -RuneDotStr = "#" RuneName "\" SpecialStr - -RuneClad = "#" RuneName CladDatum - -HashBang = "#" "!" *( SP / HTAB ) HBLine LF - -LabelRef = "#" "%" Label "%" - -LabelDef = "#" "%" Label "=" Datum - -HashStr = "#" "\" BareString - -HashDotStr = "#" "\" SpecialStr - -HashClad = "#" CladDatum - -QuoteExpr = "'" Datum - / "`" Datum - / "," Datum - -JoinExpr = Datum RJoinDatum - / LJoinDatum NoStartDot - / Datum ":" Datum - / NoEndDot "." Datum - - -BareChar = "!" / "$" / "%" / "*" / "/" / "<" / "=" / ">" - / "?" / "^" / "_" / "~" / ALPHA - -Numeric = "+" / "-" / DIGIT - -SpecStrChar = "." / ":" / Numeric - -PipeStrChar = %x00-5b / %x5d-7b / %x7d-ff ; any but "|" or "\" - -QuotStrChar = %x00-21 / %x23-5b / %x5d-ff ; any but DQUOTE or "\" - -StringEsc = "\" / "|" / DQUOTE / *( HTAB / SP ) LF *( HTAB / SP ) - / %s"a" / %s"b" / %s"t" / %s"n" - / %s"v" / %s"f" / %s"r" / %s"e" - / %s"x" *( 2HEXDIG ) ";" - / %s"u" ["0"] 1*5HEXDIG ";" - / %s"u" "1" "0" 4HEXDIG ";" - -List = [ Unit *( Blank Unit ) ] *Blank [Tail] [SkipUnit] - -Tail = "&" Unit *Blank - - -RuneName = ALPHA *5( ALPHA / DIGIT ) - -Label = 1*12( HEXDIG ) - -HBLine = 1*HBChar [ 1*( SP / HTAB ) *HBChar ] - -HBChar = %x00-08 / %x0b-1f / %x21-ff ; any but HT, LF, SP - - -RJoinDatum = CladDatum / Rune / RuneStr / RuneDotStr / RuneClad - / LabelRef / LabelDef / HashStr / HashDotStr / HashClad - / QuoteExpr - -LJoinDatum = CladDatum / RuneClad / LabelRef / HashClad - -NoStartDot = BareString / CladDatum / Rune / RuneStr / RuneDotStr - / RuneClad / LabelRef / LabelDef / HashStr / HashDotStr - / HashClad / QuoteExpr - -NoEndDot = BareString / Rune / RuneStr / RuneClad / LabelRef - / HashStr / HashClad - - -;; Local Variables: -;; eval: (flyspell-mode -1) -;; End: diff --git a/doc/c1/grammar/index.md b/doc/c1/grammar/index.md deleted file mode 100644 index e3716ea..0000000 --- a/doc/c1/grammar/index.md +++ /dev/null @@ -1,115 +0,0 @@ -# Zisp S-Expression Grammar - -The grammar is available in several different formats: - -* [ZBNF](zbnf.txt): See below for the rules of this notation -* [ABNF](abnf.txt): Compatible with the `abnfgen` tool -* [PEG](peg.txt): Compatible with `peg/leg` tool - - -## ZBNF notation - -The ZBNF grammar specification uses a BNF-like notation with PEG-like -semantics: - -* Concatenation of expressions is implicit: `foo bar` means `foo` - followed by `bar`. - -* Parentheses are used for grouping, and the pipe symbol `|` is used - for alternatives. - -* The suffixes `?`, `*`, and `+` have the same meaning as in regular - expressions, although `[foo]` is used in place of `(foo)?`. - -* The syntax is defined in terms of bytes, not characters. Terminals - `'c'` and `"c"` refer to the ASCII value of the given character `c`. - Standard C escape sequences are supported. - -* The prefix `~` means NOT. It only applies to rules that match one - byte, and negates them. For example, `~( 'a' | 'b' )` matches any - byte other than 'a' and 'b'. - -* Ranges of terminal values are expressed as `x...y` (inclusive). - -* ABNF "core rules" like `ALPHA` and `HEXDIG` are supported. - -* There is no ambiguity, or look-ahead / backtracking beyond one byte. - Rules match left to right, depth-first, and greedy. As soon as the - input matches the first terminal of a rule --explicit or implied by - recursively descending into the first non-terminal-- it must match - that rule to the end or a syntax error is reported. - -The last point makes the notation simple to translate to code. - - -## Limitations outside the grammar - -The following limits are not represented in the grammar: - -* A `UnicodeSV` is the hexadecimal representation of a Unicode scalar - value; it must represent a value in the range 0 to D7FF, or E000 to - 10FFFF, inclusive. Any other value signals an error. Valid values - are converted into a UTF-8 byte sequence encoding the value. - -* A `Rune` longer than 6 bytes is grammatical, but signals an error. - This is important because runes are not self-terminating; defining - their grammar as ending after a maximum of 6 bytes would allow - another datum beginning with an alphabetic character to follow a - rune immediately without any visual delineation, which would be - terribly confusing for a human reader. Consider: `#foobarbaz`. - This would parse as a `Datum` joining `#foobar` and `baz`. - - (The ABNF does not suffer from this issue, since it explicitly - enumerates the join possibilities anyway.) - -* A `Label` is the hexadecimal representation of a 48-bit integer, - meaning it allows for a maximum of 12 hexadecimal digits. Longer - values are grammatical, but signal an out-of-range error, so as to - avoid signaling a confusing "invalid character" error on input that - appears grammatical. Consider: `#%123456789abcd=foo`. This would - signal an invalid character error at the letter `d` if the grammar - limited a `Label` to 12 hexadecimal digits. - - (As above, the ABNF doesn't care about this. You probably don't - want to use the ABNF to generate a parser anyway.) - - -## At-quoted strings - -The mechanism of at-quoted strings is not represented in any of the -grammars, since it essentially has 256 variants. Representing it -sanely in a grammar requires the ability to save and reference -variables. - - -## Stream-parsing strategy - -The parser consumes one `Unit` from the input stream every time it's -called; it returns the `Datum` therein if found, or else it returns -the Zisp EOF token. - -Since a `Datum` is not self-terminating, the parser must read beyond -it to realize that it has ended (if not followed by the EOF). Thus, -it will consume one more `Blank` following the `Unit` that it parsed. -If this `Blank` is a comment, it will be consumed entirely, ensuring -that parsing resumes properly on a subsequent parser call on the same -input stream, without needing to store any state in between. - -Since comments of type `SkipUnit` are likewise not self-terminating, -an arbitrary number of chained `SkipUnit` comments may need to be -consumed before the parser is finally allowed to return. - -The following illustration shows the positions at which the parser -will stop consuming input when called repeatedly on the same input -stream. The dots represent the extent of each `Unit` being parsed, -while the caret points at the last byte the parser will consume in -that parse cycle. - -``` -foo (bar)[baz] foo;~bar foo;~bar;~baz;~bat foobar -...^..........^... ^... ^......^ -``` - -Notice how, in the fourth cycle, the parser is forced to consume all -commented-out units before it can return, since it would otherwise -leave the stream in an inappropriate state. diff --git a/doc/c1/grammar/peg.txt b/doc/c1/grammar/peg.txt deleted file mode 100644 index 7b28a99..0000000 --- a/doc/c1/grammar/peg.txt +++ /dev/null @@ -1,93 +0,0 @@ -# Standard PEG notation - -Stream <- Unit ( Blank Unit )* !. - - -Unit <- Blank* Datum - -Blank <- [\t-\r ] / Comment - - -Datum <- OneDatum ( JoinChar? OneDatum )* - -JoinChar <- '.' / ':' - - -Comment <- ';' ( SkipUnit / SkipLine ) - -SkipUnit <- '~' Unit - -SkipLine <- (!'\n' .)* '\n'? - - -OneDatum <- BareString / CladDatum - - -BareString <- SpecBareChar ( BareChar / JoinChar )* - / BareChar+ - -SpecBareChar <- '+' / '-' / JoinChar / DIGIT - -BareChar <- ALPHA / DIGIT - / '!' / '$' / '%' / '*' / '+' / '-' / '/' - / '<' / '=' / '>' / '?' / '^' / '_' / '~' - - -CladDatum <- PipeStr / QuoteStr / HashExpr / QuoteExpr / List - -PipeStr <- '|' ( PipeStrChar / '\' StringEsc )* '|' -QuoteStr <- '"' ( QuotStrChar / '\' StringEsc )* '"' -HashExpr <- '#' HashExprs -QuoteExpr <- "'" Datum / '`' Datum / ',' Datum -List <- ParenList / SquareList / BraceList - - -PipeStrChar <- (![|\\] .) -QuotStrChar <- (!["\\] .) - -StringEsc <- '\' / '|' / '"' / ( HTAB / SP )* LF ( HTAB / SP )* - / '0' / 'a' / 'b' / 't' / 'n' / 'v' / 'f' / 'r' / 'e' - / 'x' HexByte* ';' - / 'u' UnicodeSV ';' - -HexByte <- HEXDIG HEXDIG -UnicodeSV <- HEXDIG+ - - -HashExprs <- '!' [\t ]* HBangLine '\n'? - / '%' Label ( '%' / '=' Datum ) - / '\' BareString / CladDatum - / Rune ( '\' BareString / CladDatum )? - -HBangLine <- HBChars+ [\t ]* ( HBChars+ )? -HBChars <- (![\t\n ] .) -Label <- HEXDIG+ -Rune <- ALPHA ( ALPHA / DIGIT )* - - -ParenList <- '(' ListBody ')' -SquareList <- '[' ListBody ']' -BraceList <- '{' ListBody '}' - -ListBody <- Unit* ( Blank* '&' Unit )? Blank* - - -DIGIT <- [0-9] -ALPHA <- [a-zA-Z] -HEXDIG <- [0-9a-fA-F] - - -# Keep this in sync line-for-line with the ZBNF grammar for easy -# comparison between the two. - -# This file is meant to be compatible with: -# https://piumarta.com/software/peg - -# Due to a quirk in the peg tool this file is used with, the grammar -# must not allow an empty stream. Therefore, the Unit rule has its -# Datum declared as mandatory rather than optional. - - -# Local Variables: -# eval: (flyspell-mode -1) -# End: diff --git a/doc/c1/grammar/zbnf.txt b/doc/c1/grammar/zbnf.txt deleted file mode 100644 index 923ac83..0000000 --- a/doc/c1/grammar/zbnf.txt +++ /dev/null @@ -1,77 +0,0 @@ -; Custom notation with PEG semantics - -Stream : Unit ( Blank Unit )* - - -Unit : Blank* [Datum] - -Blank : '\t'...'\r' | SP | Comment - - -Datum : OneDatum ( [JoinChar] OneDatum )* - -JoinChar : '.' | ':' - - -Comment : ';' ( SkipUnit | SkipLine ) - -SkipUnit : '~' Unit - -SkipLine : ( ~LF )* [LF] - - -OneDatum : BareString | CladDatum - - -BareString : SpecBareChar ( BareChar | JoinChar )* - | BareChar+ - -SpecBareChar : '+' | '-' | JoinChar | DIGIT - -BareChar : ALPHA | DIGIT - | '!' | '$' | '%' | '*' | '+' | '-' | '/' - | '<' | '=' | '>' | '?' | '^' | '_' | '~' - - -CladDatum : PipeStr | QuoteStr | HashExpr | QuoteExpr | List - -PipeStr : '|' ( PipeStrChar | '\' StringEsc )* '|' -QuoteStr : '"' ( QuotStrChar | '\' StringEsc )* '"' -HashExpr : '#' HashExprs -QuoteExpr : "'" Datum | '`' Datum | ',' Datum -List : ParenList | SquareList | BraceList - - -PipeStrChar : ~( '|' | '\' ) -QuotStrChar : ~( '"' | '\' ) - -StringEsc : '\' | '|' | '"' | ( HTAB | SP )* LF ( HTAB | SP )* - | '0' | 'a' | 'b' | 't' | 'n' | 'v' | 'f' | 'r' | 'e' - | 'x' HexByte* ';' - | 'u' UnicodeSV ';' - -HexByte : HEXDIG HEXDIG -UnicodeSV : HEXDIG+ - - -HashExprs : '!' ( SP | HTAB )* HBangLine [ LF ] - | '%' Label ( '%' | '=' Datum ) - | '\' BareString | CladDatum - | Rune [ '\' BareString | CladDatum ] - -HBangLine : HBChars+ ( SP | HTAB )* [ HBChars+ ] -HBChars : ~( SP | HTAB | LF ) -Label : HEXDIG+ -Rune : ALPHA ( ALPHA | DIGIT )* - - -ParenList : '(' ListBody ')' -SquareList : '[' ListBody ']' -BraceList : '{' ListBody '}' - -ListBody : Unit* [ Blank* '&' Unit ] Blank* - - -;; Local Variables: -;; eval: (flyspell-mode -1) -;; End: diff --git a/doc/c1/index.md b/doc/c1/index.md deleted file mode 100644 index 6cec369..0000000 --- a/doc/c1/index.md +++ /dev/null @@ -1,30 +0,0 @@ -# Chapter 1: Genesis - -This chapter goes through the processes involved in reading source -code, running it, and optionally compiling it. - -1. [Parse](1-parse.html) (see also [grammar](grammar/)) - - The parser receives a stream of bytes and transforms them into a - minimal set of data types with very little processing. - -2. [Decode](2-decode.html) - - The decoder runs configurable and extensible pre-processing steps - over data received from the parser, enriching it with more complex - data types, and handling primitive source code transforms. It's - comparable to the C pre-processor or Lisp's `DEFMACRO` mechanism, - with a few more responsibilities, such as number literal parsing. - -3. [Execute](3-execute.html) - - Code is executed (or interpreted, or evaluated) in an environment, - also called a module, which may be mutated, and linked with other - modules. Execution is immediate, without any pre-compilation. - -4. [Compile](4-compile.html) - - Procedures from within the compiler module can be used to demand - the compilation of other modules, with various options, yielding - static or dynamic object files. These may be loaded immediately, - replacing the previously uncompiled module code in memory. |
