# Zisp S-Expression Grammar The grammar is available in several different formats: * [ZBNF](grammar.zbnf.txt): See below for the rules of this notation * [ABNF](grammar.abnf.txt): Compatible with the `abnfgen` tool * [PEG](grammar.peg.txt): Compatible with `peg/leg` tool ## ZBNF notation The ZBNF grammar specification uses a BNF-like notation with PEG-like semantics: * Concatenation of expressions is implicit: `foo bar` means `foo` followed by `bar`. * Parentheses are used for grouping, and the pipe symbol `|` is used for alternatives. * The suffixes `?`, `*`, and `+` have the same meaning as in regular expressions, although `[foo]` is used in place of `(foo)?`. * The syntax is defined in terms of bytes, not characters. Terminals `'c'` and `"c"` refer to the ASCII value of the given character `c`. Standard C escape sequences are supported. * The prefix `~` means NOT. It only applies to rules that match one byte, and negates them. For example, `~( 'a' | 'b' )` matches any byte other than 'a' and 'b'. * Ranges of terminal values are expressed as `x...y` (inclusive). * ABNF "core rules" like `ALPHA` and `HEXDIG` are supported. * There is no ambiguity, or look-ahead / backtracking beyond one byte. Rules match left to right, depth-first, and greedy. As soon as the input matches the first terminal of a rule --explicit or implied by recursively descending into the first non-terminal-- it must match that rule to the end or a syntax error is reported. The last point makes the notation simple to translate to code. ## Limitations outside the grammar The following limits are not represented in the grammar: * A `UnicodeSV` is the hexadecimal representation of a Unicode scalar value; it must represent a value in the range 0 to D7FF, or E000 to 10FFFF, inclusive. Any other value signals an error. Valid values are converted into a UTF-8 byte sequence encoding the value. * A `Rune` longer than 6 bytes is grammatical, but signals an error. This is important because runes are not self-terminating; defining their grammar as ending after a maximum of 6 bytes would allow another datum beginning with an alphabetic character to follow a rune immediately without any visual delineation, which would be terribly confusing for a human reader. Consider: `#foobarbaz`. This would parse as a `Datum` joining `#foobar` and `baz`. * A `Label` is the hexadecimal representation of a 48-bit integer, meaning it allows for a maximum of 12 hexadecimal digits. Longer values are grammatical, but signal an out-of-range error, so as to avoid signaling a confusing "invalid character" error on input that appears grammatical. Consider: `#%123456789abcd=foo`. This would signal an invalid character error at the letter `d` if the grammar limited a `Label` to 12 hexadecimal digits. ## Stream-parsing strategy The parser consumes one `Unit` from the input stream every time it's called; it returns the `Datum` therein if found, or else it returns the Zisp EOF token. Since a `Datum` is not self-terminating, the parser must read beyond it to realize that it has ended (if not followed by the EOF). Thus, it will consume one more `Blank` following the `Unit` that it parsed. If this `Blank` is a comment, it will be consumed entirely, ensuring that parsing resumes properly on a subsequent parser call on the same input stream, without needing to store any state in between. Since comments of type `SkipUnit` are likewise not self-terminating, an arbitrary number of chained `SkipUnit` comments may need to be consumed before the parser is finally allowed to return. The following illustration shows the positions at which the parser will stop consuming input when called repeatedly on the same input stream. The dots represent the extent of each `Unit` being parsed, while the caret points at the last byte the parser will consume in that parse cycle. ``` foo (bar)[baz] foo;~bar foo;~bar;~baz;~bat foobar ...^..........^... ^... ^......^ ``` Notice how, in the fourth cycle, the parser is forced to consume all commented-out units before it can return, since it would otherwise leave the stream in an inappropriate state.