More grammar shite.

author: Taylan Kammer <taylan.kammer@gmail.com> 2026-01-10 17:33:14 +0100
committer: Taylan Kammer <taylan.kammer@gmail.com> 2026-01-10 17:33:14 +0100
commit: b737130c059e8e5566caa7aa3144f910d43999ae (patch)
tree: 5dfbfface09aed028efbeacd119e6a78e7530f61 /docs/c1/grammar.md
parent: 29dfeb175122d66bf71fc67ecd80f8df90a9a462 (diff)
1 files changed, 101 insertions, 0 deletions
diff --git a/docs/c1/grammar.md b/docs/c1/grammar.md
new file mode 100644
index 0000000..3364150
--- /dev/null
+++ b/docs/c1/grammar.md
@@ -0,0 +1,101 @@
+# Zisp S-Expression Grammar
+
+The grammar is available in several different formats:
+
+* [ZBNF](grammar.zbnf.txt): See below for the rules of this notation
+* [ABNF](grammar.abnf.txt): Compatible with the `abnfgen` tool
+* [PEG](grammar.peg.txt): Compatible with `peg/leg` tool
+
+
+## ZBNF notation
+
+The ZBNF grammar specification uses a BNF-like notation with PEG-like
+semantics:
+
+* Concatenation of expressions is implicit: `foo bar` means `foo`
+  followed by `bar`.
+
+* Parentheses are used for grouping, and the pipe symbol `|` is used
+  for alternatives.
+
+* The suffixes `?`, `*`, and `+` have the same meaning as in regular
+  expressions, although `[foo]` is used in place of `(foo)?`.
+
+* The syntax is defined in terms of bytes, not characters.  Terminals
+  `'c'` and `"c"` refer to the ASCII value of the given character `c`.
+  Standard C escape sequences are supported.
+
+* The prefix `~` means NOT.  It only applies to rules that match one
+  byte, and negates them.  For example, `~( 'a' | 'b' )` matches any
+  byte other than 'a' and 'b'.
+
+* Ranges of terminal values are expressed as `x...y` (inclusive).
+
+* ABNF "core rules" like `ALPHA` and `HEXDIG` are supported.
+
+* There is no ambiguity, or look-ahead / backtracking beyond one byte.
+  Rules match left to right, depth-first, and greedy.  As soon as the
+  input matches the first terminal of a rule --explicit or implied by
+  recursively descending into the first non-terminal-- it must match
+  that rule to the end or a syntax error is reported.
+
+The last point makes the notation simple to translate to code.
+
+
+## Limitations outside the grammar
+
+The following limits are not represented in the grammar:
+
+* A `UnicodeSV` is the hexadecimal representation of a Unicode scalar
+  value; it must represent a value in the range 0 to D7FF, or E000 to
+  10FFFF, inclusive.  Any other value signals an error.  Valid values
+  are converted into a UTF-8 byte sequence encoding the value.
+
+* A `Rune` longer than 6 bytes is grammatical, but signals an error.
+  This is important because runes are not self-terminating; defining
+  their grammar as ending after a maximum of 6 bytes would allow
+  another datum beginning with an alphabetic character to follow a
+  rune immediately without any visual delineation, which would be
+  terribly confusing for a human reader.  Consider: `#foobarbaz`.
+  This would parse as a `Datum` joining `#foobar` and `baz`.
+
+* A `Label` is the hexadecimal representation of a 48-bit integer,
+  meaning it allows for a maximum of 12 hexadecimal digits.  Longer
+  values are grammatical, but signal an out-of-range error, so as to
+  avoid signaling a confusing "invalid character" error on input that
+  appears grammatical.  Consider: `#%123456789abcd=foo`.  This would
+  signal an invalid character error at the letter `d` if the grammar
+  limited a `Label` to 12 hexadecimal digits.
+
+
+## Stream-parsing strategy
+
+The parser consumes one `Unit` from the input stream every time it's
+called; it returns the `Datum` therein if found, or else it returns
+the Zisp EOF token.
+
+Since a `Datum` is not self-terminating, the parser must read beyond
+it to realize that it has ended (if not followed by the EOF).  Thus,
+it will consume one more `Blank` following the `Unit` that it parsed.
+If this `Blank` is a comment, it will be consumed entirely, ensuring
+that parsing resumes properly on a subsequent parser call on the same
+input stream, without needing to store any state in between.
+
+Since comments of type `SkipUnit` are likewise not self-terminating,
+an arbitrary number of chained `SkipUnit` comments may need to be
+consumed before the parser is finally allowed to return.
+
+The following illustration shows the positions at which the parser
+will stop consuming input when called repeatedly on the same input
+stream.  The dots represent the extent of each `Unit` being parsed,
+while the caret points at the last byte the parser will consume in
+that parse cycle.
+
+```
+foo (bar)[baz] foo;~bar foo;~bar;~baz;~bat foobar
+...^..........^...     ^...               ^......^
+```
+
+Notice how, in the fourth cycle, the parser is forced to consume all
+commented-out units before it can return, since it would otherwise
+leave the stream in an inappropriate state.
author	Taylan Kammer <taylan.kammer@gmail.com>	2026-01-10 17:33:14 +0100
committer	Taylan Kammer <taylan.kammer@gmail.com>	2026-01-10 17:33:14 +0100
commit	b737130c059e8e5566caa7aa3144f910d43999ae (patch)
tree	5dfbfface09aed028efbeacd119e6a78e7530f61 /docs/c1/grammar.md
parent	29dfeb175122d66bf71fc67ecd80f8df90a9a462 (diff)