docs/c1/grammar/index.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101

# Zisp S-Expression Grammar

The grammar is available in several different formats:

* [ZBNF](zbnf.txt): See below for the rules of this notation
* [ABNF](abnf.txt): Compatible with the `abnfgen` tool
* [PEG](peg.txt): Compatible with `peg/leg` tool


## ZBNF notation

The ZBNF grammar specification uses a BNF-like notation with PEG-like
semantics:

* Concatenation of expressions is implicit: `foo bar` means `foo`
  followed by `bar`.

* Parentheses are used for grouping, and the pipe symbol `|` is used
  for alternatives.

* The suffixes `?`, `*`, and `+` have the same meaning as in regular
  expressions, although `[foo]` is used in place of `(foo)?`.

* The syntax is defined in terms of bytes, not characters.  Terminals
  `'c'` and `"c"` refer to the ASCII value of the given character `c`.
  Standard C escape sequences are supported.

* The prefix `~` means NOT.  It only applies to rules that match one
  byte, and negates them.  For example, `~( 'a' | 'b' )` matches any
  byte other than 'a' and 'b'.

* Ranges of terminal values are expressed as `x...y` (inclusive).

* ABNF "core rules" like `ALPHA` and `HEXDIG` are supported.

* There is no ambiguity, or look-ahead / backtracking beyond one byte.
  Rules match left to right, depth-first, and greedy.  As soon as the
  input matches the first terminal of a rule --explicit or implied by
  recursively descending into the first non-terminal-- it must match
  that rule to the end or a syntax error is reported.

The last point makes the notation simple to translate to code.


## Limitations outside the grammar

The following limits are not represented in the grammar:

* A `UnicodeSV` is the hexadecimal representation of a Unicode scalar
  value; it must represent a value in the range 0 to D7FF, or E000 to
  10FFFF, inclusive.  Any other value signals an error.  Valid values
  are converted into a UTF-8 byte sequence encoding the value.

* A `Rune` longer than 6 bytes is grammatical, but signals an error.
  This is important because runes are not self-terminating; defining
  their grammar as ending after a maximum of 6 bytes would allow
  another datum beginning with an alphabetic character to follow a
  rune immediately without any visual delineation, which would be
  terribly confusing for a human reader.  Consider: `#foobarbaz`.
  This would parse as a `Datum` joining `#foobar` and `baz`.

* A `Label` is the hexadecimal representation of a 48-bit integer,
  meaning it allows for a maximum of 12 hexadecimal digits.  Longer
  values are grammatical, but signal an out-of-range error, so as to
  avoid signaling a confusing "invalid character" error on input that
  appears grammatical.  Consider: `#%123456789abcd=foo`.  This would
  signal an invalid character error at the letter `d` if the grammar
  limited a `Label` to 12 hexadecimal digits.


## Stream-parsing strategy

The parser consumes one `Unit` from the input stream every time it's
called; it returns the `Datum` therein if found, or else it returns
the Zisp EOF token.

Since a `Datum` is not self-terminating, the parser must read beyond
it to realize that it has ended (if not followed by the EOF).  Thus,
it will consume one more `Blank` following the `Unit` that it parsed.
If this `Blank` is a comment, it will be consumed entirely, ensuring
that parsing resumes properly on a subsequent parser call on the same
input stream, without needing to store any state in between.

Since comments of type `SkipUnit` are likewise not self-terminating,
an arbitrary number of chained `SkipUnit` comments may need to be
consumed before the parser is finally allowed to return.

The following illustration shows the positions at which the parser
will stop consuming input when called repeatedly on the same input
stream.  The dots represent the extent of each `Unit` being parsed,
while the caret points at the last byte the parser will consume in
that parse cycle.

```
foo (bar)[baz] foo;~bar foo;~bar;~baz;~bat foobar
...^..........^...     ^...               ^......^
```

Notice how, in the fourth cycle, the parser is forced to consume all
commented-out units before it can return, since it would otherwise
leave the stream in an inappropriate state.