spec/syntax.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117

# Zisp S-Expression Syntax

We use a BNF-like grammar notation with the following rules:

* Concatenation of expressions is implicit: `foo bar` means `foo`
  followed by `bar`.

* The suffixes `?`, `*`, and `+` have the same meaning as in regular
  expressions, although `[foo]` is used in place of `(foo)?`.

* The syntax is defined in terms of bytes, not characters.  Terminals
  `'c'` and `"c"` refer to the ASCII value of the given character `c`.
  Numbers are in decimal and refer to a byte with the given value.

* The prefix `~` means NOT.  It only applies to rules that match one
  byte, and negates them.  For example, `~( 'a' | 'b' )` matches any
  byte other than 97 and 98.

* Ranges of terminal values are expressed as `x...y` (inclusive).

* ABNF "core rules" like `ALPHA` and `HEXDIG` are supported.

* There is no ambiguity, or look-ahead / backtracking beyond one byte.
  Rules match left to right, depth-first, and greedy.  As soon as the
  input matches the first terminal of a rule (explicit or implied by
  recursively descending into the first non-terminal), it must match
  that rule to the end, or it is considered a syntax error.

The last rule means that the notation is simple to translate to code.
It ostensibly makes the notation equivalent to PEG in expression.

The parser consumes one `Unit` from an input stream every time it's
called; it returns the `Datum` therein, or EOF.  The final optional
`Blank` represents the fact that the parser will consume one more
blank at the end if it finds one; this is because `Datum` is not
self-closing so the parser has to check if it goes on.

The following limits are not represented in the grammar:

* A `UnicodeSV` is the hexadecimal representation of a Unicode scalar
  value; it must represent a value in the range 0 to D7FF, or E000 to
  10FFFF, inclusive.  Any other value signals an error.  Valid values
  are converted into a UTF-8 byte sequence encoding the value.

* A `Rune` longer than 6 bytes is grammatical, but signals an error.
  This is important because runes are not self-terminating; defining
  their grammar as ending after a maximum of 6 bytes would allow
  another datum beginning with an alphabetic character to follow a
  rune immediately without any visual delineation, which would be
  terribly confusing for a human reader.  Consider: `#foo123bar`.
  This would parse as a concatenation of `#foo123` and `bar`.

* A `Label` is the hexadecimal representation of a 48-bit integer,
  meaning it allows for a maximum of 12 hexadecimal digits.  Longer
  values are grammatical, but signal an out-of-range error.

```
Unit          : Blank* [ Datum [Blank] ]


Blank         : 9...13 | SP | Comment

Datum         : OneDatum ( [JoinChar] OneDatum )*

JoinChar      : '.' | ':'


Comment       : ';' ( SkipUnit | SkipLine )

SkipUnit      : '~' Unit

SkipLine      : ( ~LF )* [LF]


OneDatum      : BareString | CladDatum


BareString    : ( '.' | '+' | '-' | DIGIT ) ( BareChar | '.' )*
              | BareChar+

CladDatum     : PipeStr | QuoteStr | HashExpr | QuoteExpr | List

PipeStr       : '|' ( PipeStrChar | '\' StringEsc )* '|'
QuoteStr      : '"' ( QuotStrChar | '\' StringEsc )* '"'
HashExpr      : '#' ( RuneExpr | LabelExpr | HashDatum )
QuoteExpr     : "'" Datum | '`' Datum | ',' Datum
List          : ParenList | SquareList | BraceList

BareChar      : ALPHA | DIGIT
              | '!' | '$' | '%' | '*' | '+'
              | '-' | '/' | '<' | '=' | '>'
              | '?' | '@' | '^' | '_' | '~'

PipeStrChar   : ~( '|' | '\' )
QuotStrChar   : ~( '"' | '\' )

StringEsc     : '\' | '|' | '"' | ( HTAB | SP )* LF ( HTAB | SP )*
              | 'a' | 'b' | 't' | 'n' | 'v' | 'f' | 'r' | 'e'
              | 'x' HexByte+ ';'
              | 'u' UnicodeSV ';'

HexByte       : HEXDIG HEXDIG
UnicodeSV     : HEXDIG+

RuneExpr      : Rune [ '\' BareString | CladDatum ]
LabelExpr     : '%' Label ( '%' | '=' Datum )
HashDatum     : '\' BareString | CladDatum

Rune          : ALPHA ( ALPHA | DIGIT )*
Label         : HEXDIG+

ParenList     : '(' ListBody ')'
SquareList    : '[' ListBody ']'
BraceList     : '{' ListBody '}'

ListBody      : Unit* [ Blank* '&' Unit ] Blank*
```