summaryrefslogtreecommitdiff
path: root/spec/syntax.md
diff options
context:
space:
mode:
authorTaylan Kammer <taylan.kammer@gmail.com>2026-01-09 18:09:59 +0100
committerTaylan Kammer <taylan.kammer@gmail.com>2026-01-09 18:09:59 +0100
commit2d72a1aa64a66c486a2329999123c14afcddeb32 (patch)
tree4eba98eb1240d3d445e2d35c61bad63d352e413b /spec/syntax.md
parenta2ece405cc61341122fc075d499420e894c56909 (diff)
More grammar fuckery. BNF is horrible!
Diffstat (limited to 'spec/syntax.md')
-rw-r--r--spec/syntax.md81
1 files changed, 51 insertions, 30 deletions
diff --git a/spec/syntax.md b/spec/syntax.md
index 7f3561c..d1a17ad 100644
--- a/spec/syntax.md
+++ b/spec/syntax.md
@@ -1,20 +1,18 @@
# Zisp S-Expression Syntax
-We use a BNF notation with the following rules:
+We use a BNF-like grammar notation with the following rules:
* Concatenation of expressions is implicit: `foo bar` means `foo`
followed by `bar`.
-* Expressions may be followed by `?`, `*`, `+`, `{N}`, or `{N,M}`,
- which have the same meanings as in regular expressions.
-
-* The syntax `[foo]` is shorthand for `(foo)?`.
+* The suffixes `?`, `*`, and `+` have the same meaning as in regular
+ expressions, although `[foo]` is used in place of `(foo)?`.
* The syntax is defined in terms of bytes, not characters. Terminals
`'c'` and `"c"` refer to the ASCII value of the given character `c`.
Numbers are in decimal and refer to a byte with the given value.
-* The `~` prefix means NOT. It only applies to rules that match one
+* The prefix `~` means NOT. It only applies to rules that match one
byte, and negates them. For example, `~( 'a' | 'b' )` matches any
byte other than 97 and 98.
@@ -24,11 +22,12 @@ We use a BNF notation with the following rules:
* There is no ambiguity, or look-ahead / backtracking beyond one byte.
Rules match left to right, depth-first, and greedy. As soon as the
- input matches the first terminal of a rule, it must match that rule
- to the end or it is considered a syntax error.
+ input matches the first terminal of a rule (explicit or implied by
+ recursively descending into the first non-terminal), it must match
+ that rule to the end, or it is considered a syntax error.
-The last rule means that the BNF is very simple to translate to code.
-It also probably makes it equivalent to PEG.
+The last rule means that the notation is simple to translate to code.
+It ostensibly makes the notation equivalent to PEG in expression.
The parser consumes one `Unit` from an input stream every time it's
called; it returns the `Datum` therein, or EOF. The final optional
@@ -36,11 +35,30 @@ called; it returns the `Datum` therein, or EOF. The final optional
blank at the end if it finds one; this is because `Datum` is not
self-closing so the parser has to check if it goes on.
+The following limits are not represented in the grammar:
+
+* A `UnicodeSV` is the hexadecimal representation of a Unicode scalar
+ value; it must represent a value in the range 0 to D7FF, or E000 to
+ 10FFFF, inclusive. Any other value signals an error. Valid values
+ are converted into a UTF-8 byte sequence encoding the value.
+
+* A `Rune` longer than 6 bytes is grammatical, but signals an error.
+ This is important because runes are not self-terminating; defining
+ their grammar as ending after a maximum of 6 bytes would allow
+ another datum beginning with an alphabetic character to follow a
+ rune immediately without any visual delineation, which would be
+ terribly confusing for a human reader. Consider: `#foo123bar`.
+ This would parse as a concatenation of `#foo123` and `bar`.
+
+* A `Label` is the hexadecimal representation of a 48-bit integer,
+ meaning it allows for a maximum of 12 hexadecimal digits. Longer
+ values are grammatical, but signal an out-of-range error.
+
```
Unit : Blank* [ Datum [Blank] ]
-Blank : 9...13 | Comment
+Blank : 9...13 | SP | Comment
Datum : OneDatum ( [JoinChar] OneDatum )*
@@ -56,41 +74,44 @@ SkipLine : ( ~LF )* [LF]
OneDatum : BareString | CladDatum
+
BareString : ( '.' | '+' | '-' | DIGIT ) ( BareChar | '.' )*
| BareChar+
-CladDatum : '|' ( PipeStrChar | '\' StringEsc )* '|'
- | '"' ( QuotStrChar | '\' StringEsc )* '"'
- | '#' HashExpr
- | '(' List ')' | '[' List ']' | '{' List '}'
- | "'" Datum | '`' Datum | ',' Datum
+CladDatum : PipeStr | QuoteStr | HashExpr | QuoteExpr | List
+PipeStr : '|' ( PipeStrChar | '\' StringEsc )* '|'
+QuoteStr : '"' ( QuotStrChar | '\' StringEsc )* '"'
+HashExpr : '#' ( RuneExpr | LabelExpr | HashDatum )
+QuoteExpr : "'" Datum | '`' Datum | ',' Datum
+List : ParenList | SquareList | BraceList
BareChar : ALPHA | DIGIT
| '!' | '$' | '%' | '*' | '+'
| '-' | '/' | '<' | '=' | '>'
| '?' | '@' | '^' | '_' | '~'
-
PipeStrChar : ~( '|' | '\' )
-
QuotStrChar : ~( '"' | '\' )
-HashExpr : Rune [ '\' BareString | CladDatum ]
- | '\' BareString
- | '%' Label ( '%' | '=' Datum )
- | CladDatum
-
-List : Unit* [ Blank* '&' Unit ] Blank*
-
-
StringEsc : '\' | '|' | '"' | ( HTAB | SP )* LF ( HTAB | SP )*
| 'a' | 'b' | 't' | 'n' | 'v' | 'f' | 'r' | 'e'
- | 'x' ( HEXDIG{2} )+ ';'
- | 'u' HEXDIG{1,6} ';'
+ | 'x' HexByte+ ';'
+ | 'u' UnicodeSV ';'
+
+HexByte : HEXDIG HEXDIG
+UnicodeSV : HEXDIG+
+
+RuneExpr : Rune [ '\' BareString | CladDatum ]
+LabelExpr : '%' Label ( '%' | '=' Datum )
+HashDatum : '\' BareString | CladDatum
+Rune : ALPHA ( ALPHA | DIGIT )*
+Label : HEXDIG+
-Rune : ALPHA ( ALPHA | DIGIT ){0,5}
+ParenList : '(' ListBody ')'
+SquareList : '[' ListBody ']'
+BraceList : '{' ListBody '}'
-Label : HEXDIG{1,12}
+ListBody : Unit* [ Blank* '&' Unit ] Blank*
```