diff options
| -rw-r--r-- | notes/250219-reader.md | 53 | ||||
| -rw-r--r-- | spec/syntax.md | 68 |
2 files changed, 115 insertions, 6 deletions
diff --git a/notes/250219-reader.md b/notes/250219-reader.md index de71b4e..503d402 100644 --- a/notes/250219-reader.md +++ b/notes/250219-reader.md @@ -7,6 +7,10 @@ article:* [Symbols are strings are symbols](250210-symbols.html) +*This whole article is me rambling, and the actual implementation of +the parser that I settled on is slightly different from all the ideas +that are wildly explored here. See late addition at the bottom.* + OK but hear me out... What if there were different reader modes, for code and (pure) data? @@ -463,10 +467,57 @@ from the apostrophe if needed.) Also, all those would work without a rune as well, to allow a file to change the meaning of some of the default syntax sugar if desired: - "foo" -> (#string . foo) + "foo" -> (#string . foo) [foo bar] -> (#square foo bar) {foo bar} -> (#braces foo bar) Or something like that. I'm making this all up as I go. + +## Actual implementation + +_2026 January_ + +Just to summarize what I actually ended up implementing in the end: + +- There is only one parser, not separate data and code parsers. + +- It simply desugars `"foo bar"` into `(#QUOTE . |foo bar|)`, i.e., + these expressions are equivalent, and indistinguishable once they + have been parsed into data. (The syntax `|foo bar|` represents a + string literal in its purest form.) Another equivalent expression + would be `'|foo bar|` that also parses into `(#QUOTE . |foo bar|)`. + All three parse into the exact same data in memory. + +- If you want to use Zisp expressions for something like config files + and want to type `"foo bar"` instead of `|foo bar|` but don't want + to deal with `(#QUOTE . |foo bar|)` then just run a decoder on the + data before using it. You'll need to run a decoder on it anyway if + you want to support vectors, mappings, and other such data types in + your config file that don't have a *direct* data representation. + +- The decoder is not implemented yet, but it will be configurable and + may have default configurations for "code" and "data" where the data + configuration would presumably just strip `(#QUOTE . foo)` down to + `foo` just to make `"foo"` and `|foo|` totally equivalent in data + contexts like config files. In the code configuration, it would + decode `(#QUOTE . foo)` into a macro call expression object which, + when evaluated, results in `foo`. + +- If you wanted to have a config file with code snippets in it, and + don't want e.g. `(code (string-append "foo" x))` to be decoded into + `(code (string-append foo x))` thus changing the meaning of the + embedded code, you have two options: + + 1. Make your entire config file be Zisp code written in a DSL. + + 2. Wrap code snippets in one layer of quoting like `'(...)` which + will effectively protect nested uses of `#QUOTE` from the data + decoder, since decoding is a breadth-first operation. + +See here for full documentation of Zisp expressions as implemented: + +- [Informal docs](https://git.tkammer.de/zisp/tree/docs/parser.md) +- [Formal spec](https://git.tkammer.de/zisp/tree/spec/syntax.md) +- [ABNF](https://git.tkammer.de/zisp/tree/spec/syntax.abnf) diff --git a/spec/syntax.md b/spec/syntax.md index b85ed78..91e5495 100644 --- a/spec/syntax.md +++ b/spec/syntax.md @@ -6,7 +6,9 @@ We use a BNF notation with the following rules: followed by `bar`. * Expressions may be followed by `?`, `*`, `+`, `{N}`, or `{N,M}`, - which have the meanings they have in regular expressions. + which have meanings analogous to regular expressions. + +* The syntax `[foo]` is shorthand for `(foo)?`. * The syntax is defined in terms of bytes, not characters. Terminals `'c'` and `"c"` refer to the ASCII value of the given character `c`. @@ -18,10 +20,13 @@ We use a BNF notation with the following rules: * Ranges of terminal values are expressed as `x...y` (inclusive). -* There is no ambiguity, backtracking, or look-ahead beyond the byte - currently being matched. Rules match left to right, depth-first, - and greedy. As soon as the input matches the first terminal of a - rule, it must match that rule to the end. +* ABNF "core rules" like `ALPHA` and `HEXDIG` are supported, with the + addition of EOF to explicitly demarcate the end of the byte stream. + +* There is no ambiguity, backtracking, or look-ahead beyond one byte. + Rules match left to right, depth-first, and greedy. As soon as the + input matches the first terminal of a rule, it must match that rule + to the end or it is considered a syntax error. The last rule means that the BNF is very simple to translate to code. @@ -29,6 +34,59 @@ The parser consumes one `unit` from an input stream every time it's called; it returns the `datum` therein, or EOF. ``` +Unit : Blank* ( Datum [Blank] | EOF ) + + +Blank : 9...13 | Comment + +Datum : OneDatum ( [JoinChar] OneDatum )* + +JoinChar : '.' | ':' + + +Comment : ';' ( SkipUnit | SkipLine ) + +SkipUnit : '~' Unit + +SkipLine : ( ~LF )* [LF] + + +OneDatum : BareString | CladDatum + +BareString : ( '.' | '+' | '-' | DIGIT ) ( BareChar | '.' )* + | BareChar+ + +CladDatum : '|' PipeStrElt* '|' + | '"' QuotStrElt* '"' + | '#' HashExpr + | '(' List ')' | '[' List ']' | '{' List '}' + | "'" Datum | '`' Datum | ',' Datum + + +BareChar : ALPHA | DIGIT + | '!' | '$' | '%' | '&' | '*' | '+' | '-' | '/' + | '<' | '=' | '>' | '?' | '@' | '^' | '_' | '~' + + +PipeStrElt : ~( '|' | '\' ) | '\' StringEsc + +QuotStrElt : ~( '"' | '\' ) | '\' StringEsc + +HashExpr : Rune [ '\' BareString | CladDatum ] + | '\' BareString + | '%' Label ( '%' | '=' Datum ) + | CladDatum + +List : Unit* [ '.' Unit ] Blank* + + +StringEsc : '\' | '|' | '"' | ( HTAB | SP )* LF ( HTAB | SP )* + | 'a' | 'b' | 't' | 'n' | 'v' | 'f' | 'r' | 'e' + | 'x' ( HEXDIG{2} )+ ';' + | 'u' HEXDIG{1,6} ';' + +Rune : ALPHA ( ALPHA | DIGIT ){0,5} +Label : HEXDIG{1,12} ``` |
