diff options
Diffstat (limited to 'html/notes/reader.md')
| -rw-r--r-- | html/notes/reader.md | 470 |
1 files changed, 0 insertions, 470 deletions
diff --git a/html/notes/reader.md b/html/notes/reader.md deleted file mode 100644 index ebbe1ea..0000000 --- a/html/notes/reader.md +++ /dev/null @@ -1,470 +0,0 @@ -# Reader? Decoder? I barely know 'er! - -*This started from an expansion to the following, then became its own -article:* - -[Symbols are strings are symbols](symbols.html) - -OK but hear me out... What if there were different reader modes, for -code and (pure) data? - -I want Zisp to have various neat [syntactic extensions](sugar.html) -for programming purposes anyway, like the lambda shorthand, and these -shouldn't apply to configuration files, either. (Although they seem -unlikely to occur by accident.) - -So what if the transform from string literal to quoted string literal -only occurred in code reading mode? - -At least one problem remains, which is that `'(foo "bar")` would turn -into `(quote (foo (quote bar)))` because the reader would be in code -mode while reading it... - -This reminds me of the long-standing annoyance in Scheme that "quote -is unhygienic" and maybe we can tackle this problem as well now. - -Also, expressions like `'(foo '(bar))` always seemed weird to me, and -probably have no place in Scheme, because we don't generate code via -quote; we generate it with macros that operate on explicit syntax -objects rather than pure data. - -I want to experiment with an idea like this: - - ;; "code reader mode" transformations - - '(foo bar) -> (#quote foo bar) - - '(foo 'bar) -> ERROR - - "foo" -> (#quote . foo) - - "foo bar" -> (#quote . "foo bar") - - '(foo "bar") -> (#quote foo bar) - - '(foo "x y") -> (#quote foo "x y") - -The right-hand side shows what you would get if you read the form on -the left in code reader mode, then wrote it back out in data mode. - -The writer could also have a code writer mode, which applies the -reverse transformations. There should be a one to one mapping, -unambiguous, so this always works. A "hygienic" way to quote is -imperative here, since the writer could otherwise not know whether -some `quote` keyword in a list is *the* quote special form, or just -part of some data. - -We've made quote into a special token, `#quote`, to solve that. -Instead of adding a separate symbol data type that's a subtype of -strings, I think I'll add something called a "rune" or such that's -represented like `#foo` and allows for custom reader extensions, or -rather, writer extensions. - -Essentially, these runes would be bound to a pair of the following: - -1. A procedure that accepts a datum and returns some type of value. - -2. A procedure that takes values of that type, and turns them back - into written format. - -For `#quote`, the reader procedure would be the identity function. -The writer procedure would need to be a little more sophisticated. - -Note that the first procedure would not actually be called during -reading of data. Somewhat confusingly, it would only be called in -evaluation of code. - -Let's recap. Starting with pure data reading and writing: - -1. There is no special reader syntax. This s-expression format is a -bare minimum of what's needed to represent sequential data i.e. lists -and lists are the only compound data type recognized by the reader. -Anything that isn't a list is either an atomic value, or a string -which may or may not be considered atomic depending on how pedantic -you want to be. Oh and runes are allowed. - - A. This is basically "classic" s-expressions with runes added. - Only lists, numbers, and strings/symbols are recognized. - - B. Heck, numbers may not be recognized. Or maybe they will be - limited to integers and floats, but no rationals or such, and - reading a float will guarantee no loss of precision? - -2. Writing data returned by the data reader back out, in data form, -will produce exactly what was read, with the sole exception being -whitespace differences. The data is not allowed to contain any -non-atomic values other than proper lists. - - A. It's important not to allow floats that IEEE 754 doubles can't - represent, since then differences between input and output would - occur. But what about numbers like "10.00"? That would also - become something else when written back out. - - B. OK, maybe numbers represented in a non-canonical way are a - second source of difference between reading and writing back out, - but let's at least guarantee there's no loss of precision. - -(I've not considered comments. Maybe they will be preserved? Maybe -they should be implemented as code reader syntax sugar as well??) - -And now code reading and writing: - -1. Various syntax sugar is internally transformed into runes, with -non-list compound data literals (vectors, hash tables, etc.) needing -this type of representation to appear in code. - - A. Writing that data back out in data mode will reveal the inner - workings of the language, producing output containing runes. - - B. Direct use of runes may be forbidden; not sure about this. - - C. Evaluating this data containing runes will produce, in-memory, - the actual values being represented. The "reader procedure" tied - to the rune is responsible for this, though the fact that it's - evaluation and not reading that calls that procedure makes it - confusing so a better name is needed. Maybe just "decoder." - -2. For every data type that falls outside the pure data syntax, there -is a procedure that turns it into a canonical data representation -based on lists and atomics, always using the format `(#rune ...)`. - - A. Another procedure is capable of turning that back into reader - sugar, but this is not terribly important. Although it would be - neat to be able to write out code that looks like hand-written - program code, this really is just a bonus feature. - - B. For some types, turning them back into code without any runes - may be highly complicated; procedures, in particular, would need - decompilation to make this work. - - -## Recap (or not?) - -Wow, that was a long "recap." I actually came up with new ideas in -writing that. Let's recap the recap. I'll represent the mechanisms -as different pipelines that can happen using the various features. - -Typical pipeline when reading and evaluating code: - - code-file --[code-reader]--> code-data --[eval]--> values - ^^^^^^^^^^^ ^^^^ - turns sugar into calls rune decoders - rune calls to produce values - i.e. desugars code & compiles code - -Reading in a [serialized program](compile.html): - - data-file --[data-reader]--> data --[eval]--> values - ^^^^ - fairly trivial - (no lambdas, only runes) - -Reading pure and simple data like a config file: - - data-file --[data-reader]--> data (no runes to eval) - -Note that "data" is a subset of "values" basically. And the term -"code-data" which was used above just means data that is meant to be -evaluated as code, but is totally valid as pure data. This is not to -be confused with the "data" that existed in the intermediate step -while we were reading a serialized program; that was absent of any -forms like lambdas that need compilation. - -OK, that last bit was a bit confusing, and I realize it stems from -conflating rune decoding with code compilation, so let's split that -further up. Above, "eval" is "decode + compile" basically, but it's -possible to separate them, for example if we want to read a file of -serialized values that should not contain any code: - - values-file --[data-reader]--> values-data --[decode]--> values - -This is a secure way to read complex data even if it comes from an -untrusted source. It may contain runes that represent code, such as -in the form of `(#program "binary")` (compiled procedure) or even -`(#lambda (x) (do-things))` but so long as you don't actually call -those things after having decoded them, they can't do anything. -Decoding runes can't define macros or register new rune decoders, -meaning there's no way to achieve arbitrary code execution. - -Heck, although `#lambda` exists to represent the desugaring of the -`{...}` convenience syntax, it wouldn't actually work here because -decoding runes would happen in a null-environment without any bound -identifiers, meaning that e.g. `(#lambda (x) (+ x x))` would just -raise an error during decoding, because the compiler would consider -`+` unbound. - -Alternatively, instead of calling the compiler, the `#lambda` decoder -could just be a no-op that returns the same form back, but without the -rune, like `(lambda (x) (+ x x))`, because the compiler will take care -of that later. Yeah, I think this makes more sense. Why doesn't the -code reader directly give `(lambda ...)` for the `{...}` sugar? Well, -actually, the `#lambda` decoder may yield a syntax object where the -first element specifically refers to the binding of `lambda` in the -default environment, so you could use `{...}` in an environment where -`lambda` is bound to something else, and you would still hygienically -get the default lambda behavior from `{...}`. Yay! - -(Wow, it's rabbit hole after rabbit hole today. This is good though. -I'm coming up with some crazy stuff.) - -It would be possible to decode "code-data" and get an internal memory -representation of an uncompiled program which however already has -various data structure literals turned into values. This is super -obscure but for sake of completeness: - - code-file --[code-reader]--> code-data --[decode]--> code-values - -(These so-called "code-values" would only ever be useful for piping -them into the compiler. By the way, I initially used "eval" in the -example of reading a serialized program, but "decode" would have been -sufficient there.) - - -## Here's a well-deserved break - -(There wasn't a new header in a while. This seemed a good spot.) - -Now writing pipelines. Let's reverse the above pipelines, from the -bottom back towards eventually the first... - -The reverse of the super obscure thing above: - - code-values --[encode]--> code-data --[code-writer]--> code-file - -That would only ever be useful for debugging things. Now writing a -data structure into a serialized file, without unnecessarily invoking -the decompiler: - - values --[encode]--> values-data --[data-writer]--> data-file - -That gives you a file containing only data, but the data is the -encoded format of various data structures Zisp recognizes... -Actually, that may include compiled procedures as well. - -Now the simple config file case, being serialized: - - data -[data writer]-> data-file - -Now serializing a compiled program to a file, without decompilation: - - values --[encode]--> values-data --[data-writer]--> data-file - ^^^^^^ ^^^^^^^^^^^ - data structures no decompilation - become rune calls or "re-sugaring" - -Oh, look at that. It's the same as writing out data structures, as -we've already seen previously... This recap of a recap will need -another recap for sure. - -And now, the full decompiler: - - values --[uneval]--> code-data --[code-writer]--> code-file - ^^^^^^ - decompilation - -Actually, just like "eval" is "decode + compile", the "uneval" here -really is "decompile + encode". - - -## The Revised Recap of the Recap - -The following exist: - -1. Readers: - - 1. Data reader: Reads lists, strings/symbols, runes, integers, and - IEEE 754 double-precision floats without loss of precision. - - 2. Code reader: Reads code that can contain various syntax sugar, - all of which has an equivalent representation with runes. - -2. In-memory transformers: - - 1. Decoder: Calls decoders for runes in data, to yield values. - - 2. Evaluator: [Executes aka compiles](compile.html) decoded values - into other values.[*] - -3. Reverse in-memory transformers: - - 1. Encoder: Reverse of the decoder. (Lossless?) - - 2. Unevaluator: Reverse of the evaluator. (Lossy.) - -4. Writers: - - 1. Data writer: Reverse of data reader. (Lossless.) - - 2. Code writer: Reverse of code reader. (Lossy?) - -(*) This needs decoding to run first, because otherwise it wouldn't - realize that you're e.g. calling `+` on a pair of rational number - constants represented through runes, so constant folding wouldn't - work. Same with `vector-ref` on a vector literal represented as a - rune, and so on. - - -## How in the seven hells did I arrive at this point? - -Jesus Christ! - -This was about symbols and strings being the same thing. - -But I love these rabbit holes. They're mind expanding and you find -out so many new things you never thought about. - -Did you notice, by the way, that the code reader/writer above is -essentially a parser (and unparser) you would have in a regular -programming language, where syntax becomes an AST? The pure data -format is basically our AST! - -But this doesn't mean we lost homoiconicity. No, we merely expanded -upon it by providing a more detailed explanation of the relationship -between textual representation of code and in-memory data that exists -at various stages before ultimate compilation. - -Oh, and did we achieve our strategy of strings = symbols now, or does -it need to be dropped? I think we achieved it. The code reader, as -described all the way up where the section "Reconsidering AGAIN" -begins --in the original article; see top-- will desugar string -literals into: - - "foo" -> (#quote foo) - -(As already described originally...) - -And the `#quote` rune? Well, it will not actually just return its -operand verbatim, no! It will return a syntax object that's a list -with the first element specifically refers to the binding of `quote` -from the standard library. In other words, it's the evaluator that -actually implements quote, not the decoder. - -Oh yes, this is very satisfying. Everything is coming together. - -Syntax objects, by the way, will also have a rune-based external -representation, so you can inspect the result of macro expansion. - -And yes, I think using runes directly in code mode should be illegal, -because it allows referring to bindings in the standard library, or -even bindings in arbitrary libraries by crafting syntax objects -represented via runes, to bypass environment limits. - -That bug actually existed in Guile at some point, where one could -craft syntax objects, represented as vector literals, to refer to -bindings in other modules, making it impossible to run code in a -sandboxed environment. (It was fixed long ago, I believe.) - -Oh, but what about `#true` and `#false`? OK, maybe there will be a -whitelist of runes that are allowed in code. That makes sense. - -We will see. Still more details to be fleshed out. - -In any case, some runes must be able to declare that they don't take -arguments, in which case `(#rune ...)` isn't decoded by passing the -entire form to the decoder of `#rune`, but rather treated as a normal -list whose first element is decoded as a nullary rune. That's how -boolean literals in code will be implemented. - - -## Looking at more of the initial problems - -What happened to `'(quote "foo")` in code mode being weird? Well, -encountering an apostrophe tells the code reader that the next -expression is a datum, so it switches to data mode for that. - -Wow, that was easy. - -This also means you can't use syntax sugar inside it, which is good -because as we said previously, we don't want to use quoting to create -code; we want to use syntax objects for that. - -This is really orthogonal to the whole runes issue, and could have -been solved without that mechanism, but I'm happy I came up with it -because it resolves hygiene issues. - -The syntax `#'(quote "foo")` would be sugar for a different rune, and -the reader would remain in code mode, further desugaring any sugar -found within, so this works: `#'{x (+ x x)}` - -Oh and I mentioned reader extensions (for code mode) but then didn't -expand on that. Well, whenever the code reader encounters this: - - #foo(blah blah blah) - -It will turn that into: - - (#foo blah blah blah) - -After which the decoder for `#foo` will be invoked, which could have -been registered by the programmer. - -Can that registration be done in the same file though? Normally, the -execution step comes after decoding, and we decided that we don't want -to allow arbitrary code execution to happen just when reading a data -file and decoding it. So something exceptional would need to happen -for this to work. Or maybe not. - -Remember that [compilation is execution](compile.html) in Zisp, -meaning that compiling a file looks like this in pseudo-Scheme: - - (define env (null-environment)) ;start out empty - - (while (not (eof? input)) - (let* ((datum (read-code input)) ;desugar - (value (decode datum))) ;decode - (eval! value env))) ;eval in mutable env - - (write (env-lookup env 'main)) ;serialize - -I've called eval `eval!` to indicate that it can mutate the env it -receives, which is what import statements and defines would do. - -Let's modify that a little further to indicate the fact that reader -macros, or in our terms, custom rune decoders, can be defined in the -middle of the code file by affecting the environment: - - (define env (null-environment)) ;start out empty - - (while (not (eof? input)) - (let* ((datum (read-code input)) ;desugar - (value (decode datum env))) ;decode in env - (eval! value env))) ;eval in mutable env - - (write (env-lookup env 'main)) ;serialize - -Since the `decode` procedure is given an environment, it will look up -decoders from therein. So, after the evaluation of each top-level -expression, the expressions coming after it could be using a custom -decoder. - -What our reader macros cannot do is completely affect the lexical -syntax of the language, as in, add more sugar. You must rely on the -global desugaring feature of `#x(...) -> (#x ...)` which, now that I -think about it, is completely useless because a regular macro could -have achieved exactly the same thing. - -OK, let's try that again. The global desugaring wouldn't work on -lists only, it would work on a number of things: - - #x"foo" -> (#x #string . foo) - - #x[foo] -> (#x #square . foo) - - #x{foo} -> (#x #braces . foo) - -You get the idea! - -(I've changed my mind that `"foo"` should desugar into a call to the -regular `#quote` rune; it should be `#string` instead to disambiguate -from the apostrophe if needed.) - -Also, all those would work without a rune as well, to allow a file to -change the meaning of some of the default syntax sugar if desired: - - "foo" -> (#string . foo) - - [foo bar] -> (#square foo bar) - - {foo bar} -> (#braces foo bar) - -Or something like that. I'm making this all up as I go. |
