summaryrefslogtreecommitdiff
path: root/html/notes/reader.md
diff options
context:
space:
mode:
authorTaylan Kammer <taylan.kammer@gmail.com>2025-03-29 11:10:24 +0100
committerTaylan Kammer <taylan.kammer@gmail.com>2025-03-29 11:10:24 +0100
commit451aa92846b5fd5c8a0739336de3aa26d741d750 (patch)
tree21e51213bf1d39c2a8677060c51d83a656873786 /html/notes/reader.md
parent5025f9acf31cd880bbff62ff47ed03b69a0025ee (diff)
Relocate MD sources for HTML notes.
Diffstat (limited to 'html/notes/reader.md')
-rw-r--r--html/notes/reader.md470
1 files changed, 0 insertions, 470 deletions
diff --git a/html/notes/reader.md b/html/notes/reader.md
deleted file mode 100644
index ebbe1ea..0000000
--- a/html/notes/reader.md
+++ /dev/null
@@ -1,470 +0,0 @@
-# Reader? Decoder? I barely know 'er!
-
-*This started from an expansion to the following, then became its own
-article:*
-
-[Symbols are strings are symbols](symbols.html)
-
-OK but hear me out... What if there were different reader modes, for
-code and (pure) data?
-
-I want Zisp to have various neat [syntactic extensions](sugar.html)
-for programming purposes anyway, like the lambda shorthand, and these
-shouldn't apply to configuration files, either. (Although they seem
-unlikely to occur by accident.)
-
-So what if the transform from string literal to quoted string literal
-only occurred in code reading mode?
-
-At least one problem remains, which is that `'(foo "bar")` would turn
-into `(quote (foo (quote bar)))` because the reader would be in code
-mode while reading it...
-
-This reminds me of the long-standing annoyance in Scheme that "quote
-is unhygienic" and maybe we can tackle this problem as well now.
-
-Also, expressions like `'(foo '(bar))` always seemed weird to me, and
-probably have no place in Scheme, because we don't generate code via
-quote; we generate it with macros that operate on explicit syntax
-objects rather than pure data.
-
-I want to experiment with an idea like this:
-
- ;; "code reader mode" transformations
-
- '(foo bar) -> (#quote foo bar)
-
- '(foo 'bar) -> ERROR
-
- "foo" -> (#quote . foo)
-
- "foo bar" -> (#quote . "foo bar")
-
- '(foo "bar") -> (#quote foo bar)
-
- '(foo "x y") -> (#quote foo "x y")
-
-The right-hand side shows what you would get if you read the form on
-the left in code reader mode, then wrote it back out in data mode.
-
-The writer could also have a code writer mode, which applies the
-reverse transformations. There should be a one to one mapping,
-unambiguous, so this always works. A "hygienic" way to quote is
-imperative here, since the writer could otherwise not know whether
-some `quote` keyword in a list is *the* quote special form, or just
-part of some data.
-
-We've made quote into a special token, `#quote`, to solve that.
-Instead of adding a separate symbol data type that's a subtype of
-strings, I think I'll add something called a "rune" or such that's
-represented like `#foo` and allows for custom reader extensions, or
-rather, writer extensions.
-
-Essentially, these runes would be bound to a pair of the following:
-
-1. A procedure that accepts a datum and returns some type of value.
-
-2. A procedure that takes values of that type, and turns them back
- into written format.
-
-For `#quote`, the reader procedure would be the identity function.
-The writer procedure would need to be a little more sophisticated.
-
-Note that the first procedure would not actually be called during
-reading of data. Somewhat confusingly, it would only be called in
-evaluation of code.
-
-Let's recap. Starting with pure data reading and writing:
-
-1. There is no special reader syntax. This s-expression format is a
-bare minimum of what's needed to represent sequential data i.e. lists
-and lists are the only compound data type recognized by the reader.
-Anything that isn't a list is either an atomic value, or a string
-which may or may not be considered atomic depending on how pedantic
-you want to be. Oh and runes are allowed.
-
- A. This is basically "classic" s-expressions with runes added.
- Only lists, numbers, and strings/symbols are recognized.
-
- B. Heck, numbers may not be recognized. Or maybe they will be
- limited to integers and floats, but no rationals or such, and
- reading a float will guarantee no loss of precision?
-
-2. Writing data returned by the data reader back out, in data form,
-will produce exactly what was read, with the sole exception being
-whitespace differences. The data is not allowed to contain any
-non-atomic values other than proper lists.
-
- A. It's important not to allow floats that IEEE 754 doubles can't
- represent, since then differences between input and output would
- occur. But what about numbers like "10.00"? That would also
- become something else when written back out.
-
- B. OK, maybe numbers represented in a non-canonical way are a
- second source of difference between reading and writing back out,
- but let's at least guarantee there's no loss of precision.
-
-(I've not considered comments. Maybe they will be preserved? Maybe
-they should be implemented as code reader syntax sugar as well??)
-
-And now code reading and writing:
-
-1. Various syntax sugar is internally transformed into runes, with
-non-list compound data literals (vectors, hash tables, etc.) needing
-this type of representation to appear in code.
-
- A. Writing that data back out in data mode will reveal the inner
- workings of the language, producing output containing runes.
-
- B. Direct use of runes may be forbidden; not sure about this.
-
- C. Evaluating this data containing runes will produce, in-memory,
- the actual values being represented. The "reader procedure" tied
- to the rune is responsible for this, though the fact that it's
- evaluation and not reading that calls that procedure makes it
- confusing so a better name is needed. Maybe just "decoder."
-
-2. For every data type that falls outside the pure data syntax, there
-is a procedure that turns it into a canonical data representation
-based on lists and atomics, always using the format `(#rune ...)`.
-
- A. Another procedure is capable of turning that back into reader
- sugar, but this is not terribly important. Although it would be
- neat to be able to write out code that looks like hand-written
- program code, this really is just a bonus feature.
-
- B. For some types, turning them back into code without any runes
- may be highly complicated; procedures, in particular, would need
- decompilation to make this work.
-
-
-## Recap (or not?)
-
-Wow, that was a long "recap." I actually came up with new ideas in
-writing that. Let's recap the recap. I'll represent the mechanisms
-as different pipelines that can happen using the various features.
-
-Typical pipeline when reading and evaluating code:
-
- code-file --[code-reader]--> code-data --[eval]--> values
- ^^^^^^^^^^^ ^^^^
- turns sugar into calls rune decoders
- rune calls to produce values
- i.e. desugars code & compiles code
-
-Reading in a [serialized program](compile.html):
-
- data-file --[data-reader]--> data --[eval]--> values
- ^^^^
- fairly trivial
- (no lambdas, only runes)
-
-Reading pure and simple data like a config file:
-
- data-file --[data-reader]--> data (no runes to eval)
-
-Note that "data" is a subset of "values" basically. And the term
-"code-data" which was used above just means data that is meant to be
-evaluated as code, but is totally valid as pure data. This is not to
-be confused with the "data" that existed in the intermediate step
-while we were reading a serialized program; that was absent of any
-forms like lambdas that need compilation.
-
-OK, that last bit was a bit confusing, and I realize it stems from
-conflating rune decoding with code compilation, so let's split that
-further up. Above, "eval" is "decode + compile" basically, but it's
-possible to separate them, for example if we want to read a file of
-serialized values that should not contain any code:
-
- values-file --[data-reader]--> values-data --[decode]--> values
-
-This is a secure way to read complex data even if it comes from an
-untrusted source. It may contain runes that represent code, such as
-in the form of `(#program "binary")` (compiled procedure) or even
-`(#lambda (x) (do-things))` but so long as you don't actually call
-those things after having decoded them, they can't do anything.
-Decoding runes can't define macros or register new rune decoders,
-meaning there's no way to achieve arbitrary code execution.
-
-Heck, although `#lambda` exists to represent the desugaring of the
-`{...}` convenience syntax, it wouldn't actually work here because
-decoding runes would happen in a null-environment without any bound
-identifiers, meaning that e.g. `(#lambda (x) (+ x x))` would just
-raise an error during decoding, because the compiler would consider
-`+` unbound.
-
-Alternatively, instead of calling the compiler, the `#lambda` decoder
-could just be a no-op that returns the same form back, but without the
-rune, like `(lambda (x) (+ x x))`, because the compiler will take care
-of that later. Yeah, I think this makes more sense. Why doesn't the
-code reader directly give `(lambda ...)` for the `{...}` sugar? Well,
-actually, the `#lambda` decoder may yield a syntax object where the
-first element specifically refers to the binding of `lambda` in the
-default environment, so you could use `{...}` in an environment where
-`lambda` is bound to something else, and you would still hygienically
-get the default lambda behavior from `{...}`. Yay!
-
-(Wow, it's rabbit hole after rabbit hole today. This is good though.
-I'm coming up with some crazy stuff.)
-
-It would be possible to decode "code-data" and get an internal memory
-representation of an uncompiled program which however already has
-various data structure literals turned into values. This is super
-obscure but for sake of completeness:
-
- code-file --[code-reader]--> code-data --[decode]--> code-values
-
-(These so-called "code-values" would only ever be useful for piping
-them into the compiler. By the way, I initially used "eval" in the
-example of reading a serialized program, but "decode" would have been
-sufficient there.)
-
-
-## Here's a well-deserved break
-
-(There wasn't a new header in a while. This seemed a good spot.)
-
-Now writing pipelines. Let's reverse the above pipelines, from the
-bottom back towards eventually the first...
-
-The reverse of the super obscure thing above:
-
- code-values --[encode]--> code-data --[code-writer]--> code-file
-
-That would only ever be useful for debugging things. Now writing a
-data structure into a serialized file, without unnecessarily invoking
-the decompiler:
-
- values --[encode]--> values-data --[data-writer]--> data-file
-
-That gives you a file containing only data, but the data is the
-encoded format of various data structures Zisp recognizes...
-Actually, that may include compiled procedures as well.
-
-Now the simple config file case, being serialized:
-
- data -[data writer]-> data-file
-
-Now serializing a compiled program to a file, without decompilation:
-
- values --[encode]--> values-data --[data-writer]--> data-file
- ^^^^^^ ^^^^^^^^^^^
- data structures no decompilation
- become rune calls or "re-sugaring"
-
-Oh, look at that. It's the same as writing out data structures, as
-we've already seen previously... This recap of a recap will need
-another recap for sure.
-
-And now, the full decompiler:
-
- values --[uneval]--> code-data --[code-writer]--> code-file
- ^^^^^^
- decompilation
-
-Actually, just like "eval" is "decode + compile", the "uneval" here
-really is "decompile + encode".
-
-
-## The Revised Recap of the Recap
-
-The following exist:
-
-1. Readers:
-
- 1. Data reader: Reads lists, strings/symbols, runes, integers, and
- IEEE 754 double-precision floats without loss of precision.
-
- 2. Code reader: Reads code that can contain various syntax sugar,
- all of which has an equivalent representation with runes.
-
-2. In-memory transformers:
-
- 1. Decoder: Calls decoders for runes in data, to yield values.
-
- 2. Evaluator: [Executes aka compiles](compile.html) decoded values
- into other values.[*]
-
-3. Reverse in-memory transformers:
-
- 1. Encoder: Reverse of the decoder. (Lossless?)
-
- 2. Unevaluator: Reverse of the evaluator. (Lossy.)
-
-4. Writers:
-
- 1. Data writer: Reverse of data reader. (Lossless.)
-
- 2. Code writer: Reverse of code reader. (Lossy?)
-
-(*) This needs decoding to run first, because otherwise it wouldn't
- realize that you're e.g. calling `+` on a pair of rational number
- constants represented through runes, so constant folding wouldn't
- work. Same with `vector-ref` on a vector literal represented as a
- rune, and so on.
-
-
-## How in the seven hells did I arrive at this point?
-
-Jesus Christ!
-
-This was about symbols and strings being the same thing.
-
-But I love these rabbit holes. They're mind expanding and you find
-out so many new things you never thought about.
-
-Did you notice, by the way, that the code reader/writer above is
-essentially a parser (and unparser) you would have in a regular
-programming language, where syntax becomes an AST? The pure data
-format is basically our AST!
-
-But this doesn't mean we lost homoiconicity. No, we merely expanded
-upon it by providing a more detailed explanation of the relationship
-between textual representation of code and in-memory data that exists
-at various stages before ultimate compilation.
-
-Oh, and did we achieve our strategy of strings = symbols now, or does
-it need to be dropped? I think we achieved it. The code reader, as
-described all the way up where the section "Reconsidering AGAIN"
-begins --in the original article; see top-- will desugar string
-literals into:
-
- "foo" -> (#quote foo)
-
-(As already described originally...)
-
-And the `#quote` rune? Well, it will not actually just return its
-operand verbatim, no! It will return a syntax object that's a list
-with the first element specifically refers to the binding of `quote`
-from the standard library. In other words, it's the evaluator that
-actually implements quote, not the decoder.
-
-Oh yes, this is very satisfying. Everything is coming together.
-
-Syntax objects, by the way, will also have a rune-based external
-representation, so you can inspect the result of macro expansion.
-
-And yes, I think using runes directly in code mode should be illegal,
-because it allows referring to bindings in the standard library, or
-even bindings in arbitrary libraries by crafting syntax objects
-represented via runes, to bypass environment limits.
-
-That bug actually existed in Guile at some point, where one could
-craft syntax objects, represented as vector literals, to refer to
-bindings in other modules, making it impossible to run code in a
-sandboxed environment. (It was fixed long ago, I believe.)
-
-Oh, but what about `#true` and `#false`? OK, maybe there will be a
-whitelist of runes that are allowed in code. That makes sense.
-
-We will see. Still more details to be fleshed out.
-
-In any case, some runes must be able to declare that they don't take
-arguments, in which case `(#rune ...)` isn't decoded by passing the
-entire form to the decoder of `#rune`, but rather treated as a normal
-list whose first element is decoded as a nullary rune. That's how
-boolean literals in code will be implemented.
-
-
-## Looking at more of the initial problems
-
-What happened to `'(quote "foo")` in code mode being weird? Well,
-encountering an apostrophe tells the code reader that the next
-expression is a datum, so it switches to data mode for that.
-
-Wow, that was easy.
-
-This also means you can't use syntax sugar inside it, which is good
-because as we said previously, we don't want to use quoting to create
-code; we want to use syntax objects for that.
-
-This is really orthogonal to the whole runes issue, and could have
-been solved without that mechanism, but I'm happy I came up with it
-because it resolves hygiene issues.
-
-The syntax `#'(quote "foo")` would be sugar for a different rune, and
-the reader would remain in code mode, further desugaring any sugar
-found within, so this works: `#'{x (+ x x)}`
-
-Oh and I mentioned reader extensions (for code mode) but then didn't
-expand on that. Well, whenever the code reader encounters this:
-
- #foo(blah blah blah)
-
-It will turn that into:
-
- (#foo blah blah blah)
-
-After which the decoder for `#foo` will be invoked, which could have
-been registered by the programmer.
-
-Can that registration be done in the same file though? Normally, the
-execution step comes after decoding, and we decided that we don't want
-to allow arbitrary code execution to happen just when reading a data
-file and decoding it. So something exceptional would need to happen
-for this to work. Or maybe not.
-
-Remember that [compilation is execution](compile.html) in Zisp,
-meaning that compiling a file looks like this in pseudo-Scheme:
-
- (define env (null-environment)) ;start out empty
-
- (while (not (eof? input))
- (let* ((datum (read-code input)) ;desugar
- (value (decode datum))) ;decode
- (eval! value env))) ;eval in mutable env
-
- (write (env-lookup env 'main)) ;serialize
-
-I've called eval `eval!` to indicate that it can mutate the env it
-receives, which is what import statements and defines would do.
-
-Let's modify that a little further to indicate the fact that reader
-macros, or in our terms, custom rune decoders, can be defined in the
-middle of the code file by affecting the environment:
-
- (define env (null-environment)) ;start out empty
-
- (while (not (eof? input))
- (let* ((datum (read-code input)) ;desugar
- (value (decode datum env))) ;decode in env
- (eval! value env))) ;eval in mutable env
-
- (write (env-lookup env 'main)) ;serialize
-
-Since the `decode` procedure is given an environment, it will look up
-decoders from therein. So, after the evaluation of each top-level
-expression, the expressions coming after it could be using a custom
-decoder.
-
-What our reader macros cannot do is completely affect the lexical
-syntax of the language, as in, add more sugar. You must rely on the
-global desugaring feature of `#x(...) -> (#x ...)` which, now that I
-think about it, is completely useless because a regular macro could
-have achieved exactly the same thing.
-
-OK, let's try that again. The global desugaring wouldn't work on
-lists only, it would work on a number of things:
-
- #x"foo" -> (#x #string . foo)
-
- #x[foo] -> (#x #square . foo)
-
- #x{foo} -> (#x #braces . foo)
-
-You get the idea!
-
-(I've changed my mind that `"foo"` should desugar into a call to the
-regular `#quote` rune; it should be `#string` instead to disambiguate
-from the apostrophe if needed.)
-
-Also, all those would work without a rune as well, to allow a file to
-change the meaning of some of the default syntax sugar if desired:
-
- "foo" -> (#string . foo)
-
- [foo bar] -> (#square foo bar)
-
- {foo bar} -> (#braces foo bar)
-
-Or something like that. I'm making this all up as I go.