summaryrefslogtreecommitdiff
path: root/html
diff options
context:
space:
mode:
authorTaylan Kammer <taylan.kammer@gmail.com>2025-02-19 23:29:26 +0100
committerTaylan Kammer <taylan.kammer@gmail.com>2025-02-19 23:29:26 +0100
commit4e88891235664917a2db44b84c0bbeeb13dd71ad (patch)
tree7ed8ac2272ce92054fdf2f4e5e09b156dfc5a4d1 /html
parent4d0db1a1065f18d879b3ff90da6ecb14e9e1ae31 (diff)
update
Diffstat (limited to 'html')
-rw-r--r--html/notes/compile.md (renamed from html/notes/compilation.md)0
-rw-r--r--html/notes/format.md12
-rw-r--r--html/notes/reader.md470
-rw-r--r--html/notes/serialize.md3
-rw-r--r--html/notes/symbols.md44
-rw-r--r--html/style.css2
6 files changed, 528 insertions, 3 deletions
diff --git a/html/notes/compilation.md b/html/notes/compile.md
index 4d5fc6d..4d5fc6d 100644
--- a/html/notes/compilation.md
+++ b/html/notes/compile.md
diff --git a/html/notes/format.md b/html/notes/format.md
new file mode 100644
index 0000000..39da84a
--- /dev/null
+++ b/html/notes/format.md
@@ -0,0 +1,12 @@
+WIP WIP WIP
+
+(format "template" arg ...) ;sprintf
+(format obj) ;like write but returns string
+(format! "template" arg ...) ;printf
+(format! arg) ;write
+
+The ones with a string template are special forms and process the
+template string at compile time and ensure correct number of args.
+
+Need a way to let a special form's name also appear as an identifier
+like Guile does it with record accessors and shit.
diff --git a/html/notes/reader.md b/html/notes/reader.md
new file mode 100644
index 0000000..ebbe1ea
--- /dev/null
+++ b/html/notes/reader.md
@@ -0,0 +1,470 @@
+# Reader? Decoder? I barely know 'er!
+
+*This started from an expansion to the following, then became its own
+article:*
+
+[Symbols are strings are symbols](symbols.html)
+
+OK but hear me out... What if there were different reader modes, for
+code and (pure) data?
+
+I want Zisp to have various neat [syntactic extensions](sugar.html)
+for programming purposes anyway, like the lambda shorthand, and these
+shouldn't apply to configuration files, either. (Although they seem
+unlikely to occur by accident.)
+
+So what if the transform from string literal to quoted string literal
+only occurred in code reading mode?
+
+At least one problem remains, which is that `'(foo "bar")` would turn
+into `(quote (foo (quote bar)))` because the reader would be in code
+mode while reading it...
+
+This reminds me of the long-standing annoyance in Scheme that "quote
+is unhygienic" and maybe we can tackle this problem as well now.
+
+Also, expressions like `'(foo '(bar))` always seemed weird to me, and
+probably have no place in Scheme, because we don't generate code via
+quote; we generate it with macros that operate on explicit syntax
+objects rather than pure data.
+
+I want to experiment with an idea like this:
+
+ ;; "code reader mode" transformations
+
+ '(foo bar) -> (#quote foo bar)
+
+ '(foo 'bar) -> ERROR
+
+ "foo" -> (#quote . foo)
+
+ "foo bar" -> (#quote . "foo bar")
+
+ '(foo "bar") -> (#quote foo bar)
+
+ '(foo "x y") -> (#quote foo "x y")
+
+The right-hand side shows what you would get if you read the form on
+the left in code reader mode, then wrote it back out in data mode.
+
+The writer could also have a code writer mode, which applies the
+reverse transformations. There should be a one to one mapping,
+unambiguous, so this always works. A "hygienic" way to quote is
+imperative here, since the writer could otherwise not know whether
+some `quote` keyword in a list is *the* quote special form, or just
+part of some data.
+
+We've made quote into a special token, `#quote`, to solve that.
+Instead of adding a separate symbol data type that's a subtype of
+strings, I think I'll add something called a "rune" or such that's
+represented like `#foo` and allows for custom reader extensions, or
+rather, writer extensions.
+
+Essentially, these runes would be bound to a pair of the following:
+
+1. A procedure that accepts a datum and returns some type of value.
+
+2. A procedure that takes values of that type, and turns them back
+ into written format.
+
+For `#quote`, the reader procedure would be the identity function.
+The writer procedure would need to be a little more sophisticated.
+
+Note that the first procedure would not actually be called during
+reading of data. Somewhat confusingly, it would only be called in
+evaluation of code.
+
+Let's recap. Starting with pure data reading and writing:
+
+1. There is no special reader syntax. This s-expression format is a
+bare minimum of what's needed to represent sequential data i.e. lists
+and lists are the only compound data type recognized by the reader.
+Anything that isn't a list is either an atomic value, or a string
+which may or may not be considered atomic depending on how pedantic
+you want to be. Oh and runes are allowed.
+
+ A. This is basically "classic" s-expressions with runes added.
+ Only lists, numbers, and strings/symbols are recognized.
+
+ B. Heck, numbers may not be recognized. Or maybe they will be
+ limited to integers and floats, but no rationals or such, and
+ reading a float will guarantee no loss of precision?
+
+2. Writing data returned by the data reader back out, in data form,
+will produce exactly what was read, with the sole exception being
+whitespace differences. The data is not allowed to contain any
+non-atomic values other than proper lists.
+
+ A. It's important not to allow floats that IEEE 754 doubles can't
+ represent, since then differences between input and output would
+ occur. But what about numbers like "10.00"? That would also
+ become something else when written back out.
+
+ B. OK, maybe numbers represented in a non-canonical way are a
+ second source of difference between reading and writing back out,
+ but let's at least guarantee there's no loss of precision.
+
+(I've not considered comments. Maybe they will be preserved? Maybe
+they should be implemented as code reader syntax sugar as well??)
+
+And now code reading and writing:
+
+1. Various syntax sugar is internally transformed into runes, with
+non-list compound data literals (vectors, hash tables, etc.) needing
+this type of representation to appear in code.
+
+ A. Writing that data back out in data mode will reveal the inner
+ workings of the language, producing output containing runes.
+
+ B. Direct use of runes may be forbidden; not sure about this.
+
+ C. Evaluating this data containing runes will produce, in-memory,
+ the actual values being represented. The "reader procedure" tied
+ to the rune is responsible for this, though the fact that it's
+ evaluation and not reading that calls that procedure makes it
+ confusing so a better name is needed. Maybe just "decoder."
+
+2. For every data type that falls outside the pure data syntax, there
+is a procedure that turns it into a canonical data representation
+based on lists and atomics, always using the format `(#rune ...)`.
+
+ A. Another procedure is capable of turning that back into reader
+ sugar, but this is not terribly important. Although it would be
+ neat to be able to write out code that looks like hand-written
+ program code, this really is just a bonus feature.
+
+ B. For some types, turning them back into code without any runes
+ may be highly complicated; procedures, in particular, would need
+ decompilation to make this work.
+
+
+## Recap (or not?)
+
+Wow, that was a long "recap." I actually came up with new ideas in
+writing that. Let's recap the recap. I'll represent the mechanisms
+as different pipelines that can happen using the various features.
+
+Typical pipeline when reading and evaluating code:
+
+ code-file --[code-reader]--> code-data --[eval]--> values
+ ^^^^^^^^^^^ ^^^^
+ turns sugar into calls rune decoders
+ rune calls to produce values
+ i.e. desugars code & compiles code
+
+Reading in a [serialized program](compile.html):
+
+ data-file --[data-reader]--> data --[eval]--> values
+ ^^^^
+ fairly trivial
+ (no lambdas, only runes)
+
+Reading pure and simple data like a config file:
+
+ data-file --[data-reader]--> data (no runes to eval)
+
+Note that "data" is a subset of "values" basically. And the term
+"code-data" which was used above just means data that is meant to be
+evaluated as code, but is totally valid as pure data. This is not to
+be confused with the "data" that existed in the intermediate step
+while we were reading a serialized program; that was absent of any
+forms like lambdas that need compilation.
+
+OK, that last bit was a bit confusing, and I realize it stems from
+conflating rune decoding with code compilation, so let's split that
+further up. Above, "eval" is "decode + compile" basically, but it's
+possible to separate them, for example if we want to read a file of
+serialized values that should not contain any code:
+
+ values-file --[data-reader]--> values-data --[decode]--> values
+
+This is a secure way to read complex data even if it comes from an
+untrusted source. It may contain runes that represent code, such as
+in the form of `(#program "binary")` (compiled procedure) or even
+`(#lambda (x) (do-things))` but so long as you don't actually call
+those things after having decoded them, they can't do anything.
+Decoding runes can't define macros or register new rune decoders,
+meaning there's no way to achieve arbitrary code execution.
+
+Heck, although `#lambda` exists to represent the desugaring of the
+`{...}` convenience syntax, it wouldn't actually work here because
+decoding runes would happen in a null-environment without any bound
+identifiers, meaning that e.g. `(#lambda (x) (+ x x))` would just
+raise an error during decoding, because the compiler would consider
+`+` unbound.
+
+Alternatively, instead of calling the compiler, the `#lambda` decoder
+could just be a no-op that returns the same form back, but without the
+rune, like `(lambda (x) (+ x x))`, because the compiler will take care
+of that later. Yeah, I think this makes more sense. Why doesn't the
+code reader directly give `(lambda ...)` for the `{...}` sugar? Well,
+actually, the `#lambda` decoder may yield a syntax object where the
+first element specifically refers to the binding of `lambda` in the
+default environment, so you could use `{...}` in an environment where
+`lambda` is bound to something else, and you would still hygienically
+get the default lambda behavior from `{...}`. Yay!
+
+(Wow, it's rabbit hole after rabbit hole today. This is good though.
+I'm coming up with some crazy stuff.)
+
+It would be possible to decode "code-data" and get an internal memory
+representation of an uncompiled program which however already has
+various data structure literals turned into values. This is super
+obscure but for sake of completeness:
+
+ code-file --[code-reader]--> code-data --[decode]--> code-values
+
+(These so-called "code-values" would only ever be useful for piping
+them into the compiler. By the way, I initially used "eval" in the
+example of reading a serialized program, but "decode" would have been
+sufficient there.)
+
+
+## Here's a well-deserved break
+
+(There wasn't a new header in a while. This seemed a good spot.)
+
+Now writing pipelines. Let's reverse the above pipelines, from the
+bottom back towards eventually the first...
+
+The reverse of the super obscure thing above:
+
+ code-values --[encode]--> code-data --[code-writer]--> code-file
+
+That would only ever be useful for debugging things. Now writing a
+data structure into a serialized file, without unnecessarily invoking
+the decompiler:
+
+ values --[encode]--> values-data --[data-writer]--> data-file
+
+That gives you a file containing only data, but the data is the
+encoded format of various data structures Zisp recognizes...
+Actually, that may include compiled procedures as well.
+
+Now the simple config file case, being serialized:
+
+ data -[data writer]-> data-file
+
+Now serializing a compiled program to a file, without decompilation:
+
+ values --[encode]--> values-data --[data-writer]--> data-file
+ ^^^^^^ ^^^^^^^^^^^
+ data structures no decompilation
+ become rune calls or "re-sugaring"
+
+Oh, look at that. It's the same as writing out data structures, as
+we've already seen previously... This recap of a recap will need
+another recap for sure.
+
+And now, the full decompiler:
+
+ values --[uneval]--> code-data --[code-writer]--> code-file
+ ^^^^^^
+ decompilation
+
+Actually, just like "eval" is "decode + compile", the "uneval" here
+really is "decompile + encode".
+
+
+## The Revised Recap of the Recap
+
+The following exist:
+
+1. Readers:
+
+ 1. Data reader: Reads lists, strings/symbols, runes, integers, and
+ IEEE 754 double-precision floats without loss of precision.
+
+ 2. Code reader: Reads code that can contain various syntax sugar,
+ all of which has an equivalent representation with runes.
+
+2. In-memory transformers:
+
+ 1. Decoder: Calls decoders for runes in data, to yield values.
+
+ 2. Evaluator: [Executes aka compiles](compile.html) decoded values
+ into other values.[*]
+
+3. Reverse in-memory transformers:
+
+ 1. Encoder: Reverse of the decoder. (Lossless?)
+
+ 2. Unevaluator: Reverse of the evaluator. (Lossy.)
+
+4. Writers:
+
+ 1. Data writer: Reverse of data reader. (Lossless.)
+
+ 2. Code writer: Reverse of code reader. (Lossy?)
+
+(*) This needs decoding to run first, because otherwise it wouldn't
+ realize that you're e.g. calling `+` on a pair of rational number
+ constants represented through runes, so constant folding wouldn't
+ work. Same with `vector-ref` on a vector literal represented as a
+ rune, and so on.
+
+
+## How in the seven hells did I arrive at this point?
+
+Jesus Christ!
+
+This was about symbols and strings being the same thing.
+
+But I love these rabbit holes. They're mind expanding and you find
+out so many new things you never thought about.
+
+Did you notice, by the way, that the code reader/writer above is
+essentially a parser (and unparser) you would have in a regular
+programming language, where syntax becomes an AST? The pure data
+format is basically our AST!
+
+But this doesn't mean we lost homoiconicity. No, we merely expanded
+upon it by providing a more detailed explanation of the relationship
+between textual representation of code and in-memory data that exists
+at various stages before ultimate compilation.
+
+Oh, and did we achieve our strategy of strings = symbols now, or does
+it need to be dropped? I think we achieved it. The code reader, as
+described all the way up where the section "Reconsidering AGAIN"
+begins --in the original article; see top-- will desugar string
+literals into:
+
+ "foo" -> (#quote foo)
+
+(As already described originally...)
+
+And the `#quote` rune? Well, it will not actually just return its
+operand verbatim, no! It will return a syntax object that's a list
+with the first element specifically refers to the binding of `quote`
+from the standard library. In other words, it's the evaluator that
+actually implements quote, not the decoder.
+
+Oh yes, this is very satisfying. Everything is coming together.
+
+Syntax objects, by the way, will also have a rune-based external
+representation, so you can inspect the result of macro expansion.
+
+And yes, I think using runes directly in code mode should be illegal,
+because it allows referring to bindings in the standard library, or
+even bindings in arbitrary libraries by crafting syntax objects
+represented via runes, to bypass environment limits.
+
+That bug actually existed in Guile at some point, where one could
+craft syntax objects, represented as vector literals, to refer to
+bindings in other modules, making it impossible to run code in a
+sandboxed environment. (It was fixed long ago, I believe.)
+
+Oh, but what about `#true` and `#false`? OK, maybe there will be a
+whitelist of runes that are allowed in code. That makes sense.
+
+We will see. Still more details to be fleshed out.
+
+In any case, some runes must be able to declare that they don't take
+arguments, in which case `(#rune ...)` isn't decoded by passing the
+entire form to the decoder of `#rune`, but rather treated as a normal
+list whose first element is decoded as a nullary rune. That's how
+boolean literals in code will be implemented.
+
+
+## Looking at more of the initial problems
+
+What happened to `'(quote "foo")` in code mode being weird? Well,
+encountering an apostrophe tells the code reader that the next
+expression is a datum, so it switches to data mode for that.
+
+Wow, that was easy.
+
+This also means you can't use syntax sugar inside it, which is good
+because as we said previously, we don't want to use quoting to create
+code; we want to use syntax objects for that.
+
+This is really orthogonal to the whole runes issue, and could have
+been solved without that mechanism, but I'm happy I came up with it
+because it resolves hygiene issues.
+
+The syntax `#'(quote "foo")` would be sugar for a different rune, and
+the reader would remain in code mode, further desugaring any sugar
+found within, so this works: `#'{x (+ x x)}`
+
+Oh and I mentioned reader extensions (for code mode) but then didn't
+expand on that. Well, whenever the code reader encounters this:
+
+ #foo(blah blah blah)
+
+It will turn that into:
+
+ (#foo blah blah blah)
+
+After which the decoder for `#foo` will be invoked, which could have
+been registered by the programmer.
+
+Can that registration be done in the same file though? Normally, the
+execution step comes after decoding, and we decided that we don't want
+to allow arbitrary code execution to happen just when reading a data
+file and decoding it. So something exceptional would need to happen
+for this to work. Or maybe not.
+
+Remember that [compilation is execution](compile.html) in Zisp,
+meaning that compiling a file looks like this in pseudo-Scheme:
+
+ (define env (null-environment)) ;start out empty
+
+ (while (not (eof? input))
+ (let* ((datum (read-code input)) ;desugar
+ (value (decode datum))) ;decode
+ (eval! value env))) ;eval in mutable env
+
+ (write (env-lookup env 'main)) ;serialize
+
+I've called eval `eval!` to indicate that it can mutate the env it
+receives, which is what import statements and defines would do.
+
+Let's modify that a little further to indicate the fact that reader
+macros, or in our terms, custom rune decoders, can be defined in the
+middle of the code file by affecting the environment:
+
+ (define env (null-environment)) ;start out empty
+
+ (while (not (eof? input))
+ (let* ((datum (read-code input)) ;desugar
+ (value (decode datum env))) ;decode in env
+ (eval! value env))) ;eval in mutable env
+
+ (write (env-lookup env 'main)) ;serialize
+
+Since the `decode` procedure is given an environment, it will look up
+decoders from therein. So, after the evaluation of each top-level
+expression, the expressions coming after it could be using a custom
+decoder.
+
+What our reader macros cannot do is completely affect the lexical
+syntax of the language, as in, add more sugar. You must rely on the
+global desugaring feature of `#x(...) -> (#x ...)` which, now that I
+think about it, is completely useless because a regular macro could
+have achieved exactly the same thing.
+
+OK, let's try that again. The global desugaring wouldn't work on
+lists only, it would work on a number of things:
+
+ #x"foo" -> (#x #string . foo)
+
+ #x[foo] -> (#x #square . foo)
+
+ #x{foo} -> (#x #braces . foo)
+
+You get the idea!
+
+(I've changed my mind that `"foo"` should desugar into a call to the
+regular `#quote` rune; it should be `#string` instead to disambiguate
+from the apostrophe if needed.)
+
+Also, all those would work without a rune as well, to allow a file to
+change the meaning of some of the default syntax sugar if desired:
+
+ "foo" -> (#string . foo)
+
+ [foo bar] -> (#square foo bar)
+
+ {foo bar} -> (#braces foo bar)
+
+Or something like that. I'm making this all up as I go.
diff --git a/html/notes/serialize.md b/html/notes/serialize.md
index e35177e..fb9963a 100644
--- a/html/notes/serialize.md
+++ b/html/notes/serialize.md
@@ -1,7 +1,6 @@
# Everything can be serialized
-Let's look at the code mentioned in [compilation](compilation.html)
-again:
+Let's look at the code mentioned in [compilation](compile.html) again:
```scheme
diff --git a/html/notes/symbols.md b/html/notes/symbols.md
index aa3c448..f45f9cf 100644
--- a/html/notes/symbols.md
+++ b/html/notes/symbols.md
@@ -18,6 +18,7 @@ Instead of `string->symbol` we will have `string-intern` which
basically does the same thing. Dynamically generated strings that
aren't passed to this function will be uninterned.
+
## But but but
(Late addition because I didn't even notice this problem at first.
@@ -66,3 +67,46 @@ That prints: "base to us"
I'm not married to the syntax `#"string"` and may end up using the
simpler `|foo|` in the end. It doesn't really matter.
+
+
+## More problems
+
+Dangit, couldn't have been so easy could it?
+
+What if you have a configuration file with these contents:
+
+ ((job-name "Clear /tmp directory")
+ (interval reboot)
+ (command "find /tmp -mindepth 1 -delete"))
+
+Now all those "string constants" will turn into lists with the string
+"quote" at its head. Terrible. One could write it with the explicit
+string literal syntax `#"foo"` for strings, but that's also terrible.
+
+
+## Salvageable
+
+I'm not yet done with this idea. What if strings simply have a flag
+that says whether they are intended as a symbol or not?
+
+While reading, it would be set automatically. Instead of `intern`,
+one would call a function like `symbol`, which would return a string
+with the flag set, after interning it if necessary; it would simply
+return the original string if it already had the flag set.
+
+Another way to look at this is that strings and symbols are sort of
+"polymorphic" and can be used interchangeably. I don't want to get
+into JavaScript style automatic type conversions (yuck) but this may
+simply be a flag that's set on a string, which makes it a subtype of
+regular strings.
+
+Yes yes, I think that's good. I even still have enough space left in
+the NaN-packing strategy to put a tag on "short strings" which are our
+6-byte immediate strings.
+
+
+## Reconsidering AGAIN
+
+*This got too long and off-topic so it continues here:*
+
+[Reader? Decoder? I barely know 'er!](reader.html)
diff --git a/html/style.css b/html/style.css
index f1b474b..4725089 100644
--- a/html/style.css
+++ b/html/style.css
@@ -1,6 +1,6 @@
body {
margin: 20px auto;
- padding: 0 20px;
+ padding: 10px 20px;
max-width: 80ch;
background: #eee;