summaryrefslogtreecommitdiff
path: root/notes/reader.md
blob: ebbe1ea5fa9e83bd8cc85a24086b45315f787d1f (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
# Reader? Decoder? I barely know 'er!

*This started from an expansion to the following, then became its own
article:*

[Symbols are strings are symbols](symbols.html)

OK but hear me out... What if there were different reader modes, for
code and (pure) data?

I want Zisp to have various neat [syntactic extensions](sugar.html)
for programming purposes anyway, like the lambda shorthand, and these
shouldn't apply to configuration files, either.  (Although they seem
unlikely to occur by accident.)

So what if the transform from string literal to quoted string literal
only occurred in code reading mode?

At least one problem remains, which is that `'(foo "bar")` would turn
into `(quote (foo (quote bar)))` because the reader would be in code
mode while reading it...

This reminds me of the long-standing annoyance in Scheme that "quote
is unhygienic" and maybe we can tackle this problem as well now.

Also, expressions like `'(foo '(bar))` always seemed weird to me, and
probably have no place in Scheme, because we don't generate code via
quote; we generate it with macros that operate on explicit syntax
objects rather than pure data.

I want to experiment with an idea like this:

    ;; "code reader mode" transformations

    '(foo bar)     -> (#quote foo bar)
    
    '(foo 'bar)    -> ERROR

    "foo"          -> (#quote . foo)

    "foo bar"      -> (#quote . "foo bar")

    '(foo "bar")   -> (#quote foo bar)

    '(foo "x y")   -> (#quote foo "x y")

The right-hand side shows what you would get if you read the form on
the left in code reader mode, then wrote it back out in data mode.

The writer could also have a code writer mode, which applies the
reverse transformations.  There should be a one to one mapping,
unambiguous, so this always works.  A "hygienic" way to quote is
imperative here, since the writer could otherwise not know whether
some `quote` keyword in a list is *the* quote special form, or just
part of some data.

We've made quote into a special token, `#quote`, to solve that.
Instead of adding a separate symbol data type that's a subtype of
strings, I think I'll add something called a "rune" or such that's
represented like `#foo` and allows for custom reader extensions, or
rather, writer extensions.

Essentially, these runes would be bound to a pair of the following:

1. A procedure that accepts a datum and returns some type of value.

2. A procedure that takes values of that type, and turns them back
   into written format.

For `#quote`, the reader procedure would be the identity function.
The writer procedure would need to be a little more sophisticated.

Note that the first procedure would not actually be called during
reading of data.  Somewhat confusingly, it would only be called in
evaluation of code.

Let's recap.  Starting with pure data reading and writing:

1. There is no special reader syntax.  This s-expression format is a
bare minimum of what's needed to represent sequential data i.e. lists
and lists are the only compound data type recognized by the reader.
Anything that isn't a list is either an atomic value, or a string
which may or may not be considered atomic depending on how pedantic
you want to be.  Oh and runes are allowed.

   A. This is basically "classic" s-expressions with runes added.
   Only lists, numbers, and strings/symbols are recognized.

   B. Heck, numbers may not be recognized.  Or maybe they will be
   limited to integers and floats, but no rationals or such, and
   reading a float will guarantee no loss of precision?

2. Writing data returned by the data reader back out, in data form,
will produce exactly what was read, with the sole exception being
whitespace differences.  The data is not allowed to contain any
non-atomic values other than proper lists.

   A. It's important not to allow floats that IEEE 754 doubles can't
   represent, since then differences between input and output would
   occur.  But what about numbers like "10.00"?  That would also
   become something else when written back out.

   B. OK, maybe numbers represented in a non-canonical way are a
   second source of difference between reading and writing back out,
   but let's at least guarantee there's no loss of precision.

(I've not considered comments.  Maybe they will be preserved?  Maybe
they should be implemented as code reader syntax sugar as well??)

And now code reading and writing:

1. Various syntax sugar is internally transformed into runes, with
non-list compound data literals (vectors, hash tables, etc.) needing
this type of representation to appear in code.

   A. Writing that data back out in data mode will reveal the inner
   workings of the language, producing output containing runes.

   B. Direct use of runes may be forbidden; not sure about this.

   C. Evaluating this data containing runes will produce, in-memory,
   the actual values being represented.  The "reader procedure" tied
   to the rune is responsible for this, though the fact that it's
   evaluation and not reading that calls that procedure makes it
   confusing so a better name is needed.  Maybe just "decoder."

2. For every data type that falls outside the pure data syntax, there
is a procedure that turns it into a canonical data representation
based on lists and atomics, always using the format `(#rune ...)`.

   A. Another procedure is capable of turning that back into reader
   sugar, but this is not terribly important.  Although it would be
   neat to be able to write out code that looks like hand-written
   program code, this really is just a bonus feature.

   B. For some types, turning them back into code without any runes
   may be highly complicated; procedures, in particular, would need
   decompilation to make this work.


## Recap (or not?)

Wow, that was a long "recap."  I actually came up with new ideas in
writing that.  Let's recap the recap.  I'll represent the mechanisms
as different pipelines that can happen using the various features.

Typical pipeline when reading and evaluating code:

    code-file --[code-reader]--> code-data --[eval]--> values
                 ^^^^^^^^^^^                  ^^^^
              turns sugar into        calls rune decoders
                 rune calls            to produce values
            i.e. desugars code          & compiles code

Reading in a [serialized program](compile.html):

    data-file --[data-reader]--> data --[eval]--> values
                                         ^^^^
                                    fairly trivial
                                (no lambdas, only runes)

Reading pure and simple data like a config file:

    data-file --[data-reader]--> data (no runes to eval)

Note that "data" is a subset of "values" basically.  And the term
"code-data" which was used above just means data that is meant to be
evaluated as code, but is totally valid as pure data.  This is not to
be confused with the "data" that existed in the intermediate step
while we were reading a serialized program; that was absent of any
forms like lambdas that need compilation.

OK, that last bit was a bit confusing, and I realize it stems from
conflating rune decoding with code compilation, so let's split that
further up.  Above, "eval" is "decode + compile" basically, but it's
possible to separate them, for example if we want to read a file of
serialized values that should not contain any code:

    values-file --[data-reader]--> values-data --[decode]--> values

This is a secure way to read complex data even if it comes from an
untrusted source.  It may contain runes that represent code, such as
in the form of `(#program "binary")` (compiled procedure) or even
`(#lambda (x) (do-things))` but so long as you don't actually call
those things after having decoded them, they can't do anything.
Decoding runes can't define macros or register new rune decoders,
meaning there's no way to achieve arbitrary code execution.

Heck, although `#lambda` exists to represent the desugaring of the
`{...}` convenience syntax, it wouldn't actually work here because
decoding runes would happen in a null-environment without any bound
identifiers, meaning that e.g. `(#lambda (x) (+ x x))` would just
raise an error during decoding, because the compiler would consider
`+` unbound.

Alternatively, instead of calling the compiler, the `#lambda` decoder
could just be a no-op that returns the same form back, but without the
rune, like `(lambda (x) (+ x x))`, because the compiler will take care
of that later.  Yeah, I think this makes more sense.  Why doesn't the
code reader directly give `(lambda ...)` for the `{...}` sugar?  Well,
actually, the `#lambda` decoder may yield a syntax object where the
first element specifically refers to the binding of `lambda` in the
default environment, so you could use `{...}` in an environment where
`lambda` is bound to something else, and you would still hygienically
get the default lambda behavior from `{...}`.  Yay!

(Wow, it's rabbit hole after rabbit hole today.  This is good though.
I'm coming up with some crazy stuff.)

It would be possible to decode "code-data" and get an internal memory
representation of an uncompiled program which however already has
various data structure literals turned into values.  This is super
obscure but for sake of completeness:

    code-file --[code-reader]--> code-data --[decode]--> code-values

(These so-called "code-values" would only ever be useful for piping
them into the compiler.  By the way, I initially used "eval" in the
example of reading a serialized program, but "decode" would have been
sufficient there.)


## Here's a well-deserved break

(There wasn't a new header in a while.  This seemed a good spot.)

Now writing pipelines.  Let's reverse the above pipelines, from the
bottom back towards eventually the first...

The reverse of the super obscure thing above:

    code-values --[encode]--> code-data --[code-writer]--> code-file

That would only ever be useful for debugging things.  Now writing a
data structure into a serialized file, without unnecessarily invoking
the decompiler:

    values --[encode]--> values-data --[data-writer]--> data-file

That gives you a file containing only data, but the data is the
encoded format of various data structures Zisp recognizes...
Actually, that may include compiled procedures as well.

Now the simple config file case, being serialized:

    data -[data writer]-> data-file

Now serializing a compiled program to a file, without decompilation:

    values --[encode]--> values-data --[data-writer]--> data-file
              ^^^^^^                    ^^^^^^^^^^^
         data structures             no decompilation
        become rune calls            or "re-sugaring"

Oh, look at that.  It's the same as writing out data structures, as
we've already seen previously...  This recap of a recap will need
another recap for sure.

And now, the full decompiler:

    values --[uneval]--> code-data --[code-writer]--> code-file
              ^^^^^^
          decompilation

Actually, just like "eval" is "decode + compile", the "uneval" here
really is "decompile + encode".


## The Revised Recap of the Recap

The following exist:

1. Readers:

   1. Data reader: Reads lists, strings/symbols, runes, integers, and
   IEEE 754 double-precision floats without loss of precision.

   2. Code reader: Reads code that can contain various syntax sugar,
   all of which has an equivalent representation with runes.

2. In-memory transformers:

   1. Decoder: Calls decoders for runes in data, to yield values.

   2. Evaluator: [Executes aka compiles](compile.html) decoded values
   into other values.[*]

3. Reverse in-memory transformers:

   1. Encoder: Reverse of the decoder. (Lossless?)

   2. Unevaluator: Reverse of the evaluator. (Lossy.)

4. Writers:

   1. Data writer: Reverse of data reader. (Lossless.)

   2. Code writer: Reverse of code reader. (Lossy?)

(*) This needs decoding to run first, because otherwise it wouldn't
    realize that you're e.g. calling `+` on a pair of rational number
    constants represented through runes, so constant folding wouldn't
    work.  Same with `vector-ref` on a vector literal represented as a
    rune, and so on.


## How in the seven hells did I arrive at this point?

Jesus Christ!

This was about symbols and strings being the same thing.

But I love these rabbit holes.  They're mind expanding and you find
out so many new things you never thought about.

Did you notice, by the way, that the code reader/writer above is
essentially a parser (and unparser) you would have in a regular
programming language, where syntax becomes an AST?  The pure data
format is basically our AST!

But this doesn't mean we lost homoiconicity.  No, we merely expanded
upon it by providing a more detailed explanation of the relationship
between textual representation of code and in-memory data that exists
at various stages before ultimate compilation.

Oh, and did we achieve our strategy of strings = symbols now, or does
it need to be dropped?  I think we achieved it.  The code reader, as
described all the way up where the section "Reconsidering AGAIN"
begins --in the original article; see top-- will desugar string
literals into:

    "foo" -> (#quote foo)

(As already described originally...)

And the `#quote` rune?  Well, it will not actually just return its
operand verbatim, no!  It will return a syntax object that's a list
with the first element specifically refers to the binding of `quote`
from the standard library.  In other words, it's the evaluator that
actually implements quote, not the decoder.

Oh yes, this is very satisfying.  Everything is coming together.

Syntax objects, by the way, will also have a rune-based external
representation, so you can inspect the result of macro expansion.

And yes, I think using runes directly in code mode should be illegal,
because it allows referring to bindings in the standard library, or
even bindings in arbitrary libraries by crafting syntax objects
represented via runes, to bypass environment limits.

That bug actually existed in Guile at some point, where one could
craft syntax objects, represented as vector literals, to refer to
bindings in other modules, making it impossible to run code in a
sandboxed environment.  (It was fixed long ago, I believe.)

Oh, but what about `#true` and `#false`?  OK, maybe there will be a
whitelist of runes that are allowed in code.  That makes sense.

We will see.  Still more details to be fleshed out.

In any case, some runes must be able to declare that they don't take
arguments, in which case `(#rune ...)` isn't decoded by passing the
entire form to the decoder of `#rune`, but rather treated as a normal
list whose first element is decoded as a nullary rune.  That's how
boolean literals in code will be implemented.


## Looking at more of the initial problems

What happened to `'(quote "foo")` in code mode being weird?  Well,
encountering an apostrophe tells the code reader that the next
expression is a datum, so it switches to data mode for that.

Wow, that was easy.

This also means you can't use syntax sugar inside it, which is good
because as we said previously, we don't want to use quoting to create
code; we want to use syntax objects for that.

This is really orthogonal to the whole runes issue, and could have
been solved without that mechanism, but I'm happy I came up with it
because it resolves hygiene issues.

The syntax `#'(quote "foo")` would be sugar for a different rune, and
the reader would remain in code mode, further desugaring any sugar
found within, so this works: `#'{x (+ x x)}`

Oh and I mentioned reader extensions (for code mode) but then didn't
expand on that.  Well, whenever the code reader encounters this:

    #foo(blah blah blah)

It will turn that into:

    (#foo blah blah blah)

After which the decoder for `#foo` will be invoked, which could have
been registered by the programmer.

Can that registration be done in the same file though?  Normally, the
execution step comes after decoding, and we decided that we don't want
to allow arbitrary code execution to happen just when reading a data
file and decoding it.  So something exceptional would need to happen
for this to work.  Or maybe not.

Remember that [compilation is execution](compile.html) in Zisp,
meaning that compiling a file looks like this in pseudo-Scheme:

    (define env (null-environment))    ;start out empty
    
    (while (not (eof? input))
      (let* ((datum (read-code input)) ;desugar
             (value (decode datum)))   ;decode
        (eval! value env)))            ;eval in mutable env

    (write (env-lookup env 'main))     ;serialize

I've called eval `eval!` to indicate that it can mutate the env it
receives, which is what import statements and defines would do.

Let's modify that a little further to indicate the fact that reader
macros, or in our terms, custom rune decoders, can be defined in the
middle of the code file by affecting the environment:

    (define env (null-environment))      ;start out empty
    
    (while (not (eof? input))
      (let* ((datum (read-code input))   ;desugar
             (value (decode datum env))) ;decode in env
        (eval! value env)))              ;eval in mutable env

    (write (env-lookup env 'main))       ;serialize

Since the `decode` procedure is given an environment, it will look up
decoders from therein.  So, after the evaluation of each top-level
expression, the expressions coming after it could be using a custom
decoder.

What our reader macros cannot do is completely affect the lexical
syntax of the language, as in, add more sugar.  You must rely on the
global desugaring feature of `#x(...) -> (#x ...)` which, now that I
think about it, is completely useless because a regular macro could
have achieved exactly the same thing.

OK, let's try that again.  The global desugaring wouldn't work on
lists only, it would work on a number of things:

    #x"foo"    -> (#x #string . foo)

    #x[foo]    -> (#x #square . foo)

    #x{foo}    -> (#x #braces . foo)

You get the idea!

(I've changed my mind that `"foo"` should desugar into a call to the
regular `#quote` rune; it should be `#string` instead to disambiguate
from the apostrophe if needed.)

Also, all those would work without a rune as well, to allow a file to
change the meaning of some of the default syntax sugar if desired:

    "foo"    -> (#string . foo)

    [foo bar]    -> (#square foo bar)

    {foo bar}    -> (#braces foo bar)

Or something like that.  I'm making this all up as I go.