diff options
| author | Taylan Kammer <taylan.kammer@gmail.com> | 2025-03-29 11:10:24 +0100 |
|---|---|---|
| committer | Taylan Kammer <taylan.kammer@gmail.com> | 2025-03-29 11:10:24 +0100 |
| commit | 451aa92846b5fd5c8a0739336de3aa26d741d750 (patch) | |
| tree | 21e51213bf1d39c2a8677060c51d83a656873786 /html | |
| parent | 5025f9acf31cd880bbff62ff47ed03b69a0025ee (diff) | |
Relocate MD sources for HTML notes.
Diffstat (limited to 'html')
| -rwxr-xr-x | html/gen.sh | 31 | ||||
| -rw-r--r-- | html/notes/booleans.md | 32 | ||||
| -rw-r--r-- | html/notes/compile.md | 115 | ||||
| -rw-r--r-- | html/notes/cons.md | 178 | ||||
| -rw-r--r-- | html/notes/equal.md | 255 | ||||
| -rw-r--r-- | html/notes/fastcons.md | 206 | ||||
| -rw-r--r-- | html/notes/format.md | 14 | ||||
| -rw-r--r-- | html/notes/immutable.md | 56 | ||||
| -rw-r--r-- | html/notes/let.md | 41 | ||||
| -rw-r--r-- | html/notes/macros.md | 151 | ||||
| -rw-r--r-- | html/notes/nan.md | 458 | ||||
| -rw-r--r-- | html/notes/oop.md | 3 | ||||
| -rw-r--r-- | html/notes/reader.md | 470 | ||||
| -rw-r--r-- | html/notes/records.md | 54 | ||||
| -rw-r--r-- | html/notes/serialize.md | 66 | ||||
| -rw-r--r-- | html/notes/sr.md | 368 | ||||
| -rw-r--r-- | html/notes/strict-mode.md | 16 | ||||
| -rw-r--r-- | html/notes/sugar.md | 85 | ||||
| -rw-r--r-- | html/notes/symbols.md | 112 | ||||
| -rw-r--r-- | html/notes/zero-values.md | 11 |
20 files changed, 18 insertions, 2704 deletions
diff --git a/html/gen.sh b/html/gen.sh index 3ed9482..df78d25 100755 --- a/html/gen.sh +++ b/html/gen.sh @@ -2,24 +2,29 @@ set -euo pipefail -if [ $# -eq 0 ] -then - exec find . -name \*.md -exec "$0" {} + -fi - -for file -do - if ! [ -f "$file" ] +md2ht() { + src=$1 + dst=$2 + if ! [ -f "$src" ] then - echo >&2 "File not found: $file" + echo >&2 "File not found: $src" continue fi - echo "$file" + echo "$src -> $dst" { - title=$(sed '/^# / { s/# //; q }' "$file") + title=$(sed 's/# //; q' "$src") sed "s/__TITLE__/$title/" prelude.html echo "<body>" - markdown2 "$file" -x fenced-code-blocks,highlightjs-lang,tables + markdown2 "$src" -x fenced-code-blocks,highlightjs-lang,tables echo "</body>" - } > "${file%.md}".html + } > "$dst" +} + +md2ht index.md index.html + +for note in ../notes/*.md +do + name=${note#../notes/} + name=${name%.md} + md2ht "$note" "notes/$name.html" done diff --git a/html/notes/booleans.md b/html/notes/booleans.md deleted file mode 100644 index 61b9d7e..0000000 --- a/html/notes/booleans.md +++ /dev/null @@ -1,32 +0,0 @@ -# Only Booleans have truthiness - -Like in Java, there should be no implicit conversion of values to a -Boolean. This leads to sloppy code and subtle bugs. - -I believe the code base of Guix contained an example of this at some -point: Build phases ending in a call to `system*` would return 0 or 1 -which would pass as "true" regardless. - -In most cases, you should know the actual type of the value you're -receiving, and do an appropriate check, be it `zero?`, `null?`, or -some other check. If you truly want to check if something is "any -value other than false" then you can always do: - -```scheme - -(if (not (eq? #f value)) - (do something)) - -``` - -No, you cannot use `(not (not x))` because `not` obviously expects a -Boolean argument! Duh. - -I'm actually serious about this. Scheme went all the way to make null -separate from false, but then refused to go all the way and decided to -allow non-Boolean values to function as Booleans anyway. - -Of course, performing a type-check on every single conditional may -incur a serious performance penalty. If so, then the same flag that -determines [whether returned values can be ignored](strict-mode.html) -may also determine whether non-Booleans can be coerced into Booleans. diff --git a/html/notes/compile.md b/html/notes/compile.md deleted file mode 100644 index 4d5fc6d..0000000 --- a/html/notes/compile.md +++ /dev/null @@ -1,115 +0,0 @@ -# Compilation is execution - -Any Scheme implementation with support for procedural macros allows -arbitrary code execution at compile-time. However, this is slightly -awkward: - -```scheme - -(define-syntax comptime - (lambda (stx) - (syntax-case stx () - ((_) - (begin - (display "foo\n"))))) - -(comptime) - -``` - -Compiling this with, for example, `guild compile foo.scm` will lead to -the line "foo" being printed. (The actual program is a no-op and will -do nothing when run after compiled.) - -One can of course implement a macro such as `eval-when-compile` to -make this slightly less awkward. Using R6RS: - -```scheme - -(import (rnrs eval)) - -(define-syntax eval-when-compile - (lambda (stx) - (syntax-case stx () - ((_ imports body ...) - (eval - (syntax->datum #'(begin body ...)) - (apply environment (syntax->datum #'imports))))))) - -(eval-when-compile - ((rnrs)) - (display "foo\n")) - -``` - -An implementation may of course contain such a macro in its standard -library, but it's unclear why the language should put such a hurdle in -our way. There are problems beyond this little hurdle as well. - -Top-level forms in Scheme are semantically executed at run-time, not -compile-time. - -(Actually, the Scheme standards don't explicitly define a run-time or -compile-time stage, but it's arguably implicit in the fact that macros -are *not* first-class, and are defined by the language in such a way -that they can be executed entirely at compile-time if ahead-of-time -compilation is supported by an implementation.) - -Consider the case where the programmer wants to perform a relatively -costly calculation at compile-time and store the result as part of the -compiled program. Say, a lookup-table. Naively, we may attempt the -following: - -```scheme - -;; The fictional file `lookup-table.dat` would be in the source code -;; repository, and the fictional procedure `process-data` would read -;; it and return a data structure. -(define lookup-table (process-data "lookup-table.dat")) - -``` - -This will not work. Compiling a Scheme file containing such a form -will produce a program that calls `process-data` at run-time and not -at compile-time as intended. - -One can of course resolve this with an explicit use of a procedural -macro. In fact, all one needs to do is redefine `process-data` as a -macro, but I find this to be an unnecessary complication. - -Further, any sufficiently intelligent implementation *will* actually -execute such top-level definitions at compile-time, given that they -only make use of compile-time constants and pure functions. - -```scheme - -;; Guile will compute the value at compile-time. -(define seconds-per-day (number->string (* 24 60 60))) - -``` - -This is easily observed by running `guild compile test.scm -o test.go` -and then `strings test.go` which will contain the string 86400 in its -output. (If we didn't use `number->string`, it would be harder to -locate the number 86400 in the output, since it would be binary.) - -This works because Guile implements the "partial evaluation" strategy -for program optimization. This requires the optimizer to know which -procedures are pure. A limitation in the implementation may lead to -some such opportunities to be missed, and for the compiled program to -execute unnecessary code at run-time. - -To recap: The top-level of a Scheme file is conceptually executed at -run-time. But an optimizer may execute some of it at compile-time -anyway. However, we're at the mercy of the implementation quality for -this to happen consistently. We can use procedural macros to force -some execution to definitely happen at compile-time. What a mess! - -It would be so much simpler if compiling a program meant, at the -language semantics level, that the top-level is executed. - -Any run-time initialization of a program or module should be explicit, -such as by putting it into a `main` function, or having the module -export an initialization function. (The language may, if I feel like -it, allow for declaring a module initializer function, which would be -invoked automatically when a module is loaded.) diff --git a/html/notes/cons.md b/html/notes/cons.md deleted file mode 100644 index 29bb2d6..0000000 --- a/html/notes/cons.md +++ /dev/null @@ -1,178 +0,0 @@ -# Stop the "cons" madness! - -Lists are neat, but they aren't the best representation for sequences -of fixed length. An array/vector is a better choice for this. - -R6RS already uses vectors in some places where a traditional lisper -may have expected to see lists. Namely, in the procedural layer of -record types, where the fields of a record type are represented by -vectors. Another example is `hashtable-keys` and `hashtable-entries` -which both return vectors. There may be more places in Scheme where -this makes sense. - -In the following, we discuss a better handling of rest-argument lists -and a change to `apply`. In short, rest arguments are not actually -lists anymore, but rather a special kind of identifier that refers to -multiple values. And `apply`, which becomes a special form, expects -its last argument not to be a list but rather an expression that may -evaluate to multiple values. - - -## Better handling of rest args - -Initially, I was thinking of using either immutable vectors, or -immutable lists transparently backed by arrays, to represent rest -arguments. But the following paper offers a very compelling -alternative: - -[A New Approach to Procedures with Variable Arity](https://legacy.cs.indiana.edu/~dyb/pubs/LaSC-3-3-pp229-244.pdf) - -Let's first summarize the paper, and then see how we can adapt its -ideas for Zisp. - -Long story short, rest argumenst are received through a parameter list -such as `(arg1 ... argn & rest)` and the identifier `rest` is special -in that it can only be passed on to another procedure using the same -syntax. For example, to explicitly put the rest args into a list and -map over them: - -```scheme - -(define (map* proc & args) - (map proc (list & args))) - -(map* square 1 2 3) ;=> (1 4 9) - -``` - -Recursive functions that directly consume an arbitrary number of args, -without needing to allocate any data structure, can be implemented by -combining this feature with what is today known as `case-lambda`: - -```scheme - -(define combine* - (case-lambda - ((x) x) - ((x y & rest) (combine* (combine x y) & rest)) - -``` - -Though the paper proposes the use of `&` so as to differentiate it -from the regular rest-argument mechanism of Scheme, I intend to make -Zisp use only this mechanism, so we can use the dot notation for it. -Rewriting the above examples in this style gives us: - -```scheme - -(define (map* proc . args) - (map proc (list . args))) - -(define combine* - (case-lambda - ((x) x) - ((x y . rest) (combine* (combine x y) . rest)))) - -``` - -I find this very pleasing on the eyes, and a very elegant way to use -improper lists in evaluation context, which isn't allowed in Scheme. - - -## More ergonomic multiple-values - -The paper linked above proposes to reuse the rest args syntax for an -elegant solution to consuming multiple values: - -```scheme - -(proc x y & <expr>) - -``` - -In the above, `<expr>` may evaluate to multiple values, and the values -will be passed to `proc` as additional arguments. - -Essentially, this means that the special rest-arg identifier is itself -a representation of multiple values, or in other words, evaluating it -results in multiple values even though it's just an identifier! - -Demonstration, using Zisp notation for multiple-value rest args: - -```scheme - -(define (foobar . rest) - (let-values (((x y z) rest)) - ;; This is a meaningless example for demonstration, since we - ;; could have just made the function accept three parameters. - ;; The `let-values` here is the one from SRFI 11. - )) - -``` - -The paper also proposes terminating lambda bodies with `& <expr>` to -return multiple values, but this is obsolete as of R5RS, which allows -the final expression in a body to evaluate to multiple values anyway. - -However, the use of & to pass multiple values as arguments is very -convenient and much cleaner than R5RS's clumsy `call-with-values`: - -```scheme - -;; Returns {bool, obj} where bool indicates success/failure and obj -;; is meaningless if bool is false; allows differentiating between -;; the case where #f is found as the value vs. nothing being found. -(define (lookup alist key) - (if (null? alist) - (values #f #f) - (if (eqv? key (caar alist)) - (values #t (cdar alist)) - (lookup (cdr alist) key)))) - -(define (display-if-found found? obj) - (when found? (display obj))) - -;; Incredibly ugly `call-with-values`: -(call-with-values - (lambda () (lookup '((x . y)) 'x)) - display-if-found) - -;; (Up until here is valid R5RS code, by the way.) - -;; So much cleaner: -(display-if-found & (lookup '((x . y)) 'x)) ;; displays x - -``` - -Unfortunately, we can't reuse the improper list syntax in the last -example, since the following s-expressions are equivalent: - -```scheme - -(foo . (bar baz)) -(foo bar baz) - -``` - -In Zisp, this will be solved by making `apply` a special-form where -the last operand is expected to evaluate to multiple values rather -than a list: - -```scheme - -;; (apply <proc-expr> <argn-expr> ... <restargs-expr>) - -(apply display-if-found (lookup '((x . y)) 'x)) - -``` - -Note that this means the forms `(apply foo rest)` and `(foo . rest)` -are equivalent if `rest` is an identifier and not a pair/list, while -`(apply foo (x ...))` is of course different from `(foo x ...)`. - -I find this all incredibly pleasing. Lists never had any business in -representing arguments in the first place; it should always have been -multiple values! - -(The phrase "argument list" is probably going to stick around forever -though, even if it's technically wrong in Zisp.) diff --git a/html/notes/equal.md b/html/notes/equal.md deleted file mode 100644 index 8c55faa..0000000 --- a/html/notes/equal.md +++ /dev/null @@ -1,255 +0,0 @@ -# A novel approach to object equivalence - -## Story time - -In my past 5 years of developing a warehouse management application, -I've frequently found myself implementing equality the following way. - -(Code massively reduced to get to the point. Implementations of the -`hashCode` method, which must be compatible with `equals`, have been -left out for brevity.) - -```java - -class Warehouse { - String code; - String name; - // ... more fields, many mutable - - public boolean equals(Object obj) { - if (obj == this) { - return true; - } - if (obj instanceof Warehouse w) { - return code.equals(w.code); - } - return false; - } -} - -class Product { - int id; - LocalizedString name; - LocalizedString description; - // ... more fields, many mutable - - public boolean equals(Object obj) { - if (obj == this) { - return true; - } - if (obj instanceof Product p) { - return id == p.id; - } - return false; - } -} - -``` - -And so on. A type may have a code that is a String, an id that is an -int, or some other field that uniquely identifies members of it within -our application domain. - -I'm speaking of types and their members, not classes and instances, -because I mean the application domain. The user of our program has a -number of warehouses and a number of products, in reality, which we -represent via these classes, but there may be multiple instances of a -class representing the same real entity. - -Oftentimes, there will be a database that contains unique entries for -these. For example, an SQL table for products where the id is the -primary key, and an SQL table for warehouses where the code is the -primary key. - -This way of implementing `equals()` and `hashCode()` may seem wrong to -some developers. Since the classes have mutable fields, we may end up -with two `Product` instances which are not equal in their state, yet -`equals()` returns true for them since they represent the same real -life entity, just with different state in our program. Perhaps one -has just been fetched from the database, whereas the other has been -modified to represent changes that are yet to be committed. - -I've never found this strategy to lead to any problems. Never have I -had a need to know whether two `Product` instances have the same state -right now. What I care about is which product is being represented. -This is useful in various ways. - -### Map keys, sets, and so on - -First of all, it allows us to create maps where the keys are instances -of Product or Warehouse. For example, the application may want to -create a map of which products each warehouse contains: - -```java - -Map<Warehouse, List<Product>> productsByWarehouse; - -``` - -Given the `equals()` implementation of the Warehouse class, this map -can now be given any instance of Warehouse as a key, and we need not -ensure that we have one canonical instance used as the key. - -There are of course many alternatives to this. One may not want this -map to keep Warehouse instances alive, in which case it would be more -sensible to use the codes (Strings) for the keys. One could also add -a field to Warehouse such as `List<Product> products;`. However, in -some circumstances, a map such as the above may be most natural. - -Other data structures and operations depend on equality checks as -well. For example: finding duplicates in a collection, checking -whether a collection already contains a given item, managing a set -data type, and so on. The way we implement equality is also useful -for such purposes. - -### Elegance and encapsulation - -Secondly, we may have subsystems in the application that communicate -by passing Product or Warehouse instances to each other. For example, -the user may be looking at a user interface displaying the products in -a warehouse with their current quantities. The user may then click on -an "update quantities" button which opens a sub-interface where it's -possible to add an arbitrary number of entries (products) with the -desired quantity change for each. This sub-interface may then return -a list of Product instances with the associated quantity change. The -code of the warehouse overview interface can now use `equals()` to -determine which instance of Product it received corresponds to which -Product instance it's already holding on to. - -Again, there are of course many alternative strategies which don't -require such a use of `equals()`. One may explicitly compare the id -fields, or the sub-interface may return ids associated with quantity -changes. However, the `equals()` based implementation may offer the -cleanest possible code: - -```java - -public void receiveQuantityChanges(Map<Product, Integer> changes) { - for (Row<Product> row : this.productRows) { - Integer change = changes.get(row.product); - row.quantity += change; - } -} - -``` - -As far as readability and elegance is concerned, this can hardly be -improved on, as far as one is familiar with Java. Although using a -`Map<Integer, Integer>` would hardly make the code any more verbose, -it immediately causes it to lose out on self-documentation: - -```java - -// Quantity changes of what? Should we rename this to the awfully -// long name receiveProductQuantityChanges to make it clear? -public void receiveQuantityChanges(Map<Integer, Integer> changes) { - for (Row<Product> row : this.productRows) { - Integer change = changes.get(row.product.id); - row.quantity += change; - } -} - -``` - -This is a minor change as far as readability is concerned, but it also -decreases encapsulation. The code needs to be aware that products are -represented uniquely by an id that is an integer, and changing this -implementation detail of the class may have far-reaching consequences -throughout the code base. - -The self-documenting property may not apply to a Scheme-based language -without static types, but it makes the encapsulation aspect all the -more significant, since we won't have a compiler or static analyzer -which immediately tells us about a change in the type of `id`. We -must hope that the field `id` has been removed and replaced with a -different one, not reusing the `id` identifier. (For example, this -would mean a procedure like `product-id` stops existing, so a static -analyzer or compiler can immediately warn us about its absence. Had -we reused the field but changed its type, we would have a very hard -time finding all the places in the code we now need to fix.) - -## Get to the point! - -I've explained all this to arrive at one simple conclusion: - -Perhaps it would be a good idea for a language to have a mechanism in -which compound types like records or classes can simply declare that -one of the defined fields is the "unique identifier" of the type. - -Imitating SRFI 9: - -```scheme - -(define-record-type <product> - (make-product id name description ...) - product? - (identity: id) ; <- see here - (id product-id) - (name product-name) - (description product-description) - ...) - -(define-record-type <warehouse> - (make-warehouse code name ...) - warehouse? - (identity: code) ; <- see here - (code warehouse-code) - (name warehouse-name) - ...) - -``` - -Now the million dollar question is whether it should be `equal?` that -makes use of this information, or `eqv?`, or both. - -Although `eqv?` is intended to approximate operational equivalence, -and should therefore not consider separate mutable objects to be the -same, it's not clear to me how useful this is on user-defined types. - -The rationale for defining `eqv?` in terms of operational equivalence -is that this allows implementing memoization. If a memoized function -depends on one of the mutable fields of a record, yet `eqv?` returns -true on separate instances with different state, then the function -will return an incorrect result. But I'm skeptical as to whether a -function like that would see much practical use. It seems to me like -a memoized function that is used to compute some property of an entity -in our application domain should probably depend only on immutable -properties of that object. - -Another way to look at `eqv?` is that it implements "mostly constant -time" equality. (Strictly speaking, it's a linear time operation on -numbers, assuming the implementation offers arbitrary-size numbers. -This rarely matters, as few programs make use of numbers larger than -what fits in a few dozen bytes.) Conversely, `equal?` is responsible -for deep equality testing, which is inherently linear time, though -record types were overlooked in its definition in R7RS-small. - -## "Why not neither?" - -(I'm coining this as a new meme, in contrast to "why not both?") - -I think I want Zisp to support the following: - -* `eq?`: As in Scheme, but maybe allowed to fail on procedures. -* `eqv?`: As in Scheme. -* `equiv?`: An even closer approximation of operational equivalence, - by diving into immutable compound data types that `eqv?` doesn't - handle due to trying to be constant-time. I believe this is the - same as `equal-always?` in Racket. -* `equal?`: As in Scheme, but also dives into records because why not? - -The reason `eq?` is allowed to fail on procedures is an optimization -strategy that duplicates/inlines procedures. (citation needed) - -And which one should make use of the `identity` field of user types? -Only `equiv?`, because the identity object may be a string or other -immutable but compound object. (Thinking back to my Java programs, -I've had an extremely common type that was identified by a pair of -integers, for example: document and line number. One can also -conceive of N-dimensional coordinates and other such examples.) - -We want `eqv?` to remain "pretty much constant time" with the only -exception being numbers, which seem highly unlikely to reach such a -large size that it would matter. Recursive data structures, and -uninterned strings, on the other hand, could become quite large and -costly to compare. diff --git a/html/notes/fastcons.md b/html/notes/fastcons.md deleted file mode 100644 index 3dd4c3a..0000000 --- a/html/notes/fastcons.md +++ /dev/null @@ -1,206 +0,0 @@ -# Cons cell optimization? - -Assume that your Lisp implementation uses tagged pointers (possibly -with NaN packing; doesn't matter), and there's enough information on -pointers to tell you that the destination is the contents of a pair, -aka cons cell. - -For example, Zisp uses NaN packing, and all values are represented as -a `Value` union, which is either a double, or a special kind of NaN -value that may, among other possibilities, contain a pointer with a -3-bit type tag within it. One of the 8 possible types is a pair. - -You would implement the actual contents of a pair as `[2]Value` (in -Zig notation), i.e., an array of two consecutive `Value` objects on -the heap; you already know from the type tag on the pointer itself -that the destination is pair contents, so you don't need any further -information on the heap indicating that those two consecutive `Value` -objects represent the car and cdr of a cons cell. They are "naked" -within the heap. - -This seems optimal; you have zero meta-data on the heap itself. But -pairs are ubiquitously used to implement linked lists in Lisp, and -that includes a lot of inherent overhead: every single element must -keep a pointer to the next element. So, if you have a list of five -elements, you have five `[2]Value` on the heap, i.e., ten values! - -Can't we improve on that, somehow? - -In Zisp's NaN-packing strategy, there's an entire unused bit on heap -pointers. (There's also a whole other set of possible pointer values -with a 50-bit payload, so really, we have a *ton* of unused pointer -space.) - -What if that bit indicated that the pointer, well, let's say "has to -do" with pairs (I'll explain). The three type tag bits are now free, -since the bit being *unset* means the pointer is for one of the other -types. So, we can encode eight "sub-types" of this pointer. - -The actual pointer would go to a `[8]Value` instead of `[2]Value`, -with the "type" tag being something akin to an index. (It's not an -actual index; read on.) - -When you first use cons, say on a symbol `x` and nil, to get a list of -one element, it uses two of the eight slots: - - [ _ _ _ _ _ _ x () ] (tag = 0) - -(Let's call the operation that does this, i.e., allocates an array of -eight and puts the arguments into positions 6 and 7, `alloc-cons`. -We will use this term later.) - -The pointer points to the start of the array, the "type tag" is zero, -and car and cdr are implemented as: - - (car p) = p[6 - tag] - - (cdr p) = p[7 - tag] ??? (read on) - -When you call cons again but the cdr is a pointer such as the above, -then instead of allocating a new array, it puts the car into it and -returns a modified pointer; say we did `(cons y pair)`: - - [ _ _ _ _ _ y x () ] (tag = 1) - -(The operation that does this, i.e., puts the first argument to cons -into ptr[5 - tag] and returns ptr with tag + 1, is `shift-cons`.) - -The tag on the pointer now says 1. You may notice that if cdr were -truly implemented as the above, it wouldn't actually give us the cdr -of the list any more; it would give us `(car (cdr list))` so we need -to make a change: - - (cdr p) = if tag = 0: p[7]; else: p with tag - 1 - -We introduced a branch instruction, which may come back to bite us, -but our car and cdr are now functional again. - -OK, but what if we run out of space in the array? Cons also needs a -branch, or multiple if we count the one already implied above: - - (cons a b) = if b is pair and b.tag < 6: - (shift-cons a b) - else if b is pair: - (alloc-cons a b) - else: - (alloc-cons a b) - -Oh, would you look at that: the last two branches are identical, so -actually just one branch instruction is enough after all: - - if b is pair and b.tag < 6: - (shift-cons a b) - else: - (alloc-cons a b) - -I *think* this pseudo-code is complete now? If cons, car, and cdr -were to be implemented this way, I think we get an API that's fully -compatible with traditional cons/car/cdr and the optimization is -completely transparent. - -Note that we haven't used tag value 7. Maybe that's fine, or maybe -it's possible to exploit that somehow; let's ignore it for now. - -(We could simply use `[9]Value` and then all 8 tag values would be -used, but that's probably a bad idea due to cache line shenanigans; -more on that later.) - - -## Comparison - -Let's compare the in-memory representation of traditional pairs and -our optimized "pairs" when both are used to implement a linked list -with ten elements: - - ;; traditional - list = [ e0 cdr1 ] - cdr1 = [ e1 cdr2 ] - cdr2 = [ e2 cdr3 ] - cdr3 = [ e3 cdr4 ] - cdr4 = [ e4 cdr5 ] - cdr5 = [ e5 cdr6 ] - cdr6 = [ e6 cdr7 ] - cdr7 = [ e7 cdr8 ] - cdr8 = [ e8 cdr9 ] - cdr9 = [ e9 () ] - - ;; optimized - list = [ _ _ _ _ e0 e1 e2 cdr1 ] (tag = 2) - cdr1 = [ e3 e4 e5 e6 e7 e8 e9 () ] (tag = 6) - -Let's look at the pros and cons. - -<style> -td:first-child { font-weight: bold; } -td:not(:first-child) { font-family: mono; } -</style> - -| | Traditional | Optimized | -|----------|--------------------------|------------------------------| -| cons | alloc, set car & cdr | see above (branches) | -| car | ptr[0] | ptr[6 - tag] | -| cdr | ptr[1] | see above (branches) | -| Memory¹ | n * 2 | (floor((n - 1) / 7) + 1) * 8 | -| Locality | Bad | Better (?) | -| Derefs² | n | n/7 (I think) | - -¹ Counted in `Value`-sized slots on the heap. - -² Pointer dereferences when looping through the whole list. - -I haven't thought too hard about some of the numbers, but it's not -terribly important; this is to get a general idea. - -Traditional cons always allocates; ours has a branch instruction but -this allows it to elide an allocation most of the time; I think it's -every 6 out of 7 cons calls that *don't* require allocation, when -building up a linked list. - -The implementation of car became ever so slightly slower, requiring a -small arithmetic operation. Cdr became even more complicated, with a -branch instruction, but most of the time it can actually elide the -pointer dereference; I think this time it's every 7 out of 8 calls. - -The strange formula for memory use is because we alloc in chunks of 8, -but use one slot for a final cdr. Anyhow, in the worst case, we will -be using 8 rather than 2 slots for a single-list element; the break -even point is 4 elements where both implementations use 8 slots. -(Then there's the 8-element case where both use 16 slots, but after -that our implementation always wins.) - -The fewer pointer derefs should help a bit, although it's likely that -the pairs making up a traditional linked list were all allocated in -close proximity, and CPUs nowadays fetch more memory from main RAM -into a CPU cache than you ask for, if you've asked for less than the -cache line size. The most common size nowadays is 64 bytes, which is -exactly `[8]Value`. This means that, using traditional cons cells, -you'd be fetching 4 of them at once every time you deref a pointer, -assuming they were all allocated next to each other. - -Memory locality is better on paper, but only if you don't think too -hard about it. We will always have seven elements in one contiguous -chunk, whereas the traditional implementation could suffer from a lot -of fragmentation: Each pair could end up in a different cache line -sized segment on the heap, meaning every element requires a main -memory fetch. But that's unlikely; they are much more likely to be -adjacent on the heap. Further, what if we have tons of tiny lists, -with one to four elements per list? Now our strategy forces every -list to be in its own cache line sized segment, whereas some of the -lists using the traditional implementation could end up in a single -64-byte segment and thus be fetched together. A cursory glance at -some Scheme code of mine indicates that there's indeed tons of lists -with fewer than 5 elements! - - -## And that was just linked lists - -Pairs are not only used for linked lists. They could be the slots of -an alist, they could be nodes in a tree data structure, and so on. - -Forcing every cons cell to allocate 64 bytes could thrash locality in -these cases, and actually increase the total memory use as well, as -most of the `Value`-sized slots may end up being wasted. - -I won't implement the strategy mentioned above yet, but if I do it -eventually, it will be very interesting to see its performance in a -variety of benchmarks, reflecting common usage patterns. diff --git a/html/notes/format.md b/html/notes/format.md deleted file mode 100644 index f757736..0000000 --- a/html/notes/format.md +++ /dev/null @@ -1,14 +0,0 @@ -# I hate 'display' - -WIP WIP WIP - -(format "template" arg ...) ;sprintf -(format obj) ;like write but returns string -(format! "template" arg ...) ;printf -(format! arg) ;write - -The ones with a string template are special forms and process the -template string at compile time and ensure correct number of args. - -Need a way to let a special form's name also appear as an identifier -like Guile does it with record accessors and shit. diff --git a/html/notes/immutable.md b/html/notes/immutable.md deleted file mode 100644 index 78652e9..0000000 --- a/html/notes/immutable.md +++ /dev/null @@ -1,56 +0,0 @@ -# More immutability - -I see no reason to have mutable variables in the language. - -Usually, code is analyzed to distinguish between mutable and immutable -variables, because this aids in optimization. This means you end up -with two types of variables, and whether a variable is of one or the -other type is determined solely from how it's used. Ugly! - -An explicit box data structure can trivially replicate the features of -a mutable variable, so let's just use that instead. - -Our `set!` can assume a box when a plain identifier is used. But to -get the value, we call `get`. - -```scheme - -(let ((x (box 0))) - (while foo - (set! x (+ x 1))) - (get x)) - -``` - -I've not yet made up my mind on whether pairs should be immutable by -default, but they probably should. Strings, as also mentioned in -[symbols](symbols.html), will be immutable, since string constants -will be the same thing as symbols. - -## Late additions - -It now occurs to me that, if you want your explicitly boxed value to -not be heap-allocated, your compiler will need to analyze its use and -potentially unbox it. - -So, in terms of code analysis complexity, it may not actually make a -difference, but I still like the more explicit demarcation of mutable -variables. Perhaps the syntax and semantics could be changed to: - -```scheme - -(let ((x (mutable 0))) - (while foo - (set! x (+ x 1))) - x) - -``` - -This is different in that passing around `x` will not actually pass -around a box whose contents can be mutated; rather, it's a regular -variable like in Scheme, but mutable unlike normal Zisp variables. -The `mutable` identifier would be part of the `let` syntax and not -possible to use anywhere else. (Probably not even with `define`.) - -It's really just to make code more explicit and easier to grasp, -without any effects on compiler complexity, probably. diff --git a/html/notes/let.md b/html/notes/let.md deleted file mode 100644 index 4af41bd..0000000 --- a/html/notes/let.md +++ /dev/null @@ -1,41 +0,0 @@ -# No shadowing (shock!) and a reduced set of `let` forms - -This may be shocking for Schemers but I believe shadowing variables is -evil. I've already written a bug once that would have been prevented -had it not been possible to shadow variables, and I don't even write -that much Scheme code. - -And let's face it: The presence of four different forms by the name of -`let`, `let*`, `letrec`, and `letrec*` is daunting when you're coming -from another language. Lack of shadowing allows us to reduce this -without losing out much functionality. - -In the absence of shadowing, `let` becomes nearly useless because you -can always use `letrec` to fulfill the same need; it's strictly more -powerful. (The only thing `let` can do that `letrec` can't, is to -refer to the previous binding of a variable before shadowing it.) - -Further, the value of vanilla `letrec` is dubious when `letrec*` is -strictly more powerful. So, in Zisp, `let` is the ultimate binding -form that does what `letrec*` does in Scheme. - -Except it does more! We haven't looked at the whole `let-values` -family yet. In Zisp, these are also merged into the ultimate `let`, -using the SRFI 71 syntax: - -```scheme - -(let ((a (one-value)) - (b c (two-values)) - ((values d e . rest) (arbitrary-values))) - (do-things)) - -``` - -You may be wondering whether it also supports the "let loop" syntax of -vanilla Scheme `let` and the answer is no, because I hate that syntax. -It has too high a risk of leading to absolute spaghetti code with no -clear indication as to how the loop variables are being updated. - -If you want to loop, use a dedicated looping syntax! Even `do` is -better than "let loop" shenanigans if you ask me. diff --git a/html/notes/macros.md b/html/notes/macros.md deleted file mode 100644 index 3169c49..0000000 --- a/html/notes/macros.md +++ /dev/null @@ -1,151 +0,0 @@ -# Does the decoder implement macros? - -I've written about the [parser/decoder dualism](reader.html) in a -previous article. Long story short, the parser takes care of syntax -sugar, like turning `#(...)` into `(#HASH ...)`, and the decoder takes -care of turning that into a vector or whatever. - -Now, since the job of the decoder seems superficially quite similar to -that of a macro expander, I've been agonizing for the past two days or -so whether it *is* the macro expander. - -(Warning: This post is probably going to be very rambly, as I'm trying -to gather my thoughts by writing it.) - -On one hand, sure: - - (define-syntax #HASH - (syntax-rules () - (#HASH <element> ...) - (vector '<element> ...))) - -Or something like that. You know what I mean? I mean, in Scheme you -can't return a vector from a macro, but in Zisp the idea is that you -can very well do that if you want, because why not. - -It's very much possible that I will eventually realize that this is a -bad idea in some way, but we'll see. So far I really like the idea of -a macro just returning objects, like a procedure, rather than having -to return a syntax object that has a binding to that procedure. - -This may be similar to John Shutt's "vau calculus" from his language -Kernel. Maybe Zisp will even end up being an implementation of the -vau calculus. But I don't know; I've never fully grokked the vau -calculus, so if I end up implementing it, it will be by accident. - -In any case, I want the user to be able to bind transformers to runes, -and doing so feels like it's pretty much the same thing as defining a -macro, so maybe the decoder should also be the macro expander. - -But then there's an issue with quoting. Consider the following: - - (define stuff '(foo #(0 1 2))) - -In Zisp, this would first of all be parsed into: - - (define stuff (#QUOTE foo (#HASH 0 1 2))) - -Now, if #QUOTE didn't decode its operand, we'd end up seeing #HASH in -the result, never creating the vector we meant to create. - -But if #QUOTE calls decode on its operand, and the decoder is also the -macro expander, whoops: - - (let-syntax ((foo (syntax-rules () ((_ x) (bar x))))) - '(foo #(0 1 2))) - - ;; => (bar #(0 1 2)) - -I mean... MAYBE that should happen, actually?! Probably not, though. -What Scheme does isn't gospel; Zisp isn't Scheme and it will do some -things differently, but we *probably* don't want anything inside a -quoted expression to be macro expanded. Probably. - -The thought that I might actually want that to happen sent me down a -whole rabbit whole, and made me question "runes" altogether. If they -just make the decoder invoke a predefined macro, well, why not ditch -runes and have the parser emit macro calls? - -So instead of: - - #(x y z) -> (#HASH x y z) - -(Which is then "decoded" into a vector...) Why not just: - - #(x y z) -> (VECTOR x y z) - -And then `VECTOR` is, I don't know, a macro in the standard library I -guess. If the decoder is the macro expander, then sure, it will know -about the standard library; it will have a full-blown environment that -it uses to macro expand, to look up macro names. - -But no, I think this conflates everything too much. Even just on the -level of comprehensibility of code containing literals, I think it's -good for there to be something that you just know will turn into an -object of some type, no matter what; that's what a literal is. - -(In Zisp, it's not the reader that immediately turns the literal into -an object of the correct type, but the decoder still runs before the -evaluator so it's almost the same.) - -Then again, maybe this intuition just comes from having worked with -Scheme for such a long time, and maybe it's not good. Perhaps it's -more elegant if everything is a macro. Don't pile feature on top of -feature, remember? - -Booleans, by the way, would just be identifier syntax then. Just -`true` and `false` without the hash sign. In Zisp, you can't shadow -identifiers anyway, so now they're like keywords in other languages, -also a bit like `t` and `nil` in CL and Elisp. - -IF we are fine with the quote issue described above, then I *think* -everything being a macro would be the right thing to do. Although -I've said the decoder could be used for things other than code, like -for configuration files containing user-defined data types, you could -still do that by defining macros and calling the macro expander on the -config file. - -It's just that you would either not be able to have stuff like vectors -in a quoted list (you'd just get a list like `(VECTOR ...)` in it if -you tried), or you'd have to be expanding any macros encountered -within the quoted list. Either both, or neither. - -Not getting a choice, you say... That's not very expressive. That -seems like a limitation in the language. Remember: remove the -limitations that make additional features seem necessary. - -Next thing we will have two variants of quote: One which quotes for -real, and one that expands macros. Or maybe some mechanism to mark -macros as being meant to be run inside a quote or not, but then we -re-invented runes in a different way. - -Which brings me back to runes, and how `#QUOTE` could handle them, -even if the decoder is the macro expander. - -Encountering `#QUOTE` could tell the decoder that while decoding the -operand, it should only honor runes, not macros bound to identifiers. - -That would probably be a fine way to solve the quote problem, should -the decoder also be the macro expander: Macros are bound to runes or -identifiers, and the rune-bound macros are those that are expanded -even inside a quote. - -I think that would be the same as having completely separate decode -and macro-expand phases. - -(The reason we would want them merged, by the way, is that it would -presumably prevent duplication of code, since what they do is so -similar.) - -It's possible that I'm agonizing for no reason at all because maybe -the decoder cannot be the macro expander anyway. - -We will see. - -For now, I think it's best to proceed by implementing the decoder, and -once I've come to the macro expander I can see if it makes sense to -merge the two or not. - -But I'll probably keep runes one way or another, since they're a nice -way of marking things that should be processed "no matter what" such -that they can function as object literals within code. diff --git a/html/notes/nan.md b/html/notes/nan.md deleted file mode 100644 index f8f3f80..0000000 --- a/html/notes/nan.md +++ /dev/null @@ -1,458 +0,0 @@ -# NaN-Packing - -NaN-packing (also called NaN-boxing) is a strategy involving the use of NaN -bit patterns, that are otherwise unused, to store various values in them. - -In the implementation of a dynamically typed language, this can be used to -ensure that all types in the language can be represented by a single 64-bit -value, which is either a valid double, an actual NaN value, or one of the -other NaN bit patterns that represent some other type, potentially in the -form of a pointer to a heap object containing further data. - -This works because pointers only need 48 bits in practice, and the range of -unused NaN bit patterns contains an astounding `2^53 - 4` different values. - -IMPORTANT NOTE: All illustrations of data structures and bit patterns use -big-endian. When implementing the strategies described herein, it may be -necessary to reorder the elements. For example, the elements of packed -structs in Zig are ordered least to most significant. - - -## The double format - -The IEEE 754 double-precision binary floating-point aka binary64 format is: - - { sign: u1, exponent: u11, fraction: u52 } - -Possible types of values a double can encode include: - - { sign == any, exponent != 0x7ff, fraction == any } :: Real (finite) - { sign == any, exponent == 0x7ff, fraction == 0x0 } :: Infinity - { sign == any, exponent == 0x7ff, fraction != 0x0 } :: NaN - -Note: - - 0x7ff = u11 with all bits set (0b11111111111) - -In other words: - - all exponent bits set, fraction bits all zero :: Infinity - all exponent bits set, fraction part non-zero :: NaN - - -## Details of NaN values - -There are two different NaN types: signaling and quiet. Quiet NaN may be -returned by FP operations to denote invalid results, whereas signaling NaN -are never returned by FP operations and serve other purposes. - -Modern hardware sets the MSB of the fraction to indicate that the NaN is a -quiet one, so let's refine our definition for denoting NaN values: - - { sign: u1, exp: u11, quiet: u1, rest: u51 } - -Variants of NaN: - - { sign == any, exp == 0x7ff, quiet == 0, rest >= 0x1 } :: sNaN - { sign == any, exp == 0x7ff, quiet == 1, rest == any } :: qNaN - -Note that in case of the signaling NaN, the rest of the fraction must be -non-zero, since otherwise the entire fraction part would be zero and thus -denote an infinity rather than a NaN. - -Most systems have a "canonical" quiet NaN that they use: - - { sign == any, exp == 0x7ff, quiet == 1, rest == 0x0 } :: cqNaN - -The sign bit of the canonical quiet NaN is undefined and may differ from -operation to operation or depending on the platform. - -It's useful to see a few common examples expressed in hex: - - 0x7ff8000000000000 :: cqNaN, sign bit nil - 0xfff8000000000000 :: cqNaN, sign bit set - - 0x7ff8000000000001 :: smallest non-canon qNaN, sign bit nil - 0xfff8000000000001 :: smallest non-canon qNaN, sign bit set - - 0x7fffffffffffffff :: largest non-canon qNaN, sign bit nil - 0xffffffffffffffff :: largest non-canon qNaN, sign bit set - - 0x7ff0000000000001 :: smallest sNaN, sign bit nil - 0xfff0000000000001 :: smallest sNaN, sign bit set - - 0x7ff7ffffffffffff :: largest sNaN, sign bit nil - 0xfff7ffffffffffff :: largest sNaN, sign bit set - - -## Unused NaN bit patterns - -Let's start with the quiet NaN values. - -Theoretically, there only needs to be one canonical quiet NaN, so we would -have `2^52 - 1` unused bit patterns in the quiet NaN range. In practice, -however, the sign bit may differ from one operation to the next. - -For example, the fabs function may simply clear the sign of the argument, -without minding it being a NaN. In that case, if the platform's regular -canonical NaN is the one with the sign bit set, we would end up getting -another, "semi-canonical" quiet NaN bit pattern, with the sign bit nil. - -So, both variants of the canonical quiet NaN are in use. - -This leaves `2^52 - 2` definitely-unused quiet NaN bit patterns: - - { sign == any, exp == 0x7ff, quiet == 1, rest >= 0x1 } :: Unused qNaN - -Remember that signaling NaN are defined in a very similar way: - - { sign == any, exp == 0x7ff, quiet == 0, rest >= 0x1 } :: sNaN - -Since none of those can be returned by FP operations, they could all be seen -as unused, giving us another `2^52 - 2` bit patterns. - -In total, this gives us `2^53 - 4` definitely-unused NaN bit patterns. - - -## Representing Zisp values and pointers as unused NaN bit patterns - -Zisp wants to store two different things in unused NaN patterns: - -1. Pointers (to anything in principle) - -2. Non-double primitive aka "immediate" values - -It may seem intuitive to use signaling NaN for one, and quiet NaN for the -other. However, this would fragment our "payload" bits, since we would be -using the sign bit as its MSB and the remaining 51 bits of the fraction as -the rest of the payload. - -Further, we want to use as many bit patterns as possible for fixnums, so we -can have a nice large fixnum range. To this end, it would be nice if we -could, for example, use all bit patterns where the sign bit is set for our -representation of fixnums, and then the range of bit patterns with the sign -bit unset can be shared among the remaining values, and pointers. - -Then let's do exactly that, and use the sign as the first major distinction -between fixnums and other values, using it as a sort of `is_int` flag: - - { sign == 0x0, exp == 0x7ff, payload == ??? } :: Non-Fixnum - { sign == 0x1, exp == 0x7ff, payload == ??? } :: Fixnum - -It will become apparent in a moment why we haven't defined the payload yet. - -Given that our payload is the entire fraction part of the IEEE 754 double -format, we must be careful not to use the following two payload values -regardless of the sign bit: - -1. Zero: This would make the bit pattern represent an infinity, since the -payload is the entire fraction and a zero fraction indicates infinity. - -2. `0x8000000000000` (aka only the MSB is set): This would make the bit -pattern a canonical quiet NaN, since the payload MSB is the quiet bit. - -This means that in each category (sign bit set, or nil) we have `2^52 - 2` -possible bit patterns, and the payload has a rather strange definition: - - 0x0 < payload < 0x8000000000000 < payload < 0xfffffffffffff - -Can we really fit a continuous range of fixnum values into that payload -without significantly complicating things? Yes, we can! Observe. - - -## Fixnum representation - -We will store positive and negative fixnums as separate value ranges, using -the quiet bit to differentiate between them. - -Let's go back to considering the quiet bit a separate field: - - { sign == 0x1, exp == 0x7ff, quiet == 0x0, rest >= 0x1 } :: Positive - { sign == 0x1, exp == 0x7ff, quiet == 0x1, rest >= 0x1 } :: Negative - -But, I hear you say, the positive range is missing zero! Worry not, for -maths is wizardry. We will actually store positive values as their ones' -complement (bitwise NOT) meaning that all bits being set is our zero, and -only the LSB being set is the highest possible value. - -This must be combined with a bitwise OR mask, to ensure that the 13 highest -of the 64 bits turn into the correct starting bit pattern for a signed NaN. -Unpacking it is just as simple: Take the ones' complement (bitwise NOT) and -then use an AND mask to unset the 13 highest: - - POS_INT_PACK(x) = ~x | 0xfff8000000000000 - - POS_INT_UNPACK(x) = ~x & 0x0007ffffffffffff - -If you've been paying very close attention, you may notice something: Given -that we know the 13 highest bits must always have a certain respective value -in the packed and unpacked representation (12 highest set when packed, none -set when unpacked), we can use an XOR to flip between the two, and the same -XOR can take care of flipping the remaining 51 bits at the same time! - -This also means packing and unpacking is the same operation: - - POS_INT_PACK(x) = x ^ 0xfff7ffffffffffff - - POS_INT_UNPACK(x) = x ^ 0xfff7ffffffffffff - -There we go; packing and unpacking 51-bit positive fixnums with one XOR! -Amazing, isn't it? - -As for the negative values, it's even simpler. This time, the correct NaN -starting pattern has all 13 bits set, since the quiet bit being set is what -we use to determine the number being negative. And would you believe it; -this means the packed negative fixnum already represents itself! - - NEG_INT_PACK(x) = x - - NEG_INT_UNPACK(x) = x - -Isn't that unbelievable? I need to verify this strategy further, but based -on all information I can find about NaN values, it should work just fine. - -The only disappointing thing is that it's positive integers that need an XOR -to pack and unpack, rather than negative ones. One would expect positive -values to occur much more frequently in typical code. But I think we can -live with a single XOR instruction! - - -## Pointers & Others - -We still want to represent the following, which must share space within the -`2^52 - 2` bit patterns that can be packed into an unsigned NaN: - -- Pointers of various kinds -- Unicode code points (21-bit values) -- False, true, null, end-of-file, and maybe a few more singletons - -It seems sensible to split this into two main categories: pointers and other -values. Let's use the quiet bit as a `pointer` flag: - - { sign == 0x0, exp == 0x7ff, quiet == 0x0, rest >= 0x1 } :: Other - { sign == 0x0, exp == 0x7ff, quiet == 0x1, rest >= 0x1 } :: Pointer - -Note how neither type is allowed to have a zero payload, since in case of an -unset quiet bit, this would make our value an infinity, and in case of a set -quiet bit it would give us a canonical quiet NaN. Each of them is allowed -any other payload than zero. - - -## Pointers - -It would seem that we have 51 bits left to represent a pointer (though we -need to avoid the value zero). But we only need 48 bits... or even less! -Since allocations happen at 8-byte boundaries on 64-bit machines, we only -really need 45 of the 48 bits, given the least significant 3 will never be -set. This gives us a whole 6 free bits to tag pointers with! If we have -that much play room, we can do some crazy things. - -### Foreign pointers - -Firstly, let's introduce the concept of a "foreign" pointer. This means the -pointer doesn't necessarily point to a Zisp object, and may not be 8-byte -aligned. As it may point to anything, there's no point in defining further -bits as tagging additional information, so we have all 50 bits available. - -Let's cut out the 12 high bits of our double since we already know what they -must contain, and look at the definition of our 52-bit payload. - -I will also mix up the notation a bit, to indicate that some fields are only -defined if a previous field has a given value. - - { pointer == 0x1, foreign: u1, rest: u50 } - -(The `pointer` field is the `quiet` bit i.e. MSB of the 52-bit fraction.) - -If the foreign bit is set, then the entire `rest` field shall be seen as -opaque and may contain any value. Another way to look at this is that we -essentially defined another fixnum range of 50 bits. This can include the -value zero, since the foreign bit being set ensures we don't step on the -forbidden all-zero payload value. - -### Zisp pointers - -Now let's look at what we can do with "native" Zisp pointers. - -Wouldn't it be nice if our language had an explicit "pointer" data type and -it didn't require any additional heap allocation? So let's decide that one -bit is dedicated to distinguishing between an explicit pointer object, and -regular pointers that stand in for the object being pointed to. - -Perhaps it would be good to show some Zisp pseudo-code to explain what that -means, since it may be a strange concept: - - ;; In memory, vec is represented as a regular/direct vector pointer. - (define vec (vector 1 2 3)) - - ;; We can of course use this variable as a vector. - (vector? vec) ;=> #t - (vector-ref vec 0) ;=> 1 - - ;; Now we create an explicit pointer object pointing to that vector. - ;; Distinguished by a special bit in the in-memory value of vec-ptr. - (define vec-ptr (pointer vec)) - - ;; This variable is *not* a vector; it's a vector-pointer. - (vector? vec-ptr) ;=> #f - (vector-ref vec-ptr 0) ;ERROR - (pointer? vec-ptr) ;=> #t - (pointer-ref vec-ptr) ;=> #(1 2 3) - -This is *not* the same thing as a box, because it can *only* refer to heap -allocated objects, not immediates, whereas a box would be able to hold an -immediate value like an integer or double as well. - - (pointer 42) ;ERROR - (box 42) ;=> #<box:42> - -A box would necessarily need heap allocation, whereas a pointer doesn't. - -It's *also not* the same thing as a foreign pointer, because those can be -anything, whereas these pointer objects definitely point to Zisp objects. - -Pointers may or may not be mutable; I've not made up my mind yet. It may -seem like a pointless data type, but it adds a little bit of expressive -strength to our language. For example, when working with an FFI. And -there's really not much else we can do with all our bits. - -Let's use the term "indirect" for this tag, since "pointer" is already used: - - { pointer == 0x1, foreign == 0x0, indirect: u1, rest: u49 } - -Should these indirect pointers objects be mutable, then they may contain a -null pointer; the forbidden zero value is avoided through the fact that the -indirect bit is set. - -Hmm, indirect pointers may instead become weak pointers at some point! This -would fit perfectly since they can contain null. - -Direct or indirect makes no difference to the fact that the pointer value -will be 8-byte aligned, so we still have 4 bits for more information about -what's being pointed to. Also, since the actual pointer value can never be -zero (all non-foreign pointers must point to a valid Zisp object), we avoid -the forbidden zero pattern. Thus, we can indicate 16 different values with -our 4 remaining bits. - -It would have been nice to avoid fragmentation of these remaining tag bits. -However, we want to avoid shifting, so let's go with this definition for the -remaining 49 bits: - - { tag_high: u1, pointer_value: u45, tag_low: u3 } - -The pointer value is extracted by masking the entire bit sequence, so it -actually becomes a 48-bit value without further shifting. - -(This part of the article is kinda obsolete. Implementation details are up -for debate and we may or may not use bit shifting. It's not that expensive -of an operation, after all.) - -The tag can be used to tell us what we're pointing to, so that type checks -often don't require following the pointer. The memory location that's being -pointed to may duplicate this information, since we may want to ensure that -any Zisp object on the heap carries its type information within itself, but -I'm not yet decided on that. - -In any case, let's list some common heap data types that our 4-bit tag can -represent, making sure to have an "other" wildcard for future extensions. - -The right side shows the value of the type tag when it's acquired by masking -the 49-bit Zisp pointer payload. - - 0. String (Symbol) ... 0x0000000000000 - 1. Pair (List) 0x0000000000001 - 2. Vector ............ 0x0000000000002 - 3. Map (Hash-table) 0x0000000000003 - 4. Box ............... 0x0000000000004 - 5. Record 0x0000000000005 - 6. Class ............. 0x0000000000006 - 7. Instance 0x0000000000007 - 8. Text .............. 0x1000000000000 - 9. Byte-vector 0x1000000000001 - 10. Procedure ........ 0x1000000000002 - 11. Continuation 0x1000000000003 - 12. Port ............. 0x1000000000004 - 13. Error 0x1000000000005 - 14. Enum ............. 0x1000000000006 - 15. Other 0x1000000000007 - -This list is likely to change; for example: errors should probably be class -instances, continuations could be merged with procedures, and so on. But -this gives us a rough picture and demonstrates that 16 distinct values is -quite sufficient for avoiding a pointer de-reference in type checking. - -(Why is it so important to avoid following a pointer when checking a type? -Who knows? Did I say it was important? Why look at me like that??) - -(Since I wrote this, I decided to use bit shifting after all, and the tags -are straightforward values from 0 to 15.) - - -## Other values - -We still have one entire `2^51 - 1` value range left. We will split it the -following way. This one uses a very simple partitioning scheme: - - { tag: u3, payload: u48 } - -The following tags are defined: - - 001 = short string - 010 = char (Unicode code point) - 100 = singletons (false, true, etc.) - -Other tags are undefined and reserved for the future. Note that 000 is -missing, so we automatically avoid the forbidden zero payload. - -### What the heck is a "short string"? - -Remember that [strings are immutable](symbols.html) in Zisp. This allows us -to use an amazing optimization where short strings can be represented as -immediate values. - -We can't get to 56 bits (7 bytes), but 48 bits (6 bytes) fits perfectly into -our payload! So any interned string (equivalent to a Scheme symbol) in Zisp -will in fact be an immediate value if 6 bytes or shorter, and doesn't need -any heap allocation. Awesome! - -There can still be uninterned strings that are 6 bytes or shorter, and -calling intern on them would return the canonical, immediate version. - -### Unicode code points - -This is an easy one. We have 48 bits, and only need 21. Just write the -Unicode code point into the payload: done. - -This value range may be split in the future to fit other things in it, as -we've wasted a ton of bits here. - -### Singletons - -This 48-bit value range contains various singletons like Boolean values, the -empty list aka null, and so on. - -This is even more wasteful than using 48 bits for Unicode, so again this -value range may be partitioned further at some point. - -### Undefined ranges - -We have a whole 48-bit value range (sans one forbidden value) that's still -unused, plus another 50-bit range (or two 49-bit ranges, or three 48-bit). - -It's incredible just how much stuff you can cram into a NaN. I would have -never thought it possible. - -Ours may just be the most sophisticated NaN-packing strategy ever devised, -because I couldn't find any information on the web about the possibility of -using both signaling and quiet NaNs. All articles I've stumbled upon either -claim that you must avoid signaling NaNs or quiet NaNs, or they take a naive -approach to the subdivision of the available bit patterns and end up wasting -tons of bit real estate. - -Stay tuned for the development of Zisp, because this is getting serious! - -<!-- -;; Local Variables: -;; fill-column: 77 -;; End: ---> diff --git a/html/notes/oop.md b/html/notes/oop.md deleted file mode 100644 index 0821b42..0000000 --- a/html/notes/oop.md +++ /dev/null @@ -1,3 +0,0 @@ -# Object-oriented programming isn't that bad - -WIP diff --git a/html/notes/reader.md b/html/notes/reader.md deleted file mode 100644 index ebbe1ea..0000000 --- a/html/notes/reader.md +++ /dev/null @@ -1,470 +0,0 @@ -# Reader? Decoder? I barely know 'er! - -*This started from an expansion to the following, then became its own -article:* - -[Symbols are strings are symbols](symbols.html) - -OK but hear me out... What if there were different reader modes, for -code and (pure) data? - -I want Zisp to have various neat [syntactic extensions](sugar.html) -for programming purposes anyway, like the lambda shorthand, and these -shouldn't apply to configuration files, either. (Although they seem -unlikely to occur by accident.) - -So what if the transform from string literal to quoted string literal -only occurred in code reading mode? - -At least one problem remains, which is that `'(foo "bar")` would turn -into `(quote (foo (quote bar)))` because the reader would be in code -mode while reading it... - -This reminds me of the long-standing annoyance in Scheme that "quote -is unhygienic" and maybe we can tackle this problem as well now. - -Also, expressions like `'(foo '(bar))` always seemed weird to me, and -probably have no place in Scheme, because we don't generate code via -quote; we generate it with macros that operate on explicit syntax -objects rather than pure data. - -I want to experiment with an idea like this: - - ;; "code reader mode" transformations - - '(foo bar) -> (#quote foo bar) - - '(foo 'bar) -> ERROR - - "foo" -> (#quote . foo) - - "foo bar" -> (#quote . "foo bar") - - '(foo "bar") -> (#quote foo bar) - - '(foo "x y") -> (#quote foo "x y") - -The right-hand side shows what you would get if you read the form on -the left in code reader mode, then wrote it back out in data mode. - -The writer could also have a code writer mode, which applies the -reverse transformations. There should be a one to one mapping, -unambiguous, so this always works. A "hygienic" way to quote is -imperative here, since the writer could otherwise not know whether -some `quote` keyword in a list is *the* quote special form, or just -part of some data. - -We've made quote into a special token, `#quote`, to solve that. -Instead of adding a separate symbol data type that's a subtype of -strings, I think I'll add something called a "rune" or such that's -represented like `#foo` and allows for custom reader extensions, or -rather, writer extensions. - -Essentially, these runes would be bound to a pair of the following: - -1. A procedure that accepts a datum and returns some type of value. - -2. A procedure that takes values of that type, and turns them back - into written format. - -For `#quote`, the reader procedure would be the identity function. -The writer procedure would need to be a little more sophisticated. - -Note that the first procedure would not actually be called during -reading of data. Somewhat confusingly, it would only be called in -evaluation of code. - -Let's recap. Starting with pure data reading and writing: - -1. There is no special reader syntax. This s-expression format is a -bare minimum of what's needed to represent sequential data i.e. lists -and lists are the only compound data type recognized by the reader. -Anything that isn't a list is either an atomic value, or a string -which may or may not be considered atomic depending on how pedantic -you want to be. Oh and runes are allowed. - - A. This is basically "classic" s-expressions with runes added. - Only lists, numbers, and strings/symbols are recognized. - - B. Heck, numbers may not be recognized. Or maybe they will be - limited to integers and floats, but no rationals or such, and - reading a float will guarantee no loss of precision? - -2. Writing data returned by the data reader back out, in data form, -will produce exactly what was read, with the sole exception being -whitespace differences. The data is not allowed to contain any -non-atomic values other than proper lists. - - A. It's important not to allow floats that IEEE 754 doubles can't - represent, since then differences between input and output would - occur. But what about numbers like "10.00"? That would also - become something else when written back out. - - B. OK, maybe numbers represented in a non-canonical way are a - second source of difference between reading and writing back out, - but let's at least guarantee there's no loss of precision. - -(I've not considered comments. Maybe they will be preserved? Maybe -they should be implemented as code reader syntax sugar as well??) - -And now code reading and writing: - -1. Various syntax sugar is internally transformed into runes, with -non-list compound data literals (vectors, hash tables, etc.) needing -this type of representation to appear in code. - - A. Writing that data back out in data mode will reveal the inner - workings of the language, producing output containing runes. - - B. Direct use of runes may be forbidden; not sure about this. - - C. Evaluating this data containing runes will produce, in-memory, - the actual values being represented. The "reader procedure" tied - to the rune is responsible for this, though the fact that it's - evaluation and not reading that calls that procedure makes it - confusing so a better name is needed. Maybe just "decoder." - -2. For every data type that falls outside the pure data syntax, there -is a procedure that turns it into a canonical data representation -based on lists and atomics, always using the format `(#rune ...)`. - - A. Another procedure is capable of turning that back into reader - sugar, but this is not terribly important. Although it would be - neat to be able to write out code that looks like hand-written - program code, this really is just a bonus feature. - - B. For some types, turning them back into code without any runes - may be highly complicated; procedures, in particular, would need - decompilation to make this work. - - -## Recap (or not?) - -Wow, that was a long "recap." I actually came up with new ideas in -writing that. Let's recap the recap. I'll represent the mechanisms -as different pipelines that can happen using the various features. - -Typical pipeline when reading and evaluating code: - - code-file --[code-reader]--> code-data --[eval]--> values - ^^^^^^^^^^^ ^^^^ - turns sugar into calls rune decoders - rune calls to produce values - i.e. desugars code & compiles code - -Reading in a [serialized program](compile.html): - - data-file --[data-reader]--> data --[eval]--> values - ^^^^ - fairly trivial - (no lambdas, only runes) - -Reading pure and simple data like a config file: - - data-file --[data-reader]--> data (no runes to eval) - -Note that "data" is a subset of "values" basically. And the term -"code-data" which was used above just means data that is meant to be -evaluated as code, but is totally valid as pure data. This is not to -be confused with the "data" that existed in the intermediate step -while we were reading a serialized program; that was absent of any -forms like lambdas that need compilation. - -OK, that last bit was a bit confusing, and I realize it stems from -conflating rune decoding with code compilation, so let's split that -further up. Above, "eval" is "decode + compile" basically, but it's -possible to separate them, for example if we want to read a file of -serialized values that should not contain any code: - - values-file --[data-reader]--> values-data --[decode]--> values - -This is a secure way to read complex data even if it comes from an -untrusted source. It may contain runes that represent code, such as -in the form of `(#program "binary")` (compiled procedure) or even -`(#lambda (x) (do-things))` but so long as you don't actually call -those things after having decoded them, they can't do anything. -Decoding runes can't define macros or register new rune decoders, -meaning there's no way to achieve arbitrary code execution. - -Heck, although `#lambda` exists to represent the desugaring of the -`{...}` convenience syntax, it wouldn't actually work here because -decoding runes would happen in a null-environment without any bound -identifiers, meaning that e.g. `(#lambda (x) (+ x x))` would just -raise an error during decoding, because the compiler would consider -`+` unbound. - -Alternatively, instead of calling the compiler, the `#lambda` decoder -could just be a no-op that returns the same form back, but without the -rune, like `(lambda (x) (+ x x))`, because the compiler will take care -of that later. Yeah, I think this makes more sense. Why doesn't the -code reader directly give `(lambda ...)` for the `{...}` sugar? Well, -actually, the `#lambda` decoder may yield a syntax object where the -first element specifically refers to the binding of `lambda` in the -default environment, so you could use `{...}` in an environment where -`lambda` is bound to something else, and you would still hygienically -get the default lambda behavior from `{...}`. Yay! - -(Wow, it's rabbit hole after rabbit hole today. This is good though. -I'm coming up with some crazy stuff.) - -It would be possible to decode "code-data" and get an internal memory -representation of an uncompiled program which however already has -various data structure literals turned into values. This is super -obscure but for sake of completeness: - - code-file --[code-reader]--> code-data --[decode]--> code-values - -(These so-called "code-values" would only ever be useful for piping -them into the compiler. By the way, I initially used "eval" in the -example of reading a serialized program, but "decode" would have been -sufficient there.) - - -## Here's a well-deserved break - -(There wasn't a new header in a while. This seemed a good spot.) - -Now writing pipelines. Let's reverse the above pipelines, from the -bottom back towards eventually the first... - -The reverse of the super obscure thing above: - - code-values --[encode]--> code-data --[code-writer]--> code-file - -That would only ever be useful for debugging things. Now writing a -data structure into a serialized file, without unnecessarily invoking -the decompiler: - - values --[encode]--> values-data --[data-writer]--> data-file - -That gives you a file containing only data, but the data is the -encoded format of various data structures Zisp recognizes... -Actually, that may include compiled procedures as well. - -Now the simple config file case, being serialized: - - data -[data writer]-> data-file - -Now serializing a compiled program to a file, without decompilation: - - values --[encode]--> values-data --[data-writer]--> data-file - ^^^^^^ ^^^^^^^^^^^ - data structures no decompilation - become rune calls or "re-sugaring" - -Oh, look at that. It's the same as writing out data structures, as -we've already seen previously... This recap of a recap will need -another recap for sure. - -And now, the full decompiler: - - values --[uneval]--> code-data --[code-writer]--> code-file - ^^^^^^ - decompilation - -Actually, just like "eval" is "decode + compile", the "uneval" here -really is "decompile + encode". - - -## The Revised Recap of the Recap - -The following exist: - -1. Readers: - - 1. Data reader: Reads lists, strings/symbols, runes, integers, and - IEEE 754 double-precision floats without loss of precision. - - 2. Code reader: Reads code that can contain various syntax sugar, - all of which has an equivalent representation with runes. - -2. In-memory transformers: - - 1. Decoder: Calls decoders for runes in data, to yield values. - - 2. Evaluator: [Executes aka compiles](compile.html) decoded values - into other values.[*] - -3. Reverse in-memory transformers: - - 1. Encoder: Reverse of the decoder. (Lossless?) - - 2. Unevaluator: Reverse of the evaluator. (Lossy.) - -4. Writers: - - 1. Data writer: Reverse of data reader. (Lossless.) - - 2. Code writer: Reverse of code reader. (Lossy?) - -(*) This needs decoding to run first, because otherwise it wouldn't - realize that you're e.g. calling `+` on a pair of rational number - constants represented through runes, so constant folding wouldn't - work. Same with `vector-ref` on a vector literal represented as a - rune, and so on. - - -## How in the seven hells did I arrive at this point? - -Jesus Christ! - -This was about symbols and strings being the same thing. - -But I love these rabbit holes. They're mind expanding and you find -out so many new things you never thought about. - -Did you notice, by the way, that the code reader/writer above is -essentially a parser (and unparser) you would have in a regular -programming language, where syntax becomes an AST? The pure data -format is basically our AST! - -But this doesn't mean we lost homoiconicity. No, we merely expanded -upon it by providing a more detailed explanation of the relationship -between textual representation of code and in-memory data that exists -at various stages before ultimate compilation. - -Oh, and did we achieve our strategy of strings = symbols now, or does -it need to be dropped? I think we achieved it. The code reader, as -described all the way up where the section "Reconsidering AGAIN" -begins --in the original article; see top-- will desugar string -literals into: - - "foo" -> (#quote foo) - -(As already described originally...) - -And the `#quote` rune? Well, it will not actually just return its -operand verbatim, no! It will return a syntax object that's a list -with the first element specifically refers to the binding of `quote` -from the standard library. In other words, it's the evaluator that -actually implements quote, not the decoder. - -Oh yes, this is very satisfying. Everything is coming together. - -Syntax objects, by the way, will also have a rune-based external -representation, so you can inspect the result of macro expansion. - -And yes, I think using runes directly in code mode should be illegal, -because it allows referring to bindings in the standard library, or -even bindings in arbitrary libraries by crafting syntax objects -represented via runes, to bypass environment limits. - -That bug actually existed in Guile at some point, where one could -craft syntax objects, represented as vector literals, to refer to -bindings in other modules, making it impossible to run code in a -sandboxed environment. (It was fixed long ago, I believe.) - -Oh, but what about `#true` and `#false`? OK, maybe there will be a -whitelist of runes that are allowed in code. That makes sense. - -We will see. Still more details to be fleshed out. - -In any case, some runes must be able to declare that they don't take -arguments, in which case `(#rune ...)` isn't decoded by passing the -entire form to the decoder of `#rune`, but rather treated as a normal -list whose first element is decoded as a nullary rune. That's how -boolean literals in code will be implemented. - - -## Looking at more of the initial problems - -What happened to `'(quote "foo")` in code mode being weird? Well, -encountering an apostrophe tells the code reader that the next -expression is a datum, so it switches to data mode for that. - -Wow, that was easy. - -This also means you can't use syntax sugar inside it, which is good -because as we said previously, we don't want to use quoting to create -code; we want to use syntax objects for that. - -This is really orthogonal to the whole runes issue, and could have -been solved without that mechanism, but I'm happy I came up with it -because it resolves hygiene issues. - -The syntax `#'(quote "foo")` would be sugar for a different rune, and -the reader would remain in code mode, further desugaring any sugar -found within, so this works: `#'{x (+ x x)}` - -Oh and I mentioned reader extensions (for code mode) but then didn't -expand on that. Well, whenever the code reader encounters this: - - #foo(blah blah blah) - -It will turn that into: - - (#foo blah blah blah) - -After which the decoder for `#foo` will be invoked, which could have -been registered by the programmer. - -Can that registration be done in the same file though? Normally, the -execution step comes after decoding, and we decided that we don't want -to allow arbitrary code execution to happen just when reading a data -file and decoding it. So something exceptional would need to happen -for this to work. Or maybe not. - -Remember that [compilation is execution](compile.html) in Zisp, -meaning that compiling a file looks like this in pseudo-Scheme: - - (define env (null-environment)) ;start out empty - - (while (not (eof? input)) - (let* ((datum (read-code input)) ;desugar - (value (decode datum))) ;decode - (eval! value env))) ;eval in mutable env - - (write (env-lookup env 'main)) ;serialize - -I've called eval `eval!` to indicate that it can mutate the env it -receives, which is what import statements and defines would do. - -Let's modify that a little further to indicate the fact that reader -macros, or in our terms, custom rune decoders, can be defined in the -middle of the code file by affecting the environment: - - (define env (null-environment)) ;start out empty - - (while (not (eof? input)) - (let* ((datum (read-code input)) ;desugar - (value (decode datum env))) ;decode in env - (eval! value env))) ;eval in mutable env - - (write (env-lookup env 'main)) ;serialize - -Since the `decode` procedure is given an environment, it will look up -decoders from therein. So, after the evaluation of each top-level -expression, the expressions coming after it could be using a custom -decoder. - -What our reader macros cannot do is completely affect the lexical -syntax of the language, as in, add more sugar. You must rely on the -global desugaring feature of `#x(...) -> (#x ...)` which, now that I -think about it, is completely useless because a regular macro could -have achieved exactly the same thing. - -OK, let's try that again. The global desugaring wouldn't work on -lists only, it would work on a number of things: - - #x"foo" -> (#x #string . foo) - - #x[foo] -> (#x #square . foo) - - #x{foo} -> (#x #braces . foo) - -You get the idea! - -(I've changed my mind that `"foo"` should desugar into a call to the -regular `#quote` rune; it should be `#string` instead to disambiguate -from the apostrophe if needed.) - -Also, all those would work without a rune as well, to allow a file to -change the meaning of some of the default syntax sugar if desired: - - "foo" -> (#string . foo) - - [foo bar] -> (#square foo bar) - - {foo bar} -> (#braces foo bar) - -Or something like that. I'm making this all up as I go. diff --git a/html/notes/records.md b/html/notes/records.md deleted file mode 100644 index b93e0c3..0000000 --- a/html/notes/records.md +++ /dev/null @@ -1,54 +0,0 @@ -# Subtyping of record types - -It's a serious limitation of SRFI 9 that it doesn't allow creating -subtypes with additional fields. This is an invaluable strategy for -representing a hierarchy of types, which are ubiquitious in real life -and thus in programming. - -Sadly, this brings with it some significant complications if records -are to be initialized with default values to ensure invariants. The -R6RS solves this with an incredibly sophisticated system, which we -might need to adopt. (Search for "protocol" and "record-constructor -descriptor" in the R6RS.) - -However, we may be able to get away with a simpler approach... - -UNDER CONSTRUCTION - -Are constructor protocols really that important? Consider that all we -can do is add additional fields in the subtype. What if we separated -allocation from initialization: - -```scheme -(define-record r1 - (parent #f) - (fields a b)) - -(define (init-r1! r a b) - (set-r1-a! a) - (set-r1-b! b)) - - -(define-record r2 - (parent r1) - (fields c d)) - -(define (init-r2! r a b c d) - (init-r1! r a b) - (set-r2-c! c) - (set-r2-d! d)) - - -(define-record r3 - (parent r2) - (fields e f)) - -(define (init-r3! r a b c d e f) - (init-r2! r a b c d) - (set-r3-e! e) - (set-r3-f! f)) - - -(define r (make-r3)) -(init-r3! r 1 2 3 4 5 6) -``` diff --git a/html/notes/serialize.md b/html/notes/serialize.md deleted file mode 100644 index 8e5d49b..0000000 --- a/html/notes/serialize.md +++ /dev/null @@ -1,66 +0,0 @@ -# Everything can be serialized - -Let's look at the code mentioned in [compilation](compile.html) again: - -```scheme - -;; The fictional file `lookup-table.dat` would be in the source code -;; repository, and the fictional procedure `process-data` would read -;; it and return a data structure. -(define lookup-table (process-data "lookup-table.dat")) - -``` - -(In Zisp, this would be executed at compile-time, so the lookup table -becomes part of the compiled program.) - -If you're familiar with Guile --and I suspect most implementations of -Scheme have a similar limitation-- then you may have noticed an issue. - -Not all Scheme objects can be serialized. This not only applies to -the `write` procedure, but also the compiler's ability to put objects -into the binary representation of a compiled program. (For example, -the .data section of an ELF file in case of Guile.) - -This can be demonstrated as follows: - -```scheme - -(define-syntax process-data - (lambda (stx) - (syntax-case stx () - ((_ file) - ;; Ignore `file`; this is just an example! - (let ((ht (make-eqv-hashtable))) - (hashtable-set! ht 1 2) - ht))))) - -(define lookup-table (process-data "lookup-table.dat")) - -``` - -Compiling this with `guild` will yield an error, complaining about an -"unhandled constant" represented as #<r6rs:hashtable ...> in the -error message. What it's actually trying to say is that hash tables -aren't constants, and the compiler doesn't know how to put them into -the ELF file it's writing. - -(At least, this is the case as of February 2025, using Guile 3.0.10; -who knows what the future will provide!) - -In Zisp, I want absolutely everything to be possible to serialize, and -the compiler should simply be using this capability of the language to -write out compiled binaries. - -For example, given that any Zisp program has to declare a `main` entry -point function, all the compiler would do is execute the file and then -call `(write main)`. How elegant! - -This serialization of a function would, of course, involve traversing -all references therein, and including them in the output somehow. The -same will apply to writing out any data structure. This means that -serializing a module is *not* a matter of invoking `write` on each of -its exported definitions. This would lead to lots of duplicate data -between the outputs, and `eq?` relations would be lost after reading -them back in. We probably want a first-class `module` type that can -be serialized as one object. diff --git a/html/notes/sr.md b/html/notes/sr.md deleted file mode 100644 index 0fa9e06..0000000 --- a/html/notes/sr.md +++ /dev/null @@ -1,368 +0,0 @@ -# Better syntax-rules? - -Yesterday, someone on IRC asked for help in improving the following -syntax-rules (s-r) macro: - -```scheme - -(define-syntax alist-let* - (syntax-rules () - - ;; uses subpattern to avoid fender - ;; alist-expr is evaluated only once - ((_ alist-expr ((key alias) ...) body body* ...) - (let ((alist alist-expr)) - (let ((alias (assq-ref alist 'key)) ...) - body body* ...))) - - ((_ alist-expr (key ...) body body* ...) - (let ((alist alist-expr)) - (let ((key (assq-ref alist 'key)) ...) - body body* ...))) - -)) - -;; Example uses: - -(define alist '((foo . 1) (bar . 2))) - -(alist-let alist (foo bar) - (+ foo bar)) ;=> 3 - -(alist-let alist ((foo x) (bar y)) - (+ x y)) ;=> 3 - -;; Problem: Can't mix plain key with (key alias) forms: - -(alist-let alist ((foo x) bar) - (+ x bar)) ;ERROR - -``` - -How do we make it accept a mix of plain keys and `(key alias)` pairs? -Oh boy, it's more difficult than you may think if you're new to s-r -macros. Basically, there's no "obvious" solution, and all we have is -various hacks we can apply. - -Let's look at two fairly straightforward hacks, and their problems. - -## Option 1 - -```scheme - -;; Solution 1: Internal helper patterns using a dummy constant. - -(define-syntax alist-let* - (syntax-rules () - - ((_ "1" alist ((key alias) rest ...) body body* ...) - (let ((alias (assq-ref alist 'key))) - (alist-let* "1" alist (rest ...) body body* ...))) - - ((_ "1" alist (key rest ...) body body* ...) - (let ((key (assq-ref alist 'key))) - (alist-let* "1" alist (rest ...) body body* ...))) - - ((_ "1" alist () body body* ...) - (begin body body* ...)) - - ;; dispatch, ensuring alist-expr only eval'd once - ((_ <alist> <bindings> <body> <body*> ...) - (let ((alist <alist>)) - (alist-let* "1" alist <bindings> <body> <body*> ...))) - -)) - -``` - -(I've switched to my `<foo>` notation for pattern variables in the -"dispatcher" part. Don't let it distract you. I strongly endorse -that convention for s-r pattern variables, to make it clear that -they're like "empty slots" where *any* expression can match, but -that's a topic for another day.) - -What the solution above does, is "dispatch" actual uses of the macro, -which obviously won't have the string literal `"1"` in first position, -onto internal sub-macros, which can call each other recursively, so -each layer only handles either a stand-alone `key` or a `(key alias)` -couple. - -There's some nuances to this implementation. First, if you're not -familiar with s-r macros, you may mistakenly worry that this solution -could mask a programmer error: What if we accidentally call the macro -with a variable bound to the string "1"? Would this lead to a very -annoying bug that's hard to find? No; remember that syntax-rules -patterns match *unevaluated* operands, so the internal sub-patterns -are only triggered by the appearance of a literal string constant of -`"1"` in the first position; a mistake that would be very apparent in -code you're reading, and is extremely unlikely to occur by accident. - -As for a real pitfall of this implementation: The dispatcher pattern -*must* be in the final position; otherwise it will actually catch our -recursive calls starting with `"1"` and bind that string literal to -the `alist` pattern variable! (Kind of the "reverse" of the fake -problem described in the previous paragraph, in a sense?) If the -dispatcher pattern is in the first position, it will keep calling -itself with an increasing number of `"1"`s at the start, in an -infinite loop, until you forcibly stop it or it crashes. - -As a side note, this brings me to a general s-r pitfall, that applies -to the original implementation as well in this case: Since patterns -are matched top to bottom, a simple `key` pattern variable *could* -actually match the form `(key alias)`, so you have to make sure that -the pattern for matching those key-alias couples comes before the one -matching plain keys. - -Oh, and by the way, if you're questioning whether we even need those -internal helper patterns at all: Yes, it's the only way to ensure the -initial `<alist>` expression is only evaluated once, in an outermost -`let` wrapping everything. - -Let's summarize the issues we've faced: - -1. It's easy to forget that pattern variables can match arbitrary - expressions, not just identifiers, and there's no way to say it - should only match identifiers. - -2. When an arbitrary expression is matched by the pattern variable, - using it means repeating that expression every time, unless you - explicitly use `let` to take care of that, which may require - dispatching to another pattern immediately if you wanted to use - recursive patterns. - -3. You may accidentally put a more generic pattern first, causing it - to match an input that was meant to be matched by a subsequent - pattern with more deeper destructuring. - -It may be interesting trying to solve 3 by specifying some way of -measuring the "specificity" of a pattern, and saying that those with -the highest specificity match first, but that may prove difficult. -Besides, solving 1 would basically solve 3 anyway. - -Racket has syntax-parse, which solves the first problem through an -incredibly sophisticated specification of "syntax patterns" that take -the place of the humble generic pattern variable of syntax-rules. -It's cool and all, but the charm of s-r is the simplicity. Can't we -use some of the ideas of syntax-parse patterns and add them to s-r? - -In Racket, there's the concept of "syntax classes," and a pattern can -be a variable with `:syntax-class-id` appended to its name, which is -how you make it only match inputs of that syntax class, such as for -example, only identifiers. Trying to find out what syntax class ids -are supported may send you down a rabbit hole of how you can actually -define your own syntax classes, but that just seems to be a weak spot -of the Racket online documentation; looking a bit closer, you should -find the list of built-in classes that are supported. They are just -called "library" syntax classes for some reason: - -[Library Syntax Classes and Literal Sets -- Racket Documentation](https://docs.racket-lang.org/syntax/Library_Syntax_Classes_and_Literal_Sets.html) - -It would be great if there were classes for atoms (anything that's not -a list) and lists, though; then we could do this: - -```scheme - -(define-syntax alist-let* - (syntax-rules () - - ((_ <alist>:list bindings body body* ...) - (let ((alist <alist>)) - (alist-let* alist bindings body body* ...))) - - ((_ alist (key:id ...) body body* ...) - (let ((key (assq-ref alist 'key)) ...) - body body* ...)) - - ((_ alist ((key:atom alias:id) ...) body body* ...) - (let ((alias (assq-ref alist 'key)) ...) - body body* ...)) - -)) - -``` - -(The key could also be a non-symbol immediate value, like a fixnum, -boolean, etc.; anything that `assq-ref` can compare via `eq?`. One -could also just not quote the key, and instead let it be an arbitrary -expression, which would probably make for a more useful macro, but -that's a different topic.) - -Isn't that really neat? But let's go one step further. I believe -this strategy of binding an expression via `let` to ensure it's only -evaluated once is probably so common that it warrants a shortcut: - -```scheme - -(define-syntax alist-let* - (syntax-rules () - - ((_ alist:bind (key:id ...) body body* ...) - (let ((key (assq-ref alist 'key)) ...) - body body* ...)) - - ((_ alist:bind ((key:atom alias:id) ...) body body* ...) - (let ((alias (assq-ref alist 'key)) ...) - body body* ...)) - -)) - -``` - -The idea here is: All pattern variables marked with `:bind` are first -collected, and if there is at least one that is not an identifier, -then the whole template (the part that produces the output of the s-r -macro) is wrapped in a `let` which binds those expressions to the name -of the pattern variable, and uses of that pattern variable within the -template refer to that binding. - -I'm not entirely sure yet if this is an ingenious idea, or a hacky fix -for just one arbitrary issue you can face while using syntax-rules, -but I suspect it's a common enough pattern to make it desirable. - -## Option 2 - -I said there were various hacks to solve the original problem; here's -the second variant. It's actually almost the same thing, but we put -the helper patterns into a separate macro. - -```scheme - -;; Solution 2: Separate helper macro - -(define-syntax alist-let* - (syntax-rules () - - ;; dispatch, ensuring alist-expr only eval'd once - ((_ <alist> <bindings> <body> <body*> ...) - (let ((alist <alist>)) - (%alist-let-helper alist <bindings> <body> <body*> ...))) - -)) - -(define-syntax %alist-let-helper - (syntax-rules () - - ;; basically do here what the internal helpers did in solution 1, - ;; but without the need for the "1" string literal hack - -)) - -``` - -That's cleaner in terms of the patterns we have to write, but we had -to define a second top-level macro, which feels wrong. It should be -properly encapsulated as part of the first. - -This is where another improvement to s-r could come in handy, and -that's not making it evaluate to a syntax transformer (i.e., lambda) -directly, but rather making it more like syntax-case in that regard. -However, the additional lambda wrapping always really annoyed me, so -the following syntax may be desirable. - -```scheme - -(define-syntax (alist-let* . s) - - (define-syntax (helper . s) - (syntax-rules s () - ((alist ((key alias) rest ...) body body* ...) - (let ((alias (assq-ref alist 'key))) - (alist-let* "1" alist (rest ...) body body* ...))) - - ((alist (key rest ...) body body* ...) - (let ((key (assq-ref alist 'key))) - (alist-let* "1" alist (rest ...) body body* ...))) - - ((alist () body body* ...) - (begin body body* ...)) - )) - - (syntax-rules s () - ((<alist> <bindings> <body> <body*> ...) - (let ((alist <alist>)) - (helper alist <bindings> <body> <body*> ...))))) - -``` - -That looks a bit confusing at first sight, but we can actually do -something a lot better now, since we already get one stand-alone -pattern at the start, which fits our intention perfectly here: - -```scheme - -(define-syntax (alist-let* <alist> <bindings> <body> <body*> ...) - - (define-syntax (helper . s) - (syntax-rules s () - ((alist ((key alias) rest ...) body body* ...) - (let ((alias (assq-ref alist 'key))) - (alist-let* "1" alist (rest ...) body body* ...))) - - ((alist (key rest ...) body body* ...) - (let ((key (assq-ref alist 'key))) - (alist-let* "1" alist (rest ...) body body* ...))) - - ((alist () body body* ...) - (begin body body* ...)) - )) - - #'(let ((alist <alist>)) - (helper alist <bindings> <body> <body*> ...))) - -``` - -To be honest, I don't like this solution nearly as much as the first, -and I now realize that there wouldn't be much point in keeping s-r if -it's going to be so close to syntax-case. (The only difference, at -this point, would be that s-r implicitly puts `#'` in front of the -templates. That's literally all it would do, if I'm not mistaken.) - -## Or just implement syntax-parse? - -Racket can actually give you the implicit lambda when you want it, by -offering `syntax-parser` as an alternative to `syntax-parse`: - -```scheme - -;; The following two are equivalent. - -(define-syntax foo - (lambda (s) - (syntax-parse s ...))) - -(define-syntax foo - (syntax-parser ...)) - -``` - -(At least, I'm pretty sure that's how it's supposed to work; the docs -just bind the result of `syntax-parser` to an identifier via `define` -and call it as a procedure to showcase it, for whatever reason.) - -Yes, syntax-parse is a lot more complex than syntax-rules, but to be -honest it seems mainly the fault of the documentation that it doesn't -showcase the simplest ways of using it, which look essentially the -same as using syntax-rules, so it's not clear why s-r should stay if -you have syntax-parse. - -Maybe I would just make one change, which is to allow the following -syntax and thus make the additional `syntax-parser` unnecessary: - -```scheme - -(define-syntax (foo s) - (syntax-parse s ...)) - -``` - -Note that this is different from my previous idea of making the first -operand to `define-syntax` a pattern. The only thing I don't like -about this variant is that there will never be more than one argument, -but maybe that's fine? - -In any case, I guess the only innovation I came up with here is the -special `:bind` syntax class id, assuming there isn't already a -similar thing in Racket or elsewhere. - -Oh and this made me realize I should add `foo:bar` as reader syntax to -Zisp, turning it into `(#COLON foo . bar)` or such. diff --git a/html/notes/strict-mode.md b/html/notes/strict-mode.md deleted file mode 100644 index 5b99386..0000000 --- a/html/notes/strict-mode.md +++ /dev/null @@ -1,16 +0,0 @@ -# Strict mode to disallow ignoring returned values - -This ties in to the last point. In Scheme, a non-tail expression in a -body can return an arbitrary number of values, which will be silently -ignored. - -This can lead to bugs, where a procedure actually returns some kind of -success or failure indicator (instead of raising an error) and the -programmer forgets to handle it. - -Though it may be too inefficient to enable globally, there should at -least be a mode of compilation that emits code which checks at every -single function return whether there are any values that are being -ignored, and raises an error if so. - -There would of course be a form to explicitly ignore values. diff --git a/html/notes/sugar.md b/html/notes/sugar.md deleted file mode 100644 index 217b0d3..0000000 --- a/html/notes/sugar.md +++ /dev/null @@ -1,85 +0,0 @@ -# A little bit of syntax sugar never hurt anyone - -## Lambda shorthands - -We could benefit from a more minimal syntax to express lambda, as an -alternative to the written-out form: - -```scheme - -;; {x ... expr} = (lambda (x ...) expr) - -(map {x y (+ x y)} xs ys) - -(for-each {x (display x)} objects) - -``` - -## Vector and map references - -Furthermore, `foo[bar]` could be sugar for `(vector-ref foo bar)` and -`foo{bar}` could be sugar for `(map-ref foo bar)` or the like. (Not -sure yet whether to use the word "map" for a data type, due to the -overlap with the `map` function. Perhaps `mapping`. In any case -there must be a map kind of data type in the standard library.) - -## Records - -Why not make `foo.bar` reader syntax for `(record-ref foo bar)` where -`record-ref` is a macro that interprets `bar` as a constant? - -Admittedly, this loses us some elegance compared to the widely used -SRFI 9 (standardized in R7RS-small) where field getters only exist as -procedures, so making a field "private" is simply a matter of leaving -the getter out of your module exports. - -A general `record-ref` that accepts field names would need to be aware -of the concept of private fields and block access to them except if -invoked in a context where the private fields are accessible. - -I think the nice syntax is worth the added complexity. It doesn't -seem terribly difficult to expand the concept of "lexical scope" to -include field accessibility information. - -Alternatively, we could just not have private fields, like in Python. -Fields that are meant to be private could be named in a special way by -convention, such as `%foo` or whatever. I have to check whether real -encapsulation would provide us with any substantial benefits, such as -in optimization. - -Oh, speaking of performance, of course `(record-ref x y)` has a big -problem: It would require dynamic dispatch since the type of x is not -statically known. (And no we don't want to write `person.person-age`, -we want `person.age` where `age` is not treated as an identifier but -merely as a symbol for lookup.) - -It may be that we want to add static typing to Zisp! - -We may also add OOP-style objects and only use the dot notation for -their methods, but not fields. The reader syntax for `foo.bar` may -then expand to `(method-dispatch foo bar)`. It would also work for -fields if we really did add static typing, of course. The reason it -would only work on methods is that those need dynamic dispatch anyway. - -THIS POINT NEEDS A LOT MORE CONSIDERATION! - -## Built-in SRFI 17 - -The functionality of SRFI 17 should be a core aspect of the language, -so the following all work: - -```scheme - -;; Vector -(define vec (vector 1 2 3)) -(set! vec[n] value) - -;; Some kind of mapping -(define table (make-table)) -(set! table{key} value) - -;; Record type -(define rec (make-foo)) -(set! rec.field value) - -``` diff --git a/html/notes/symbols.md b/html/notes/symbols.md deleted file mode 100644 index 280fd9f..0000000 --- a/html/notes/symbols.md +++ /dev/null @@ -1,112 +0,0 @@ -# Symbols are strings are symbols - -In Scheme, symbols are literally just interned and immutable strings. -They can contain any character a string can, constructed either via -`string->symbol` or the modern `|foo bar baz|` syntax for quoted -symbols. Why not just embrace the fact that they are strings? - -Scheme strings are mutable, but they are a terrible choice for text -manipulation, because they are constant-length. They are literally -just vectors of characters. If you wanted a vector of characters, -well, use a vector of characters! - -Zisp won't differentiate between symbols and strings. All strings -will be immutable, string constants will be automatically interned, -and bare symbols will just be reader syntax for a string constant. - -Instead of `string->symbol` we will have `string-intern` which -basically does the same thing. Dynamically generated strings that -aren't passed to this function will be uninterned. - - -## But but but - -(Late addition because I didn't even notice this problem at first. -How embarrassing!) - -But if symbols and strings are the same thing at the reader level, -then how on earth would you have a string literal in source code, -without it being evaluated as a variable? - - (display "bar") ;should this look up the variable 'bar'?! - (display bar) ;should this display the string 'bar'?! - -There's actually a simple solution. - -The syntax `"string"`, with double-quotes and nothing else, becomes -reader syntax akin to the apostrophe, and expands to: - - (quote #"string") - -And `#"string"` is the real syntax for string literals, which are -always treated as identifiers by the evaluator. - -Bare identifiers like `foo` instead directly become `#"foo"`, without -the wrapping `(quote ...)`, and are thus evaluated. - -This also means that manually writing `#"string"` in your source code -allows that to be used as an identifier regardless of whether it has -illegal characters in it, essentially doing what `|string|` does in -R7RS-small. - -Let's sum it up; here's the reader transformations: - - foo -> #"foo" - "foo" -> (quote #"foo") - "foo bar" -> (quote #"foo bar") - #"foo bar" -> #"foo bar" - -Some pseudo-code based on Scheme: - - (let ((#"all your" "base ") - (#"are belong" "to us")) - (display - (string-append #"all your" #"are belong"))) - -That prints: "base to us" - -I'm not married to the syntax `#"string"` and may end up using the -simpler `|foo|` in the end. It doesn't really matter. - - -## More problems - -Dangit, couldn't have been so easy could it? - -What if you have a configuration file with these contents: - - ((job-name "Clear /tmp directory") - (interval reboot) - (command "find /tmp -mindepth 1 -delete")) - -Now all those "string constants" will turn into lists with the string -"quote" at its head. Terrible. One could write it with the explicit -string literal syntax `#"foo"` for strings, but that's also terrible. - - -## Salvageable - -I'm not yet done with this idea. What if strings simply have a flag -that says whether they are intended as a symbol or not? - -While reading, it would be set automatically. Instead of `intern`, -one would call a function like `symbol`, which would return a string -with the flag set, after interning it if necessary; it would simply -return the original string if it already had the flag set. - -Another way to look at this is that strings and symbols are sort of -"polymorphic" and can be used interchangeably. I don't want to get -into JavaScript style automatic type conversions (yuck) but this may -simply be a flag that's set on a string, which makes it a subtype of -regular strings. - -Yes yes, I think that's good. I even still have enough space left in -the NaN-packing strategy to put a tag on "short strings" which are our -6-byte immediate strings. - - -## Reconsidering AGAIN - -*This got too long and off-topic so it continues here:* - -[Reader? Decoder? I barely know 'er!](reader.html) diff --git a/html/notes/zero-values.md b/html/notes/zero-values.md deleted file mode 100644 index 3a1eecb..0000000 --- a/html/notes/zero-values.md +++ /dev/null @@ -1,11 +0,0 @@ -# Return zero values when there's nothing to return - -This is only a minor point: - -It's a long-running pet peeve of mine that R5RS Scheme specifies "an -unspecified value" to be returned when there's nothing meaningful to -return. It's a remnant from before we had the ability to return -multiple values, and should be eliminated. - -Any operation that has nothing meaningful to return, will return zero -values in Zisp, and no more. |
