diff options
| author | Taylan Kammer <taylan.kammer@gmail.com> | 2025-03-29 23:56:22 +0100 |
|---|---|---|
| committer | Taylan Kammer <taylan.kammer@gmail.com> | 2025-03-29 23:56:22 +0100 |
| commit | d6e50e7a631d0dfe8d41438be89f8b00dfc9a4df (patch) | |
| tree | 741717b08aafac370ce416f901c4698c62b39bfa | |
| parent | fc23b42c6e2183c8ca8b6c42dc4e90d8061f835d (diff) | |
add some unfinished notes and docs
| -rw-r--r-- | docs/decoder.md | 44 | ||||
| -rw-r--r-- | docs/parser.md | 122 | ||||
| -rw-r--r-- | notes/boot.md | 17 | ||||
| -rw-r--r-- | notes/numbers.md | 45 | ||||
| -rw-r--r-- | notes/strings.md | 57 | ||||
| -rw-r--r-- | notes/unread.md | 36 |
6 files changed, 321 insertions, 0 deletions
diff --git a/docs/decoder.md b/docs/decoder.md new file mode 100644 index 0000000..0b34204 --- /dev/null +++ b/docs/decoder.md @@ -0,0 +1,44 @@ +# Decoding + +A separate process called "decoding" can transform simple data structures, +consisting of only the datum types, into a richer set of Zisp types. + +For example, the decoder may turn `(#HASH ...)` into a vector, as one would +expect a vector literal like `#(...)` to work in Scheme. Bytevector syntax +could use a custom rune as a list prefix, like: `#u8(...)` + +Runes may be decoded in isolation as well, rather than transforming a list +whose head they appear in. This can implement Boolean constants as `#true` +and `#false` or `#t` and `#f`. + +The decoder recognizes `(#QUOTE ...)` to aid in implementing the traditional +quoting mechanism of Lisp/Scheme, but with a significant difference: + +Traditional quote is "unhygienic" in Scheme terms. An expression such as +`'(foo bar)` will always be read as `(quote (foo bar))` regardless of what +lexical context it appears in, so the semantics will depend on whatever the +identifier `quote` is bound to, meaning that the expression may end up +evaluating to something other than the list `(foo bar)`. + +The Zisp decoder, which transforms not datum to datum, but object to object, +can turn `#QUOTE` into an object which encapsulates the notion of quoting, +which the Zisp evaluator can recognize and act upon, ensuring that an +expression like `'(foo bar)` always turns into the list `(foo bar)`. + +One way to think about this, in Scheme (R6RS / syntax-case) terms, is that +expressions like `'(foo bar)` turn directly into a syntax object when read, +and the created syntax object begins with an identifier bound to `quote` in +the standard library. + +The decoder is, of course, configurable and extensible. The transformations +mentioned above would be performed only when it's told to decode data which +represents Zisp code. The decoder may be given a different configuration, +telling it to decode, for example, data which represents a different kind of +domain-specific data, such as application settings, build system commands, +complex data records with non-standard data types, and so on. + +<!-- +;; Local Variables: +;; fill-column: 77 +;; End: +--> diff --git a/docs/parser.md b/docs/parser.md new file mode 100644 index 0000000..a4e5d78 --- /dev/null +++ b/docs/parser.md @@ -0,0 +1,122 @@ +# Parser for Code & Data + +Zisp s-expressions are defined in terms of an extremely minimal set of data +types; only that which is necessary to build representations of more complex +expressions and data types: + + +--------+-----------------+---------------+--------+----------+------+ + | TYPE | Bare String | Quoted String | Rune | Pair | Nil | + +--------+-----------------+---------------+--------+----------+------+ + | E.G. | foo, |foo bar| | "foo bar" | #name | (X . Y) | () | + +--------+-----------------+---------------+--------+----------+------+ + +Bare strings and quoted strings are polymorphic sub-types of the generic +string type. Bare strings are implicitly interned. + +The parser can also output non-negative integers, but this is only used for +datum labels; number literals are handled by the decoder (see next section). + +The parser recognizes various "syntax sugar" and transforms it into uses of +the above data types. The most ubiquitous example is of course the list: + + (datum1 datum2 ...) -> (datum1 . (datum2 . (... . ()))) + +The following table summarizes the other supported transformations: + + #datum -> (#HASH . datum) #rune(...) -> (#rune ...) + + [...] -> (#SQUARE ...) dat1dat2 -> (#JOIN dat1 . dat2) + + {...} -> (#BRACE ...) dat1.dat2 -> (#DOT dat1 . dat2) + + 'datum -> (#QUOTE . datum) dat1:dat2 -> (#COLON dat1 . dat2) + + `datum -> (#GRAVE . datum) #%hex% -> (#LABEL . hex) + + ,datum -> (#COMMA . datum) #%hex=datum -> (#LABEL hex . datum) + +A separate process called "decoding" can transform these objects into other +data types. For example, `(#HASH x y z)` could become a vector, so that the +expression `#(x y z)` works just like in Scheme. See the next section for +details about the decoder. + +Decoding also resolves datum labels, and goes over bare strings to find ones +that are actually a number literal. This lets us offload the complexity of +number parsing elsewhere, so the parser remains extremely simple. + +Further notes about the syntax sugar table and examples above: + +* The terms datum, dat1, and dat2 each refer to an arbitrary datum; ellipsis + means zero or more data; hex is a hexadecimal number of up to 12 digits. + +* The `#datum` form only applies when the datum following the hash sign is a + list, quoted string, quote expression, another expression starting with a + hash sign, a bare string starting with a backslash escape (see next), or a + pipe-quoted bare string (see next). + +* A backslash causes the immediately following character to lose any special + meaning it would have, and be considered as part of a bare string instead. + (This does not apply to space or control characters.) For example, the + following three character sequences are each a valid bare string: + + foo\(bar\) \]blah \#\'xyz + + Bare strings can also be "quoted" with pipes as in Scheme; it should be + noted that this still produces a "bare string" in terms of data type: + + |foo bar baz| + +* Though not represented in the table due to notational difficulty, the form + `#rune(...)` doesn't require a list in the second position; any datum that + works with the `#datum` syntax also works with `#rune<DATUM>`. + + #rune1#rune2 -> (#rune1 . #rune2) + + #rune"text" -> (#rune . "text") + + #rune\string -> (rune . string) + + #rune'string -> (#rune #QUOTE . string) + + As a counter-example, following a rune immediately with a bare string isn't + possible, since it's ambiguous: + + #abcdefgh ;Could be (#abcdef . gh) or (#abcde . fgh) or ... + +* Syntax sugar can combine arbitrarily; some examples follow: + + #{...} -> (#HASH #BRACE ...) + + #'foo -> (#HASH #QUOTE . foo) + + ##'[...] -> (#HASH #HASH #QUOTE #SQUARE ...) + + {x y}[i j] -> (#JOIN (#BRACE x y) #SQUARE i j) + + foo.bar.baz{x y} -> (#JOIN (#DOT (#DOT foo . bar) . baz) #BRACE x y) + +* While in Lisp and Scheme `'foo` parses as `(quote foo)`, in Zisp it parses + as `(#QUOTE . foo)` instead; the operand of `#QUOTE` is the entire cdr. + + The same principle is used when parsing other sugar; some examples follow: + + Incorrect Correct + + #(x y z) -> (#HASH (x y z)) #(x y z) -> (#HASH x y z) + + [x y z] -> (#SQUARE (x y z)) [x y z] -> (#SQUARE x y z) + + #{x} -> (#HASH (#BRACE (x))) #{x} -> (#HASH #BRACE x) + + foo(x y) -> (#JOIN foo (x y)) foo(x y) -> (#JOIN foo x y) + +* Runes are case-sensitive, and the parser only emits runes using upper-case + letters when expressing syntax sugar. This way, there can be no accidental + clash with runes that appear verbatim in code, as long as only lower-case + letters are used for rune literals in code. + +<!-- +;; Local Variables: +;; fill-column: 77 +;; End: +--> diff --git a/notes/boot.md b/notes/boot.md new file mode 100644 index 0000000..758d264 --- /dev/null +++ b/notes/boot.md @@ -0,0 +1,17 @@ +# Bootstrapping Zisp + +In my opinion, any serious programming language must have a serious +bootstrapping strategy that addresses the "Trusting Trust" issue aka +the Thompson Hack. The easiest way to do that is making sure that +your language can be bootstrapped from an existing language, which +itself has some solution to the problem. + +Currently, I'm thinking of implementing Zisp in Zig. (That's not the +entire reason Zisp is called Zisp, and I might choose a different +language eventually, and/or rename Zisp, but anyway.) + +Zig, in turn, will *hopefully* be possible to bootstrap from C in the +future, or some language implemented in C. For C, there are some ways +to bootstrap it from scratch. + +*** WIP *** diff --git a/notes/numbers.md b/notes/numbers.md new file mode 100644 index 0000000..6507a67 --- /dev/null +++ b/notes/numbers.md @@ -0,0 +1,45 @@ + +exacts: + + uint : 0...n + + sint : -n...-1 | uint + + ratn : ( p: sint, q: sint ) + + comp : ( r: ratn, i: ratn ) + + +inexacts: + + double : ieee754 double with +inf, -inf, +nan, -nan + + cmp128 : ( r: double , i: double ) + + +exact operations: + + uint + uint = uint + + sint + uint = sint + + ratn + uint = ratn [ ratn + ( p = uint , q = 0 ) ] + + ratn + sint = ratn [ ratn + ( p = sint , q = 0 ) ] + + ratn + ratn = ratn + + comp + uint = comp [ comp + ( r = ( p = uint , q = 0 ) , i = 0 ) ] + + comp + sint = comp [ comp + ( r = ( p = sint , q = 0 ) , i = 0 ) ] + + comp + ratn = comp [ comp + ( r = ratn , i = 0 ) ] + + comp + comp = comp + + +inexact operations: + + double + double = double + + cmp128 + double = cmp128 [ cmp128 + ( r = double , i = 0 ) ] diff --git a/notes/strings.md b/notes/strings.md new file mode 100644 index 0000000..6f01944 --- /dev/null +++ b/notes/strings.md @@ -0,0 +1,57 @@ +# Symbols and strings, revisited + +My [original plan](symbols.html) was to make strings and symbols one +and the same. Then I realized this introduced ambiguity between bare +strings meant as identifiers, and quoted strings representing a string +literal in code. + +After a bunch of back-and-forth, I came up with the idea of the Zisp +[decoder](reader.html) with which I'm very happy overall, but I still +decided to ditch the idea of using an intermediate representation for +quoted string literals like `(#STRING . "foo")` after all. + +The idea was that the reader would have a data mode and a code mode +and that quoted strings would become `(#STRING . "foo")` or such in +code mode, but not in data mode. This way, reading a configuration +file (in data mode) that uses quoted strings would not end up giving +you this wonky thing with `#STRING`. + +It was an exciting idea at first, but eventually I realized that the +above was the *only* substantial reason to have separate modes for +reading s-expressions. It also annoyed me a bit that every single +quoted string in code would be wrapped in a cons cell... + +So, ultimately I've decided to simply make quoted strings a proper +sub-type of strings. (Or make symbols a sub-type of strings; which +ever way you want to look at it.) + +Also, my [NaN-packing strategy](nan.html) has so much extra room that +I've decided to put up-to-6-byte strings into NaNs as an optimization +hack, and this applies to both quoted and bare strings. + +So we have two different string types, and two different in-memory +representations for each. Let's summarize and give them names: + +* sstr: Short string (symbol, up to 6 bytes) + +* qstr: Quoted short string (non-symbol, up to 6 bytes) + +* istr: Interned string (symbol, greater than 6 bytes) + +* ustr: Uninterned string (non-symbol, greater than 6 bytes) + +Don't get hung up on the short four-letter names; they aren't fully +descriptive. The "qstr" isn't the only one representing a quoted +string literal; a "ustr" may also represent one. + +Here's how the parser uses these types: + +* Encountered an unquoted string of up to 6 bytes? Make a sstr. + +* Encountered a quoted string of up to 6 bytes? Make a qstr. + +* Unquoted string of more than 6 bytes? Intern it to make an istr. + +* Quoted string of more than 6 bytes? Uninterned string. + +*** WIP *** diff --git a/notes/unread.md b/notes/unread.md new file mode 100644 index 0000000..31b2f91 --- /dev/null +++ b/notes/unread.md @@ -0,0 +1,36 @@ +# Must ports support seeking? + +With traditional s-expressions, it's not always possible to stop +reading bytes as soon as the end of the current datum is reached, +because some data don't have a terminating character. Consider a +sequence of s-expressions such as: + + foo(bar) + +After reading the second 'o', the parser has no way of knowing that +the symbol has ended. It must read another byte. + +If the underlying input stream doesn't support "unreading" or seeking +back, this is troublesome: The opening parenthesis is consumed by the +first call to the parser, and then discarded, since it's not part of +the symbol it's reading. The second call to the parser cannot know +that the "read head" is already within a list. + +I assume that traditional lisps work around this issue by requiring +all streams (ports) to have seeking or unreading functionality, which +isn't too bad. Assuming you only need to look ahead by one character, +any port without this feature can be wrapped in a port that adds it +via a simple one-character buffer. If more than one character of +look-ahead is needed, a small circular buffer could be used. + +Thankfully, Zisp s-expressions are all self-terminating. This is +because a datum followed immediately by another datum, without any +blanks in between, is a "joined datum" expression. Any number of +additional data can be joined like this, yielding a more and more +deeply nested compound datum. Only a blank or EOF can end this, +meaning that disjoint data within a stream are necessarily delimited +by blanks. + + + +*** WIP *** |
