add some unfinished notes and docs

author: Taylan Kammer <taylan.kammer@gmail.com> 2025-03-29 23:56:22 +0100
committer: Taylan Kammer <taylan.kammer@gmail.com> 2025-03-29 23:56:22 +0100
commit: d6e50e7a631d0dfe8d41438be89f8b00dfc9a4df (patch)
tree: 741717b08aafac370ce416f901c4698c62b39bfa
parent: fc23b42c6e2183c8ca8b6c42dc4e90d8061f835d (diff)
6 files changed, 321 insertions, 0 deletions
diff --git a/docs/decoder.md b/docs/decoder.md
new file mode 100644
index 0000000..0b34204
--- /dev/null
+++ b/docs/decoder.md
@@ -0,0 +1,44 @@
+# Decoding
+
+A separate process called "decoding" can transform simple data structures,
+consisting of only the datum types, into a richer set of Zisp types.
+
+For example, the decoder may turn `(#HASH ...)` into a vector, as one would
+expect a vector literal like `#(...)` to work in Scheme.  Bytevector syntax
+could use a custom rune as a list prefix, like: `#u8(...)`
+
+Runes may be decoded in isolation as well, rather than transforming a list
+whose head they appear in.  This can implement Boolean constants as `#true`
+and `#false` or `#t` and `#f`.
+
+The decoder recognizes `(#QUOTE ...)` to aid in implementing the traditional
+quoting mechanism of Lisp/Scheme, but with a significant difference:
+
+Traditional quote is "unhygienic" in Scheme terms.  An expression such as
+`'(foo bar)` will always be read as `(quote (foo bar))` regardless of what
+lexical context it appears in, so the semantics will depend on whatever the
+identifier `quote` is bound to, meaning that the expression may end up
+evaluating to something other than the list `(foo bar)`.
+
+The Zisp decoder, which transforms not datum to datum, but object to object,
+can turn `#QUOTE` into an object which encapsulates the notion of quoting,
+which the Zisp evaluator can recognize and act upon, ensuring that an
+expression like `'(foo bar)` always turns into the list `(foo bar)`.
+
+One way to think about this, in Scheme (R6RS / syntax-case) terms, is that
+expressions like `'(foo bar)` turn directly into a syntax object when read,
+and the created syntax object begins with an identifier bound to `quote` in
+the standard library.
+
+The decoder is, of course, configurable and extensible.  The transformations
+mentioned above would be performed only when it's told to decode data which
+represents Zisp code.  The decoder may be given a different configuration,
+telling it to decode, for example, data which represents a different kind of
+domain-specific data, such as application settings, build system commands,
+complex data records with non-standard data types, and so on.
+
+<!--
+;; Local Variables:
+;; fill-column: 77
+;; End:
+-->
diff --git a/docs/parser.md b/docs/parser.md
new file mode 100644
index 0000000..a4e5d78
--- /dev/null
+++ b/docs/parser.md
@@ -0,0 +1,122 @@
+# Parser for Code & Data
+
+Zisp s-expressions are defined in terms of an extremely minimal set of data
+types; only that which is necessary to build representations of more complex
+expressions and data types:
+
+    +--------+-----------------+---------------+--------+----------+------+
+    | TYPE   | Bare String     | Quoted String | Rune   | Pair     | Nil  |
+    +--------+-----------------+---------------+--------+----------+------+
+    | E.G.   | foo, |foo bar|  | "foo bar"     | #name  | (X . Y)  | ()   |
+    +--------+-----------------+---------------+--------+----------+------+
+
+Bare strings and quoted strings are polymorphic sub-types of the generic
+string type.  Bare strings are implicitly interned.
+
+The parser can also output non-negative integers, but this is only used for
+datum labels; number literals are handled by the decoder (see next section).
+
+The parser recognizes various "syntax sugar" and transforms it into uses of
+the above data types.  The most ubiquitous example is of course the list:
+
+    (datum1 datum2 ...)  ->  (datum1 . (datum2 . (... . ())))
+
+The following table summarizes the other supported transformations:
+
+    #datum  -> (#HASH . datum)        #rune(...)   -> (#rune ...)
+
+    [...]   -> (#SQUARE ...)          dat1dat2     -> (#JOIN dat1 . dat2)
+
+    {...}   -> (#BRACE ...)           dat1.dat2    -> (#DOT dat1 . dat2)
+
+    'datum  -> (#QUOTE . datum)       dat1:dat2    -> (#COLON dat1 . dat2)
+
+    `datum  -> (#GRAVE . datum)       #%hex%       -> (#LABEL . hex)
+
+    ,datum  -> (#COMMA . datum)       #%hex=datum  -> (#LABEL hex . datum)
+
+A separate process called "decoding" can transform these objects into other
+data types.  For example, `(#HASH x y z)` could become a vector, so that the
+expression `#(x y z)` works just like in Scheme.  See the next section for
+details about the decoder.
+
+Decoding also resolves datum labels, and goes over bare strings to find ones
+that are actually a number literal.  This lets us offload the complexity of
+number parsing elsewhere, so the parser remains extremely simple.
+
+Further notes about the syntax sugar table and examples above:
+
+* The terms datum, dat1, and dat2 each refer to an arbitrary datum; ellipsis
+  means zero or more data; hex is a hexadecimal number of up to 12 digits.
+
+* The `#datum` form only applies when the datum following the hash sign is a
+  list, quoted string, quote expression, another expression starting with a
+  hash sign, a bare string starting with a backslash escape (see next), or a
+  pipe-quoted bare string (see next).
+
+* A backslash causes the immediately following character to lose any special
+  meaning it would have, and be considered as part of a bare string instead.
+  (This does not apply to space or control characters.)  For example, the
+  following three character sequences are each a valid bare string:
+
+      foo\(bar\)  \]blah  \#\'xyz
+
+  Bare strings can also be "quoted" with pipes as in Scheme; it should be
+  noted that this still produces a "bare string" in terms of data type:
+
+      |foo bar baz|
+
+* Though not represented in the table due to notational difficulty, the form
+  `#rune(...)` doesn't require a list in the second position; any datum that
+  works with the `#datum` syntax also works with `#rune<DATUM>`.
+
+      #rune1#rune2  -> (#rune1 . #rune2)
+
+      #rune"text"   -> (#rune . "text")
+
+      #rune\string  -> (rune . string)
+
+      #rune'string  -> (#rune #QUOTE . string)
+
+  As a counter-example, following a rune immediately with a bare string isn't
+  possible, since it's ambiguous:
+
+      #abcdefgh  ;Could be (#abcdef . gh) or (#abcde . fgh) or ...
+
+* Syntax sugar can combine arbitrarily; some examples follow:
+
+      #{...}            -> (#HASH #BRACE ...)
+
+      #'foo             -> (#HASH #QUOTE . foo)
+
+      ##'[...]          -> (#HASH #HASH #QUOTE #SQUARE ...)
+
+      {x y}[i j]        -> (#JOIN (#BRACE x y) #SQUARE i j)
+
+      foo.bar.baz{x y}  -> (#JOIN (#DOT (#DOT foo . bar) . baz) #BRACE x y)
+
+* While in Lisp and Scheme `'foo` parses as `(quote foo)`, in Zisp it parses
+  as `(#QUOTE . foo)` instead; the operand of `#QUOTE` is the entire cdr.
+
+  The same principle is used when parsing other sugar; some examples follow:
+
+      Incorrect                              Correct
+
+      #(x y z) -> (#HASH (x y z))            #(x y z) -> (#HASH x y z)
+
+      [x y z]  -> (#SQUARE (x y z))          [x y z]  -> (#SQUARE x y z)
+
+      #{x}     -> (#HASH (#BRACE (x)))       #{x}     -> (#HASH #BRACE x)
+
+      foo(x y) -> (#JOIN foo (x y))          foo(x y) -> (#JOIN foo x y)
+
+* Runes are case-sensitive, and the parser only emits runes using upper-case
+  letters when expressing syntax sugar.  This way, there can be no accidental
+  clash with runes that appear verbatim in code, as long as only lower-case
+  letters are used for rune literals in code.
+
+<!--
+;; Local Variables:
+;; fill-column: 77
+;; End:
+-->
diff --git a/notes/boot.md b/notes/boot.md
new file mode 100644
index 0000000..758d264
--- /dev/null
+++ b/notes/boot.md
@@ -0,0 +1,17 @@
+# Bootstrapping Zisp
+
+In my opinion, any serious programming language must have a serious
+bootstrapping strategy that addresses the "Trusting Trust" issue aka
+the Thompson Hack.  The easiest way to do that is making sure that
+your language can be bootstrapped from an existing language, which
+itself has some solution to the problem.
+
+Currently, I'm thinking of implementing Zisp in Zig.  (That's not the
+entire reason Zisp is called Zisp, and I might choose a different
+language eventually, and/or rename Zisp, but anyway.)
+
+Zig, in turn, will *hopefully* be possible to bootstrap from C in the
+future, or some language implemented in C.  For C, there are some ways
+to bootstrap it from scratch.
+
+*** WIP ***
diff --git a/notes/numbers.md b/notes/numbers.md
new file mode 100644
index 0000000..6507a67
--- /dev/null
+++ b/notes/numbers.md
@@ -0,0 +1,45 @@
+
+exacts:
+
+  uint : 0...n
+
+  sint : -n...-1 | uint
+
+  ratn : ( p: sint, q: sint )
+
+  comp : ( r: ratn, i: ratn )
+
+
+inexacts:
+
+  double : ieee754 double with +inf, -inf, +nan, -nan
+
+  cmp128 : ( r: double , i: double )
+
+
+exact operations:
+
+  uint + uint = uint
+
+  sint + uint = sint
+
+  ratn + uint = ratn  [ ratn + ( p = uint , q = 0 ) ]
+
+  ratn + sint = ratn  [ ratn + ( p = sint , q = 0 ) ]
+
+  ratn + ratn = ratn
+
+  comp + uint = comp  [ comp + ( r = ( p = uint , q = 0 ) , i = 0 ) ]
+
+  comp + sint = comp  [ comp + ( r = ( p = sint , q = 0 ) , i = 0 ) ]
+
+  comp + ratn = comp  [ comp + ( r = ratn , i = 0 ) ]
+
+  comp + comp = comp
+
+
+inexact operations:
+
+  double + double = double
+
+  cmp128 + double = cmp128  [ cmp128 + ( r = double , i = 0 ) ]
diff --git a/notes/strings.md b/notes/strings.md
new file mode 100644
index 0000000..6f01944
--- /dev/null
+++ b/notes/strings.md
@@ -0,0 +1,57 @@
+# Symbols and strings, revisited
+
+My [original plan](symbols.html) was to make strings and symbols one
+and the same.  Then I realized this introduced ambiguity between bare
+strings meant as identifiers, and quoted strings representing a string
+literal in code.
+
+After a bunch of back-and-forth, I came up with the idea of the Zisp
+[decoder](reader.html) with which I'm very happy overall, but I still
+decided to ditch the idea of using an intermediate representation for
+quoted string literals like `(#STRING . "foo")` after all.
+
+The idea was that the reader would have a data mode and a code mode
+and that quoted strings would become `(#STRING . "foo")` or such in
+code mode, but not in data mode.  This way, reading a configuration
+file (in data mode) that uses quoted strings would not end up giving
+you this wonky thing with `#STRING`.
+
+It was an exciting idea at first, but eventually I realized that the
+above was the *only* substantial reason to have separate modes for
+reading s-expressions.  It also annoyed me a bit that every single
+quoted string in code would be wrapped in a cons cell...
+
+So, ultimately I've decided to simply make quoted strings a proper
+sub-type of strings.  (Or make symbols a sub-type of strings; which
+ever way you want to look at it.)
+
+Also, my [NaN-packing strategy](nan.html) has so much extra room that
+I've decided to put up-to-6-byte strings into NaNs as an optimization
+hack, and this applies to both quoted and bare strings.
+
+So we have two different string types, and two different in-memory
+representations for each.  Let's summarize and give them names:
+
+* sstr: Short string (symbol, up to 6 bytes)
+
+* qstr: Quoted short string (non-symbol, up to 6 bytes)
+
+* istr: Interned string (symbol, greater than 6 bytes)
+
+* ustr: Uninterned string (non-symbol, greater than 6 bytes)
+
+Don't get hung up on the short four-letter names; they aren't fully
+descriptive.  The "qstr" isn't the only one representing a quoted
+string literal; a "ustr" may also represent one.
+
+Here's how the parser uses these types:
+
+* Encountered an unquoted string of up to 6 bytes?  Make a sstr.
+
+* Encountered a quoted string of up to 6 bytes?  Make a qstr.
+
+* Unquoted string of more than 6 bytes?  Intern it to make an istr.
+
+* Quoted string of more than 6 bytes?  Uninterned string.
+
+*** WIP ***
diff --git a/notes/unread.md b/notes/unread.md
new file mode 100644
index 0000000..31b2f91
--- /dev/null
+++ b/notes/unread.md
@@ -0,0 +1,36 @@
+# Must ports support seeking?
+
+With traditional s-expressions, it's not always possible to stop
+reading bytes as soon as the end of the current datum is reached,
+because some data don't have a terminating character.  Consider a
+sequence of s-expressions such as:
+
+    foo(bar)
+
+After reading the second 'o', the parser has no way of knowing that
+the symbol has ended.  It must read another byte.
+
+If the underlying input stream doesn't support "unreading" or seeking
+back, this is troublesome: The opening parenthesis is consumed by the
+first call to the parser, and then discarded, since it's not part of
+the symbol it's reading.  The second call to the parser cannot know
+that the "read head" is already within a list.
+
+I assume that traditional lisps work around this issue by requiring
+all streams (ports) to have seeking or unreading functionality, which
+isn't too bad.  Assuming you only need to look ahead by one character,
+any port without this feature can be wrapped in a port that adds it
+via a simple one-character buffer.  If more than one character of
+look-ahead is needed, a small circular buffer could be used.
+
+Thankfully, Zisp s-expressions are all self-terminating.  This is
+because a datum followed immediately by another datum, without any
+blanks in between, is a "joined datum" expression.  Any number of
+additional data can be joined like this, yielding a more and more
+deeply nested compound datum.  Only a blank or EOF can end this,
+meaning that disjoint data within a stream are necessarily delimited
+by blanks.
+
+
+
+*** WIP ***
author	Taylan Kammer <taylan.kammer@gmail.com>	2025-03-29 23:56:22 +0100
committer	Taylan Kammer <taylan.kammer@gmail.com>	2025-03-29 23:56:22 +0100
commit	d6e50e7a631d0dfe8d41438be89f8b00dfc9a4df (patch)
tree	741717b08aafac370ce416f901c4698c62b39bfa
parent	fc23b42c6e2183c8ca8b6c42dc4e90d8061f835d (diff)