summaryrefslogtreecommitdiff
path: root/notes
diff options
context:
space:
mode:
authorTaylan Kammer <taylan.kammer@gmail.com>2026-01-09 11:06:25 +0100
committerTaylan Kammer <taylan.kammer@gmail.com>2026-01-09 11:06:25 +0100
commita2ece405cc61341122fc075d499420e894c56909 (patch)
tree133f54000c6cbc5dadca14e6bb22249b544acc6f /notes
parent255e2f98680457b611e3e2b93e54da32052e6e55 (diff)
Add note.
Diffstat (limited to 'notes')
-rw-r--r--notes/260109-uninterned.md181
-rw-r--r--notes/index.md1
2 files changed, 182 insertions, 0 deletions
diff --git a/notes/260109-uninterned.md b/notes/260109-uninterned.md
new file mode 100644
index 0000000..0d82e6d
--- /dev/null
+++ b/notes/260109-uninterned.md
@@ -0,0 +1,181 @@
+# What strings NOT to intern?
+
+Although it's become fairly normal these days to define grammars in
+terms of Unicode, there's a certain simplicity to just defining them
+in terms of bytes instead.
+
+The Zisp s-expression grammar is defined in terms of bytes. Regular,
+unquoted strings, like identifiers, are defined as only accepting
+characters in the ASCII range. But within quoted strings, and line
+comments, *any* byte can appear so long as it doesn't terminate the
+string or line comment.
+
+It's a super simple implementation. Encountering a double-quote that
+opens a string simply means the parser begins to consume any byte it
+encounters and adds it to the string, until it encounters a closing
+double-quote character.
+
+Backslash escapes make it slightly more complicated, but so long as
+the byte you've read isn't a double-quote or backslash, you just add
+it to the string. You don't try to verify that it's part of a valid
+UTF-8 byte sequence or anything.
+
+Line comments are even simpler, since there's no backslash escapes
+there; you just skip bytes until you encounter the Line Feed byte.
+It's really that simple of a loop.
+
+This way, string literals, and comments, can have UTF-8 encoded
+Unicode text in them, which is highly desirable. The parser just
+doesn't care if they're valid UTF-8 or something else.
+
+It may seem a bit "dirty" that this means that comments and string
+literals can also contain random byte sequences in them, which don't
+decode as valid UTF-8. You can have some garbled binary data right
+there in the middle of your source code.
+
+But I think this is not only an acceptable price to pay for the ease
+of implementation (and just plain conceptual simplicity) it offers,
+but even arguably a benefit:
+
+*It means you can use Zisp s-expressions to transmit binary data.*
+
+Consider two simple programs that talk over the network, and want to
+exchange binary data in a structured fashion.
+
+The best way to transmit binary, I suppose, is to send a header which
+declares the byte count, and then send the bytes. The receiver won't
+have to scan the data for some termination token, and the sender can
+also forego the need to ensure that the termination token doesn't
+accidentally appear within the payload.
+
+But if you're in a pinch, and want neither the complexity nor higher
+throughput of such a length-prefixed binary data protocol, you might
+go for a simplistic strategy like stuffing base64 encoded data into
+strings in a JSON object:
+
+```js
+{ "key1": "<base64 of value 1>"
+, "key2": "<base64 of value 2>"
+}
+```
+
+(After all, any programming language under the sun has a JSON library,
+be it built in or third-party.)
+
+It may not matter in many cases, but the base64 transform will add
+some processing overhead and bloat the size by 33%. Compare that to
+Zisp s-expressions with binary data in strings:
+
+```scheme
+((key1 "<value 1 with backslash escapes>")
+ (key2 "<value 2 with backslash escapes>")
+)
+```
+
+The backslash escapes in question are extremely simple:
+
+1. Any byte with value of `"` (ASCII double quote) becomes the byte
+ sequence `\"` instead.
+
+2. Any byte with value of `\` (ASCII backslash) becomes the byte
+ sequence `\\` instead.
+
+This actually has a worst-case scenario of doubling the size, but in
+more typical cases it will have much smaller size than base64. When
+it comes to CPU throughput, it's an interesting comparison because
+base64 can be branchless and heavily SIMD-optimized. I'm lazy, and
+it's a silly comparison anyway (since, as mentioned, you should be
+using a proper binary format with length prefixes if you need optimal
+efficiency) so I just asked a bunch of popular LLMs like ChatGPT, and
+they actually disagreed with each other. Some claimed that base64
+would outperform, others claimed it couldn't compete with the simple
+byte substitution algorithm, so long as the latter is also optimized
+with SIMD and the data is not a pathological case making tons of
+backslash escapes necessary.
+
+Even if base64 were to have higher CPU throughput, however, it still
+has the higher data size issue (in most cases) and besides, a JSON
+parser would need to scan the entire string looking for backslash
+escapes anyway, even though base64 can't contain any, so the total
+overhead of the Zisp parser would probably easily beat the combined
+JSON parsing and base64 decoding.
+
+On top of that, with base64 you typically want to compress before the
+transformation, whereas the simple backslash escape strategy allows
+you to comfortably rely on transparent compression at the network
+level, like if the applications communicate via HTTP and implement
+gzip or brotli compression as part of that.
+
+I'm not saying this is some kind of killer feature, obviously. But I
+find it neat how the simplistic and "un-opinionated" nature of the
+Zisp s-expression format (not coercing you into UTF-8) allows for
+little hacks like this.
+
+```scheme
+;; pseudo-code of sending application
+(define (send-data port)
+ (port.write-object `((key1 ,binary1) (key2 ,binary2))))
+
+;; pseudo-code of receiving application
+(define (receive-data port)
+ (let ((data (port.read-object)))
+ (values (alist-get data 'key1)
+ (alist-get data 'key2))))
+```
+
+There, you just sent and received two pieces of binary data in a
+structured fashion. Does it get any simpler than that?
+
+## Uhh, what was that about interning?
+
+Oh right, I began writing this article with the intention of talking
+about string interning. Really thought I was finished for a second,
+before I remembered that.
+
+The parser interns strings by default. If you have potentially huge
+chunks of data transmitted as strings, that's a bit of an issue.
+
+Another example would be the use of the `(#STRING file-path)` decoder
+rule that I wrote about in the previous note. Actually, since that
+one will be handled by the decoder, not the parser, there's not as
+much of an issue there: Just document the fact that this decoder rule
+results in an uninterned string, rather than an interned one.
+
+But back to the binary data example. Or to be more specific, the
+general issue of massive strings appearing in Zisp data, whatever
+their content. There should be a way to avoid interning them.
+
+I think I might settle for a very simple rule: The parser has a hard
+cap on the length of strings that it will intern.
+
+If the string we just parsed is under 80 bytes, intern it. Otherwise,
+return it uninterned. Why 80 bytes? Because I think that's a fairly
+reasonable cutoff for how long an identifier could ever be. I mean,
+I'm using 80 columns as the limit when I write code! An identifier
+would need to span an entire line to be rejected.
+
+What do I mean with rejected? The parser won't care (it will return
+uninterned strings silently) and the decoder probably won't see a
+difference either. But once the data begins actually being seen as
+Zisp code, and an uninterned string appears in the position of an
+identifier, it will raise an error. Generally, strings carry with
+them the information of whether they are interned or not, so that's
+not an issue.
+
+In practice, this will mean that trying to run or compile Zisp code
+containing an identifier of length 80 or greater will cleanly display
+an error message saying you can't do that. No cryptic error message
+or silent failure or such.
+
+If you're instead using Zisp s-expressions for data serialization and
+deserialization, then you need to be cognizant of the fact that long
+strings with equal content won't be pointer-equal, but that is not a
+common expectation of a deserialization procedure. For example, I'm
+not sure if JSON parsers typically offer such an option, but I think
+they probably don't, in the general case.
+
+The automatic interning of short to medium length strings actually
+makes Zisp parsing inherently a bit slower. Perhaps the parser could
+offer an option to forego interning entirely, but I'm getting ahead of
+myself here. For now, let's just ensure that parsing data with tons
+of large strings doesn't lead to tons of redundant hashing.
diff --git a/notes/index.md b/notes/index.md
index 5d02a60..0550055 100644
--- a/notes/index.md
+++ b/notes/index.md
@@ -24,3 +24,4 @@
* [A full-stack programming language](260102-full-stack.html)
* [Simplifying S-Expression Grammar](260106-simpler-grammar.html)
* [Decoder](260107-decoder.html)
+* [What not to intern?](260109-uninterned.html)