From a2ece405cc61341122fc075d499420e894c56909 Mon Sep 17 00:00:00 2001 From: Taylan Kammer Date: Fri, 9 Jan 2026 11:06:25 +0100 Subject: Add note. --- notes/260109-uninterned.md | 181 +++++++++++++++++++++++++++++++++++++++++++++ notes/index.md | 1 + 2 files changed, 182 insertions(+) create mode 100644 notes/260109-uninterned.md diff --git a/notes/260109-uninterned.md b/notes/260109-uninterned.md new file mode 100644 index 0000000..0d82e6d --- /dev/null +++ b/notes/260109-uninterned.md @@ -0,0 +1,181 @@ +# What strings NOT to intern? + +Although it's become fairly normal these days to define grammars in +terms of Unicode, there's a certain simplicity to just defining them +in terms of bytes instead. + +The Zisp s-expression grammar is defined in terms of bytes. Regular, +unquoted strings, like identifiers, are defined as only accepting +characters in the ASCII range. But within quoted strings, and line +comments, *any* byte can appear so long as it doesn't terminate the +string or line comment. + +It's a super simple implementation. Encountering a double-quote that +opens a string simply means the parser begins to consume any byte it +encounters and adds it to the string, until it encounters a closing +double-quote character. + +Backslash escapes make it slightly more complicated, but so long as +the byte you've read isn't a double-quote or backslash, you just add +it to the string. You don't try to verify that it's part of a valid +UTF-8 byte sequence or anything. + +Line comments are even simpler, since there's no backslash escapes +there; you just skip bytes until you encounter the Line Feed byte. +It's really that simple of a loop. + +This way, string literals, and comments, can have UTF-8 encoded +Unicode text in them, which is highly desirable. The parser just +doesn't care if they're valid UTF-8 or something else. + +It may seem a bit "dirty" that this means that comments and string +literals can also contain random byte sequences in them, which don't +decode as valid UTF-8. You can have some garbled binary data right +there in the middle of your source code. + +But I think this is not only an acceptable price to pay for the ease +of implementation (and just plain conceptual simplicity) it offers, +but even arguably a benefit: + +*It means you can use Zisp s-expressions to transmit binary data.* + +Consider two simple programs that talk over the network, and want to +exchange binary data in a structured fashion. + +The best way to transmit binary, I suppose, is to send a header which +declares the byte count, and then send the bytes. The receiver won't +have to scan the data for some termination token, and the sender can +also forego the need to ensure that the termination token doesn't +accidentally appear within the payload. + +But if you're in a pinch, and want neither the complexity nor higher +throughput of such a length-prefixed binary data protocol, you might +go for a simplistic strategy like stuffing base64 encoded data into +strings in a JSON object: + +```js +{ "key1": "" +, "key2": "" +} +``` + +(After all, any programming language under the sun has a JSON library, +be it built in or third-party.) + +It may not matter in many cases, but the base64 transform will add +some processing overhead and bloat the size by 33%. Compare that to +Zisp s-expressions with binary data in strings: + +```scheme +((key1 "") + (key2 "") +) +``` + +The backslash escapes in question are extremely simple: + +1. Any byte with value of `"` (ASCII double quote) becomes the byte + sequence `\"` instead. + +2. Any byte with value of `\` (ASCII backslash) becomes the byte + sequence `\\` instead. + +This actually has a worst-case scenario of doubling the size, but in +more typical cases it will have much smaller size than base64. When +it comes to CPU throughput, it's an interesting comparison because +base64 can be branchless and heavily SIMD-optimized. I'm lazy, and +it's a silly comparison anyway (since, as mentioned, you should be +using a proper binary format with length prefixes if you need optimal +efficiency) so I just asked a bunch of popular LLMs like ChatGPT, and +they actually disagreed with each other. Some claimed that base64 +would outperform, others claimed it couldn't compete with the simple +byte substitution algorithm, so long as the latter is also optimized +with SIMD and the data is not a pathological case making tons of +backslash escapes necessary. + +Even if base64 were to have higher CPU throughput, however, it still +has the higher data size issue (in most cases) and besides, a JSON +parser would need to scan the entire string looking for backslash +escapes anyway, even though base64 can't contain any, so the total +overhead of the Zisp parser would probably easily beat the combined +JSON parsing and base64 decoding. + +On top of that, with base64 you typically want to compress before the +transformation, whereas the simple backslash escape strategy allows +you to comfortably rely on transparent compression at the network +level, like if the applications communicate via HTTP and implement +gzip or brotli compression as part of that. + +I'm not saying this is some kind of killer feature, obviously. But I +find it neat how the simplistic and "un-opinionated" nature of the +Zisp s-expression format (not coercing you into UTF-8) allows for +little hacks like this. + +```scheme +;; pseudo-code of sending application +(define (send-data port) + (port.write-object `((key1 ,binary1) (key2 ,binary2)))) + +;; pseudo-code of receiving application +(define (receive-data port) + (let ((data (port.read-object))) + (values (alist-get data 'key1) + (alist-get data 'key2)))) +``` + +There, you just sent and received two pieces of binary data in a +structured fashion. Does it get any simpler than that? + +## Uhh, what was that about interning? + +Oh right, I began writing this article with the intention of talking +about string interning. Really thought I was finished for a second, +before I remembered that. + +The parser interns strings by default. If you have potentially huge +chunks of data transmitted as strings, that's a bit of an issue. + +Another example would be the use of the `(#STRING file-path)` decoder +rule that I wrote about in the previous note. Actually, since that +one will be handled by the decoder, not the parser, there's not as +much of an issue there: Just document the fact that this decoder rule +results in an uninterned string, rather than an interned one. + +But back to the binary data example. Or to be more specific, the +general issue of massive strings appearing in Zisp data, whatever +their content. There should be a way to avoid interning them. + +I think I might settle for a very simple rule: The parser has a hard +cap on the length of strings that it will intern. + +If the string we just parsed is under 80 bytes, intern it. Otherwise, +return it uninterned. Why 80 bytes? Because I think that's a fairly +reasonable cutoff for how long an identifier could ever be. I mean, +I'm using 80 columns as the limit when I write code! An identifier +would need to span an entire line to be rejected. + +What do I mean with rejected? The parser won't care (it will return +uninterned strings silently) and the decoder probably won't see a +difference either. But once the data begins actually being seen as +Zisp code, and an uninterned string appears in the position of an +identifier, it will raise an error. Generally, strings carry with +them the information of whether they are interned or not, so that's +not an issue. + +In practice, this will mean that trying to run or compile Zisp code +containing an identifier of length 80 or greater will cleanly display +an error message saying you can't do that. No cryptic error message +or silent failure or such. + +If you're instead using Zisp s-expressions for data serialization and +deserialization, then you need to be cognizant of the fact that long +strings with equal content won't be pointer-equal, but that is not a +common expectation of a deserialization procedure. For example, I'm +not sure if JSON parsers typically offer such an option, but I think +they probably don't, in the general case. + +The automatic interning of short to medium length strings actually +makes Zisp parsing inherently a bit slower. Perhaps the parser could +offer an option to forego interning entirely, but I'm getting ahead of +myself here. For now, let's just ensure that parsing data with tons +of large strings doesn't lead to tons of redundant hashing. diff --git a/notes/index.md b/notes/index.md index 5d02a60..0550055 100644 --- a/notes/index.md +++ b/notes/index.md @@ -24,3 +24,4 @@ * [A full-stack programming language](260102-full-stack.html) * [Simplifying S-Expression Grammar](260106-simpler-grammar.html) * [Decoder](260107-decoder.html) +* [What not to intern?](260109-uninterned.html) -- cgit v1.2.3