# What strings NOT to intern? Although it's become fairly normal these days to define grammars in terms of Unicode, there's a certain simplicity to just defining them in terms of bytes instead. The Zisp s-expression grammar is defined in terms of bytes. Regular, unquoted strings, like identifiers, are defined as only accepting characters in the ASCII range. But within quoted strings, and line comments, *any* byte can appear so long as it doesn't terminate the string or line comment. It's a super simple implementation. Encountering a double-quote that opens a string simply means the parser begins to consume any byte it encounters and adds it to the string, until it encounters a closing double-quote character. Backslash escapes make it slightly more complicated, but so long as the byte you've read isn't a double-quote or backslash, you just add it to the string. You don't try to verify that it's part of a valid UTF-8 byte sequence or anything. Line comments are even simpler, since there's no backslash escapes there; you just skip bytes until you encounter the Line Feed byte. It's really that simple of a loop. This way, string literals, and comments, can have UTF-8 encoded Unicode text in them, which is highly desirable. The parser just doesn't care if they're valid UTF-8 or something else. It may seem a bit "dirty" that this means that comments and string literals can also contain random byte sequences in them, which don't decode as valid UTF-8. You can have some garbled binary data right there in the middle of your source code. But I think this is not only an acceptable price to pay for the ease of implementation (and just plain conceptual simplicity) but even arguably a benefit: *It means you can use Zisp s-expressions to transmit binary data.* Consider two simple programs that talk over the network, and want to exchange binary data in a structured fashion. The best way to transmit binary, I suppose, is to send a header which declares the byte count, and then send the bytes. The receiver won't have to scan the data for some termination token, and the sender can also forego the need to ensure that the termination token doesn't accidentally appear within the payload. But if you're in a pinch, and want neither the complexity nor higher throughput of such a length-prefixed binary data protocol, you might go for a simplistic strategy like stuffing base64 encoded data into strings in a JSON object: ```js { "key1": "" , "key2": "" } ``` (After all, any programming language under the sun has a JSON library, be it built in or third-party.) It may not matter in many cases, but the base64 transform will add some processing overhead and bloat the size by 33%. Compare that to Zisp s-expressions with binary data in strings: ```scheme ((key1 "") (key2 "") ) ``` The backslash escapes in question are extremely simple: 1. Any byte with value of `"` (ASCII double quote) becomes the byte sequence `\"` instead. 2. Any byte with value of `\` (ASCII backslash) becomes the byte sequence `\\` instead. This actually has a worst-case scenario of doubling the size, but in more typical cases it will have much smaller size than base64. When it comes to CPU throughput, it's an interesting comparison because base64 can be branchless and heavily SIMD-optimized. I'm lazy, and it's a silly comparison anyway (since, as mentioned, you should be using a proper binary format with length prefixes if you need optimal efficiency) so I just asked a bunch of popular LLMs like ChatGPT, and they actually disagreed with each other. Some claimed that base64 would outperform, others claimed it couldn't compete with the simple byte substitution algorithm, so long as the latter is also optimized with SIMD and the data is not a pathological case making tons of backslash escapes necessary. Even if base64 were to have higher CPU throughput, however, it still has the higher data size issue (in most cases) and besides, a JSON parser would need to scan the entire string looking for backslash escapes anyway, even though base64 can't contain any, so the total overhead of the Zisp parser would probably easily beat the combined JSON parsing and base64 decoding. On top of that, with base64 you typically want to compress before the transformation, whereas the simple backslash escape strategy allows you to comfortably rely on transparent compression at the network level, like if the applications communicate via HTTP and implement gzip or brotli compression as part of that. I'm not saying this is some kind of killer feature, obviously. But I find it neat how the simplistic and "un-opinionated" nature of the Zisp s-expression format (not coercing you into UTF-8) allows for little hacks like this. ```scheme ;; pseudo-code of sending application (define (send-data port) (port.write-object `((key1 ,binary1) (key2 ,binary2)))) ;; pseudo-code of receiving application (define (receive-data port) (let ((data (port.read-object))) (values (alist-get data 'key1) (alist-get data 'key2)))) ``` There, you just sent and received two pieces of binary data in a structured fashion. Does it get any simpler than that? ## Uhh, what was that about interning? Oh right, I began writing this article with the intention of talking about string interning. Really thought I was finished for a second, before I remembered that. The parser interns strings by default. If you have potentially huge chunks of data transmitted as strings, that's a bit of an issue. Another example would be the use of the `(#STRING file-path)` decoder rule that I wrote about in the previous note. Actually, since that one will be handled by the decoder, not the parser, there's not as much of an issue there: Just document the fact that this decoder rule results in an uninterned string, rather than an interned one. But back to the binary data example. Or to be more specific, the general issue of massive strings appearing in Zisp data, whatever their content. There should be a way to avoid interning them. I think I might settle for a very simple rule: The parser has a hard cap on the length of strings that it will intern. If the string we just parsed is under 80 bytes, intern it. Otherwise, return it uninterned. Why 80 bytes? Because I think that's a fairly reasonable cutoff for how long an identifier could ever be. I mean, I'm using 80 columns as the limit when I write code! An identifier would need to span an entire line to be rejected. What do I mean with rejected? The parser won't care (it will return uninterned strings silently) and the decoder probably won't see a difference either. But once the data begins actually being seen as Zisp code, and an uninterned string appears in the position of an identifier, it will raise an error. Generally, strings carry with them the information of whether they are interned or not, so that's not an issue. In practice, this will mean that trying to run or compile Zisp code containing an identifier of length 80 or greater will cleanly display an error message saying you can't do that. No cryptic error message or silent failure or such. If you're instead using Zisp s-expressions for data serialization and deserialization, then you need to be cognizant of the fact that long strings with equal content won't be pointer-equal, but that is not a common expectation of a deserialization procedure. For example, I'm not sure if JSON parsers typically offer such an option, but I think they probably don't, in the general case. The automatic interning of short to medium length strings actually makes Zisp parsing inherently a bit slower. Perhaps the parser could offer an option to forego interning entirely, but I'm getting ahead of myself here. For now, let's just ensure that parsing data with tons of large strings doesn't lead to tons of redundant hashing.