summaryrefslogtreecommitdiff
path: root/notes/260109-uninterned.md
blob: 5899cbaf8835fc7d0a65044a9310ab534bebe4bb (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
# What strings NOT to intern?

Although it's become fairly normal these days to define grammars in
terms of Unicode, there's a certain simplicity to just defining them
in terms of bytes instead.

The Zisp s-expression grammar is defined in terms of bytes.  Regular,
unquoted strings, like identifiers, are defined as only accepting
characters in the ASCII range.  But within quoted strings, and line
comments, *any* byte can appear so long as it doesn't terminate the
string or line comment.

It's a super simple implementation.  Encountering a double-quote that
opens a string simply means the parser begins to consume any byte it
encounters and adds it to the string, until it encounters a closing
double-quote character.

Backslash escapes make it slightly more complicated, but so long as
the byte you've read isn't a double-quote or backslash, you just add
it to the string.  You don't try to verify that it's part of a valid
UTF-8 byte sequence or anything.

Line comments are even simpler, since there's no backslash escapes
there; you just skip bytes until you encounter the Line Feed byte.
It's really that simple of a loop.

This way, string literals, and comments, can have UTF-8 encoded
Unicode text in them, which is highly desirable.  The parser just
doesn't care if they're valid UTF-8 or something else.

It may seem a bit "dirty" that this means that comments and string
literals can also contain random byte sequences in them, which don't
decode as valid UTF-8.  You can have some garbled binary data right
there in the middle of your source code.

But I think this is not only an acceptable price to pay for the ease
of implementation (and just plain conceptual simplicity) but even
arguably a benefit:

*It means you can use Zisp s-expressions to transmit binary data.*

Consider two simple programs that talk over the network, and want to
exchange binary data in a structured fashion.

The best way to transmit binary, I suppose, is to send a header which
declares the byte count, and then send the bytes.  The receiver won't
have to scan the data for some termination token, and the sender can
also forego the need to ensure that the termination token doesn't
accidentally appear within the payload.

But if you're in a pinch, and want neither the complexity nor higher
throughput of such a length-prefixed binary data protocol, you might
go for a simplistic strategy like stuffing base64 encoded data into
strings in a JSON object:

```js
{ "key1": "<base64 of value 1>"
, "key2": "<base64 of value 2>"
}
```

(After all, any programming language under the sun has a JSON library,
be it built in or third-party.)

It may not matter in many cases, but the base64 transform will add
some processing overhead and bloat the size by 33%.  Compare that to
Zisp s-expressions with binary data in strings:

```scheme
((key1 "<value 1 with backslash escapes>")
 (key2 "<value 2 with backslash escapes>")
)
```

The backslash escapes in question are extremely simple:

1. Any byte with value of `"` (ASCII double quote) becomes the byte
   sequence `\"` instead.

2. Any byte with value of `\` (ASCII backslash) becomes the byte
   sequence `\\` instead.

This actually has a worst-case scenario of doubling the size, but in
more typical cases it will have much smaller size than base64.  When
it comes to CPU throughput, it's an interesting comparison because
base64 can be branchless and heavily SIMD-optimized.  I'm lazy, and
it's a silly comparison anyway (since, as mentioned, you should be
using a proper binary format with length prefixes if you need optimal
efficiency) so I just asked a bunch of popular LLMs like ChatGPT, and
they actually disagreed with each other.  Some claimed that base64
would outperform, others claimed it couldn't compete with the simple
byte substitution algorithm, so long as the latter is also optimized
with SIMD and the data is not a pathological case making tons of
backslash escapes necessary.

Even if base64 were to have higher CPU throughput, however, it still
has the higher data size issue (in most cases) and besides, a JSON
parser would need to scan the entire string looking for backslash
escapes anyway, even though base64 can't contain any, so the total
overhead of the Zisp parser would probably easily beat the combined
JSON parsing and base64 decoding.

On top of that, with base64 you typically want to compress before the
transformation, whereas the simple backslash escape strategy allows
you to comfortably rely on transparent compression at the network
level, like if the applications communicate via HTTP and implement
gzip or brotli compression as part of that.

I'm not saying this is some kind of killer feature, obviously.  But I
find it neat how the simplistic and "un-opinionated" nature of the
Zisp s-expression format (not coercing you into UTF-8) allows for
little hacks like this.

```scheme
;; pseudo-code of sending application
(define (send-data port)
  (port.write-object `((key1 ,binary1) (key2 ,binary2))))

;; pseudo-code of receiving application
(define (receive-data port)
  (let ((data (port.read-object)))
    (values (alist-get data 'key1)
            (alist-get data 'key2))))
```

There, you just sent and received two pieces of binary data in a
structured fashion.  Does it get any simpler than that?

## Uhh, what was that about interning?

Oh right, I began writing this article with the intention of talking
about string interning.  Really thought I was finished for a second,
before I remembered that.

The parser interns strings by default.  If you have potentially huge
chunks of data transmitted as strings, that's a bit of an issue.

Another example would be the use of the `(#STRING file-path)` decoder
rule that I wrote about in the previous note.  Actually, since that
one will be handled by the decoder, not the parser, there's not as
much of an issue there: Just document the fact that this decoder rule
results in an uninterned string, rather than an interned one.

But back to the binary data example.  Or to be more specific, the
general issue of massive strings appearing in Zisp data, whatever
their content.  There should be a way to avoid interning them.

I think I might settle for a very simple rule: The parser has a hard
cap on the length of strings that it will intern.

If the string we just parsed is under 80 bytes, intern it.  Otherwise,
return it uninterned.  Why 80 bytes?  Because I think that's a fairly
reasonable cutoff for how long an identifier could ever be.  I mean,
I'm using 80 columns as the limit when I write code!  An identifier
would need to span an entire line to be rejected.

What do I mean with rejected?  The parser won't care (it will return
uninterned strings silently) and the decoder probably won't see a
difference either.  But once the data begins actually being seen as
Zisp code, and an uninterned string appears in the position of an
identifier, it will raise an error.  Generally, strings carry with
them the information of whether they are interned or not, so that's
not an issue.

In practice, this will mean that trying to run or compile Zisp code
containing an identifier of length 80 or greater will cleanly display
an error message saying you can't do that.  No cryptic error message
or silent failure or such.

If you're instead using Zisp s-expressions for data serialization and
deserialization, then you need to be cognizant of the fact that long
strings with equal content won't be pointer-equal, but that is not a
common expectation of a deserialization procedure.  For example, I'm
not sure if JSON parsers typically offer such an option, but I think
they probably don't, in the general case.

The automatic interning of short to medium length strings actually
makes Zisp parsing inherently a bit slower.  Perhaps the parser could
offer an option to forego interning entirely, but I'm getting ahead of
myself here.  For now, let's just ensure that parsing data with tons
of large strings doesn't lead to tons of redundant hashing.