notes/250329-strings.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67

# Symbols and strings, revisited

_2025 March_

My [original plan](250210-symbols.html) was to make strings and
symbols one and the same.  Then I realized this introduced ambiguity
between bare strings meant as identifiers, and quoted strings
representing a string literal in code.

After a bunch of back-and-forth, I came up with the idea of the Zisp
[decoder](250219-reader.html) with which I'm very happy overall, but I
still decided to ditch the idea of using a representation for quoted
string literals like `(#STRING . "foo")` after all.

The idea was that the reader would have a data mode and a code mode
and that quoted strings would become `(#STRING . "foo")` or such in
code mode, but not in data mode.  This way, reading a configuration
file (in data mode) that uses quoted strings would not end up giving
you this wonky thing with `#STRING`.

It was an exciting idea at first, but eventually I realized that the
above was the *only* substantial reason to have separate modes for
reading s-expressions.  It also annoyed me a bit that every single
quoted string in code would be wrapped in a cons cell...

So, ultimately I've decided to simply make quoted strings a proper
sub-type of strings.  (Or make symbols a sub-type of strings; which
ever way you want to look at it.)

Also, my [NaN-packing strategy](250210-nan.html) has so much extra
room that I've decided to put up-to-6-byte strings into NaNs as an
optimization hack, and this applies to both quoted and bare strings.

So we have two different string types, and two different in-memory
representations for each.  Let's summarize and give them names:

* sstr: Short string (symbol, up to 6 bytes)

* qstr: Quoted short string (non-symbol, up to 6 bytes)

* istr: Interned string (symbol, greater than 6 bytes)

* ustr: Uninterned string (non-symbol, greater than 6 bytes)

Don't get hung up on the short four-letter names; they aren't fully
descriptive.  The "qstr" isn't the only one representing a quoted
string literal; a "ustr" may also represent one.

Here's how the parser uses these types:

* Encountered an unquoted string of up to 6 bytes?  Make a sstr.

* Encountered a quoted string of up to 6 bytes?  Make a qstr.

* Unquoted string of more than 6 bytes?  Intern it to make an istr.

* Quoted string of more than 6 bytes?  Uninterned string.

*** WIP ***

_2026 January_

Currently, the Zisp parser does, after all, conflate strings and
symbols, with string literals simply being quoted symbols.  There
aren't going to be separate data types because it's unnecessary after
all.  The syntax `"foo bar"` parses into `(#QUOTE . |foo bar|)` and
I'll leave it at that.