notes/strings.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57

# Symbols and strings, revisited

My [original plan](symbols.html) was to make strings and symbols one
and the same.  Then I realized this introduced ambiguity between bare
strings meant as identifiers, and quoted strings representing a string
literal in code.

After a bunch of back-and-forth, I came up with the idea of the Zisp
[decoder](reader.html) with which I'm very happy overall, but I still
decided to ditch the idea of using an intermediate representation for
quoted string literals like `(#STRING . "foo")` after all.

The idea was that the reader would have a data mode and a code mode
and that quoted strings would become `(#STRING . "foo")` or such in
code mode, but not in data mode.  This way, reading a configuration
file (in data mode) that uses quoted strings would not end up giving
you this wonky thing with `#STRING`.

It was an exciting idea at first, but eventually I realized that the
above was the *only* substantial reason to have separate modes for
reading s-expressions.  It also annoyed me a bit that every single
quoted string in code would be wrapped in a cons cell...

So, ultimately I've decided to simply make quoted strings a proper
sub-type of strings.  (Or make symbols a sub-type of strings; which
ever way you want to look at it.)

Also, my [NaN-packing strategy](nan.html) has so much extra room that
I've decided to put up-to-6-byte strings into NaNs as an optimization
hack, and this applies to both quoted and bare strings.

So we have two different string types, and two different in-memory
representations for each.  Let's summarize and give them names:

* sstr: Short string (symbol, up to 6 bytes)

* qstr: Quoted short string (non-symbol, up to 6 bytes)

* istr: Interned string (symbol, greater than 6 bytes)

* ustr: Uninterned string (non-symbol, greater than 6 bytes)

Don't get hung up on the short four-letter names; they aren't fully
descriptive.  The "qstr" isn't the only one representing a quoted
string literal; a "ustr" may also represent one.

Here's how the parser uses these types:

* Encountered an unquoted string of up to 6 bytes?  Make a sstr.

* Encountered a quoted string of up to 6 bytes?  Make a qstr.

* Unquoted string of more than 6 bytes?  Intern it to make an istr.

* Quoted string of more than 6 bytes?  Uninterned string.

*** WIP ***