summaryrefslogtreecommitdiff
path: root/notes/strings.md
diff options
context:
space:
mode:
Diffstat (limited to 'notes/strings.md')
-rw-r--r--notes/strings.md57
1 files changed, 57 insertions, 0 deletions
diff --git a/notes/strings.md b/notes/strings.md
new file mode 100644
index 0000000..6f01944
--- /dev/null
+++ b/notes/strings.md
@@ -0,0 +1,57 @@
+# Symbols and strings, revisited
+
+My [original plan](symbols.html) was to make strings and symbols one
+and the same. Then I realized this introduced ambiguity between bare
+strings meant as identifiers, and quoted strings representing a string
+literal in code.
+
+After a bunch of back-and-forth, I came up with the idea of the Zisp
+[decoder](reader.html) with which I'm very happy overall, but I still
+decided to ditch the idea of using an intermediate representation for
+quoted string literals like `(#STRING . "foo")` after all.
+
+The idea was that the reader would have a data mode and a code mode
+and that quoted strings would become `(#STRING . "foo")` or such in
+code mode, but not in data mode. This way, reading a configuration
+file (in data mode) that uses quoted strings would not end up giving
+you this wonky thing with `#STRING`.
+
+It was an exciting idea at first, but eventually I realized that the
+above was the *only* substantial reason to have separate modes for
+reading s-expressions. It also annoyed me a bit that every single
+quoted string in code would be wrapped in a cons cell...
+
+So, ultimately I've decided to simply make quoted strings a proper
+sub-type of strings. (Or make symbols a sub-type of strings; which
+ever way you want to look at it.)
+
+Also, my [NaN-packing strategy](nan.html) has so much extra room that
+I've decided to put up-to-6-byte strings into NaNs as an optimization
+hack, and this applies to both quoted and bare strings.
+
+So we have two different string types, and two different in-memory
+representations for each. Let's summarize and give them names:
+
+* sstr: Short string (symbol, up to 6 bytes)
+
+* qstr: Quoted short string (non-symbol, up to 6 bytes)
+
+* istr: Interned string (symbol, greater than 6 bytes)
+
+* ustr: Uninterned string (non-symbol, greater than 6 bytes)
+
+Don't get hung up on the short four-letter names; they aren't fully
+descriptive. The "qstr" isn't the only one representing a quoted
+string literal; a "ustr" may also represent one.
+
+Here's how the parser uses these types:
+
+* Encountered an unquoted string of up to 6 bytes? Make a sstr.
+
+* Encountered a quoted string of up to 6 bytes? Make a qstr.
+
+* Unquoted string of more than 6 bytes? Intern it to make an istr.
+
+* Quoted string of more than 6 bytes? Uninterned string.
+
+*** WIP ***