docs/c1/1-parse.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233

# Parser for Code & Data

*For an exact specification of the grammar, see [grammar](grammar/).*

Zisp S-Expressions represent an extremely minimal set of data types; only that
which is necessary to strategically construct more complex code and data:

    +--------+-----------------+--------+----------+------+
    | TYPE   | String          | Rune   | Pair     | Nil  |
    +--------+-----------------+--------+----------+------+
    | E.G.   | foo, |foo bar|  | #name  | (X & Y)  | ()   |
    +--------+-----------------+--------+----------+------+

The parser can also output non-negative integers, but this is only used for
datum labels; number literals are handled by the *decoder* instead.


## Decoder

A separate process called *decoding* can transform such data into more complex
types.  For example, `(#HASH x y z)` could be decoded into an array, so the
expression `#(x y z)` could work like in Scheme; or `(#SQUARE x y z)` could be
decoded into a function call expression that will, at run-time, allocate and
initialize a dynamic array with three elements, so the expression `[x y z]`
would work like in JavaScript.

Decoding also resolves datum labels, goes over strings to find ones that are
actually a number literal, and takes care of a number of other transformations.
This offloads complexity, allowing the parser to remain extremely simple.  See
the dedicated documentation of the decoder for more.


## Syntax sugar

The parser recognizes various "syntax sugar" and transforms it into uses of the
above listed minimal data types.  The most ubiquitous example is the list:

    (datum1 datum2 ...)  ->  (datum1 & (datum2 & (... & ())))

The following table summarizes the other transformations available:

    "xyz"   -> (#QUOTE & |xyz|)       #datum       -> (#HASH & datum)

    ~_xyz_  -> (#TILDE & |xyz|)       #rune(...)   -> (#rune ...)

    [...]   -> (#SQUARE ...)          dat1dat2     -> (#JOIN dat1 & dat2)
                                 
    {...}   -> (#BRACE ...)           dat1.dat2    -> (#DOT dat1 & dat2)
                                 
    'datum  -> (#QUOTE & datum)       dat1:dat2    -> (#COLON dat1 & dat2)
                                 
    `datum  -> (#GRAVE & datum)       #%hex=datum  -> (#LABEL hex & datum)
                                 
    ,datum  -> (#COMMA & datum)       #%hex%       -> (#LABEL & hex)

Notes about the table and examples:

* The terms datum, dat1, and dat2 each refer to an arbitrary datum; ellipsis
  means zero or more data; hex is a hexadecimal number of up to 12 digits.

* Strings can be quoted with pipes, like symbols in Scheme.  This is the "real"
  string literal syntax, whereas using double quotes is syntax sugar for a
  quoted string literal.

      |foo bar baz|  -> |foo bar baz|

      "foo bar baz"  -> (#QUOTE & |foo bar baz|)

* See the next section for an explanation of the tilde syntax, which implements
  "raw" string literals.

* The `#datum` form only applies when the datum following the hash sign is
  anything other than a bare string (unquoted, without pipe symbol) since
  otherwise this would be ambiguous with a rune literal.  A bare string can
  nevertheless follow the hash sign by separating the two with a backslash:

      #\string  ->  (#HASH & string)

* Though not represented in the table due to notational difficulty, the form
  `#rune(...)` doesn't require a list in the second position; any datum that
  works with the `#datum` syntax also works with `#rune<DATUM>`.

      #rune1#rune2  -> (#rune1 & #rune2)

      #rune"text"   -> (#rune & "text")

      #rune\string  -> (rune & string)

      #rune'string  -> (#rune #QUOTE & string)

  As a counter-example, following a rune immediately with a bare string isn't
  possible without the delimiting backslash, since that would be ambiguous:

      #abcdefgh  ;Could be (#abcdef & gh) or (#abcde & fgh) or ...

* Syntax sugar can combine arbitrarily.  Some examples follow.  Any of these may
  or may not actually have a meaning in code; many could simply end up producing
  an error during decoding, or later interpretation of code.

      #{...}            -> (#HASH #BRACE ...)

      #'foo             -> (#HASH #QUOTE & foo)

      ##'[...]          -> (#HASH #HASH #QUOTE #SQUARE ...)

      {x y}[i j]        -> (#JOIN (#BRACE x y) #SQUARE i j)

      foo.bar.baz{x y}  -> (#JOIN (#DOT (#DOT foo & bar) & baz) #BRACE x y)

* While in Lisp and Scheme `'foo` parses as `(quote foo)`, in Zisp it parses
  as `(#QUOTE & foo)` instead; the operand of `#QUOTE` is the entire cdr.

  The same principle is used when parsing other sugar; some examples follow:

      Incorrect                              Correct

      #(x y z) -> (#HASH (x y z))            #(x y z) -> (#HASH x y z)

      [x y z]  -> (#SQUARE (x y z))          [x y z]  -> (#SQUARE x y z)

      #{x}     -> (#HASH (#BRACE (x)))       #{x}     -> (#HASH #BRACE x)

      foo(x y) -> (#JOIN foo (x y))          foo(x y) -> (#JOIN foo x y)

* Runes are case-sensitive, and the parser always emits runes using upper-case
  letters when expressing syntax sugar.  Uppercase rune names are reserved for
  Zisp's internal use and standard library; users can use lowercase runes with
  custom meaning without worrying about clashes, with the exception of a small
  number of lowercase runes such as `#true` and `#false` that are part of the
  default decoder settings.


## Tilde strings

There is a special type of syntax sugar for "raw" strings, meaning that no
backslash escapes nor any other kind of escape sequence are recognized.

This raw string syntax begins with a tilde, followed by any byte.  That byte
becomes the termination marker, and the string cannot represent a literal
occurrence of it, since there are no escape sequences.

    ~%foo \ bar%  ->  (#TILDE |foo \\ bar|)

This can be useful, for instance, when representing regular expressions as
quoted string literals in code:

    ~/^foo\\(bar|baz)\.\[".*"\]$/     ;; matches e.g. foo\bar.["blah"]

Were it not for this syntax, this regular expression would need to be
represented by the following quoted string literal in Zisp code:

    "^foo\\\\(bar|baz)\\t\\[\".*\"\\]$"

Alternatively, imagine searching for certain MS Windows file paths:

    ~_C:\\\\User\\foo_                ;; matches C:\\User\foo

That's already ugly.  Without raw strings, it would need to look like this:

    "C:\\\\\\\\User\\\\foo"

Typically, the rune `#TILDE` would be treated as a synonym to `#QUOTE` by the
decoder, though creative programmers could repurpose it.


## Newlines in strings

Normally, a newline in a string has no special meaning and simply becomes part
of the string.  However, newlines can be backslash-escaped, which simple erases
them; the escaped newline can also be preceded or followed by any number of tab
and space characters, which are all stripped as well.  (Note: It's not blanks
preceding the backslash that are stripped, but blanks following the backslash
and preceding the newline; i.e., blanks at the end of the line.)

Following are some examples of how multi-line strings can appear in source code
with different intentions and meanings:

    (define paragraph "This paragraph has been visually split into multiple \
                       lines, but the newlines are escaped, so it's one line.")

    (define json-object '|   ;; use '|| so we needn't escape "key" etc.
      {
        "key": "value"
      }
    |)

The second example is actually slightly problematic.  It begins with a newline,
which may be undesirable, but escaping that newline would cause the first line
to have no indentation, thus the opening `{` would not line up with the closing
`}` when this string is printed out.  Further, if the entire block of code is
indented, then the string contents may be more indented than intended.  (No pun
or rhyme intended.)  Consider:

    (let ((foo one))
      (let ((bar two))
        (let ((json-object '|
                 {
                   "key": "value"
                 }
               |))
          (do-whatever))))

The string bound to `json-object` has way more indentation than the programmer
intended.  Should the parser attempt to solve this issue?

Thankfully, we have the decoder.  The implementation of `#QUOTE` can simply
implement a post-processing algorithm such as the one used for Java 15 text
blocks feature: [JEP 378: Text Blocks](https://openjdk.org/jeps/378)

The only feature Zisp cannot offer here is a way to fence off multi-line strings
with a longer token such as `"""` as seen in Python or Java, or an arbitrary
word as seen in Bourne shell and PHP "here doc" syntax.  For simplicity, the
Zisp parser omits such features.

That said, if a programmer truly wanted to have arbitrary text blocks in code,
without needing to escape anything in them, it's possible to abuse the tilde
string syntax by using it with an ASCII control character which is displayed
visibly by a text editor.  In the following, the characters `^\` are meant to
represent a literal ASCII File Separator character in the source code:

    (define json-object ~^\
      {
        "key": "value"
      }
      ^\)

Hey, it works fine in Emacs, so why not??  (`C-q C-\` to insert the `^\`.)

<!--
;; Local Variables:
;; fill-column: 80
;; End:
-->