doc/0/1-parse.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595

# Parser for Code & Data

<!--TOC-->

Zisp s-expressions represent an extremely minimal set of data types; only that
which is necessary to strategically construct more complex values:

    +---------+--------+----------+------+
    | String  | Rune   | List     | Nil  |
    +---------+--------+----------+------+
    | foobar  | #name  | (X ...)  | ()   |
    +---------+--------+----------+------+

The parser recognizes various *syntax sugar* which abbreviates verbose syntax,
and may result in special data structures (typically, a list with a rune in its
first position) which another Zisp component called the *decoder* can transform
into a rich set of value types.

More details about syntax sugar, and the decoder, are explained later.


## Character Encoding

The parser does not consume Unicode characters; it consumes bytes.  Grammar is
generally constructed by bytes corresponding to ASCII characters.

Some elements of the grammar, such as comments and quoted strings, may contain
arbitrary byte sequences, until terminated.  These sequences may happen to be
valid UTF-8 text.  This way, quoted strings and comments may contain Unicode
text encoded in UTF-8, but the parser does not check these for validity.

Since comments and quoted strings may contain arbitrary byte sequences, a text
editor or other program displaying Zisp s-expressions may need to use a special
visual representation for bytes that don't represent valid text.

The parser working on bytes rather than Unicode characters is not a limitation,
but rather a feature: It allows Zisp s-expressions to be used as a structured
data exchange format, which may contain binary data elements, without the need
to encode these in Base64 or other such text representations of binary data.
Consider the example:

    ((image.webp "<BINARY>")
     (video.webm "<BINARY>"))

All that needs to be done for this to work, is that any incidental occurrences
of the double-quote sign, and the backslash sign, are escaped with a backslash
within the `<BINARY>` data; all other bytes can appear verbatim in the strings.


## Stream Parsing

The parser can be repeatedly invoked on a byte stream to consume the next datum
within.  This does not require "unreading" or back-seeking within the stream;
the parser always reads a full datum, and stops after some byte which cleanly
terminates the currently parsed datum.

This means Zisp s-expressions can be safely intermixed with other data within
the same byte stream.  So long as the other data is consumed by some parser
which similarly stops reading at a clear boundary, the Zisp parser can then
continue operating on the same stream.  Consider the example:

    ("image.webp" 8273)

    << 8273 bytes >>

    ("video.webm" 736)

    << 736 bytes >>

The "header" for each file in this stream is a Zisp s-expression containing
information about how many bytes should be read after the header, before the
next file header appears.  (The header data need to be terminated with a blank
ASCII character such as a newline; the closing parenthesis does not act as a
terminator unto itself due to the "join" syntax sugar.)

To enable this stream parsing strategy, the parser does not use any automatic
buffering.  If it did, it might inadvertently consume some bytes beyond the
currently parsed datum, leaving the stream inconsistent.

If the parser is meant to be used on an input stream associated with expensive
system calls, such as a file handle or network socket, it's best to wrap that
stream in some intermediate object which asks the system for large chunks of
data at once, and stores the data in a buffer.


## Comments

Two types of comment are supported: datum comments and line comments.

* A semicolon followed by a tilde instructs the parser to consume one datum and
  discard it.  Whitespace may appear between the tilde and the datum to discard.

* A semicolon, followed by a non-tilde byte, instructs the parser to consume and
  discard bytes until a newline (ASCII Line Feed) is encountered.


## Value vs. Datum

A Zisp *value* that has an *external representation* in the form of a sequence
of bytes is called a *datum*.  Every datum is a value, but not every value is a
datum.  In other words, a datum is a value that can be printed out as a byte
sequence which the parser can turn back into an equivalent datum.

A value that is not a datum may nevertheless be *encoded* into one, allowing it
to have an external representation.  After parsing, it needs to be *decoded* to
actually become the expected value.

One may speak of an *external representation of a value* where the value is not
itself a datum, but has an encoding as one.  The more strictly correct term for
this is: "The external representation of the datum encoding the value."

### Syntax sugar

The parser recognizes various *syntax sugar* to abbreviate an equivalent datum
construction, or express a datum that encodes a more complex value.

As an example, the expression `#(x y z)` is an abbreviation for the equivalent
`(#HASH x y z)`.  These are two external representations for the same datum;
after parsing, both will yield values that are indistinguishable in all but
their memory address.

An example of syntax sugar that is not a mere abbreviation is a quoted string
which contains bytes that could not appear in a *bare* string:

    "foo bar"  ->  (#DQUOTE <STRING>)

In this example, the visual token `<STRING>` represents the actual string value
in program memory, which has no direct external representation in bytes because
it contains a space character.

Those familiar with Lisp and Scheme may expect bare strings to be parsed into a
separate type called *symbol* while quoted strings are parsed directly into a
string type, but this is not the case in Zisp.

### Decoder

The *decoder* transforms Zisp data into values of more complex types, including
values that are not of a datum type.

Combined with syntax sugar, this allows Zisp to offer familiar syntax elements.
For example, the expression `#(x y z)` which parses into `(#HASH x y z)` can be
decoded into an array, so the result is similar to the vector syntax of Scheme.

Decoding also resolves datum labels, goes over bare strings to find ones that
represent a number literal, and takes care of a number of other transforms.
This offloads complexity, allowing the parser to remain extremely simple.

See the dedicated documentation of the [decoder](2-decode.html) for more.


## Data types

Following is a more in-depth explanation of each data type constructed by the
Zisp s-expression parser.

These are in fact value types, though the term "data type" is often used due to
familiarity.  A Zisp value that is a member of one of the following value types
is only a *datum* if it adheres to additional constraints as explained below.

### String

Strings can appear *bare* or be quoted in various ways.  A quoted string is in
fact parsed into a list value with a rune in the first position to identify the
quotation variant that was parsed, and the string value in the second position;
or, in case of at-quoted strings, a special construct we will look at later.

    +-----------+-------------------------------+
    | Syntax    | Parse output                  |
    +-----------+-------------------------------+
    | |bytes|   | (#PQSTR <STRING>)             |
    +-----------+-------------------------------+
    | "bytes"   | (#DQSTR <STRING>)             |
    +-----------+-------------------------------+
    | @_bytes_  | (#ATSTR <SENTINEL> <STRING>)  |
    +-----------+-------------------------------+

The visual token `<STRING>` denotes the actual string, as a Zisp value, in the
second position of the list.  The visual token `<SENTINEL>` stands for a Zisp
integer value between 0 and 254.

These external representations of strings will be explained in more detail
further below, including backslash escape sequences allowed within, and how
exactly at-quoted strings work.

Strings have a fixed length, counted in bytes.  Each byte can have any value,
including zero (ASCII NUL).  The parser reads bytes, not Unicode characters; a
string may contain UTF-8 byte sequences, but these are not tested for validity.

A string that is up to 255 bytes long is automatically *interned*, meaning any
occurrence of the same string -- equal in length and containing the same byte
values -- ends up being represented by the same bit-pattern; either a memory
address, or an immediate representation within a CPU word for short strings.
The quotation method is inconsequential to this process; for example, while
`|foo bar|` and `"foo bar"` will parse into different list values, the actual
string they hold a reference to will be the same one in program memory.  This
behavior is however configurable and can be disabled entirely for cases where
large numbers of arbitrary binary strings are being parsed.

Strings of length greater than 255 bytes are stored separately in memory, even
if they are equal in length and content.

### Rune

A rune is represented by an ASCII character sequence of 1 to 6 bytes, that must
begin with a letter, and may only contain letters and digits.  This character
sequence of letters and digits is called the *name* of the rune.  A rune that
follows this constraint is valid as a datum.

Zisp code may explicitly construct values of the rune type that violate the
above constraints.  Such runes are not valid data and cannot be printed or
parsed.

Runes are case-sensitive, and the parser always emits runes using upper-case
letters when expressing syntax sugar.  Uppercase rune names are reserved for
Zisp's internal use and standard library; users can use lowercase runes with
custom meaning without worrying about clashes, with the exception of a small
number of lowercase runes such as `#true` and `#false` that are part of the
default decoder settings and documented explicitly as such.

Runes are always stored directly in a CPU word; never by memory address.

### List

A list is a contiguous array of one or more values in memory, whose length may
be encoded directly within the pointer to the head of the array, or else the
array is terminated with a special sentinel bit-pattern that is not otherwise
valid as a Zisp value.

The parser allocates a unique array in program memory for every list, and the
list as a value is then represented by the memory address of that array, with
either an exact length tag or a tag indicating that it's sentinel-terminated.

Lists are valid data if one of the following holds true:

* The list encodes a quoted string, datum label, or shebang line.

* All values in the list are a valid datum.

Further, a structure of nested list values may not contain cyclic references
back up in the structure (which would make the above definition diverge into
infinity).  Such cycles must be broken up with datum labels, or else the list
cannot be considered a datum, since it cannot be printed or parsed.

### Nil

The Zisp nil value is a singleton and a datum.  There is exactly one nil value,
used in lieu of a list of zero length; it has the external representation `()`.


## Quoted strings

Three quoted string types exist: Pipe-quoted, double-quoted, and at-quoted.
This section goes into the details of each variant.

### Pipe-quoted

Strings can be quoted with pipes, like symbols in R7RS Scheme, which triggers
the parser to generate a list with the structure:

    (#PQSTR <STRING>)   ;; <STRING> is visual aid, not syntax

The decoder, using default settings, would emit this string verbatim as a value.
Then, during code evaluation, this would be seen as an identifier.  In this way,
pipe-quoted strings are equivalent to bare strings in functionality.

It is important to understand that the decoder sits between the parser and the
[evaluator](3-eval.html), and in opposition to Lisp and Scheme tradition, it is
common for the evaluator to receive values that are not valid as a datum; here,
a string unto itself that may not be a valid datum.  Yet, it is valid as an
identifier for the purposes of the evaluator.

### Double-quoted

Strings wrapped in the double-quote symbol parse into:

    (#DQSTR <STRING>)   ;; <STRING> is visual aid, not syntax

Under default settings, the decoder would transform this into a value which,
when evaluated as code, simply yields the contained string as a value.

### At-quoted

This is a special type of syntax for "raw" strings, meaning that no backslash
escapes nor any other kind of escape sequence are recognized within them.

The syntax begins with an at sign, followed by any byte.  That byte becomes a
termination marker, and the string cannot contain an occurrence of it, since
there are no escape sequences.  The byte value 255 has a special meaning; see
further below.

    @"foo \ bar"  ->  (#ATSTR <SENTINEL> <STRING>)

The visual tokens `<SENTINEL>` and `<STRING>` represent an integer and string
value, respectively.  Here, the integer would be 34, which is the ASCII value
for a double-quote sign.  The string contains a literal backslash, since there
is no backslash escape parsing.

This style of quoting can be useful, for instance, when representing regular
expressions as strings in code:

    ;; Matches e.g. foo\bar.["blah"]

    @/^foo\\(bar|baz)\.\[".*"\]$/

Were it not for this syntax, this regular expression would only be possible to
represent through a quoted string such as the following:

    ;; Same as above, but so many backslashes

    "^foo\\\\(bar|baz)\\t\\[\".*\"\\]$"

The byte that follows the at sign need not be a printable character or even a
valid ASCII byte; it can be absolutely any byte value, even NUL.  This can be
useful to easily encode binary data which is known to not contain a specific
byte; an example would be C strings which cannot contain NUL.

If however the byte value is 255, then it does not stand for a sentinel, but
rather indicates that 6 more bytes follow, interpreted as a big-endian 48-bit
integer, which is the count of bytes making up the contents of the string.

Example sequence of bytes, represented as a mixture of ASCII and raw integers:

    '@' 255 0 0 0 0 2 100 <612 bytes>  ->  (#ATSTR <STRING>)

One may ask why the length is not included in the list.  This is unnecessary,
since strings in Zisp already carry length information in their own metadata
structure.
    

### Backslash escapes

In pipe-quoted and double-quoted strings, the following ASCII characters may
follow a backslash to insert a certain character.

    +-------+----------------------------+
    | Char  | Meaning                    |
    +-------+----------------------------+
    | \     | Literal backslash          |
    +-------+----------------------------+
    | |     | Literal pipe symbol        |
    +-------+----------------------------+
    | "     | Literal double-quote       |
    +-------+----------------------------+
    | 0     | ASCII NUL                  |
    +-------+----------------------------+
    | a     | ASCII Alert                |
    +-------+----------------------------+
    | b     | ASCII Backspace            |
    +-------+----------------------------+
    | t     | ASCII Tab (Horizontal)     |
    +-------+----------------------------+
    | n     | ASCII Newline (Line Feed)  |
    +-------+----------------------------+
    | v     | ASCII Vertical Tab         |
    +-------+----------------------------+
    | f     | ASCII Form Feed            |
    +-------+----------------------------+
    | r     | ASCII Carriage Return      |
    +-------+----------------------------+
    | e     | ASCII Escape               |
    +-------+----------------------------+

In words:

* A backslash, followed by a backslash, pipe, or double-quote character, is
  substituted with a literal occurrence of that character.

* The characters 0, a, b, t, n, v, f, r, and e have the same meanings as in the
  C programming language, representing common ASCII control characters.

Further, the following Regular Expression patterns following a backslash have
special meaning.

    +---------------------+-----------------------+
    | Regular Expression  | Meaning               |
    +---------------------+-----------------------+
    | [\t ]*\n[\t ]*      | Discarded             |
    +---------------------+-----------------------+
    | x([0-9a-fA-F]{2})*; | Arbitrary bytes       |
    +---------------------+-----------------------+
    | u[0-9a-fA-F]+;      | Unicode Scalar Value  |
    +---------------------+-----------------------+

Explanations:

* A backslash followed by any number of blanks (space or tab), a newline, and
  again any number of blanks, is substituted with nothing.  This is to allow
  splitting a string into multiple lines for human readability.

      (define p "This paragraph has been visually split into multiple \
                 lines, but the newline is escaped, so it's one line.")

* An x, followed by pairs of hexadecimal digits (case insensitive), terminated
  by a semicolon, is substituted with the sequence of bytes represented by the
  corresponding pairs of hexadecimal digits.  E.g.: `"foo\xDEADBEEF;bar"`

* A u, followed by a hexadecimal digit sequence (case insensitive), terminated
  by a semicolon, is substituted with the canonical UTF-8 byte sequence for the
  Unicode Scalar Value represented by that hexadecimal number.  The number must
  be in the range `0` to `10FFFF`.  E.g.: `"foo\u00A0;bar"`

### Newlines in strings

Normally, a newline in a string has no special meaning and simply becomes part
of the string.  However, newlines can be backslash-escaped, which simple erases
them; the escaped newline can also be preceded or followed by any number of tab
and space characters, which are all stripped as well.  (Note: It's not blanks
preceding the backslash that are stripped, but blanks following the backslash
and preceding the newline; i.e., blanks at the end of the line.)

Following are some examples of how multi-line strings can appear in source code
with different intentions and meanings:

    (define paragraph "This paragraph has been visually split into multiple \
                       lines, but the newlines are escaped, so it's one line.")

    (define json-object '|   ;; use '|| so double-quotes need no escaping
      {
        "key": "value"
      }
    |)

The second example is actually slightly problematic.  It begins with a newline,
which may be undesirable, but escaping that newline would cause the first line
to have no indentation, thus the opening `{` would not line up with the closing
`}` when this string is printed out.  Further, if the entire block of code is
indented, then the string contents may be more indented than intended.  (No pun
or rhyme intended.)  Consider:

    (let ((foo one))
      (let ((bar two))
        (let ((json-object '|
                 {
                   "key": "value"
                 }
               |))
          (do-whatever))))

The string bound to `json-object` has redundant indentation.  Should the parser
attempt to solve this issue?

Thankfully, we have the decoder to handle such complexities.  Under the default
settings, the rune `#HASH` is bound to a decoder rule which detects a payload
value that is a string literal, and implements the same algorithm as seen in
Java 15 Text Blocks: [JEP 378: Text Blocks](https://openjdk.org/jeps/378)

Thus, we can do the following:

    (let ((foo one))
      (let ((bar two))
        (let ((json-object #|
    ...........  {
    ...........    "key": "value"
    ...........  }
    ...........|))
          (do-whatever))))

(Dots represent whitespace that is deleted.  The initial newline is, as well.)

The only feature Zisp does not offer is a way to fence off multi-line strings
with a longer token such as `"""` as seen in Python and Java, or an arbitrary
word as seen in Bourne shell and PHP "here doc" syntax.

However, if a programmer truly wanted to have arbitrary text blocks in code,
without needing to escape anything in them, it's possible to abuse at-quoted
string syntax, using it with an ASCII control character which is displayed
visibly by a text editor.  In the following, the characters `^\` are meant to
represent a literal ASCII File Separator character in the source code:

    (define json-object #@^\
      {
        "key": "value"
      }
      ^\)

It works fine in Emacs, so why not?  Use `C-q C-\` to insert the `^\`.

This is indeed quite an eldritch syntax, but hopefully most programs would not
need to use it.


## Other syntax

The following table summarizes commonly useful syntax abbreviations:

    [...]   -> (#SQUARE ...)        #datum       -> (#HASH datum)

    {...}   -> (#BRACE ...)         #rune(...)   -> (#rune ...)

    'datum  -> (#QUOTE datum)       dat1dat2     -> (#JOIN dat1 dat2)

    `datum  -> (#GRAVE datum)       dat1.dat2    -> (#DOT dat1 dat2)

    ,datum  -> (#COMMA datum)       dat1:dat2    -> (#COLON dat1 dat2)

Notes:

* The terms datum, dat1, and dat2 each refer to an arbitrary datum; ellipsis
  means zero or more data.

* The `#datum` form only applies when the datum following the hash sign is
  anything other than a bare string, since otherwise this would be ambiguous
  with a rune literal.  A bare string can nevertheless follow the hash sign by
  separating the two with a backslash:

      #\string  ->  (#HASH string)

* Though not represented in the table due to notational difficulty, the form
  `#rune(...)` doesn't require a list in the second position; any datum that
  works with the `#datum` syntax also works with `#rune<DATUM>`.

      #rune1#rune2  -> (#rune1 #rune2)

      #rune\string  -> (#rune string)

      #rune'string  -> (#rune (#QUOTE string))

      #rune"string" -> (#rune (#DQSTR |string|))

  As a counter-example, following a rune immediately with a bare string isn't
  possible without the delimiting backslash, since that would be ambiguous:

      #abcdefgh  ;Could be (#abcdef gh) or (#abcde fgh) or ...

* Syntax sugar can combine arbitrarily.  Some examples follow.  Any of these may
  or may not actually have a meaning in code; some might simply end up producing
  an error during decoding, or later evaluation of code.

      #{...}            -> (#HASH (#BRACE ...))

      #'foo             -> (#HASH (#QUOTE foo))

      ##'[...]          -> (#HASH (#HASH (#QUOTE (#SQUARE ...))))

      {x y}[i j]        -> (#JOIN (#BRACE x y) (#SQUARE i j))

      foo.bar.baz{x y}  -> (#JOIN (#DOT (#DOT foo bar) baz) (#BRACE x y))

* Those used to thinking in Lisp and Scheme may think that `(#QUOTE ...)` halts
  further decoding of enclosed data.  This is not so, since quoting is related
  to code evaluation, not decoding.

### Datum labels

Valid data cannot be cyclic, since that would mean it has infinite length in
bytes.  To externally represent a value with cyclic structure, one uses datum
labels in the data encoding of the value.

A datum label either wraps another datum to assign a number to it, or contains
just a reference to a previous assignment.

    +------------------+----------------------------+
    | Syntax           | Internal datum structure   |
    +------------------+----------------------------+
    | #%<HEX>=<DATUM>  | (#LABEL <NUMBER> <DATUM>)  |
    +------------------+----------------------------+
    | #%<HEX>%         | (#LABEL <NUMBER>)          |
    +------------------+----------------------------+

In this visual, the token `<HEX>` stands for a hexadecimal digit sequence, the
token `<DATUM>` stands for any other datum, and `<NUMBER>` is a stand-in for a
number value; that which is represented by `<HEX>`.

For clarity, concrete examples follow:

    +-------------------+------------------------------+
    | Byte sequence     | Parse result                 |
    +-------------------+------------------------------+
    | #%1234abcd=(foo)  | (#LABEL <0x1234abcd> (foo))  |
    +-------------------+------------------------------+
    | #%1234abcd%       | (#LABEL <0x1234abcd>)        |
    +-------------------+------------------------------+

Here, the visual token `<0x1234abcd>` stands for a Zisp value of a numeric type
with an integer value.  Note that the decoder may not accept a bare string here,
meaning this syntax sugar is not merely an abbreviation.

### Shebang

Finally, the parser recognizes the Unix *shebang* syntax and outputs a datum to
hold the string values found within:

    #!interpreter          ->  (#SHBANG interpreter)

    #!interpreter argline  ->  (#SHBANG interpreter argline)

When executing a script file, Zisp simply stores this into a global value that
may be inspected if desired.


<!--
;; Local Variables:
;; fill-column: 80
;; End:
-->