# Parser for Code & Data

<!--TOC-->

Zisp s-expressions represent an extremely minimal set of data types; only that
which is necessary to strategically construct more complex values:

    +-------+---------+--------+----------+------+
    | TYPE  | String  | Rune   | Pair     | Nil  |
    +-------+---------+--------+----------+------+
    | E.G.  | foobar  | #name  | (X & Y)  | ()   |
    +-------+---------+--------+----------+------+

The parser also recognizes various *syntax sugar* which typically results in a
pair beginning with a specific rune.  A separate component called the *decoder*
transforms such data into a rich set of value types.


## Character Encoding

The parser does not consume Unicode characters; it consumes bytes.  Grammar is
generally constructed by bytes corresponding to ASCII characters.

Some elements of the grammar, such as comments and quoted strings, may contain
arbitrary byte sequences, until terminated.  These sequences may happen to be
valid UTF-8 text.  This way, quoted strings and comments may contain Unicode
text encoded in UTF-8, but the parser does not check these for validity.

Since comments and quoted strings may contain arbitrary byte sequences, a text
editor or other program displaying Zisp s-expressions may need to use a special
visual representation for bytes that don't represent valid text.

The parser working on bytes rather than Unicode characters is not a limitation,
but rather a feature: It allows Zisp s-expressions to be used as a structured
data exchange format, which may contain binary data elements, without the need
to encode these in Base64 or other such text representations of binary data.
Consider the example:

    ((image.webp "<BINARY>")
     (video.webm "<BINARY>"))

All that needs to be done for this to work, is that any incidental occurrences
of the double-quote sign, and the backslash sign, are escaped with a backslash
within the `<BINARY>` data; all other bytes can appear verbatim in the strings.


## Stream Parsing

The parser can be repeatedly invoked on a byte stream to consume the next datum
within.  This does not require "unreading" or back-seeking within the stream;
the parser always reads a full datum, and stops after some byte which cleanly
terminates the currently parsed datum.

This means Zisp s-expressions can be safely intermixed with other data within
the same byte stream.  So long as the other data is consumed by some parser
which similarly stops reading at a clear boundary, the Zisp parser can then
continue operating on the same stream.  Consider the example:

    ("image.webp" 8273)

    << 8273 bytes >>

    ("video.webm" 736)

    << 736 bytes >>

The "header" for each file in this stream is a Zisp s-expression containing
information about how many bytes should be read after the header, before the
next file header appears.  (The header data need to be terminated with a blank
ASCII character such as a newline; the closing parenthesis does not act as a
terminator unto itself due to the "join" syntax sugar.)

To enable this stream parsing strategy, the parser does not use any automatic
buffering.  If it did, it might inadvertently consume some bytes beyond the
currently parsed datum, leaving the stream inconsistent.

If the parser is meant to be used on an input stream associated with expensive
system calls, such as a file handle or network socket, it's best to wrap that
stream in some intermediate object which asks the system for large chunks of
data at once, and stores the data in a buffer.


## Comments

Two types of comment are supported: datum comments and line comments.

* A semicolon followed by a tilde instructs the parser to consume one datum and
  discard it.  Whitespace may appear between the tilde and the datum to discard.

* A semicolon, followed by a non-tilde byte, instructs the parser to consume and
  discard bytes until a newline (ASCII Line Feed) is encountered.


## Value vs. Datum

A Zisp *value* that has an *external representation* in the form of a sequence
of bytes is called a *datum*.  Every datum is a value, but not every value is a
datum.  In other words, a datum is a value that can be printed out as a byte
sequence which the parser can turn back into an equivalent datum.

A value that is not a datum may nevertheless be *encoded* into one, allowing it
to have an external representation.  After parsing, it needs to be *decoded*.

One may speak of an *external representation of a value* where the value is not
itself a datum, but can be encoded as one.  The more strictly correct term for
this is: "The external representation of a datum that encodes the value."

### Syntax sugar

The parser recognizes various *syntax sugar* to abbreviate an equivalent datum
construction, or express a datum that encodes a more complex value.

As an example, the expression `#(x y z)` is an abbreviation for the equivalent
`(#HASH x y z)`.  These are two external representations for the same datum;
after parsing, both will yield values that are indistinguishable in all but
their memory address.

The most ubiquitous syntax sugar is the list, which abbreviates a sequence of
tail-linked pairs, terminated with a special nil value represented as `()`:

    (x)      ->  (x & ())

    (x y)    ->  (x & (y & ()))

    (x y z)  ->  (x & (y & (z & ())))

There are also so-called *improper lists* which are chains of pairs that end in
a value other than nil:

    (x y & z)    ->  (x & (y & z))

    (x y z & t)  ->  (x & (y & (z & t)))

An example of "syntax sugar" that is not a mere abbreviation is a quoted string
which contains bytes that could not appear in a *bare* string:

    "foo bar"  ->  (#DQUOTE & <STRING>)

In this example, the visual token `<STRING>` represents the actual string value
in program memory, which has no direct external representation in bytes because
it contains a space character.

Those familiar with Lisp and Scheme may expect bare strings to be parsed into a
separate type called *symbol* while quoted strings are parsed directly into a
string type, but this is not the case in Zisp.

### Decoder

The *decoder* transforms Zisp data into values of more complex types, including
values that are not of a datum type.

Combined with syntax sugar, this allows Zisp to offer familiar syntax elements.
For example, the expression `#(x y z)` which parses into `(#HASH x y z)` can be
decoded into an array, so the result is similar to the vector syntax of Scheme.

Decoding also resolves datum labels, goes over bare strings to find ones that
represent a number literal, and takes care of a number of other transforms.
This offloads complexity, allowing the parser to remain extremely simple.

See the dedicated documentation of the [decoder](2-decode.html) for more.


## Data types

Following is a more in-depth explanation of each data type constructed by the
Zisp s-expression parser.

These are in fact value types, though the term "data type" is often used due to
familiarity.  A Zisp value that is a member of one of the following value types
is only a *datum* if it adheres to additional constraints as explained below.

### String

Strings can appear *bare* or be quoted in various ways.  A quoted string is in
fact parsed into a pair value with a rune in the first position to identify the
quotation variant that was parsed, and the string value in the second position.

    +-----------+----------------------+
    | Syntax    | Parse output         |
    +-----------+----------------------+
    | |bytes|   | (#PQSTR & <STRING>)  |
    +-----------+----------------------+
    | "bytes"   | (#DQSTR & <STRING>)  |
    +-----------+----------------------+
    | @_bytes_  | (#ATSTR & <STRING>)  |
    +-----------+----------------------+

The visual token `<STRING>` denotes the actual string, as a Zisp value, in the
second position of the pair.

These external representations of strings will be explained in more detail
further below, including backslash escape sequences allowed within.

Strings have a fixed length, counted in bytes.  Each byte can have any value,
including zero (ASCII NUL).  The parser reads bytes, not Unicode characters; a
string may contain UTF-8 byte sequences, but these are not tested for validity.

A string that is up to 255 bytes long is automatically *interned*, meaning any
occurrence of the same string -- equal in length and containing the same byte
values -- ends up being represented by the same bit-pattern; either a memory
address, or an immediate representation within a CPU word for short strings.
The quotation method is inconsequential to this process; for example, while
`|foobar|` and `"foobar"` will parse into different pair values, the actual
string they hold will be the same one in program memory.

Strings of length greater than 255 bytes are stored separately in memory, even
if they are equal in length and content.

### Rune

A rune is represented by an ASCII character sequence of 1 to 6 bytes, that must
begin with a letter, and may only contain letters and digits.  This character
sequence of letters and digits is called the *name* of the rune.  A rune that
follows this constraint is valid as a datum.

Zisp code may explicitly construct values of the rune type that violate the
above constraints.  Such runes are not valid data and cannot be printed or
parsed.

Runes are case-sensitive, and the parser always emits runes using upper-case
letters when expressing syntax sugar.  Uppercase rune names are reserved for
Zisp's internal use and standard library; users can use lowercase runes with
custom meaning without worrying about clashes, with the exception of a small
number of lowercase runes such as `#true` and `#false` that are part of the
default decoder settings and documented explicitly as such.

Runes are always stored directly in a CPU word; never by memory address.

### Pair

A pair is a tuple of two values: the first value and the second value.  In Lisp
tradition, these are also called the `car` and `cdr` of the pair, respectively.

The parser allocates a unique two-word cell in program memory for every pair,
and represents that pair through the memory address of the cell.

Pairs are valid data if one of the following holds true:

* The pair encodes a quoted string, datum label, or shebang line.

* Both the first and second value in the pair is a valid datum.

Further, a structure of nested pair values may not contain cyclic references
back up in the structure (which would make the above definition diverge into
infinity).  Such cycles must be broken up with datum labels, or else the pair
cannot be considered a datum, since it cannot be printed or parsed.

### Nil

The Zisp nil value is a singleton and a datum.  There is exactly one nil value
and it is used to terminate a chain of pairs representing a list of values; it
has the external representation `()`.


## Quoted strings

Three quoted string types exist: Pipe-quoted, double-quoted, and at-quoted.
This section goes into the details of each variant.

### Pipe-quoted

Strings can be quoted with pipes, like symbols in R7RS Scheme, which triggers
the parser to generate a pair with the structure:

    (#PQSTR & <STRING>)   ;; <STRING> is visual aid, not syntax

The decoder, using default settings, would emit this string verbatim as a value.
Then, during code evaluation, this would be seen as an identifier.  In this way,
pipe-quoted strings are equivalent to bare strings in functionality.

It is important to understand that the decoder sits between the parser and the
[evaluator](3-execute.html), and in opposition to Lisp and Scheme tradition, it
is common for the evaluator to receive values that are not valid as a datum; in
this case, a string unto itself that may not be a valid datum, due to not being
possible to be represented as a bare string.  Yet, it is valid as an identifier
for the purposes of the evaluator, since it is a string *value* like any other.

### Double-quoted

Strings wrapped in the double-quote symbol parse into:

    (#DQSTR & <STRING>)   ;; <STRING> is visual aid, not syntax

Under default settings, the decoder would transform this into a value which,
when evaluated as code, simply yields the contained string as a value.

### At-quoted

This is a special type of syntax for "raw" strings, meaning that no backslash
escapes nor any other kind of escape sequence are recognized within them.

The syntax begins with an at sign, followed by any byte.  That byte becomes a
termination marker, and the string cannot contain an occurrence of it, since
there are no escape sequences.

    @"foo \ bar"  ->  (#ATSTR <BYTE> & <STRING>)

In the above, the visual tokens `<BYTE>` and `<STRING>` represent an integer
value and a string value, respectively.  In this example, the integer value
would be 34; the ASCII value for the double-quote sign.  The string value
contains a literal backslash, since there is no backslash escape parsing.

This style of quoting can be useful, for instance, when representing regular
expressions as strings in code:

    ;; Matches e.g. foo\bar.["blah"]

    @/^foo\\(bar|baz)\.\[".*"\]$/

Were it not for this syntax, this regular expression would only be possible to
represent through a quoted string such as the following:

    ;; Same as above, but so many backslashes

    "^foo\\\\(bar|baz)\\t\\[\".*\"\\]$"

The byte that follows the at sign need not be a printable character or even a
valid ASCII byte; it can be absolutely any byte value, even NUL.  This can be
useful to easily encode binary data which is known to not contain a specific
byte; an example would be C strings which cannot contain NUL.

### Backslash escapes

In pipe-quoted and double-quoted strings, the following ASCII characters may
follow a backslash to insert a certain character.

    +-------+----------------------------+
    | Char  | Meaning                    |
    +-------+----------------------------+
    | \     | Literal backslash          |
    +-------+----------------------------+
    | |     | Literal pipe symbol        |
    +-------+----------------------------+
    | "     | Literal double-quote       |
    +-------+----------------------------+
    | 0     | ASCII NUL                  |
    +-------+----------------------------+
    | a     | ASCII Alert                |
    +-------+----------------------------+
    | b     | ASCII Backspace            |
    +-------+----------------------------+
    | t     | ASCII Tab (Horizontal)     |
    +-------+----------------------------+
    | n     | ASCII Newline (Line Feed)  |
    +-------+----------------------------+
    | v     | ASCII Vertical Tab         |
    +-------+----------------------------+
    | f     | ASCII Form Feed            |
    +-------+----------------------------+
    | r     | ASCII Carriage Return      |
    +-------+----------------------------+
    | e     | ASCII Escape               |
    +-------+----------------------------+

In words:

* A backslash followed by a backslash, pipe, or double-quote character is
  substituted with a literal occurrence of that character.

* The characters 0, a, b, t, n, v, f, r, and e have the same meanings as in the
  C programming language, representing common ASCII control characters.

Further, the following Regular Expression patterns following a backslash have
special meaning.

    +---------------------+-----------------------+
    | Regular Expression  | Meaning               |
    +---------------------+-----------------------+
    | [\t ]*\n[\t ]*      | Discarded             |
    +---------------------+-----------------------+
    | x([0-9a-fA-F]{2})*; | Arbitrary bytes       |
    +---------------------+-----------------------+
    | u[0-9a-fA-F]+;      | Unicode Scalar Value  |
    +---------------------+-----------------------+

Explanations:

* A backslash followed by any number of blanks (space or tab), a newline, and
  again any number of blanks, is substituted with nothing.  This is to allow
  splitting a string into multiple lines for human readability.

      (define p "This paragraph has been visually split into multiple \
                 lines, but the newline is escaped, so it's one line.")

* An x, followed by pairs of hexadecimal digits (case insensitive), terminated
  by a semicolon, is substituted with the sequence of bytes represented by the
  corresponding pairs of hexadecimal digits.  E.g.: `"foo\xDEADBEEF;bar"`

* A u, followed by a hexadecimal digit sequence (case insensitive), terminated
  by a semicolon, is substituted with the canonical UTF-8 byte sequence for the
  Unicode Scalar Value represented by that hexadecimal number.  The number must
  be in the range `0` to `10FFFF`.  E.g.: `"foo\u00A0;bar"`

### Newlines in strings

Normally, a newline in a string has no special meaning and simply becomes part
of the string.  However, newlines can be backslash-escaped, which simple erases
them; the escaped newline can also be preceded or followed by any number of tab
and space characters, which are all stripped as well.  (Note: It's not blanks
preceding the backslash that are stripped, but blanks following the backslash
and preceding the newline; i.e., blanks at the end of the line.)

Following are some examples of how multi-line strings can appear in source code
with different intentions and meanings:

    (define paragraph "This paragraph has been visually split into multiple \
                       lines, but the newlines are escaped, so it's one line.")

    (define json-object '|   ;; use '|| so double-quotes need no escaping
      {
        "key": "value"
      }
    |)

The second example is actually slightly problematic.  It begins with a newline,
which may be undesirable, but escaping that newline would cause the first line
to have no indentation, thus the opening `{` would not line up with the closing
`}` when this string is printed out.  Further, if the entire block of code is
indented, then the string contents may be more indented than intended.  (No pun
or rhyme intended.)  Consider:

    (let ((foo one))
      (let ((bar two))
        (let ((json-object '|
                 {
                   "key": "value"
                 }
               |))
          (do-whatever))))

The string bound to `json-object` has redundant indentation.  Should the parser
attempt to solve this issue?

Thankfully, we have the decoder to handle such complexities.  Under the default
settings, the rune `#HASH` is bound to a decoder rule which detects a payload
value that is a string literal, and implements the same algorithm as seen in
Java 15 Text Blocks: [JEP 378: Text Blocks](https://openjdk.org/jeps/378)

Thus, we can do the following:

    (let ((foo one))
      (let ((bar two))
        (let ((json-object #|
    ...........  {
    ...........    "key": "value"
    ...........  }
    ...........|))
          (do-whatever))))

(Dots represent whitespace that is deleted.  The initial newline is, as well.)

The only feature Zisp does not offer is a way to fence off multi-line strings
with a longer token such as `"""` as seen in Python and Java, or an arbitrary
word as seen in Bourne shell and PHP "here doc" syntax.

However, if a programmer truly wanted to have arbitrary text blocks in code,
without needing to escape anything in them, it's possible to abuse at-quoted
string syntax, using it with an ASCII control character which is displayed
visibly by a text editor.  In the following, the characters `^\` are meant to
represent a literal ASCII File Separator character in the source code:

    (define json-object #@^\
      {
        "key": "value"
      }
      ^\)

It works fine in Emacs, so why not?  Use `C-q C-\` to insert the `^\`.

This is indeed quite an eldritch syntax, but hopefully most programs would not
need to use it.


## Other syntax

The following table summarizes commonly useful syntax abbreviations:

    [...]   -> (#SQUARE ...)          #datum       -> (#HASH & datum)

    {...}   -> (#BRACE ...)           #rune(...)   -> (#rune ...)

    'datum  -> (#QUOTE & datum)       dat1dat2     -> (#JOIN dat1 & dat2)

    `datum  -> (#GRAVE & datum)       dat1.dat2    -> (#DOT dat1 & dat2)

    ,datum  -> (#COMMA & datum)       dat1:dat2    -> (#COLON dat1 & dat2)

Notes:

* The terms datum, dat1, and dat2 each refer to an arbitrary datum; ellipsis
  means zero or more data.

* The `#datum` form only applies when the datum following the hash sign is
  anything other than a bare string, since otherwise this would be ambiguous
  with a rune literal.  A bare string can nevertheless follow the hash sign by
  separating the two with a backslash:

      #\string  ->  (#HASH & string)

* Though not represented in the table due to notational difficulty, the form
  `#rune(...)` doesn't require a list in the second position; any datum that
  works with the `#datum` syntax also works with `#rune<DATUM>`.

      #rune1#rune2  -> (#rune1 & #rune2)

      #rune\string  -> (rune & string)

      #rune'string  -> (#rune #QUOTE & string)

      #rune"string" -> (#rune #DQSTR & |string|)

  As a counter-example, following a rune immediately with a bare string isn't
  possible without the delimiting backslash, since that would be ambiguous:

      #abcdefgh  ;Could be (#abcdef & gh) or (#abcde & fgh) or ...

* Syntax sugar can combine arbitrarily.  Some examples follow.  Any of these may
  or may not actually have a meaning in code; many could simply end up producing
  an error during decoding, or later evaluation of code.

      #{...}            -> (#HASH #BRACE ...)

      #'foo             -> (#HASH #QUOTE & foo)

      ##'[...]          -> (#HASH #HASH #QUOTE #SQUARE ...)

      {x y}[i j]        -> (#JOIN (#BRACE x y) #SQUARE i j)

      foo.bar.baz{x y}  -> (#JOIN (#DOT (#DOT foo & bar) & baz) #BRACE x y)

* While in Lisp and Scheme `'foo` parses as `(quote foo)`, in Zisp it parses as
  `(#QUOTE & foo)`; a single pair with the quoted datum in the second position.

  The same principle is used when parsing other sugar; some examples follow:

      Incorrect                              Correct

      #(x y z) -> (#HASH (x y z))            #(x y z) -> (#HASH x y z)

      [x y z]  -> (#SQUARE (x y z))          [x y z]  -> (#SQUARE x y z)

      #{x}     -> (#HASH (#BRACE (x)))       #{x}     -> (#HASH #BRACE x)

      foo(x y) -> (#JOIN foo (x y))          foo(x y) -> (#JOIN foo x y)

* Those used to thinking in Lisp and Scheme may think that `(#QUOTE ...)` halts
  further decoding of enclosed data.  This is not so, since quoting is related
  to code evaluation, not decoding.

### Datum labels

Valid data cannot be cyclic, since that would mean it has infinite length in
bytes.  To externally represent a value with cyclic structure, one uses datum
labels in the data encoding of the value.

A datum label either wraps another datum to assign a number to it, or contains
just a reference to a previous assignment.

    +------------------+------------------------------+
    | Syntax           | Internal datum structure     |
    +------------------+------------------------------+
    | #%<HEX>=<DATUM>  | (#LABEL <NUMBER> & <DATUM>)  |
    +------------------+------------------------------+
    | #%<HEX>%         | (#LABEL & <NUMBER>)          |
    +------------------+------------------------------+

In this visual, the token `<HEX>` stands for a hexadecimal digit sequence, the
token `<DATUM>` stands for any other datum, and `<NUMBER>` is a stand-in for a
number value; that which is represented by `<HEX>`.

For clarity, concrete examples follow:

    +-------------------+-------------------------------+
    | Byte sequence     | Parse result                  |
    +-------------------+-------------------------------+
    | #%1234abcd=(foo)  | (#LABEL <0x1234abcd> & (foo)) |
    +-------------------+-------------------------------+
    | #%1234abcd%       | (#LABEL & <0x1234abcd>)       |
    +-------------------+-------------------------------+

Here, the visual token `<0x1234abcd>` stands for a Zisp value of a numeric type
with an integer value.  Note that the decoder may not accept a bare string here,
meaning this syntax sugar is not merely an abbreviation.

### Shebang

Finally, the parser recognizes the Unix *shebang* syntax and outputs a datum to
hold the string values found within:

    #!interpreter          ->  (#SHBANG & interpreter)

    #!interpreter argline  ->  (#SHBANG interpreter & argline)

When executing a script file, Zisp simply stores this into a global value that
may be inspected if desired.


<!--
;; Local Variables:
;; fill-column: 80
;; End:
-->