# Parser for Data

*For an exact specification of the grammar, see [grammar](grammar/).*

Zisp s-expressions represent an extremely minimal set of data types; only that
which is necessary to strategically construct more complex values:

    +--------+-----------------+--------+----------+------+
    | TYPE   | String          | Rune   | Pair     | Nil  |
    +--------+-----------------+--------+----------+------+
    | E.G.   | foobar          | #name  | (X & Y)  | ()   |
    |        | |foo bar|       |        |          |      |
    |        | "foo bar"       |        |          |      |
    |        | @_foo bar_      |        |          |      |
    +--------+-----------------+--------+----------+------+

Datum comments and line comments are supported:

* A semicolon followed by a tilde instructs the parser to consume one datum and
  discard it.  Whitespace may appear between the tilde and the datum to discard.

* A semicolon, followed by a non-tilde byte, instructs the parser to consume and
  discard bytes until a newline (ASCII Line Feed) is encountered.

The parser can also output non-negative integers, but this is only used for
datum labels; number literals are handled by the decoder instead; see below.


## Overview

This section explains a few core concepts and features related to the parser.


### Value vs. Datum

A Zisp *value* that has an *external representation* in the form of a sequence
of bytes is called a *datum*.  Every datum is a value, but not all values are
data.  A datum is a value that can be printed out as a byte sequence which the
parser can recognize and turn back into an equivalent datum.

One may speak of an *external representation of a value* where the value is not
itself a datum, but can be encoded as a datum.  The more strictly correct term
for this is: "The external representation of a datum encoding the value."


### Syntax sugar

The parser recognizes various "syntax sugar" and transforms it into uses of the
above listed primitive data types.  As an example, the expression `#(x y z)` is
parsed into the structure `(#HASH x y z)`.  These are two completely equivalent
external representations for the same compound datum; after parsing, both byte
sequences will yield data values that are indistinguishable in all but their
memory address.

The most ubiquitously used syntax sugar is the list, which stands for a chain of
pairs, terminated with nil:

    (x y z)  ->  (x & (y & (z & ())))

The full syntax sugar table is listed and explained further below.


### Decoder

*The decoder has nothing to do with the concept of text or character encoding.*

A separate process called *decoding* can transform Zisp data into values of more
complex types, including values that are not of a datum type.

For example, the datum `(#HASH x y z)` could be decoded into an array, so the
expression `#(x y z)` could work like in Scheme.

Decoding also resolves datum labels, goes over bare strings to find ones that
represent a number literal, and takes care of a number of other transforms.
This offloads complexity, allowing the parser to remain extremely simple.

See the dedicated documentation of the [decoder](2-decode.html) for more.


### Character encoding

The parser does not consume characters; it consumes bytes.

Grammar is generally constructed by bytes corresponding to ASCII characters.
Some elements of the grammar, such as comments and quoted strings, may contain
arbitrary byte sequences, until terminated.  These sequences may happen to be
valid UTF-8 text.  This way, quoted strings and comments may contain Unicode
text encoded in UTF-8, but the parser does not check these for validity.

Since comments and quoted strings may contain arbitrary byte sequences, a text
editor or other program displaying Zisp s-expressions may need to use a special
visual representation for bytes that don't represent valid text.

The parser being based on bytes rather than characters is not a limitation but
rather a feature: It allows for Zisp s-expressions to be used as a structured
data exchange format that may contain binary data elements without the need to
encode these in Base64 or other such text representations of binary data.
Consider the example:

    ((image.webp "<< binary data >>")
     (video.webm "<< binary data >>"))

All that needs to be done for this to work, is that any incidental occurrences
of the double-quote sign, and the backslash sign, are escaped with a backslash
within the binary data; all other bytes can appear verbatim in the strings.


### Stream parsing

The parser can be repeatedly invoked on a byte stream to consume the next datum
within.  This does not require "unreading" or back-seeking within the stream;
the parser always reads a full datum, and stops after some byte which cleanly
terminates the currently parsed datum.

This means Zisp s-expressions can be safely intermixed with other data within
the same byte stream.  So long as the other data is consumed by some parser
which similarly stops reading at a clear boundary, the Zisp parser can then
continue operating on the same stream.  Consider the example:

    ("image.webp" 8273)

    << 8273 bytes >>

    ("video.webm" 736)

    << 736 bytes >>

The "header" for each file in this stream is a Zisp s-expression containing
information about how many bytes should be read after the header, before the
next file header appears.  (The header data need to be terminated with a blank
ASCII character such as a newline.  The reason why the closing parenthesis does
not act as a terminator unto itself will become apparent later.)


### Datum labels

Valid data cannot be cyclic, since that would mean it has infinite length in
bytes.  To externally represent a value with cyclic structure, one uses datum
labels in the data encoding of the value.

A datum label either wraps another datum to assign a number to it, or contains
just a reference to a previous assignment.

    +----------------------------------+---------------------------------+
    | Internal structure               | External representation         |
    +----------------------------------+---------------------------------+
    | (#LABEL & (<NUMBER> & <DATUM>))  | #%<HEX>=<DATUM>                 |
    +----------------------------------+---------------------------------+
    | (#LABEL & <NUMBER>)              | #%<HEX>%                        |
    +----------------------------------+---------------------------------+

In this visual, the token `<NUMBER>` stands for an actual number value that
doesn't have its own external representation.  It's printed as a sequence of
hexadecimal digits, denoted by `<HEX>` in the external representation.

For clarity, concrete examples follow:

    #%1234abcd=(foo bar)  ->  (#LABEL & (<0x1234abcd> & (foo bar)))

    #%1234abcd%           ->  (#LABEL & <0x1234abcd>)

Here, the visual token `<0x1234abcd>` stands for a Zisp value of a numeric type
with an integer value.

Datum labels may look like "syntax sugar" but the fact that integers don't have
a direct external representation means that datum labels are a fundamental type
of syntax that has no "desugared" equivalent in external representation.  The
decoder will not accept a bare string encoding of an integer here.


## Data types

Following is an explanation of the four core data types constructed by the Zisp
s-expression parser.

A Zisp value that is a member of one of these types is also called a *datum* if
it adheres to additional constraints as explained for each type.


### String

Strings can appear "bare" or be quoted in various ways.

A string, as a stand-alone Zisp value, is only a valid datum if it can be
represented as a bare string.  If it contains bytes that prevent the bare
representation, then the string must be wrapped in one of the following
structures to become a valid datum, each of which has its own external
representation:

    +-------------------------------+-------------------------------+
    | Internal structure            | External representation       |
    +-------------------------------+-------------------------------+
    | (#PQSTR & <STRING>)           | |contents|                    |
    +-------------------------------+-------------------------------+
    | (#DQSTR & <STRING>)           | "contents"                    |
    +-------------------------------+-------------------------------+
    | (#ATSTR & <STRING>)           | @_contents_                   |
    +-------------------------------+-------------------------------+

The visual token `<STRING>` is meant to denote the actual string, as a Zisp
value, occupying the second position in the pair.  It is not actual syntax.

Note that, while conceptually similar, this internal encoding of string data is
not syntax sugar, since the internal datum representation using runes cannot be
printed out verbatim, due to the attached string being impossible to represent
externally without quotation.  As such, quoted strings are fundamental syntax.

These external representations of strings will be explained in more detail
further below, including backslash escape sequences allowed within.

Strings have a fixed length, counted in bytes.  Each byte can have any value,
including zero (aka ASCII NULL).  The parser reads bytes, not characters, and
has no concept of a character encoding, which means that a string can contain
UTF-8 byte sequences, but these are not tested for validity.

A string that is up to 64 bytes long is automatically *interned*, meaning any
occurrence of the same string -- equal in length and containing the same byte
values -- ends up being represented by the same bit-pattern; either a memory
address, or an immediate representation within a CPU word for short strings.

Strings with a length greater than 64 bytes end up being represented by a
distinct memory address, even if they are equal in length and content.


### Rune

A rune is represented by an ASCII character sequence of 1 to 6 bytes, that must
begin with a letter, and may only contain letters and digits.  This character
sequence of letters and digits is called the *name* of the rune.  A rune that
follows this constraint is valid as a datum.

Zisp code may explicitly construct values of the rune type that violate the
above constraints.  Such runes are not valid data and cannot be printed or
parsed in any way.

Runes are case-sensitive, and the parser always emits runes using upper-case
letters when expressing syntax sugar.  Uppercase rune names are reserved for
Zisp's internal use and standard library; users can use lowercase runes with
custom meaning without worrying about clashes, with the exception of a small
number of lowercase runes such as `#true` and `#false` that are part of the
default decoder settings.

Runes are always stored directly in a CPU word; never by memory address.


### Pair

A pair is a tuple of two values: the first value and the second value.

The parser allocates a unique two-word cell in the process heap for every pair,
and represents that pair through the memory address of that cell.

Pairs are valid as a datum if one of the following holds true for the pair:

* It encodes one of the quoted string variants.

* It encodes a datum label (assignment or reference).

* Both the first and second value in the pair is itself a valid datum.

An additional constraint is that a hierarchy of pairs containing pairs must not
form cycles; if they do, the cycles must be broken up by use of datum labels or
else none of the pairs within the cyclic structure are a valid datum.


### Nil

The Zisp nil value is a singleton and a datum.  There is exactly one nil value
and it is used to terminate a chain of pairs representing a list of values.


## Quoted strings

Three quoted string types exist: Pipe-quoted, double-quoted, and at-quoted.
This section goes into the details of each variant.


### Pipe-quoted

Strings can be quoted with pipes, like symbols in R7RS Scheme, which triggers
the parser to generate a pair with the structure:

    (#PQSTR & <STRING>)                 ;; <STRING> is visual aid, not syntax

The decoder, using default settings, would emit this string verbatim as a value.
Then, during code evaluation, this would be seen as an identifier.  In this way,
pipe-quoted strings are equivalent to bare strings in functionality.

It is important to understand that the decoder sits between the parser and the
[evaluator](3-execute.html), and in opposition to Lisp and Scheme tradition, it
is common for the evaluator to receive values that are not valid as a datum; in
this case, a string unto itself that may not be a valid datum, due to not being
possible to be represented as a bare string.  Yet, it is valid as an identifier
for the purposes of the evaluator, since it is a string *value* like any other.


### Double-quoted

Strings wrapped in the double-quote symbol parse into:

    (#DQSTR & <STRING>)                 ;; <STRING> is visual aid, not syntax

Under default settings, the decoder would transform this into a value which,
when evaluated, yields back the string as a value.  Typically, this would be
achieved by simply transforming it into `(#QUOTE & <STRING>)`.  (Note that,
unlike `(#PQSTR & <STRING>)`, this would not be decoded into a string unto
itself, as that would make the evaluator see it as an identifier.)


### At-quoted strings AKA raw strings

There is a special type of syntax for "raw" strings, meaning that no backslash
escapes nor any other kind of escape sequence are recognized within them.

This raw string syntax begins with an at sign, followed by any byte.  That byte
becomes the termination marker, and the string cannot contain an occurrence of
it, since there are no escape sequences.

    @"foo \ bar"  ->  (#ATSTR & <STRING>)

In the above, the visual token `<STRING>` is not part of datum syntax but a
stand-in for the actual string value, which is, literally: `foo \ bar`

This style of quoting can be useful, for instance, when representing regular
expressions as strings in code:

    @/^foo\\(bar|baz)\.\[".*"\]$/         ;; matches e.g. foo\bar.["blah"]

Were it not for this syntax, this regular expression would only be possible to
represent through a quoted string such as the following:

    "^foo\\\\(bar|baz)\\t\\[\".*\"\\]$"   ;; many backslashes

Alternatively, imagine searching for certain MS Windows file paths:

    @_C:\\\\Users\\([a-z]+)_              ;; matches C:\\User\foo

That's already ugly.  Without raw strings, it would need to look even worse:

    "C:\\\\\\\\Users\\\\([a-z]+)"         ;; MANY backslashes

The byte that follows the at sign need not be a printable character or even a
valid ASCII byte; it can be absolutely any byte value, even NULL.  This can be
useful to easily encode binary data which is known to not contain a specific
byte; an example would be C strings which cannot contain NULL.


### Backslash escape sequences in strings

The following backslash escapes are supported in pipe-quoted and double-quoted
strings.  (Some rows use Regular Expression notation.)

    +-----------------------------------+------------------------------+
    | Character(s) following backslash  | Meaning                      |
    +-----------------------------------+------------------------------+
    | \                                 | Literal backslash            |
    +-----------------------------------+------------------------------+
    | |                                 | Literal pipe symbol          |
    +-----------------------------------+------------------------------+
    | "                                 | Literal double-quote         |
    +-----------------------------------+------------------------------+
    | RE: /[\t ]*\n[\t ]*/              | Discarded                    |
    +-----------------------------------+------------------------------+
    | 0                                 | ASCII NULL                   |
    +-----------------------------------+------------------------------+
    | a                                 | ASCII Alert                  |
    +-----------------------------------+------------------------------+
    | b                                 | ASCII Backspace              |
    +-----------------------------------+------------------------------+
    | t                                 | ASCII Tab (Horizontal)       |
    +-----------------------------------+------------------------------+
    | n                                 | ASCII Newline (Line Feed)    |
    +-----------------------------------+------------------------------+
    | v                                 | ASCII Vertical Tab           |
    +-----------------------------------+------------------------------+
    | f                                 | ASCII Form Feed              |
    +-----------------------------------+------------------------------+
    | r                                 | ASCII Carriage Return        |
    +-----------------------------------+------------------------------+
    | e                                 | ASCII Escape                 |
    +-----------------------------------+------------------------------+
    | RE: /x([0-9a-fA-F]{2})+;/         | Arbitrary bytes in hex       |
    +-----------------------------------+------------------------------+
    | RE: /u[0-9a-fA-F]+;/              | Unicode scalar as UTF-8      |
    +-----------------------------------+------------------------------+
     
To clarify:

* A backslash followed by a backslash, pipe, or double-quote character is
  substituted with a literal occurrence of the corresponding character.

* A backslash followed by any number of blanks (space or tab), a newline, and
  again any number of blanks, is substituted with nothing.  This is to allow
  splitting a string into multiple lines for human readability.

      (define paragraph "This paragraph has been visually split into multiple \
                         lines, but the newline is escaped, so it's one line.")

* The characters 0, a, b, t, n, v, f, r, and e have the same meanings as in the
  C programming language, representing common unprintable ASCII bytes.

* An x, followed by pairs of hexadecimal digits (case insensitive), terminated
  by a semicolon, is substituted with the sequence of bytes represented by the
  corresponding pairs of hexadecimal digits.  E.g.: `"foo\xDEADBEEF;bar"`

* A u, followed by a hexadecimal digit sequence (case insensitive), terminated
  by a semicolon, is substituted with the canonical UTF-8 byte sequence for the
  Unicode Scalar Value represented by that hexadecimal number.  The number must
  be in the range `0` to `10FFFF`.  E.g.: `"foo\u00A0;bar"`


### Newlines in strings

Normally, a newline in a string has no special meaning and simply becomes part
of the string.  However, newlines can be backslash-escaped, which simple erases
them; the escaped newline can also be preceded or followed by any number of tab
and space characters, which are all stripped as well.  (Note: It's not blanks
preceding the backslash that are stripped, but blanks following the backslash
and preceding the newline; i.e., blanks at the end of the line.)

Following are some examples of how multi-line strings can appear in source code
with different intentions and meanings:

    (define paragraph "This paragraph has been visually split into multiple \
                       lines, but the newlines are escaped, so it's one line.")

    (define json-object '|         ;; use '|| so double-quotes need no escaping
      {
        "key": "value"
      }
    |)

The second example is actually slightly problematic.  It begins with a newline,
which may be undesirable, but escaping that newline would cause the first line
to have no indentation, thus the opening `{` would not line up with the closing
`}` when this string is printed out.  Further, if the entire block of code is
indented, then the string contents may be more indented than intended.  (No pun
or rhyme intended.)  Consider:

    (let ((foo one))
      (let ((bar two))
        (let ((json-object '|
                 {
                   "key": "value"
                 }
               |))
          (do-whatever))))

The string bound to `json-object` has redundant indentation.  Should the parser
attempt to solve this issue?

Thankfully, we have the decoder to handle such complexities.  Under the default
settings, the rune `#HASH` is bound to a decoder rule which detects a payload
value that is a string literal, and implements the same algorithm as seen in
Java 15 Text Blocks: [JEP 378: Text Blocks](https://openjdk.org/jeps/378)

Thus, we can do the following:

    (let ((foo one))
      (let ((bar two))
        (let ((json-object #|
    ...........  {
    ...........    "key": "value"
    ...........  }
    ...........|))
          (do-whatever))))

(Dots represent whitespace that is deleted.  The initial newline is, as well.)

The only feature Zisp does not offer is a way to fence off multi-line strings
with a longer token such as `"""` as seen in Python and Java, or an arbitrary
word as seen in Bourne shell and PHP "here doc" syntax.

However, if a programmer truly wanted to have arbitrary text blocks in code,
without needing to escape anything in them, it's possible to abuse at-quoted
string syntax, using it with an ASCII control character which is displayed
visibly by a text editor.  In the following, the characters `^\` are meant to
represent a literal ASCII File Separator character in the source code:

    (define json-object #@^\
      {
        "key": "value"
      }
      ^\)

Hey, it works fine in Emacs, so why not?  Use `C-q C-\` to insert the `^\`.

This is indeed quite an eldritch syntax, but hopefully most programs would not
need to use it anyway.


## Syntax sugar

The parser recognizes various "syntax sugar" and transforms it into equivalent
datum constructions.  The most ubiquitous example of this is the list, which is
transformed into a chain of pairs, terminated with nil:

    (datum1 datum2 ...)  ->  (datum1 & (datum2 & (... & ())))

This is so ubiquitous as to be hardly considered "syntax sugar" but is counted
as such, since any list could just as well be written as a chain of pairs; both
would result in an equivalent datum when parsed.

The following table summarizes the other available transformations:

    [...]   -> (#SQUARE ...)          #datum       -> (#HASH & datum)

    {...}   -> (#BRACE ...)           #rune(...)   -> (#rune ...)

    'datum  -> (#QUOTE & datum)       dat1dat2     -> (#JOIN dat1 & dat2)

    `datum  -> (#GRAVE & datum)       dat1.dat2    -> (#DOT dat1 & dat2)

    ,datum  -> (#COMMA & datum)       dat1:dat2    -> (#COLON dat1 & dat2)

Notes:

* The terms datum, dat1, and dat2 each refer to an arbitrary datum; ellipsis
  means zero or more data.

* The `#datum` form only applies when the datum following the hash sign is
  anything other than a bare string, since otherwise this would be ambiguous
  with a rune literal.  A bare string can nevertheless follow the hash sign by
  separating the two with a backslash:

      #\string  ->  (#HASH & string)

* Though not represented in the table due to notational difficulty, the form
  `#rune(...)` doesn't require a list in the second position; any datum that
  works with the `#datum` syntax also works with `#rune<DATUM>`.

      #rune1#rune2  -> (#rune1 & #rune2)

      #rune\string  -> (rune & string)

      #rune'string  -> (#rune #QUOTE & string)

      #rune"string" -> (#rune #DQSTR & |string|)

  As a counter-example, following a rune immediately with a bare string isn't
  possible without the delimiting backslash, since that would be ambiguous:

      #abcdefgh  ;Could be (#abcdef & gh) or (#abcde & fgh) or ...

* Syntax sugar can combine arbitrarily.  Some examples follow.  Any of these may
  or may not actually have a meaning in code; many could simply end up producing
  an error during decoding, or later evaluation of code.

      #{...}            -> (#HASH #BRACE ...)

      #'foo             -> (#HASH #QUOTE & foo)

      ##'[...]          -> (#HASH #HASH #QUOTE #SQUARE ...)

      {x y}[i j]        -> (#JOIN (#BRACE x y) #SQUARE i j)

      foo.bar.baz{x y}  -> (#JOIN (#DOT (#DOT foo & bar) & baz) #BRACE x y)

* While in Lisp and Scheme `'foo` parses as `(quote foo)`, in Zisp it parses as
  `(#QUOTE & foo)`; a single pair with the quoted datum in the second position.

  The same principle is used when parsing other sugar; some examples follow:

      Incorrect                              Correct

      #(x y z) -> (#HASH (x y z))            #(x y z) -> (#HASH x y z)

      [x y z]  -> (#SQUARE (x y z))          [x y z]  -> (#SQUARE x y z)

      #{x}     -> (#HASH (#BRACE (x)))       #{x}     -> (#HASH #BRACE x)

      foo(x y) -> (#JOIN foo (x y))          foo(x y) -> (#JOIN foo x y)

* Those used to thinking in Lisp and Scheme may think that `(#QUOTE ...)` halts
  further decoding of enclosed data.  This is not so, since quoting is related
  to code evaluation, not decoding.

<!--
;; Local Variables:
;; fill-column: 80
;; End:
-->