summaryrefslogtreecommitdiff
path: root/doc/c1/1-parse.md
diff options
context:
space:
mode:
Diffstat (limited to 'doc/c1/1-parse.md')
-rw-r--r--doc/c1/1-parse.md53
1 files changed, 25 insertions, 28 deletions
diff --git a/doc/c1/1-parse.md b/doc/c1/1-parse.md
index 8932481..e396ca5 100644
--- a/doc/c1/1-parse.md
+++ b/doc/c1/1-parse.md
@@ -1,20 +1,20 @@
-# Parser for Code and Data
+# Parser for Code & Data
Zisp s-expressions represent an extremely minimal set of data types; only that
which is necessary to strategically construct more complex values:
- +-------+---------+--------+----------+
- | TYPE | String | Rune | Pair |
- +-------+---------+--------+----------+
- | E.G. | foobar | #name | (X & Y) |
- +-------+---------+--------+----------+
+ +-------+---------+--------+----------+------+
+ | TYPE | String | Rune | Pair | Nil |
+ +-------+---------+--------+----------+------+
+ | E.G. | foobar | #name | (X & Y) | () |
+ +-------+---------+--------+----------+------+
The parser also recognizes various *syntax sugar* which typically results in a
pair beginning with a specific rune. A separate component called the *decoder*
-transforms such data into a rich set of value types. See below for details.
+transforms such data into a rich set of value types.
-## Charset and Stream Handling
+## Character Encoding
The parser does not consume Unicode characters; it consumes bytes. Grammar is
generally constructed by bytes corresponding to ASCII characters.
@@ -41,7 +41,8 @@ All that needs to be done for this to work, is that any incidental occurrences
of the double-quote sign, and the backslash sign, are escaped with a backslash
within the `<BINARY>` data; all other bytes can appear verbatim in the strings.
-### Buffering
+
+## Stream Parsing
The parser can be repeatedly invoked on a byte stream to consume the next datum
within. This does not require "unreading" or back-seeking within the stream;
@@ -65,7 +66,7 @@ The "header" for each file in this stream is a Zisp s-expression containing
information about how many bytes should be read after the header, before the
next file header appears. (The header data need to be terminated with a blank
ASCII character such as a newline; the closing parenthesis does not act as a
-terminator unto itself due to the "join" syntax sugar; see later.)
+terminator unto itself due to the "join" syntax sugar.)
To enable this stream parsing strategy, the parser does not use any automatic
buffering. If it did, it might inadvertently consume some bytes beyond the
@@ -134,16 +135,12 @@ which contains bytes that could not appear in a *bare* string:
"foo bar" -> (#DQUOTE & <STRING>)
In this example, the visual token `<STRING>` represents the actual string value
-in program memory. It may seem contrived to refer to this as syntax sugar, but
-we are using the term uniformly for any situation in which the parser generates
-a pair with a rune in its first position, intended for the decoder to handle.
-
-Those familiar with Lisp and Scheme may expect *bare* strings to be parsed into
-a separate data type called a *symbol* but this does not exist in Zisp. Quoted
-strings instead parse into this internal representation to differentiate them
-from bare strings which may represent identifiers in code.
+in program memory, which has no direct external representation in bytes because
+it contains a space character.
-Other syntax sugar is explained further below.
+Those familiar with Lisp and Scheme may expect bare strings to be parsed into a
+separate type called *symbol* while quoted strings are parsed directly into a
+string type, but this is not the case in Zisp.
### Decoder
@@ -163,7 +160,7 @@ See the dedicated documentation of the [decoder](2-decode.html) for more.
## Data types
-Following is a more explanation of the four core data types constructed by the
+Following is a more in-depth explanation of each data type constructed by the
Zisp s-expression parser.
These are in fact value types, though the term "data type" is often used due to
@@ -173,8 +170,8 @@ is only a *datum* if it adheres to additional constraints as explained below.
### String
Strings can appear *bare* or be quoted in various ways. A quoted string is in
-fact parsed into a pair value (see below) with a rune in the first position to
-identify the quotation category, and the string value in the second position.
+fact parsed into a pair value with a rune in the first position to identify the
+quotation variant that was parsed, and the string value in the second position.
+-----------+----------------------+
| Syntax | Parse output |
@@ -223,7 +220,7 @@ letters when expressing syntax sugar. Uppercase rune names are reserved for
Zisp's internal use and standard library; users can use lowercase runes with
custom meaning without worrying about clashes, with the exception of a small
number of lowercase runes such as `#true` and `#false` that are part of the
-default decoder settings.
+default decoder settings and documented explicitly as such.
Runes are always stored directly in a CPU word; never by memory address.
@@ -237,7 +234,7 @@ and represents that pair through the memory address of the cell.
Pairs are valid data if one of the following holds true:
-* The pair encodes a quoted string, datum label, or shebang line. (See below.)
+* The pair encodes a quoted string, datum label, or shebang line.
* Both the first and second value in the pair is a valid datum.
@@ -320,7 +317,7 @@ valid ASCII byte; it can be absolutely any byte value, even NUL. This can be
useful to easily encode binary data which is known to not contain a specific
byte; an example would be C strings which cannot contain NUL.
-### Backslash escape sequences
+### Backslash escapes
In pipe-quoted and double-quoted strings, the following ASCII characters may
follow a backslash to insert a certain character.
@@ -380,8 +377,8 @@ Explanations:
again any number of blanks, is substituted with nothing. This is to allow
splitting a string into multiple lines for human readability.
- (define paragraph "This paragraph has been visually split into multiple \
- lines, but the newline is escaped, so it's one line.")
+ (define p "This paragraph has been visually split into multiple \
+ lines, but the newline is escaped, so it's one line.")
* An x, followed by pairs of hexadecimal digits (case insensitive), terminated
by a semicolon, is substituted with the sequence of bytes represented by the
@@ -472,7 +469,7 @@ This is indeed quite an eldritch syntax, but hopefully most programs would not
need to use it.
-## Syntax sugar
+## Other syntax
The following table summarizes commonly useful syntax abbreviations: