diff options
| author | Taylan Kammer <taylan.kammer@gmail.com> | 2026-06-02 23:56:10 +0200 |
|---|---|---|
| committer | Taylan Kammer <taylan.kammer@gmail.com> | 2026-06-02 23:56:10 +0200 |
| commit | dca76cd7955573cc537933c7beb93d2d9ee2b1d2 (patch) | |
| tree | 1f082c2f2d6036019b28a72d146709fbcc32cc0c /doc | |
| parent | af6f48ff079fc8067b564adeaa73caed8cbf5438 (diff) | |
More doc and style improvements.
Diffstat (limited to 'doc')
| -rw-r--r-- | doc/c1/1-parse.md | 53 |
1 files changed, 25 insertions, 28 deletions
diff --git a/doc/c1/1-parse.md b/doc/c1/1-parse.md index 8932481..e396ca5 100644 --- a/doc/c1/1-parse.md +++ b/doc/c1/1-parse.md @@ -1,20 +1,20 @@ -# Parser for Code and Data +# Parser for Code & Data Zisp s-expressions represent an extremely minimal set of data types; only that which is necessary to strategically construct more complex values: - +-------+---------+--------+----------+ - | TYPE | String | Rune | Pair | - +-------+---------+--------+----------+ - | E.G. | foobar | #name | (X & Y) | - +-------+---------+--------+----------+ + +-------+---------+--------+----------+------+ + | TYPE | String | Rune | Pair | Nil | + +-------+---------+--------+----------+------+ + | E.G. | foobar | #name | (X & Y) | () | + +-------+---------+--------+----------+------+ The parser also recognizes various *syntax sugar* which typically results in a pair beginning with a specific rune. A separate component called the *decoder* -transforms such data into a rich set of value types. See below for details. +transforms such data into a rich set of value types. -## Charset and Stream Handling +## Character Encoding The parser does not consume Unicode characters; it consumes bytes. Grammar is generally constructed by bytes corresponding to ASCII characters. @@ -41,7 +41,8 @@ All that needs to be done for this to work, is that any incidental occurrences of the double-quote sign, and the backslash sign, are escaped with a backslash within the `<BINARY>` data; all other bytes can appear verbatim in the strings. -### Buffering + +## Stream Parsing The parser can be repeatedly invoked on a byte stream to consume the next datum within. This does not require "unreading" or back-seeking within the stream; @@ -65,7 +66,7 @@ The "header" for each file in this stream is a Zisp s-expression containing information about how many bytes should be read after the header, before the next file header appears. (The header data need to be terminated with a blank ASCII character such as a newline; the closing parenthesis does not act as a -terminator unto itself due to the "join" syntax sugar; see later.) +terminator unto itself due to the "join" syntax sugar.) To enable this stream parsing strategy, the parser does not use any automatic buffering. If it did, it might inadvertently consume some bytes beyond the @@ -134,16 +135,12 @@ which contains bytes that could not appear in a *bare* string: "foo bar" -> (#DQUOTE & <STRING>) In this example, the visual token `<STRING>` represents the actual string value -in program memory. It may seem contrived to refer to this as syntax sugar, but -we are using the term uniformly for any situation in which the parser generates -a pair with a rune in its first position, intended for the decoder to handle. - -Those familiar with Lisp and Scheme may expect *bare* strings to be parsed into -a separate data type called a *symbol* but this does not exist in Zisp. Quoted -strings instead parse into this internal representation to differentiate them -from bare strings which may represent identifiers in code. +in program memory, which has no direct external representation in bytes because +it contains a space character. -Other syntax sugar is explained further below. +Those familiar with Lisp and Scheme may expect bare strings to be parsed into a +separate type called *symbol* while quoted strings are parsed directly into a +string type, but this is not the case in Zisp. ### Decoder @@ -163,7 +160,7 @@ See the dedicated documentation of the [decoder](2-decode.html) for more. ## Data types -Following is a more explanation of the four core data types constructed by the +Following is a more in-depth explanation of each data type constructed by the Zisp s-expression parser. These are in fact value types, though the term "data type" is often used due to @@ -173,8 +170,8 @@ is only a *datum* if it adheres to additional constraints as explained below. ### String Strings can appear *bare* or be quoted in various ways. A quoted string is in -fact parsed into a pair value (see below) with a rune in the first position to -identify the quotation category, and the string value in the second position. +fact parsed into a pair value with a rune in the first position to identify the +quotation variant that was parsed, and the string value in the second position. +-----------+----------------------+ | Syntax | Parse output | @@ -223,7 +220,7 @@ letters when expressing syntax sugar. Uppercase rune names are reserved for Zisp's internal use and standard library; users can use lowercase runes with custom meaning without worrying about clashes, with the exception of a small number of lowercase runes such as `#true` and `#false` that are part of the -default decoder settings. +default decoder settings and documented explicitly as such. Runes are always stored directly in a CPU word; never by memory address. @@ -237,7 +234,7 @@ and represents that pair through the memory address of the cell. Pairs are valid data if one of the following holds true: -* The pair encodes a quoted string, datum label, or shebang line. (See below.) +* The pair encodes a quoted string, datum label, or shebang line. * Both the first and second value in the pair is a valid datum. @@ -320,7 +317,7 @@ valid ASCII byte; it can be absolutely any byte value, even NUL. This can be useful to easily encode binary data which is known to not contain a specific byte; an example would be C strings which cannot contain NUL. -### Backslash escape sequences +### Backslash escapes In pipe-quoted and double-quoted strings, the following ASCII characters may follow a backslash to insert a certain character. @@ -380,8 +377,8 @@ Explanations: again any number of blanks, is substituted with nothing. This is to allow splitting a string into multiple lines for human readability. - (define paragraph "This paragraph has been visually split into multiple \ - lines, but the newline is escaped, so it's one line.") + (define p "This paragraph has been visually split into multiple \ + lines, but the newline is escaped, so it's one line.") * An x, followed by pairs of hexadecimal digits (case insensitive), terminated by a semicolon, is substituted with the sequence of bytes represented by the @@ -472,7 +469,7 @@ This is indeed quite an eldritch syntax, but hopefully most programs would not need to use it. -## Syntax sugar +## Other syntax The following table summarizes commonly useful syntax abbreviations: |
