Pycnolog: Syntax ================ Encoding -------- Most programming languages use either Unicode encoded as UTF-8, or a codepage consisting of up to 256 distinct characters (in which each character is assigned an 8-bit codepoint). However, Pycnolog uses only 64 distinct characters, each of which has a 6-bit codepoint; Pycnolog characters are thus only 6 bits wide. In order to store these 6-bit characters on an 8-bit disk, "inverse base 64 encoding" is used; base 64 encoding is equivalent to Pycnolog decoding, and vice versa. For example, the byte sequence formed by encoding the text `abc` in ASCII is, in base 64, written as `YWJj`; thus, the Pycnolog program `YWJj` has the same encoding as `abc` would in ASCII. Pycnolog's character set is (intentionally) identical to the most commonly used character set for base 64, and uses the same codepoints: * `A` to `Z`: 0 to 25 respectively * `a` to `z`: 26 to 51 respectively * `0` to `9`: 52 to 61 respectively * `+`: 62 * `/`: 63 (Note that in some cases where Pycnolog uses codepoints, 0 is not a valid codepoint, and 64 is. In this situation, `A` is treated as being codepoint 64, with the other characters unchanged.) One thing to note is that sometimes padding is needed at the end of a program to pad it up to a full byte (as not all multiples of 6 are divisible by 8). At most 6 bits of padding are necessary, and these should be a (partial or full) `F` character. (It isn't useful to end a Pycnolog program with `F`, so the extra character can be recognised as padding and removed before running the program.) Alphabet -------- Pycnolog's 64 characters are broken down into four groups: * *Uppercase letters*, `A` to `Z` * *Lowercase letters*, `a` to `z` * *Digits*, `0`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `8`, `9`, `+` * and the *punctuation mark* `/`. Digits are, unsurprisingly, used to write numbers. Pycnolog generally uses base 11, thus 11 digits are needed; `+` is the digit with the value 10. Enigma names are also formed of strings of digits, despite not being numbers. Most letters refer to commands. In most situations, the uppercase and lowercase versions of a letter are conceptually similar commands, for which the only difference is the types of argument they can take. However, sometimes certain types of argument make no sense for certain commands; thus, some letters can refer to multiple different commands, with the argument used to distinguish between them. `y`/`Y` (and to some extent `F`/`R`) are exceptions. Although `F` and `R` normally start commands, they're also used for special-case syntax around the edge of stanzas. `y`/`Y` are not commands in their own right, but rather modifiers to commands, written immediately before the command to change the way it takes its arguments. The punctuation mark is used to mark where blocks end (in cases where automatic block construction would place the block in the wrong place). (The places where blocks would start are automatically determined by looking for commands that require a block argument; the block starts immediately after the command.) The punctuation mark is also used to mark the start of a base 64 literal, and as part of the `F/` digraph that marks the boundary between stanzas. Commands and arguments ---------------------- There are five main patterns that commands can take: * No arguments: the command is a lowercase letter, and is not followed by a digit. * Constant argument: the command is an uppercase letter, and is followed by number, specified either in base 11 (using one or more digits), or in base 64 (starting with a punctuation mark). * Enigma argument: the command is a lowercase letter and is followed by an enigma name. (Enigma names are strings of digits, but are not interpreted as numbers, e.g. `01` and `1` are different enigmas.) * Block argument: the command is an uppercase letter, and is followed by a letter. That letter will be the first command inside the block. * Block and constant: this is similar to giving the command a block argument, but the closing punctuation mark on the block must be given explicitly. A base 11 number is specified immediately after the closing punctuation mark. In addition to its argument (if any), a command is connected to the commands to its left and its right; each command takes a value (the *input*) from the command to its left and gives a value (the *output*) to the command to its right. (So one way to think about this is that commands are mutating an "accumulator" or similar temporary value.) This is why commands typically only need to be able to take one argument; binary operations like "add" (`A`) will add their argument to their input (taken from the command to the left), and give the resulting output to the command to their right; so two numbers are being added despite the fact that the addition only had one *argument*. In a few cases, a combination of two commands will obviously be redundant, e.g. `UL2/` should mean "the input equals the output and assert that the output has length 2", but that's clearly equivalent to `L2`, "the output equals the input and assert that the input has length 2". Such "redundant combinations" of commands may thus be assigned entirely different meanings to increase the density of the language's encoding, either now or in the future. `y`/`Y` are not commands, but rather command modifiers. These parse much the same way as commands themselves, e.g. `Y` followed by a digit is parsed as "a `Y` modifier taking an integer argument". However, they're merged together with the following command into a single command; for example, `e` indexes lists, and `Ye` effectively does multidimensional indexing on a list, whereas `y0e` will index each list in a list of lists via taking the index from the corresponding element of enigma 0. When `Y` is given a block as its argument, it modifies the first command inside the block (not the command that comes after the block). The above explanation gives almost all possible patterns for commands, but there's one remaining possibility: if a letter appears at the end of a stanza (most commonly, at the end of a program). With a lowercase letter at the end of a stanza, this is just the no-argument command, as you might expect. With an uppercase letter at the end of the stanza, though, this is equivalent to enclosing the entire rest of the stanza in a block and giving it as an argument to that command. So for example, the (full program) `A1X2E` is equivalent to `EA1X2/`. Note that this should only be done in cases where the same effect cannot be achieved with automatic block construction (as "redundant" uses of this mechanism may be given a different meaning in the future); `EA1X2` would become `EA1/X2` after automatic block construction, so `A1X2E` is acceptable, but `A1E` is not a valid program, as it has the same meaning as `EA1` (both become `EA1/` after the missing punctuation mark is inserted). The mechanism can be used with multiple characters: `A1X2ES` is equivalent to `ESA1X2//` (which in turn could be written as `ESA1X2/`, but not `ESA1X2`, so the transformation is allowed). If a stanza starts with a slash or digit, which would otherwise be syntactically incorrect, a leading `R` is assumed (i.e. the program starts with the given number rather than with the input that's provided to it). Blocks and Scope ---------------- Pycnolog is a block-structured language: commands can take blocks as arguments, and a block has a similar structure to the program as a whole. Blocks can be used either simply to show what a command applies to, or as miniature self-contained programs of their own which are isolated from the program outside them. These two uses for a block are distinguished by *scope*: some blocks create a new scope, some don't. If a block doesn't create a scope, it's treated as though it were inlined into the containing block; the enigmas inside the block are the same as those in the containing block (exception: enigmas `8` and `9`), nondeterminism in the containing block will extend into the interior block, and so on. If a block does create a scope, then it has an entirely separate set of enigmas (except for enigmas like `+` which are defined in terms of those in the containing scope), and even if the containing block is nondeterministic, the separate scope it contains will get a deterministic input (effectively, a separate copy of the scope is run for each possible input it could have). In general, a block that's the argument to a command that loops (or otherwise can run the same block multiple times) is a new scope of its own; in most other cases, no new scope is created. (The exception is `F`, which always creates a scope, as that's one of its jobs.) Blocks cannot be empty, nor can they end with a `V` command that takes a non-block argument. (Blocks which are the argument to `F` have an additional restriction: they cannot consist of a single command, unless that command itself takes an argument.) Stanzas ------- There's one level of structure beyond that of the block. A *stanza* is a self-contained piece of code that's mostly independent of other stanzas; although stanzas can refer to each other, they can't refer to blocks contained inside other stanzas, or the like. When a program is run as a full program (rather than just being treated as a collection of stanzas for other programs to reference), the last stanza in the program is the one that will actually run when the program starts; other stanzas will not run unless run explicitly via the `F` command. Stanzas are separated by the `F/` digraph, anywhere but raw base 64 data. (This overrides the normal meaning, which would be an `F` command that takes base 64 data as an argument; it's highly unlikely that a program would contain enough blocks to make base 64 notation efficient in referencing them!) If the first two characters of a stanza are `R/` (i.e. either the program starts with `R/`, or stanzas are separated via `F/R/`), the stanza is a *comment stanza* and is entirely ignored, not even being parsed (except to look for `F/` digraphs which would end the stanza and start a new one). This rule, combined with the previous one, mean that stanzas and base 64 literals have to be parsed at the same time, reading the program from left to right (as `F/` in a base 64 literal won't end a stanza, but, e.g., `A/` in a comment stanza won't start a base 64 literal even though it normally would). Automatic Block Construction ---------------------------- In cases where a stanza has more start-of-block sequences (i.e. an uppercase letter followed by any letter) than end-of-block sequences (i.e. a punctuation mark that's not immediately preceded by an uppercase letter), extra punctuation marks are automatically inserted into the stanza to balance the blocks. Here's an example of such an unbalanced stanza (forming a full program): EFhhEA1f/ The block construction algorithm starts by identifying all the positions where blocks start, and all the positions where the end of a block is explicitly marked: E{F{hhE{A1f} ^ ^ unmatched Then, after each *unmatched* start-of-block location (scanning right to left), a punctuation mark is inserted as far to the left as possible to match it, creating a block. This is subject to the rules that blocks can't be empty, that blocks can't end with `y`, that blocks must nest properly, and that arguments to `F` cannot consist of a single no-argument command: E{F{hh}}E{A1f}, i.e. EFhh//EA1f/ (Note here that the closing punctuation mark for the initial `E` appeared after that for the `F` following it, as that's the furthest left it could be placed to balance the structure.) Block numbering --------------- From the point of view of each stanza, the blocks making up that stanza, together with each non-comment stanza of the program, are assigned non-negative numbers. The numbering is calculated as follows: * The very lowest numbers belong to (non-comment) stanzas that appear earlier in the program. * Blocks that are the argument to `F` get the next-lowest numbers. * After that comes the stanza itself. * Blocks that exist explicitly within the stanza (but are not the argument to `F`) have the next-lowest numbers. * Stanzas that appear later in the program have the second-highest numbers. * Automatically constructed blocks come last. (The virtual block or blocks created to wrap the rest of the stanza when the stanza ends with a capital letter count as an automatically constructed block for this purpose, and count as starting at the start of the stanza.) * Within a category of blocks in the above list, blocks which start further to the left within the stanza get lower numbers. * Stanzas follow a different rule: stanzas that appear earlier are numbered in reverse order of their position within the program (i.e. the previous stanza gets the number 0, the stanza before the number 1, and so on), whereas the stanzas that appear later in the program are numbered in forwards order of their position in the program (so the next stanza gets the lowest numbers out of these). These numbers are used by the `F`-number and `f` commands to identify which blocks/stanzas should be run. Numbers ------- ### Base 11 Numbers from 0 to 1459 inclusive are written in base 11, as a string of digits: * From 0 to 10, a single digit is enough. * From 11 to 128, 11 is subtracted from the number, and the resulting number written as two digits (thus `00` is 0+11=11, `+7` is 117+11=128). * From 129 to 1459, 129 is subtracted from the number, and the resulting number is written as three digits (thus `000` is 0+129=129, `+++` is 1330+129=1459). The sequences `+8`, `+9`, and `++` do not fall into this main sequence, and are used as shortcut encodings for the numbers 256, 1000, and 1000000 respectively. (The redundant sequences `106` for 256 and `724` for 1000 are reserved for future expansion and must not be used.) In most contexts, numbers from 1460 upwards would be written in base 64, so base 11 numbers usually cap out at three digits. There are two cases, however, where base 64 syntax for numbers is not allowed: when giving an integer argument to the `F` command (`F/` would be interpreted as a stanza break), or when a command takes both an integer and a block as argument (`/` after the block ends would be interpreted as ending an additional block). In these circumstances, base 11 numbers can be extended above three digits: 4 digits gives 1331 possibilities ranging from 1460 to 2790, 5 digits gives 14641 possibilities ranging from 2791 to 17432, and so on. ### Base 64 Numbers from 1460 upwards are normally written in base 64 notation. These all start with a `/` and contain at least two base 64 digits, which are interpreted as bijective base 64 (i.e. `A` represents not 0, but a digit with value 64). 1395 is subtracted from the integer before encoding it (because the smallest number that uses two digits in bijective base 64 is 65). The simplest case consists of *just* the leading `/` and two base 64 digits. These three-character base 64 constants range from `/BB` (i.e. (1×64+1)+1395=1460) to `/AA` (i.e. (64×64+64)+1395=5555). To write constants that need more than two base 64 digits, you use *linking digits*; first you write a `/` and two base 64 digits, then you write a (base 11) digit specifying how many more digits there are coming (`0` here means an extra 11 digits follow, leading to a total of 13 base 64 digits plus the leading `/`). If that doesn't give enough digits, you can then follow this with an extra linking digit, whose effect is 11 times stronger. (The first linking digit thus has to be chosen so that the second linking digit will be able to produce a number of the right length. For example, if you have a total of 67 base 64 digits in your number, the linking digits would need to be `+` and `5` to give a total of 2+10+(5×11) base 64 digits; the notation would therefore be `/`, the first 2 digits, `+`, the next 10 digits, `5`, and the last 55 digits.) A third linking digit is multiplied by 121 to give the number of base 64 digits that follow it, a fourth by 1331, and so on.