Pycnolog: Syntax
================

Encoding
--------

Most programming languages use either Unicode encoded as UTF-8, or a
codepage consisting of up to 256 distinct characters (in which each
character is assigned an 8-bit codepoint).  However, Pycnolog uses
only 64 distinct characters, each of which has a 6-bit codepoint;
Pycnolog characters are thus only 6 bits wide.

In order to store these 6-bit characters on an 8-bit disk, "inverse
base 64 encoding" is used; base 64 encoding is equivalent to Pycnolog
decoding, and vice versa.  For example, the byte sequence formed by
encoding the text `abc` in ASCII is, in base 64, written as `YWJj`;
thus, the Pycnolog program `YWJj` has the same encoding as `abc` would
in ASCII.  Pycnolog's character set is (intentionally) identical to
the most commonly used character set for base 64, and uses the same
codepoints:

  * `A` to `Z`: 0 to 25 respectively
  * `a` to `z`: 26 to 51 respectively
  * `0` to `9`: 52 to 61 respectively
  * `+`: 62
  * `/`: 63

(Note that in some cases where Pycnolog uses codepoints, 0 is not a
valid codepoint, and 64 is.  In this situation, `A` is treated as
being codepoint 64, with the other characters unchanged.)

One thing to note is that sometimes padding is needed at the end of a
program to pad it up to a full byte (as not all multiples of 6 are
divisible by 8).  At most 6 bits of padding are necessary, and these
should be a (partial or full) `F` character.  (It isn't useful to end
a Pycnolog program with `F`, so the extra character can be recognised
as padding and removed before running the program.)


Alphabet
--------

Pycnolog's 64 characters are broken down into four groups:

  * *Uppercase letters*, `A` to `Z`
  * *Lowercase letters*, `a` to `z`
  * *Digits*, `0`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `8`, `9`, `+`
  * and the *punctuation mark* `/`.

Digits are, unsurprisingly, used to write numbers.  Pycnolog generally
uses base 11, thus 11 digits are needed; `+` is the digit with the
value 10.  Enigma names are also formed of strings of digits, despite
not being numbers.

Most letters refer to commands.  In most situations, the uppercase and
lowercase versions of a letter are conceptually similar commands, for
which the only difference is the types of argument they can take.
However, sometimes certain types of argument make no sense for certain
commands; thus, some letters can refer to multiple different commands,
with the argument used to distinguish between them.

`y`/`Y` (and to some extent `F`/`R`) are exceptions.  Although `F` and
`R` normally start commands, they're also used for special-case syntax
around the edge of stanzas.  `y`/`Y` are not commands in their own
right, but rather modifiers to commands, written immediately before
the command to change the way it takes its arguments.

The punctuation mark is used to mark where blocks end (in cases where
automatic block construction would place the block in the wrong
place).  (The places where blocks would start are automatically
determined by looking for commands that require a block argument; the
block starts immediately after the command.)

The punctuation mark is also used to mark the start of a base 64
literal, and as part of the `F/` digraph that marks the boundary
between stanzas.


Commands and arguments
----------------------

There are five main patterns that commands can take:

  * No arguments: the command is a lowercase letter, and is not
    followed by a digit.
  * Constant argument: the command is an uppercase letter, and is
    followed by number, specified either in base 11 (using one or more
    digits), or in base 64 (starting with a punctuation mark).
  * Enigma argument: the command is a lowercase letter and is followed
    by an enigma name.  (Enigma names are strings of digits, but are
    not interpreted as numbers, e.g. `01` and `1` are different
    enigmas.)
  * Block argument: the command is an uppercase letter, and is
    followed by a letter.  That letter will be the first command
    inside the block.
  * Block and constant: this is similar to giving the command a block
    argument, but the closing punctuation mark on the block must be
    given explicitly.  A base 11 number is specified immediately after
    the closing punctuation mark.

In addition to its argument (if any), a command is connected to the
commands to its left and its right; each command takes a value (the
*input*) from the command to its left and gives a value (the *output*)
to the command to its right.  (So one way to think about this is that
commands are mutating an "accumulator" or similar temporary value.)
This is why commands typically only need to be able to take one
argument; binary operations like "add" (`A`) will add their argument
to their input (taken from the command to the left), and give the
resulting output to the command to their right; so two numbers are
being added despite the fact that the addition only had one
*argument*.

In a few cases, a combination of two commands will obviously be
redundant, e.g. `UL2/` should mean "the input equals the output and
assert that the output has length 2", but that's clearly equivalent to
`L2`, "the output equals the input and assert that the input has
length 2".  Such "redundant combinations" of commands may thus be
assigned entirely different meanings to increase the density of the
language's encoding, either now or in the future.

`y`/`Y` are not commands, but rather command modifiers.  These parse
much the same way as commands themselves, e.g. `Y` followed by a digit
is parsed as "a `Y` modifier taking an integer argument".  However,
they're merged together with the following command into a single
command; for example, `e` indexes lists, and `Ye` effectively does
multidimensional indexing on a list, whereas `y0e` will index each
list in a list of lists via taking the index from the corresponding
element of enigma 0.  When `Y` is given a block as its argument, it
modifies the first command inside the block (not the command that
comes after the block).

The above explanation gives almost all possible patterns for commands,
but there's one remaining possibility: if a letter appears at the end
of a stanza (most commonly, at the end of a program).  With a
lowercase letter at the end of a stanza, this is just the no-argument
command, as you might expect.  With an uppercase letter at the end of
the stanza, though, this is equivalent to enclosing the entire rest of
the stanza in a block and giving it as an argument to that command.
So for example, the (full program) `A1X2E` is equivalent to `EA1X2/`.
Note that this should only be done in cases where the same effect
cannot be achieved with automatic block construction (as "redundant"
uses of this mechanism may be given a different meaning in the
future); `EA1X2` would become `EA1/X2` after automatic block
construction, so `A1X2E` is acceptable, but `A1E` is not a valid
program, as it has the same meaning as `EA1` (both become `EA1/` after
the missing punctuation mark is inserted).  The mechanism can be used
with multiple characters: `A1X2ES` is equivalent to `ESA1X2//` (which
in turn could be written as `ESA1X2/`, but not `ESA1X2`, so the
transformation is allowed).

If a stanza starts with a slash or digit, which would otherwise be
syntactically incorrect, a leading `R` is assumed (i.e. the program
starts with the given number rather than with the input that's
provided to it).


Blocks and Scope
----------------

Pycnolog is a block-structured language: commands can take blocks as
arguments, and a block has a similar structure to the program as a
whole.  Blocks can be used either simply to show what a command
applies to, or as miniature self-contained programs of their own which
are isolated from the program outside them.

These two uses for a block are distinguished by *scope*: some blocks
create a new scope, some don't.  If a block doesn't create a scope,
it's treated as though it were inlined into the containing block; the
enigmas inside the block are the same as those in the containing block
(exception: enigmas `8` and `9`), nondeterminism in the containing
block will extend into the interior block, and so on.  If a block does
create a scope, then it has an entirely separate set of enigmas
(except for enigmas like `+` which are defined in terms of those in
the containing scope), and even if the containing block is
nondeterministic, the separate scope it contains will get a
deterministic input (effectively, a separate copy of the scope is run
for each possible input it could have).

In general, a block that's the argument to a command that loops (or
otherwise can run the same block multiple times) is a new scope of its
own; in most other cases, no new scope is created.  (The exception is
`F`, which always creates a scope, as that's one of its jobs.)

Blocks cannot be empty, nor can they end with a `V` command that takes
a non-block argument.  (Blocks which are the argument to `F` have an
additional restriction: they cannot consist of a single command,
unless that command itself takes an argument.)


Stanzas
-------

There's one level of structure beyond that of the block.  A *stanza*
is a self-contained piece of code that's mostly independent of other
stanzas; although stanzas can refer to each other, they can't refer to
blocks contained inside other stanzas, or the like.  When a program is
run as a full program (rather than just being treated as a collection
of stanzas for other programs to reference), the last stanza in the
program is the one that will actually run when the program starts;
other stanzas will not run unless run explicitly via the `F` command.

Stanzas are separated by the `F/` digraph, anywhere but raw base 64
data.  (This overrides the normal meaning, which would be an `F`
command that takes base 64 data as an argument; it's highly unlikely
that a program would contain enough blocks to make base 64 notation
efficient in referencing them!)

If the first two characters of a stanza are `R/` (i.e. either the
program starts with `R/`, or stanzas are separated via `F/R/`), the
stanza is a *comment stanza* and is entirely ignored, not even being
parsed (except to look for `F/` digraphs which would end the stanza
and start a new one).  This rule, combined with the previous one, mean
that stanzas and base 64 literals have to be parsed at the same time,
reading the program from left to right (as `F/` in a base 64 literal
won't end a stanza, but, e.g., `A/` in a comment stanza won't start a
base 64 literal even though it normally would).


Automatic Block Construction
----------------------------

In cases where a stanza has more start-of-block sequences (i.e. an
uppercase letter followed by any letter) than end-of-block sequences
(i.e. a punctuation mark that's not immediately preceded by an
uppercase letter), extra punctuation marks are automatically inserted
into the stanza to balance the blocks.  Here's an example of such an
unbalanced stanza (forming a full program):

    EFhhEA1f/

The block construction algorithm starts by identifying all the
positions where blocks start, and all the positions where the end of a
block is explicitly marked:

    E{F{hhE{A1f}
     ^ ^         unmatched

Then, after each *unmatched* start-of-block location (scanning right
to left), a punctuation mark is inserted as far to the left as
possible to match it, creating a block.  This is subject to the rules
that blocks can't be empty, that blocks can't end with `y`, that
blocks must nest properly, and that arguments to `F` cannot consist of
a single no-argument command:

    E{F{hh}}E{A1f}, i.e. EFhh//EA1f/

(Note here that the closing punctuation mark for the initial `E`
appeared after that for the `F` following it, as that's the furthest
left it could be placed to balance the structure.)


Block numbering
---------------

From the point of view of each stanza, the blocks making up that
stanza, together with each non-comment stanza of the program, are
assigned non-negative numbers.  The numbering is calculated as
follows:

  * The very lowest numbers belong to (non-comment) stanzas that
    appear earlier in the program.
  * Blocks that are the argument to `F` get the next-lowest numbers.
  * After that comes the stanza itself.
  * Blocks that exist explicitly within the stanza (but are not the
    argument to `F`) have the next-lowest numbers.
  * Stanzas that appear later in the program have the second-highest
    numbers.
  * Automatically constructed blocks come last.  (The virtual block or
    blocks created to wrap the rest of the stanza when the stanza ends
    with a capital letter count as an automatically constructed block
    for this purpose, and count as starting at the start of the
    stanza.)
  * Within a category of blocks in the above list, blocks which start
    further to the left within the stanza get lower numbers.
  * Stanzas follow a different rule: stanzas that appear earlier are
    numbered in reverse order of their position within the program
    (i.e. the previous stanza gets the number 0, the stanza before the
    number 1, and so on), whereas the stanzas that appear later in the
    program are numbered in forwards order of their position in the
    program (so the next stanza gets the lowest numbers out of these).

These numbers are used by the `F`-number and `f` commands to identify
which blocks/stanzas should be run.


Numbers
-------

### Base 11

Numbers from 0 to 1459 inclusive are written in base 11, as a string
of digits:

  * From 0 to 10, a single digit is enough.
  * From 11 to 128, 11 is subtracted from the number, and the
    resulting number written as two digits (thus `00` is 0+11=11, `+7`
    is 117+11=128).
  * From 129 to 1459, 129 is subtracted from the number, and the
    resulting number is written as three digits (thus `000` is
    0+129=129, `+++` is 1330+129=1459).

The sequences `+8`, `+9`, and `++` do not fall into this main
sequence, and are used as shortcut encodings for the numbers 256,
1000, and 1000000 respectively.  (The redundant sequences `106` for
256 and `724` for 1000 are reserved for future expansion and must not
be used.)

In most contexts, numbers from 1460 upwards would be written in base
64, so base 11 numbers usually cap out at three digits.  There are two
cases, however, where base 64 syntax for numbers is not allowed: when
giving an integer argument to the `F` command (`F/` would be
interpreted as a stanza break), or when a command takes both an
integer and a block as argument (`/` after the block ends would be
interpreted as ending an additional block).  In these circumstances,
base 11 numbers can be extended above three digits: 4 digits gives
1331 possibilities ranging from 1460 to 2790, 5 digits gives 14641
possibilities ranging from 2791 to 17432, and so on.

### Base 64

Numbers from 1460 upwards are normally written in base 64 notation.
These all start with a `/` and contain at least two base 64 digits,
which are interpreted as bijective base 64 (i.e. `A` represents not 0,
but a digit with value 64).  1395 is subtracted from the integer
before encoding it (because the smallest number that uses two digits
in bijective base 64 is 65).

The simplest case consists of *just* the leading `/` and two base 64
digits.  These three-character base 64 constants range from `/BB`
(i.e. (1×64+1)+1395=1460) to `/AA` (i.e. (64×64+64)+1395=5555).

To write constants that need more than two base 64 digits, you use
*linking digits*; first you write a `/` and two base 64 digits, then
you write a (base 11) digit specifying how many more digits there are
coming (`0` here means an extra 11 digits follow, leading to a total
of 13 base 64 digits plus the leading `/`).  If that doesn't give
enough digits, you can then follow this with an extra linking digit,
whose effect is 11 times stronger.  (The first linking digit thus has
to be chosen so that the second linking digit will be able to produce
a number of the right length.  For example, if you have a total of 67
base 64 digits in your number, the linking digits would need to be `+`
and `5` to give a total of 2+10+(5×11) base 64 digits; the notation
would therefore be `/`, the first 2 digits, `+`, the next 10 digits,
`5`, and the last 55 digits.)  A third linking digit is multiplied by
121 to give the number of base 64 digits that follow it, a fourth by
1331, and so on.