Elly's encoded representation
=============================

This document discusses the details behind how Elly is encoded into
its encoded representation.  You normally won't need to know this to
be able to program in Elly; the Guide (`Guide.md`) gives information
on how many bytes are required to express most elements, and you
typically care about how many bytes are used, not *which* bytes.  It
also doesn't list the details of encoding literals (see
`Constants.md`), or the exact byte sequence to use for each builtin
(see the list of builtins).  Rather, this talks about the higher-level
syntactical elements, such as how generators and tacit variables are
encoded.


The encoding space
------------------

The 256 bytes each correspond to builtins as follows (byte values
specified below are in hexadecimal):

  * 00–2E, 30–3F: "Builtins A".  These normally correspond to
    0-argument non-generator builtins.  In contexts where those aren't
    allowed, it provides a secondary encoding for 2-argument builtins
    (by flipping the 40s bit).  3E and 3F are dedicated as leading
    bytes for two-byte builtin A names.

  * 40–6E, 7F: "Builtins D"; 70–7E: "Builtins C".
  
    These two groups are collectively known as "Builtins B".
    
    7E and 7F are dedicated as leading bytes for two-byte builtin C
    and D names respectively.
  
    These normally correspond to 2-argument non-generator builtins.
    In contexts where those aren't allowed, it provides a secondary
    encoding for 0-argument builtins (by flipping the 40s bit).

    2-argument builtins that don't have a swapped version use the
    builtins C range.  If there is a swapped version, the builtins D
    range is used.  (The swapped and unswapped versions of a builtin
    share the same byte, and are distinguished by changing how their
    argument is encoded.)

  * 80–BE, C2–CF: "Builtins F".  These correspond to 1-argument
    non-generator builtins.  CE and CF are dedicated as leading bytes
    for two-byte builtin F names.

  * D0–DF: "Builtins I"; E0-EF: "Builtins E". 
  
    These two groups are collectively known as "Builtins G".
  
    These correspond to 1-argument generator builtins with a tacit
    start-of-scope marker.  The split between I and E depends on
    context; in statement-argument context, the I and E ranges are
    both used for `each` (because no other builtin can have a
    start-of-scope marker here), giving it separate I and E encodings;
    and in other contexts, the builtin E range is still used for
    `each`, but the builtin I range for other 1-argument generators
    with tacit start-of-scope markers (with DF dedicated as a leading
    byte for two-byte builtin I names).
  
  * F0–FF: "Builtins H".  These normally correspond to generator
    builtins that have 0 (FA–FE) or 2 (F0–F9) arguments ("builtins HA"
    and "builtins HB" respectively), and a tacit start-of-scope
    marker (with FF dedicated as the first byte of two-byte builtin H
    names).  In contexts where those aren't allowed, these are instead
    encodings of the full control marker (no context allows both full
    control markers and 0/2-byte generators with tacit start-of-scope
    markers).  However, in either situation, the encodings can be
    repurposed to mean:

  * F1–FE: Non-tacit scope variables.  There are up to 7 generator
    scope variables in existence at any given time, and each of these
    replaces one or two encodings from FE, FD, FC, etc. (with FE being
    for the scope that ends latest); normally, only one encoding per
    scope variable is required (and is considered part of the
    "builtins HA" group), but two (the "HA" and "HB" encodings) are
    required in the case where scope variable is the argument to a
    statement formed from a group D builtin.  (When two encodings are
    needed, the HA encodings will have the highest code points, and
    the HB encodings will have the code points immediately before
    them.)  This means that fewer encodings of the full control marker
    will be available, and some generators will be unavailable tacitly
    (you can still use the explicit form).
    
  * FA–FE: The possibilities for FA–FE listed above still fail to
    assign a meaning in one given context (statement context, where
    none of full control markers, 0-argument builtins, and scope
    variables can exist).  These are therefore repurposed for
    miscellaneous syntax that's only legal in statement context.
  
  * BF: This is considered part of the builtins I range, but with an
    explicit start-of-scope marker.  The BF itself acts as the marker,
    and the builtin name comes in the next byte.  Because builtins D
    occupy 1/16th of the encoding space, the byte after BF thus allows
    for 16 encodings for each explicitly scoped builtin I (by
    replacing the leading D with any of the 16 hexadecimal digits).
  
  * 2F, 6F: These are considered part of the builtins HA and HB ranges
    respectively, but with an explicit start-of-scope marker.  This
    works identically to BF (with the leading F being replaced to
    provide 16 encodings).  Two encodings for this are provided,
    because every context but full control context needs one spare bit
    of information (over and above the 16 possible encodings) when
    this sort of builtin is used.  In a full control expression, the
    distinction between 2F and 6F is not useful; at present, 2F is
    used for 0-argument builtins H and 6F for 2-argument builtins H,
    and using the "wrong encoding" is reserved for future expansion.

  * C1: This has two possible meanings.  In most contexts, this is an
    explicit end-of-scope marker.  Directly as the argument of a
    statement, it insteads acts as the E encoding of 1-argument
    generator builtins with an explicit start-of-scope marker, in a
    similar way to BF, thus allowing these builtins to also have
    separate E and I encodings in this context.
    
  * C0: The "end of function" marker, `)` or `}`.  This can appear in
    almost any context; if there are still arguments required to
    finish off the function, it will fill them with tacit variables.

Apart from C0 which is special (and means the same thing in all
contexts, except when it's a second or subsequent byte of a two-byte
builtin name, a literal, etc.), the bytes thus fall into the
categories ACDEFHI (with B meaning "C or D", and G meaning "E or I").
Parsing the encoded representation works by deciding which category
each builtin belongs to, then matching sequences of categories to the
appropriate builtin and tacit stage shape.

It should be noted that there's no difference between explicit and
implicit representations of generator start markers in how they
decode, e.g. BF 4C is a builtin I, as is DC, and they both refer to
the same builtin.  The difference is that the long form (represented
in the source code with a non-tacit start-of-scope marker) provides 16
different representations (the choice of representation is known as
the generator's *range*), whereas the tacit/short form doesn't.
`each` also has 16 possible E encodings (and in statement-argument
context, another 16 possible I encodings), meaning that it also has a
range.  The final language construct which has a range is the
full control marker; this can have up to 16 encodings depending on how
many scopes it is nested inside.


Payloads
--------

Some builtins aren't uniquely defined by their leading byte, e.g. the
"common integer" builtin is encoded at 3B (when as a builtin A) or 7B
(when used as a builtin B), but it needs another byte to represent
*which* common integer.  For builtins with multiple variants like
this, extra data (known as a "data part" in the guide, and a "payload"
internally within the Elly implementation) appears in the program
immediately after the builtin *code* (the one or two bytes that define
what sort of builtin we're using).  Payloads are considered completely
separate from the rest of the encoding space, e.g. a `C0` byte inside
a payload is not an end-of-function marker but is just a byte of data
like any other byte would be.

"Lexing"/"tokenising" an Elly encoding requires the lexer to be able
to calculate whether it has a payload, which requires it to be able to
calculate which builtin is in use before seeing the payload.  In order
to accomplish this, we rely on the fact that there are at most two
possible builtins for each encoding (e.g. a builtin B must be either a
specific 0-argument builtin or a specific 2-argument builtin, a
builtin G must be either a specific 1-argument builtin or else
`each`); and which of those two possibilities applies will always be
disambiguated by the context to the left of the builtin.  So we can
always know which specific builtin we're looking at after parsing
everything to its left in the program, and thus whether or not it
needs a payload.

Additionally, the lexer/tokeniser needs to be able to find the end of
the payload.  For any builtin, there must be some rule that makes the
payload self-delimiting, i.e. after reading all the bytes of the
payload we'll know when we've reached the end.  This could be very
simple (e.g. the "uncommon integer" builtin always has a payload
that's exactly two bytes long) or very complex (e.g. the builtin `1{}`
has a payload that's a valid Elly program with no trailing bytestring,
followed by a `C0` end-of-function marker, thus finding the end
requires running a full Elly decoder; but the end is nonetheless
always going to be uniquely determined because it ends at the same
point the trailing bytestring would start in a non-nested Elly
program).  However, the rule is uniquely defined by the builtin in
use.


Contexts, and shape-based tacit variables
-----------------------------------------

Decoding an Elly program works using what's effectively a state
machine; the states are called *contexts*, and determine what's legal
at each point in the encoding.

There are five contexts:

  * Full control context (0-argument functions start here)
  * Initial expression context (1- and 2-argument functions start here)
  * Statement context
  * Statement-argument context (after a group C builtin)
  * Statement-argument context (after a group D builtin)


### Full control context

Full control context is the simplest.  0-argument functions start in
full control mode (because the leading `@` is tacit, and in fact can't be
encoded; the encoder deletes it if it sees one), and it can be entered
explicitly from statement-argument context.

The various areas of the encoding space have the following meanings in
full control context:

  * A: interpreted as a 0-argument nongenerator builtin
  * B (i.e. C or D): interpreted as a 2-argument nongenerator builtin
  * G (i.e. E or I): interpreted as a 1-argument generator builtin
  * F: interpreted as a 1-argument nongenerator builtin
  * H: interpreted as a 0- or 2-argument generator builtin (unless it
    is needed as a scope variable – each scope variable only needs one
    encoding here)

In other words, we're just encoding builtins directly, with the whole
encoding space (other than the special cases C0 and C1) dedicated to
them.  Any sort of builtin is allowed to appear in full control
context, in any order.

Full control context ends either at a C0, or when we have a "properly
matched" expression.

There are no tacit variables within a full control expression other
than those added by a C0, and those added by a range (see "range-based
tacit variables" below).


### Initial expression context and statement context

Initial expression context and statement context are very similar.
Initial expression context is used only at the start of a 1- or
2-argument function (which usually starts with a tacit `y` or `x`
respectively followed by a statement, but could start with a
full control marker).  Statement context is used after the initial
expresssion ends, and after each statement ends.

The only difference is therefore whether full control markers are a
possibility or not, and thus the interpretation of the bytes in the
builtins H range; initial expression context interprets these as
full control markers, whereas statement context interprets F0–F9 as
2-argument generators with tacit scope markers, and FA–FF as special
syntax that's only legal in statement position.  (This difference is
the reason for the restriction in the Guide that 1-statement scope is
disallowed for generators that take a second argument, and are the
first statement in a function whose initial expression is a tacit `x`
or `y`; such generators would need to be encoded using the F0–F9
range, but it's being used for full control markers in this
situation.)

Outside the F0–FF range, the two contexts are the same, and interpret
the various areas of the encoding space as follows:

  * A: interpreted as a 2-argument nongenerator builtin (by flipping
    the 40s bit), followed by a tacit `y`
  * B (i.e. C or D): interpreted as a 2-argument nongenerator builtin,
    followed by a specified argument (this enters statement argument
    context)
  * G (i.e. E or I): interpreted as a 1-argument generator builtin
  * F: interpreted as a 1-argument nongenerator builtin
  * H (if encoded with 2F/6F; the case of F0–FF was discussed above):
    interpreted as a 2-argument generator builtin; 2F-encoded builtins
    are followed by a tacit `y` (as though it were a builtin A), and
    6F-encoded builtins are followed by an explicitly given statement
    argument (as though it were a builtin B)

In initial expression context, use of *any* of the above possibilities
will insert a tacit `y` or `x` as appropriate for the initial
expression; the only way to get something else as an initial
expression is to start with a full control marker (F0–FF).

In Elly, statements formed from 2-argument builtins in group D
(corresponding to builtins D in the encoding space) and which have an
argument other than tacit `y` permit swapping of their arguments.
This is not specified by the statement itself, but rather by its
argument.


### Statement-argument context

Statement-argument context is a bit different from full control
context; although both are used for expressions, statement-argument
context does not allow tacit start-of-scope markers, and often
introduces tacit variables.  It works differently depending on whether
the statement itself is a group C or D builtin (it must be either a
2-argument builtin, or else a user-defined function which is treated
as though it were the `2()` builtin with a payload, and all such
builtins are assigned to group C or D even though the generators might
be a builtin H rather than a builtin C/D).

Statement arguments are split into three main groups:

  * **The primary group**

      * A: interpreted as a 0-argument nongenerator builtin
      * B (sometimes, and only when the introducing statement is group
        C): interpreted as a 2-argument nongenerator bultin
      * E: interpreted as a 1-argument generator builtin
      * F: interpreted as a 1-argument nongenerator builtin
      * H (encoded with 2F): interpreted as a 0/2-argument generator
        builtin
      * H (F1–FE): interpreted as a generator scope variable, if it
        corresponds to a generator in scope

  * **The secondary group**

      * B (usually): interpreted as a 2- (if the introducing statement
        is group C) or 0- (if the introducing statement is group D)
        argument nongenerator builtin
      * H (encoded with 6F): interpreted as a 0/2-argument generator
        builtin
      * H (F1–FE): interpreted as a generator scope variable, if it
        corresponds to a generator in scope, and the introducing
        statement is in group D
      * I: interpreted as a 1-argument generator builtin

  * **The full control group**

      * H (F0–FF): interpreted as a full control marker, if it is not
        needed as a scope variable

The primary group is used for situations where the statement ends
here, and does not have swapped arguments; if extra arguments are
needed to complete the statement, they are filled out with tacit `y`
or `x y` as usual.

If the statement is a group D builtin, then the secondary group swaps
the arguments to the builtin.  The statement is still filled out with
tacit `y` if any more arguments are needed (this only happens in the
sequence `D I`, corresponding to the tacit stage shape `D' G y`, or at
the end of the function).

If the statement is a group C builtin, then the secondary group
instead means that there is an explicit argument to the argument
(i.e. not just a tacit `y`), which is itself specified in
statement-argument context.  If the function ends immediately after
this point, then a tacit `x` is used as the argument (not a tacit `y`,
as usual, as that could have bene produced by using the primary
group).

The full control group is of course used for full control markers; in
such a case, the statement's argument can be anything, and is
specified in full control context rather than statement-argument
context.  For builtins in group D, the encoding of the full control
marker determines whether the builtin has its arguments swapped.

There are two ambiguous cases.  One is when we see a scope variable.
If the parent statement is a group C builtin, then we only have one
encoding per scope variables (the primary encodings).  If it's a group
D builtin, we have two encodings per scope variable (the primary and
secondary encodings), consuming twice as much of the
full-control-marker space; the primary encoding has the codepoint 1
lower than the secondary encoding.

The other case is more complex: in `C B` (with the `C` in statement
context and `B` in statement-argument context), the `B` can be
interpreted as either part of the primary group (which would complete
the statement as `C B x y`), or as part of the secondary group (which
means that the statement starts `C B ` and the next byte provides is
in statement-argument and provides the argument).  This case is
disambiguated by determining whether it's necessarily the case that
`X` and `Y` are the same variable (i.e. it's a 0- or 1-argument
function and no builtins that change `X` nor `Y` have been seen yet).
If they are different, the primary group (`C B x y`) interpretation is
always the interpretation used.  If they are the same, the `C B x y`
interpretation is used only in the case where `B y y` is not trivially
equivalent (nor mostly equivalent, e.g. equivalent except on `null`)
to any 0-argument builtin, nor 1-argument builtin applied to `y`.  The
big list of builtins stores information about which builtins can
usefully have a repeated argument, and thus which builtins will use
the primary rather than secondary parsing when `x` = `y`.

The Guide contains a list of "tacit pipeline stage shapes" which
reflect the rules for statement and statement-argument context,
listing (most of) the combinations of builtins and tacit variables
that can be used to form a statement.  Not all possibilities that can
be generated by the above rules are currently listed in the Guide
(e.g. `C C C …`, `C D' K y`, `C D @ …`), partly to make the language
less overwhelming to learn, and partly because it's unclear whether
this is the best use for the encoding space.  The encoder will not
currently produce nor recognise these encodings, in order to avoid
breaking future programs.


Ranges, and range-based tacit variables / tacit scope markers
-------------------------------------------------------------

Full control markers, explicit start-of-scope markers, and the `each`
builtin each have a *range* associated with them; this is a number in
the 0–15 range that's encoded by using one of 16 possibilities for the
encoding of the marker or builtin, by changing the value of the nybble
that varies.  (In some cases, full control markers won't be able to
encode all 16 possibilities for the range because some of them are in
use for tacit variables.  In this case, the FF encoding is used for
the lowest positive integer that could not otherwise be encoded,
rather than representing 15.)

Ranges can be thought of numbers from which information is repeatedly
extracted as the program is parsed, reducing their value, until they
reaches 0; the initial value of the range is chosen so that the
correct information will be extracted from it.  The information
encoded is:

  * which order the end-of-scope markers are in, relative to this
    start-of-scope marker (only if the range is on the start of a
    generator scope, i.e. a non-tacit start-of-scope marker or an
    `each`);
  * whether the parent statement has its arguments swapped (only for
    full control marker ranges, when the full control marker is an
    argument to a statement);
  * whether a tacit scope variable is being used in place of a tacit
    `y`;
  * which scope variable is being used for `NI` and `ZI`;
  * whether there's a tacit end-of-scope marker, and where it is (only
    if the range is on the start of a generator scope);
  * where to introduce a tacit `y` (only for full control marker
    ranges; in the initial expression of a 2-argument function, this
    is a tacit `x` instead).

The only pieces of information here that are absolutely vital (in that
they can't be encoded without using ranges) are the information about
which order the end-of-scope markers are in, and the information about
whether the parent statement has swapped arguments, and these can
never require extracting more than 6 possibilities.  Everything else
is either related to tacit variables or tacit end-of-scope markers,
and could be encoded explicitly if necessary; or is related to
"secondary" scope variables (`NI`, `ZI` and friends), which are not
required to work except near the start of a scope or full control
marker.  Thus, the limited space of a range (16 possibilities) is
always sufficient to encode everything that's absolutely necessary,
and the remaining possibilities are used to encode the nice-to-haves
(such as tacit variables).

The ranges are thus used in two ways:

  * Whenever something which is range-encoded and has multiple
    possibilities is encountered, the range is divided by the number
    of possibilities; the quotient is used as the new value of the
    range when parsing the rest of the program, and the remainder
    specifies which possibility to use.
  
  * After every location where a scope could end (for scope ranges),
    or a tacit `y` could be inserted (for full control ranges), if the
    range is 1, the scope ends or tacit `y` is inserted there (and the
    range becomes 0); then the range is decremented if it is not 0.

And here are the specifics:

  * end-of-scope marker order: multiple possibilities; the position of
    an end-of-scope marker relative to the end-of-scope markers for
    scopes that started before its start-of-scope marker is encoded on
    its start-of-scope marker's range, with 0 meaning appearing as
    soon as possible, 1 a little later, and so on;
  
  * whether a statement has its arguments swapped: multiple
    possibilities, using the argument's full control marker's range (0
    = not swapped, 1 = swapped);

  * end-of-scope marker position: location; encoded on the matching
    start-of-scope marker, with a range of 1 ending the scope there;
    note that because we know the order in which the scopes end,
    scopes which aren't the next to end will not "count down" their
    ranges unil they become the next to end

  * whether a tacit `y` is actually a tacit generator scope variable:
    multiple possibilities; if we're in a full control expression, it
    can only contain one tacit variable (except at EOF) and we record
    this together with its location; otherwise, we use the ranges of
    every open scope (2 possibilities: 0 = not my scope variable, 1 =
    my scope variable), checking from the first to end inwards until
    one of them claims it
    
  * which scope variable to use for `NI` and `ZI`: multiple
    possibilities; if we're in an expression with a full control
    marker, this uses the full control marker's range (0 = next scope
    to end, 1 = next-but-one scope to end, etc.); otherwise, we use
    the range of every open scope (2 possibilities: 0 = my scope
    variable, 1 = not my scope variable; this is reversed from the
    previous case), checking from the first to end inwards until one
    of them claims it
    
  * where to introduce a tacit `y` (or tacit `x` if this is the
    initial expression of a 2-argument function): a combination of the
    two possibilities; encoded on the full control marker, with a
    range of 1 inserting a `y`/`x` there, 2 inserting the scope
    variable for the next scope to end, 3 inserting the scope variable
    for the next-but-one, and so on; then the range is reduced by the
    number of scope variables plus 1; at EOF, the `y` possibility does
    not exist, so the range for a tacit variable at EOF is 1 lower

Under these rules, a range of 0 therefore produces the "default"
outcome (i.e. end-of-scope markers needing to be given explicitly if
they aren't at EOF, `Y` in full control expressions needing to be
given explicitly, `NI` and `ZI` defaulting to the scope variable for
the next scope to end, tacit `y` being tacit `y` rather than a scope
variable).

When encoding, it's possible to run these rules backwards (scanning
the program from right to left) to work out what range value would
need to be used to get all the right tacit variables and scope markers
in the right places.  This enables the encoder to work out whether a
tacit variable is disallowed due to being "too far away" from the
location which encodes it; simply see whether the range has become too
high to be encoded (16 for a scope marker, 16 minus the number of
generator scopes we're inside for a full control marker), and complain
if it is.

Ranges normally correspond to nybbles in the encoding directly.
However, in the case of full-control markers, they're rotated by -1
(with a range of 0 being encoded as FF, of 1 as F0, of 2 as F1, etc.)
so that the set of valid ranges (i.e. those that don't clash with
generator scope variables) is contiguous.


Implicit scope
--------------

Generators can have either an explicit or tacit start-of-scope marker.
When the marker is explicit, then it has a range, and that's used to
encode everything that's tacitly associated with its scope.  However,
when the start-of-scope marker is implict, and we aren't using `each`
(which has a range despite being only 1 byte wide), then we need to
use defaults for everything, rather than using a range.

The Guide lists four situations in which the start-of-scope marker can
be made tacit.  The first is when `each` is used; that has a range, so
nothing special needs to be done.  The other three are
1-statement/1-builtin scopes, all-other-statement scopes `:y`, and
initial-expression-root scopes `:x`.  These each have rules of their
own to specialise what choices to make for `i`-versus-`y` and the
like, and tacit start-of-scope markers are encoded quite differently
from explicit start-of-scope markers, so the only hard part when
encoding is to determine which of the possibilities for a tacit
start-of-scope marker is in use.

In some cases, this is easy. 1-builtin scopes are only allowed in a
full control expression and not at the root, and none of the other
implicit scopes are allowed there.  Likewise, initial-expression-root
scopes are only allowed at the root of the initial expression (which
is always in full control mode if it's more complex than a trivial `x`
or `y`), and none of the other implicit scopes are allowed there.  So
the only difficult part is distinguishing between 1-statement scope
and all-other-statement scopes.

This distinguishing is done by rearranging the order in which the
builtins appear in the function.  For 1-statement scope, the generator
appears in its "usual" place, before the statement that the scope is
placed around.  However, for all-other-statement scope (in which the
generator applies to all the statements in the function apart from
itself), although the generator appears as the first statement of the
function in the source code, it appears in a different place in the
encoding: right at the end of the function.  This is not a valid
location for a generator with 1-statement scope (because there are no
statements for it to apply to), so the decoder will know (when it sees
a tacit-`:` generator at the end of a function) that the generator has
been moved and has an all-other-statements scope, and will be able to
put it back into the right place.

When using nested all-other-statements generators, they appear in the
same order at the end of the function as they previously did at the
start (i.e. we repeatedly rotate them from the start to the end when
encoding, and the end to the start when decoding).

The end-of-scope marker can also be tacit even when the start-of-scope
marker is explicit (or `each`).  The position of the end-of-scope
marker is generally encoded in the range of the start-of-scope marker.
If it's too far away, then a value is used that will cause no tacit
marker to be inserted;  the end-of-scope marker can *still* be tacit
in this case, but it will have to come at the end of the function
(allowing a tacit marker to be inserted because we haven't seen an
explicit marker yet).