Elly's encoded representation ============================= This document discusses the details behind how Elly is encoded into its encoded representation. You normally won't need to know this to be able to program in Elly; the Guide (`Guide.md`) gives information on how many bytes are required to express most elements, and you typically care about how many bytes are used, not *which* bytes. It also doesn't list the details of encoding literals (see `Constants.md`), or the exact byte sequence to use for each builtin (see the list of builtins). Rather, this talks about the higher-level syntactical elements, such as how generators and tacit variables are encoded. The encoding space ------------------ The 256 bytes each correspond to builtins as follows (byte values specified below are in hexadecimal): * 00–2E, 30–3F: "Builtins A". These normally correspond to 0-argument non-generator builtins. In contexts where those aren't allowed, it provides a secondary encoding for 2-argument builtins (by flipping the 40s bit). 3E and 3F are dedicated as leading bytes for two-byte builtin A names. * 40–6E, 7F: "Builtins D"; 70–7E: "Builtins C". These two groups are collectively known as "Builtins B". 7E and 7F are dedicated as leading bytes for two-byte builtin C and D names respectively. These normally correspond to 2-argument non-generator builtins. In contexts where those aren't allowed, it provides a secondary encoding for 0-argument builtins (by flipping the 40s bit). 2-argument builtins that don't have a swapped version use the builtins C range. If there is a swapped version, the builtins D range is used. (The swapped and unswapped versions of a builtin share the same byte, and are distinguished by changing how their argument is encoded.) * 80–BE, C2–CF: "Builtins F". These correspond to 1-argument non-generator builtins. CE and CF are dedicated as leading bytes for two-byte builtin F names. * D0–DF: "Builtins I"; E0-EF: "Builtins E". These two groups are collectively known as "Builtins G". These correspond to 1-argument generator builtins with a tacit start-of-scope marker. The split between I and E depends on context; in statement-argument context, the I and E ranges are both used for `each` (because no other builtin can have a start-of-scope marker here), giving it separate I and E encodings; and in other contexts, the builtin E range is still used for `each`, but the builtin I range for other 1-argument generators with tacit start-of-scope markers (with DF dedicated as a leading byte for two-byte builtin I names). * F0–FF: "Builtins H". These normally correspond to generator builtins that have 0 (FA–FE) or 2 (F0–F9) arguments ("builtins HA" and "builtins HB" respectively), and a tacit start-of-scope marker (with FF dedicated as the first byte of two-byte builtin H names). In contexts where those aren't allowed, these are instead encodings of the full control marker (no context allows both full control markers and 0/2-byte generators with tacit start-of-scope markers). However, in either situation, the encodings can be repurposed to mean: * F1–FE: Non-tacit scope variables. There are up to 7 generator scope variables in existence at any given time, and each of these replaces one or two encodings from FE, FD, FC, etc. (with FE being for the scope that ends latest); normally, only one encoding per scope variable is required (and is considered part of the "builtins HA" group), but two (the "HA" and "HB" encodings) are required in the case where scope variable is the argument to a statement formed from a group D builtin. (When two encodings are needed, the HA encodings will have the highest code points, and the HB encodings will have the code points immediately before them.) This means that fewer encodings of the full control marker will be available, and some generators will be unavailable tacitly (you can still use the explicit form). * FA–FE: The possibilities for FA–FE listed above still fail to assign a meaning in one given context (statement context, where none of full control markers, 0-argument builtins, and scope variables can exist). These are therefore repurposed for miscellaneous syntax that's only legal in statement context. * BF: This is considered part of the builtins I range, but with an explicit start-of-scope marker. The BF itself acts as the marker, and the builtin name comes in the next byte. Because builtins D occupy 1/16th of the encoding space, the byte after BF thus allows for 16 encodings for each explicitly scoped builtin I (by replacing the leading D with any of the 16 hexadecimal digits). * 2F, 6F: These are considered part of the builtins HA and HB ranges respectively, but with an explicit start-of-scope marker. This works identically to BF (with the leading F being replaced to provide 16 encodings). Two encodings for this are provided, because every context but full control context needs one spare bit of information (over and above the 16 possible encodings) when this sort of builtin is used. In a full control expression, the distinction between 2F and 6F is not useful; at present, 2F is used for 0-argument builtins H and 6F for 2-argument builtins H, and using the "wrong encoding" is reserved for future expansion. * C1: This has two possible meanings. In most contexts, this is an explicit end-of-scope marker. Directly as the argument of a statement, it insteads acts as the E encoding of 1-argument generator builtins with an explicit start-of-scope marker, in a similar way to BF, thus allowing these builtins to also have separate E and I encodings in this context. * C0: The "end of function" marker, `)` or `}`. This can appear in almost any context; if there are still arguments required to finish off the function, it will fill them with tacit variables. Apart from C0 which is special (and means the same thing in all contexts, except when it's a second or subsequent byte of a two-byte builtin name, a literal, etc.), the bytes thus fall into the categories ACDEFHI (with B meaning "C or D", and G meaning "E or I"). Parsing the encoded representation works by deciding which category each builtin belongs to, then matching sequences of categories to the appropriate builtin and tacit stage shape. It should be noted that there's no difference between explicit and implicit representations of generator start markers in how they decode, e.g. BF 4C is a builtin I, as is DC, and they both refer to the same builtin. The difference is that the long form (represented in the source code with a non-tacit start-of-scope marker) provides 16 different representations (the choice of representation is known as the generator's *range*), whereas the tacit/short form doesn't. `each` also has 16 possible E encodings (and in statement-argument context, another 16 possible I encodings), meaning that it also has a range. The final language construct which has a range is the full control marker; this can have up to 16 encodings depending on how many scopes it is nested inside. Payloads -------- Some builtins aren't uniquely defined by their leading byte, e.g. the "common integer" builtin is encoded at 3B (when as a builtin A) or 7B (when used as a builtin B), but it needs another byte to represent *which* common integer. For builtins with multiple variants like this, extra data (known as a "data part" in the guide, and a "payload" internally within the Elly implementation) appears in the program immediately after the builtin *code* (the one or two bytes that define what sort of builtin we're using). Payloads are considered completely separate from the rest of the encoding space, e.g. a `C0` byte inside a payload is not an end-of-function marker but is just a byte of data like any other byte would be. "Lexing"/"tokenising" an Elly encoding requires the lexer to be able to calculate whether it has a payload, which requires it to be able to calculate which builtin is in use before seeing the payload. In order to accomplish this, we rely on the fact that there are at most two possible builtins for each encoding (e.g. a builtin B must be either a specific 0-argument builtin or a specific 2-argument builtin, a builtin G must be either a specific 1-argument builtin or else `each`); and which of those two possibilities applies will always be disambiguated by the context to the left of the builtin. So we can always know which specific builtin we're looking at after parsing everything to its left in the program, and thus whether or not it needs a payload. Additionally, the lexer/tokeniser needs to be able to find the end of the payload. For any builtin, there must be some rule that makes the payload self-delimiting, i.e. after reading all the bytes of the payload we'll know when we've reached the end. This could be very simple (e.g. the "uncommon integer" builtin always has a payload that's exactly two bytes long) or very complex (e.g. the builtin `1{}` has a payload that's a valid Elly program with no trailing bytestring, followed by a `C0` end-of-function marker, thus finding the end requires running a full Elly decoder; but the end is nonetheless always going to be uniquely determined because it ends at the same point the trailing bytestring would start in a non-nested Elly program). However, the rule is uniquely defined by the builtin in use. Contexts, and shape-based tacit variables ----------------------------------------- Decoding an Elly program works using what's effectively a state machine; the states are called *contexts*, and determine what's legal at each point in the encoding. There are five contexts: * Full control context (0-argument functions start here) * Initial expression context (1- and 2-argument functions start here) * Statement context * Statement-argument context (after a group C builtin) * Statement-argument context (after a group D builtin) ### Full control context Full control context is the simplest. 0-argument functions start in full control mode (because the leading `@` is tacit, and in fact can't be encoded; the encoder deletes it if it sees one), and it can be entered explicitly from statement-argument context. The various areas of the encoding space have the following meanings in full control context: * A: interpreted as a 0-argument nongenerator builtin * B (i.e. C or D): interpreted as a 2-argument nongenerator builtin * G (i.e. E or I): interpreted as a 1-argument generator builtin * F: interpreted as a 1-argument nongenerator builtin * H: interpreted as a 0- or 2-argument generator builtin (unless it is needed as a scope variable – each scope variable only needs one encoding here) In other words, we're just encoding builtins directly, with the whole encoding space (other than the special cases C0 and C1) dedicated to them. Any sort of builtin is allowed to appear in full control context, in any order. Full control context ends either at a C0, or when we have a "properly matched" expression. There are no tacit variables within a full control expression other than those added by a C0, and those added by a range (see "range-based tacit variables" below). ### Initial expression context and statement context Initial expression context and statement context are very similar. Initial expression context is used only at the start of a 1- or 2-argument function (which usually starts with a tacit `y` or `x` respectively followed by a statement, but could start with a full control marker). Statement context is used after the initial expresssion ends, and after each statement ends. The only difference is therefore whether full control markers are a possibility or not, and thus the interpretation of the bytes in the builtins H range; initial expression context interprets these as full control markers, whereas statement context interprets F0–F9 as 2-argument generators with tacit scope markers, and FA–FF as special syntax that's only legal in statement position. (This difference is the reason for the restriction in the Guide that 1-statement scope is disallowed for generators that take a second argument, and are the first statement in a function whose initial expression is a tacit `x` or `y`; such generators would need to be encoded using the F0–F9 range, but it's being used for full control markers in this situation.) Outside the F0–FF range, the two contexts are the same, and interpret the various areas of the encoding space as follows: * A: interpreted as a 2-argument nongenerator builtin (by flipping the 40s bit), followed by a tacit `y` * B (i.e. C or D): interpreted as a 2-argument nongenerator builtin, followed by a specified argument (this enters statement argument context) * G (i.e. E or I): interpreted as a 1-argument generator builtin * F: interpreted as a 1-argument nongenerator builtin * H (if encoded with 2F/6F; the case of F0–FF was discussed above): interpreted as a 2-argument generator builtin; 2F-encoded builtins are followed by a tacit `y` (as though it were a builtin A), and 6F-encoded builtins are followed by an explicitly given statement argument (as though it were a builtin B) In initial expression context, use of *any* of the above possibilities will insert a tacit `y` or `x` as appropriate for the initial expression; the only way to get something else as an initial expression is to start with a full control marker (F0–FF). In Elly, statements formed from 2-argument builtins in group D (corresponding to builtins D in the encoding space) and which have an argument other than tacit `y` permit swapping of their arguments. This is not specified by the statement itself, but rather by its argument. ### Statement-argument context Statement-argument context is a bit different from full control context; although both are used for expressions, statement-argument context does not allow tacit start-of-scope markers, and often introduces tacit variables. It works differently depending on whether the statement itself is a group C or D builtin (it must be either a 2-argument builtin, or else a user-defined function which is treated as though it were the `2()` builtin with a payload, and all such builtins are assigned to group C or D even though the generators might be a builtin H rather than a builtin C/D). Statement arguments are split into three main groups: * **The primary group** * A: interpreted as a 0-argument nongenerator builtin * B (sometimes, and only when the introducing statement is group C): interpreted as a 2-argument nongenerator bultin * E: interpreted as a 1-argument generator builtin * F: interpreted as a 1-argument nongenerator builtin * H (encoded with 2F): interpreted as a 0/2-argument generator builtin * H (F1–FE): interpreted as a generator scope variable, if it corresponds to a generator in scope * **The secondary group** * B (usually): interpreted as a 2- (if the introducing statement is group C) or 0- (if the introducing statement is group D) argument nongenerator builtin * H (encoded with 6F): interpreted as a 0/2-argument generator builtin * H (F1–FE): interpreted as a generator scope variable, if it corresponds to a generator in scope, and the introducing statement is in group D * I: interpreted as a 1-argument generator builtin * **The full control group** * H (F0–FF): interpreted as a full control marker, if it is not needed as a scope variable The primary group is used for situations where the statement ends here, and does not have swapped arguments; if extra arguments are needed to complete the statement, they are filled out with tacit `y` or `x y` as usual. If the statement is a group D builtin, then the secondary group swaps the arguments to the builtin. The statement is still filled out with tacit `y` if any more arguments are needed (this only happens in the sequence `D I`, corresponding to the tacit stage shape `D' G y`, or at the end of the function). If the statement is a group C builtin, then the secondary group instead means that there is an explicit argument to the argument (i.e. not just a tacit `y`), which is itself specified in statement-argument context. If the function ends immediately after this point, then a tacit `x` is used as the argument (not a tacit `y`, as usual, as that could have bene produced by using the primary group). The full control group is of course used for full control markers; in such a case, the statement's argument can be anything, and is specified in full control context rather than statement-argument context. For builtins in group D, the encoding of the full control marker determines whether the builtin has its arguments swapped. There are two ambiguous cases. One is when we see a scope variable. If the parent statement is a group C builtin, then we only have one encoding per scope variables (the primary encodings). If it's a group D builtin, we have two encodings per scope variable (the primary and secondary encodings), consuming twice as much of the full-control-marker space; the primary encoding has the codepoint 1 lower than the secondary encoding. The other case is more complex: in `C B` (with the `C` in statement context and `B` in statement-argument context), the `B` can be interpreted as either part of the primary group (which would complete the statement as `C B x y`), or as part of the secondary group (which means that the statement starts `C B ` and the next byte provides is in statement-argument and provides the argument). This case is disambiguated by determining whether it's necessarily the case that `X` and `Y` are the same variable (i.e. it's a 0- or 1-argument function and no builtins that change `X` nor `Y` have been seen yet). If they are different, the primary group (`C B x y`) interpretation is always the interpretation used. If they are the same, the `C B x y` interpretation is used only in the case where `B y y` is not trivially equivalent (nor mostly equivalent, e.g. equivalent except on `null`) to any 0-argument builtin, nor 1-argument builtin applied to `y`. The big list of builtins stores information about which builtins can usefully have a repeated argument, and thus which builtins will use the primary rather than secondary parsing when `x` = `y`. The Guide contains a list of "tacit pipeline stage shapes" which reflect the rules for statement and statement-argument context, listing (most of) the combinations of builtins and tacit variables that can be used to form a statement. Not all possibilities that can be generated by the above rules are currently listed in the Guide (e.g. `C C C …`, `C D' K y`, `C D @ …`), partly to make the language less overwhelming to learn, and partly because it's unclear whether this is the best use for the encoding space. The encoder will not currently produce nor recognise these encodings, in order to avoid breaking future programs. Ranges, and range-based tacit variables / tacit scope markers ------------------------------------------------------------- Full control markers, explicit start-of-scope markers, and the `each` builtin each have a *range* associated with them; this is a number in the 0–15 range that's encoded by using one of 16 possibilities for the encoding of the marker or builtin, by changing the value of the nybble that varies. (In some cases, full control markers won't be able to encode all 16 possibilities for the range because some of them are in use for tacit variables. In this case, the FF encoding is used for the lowest positive integer that could not otherwise be encoded, rather than representing 15.) Ranges can be thought of numbers from which information is repeatedly extracted as the program is parsed, reducing their value, until they reaches 0; the initial value of the range is chosen so that the correct information will be extracted from it. The information encoded is: * which order the end-of-scope markers are in, relative to this start-of-scope marker (only if the range is on the start of a generator scope, i.e. a non-tacit start-of-scope marker or an `each`); * whether the parent statement has its arguments swapped (only for full control marker ranges, when the full control marker is an argument to a statement); * whether a tacit scope variable is being used in place of a tacit `y`; * which scope variable is being used for `NI` and `ZI`; * whether there's a tacit end-of-scope marker, and where it is (only if the range is on the start of a generator scope); * where to introduce a tacit `y` (only for full control marker ranges; in the initial expression of a 2-argument function, this is a tacit `x` instead). The only pieces of information here that are absolutely vital (in that they can't be encoded without using ranges) are the information about which order the end-of-scope markers are in, and the information about whether the parent statement has swapped arguments, and these can never require extracting more than 6 possibilities. Everything else is either related to tacit variables or tacit end-of-scope markers, and could be encoded explicitly if necessary; or is related to "secondary" scope variables (`NI`, `ZI` and friends), which are not required to work except near the start of a scope or full control marker. Thus, the limited space of a range (16 possibilities) is always sufficient to encode everything that's absolutely necessary, and the remaining possibilities are used to encode the nice-to-haves (such as tacit variables). The ranges are thus used in two ways: * Whenever something which is range-encoded and has multiple possibilities is encountered, the range is divided by the number of possibilities; the quotient is used as the new value of the range when parsing the rest of the program, and the remainder specifies which possibility to use. * After every location where a scope could end (for scope ranges), or a tacit `y` could be inserted (for full control ranges), if the range is 1, the scope ends or tacit `y` is inserted there (and the range becomes 0); then the range is decremented if it is not 0. And here are the specifics: * end-of-scope marker order: multiple possibilities; the position of an end-of-scope marker relative to the end-of-scope markers for scopes that started before its start-of-scope marker is encoded on its start-of-scope marker's range, with 0 meaning appearing as soon as possible, 1 a little later, and so on; * whether a statement has its arguments swapped: multiple possibilities, using the argument's full control marker's range (0 = not swapped, 1 = swapped); * end-of-scope marker position: location; encoded on the matching start-of-scope marker, with a range of 1 ending the scope there; note that because we know the order in which the scopes end, scopes which aren't the next to end will not "count down" their ranges unil they become the next to end * whether a tacit `y` is actually a tacit generator scope variable: multiple possibilities; if we're in a full control expression, it can only contain one tacit variable (except at EOF) and we record this together with its location; otherwise, we use the ranges of every open scope (2 possibilities: 0 = not my scope variable, 1 = my scope variable), checking from the first to end inwards until one of them claims it * which scope variable to use for `NI` and `ZI`: multiple possibilities; if we're in an expression with a full control marker, this uses the full control marker's range (0 = next scope to end, 1 = next-but-one scope to end, etc.); otherwise, we use the range of every open scope (2 possibilities: 0 = my scope variable, 1 = not my scope variable; this is reversed from the previous case), checking from the first to end inwards until one of them claims it * where to introduce a tacit `y` (or tacit `x` if this is the initial expression of a 2-argument function): a combination of the two possibilities; encoded on the full control marker, with a range of 1 inserting a `y`/`x` there, 2 inserting the scope variable for the next scope to end, 3 inserting the scope variable for the next-but-one, and so on; then the range is reduced by the number of scope variables plus 1; at EOF, the `y` possibility does not exist, so the range for a tacit variable at EOF is 1 lower Under these rules, a range of 0 therefore produces the "default" outcome (i.e. end-of-scope markers needing to be given explicitly if they aren't at EOF, `Y` in full control expressions needing to be given explicitly, `NI` and `ZI` defaulting to the scope variable for the next scope to end, tacit `y` being tacit `y` rather than a scope variable). When encoding, it's possible to run these rules backwards (scanning the program from right to left) to work out what range value would need to be used to get all the right tacit variables and scope markers in the right places. This enables the encoder to work out whether a tacit variable is disallowed due to being "too far away" from the location which encodes it; simply see whether the range has become too high to be encoded (16 for a scope marker, 16 minus the number of generator scopes we're inside for a full control marker), and complain if it is. Ranges normally correspond to nybbles in the encoding directly. However, in the case of full-control markers, they're rotated by -1 (with a range of 0 being encoded as FF, of 1 as F0, of 2 as F1, etc.) so that the set of valid ranges (i.e. those that don't clash with generator scope variables) is contiguous. Implicit scope -------------- Generators can have either an explicit or tacit start-of-scope marker. When the marker is explicit, then it has a range, and that's used to encode everything that's tacitly associated with its scope. However, when the start-of-scope marker is implict, and we aren't using `each` (which has a range despite being only 1 byte wide), then we need to use defaults for everything, rather than using a range. The Guide lists four situations in which the start-of-scope marker can be made tacit. The first is when `each` is used; that has a range, so nothing special needs to be done. The other three are 1-statement/1-builtin scopes, all-other-statement scopes `:y`, and initial-expression-root scopes `:x`. These each have rules of their own to specialise what choices to make for `i`-versus-`y` and the like, and tacit start-of-scope markers are encoded quite differently from explicit start-of-scope markers, so the only hard part when encoding is to determine which of the possibilities for a tacit start-of-scope marker is in use. In some cases, this is easy. 1-builtin scopes are only allowed in a full control expression and not at the root, and none of the other implicit scopes are allowed there. Likewise, initial-expression-root scopes are only allowed at the root of the initial expression (which is always in full control mode if it's more complex than a trivial `x` or `y`), and none of the other implicit scopes are allowed there. So the only difficult part is distinguishing between 1-statement scope and all-other-statement scopes. This distinguishing is done by rearranging the order in which the builtins appear in the function. For 1-statement scope, the generator appears in its "usual" place, before the statement that the scope is placed around. However, for all-other-statement scope (in which the generator applies to all the statements in the function apart from itself), although the generator appears as the first statement of the function in the source code, it appears in a different place in the encoding: right at the end of the function. This is not a valid location for a generator with 1-statement scope (because there are no statements for it to apply to), so the decoder will know (when it sees a tacit-`:` generator at the end of a function) that the generator has been moved and has an all-other-statements scope, and will be able to put it back into the right place. When using nested all-other-statements generators, they appear in the same order at the end of the function as they previously did at the start (i.e. we repeatedly rotate them from the start to the end when encoding, and the end to the start when decoding). The end-of-scope marker can also be tacit even when the start-of-scope marker is explicit (or `each`). The position of the end-of-scope marker is generally encoded in the range of the start-of-scope marker. If it's too far away, then a value is used that will cause no tacit marker to be inserted; the end-of-scope marker can *still* be tacit in this case, but it will have to come at the end of the function (allowing a tacit marker to be inserted because we haven't seen an explicit marker yet).