A Guide to Elly =============== This guide describes the Elly programming language. Elly is designed as a "golfing" language: it aims to be able to express programs using a minimal number of bytes. It also aims to be fairly easy to read and particularly fast to write: programs are written and read in a readable "source code" form. However, the source code must be converted into a heavily compressed "encoded" form before it can be used by the interpreter (this conversion can be done automatically, and the encoded form can also be expanded back out into valid source code automatically, although it might not use exactly the same syntax as the original; it will at least have the same functionality). Elly was inspired by the programming language [Jelly](https://github.com/DennisMitchell/jellylanguage) (which is also a golfing language), but (in addition to being easier to read and write) aims to save space over Jelly by sorting out parsing ambiguities more efficiently (especially in loops) and using better representations for the constants. This guide is intended to be usable both as a tutorial, and as a reference guide. Every feature of the language is discussed, but the order in which features are discussed is chosen so that reading the guide in order will avoid introducing too many features at once and form a sensible progression for learning the language. Full control mode ----------------- Like most programming lmnguages, Elly allows you to write arbitrary expressions by combining constants, functions and builtins in any combination you like. Unlike most languages, this is *not* the usual way to write an expression in Elly; most expressions can be abbreviated, allowing them to be encoded in fewer bytes. However, it's the most general-purpose method, and a good place to start when learning the language. If you want to write an arbitrary expression, you need to start it with a "full control marker", which is an at sign `@`; this lets an Elly implementation know that it should be looking at the code as-is, rather than trying to implictly insert variables. You can then follow this with the expression you want. Although it is unidiomatic to use `@` where unnecessary in Elly, this guide will start by placing an `@` on every expression; later on, it will explain the rules for where you can omit one, and then omit them from then on. An Elly *expression* uses *prefix notation*: constants are just written normally, but to call a function or use a built-in function (a "builtin"), you write the name of the function, followed by each of the arguments in turn. (Each function has a known number of arguments, so it's always possible to unambiguously match a function with its arguments, even without parentheses.) The various constants/functions/builtins that make up an expression are, in the source code form, separated by spaces. For example, if we wanted to write the expression "square root of (5 squared minus 4 squared)", we could write it like this: @ sqrt minus square 5 square 4 Elly is forgiving with names of built-in functions: most builtins have multiple possible names that can be used in the source code. So you could also write it like this: @ √ - ² 5 ² 4 These programs are equivalent; after being converted into the encoded form that's used as input to the interpreter, they will become the same sequence of bytes (the exact name you use for the builtin doesn't matter, the encoded representation only cares about which builtin you're using). This mechanism is designed to make it easier to remember the builtin names; you can guess at the name, and normally it'll be a valid name for the builtin. (It's also designed to make it easier to *type* the builtin names; names like `√` are easy to read, but not so easy to type.) Evaluation order in Elly is chosen to produce the same results as if the arguments of a builtin were evaluated before the builtin itself (from left to right), with the exception of builtins like `and` that are specifically designed to skip evaluating the right argument if the left argument has a specific value. (This detail normally doesn't matter unless calculating the arguments has side effects, but it means that you can, e.g., assign to a variable while calculating the first argument of a builtin, then read the assigned value while calculating the second.) When using a golfing language, it's usually important to know how many bytes your program will encode to (in order to know what sort of rearrangements will save bytes). For full control mode, this is very simple: the `@` costs a byte, and then you simply pay the costs for each builtin/function/constant you use individually (which is normally a single byte for each builtin and constant, except for the case of larger literals like `20` or `12345678` or `"Hello, World!"` which typically cost multiple bytes, and builtins whose names start with `@` which cost two bytes). Simple functions and programs ----------------------------- In order to write an actual Elly program, you have to start by writing the definition of a function; this will be the *main function* that defines what the program does. A function in Elly can take 0, 1 or 2 arguments. (A program can take more, but the extra arguments become global variables rather than being used as arguments to the function.) It is possible to specify the number of arguments explicity with an I/O specification (see below), but normally this would be automatically determined by looking at how many arguments the program was actually given. A function has two local variables; these are initially used to hold its arguments, but can be assigned to while the function is running (with no effect on the actual arguments). These variables are called `X` and `Y`. * If the function has two arguments, initially `X` holds the first argument and `Y` the second argument. * If the function has only one argument, this is initially stored in `Y`; `X` will start out undefined, but when the function's initial expression has been evaluated, `X` will automatically be given the value that the expression evaluated to if it did not gain a value as a side effect of evaluating the expression. * If the function has no arguments, `X` and `Y` both start out undefined. When the function's initial expression has been evaluated, its value will automatically be stored in both `X` and `Y` (although if either variable gained a value during the evaluation of that expression, that value will not be overwritten). The simplest form of a function (thus, of an Elly program) is a single expression; the value that the expression evaluates to will be the value returned from the function. (Except when an I/O specification is used to override this default rule, a full program will automatically print the return value from the main function to standard output, letting you see what value it computed.) As a running example through this guide, we'll try to write a program which takes a number *n* as input, and calculates (*n* + 2) × (*n* + 3). (In mathematical notation, this program would be "λ*n*.(*n*+2)(*n*+3)".) Using the language constructs implemented so far, we can implement this easily (but not very tersely) using a main function consisting of a single expression: @ × + Y 2 + Y 3 (If we weren't sure how to type `×`, we could have written it as `multiply` instead.) This program will be 8 bytes long in encoded form: 1 for the full control marker, 3 for the 3 builtin usages, 2 for the 2 small integer constants, and 2 for the two uses of `Y`. If we wanted to write this as a separate function within a larger program, we'd need to explicitly mark it as a function (the program as a whole is automatically interpreted as a function body, but if you have other functions, their bodies need to be marked as functions explicitly). That's done using parentheses around the function, and a number of arguments before them: 1( @ × + Y 2 + Y 3 ) A function is automatically called at the location where it's defined, so you generally define a function at the first location where you'd want to use it. Subsequent calls to the same function are done via using builtins whose purpose is to call user-defined functions. Functions are not allowed to be completely empty; you must have something between the `1(` (or whatever) and `)`, and the content you place there is not allowed to abbreviate down to 0 bytes. The exception is when you have no closing `)` (i.e. you're in the main program); in this case, you are allowed to simply just end the file to produce an empty function. If a function's closing `)` would appear at the end of the program, you may omit it. Pipelines --------- Although simple functions are easy to understand, a function in Elly doesn't need to consist of just a single expression. Instead, functions more commonly consist of a *pipeline*: an expression followed by a list of *statements*. Elly statements are sort-of like statements in most other programming languages, in that they could just be bits of code that you're calling for their side effects. However, they also have an input and a return value; the input of each statement is the return value of the previous statement or expression (and the return value of a function is the return value of the last statement in it, or of its initial expression if there are no statements), and statements are more commonly used to create an appropriate return value rather than for side effects. Any function or builtin that takes 1 or 2 arguments can be used as a statement: the argument to the statement will be used as the first argument to the function or builtin, and if it requires a second argument, an expression to produce that argument needs to be specified just after it in the source code. This gives us a couple of alternative ways to write our (*n* + 2) × (*n* + 3) program: @ + Y 2 × @ + Y 3 @ Y + @ 2 × @ + Y 3 (There is no requirement to add additional whitespace between the "pipeline stages" of an Elly program, but it won't cost bytes after encoding and makes it more readable, so it's generally worth adding some. The normal whitespace convention in Elly is to place one space between tokens, and one additional space before each statement.) Statements do not *have* to be formed from builtins that are usable in expressions; a few statements are usable only as statements. This sort of statement could potentially take any number of extra arguments (which are all written just after the statement name, prefix-notation style), in addition to their input. Sometimes, when writing a pipeline, it is useful to swap the arguments of a function or builtin; for example, the mathematical function λ*n*.(2 − (3 × (4 + *n*))) would be most clearly written as a pipeline that starts with *n*, adds 4, multiplies by 3, and subtracts *from* 2. In Elly, you can express this form of argument swapping by adding a `'` to the end of a function or builtin usage: @ Y + @ 4 × @ 3 −' @ 2 Although reversed builtins like `−'` don't have an encoding of their own, the `'` nonetheless never costs a byte; if it's followed by a full control expression, it's encoded via using a different encoding of the full control marker `@`, and otherwise it's encoded as part of the tacit pipeline stage shape (see below). You cannot use a reversed builtin within a full control expression (but you can typically just swap the arguments manually). I personally find the pipeline form of a program to be closest to the way I think about programs in golfing languages. Although it typically comes out longer when using nothing but full control mode, it also tends to be the tersest way to write a program when you take full advantage of the various abbreviation mechanisms (which are explained below), so it should normally be your first approach when trying to write an Elly program (with complex expressions used only for expressions that are too "branchy" to fit into a linear pipeline, or in cases where you hardly mention variables and thus could gain no advantage from abbreviating them). Implicit variables, and how to count bytes in Elly -------------------------------------------------- Elly is a token-based language. Although there are more than 256 spellings of tokens in total, many of them are equivalent (e.g. `*` versus `×` versus `multiply`) and thus encoded the same way, and many won't be legal in any given context. As such, there are fewer than 256 possible one-byte tokens at any given point in the code, so each can be unambiguously encoded as a single byte. The whitespace between tokens is just used to make the program easier to read (there is always some whitespace between tokens and the exact amount or nature of the whitespace doesn't matter, so the whitespace doesn't need to be encoded at all). Some tokens are rarely used, and intentionally encoded with multiple bytes. For example, there are a lot of possible literals, and (apart from small integers) each individual literal is rarely used; so a string, integer or list literal will often have to be longer than one byte. For literals, there isn't a clear visual indication of the extra bytes, so it helps to use an Elly encoding tool if you care about the exact number of bytes there. There are also some rarely-used builtins (such as `@first_last`, which returns the first and last elements of a list); despite being a single token each, these tokens are encoded as two bytes in order to save encoding space for more commonly used tokens. Two-byte builtins all have names starting with `@`, as a warning that an extra byte is being used (this is intentionally reminiscent of the full control marker `@`, which acts as a warning/reminder that if an expression isn't being abbreviated, it needs an extra byte to represent). Sometimes, though, a token will cost no bytes at all, allowing an Elly program to be shorter than it looks; a few tokens can be represented with no bytes at all (these tokens can be recognised in source code because they contain lowercase variable names). This section explains how this is possible, and how you can save bytes with implicit variables in your own programs. ### Implicit variables in full control expressions Implicit (tacit) variables are the main advantage/twist that Elly has over other golfing languages, and are often able to make programs much shorter without making them substantially less clear. Functions are generally built around their two main variables, `X` and `Y`. When written as capital letters like this, use of the variables costs a byte each. However, it's also often possible to specify that these variables should be used, but without spending any bytes to do so. Requesting an implicit variable is very simple: you just write the variable name in lowercase. So we can save bytes in the first two versions of our example program we've seen so far simply by making the first `y` in each expression implicit, leading to the following programs (which are each 7 bytes long when encoded): @ × + y 2 + Y 3 @ + y 2 × @ + y 3 Of course, it isn't always possible to "magically" save bytes like this; the fact that a `y` is used, and its location, needs to be specified somehow. Elly implements this by allocating more than one possible encoding to various tokens (most notably `@`, `:`, `each`); one possible encoding of `@` will simply just represent an `@` on its own, but another encoding will represent an `@` followed by a `y` as soon as possible, another will represent an `@` followed by a `y` a little later, and so on. ("As soon as possible" is generally after one builtin has been seen: you are not allowed to write a full-control expression that consists of nothing but a single tacit variable, thus `@ y` is illegal.) The Elly encoder is thus able to represent the exact program you requested, in the exact number of bytes you requested, via modifying the encoding of the `@` so that it encodes both itself, and the implicit `y`. In practice, this typically means that the cost of an `@` will "refund" itself as long as you use an implicitly representable variable fairly early in that expression. The variable that can be represented implicitly is usually either `y` or a generator scope variable (see the discussion of generators below); however, in the initial expression of a 2-argument function, it is `x` that can be used implicitly. (This section does not include details on the exact way that, e.g., the encoding of the `@` is adjusted. If you care about them, see the separate guide on Encoding for information on how Elly's encoding process works. However, those details shouldn't matter for learning the language; you just need to know that this saves a byte, and why it can save a byte, rather than the exact details on what numbers go where in the resulting encoded file.) ### Incomplete full control expressions There's another way that full control expressions can be abbreviated: if the function (or program) ends in the middle of one. In this situation, sufficiently many implicit variables are implied into the end of the encoding to complete the incomplete expression. We can take advantage of this by placing implicit variables at the end of the program in our source code. The last variable placed will be a `y`; if more than one variable is missing, all but the last will be an `x`. So our example program: @ × + Y 2 + Y 3 could be written like this, by rearranging it to end with a `y`: @ × + y 2 + 3 y in order to allow us to make both `y`s implicit, giving us a 6-byte program. (The first `y` is being encoded in the full control marker, the second `y` at EOF.) We're making progress! ### Tacit pipeline stages You may be starting to get sick of the `@` signs everywhere (especially because they often cost bytes). As explained above, the full control marker is necessary whenever you want to be able to write an *arbitrary* expression, which can have any shape you want and which puts the variables wherever you want. However, most expressions aren't arbitrary, and have one of a few common shapes and use variables in one of a few common ways. If you don't use a full control marker, therefore, you can still write expressions, and statements containing expressions. To keep things simple, the meaning is exactly the same as it would be without the full control marker. However, rather than being able to write arbitrary expressions, you are limited to using a few common shapes for the expression; the code is a syntax error if you try to go outside the set of common shapes and don't spend a byte on a full control marker to permit this (because most of the available non-full-control encoding space is used to encode *which* tacit shape is in use, in addition to the builtins/constants/functions that make up the statement or expression). But because these shapes are so common in practice, you can often write much of the program without resorting to full control. The big advantage is that the common shapes are often packed full of implicit variables, saving you bytes not only on omitting the `@` signs, but also by lowercasing your `x`s and `y`s. For the purpose of determining tacit pipeline stage shapes, the 2-argument functions are split into two groups, C (which includes builtins which there is never or rarely a reason to swap the arguments of); and D (all other 2-argument builtins, and user-defined 2-argument functions). Here are the various tacit pipeline stage shapes available (here, `K` represents any constant / 0-arg function, `F` any 1-arg function except generators, `G` any 1-arg function that is a generator (see the section on generators below), and `C` and `D` represent 2-arg functions in groups C and D respectively): * As an initial expression: * `x` (only in the initial expression of a 2-arg function) * `y` (only in the initial expression of a 1-arg function) * for a 0-arg function, *any* initial expression may omit the `@` (as a tradeoff for this, no implicit `x`/`y` will be available) * As a statement: * `F` / `G` * `C y` / `D y` * `C K` / `D K` * `C F y` / `D F y` / `C G y` / `D G y` * `D' K` * `D' G y` * `C G K` * `C G F y` * `C C x y` / `C D x y` * `C C K y` / `C D K y` (only in cases in which for the `C`/`D` in the argument, `C x y` / `D x y` is trivially equivalent to something of the form `F y` or `K`, typically because neither `X` nor `Y` have changed since `Y` was implicitly assigned to `X`) * `C G C x y` / `C G D x y` * `C G C K y` / `C G D K y` (under the same circumstances in which `C C K y` / `C D K y` are allowed) * For the last statement in the function, `D' y` and `C x` are also permitted tacit shapes (these are not allowed for earlier statements), except when `y` is in use as a generator scope variable. Above, we looked at the following pipeline for calculating our running example: @ Y + @ 2 × @ + Y 3 Under full control, this looks pretty horrible (and is 10 bytes long!). However, we can easily convert it entirely to tacit pipeline stages: y + 2 × + 3 y This is not only more readable than what we've seen before, it's also shorter: because the two uses of `y` are implicit, this comes to only five bytes. Note that we had to swap the arguments of the `+` at the end so that the last pipeline stage has a valid tacit shape; addition produces the same result regardless of its argument order (which is why it's in group C), so this swap doesn't change the meaning of the program. The `× + K y` shape is legal because if we had hypothetically written `× + x y` instead, we could alternatively have written it as `× twice y`, i.e. in the form `C F y`, making the encoding that would otherwise have been used for `× + x y` available for a different shape. The Elly decoder takes into account whether or not we necessarily have `X` equal to `Y` when working out whether expressions like `+ x y` could be useful or not. Normally, you will want to write an Elly program almost entirely with tacit pipeline stages, both to save on full control markers and to make as many variables implicit as possible. Variables --------- As mentioned above, the most important varilables in Elly are `X` and `Y` (especially `Y`, which is the most commonly available in tacit shapes). `X` and `Y` are local variables, local to the function that contains them. There are a few other variables, too; most of these are global variables that hold their value throughout the whole program, with the most important being `R` and `M`. In order to assign to a variable, we normally use the `→` (or `->`) syntax. Each of the four main variables has its own assignment operator as a builtin, e.g. `->Y` is a 1-argument builtin that assigns its argument to `Y` and returns the value assigned. When these builtins are used inside an expression (rather than as a statement on their own), they automatically turn on full control mode (thus, for example, you can't write `+ →X y` except at the end of a function, but can write `+ →X × Y 3`); this is based on the reasoning that there isn't normally much point in assigning one variable directly to another as a side effect. `→X` and friends have only a single encoding, so you can't place tacit variables inside this sort of full control expression (except at the end of the function). As a side note, functions with fewer than 2 arguments have an automatic assignment to `X` after the initial expression is evaluated (unless it was explicitly assigned within the initial expression). To make programs more readable, you should make this implict assignment explicit by writing `→x` (or `->x`) at this point in the program, unless the value of `X` is not used beyond that point. (Likewise, functions with 0 arguments allow an explicit `→y`.) These cost no bytes because they exist implicitly anyway (thus the lowercase variable names), and the variable will actually be assigned to even if you forget them. ### Saving bytes via using variables Frequently, functions will discover that they don't need to remember the value of `X` and/or `Y` all the way through. So it's often worth thinking about whether using a variable to remember a partial result will lead to shorter code to recalculate it. Let's think about our running example, the λ*n*.(*n*+2)(*n*+3) program, which we have so far "golfed down" to 5 bytes. Suppose we wanted to implement a slightly different program instead: λ*n*.(*n*+2)÷(*n*+3). We can't simply just write `y + 2 ÷ + 3 y`, because `÷ + 3 y` is not a valid tacit pipeline stage (unlike `×`, the argument order to `÷` matters, so the encoding space is needed to remember which order the arguments are supposed to come in). However, by using a variable assignment, it's nonetheless possible to write this program in 5 bytes. The trick is to notice that "increment" (i.e. "plus 1") is a byte cheaper to express than "plus 2" or "plus 3". That means that instead of adding to our argument `Y` twice, we can remember the value of `y + 2` and increment it: y + 2 →Y ÷ inc y Incidentally, the same program can be encoded in 4 bytes by using a mathematical rearrangement: (*n*+2)÷(*n*+3) = (*n*+3−1)÷(*n*+3) = (*n*+3)÷(*n*+3)−1÷(*n*+3) = 1−1÷(*n*+3), which we can write as y + 3 reciprocal complement When golfing, it's important to look for rearrangements that could help you express the same algorithm in fewer bytes. But it's also important to look for a simpler algorithm, because maybe that algorithm will come out terser than your original algorithm did. ### Which variables are available? `X` and `Y` are general-purpose local variables (local to the containing function), which initially hold the arguments but which you can subsequently use how you like. `X` has an additional purpose, being automatically overwritten by a few builtins that calculate a related value to the value you're calculating, e.g. when doing floor-division of two numbers, the quotient will be output, and the remainder will be stored in `X`. Additionally, if a function contains an explicit assignment to `X` (`→X`), but does not use `X` anywhere after the assignment, the last value assigned to `X` will be used as the function's return value (rather than the return value of the last statement in it). This return value override is based on whether `X` actually *was* read, not on whether it *could have been* read. `Z` is a special-purpose local variable; it refers to the return value of the previous statement (or of the initial expression if this is the first statement). Normally this isn't very useful (as that value is typically being used implicitly anyway), but you might sometimes want to specify it as `Z` in a complex full control expression. You cannot specify it tacitly. `I`, `J`, `K`, `L` (and `I0`, `I1`, … `I9`) are temporary, read-only, local variables used to store the "scope variable" of an active generator. They can be named tacitly (e.g. as `i`) in the same contexts as `y`, as long as they are used sufficiently close to the start of the generator's scope, or of a full control expression; at longer distances, they may need to be named explicitly as 1 byte (`I`). See "Generators" below for more information. `NI`, `NJ`, `NK`, etc. are *numerical* loop counters, associated with the generators whose scope variables are `I`, `J`, `K`, etc.; for example, when looping over a list using scope variable `I`, `I` will hold the list element and `NI` will hold the index of the location within the list from which that element was taken. These variables can never be tacit, and numerical loop counters associated with generators other than the next to be collected can be named only near the start of a generator's scope or the start of a full control expression (otherwise, there is nowhere to encode the choice of which loop counter you mean); additionally, if they appear inside a full control expression, that expression cannot contain tacit variables to the left of the non-next-to-be-collected numerical loop counter. `ZI`, `ZJ`, `ZK` are effectively the "working space" for parallel generators; they are lists, holding the values of `Z` on every parallel stream of the generators at once. In `each`-like generators, collecting the generator with scope variable `I` (`I.`) will return the value of `ZI`. `ZI` and friends have the same encoding restrictions as `NI` and friends (i.e. when referring to a generator other than the next to be collected, you're very restricted in where you can put them). `R` is a general-purpose *global* variable (whose initial value is a falsey empty list). You can assign to it in one function, then read the value in a different function (whereas assigning to `X` or `Y` won't change the value of the corresponding variable in the calling function). Similarly to `X`, if a program assigns to `R` and yet the value of `R` is not read at any time during the execution of the program (except by builtins that modify `R` in-place without exposing its value to the calling code), the last value assigned to `R` will be the output from the program (as opposed to the output of the main function). This is slightly different from the rule for `X` because it affects the program as a whole, not just a single function; and because reads of `R` before the assignment turn off the special handling. `M` is a bit of a special case: it can be thought of as a global map/dictionary, or as a set of global variables that are named after Elly values. For example, `M 3` is a specific variable, that can be read from and assigned to, and so is `M [10]`, and `M "abc"`, and even `M null` (note that although nulls are not normally considered equal to each other, as discussed below, they are considered equivalent for the purpose of looking up values in `M`). This means that `M` itself is not a 0-argument builtin, but rather a 1-argument builtin (the argument is the key of the `M` entry to read); and `→M` is a 2-argument builtin (the value to assign is the first argument, the map key is the second). Every value within `M` is initially a falsey zero. `$` is a list that holds the command-line arguments to the program as a whole. Separately, there are variables `$1`, `$2`, `$3`, `$4`, `$5` that hold the first 5 command-line arguments to the program, if they exist. If there are fewer than 5 arguments, then the unused arguments in `$2`…`$5` are initialized to certain constants that would otherwise require two bytes to represent: * `$2`: 16 * `$3`: 32 (the codepoint of a space character, in ASCII/Unicode) * `$4`: 256 * `$5`: 100 `$`, and `$1`…`$5`, are read-only; they don't have assignment builtins or statements of their own. `$1` is also a bit of a special case. In cases where some unusual encoding is in use (see "Unusual encodings" below), then the variable `$1` will contain information specific to the encoding, allowing the program to introspect into the way it was encoded. (This does not change the actual first input to the program, which will still be used as the main function's initial `X` or `Y`, and which will still be usually accessible as `@ first_nonnull $`, or `@ index_into 0 $` if your program accepts nulls as arguments.) There are no other variables available. For most small programs, `X`/`Y`/`R` will be sufficient (in fact, even `R` is often unnecessary), but if your program is large and you need a lot of variables, you can create an unlimited supply by using entries within `M`. Generators ---------- Unlike imperative languages, where control flow is typically accomplished using control flow statements, the usual (and idiomatic) way to perform control flow in Elly is to use *generators* (the only other forms of control flow are recursion and `when`). Generators can be thought of as control flow builtins whose scope is determined by the way that values flow through the program; they return a value multiple times (in series or in parallel), and then are *collected* at some subsequent point (which either starts the next iteration by returning control flow to the generator and allowing it to returning a new value; or else stops looping and outputs some value based on the result of all the computations so far). For example, let's look at the generator `each_prefix`, which runs some code on each prefix of a list (with the outputs being listed in order from shortest input to longest input). Here are a 0-argument program, and an expression, that calculate the squared sum of the prefixes of a specific list, in order from longest to shortest: [1,3,5] each_prefix :I sum square I. reverse @ reverse I. square sum :I each_prefix [1,3,5] The program returns, and the expression evaluates to, [81,16,1,0]. In addition to the generator builtin itself, the important syntax here is the `:I … I.`. This specifies the scope of the generator. The `:I` at the start of the scope specifies what the scope is named (it can be named `I`, `J`, `K`, `L` or `I0`, `I1`, `I2`, etc.; the name is not encoded in the program so your choice of name doesn't matter). The `I.` at the end shows the end of the generator's scope, where the generator is collected. So in this example, the scope is around two builtins (`square` and `sum`); while in the scope of the generator, they will run once for every prefix of the list that was the input to `each_prefix`, looking at a different prefix each time, so three different numbers have their squared sum calculated. When an `each_prefix` generator (in general, any prefix whose name starts `each`) is collected, the various results at the point of collection get combined back into a single list. The `reverse` is outside the scope of the generator, so it affects the entire list of results as a whole, running only once. Within the scope of a generator, the variable with the same name as the scope holds the value that was returned from the generator. You can name this value tacitly in the same contexts as you would normally be able to tacitly name `y` (as long as you are close enough to the `:` start-of-scope marker, because the choice of `y` versus `i`, etc., is encoded using a choice of multiple encodings for `:`). You can also name it tacitly in a full control expression, if close enough to the leading `@` (in this case, it's the full control marker that encodes `y` versus `i`). If we wanted to take the sum of the squares of each prefix of a list, rather than the square of the sum, one approach would be to take the dot product of each prefix with itself (dot product is "multiply corresponding elements, then sum"): y each_prefix :I dot_product i I. `dot_product y` is a legal tacit shape, so we can use `dot_product i` here instead. You are not allowed to nest generators more than 7 deep in a single function (placing a limit on this reduces the amount of encoding space that needs to be dedicated to highly nested generators, freeing up encoding space for encoding other constructs). ### Tacit scope markers In our examples so far, we've been using explicit scope markers; the start-of-scope marker `:` costs a byte, and the end-of-scope marker `.` costs an additional byte. As usual for Elly, that's the general but inefficient way of doing things, and there are some cases in which we can make one or both scope markers tacit, saving a byte or even two bytes. The notation for a tacit scope marker is, unsurprisingly, to write the name of the scope variable in lowercase. The `.` marker can be made tacit in most situations; colons have multiple possible encodings, and so if the `.` is close enough to the matching `:`, the exact position of a tacit `.` marker can be specified by varying the encoding of the `:`. The exact distance you can reach depends on how many other things we're trying to encode in the colon (a lot of implicit `i`s and `y`s in between will really shorten the distance, for example, as we need to encode which is in use; and heavily nesting generators will make the reachable distance a little shorter); as a special case, an end-of-scope marker can always be tacit if it appears right at the end of the function (with nothing but other tacit scope markers after it). The `.` marker can also be made tacit (in fact must be made tacit) if the `:` marker is tacit and the generator is not `each`. However, if the scope of a generator is empty (i.e. there are no builtins between the start and end of the scope), the end marker may never be tacit (often this implies that the start marker cannot be tacit either). As for the `:` marker, there are four situations in which this can be made tacit. The first is if we're using the `each` builtin, which runs the statements in its scope once for each element of the original list (and combines the results back into a list when collected at the end of its scope, as usual for `each` generators). `each` is sufficiently commonly used that it always gets an implicit colon (and may not use an explicit colon); it has just as many different encodings as the colon does, so the exact sort of colon we're using is implicitly included within the `each`. Our running example, the λ*n*.(*n* + 2)(*n* + 3) program, can be written in terms of an `each` generator: y + :i each [2,3] i. product The program is 5 bytes long, tying our other attempts so far. (Normally we'd need a full control marker to write a statement of shape `C F K`, such as `+ each [2,3]`, but there is a special case for 1-argument generators and `C G K` is allowed.) However, it generalises in a different way from the other attempts (you could easily modify it to calculate λ*n*.(*n* + 2)(*n* + 3)(*n* + 5), for example). The second situation is a "1-statement scope"; if a scope consists of exactly one statement, then the `:` at its start can usually be tacit. (An exception: you cannot make it tacit if the generator is the first statement in the function, the generator takes a second argument, and the function's initial expression is a tacit `x` or `y`.) In this situation, if the statement doesn't have a full control marker, there's nowhere which can encode tacit `y` versus tacit scope variables; the rule is that within a minimum size 1-statement scope, a tacit `y` is accepted only if the statement is of the form `2_argument_function y` exactly (you cannot use a tacit scope variable with that sort of context), and any other tacit use of a variable within the statement must be of the scope variable. Related to this is a "1-builtin scope" within a full control expression, which also allows an implicit `:` (in this case, the scope variable cannot be used tacitly); note that this is only allowed in full control expressions, not when the generator is part of the expression of a tacit statement, and may occasionally be disallowed within deeply nested scopes (because some of the encodings normally used for this are instead needed for the scope variables `I`, `J`, `K`, etc.). The third situation is the opposite extreme: when the scope of a generator that is a statement covers all the other statements of a function (not just all the *remaining* statements; the builtin creating the generator has to be the first statement after the initial expression). In this case, the scope variable entirely replaces `Y`; in fact, you can think of `Y` being used as the scope variable. The syntax for this uses `:y` as the start-of-scope marker; the end-of-scope marker would be `y.`, but you can omit it if you like. (As an exception to the "cover the entire function" rule, you may nest two or more whole-function scopes, along the lines of `y each :y each :y reverse y. y.`. Only the innermost `y` will be visible, with the others being shadowed.) The fourth situation is similar: a generator that forms the root of the initial expression of a 0- or 1-argument function. In this situation, the generator's output is implicitly being assigned to `X`. You may therefore use `X` as the scope variable if you wish. In this situation, the colon may be implicit if the matching `.x` comes at the end of the function (you may still use `X` as the scope variable if the `.x` comes elsewhere, but will need to spend a byte on `:X` in that case). For example, it's legal to write this as a 0-argument program: :x each_prefix [1,2,3] →y pair reverse y x. This would produce the output [[[],[]],[[1],[1]],[[1,2],[2,1]],[[1,2,3],[3,2,1]]]. (Note that there's no cheap way to specify the scope variable `x` directly, but the implicit `→y` in a 0-argument function means that we can just use `y` instead.) It is worth noting that this situation uses the same encoding that would otherwise be used for a 1-statement scope; therefore, if you want to write a generator that starts in a function's initial expression and is scoped to only its first statement, you will need to use an explicit `:I` or the like for its scope. Our "sum of squares of each prefix" example covers both a single statement, and the entire function, so we could write it in any of the following forms (explicit colon, 1-statement tacit colon, from-initial-expression tacit colon, all-other-statements tacit colon, with the end of scope being tacit in each case): y each_prefix :I dot_product i i. y each_prefix :i dot_product I i. @ :x each_prefix y dot_product X x. y each_prefix :y dot_product y y. In this case, the last form is preferable; the first form costs a byte for the colon, the second and third cost a byte for the scope variable (which can't be tacit because the default for a tacit variable in this context is `y`), and the third costs an additional byte for the full control marker. The last, however, allows both the variable and the colon to be tacit, giving us a 2-byte program. The various tacit forms exist because they make different trade-offs; some (such as the all-other-statements form) are very terse when they work but often won't fit your program; others (such as explicit start of scope with tacit end of scope) will be usable for most programs but are slightly longer. It's often worth trying them all out to see which will give the shortest code for any particular problem you might be having. ### Generator control flow patterns There are two main sorts of generator in Elly, in terms of how they affect the program's control flow. One possibility is to create an explicit loop, with separate iterations (a *serial* generator). For example, the `first_n_ints` builtin creates a generator that loops through successive integers as input, stopping when it finds a given number of truthy results (and outputting the inputs that lead to those results). In this situation, it's collecting the iterator that causes the next iteration to occur. For example, look at this 0-argument program to find the first 5 square numbers that are 1 less than a prime number (because there are no arguments, we can write the leading `5` tacitly with no `@`): 5 first_n_ints :y is_square and inc y is_prime y. Here, the loop starts at `first_n_ints :y` and ends at `y.`, so all three statements in between will be run once on each iteration. If we added a statement with a side effect there, it would run every iteration; if we added two statements with side effects there, they would alternate with each other, as both would run on every iteration. `first_n_ints` behaves like this because the number of iterations to run is not known when the generator starts to generate possibilities. This iterative sort of generator must be entirely well-nested; any generator whose scope starts outside its must have a scope that finishes outside its, and any generator whose scope starts inside its must have a scope that finishes inside its. Most generators, however, are like `each`: there's a fixed number of times the statements inside can run, and each of the executions are independent of each other. This sort of generator (a *parallel* generator) does not place a single loop around all the statements inside it, but rather runs each of them in the usual sequence; only when a value derived from the generator is used is a loop needed, and the loop will run just around a single function or builtin. If we look for squares that are 1 less than a prime in an explicit list (we can do this using the `filter` generator, which is like `each` but creates a list of the inputs that mapped to truthy values when the generator was collected, rather than a list of the values that were collected themselves), the program will look very similar: [6,9,16,36] filter :y is_square and inc y is_prime y. but now the evaluation order is different, and all four of our hardcoded numbers will be checked for squareness first, then incremented, then checked for primality. If we had placed two builtins with side effects inside this `filter` scope, then we'd see all the side effects from the first builtin, followed by all the side effects from the second bultin – even if they were part of the same expression. The main reason for this modified evaluation order is that it allows generators to be non-well-nested. It's entirely legal to do something like this: [[1,2,3],[4,5]] each :i each :j double i. j. `each :i each :j double j. i.` would simply just double each variable of the nested structure – `each :j` runs twice, so there are two `J` generators (that get collected simultaneously), but they just reconstruct the lists they were given. This mis-nested version, however, means that the `I` generator has to be collected before the `J` generator. This collection happens within the scope of the two `J` generators, so we end up with one list entry for each value from each `J` generator, producing the six lists [2,8], [2,10], [4,8], [4,10], [6,8], [6,10] (which end up being collected into a single list by the `J` collector). Note that the two `J` generators have merged with each other; there were two `J` generators while the `I` generator still existed, but they merged when the generator that was keeping them separate (here the `I` generator) was collected. (In general, you can always look at where you are relative to the scopes to see how many copies of something there are, e.g. something that is copied by an `I` scope will have multiple parallel copies only inside the `I` scope and will always be merged when leaving it.) Mis-nesting generators is allowed only when both generators are either `each`, or have an explicit start-of-scope marker; tacit start-of-scope markers cannot be mis-nested, except on `each`. (And of course, both generators must be parallel; serial generators cannot meaningfully be mis-nested.) In a parallel generator, it's also possible to peek at the output from the "other" streams that are running in parallel. The variables `ZI`, `ZJ`, etc., hold the lists that would be produced if the generator with the corresponding scope variable (`I`, `J`, etc.) were collected right now (accessing the variables does not actually collect the generator, though, merely lets you know what would happen if you did); in other words, they contain all possible variations of `Z`, in all the streams visible from the generator. Using these variables sometimes saves the need to collect a generator, look at `Z`, and restart the generator again (perhaps forgetting the value of its scope variable in the process). ### Two-variable generators Although the vast majority of generators take one *argument*, in the sense that one input is provided to the builtin, a couple of generators effectively have two scope variables. For example, `fold` evaluates some code on the first two elements of a given list, then on the result and the third, then on that result and the fourth, etc.; this means that two pieces of information need to be provided from the generator to its scope. In Elly, generators like `fold` nonetheless only have one scope variable named – the second of the two variables. This is the variable that can be referred to by name, and used implicitly in the same contexts that `y` can; and it is written at the ends of the scope in the same way that normal scope variables are. The other is the "output from the generator builtin", which will be the input to the first expression or statement inside the generator. (You can think of this mechanism as being similar to the two arguments of a two-argument function; `x` is the default for the initial expression, but most tacit mentions inside the function will be of `y`.) As a simple example, if we wanted to write a function that sums a list without using the `sum` builtin, we could fold addition over the list: y fold :j + j j. If you need to refer to the "first" scope variable more than once, you will need to assign it to some other variable for safekeeping, or (if you don't want to overwrite any of your variables) use a temporary function call so that its parameters act as temporary variables. Here are three ways you could fold λ*i*λ*j*.*i*+*ij* over a list: y fold :j 2( x + × x y ) j j. y fold :J →X + × x j j. y fold :y →X + × x y y. However, often you can avoid the need for this by either using `Z`: y fold :j + × Z j j. or rearranging the equation: y fold :j × inc j j. ### `while` and `until` [TODO: need magic for the RHS of these] ### `when` Apart from generators and recursion, Elly has one other way to do control flow: `when`, typically used to produce an early return from the loops created by generators. `when` is not itself a generator, but rather a 2-argument builtin. If its second argument is truthy, it simply just returns its first argument; but if the second argument is falsey, it effectively "cancels" the current code execution. That means that the executions of a parallel generator that contain a failing `when` are removed entirely (so they won't be collected), and an iterative generator or `while` loop will stop at the point it had reached at the end of the previous iteration. If a failing `when` is used outside a loop, it causes the currently running function to stop executing and immediately return `null`. For example, this program will output [1,5,7]: [1,2,5,6,7] each :i when is_odd i i. `when` belongs to group C, so you can use fairly complex tacit shapes with it (at the, generally irrelevant, cost of not being able to write its reversed form `when'`; in many but not all of these situations you can use `and` instead). Data types and autovectorisation -------------------------------- At this point, it's worth looking at what data types can exist within an Elly program. There are basically only three types of data: * **Numbers**. Most Elly programs use only integers for their numbers. However, Elly does transparently support other types of numbers (such as decimal, rational and complex numbers); these can be freely mixed with integers and will produce the mathematically expected results. Additionally, they will be used to represent the mathematically expected results when necessary, e.g. the square root of −1 is an imaginary number not an error, and `÷ 2 3` is the exact number ⅔ rather than being rounded to 1 or truncated to 0, even though the inputs were integers. There is no maximum value for a number; Elly uses an unbounded (bignum) representation for integers, and for the numerator and denominator of a rational number, and these representations are used for the real and complex parts of a complex number. Irrational numbers do, however, have to be approximated. Numeric literals are limited to integers. In source code, the normal form of a numeric literal is written in decimal. Negative literals start with a hyphen-minus or Unicode minus sign, and (except in the case of −1) cost a byte more than the corresponding positive number. Infinities, NaN, etc., are not considered to be possible values for numbers. Operations that should construct an infinite or impossible result, such as division by 0, return null in Elly. Elly does not have a separate character type. Rather, each number has a "display flag" that determines whether it should be displayed as a number, byte or character when doing I/O. This flag has no impact on any calculation – values with the flag set act the same way in calculations as values without the flag set – but will affect whether, e.g., a number with value of 65 is output as `65` or as `'A'`. The flag is preserved by builtins that just copy a number around without inspecting it, and automatically set back to "number" in most other cases (with the exception that addition, and some addition-based operations like increments, will keep the display flag from the additon's left argument unless the resulting value would be inconsistent with the display flag, such as a byte with value 259, or a character code with value 4½). To support display-flagged numbers, there is a second form of numeric literal available in the source code: the byte literal, consisting of a backslash followed by any byte (this is a token on its own, so it must be followed by a space as usual), and encoding a number display-flagged as "byte". For example, `\d` is an encoding of the number 100, but flagged to display as a byte. This can also be spelled as `byte.` followed by two hexadecimal digits, e.g. `byte.64`. When doing I/O on numbers that requires an unambiguous representation (e.g. printing them as a list element, or parsing them from a string), character-flagged numbers are written as a character in single quotes (with special cases of `'\''` for a character-flagged 39, and `'\u{…}'` for unprintable characters), byte-flagged numbers are written in backslash-byte or `byte.xx` format (depending on whether they are printable or not), a number-flagged 0 is written as `0`, and other number-flagged numbers are written in the form (a+bi)/c or (a-bi)/c (with a and a possible subsequent + omitted if it is 0, b omitted if it is 1 or −1, +bi omitted if *b* is 0, /c omitted if *c* is 1, and the parentheses omitted if any of *a*, *b*, and *c* are). If a number is printed on its own, and is not number-flagged, the character or byte it represents will be printed directly (e.g. a character-flagged 100 prints as `d`). * **Lists**. A list has a specific length (which is a nonnegative integer), and a number of elements equal to that length, e.g. a length 3 list has a first, a second, and a third element. Lists are 0-indexed, i.e. an index of 0 corresponds to the first element, of 1 to the second element, and so on. (Many builtins that operate on lists will "wrap" indexes beyond the list, e.g. element −1 is the last element, and element 4 of a 3-element list is the second element. But this is part of the behaviour of the builtin, not part of the definition of a list.) List elements can be of any type supported by Elly, i.e. they can be numbers, other lists, or even null. There is no requirement for all the elements of a list to have the same type, e.g. a list like `[1, [2, []], null]` is perfectly legal. Elly does not have a separate string type; a string is just a list of its characters. For example, the string `"Elly"` is actually just the list `[69, 108, 108, 121]` (the display flag would typically be set to "character" on each of the elements, but it is almost invisible to Elly and matters only when formatting output, and calculations will produce the same value regardless). A list may be output using string or bytestring notation if every element of the list has a "character" or "byte" display flag respectively, or if the list contains a mix of "character" and "byte"-flagged elements but they all correspond to ASCII characters. As with numbers, if a list is printed on its own (rather than being an element of another list, or used in a context where an unambiguous representation is reuqired), a character string or bytestring is just printed directly by printing the characters or bytes it contains. List literals are limited to lists of nonnegative integers, but other sorts of lists can be created at runtime via running code or parsing strings. In the encoded form, there are four types of list literal: one which is specialised to encode the sort of lists that contain opaque integers, one which can only encode ranges from 0 or 1 to a small-to-medium integer, one which is specialised to encode lists of character codes and which sets the display flag to "character" on the elements, and one which can only encode lists that are valid encodings of some Elly program (which have no trailing bytestring). In source code, list literals are either written as a sequence of decimal integers, separated by commas and enclosed in square brackets; or as a character string in double quotes `""` (this allows escape codes like `\\`, `\"`, `\n`, and `\u{…}`); or as an Elly program surrounded by (depending on the number of arguments it takes) `@0{ … }`, `1{ … }`, or `2{ … }`, to represent the list that contains the bytes that encode that program (you may omit the `}` at the end of the file if you wish, but this doesn't save any bytes). The `@0{` delimiter costs an extra byte compared to the other two, thus the `@` at the start of its name. In I/O, the square-bracketed and double-quoted forms are allowed, as is a bytestring form (which is the same as the string form but with an extra "b" at the start and with `\x` escapes rather than `\u{…}` escapes, e.g. `b"a\x01b"`). * **Null**. A null is a special value, distinct from all valid values; to obtain a null, you can use `null`, a 0-argument builtin that returns one. A comparison involving a null is treated as meaningless; all such comparisions will return false, even when comparing two nulls for equality (e.g. it is not true that `[1, null]` is the same list as `[1, null]`), although some builtins (such as `→M`) treat two different nulls as being interchangeable. Bear in mind that the rule is not "null is not equal to null"; two nulls are neither equal to each other, nor unequal to each other, they're impossible to compare at all. A null is used for two purposes. One is as the main representation of Boolean false (most values in Elly can be either truthy or falsey depending on how they were produced, but nulls are always false). The other is as a sentinel value to specify that part of a data structure is inapplicable / has errored / has been determined not to be the part that we're looking for / should be abandoned. Generally speaking, any calculation that takes a null as input will produce a null as output, unless the calculation is specifically designed to operate on nulls or on Booleans. If a calculation takes a list including a null as input, or takes a null as input and produces a list, this will only null out any parts of the output which would depend on the null value specifically. Additionally, if a calculation takes a list as input and inspects its elements, but interprets it as an unordered collection of elements rather than as an ordered list, generally any nulls within the list are ignored (and only the non-null elements are used); for example, `lex_maximum` will give you the maximum non-null element of a list. In I/O, a null is represented as `null`. Builtins will generally try their best to operate on data that has the "wrong type". In the case of nulls, this is easy (if the builtin isn't specialised for operating on nulls, you get a null result). The rules for unexpected mixes of lists and integers are more interesting, and are known as *autovectorisation*: * If a builtin/statement expects a list, and is given an integer, the integer will be treated as a list of the nonnegative integers strictly below that integer. So, e.g., `@ i. square :i each 4` is an expression that evaluates to the list `[0,1,4,9]`. This happens regardless of the other inputs. At present, the behaviour when a non-integer number attempts to autovectorise into a list in this way is not defined in general. However, some builtins will special-case this to generalise what their behaviour on the integers is to other sorts of numbers; for example, `index_into` (which indexes into a list, wrapping if necessary) effectively computes a modulus operation when indexing into an integer, and thus using `index_into` to "index into" a non-integer number will also calculate the value of the "index" modulus the "list". * If a builtin/statement has exactly one argument that is expected to be a number, but is actually a list, an implicit `each` will be added onto that argument (with its scope covering the builtin but nothing else). So, e.g., `@ + [1,2] 3` is an expression that evaluates to the list `[4,5]`; we've added `3` to each element of the list `[1,2]`. * If a builtin/statement has two or more arguments that are expected to be numbers, but are actually lists, then it will run in a loop over the first elements of the lists in question, the second elements of the lists in question, and so on. (This is not the same as adding an `each` to every argument; `@ + [1,2] [3,5]` is `[4,7]`, whereas `@ j. i. + :j each [1,2] :i each [3,5]` is `[[4,6],[5,7]]`. The generator with the same behaviour as the autovectorisation is `corresponding`.) If the lists have different lengths, then the shorter list will be padded to the length of the longer using falsey integer zeroes: `@ + [1,2] [3,4,5]` is `[4,6,5]`. * The above rules apply recursively. For example, a list of lists can end up creating two nested loops: `@ + [[1],[2,3]] 4` is `[[5],[6,7]]`. Likewise, `tighten` (which expects a list of lists and concatenates all those lists together) may end up expanding integers to lists twice: `@ tighten 4` is `[0,0,1,0,1,2]` (the tightening of `[[],[0],[0,1],[0,1,2]]`). In the previous section, we saw how we could write our running example as `y + @ :i each [2,3] i. product`. Using autovectorisation, we can see that the `each` is unnecessary; the `+` (which is designed to add numbers) will autovectorise when given a list as an argument. So we can write the program as `y + [2,3] product`. This encodes to only 4 bytes, just half the size of the naive full control version, and is probably the shortest way to express the program in Elly. ### Comparisons One thing to be aware of is that Elly has two basic families of comparison builtins. One of these families are the *numerical* comparison builtins. The basic numerical comparison builtins are `≤` and `≥`. As the names suggest, they are only used to compare numbers; if you try to compare lists using them, they will autovectorise. So for example, the expression @ ≤ [1, 2, 4] [1, 2, 3] will output a list of two truthy numbers and a falsey number. The other family is the *lexicographic* comparison builtins. These compare arbitrary values, and never autovectorise. For the purpose of lexicographic comparisons, any comparison involving `null` returns false (regardless of what sort of comparison it is: equality, inequality, less-than, greater-than…), a comparison of two numbers compares their numerical values, and a comparison of two lists compares the elements in corresponding positions of the lists, with the first differing element determining the result of the comparison. Lexicographic comparisons can try to compare things of different types from each other. When comparing two lists of different lengths, the common elements are compared first: if they are the same, then the shorter list is considered lexicographically lower. When comparing a list to a number, numbers whose real part is in the range *n* exclusive to *n*+1 inclusive are considered to be marginally greater than the list [0, 1, …, *n*] (i.e. if compared to this list, they will be considered greater than it; and if compared to any other list, the result will be the same as when comparing this list to that list). To directly lexicographically compare two values, you can use `lex_greater`; in general, builtins for lexicographic comparisons have names starting `lex_`. The selection of 1-byte builtins for comparisons is different between lexicographic and numerical comparisons, e.g. you can sort lexicographically in 1 byte but not numerically, and lexicographic comparisons have a greater-than whereas numerical comparisons have a greater-than-or-equal. Complex numbers also present some difficulty for comparison functions, because they aren't normally considered to be ordered. These are also handled differently between lexicographic and numerical comparisons. Numerical comparisons treat complex numbers (i.e. numbers whose imaginary part is not zero) very similarly to `null`s: complex numbers will compare equal to themselves, and unequal to any other number, but anything that cares about the order (`≤`, `≥` and friends) will return false. Meanwhile, lexicographic comparisons mostly only look at the real part of a number; the imaginary part will be used as a tiebreak when comparing two numbers which different but identical in their real parts (the number whose imaginary part is larger upon dividing by `i` will be considered greater). ### Truthiness and falsiness Some builtins care about whether a value is logically "true" or "false". In Elly, this doesn't have much relationship to the actual value; rather, each value is associated with an extra boolean that determines whether it's currently truthy or falsey. (The main exception to this is that nulls are always falsey.) Generally speaking, builtins that are designed to handle data in an opaque way (without examining it) will leave its truthiness alone. Builtins that are designed to return booleans will return null for false, and some useful value related to the input value for true (often but not always the input itself); in this case, the output will be converted to its truthy version (even if it was previously falsey). Builtins that have functionality unrelated to booleans (e.g. arithmetic) will generally mark 0s in the output as falsey, whilst marking other elements as truthy. When an autovectorisation assembles its results into a list, the resulting list wrapper (as opposed to its elements) will be falsey if it is empty or contains any falsey elements, and truthy otherwise; builtins that generate lists might also use this scheme (but will mark any nonempty list as truthy if they could plausibly generate an empty output from nonempty input). In some cases, two builtins with only vaguely related functionality will be merged together, because one of them conceptually returns only a boolean and the other conceptually returns only a value. For example, `≤` (which determines whether its left argument is numerically less than or equal to its right argument) and `smaller` (which takes two arguments and returns the numerically smaller) are actually the same builtin, implemented as taking two arguments, returning the smaller, and marking it as truthy or falsey depending on which argument was smaller. This sort of merge can be a little confusing, but helps to make programs terser by increasing the number of builtins that can be given single-byte names (by allowing some of them to share). Note that falsiness or truthiness is generally only important when using a builtin that specifically checks for it, so most of the time, you won't have to worry about the truth values of your data (and besides, it tends to get overwritten all the time anyway). The truthiness or falsiness of the output of the whole program will affect the program's exit code, but typically not anything else. I/O specifications ------------------ [TODO: Where do these things go? If we put them at the end, they can't adapt argument count, but if we put them at the start, we can't parse them correctly. Current plan: encode these by starting a function with a C1 byte, which shouldn't be needed for other meanings, followed by a self-delimiting specification.] I/O specifications are a domain-specific language, embedded within Elly, for parsing input and formatting output, and specifying where a function/program takes its input from and where it writes it to. Placing an I/O specification at the start of the main program turns off any attempt to adapt its argument count to the number of command-line arguments given; it will always take exactly the number of inputs that the I/O specification implies. (The place to take those inputs from can be included in the specification itself; if it isn't, they're taken from arguments first, then from a trailing bytestring (see below), then from standard input.) As an exception to this, if the I/O specification lists no inputs at all, command-line inputs will be taken via the normal mechanism. Functions other than the main program can also start with an I/O specification. Again, this can affect the number of arguments the function is given. The `0(`/`1(`/`2(` specifies the number of arguments the *caller* must specify when calling the function; the number of arguments the function *receives* is determined by the I/O specification, and might be more (or occasionally even fewer). The I/O specification can read the function arguments; any which it does not read at all will be provided to the function after the arguments produced by the I/O specification. It's possible that an I/O specification will specify that a function takes 3 or more arguments. In this case, all but the last will be grouped together into a list. To write an I/O specification, you start by writing I/O specification commands, followed by a `;` (which counts against the bytecount). [TODO: is this obsolete?] This is always distinguishable from the `;` used by control flow statements, because no control flow statements have appeared yet in the function. The commands can be explicitly marked to say whether they appear to inputs or outputs; but if they aren't, they apply first to the program/function's return value, and then to its inputs in order. There are three sorts of I/O specification commands, explained in the next three sections. ### Formatting commands Formatting commands specify how a single value (normally a number) should be converted to (or from) a string for I/O. For example, this includes `hexadecimal` which outputs the number in hexadecimal, and `fractional` which specifies that non-integer numbers should be output as a fraction (rather than a decimal). A run of consecutive formatting commands will all be interpreted as applying to the same value. For example, `hexadecimal uppercase` will output the value in uppercase hexadecimal. The order can matter; formatting commands will only affect formatting commands to their left in the run. So for example, `hexadecimal uppercase parseable` would format the number 121 as `0x7B`, whereas `parseable uppercase hexadecimal` would format it as `0X7b`; `uppercase` applies to `hexadecimal` in the former case, and to `parseable` (which in this case prefixes a `0x` so that the number is identifiable as hexadecimal) in the latter case. Note, however, that `parseable` still prepends a `0x` in the latter case, despite appearing to the left of `hexadecimal`, because otherwise it would do nothing at all. Sometimes you want to break up a run of consecutive formatting commands, to apply them to two different values. A run will automatically break upon seeing a redundant or contradictory command; `hexadecimal decimal` can't apply to a single value, so it would end up applying to two values. For example, this program: hexadecimal decimal ; y double will read a decimal integer (from command-line argument or standard input), double it, and output the doubled value in hexadecimal. However, there may be some cases in which the formatting commands are consistent with each other, but nonetheless you want to apply them to different values. For example, if we wanted to input a number in hexadecimal, and output a number with an explicit decimal point, we can't use `show_point hexadecimal` because this is internally consistent (producing numbers like `0x4.`). Instead, we need to insert an additional `;` in the middle of the run to break it up; here's how we read a number in hexadecimal, and output it in (decimal, by default) with an explicit decimal point: show_point ; hexadecimal ; y Most formatting commands take no arguments. A few, however, do need an argument; as usual in Elly, this is written directly after the command itself. You don't have many tacit shapes available for formatting command arguments; your only options are to use a single 0-argument builtin/function (which would typically be a literal or variable), or else to enter full control mode. Two examples of formatting commands with arguments are `precision` and `fieldwidth`. `precision` determines the number of decimal places (or significant figures under `scientific`, hexadecimal places under `hexadecimal`, etc.) to display; the number will be rounded if there are more and padded out with trailing zeroes if there are fewer. `fieldwidth` determines the preferred number of characters to use to write the value; if the value naturally comes out shorter, it will be padded out to the given width, and if it naturally comes out longer, it will be rounded or converted to scientific notation if possible. (However, sometimes this isn't possible, and the value will take up more space than the given `fieldwidth`.) Incidentally, the nature of the padding can be changed by reordering the commands; something like `fieldwidth 10 decimal` will attempt to pad the number using decimal digits (so we end up with leading zeroes in this case), whereas if we reverse this to `decimal fieldwidth 10`, we get the default padding of spaces. `precision` and `fieldwidth` also have special cases that allow them to be joined together with their argument into a single token; instead of `fieldwidth 10` (two bytes), you can use `fieldwidth_10` (one byte). Joining into a single-byte token is allowed up to `precision_9` and `fieldwidth_19`. You can also join into a double-byte token (with leading `@` because it's a double-byte token), going as high as `@precision_99` and `@fieldwidth_399`. The double-byte versions are encoded by using two consecutive `precision` or `fieldwidth` commands; thus, you'll need to separate them with a `;` if they were meant to apply to different values. Formatting commands are mostly only used for output, because often it's possible to automatically determine the format of input anyway. Sometimes, though, you want to restrict what inputs are legal, or enforce an unusual interpretation (e.g. parsing octal numbers, which typically look like decimal if they don't have a leading `0o`). So you can use formatting command on inputs too. An input that doesn't fit the required format will parse as `null` and produce a warning to stderr. ### Composition commands Composition commands allow the various pieces of a value to be formatted separately and combined back together again. One example is the `join ","` command, which will treat the value being output as a list, and output the elements separated by commas (with no whitespace; if you wanted whitespace that would be the `join ", "` command). Composition commands effectively consume the value being formatted and give you a new value to format (in this case, "an element"; there is an implicit loop over the elements of the list). In this way, they act a little like generators do (although the syntax is very different). When composition commands have arguments (and they usually do), they're compressed together with the argument. For example, the literal `"and"` encodes to three bytes, and `" and "` to five. However, `join "and"` is still three bytes, and `join " and "` is also three bytes, because there is a special case for `join` followed by a dictionary word possibly surrounded by spaces. `join` in particular has a huge number of special cases because it's so common; very common combinations like `join ", "` can encode to just a single byte. `join` also has two variants, `ojoin` and `rojoin`, that allow you to special-case the behaviour of the start or end of the list. `ojoin` will specify how to join the first element of the list to the rest of the list; the "new value" produced to format will be the remainder of the list (the first element itself will be formatted the same way as an *element* of that list). Likewise, `rojoin` will specify how to join the last element of the list to the rest of the list. If the list happens to be only one element long, both `ojoin` and `rojoin` are no-ops. Here's how we'd format a list as `1, 2, 3 and 4`: rojoin " and " join ", " ; [1,2,3,4] `join` also has a variant `map`. This applies to all inputs and outputs simultaneously, and effectively runs the entire function (or entire main program) once for every item in the input, creating the same sort of item in the output. As an example, we can reverse every line of the input, without needing an explicit loop: map "\n" ; y reverse Aside from the very common `join` and its variants, composition commands can also be used to split apart values in various ways and format the parts separately. For example, `num,denom` replaces a number given as input with a 2-element list consisting of its numerator and denominator. Most of these come in comma and semicolon versions: `num;denom` replaces a number given as input with *two* numbers, its numerator and a denominator, and must be followed with a format for each individually. Generally speaking, irrelevant parts of the format are just ignored (`num,denom` will convert an integer to a singleton list, and `num;denom` will format an integer using the numerator format and disregard the denominator format), although there are sometimes variants like `num;denom1` which force the denominator to exist (in this case, with value 1). The remaining family of composition commands is typified by, e.g., `surround "()"`. These don't change the value to be formatted at all, but do change the output that it formats to (in this case, placing parentheses around the value that would be formatted otherwise). Composition commands are written as though they are being used to output a value, but they can also be used to input a value; for example, placing `join ","` on an input would attempt to input a comma-separated list. ### Location commands Location commands specify where a value is read from or written to. You write these before the other formatting commands related to the value. For example, `read_line byte` as an I/O specification will read a line from standard input, and interpret it as a bytestring; as another example, `write_stderr binary` will format the function's output in binary and write it to the standard error stream. Location commands don't exist within the "standard sequence" of locations to read from or write to; they just give you extra places to read from or write to. If you use a location command to redirect the output of the main program, it doesn't get implicitly written to standard output (like it normally would be); however, using a location command to redirect the output of a function other than the main program will not prevent it returning its normal return value. You can use multiple location commands referencing outputs to output the same output multiple times. There are three special location commands, `return_value`, `return_bytes`, `return_chars`, which overwrite the output of the currently running function (causing it to return a different value). `return_bytes` and `return_chars` format the output as a string (as though it were about to be output), but then return a string (of bytes or codepoints respectively) from the function rather than outputting it. `return_value` just outputs the value from the function directly, as an Elly data type, without formatting it at all. Note that in addition to being placed before the description of the format of a value as a whole, you can place a location command at the start of the formatting of a part of a value (which was created by splitting a value with a composition command). This will send just that part of the value to a different location. You could, for example, write `real;complex write_stdout write_stderr` to write the real part of a number to standard output, and the complex part to standard error. Miscellaneous features ---------------------- ### Features that only exist in source code Elly is primarily focused on optimising the encoded representation (which is the representation on which the interpreter works). However, the encoded representation is normally produced by compiling the source code representation. In most cases, the source code notation is designed to make it obvious how large the encoded representation would be, so there's a very close correspondence between them. However, there are some features that exist only at the source code level, and don't appear in the encoded representation. The most obvious of these is source-code-level comments. These start with a `#` and last to the end of the line. The content of a source-code-level comment will be entirely ignored by the compiler; and these comments will be entirely omitted from the encoded representation (and thus will have no effect when the program actually runs). Comments are allowed anywhere except inside a literal. Another option you have is to specify builtins via specifying the raw bytes that represent them when encoded. This might be useful if you want to force a specific encoding of a builtin, e.g. to obey some source layout rule, but it mostly exists to give a method of referring to nonexistent builtins in error messages. The syntax involves a period `.` followed by the bytes in big-endian hexadecimal. If the builtin is a literal, you need to place an additional `.` between the part that specifies what sort of literal it is, and the part that specifies what value the literal has (e.g. `−128` would not be `.3F3B80`, but rather `.3F3B.80`). The same "specify via raw bytes" mechanism can be used to specify just the data part of a literal. For example, you could write the builtin `280` as `common_integer.0A`. This syntax is used by interactive Elly programming environments when they want to visualise how a literal is encoded and/or how many bytes it takes up. Another feature that's often useful is the `function_by_cases` builtin. At the source code level, it looks something like this: y function_by_cases { "Monday": 0, "Tuesday": 1, "Wednesday": 2, "Thursday": 3, "Friday": 4, "Saturday": 5, "Sunday": 6, } 7 At the source code level, the builtin takes two arguments and a literal dictionary. If the argument is one of the keys in the dictionary (which can be arbitrary Elly data; they don't need to be strings, but could be nested lists or numbers or even a null), then the return value will be calculated by finding the value corresponding to the key, then indexing into the right-hand argument. If the argument is not one of the keys in the dictionary, the return value will be an element of the right-hand argument that's arbitrary, but consistent for any given input. In order for the mechanism to work, the number of elements in (or for numbers, the value of) the right hand argument must be the smallest prime power that's strictly greater than any of the values in the dictionary (in this case, 7 is the smallest prime power that's strictly greater than 6). The reason that `function_by_cases` is useful is that there's no need to encode the entire literal dictionary when producing the encoded representation. Rather, the translation will look for a "magic formula" that happens to produce the right results, and encode that. In the program above, there's a 1 in 77 = 823543 chance that any given random formula would produce the correct output (out of 7 possibilities) for each of the 7 inputs; so it's quite likely that we'll find an appropriate formula within the first 165 = 1048576 attempts and thus be able to represent which specific formula is in use with a five-nybble number. (For short magic formulas, only one nybble is required to record the length of the payload.) Encoding and then decoding the program above would therefore produce a program that looks something like this (although the last five hexadecimal digits would almost certainly be different; the leading `2` encodes the length): y function_by_cases.2391DF 7 In other words, our day-of-the-week-to-index function can be encoded a lot more tersely than it might first look; it's only give bytes long (one for `function_by_cases`, three for the payload's length and value, and one for the `7`). Generally speaking, whenever you have a function with only a small number of valid inputs and no real pattern, using `function_by_cases` will produce terser code than you'd get trying to produce a magic formula by hand. ### Trailing bytestrings A trailing bytestring is a method of embedding arbitrary literal binary data directly into the program, without escaping, length-prefixing or delimters. This has two main uses; one is to save bytes when you need to embed a large amount of incompressible (or already-compressed) data into your program (because there's no need to spend any encoding space on specifying where the data ends); and the other is for writing quines and similar programs where you care about which particular characters exist within the source code or encoding, because trailing bytestrings are never compressed. Trailing bytestrings exist in both the source code and encoded representations, and work the same way; they start immediately after an unmatched `)` token (which is normally used to end a function definition), and last until the end of the file. (In the source code definition, this is a literal `)` character followed by a single space, with the trailing bytestring starting just after the space; in the encoded representation, this is an umatched use of the octet 0xC0, which encodes the `)` and `}` tokens.) The value of a trailing bytestring is used to determine the values of `X` and `Y` before they are defined (e.g. in the initial expression of a function that takes fewer than 2 arguments), or to specify a value that an I/O specification reads. If neither of these constructs exist in the program (and thus the trailing bytestring would otherwise be unused), the value of the trailing bytestring replaces the last implicit `y` in the program. As the name suggests, trailing bytestrings are lists of numbers in the range 0 to 255 inclusive, representing the raw bytes that make up the bytestring (no attempt is made to, e.g., do UTF-8 decoding on them); the numbers in question will have a display flag of "byte". A good example of trailing bytestrings is writing a quine in Elly (that is, a program which prints itself). `repeat 2 y )` encodes to the bytes `43 32 C0`, so the encoding `43 32 C0 43 32 C0` will, when executed, print itself. (We could make this program shorter still by instead appending `append x y )` to itself, i.e. `48 C0 48 C0`.) Likewise, if we wanted the program to print a source representation of itself (rather than the encoded representation that actually gets executed), we could do this: repeat 2 y ) repeat 2 y ) One thing to watch out for: because functions cannot consist entirely of tacit expressions, and must encode to at least one byte, a trailing bytestring is not allowed at the start of a program. If you want to write a program that prints a literal trailing bytestring, add a no-op to it, e.g. restart x y ) Hello, world! Bear in mind, though, that you will almost certainly end up with a shorter program if you allow Elly to compress the string: "Hello, world!" This encodes down to a 9 byte encoding (as opposed to 15 for the trailing bytestring version); the only reason you might want to use the 15-byte version is that the string "Hello, world!" will appear literally in the encoded program, rather than being compressed. ### Unusual encodings [...]