Each uses a different form of virtually the same elliptic curve (Curve25519), which is defined over the prime field of order $latex 2^{255} – 19$.

Operations on elliptic curves are built up from arithmetic operations on the underlying field. Furthermore, complex field operations (like inversion and square root) are built up from simpler ones. A single invocation of X25519 or Ed25519 will involve hundreds of field multiplications (and squarings), and these turn out to be the bottleneck when it comes to performance.

Any field element can be encoded using 255 bits. A popular method of implementing arithmetic uses the fact that

$latex 255 = 5 \times 51 = 5 \times (26 + 25) = 26 + 25 + 26 + 25 + \ldots + 26 + 25$

and thus represents a general element as $latex x_0 + x_1 2^{26} + x_2 2^{51} + \ldots + x_9 2^{230}$, where each ‘limb’ $latex x_i$ comprises alternately 25 or 26 bits *or a small amount more*. Normally we only have to worry about the representation being strictly canonical at the very end of a long calculation, so in the interests of efficiency we tolerate a few extra bits at the intermediate stages and ‘tidy up’ at the end.

Splitting the field elements into 10 parts like this means that we can multiply two of them by performing 100 elementary multiplications, followed by (or interleaved with) the necessary reductions (modulo the prime) and additions.

The $latex x_i$ are stored as 64-bit integers; and provided they have no more than 32 significant bits the products $latex x_i x_j$ are also 64-bit integers. By performing enough reduction steps we can ensure that if all $latex x_i$ are within certain bounds (not exceeding their nominal length by too much) on input to an arithmetic operation, then the output is correct and also within those bounds, so that we can compose operations freely and put the result in canonical form at the end.

I wondered if it was possible to do better than this, using a shorter partition of 255. In particular, the factorization

$latex 255 = 3 \times 85 = 3 \times (29 + 28 + 28)$

allows us to split the element as $latex x_0 + x_1 2^{29} + x_2 2^{57} + \ldots + x_8 2^{227}$, where the $latex x_i$ have 28 or 29 bits plus a bit of wiggle room. If we could make this work, we would only have to do 81 elementary multiplications per field multiplication: a saving of 19%.

(Obviously, this is as far as one can hope to go with this technique. A partition of length 8 would require seven 32-bit limbs, leaving no wiggle room.)

The difficulty, of course, is in keeping sufficient control over the bounds. Because 28- and 29-bit integers have less room to expand than 25- and 26-bit integers before they hit the fatal 33-bit barrier, more reduction steps are required to guarantee correctness. Would the cost of these necessarily cancel out the apparent gain in efficiency?

I donʼt know the answer to this question for a *general-purpose* arithmetic library over this field. I scratched my head over the equations for some time, but couldnʼt see a good way to set the bounds so that they were preserved by all functions without an excessive amount of reduction.

However, I was not writing a general-purpose library. My functions would only ever be called in a small number of specific sequences: namely, those of the higher-level X25519 and Ed25519 implementations.

To prove correctness, therefore, I inserted some instrumentation code into all the field functions, and an additional array of 64-bit numbers that got carried around with the limbs of each field element, representing upper bounds on their values. When a field element was initialized, the bounds were set to their maximum values (i.e. $latex 2^{29} – 1$ or $latex 2^{28} -1$). Within each function, I updated the bounds according to the functionʼs internal workings.

For example, every time two limbs are added, the corresponding upper bounds are added. Every time they are multiplied, the bounds are multiplied. The ‘conditional swap’ function (which swaps two elements conditional on a secret bit) replaces each of the corresponding upper bounds with their maximum. When a limb is shifted right, so is the bound. When it is masked with $latex 2^n – 1$, the bound $latex m$ is replaced by $latex \min(m, 2^n-1)$. And so on.

All addition and multiplication of upper bounds is performed with overflow checking. Any overflow triggers an assertion, stopping the program.

The advantage of this approach is that I only need to run a small number of tests (exercising all the code paths followed by the external API functions) to verify that no overflow is possible. One of the nice properties of these algorithms is that they lend themselves to straight-line code, where there are no branches based on secret data. Of course, the main point of this is to resist side-channel attacks: nothing you might be able to observe about the code path tells you anything useful about the data. But a bonus is that it allowed me easily to verify my field-arithmetic code for bounds-correctness *in all contexts where it would ever be called*.

The upshot was that I had to make a small tweak to the subtraction operation and add a single reduction step (carrying excess bits in a loop around the limbs) in the middle of the multiplication operation and another at the end of the subtraction, in order to guarantee correctness. This was a small price to pay for the saving in multiplications.

The downside is that if any changes are made to the higher-level X25519 or Ed25519 code, or if another algorithm wants to use this field code, then everything has to be revalidated: the field arithmetic cannot be treated as a black box.