# [−][src]Module curve25519_dalek::backend::avx2

A vectorized implementation of group operations on the twisted Edwards form of Curve25519, using a modification of the 4-way parallel formulas of Hisil, Wong, Carter, and Dawson.

# Overview

The 2008 paper Twisted Edwards Curves Revisited by Hisil, Wong, Carter, and Dawson (HWCD) introduced the “extended coordinates” and mixed-model representations which are used by most Edwards curve implementations.

However, they also describe 4-way parallel formulas for point addition and doubling: a unified addition algorithm taking an effective $$2\mathbf M + 1\mathbf D$$, a doubling algorithm taking an effective $$1\mathbf M + 1\mathbf S$$, and a dedicated (i.e., for distinct points) addition algorithm taking an effective $$2 \mathbf M$$. They compare these formulas with a 2-way parallel variant of the Montgomery ladder.

Unlike their serial formulas, which are used widely, their parallel formulas do not seem to have been implemented in software before. The 2-way parallel Montgomery ladder was used in 2015 by Tung Chou's sandy2x implementation. Curiously, however, although the sandy2x paper also implements Edwards arithmetic, and cites HWCD08, it doesn't mention their parallel Edwards formulas. A 2015 paper by Hernández and López describes an AVX2 implementation of X25519. Neither the paper nor the code are publicly available, but it apparently gives only a slight speedup, suggesting that it uses a 4-way parallel Montgomery ladder rather than parallel Edwards formulas.

The reason may be that HWCD08 describe their formulas as operating on four independent processors, which would make a software implementation impractical: all of the operations are too low-latency to effectively synchronize. But a closer inspection reveals that the (more expensive) multiplication and squaring steps are uniform, while the instruction divergence occurs in the (much cheaper) addition and subtraction steps. This means that a SIMD implementation can perform the expensive steps uniformly, and handle divergence in the inexpensive steps using masking.

These notes describe modifications to the original parallel formulas to allow a SIMD implementation, and this module contains an implementation of the modified formulas using 256-bit AVX2 vector operations.

# Parallel formulas in HWCD'08

The doubling formula is presented in the HWCD paper as follows:

Cost Processor 1 Processor 2 Processor 3 Processor 4
idle idle idle $$R_1 \gets X_1 + Y_1$$
$$1\mathbf S$$ $$R_2 \gets X_1^2$$ $$R_3 \gets Y_1^2$$ $$R_4 \gets Z_1^2$$ $$R_5 \gets R_1^2$$
$$R_6 \gets R_2 + R_3$$ $$R_7 \gets R_2 - R_3$$ $$R_4 \gets 2 R_4$$ idle
idle $$R_1 \gets R_4 + R_7$$ idle $$R_2 \gets R_6 - R_5$$
$$1\mathbf M$$ $$X_3 \gets R_1 R_2$$ $$Y_3 \gets R_6 R_7$$ $$T_3 \gets R_2 R_6$$ $$Z_3 \gets R_1 R_7$$

and the unified addition algorithm is presented as follows:

Cost Processor 1 Processor 2 Processor 3 Processor 4
$$R_1 \gets Y_1 - X_1$$ $$R_2 \gets Y_2 - X_2$$ $$R_3 \gets Y_1 + X_1$$ $$R_4 \gets Y_2 + X_2$$
$$1\mathbf M$$ $$R_5 \gets R_1 R_2$$ $$R_6 \gets R_3 R_4$$ $$R_7 \gets T_1 T_2$$ $$R_8 \gets Z_1 Z_2$$
$$1\mathbf D$$ idle idle $$R_7 \gets k R_7$$ $$R_8 \gets 2 R_8$$
$$R_1 \gets R_6 - R_5$$ $$R_2 \gets R_8 - R_7$$ $$R_3 \gets R_8 + R_7$$ $$R_4 \gets R_6 + R_5$$
$$1\mathbf M$$ $$X_3 \gets R_1 R_2$$ $$Y_3 \gets R_3 R_4$$ $$T_3 \gets R_1 R_4$$ $$Z_3 \gets R_2 R_3$$

Here $$\mathbf M$$ and $$\mathbf S$$ represent the cost of multiplication and squaring of generic field elements, $$\mathbf D$$ represents the cost of multiplication by a curve constant (in this case $$k = 2d$$).

Notice that the $$1\mathbf M$$ and $$1\mathbf S$$ steps are uniform. The non-uniform steps are all inexpensive additions or subtractions, with the exception of the multiplication by the curve constant $$k = 2d$$: $$R_7 \gets 2 d R_7.$$

HWCD suggest parallelising this step by breaking $$k = 2d$$ into four parts as $$k = k_0 + 2^n k_1 + 2^{2n} k_2 + 2^{3n} k_3$$ and computing $$k_i R_7$$ in parallel. This is quite awkward, but if the curve constant is a ratio $$d = d_1/d_2$$, then projective coordinates allow us to instead compute $$(R_5, R_6, R_7, R_8) \gets (d_2 R_5, d_2 R_6, 2d_1 R_7, d_2 R_8).$$ This can be performed as a uniform multiplication by a vector of constants, and if $$d_1, d_2$$ are small, it is relatively inexpensive. (This trick was suggested by Mike Hamburg). In the Curve25519 case, we have $$d = \frac{d_1}{d_2} = \frac{-121665}{121666};$$ Since $$2 \cdot 121666 < 2^{18}$$, all the constants above fit (up to sign) in 32 bits, so this can be done in parallel as four multiplications by small constants $$(121666, 121666, 2\cdot 121665, 2\cdot 121666)$$, followed by a negation to compute $$- 2\cdot 121665$$.

# Modified parallel formulas

Using the modifications sketched above, we can write SIMD-friendly versions of the parallel formulas as follows. To avoid confusion with the original formulas, temporary variables are named $$S$$ instead of $$R$$ and are in static single-assignment form.

To add points $$P_1 = (X_1 : Y_1 : Z_1 : T_1)$$ and $$P_2 = (X_2 : Y_2 : Z_2 : T_2 )$$, we compute \begin{aligned} (S_0 &&,&& S_1 &&,&& S_2 &&,&& S_3 ) &\gets (Y_1 - X_1&&,&& Y_1 + X_1&&,&& Y_2 - X_2&&,&& Y_2 + X_2) \\ (S_4 &&,&& S_5 &&,&& S_6 &&,&& S_7 ) &\gets (S_0 \cdot S_2&&,&& S_1 \cdot S_3&&,&& Z_1 \cdot Z_2&&,&& T_1 \cdot T_2) \\ (S_8 &&,&& S_9 &&,&& S_{10} &&,&& S_{11} ) &\gets (d_2 \cdot S_4 &&,&& d_2 \cdot S_5 &&,&& 2 d_2 \cdot S_6 &&,&& 2 d_1 \cdot S_7 ) \\ (S_{12} &&,&& S_{13} &&,&& S_{14} &&,&& S_{15}) &\gets (S_9 - S_8&&,&& S_9 + S_8&&,&& S_{10} - S_{11}&&,&& S_{10} + S_{11}) \\ (X_3&&,&& Y_3&&,&& Z_3&&,&& T_3) &\gets (S_{12} \cdot S_{14}&&,&& S_{15} \cdot S_{13}&&,&& S_{15} \cdot S_{14}&&,&& S_{12} \cdot S_{13}) \end{aligned} to obtain $$P_3 = (X_3 : Y_3 : Z_3 : T_3) = P_1 + P_2$$. This costs $$2\mathbf M + 1 \mathbf D$$.

If the point $$P_2 = (X_2 : Y_2 : Z_2 : T_2)$$ is fixed, we can cache the multiplication of the curve constants by computing \begin{aligned} (S_2' &&,&& S_3' &&,&& Z_2' &&,&& T_2' ) &\gets (d_2 \cdot (Y_2 - X_2)&&,&& d_2 \cdot (Y_1 + X_1)&&,&& 2d_2 \cdot Z_2 &&,&& 2d_1 \cdot T_2). \end{aligned} This costs $$1\mathbf D$$; with $$(S_2', S_3', Z_2', T_2')$$ in hand, the addition formulas above become \begin{aligned} (S_0 &&,&& S_1 &&,&& Z_1 &&,&& T_1 ) &\gets (Y_1 - X_1&&,&& Y_1 + X_1&&,&& Z_1 &&,&& T_1) \\ (S_8 &&,&& S_9 &&,&& S_{10} &&,&& S_{11} ) &\gets (S_0 \cdot S_2' &&,&& S_1 \cdot S_3'&&,&& Z_1 \cdot Z_2' &&,&& T_1 \cdot T_2') \\ (S_{12} &&,&& S_{13} &&,&& S_{14} &&,&& S_{15}) &\gets (S_9 - S_8&&,&& S_9 + S_8&&,&& S_{10} - S_{11}&&,&& S_{10} + S_{11}) \\ (X_3&&,&& Y_3&&,&& Z_3&&,&& T_3) &\gets (S_{12} \cdot S_{14}&&,&& S_{15} \cdot S_{13}&&,&& S_{15} \cdot S_{14}&&,&& S_{12} \cdot S_{13}) \end{aligned} which costs only $$2\mathbf M$$. This precomputation is essentially similar to the precomputation that HWCD suggest for their serial formulas. Because the cost of precomputation and then readdition is the same as addition, it's sufficient to only implement caching and readdition.

## Doubling

The non-uniform portions of the (re)addition formulas have a fairly regular structure. Unfortunately, this is not the case for the doubling formulas, which are much less nice.

To double a point $$P = (X_1 : Y_1 : Z_1 : T_1)$$, we compute \begin{aligned} (X_1 &&,&& Y_1 &&,&& Z_1 &&,&& S_0) &\gets (X_1 &&,&& Y_1 &&,&& Z_1 &&,&& X_1 + Y_1) \\ (S_1 &&,&& S_2 &&,&& S_3 &&,&& S_4 ) &\gets (X_1^2 &&,&& Y_1^2&&,&& Z_1^2 &&,&& S_0^2) \\ (S_5 &&,&& S_6 &&,&& S_8 &&,&& S_9 ) &\gets (S_1 + S_2 &&,&& S_1 - S_2 &&,&& S_1 + 2S_3 - S_2 &&,&& S_1 + S_2 - S_4) \\ (X_3 &&,&& Y_3 &&,&& Z_3 &&,&& T_3 ) &\gets (S_8 \cdot S_9 &&,&& S_5 \cdot S_6 &&,&& S_8 \cdot S_6 &&,&& S_5 \cdot S_9) \end{aligned} to obtain $$P_3 = (X_3 : Y_3 : Z_3 : T_3) = [2]P_1$$.

The intermediate step between the squaring and multiplication requires a long chain of additions, but with some care and finesse, described below, it is possible (in our case) to arrange this computation without requiring an intermediate reduction.

However, it does mean that the doubling formulas have proportionately more vectorization overhead than the (re)addition formulas. The effects of this are discussed in the comparison section below.

# Field element representation

Our strategy is to implement 4-wide multiplication and squaring by wordslicing, using one 64-bit AVX2 lane for each field element. Field elements are represented in the usual way as 10 u32 limbs in radix $$25.5$$ (i.e., alternating between $$2^{26}$$ for even limbs and $$2^{25}$$ for odd limbs). This has the effect that passing between the parallel 32-bit AVX2 representation and the serial 64-bit representation (which uses radix $$2^{51}$$) amounts to regrouping digits.

The field element representation is oriented around the AVX2 vpmuluqdq instruction, which multiplies the low 32 bits of each 64-bit lane of each operand to produce a 64-bit result.

(a1 ?? b1 ?? c1 ?? d1 ??)
(a2 ?? b2 ?? c2 ?? d2 ??)

(a1*a2 b1*b2 c1*c2 d1*d2)


To unpack 32-bit values into 64-bit lanes for use in multiplication it would be convenient to use the vpunpck[lh]dq instructions, which unpack and interleave the low and high 32-bit lanes of two source vectors. However, the AVX2 versions of these instructions are designed to operate only within 128-bit lanes of the 256-bit vectors, so that interleaving the low lanes of (a0 b0 c0 d0 a1 b1 c1 d1) with zero gives (a0 00 b0 00 a1 00 b1 00). Instead, we pre-shuffle the data layout as (a0 b0 a1 b1 c0 d0 c1 d1) so that we can unpack the "low" and "high" parts as

(a0 00 b0 00 c0 00 d0 00)
(a1 00 b1 00 c1 00 d1 00)


The data layout for a vector of four field elements $$(a,b,c,d)$$ with limbs $$a_0, a_1, \ldots, a_9$$ is as [u32x8; 5] in the form

(a0 b0 a1 b1 c0 d0 c1 d1)
(a2 b2 a3 b3 c2 d2 c3 d3)
(a4 b4 a5 b5 c4 d4 c5 d5)
(a6 b6 a7 b7 c6 d6 c7 d7)
(a8 b8 a9 b9 c8 d8 c9 d9)


Since this breaks cleanly into two 128-bit lanes, it may be possible to adapt it to 128-bit vector instructions such as NEON without too much difficulty. Going the other direction, to extend this to AVX512, we could either run two point operations in parallel in lower and upper halves of the registers, or use 2-way parallelism within a field operation.

# Avoiding Overflow in Doubling

To analyze the size of the field element coefficients during the computations, we can parameterize the bounds on the limbs of each field element by $$b \in \mathbb R$$ representing the excess bits above that limb's radix, so that each limb is bounded by either $$2^{25+b}$$ or $$2^{26+b}$$, as appropriate.

The multiplication routine requires that its inputs are bounded with $$b < 1.75$$, in order to fit a multiplication by $$19$$ into 32 bits. Since $$\lg 19 < 4.25$$, $$19x < 2^{32}$$ when $$x < 2^{27.75} = 2^{26 + 1.75}$$. However, this is only required for one of the inputs; the other can grow up to $$b < 2.5$$.

In addition, the multiplication and squaring routines do not canonically reduce their outputs, but can leave some small uncarried excesses, so that their reduced outputs are bounded with $$b < 0.007$$.

The non-parallel portion of the doubling formulas is \begin{aligned} (S_5 &&,&& S_6 &&,&& S_8 &&,&& S_9 ) &\gets (S_1 + S_2 &&,&& S_1 - S_2 &&,&& S_1 + 2S_3 - S_2 &&,&& S_1 + S_2 - S_4) \end{aligned}

Computing $$(S_5, S_6, S_8, S_9 )$$ as $$\begin{matrix} & S_1 & S_1 & S_1 & S_1 \\ +& S_2 & & & S_2 \\ +& & & S_3 & \\ +& & & S_3 & \\ +& & 2p & 2p & 2p \\ -& & S_2 & S_2 & \\ -& & & & S_4 \\ =& S_5 & S_6 & S_8 & S_9 \end{matrix}$$ results in bit-excesses $$< (1.01, 1.60, 2.33, 2.01)$$ for $$(S_5, S_6, S_8, S_9 )$$. The products we want to compute are then \begin{aligned} X_3 &\gets S_8 S_9 \leftrightarrow (2.33, 2.01) \\ Y_3 &\gets S_5 S_6 \leftrightarrow (1.01, 1.60) \\ Z_3 &\gets S_8 S_6 \leftrightarrow (2.33, 1.60) \\ T_3 &\gets S_5 S_9 \leftrightarrow (1.01, 2.01) \end{aligned} which are too large: it's not possible to arrange the multiplicands so that one vector has $$b < 2.5$$ and the other has $$b < 1.75$$. However, if we flip the sign of $$S_4 = S_0^2$$ during squaring, so that we output $$S_4' = -S_4 \pmod p$$, then we can compute $$\begin{matrix} & S_1 & S_1 & S_1 & S_1 \\ +& S_2 & & & S_2 \\ +& & & S_3 & \\ +& & & S_3 & \\ +& & & & S_4' \\ +& & 2p & 2p & \\ -& & S_2 & S_2 & \\ =& S_5 & S_6 & S_8 & S_9 \end{matrix}$$ resulting in bit-excesses $$< (1.01, 1.60, 2.33, 1.60)$$ for $$(S_5, S_6, S_8, S_9 )$$. The products we want to compute are then \begin{aligned} X_3 &\gets S_8 S_9 \leftrightarrow (2.33, 1.60) \\ Y_3 &\gets S_5 S_6 \leftrightarrow (1.01, 1.60) \\ Z_3 &\gets S_8 S_6 \leftrightarrow (2.33, 1.60) \\ T_3 &\gets S_5 S_9 \leftrightarrow (1.01, 1.60) \end{aligned} whose right-hand sides are all bounded with $$b < 1.75$$ and whose left-hand sides are all bounded with $$b < 2.5$$, so that we can avoid any intermediate reductions.

# Comparison to non-vectorized formulas

In theory, the parallel Edwards formulas seem to allow a $$4$$-way speedup from parallelism. However, an actual vectorized implementation has several slowdowns that cut into this speedup.

First, the parallel formulas can only use a $$32 \times 32 \rightarrow 64$$-bit integer multiplier, so the speedup from vectorization must overcome the disadvantage of losing the $$64 \times 64 \rightarrow 128$$-bit (serial) integer multiplier. The effect of this slowdown is microarchitecture-dependent, since it requires accounting for the total number of multiplications and additions and their relative costs. In the future, it will probably be possible to avoid this slowdown by using the IFMA52 instructions, whose parallelism is perfectly suited to these formulas.

Second, the parallel doubling formulas incur both a theoretical and practical slowdown. The parallel formulas described above work on the $$\mathbb P^3$$ “extended” coordinates. The $$\mathbb P^2$$ model introduced earlier by Bernstein, Birkner, Joye, Lange, and Peters allows slightly faster doublings, so HWCD suggest mixing coordinate systems while performing scalar multiplication (attributing the idea to a 1998 paper by Cohen, Miyagi, and Ono). The $$T$$ coordinate is not required for doublings, so when doublings are followed by doublings, its computation can be skipped. More details on this approach and the different coordinate systems can be found in the curve_models module documentation.

Unfortunately, this optimization is not compatible with the parallel formulas, which cannot save time by skipping a single variable, so the parallel doubling formulas do slightly more work when counting the total number of field multiplications and squarings.

In addition, the parallel doubling formulas have a less regular pattern of additions and subtractions than the parallel addition formulas, so the vectorization overhead is proportionately greater. Both the parallel addition and parallel doubling formulas also require some shuffling to rearrange data within the vectors, which places more pressure on the shuffle unit than is desirable.

This means that the speedup from using a vectorized implementation of parallel Edwards formulas is likely to be greatest in applications that do fewer doublings and more additions (like a large multiscalar multiplication) rather than applications that do fewer additions and more doublings (like a double-base scalar multiplication).

Third, current Intel CPUs perform thermal throttling when using wide vector instructions. A detailed description can be found in §15.26 of the Intel Optimization Manual, but using wide vector instructions prevents the core from operating at higher frequencies. The core can return to the higher-frequency state after 2 milliseconds, but this timer is reset every time high-power instructions are used.

Any speedup from vectorization therefore has to be weighed against a slowdown for the next few million instructions. For a mixed workload, where point operations are interspersed with other tasks, this can reduce overall performance. This implementation is therefore probably not suitable for basic applications, like signatures, but is worthwhile for complex applications, like zero-knowledge proofs, which do sustained work.

For this reason, the AVX2 backend is not enabled by default, but can be selected using the avx2_backend feature.

# Future work

There are several directions for future improvement:

• Using the vectorized field arithmetic code to parallelize across point operations rather than within a single point operation. This is less flexible, but would give a speedup both from allowing use of the faster mixed-model arithmetic and from reducing shuffle pressure. One approach in this direction would be to implement batched scalar-point operations using vectors of points (AoSoA layout). This less generally useful but would give a speedup for Bulletproofs.

• Extending the implementation to use the full width of AVX512, either handling the extra parallelism internally to a single point operation (by using a 2-way parallel implementation of field arithmetic instead of a wordsliced one), or externally, parallelizing across point operations. Internal parallelism would be preferable but might require too much shuffle pressure.

• Generalizing the implementation to non-AVX2 instructions, particularly NEON. The current point arithmetic code is written in terms of field element vectors, which are in turn implemented using platform SIMD vectors. It should be possible to write an alternate implementation of the FieldElement32x4 using NEON without changing the point arithmetic. NEON has 128-bit vectors rather than 256-bit vectors, but this may still be worthwhile compared to a serial implementation.

## Modules

 constants This module contains constants used by the AVX2 backend. edwards Parallel Edwards Arithmetic for Curve25519. field An implementation of 4-way vectorized 32bit field arithmetic using AVX2. scalar_mul