[−][src]Module curve25519_dalek::backend::avx2
A vectorized implementation of group operations on the twisted Edwards form of Curve25519, using a modification of the 4way parallel formulas of Hisil, Wong, Carter, and Dawson.
Overview
The 2008 paper Twisted Edwards Curves Revisited by Hisil, Wong, Carter, and Dawson (HWCD) introduced the “extended coordinates” and mixedmodel representations which are used by most Edwards curve implementations.
However, they also describe 4way parallel formulas for point addition and doubling: a unified addition algorithm taking an effective \(2\mathbf M + 1\mathbf D\), a doubling algorithm taking an effective \(1\mathbf M + 1\mathbf S\), and a dedicated (i.e., for distinct points) addition algorithm taking an effective \(2 \mathbf M \). They compare these formulas with a 2way parallel variant of the Montgomery ladder.
Unlike their serial formulas, which are used widely, their parallel
formulas do not seem to have been implemented in software before. The
2way parallel Montgomery ladder was used in 2015 by Tung Chou's
sandy2x
implementation. Curiously, however, although the sandy2x
paper also implements Edwards arithmetic, and cites HWCD08,
it doesn't mention their parallel Edwards formulas.
A 2015 paper by Hernández and López describes an AVX2 implementation
of X25519. Neither the paper nor the code are publicly available, but
it apparently gives only a slight speedup, suggesting that
it uses a 4way parallel Montgomery ladder rather than parallel
Edwards formulas.
The reason may be that HWCD08 describe their formulas as operating on four independent processors, which would make a software implementation impractical: all of the operations are too lowlatency to effectively synchronize. But a closer inspection reveals that the (more expensive) multiplication and squaring steps are uniform, while the instruction divergence occurs in the (much cheaper) addition and subtraction steps. This means that a SIMD implementation can perform the expensive steps uniformly, and handle divergence in the inexpensive steps using masking.
These notes describe modifications to the original parallel formulas to allow a SIMD implementation, and this module contains an implementation of the modified formulas using 256bit AVX2 vector operations.
Parallel formulas in HWCD'08
The doubling formula is presented in the HWCD paper as follows:
Cost  Processor 1  Processor 2  Processor 3  Processor 4 

idle  idle  idle  \( R_1 \gets X_1 + Y_1 \)  
\(1\mathbf S\)  \( R_2 \gets X_1^2 \)  \( R_3 \gets Y_1^2 \)  \( R_4 \gets Z_1^2 \)  \( R_5 \gets R_1^2 \) 
\( R_6 \gets R_2 + R_3 \)  \( R_7 \gets R_2  R_3 \)  \( R_4 \gets 2 R_4 \)  idle  
idle  \( R_1 \gets R_4 + R_7 \)  idle  \( R_2 \gets R_6  R_5 \)  
\(1\mathbf M\)  \( X_3 \gets R_1 R_2 \)  \( Y_3 \gets R_6 R_7 \)  \( T_3 \gets R_2 R_6 \)  \( Z_3 \gets R_1 R_7 \) 
and the unified addition algorithm is presented as follows:
Cost  Processor 1  Processor 2  Processor 3  Processor 4 

\( R_1 \gets Y_1  X_1 \)  \( R_2 \gets Y_2  X_2 \)  \( R_3 \gets Y_1 + X_1 \)  \( R_4 \gets Y_2 + X_2 \)  
\(1\mathbf M\)  \( R_5 \gets R_1 R_2 \)  \( R_6 \gets R_3 R_4 \)  \( R_7 \gets T_1 T_2 \)  \( R_8 \gets Z_1 Z_2 \) 
\(1\mathbf D\)  idle  idle  \( R_7 \gets k R_7 \)  \( R_8 \gets 2 R_8 \) 
\( R_1 \gets R_6  R_5 \)  \( R_2 \gets R_8  R_7 \)  \( R_3 \gets R_8 + R_7 \)  \( R_4 \gets R_6 + R_5 \)  
\(1\mathbf M\)  \( X_3 \gets R_1 R_2 \)  \( Y_3 \gets R_3 R_4 \)  \( T_3 \gets R_1 R_4 \)  \( Z_3 \gets R_2 R_3 \) 
Here \(\mathbf M\) and \(\mathbf S\) represent the cost of multiplication and squaring of generic field elements, \(\mathbf D\) represents the cost of multiplication by a curve constant (in this case \( k = 2d \)).
Notice that the \(1\mathbf M\) and \(1\mathbf S\) steps are uniform. The nonuniform steps are all inexpensive additions or subtractions, with the exception of the multiplication by the curve constant \(k = 2d\): $$ R_7 \gets 2 d R_7. $$
HWCD suggest parallelising this step by breaking \(k = 2d\) into four parts as \(k = k_0 + 2^n k_1 + 2^{2n} k_2 + 2^{3n} k_3 \) and computing \(k_i R_7 \) in parallel. This is quite awkward, but if the curve constant is a ratio \( d = d_1/d_2 \), then projective coordinates allow us to instead compute $$ (R_5, R_6, R_7, R_8) \gets (d_2 R_5, d_2 R_6, 2d_1 R_7, d_2 R_8). $$ This can be performed as a uniform multiplication by a vector of constants, and if \(d_1, d_2\) are small, it is relatively inexpensive. (This trick was suggested by Mike Hamburg). In the Curve25519 case, we have $$ d = \frac{d_1}{d_2} = \frac{121665}{121666}; $$ Since \(2 \cdot 121666 < 2^{18}\), all the constants above fit (up to sign) in 32 bits, so this can be done in parallel as four multiplications by small constants \( (121666, 121666, 2\cdot 121665, 2\cdot 121666) \), followed by a negation to compute \(  2\cdot 121665\).
Modified parallel formulas
Using the modifications sketched above, we can write SIMDfriendly versions of the parallel formulas as follows. To avoid confusion with the original formulas, temporary variables are named \(S\) instead of \(R\) and are in static singleassignment form.
Addition
To add points \(P_1 = (X_1 : Y_1 : Z_1 : T_1) \) and \(P_2 = (X_2 : Y_2 : Z_2 : T_2 ) \), we compute $$ \begin{aligned} (S_0 &&,&& S_1 &&,&& S_2 &&,&& S_3 ) &\gets (Y_1  X_1&&,&& Y_1 + X_1&&,&& Y_2  X_2&&,&& Y_2 + X_2) \\ (S_4 &&,&& S_5 &&,&& S_6 &&,&& S_7 ) &\gets (S_0 \cdot S_2&&,&& S_1 \cdot S_3&&,&& Z_1 \cdot Z_2&&,&& T_1 \cdot T_2) \\ (S_8 &&,&& S_9 &&,&& S_{10} &&,&& S_{11} ) &\gets (d_2 \cdot S_4 &&,&& d_2 \cdot S_5 &&,&& 2 d_2 \cdot S_6 &&,&& 2 d_1 \cdot S_7 ) \\ (S_{12} &&,&& S_{13} &&,&& S_{14} &&,&& S_{15}) &\gets (S_9  S_8&&,&& S_9 + S_8&&,&& S_{10}  S_{11}&&,&& S_{10} + S_{11}) \\ (X_3&&,&& Y_3&&,&& Z_3&&,&& T_3) &\gets (S_{12} \cdot S_{14}&&,&& S_{15} \cdot S_{13}&&,&& S_{15} \cdot S_{14}&&,&& S_{12} \cdot S_{13}) \end{aligned} $$ to obtain \( P_3 = (X_3 : Y_3 : Z_3 : T_3) = P_1 + P_2 \). This costs \( 2\mathbf M + 1 \mathbf D\).
Readdition
If the point \( P_2 = (X_2 : Y_2 : Z_2 : T_2) \) is fixed, we can cache the multiplication of the curve constants by computing $$ \begin{aligned} (S_2' &&,&& S_3' &&,&& Z_2' &&,&& T_2' ) &\gets (d_2 \cdot (Y_2  X_2)&&,&& d_2 \cdot (Y_1 + X_1)&&,&& 2d_2 \cdot Z_2 &&,&& 2d_1 \cdot T_2). \end{aligned} $$ This costs \( 1\mathbf D\); with \( (S_2', S_3', Z_2', T_2')\) in hand, the addition formulas above become $$ \begin{aligned} (S_0 &&,&& S_1 &&,&& Z_1 &&,&& T_1 ) &\gets (Y_1  X_1&&,&& Y_1 + X_1&&,&& Z_1 &&,&& T_1) \\ (S_8 &&,&& S_9 &&,&& S_{10} &&,&& S_{11} ) &\gets (S_0 \cdot S_2' &&,&& S_1 \cdot S_3'&&,&& Z_1 \cdot Z_2' &&,&& T_1 \cdot T_2') \\ (S_{12} &&,&& S_{13} &&,&& S_{14} &&,&& S_{15}) &\gets (S_9  S_8&&,&& S_9 + S_8&&,&& S_{10}  S_{11}&&,&& S_{10} + S_{11}) \\ (X_3&&,&& Y_3&&,&& Z_3&&,&& T_3) &\gets (S_{12} \cdot S_{14}&&,&& S_{15} \cdot S_{13}&&,&& S_{15} \cdot S_{14}&&,&& S_{12} \cdot S_{13}) \end{aligned} $$ which costs only \( 2\mathbf M \). This precomputation is essentially similar to the precomputation that HWCD suggest for their serial formulas. Because the cost of precomputation and then readdition is the same as addition, it's sufficient to only implement caching and readdition.
Doubling
The nonuniform portions of the (re)addition formulas have a fairly regular structure. Unfortunately, this is not the case for the doubling formulas, which are much less nice.
To double a point \( P = (X_1 : Y_1 : Z_1 : T_1) \), we compute $$ \begin{aligned} (X_1 &&,&& Y_1 &&,&& Z_1 &&,&& S_0) &\gets (X_1 &&,&& Y_1 &&,&& Z_1 &&,&& X_1 + Y_1) \\ (S_1 &&,&& S_2 &&,&& S_3 &&,&& S_4 ) &\gets (X_1^2 &&,&& Y_1^2&&,&& Z_1^2 &&,&& S_0^2) \\ (S_5 &&,&& S_6 &&,&& S_8 &&,&& S_9 ) &\gets (S_1 + S_2 &&,&& S_1  S_2 &&,&& S_1 + 2S_3  S_2 &&,&& S_1 + S_2  S_4) \\ (X_3 &&,&& Y_3 &&,&& Z_3 &&,&& T_3 ) &\gets (S_8 \cdot S_9 &&,&& S_5 \cdot S_6 &&,&& S_8 \cdot S_6 &&,&& S_5 \cdot S_9) \end{aligned} $$ to obtain \( P_3 = (X_3 : Y_3 : Z_3 : T_3) = [2]P_1 \).
The intermediate step between the squaring and multiplication requires a long chain of additions, but with some care and finesse, described below, it is possible (in our case) to arrange this computation without requiring an intermediate reduction.
However, it does mean that the doubling formulas have proportionately more vectorization overhead than the (re)addition formulas. The effects of this are discussed in the comparison section below.
Field element representation
Our strategy is to implement 4wide multiplication and squaring by
wordslicing, using one 64bit AVX2 lane for each field element. Field
elements are represented in the usual way as 10 u32
limbs in radix
\(25.5\) (i.e., alternating between \(2^{26}\) for even limbs and
\(2^{25}\) for odd limbs). This has the effect that passing between
the parallel 32bit AVX2 representation and the serial 64bit
representation (which uses radix \(2^{51}\)) amounts to regrouping
digits.
The field element representation is oriented around the AVX2
vpmuluqdq
instruction, which multiplies the low 32 bits of each
64bit lane of each operand to produce a 64bit result.
(a1 ?? b1 ?? c1 ?? d1 ??)
(a2 ?? b2 ?? c2 ?? d2 ??)
(a1*a2 b1*b2 c1*c2 d1*d2)
To unpack 32bit values into 64bit lanes for use in multiplication
it would be convenient to use the vpunpck[lh]dq
instructions,
which unpack and interleave the low and high 32bit lanes of two
source vectors.
However, the AVX2 versions of these instructions are designed to
operate only within 128bit lanes of the 256bit vectors, so that
interleaving the low lanes of (a0 b0 c0 d0 a1 b1 c1 d1)
with zero
gives (a0 00 b0 00 a1 00 b1 00)
. Instead, we preshuffle the data
layout as (a0 b0 a1 b1 c0 d0 c1 d1)
so that we can unpack the
"low" and "high" parts as
(a0 00 b0 00 c0 00 d0 00)
(a1 00 b1 00 c1 00 d1 00)
The data layout for a vector of four field elements \( (a,b,c,d)
\) with limbs \( a_0, a_1, \ldots, a_9 \) is as [u32x8; 5]
in
the form
(a0 b0 a1 b1 c0 d0 c1 d1)
(a2 b2 a3 b3 c2 d2 c3 d3)
(a4 b4 a5 b5 c4 d4 c5 d5)
(a6 b6 a7 b7 c6 d6 c7 d7)
(a8 b8 a9 b9 c8 d8 c9 d9)
Since this breaks cleanly into two 128bit lanes, it may be possible to adapt it to 128bit vector instructions such as NEON without too much difficulty. Going the other direction, to extend this to AVX512, we could either run two point operations in parallel in lower and upper halves of the registers, or use 2way parallelism within a field operation.
Avoiding Overflow in Doubling
To analyze the size of the field element coefficients during the computations, we can parameterize the bounds on the limbs of each field element by \( b \in \mathbb R \) representing the excess bits above that limb's radix, so that each limb is bounded by either \(2^{25+b} \) or \( 2^{26+b} \), as appropriate.
The multiplication routine requires that its inputs are bounded with \( b < 1.75 \), in order to fit a multiplication by \( 19 \) into 32 bits. Since \( \lg 19 < 4.25 \), \( 19x < 2^{32} \) when \( x < 2^{27.75} = 2^{26 + 1.75} \). However, this is only required for one of the inputs; the other can grow up to \( b < 2.5 \).
In addition, the multiplication and squaring routines do not canonically reduce their outputs, but can leave some small uncarried excesses, so that their reduced outputs are bounded with \( b < 0.007 \).
The nonparallel portion of the doubling formulas is $$ \begin{aligned} (S_5 &&,&& S_6 &&,&& S_8 &&,&& S_9 ) &\gets (S_1 + S_2 &&,&& S_1  S_2 &&,&& S_1 + 2S_3  S_2 &&,&& S_1 + S_2  S_4) \end{aligned} $$
Computing \( (S_5, S_6, S_8, S_9 ) \) as $$ \begin{matrix} & S_1 & S_1 & S_1 & S_1 \\ +& S_2 & & & S_2 \\ +& & & S_3 & \\ +& & & S_3 & \\ +& & 2p & 2p & 2p \\ & & S_2 & S_2 & \\ & & & & S_4 \\ =& S_5 & S_6 & S_8 & S_9 \end{matrix} $$ results in bitexcesses \( < (1.01, 1.60, 2.33, 2.01)\) for \( (S_5, S_6, S_8, S_9 ) \). The products we want to compute are then $$ \begin{aligned} X_3 &\gets S_8 S_9 \leftrightarrow (2.33, 2.01) \\ Y_3 &\gets S_5 S_6 \leftrightarrow (1.01, 1.60) \\ Z_3 &\gets S_8 S_6 \leftrightarrow (2.33, 1.60) \\ T_3 &\gets S_5 S_9 \leftrightarrow (1.01, 2.01) \end{aligned} $$ which are too large: it's not possible to arrange the multiplicands so that one vector has \(b < 2.5\) and the other has \( b < 1.75 \). However, if we flip the sign of \( S_4 = S_0^2 \) during squaring, so that we output \(S_4' = S_4 \pmod p\), then we can compute $$ \begin{matrix} & S_1 & S_1 & S_1 & S_1 \\ +& S_2 & & & S_2 \\ +& & & S_3 & \\ +& & & S_3 & \\ +& & & & S_4' \\ +& & 2p & 2p & \\ & & S_2 & S_2 & \\ =& S_5 & S_6 & S_8 & S_9 \end{matrix} $$ resulting in bitexcesses \( < (1.01, 1.60, 2.33, 1.60)\) for \( (S_5, S_6, S_8, S_9 ) \). The products we want to compute are then $$ \begin{aligned} X_3 &\gets S_8 S_9 \leftrightarrow (2.33, 1.60) \\ Y_3 &\gets S_5 S_6 \leftrightarrow (1.01, 1.60) \\ Z_3 &\gets S_8 S_6 \leftrightarrow (2.33, 1.60) \\ T_3 &\gets S_5 S_9 \leftrightarrow (1.01, 1.60) \end{aligned} $$ whose righthand sides are all bounded with \( b < 1.75 \) and whose lefthand sides are all bounded with \( b < 2.5 \), so that we can avoid any intermediate reductions.
Comparison to nonvectorized formulas
In theory, the parallel Edwards formulas seem to allow a \(4\)way speedup from parallelism. However, an actual vectorized implementation has several slowdowns that cut into this speedup.
First, the parallel formulas can only use a \( 32 \times 32
\rightarrow 64 \)bit integer multiplier, so the speedup from
vectorization must overcome the disadvantage of losing the \( 64
\times 64 \rightarrow 128\)bit (serial) integer multiplier. The
effect of this slowdown is microarchitecturedependent, since it
requires accounting for the total number of multiplications and
additions and their relative costs. In the future, it will probably
be possible to avoid this slowdown by using the IFMA52
instructions,
whose parallelism is perfectly suited to these formulas.
Second, the parallel doubling formulas incur both a theoretical and
practical slowdown. The parallel formulas described above work on the
\( \mathbb P^3 \) “extended” coordinates. The \( \mathbb P^2 \)
model introduced earlier by Bernstein, Birkner, Joye, Lange, and
Peters allows slightly faster doublings, so HWCD suggest
mixing coordinate systems while performing scalar multiplication
(attributing the idea to a 1998 paper by Cohen, Miyagi, and
Ono). The \( T \) coordinate is not required for doublings, so when
doublings are followed by doublings, its computation can be skipped.
More details on this approach and the different coordinate systems can
be found in the curve_models
module documentation.
Unfortunately, this optimization is not compatible with the parallel formulas, which cannot save time by skipping a single variable, so the parallel doubling formulas do slightly more work when counting the total number of field multiplications and squarings.
In addition, the parallel doubling formulas have a less regular pattern of additions and subtractions than the parallel addition formulas, so the vectorization overhead is proportionately greater. Both the parallel addition and parallel doubling formulas also require some shuffling to rearrange data within the vectors, which places more pressure on the shuffle unit than is desirable.
This means that the speedup from using a vectorized implementation of parallel Edwards formulas is likely to be greatest in applications that do fewer doublings and more additions (like a large multiscalar multiplication) rather than applications that do fewer additions and more doublings (like a doublebase scalar multiplication).
Third, current Intel CPUs perform thermal throttling when using wide vector instructions. A detailed description can be found in §15.26 of the Intel Optimization Manual, but using wide vector instructions prevents the core from operating at higher frequencies. The core can return to the higherfrequency state after 2 milliseconds, but this timer is reset every time highpower instructions are used.
Any speedup from vectorization therefore has to be weighed against a slowdown for the next few million instructions. For a mixed workload, where point operations are interspersed with other tasks, this can reduce overall performance. This implementation is therefore probably not suitable for basic applications, like signatures, but is worthwhile for complex applications, like zeroknowledge proofs, which do sustained work.
For this reason, the AVX2 backend is not enabled by default, but can
be selected using the avx2_backend
feature.
Future work
There are several directions for future improvement:

Using the vectorized field arithmetic code to parallelize across point operations rather than within a single point operation. This is less flexible, but would give a speedup both from allowing use of the faster mixedmodel arithmetic and from reducing shuffle pressure. One approach in this direction would be to implement batched scalarpoint operations using vectors of points (AoSoA layout). This less generally useful but would give a speedup for Bulletproofs.

Extending the implementation to use the full width of AVX512, either handling the extra parallelism internally to a single point operation (by using a 2way parallel implementation of field arithmetic instead of a wordsliced one), or externally, parallelizing across point operations. Internal parallelism would be preferable but might require too much shuffle pressure.

Generalizing the implementation to nonAVX2 instructions, particularly NEON. The current point arithmetic code is written in terms of field element vectors, which are in turn implemented using platform SIMD vectors. It should be possible to write an alternate implementation of the
FieldElement32x4
using NEON without changing the point arithmetic. NEON has 128bit vectors rather than 256bit vectors, but this may still be worthwhile compared to a serial implementation.
Modules
constants 
This module contains constants used by the AVX2 backend. 
edwards 
Parallel Edwards Arithmetic for Curve25519. 
field 
An implementation of 4way vectorized 32bit field arithmetic using AVX2. 
scalar_mul 