1*37919e23SEric Biggers/* SPDX-License-Identifier: GPL-2.0 */ 2*37919e23SEric Biggers/* 3*37919e23SEric Biggers * Implementation of POLYVAL using ARMv8 Crypto Extensions. 4*37919e23SEric Biggers * 5*37919e23SEric Biggers * Copyright 2021 Google LLC 6*37919e23SEric Biggers */ 7*37919e23SEric Biggers/* 8*37919e23SEric Biggers * This is an efficient implementation of POLYVAL using ARMv8 Crypto Extensions 9*37919e23SEric Biggers * It works on 8 blocks at a time, by precomputing the first 8 keys powers h^8, 10*37919e23SEric Biggers * ..., h^1 in the POLYVAL finite field. This precomputation allows us to split 11*37919e23SEric Biggers * finite field multiplication into two steps. 12*37919e23SEric Biggers * 13*37919e23SEric Biggers * In the first step, we consider h^i, m_i as normal polynomials of degree less 14*37919e23SEric Biggers * than 128. We then compute p(x) = h^8m_0 + ... + h^1m_7 where multiplication 15*37919e23SEric Biggers * is simply polynomial multiplication. 16*37919e23SEric Biggers * 17*37919e23SEric Biggers * In the second step, we compute the reduction of p(x) modulo the finite field 18*37919e23SEric Biggers * modulus g(x) = x^128 + x^127 + x^126 + x^121 + 1. 19*37919e23SEric Biggers * 20*37919e23SEric Biggers * This two step process is equivalent to computing h^8m_0 + ... + h^1m_7 where 21*37919e23SEric Biggers * multiplication is finite field multiplication. The advantage is that the 22*37919e23SEric Biggers * two-step process only requires 1 finite field reduction for every 8 23*37919e23SEric Biggers * polynomial multiplications. Further parallelism is gained by interleaving the 24*37919e23SEric Biggers * multiplications and polynomial reductions. 25*37919e23SEric Biggers */ 26*37919e23SEric Biggers 27*37919e23SEric Biggers#include <linux/linkage.h> 28*37919e23SEric Biggers#define STRIDE_BLOCKS 8 29*37919e23SEric Biggers 30*37919e23SEric BiggersACCUMULATOR .req x0 31*37919e23SEric BiggersKEY_POWERS .req x1 32*37919e23SEric BiggersMSG .req x2 33*37919e23SEric BiggersBLOCKS_LEFT .req x3 34*37919e23SEric BiggersKEY_START .req x10 35*37919e23SEric BiggersEXTRA_BYTES .req x11 36*37919e23SEric BiggersTMP .req x13 37*37919e23SEric Biggers 38*37919e23SEric BiggersM0 .req v0 39*37919e23SEric BiggersM1 .req v1 40*37919e23SEric BiggersM2 .req v2 41*37919e23SEric BiggersM3 .req v3 42*37919e23SEric BiggersM4 .req v4 43*37919e23SEric BiggersM5 .req v5 44*37919e23SEric BiggersM6 .req v6 45*37919e23SEric BiggersM7 .req v7 46*37919e23SEric BiggersKEY8 .req v8 47*37919e23SEric BiggersKEY7 .req v9 48*37919e23SEric BiggersKEY6 .req v10 49*37919e23SEric BiggersKEY5 .req v11 50*37919e23SEric BiggersKEY4 .req v12 51*37919e23SEric BiggersKEY3 .req v13 52*37919e23SEric BiggersKEY2 .req v14 53*37919e23SEric BiggersKEY1 .req v15 54*37919e23SEric BiggersPL .req v16 55*37919e23SEric BiggersPH .req v17 56*37919e23SEric BiggersTMP_V .req v18 57*37919e23SEric BiggersLO .req v20 58*37919e23SEric BiggersMI .req v21 59*37919e23SEric BiggersHI .req v22 60*37919e23SEric BiggersSUM .req v23 61*37919e23SEric BiggersGSTAR .req v24 62*37919e23SEric Biggers 63*37919e23SEric Biggers .text 64*37919e23SEric Biggers 65*37919e23SEric Biggers .arch armv8-a+crypto 66*37919e23SEric Biggers .align 4 67*37919e23SEric Biggers 68*37919e23SEric Biggers.Lgstar: 69*37919e23SEric Biggers .quad 0xc200000000000000, 0xc200000000000000 70*37919e23SEric Biggers 71*37919e23SEric Biggers/* 72*37919e23SEric Biggers * Computes the product of two 128-bit polynomials in X and Y and XORs the 73*37919e23SEric Biggers * components of the 256-bit product into LO, MI, HI. 74*37919e23SEric Biggers * 75*37919e23SEric Biggers * Given: 76*37919e23SEric Biggers * X = [X_1 : X_0] 77*37919e23SEric Biggers * Y = [Y_1 : Y_0] 78*37919e23SEric Biggers * 79*37919e23SEric Biggers * We compute: 80*37919e23SEric Biggers * LO += X_0 * Y_0 81*37919e23SEric Biggers * MI += (X_0 + X_1) * (Y_0 + Y_1) 82*37919e23SEric Biggers * HI += X_1 * Y_1 83*37919e23SEric Biggers * 84*37919e23SEric Biggers * Later, the 256-bit result can be extracted as: 85*37919e23SEric Biggers * [HI_1 : HI_0 + HI_1 + MI_1 + LO_1 : LO_1 + HI_0 + MI_0 + LO_0 : LO_0] 86*37919e23SEric Biggers * This step is done when computing the polynomial reduction for efficiency 87*37919e23SEric Biggers * reasons. 88*37919e23SEric Biggers * 89*37919e23SEric Biggers * Karatsuba multiplication is used instead of Schoolbook multiplication because 90*37919e23SEric Biggers * it was found to be slightly faster on ARM64 CPUs. 91*37919e23SEric Biggers * 92*37919e23SEric Biggers */ 93*37919e23SEric Biggers.macro karatsuba1 X Y 94*37919e23SEric Biggers X .req \X 95*37919e23SEric Biggers Y .req \Y 96*37919e23SEric Biggers ext v25.16b, X.16b, X.16b, #8 97*37919e23SEric Biggers ext v26.16b, Y.16b, Y.16b, #8 98*37919e23SEric Biggers eor v25.16b, v25.16b, X.16b 99*37919e23SEric Biggers eor v26.16b, v26.16b, Y.16b 100*37919e23SEric Biggers pmull2 v28.1q, X.2d, Y.2d 101*37919e23SEric Biggers pmull v29.1q, X.1d, Y.1d 102*37919e23SEric Biggers pmull v27.1q, v25.1d, v26.1d 103*37919e23SEric Biggers eor HI.16b, HI.16b, v28.16b 104*37919e23SEric Biggers eor LO.16b, LO.16b, v29.16b 105*37919e23SEric Biggers eor MI.16b, MI.16b, v27.16b 106*37919e23SEric Biggers .unreq X 107*37919e23SEric Biggers .unreq Y 108*37919e23SEric Biggers.endm 109*37919e23SEric Biggers 110*37919e23SEric Biggers/* 111*37919e23SEric Biggers * Same as karatsuba1, except overwrites HI, LO, MI rather than XORing into 112*37919e23SEric Biggers * them. 113*37919e23SEric Biggers */ 114*37919e23SEric Biggers.macro karatsuba1_store X Y 115*37919e23SEric Biggers X .req \X 116*37919e23SEric Biggers Y .req \Y 117*37919e23SEric Biggers ext v25.16b, X.16b, X.16b, #8 118*37919e23SEric Biggers ext v26.16b, Y.16b, Y.16b, #8 119*37919e23SEric Biggers eor v25.16b, v25.16b, X.16b 120*37919e23SEric Biggers eor v26.16b, v26.16b, Y.16b 121*37919e23SEric Biggers pmull2 HI.1q, X.2d, Y.2d 122*37919e23SEric Biggers pmull LO.1q, X.1d, Y.1d 123*37919e23SEric Biggers pmull MI.1q, v25.1d, v26.1d 124*37919e23SEric Biggers .unreq X 125*37919e23SEric Biggers .unreq Y 126*37919e23SEric Biggers.endm 127*37919e23SEric Biggers 128*37919e23SEric Biggers/* 129*37919e23SEric Biggers * Computes the 256-bit polynomial represented by LO, HI, MI. Stores 130*37919e23SEric Biggers * the result in PL, PH. 131*37919e23SEric Biggers * [PH : PL] = 132*37919e23SEric Biggers * [HI_1 : HI_1 + HI_0 + MI_1 + LO_1 : HI_0 + MI_0 + LO_1 + LO_0 : LO_0] 133*37919e23SEric Biggers */ 134*37919e23SEric Biggers.macro karatsuba2 135*37919e23SEric Biggers // v4 = [HI_1 + MI_1 : HI_0 + MI_0] 136*37919e23SEric Biggers eor v4.16b, HI.16b, MI.16b 137*37919e23SEric Biggers // v4 = [HI_1 + MI_1 + LO_1 : HI_0 + MI_0 + LO_0] 138*37919e23SEric Biggers eor v4.16b, v4.16b, LO.16b 139*37919e23SEric Biggers // v5 = [HI_0 : LO_1] 140*37919e23SEric Biggers ext v5.16b, LO.16b, HI.16b, #8 141*37919e23SEric Biggers // v4 = [HI_1 + HI_0 + MI_1 + LO_1 : HI_0 + MI_0 + LO_1 + LO_0] 142*37919e23SEric Biggers eor v4.16b, v4.16b, v5.16b 143*37919e23SEric Biggers // HI = [HI_0 : HI_1] 144*37919e23SEric Biggers ext HI.16b, HI.16b, HI.16b, #8 145*37919e23SEric Biggers // LO = [LO_0 : LO_1] 146*37919e23SEric Biggers ext LO.16b, LO.16b, LO.16b, #8 147*37919e23SEric Biggers // PH = [HI_1 : HI_1 + HI_0 + MI_1 + LO_1] 148*37919e23SEric Biggers ext PH.16b, v4.16b, HI.16b, #8 149*37919e23SEric Biggers // PL = [HI_0 + MI_0 + LO_1 + LO_0 : LO_0] 150*37919e23SEric Biggers ext PL.16b, LO.16b, v4.16b, #8 151*37919e23SEric Biggers.endm 152*37919e23SEric Biggers 153*37919e23SEric Biggers/* 154*37919e23SEric Biggers * Computes the 128-bit reduction of PH : PL. Stores the result in dest. 155*37919e23SEric Biggers * 156*37919e23SEric Biggers * This macro computes p(x) mod g(x) where p(x) is in montgomery form and g(x) = 157*37919e23SEric Biggers * x^128 + x^127 + x^126 + x^121 + 1. 158*37919e23SEric Biggers * 159*37919e23SEric Biggers * We have a 256-bit polynomial PH : PL = P_3 : P_2 : P_1 : P_0 that is the 160*37919e23SEric Biggers * product of two 128-bit polynomials in Montgomery form. We need to reduce it 161*37919e23SEric Biggers * mod g(x). Also, since polynomials in Montgomery form have an "extra" factor 162*37919e23SEric Biggers * of x^128, this product has two extra factors of x^128. To get it back into 163*37919e23SEric Biggers * Montgomery form, we need to remove one of these factors by dividing by x^128. 164*37919e23SEric Biggers * 165*37919e23SEric Biggers * To accomplish both of these goals, we add multiples of g(x) that cancel out 166*37919e23SEric Biggers * the low 128 bits P_1 : P_0, leaving just the high 128 bits. Since the low 167*37919e23SEric Biggers * bits are zero, the polynomial division by x^128 can be done by right 168*37919e23SEric Biggers * shifting. 169*37919e23SEric Biggers * 170*37919e23SEric Biggers * Since the only nonzero term in the low 64 bits of g(x) is the constant term, 171*37919e23SEric Biggers * the multiple of g(x) needed to cancel out P_0 is P_0 * g(x). The CPU can 172*37919e23SEric Biggers * only do 64x64 bit multiplications, so split P_0 * g(x) into x^128 * P_0 + 173*37919e23SEric Biggers * x^64 * g*(x) * P_0 + P_0, where g*(x) is bits 64-127 of g(x). Adding this to 174*37919e23SEric Biggers * the original polynomial gives P_3 : P_2 + P_0 + T_1 : P_1 + T_0 : 0, where T 175*37919e23SEric Biggers * = T_1 : T_0 = g*(x) * P_0. Thus, bits 0-63 got "folded" into bits 64-191. 176*37919e23SEric Biggers * 177*37919e23SEric Biggers * Repeating this same process on the next 64 bits "folds" bits 64-127 into bits 178*37919e23SEric Biggers * 128-255, giving the answer in bits 128-255. This time, we need to cancel P_1 179*37919e23SEric Biggers * + T_0 in bits 64-127. The multiple of g(x) required is (P_1 + T_0) * g(x) * 180*37919e23SEric Biggers * x^64. Adding this to our previous computation gives P_3 + P_1 + T_0 + V_1 : 181*37919e23SEric Biggers * P_2 + P_0 + T_1 + V_0 : 0 : 0, where V = V_1 : V_0 = g*(x) * (P_1 + T_0). 182*37919e23SEric Biggers * 183*37919e23SEric Biggers * So our final computation is: 184*37919e23SEric Biggers * T = T_1 : T_0 = g*(x) * P_0 185*37919e23SEric Biggers * V = V_1 : V_0 = g*(x) * (P_1 + T_0) 186*37919e23SEric Biggers * p(x) / x^{128} mod g(x) = P_3 + P_1 + T_0 + V_1 : P_2 + P_0 + T_1 + V_0 187*37919e23SEric Biggers * 188*37919e23SEric Biggers * The implementation below saves a XOR instruction by computing P_1 + T_0 : P_0 189*37919e23SEric Biggers * + T_1 and XORing into dest, rather than separately XORing P_1 : P_0 and T_0 : 190*37919e23SEric Biggers * T_1 into dest. This allows us to reuse P_1 + T_0 when computing V. 191*37919e23SEric Biggers */ 192*37919e23SEric Biggers.macro montgomery_reduction dest 193*37919e23SEric Biggers DEST .req \dest 194*37919e23SEric Biggers // TMP_V = T_1 : T_0 = P_0 * g*(x) 195*37919e23SEric Biggers pmull TMP_V.1q, PL.1d, GSTAR.1d 196*37919e23SEric Biggers // TMP_V = T_0 : T_1 197*37919e23SEric Biggers ext TMP_V.16b, TMP_V.16b, TMP_V.16b, #8 198*37919e23SEric Biggers // TMP_V = P_1 + T_0 : P_0 + T_1 199*37919e23SEric Biggers eor TMP_V.16b, PL.16b, TMP_V.16b 200*37919e23SEric Biggers // PH = P_3 + P_1 + T_0 : P_2 + P_0 + T_1 201*37919e23SEric Biggers eor PH.16b, PH.16b, TMP_V.16b 202*37919e23SEric Biggers // TMP_V = V_1 : V_0 = (P_1 + T_0) * g*(x) 203*37919e23SEric Biggers pmull2 TMP_V.1q, TMP_V.2d, GSTAR.2d 204*37919e23SEric Biggers eor DEST.16b, PH.16b, TMP_V.16b 205*37919e23SEric Biggers .unreq DEST 206*37919e23SEric Biggers.endm 207*37919e23SEric Biggers 208*37919e23SEric Biggers/* 209*37919e23SEric Biggers * Compute Polyval on 8 blocks. 210*37919e23SEric Biggers * 211*37919e23SEric Biggers * If reduce is set, also computes the montgomery reduction of the 212*37919e23SEric Biggers * previous full_stride call and XORs with the first message block. 213*37919e23SEric Biggers * (m_0 + REDUCE(PL, PH))h^8 + ... + m_7h^1. 214*37919e23SEric Biggers * I.e., the first multiplication uses m_0 + REDUCE(PL, PH) instead of m_0. 215*37919e23SEric Biggers * 216*37919e23SEric Biggers * Sets PL, PH. 217*37919e23SEric Biggers */ 218*37919e23SEric Biggers.macro full_stride reduce 219*37919e23SEric Biggers eor LO.16b, LO.16b, LO.16b 220*37919e23SEric Biggers eor MI.16b, MI.16b, MI.16b 221*37919e23SEric Biggers eor HI.16b, HI.16b, HI.16b 222*37919e23SEric Biggers 223*37919e23SEric Biggers ld1 {M0.16b, M1.16b, M2.16b, M3.16b}, [MSG], #64 224*37919e23SEric Biggers ld1 {M4.16b, M5.16b, M6.16b, M7.16b}, [MSG], #64 225*37919e23SEric Biggers 226*37919e23SEric Biggers karatsuba1 M7 KEY1 227*37919e23SEric Biggers .if \reduce 228*37919e23SEric Biggers pmull TMP_V.1q, PL.1d, GSTAR.1d 229*37919e23SEric Biggers .endif 230*37919e23SEric Biggers 231*37919e23SEric Biggers karatsuba1 M6 KEY2 232*37919e23SEric Biggers .if \reduce 233*37919e23SEric Biggers ext TMP_V.16b, TMP_V.16b, TMP_V.16b, #8 234*37919e23SEric Biggers .endif 235*37919e23SEric Biggers 236*37919e23SEric Biggers karatsuba1 M5 KEY3 237*37919e23SEric Biggers .if \reduce 238*37919e23SEric Biggers eor TMP_V.16b, PL.16b, TMP_V.16b 239*37919e23SEric Biggers .endif 240*37919e23SEric Biggers 241*37919e23SEric Biggers karatsuba1 M4 KEY4 242*37919e23SEric Biggers .if \reduce 243*37919e23SEric Biggers eor PH.16b, PH.16b, TMP_V.16b 244*37919e23SEric Biggers .endif 245*37919e23SEric Biggers 246*37919e23SEric Biggers karatsuba1 M3 KEY5 247*37919e23SEric Biggers .if \reduce 248*37919e23SEric Biggers pmull2 TMP_V.1q, TMP_V.2d, GSTAR.2d 249*37919e23SEric Biggers .endif 250*37919e23SEric Biggers 251*37919e23SEric Biggers karatsuba1 M2 KEY6 252*37919e23SEric Biggers .if \reduce 253*37919e23SEric Biggers eor SUM.16b, PH.16b, TMP_V.16b 254*37919e23SEric Biggers .endif 255*37919e23SEric Biggers 256*37919e23SEric Biggers karatsuba1 M1 KEY7 257*37919e23SEric Biggers eor M0.16b, M0.16b, SUM.16b 258*37919e23SEric Biggers 259*37919e23SEric Biggers karatsuba1 M0 KEY8 260*37919e23SEric Biggers karatsuba2 261*37919e23SEric Biggers.endm 262*37919e23SEric Biggers 263*37919e23SEric Biggers/* 264*37919e23SEric Biggers * Handle any extra blocks after full_stride loop. 265*37919e23SEric Biggers */ 266*37919e23SEric Biggers.macro partial_stride 267*37919e23SEric Biggers add KEY_POWERS, KEY_START, #(STRIDE_BLOCKS << 4) 268*37919e23SEric Biggers sub KEY_POWERS, KEY_POWERS, BLOCKS_LEFT, lsl #4 269*37919e23SEric Biggers ld1 {KEY1.16b}, [KEY_POWERS], #16 270*37919e23SEric Biggers 271*37919e23SEric Biggers ld1 {TMP_V.16b}, [MSG], #16 272*37919e23SEric Biggers eor SUM.16b, SUM.16b, TMP_V.16b 273*37919e23SEric Biggers karatsuba1_store KEY1 SUM 274*37919e23SEric Biggers sub BLOCKS_LEFT, BLOCKS_LEFT, #1 275*37919e23SEric Biggers 276*37919e23SEric Biggers tst BLOCKS_LEFT, #4 277*37919e23SEric Biggers beq .Lpartial4BlocksDone 278*37919e23SEric Biggers ld1 {M0.16b, M1.16b, M2.16b, M3.16b}, [MSG], #64 279*37919e23SEric Biggers ld1 {KEY8.16b, KEY7.16b, KEY6.16b, KEY5.16b}, [KEY_POWERS], #64 280*37919e23SEric Biggers karatsuba1 M0 KEY8 281*37919e23SEric Biggers karatsuba1 M1 KEY7 282*37919e23SEric Biggers karatsuba1 M2 KEY6 283*37919e23SEric Biggers karatsuba1 M3 KEY5 284*37919e23SEric Biggers.Lpartial4BlocksDone: 285*37919e23SEric Biggers tst BLOCKS_LEFT, #2 286*37919e23SEric Biggers beq .Lpartial2BlocksDone 287*37919e23SEric Biggers ld1 {M0.16b, M1.16b}, [MSG], #32 288*37919e23SEric Biggers ld1 {KEY8.16b, KEY7.16b}, [KEY_POWERS], #32 289*37919e23SEric Biggers karatsuba1 M0 KEY8 290*37919e23SEric Biggers karatsuba1 M1 KEY7 291*37919e23SEric Biggers.Lpartial2BlocksDone: 292*37919e23SEric Biggers tst BLOCKS_LEFT, #1 293*37919e23SEric Biggers beq .LpartialDone 294*37919e23SEric Biggers ld1 {M0.16b}, [MSG], #16 295*37919e23SEric Biggers ld1 {KEY8.16b}, [KEY_POWERS], #16 296*37919e23SEric Biggers karatsuba1 M0 KEY8 297*37919e23SEric Biggers.LpartialDone: 298*37919e23SEric Biggers karatsuba2 299*37919e23SEric Biggers montgomery_reduction SUM 300*37919e23SEric Biggers.endm 301*37919e23SEric Biggers 302*37919e23SEric Biggers/* 303*37919e23SEric Biggers * Computes a = a * b * x^{-128} mod x^128 + x^127 + x^126 + x^121 + 1. 304*37919e23SEric Biggers * 305*37919e23SEric Biggers * void polyval_mul_pmull(struct polyval_elem *a, 306*37919e23SEric Biggers * const struct polyval_elem *b); 307*37919e23SEric Biggers */ 308*37919e23SEric BiggersSYM_FUNC_START(polyval_mul_pmull) 309*37919e23SEric Biggers adr TMP, .Lgstar 310*37919e23SEric Biggers ld1 {GSTAR.2d}, [TMP] 311*37919e23SEric Biggers ld1 {v0.16b}, [x0] 312*37919e23SEric Biggers ld1 {v1.16b}, [x1] 313*37919e23SEric Biggers karatsuba1_store v0 v1 314*37919e23SEric Biggers karatsuba2 315*37919e23SEric Biggers montgomery_reduction SUM 316*37919e23SEric Biggers st1 {SUM.16b}, [x0] 317*37919e23SEric Biggers ret 318*37919e23SEric BiggersSYM_FUNC_END(polyval_mul_pmull) 319*37919e23SEric Biggers 320*37919e23SEric Biggers/* 321*37919e23SEric Biggers * Perform polynomial evaluation as specified by POLYVAL. This computes: 322*37919e23SEric Biggers * h^n * accumulator + h^n * m_0 + ... + h^1 * m_{n-1} 323*37919e23SEric Biggers * where n=nblocks, h is the hash key, and m_i are the message blocks. 324*37919e23SEric Biggers * 325*37919e23SEric Biggers * x0 - pointer to accumulator 326*37919e23SEric Biggers * x1 - pointer to precomputed key powers h^8 ... h^1 327*37919e23SEric Biggers * x2 - pointer to message blocks 328*37919e23SEric Biggers * x3 - number of blocks to hash 329*37919e23SEric Biggers * 330*37919e23SEric Biggers * void polyval_blocks_pmull(struct polyval_elem *acc, 331*37919e23SEric Biggers * const struct polyval_key *key, 332*37919e23SEric Biggers * const u8 *data, size_t nblocks); 333*37919e23SEric Biggers */ 334*37919e23SEric BiggersSYM_FUNC_START(polyval_blocks_pmull) 335*37919e23SEric Biggers adr TMP, .Lgstar 336*37919e23SEric Biggers mov KEY_START, KEY_POWERS 337*37919e23SEric Biggers ld1 {GSTAR.2d}, [TMP] 338*37919e23SEric Biggers ld1 {SUM.16b}, [ACCUMULATOR] 339*37919e23SEric Biggers subs BLOCKS_LEFT, BLOCKS_LEFT, #STRIDE_BLOCKS 340*37919e23SEric Biggers blt .LstrideLoopExit 341*37919e23SEric Biggers ld1 {KEY8.16b, KEY7.16b, KEY6.16b, KEY5.16b}, [KEY_POWERS], #64 342*37919e23SEric Biggers ld1 {KEY4.16b, KEY3.16b, KEY2.16b, KEY1.16b}, [KEY_POWERS], #64 343*37919e23SEric Biggers full_stride 0 344*37919e23SEric Biggers subs BLOCKS_LEFT, BLOCKS_LEFT, #STRIDE_BLOCKS 345*37919e23SEric Biggers blt .LstrideLoopExitReduce 346*37919e23SEric Biggers.LstrideLoop: 347*37919e23SEric Biggers full_stride 1 348*37919e23SEric Biggers subs BLOCKS_LEFT, BLOCKS_LEFT, #STRIDE_BLOCKS 349*37919e23SEric Biggers bge .LstrideLoop 350*37919e23SEric Biggers.LstrideLoopExitReduce: 351*37919e23SEric Biggers montgomery_reduction SUM 352*37919e23SEric Biggers.LstrideLoopExit: 353*37919e23SEric Biggers adds BLOCKS_LEFT, BLOCKS_LEFT, #STRIDE_BLOCKS 354*37919e23SEric Biggers beq .LskipPartial 355*37919e23SEric Biggers partial_stride 356*37919e23SEric Biggers.LskipPartial: 357*37919e23SEric Biggers st1 {SUM.16b}, [ACCUMULATOR] 358*37919e23SEric Biggers ret 359*37919e23SEric BiggersSYM_FUNC_END(polyval_blocks_pmull) 360