xxhash.h - OpenGrok cross reference for /freebsd/sys/contrib/zstd/lib/common/xxhash.h

Lines Matching +full:speed +full:- +full:grade
2  *  xxHash - Fast Hash algorithm
6  *  - xxHash homepage: http://www.xxhash.com
7  *  - xxHash source repository : https://github.com/Cyan4973/xxHash
9  * This source code is licensed under both the BSD-style license (found in the
12  * You may select, at your option, one of the above-listed licenses.
33 xxHash is an extremely fast hash algorithm, running at RAM speed limits.
38 Name            Speed       Q.Score   Author
49 MD5-32          0.33 GB/s    10       Ronald L. Rivest
50 SHA1-32         0.28 GB/s    10
57 Other speed-oriented implementations can be faster,
59 https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html?showComment=1552696407071#c349009…
61 A 64-bit version, named XXH64, is available since r35.
62 It offers much better speed, but for 64-bit applications only.
63 Name     Speed on 64 bits    Speed on 32 bits
79  * expressed as a compile-time constant:
81  *      https://fastcompression.blogspot.com/2018/03/xxhash-for-small-keys-impressive-power.html
335 /*-**********************************************************************
336 *  32-bit hash
340  * @brief An unsigned 32-bit integer.
360 #       error "unsupported platform: need a 32-bit type"
370  * Contains functions used in the classic 32-bit xxHash algorithm.
373  *   XXH32 is useful for older platforms, with no or poor 64-bit performance.
374  *   Note that @ref xxh3_family provides competitive speed
375  *   for both 32-bit and 64-bit systems, and offers true 64/128 bit hash results.
383  * @brief Calculates the 32-bit hash of @p input using xxHash32.
385  * Speed on Core 2 Duo @ 3 GHz (single thread, SMHasher benchmark): 5.4 GB/s
389  * @param seed The 32-bit seed to alter the hash's output predictably.
396  * @return The calculated 32-bit hash value.
408  * This method is slower than single-call functions, due to state management.
421  * This function returns the nn-bits hash as an int or long long.
495  * @param seed The 32-bit seed to alter the hash result predictably.
545  * This the simplest and fastest format for further post-processing.
550  * The canonical representation settles this issue by mandating big-endian
551  * convention, the same convention as human-readable numbers (large digits first).
598 /* C-language Attributes are added in C23. */
634 /*-**********************************************************************
635 *  64-bit hash
639  * @brief An unsigned 64-bit integer.
655      /* the following type must have a width of 64-bit */
666  * Contains functions used in the classic 64-bit xxHash algorithm.
669  *   XXH3 provides competitive speed for both 32-bit and 64-bit systems,
671  *   It provides better speed for systems with vector processing capabilities.
676  * @brief Calculates the 64-bit hash of @p input using xxHash64.
678  * This function usually runs faster on 64-bit systems, but slower on 32-bit
683  * @param seed The 64-bit seed to alter the hash's output predictably.
690  * @return The calculated 64-bit hash.
698 /* Begin FreeBSD - This symbol is needed by dll-linked CLI zstd(1). */
732  *  - Improved speed for both small and large inputs
733  *  - True 64-bit and 128-bit outputs
734  *  - SIMD acceleration
735  *  - Improved 32-bit viability
737  * Speed analysis methodology is explained here:
739  *    https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html
745  * XXH3's speed benefits greatly from SIMD and 64-bit arithmetic,
747  * Any 32-bit and 64-bit targets that can run XXH32 smoothly
763  * reduces the amount of mixing, resulting in faster speed on small inputs.
766  * The API supports one-shot hashing, streaming mode, and custom secrets.
769 /*-**********************************************************************
770 *  XXH3 64-bit variant
774  * default 64-bit variant, using default secret and default seed of 0.
820  * As a consequence, streaming is slower than one-shot hashing.
821  * For better performance, prefer one-shot functions whenever applicable.
849  * Similar to one-shot API, `secretSize` must be >= `XXH3_SECRET_SIZE_MIN`,
864 /*-**********************************************************************
865 *  XXH3 128-bit variant
869  * @brief The return value from 128-bit hashes.
887  * As a consequence, streaming is slower than one-shot hashing.
888  * For better performance, prefer one-shot functions whenever applicable.
893  * All reset and streaming functions have same meaning as their 64-bit counterpart.
979 #ifndef XXH_NO_LONG_LONG  /* defined when there is no 64-bit support */
994    XXH64_hash_t total_len;    /*!< Total length hashed. This is always 64-bit. */
1078        /*!< Reserved field. Needed for padding on 64-bit. */
1082        /*!< Total length hashed. 64-bit even on 32-bit targets. */
1100  * @brief Initializes a stack-allocated `XXH3_state_s`.
1110 #define XXH3_INITSTATE(XXH3_state_ptr)   { (XXH3_state_ptr)->seed = 0; }
1114  * simple alias to pre-selected XXH3_128bits variant
1125  * Derive a high-entropy secret from any user-defined content, named customSeed.
1127  * The `_withSecret()` variants are useful to provide a higher level of protection than 64-bit seed,
1131  * and derives from it a high-entropy secret of length @secretSize
1140  * _and_ feature very high entropy (consist of random-looking bytes).
1174  * This generally benefits speed, compared to `_withSeed()` or `_withSecret()`.
1183  * hence offering only a pure speed benefit on "large" input,
1186  * Another usage scenario is to hash the secret to a 64-bit hash value,
1189  * On top of speed, an added benefit is that each bit in the secret
1230 /*-**********************************************************************
1232  *-**********************************************************************
1268  * @brief Define this to disable 64-bit code.
1281  * is sub-optimal.
1288  *  - `XXH_FORCE_MEMORY_ACCESS=0` (default): `memcpy`
1293  *  - `XXH_FORCE_MEMORY_ACCESS=1`: `__attribute__((packed))`
1299  *  - `XXH_FORCE_MEMORY_ACCESS=2`: Direct cast
1307  *  - `XXH_FORCE_MEMORY_ACCESS=3`: Byteshift
1310  *     inline small `memcpy()` calls, and it might also be faster on big-endian
1316  *   Methods 1 and 2 rely on implementation-defined behavior. Use these with
1320  * See http://fastcompression.blogspot.com/2015/08/accessing-unaligned-memory.html for details.
1328  * @brief If defined to non-zero, adds a special path for aligned inputs (XXH32()
1335  * path" employing direct 32-bit/64-bit reads, resulting in _dramatically
1336  * faster_ read speed.
1356  * @brief When non-zero, sets all functions to `static`.
1371  * When not optimizing (-O0), optimizing for size (-Os, -Oz), or using
1372  * -fno-inline with GCC or Clang, this will automatically be defined.
1432 #  if defined(__OPTIMIZE_SIZE__) /* -Os, -Oz */ \
1433    || defined(__NO_INLINE__)     /* -O0, -fno-inline */
1529 #    define XXH_STATIC_ASSERT_WITH_MESSAGE(c,m) do { struct xxh_sa { char x[(c) ? 1 : -1]; }; } whi…
1580  * @brief Reads an unaligned 32-bit integer from @p ptr in native endianness.
1585  * @return The 32-bit native endian integer from the bytes at @p ptr.
1591  * @brief Reads an unaligned 32-bit little endian integer from @p ptr.
1596  * @return The 32-bit little endian integer from the bytes at @p ptr.
1602  * @brief Reads an unaligned 32-bit big endian integer from @p ptr.
1607  * @return The 32-bit big endian integer from the bytes at @p ptr.
1624  * @return The 32-bit little endian integer from the bytes at @p ptr.
1654     return ((const xxh_unalign*)ptr)->u32;  in XXH_read32()
1661  * see: http://fastcompression.blogspot.com/2015/08/accessing-unaligned-memory.html
1713      * Portable and well-defined behavior.  in XXH_isLittleEndian()
1727 *  Compiler-specific Functions and Macros
1740  * @brief 32-bit rotate left.
1742  * @param x The 32-bit integer to be rotated.
1759 #  define XXH_rotl32(x,r) (((x) << (r)) | ((x) >> (32 - (r))))
1760 #  define XXH_rotl64(x,r) (((x) << (r)) | ((x) >> (64 - (r))))
1766  * @brief A 32-bit byteswap.
1768  * @param x The 32-bit integer to byteswap.
1800  * XXH_FORCE_MEMORY_ACCESS==3 is an endian-independent byteshift load.
1855 *  32-bit hash functions
1904      * - There's a ridiculous amount of lag from pmulld (10 cycles of latency on  in XXH32_round()
1910      * - Four instructions are required to rotate,  in XXH32_round()
1919      * - Instruction level parallelism is actually more beneficial here because  in XXH32_round()
1926      * same speed as scalar for XXH32.  in XXH32_round()
1957  * @brief Processes the last 0-15 bytes of @p ptr.
1990             len -= 4;  in XXH32_finalize()
1994             --len;  in XXH32_finalize()
1998          switch(len&15) /* or switch(bEnd - p) */ {  in XXH32_finalize()
2067         const xxh_u8* const limit = bEnd - 15;  in XXH32_endian_align()
2071         xxh_u32 v4 = seed - XXH_PRIME32_1;  in XXH32_endian_align()
2102 …     if ((((size_t)input) & 3) == 0) {   /* Input is 4-bytes aligned, leverage the speed benefit */  in XXH32()
2138     statePtr->v[0] = seed + XXH_PRIME32_1 + XXH_PRIME32_2;  in XXH32_reset()
2139     statePtr->v[1] = seed + XXH_PRIME32_2;  in XXH32_reset()
2140     statePtr->v[2] = seed + 0;  in XXH32_reset()
2141     statePtr->v[3] = seed - XXH_PRIME32_1;  in XXH32_reset()
2158         state->total_len_32 += (XXH32_hash_t)len;  in XXH32_update()
2159         state->large_len |= (XXH32_hash_t)((len>=16) | (state->total_len_32>=16));  in XXH32_update()
2161         if (state->memsize + len < 16)  {   /* fill in tmp buffer */  in XXH32_update()
2162             XXH_memcpy((xxh_u8*)(state->mem32) + state->memsize, input, len);  in XXH32_update()
2163             state->memsize += (XXH32_hash_t)len;  in XXH32_update()
2167         if (state->memsize) {   /* some data left from previous update */  in XXH32_update()
2168             XXH_memcpy((xxh_u8*)(state->mem32) + state->memsize, input, 16-state->memsize);  in XXH32_update()
2169             {   const xxh_u32* p32 = state->mem32;  in XXH32_update()
2170                 state->v[0] = XXH32_round(state->v[0], XXH_readLE32(p32)); p32++;  in XXH32_update()
2171                 state->v[1] = XXH32_round(state->v[1], XXH_readLE32(p32)); p32++;  in XXH32_update()
2172                 state->v[2] = XXH32_round(state->v[2], XXH_readLE32(p32)); p32++;  in XXH32_update()
2173                 state->v[3] = XXH32_round(state->v[3], XXH_readLE32(p32));  in XXH32_update()
2175             p += 16-state->memsize;  in XXH32_update()
2176             state->memsize = 0;  in XXH32_update()
2179         if (p <= bEnd-16) {  in XXH32_update()
2180             const xxh_u8* const limit = bEnd - 16;  in XXH32_update()
2183                 state->v[0] = XXH32_round(state->v[0], XXH_readLE32(p)); p+=4;  in XXH32_update()
2184                 state->v[1] = XXH32_round(state->v[1], XXH_readLE32(p)); p+=4;  in XXH32_update()
2185                 state->v[2] = XXH32_round(state->v[2], XXH_readLE32(p)); p+=4;  in XXH32_update()
2186                 state->v[3] = XXH32_round(state->v[3], XXH_readLE32(p)); p+=4;  in XXH32_update()
2192             XXH_memcpy(state->mem32, p, (size_t)(bEnd-p));  in XXH32_update()
2193             state->memsize = (unsigned)(bEnd-p);  in XXH32_update()
2206     if (state->large_len) {  in XXH32_digest()
2207         h32 = XXH_rotl32(state->v[0], 1)  in XXH32_digest()
2208             + XXH_rotl32(state->v[1], 7)  in XXH32_digest()
2209             + XXH_rotl32(state->v[2], 12)  in XXH32_digest()
2210             + XXH_rotl32(state->v[3], 18);  in XXH32_digest()
2212         h32 = state->v[2] /* == seed */ + XXH_PRIME32_5;  in XXH32_digest()
2215     h32 += state->total_len_32;  in XXH32_digest()
2217     return XXH32_finalize(h32, (const xxh_u8*)state->mem32, state->memsize, XXH_aligned);  in XXH32_digest()
2229  * as human-readable numbers (large digits first).
2253 *  64-bit hash functions
2295     return ((const xxh_unalign64*)ptr)->u64;  in XXH_read64()
2302  * see: http://fastcompression.blogspot.com/2015/08/accessing-unaligned-memory.html
2332 /* XXH_FORCE_MEMORY_ACCESS==3 is an endian-independent byteshift load. */
2444         len -= 8;  in XXH64_finalize()
2450         len -= 4;  in XXH64_finalize()
2455         --len;  in XXH64_finalize()
2478         const xxh_u8* const limit = bEnd - 31;  in XXH64_endian_align()
2482         xxh_u64 v4 = seed - XXH_PRIME64_1;  in XXH64_endian_align()
2518         if ((((size_t)input) & 7)==0) {  /* Input is aligned, let's leverage the speed advantage */  in XXH64()
2552     statePtr->v[0] = seed + XXH_PRIME64_1 + XXH_PRIME64_2;  in XXH64_reset()
2553     statePtr->v[1] = seed + XXH_PRIME64_2;  in XXH64_reset()
2554     statePtr->v[2] = seed + 0;  in XXH64_reset()
2555     statePtr->v[3] = seed - XXH_PRIME64_1;  in XXH64_reset()
2571         state->total_len += len;  in XXH64_update()
2573         if (state->memsize + len < 32) {  /* fill in tmp buffer */  in XXH64_update()
2574             XXH_memcpy(((xxh_u8*)state->mem64) + state->memsize, input, len);  in XXH64_update()
2575             state->memsize += (xxh_u32)len;  in XXH64_update()
2579         if (state->memsize) {   /* tmp buffer is full */  in XXH64_update()
2580             XXH_memcpy(((xxh_u8*)state->mem64) + state->memsize, input, 32-state->memsize);  in XXH64_update()
2581             state->v[0] = XXH64_round(state->v[0], XXH_readLE64(state->mem64+0));  in XXH64_update()
2582             state->v[1] = XXH64_round(state->v[1], XXH_readLE64(state->mem64+1));  in XXH64_update()
2583             state->v[2] = XXH64_round(state->v[2], XXH_readLE64(state->mem64+2));  in XXH64_update()
2584             state->v[3] = XXH64_round(state->v[3], XXH_readLE64(state->mem64+3));  in XXH64_update()
2585             p += 32 - state->memsize;  in XXH64_update()
2586             state->memsize = 0;  in XXH64_update()
2590             const xxh_u8* const limit = bEnd - 32;  in XXH64_update()
2593                 state->v[0] = XXH64_round(state->v[0], XXH_readLE64(p)); p+=8;  in XXH64_update()
2594                 state->v[1] = XXH64_round(state->v[1], XXH_readLE64(p)); p+=8;  in XXH64_update()
2595                 state->v[2] = XXH64_round(state->v[2], XXH_readLE64(p)); p+=8;  in XXH64_update()
2596                 state->v[3] = XXH64_round(state->v[3], XXH_readLE64(p)); p+=8;  in XXH64_update()
2602             XXH_memcpy(state->mem64, p, (size_t)(bEnd-p));  in XXH64_update()
2603             state->memsize = (unsigned)(bEnd-p);  in XXH64_update()
2616     if (state->total_len >= 32) {  in XXH64_digest()
2617 …h64 = XXH_rotl64(state->v[0], 1) + XXH_rotl64(state->v[1], 7) + XXH_rotl64(state->v[2], 12) + XXH_…  in XXH64_digest()
2618         h64 = XXH64_mergeRound(h64, state->v[0]);  in XXH64_digest()
2619         h64 = XXH64_mergeRound(h64, state->v[1]);  in XXH64_digest()
2620         h64 = XXH64_mergeRound(h64, state->v[2]);  in XXH64_digest()
2621         h64 = XXH64_mergeRound(h64, state->v[3]);  in XXH64_digest()
2623         h64  = state->v[2] /*seed*/ + XXH_PRIME64_5;  in XXH64_digest()
2626     h64 += (xxh_u64) state->total_len;  in XXH64_digest()
2628     return XXH64_finalize(h64, (const xxh_u8*)state->mem64, (size_t)state->total_len, XXH_aligned);  in XXH64_digest()
2652 *  New generation hash designed for speed on small keys and vectorization
2701  * One goal of XXH3 is to make it fast on both 32-bit and 64-bit, while
2702  * remaining a true 64-bit/128-bit hash function.
2704  * This is done by prioritizing a subset of 64-bit operations that can be
2705  * emulated without too many steps on the average 32-bit machine.
2707  * For example, these two lines seem similar, and run equally fast on 64-bit:
2713  * However, to a 32-bit machine, there is a major difference.
2717  *   x.lo ^= (x.hi >> (47 - 32));
2722  *   x.lo ^= (x.lo >> 13) | (x.hi << (32 - 13));
2727  *  - All the bits we need are in the upper 32 bits, so we can ignore the lower
2729  *  - The shift result will always fit in the lower 32 bits, and therefore,
2734  *  - Usable unaligned access
2735  *  - A 32-bit or 64-bit ALU
2736  *      - If 32-bit, a decent ADC instruction
2737  *  - A 32 or 64-bit multiply with a 64-bit result
2738  *  - For the 128-bit variant, a decent byteswap helps short inputs.
2740  * The first two are already required by XXH32, and almost all 32-bit and 64-bit
2743  * Thumb-1, the classic 16-bit only subset of ARM's instruction set, is one
2746  * First of all, Thumb-1 lacks support for the UMULL instruction which
2755  * do a 32->64 multiply with UMULL, and the flexible operand allowing free
2760  * If compiling Thumb-1 for a target which supports ARM instructions, we will
2764  * to specify -march, as you likely meant to compile for a newer architecture.
2770 #   warning "XXH3 is highly inefficient without ARM or Thumb-2."
2808     XXH_NEON   = 4,  /*!< NEON for most ARMv7-A and all AArch64 */
2809     XXH_VSX    = 5,  /*!< VSX and ZVector for POWER8/z13 (64-bit) */
2888  * GCC usually generates the best code with -O3 for xxHash.
2891  * in code roughly 3/4 the speed of Clang.
2898  *   -O2 -mavx2 -march=haswell
2900  *   -O2 -mavx2 -mno-avx256-split-unaligned-load
2904  * -O2, but the other one we can't control without "failed to inline always
2909   && defined(__OPTIMIZE__) && !defined(__OPTIMIZE_SIZE__) /* respect -O0 and -Os */
2911 #  pragma GCC optimize("-O2")
2922  * To do the same operation, the 128-bit 'Q' register needs to be split into
2923  * two 64-bit 'D' registers, performing this operation::
2926  *            |              '---------. .--------'                |
2928  *            |              .---------' '--------.                |
2932  * completely different than the fastest method for ARMv7-A.
2934  * ARMv7-A treats D registers as unions overlaying Q registers, so modifying
2936  * will only affect bits 8-15 of AX on x86.
2941  * On ARMv7-A, this strangely modifies both parameters in place instead of
2942  * taking the usual 3-operand form.
2944  * Therefore, if we want to do this, we can simply use a D-form VZIP.32 on the
2946  * halves where we want - all in one instruction.
2961  * aarch64 cannot access the high bits of a Q-form register, and writes to a
2962  * D-form register zero the high bits, similar to how writes to W-form scalar
2984  * This is available on ARMv7-A, but is less efficient than a single VZIP.32.
2988  * Function-like macro:
3002       /* https://github.com/gcc-mirror/gcc/blob/38cf91e5/gcc/config/arm/arm.c#L22486 */     \
3003 …    /* https://github.com/llvm-mirror/llvm/blob/2c4ca683/lib/Target/ARM/ARMAsmPrinter.cpp#L399 */ \
3024  * emulated 64-bit arithmetic is too slow.
3028  * For example, the Cortex-A73 can dispatch 3 micro-ops per cycle, but it can't
3029  * have more than 2 NEON (F0/F1) micro-ops. If you are only using NEON instructions,
3033  * can dispatch 8 micro-ops per cycle, but still only 2 NEON micro-ops at once.
3040  * This change benefits CPUs with large micro-op buffers without negatively affecting
3044  *  |:----------------------|:--------------------|----------:|-----------:|------:|
3045  *  | Snapdragon 730 (A76)  | 2 NEON/8 micro-ops  |  8.8 GB/s |  10.1 GB/s |  ~16% |
3046  *  | Snapdragon 835 (A73)  | 2 NEON/3 micro-ops  |  5.1 GB/s |   5.3 GB/s |   ~5% |
3047  *  | Marvell PXA1928 (A53) | In-order dual-issue |  1.9 GB/s |   1.9 GB/s |    0% |
3098 #    warning "-maltivec=be is not recommended. Please use native endianness."
3173 #    include <mmintrin.h>   /* https://msdn.microsoft.com/fr-fr/library/84szxsww(v=vs.90).aspx */
3216  * @brief Calculates a 32-bit to 64-bit long multiply.
3225  * If you are compiling for platforms like Thumb-1 and don't have a better option,
3229  * @return 64-bit product of the low 32 bits of @p x and @p y.
3241  * GCC 4.2 (especially 32-bit ones), all without affecting newer compilers.
3244  * and perform a full 64x64 multiply -- entirely redundant on 32-bit.
3250  * @brief Calculates a 64->128-bit long multiply.
3255  * @param lhs , rhs The 64-bit integers to be multiplied
3256  * @return The 128-bit result represented in an @ref XXH128_hash_t.
3264      * On most 64-bit targets, GCC and Clang define a __uint128_t type.  in XXH_mult64to128()
3265      * This is usually the best way as it usually uses a native long 64-bit  in XXH_mult64to128()
3270      * Despite being a 32-bit platform, Clang (and emscripten) define this type  in XXH_mult64to128()
3272      * compiler builtin call which calculates a full 128-bit multiply.  in XXH_mult64to128()
3274      * https://github.com/Cyan4973/xxHash/issues/211#issuecomment-515575677  in XXH_mult64to128()
3322      * Portable scalar method. Optimized for 32-bit and 64-bit ALUs.  in XXH_mult64to128()
3324      * This is a fast and simple grade school multiply, which is shown below  in XXH_mult64to128()
3329      *     ----------  in XXH_mult64to128()
3334      *     ---------  in XXH_mult64to128()
3337      *     ---------  in XXH_mult64to128()
3347      *     in 32-bit ARMv6 and later, which is shown below:  in XXH_mult64to128()
3358      *     comparable to some 64-bit ALUs.  in XXH_mult64to128()
3361      *     of 32-bit ADD/ADCs.  in XXH_mult64to128()
3383  * @brief Calculates a 64-bit to 128-bit multiply, then XOR folds it.
3388  * @param lhs , rhs The 64-bit integers to multiply
3438  * sub-optimal on short lengths. It used an iterative algorithm which strongly
3445  * reduces latency, especially when emulating 64-bit multiplies on 32-bit.
3480         xxh_u8  const c3 = input[len - 1];  in XXH3_len_1to3_64b()
3497         xxh_u32 const input2 = XXH_readLE32(input + len - 4);  in XXH3_len_4to8_64b()
3498         xxh_u64 const bitflip = (XXH_readLE64(secret+8) ^ XXH_readLE64(secret+16)) - seed;  in XXH3_len_4to8_64b()
3512         xxh_u64 const bitflip2 = (XXH_readLE64(secret+40) ^ XXH_readLE64(secret+48)) - seed;  in XXH3_len_9to16_64b()
3514         xxh_u64 const input_hi = XXH_readLE64(input + len - 8) ^ bitflip2;  in XXH3_len_9to16_64b()
3534  * DISCLAIMER: There are known *seed-dependent* multicollisions here due to
3540  * unseeded non-cryptographic hashes, it does not attempt to defend itself
3552  * This is not too bad for a non-cryptographic hash function, especially with
3555  * The 128-bit variant (which trades some speed for strength) is NOT affected
3567      * GCC for x86 tends to autovectorize the 128-bit multiply, resulting in  in XXH3_mix16B()
3573      * FIXME: Clang's output is still _much_ faster -- On an AMD Ryzen 3600,  in XXH3_mix16B()
3586             input_hi ^ (XXH_readLE64(secret+8) - seed64)  in XXH3_mix16B()
3591 /* For mid range keys, XXH3 uses a Mum-hash variant. */
3605                     acc += XXH3_mix16B(input+len-64, secret+112, seed);  in XXH3_len_17to128_64b()
3608                 acc += XXH3_mix16B(input+len-48, secret+80, seed);  in XXH3_len_17to128_64b()
3611             acc += XXH3_mix16B(input+len-32, secret+48, seed);  in XXH3_len_17to128_64b()
3614         acc += XXH3_mix16B(input+len-16, secret+16, seed);  in XXH3_len_17to128_64b()
3646          * Clang for ARMv7-A tries to vectorize this loop, similar to GCC x86.  in XXH3_len_129to240_64b()
3649          * For 64->128-bit multiplies, even if the NEON was 100% optimal, it  in XXH3_len_129to240_64b()
3667             acc += XXH3_mix16B(input+(16*i), secret+(16*(i-8)) + XXH3_MIDSIZE_STARTOFFSET, seed);  in XXH3_len_129to240_64b()
3670 …acc += XXH3_mix16B(input + len - 16, secret + XXH3_SECRET_SIZE_MIN - XXH3_MIDSIZE_LASTOFFSET, seed…  in XXH3_len_129to240_64b()
3703     /* the following type must have a width of 64-bit */
3713  * This was chosen because it adapts quite well to 32-bit, 64-bit, and SIMD
3721  * On 128-bit inputs, we swap 64-bit pairs when we add the input to improve
3722  * cross-pollination, as otherwise the upper and lower halves would be
3725  * This doesn't matter on 64-bit hashes since they all get merged together in
3771  *  // Multiplication mixes/scrambles bytes 0-7 of the 64-bit result to
3819 …st seed = _mm512_mask_set1_epi64(_mm512_set1_epi64((xxh_i64)seed64), 0xAA, (xxh_i64)(0U - seed64));  in XXH3_initCustomSecret_avx512()
3917 …   __m256i const seed = _mm256_set_epi64x((xxh_i64)(0U - seed64), (xxh_i64)seed64, (xxh_i64)(0U - …  in XXH3_initCustomSecret_avx2()
3925          *   - do not extract the secret from sse registers in the internal loop  in XXH3_initCustomSecret_avx2()
3926          *   - use less common registers, and avoid pushing these reg into stack  in XXH3_initCustomSecret_avx2()
3933         /* GCC -O2 need unroll loop manually */  in XXH3_initCustomSecret_avx2()
3957     /* SSE2 is just a half-scale version of the AVX2 version. */  in XXH3_accumulate_512_sse2()
4024         XXH_ALIGN(16) const xxh_i64 seed64x2[2] = { (xxh_i64)seed64, (xxh_i64)(0U - seed64) };  in XXH3_initCustomSecret_sse2()
4027         __m128i const seed = _mm_set_epi64x((xxh_i64)(0U - seed64), (xxh_i64)seed64);  in XXH3_initCustomSecret_sse2()
4036          *   - do not extract the secret from sse registers in the internal loop  in XXH3_initCustomSecret_sse2()
4037          *   - use less common registers, and avoid pushing these reg into stack  in XXH3_initCustomSecret_sse2()
4153                  * However, unlike SSE, Clang lacks a 64-bit multiply routine  in XXH3_scrambleAcc_neon()
4154                  * for NEON, and it scalarizes two 64-bit multiplies instead.  in XXH3_scrambleAcc_neon()
4246 /* scalar variants - universal */
4265     XXH_ASSERT(((size_t)acc & (XXH_ACC_ALIGN-1)) == 0);  in XXH3_scalarRound()
4303     XXH_ASSERT((((size_t)acc) & (XXH_ACC_ALIGN-1)) == 0);  in XXH3_scalarScrambleRound()
4333      * which requires a non-const pointer.  in XXH3_initCustomSecret_scalar()
4345      * While MOVK is great for generating constants (2 cycles for a 64-bit  in XXH3_initCustomSecret_scalar()
4390             xxh_u64 hi = XXH_readLE64(kSecretPtr + 16*i + 8) - seed64;  in XXH3_initCustomSecret_scalar()
4483     size_t const nbStripesPerBlock = (secretSize - XXH_STRIPE_LEN) / XXH_SECRET_CONSUME_RATE;  in XXH3_hashLong_internal_loop()
4485     size_t const nb_blocks = (len - 1) / block_len;  in XXH3_hashLong_internal_loop()
4493         f_scramble(acc, secret + secretSize - XXH_STRIPE_LEN);  in XXH3_hashLong_internal_loop()
4498     {   size_t const nbStripes = ((len - 1) - (block_len * nb_blocks)) / XXH_STRIPE_LEN;  in XXH3_hashLong_internal_loop()
4503         {   const xxh_u8* const p = input + len - XXH_STRIPE_LEN;  in XXH3_hashLong_internal_loop()
4505             f_acc512(acc, p, secret + secretSize - XXH_STRIPE_LEN - XXH_SECRET_LASTACC_START);  in XXH3_hashLong_internal_loop()
4531          * Prevent autovectorization on Clang ARMv7-a. Exact same problem as  in XXH3_mergeAccs()
4646      * For now, it's a contract pre-condition.  in XXH3_64bits_internal()
4698  * malloc typically guarantees 16 byte alignment on 64-bit systems and 8 byte
4699  * alignment on 32-bit. This isn't enough for the 32 byte aligned loads in AVX2
4700  * or on 32-bit, the 16 byte aligned loads in SSE2 and NEON.
4719     XXH_ASSERT((align & (align-1)) == 0);   /* power of 2 */  in XXH_alignedMalloc()
4730             size_t offset = align - ((size_t)base & (align - 1)); /* base % align */  in XXH_alignedMalloc()
4731             /* Add the offset for the now-aligned pointer */  in XXH_alignedMalloc()
4737             ptr[-1] = (xxh_u8)offset;  in XXH_alignedMalloc()
4752         xxh_u8 offset = ptr[-1];  in XXH_alignedFree()
4754         xxh_u8* base = ptr - offset;  in XXH_alignedFree()
4787     size_t const initLength = offsetof(XXH3_state_t, nbStripesPerBlock) - initStart;  in XXH3_reset_internal()
4792     statePtr->acc[0] = XXH_PRIME32_3;  in XXH3_reset_internal()
4793     statePtr->acc[1] = XXH_PRIME64_1;  in XXH3_reset_internal()
4794     statePtr->acc[2] = XXH_PRIME64_2;  in XXH3_reset_internal()
4795     statePtr->acc[3] = XXH_PRIME64_3;  in XXH3_reset_internal()
4796     statePtr->acc[4] = XXH_PRIME64_4;  in XXH3_reset_internal()
4797     statePtr->acc[5] = XXH_PRIME32_2;  in XXH3_reset_internal()
4798     statePtr->acc[6] = XXH_PRIME64_5;  in XXH3_reset_internal()
4799     statePtr->acc[7] = XXH_PRIME32_1;  in XXH3_reset_internal()
4800     statePtr->seed = seed;  in XXH3_reset_internal()
4801     statePtr->useSeed = (seed != 0);  in XXH3_reset_internal()
4802     statePtr->extSecret = (const unsigned char*)secret;  in XXH3_reset_internal()
4804     statePtr->secretLimit = secretSize - XXH_STRIPE_LEN;  in XXH3_reset_internal()
4805     statePtr->nbStripesPerBlock = statePtr->secretLimit / XXH_SECRET_CONSUME_RATE;  in XXH3_reset_internal()
4834     if ((seed != statePtr->seed) || (statePtr->extSecret != NULL))  in XXH3_64bits_reset_withSeed()
4835         XXH3_initCustomSecret(statePtr->customSecret, seed);  in XXH3_64bits_reset_withSeed()
4848     statePtr->useSeed = 1; /* always, even if seed64==0 */  in XXH3_64bits_reset_withSecretandSeed()
4865     if (nbStripesPerBlock - *nbStripesSoFarPtr <= nbStripes) {  in XXH3_consumeStripes()
4867         size_t const nbStripesToEndofBlock = nbStripesPerBlock - *nbStripesSoFarPtr;  in XXH3_consumeStripes()
4868         size_t const nbStripesAfterBlock = nbStripes - nbStripesToEndofBlock;  in XXH3_consumeStripes()
4900 …const unsigned char* const secret = (state->extSecret == NULL) ? state->customSecret : state->extS…  in XXH3_update()
4906         XXH_ALIGN(XXH_ACC_ALIGN) xxh_u64 acc[8]; memcpy(acc, state->acc, sizeof(acc));  in XXH3_update()
4908         xxh_u64* XXH_RESTRICT const acc = state->acc;  in XXH3_update()
4910         state->totalLen += len;  in XXH3_update()
4911         XXH_ASSERT(state->bufferedSize <= XXH3_INTERNALBUFFER_SIZE);  in XXH3_update()
4914         if (state->bufferedSize + len <= XXH3_INTERNALBUFFER_SIZE) {  in XXH3_update()
4915             XXH_memcpy(state->buffer + state->bufferedSize, input, len);  in XXH3_update()
4916             state->bufferedSize += (XXH32_hash_t)len;  in XXH3_update()
4928         if (state->bufferedSize) {  in XXH3_update()
4929             size_t const loadSize = XXH3_INTERNALBUFFER_SIZE - state->bufferedSize;  in XXH3_update()
4930             XXH_memcpy(state->buffer + state->bufferedSize, input, loadSize);  in XXH3_update()
4933                                &state->nbStripesSoFar, state->nbStripesPerBlock,  in XXH3_update()
4934                                 state->buffer, XXH3_INTERNALBUFFER_STRIPES,  in XXH3_update()
4935                                 secret, state->secretLimit,  in XXH3_update()
4937             state->bufferedSize = 0;  in XXH3_update()
4942         if ((size_t)(bEnd - input) > state->nbStripesPerBlock * XXH_STRIPE_LEN) {  in XXH3_update()
4943             size_t nbStripes = (size_t)(bEnd - 1 - input) / XXH_STRIPE_LEN;  in XXH3_update()
4944             XXH_ASSERT(state->nbStripesPerBlock >= state->nbStripesSoFar);  in XXH3_update()
4946             {   size_t const nbStripesToEnd = state->nbStripesPerBlock - state->nbStripesSoFar;  in XXH3_update()
4948 …XXH3_accumulate(acc, input, secret + state->nbStripesSoFar * XXH_SECRET_CONSUME_RATE, nbStripesToE…  in XXH3_update()
4949                 f_scramble(acc, secret + state->secretLimit);  in XXH3_update()
4950                 state->nbStripesSoFar = 0;  in XXH3_update()
4952                 nbStripes -= nbStripesToEnd;  in XXH3_update()
4955             while(nbStripes >= state->nbStripesPerBlock) {  in XXH3_update()
4956                 XXH3_accumulate(acc, input, secret, state->nbStripesPerBlock, f_acc512);  in XXH3_update()
4957                 f_scramble(acc, secret + state->secretLimit);  in XXH3_update()
4958                 input += state->nbStripesPerBlock * XXH_STRIPE_LEN;  in XXH3_update()
4959                 nbStripes -= state->nbStripesPerBlock;  in XXH3_update()
4965             state->nbStripesSoFar = nbStripes;  in XXH3_update()
4967 …XXH_memcpy(state->buffer + sizeof(state->buffer) - XXH_STRIPE_LEN, input - XXH_STRIPE_LEN, XXH_STR…  in XXH3_update()
4968             XXH_ASSERT(bEnd - input <= XXH_STRIPE_LEN);  in XXH3_update()
4972             if (bEnd - input > XXH3_INTERNALBUFFER_SIZE) {  in XXH3_update()
4973                 const xxh_u8* const limit = bEnd - XXH3_INTERNALBUFFER_SIZE;  in XXH3_update()
4976                                        &state->nbStripesSoFar, state->nbStripesPerBlock,  in XXH3_update()
4978                                         secret, state->secretLimit,  in XXH3_update()
4983 …XXH_memcpy(state->buffer + sizeof(state->buffer) - XXH_STRIPE_LEN, input - XXH_STRIPE_LEN, XXH_STR…  in XXH3_update()
4989         XXH_ASSERT(bEnd - input <= XXH3_INTERNALBUFFER_SIZE);  in XXH3_update()
4990         XXH_ASSERT(state->bufferedSize == 0);  in XXH3_update()
4991         XXH_memcpy(state->buffer, input, (size_t)(bEnd-input));  in XXH3_update()
4992         state->bufferedSize = (XXH32_hash_t)(bEnd-input);  in XXH3_update()
4995         memcpy(state->acc, acc, sizeof(acc));  in XXH3_update()
5020     XXH_memcpy(acc, state->acc, sizeof(state->acc));  in XXH3_digest_long()
5021     if (state->bufferedSize >= XXH_STRIPE_LEN) {  in XXH3_digest_long()
5022         size_t const nbStripes = (state->bufferedSize - 1) / XXH_STRIPE_LEN;  in XXH3_digest_long()
5023         size_t nbStripesSoFar = state->nbStripesSoFar;  in XXH3_digest_long()
5025                            &nbStripesSoFar, state->nbStripesPerBlock,  in XXH3_digest_long()
5026                             state->buffer, nbStripes,  in XXH3_digest_long()
5027                             secret, state->secretLimit,  in XXH3_digest_long()
5031                             state->buffer + state->bufferedSize - XXH_STRIPE_LEN,  in XXH3_digest_long()
5032                             secret + state->secretLimit - XXH_SECRET_LASTACC_START);  in XXH3_digest_long()
5035         size_t const catchupSize = XXH_STRIPE_LEN - state->bufferedSize;  in XXH3_digest_long()
5036         XXH_ASSERT(state->bufferedSize > 0);  /* there is always some input buffered */  in XXH3_digest_long()
5037         XXH_memcpy(lastStripe, state->buffer + sizeof(state->buffer) - catchupSize, catchupSize);  in XXH3_digest_long()
5038         XXH_memcpy(lastStripe + catchupSize, state->buffer, state->bufferedSize);  in XXH3_digest_long()
5041                             secret + state->secretLimit - XXH_SECRET_LASTACC_START);  in XXH3_digest_long()
5048 …const unsigned char* const secret = (state->extSecret == NULL) ? state->customSecret : state->extS…  in XXH3_64bits_digest()
5049     if (state->totalLen > XXH3_MIDSIZE_MAX) {  in XXH3_64bits_digest()
5054                               (xxh_u64)state->totalLen * XXH_PRIME64_1);  in XXH3_64bits_digest()
5057     if (state->useSeed)  in XXH3_64bits_digest()
5058         return XXH3_64bits_withSeed(state->buffer, (size_t)state->totalLen, state->seed);  in XXH3_64bits_digest()
5059     return XXH3_64bits_withSecret(state->buffer, (size_t)(state->totalLen),  in XXH3_64bits_digest()
5060                                   secret, state->secretLimit + XXH_STRIPE_LEN);  in XXH3_64bits_digest()
5068  * XXH3's 128-bit variant has better mixing and strength than the 64-bit variant,
5071  * For example, extra steps are taken to avoid the seed-dependent collisions
5072  * in 17-240 byte inputs (See XXH3_mix16B and XXH128_mix32B).
5074  * This strength naturally comes at the cost of some speed, especially on short
5075  * lengths. Note that longer hashes are about as fast as the 64-bit version
5076  * due to it using only a slight modification of the 64-bit loop.
5078  * XXH128 is also more oriented towards 64-bit machines. It is still extremely
5079  * fast for a _128-bit_ hash on 32-bit (it usually clears XXH64).
5096         xxh_u8 const c3 = input[len - 1];  in XXH3_len_1to3_128b()
5101         xxh_u64 const bitfliph = (XXH_readLE32(secret+8) ^ XXH_readLE32(secret+12)) - seed;  in XXH3_len_1to3_128b()
5119         xxh_u32 const input_hi = XXH_readLE32(input + len - 4);  in XXH3_len_4to8_128b()
5144     {   xxh_u64 const bitflipl = (XXH_readLE64(secret+32) ^ XXH_readLE64(secret+40)) - seed;  in XXH3_len_9to16_128b()
5147         xxh_u64       input_hi = XXH_readLE64(input + len - 8);  in XXH3_len_9to16_128b()
5153         m128.low64 += (xxh_u64)(len - 1) << 54;  in XXH3_len_9to16_128b()
5160          * The best approach to this operation is different on 32-bit and 64-bit.  in XXH3_len_9to16_128b()
5162         if (sizeof(void *) < sizeof(xxh_u64)) { /* 32-bit */  in XXH3_len_9to16_128b()
5164              * 32-bit optimized version, which is more readable.  in XXH3_len_9to16_128b()
5166              * On 32-bit, it removes an ADC and delays a dependency between the two  in XXH3_len_9to16_128b()
5167              * halves of m128.high64, but it generates an extra mask on 64-bit.  in XXH3_len_9to16_128b()
5172              * 64-bit optimized (albeit more confusing) version.  in XXH3_len_9to16_128b()
5182              * Inverse Property: x + y - x == y  in XXH3_len_9to16_128b()
5183              *    a + (b * (1 + c - 1))  in XXH3_len_9to16_128b()
5185              *    a + (b * 1) + (b * (c - 1))  in XXH3_len_9to16_128b()
5187              *    a + b + (b * (c - 1))  in XXH3_len_9to16_128b()
5190              *    input_hi.hi + input_hi.lo + ((xxh_u64)input_hi.lo * (XXH_PRIME32_2 - 1))  in XXH3_len_9to16_128b()
5193              *    input_hi + ((xxh_u64)input_hi.lo * (XXH_PRIME32_2 - 1))  in XXH3_len_9to16_128b()
5195             m128.high64 += input_hi + XXH_mult32to64((xxh_u32)input_hi, XXH_PRIME32_2 - 1);  in XXH3_len_9to16_128b()
5258                     acc = XXH128_mix32B(acc, input+48, input+len-64, secret+96, seed);  in XXH3_len_17to128_128b()
5260                 acc = XXH128_mix32B(acc, input+32, input+len-48, secret+64, seed);  in XXH3_len_17to128_128b()
5262             acc = XXH128_mix32B(acc, input+16, input+len-32, secret+32, seed);  in XXH3_len_17to128_128b()
5264         acc = XXH128_mix32B(acc, input, input+len-16, secret, seed);  in XXH3_len_17to128_128b()
5269                         + ((len - seed) * XXH_PRIME64_2);  in XXH3_len_17to128_128b()
5271             h128.high64 = (XXH64_hash_t)0 - XXH3_avalanche(h128.high64);  in XXH3_len_17to128_128b()
5304                                 secret + XXH3_MIDSIZE_STARTOFFSET + (32 * (i - 4)),  in XXH3_len_129to240_128b()
5309                             input + len - 16,  in XXH3_len_129to240_128b()
5310                             input + len - 32,  in XXH3_len_129to240_128b()
5311                             secret + XXH3_SECRET_SIZE_MIN - XXH3_MIDSIZE_LASTOFFSET - 16,  in XXH3_len_129to240_128b()
5312                             0ULL - seed);  in XXH3_len_129to240_128b()
5318                         + ((len - seed) * XXH_PRIME64_2);  in XXH3_len_129to240_128b()
5320             h128.high64 = (XXH64_hash_t)0 - XXH3_avalanche(h128.high64);  in XXH3_len_129to240_128b()
5345                                             - sizeof(acc) - XXH_SECRET_MERGEACCS_START,  in XXH3_hashLong_128b_internal()
5420      * For now, it's a contract pre-condition.  in XXH3_128bits_internal()
5478 /* ===   XXH3 128-bit streaming   === */
5481  * All initialization and update functions are identical to 64-bit streaming variant.
5524 …const unsigned char* const secret = (state->extSecret == NULL) ? state->customSecret : state->extS…  in XXH3_128bits_digest()
5525     if (state->totalLen > XXH3_MIDSIZE_MAX) {  in XXH3_128bits_digest()
5528         XXH_ASSERT(state->secretLimit + XXH_STRIPE_LEN >= sizeof(acc) + XXH_SECRET_MERGEACCS_START);  in XXH3_128bits_digest()
5532                                          (xxh_u64)state->totalLen * XXH_PRIME64_1);  in XXH3_128bits_digest()
5534                                          secret + state->secretLimit + XXH_STRIPE_LEN  in XXH3_128bits_digest()
5535                                                 - sizeof(acc) - XXH_SECRET_MERGEACCS_START,  in XXH3_128bits_digest()
5536                                          ~((xxh_u64)state->totalLen * XXH_PRIME64_2));  in XXH3_128bits_digest()
5541     if (state->seed)  in XXH3_128bits_digest()
5542         return XXH3_128bits_withSeed(state->buffer, (size_t)state->totalLen, state->seed);  in XXH3_128bits_digest()
5543     return XXH3_128bits_withSecret(state->buffer, (size_t)(state->totalLen),  in XXH3_128bits_digest()
5544                                    secret, state->secretLimit + XXH_STRIPE_LEN);  in XXH3_128bits_digest()
5547 /* 128-bit utility functions */
5568     int const hcmp = (h1.high64 > h2.high64) - (h2.high64 > h1.high64);  in XXH128_cmp()
5571     return (h1.low64 > h2.low64) - (h2.low64 > h1.low64);  in XXH128_cmp()
5595     h.low64  = XXH_readBE64(src->digest + 8);  in XXH128_hashFromCanonical()
5636     /* Fill secretBuffer with a copy of customSeed - repeat as needed */  in XXH3_generateSecret()
5639             size_t const toCopy = XXH_MIN((secretSize - pos), customSeedSize);  in XXH3_generateSecret()
5653         XXH3_combine16((char*)secretBuffer + secretSize - 16, XXH128_hashFromCanonical(&scrambler));  in XXH3_generateSecret()
5673   && defined(__OPTIMIZE__) && !defined(__OPTIMIZE_SIZE__) /* respect -O0 and -Os */