Lines Matching +full:speed +full:- +full:grade
2 * xxHash - Fast Hash algorithm
6 * - xxHash homepage: http://www.xxhash.com
7 * - xxHash source repository : https://github.com/Cyan4973/xxHash
9 * This source code is licensed under both the BSD-style license (found in the
12 * You may select, at your option, one of the above-listed licenses.
33 xxHash is an extremely fast hash algorithm, running at RAM speed limits.
38 Name Speed Q.Score Author
49 MD5-32 0.33 GB/s 10 Ronald L. Rivest
50 SHA1-32 0.28 GB/s 10
57 Other speed-oriented implementations can be faster,
59 https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html?showComment=1552696407071#c349009…
61 A 64-bit version, named XXH64, is available since r35.
62 It offers much better speed, but for 64-bit applications only.
63 Name Speed on 64 bits Speed on 32 bits
79 * expressed as a compile-time constant:
81 * https://fastcompression.blogspot.com/2018/03/xxhash-for-small-keys-impressive-power.html
335 /*-**********************************************************************
336 * 32-bit hash
340 * @brief An unsigned 32-bit integer.
360 # error "unsupported platform: need a 32-bit type"
370 * Contains functions used in the classic 32-bit xxHash algorithm.
373 * XXH32 is useful for older platforms, with no or poor 64-bit performance.
374 * Note that @ref xxh3_family provides competitive speed
375 * for both 32-bit and 64-bit systems, and offers true 64/128 bit hash results.
383 * @brief Calculates the 32-bit hash of @p input using xxHash32.
385 * Speed on Core 2 Duo @ 3 GHz (single thread, SMHasher benchmark): 5.4 GB/s
389 * @param seed The 32-bit seed to alter the hash's output predictably.
396 * @return The calculated 32-bit hash value.
408 * This method is slower than single-call functions, due to state management.
421 * This function returns the nn-bits hash as an int or long long.
495 * @param seed The 32-bit seed to alter the hash result predictably.
545 * This the simplest and fastest format for further post-processing.
550 * The canonical representation settles this issue by mandating big-endian
551 * convention, the same convention as human-readable numbers (large digits first).
598 /* C-language Attributes are added in C23. */
634 /*-**********************************************************************
635 * 64-bit hash
639 * @brief An unsigned 64-bit integer.
655 /* the following type must have a width of 64-bit */
666 * Contains functions used in the classic 64-bit xxHash algorithm.
669 * XXH3 provides competitive speed for both 32-bit and 64-bit systems,
671 * It provides better speed for systems with vector processing capabilities.
676 * @brief Calculates the 64-bit hash of @p input using xxHash64.
678 * This function usually runs faster on 64-bit systems, but slower on 32-bit
683 * @param seed The 64-bit seed to alter the hash's output predictably.
690 * @return The calculated 64-bit hash.
698 /* Begin FreeBSD - This symbol is needed by dll-linked CLI zstd(1). */
732 * - Improved speed for both small and large inputs
733 * - True 64-bit and 128-bit outputs
734 * - SIMD acceleration
735 * - Improved 32-bit viability
737 * Speed analysis methodology is explained here:
739 * https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html
745 * XXH3's speed benefits greatly from SIMD and 64-bit arithmetic,
747 * Any 32-bit and 64-bit targets that can run XXH32 smoothly
763 * reduces the amount of mixing, resulting in faster speed on small inputs.
766 * The API supports one-shot hashing, streaming mode, and custom secrets.
769 /*-**********************************************************************
770 * XXH3 64-bit variant
774 * default 64-bit variant, using default secret and default seed of 0.
820 * As a consequence, streaming is slower than one-shot hashing.
821 * For better performance, prefer one-shot functions whenever applicable.
849 * Similar to one-shot API, `secretSize` must be >= `XXH3_SECRET_SIZE_MIN`,
864 /*-**********************************************************************
865 * XXH3 128-bit variant
869 * @brief The return value from 128-bit hashes.
887 * As a consequence, streaming is slower than one-shot hashing.
888 * For better performance, prefer one-shot functions whenever applicable.
893 * All reset and streaming functions have same meaning as their 64-bit counterpart.
979 #ifndef XXH_NO_LONG_LONG /* defined when there is no 64-bit support */
994 XXH64_hash_t total_len; /*!< Total length hashed. This is always 64-bit. */
1078 /*!< Reserved field. Needed for padding on 64-bit. */
1082 /*!< Total length hashed. 64-bit even on 32-bit targets. */
1100 * @brief Initializes a stack-allocated `XXH3_state_s`.
1110 #define XXH3_INITSTATE(XXH3_state_ptr) { (XXH3_state_ptr)->seed = 0; }
1114 * simple alias to pre-selected XXH3_128bits variant
1125 * Derive a high-entropy secret from any user-defined content, named customSeed.
1127 * The `_withSecret()` variants are useful to provide a higher level of protection than 64-bit seed,
1131 * and derives from it a high-entropy secret of length @secretSize
1140 * _and_ feature very high entropy (consist of random-looking bytes).
1174 * This generally benefits speed, compared to `_withSeed()` or `_withSecret()`.
1183 * hence offering only a pure speed benefit on "large" input,
1186 * Another usage scenario is to hash the secret to a 64-bit hash value,
1189 * On top of speed, an added benefit is that each bit in the secret
1230 /*-**********************************************************************
1232 *-**********************************************************************
1268 * @brief Define this to disable 64-bit code.
1281 * is sub-optimal.
1288 * - `XXH_FORCE_MEMORY_ACCESS=0` (default): `memcpy`
1293 * - `XXH_FORCE_MEMORY_ACCESS=1`: `__attribute__((packed))`
1299 * - `XXH_FORCE_MEMORY_ACCESS=2`: Direct cast
1307 * - `XXH_FORCE_MEMORY_ACCESS=3`: Byteshift
1310 * inline small `memcpy()` calls, and it might also be faster on big-endian
1316 * Methods 1 and 2 rely on implementation-defined behavior. Use these with
1320 * See http://fastcompression.blogspot.com/2015/08/accessing-unaligned-memory.html for details.
1328 * @brief If defined to non-zero, adds a special path for aligned inputs (XXH32()
1335 * path" employing direct 32-bit/64-bit reads, resulting in _dramatically
1336 * faster_ read speed.
1356 * @brief When non-zero, sets all functions to `static`.
1371 * When not optimizing (-O0), optimizing for size (-Os, -Oz), or using
1372 * -fno-inline with GCC or Clang, this will automatically be defined.
1432 # if defined(__OPTIMIZE_SIZE__) /* -Os, -Oz */ \
1433 || defined(__NO_INLINE__) /* -O0, -fno-inline */
1529 # define XXH_STATIC_ASSERT_WITH_MESSAGE(c,m) do { struct xxh_sa { char x[(c) ? 1 : -1]; }; } whi…
1580 * @brief Reads an unaligned 32-bit integer from @p ptr in native endianness.
1585 * @return The 32-bit native endian integer from the bytes at @p ptr.
1591 * @brief Reads an unaligned 32-bit little endian integer from @p ptr.
1596 * @return The 32-bit little endian integer from the bytes at @p ptr.
1602 * @brief Reads an unaligned 32-bit big endian integer from @p ptr.
1607 * @return The 32-bit big endian integer from the bytes at @p ptr.
1624 * @return The 32-bit little endian integer from the bytes at @p ptr.
1654 return ((const xxh_unalign*)ptr)->u32; in XXH_read32()
1661 * see: http://fastcompression.blogspot.com/2015/08/accessing-unaligned-memory.html
1713 * Portable and well-defined behavior. in XXH_isLittleEndian()
1727 * Compiler-specific Functions and Macros
1740 * @brief 32-bit rotate left.
1742 * @param x The 32-bit integer to be rotated.
1759 # define XXH_rotl32(x,r) (((x) << (r)) | ((x) >> (32 - (r))))
1760 # define XXH_rotl64(x,r) (((x) << (r)) | ((x) >> (64 - (r))))
1766 * @brief A 32-bit byteswap.
1768 * @param x The 32-bit integer to byteswap.
1800 * XXH_FORCE_MEMORY_ACCESS==3 is an endian-independent byteshift load.
1855 * 32-bit hash functions
1904 * - There's a ridiculous amount of lag from pmulld (10 cycles of latency on in XXH32_round()
1910 * - Four instructions are required to rotate, in XXH32_round()
1919 * - Instruction level parallelism is actually more beneficial here because in XXH32_round()
1926 * same speed as scalar for XXH32. in XXH32_round()
1957 * @brief Processes the last 0-15 bytes of @p ptr.
1990 len -= 4; in XXH32_finalize()
1994 --len; in XXH32_finalize()
1998 switch(len&15) /* or switch(bEnd - p) */ { in XXH32_finalize()
2067 const xxh_u8* const limit = bEnd - 15; in XXH32_endian_align()
2071 xxh_u32 v4 = seed - XXH_PRIME32_1; in XXH32_endian_align()
2102 … if ((((size_t)input) & 3) == 0) { /* Input is 4-bytes aligned, leverage the speed benefit */ in XXH32()
2138 statePtr->v[0] = seed + XXH_PRIME32_1 + XXH_PRIME32_2; in XXH32_reset()
2139 statePtr->v[1] = seed + XXH_PRIME32_2; in XXH32_reset()
2140 statePtr->v[2] = seed + 0; in XXH32_reset()
2141 statePtr->v[3] = seed - XXH_PRIME32_1; in XXH32_reset()
2158 state->total_len_32 += (XXH32_hash_t)len; in XXH32_update()
2159 state->large_len |= (XXH32_hash_t)((len>=16) | (state->total_len_32>=16)); in XXH32_update()
2161 if (state->memsize + len < 16) { /* fill in tmp buffer */ in XXH32_update()
2162 XXH_memcpy((xxh_u8*)(state->mem32) + state->memsize, input, len); in XXH32_update()
2163 state->memsize += (XXH32_hash_t)len; in XXH32_update()
2167 if (state->memsize) { /* some data left from previous update */ in XXH32_update()
2168 XXH_memcpy((xxh_u8*)(state->mem32) + state->memsize, input, 16-state->memsize); in XXH32_update()
2169 { const xxh_u32* p32 = state->mem32; in XXH32_update()
2170 state->v[0] = XXH32_round(state->v[0], XXH_readLE32(p32)); p32++; in XXH32_update()
2171 state->v[1] = XXH32_round(state->v[1], XXH_readLE32(p32)); p32++; in XXH32_update()
2172 state->v[2] = XXH32_round(state->v[2], XXH_readLE32(p32)); p32++; in XXH32_update()
2173 state->v[3] = XXH32_round(state->v[3], XXH_readLE32(p32)); in XXH32_update()
2175 p += 16-state->memsize; in XXH32_update()
2176 state->memsize = 0; in XXH32_update()
2179 if (p <= bEnd-16) { in XXH32_update()
2180 const xxh_u8* const limit = bEnd - 16; in XXH32_update()
2183 state->v[0] = XXH32_round(state->v[0], XXH_readLE32(p)); p+=4; in XXH32_update()
2184 state->v[1] = XXH32_round(state->v[1], XXH_readLE32(p)); p+=4; in XXH32_update()
2185 state->v[2] = XXH32_round(state->v[2], XXH_readLE32(p)); p+=4; in XXH32_update()
2186 state->v[3] = XXH32_round(state->v[3], XXH_readLE32(p)); p+=4; in XXH32_update()
2192 XXH_memcpy(state->mem32, p, (size_t)(bEnd-p)); in XXH32_update()
2193 state->memsize = (unsigned)(bEnd-p); in XXH32_update()
2206 if (state->large_len) { in XXH32_digest()
2207 h32 = XXH_rotl32(state->v[0], 1) in XXH32_digest()
2208 + XXH_rotl32(state->v[1], 7) in XXH32_digest()
2209 + XXH_rotl32(state->v[2], 12) in XXH32_digest()
2210 + XXH_rotl32(state->v[3], 18); in XXH32_digest()
2212 h32 = state->v[2] /* == seed */ + XXH_PRIME32_5; in XXH32_digest()
2215 h32 += state->total_len_32; in XXH32_digest()
2217 return XXH32_finalize(h32, (const xxh_u8*)state->mem32, state->memsize, XXH_aligned); in XXH32_digest()
2229 * as human-readable numbers (large digits first).
2253 * 64-bit hash functions
2295 return ((const xxh_unalign64*)ptr)->u64; in XXH_read64()
2302 * see: http://fastcompression.blogspot.com/2015/08/accessing-unaligned-memory.html
2332 /* XXH_FORCE_MEMORY_ACCESS==3 is an endian-independent byteshift load. */
2444 len -= 8; in XXH64_finalize()
2450 len -= 4; in XXH64_finalize()
2455 --len; in XXH64_finalize()
2478 const xxh_u8* const limit = bEnd - 31; in XXH64_endian_align()
2482 xxh_u64 v4 = seed - XXH_PRIME64_1; in XXH64_endian_align()
2518 if ((((size_t)input) & 7)==0) { /* Input is aligned, let's leverage the speed advantage */ in XXH64()
2552 statePtr->v[0] = seed + XXH_PRIME64_1 + XXH_PRIME64_2; in XXH64_reset()
2553 statePtr->v[1] = seed + XXH_PRIME64_2; in XXH64_reset()
2554 statePtr->v[2] = seed + 0; in XXH64_reset()
2555 statePtr->v[3] = seed - XXH_PRIME64_1; in XXH64_reset()
2571 state->total_len += len; in XXH64_update()
2573 if (state->memsize + len < 32) { /* fill in tmp buffer */ in XXH64_update()
2574 XXH_memcpy(((xxh_u8*)state->mem64) + state->memsize, input, len); in XXH64_update()
2575 state->memsize += (xxh_u32)len; in XXH64_update()
2579 if (state->memsize) { /* tmp buffer is full */ in XXH64_update()
2580 XXH_memcpy(((xxh_u8*)state->mem64) + state->memsize, input, 32-state->memsize); in XXH64_update()
2581 state->v[0] = XXH64_round(state->v[0], XXH_readLE64(state->mem64+0)); in XXH64_update()
2582 state->v[1] = XXH64_round(state->v[1], XXH_readLE64(state->mem64+1)); in XXH64_update()
2583 state->v[2] = XXH64_round(state->v[2], XXH_readLE64(state->mem64+2)); in XXH64_update()
2584 state->v[3] = XXH64_round(state->v[3], XXH_readLE64(state->mem64+3)); in XXH64_update()
2585 p += 32 - state->memsize; in XXH64_update()
2586 state->memsize = 0; in XXH64_update()
2590 const xxh_u8* const limit = bEnd - 32; in XXH64_update()
2593 state->v[0] = XXH64_round(state->v[0], XXH_readLE64(p)); p+=8; in XXH64_update()
2594 state->v[1] = XXH64_round(state->v[1], XXH_readLE64(p)); p+=8; in XXH64_update()
2595 state->v[2] = XXH64_round(state->v[2], XXH_readLE64(p)); p+=8; in XXH64_update()
2596 state->v[3] = XXH64_round(state->v[3], XXH_readLE64(p)); p+=8; in XXH64_update()
2602 XXH_memcpy(state->mem64, p, (size_t)(bEnd-p)); in XXH64_update()
2603 state->memsize = (unsigned)(bEnd-p); in XXH64_update()
2616 if (state->total_len >= 32) { in XXH64_digest()
2617 …h64 = XXH_rotl64(state->v[0], 1) + XXH_rotl64(state->v[1], 7) + XXH_rotl64(state->v[2], 12) + XXH_… in XXH64_digest()
2618 h64 = XXH64_mergeRound(h64, state->v[0]); in XXH64_digest()
2619 h64 = XXH64_mergeRound(h64, state->v[1]); in XXH64_digest()
2620 h64 = XXH64_mergeRound(h64, state->v[2]); in XXH64_digest()
2621 h64 = XXH64_mergeRound(h64, state->v[3]); in XXH64_digest()
2623 h64 = state->v[2] /*seed*/ + XXH_PRIME64_5; in XXH64_digest()
2626 h64 += (xxh_u64) state->total_len; in XXH64_digest()
2628 return XXH64_finalize(h64, (const xxh_u8*)state->mem64, (size_t)state->total_len, XXH_aligned); in XXH64_digest()
2652 * New generation hash designed for speed on small keys and vectorization
2701 * One goal of XXH3 is to make it fast on both 32-bit and 64-bit, while
2702 * remaining a true 64-bit/128-bit hash function.
2704 * This is done by prioritizing a subset of 64-bit operations that can be
2705 * emulated without too many steps on the average 32-bit machine.
2707 * For example, these two lines seem similar, and run equally fast on 64-bit:
2713 * However, to a 32-bit machine, there is a major difference.
2717 * x.lo ^= (x.hi >> (47 - 32));
2722 * x.lo ^= (x.lo >> 13) | (x.hi << (32 - 13));
2727 * - All the bits we need are in the upper 32 bits, so we can ignore the lower
2729 * - The shift result will always fit in the lower 32 bits, and therefore,
2734 * - Usable unaligned access
2735 * - A 32-bit or 64-bit ALU
2736 * - If 32-bit, a decent ADC instruction
2737 * - A 32 or 64-bit multiply with a 64-bit result
2738 * - For the 128-bit variant, a decent byteswap helps short inputs.
2740 * The first two are already required by XXH32, and almost all 32-bit and 64-bit
2743 * Thumb-1, the classic 16-bit only subset of ARM's instruction set, is one
2746 * First of all, Thumb-1 lacks support for the UMULL instruction which
2755 * do a 32->64 multiply with UMULL, and the flexible operand allowing free
2760 * If compiling Thumb-1 for a target which supports ARM instructions, we will
2764 * to specify -march, as you likely meant to compile for a newer architecture.
2770 # warning "XXH3 is highly inefficient without ARM or Thumb-2."
2808 XXH_NEON = 4, /*!< NEON for most ARMv7-A and all AArch64 */
2809 XXH_VSX = 5, /*!< VSX and ZVector for POWER8/z13 (64-bit) */
2888 * GCC usually generates the best code with -O3 for xxHash.
2891 * in code roughly 3/4 the speed of Clang.
2898 * -O2 -mavx2 -march=haswell
2900 * -O2 -mavx2 -mno-avx256-split-unaligned-load
2904 * -O2, but the other one we can't control without "failed to inline always
2909 && defined(__OPTIMIZE__) && !defined(__OPTIMIZE_SIZE__) /* respect -O0 and -Os */
2911 # pragma GCC optimize("-O2")
2922 * To do the same operation, the 128-bit 'Q' register needs to be split into
2923 * two 64-bit 'D' registers, performing this operation::
2926 * | '---------. .--------' |
2928 * | .---------' '--------. |
2932 * completely different than the fastest method for ARMv7-A.
2934 * ARMv7-A treats D registers as unions overlaying Q registers, so modifying
2936 * will only affect bits 8-15 of AX on x86.
2941 * On ARMv7-A, this strangely modifies both parameters in place instead of
2942 * taking the usual 3-operand form.
2944 * Therefore, if we want to do this, we can simply use a D-form VZIP.32 on the
2946 * halves where we want - all in one instruction.
2961 * aarch64 cannot access the high bits of a Q-form register, and writes to a
2962 * D-form register zero the high bits, similar to how writes to W-form scalar
2984 * This is available on ARMv7-A, but is less efficient than a single VZIP.32.
2988 * Function-like macro:
3002 /* https://github.com/gcc-mirror/gcc/blob/38cf91e5/gcc/config/arm/arm.c#L22486 */ \
3003 … /* https://github.com/llvm-mirror/llvm/blob/2c4ca683/lib/Target/ARM/ARMAsmPrinter.cpp#L399 */ \
3024 * emulated 64-bit arithmetic is too slow.
3028 * For example, the Cortex-A73 can dispatch 3 micro-ops per cycle, but it can't
3029 * have more than 2 NEON (F0/F1) micro-ops. If you are only using NEON instructions,
3033 * can dispatch 8 micro-ops per cycle, but still only 2 NEON micro-ops at once.
3040 * This change benefits CPUs with large micro-op buffers without negatively affecting
3044 * |:----------------------|:--------------------|----------:|-----------:|------:|
3045 * | Snapdragon 730 (A76) | 2 NEON/8 micro-ops | 8.8 GB/s | 10.1 GB/s | ~16% |
3046 * | Snapdragon 835 (A73) | 2 NEON/3 micro-ops | 5.1 GB/s | 5.3 GB/s | ~5% |
3047 * | Marvell PXA1928 (A53) | In-order dual-issue | 1.9 GB/s | 1.9 GB/s | 0% |
3098 # warning "-maltivec=be is not recommended. Please use native endianness."
3173 # include <mmintrin.h> /* https://msdn.microsoft.com/fr-fr/library/84szxsww(v=vs.90).aspx */
3216 * @brief Calculates a 32-bit to 64-bit long multiply.
3225 * If you are compiling for platforms like Thumb-1 and don't have a better option,
3229 * @return 64-bit product of the low 32 bits of @p x and @p y.
3241 * GCC 4.2 (especially 32-bit ones), all without affecting newer compilers.
3244 * and perform a full 64x64 multiply -- entirely redundant on 32-bit.
3250 * @brief Calculates a 64->128-bit long multiply.
3255 * @param lhs , rhs The 64-bit integers to be multiplied
3256 * @return The 128-bit result represented in an @ref XXH128_hash_t.
3264 * On most 64-bit targets, GCC and Clang define a __uint128_t type. in XXH_mult64to128()
3265 * This is usually the best way as it usually uses a native long 64-bit in XXH_mult64to128()
3270 * Despite being a 32-bit platform, Clang (and emscripten) define this type in XXH_mult64to128()
3272 * compiler builtin call which calculates a full 128-bit multiply. in XXH_mult64to128()
3274 * https://github.com/Cyan4973/xxHash/issues/211#issuecomment-515575677 in XXH_mult64to128()
3322 * Portable scalar method. Optimized for 32-bit and 64-bit ALUs. in XXH_mult64to128()
3324 * This is a fast and simple grade school multiply, which is shown below in XXH_mult64to128()
3329 * ---------- in XXH_mult64to128()
3334 * --------- in XXH_mult64to128()
3337 * --------- in XXH_mult64to128()
3347 * in 32-bit ARMv6 and later, which is shown below: in XXH_mult64to128()
3358 * comparable to some 64-bit ALUs. in XXH_mult64to128()
3361 * of 32-bit ADD/ADCs. in XXH_mult64to128()
3383 * @brief Calculates a 64-bit to 128-bit multiply, then XOR folds it.
3388 * @param lhs , rhs The 64-bit integers to multiply
3438 * sub-optimal on short lengths. It used an iterative algorithm which strongly
3445 * reduces latency, especially when emulating 64-bit multiplies on 32-bit.
3480 xxh_u8 const c3 = input[len - 1]; in XXH3_len_1to3_64b()
3497 xxh_u32 const input2 = XXH_readLE32(input + len - 4); in XXH3_len_4to8_64b()
3498 xxh_u64 const bitflip = (XXH_readLE64(secret+8) ^ XXH_readLE64(secret+16)) - seed; in XXH3_len_4to8_64b()
3512 xxh_u64 const bitflip2 = (XXH_readLE64(secret+40) ^ XXH_readLE64(secret+48)) - seed; in XXH3_len_9to16_64b()
3514 xxh_u64 const input_hi = XXH_readLE64(input + len - 8) ^ bitflip2; in XXH3_len_9to16_64b()
3534 * DISCLAIMER: There are known *seed-dependent* multicollisions here due to
3540 * unseeded non-cryptographic hashes, it does not attempt to defend itself
3552 * This is not too bad for a non-cryptographic hash function, especially with
3555 * The 128-bit variant (which trades some speed for strength) is NOT affected
3567 * GCC for x86 tends to autovectorize the 128-bit multiply, resulting in in XXH3_mix16B()
3573 * FIXME: Clang's output is still _much_ faster -- On an AMD Ryzen 3600, in XXH3_mix16B()
3586 input_hi ^ (XXH_readLE64(secret+8) - seed64) in XXH3_mix16B()
3591 /* For mid range keys, XXH3 uses a Mum-hash variant. */
3605 acc += XXH3_mix16B(input+len-64, secret+112, seed); in XXH3_len_17to128_64b()
3608 acc += XXH3_mix16B(input+len-48, secret+80, seed); in XXH3_len_17to128_64b()
3611 acc += XXH3_mix16B(input+len-32, secret+48, seed); in XXH3_len_17to128_64b()
3614 acc += XXH3_mix16B(input+len-16, secret+16, seed); in XXH3_len_17to128_64b()
3646 * Clang for ARMv7-A tries to vectorize this loop, similar to GCC x86. in XXH3_len_129to240_64b()
3649 * For 64->128-bit multiplies, even if the NEON was 100% optimal, it in XXH3_len_129to240_64b()
3667 acc += XXH3_mix16B(input+(16*i), secret+(16*(i-8)) + XXH3_MIDSIZE_STARTOFFSET, seed); in XXH3_len_129to240_64b()
3670 …acc += XXH3_mix16B(input + len - 16, secret + XXH3_SECRET_SIZE_MIN - XXH3_MIDSIZE_LASTOFFSET, seed… in XXH3_len_129to240_64b()
3703 /* the following type must have a width of 64-bit */
3713 * This was chosen because it adapts quite well to 32-bit, 64-bit, and SIMD
3721 * On 128-bit inputs, we swap 64-bit pairs when we add the input to improve
3722 * cross-pollination, as otherwise the upper and lower halves would be
3725 * This doesn't matter on 64-bit hashes since they all get merged together in
3771 * // Multiplication mixes/scrambles bytes 0-7 of the 64-bit result to
3819 …st seed = _mm512_mask_set1_epi64(_mm512_set1_epi64((xxh_i64)seed64), 0xAA, (xxh_i64)(0U - seed64)); in XXH3_initCustomSecret_avx512()
3917 … __m256i const seed = _mm256_set_epi64x((xxh_i64)(0U - seed64), (xxh_i64)seed64, (xxh_i64)(0U - … in XXH3_initCustomSecret_avx2()
3925 * - do not extract the secret from sse registers in the internal loop in XXH3_initCustomSecret_avx2()
3926 * - use less common registers, and avoid pushing these reg into stack in XXH3_initCustomSecret_avx2()
3933 /* GCC -O2 need unroll loop manually */ in XXH3_initCustomSecret_avx2()
3957 /* SSE2 is just a half-scale version of the AVX2 version. */ in XXH3_accumulate_512_sse2()
4024 XXH_ALIGN(16) const xxh_i64 seed64x2[2] = { (xxh_i64)seed64, (xxh_i64)(0U - seed64) }; in XXH3_initCustomSecret_sse2()
4027 __m128i const seed = _mm_set_epi64x((xxh_i64)(0U - seed64), (xxh_i64)seed64); in XXH3_initCustomSecret_sse2()
4036 * - do not extract the secret from sse registers in the internal loop in XXH3_initCustomSecret_sse2()
4037 * - use less common registers, and avoid pushing these reg into stack in XXH3_initCustomSecret_sse2()
4153 * However, unlike SSE, Clang lacks a 64-bit multiply routine in XXH3_scrambleAcc_neon()
4154 * for NEON, and it scalarizes two 64-bit multiplies instead. in XXH3_scrambleAcc_neon()
4246 /* scalar variants - universal */
4265 XXH_ASSERT(((size_t)acc & (XXH_ACC_ALIGN-1)) == 0); in XXH3_scalarRound()
4303 XXH_ASSERT((((size_t)acc) & (XXH_ACC_ALIGN-1)) == 0); in XXH3_scalarScrambleRound()
4333 * which requires a non-const pointer. in XXH3_initCustomSecret_scalar()
4345 * While MOVK is great for generating constants (2 cycles for a 64-bit in XXH3_initCustomSecret_scalar()
4390 xxh_u64 hi = XXH_readLE64(kSecretPtr + 16*i + 8) - seed64; in XXH3_initCustomSecret_scalar()
4483 size_t const nbStripesPerBlock = (secretSize - XXH_STRIPE_LEN) / XXH_SECRET_CONSUME_RATE; in XXH3_hashLong_internal_loop()
4485 size_t const nb_blocks = (len - 1) / block_len; in XXH3_hashLong_internal_loop()
4493 f_scramble(acc, secret + secretSize - XXH_STRIPE_LEN); in XXH3_hashLong_internal_loop()
4498 { size_t const nbStripes = ((len - 1) - (block_len * nb_blocks)) / XXH_STRIPE_LEN; in XXH3_hashLong_internal_loop()
4503 { const xxh_u8* const p = input + len - XXH_STRIPE_LEN; in XXH3_hashLong_internal_loop()
4505 f_acc512(acc, p, secret + secretSize - XXH_STRIPE_LEN - XXH_SECRET_LASTACC_START); in XXH3_hashLong_internal_loop()
4531 * Prevent autovectorization on Clang ARMv7-a. Exact same problem as in XXH3_mergeAccs()
4646 * For now, it's a contract pre-condition. in XXH3_64bits_internal()
4698 * malloc typically guarantees 16 byte alignment on 64-bit systems and 8 byte
4699 * alignment on 32-bit. This isn't enough for the 32 byte aligned loads in AVX2
4700 * or on 32-bit, the 16 byte aligned loads in SSE2 and NEON.
4719 XXH_ASSERT((align & (align-1)) == 0); /* power of 2 */ in XXH_alignedMalloc()
4730 size_t offset = align - ((size_t)base & (align - 1)); /* base % align */ in XXH_alignedMalloc()
4731 /* Add the offset for the now-aligned pointer */ in XXH_alignedMalloc()
4737 ptr[-1] = (xxh_u8)offset; in XXH_alignedMalloc()
4752 xxh_u8 offset = ptr[-1]; in XXH_alignedFree()
4754 xxh_u8* base = ptr - offset; in XXH_alignedFree()
4787 size_t const initLength = offsetof(XXH3_state_t, nbStripesPerBlock) - initStart; in XXH3_reset_internal()
4792 statePtr->acc[0] = XXH_PRIME32_3; in XXH3_reset_internal()
4793 statePtr->acc[1] = XXH_PRIME64_1; in XXH3_reset_internal()
4794 statePtr->acc[2] = XXH_PRIME64_2; in XXH3_reset_internal()
4795 statePtr->acc[3] = XXH_PRIME64_3; in XXH3_reset_internal()
4796 statePtr->acc[4] = XXH_PRIME64_4; in XXH3_reset_internal()
4797 statePtr->acc[5] = XXH_PRIME32_2; in XXH3_reset_internal()
4798 statePtr->acc[6] = XXH_PRIME64_5; in XXH3_reset_internal()
4799 statePtr->acc[7] = XXH_PRIME32_1; in XXH3_reset_internal()
4800 statePtr->seed = seed; in XXH3_reset_internal()
4801 statePtr->useSeed = (seed != 0); in XXH3_reset_internal()
4802 statePtr->extSecret = (const unsigned char*)secret; in XXH3_reset_internal()
4804 statePtr->secretLimit = secretSize - XXH_STRIPE_LEN; in XXH3_reset_internal()
4805 statePtr->nbStripesPerBlock = statePtr->secretLimit / XXH_SECRET_CONSUME_RATE; in XXH3_reset_internal()
4834 if ((seed != statePtr->seed) || (statePtr->extSecret != NULL)) in XXH3_64bits_reset_withSeed()
4835 XXH3_initCustomSecret(statePtr->customSecret, seed); in XXH3_64bits_reset_withSeed()
4848 statePtr->useSeed = 1; /* always, even if seed64==0 */ in XXH3_64bits_reset_withSecretandSeed()
4865 if (nbStripesPerBlock - *nbStripesSoFarPtr <= nbStripes) { in XXH3_consumeStripes()
4867 size_t const nbStripesToEndofBlock = nbStripesPerBlock - *nbStripesSoFarPtr; in XXH3_consumeStripes()
4868 size_t const nbStripesAfterBlock = nbStripes - nbStripesToEndofBlock; in XXH3_consumeStripes()
4900 …const unsigned char* const secret = (state->extSecret == NULL) ? state->customSecret : state->extS… in XXH3_update()
4906 XXH_ALIGN(XXH_ACC_ALIGN) xxh_u64 acc[8]; memcpy(acc, state->acc, sizeof(acc)); in XXH3_update()
4908 xxh_u64* XXH_RESTRICT const acc = state->acc; in XXH3_update()
4910 state->totalLen += len; in XXH3_update()
4911 XXH_ASSERT(state->bufferedSize <= XXH3_INTERNALBUFFER_SIZE); in XXH3_update()
4914 if (state->bufferedSize + len <= XXH3_INTERNALBUFFER_SIZE) { in XXH3_update()
4915 XXH_memcpy(state->buffer + state->bufferedSize, input, len); in XXH3_update()
4916 state->bufferedSize += (XXH32_hash_t)len; in XXH3_update()
4928 if (state->bufferedSize) { in XXH3_update()
4929 size_t const loadSize = XXH3_INTERNALBUFFER_SIZE - state->bufferedSize; in XXH3_update()
4930 XXH_memcpy(state->buffer + state->bufferedSize, input, loadSize); in XXH3_update()
4933 &state->nbStripesSoFar, state->nbStripesPerBlock, in XXH3_update()
4934 state->buffer, XXH3_INTERNALBUFFER_STRIPES, in XXH3_update()
4935 secret, state->secretLimit, in XXH3_update()
4937 state->bufferedSize = 0; in XXH3_update()
4942 if ((size_t)(bEnd - input) > state->nbStripesPerBlock * XXH_STRIPE_LEN) { in XXH3_update()
4943 size_t nbStripes = (size_t)(bEnd - 1 - input) / XXH_STRIPE_LEN; in XXH3_update()
4944 XXH_ASSERT(state->nbStripesPerBlock >= state->nbStripesSoFar); in XXH3_update()
4946 { size_t const nbStripesToEnd = state->nbStripesPerBlock - state->nbStripesSoFar; in XXH3_update()
4948 …XXH3_accumulate(acc, input, secret + state->nbStripesSoFar * XXH_SECRET_CONSUME_RATE, nbStripesToE… in XXH3_update()
4949 f_scramble(acc, secret + state->secretLimit); in XXH3_update()
4950 state->nbStripesSoFar = 0; in XXH3_update()
4952 nbStripes -= nbStripesToEnd; in XXH3_update()
4955 while(nbStripes >= state->nbStripesPerBlock) { in XXH3_update()
4956 XXH3_accumulate(acc, input, secret, state->nbStripesPerBlock, f_acc512); in XXH3_update()
4957 f_scramble(acc, secret + state->secretLimit); in XXH3_update()
4958 input += state->nbStripesPerBlock * XXH_STRIPE_LEN; in XXH3_update()
4959 nbStripes -= state->nbStripesPerBlock; in XXH3_update()
4965 state->nbStripesSoFar = nbStripes; in XXH3_update()
4967 …XXH_memcpy(state->buffer + sizeof(state->buffer) - XXH_STRIPE_LEN, input - XXH_STRIPE_LEN, XXH_STR… in XXH3_update()
4968 XXH_ASSERT(bEnd - input <= XXH_STRIPE_LEN); in XXH3_update()
4972 if (bEnd - input > XXH3_INTERNALBUFFER_SIZE) { in XXH3_update()
4973 const xxh_u8* const limit = bEnd - XXH3_INTERNALBUFFER_SIZE; in XXH3_update()
4976 &state->nbStripesSoFar, state->nbStripesPerBlock, in XXH3_update()
4978 secret, state->secretLimit, in XXH3_update()
4983 …XXH_memcpy(state->buffer + sizeof(state->buffer) - XXH_STRIPE_LEN, input - XXH_STRIPE_LEN, XXH_STR… in XXH3_update()
4989 XXH_ASSERT(bEnd - input <= XXH3_INTERNALBUFFER_SIZE); in XXH3_update()
4990 XXH_ASSERT(state->bufferedSize == 0); in XXH3_update()
4991 XXH_memcpy(state->buffer, input, (size_t)(bEnd-input)); in XXH3_update()
4992 state->bufferedSize = (XXH32_hash_t)(bEnd-input); in XXH3_update()
4995 memcpy(state->acc, acc, sizeof(acc)); in XXH3_update()
5020 XXH_memcpy(acc, state->acc, sizeof(state->acc)); in XXH3_digest_long()
5021 if (state->bufferedSize >= XXH_STRIPE_LEN) { in XXH3_digest_long()
5022 size_t const nbStripes = (state->bufferedSize - 1) / XXH_STRIPE_LEN; in XXH3_digest_long()
5023 size_t nbStripesSoFar = state->nbStripesSoFar; in XXH3_digest_long()
5025 &nbStripesSoFar, state->nbStripesPerBlock, in XXH3_digest_long()
5026 state->buffer, nbStripes, in XXH3_digest_long()
5027 secret, state->secretLimit, in XXH3_digest_long()
5031 state->buffer + state->bufferedSize - XXH_STRIPE_LEN, in XXH3_digest_long()
5032 secret + state->secretLimit - XXH_SECRET_LASTACC_START); in XXH3_digest_long()
5035 size_t const catchupSize = XXH_STRIPE_LEN - state->bufferedSize; in XXH3_digest_long()
5036 XXH_ASSERT(state->bufferedSize > 0); /* there is always some input buffered */ in XXH3_digest_long()
5037 XXH_memcpy(lastStripe, state->buffer + sizeof(state->buffer) - catchupSize, catchupSize); in XXH3_digest_long()
5038 XXH_memcpy(lastStripe + catchupSize, state->buffer, state->bufferedSize); in XXH3_digest_long()
5041 secret + state->secretLimit - XXH_SECRET_LASTACC_START); in XXH3_digest_long()
5048 …const unsigned char* const secret = (state->extSecret == NULL) ? state->customSecret : state->extS… in XXH3_64bits_digest()
5049 if (state->totalLen > XXH3_MIDSIZE_MAX) { in XXH3_64bits_digest()
5054 (xxh_u64)state->totalLen * XXH_PRIME64_1); in XXH3_64bits_digest()
5057 if (state->useSeed) in XXH3_64bits_digest()
5058 return XXH3_64bits_withSeed(state->buffer, (size_t)state->totalLen, state->seed); in XXH3_64bits_digest()
5059 return XXH3_64bits_withSecret(state->buffer, (size_t)(state->totalLen), in XXH3_64bits_digest()
5060 secret, state->secretLimit + XXH_STRIPE_LEN); in XXH3_64bits_digest()
5068 * XXH3's 128-bit variant has better mixing and strength than the 64-bit variant,
5071 * For example, extra steps are taken to avoid the seed-dependent collisions
5072 * in 17-240 byte inputs (See XXH3_mix16B and XXH128_mix32B).
5074 * This strength naturally comes at the cost of some speed, especially on short
5075 * lengths. Note that longer hashes are about as fast as the 64-bit version
5076 * due to it using only a slight modification of the 64-bit loop.
5078 * XXH128 is also more oriented towards 64-bit machines. It is still extremely
5079 * fast for a _128-bit_ hash on 32-bit (it usually clears XXH64).
5096 xxh_u8 const c3 = input[len - 1]; in XXH3_len_1to3_128b()
5101 xxh_u64 const bitfliph = (XXH_readLE32(secret+8) ^ XXH_readLE32(secret+12)) - seed; in XXH3_len_1to3_128b()
5119 xxh_u32 const input_hi = XXH_readLE32(input + len - 4); in XXH3_len_4to8_128b()
5144 { xxh_u64 const bitflipl = (XXH_readLE64(secret+32) ^ XXH_readLE64(secret+40)) - seed; in XXH3_len_9to16_128b()
5147 xxh_u64 input_hi = XXH_readLE64(input + len - 8); in XXH3_len_9to16_128b()
5153 m128.low64 += (xxh_u64)(len - 1) << 54; in XXH3_len_9to16_128b()
5160 * The best approach to this operation is different on 32-bit and 64-bit. in XXH3_len_9to16_128b()
5162 if (sizeof(void *) < sizeof(xxh_u64)) { /* 32-bit */ in XXH3_len_9to16_128b()
5164 * 32-bit optimized version, which is more readable. in XXH3_len_9to16_128b()
5166 * On 32-bit, it removes an ADC and delays a dependency between the two in XXH3_len_9to16_128b()
5167 * halves of m128.high64, but it generates an extra mask on 64-bit. in XXH3_len_9to16_128b()
5172 * 64-bit optimized (albeit more confusing) version. in XXH3_len_9to16_128b()
5182 * Inverse Property: x + y - x == y in XXH3_len_9to16_128b()
5183 * a + (b * (1 + c - 1)) in XXH3_len_9to16_128b()
5185 * a + (b * 1) + (b * (c - 1)) in XXH3_len_9to16_128b()
5187 * a + b + (b * (c - 1)) in XXH3_len_9to16_128b()
5190 * input_hi.hi + input_hi.lo + ((xxh_u64)input_hi.lo * (XXH_PRIME32_2 - 1)) in XXH3_len_9to16_128b()
5193 * input_hi + ((xxh_u64)input_hi.lo * (XXH_PRIME32_2 - 1)) in XXH3_len_9to16_128b()
5195 m128.high64 += input_hi + XXH_mult32to64((xxh_u32)input_hi, XXH_PRIME32_2 - 1); in XXH3_len_9to16_128b()
5258 acc = XXH128_mix32B(acc, input+48, input+len-64, secret+96, seed); in XXH3_len_17to128_128b()
5260 acc = XXH128_mix32B(acc, input+32, input+len-48, secret+64, seed); in XXH3_len_17to128_128b()
5262 acc = XXH128_mix32B(acc, input+16, input+len-32, secret+32, seed); in XXH3_len_17to128_128b()
5264 acc = XXH128_mix32B(acc, input, input+len-16, secret, seed); in XXH3_len_17to128_128b()
5269 + ((len - seed) * XXH_PRIME64_2); in XXH3_len_17to128_128b()
5271 h128.high64 = (XXH64_hash_t)0 - XXH3_avalanche(h128.high64); in XXH3_len_17to128_128b()
5304 secret + XXH3_MIDSIZE_STARTOFFSET + (32 * (i - 4)), in XXH3_len_129to240_128b()
5309 input + len - 16, in XXH3_len_129to240_128b()
5310 input + len - 32, in XXH3_len_129to240_128b()
5311 secret + XXH3_SECRET_SIZE_MIN - XXH3_MIDSIZE_LASTOFFSET - 16, in XXH3_len_129to240_128b()
5312 0ULL - seed); in XXH3_len_129to240_128b()
5318 + ((len - seed) * XXH_PRIME64_2); in XXH3_len_129to240_128b()
5320 h128.high64 = (XXH64_hash_t)0 - XXH3_avalanche(h128.high64); in XXH3_len_129to240_128b()
5345 - sizeof(acc) - XXH_SECRET_MERGEACCS_START, in XXH3_hashLong_128b_internal()
5420 * For now, it's a contract pre-condition. in XXH3_128bits_internal()
5478 /* === XXH3 128-bit streaming === */
5481 * All initialization and update functions are identical to 64-bit streaming variant.
5524 …const unsigned char* const secret = (state->extSecret == NULL) ? state->customSecret : state->extS… in XXH3_128bits_digest()
5525 if (state->totalLen > XXH3_MIDSIZE_MAX) { in XXH3_128bits_digest()
5528 XXH_ASSERT(state->secretLimit + XXH_STRIPE_LEN >= sizeof(acc) + XXH_SECRET_MERGEACCS_START); in XXH3_128bits_digest()
5532 (xxh_u64)state->totalLen * XXH_PRIME64_1); in XXH3_128bits_digest()
5534 secret + state->secretLimit + XXH_STRIPE_LEN in XXH3_128bits_digest()
5535 - sizeof(acc) - XXH_SECRET_MERGEACCS_START, in XXH3_128bits_digest()
5536 ~((xxh_u64)state->totalLen * XXH_PRIME64_2)); in XXH3_128bits_digest()
5541 if (state->seed) in XXH3_128bits_digest()
5542 return XXH3_128bits_withSeed(state->buffer, (size_t)state->totalLen, state->seed); in XXH3_128bits_digest()
5543 return XXH3_128bits_withSecret(state->buffer, (size_t)(state->totalLen), in XXH3_128bits_digest()
5544 secret, state->secretLimit + XXH_STRIPE_LEN); in XXH3_128bits_digest()
5547 /* 128-bit utility functions */
5568 int const hcmp = (h1.high64 > h2.high64) - (h2.high64 > h1.high64); in XXH128_cmp()
5571 return (h1.low64 > h2.low64) - (h2.low64 > h1.low64); in XXH128_cmp()
5595 h.low64 = XXH_readBE64(src->digest + 8); in XXH128_hashFromCanonical()
5636 /* Fill secretBuffer with a copy of customSeed - repeat as needed */ in XXH3_generateSecret()
5639 size_t const toCopy = XXH_MIN((secretSize - pos), customSeedSize); in XXH3_generateSecret()
5653 XXH3_combine16((char*)secretBuffer + secretSize - 16, XXH128_hashFromCanonical(&scrambler)); in XXH3_generateSecret()
5673 && defined(__OPTIMIZE__) && !defined(__OPTIMIZE_SIZE__) /* respect -O0 and -Os */