Revision tags: v6.3 |
|
#
e1f2750e |
| 20-Apr-2023 |
Linus Torvalds <torvalds@linux-foundation.org> |
x86: remove 'zerorest' argument from __copy_user_nocache()
Every caller passes in zero, meaning they don't want any partial copy to zero the remainder of the destination buffer.
Which is just as we
x86: remove 'zerorest' argument from __copy_user_nocache()
Every caller passes in zero, meaning they don't want any partial copy to zero the remainder of the destination buffer.
Which is just as well, because the implementation of that function didn't actually even look at that argument, and wasn't even aware it existed, although some misleading comments did mention it still.
The 'zerorest' thing is a historical artifact of how "copy_from_user()" worked, in that it would zero the rest of the kernel buffer that it copied into.
That zeroing still exists, but it's long since been moved to generic code, and the raw architecture-specific code doesn't do it. See _copy_from_user() in lib/usercopy.c for this all.
However, while __copy_user_nocache() shares some history and superficial other similarities with copy_from_user(), it is in many ways also very different.
In particular, while the code makes it *look* similar to the generic user copy functions that can copy both to and from user space, and take faults on both reads and writes as a result, __copy_user_nocache() does no such thing at all.
__copy_user_nocache() always copies to kernel space, and will never take a page fault on the destination. What *can* happen, though, is that the non-temporal stores take a machine check because one of the use cases is for writing to stable memory, and any memory errors would then take synchronous faults.
So __copy_user_nocache() does look a lot like copy_from_user(), but has faulting behavior that is more akin to our old copy_in_user() (which no longer exists, but copied from user space to user space and could fault on both source and destination).
And it very much does not have the "zero the end of the destination buffer", since a problem with the destination buffer is very possibly the very source of the partial copy.
So this whole thing was just a confusing historical artifact from having shared some code with a completely different function with completely different use cases.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
427fda2c |
| 17-Apr-2023 |
Linus Torvalds <torvalds@linux-foundation.org> |
x86: improve on the non-rep 'copy_user' function
The old 'copy_user_generic_unrolled' function was oddly implemented for largely historical reasons: it had been largely based on the uncached copy ca
x86: improve on the non-rep 'copy_user' function
The old 'copy_user_generic_unrolled' function was oddly implemented for largely historical reasons: it had been largely based on the uncached copy case, which has some other concerns.
For example, the __copy_user_nocache() function uses 'movnti' for the destination stores, and those want the destination to be aligned. In contrast, the regular copy function doesn't really care, and trying to align things only complicates matters.
Also, like the clear_user function, the copy function had some odd handling of the repeat counts, complicating the exception handling for no really good reason. So as with clear_user, just write it to keep all the byte counts in the %rcx register, exactly like the 'rep movs' functionality that this replaces.
Unlike a real 'rep movs', we do allow for this to trash a few temporary registers to not have to unnecessarily save/restore registers on the stack.
And like the clearing case, rename this to what it now clearly is: 'rep_movs_alternative', and make it one coherent function, so that it shows up as such in profiles (instead of the odd split between "copy_user_generic_unrolled" and "copy_user_short_string", the latter of which was not about strings at all, and which was shared with the uncached case).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
Revision tags: v6.3-rc7 |
|
#
8c9b6a88 |
| 16-Apr-2023 |
Linus Torvalds <torvalds@linux-foundation.org> |
x86: improve on the non-rep 'clear_user' function
The old version was oddly written to have the repeat count in multiple registers. So instead of taking advantage of %rax being zero, it had some su
x86: improve on the non-rep 'clear_user' function
The old version was oddly written to have the repeat count in multiple registers. So instead of taking advantage of %rax being zero, it had some sub-counts in it. All just for a "single word clearing" loop, which isn't even efficient to begin with.
So get rid of those games, and just keep all the state in the same registers we got it in (and that we should return things in). That not only makes this act much more like 'rep stos' (which this function is replacing), but makes it much easier to actually do the obvious loop unrolling.
Also rename the function from the now nonsensical 'clear_user_original' to what it now clearly is: 'rep_stos_alternative'.
End result: if we don't have a fast 'rep stosb', at least we can have a fast fallback for it.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
577e6a7f |
| 16-Apr-2023 |
Linus Torvalds <torvalds@linux-foundation.org> |
x86: inline the 'rep movs' in user copies for the FSRM case
This does the same thing for the user copies as commit 0db7058e8e23 ("x86/clear_user: Make it faster") did for clear_user(). In other wor
x86: inline the 'rep movs' in user copies for the FSRM case
This does the same thing for the user copies as commit 0db7058e8e23 ("x86/clear_user: Make it faster") did for clear_user(). In other words, it inlines the "rep movs" case when X86_FEATURE_FSRM is set, avoiding the function call entirely.
In order to do that, it makes the calling convention for the out-of-line case ("copy_user_generic_unrolled") match the 'rep movs' calling convention, although it does also end up clobbering a number of additional registers.
Also, to simplify code sharing in the low-level assembly with the __copy_user_nocache() function (that uses the normal C calling convention), we end up with a kind of mixed return value for the low-level asm code: it will return the result in both %rcx (to work as an alternative for the 'rep movs' case), _and_ in %rax (for the nocache case).
We could avoid this by wrapping __copy_user_nocache() callers in an inline asm, but since the cost is just an extra register copy, it's probably not worth it.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
3639a535 |
| 15-Apr-2023 |
Linus Torvalds <torvalds@linux-foundation.org> |
x86: move stac/clac from user copy routines into callers
This is preparatory work for inlining the 'rep movs' case, but also a cleanup. The __copy_user_nocache() function was mis-used by the rdma c
x86: move stac/clac from user copy routines into callers
This is preparatory work for inlining the 'rep movs' case, but also a cleanup. The __copy_user_nocache() function was mis-used by the rdma code to do uncached kernel copies that don't actually want user copies at all, and as a result doesn't want the stac/clac either.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
d2c95f9d |
| 15-Apr-2023 |
Linus Torvalds <torvalds@linux-foundation.org> |
x86: don't use REP_GOOD or ERMS for user memory clearing
The modern target to use is FSRS (Fast Short REP STOS), and the other cases should only be used for bigger areas (ie mainly things like page
x86: don't use REP_GOOD or ERMS for user memory clearing
The modern target to use is FSRS (Fast Short REP STOS), and the other cases should only be used for bigger areas (ie mainly things like page clearing).
Note! This changes the conditional for the inlining from FSRM ("fast short rep movs") to FSRS ("fast short rep stos").
We'll have a separate fixup for AMD microarchitectures that have a good 'rep stosb' yet do not set the new Intel-specific FSRS bit (because FSRM was there first).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
adfcf423 |
| 15-Apr-2023 |
Linus Torvalds <torvalds@linux-foundation.org> |
x86: don't use REP_GOOD or ERMS for user memory copies
The modern target to use is FSRM (Fast Short REP MOVS), and the other cases should only be used for bigger areas (ie mainly things like page cl
x86: don't use REP_GOOD or ERMS for user memory copies
The modern target to use is FSRM (Fast Short REP MOVS), and the other cases should only be used for bigger areas (ie mainly things like page clearing).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
Revision tags: v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3 |
|
#
b9da86e3 |
| 16-Mar-2023 |
Ira Weiny <ira.weiny@intel.com> |
x86/uaccess: Remove memcpy_page_flushcache()
Commit 21b56c847753 ("iov_iter: get rid of separate bvec and xarray callbacks") removed the calls to memcpy_page_flushcache().
In addition, memcpy_page_
x86/uaccess: Remove memcpy_page_flushcache()
Commit 21b56c847753 ("iov_iter: get rid of separate bvec and xarray callbacks") removed the calls to memcpy_page_flushcache().
In addition, memcpy_page_flushcache() uses the deprecated kmap_atomic().
Remove the unused x86 memcpy_page_flushcache() implementation and also get rid of one more kmap_atomic() user.
[ dhansen: tweak changelog ]
Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20221230-kmap-x86-v1-1-15f1ecccab50%40intel.com
show more ...
|
Revision tags: v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1 |
|
#
4f2c0a4a |
| 14-Dec-2022 |
Nick Terrell <terrelln@fb.com> |
Merge branch 'main' into zstd-linus
|
#
e291c116 |
| 12-Dec-2022 |
Dmitry Torokhov <dmitry.torokhov@gmail.com> |
Merge branch 'next' into for-linus
Prepare input updates for 6.2 merge window.
|
Revision tags: v6.1, v6.1-rc8, v6.1-rc7 |
|
#
29583dfc |
| 21-Nov-2022 |
Thomas Zimmermann <tzimmermann@suse.de> |
Merge drm/drm-next into drm-misc-next-fixes
Backmerging to update drm-misc-next-fixes for the final phase of the release cycle.
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
|
Revision tags: v6.1-rc6 |
|
#
002c6ca7 |
| 14-Nov-2022 |
Rodrigo Vivi <rodrigo.vivi@intel.com> |
Merge drm/drm-next into drm-intel-next
Catch up on 6.1-rc cycle in order to solve the intel_backlight conflict on linux-next.
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
|
Revision tags: v6.1-rc5, v6.1-rc4 |
|
#
d93618da |
| 04-Nov-2022 |
Joonas Lahtinen <joonas.lahtinen@linux.intel.com> |
Merge drm/drm-next into drm-intel-gt-next
Needed to bring in v6.1-rc1 which contains commit f683b9d61319 ("i915: use the VMA iterator") which is needed for series https://patchwork.freedesktop.org/s
Merge drm/drm-next into drm-intel-gt-next
Needed to bring in v6.1-rc1 which contains commit f683b9d61319 ("i915: use the VMA iterator") which is needed for series https://patchwork.freedesktop.org/series/110083/ .
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
show more ...
|
Revision tags: v6.1-rc3, v6.1-rc2 |
|
#
14e77332 |
| 22-Oct-2022 |
Nick Terrell <terrelln@fb.com> |
Merge branch 'main' into zstd-next
|
#
1aca5ce0 |
| 20-Oct-2022 |
Thomas Zimmermann <tzimmermann@suse.de> |
Merge drm/drm-fixes into drm-misc-fixes
Backmerging to get v6.1-rc1.
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
|
#
008f05a7 |
| 19-Oct-2022 |
Mark Brown <broonie@kernel.org> |
ASoC: jz4752b: Capture fixes
Merge series from Siarhei Volkau <lis8215@gmail.com>:
The patchset fixes: - Line In path stays powered off during capturing or bypass to mixer. - incorrectly repre
ASoC: jz4752b: Capture fixes
Merge series from Siarhei Volkau <lis8215@gmail.com>:
The patchset fixes: - Line In path stays powered off during capturing or bypass to mixer. - incorrectly represented dB values in alsamixer, et al. - incorrect represented Capture input selector in alsamixer in Playback tab. - wrong control selected as Capture Master
show more ...
|
#
a140a6a2 |
| 18-Oct-2022 |
Maxime Ripard <maxime@cerno.tech> |
Merge drm/drm-next into drm-misc-next
Let's kick-off this release cycle.
Signed-off-by: Maxime Ripard <maxime@cerno.tech>
|
#
c29a017f |
| 17-Oct-2022 |
Dmitry Torokhov <dmitry.torokhov@gmail.com> |
Merge tag 'v6.1-rc1' into next
Merge with mainline to bring in the latest changes to twl4030 driver.
|
#
8048b835 |
| 17-Oct-2022 |
Andrew Morton <akpm@linux-foundation.org> |
Merge branch 'master' into mm-hotfixes-stable
|
Revision tags: v6.1-rc1 |
|
#
dfd2d876 |
| 10-Oct-2022 |
Johannes Berg <johannes.berg@intel.com> |
Merge remote-tracking branch 'wireless/main' into wireless-next
Pull in wireless/main content since some new code would otherwise conflict with it.
Signed-off-by: Johannes Berg <johannes.berg@intel
Merge remote-tracking branch 'wireless/main' into wireless-next
Pull in wireless/main content since some new code would otherwise conflict with it.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
show more ...
|
#
7db99f01 |
| 04-Oct-2022 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge tag 'x86_cpu_for_v6.1_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpu updates from Borislav Petkov:
- Print the CPU number at segfault time.
The number printed
Merge tag 'x86_cpu_for_v6.1_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpu updates from Borislav Petkov:
- Print the CPU number at segfault time.
The number printed is not always accurate (preemption is enabled at that time) but the print string contains "likely" and after a lot of back'n'forth on this, this was the consensus that was reached. See thread at [1].
- After a *lot* of testing and polishing, finally the clear_user() improvements to inline REP; STOSB by default
Link: https://lore.kernel.org/r/5d62c1d0-7425-d5bb-ecb5-1dc3b4d7d245@intel.com [1]
* tag 'x86_cpu_for_v6.1_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Print likely CPU at segfault time x86/clear_user: Make it faster
show more ...
|
Revision tags: v6.0, v6.0-rc7, v6.0-rc6, v6.0-rc5, v6.0-rc4, v6.0-rc3, v6.0-rc2, v6.0-rc1, v5.19, v5.19-rc8, v5.19-rc7, v5.19-rc6, v5.19-rc5, v5.19-rc4, v5.19-rc3, v5.19-rc2, v5.19-rc1 |
|
#
0db7058e |
| 24-May-2022 |
Borislav Petkov <bp@suse.de> |
x86/clear_user: Make it faster
Based on a patch by Mark Hemment <markhemm@googlemail.com> and incorporating very sane suggestions from Linus.
The point here is to have the default case with FSRM -
x86/clear_user: Make it faster
Based on a patch by Mark Hemment <markhemm@googlemail.com> and incorporating very sane suggestions from Linus.
The point here is to have the default case with FSRM - which is supposed to be the majority of x86 hw out there - if not now then soon - be directly inlined into the instruction stream so that no function call overhead is taking place.
Drop the early clobbers from the @size and @addr operands as those are not needed anymore since we have single instruction alternatives.
The benchmarks I ran would show very small improvements and a PF benchmark would even show weird things like slowdowns with higher core counts.
So for a ~6m running the git test suite, the function gets called under 700K times, all from padzero():
<...>-2536 [006] ..... 261.208801: padzero: to: 0x55b0663ed214, size: 3564, cycles: 21900 <...>-2536 [006] ..... 261.208819: padzero: to: 0x7f061adca078, size: 3976, cycles: 17160 <...>-2537 [008] ..... 261.211027: padzero: to: 0x5572d019e240, size: 3520, cycles: 23850 <...>-2537 [008] ..... 261.211049: padzero: to: 0x7f1288dc9078, size: 3976, cycles: 15900 ...
which is around 1%-ish of the total time and which is consistent with the benchmark numbers.
So Mel gave me the idea to simply measure how fast the function becomes. I.e.:
start = rdtsc_ordered(); ret = __clear_user(to, n); end = rdtsc_ordered();
Computing the mean average of all the samples collected during the test suite run then shows some improvement:
clear_user_original: Amean: 9219.71 (Sum: 6340154910, samples: 687674)
fsrm: Amean: 8030.63 (Sum: 5522277720, samples: 687652)
That's on Zen3.
The situation looks a lot more confusing on Intel:
Icelake:
clear_user_original: Amean: 19679.4 (Sum: 13652560764, samples: 693750) Amean: 19743.7 (Sum: 13693470604, samples: 693562)
(I ran it twice just to be sure.)
ERMS: Amean: 20374.3 (Sum: 13910601024, samples: 682752) Amean: 20453.7 (Sum: 14186223606, samples: 693576)
FSRM: Amean: 20458.2 (Sum: 13918381386, sample s: 680331)
The original microbenchmark which people were complaining about:
for i in $(seq 1 10); do dd if=/dev/zero of=/dev/null bs=1M status=progress count=65536; done 2>&1 | grep copied 32207011840 bytes (32 GB, 30 GiB) copied, 1 s, 32.2 GB/s 68719476736 bytes (69 GB, 64 GiB) copied, 1.93069 s, 35.6 GB/s 37597741056 bytes (38 GB, 35 GiB) copied, 1 s, 37.6 GB/s 68719476736 bytes (69 GB, 64 GiB) copied, 1.78017 s, 38.6 GB/s 62020124672 bytes (62 GB, 58 GiB) copied, 2 s, 31.0 GB/s 68719476736 bytes (69 GB, 64 GiB) copied, 2.13716 s, 32.2 GB/s 60010004480 bytes (60 GB, 56 GiB) copied, 1 s, 60.0 GB/s 68719476736 bytes (69 GB, 64 GiB) copied, 1.14129 s, 60.2 GB/s 53212086272 bytes (53 GB, 50 GiB) copied, 1 s, 53.2 GB/s 68719476736 bytes (69 GB, 64 GiB) copied, 1.28398 s, 53.5 GB/s 55698259968 bytes (56 GB, 52 GiB) copied, 1 s, 55.7 GB/s 68719476736 bytes (69 GB, 64 GiB) copied, 1.22507 s, 56.1 GB/s 55306092544 bytes (55 GB, 52 GiB) copied, 1 s, 55.3 GB/s 68719476736 bytes (69 GB, 64 GiB) copied, 1.23647 s, 55.6 GB/s 54387539968 bytes (54 GB, 51 GiB) copied, 1 s, 54.4 GB/s 68719476736 bytes (69 GB, 64 GiB) copied, 1.25693 s, 54.7 GB/s 50566529024 bytes (51 GB, 47 GiB) copied, 1 s, 50.6 GB/s 68719476736 bytes (69 GB, 64 GiB) copied, 1.35096 s, 50.9 GB/s 58308165632 bytes (58 GB, 54 GiB) copied, 1 s, 58.3 GB/s 68719476736 bytes (69 GB, 64 GiB) copied, 1.17394 s, 58.5 GB/s
Now the same thing with smaller buffers:
for i in $(seq 1 10); do dd if=/dev/zero of=/dev/null bs=1M status=progress count=8192; done 2>&1 | grep copied 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.28485 s, 30.2 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.276112 s, 31.1 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.29136 s, 29.5 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.283803 s, 30.3 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.306503 s, 28.0 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.349169 s, 24.6 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.276912 s, 31.0 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.265356 s, 32.4 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.28464 s, 30.2 GB/s 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.242998 s, 35.3 GB/s
is also not conclusive because it all depends on the buffer sizes, their alignments and when the microcode detects that cachelines can be aggregated properly and copied in bigger sizes.
Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/CAHk-=wh=Mu_EYhtOmPn6AxoQZyEh-4fo2Zx3G7rBv1g7vwoKiw@mail.gmail.com
show more ...
|
Revision tags: v5.18, v5.18-rc7, v5.18-rc6, v5.18-rc5, v5.18-rc4, v5.18-rc3, v5.18-rc2, v5.18-rc1, v5.17, v5.17-rc8, v5.17-rc7, v5.17-rc6, v5.17-rc5, v5.17-rc4, v5.17-rc3, v5.17-rc2, v5.17-rc1 |
|
#
762f99f4 |
| 15-Jan-2022 |
Dmitry Torokhov <dmitry.torokhov@gmail.com> |
Merge branch 'next' into for-linus
Prepare input updates for 5.17 merge window.
|
Revision tags: v5.16, v5.16-rc8, v5.16-rc7, v5.16-rc6, v5.16-rc5 |
|
#
86329873 |
| 09-Dec-2021 |
Daniel Lezcano <daniel.lezcano@linaro.org> |
Merge branch 'reset/of-get-optional-exclusive' of git://git.pengutronix.de/pza/linux into timers/drivers/next
"Add optional variant of of_reset_control_get_exclusive(). If the requested reset is not
Merge branch 'reset/of-get-optional-exclusive' of git://git.pengutronix.de/pza/linux into timers/drivers/next
"Add optional variant of of_reset_control_get_exclusive(). If the requested reset is not specified in the device tree, this function returns NULL instead of an error."
This dependency is needed for the Generic Timer Module (a.k.a OSTM) support for RZ/G2L.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
show more ...
|
#
5d8dfaa7 |
| 09-Dec-2021 |
Dmitry Torokhov <dmitry.torokhov@gmail.com> |
Merge tag 'v5.15' into next
Sync up with the mainline to get the latest APIs and DT bindings.
|