entry-percpu.h - OpenGrok history log for /linux/arch/s390/include/asm/entry-percpu.h

Revision		Date	Author	Comments
# a737737c		26-May-2026	Heiko Carstens <hca@linux.ibm.com>	s390/percpu: Infrastructure for more efficient this_cpu operations With the intended removal of PREEMPT_NONE this_cpu operations based on atomic instructions, guarded with preempt_disable()/preempt_ s390/percpu: Infrastructure for more efficient this_cpu operations With the intended removal of PREEMPT_NONE this_cpu operations based on atomic instructions, guarded with preempt_disable()/preempt_enable() pairs become more expensive: the preempt_disable() / preempt_enable() pairs are not optimized away anymore during compile time. In particular the conditional call to preempt_schedule_notrace() after preempt_enable() adds additional code and register pressure. E.g. this simple C code sequence DEFINE_PER_CPU(long, foo); long bar(long a) { return this_cpu_add_return(foo, a); } generates this code: 11a976: eb af f0 68 00 24 stmg %r10,%r15,104(%r15) 11a97c: b9 04 00 ef lgr %r14,%r15 11a980: b9 04 00 b2 lgr %r11,%r2 11a984: e3 f0 ff c8 ff 71 lay %r15,-56(%r15) 11a98a: e3 e0 f0 98 00 24 stg %r14,152(%r15) 11a990: eb 01 03 a8 00 6a asi 936,1 <- __preempt_count_add(1) 11a996: c0 10 00 d2 ac b5 larl %r1,1b70300 <- address of percpu var 11a9a0: e3 10 23 b8 00 08 ag %r1,952 <- add percpu offset 11a9a6: eb ab 10 00 00 e8 laag %r10,%r11,0(%r1) <- atomic op 11a9ac: eb ff 03 a8 00 6e alsi 936,-1 <- __preempt_count_dec_and_test() 11a9b2: a7 54 00 05 jnhe 11a9bc <bar+0x4c> 11a9b6: c0 e5 00 76 d1 bd brasl %r14,ff4d30 <preempt_schedule_notrace> 11a9bc: b9 e8 b0 2a agrk %r2,%r10,%r11 11a9c0: eb af f0 a0 00 04 lmg %r10,%r15,160(%r15) 11a9c6 07 fe br %r14 Even though the above example is more or less the worst case, since the branch to preempt_schedule_notrace() requires a stackframe, which otherwise wouldn't be necessary, there is also the conditional jnhe branch instruction. Get rid of the conditional branch with the following code sequence: 11a8e6: c0 30 00 d0 c5 0d larl %r3,1b33300 11a8ec: b9 04 00 43 lgr %r4,%r3 11a8f0: eb 00 43 c0 00 52 mviy 960,4 11a8f6: e3 40 03 b8 00 08 ag %r4,952 11a8fc: eb 52 40 00 00 e8 laag %r5,%r2,0(%r4) 11a902: eb 00 03 c0 00 52 mviy 960,0 11a908: b9 08 00 25 agr %r2,%r5 11a90c 07 fe br %r14 The general idea is that this_cpu operations based on atomic instructions are guarded with mviy instructions: - The first mviy instruction writes the register number, which contains the percpu address variable to lowcore. This also indicates that a percpu code section is executed. - The first instruction following the mviy instruction must be the ag instruction which adds the percpu offset to the percpu address register. - Afterwards the atomic percpu operation follows. - Then a second mviy instruction writes a zero to lowcore, which indicates the end of the percpu code section. - In case of an interrupt/exception/nmi the register number which was written to lowcore is copied to the exception frame (pt_regs), and a zero is written to lowcore. - On return to the previous context it is checked if a percpu code section was executed (saved register number not zero), and if the process was migrated to a different cpu. If the percpu offset was already added to the percpu address register (instruction address does _not_ point to the ag instruction) the content of the percpu address register is adjusted so it points to percpu variable of the new cpu. Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> show more ...

Revision

Date

Author

Comments

# a737737c

26-May-2026

Heiko Carstens <hca@linux.ibm.com>

s390/percpu: Infrastructure for more efficient this_cpu operations

With the intended removal of PREEMPT_NONE this_cpu operations based on
atomic instructions, guarded with preempt_disable()/preempt_

s390/percpu: Infrastructure for more efficient this_cpu operations

With the intended removal of PREEMPT_NONE this_cpu operations based on
atomic instructions, guarded with preempt_disable()/preempt_enable() pairs
become more expensive: the preempt_disable() / preempt_enable() pairs are
not optimized away anymore during compile time.

In particular the conditional call to preempt_schedule_notrace() after
preempt_enable() adds additional code and register pressure.

E.g. this simple C code sequence

DEFINE_PER_CPU(long, foo);
long bar(long a) { return this_cpu_add_return(foo, a); }

generates this code:

11a976: eb af f0 68 00 24 stmg %r10,%r15,104(%r15)
11a97c: b9 04 00 ef lgr %r14,%r15
11a980: b9 04 00 b2 lgr %r11,%r2
11a984: e3 f0 ff c8 ff 71 lay %r15,-56(%r15)
11a98a: e3 e0 f0 98 00 24 stg %r14,152(%r15)
11a990: eb 01 03 a8 00 6a asi 936,1 <- __preempt_count_add(1)
11a996: c0 10 00 d2 ac b5 larl %r1,1b70300 <- address of percpu var
11a9a0: e3 10 23 b8 00 08 ag %r1,952 <- add percpu offset
11a9a6: eb ab 10 00 00 e8 laag %r10,%r11,0(%r1) <- atomic op
11a9ac: eb ff 03 a8 00 6e alsi 936,-1 <- __preempt_count_dec_and_test()
11a9b2: a7 54 00 05 jnhe 11a9bc <bar+0x4c>
11a9b6: c0 e5 00 76 d1 bd brasl %r14,ff4d30 <preempt_schedule_notrace>
11a9bc: b9 e8 b0 2a agrk %r2,%r10,%r11
11a9c0: eb af f0 a0 00 04 lmg %r10,%r15,160(%r15)
11a9c6 07 fe br %r14

Even though the above example is more or less the worst case, since the
branch to preempt_schedule_notrace() requires a stackframe, which
otherwise wouldn't be necessary, there is also the conditional jnhe branch
instruction.

Get rid of the conditional branch with the following code sequence:

11a8e6: c0 30 00 d0 c5 0d larl %r3,1b33300
11a8ec: b9 04 00 43 lgr %r4,%r3
11a8f0: eb 00 43 c0 00 52 mviy 960,4
11a8f6: e3 40 03 b8 00 08 ag %r4,952
11a8fc: eb 52 40 00 00 e8 laag %r5,%r2,0(%r4)
11a902: eb 00 03 c0 00 52 mviy 960,0
11a908: b9 08 00 25 agr %r2,%r5
11a90c 07 fe br %r14

The general idea is that this_cpu operations based on atomic instructions
are guarded with mviy instructions:

- The first mviy instruction writes the register number, which contains
the percpu address variable to lowcore. This also indicates that a
percpu code section is executed.

- The first instruction following the mviy instruction must be the ag
instruction which adds the percpu offset to the percpu address register.

- Afterwards the atomic percpu operation follows.

- Then a second mviy instruction writes a zero to lowcore, which indicates
the end of the percpu code section.

- In case of an interrupt/exception/nmi the register number which was
written to lowcore is copied to the exception frame (pt_regs), and a zero
is written to lowcore.

- On return to the previous context it is checked if a percpu code section
was executed (saved register number not zero), and if the process was
migrated to a different cpu. If the percpu offset was already added to
the percpu address register (instruction address does _not_ point to the
ag instruction) the content of the percpu address register is adjusted so
it points to percpu variable of the new cpu.

Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>