69a14d14 | 02-Jun-2025 |
Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
futex: Verify under the lock if hash can be replaced
Once the global hash is requested there is no way back to switch back to the per-task private hash. This is checked at the begin of the function.
futex: Verify under the lock if hash can be replaced
Once the global hash is requested there is no way back to switch back to the per-task private hash. This is checked at the begin of the function.
It is possible that two threads simultaneously request the global hash and both pass the initial check and block later on the mm::futex_hash_lock. In this case the first thread performs the switch to the global hash. The second thread will also attempt to switch to the global hash and while doing so, accessing the nonexisting slot 1 of the struct futex_private_hash. The same applies if the hash is made immutable: There is no reference counting and the hash must not be replaced.
Verify under mm_struct::futex_phash that neither the global hash nor an immutable hash in use.
Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com> Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com> Closes: https://lore.kernel.org/all/aDwDw9Aygqo6oAx+@ly-workstation/ Fixes: bd54df5ea7cad ("futex: Allow to resize the private local hash") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/20250610104400.1077266-5-bigeasy@linutronix.de/
show more ...
|
c042c505 | 16-Apr-2025 |
Peter Zijlstra <peterz@infradead.org> |
futex: Implement FUTEX2_MPOL
Extend the futex2 interface to be aware of mempolicy.
When FUTEX2_MPOL is specified and there is a MPOL_PREFERRED or home_node specified covering the futex address, use
futex: Implement FUTEX2_MPOL
Extend the futex2 interface to be aware of mempolicy.
When FUTEX2_MPOL is specified and there is a MPOL_PREFERRED or home_node specified covering the futex address, use that hash-map.
Notably, in this case the futex will go to the global node hashtable, even if it is a PRIVATE futex.
When FUTEX2_NUMA|FUTEX2_MPOL is specified and the user specified node value is FUTEX_NO_NODE, the MPOL lookup (as described above) will be tried first before reverting to setting node to the local node.
[bigeasy: add CONFIG_FUTEX_MPOL, add MPOL to FUTEX2_VALID_MASK, write the node only to user if FUTEX_NO_NODE was supplied]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250416162921.513656-18-bigeasy@linutronix.de
show more ...
|
cec199c5 | 16-Apr-2025 |
Peter Zijlstra <peterz@infradead.org> |
futex: Implement FUTEX2_NUMA
Extend the futex2 interface to be numa aware.
When FUTEX2_NUMA is specified for a futex, the user value is extended to two words (of the same size). The first is the us
futex: Implement FUTEX2_NUMA
Extend the futex2 interface to be numa aware.
When FUTEX2_NUMA is specified for a futex, the user value is extended to two words (of the same size). The first is the user value we all know, the second one will be the node to place this futex on.
struct futex_numa_32 { u32 val; u32 node; };
When node is set to ~0, WAIT will set it to the current node_id such that WAKE knows where to find it. If userspace corrupts the node value between WAIT and WAKE, the futex will not be found and no wakeup will happen.
When FUTEX2_NUMA is not set, the node is simply an extension of the hash, such that traditional futexes are still interleaved over the nodes.
This is done to avoid having to have a separate !numa hash-table.
[bigeasy: ensure to have at least hashsize of 4 in futex_init(), add pr_info() for size and allocation information. Cast the naddr math to void*]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250416162921.513656-17-bigeasy@linutronix.de
show more ...
|
63e8595c | 16-Apr-2025 |
Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
futex: Allow to make the private hash immutable
My initial testing showed that:
perf bench futex hash
reported less operations/sec with private hash. After using the same amount of buckets in the
futex: Allow to make the private hash immutable
My initial testing showed that:
perf bench futex hash
reported less operations/sec with private hash. After using the same amount of buckets in the private hash as used by the global hash then the operations/sec were about the same.
This changed once the private hash became resizable. This feature added an RCU section and reference counting via atomic inc+dec operation into the hot path. The reference counting can be avoided if the private hash is made immutable. Extend PR_FUTEX_HASH_SET_SLOTS by a fourth argument which denotes if the private should be made immutable. Once set (to true) the a further resize is not allowed (same if set to global hash). Add PR_FUTEX_HASH_GET_IMMUTABLE which returns true if the hash can not be changed. Update "perf bench" suite.
For comparison, results of "perf bench futex hash -s": - Xeon CPU E5-2650, 2 NUMA nodes, total 32 CPUs: - Before the introducing task local hash shared Averaged 1.487.148 operations/sec (+- 0,53%), total secs = 10 private Averaged 2.192.405 operations/sec (+- 0,07%), total secs = 10
- With the series shared Averaged 1.326.342 operations/sec (+- 0,41%), total secs = 10 -b128 Averaged 141.394 operations/sec (+- 1,15%), total secs = 10 -Ib128 Averaged 851.490 operations/sec (+- 0,67%), total secs = 10 -b8192 Averaged 131.321 operations/sec (+- 2,13%), total secs = 10 -Ib8192 Averaged 1.923.077 operations/sec (+- 0,61%), total secs = 10 128 is the default allocation of hash buckets. 8192 was the previous amount of allocated hash buckets.
- Xeon(R) CPU E7-8890 v3, 4 NUMA nodes, total 144 CPUs: - Before the introducing task local hash shared Averaged 1.810.936 operations/sec (+- 0,26%), total secs = 20 private Averaged 2.505.801 operations/sec (+- 0,05%), total secs = 20
- With the series shared Averaged 1.589.002 operations/sec (+- 0,25%), total secs = 20 -b1024 Averaged 42.410 operations/sec (+- 0,20%), total secs = 20 -Ib1024 Averaged 740.638 operations/sec (+- 1,51%), total secs = 20 -b65536 Averaged 48.811 operations/sec (+- 1,35%), total secs = 20 -Ib65536 Averaged 1.963.165 operations/sec (+- 0,18%), total secs = 20 1024 is the default allocation of hash buckets. 65536 was the previous amount of allocated hash buckets.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://lore.kernel.org/r/20250416162921.513656-16-bigeasy@linutronix.de
show more ...
|
bd54df5e | 16-Apr-2025 |
Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
futex: Allow to resize the private local hash
The mm_struct::futex_hash_lock guards the futex_hash_bucket assignment/ replacement. The futex_hash_allocate()/ PR_FUTEX_HASH_SET_SLOTS operation can no
futex: Allow to resize the private local hash
The mm_struct::futex_hash_lock guards the futex_hash_bucket assignment/ replacement. The futex_hash_allocate()/ PR_FUTEX_HASH_SET_SLOTS operation can now be invoked at runtime and resize an already existing internal private futex_hash_bucket to another size.
The reallocation is based on an idea by Thomas Gleixner: The initial allocation of struct futex_private_hash sets the reference count to one. Every user acquires a reference on the local hash before using it and drops it after it enqueued itself on the hash bucket. There is no reference held while the task is scheduled out while waiting for the wake up. The resize process allocates a new struct futex_private_hash and drops the initial reference. Synchronized with mm_struct::futex_hash_lock it is checked if the reference counter for the currently used mm_struct::futex_phash is marked as DEAD. If so, then all users enqueued on the current private hash are requeued on the new private hash and the new private hash is set to mm_struct::futex_phash. Otherwise the newly allocated private hash is saved as mm_struct::futex_phash_new and the rehashing and reassigning is delayed to the futex_hash() caller once the reference counter is marked DEAD. The replacement is not performed at rcuref_put() time because certain callers, such as futex_wait_queue(), drop their reference after changing the task state. This change will be destroyed once the futex_hash_lock is acquired.
The user can change the number slots with PR_FUTEX_HASH_SET_SLOTS multiple times. An increase and decrease is allowed and request blocks until the assignment is done.
The private hash allocated at thread creation is changed from 16 to 16 <= 4 * number_of_threads <= global_hash_size where number_of_threads can not exceed the number of online CPUs. Should the user PR_FUTEX_HASH_SET_SLOTS then the auto scaling is disabled.
[peterz: reorganize the code to avoid state tracking and simplify new object handling, block the user until changes are in effect, allow increase and decrease of the hash].
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250416162921.513656-15-bigeasy@linutronix.de
show more ...
|
80367ad0 | 16-Apr-2025 |
Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
futex: Add basic infrastructure for local task local hash
The futex hash is system wide and shared by all tasks. Each slot is hashed based on futex address and the VMA of the thread. Due to randomiz
futex: Add basic infrastructure for local task local hash
The futex hash is system wide and shared by all tasks. Each slot is hashed based on futex address and the VMA of the thread. Due to randomized VMAs (and memory allocations) the same logical lock (pointer) can end up in a different hash bucket on each invocation of the application. This in turn means that different applications may share a hash bucket on the first invocation but not on the second and it is not always clear which applications will be involved. This can result in high latency's to acquire the futex_hash_bucket::lock especially if the lock owner is limited to a CPU and can not be effectively PI boosted.
Introduce basic infrastructure for process local hash which is shared by all threads of process. This hash will only be used for a PROCESS_PRIVATE FUTEX operation.
The hashmap can be allocated via:
prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, num);
A `num' of 0 means that the global hash is used instead of a private hash. Other values for `num' specify the number of slots for the hash and the number must be power of two, starting with two. The prctl() returns zero on success. This function can only be used before a thread is created.
The current status for the private hash can be queried via:
num = prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_SLOTS);
which return the current number of slots. The value 0 means that the global hash is used. Values greater than 0 indicate the number of slots that are used. A negative number indicates an error.
For optimisation, for the private hash jhash2() uses only two arguments the address and the offset. This omits the VMA which is always the same.
[peterz: Use 0 for global hash. A bit shuffling and renaming. ]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250416162921.513656-13-bigeasy@linutronix.de
show more ...
|
b04b8f30 | 16-Apr-2025 |
Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
futex: Introduce futex_q_lockptr_lock()
futex_lock_pi() and __fixup_pi_state_owner() acquire the futex_q::lock_ptr without holding a reference assuming the previously obtained hash bucket and the as
futex: Introduce futex_q_lockptr_lock()
futex_lock_pi() and __fixup_pi_state_owner() acquire the futex_q::lock_ptr without holding a reference assuming the previously obtained hash bucket and the assigned lock_ptr are still valid. This isn't the case once the private hash can be resized and becomes invalid after the reference drop.
Introduce futex_q_lockptr_lock() to lock the hash bucket recorded in futex_q::lock_ptr. The lock pointer is read in a RCU section to ensure that it does not go away if the hash bucket has been replaced and the old pointer has been observed. After locking the pointer needs to be compared to check if it changed. If so then the hash bucket has been replaced and the user has been moved to the new one and lock_ptr has been updated. The lock operation needs to be redone in this case.
The locked hash bucket is not returned.
A special case is an early return in futex_lock_pi() (due to signal or timeout) and a successful futex_wait_requeue_pi(). In both cases a valid futex_q::lock_ptr is expected (and its matching hash bucket) but since the waiter has been removed from the hash this can no longer be guaranteed. Therefore before the waiter is removed and a reference is acquired which is later dropped by the waiter to avoid a resize.
Add futex_q_lockptr_lock() and use it. Acquire an additional reference in requeue_pi_wake_futex() and futex_unlock_pi() while the futex_q is removed, denote this extra reference in futex_q::drop_hb_ref and let the waiter drop the reference in this case.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250416162921.513656-11-bigeasy@linutronix.de
show more ...
|
fe00e88d | 16-Apr-2025 |
Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
futex: Decrease the waiter count before the unlock operation
To support runtime resizing of the process private hash, it's required to not use the obtained hash bucket once the reference count has b
futex: Decrease the waiter count before the unlock operation
To support runtime resizing of the process private hash, it's required to not use the obtained hash bucket once the reference count has been dropped. The reference will be dropped after the unlock of the hash bucket. The amount of waiters is decremented after the unlock operation. There is no requirement that this needs to happen after the unlock. The increment happens before acquiring the lock to signal early that there will be a waiter. The waiter can avoid blocking on the lock if it is known that there will be no waiter. There is no difference in terms of ordering if the decrement happens before or after the unlock.
Decrease the waiter count before the unlock operation.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250416162921.513656-10-bigeasy@linutronix.de
show more ...
|
2fb29209 | 16-Apr-2025 |
Peter Zijlstra <peterz@infradead.org> |
futex: Pull futex_hash() out of futex_q_lock()
Getting the hash bucket and queuing it are two distinct actions. In light of wanting to add a put hash bucket function later, untangle them.
Signed-of
futex: Pull futex_hash() out of futex_q_lock()
Getting the hash bucket and queuing it are two distinct actions. In light of wanting to add a put hash bucket function later, untangle them.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250416162921.513656-5-bigeasy@linutronix.de
show more ...
|