| 1e4ee513 | 27-Oct-2025 |
Tiwei Bie <tiwei.btw@antgroup.com> |
um: Add initial SMP support
Add initial symmetric multi-processing (SMP) support to UML. With this support enabled, users can tell UML to start multiple virtual processors, each represented as a sep
um: Add initial SMP support
Add initial symmetric multi-processing (SMP) support to UML. With this support enabled, users can tell UML to start multiple virtual processors, each represented as a separate host thread.
In UML, kthreads and normal threads (when running in kernel mode) can be scheduled and executed simultaneously on different virtual processors. However, the userspace code of normal threads still runs within their respective single-threaded stubs.
That is, SMP support is currently available both within the kernel and across different processes, but still remains limited within threads of the same process in userspace.
Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Link: https://patch.msgid.link/20251027001815.1666872-6-tiwei.bie@linux.dev Signed-off-by: Johannes Berg <johannes.berg@intel.com>
show more ...
|
| b3fb0eb5 | 11-Jul-2025 |
Tiwei Bie <tiwei.btw@antgroup.com> |
um: Remove the pid parameter of handle_trap()
It's no longer used. Remove it.
Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Link: https://patch.msgid.link/20250711065021.2535362-3-tiwei.bie@lin
um: Remove the pid parameter of handle_trap()
It's no longer used. Remove it.
Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Link: https://patch.msgid.link/20250711065021.2535362-3-tiwei.bie@linux.dev Signed-off-by: Johannes Berg <johannes.berg@intel.com>
show more ...
|
| cba737fa | 11-Jul-2025 |
Tiwei Bie <tiwei.btw@antgroup.com> |
um: Use err consistently in userspace()
Avoid declaring a new variable 'ret' inside the 'if (using_seccomp)' block, as the existing 'err' variable declared at the top of the function already serves
um: Use err consistently in userspace()
Avoid declaring a new variable 'ret' inside the 'if (using_seccomp)' block, as the existing 'err' variable declared at the top of the function already serves the same purpose.
Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Link: https://patch.msgid.link/20250711065021.2535362-2-tiwei.bie@linux.dev Signed-off-by: Johannes Berg <johannes.berg@intel.com>
show more ...
|
| e92e2552 | 02-Jun-2025 |
Benjamin Berg <benjamin.berg@intel.com> |
um: pass FD for memory operations when needed
Instead of always sharing the FDs with the userspace process, only hand over the FDs needed for mmap when required. The idea is that userspace might be
um: pass FD for memory operations when needed
Instead of always sharing the FDs with the userspace process, only hand over the FDs needed for mmap when required. The idea is that userspace might be able to force the stub into executing an mmap syscall, however, it will not be able to manipulate the control flow sufficiently to have access to an FD that would allow mapping arbitrary memory.
Security wise, we need to be sure that only the expected syscalls are executed after the kernel sends FDs through the socket. This is currently not the case, as userspace can trivially jump to the rt_sigreturn syscall instruction to execute any syscall that the stub is permitted to do. With this, it can trick the kernel to send the FD, which in turn allows userspace to freely map any physical memory.
As such, this is currently *not* secure. However, in principle the approach should be fine with a more strict SECCOMP filter and a careful review of the stub control flow (as userspace can prepare a stack). With some care, it is likely possible to extend the security model to SMP if desired.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20250602130052.545733-8-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
show more ...
|
| 8420e08f | 02-Jun-2025 |
Benjamin Berg <benjamin@sipsolutions.net> |
um: Track userspace children dying in SECCOMP mode
When in seccomp mode, we would hang forever on the futex if a child has died unexpectedly. In contrast, ptrace mode will notice it and kill the cor
um: Track userspace children dying in SECCOMP mode
When in seccomp mode, we would hang forever on the futex if a child has died unexpectedly. In contrast, ptrace mode will notice it and kill the corresponding thread when it fails to run it.
Fix this issue using a new IRQ that is fired after a SIGCHLD and keeping an (internal) list of all MMs. In the IRQ handler, find the affected MM and set its PID to -1 as well as the futex variable to FUTEX_IN_KERN.
This, together with futex returning -EINTR after the signal is sufficient to implement a race-free detection of a child dying.
Note that this also enables IRQ handling while starting a userspace process. This should be safe and SECCOMP requires the IRQ in case the process does not come up properly.
Signed-off-by: Benjamin Berg <benjamin@sipsolutions.net> Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20250602130052.545733-5-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
show more ...
|
| 0b8b2668 | 10-Oct-2024 |
Benjamin Berg <benjamin.berg@intel.com> |
um: insert scheduler ticks when userspace does not yield
In time-travel mode userspace can do a lot of work without any time passing. Unfortunately, this can result in OOM situations as the RCU core
um: insert scheduler ticks when userspace does not yield
In time-travel mode userspace can do a lot of work without any time passing. Unfortunately, this can result in OOM situations as the RCU core code will never be run.
Work around this by keeping track of userspace processes that do not yield for a lot of operations. When this happens, insert a jiffie into the sched_clock clock to account time against the process and cause the bookkeeping to run.
As sched_clock is used for tracing, it is useful to keep it in sync between the different VMs. As such, try to remove added ticks again when the actual clock ticks.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20241010142537.1134685-1-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
show more ...
|
| 2717c6b6 | 11-Oct-2024 |
Tiwei Bie <tiwei.btw@antgroup.com> |
um: Abandon the _PAGE_NEWPROT bit
When a PTE is updated in the page table, the _PAGE_NEWPAGE bit will always be set. And the corresponding page will always be mapped or unmapped depending on whether
um: Abandon the _PAGE_NEWPROT bit
When a PTE is updated in the page table, the _PAGE_NEWPAGE bit will always be set. And the corresponding page will always be mapped or unmapped depending on whether the PTE is present or not. The check on the _PAGE_NEWPROT bit is not really reachable. Abandoning it will allow us to simplify the code and remove the unreachable code.
Reviewed-by: Benjamin Berg <benjamin.berg@intel.com> Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Link: https://patch.msgid.link/20241011102354.1682626-2-tiwei.btw@antgroup.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
show more ...
|
| 32e8eaf2 | 19-Sep-2024 |
Benjamin Berg <benjamin.berg@intel.com> |
um: use execveat to create userspace MMs
Using clone will not undo features that have been enabled by libc. An example of this already happening is rseq, which could cause the kernel to read/write m
um: use execveat to create userspace MMs
Using clone will not undo features that have been enabled by libc. An example of this already happening is rseq, which could cause the kernel to read/write memory of the userspace process. In the future the standard library might also use mseal by default to protect itself, which would also thwart our attempts at unmapping everything.
Solve all this by taking a step back and doing an execve into a tiny static binary that sets up the minimal environment required for the stub without using any standard library. That way we have a clean execution environment that is fully under the control of UML.
Note that this changes things a bit as the FDs are not anymore shared with the kernel. Instead, we explicitly share the FDs for the physical memory and all existing iomem regions. Doing this is fine, as iomem regions cannot be added at runtime.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20240919124511.282088-3-benjamin@sipsolutions.net [use pipe() instead of pipe2(), remove unneeded close() calls] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
show more ...
|
| bcf3d957 | 03-Jul-2024 |
Benjamin Berg <benjamin.berg@intel.com> |
um: refactor TLB update handling
Conceptually, we want the memory mappings to always be up to date and represent whatever is in the TLB. To ensure that, we need to sync them over in the userspace ca
um: refactor TLB update handling
Conceptually, we want the memory mappings to always be up to date and represent whatever is in the TLB. To ensure that, we need to sync them over in the userspace case and for the kernel we need to process the mappings.
The kernel will call flush_tlb_* if page table entries that were valid before become invalid. Unfortunately, this is not the case if entries are added.
As such, change both flush_tlb_* and set_ptes to track the memory range that has to be synchronized. For the kernel, we need to execute a flush_tlb_kern_* immediately but we can wait for the first page fault in case of set_ptes. For userspace in contrast we only store that a range of memory needs to be synced and do so whenever we switch to that process.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20240703134536.1161108-13-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
show more ...
|
| 573a446f | 03-Jul-2024 |
Benjamin Berg <benjamin.berg@intel.com> |
um: simplify and consolidate TLB updates
The HVC update was mostly used to compress consecutive calls into one. This is mostly relevant for userspace where it is already handled by the syscall stub
um: simplify and consolidate TLB updates
The HVC update was mostly used to compress consecutive calls into one. This is mostly relevant for userspace where it is already handled by the syscall stub code.
Simplify the whole logic and consolidate it for both kernel and userspace. This does remove the sequential syscall compression for the kernel, however that shouldn't be the main factor in most runs.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20240703134536.1161108-12-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
show more ...
|
| 3c83170d | 03-Jul-2024 |
Benjamin Berg <benjamin@sipsolutions.net> |
um: Delay flushing syscalls until the thread is restarted
As running the syscalls is expensive due to context switches, we should do so as late as possible in case more syscalls need to be queued la
um: Delay flushing syscalls until the thread is restarted
As running the syscalls is expensive due to context switches, we should do so as late as possible in case more syscalls need to be queued later on. This will also benefit a later move to a SECCOMP enabled userspace as in that case the need for extra context switches is removed entirely.
Signed-off-by: Benjamin Berg <benjamin@sipsolutions.net> Link: https://patch.msgid.link/20240703134536.1161108-9-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
show more ...
|
| 6d8992e4 | 03-Jul-2024 |
Benjamin Berg <benjamin.berg@intel.com> |
um: compress memory related stub syscalls while adding them
To keep the number of syscalls that the stub has to do lower, compress two consecutive syscalls of the same type if the second is just a c
um: compress memory related stub syscalls while adding them
To keep the number of syscalls that the stub has to do lower, compress two consecutive syscalls of the same type if the second is just a continuation of the first.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20240703134536.1161108-6-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
show more ...
|