#
3efc5736 |
| 28-Sep-2024 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull x86 kvm updates from Paolo Bonzini: "x86:
- KVM currently invalidates the entirety of the page tables, not just thos
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull x86 kvm updates from Paolo Bonzini: "x86:
- KVM currently invalidates the entirety of the page tables, not just those for the memslot being touched, when a memslot is moved or deleted.
This does not traditionally have particularly noticeable overhead, but Intel's TDX will require the guest to re-accept private pages if they are dropped from the secure EPT, which is a non starter.
Actually, the only reason why this is not already being done is a bug which was never fully investigated and caused VM instability with assigned GeForce GPUs, so allow userspace to opt into the new behavior.
- Advertise AVX10.1 to userspace (effectively prep work for the "real" AVX10 functionality that is on the horizon)
- Rework common MSR handling code to suppress errors on userspace accesses to unsupported-but-advertised MSRs
This will allow removing (almost?) all of KVM's exemptions for userspace access to MSRs that shouldn't exist based on the vCPU model (the actual cleanup is non-trivial future work)
- Rework KVM's handling of x2APIC ICR, again, because AMD (x2AVIC) splits the 64-bit value into the legacy ICR and ICR2 storage, whereas Intel (APICv) stores the entire 64-bit value at the ICR offset
- Fix a bug where KVM would fail to exit to userspace if one was triggered by a fastpath exit handler
- Add fastpath handling of HLT VM-Exit to expedite re-entering the guest when there's already a pending wake event at the time of the exit
- Fix a WARN caused by RSM entering a nested guest from SMM with invalid guest state, by forcing the vCPU out of guest mode prior to signalling SHUTDOWN (the SHUTDOWN hits the VM altogether, not the nested guest)
- Overhaul the "unprotect and retry" logic to more precisely identify cases where retrying is actually helpful, and to harden all retry paths against putting the guest into an infinite retry loop
- Add support for yielding, e.g. to honor NEED_RESCHED, when zapping rmaps in the shadow MMU
- Refactor pieces of the shadow MMU related to aging SPTEs in prepartion for adding multi generation LRU support in KVM
- Don't stuff the RSB after VM-Exit when RETPOLINE=y and AutoIBRS is enabled, i.e. when the CPU has already flushed the RSB
- Trace the per-CPU host save area as a VMCB pointer to improve readability and cleanup the retrieval of the SEV-ES host save area
- Remove unnecessary accounting of temporary nested VMCB related allocations
- Set FINAL/PAGE in the page fault error code for EPT violations if and only if the GVA is valid. If the GVA is NOT valid, there is no guest-side page table walk and so stuffing paging related metadata is nonsensical
- Fix a bug where KVM would incorrectly synthesize a nested VM-Exit instead of emulating posted interrupt delivery to L2
- Add a lockdep assertion to detect unsafe accesses of vmcs12 structures
- Harden eVMCS loading against an impossible NULL pointer deref (really truly should be impossible)
- Minor SGX fix and a cleanup
- Misc cleanups
Generic:
- Register KVM's cpuhp and syscore callbacks when enabling virtualization in hardware, as the sole purpose of said callbacks is to disable and re-enable virtualization as needed
- Enable virtualization when KVM is loaded, not right before the first VM is created
Together with the previous change, this simplifies a lot the logic of the callbacks, because their very existence implies virtualization is enabled
- Fix a bug that results in KVM prematurely exiting to userspace for coalesced MMIO/PIO in many cases, clean up the related code, and add a testcase
- Fix a bug in kvm_clear_guest() where it would trigger a buffer overflow _if_ the gpa+len crosses a page boundary, which thankfully is guaranteed to not happen in the current code base. Add WARNs in more helpers that read/write guest memory to detect similar bugs
Selftests:
- Fix a goof that caused some Hyper-V tests to be skipped when run on bare metal, i.e. NOT in a VM
- Add a regression test for KVM's handling of SHUTDOWN for an SEV-ES guest
- Explicitly include one-off assets in .gitignore. Past Sean was completely wrong about not being able to detect missing .gitignore entries
- Verify userspace single-stepping works when KVM happens to handle a VM-Exit in its fastpath
- Misc cleanups"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (127 commits) Documentation: KVM: fix warning in "make htmldocs" s390: Enable KVM_S390_UCONTROL config in debug_defconfig selftests: kvm: s390: Add VM run test case KVM: SVM: let alternatives handle the cases when RSB filling is required KVM: VMX: Set PFERR_GUEST_{FINAL,PAGE}_MASK if and only if the GVA is valid KVM: x86/mmu: Use KVM_PAGES_PER_HPAGE() instead of an open coded equivalent KVM: x86/mmu: Add KVM_RMAP_MANY to replace open coded '1' and '1ul' literals KVM: x86/mmu: Fold mmu_spte_age() into kvm_rmap_age_gfn_range() KVM: x86/mmu: Morph kvm_handle_gfn_range() into an aging specific helper KVM: x86/mmu: Honor NEED_RESCHED when zapping rmaps and blocking is allowed KVM: x86/mmu: Add a helper to walk and zap rmaps for a memslot KVM: x86/mmu: Plumb a @can_yield parameter into __walk_slot_rmaps() KVM: x86/mmu: Move walk_slot_rmaps() up near for_each_slot_rmap_range() KVM: x86/mmu: WARN on MMIO cache hit when emulating write-protected gfn KVM: x86/mmu: Detect if unprotect will do anything based on invalid_list KVM: x86/mmu: Subsume kvm_mmu_unprotect_page() into the and_retry() version KVM: x86: Rename reexecute_instruction()=>kvm_unprotect_and_retry_on_failure() KVM: x86: Update retry protection fields when forcing retry on emulation failure KVM: x86: Apply retry protection to "unprotect on failure" path KVM: x86: Check EMULTYPE_WRITE_PF_TO_SP before unprotecting gfn ...
show more ...
|
Revision tags: v6.11 |
|
#
3f8df628 |
| 14-Sep-2024 |
Paolo Bonzini <pbonzini@redhat.com> |
Merge tag 'kvm-x86-vmx-6.12' of https://github.com/kvm-x86/linux into HEAD
KVM VMX changes for 6.12:
- Set FINAL/PAGE in the page fault error code for EPT Violations if and only if the GVA is v
Merge tag 'kvm-x86-vmx-6.12' of https://github.com/kvm-x86/linux into HEAD
KVM VMX changes for 6.12:
- Set FINAL/PAGE in the page fault error code for EPT Violations if and only if the GVA is valid. If the GVA is NOT valid, there is no guest-side page table walk and so stuffing paging related metadata is nonsensical.
- Fix a bug where KVM would incorrectly synthesize a nested VM-Exit instead of emulating posted interrupt delivery to L2.
- Add a lockdep assertion to detect unsafe accesses of vmcs12 structures.
- Harden eVMCS loading against an impossible NULL pointer deref (really truly should be impossible).
- Minor SGX fix and a cleanup.
show more ...
|
#
55f50b2f |
| 12-Sep-2024 |
Paolo Bonzini <pbonzini@redhat.com> |
Merge branch 'kvm-memslot-zap-quirk' into HEAD
Today whenever a memslot is moved or deleted, KVM invalidates the entire page tables and generates fresh ones based on the new memslot layout.
This be
Merge branch 'kvm-memslot-zap-quirk' into HEAD
Today whenever a memslot is moved or deleted, KVM invalidates the entire page tables and generates fresh ones based on the new memslot layout.
This behavior traditionally was kept because of a bug which was never fully investigated and caused VM instability with assigned GeForce GPUs. It generally does not have a huge overhead, because the old MMU is able to reuse cached page tables and the new one is more scalabale and can resolve EPT violations/nested page faults in parallel, but it has worse performance if the guest frequently deletes and adds small memslots, and it's entirely not viable for TDX. This is because TDX requires re-accepting of private pages after page dropping.
For non-TDX VMs, this series therefore introduces the KVM_X86_QUIRK_SLOT_ZAP_ALL quirk, enabling users to control the behavior of memslot zapping when a memslot is moved/deleted. The quirk is turned on by default, leading to the zapping of all SPTEs when a memslot is moved/deleted; users however have the option to turn off the quirk, which limits the zapping only to those SPTEs hat lie within the range of memslot being moved/deleted.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|
Revision tags: v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1 |
|
#
653ea448 |
| 23-Jul-2024 |
Sean Christopherson <seanjc@google.com> |
KVM: nVMX: Honor userspace MSR filter lists for nested VM-Enter/VM-Exit
Synthesize a consistency check VM-Exit (VM-Enter) or VM-Abort (VM-Exit) if L1 attempts to load/store an MSR via the VMCS MSR l
KVM: nVMX: Honor userspace MSR filter lists for nested VM-Enter/VM-Exit
Synthesize a consistency check VM-Exit (VM-Enter) or VM-Abort (VM-Exit) if L1 attempts to load/store an MSR via the VMCS MSR lists that userspace has disallowed access to via an MSR filter. Intel already disallows including a handful of "special" MSRs in the VMCS lists, so denying access isn't completely without precedent.
More importantly, the behavior is well-defined _and_ can be communicated the end user, e.g. to the customer that owns a VM running as L1 on top of KVM. On the other hand, ignoring userspace MSR filters is all but guaranteed to result in unexpected behavior as the access will hit KVM's internal state, which is likely not up-to-date.
Unlike KVM-internal accesses, instruction emulation, and dedicated VMCS fields, the MSRs in the VMCS load/store lists are 100% guest controlled, thus making it all but impossible to reason about the correctness of ignoring the MSR filter. And if userspace *really* wants to deny access to MSRs via the aforementioned scenarios, userspace can hide the associated feature from the guest, e.g. by disabling the PMU to prevent accessing PERF_GLOBAL_CTRL via its VMCS field. But for the MSR lists, KVM is blindly processing MSRs; the MSR filters are the _only_ way for userspace to deny access.
This partially reverts commit ac8d6cad3c7b ("KVM: x86: Only do MSR filtering when access MSR by rdmsr/wrmsr").
Cc: Hou Wenlong <houwenlong.hwl@antgroup.com> Cc: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20240722235922.3351122-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
show more ...
|
Revision tags: v6.10, v6.10-rc7 |
|
#
aa8d1f48 |
| 03-Jul-2024 |
Yan Zhao <yan.y.zhao@intel.com> |
KVM: x86/mmu: Introduce a quirk to control memslot zap behavior
Introduce the quirk KVM_X86_QUIRK_SLOT_ZAP_ALL to allow users to select KVM's behavior when a memslot is moved or deleted for KVM_X86_
KVM: x86/mmu: Introduce a quirk to control memslot zap behavior
Introduce the quirk KVM_X86_QUIRK_SLOT_ZAP_ALL to allow users to select KVM's behavior when a memslot is moved or deleted for KVM_X86_DEFAULT_VM VMs. Make sure KVM behave as if the quirk is always disabled for non-KVM_X86_DEFAULT_VM VMs.
The KVM_X86_QUIRK_SLOT_ZAP_ALL quirk offers two behavior options: - when enabled: Invalidate/zap all SPTEs ("zap-all"), - when disabled: Precisely zap only the leaf SPTEs within the range of the moving/deleting memory slot ("zap-slot-leafs-only").
"zap-all" is today's KVM behavior to work around a bug [1] where the changing the zapping behavior of memslot move/deletion would cause VM instability for VMs with an Nvidia GPU assigned; while "zap-slot-leafs-only" allows for more precise zapping of SPTEs within the memory slot range, improving performance in certain scenarios [2], and meeting the functional requirements for TDX.
Previous attempts to select "zap-slot-leafs-only" include a per-VM capability approach [3] (which was not preferred because the root cause of the bug remained unidentified) and a per-memslot flag approach [4]. Sean and Paolo finally recommended the implementation of this quirk and explained that it's the least bad option [5].
By default, the quirk is enabled on KVM_X86_DEFAULT_VM VMs to use "zap-all". Users have the option to disable the quirk to select "zap-slot-leafs-only" for specific KVM_X86_DEFAULT_VM VMs that are unaffected by this bug.
For non-KVM_X86_DEFAULT_VM VMs, the "zap-slot-leafs-only" behavior is always selected without user's opt-in, regardless of if the user opts for "zap-all". This is because it is assumed until proven otherwise that non- KVM_X86_DEFAULT_VM VMs will not be exposed to the bug [1], and most importantly, it's because TDX must have "zap-slot-leafs-only" always selected. In TDX's case a memslot's GPA range can be a mixture of "private" or "shared" memory. Shared is roughly analogous to how EPT is handled for normal VMs, but private GPAs need lots of special treatment: 1) "zap-all" would require to zap private root page or non-leaf entries or at least leaf-entries beyond the deleting memslot scope. However, TDX demands that the root page of the private page table remains unchanged, with leaf entries being zapped before non-leaf entries, and any dropped private guest pages must be re-accepted by the guest. 2) if "zap-all" zaps only shared page tables, it would result in private pages still being mapped when the memslot is gone. This may affect even other processes if later the gmem fd was whole punched, causing the pages being freed on the host while still mapped in the TD, because there's no pgoff to the gfn information to zap the private page table after memslot is gone.
So, simply go "zap-slot-leafs-only" as if the quirk is always disabled for non-KVM_X86_DEFAULT_VM VMs to avoid manual opt-in for every VM type [6] or complicating quirk disabling interface (current quirk disabling interface is limited, no way to query quirks, or force them to be disabled).
Add a new function kvm_mmu_zap_memslot_leafs() to implement "zap-slot-leafs-only". This function does not call kvm_unmap_gfn_range(), bypassing special handling to APIC_ACCESS_PAGE_PRIVATE_MEMSLOT, as 1) The APIC_ACCESS_PAGE_PRIVATE_MEMSLOT cannot be created by users, nor can it be moved. It is only deleted by KVM when APICv is permanently inhibited. 2) kvm_vcpu_reload_apic_access_page() effectively does nothing when APIC_ACCESS_PAGE_PRIVATE_MEMSLOT is deleted. 3) Avoid making all cpus request of KVM_REQ_APIC_PAGE_RELOAD can save on costly IPIs.
Suggested-by: Kai Huang <kai.huang@intel.com> Suggested-by: Sean Christopherson <seanjc@google.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://patchwork.kernel.org/project/kvm/patch/20190205210137.1377-11-sean.j.christopherson@intel.com [1] Link: https://patchwork.kernel.org/project/kvm/patch/20190205210137.1377-11-sean.j.christopherson@intel.com/#25054908 [2] Link: https://lore.kernel.org/kvm/20200713190649.GE29725@linux.intel.com/T/#mabc0119583dacf621025e9d873c85f4fbaa66d5c [3] Link: https://lore.kernel.org/all/20240515005952.3410568-3-rick.p.edgecombe@intel.com [4] Link: https://lore.kernel.org/all/7df9032d-83e4-46a1-ab29-6c7973a2ab0b@redhat.com [5] Link: https://lore.kernel.org/all/ZnGa550k46ow2N3L@google.com [6] Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20240703021043.13881-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
show more ...
|
#
87ee9981 |
| 19-Aug-2024 |
Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
Merge 6.11-rc4 into driver-core-next
We need the driver core build fixes in here as well.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
#
0c80bdfc |
| 12-Aug-2024 |
Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
Merge 6.11-rc3 into driver-core-next
We need the driver core fixes in here as well to build on top of.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
#
ebbe30f4 |
| 19-Aug-2024 |
Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
Merge 6.11-rc4 into tty-next
We need the tty/serial fixes in here as well.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
#
ca7df2c7 |
| 19-Aug-2024 |
Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
Merge 6.11-rc4 into usb-next
We need the usb / thunderbolt fixes in here as well.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
#
10c8d1bd |
| 19-Aug-2024 |
Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
Merge 6.11-rc4 into char-misc-next
We need the char/misc fixes in here as well.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
#
38343be0 |
| 12-Aug-2024 |
Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
Merge 6.11-rc3 into usb-next
We need the usb fixes in here as well.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
#
9f3eb413 |
| 12-Aug-2024 |
Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
Merge 6.11-rc3 into tty-next
We need the tty/serial fixes in here to build on top of.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
#
9ca12e50 |
| 12-Aug-2024 |
Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
Merge 6.11-rc3 into char-misc-next
We need the char/misc fixes in here as well.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
#
42b16d3a |
| 17-Sep-2024 |
Jens Axboe <axboe@kernel.dk> |
Merge tag 'v6.11' into for-6.12/block
Merge in 6.11 final to get the fix for preventing deadlocks on an elevator switch, as there's a fixup for that patch.
* tag 'v6.11': (1788 commits) Linux 6.1
Merge tag 'v6.11' into for-6.12/block
Merge in 6.11 final to get the fix for preventing deadlocks on an elevator switch, as there's a fixup for that patch.
* tag 'v6.11': (1788 commits) Linux 6.11 Revert "KVM: VMX: Always honor guest PAT on CPUs that support self-snoop" pinctrl: pinctrl-cy8c95x0: Fix regcache cifs: Fix signature miscalculation mm: avoid leaving partial pfn mappings around in error case drm/xe/client: add missing bo locking in show_meminfo() drm/xe/client: fix deadlock in show_meminfo() drm/xe/oa: Enable Xe2+ PES disaggregation drm/xe/display: fix compat IS_DISPLAY_STEP() range end drm/xe: Fix access_ok check in user_fence_create drm/xe: Fix possible UAF in guc_exec_queue_process_msg drm/xe: Remove fence check from send_tlb_invalidation drm/xe/gt: Remove double include net: netfilter: move nf flowtable bpf initialization in nf_flow_table_module_init() PCI: Fix potential deadlock in pcim_intx() workqueue: Clear worker->pool in the worker thread context net: tighten bad gso csum offset check in virtio_net_hdr netlink: specs: mptcp: fix port endianness net: dpaa: Pad packets to ETH_ZLEN mptcp: pm: Fix uaf in __timer_delete_sync ...
show more ...
|
#
36ec807b |
| 20-Sep-2024 |
Dmitry Torokhov <dmitry.torokhov@gmail.com> |
Merge branch 'next' into for-linus
Prepare input updates for 6.12 merge window.
|
#
9ea7b92b |
| 12-Sep-2024 |
Palmer Dabbelt <palmer@rivosinc.com> |
Merge patch series "remove size limit on XIP kernel"
Nam Cao <namcao@linutronix.de> says:
Hi,
For XIP kernel, the writable data section is always at offset specified in XIP_OFFSET, which is hard-c
Merge patch series "remove size limit on XIP kernel"
Nam Cao <namcao@linutronix.de> says:
Hi,
For XIP kernel, the writable data section is always at offset specified in XIP_OFFSET, which is hard-coded to 32MB.
Unfortunately, this means the read-only section (placed before the writable section) is restricted in size. This causes build failure if the kernel gets too large.
This series remove the use of XIP_OFFSET one by one, then remove this macro entirely at the end, with the goal of lifting this size restriction.
Also some cleanup and documentation along the way.
* b4-shazam-merge riscv: remove limit on the size of read-only section for XIP kernel riscv: drop the use of XIP_OFFSET in create_kernel_page_table() riscv: drop the use of XIP_OFFSET in kernel_mapping_va_to_pa() riscv: drop the use of XIP_OFFSET in XIP_FIXUP_FLASH_OFFSET riscv: drop the use of XIP_OFFSET in XIP_FIXUP_OFFSET riscv: replace misleading va_kernel_pa_offset on XIP kernel riscv: don't export va_kernel_pa_offset in vmcoreinfo for XIP kernel riscv: cleanup XIP_FIXUP macro riscv: change XIP's kernel_map.size to be size of the entire kernel ...
Link: https://lore.kernel.org/r/cover.1717789719.git.namcao@linutronix.de Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
show more ...
|
#
f057b572 |
| 06-Sep-2024 |
Dmitry Torokhov <dmitry.torokhov@gmail.com> |
Merge branch 'ib/6.11-rc6-matrix-keypad-spitz' into next
Bring in changes removing support for platform data from matrix-keypad driver.
|
#
34cd1928 |
| 27-Aug-2024 |
Jason Gunthorpe <jgg@nvidia.com> |
Merge branch 'bnxt_re_variable_wqes' into rdma.git for-next
Selvin Xavier says:
============= Enable the Variable size Work Queue entry support for Gen P7 adapters. This would help in the better ut
Merge branch 'bnxt_re_variable_wqes' into rdma.git for-next
Selvin Xavier says:
============= Enable the Variable size Work Queue entry support for Gen P7 adapters. This would help in the better utilization of the queue memory and pci bandwidth due to the smaller send queue Work entries. =============
Based on v6.11-rc5 for dependencies.
* bnxt_re_variable_wqes: (829 commits) RDMA/bnxt_re: Enable variable size WQEs for user space applications RDMA/bnxt_re: Handle variable WQE support for user applications RDMA/bnxt_re: Fix the table size for PSN/MSN entries RDMA/bnxt_re: Get the WQE index from slot index while completing the WQEs RDMA/bnxt_re: Add support for Variable WQE in Genp7 adapters Linux 6.11-rc5 ...
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
show more ...
|
#
76889bba |
| 27-Aug-2024 |
Jason Gunthorpe <jgg@nvidia.com> |
Merge branch 'nesting_reserved_regions' into iommufd.git for-next
Nicolin Chen says:
========= IOMMU_RESV_SW_MSI is a unique region defined by an IOMMU driver. Though it is eventually used by a dev
Merge branch 'nesting_reserved_regions' into iommufd.git for-next
Nicolin Chen says:
========= IOMMU_RESV_SW_MSI is a unique region defined by an IOMMU driver. Though it is eventually used by a device for address translation to an MSI location (including nested cases), practically it is a universal region across all domains allocated for the IOMMU that defines it.
Currently IOMMUFD core fetches and reserves the region during an attach to an hwpt_paging. It works with a hwpt_paging-only case, but might not work with a nested case where a device could directly attach to a hwpt_nested, bypassing the hwpt_paging attachment.
Move the enforcement forward, to the hwpt_paging allocation function. Then clean up all the SW_MSI related things in the attach/replace routine. =========
Based on v6.11-rc5 for dependencies.
* nesting_reserved_regions: (562 commits) iommufd/device: Enforce reserved IOVA also when attached to hwpt_nested Linux 6.11-rc5 ...
show more ...
|
#
a18eb864 |
| 08-Aug-2024 |
Leon Romanovsky <leon@kernel.org> |
Introducing Multi-Path DMA Support for mlx5 RDMA Driver
From Yishai,
Overview -------- This patch series aims to enable multi-path DMA support, allowing an mlx5 RDMA device to issue DMA commands th
Introducing Multi-Path DMA Support for mlx5 RDMA Driver
From Yishai,
Overview -------- This patch series aims to enable multi-path DMA support, allowing an mlx5 RDMA device to issue DMA commands through multiple paths. This feature is critical for improving performance and reaching line rate in certain environments where issuing PCI transactions over one path may be significantly faster than over another. These differences can arise from various PCI generations in the system or the specific system topology.
To achieve this functionality, we introduced a data direct DMA device that can serve the RDMA device by issuing DMA transactions on its behalf.
The main key features and changes are described below.
Multi-Path Discovery -------------------- API Implementation: * Introduced an API to discover multiple paths for a given mlx5 RDMA device. IOCTL Command: * Added a new IOCTL command, MLX5_IB_METHOD_GET_DATA_DIRECT_SYSFS_PATH, to the DEVICE object. When an affiliated Data-Direct/DMA device is present, its sysfs path is returned.
Feature Activation by mlx5 RDMA Application ------------------------------------------- UVERBS Extension: * Extended UVERBS_METHOD_REG_DMABUF_MR over UVERBS_OBJECT_MR to include mlx5 extended flags. Access Flag: * Introduced the MLX5_IB_UAPI_REG_DMABUF_ACCESS_DATA_DIRECT flag, allowing applications to request the use of the affiliated DMA device for DMABUF registration.
Data-Direct/DMA Device ---------------------- New Driver: * Introduced a new driver to manage the new DMA PF device ID (0x2100). Its registration/un-registration is handled as part of the mlx5_ib init/exit flows, with mlx5 IB devices as its clients. Functionality: * The driver does not interface directly with the firmware (no command interface, no caps, etc.) but works over PCI to activate its DMA functionality. It serves as the DMA device for efficiently accessing other PCI devices (e.g., GPU PF) and reads its VUID over PCI to handle NICs registrations with the same VUID.
mlx5 IB RDMA Device --------------------------- VUID Query: * Reads its affiliated DMA PF VUID via the QUERY_VUID command with the data_direct bit set. Driver Registration: * Registers with the DMA PF driver to be notified upon bind/unbind. Application Request Handling: * Uses the DMA PF device upon application request as described above.
DMABUF over Umem ---------------- Introduced an option to obtain a DMABUF UMEM using a different DMA device instead of the IB device, allowing the device to register over IOMMU with the expected DMA device for a given buffer registration.
Further details are provided in the commit logs of the patches in this series.
Thanks
Link: https://lore.kernel.org/all/cover.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
show more ...
|
#
3daee2e4 |
| 16-Jul-2024 |
Dmitry Torokhov <dmitry.torokhov@gmail.com> |
Merge tag 'v6.10' into next
Sync up with mainline to bring in device_for_each_child_node_scoped() and other newer APIs.
|
#
c24999e6 |
| 21-Sep-2024 |
Wolfram Sang <wsa+renesas@sang-engineering.com> |
Merge tag 'i2c-host-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/andi.shyti/linux into i2c/for-mergewindow
The DesignWare and the Renesas I2C drivers have received most of the changes in t
Merge tag 'i2c-host-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/andi.shyti/linux into i2c/for-mergewindow
The DesignWare and the Renesas I2C drivers have received most of the changes in this pull request.
The first has has undergone through a series of cleanups that have been sent to the mailing list a year ago for the first time and finally get merged in this pull request. They are many, from typos (e.g. i2/i2c), to cosmetics, to refactoring (e.g. move inline functions to librarieas) and many others.
Besides that, all the DesignWare Kconfig options have been grouped under the I2C_DESIGNWARE_CORE and this required some adaptation in many of the kernel configuration files for different arm and mips boards.
Follows the list of the rest of the changes grouped by type of change.
Cleanups -------- The Qualcomm Geni platform improves the exit path in the runtime resume function.
The Intel LJCA driver loses "target_addr" parameter in ljca_i2c_stop() because it was unused.
The MediaTek controller intializes the restart_flag in the transfer function using the ternary conditional operator ("? :") instead of initializing it in different parts.
Constified a few global data structures in the virtio driver.
The Renesas driver simplifies the bus speed handling in the init function making it more readable.
Improved an if/else statement in probe function of the Renesas R-Car driver.
The iMX/MXC driver switches to using the RUNTIME_PM_OPS() instead of SET_RUNTIME_PM_OPS().
Still in the iMX/MXC driver a comma ',' has been replaced by a semicolon ';', while in different drivers the ',' has been removed from the '{ }' delimiters.
Finally three devm_clk_get_enabled() have been used to simplify the devm_clk_get/clk_prepare_enable tuple in the Renesas EMEV2, Ingenic and MPC drivers.
Refactors --------- The Nuvoton fixes a potential out of boundary array access. This is not a bug fix because the issue could never occur due to hardware not having the properties listed in the array. The change makes the driver more future proof and, at the same time, silences code analyzers.
Improvements ------------ The Renesas I2C (riic) driver undergoes several patches improving the runtime power management handling.
The Intel i801 driver uses a more descriptive adapter's name to show the presence of the IDF feature.
In the Intel Denverton (ismt) adapter the pending transactions are killed when irq's can't complete their handling, triggering a timeout. This could have been considered as a bug fix, but because, standing to Vasily, it's very sporadic, I preferred considering the patch rather as an improvement.
New Feature ----------- The Renesas I2C (riic) driver now supports the fast mode plus.
New support ----------- Added support for:
- Renesas R9A08G045 - Rockchip RK3576 - KEBA I2C - Theobroma Systems Mule Multiplexer.
The Keba comes with a new driver, i2c-keba.c. The Mule is an i2c multiplexer and it also comes with a new driver, mux/i2c-mux-mule.c.
Core patch ---------- This pull request includes also a patch in the I2C framework, in i2c-core-base.c where the runtime PM functions have been replaced in order to allow to be accessed during the device add.
Devicetree ---------- Some cleanups in the devicetree, as well. nVidia and Qualcomm bindings improve their "if:then:" blocks. While the aspeed binding loses the "multi-master" property because it was redundant.
The i2c-sprd binding has been converted to YAML.
show more ...
|
#
2c25dcc2 |
| 05-Aug-2024 |
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> |
Merge tag 'v6.11-rc2' into media_stage
Linux 6.11-rc2
* tag 'v6.11-rc2': (283 commits) Linux 6.11-rc2 profiling: remove profile=sleep support arm: dts: arm: versatile-ab: Fix duplicate clock
Merge tag 'v6.11-rc2' into media_stage
Linux 6.11-rc2
* tag 'v6.11-rc2': (283 commits) Linux 6.11-rc2 profiling: remove profile=sleep support arm: dts: arm: versatile-ab: Fix duplicate clock node name runtime constants: deal with old decrepit linkers clocksource: Fix brown-bag boolean thinko in cs_watchdog_read() cifs: update internal version number smb: client: fix FSCTL_GET_REPARSE_POINT against NetApp smb3: add dynamic tracepoints for shutdown ioctl cifs: Remove cifs_aio_ctx smb: client: handle lack of FSCTL_GET_REPARSE_POINT support arm64: jump_label: Ensure patched jump_labels are visible to all CPUs syscalls: fix syscall macros for newfstat/newfstatat uretprobe: change syscall number, again thermal: core: Update thermal zone registration documentation Revert "nouveau: rip out busy fence waits" protect the fetch of ->fd[fd] in do_dup2() from mispredictions x86/uaccess: Zero the 8-byte get_range case on failure on 32-bit riscv: Fix linear mapping checks for non-contiguous memory regions KVM: x86/mmu: fix determination of max NPT mapping level for private pages PCI: pciehp: Retain Power Indicator bits for userspace indicators ...
show more ...
|
#
66e72a01 |
| 29-Jul-2024 |
Jerome Brunet <jbrunet@baylibre.com> |
Merge tag 'v6.11-rc1' into clk-meson-next
Linux 6.11-rc1
|
#
3bce87eb |
| 17-Aug-2024 |
Arnaldo Carvalho de Melo <acme@redhat.com> |
Merge remote-tracking branch 'torvalds/master' into perf-tools-next
To pick up the latest perf-tools merge for 6.11, i.e. to have the current perf tools branch that is getting into 6.11 with the per
Merge remote-tracking branch 'torvalds/master' into perf-tools-next
To pick up the latest perf-tools merge for 6.11, i.e. to have the current perf tools branch that is getting into 6.11 with the perf-tools-next that is geared towards 6.12.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
show more ...
|