cc51a668 | 08-Jul-2025 |
Dragos Tatulea <dtatulea@nvidia.com> |
vdpa/mlx5: Fix release of uninitialized resources on error path
The commit in the fixes tag made sure that mlx5_vdpa_free() is the single entrypoint for removing the vdpa device resources added in m
vdpa/mlx5: Fix release of uninitialized resources on error path
The commit in the fixes tag made sure that mlx5_vdpa_free() is the single entrypoint for removing the vdpa device resources added in mlx5_vdpa_dev_add(), even in the cleanup path of mlx5_vdpa_dev_add().
This means that all functions from mlx5_vdpa_free() should be able to handle uninitialized resources. This was not the case though: mlx5_vdpa_destroy_mr_resources() and mlx5_cmd_cleanup_async_ctx() were not able to do so. This caused the splat below when adding a vdpa device without a MAC address.
This patch fixes these remaining issues:
- Makes mlx5_vdpa_destroy_mr_resources() return early if called on uninitialized resources.
- Moves mlx5_cmd_init_async_ctx() early on during device addition because it can't fail. This means that mlx5_cmd_cleanup_async_ctx() also can't fail. To mirror this, move the call site of mlx5_cmd_cleanup_async_ctx() in mlx5_vdpa_free().
An additional comment was added in mlx5_vdpa_free() to document the expectations of functions called from this context.
Splat:
mlx5_core 0000:b5:03.2: mlx5_vdpa_dev_add:3950:(pid 2306) warning: No mac address provisioned? ------------[ cut here ]------------ WARNING: CPU: 13 PID: 2306 at kernel/workqueue.c:4207 __flush_work+0x9a/0xb0 [...] Call Trace: <TASK> ? __try_to_del_timer_sync+0x61/0x90 ? __timer_delete_sync+0x2b/0x40 mlx5_vdpa_destroy_mr_resources+0x1c/0x40 [mlx5_vdpa] mlx5_vdpa_free+0x45/0x160 [mlx5_vdpa] vdpa_release_dev+0x1e/0x50 [vdpa] device_release+0x31/0x90 kobject_cleanup+0x37/0x130 mlx5_vdpa_dev_add+0x327/0x890 [mlx5_vdpa] vdpa_nl_cmd_dev_add_set_doit+0x2c1/0x4d0 [vdpa] genl_family_rcv_msg_doit+0xd8/0x130 genl_family_rcv_msg+0x14b/0x220 ? __pfx_vdpa_nl_cmd_dev_add_set_doit+0x10/0x10 [vdpa] genl_rcv_msg+0x47/0xa0 ? __pfx_genl_rcv_msg+0x10/0x10 netlink_rcv_skb+0x53/0x100 genl_rcv+0x24/0x40 netlink_unicast+0x27b/0x3b0 netlink_sendmsg+0x1f7/0x430 __sys_sendto+0x1fa/0x210 ? ___pte_offset_map+0x17/0x160 ? next_uptodate_folio+0x85/0x2b0 ? percpu_counter_add_batch+0x51/0x90 ? filemap_map_pages+0x515/0x660 __x64_sys_sendto+0x20/0x30 do_syscall_64+0x7b/0x2c0 ? do_read_fault+0x108/0x220 ? do_pte_missing+0x14a/0x3e0 ? __handle_mm_fault+0x321/0x730 ? count_memcg_events+0x13f/0x180 ? handle_mm_fault+0x1fb/0x2d0 ? do_user_addr_fault+0x20c/0x700 ? syscall_exit_work+0x104/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f0c25b0feca [...] ---[ end trace 0000000000000000 ]---
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Fixes: 83e445e64f48 ("vdpa/mlx5: Fix error path during device add") Reported-by: Wenli Quan <wquan@redhat.com> Closes: https://lore.kernel.org/virtualization/CADZSLS0r78HhZAStBaN1evCSoPqRJU95Lt8AqZNJ6+wwYQ6vPQ@mail.gmail.com/ Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Message-Id: <20250708120424.2363354-2-dtatulea@nvidia.com> Tested-by: Wenli Quan <wquan@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
show more ...
|
62111654 | 30-Aug-2024 |
Dragos Tatulea <dtatulea@nvidia.com> |
vdpa/mlx5: Postpone MR deletion
Currently, when a new MR is set up, the old MR is deleted. MR deletion is about 30-40% the time of MR creation. As deleting the old MR is not important for the proces
vdpa/mlx5: Postpone MR deletion
Currently, when a new MR is set up, the old MR is deleted. MR deletion is about 30-40% the time of MR creation. As deleting the old MR is not important for the process of setting up the new MR, this operation can be postponed.
This series adds a workqueue that does MR garbage collection at a later point. If the MR lock is taken, the handler will back off and reschedule. The exception during shutdown: then the handler must not postpone the work.
Note that this is only a speculative optimization: if there is some mapping operation that is triggered while the garbage collector handler has the lock taken, this operation it will have to wait for the handler to finish.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Message-Id: <20240830105838.2666587-9-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
show more ...
|
f30a1232 | 30-Aug-2024 |
Dragos Tatulea <dtatulea@nvidia.com> |
vdpa/mlx5: Introduce init/destroy for MR resources
There's currently not a lot of action happening during the init/destroy of MR resources. But more will be added in the upcoming patches.
As the mr
vdpa/mlx5: Introduce init/destroy for MR resources
There's currently not a lot of action happening during the init/destroy of MR resources. But more will be added in the upcoming patches.
As the mr mutex lock init/destroy has been moved to these new functions, the lifetime has now shifted away from mlx5_vdpa_alloc_resources() / mlx5_vdpa_free_resources() into these new functions. However, the lifetime at the outer scope remains the same: mlx5_vdpa_dev_add() / mlx5_vdpa_dev_free()
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Message-Id: <20240830105838.2666587-8-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
show more ...
|
58d4d50e | 30-Aug-2024 |
Dragos Tatulea <dtatulea@nvidia.com> |
vdpa/mlx5: Rename mr_mtx -> lock
Now that the mr resources have their own namespace in the struct, give the lock a clearer name.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cos
vdpa/mlx5: Rename mr_mtx -> lock
Now that the mr resources have their own namespace in the struct, give the lock a clearer name.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20240830105838.2666587-7-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
show more ...
|
9dba4195 | 16-Aug-2024 |
Dragos Tatulea <dtatulea@nvidia.com> |
vdpa/mlx5: Parallelize VQ suspend/resume for CVQ MQ command
change_num_qps() is still suspending/resuming VQs one by one. This change switches to parallel suspend/resume.
When increasing the number
vdpa/mlx5: Parallelize VQ suspend/resume for CVQ MQ command
change_num_qps() is still suspending/resuming VQs one by one. This change switches to parallel suspend/resume.
When increasing the number of queues the flow has changed a bit for simplicity: the setup_vq() function will always be called before resume_vqs(). If the VQ is initialized, setup_vq() will exit early. If the VQ is not initialized, setup_vq() will create it and resume_vqs() will resume it.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Message-Id: <20240816090159.1967650-11-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
show more ...
|
74c89072 | 16-Aug-2024 |
Dragos Tatulea <dtatulea@nvidia.com> |
vdpa/mlx5: Small improvement for change_num_qps()
change_num_qps() has a lot of multiplications by 2 to convert the number of VQ pairs to number of VQs. This patch simplifies the code by doing the V
vdpa/mlx5: Small improvement for change_num_qps()
change_num_qps() has a lot of multiplications by 2 to convert the number of VQ pairs to number of VQs. This patch simplifies the code by doing the VQP -> VQ count conversion at the beginning in a variable.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Message-Id: <20240816090159.1967650-10-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
show more ...
|
55a7cb05 | 16-Aug-2024 |
Dragos Tatulea <dtatulea@nvidia.com> |
vdpa/mlx5: Keep notifiers during suspend but ignore
Unregistering notifiers is a costly operation. Instead of removing the notifiers during device suspend and adding them back at resume, simply igno
vdpa/mlx5: Keep notifiers during suspend but ignore
Unregistering notifiers is a costly operation. Instead of removing the notifiers during device suspend and adding them back at resume, simply ignore the call when the device is suspended.
At resume time call queue_link_work() to make sure that the device state is propagated in case there were changes.
For 1 vDPA device x 32 VQs (16 VQPs) attached to a large VM (256 GB RAM, 32 CPUs x 2 threads per core), the device suspend time is reduced from ~13 ms to ~2.5 ms.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20240816090159.1967650-9-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
show more ...
|
5eb8c7eb | 16-Aug-2024 |
Dragos Tatulea <dtatulea@nvidia.com> |
vdpa/mlx5: Parallelize device resume
Currently device resume works on vqs serially. Building up on previous changes that converted vq operations to the async api, this patch parallelizes the device
vdpa/mlx5: Parallelize device resume
Currently device resume works on vqs serially. Building up on previous changes that converted vq operations to the async api, this patch parallelizes the device resume.
For 1 vDPA device x 32 VQs (16 VQPs) attached to a large VM (256 GB RAM, 32 CPUs x 2 threads per core), the device resume time is reduced from ~16 ms to ~4.5 ms.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20240816090159.1967650-8-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
show more ...
|
dcf3eac0 | 16-Aug-2024 |
Dragos Tatulea <dtatulea@nvidia.com> |
vdpa/mlx5: Parallelize device suspend
Currently device suspend works on vqs serially. Building up on previous changes that converted vq operations to the async api, this patch parallelizes the devic
vdpa/mlx5: Parallelize device suspend
Currently device suspend works on vqs serially. Building up on previous changes that converted vq operations to the async api, this patch parallelizes the device suspend: 1) Suspend all active vqs parallel. 2) Query suspended vqs in parallel.
For 1 vDPA device x 32 VQs (16 VQPs) attached to a large VM (256 GB RAM, 32 CPUs x 2 threads per core), the device suspend time is reduced from ~37 ms to ~13 ms.
A later patch will remove the link unregister operation which will make it even faster.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20240816090159.1967650-7-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
show more ...
|
61674c15 | 16-Aug-2024 |
Dragos Tatulea <dtatulea@nvidia.com> |
vdpa/mlx5: Use async API for vq modify commands
Switch firmware vq modify command to be issued via the async API to allow future parallelization. The new refactored function applies the modify on a
vdpa/mlx5: Use async API for vq modify commands
Switch firmware vq modify command to be issued via the async API to allow future parallelization. The new refactored function applies the modify on a range of vqs and waits for their execution to complete.
For now the command is still used in a serial fashion. A later patch will switch to modifying multiple vqs in parallel.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Message-Id: <20240816090159.1967650-6-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
show more ...
|
1fcdf43e | 16-Aug-2024 |
Dragos Tatulea <dtatulea@nvidia.com> |
vdpa/mlx5: Use async API for vq query command
Switch firmware vq query command to be issued via the async API to allow future parallelization.
For now the command is still serial but the infrastruc
vdpa/mlx5: Use async API for vq query command
Switch firmware vq query command to be issued via the async API to allow future parallelization.
For now the command is still serial but the infrastructure is there to issue commands in parallel, including ratelimiting the number of issued async commands to firmware.
A later patch will switch to issuing more commands at a time.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Message-Id: <20240816090159.1967650-5-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
show more ...
|
de2cd39f | 16-Aug-2024 |
Dragos Tatulea <dtatulea@nvidia.com> |
vdpa/mlx5: Introduce error logging function
mlx5_vdpa_err() was missing. This patch adds it and uses it in the necessary places.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tar
vdpa/mlx5: Introduce error logging function
mlx5_vdpa_err() was missing. This patch adds it and uses it in the necessary places.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20240816090159.1967650-3-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
show more ...
|
8e0751af | 26-Jun-2024 |
Dragos Tatulea <dtatulea@nvidia.com> |
vdpa/mlx5: Don't enable non-active VQs in .set_vq_ready()
VQ indices in the range [cur_num_qps, max_vqs) represent queues that have not yet been activated. .set_vq_ready should not activate these VQ
vdpa/mlx5: Don't enable non-active VQs in .set_vq_ready()
VQ indices in the range [cur_num_qps, max_vqs) represent queues that have not yet been activated. .set_vq_ready should not activate these VQs.
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Message-Id: <20240626-stage-vdpa-vq-precreate-v2-24-560c491078df@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
show more ...
|
2638134f | 26-Jun-2024 |
Dragos Tatulea <dtatulea@nvidia.com> |
vdpa/mlx5: Don't reset VQs more than necessary
The vdpa device can be reset many times in sequence without any significant state changes in between. Previously this was not a problem: VQs were torn
vdpa/mlx5: Don't reset VQs more than necessary
The vdpa device can be reset many times in sequence without any significant state changes in between. Previously this was not a problem: VQs were torn down only on first reset. But after VQ pre-creation was introduced, each reset will delete and re-create the hardware VQs and their associated resources.
To solve this problem, avoid resetting hardware VQs if the VQs are still in a blank state.
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Message-Id: <20240626-stage-vdpa-vq-precreate-v2-23-560c491078df@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
show more ...
|
0fe963d6 | 26-Jun-2024 |
Dragos Tatulea <dtatulea@nvidia.com> |
vdpa/mlx5: Re-create HW VQs under certain conditions
There are a few conditions under which the hardware VQs need a full teardown and setup:
- VQ size changed to something else than default value.
vdpa/mlx5: Re-create HW VQs under certain conditions
There are a few conditions under which the hardware VQs need a full teardown and setup:
- VQ size changed to something else than default value. Hardware VQ size modification is not supported.
- User turns off certain device features: mergeable buffers, checksum virtio 1.0 compliance. In these cases, the TIR and RQT need to be re-created.
Add a needs_teardown configuration variable and set it when detecting the above scenarios. On next DRIVER_OK, the resources will be torn down first.
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Message-Id: <20240626-stage-vdpa-vq-precreate-v2-22-560c491078df@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
show more ...
|
ffb1aae4 | 26-Jun-2024 |
Dragos Tatulea <dtatulea@nvidia.com> |
vdpa/mlx5: Pre-create hardware VQs at vdpa .dev_add time
Currently, hardware VQs are created right when the vdpa device gets into DRIVER_OK state. That is easier because most of the VQ state is know
vdpa/mlx5: Pre-create hardware VQs at vdpa .dev_add time
Currently, hardware VQs are created right when the vdpa device gets into DRIVER_OK state. That is easier because most of the VQ state is known by then.
This patch switches to creating all VQs and their associated resources at device creation time. The motivation is to reduce the vdpa device live migration downtime by moving the expensive operation of creating all the hardware VQs and their associated resources out of downtime on the destination VM.
The VQs are now created in a blank state. The VQ configuration will happen later, on DRIVER_OK. Then the configuration will be applied when the VQs are moved to the Ready state.
When .set_vq_ready() is called on a VQ before DRIVER_OK, special care is needed: now that the VQ is already created a resume_vq() will be triggered too early when no mr has been configured yet. Skip calling resume_vq() in this case, let it be handled during DRIVER_OK.
For virtio-vdpa, the device configuration is done earlier during .vdpa_dev_add() by vdpa_register_device(). Avoid calling setup_vq_resources() a second time in that case.
On a 64 CPU, 256 GB VM with 1 vDPA device of 16 VQps, the full VQ resource creation + resume time was ~370ms. Now it's down to 60 ms (only VQ config and resume). The measurements were done on a ConnectX6DX based vDPA device.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Message-Id: <20240626-stage-vdpa-vq-precreate-v2-21-560c491078df@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
show more ...
|
3b3adb3b | 26-Jun-2024 |
Dragos Tatulea <dtatulea@nvidia.com> |
vdpa/mlx5: Use suspend/resume during VQP change
Resume a VQ if it is already created when the number of VQ pairs increases. This is done in preparation for VQ pre-creation which is coming in a later
vdpa/mlx5: Use suspend/resume during VQP change
Resume a VQ if it is already created when the number of VQ pairs increases. This is done in preparation for VQ pre-creation which is coming in a later patch. It is necessary because calling setup_vq() on an already created VQ will return early and will not enable the queue.
For symmetry, suspend a VQ instead of tearing it down when the number of VQ pairs decreases. But only if the resume operation is supported.
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Message-Id: <20240626-stage-vdpa-vq-precreate-v2-20-560c491078df@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
show more ...
|
ac85cd90 | 26-Jun-2024 |
Dragos Tatulea <dtatulea@nvidia.com> |
vdpa/mlx5: Forward error in suspend/resume device
Start using the suspend/resume_vq() error return codes previously added.
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Zhu Yanjun <yan
vdpa/mlx5: Forward error in suspend/resume device
Start using the suspend/resume_vq() error return codes previously added.
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Message-Id: <20240626-stage-vdpa-vq-precreate-v2-19-560c491078df@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Eugenio Pérez <eperezma@redhat.com> Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
show more ...
|