#
0e6ebfd1 |
| 09-Apr-2024 |
Ingo Molnar <mingo@kernel.org> |
Merge tag 'v6.9-rc3' into x86/cpu, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
9b4e5285 |
| 03-Apr-2024 |
Ingo Molnar <mingo@kernel.org> |
Merge tag 'v6.9-rc2' into perf/core, to pick up dependent commits
Pick up fixes that followup patches are going to depend on.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
6a2bcf92 |
| 03-Apr-2024 |
Ingo Molnar <mingo@kernel.org> |
Merge tag 'v6.9-rc2' into x86/percpu, to pick up fixes and resolve conflict
Conflicts: arch/x86/Kconfig
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
f4566a1e |
| 25-Mar-2024 |
Ingo Molnar <mingo@kernel.org> |
Merge tag 'v6.9-rc1' into sched/core, to pick up fixes and to refresh the branch
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
9961a785 |
| 13-May-2024 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge tag 'for-6.10/io_uring-20240511' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:
- Greatly improve send zerocopy performance, by enabling coalescing of sent buffers.
Merge tag 'for-6.10/io_uring-20240511' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:
- Greatly improve send zerocopy performance, by enabling coalescing of sent buffers.
MSG_ZEROCOPY already does this with send(2) and sendmsg(2), but the io_uring side did not. In local testing, the crossover point for send zerocopy being faster is now around 3000 byte packets, and it performs better than the sync syscall variants as well.
This feature relies on a shared branch with net-next, which was pulled into both branches.
- Unification of how async preparation is done across opcodes.
Previously, opcodes that required extra memory for async retry would allocate that as needed, using on-stack state until that was the case. If async retry was needed, the on-stack state was adjusted appropriately for a retry and then copied to the allocated memory.
This led to some fragile and ugly code, particularly for read/write handling, and made storage retries more difficult than they needed to be. Allocate the memory upfront, as it's cheap from our pools, and use that state consistently both initially and also from the retry side.
- Move away from using remap_pfn_range() for mapping the rings.
This is really not the right interface to use and can cause lifetime issues or leaks. Additionally, it means the ring sq/cq arrays need to be physically contigious, which can cause problems in production with larger rings when services are restarted, as memory can be very fragmented at that point.
Move to using vm_insert_page(s) for the ring sq/cq arrays, and apply the same treatment to mapped ring provided buffers. This also helps unify the code we have dealing with allocating and mapping memory.
Hard to see in the diffstat as we're adding a few features as well, but this kills about ~400 lines of code from the codebase as well.
- Add support for bundles for send/recv.
When used with provided buffers, bundles support sending or receiving more than one buffer at the time, improving the efficiency by only needing to call into the networking stack once for multiple sends or receives.
- Tweaks for our accept operations, supporting both a DONTWAIT flag for skipping poll arm and retry if we can, and a POLLFIRST flag that the application can use to skip the initial accept attempt and rely purely on poll for triggering the operation. Both of these have identical flags on the receive side already.
- Make the task_work ctx locking unconditional.
We had various code paths here that would do a mix of lock/trylock and set the task_work state to whether or not it was locked. All of that goes away, we lock it unconditionally and get rid of the state flag indicating whether it's locked or not.
The state struct still exists as an empty type, can go away in the future.
- Add support for specifying NOP completion values, allowing it to be used for error handling testing.
- Use set/test bit for io-wq worker flags. Not strictly needed, but also doesn't hurt and helps silence a KCSAN warning.
- Cleanups for io-wq locking and work assignments, closing a tiny race where cancelations would not be able to find the work item reliably.
- Misc fixes, cleanups, and improvements
* tag 'for-6.10/io_uring-20240511' of git://git.kernel.dk/linux: (97 commits) io_uring: support to inject result for NOP io_uring: fail NOP if non-zero op flags is passed in io_uring/net: add IORING_ACCEPT_POLL_FIRST flag io_uring/net: add IORING_ACCEPT_DONTWAIT flag io_uring/filetable: don't unnecessarily clear/reset bitmap io_uring/io-wq: Use set_bit() and test_bit() at worker->flags io_uring/msg_ring: cleanup posting to IOPOLL vs !IOPOLL ring io_uring: Require zeroed sqe->len on provided-buffers send io_uring/notif: disable LAZY_WAKE for linked notifs io_uring/net: fix sendzc lazy wake polling io_uring/msg_ring: reuse ctx->submitter_task read using READ_ONCE instead of re-reading it io_uring/rw: reinstate thread check for retries io_uring/notif: implement notification stacking io_uring/notif: simplify io_notif_flush() net: add callback for setting a ubuf_info to skb net: extend ubuf_info callback to ops structure io_uring/net: support bundles for recv io_uring/net: support bundles for send io_uring/kbuf: add helpers for getting/peeking multiple buffers io_uring/net: add provided buffer support for IORING_OP_SEND ...
show more ...
|
#
d3da8e98 |
| 08-May-2024 |
Jens Axboe <axboe@kernel.dk> |
io_uring/net: add IORING_ACCEPT_POLL_FIRST flag
Similarly to how polling first is supported for receive, it makes sense to provide the same for accept. An accept operation does a lot of expensive se
io_uring/net: add IORING_ACCEPT_POLL_FIRST flag
Similarly to how polling first is supported for receive, it makes sense to provide the same for accept. An accept operation does a lot of expensive setup, like allocating an fd, a socket/inode, etc. If no connection request is already pending, this is wasted and will just be cleaned up and freed, only to retry via the usual poll trigger.
Add IORING_ACCEPT_POLL_FIRST, which tells accept to only initiate the accept request if poll says we have something to accept.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
show more ...
|
#
7dcc758c |
| 07-May-2024 |
Jens Axboe <axboe@kernel.dk> |
io_uring/net: add IORING_ACCEPT_DONTWAIT flag
This allows the caller to perform a non-blocking attempt, similarly to how recvmsg has MSG_DONTWAIT. If set, and we get -EAGAIN on a connection attempt,
io_uring/net: add IORING_ACCEPT_DONTWAIT flag
This allows the caller to perform a non-blocking attempt, similarly to how recvmsg has MSG_DONTWAIT. If set, and we get -EAGAIN on a connection attempt, propagate the result to userspace rather than arm poll and wait for a retry.
Suggested-by: Norman Maurer <norman_maurer@apple.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
show more ...
|
#
79996b45 |
| 01-May-2024 |
Gabriel Krisman Bertazi <krisman@suse.de> |
io_uring: Require zeroed sqe->len on provided-buffers send
When sending from a provided buffer, we set sr->len to be the smallest between the actual buffer size and sqe->len. But, now that we disco
io_uring: Require zeroed sqe->len on provided-buffers send
When sending from a provided buffer, we set sr->len to be the smallest between the actual buffer size and sqe->len. But, now that we disconnect the buffer from the submission request, we can get in a situation where the buffers and requests mismatch, and only part of a buffer gets sent. Assume:
* buf[1]->len = 128; buf[2]->len = 256 * sqe[1]->len = 128; sqe[2]->len = 256
If sqe1 runs first, it picks buff[1] and it's all good. But, if sqe[2] runs first, sqe[1] picks buff[2], and the last half of buff[2] is never sent.
While arguably the use-case of different-length sends is questionable, it has already raised confusion with potential users of this feature. Let's make the interface less tricky by forcing the length to only come from the buffer ring entry itself.
Fixes: ac5f71a3d9d7 ("io_uring/net: add provided buffer support for IORING_OP_SEND") Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
show more ...
|
#
ef42b85a |
| 30-Apr-2024 |
Pavel Begunkov <asml.silence@gmail.com> |
io_uring/net: fix sendzc lazy wake polling
SEND[MSG]_ZC produces multiple CQEs via notifications, LAZY_WAKE doesn't handle it and so disable LAZY_WAKE for sendzc polling. It should be fine, sends ar
io_uring/net: fix sendzc lazy wake polling
SEND[MSG]_ZC produces multiple CQEs via notifications, LAZY_WAKE doesn't handle it and so disable LAZY_WAKE for sendzc polling. It should be fine, sends are not likely to be polled in the first place.
Fixes: 6ce4a93dbb5b ("io_uring/poll: use IOU_F_TWQ_LAZY_WAKE for wakeups") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5b360fb352d91e3aec751d75c87dfb4753a084ee.1714488419.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
show more ...
|
#
2f9c9515 |
| 06-Mar-2024 |
Jens Axboe <axboe@kernel.dk> |
io_uring/net: support bundles for recv
If IORING_OP_RECV is used with provided buffers, the caller may also set IORING_RECVSEND_BUNDLE to turn it into a multi-buffer recv. This grabs buffers availab
io_uring/net: support bundles for recv
If IORING_OP_RECV is used with provided buffers, the caller may also set IORING_RECVSEND_BUNDLE to turn it into a multi-buffer recv. This grabs buffers available and receives into them, posting a single completion for all of it.
This can be used with multishot receive as well, or without it.
Now that both send and receive support bundles, add a feature flag for it as well. If IORING_FEAT_RECVSEND_BUNDLE is set after registering the ring, then the kernel supports bundles for recv and send.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
show more ...
|
#
a05d1f62 |
| 05-Mar-2024 |
Jens Axboe <axboe@kernel.dk> |
io_uring/net: support bundles for send
If IORING_OP_SEND is used with provided buffers, the caller may also set IORING_RECVSEND_BUNDLE to turn it into a multi-buffer send. The idea is that an applic
io_uring/net: support bundles for send
If IORING_OP_SEND is used with provided buffers, the caller may also set IORING_RECVSEND_BUNDLE to turn it into a multi-buffer send. The idea is that an application can fill outgoing buffers in a provided buffer group, and then arm a single send that will service them all. Once there are no more buffers to send, or if the requested length has been sent, the request posts a single completion for all the buffers.
This only enables it for IORING_OP_SEND, IORING_OP_SENDMSG is coming in a separate patch. However, this patch does do a lot of the prep work that makes wiring up the sendmsg variant pretty trivial. They share the prep side.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
show more ...
|
Revision tags: v6.8-rc6 |
|
#
ac5f71a3 |
| 19-Feb-2024 |
Jens Axboe <axboe@kernel.dk> |
io_uring/net: add provided buffer support for IORING_OP_SEND
It's pretty trivial to wire up provided buffer support for the send side, just like how it's done the receive side. This enables setting
io_uring/net: add provided buffer support for IORING_OP_SEND
It's pretty trivial to wire up provided buffer support for the send side, just like how it's done the receive side. This enables setting up a buffer ring that an application can use to push pending sends to, and then have a send pick a buffer from that ring.
One of the challenges with async IO and networking sends is that you can get into reordering conditions if you have more than one inflight at the same time. Consider the following scenario where everything is fine:
1) App queues sendA for socket1 2) App queues sendB for socket1 3) App does io_uring_submit() 4) sendA is issued, completes successfully, posts CQE 5) sendB is issued, completes successfully, posts CQE
All is fine. Requests are always issued in-order, and both complete inline as most sends do.
However, if we're flooding socket1 with sends, the following could also result from the same sequence:
1) App queues sendA for socket1 2) App queues sendB for socket1 3) App does io_uring_submit() 4) sendA is issued, socket1 is full, poll is armed for retry 5) Space frees up in socket1, this triggers sendA retry via task_work 6) sendB is issued, completes successfully, posts CQE 7) sendA is retried, completes successfully, posts CQE
Now we've sent sendB before sendA, which can make things unhappy. If both sendA and sendB had been using provided buffers, then it would look as follows instead:
1) App queues dataA for sendA, queues sendA for socket1 2) App queues dataB for sendB queues sendB for socket1 3) App does io_uring_submit() 4) sendA is issued, socket1 is full, poll is armed for retry 5) Space frees up in socket1, this triggers sendA retry via task_work 6) sendB is issued, picks first buffer (dataA), completes successfully, posts CQE (which says "I sent dataA") 7) sendA is retried, picks first buffer (dataB), completes successfully, posts CQE (which says "I sent dataB")
Now we've sent the data in order, and everybody is happy.
It's worth noting that this also opens the door for supporting multishot sends, as provided buffers would be a prerequisite for that. Those can trigger either when new buffers are added to the outgoing ring, or (if stalled due to lack of space) when space frees up in the socket.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
show more ...
|
#
3e747ded |
| 25-Feb-2024 |
Jens Axboe <axboe@kernel.dk> |
io_uring/net: add generic multishot retry helper
This is just moving io_recv_prep_retry() higher up so it can get used for sends as well, and rename it to be generically useful for both sends and re
io_uring/net: add generic multishot retry helper
This is just moving io_recv_prep_retry() higher up so it can get used for sends as well, and rename it to be generically useful for both sends and receives.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
show more ...
|
#
d285da7d |
| 08-Apr-2024 |
Pavel Begunkov <asml.silence@gmail.com> |
io_uring/net: set MSG_ZEROCOPY for sendzc in advance
We can set MSG_ZEROCOPY at the preparation step, do it so we don't have to care about it later in the issue callback.
Signed-off-by: Pavel Begun
io_uring/net: set MSG_ZEROCOPY for sendzc in advance
We can set MSG_ZEROCOPY at the preparation step, do it so we don't have to care about it later in the issue callback.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c2c22aaa577624977f045979a6db2b9fb2e5648c.1712534031.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
show more ...
|
#
6b7f864b |
| 08-Apr-2024 |
Pavel Begunkov <asml.silence@gmail.com> |
io_uring/net: get rid of io_notif_complete_tw_ext
io_notif_complete_tw_ext() can be removed and combined with io_notif_complete_tw to make it simpler without sacrificing anything.
Signed-off-by: Pa
io_uring/net: get rid of io_notif_complete_tw_ext
io_notif_complete_tw_ext() can be removed and combined with io_notif_complete_tw to make it simpler without sacrificing anything.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/025a124a5e20e2474a57e2f04f16c422eb83063c.1712534031.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
show more ...
|
#
414d0f45 |
| 20-Mar-2024 |
Jens Axboe <axboe@kernel.dk> |
io_uring/alloc_cache: switch to array based caching
Currently lists are being used to manage this, but best practice is usually to have these in an array instead as that it cheaper to manage.
Outsi
io_uring/alloc_cache: switch to array based caching
Currently lists are being used to manage this, but best practice is usually to have these in an array instead as that it cheaper to manage.
Outside of that detail, games are also played with KASAN as the list is inside the cached entry itself.
Finally, all users of this need a struct io_cache_entry embedded in their struct, which is union'ized with something else in there that isn't used across the free -> realloc cycle.
Get rid of all of that, and simply have it be an array. This will not change the memory used, as we're just trading an 8-byte member entry for the per-elem array size.
This reduces the overhead of the recycled allocations, and it reduces the amount of code code needed to support recycling to about half of what it currently is.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
show more ...
|
#
e2ea5a70 |
| 19-Mar-2024 |
Jens Axboe <axboe@kernel.dk> |
io_uring/net: move connect to always using async data
While doing that, get rid of io_async_connect and just use the generic io_async_msghdr. Both of them have a struct sockaddr_storage in there, an
io_uring/net: move connect to always using async data
While doing that, get rid of io_async_connect and just use the generic io_async_msghdr. Both of them have a struct sockaddr_storage in there, and while io_async_msghdr is bigger, if the same type can be used then the netmsg_cache can get reused for connect as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
show more ...
|
#
d80f9407 |
| 18-Mar-2024 |
Jens Axboe <axboe@kernel.dk> |
io_uring/net: drop 'kmsg' parameter from io_req_msg_cleanup()
Now that iovec recycling is being done, the iovec is no longer being freed in there. Hence the kmsg parameter is now useless.
Signed-of
io_uring/net: drop 'kmsg' parameter from io_req_msg_cleanup()
Now that iovec recycling is being done, the iovec is no longer being freed in there. Hence the kmsg parameter is now useless.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
show more ...
|
#
75191341 |
| 16-Mar-2024 |
Jens Axboe <axboe@kernel.dk> |
io_uring/net: add iovec recycling
Right now the io_async_msghdr is recycled to avoid the overhead of allocating+freeing it for every request. But the iovec is not included, hence that will be alloca
io_uring/net: add iovec recycling
Right now the io_async_msghdr is recycled to avoid the overhead of allocating+freeing it for every request. But the iovec is not included, hence that will be allocated and freed for each transfer regardless. This commit enables recyling of the iovec between io_async_msghdr recycles. This avoids alloc+free for each one if an iovec is used, and on top of that, it extends the cache hot nature of msg to the iovec as well.
Also enables KASAN for the iovec entries, so that reuse can be detected even while they are in the cache.
The io_async_msghdr also shrinks from 376 -> 288 bytes, an 88 byte saving (or ~23% smaller), as the fast_iovec entry is dropped from 8 entries to a single entry. There's no point keeping a big fast iovec entry, if iovecs aren't being allocated and freed continually.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
show more ...
|
#
9f8539fe |
| 21-Mar-2024 |
Jens Axboe <axboe@kernel.dk> |
io_uring/net: remove (now) dead code in io_netmsg_recycle()
All net commands have async data at this point, there's no reason to check if this is the case or not.
Signed-off-by: Jens Axboe <axboe@k
io_uring/net: remove (now) dead code in io_netmsg_recycle()
All net commands have async data at this point, there's no reason to check if this is the case or not.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
show more ...
|
#
6498c5c9 |
| 18-Mar-2024 |
Jens Axboe <axboe@kernel.dk> |
io_uring: kill io_msg_alloc_async_prep()
We now ONLY call io_msg_alloc_async() from inside prep handling, which is always locked. No need for this helper anymore, or the check in io_msg_alloc_async(
io_uring: kill io_msg_alloc_async_prep()
We now ONLY call io_msg_alloc_async() from inside prep handling, which is always locked. No need for this helper anymore, or the check in io_msg_alloc_async() on whether the ring is locked or not.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
show more ...
|
#
50220d6a |
| 18-Mar-2024 |
Jens Axboe <axboe@kernel.dk> |
io_uring/net: get rid of ->prep_async() for send side
Move the io_async_msghdr out of the issue path and into prep handling, e it's now done unconditionally and hence does not need to be part of the
io_uring/net: get rid of ->prep_async() for send side
Move the io_async_msghdr out of the issue path and into prep handling, e it's now done unconditionally and hence does not need to be part of the issue path. This means any usage of io_sendrecv_prep_async() and io_sendmsg_prep_async(), and hence the forced async setup path is now unified with the normal prep setup.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
show more ...
|
#
c6f32c7d |
| 18-Mar-2024 |
Jens Axboe <axboe@kernel.dk> |
io_uring/net: get rid of ->prep_async() for receive side
Move the io_async_msghdr out of the issue path and into prep handling, since it's now done unconditionally and hence does not need to be part
io_uring/net: get rid of ->prep_async() for receive side
Move the io_async_msghdr out of the issue path and into prep handling, since it's now done unconditionally and hence does not need to be part of the issue path. This reduces the footprint of the multishot fast path of multiple invocations of ->issue() per prep, and also means that using ->prep_async() can be dropped for recvmsg asthis is now done via setup on the prep side.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
show more ...
|
#
3ba8345a |
| 12-Apr-2024 |
Jens Axboe <axboe@kernel.dk> |
io_uring/net: always set kmsg->msg.msg_control_user before issue
We currently set this separately for async/sync entry, but let's just move it to a generic pre-issue spot and eliminate the differenc
io_uring/net: always set kmsg->msg.msg_control_user before issue
We currently set this separately for async/sync entry, but let's just move it to a generic pre-issue spot and eliminate the difference between the two.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
show more ...
|
#
790b68b3 |
| 17-Mar-2024 |
Jens Axboe <axboe@kernel.dk> |
io_uring/net: always setup an io_async_msghdr
Rather than use an on-stack one and then need to allocate and copy if async execution is required, always grab one upfront. This should be very cheap, a
io_uring/net: always setup an io_async_msghdr
Rather than use an on-stack one and then need to allocate and copy if async execution is required, always grab one upfront. This should be very cheap, and potentially even have cache hotness benefits for back-to-back send/recv requests.
For any recv type of request, this is probably a good choice in general, as it's expected that no data is available initially. For send this is not necessarily the case, as space in the socket buffer is expected to be available. However, getting a cached io_async_msghdr is very cheap, and as it should be cache hot, probably the difference here is neglible, if any.
A nice side benefit is that io_setup_async_msg can get killed completely, which has some nasty iovec manipulation code.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
show more ...
|