xref: /linux/Documentation/networking/xsk-tx-metadata.rst (revision d97e2634fbdcd238a51bc363267df0139c17f4da)
1.. SPDX-License-Identifier: GPL-2.0
2
3==================
4AF_XDP TX Metadata
5==================
6
7This document describes how to enable offloads when transmitting packets
8via :doc:`af_xdp`. Refer to :doc:`xdp-rx-metadata` on how to access similar
9metadata on the receive side.
10
11General Design
12==============
13
14The headroom for the metadata is reserved via ``tx_metadata_len`` and
15``XDP_UMEM_TX_METADATA_LEN`` flag in ``struct xdp_umem_reg``. The metadata
16length is therefore the same for every socket that shares the same umem.
17The metadata layout is a fixed UAPI, refer to ``union xsk_tx_metadata`` in
18``include/uapi/linux/if_xdp.h``. Thus, generally, the ``tx_metadata_len``
19field above should contain ``sizeof(union xsk_tx_metadata)``.
20
21Note that in the original implementation the ``XDP_UMEM_TX_METADATA_LEN``
22flag was not required. Applications might attempt to create a umem
23with a flag first and if it fails, do another attempt without a flag.
24
25The headroom and the metadata itself should be located right before
26``xdp_desc->addr`` in the umem frame. Within a frame, the metadata
27layout is as follows::
28
29           tx_metadata_len
30     /                         \
31    +-----------------+---------+----------------------------+
32    | xsk_tx_metadata | padding |          payload           |
33    +-----------------+---------+----------------------------+
34                                ^
35                                |
36                          xdp_desc->addr
37
38An AF_XDP application can request headrooms larger than ``sizeof(struct
39xsk_tx_metadata)``. The kernel will ignore the padding (and will still
40use ``xdp_desc->addr - tx_metadata_len`` to locate
41the ``xsk_tx_metadata``). For the frames that shouldn't carry
42any metadata (i.e., the ones that don't have ``XDP_TX_METADATA`` option),
43the metadata area is ignored by the kernel as well.
44
45The flags field enables the particular offload:
46
47- ``XDP_TXMD_FLAGS_TIMESTAMP``: requests the device to put transmission
48  timestamp into ``tx_timestamp`` field of ``union xsk_tx_metadata``.
49- ``XDP_TXMD_FLAGS_CHECKSUM``: requests the device to calculate L4
50  checksum. ``csum_start`` specifies byte offset of where the checksumming
51  should start and ``csum_offset`` specifies byte offset where the
52  device should store the computed checksum.
53- ``XDP_TXMD_FLAGS_LAUNCH_TIME``: requests the device to schedule the
54  packet for transmission at a pre-determined time called launch time. The
55  value of launch time is indicated by ``launch_time`` field of
56  ``union xsk_tx_metadata``.
57
58Besides the flags above, in order to trigger the offloads, the first
59packet's ``struct xdp_desc`` descriptor should set ``XDP_TX_METADATA``
60bit in the ``options`` field. Also note that in a multi-buffer packet
61only the first chunk should carry the metadata.
62
63Software TX Checksum
64====================
65
66For development and testing purposes its possible to pass
67``XDP_UMEM_TX_SW_CSUM`` flag to ``XDP_UMEM_REG`` UMEM registration call.
68In this case, when running in ``XDK_COPY`` mode, the TX checksum
69is calculated on the CPU. Do not enable this option in production because
70it will negatively affect performance.
71
72Launch Time
73===========
74
75The value of the requested launch time should be based on the device's PTP
76Hardware Clock (PHC) to ensure accuracy. AF_XDP takes a different data path
77compared to the ETF queuing discipline, which organizes packets and delays
78their transmission. Instead, AF_XDP immediately hands off the packets to
79the device driver without rearranging their order or holding them prior to
80transmission. Since the driver maintains FIFO behavior and does not perform
81packet reordering, a packet with a launch time request will block other
82packets in the same Tx Queue until it is sent. Therefore, it is recommended
83to allocate separate queue for scheduling traffic that is intended for
84future transmission.
85
86In scenarios where the launch time offload feature is disabled, the device
87driver is expected to disregard the launch time request. For correct
88interpretation and meaningful operation, the launch time should never be
89set to a value larger than the farthest programmable time in the future
90(the horizon). Different devices have different hardware limitations on the
91launch time offload feature.
92
93stmmac driver
94-------------
95
96For stmmac, TSO and launch time (TBS) features are mutually exclusive for
97each individual Tx Queue. By default, the driver configures Tx Queue 0 to
98support TSO and the rest of the Tx Queues to support TBS. The launch time
99hardware offload feature can be enabled or disabled by using the tc-etf
100command to call the driver's ndo_setup_tc() callback.
101
102The value of the launch time that is programmed in the Enhanced Normal
103Transmit Descriptors is a 32-bit value, where the most significant 8 bits
104represent the time in seconds and the remaining 24 bits represent the time
105in 256 ns increments. The programmed launch time is compared against the
106PTP time (bits[39:8]) and rolls over after 256 seconds. Therefore, the
107horizon of the launch time for dwmac4 and dwxlgmac2 is 128 seconds in the
108future.
109
110igc driver
111----------
112
113For igc, all four Tx Queues support the launch time feature. The launch
114time hardware offload feature can be enabled or disabled by using the
115tc-etf command to call the driver's ndo_setup_tc() callback. When entering
116TSN mode, the igc driver will reset the device and create a default Qbv
117schedule with a 1-second cycle time, with all Tx Queues open at all times.
118
119The value of the launch time that is programmed in the Advanced Transmit
120Context Descriptor is a relative offset to the starting time of the Qbv
121transmission window of the queue. The Frst flag of the descriptor can be
122set to schedule the packet for the next Qbv cycle. Therefore, the horizon
123of the launch time for i225 and i226 is the ending time of the next cycle
124of the Qbv transmission window of the queue. For example, when the Qbv
125cycle time is set to 1 second, the horizon of the launch time ranges
126from 1 second to 2 seconds, depending on where the Qbv cycle is currently
127running.
128
129Querying Device Capabilities
130============================
131
132Every devices exports its offloads capabilities via netlink netdev family.
133Refer to ``xsk-flags`` features bitmask in
134``Documentation/netlink/specs/netdev.yaml``.
135
136- ``tx-timestamp``: device supports ``XDP_TXMD_FLAGS_TIMESTAMP``
137- ``tx-checksum``: device supports ``XDP_TXMD_FLAGS_CHECKSUM``
138- ``tx-launch-time-fifo``: device supports ``XDP_TXMD_FLAGS_LAUNCH_TIME``
139
140See ``tools/net/ynl/samples/netdev.c`` on how to query this information.
141
142Example
143=======
144
145See ``tools/testing/selftests/bpf/xdp_hw_metadata.c`` for an example
146program that handles TX metadata. Also see https://github.com/fomichev/xskgen
147for a more bare-bones example.
148