xref: /linux/Documentation/networking/devlink/devlink-dpipe.rst (revision c532de5a67a70f8533d495f8f2aaa9a0491c3ad0)
1.. SPDX-License-Identifier: GPL-2.0
2
3=============
4Devlink DPIPE
5=============
6
7Background
8==========
9
10While performing the hardware offloading process, much of the hardware
11specifics cannot be presented. These details are useful for debugging, and
12``devlink-dpipe`` provides a standardized way to provide visibility into the
13offloading process.
14
15For example, the routing longest prefix match (LPM) algorithm used by the
16Linux kernel may differ from the hardware implementation. The pipeline debug
17API (DPIPE) is aimed at providing the user visibility into the ASIC's
18pipeline in a generic way.
19
20The hardware offload process is expected to be done in a way that the user
21should not be able to distinguish between the hardware vs. software
22implementation. In this process, hardware specifics are neglected. In
23reality those details can have lots of meaning and should be exposed in some
24standard way.
25
26This problem is made even more complex when one wishes to offload the
27control path of the whole networking stack to a switch ASIC. Due to
28differences in the hardware and software models some processes cannot be
29represented correctly.
30
31One example is the kernel's LPM algorithm which in many cases differs
32greatly to the hardware implementation. The configuration API is the same,
33but one cannot rely on the Forward Information Base (FIB) to look like the
34Level Path Compression trie (LPC-trie) in hardware.
35
36In many situations trying to analyze systems failure solely based on the
37kernel's dump may not be enough. By combining this data with complementary
38information about the underlying hardware, this debugging can be made
39easier; additionally, the information can be useful when debugging
40performance issues.
41
42Overview
43========
44
45The ``devlink-dpipe`` interface closes this gap. The hardware's pipeline is
46modeled as a graph of match/action tables. Each table represents a specific
47hardware block. This model is not new, first being used by the P4 language.
48
49Traditionally it has been used as an alternative model for hardware
50configuration, but the ``devlink-dpipe`` interface uses it for visibility
51purposes as a standard complementary tool. The system's view from
52``devlink-dpipe`` should change according to the changes done by the
53standard configuration tools.
54
55For example, it’s quite common to  implement Access Control Lists (ACL)
56using Ternary Content Addressable Memory (TCAM). The TCAM memory can be
57divided into TCAM regions. Complex TC filters can have multiple rules with
58different priorities and different lookup keys. On the other hand hardware
59TCAM regions have a predefined lookup key. Offloading the TC filter rules
60using TCAM engine can result in multiple TCAM regions being interconnected
61in a chain (which may affect the data path latency). In response to a new TC
62filter new tables should be created describing those regions.
63
64Model
65=====
66
67The ``DPIPE`` model introduces several objects:
68
69  * headers
70  * tables
71  * entries
72
73A ``header`` describes packet formats and provides names for fields within
74the packet. A ``table`` describes hardware blocks. An ``entry`` describes
75the actual content of a specific table.
76
77The hardware pipeline is not port specific, but rather describes the whole
78ASIC. Thus it is tied to the top of the ``devlink`` infrastructure.
79
80Drivers can register and unregister tables at run time, in order to support
81dynamic behavior. This dynamic behavior is mandatory for describing hardware
82blocks like TCAM regions which can be allocated and freed dynamically.
83
84``devlink-dpipe`` generally is not intended for configuration. The exception
85is hardware counting for a specific table.
86
87The following commands are used to obtain the ``dpipe`` objects from
88userspace:
89
90  * ``table_get``: Receive a table's description.
91  * ``headers_get``: Receive a device's supported headers.
92  * ``entries_get``: Receive a table's current entries.
93  * ``counters_set``: Enable or disable counters on a table.
94
95Table
96-----
97
98The driver should implement the following operations for each table:
99
100  * ``matches_dump``: Dump the supported matches.
101  * ``actions_dump``: Dump the supported actions.
102  * ``entries_dump``: Dump the actual content of the table.
103  * ``counters_set_update``: Synchronize hardware with counters enabled or
104    disabled.
105
106Header/Field
107------------
108
109In a similar way to P4 headers and fields are used to describe a table's
110behavior. There is a slight difference between the standard protocol headers
111and specific ASIC metadata. The protocol headers should be declared in the
112``devlink`` core API. On the other hand ASIC meta data is driver specific
113and should be defined in the driver. Additionally, each driver-specific
114devlink documentation file should document the driver-specific ``dpipe``
115headers it implements. The headers and fields are identified by enumeration.
116
117In order to provide further visibility some ASIC metadata fields could be
118mapped to kernel objects. For example, internal router interface indexes can
119be directly mapped to the net device ifindex. FIB table indexes used by
120different Virtual Routing and Forwarding (VRF) tables can be mapped to
121internal routing table indexes.
122
123Match
124-----
125
126Matches are kept primitive and close to hardware operation. Match types like
127LPM are not supported due to the fact that this is exactly a process we wish
128to describe in full detail. Example of matches:
129
130  * ``field_exact``: Exact match on a specific field.
131  * ``field_exact_mask``: Exact match on a specific field after masking.
132  * ``field_range``: Match on a specific range.
133
134The id's of the header and the field should be specified in order to
135identify the specific field. Furthermore, the header index should be
136specified in order to distinguish multiple headers of the same type in a
137packet (tunneling).
138
139Action
140------
141
142Similar to match, the actions are kept primitive and close to hardware
143operation. For example:
144
145  * ``field_modify``: Modify the field value.
146  * ``field_inc``: Increment the field value.
147  * ``push_header``: Add a header.
148  * ``pop_header``: Remove a header.
149
150Entry
151-----
152
153Entries of a specific table can be dumped on demand. Each eentry is
154identified with an index and its properties are described by a list of
155match/action values and specific counter. By dumping the tables content the
156interactions between tables can be resolved.
157
158Abstraction Example
159===================
160
161The following is an example of the abstraction model of the L3 part of
162Mellanox Spectrum ASIC. The blocks are described in the order they appear in
163the pipeline. The table sizes in the following examples are not real
164hardware sizes and are provided for demonstration purposes.
165
166LPM
167---
168
169The LPM algorithm can be implemented as a list of hash tables. Each hash
170table contains routes with the same prefix length. The root of the list is
171/32, and in case of a miss the hardware will continue to the next hash
172table. The depth of the search will affect the data path latency.
173
174In case of a hit the entry contains information about the next stage of the
175pipeline which resolves the MAC address. The next stage can be either local
176host table for directly connected routes, or adjacency table for next-hops.
177The ``meta.lpm_prefix`` field is used to connect two LPM tables.
178
179.. code::
180
181    table lpm_prefix_16 {
182      size: 4096,
183      counters_enabled: true,
184      match: { meta.vr_id: exact,
185               ipv4.dst_addr: exact_mask,
186               ipv6.dst_addr: exact_mask,
187               meta.lpm_prefix: exact },
188      action: { meta.adj_index: set,
189                meta.adj_group_size: set,
190                meta.rif_port: set,
191                meta.lpm_prefix: set },
192    }
193
194Local Host
195----------
196
197In the case of local routes the LPM lookup already resolves the egress
198router interface (RIF), yet the exact MAC address is not known. The local
199host table is a hash table combining the output interface id with
200destination IP address as a key. The result is the MAC address.
201
202.. code::
203
204    table local_host {
205      size: 4096,
206      counters_enabled: true,
207      match: { meta.rif_port: exact,
208               ipv4.dst_addr: exact},
209      action: { ethernet.daddr: set }
210    }
211
212Adjacency
213---------
214
215In case of remote routes this table does the ECMP. The LPM lookup results in
216ECMP group size and index that serves as a global offset into this table.
217Concurrently a hash of the packet is generated. Based on the ECMP group size
218and the packet's hash a local offset is generated. Multiple LPM entries can
219point to the same adjacency group.
220
221.. code::
222
223    table adjacency {
224      size: 4096,
225      counters_enabled: true,
226      match: { meta.adj_index: exact,
227               meta.adj_group_size: exact,
228               meta.packet_hash_index: exact },
229      action: { ethernet.daddr: set,
230                meta.erif: set }
231    }
232
233ERIF
234----
235
236In case the egress RIF and destination MAC have been resolved by previous
237tables this table does multiple operations like TTL decrease and MTU check.
238Then the decision of forward/drop is taken and the port L3 statistics are
239updated based on the packet's type (broadcast, unicast, multicast).
240
241.. code::
242
243    table erif {
244      size: 800,
245      counters_enabled: true,
246      match: { meta.rif_port: exact,
247               meta.is_l3_unicast: exact,
248               meta.is_l3_broadcast: exact,
249               meta.is_l3_multicast, exact },
250      action: { meta.l3_drop: set,
251                meta.l3_forward: set }
252    }
253