1========================= 2NXP SJA1105 switch driver 3========================= 4 5Overview 6======== 7 8The NXP SJA1105 is a family of 6 devices: 9 10- SJA1105E: First generation, no TTEthernet 11- SJA1105T: First generation, TTEthernet 12- SJA1105P: Second generation, no TTEthernet, no SGMII 13- SJA1105Q: Second generation, TTEthernet, no SGMII 14- SJA1105R: Second generation, no TTEthernet, SGMII 15- SJA1105S: Second generation, TTEthernet, SGMII 16 17These are SPI-managed automotive switches, with all ports being gigabit 18capable, and supporting MII/RMII/RGMII and optionally SGMII on one port. 19 20Being automotive parts, their configuration interface is geared towards 21set-and-forget use, with minimal dynamic interaction at runtime. They 22require a static configuration to be composed by software and packed 23with CRC and table headers, and sent over SPI. 24 25The static configuration is composed of several configuration tables. Each 26table takes a number of entries. Some configuration tables can be (partially) 27reconfigured at runtime, some not. Some tables are mandatory, some not: 28 29============================= ================== ============================= 30Table Mandatory Reconfigurable 31============================= ================== ============================= 32Schedule no no 33Schedule entry points if Scheduling no 34VL Lookup no no 35VL Policing if VL Lookup no 36VL Forwarding if VL Lookup no 37L2 Lookup no no 38L2 Policing yes no 39VLAN Lookup yes yes 40L2 Forwarding yes partially (fully on P/Q/R/S) 41MAC Config yes partially (fully on P/Q/R/S) 42Schedule Params if Scheduling no 43Schedule Entry Points Params if Scheduling no 44VL Forwarding Params if VL Forwarding no 45L2 Lookup Params no partially (fully on P/Q/R/S) 46L2 Forwarding Params yes no 47Clock Sync Params no no 48AVB Params no no 49General Params yes partially 50Retagging no yes 51xMII Params yes no 52SGMII no yes 53============================= ================== ============================= 54 55 56Also the configuration is write-only (software cannot read it back from the 57switch except for very few exceptions). 58 59The driver creates a static configuration at probe time, and keeps it at 60all times in memory, as a shadow for the hardware state. When required to 61change a hardware setting, the static configuration is also updated. 62If that changed setting can be transmitted to the switch through the dynamic 63reconfiguration interface, it is; otherwise the switch is reset and 64reprogrammed with the updated static configuration. 65 66Traffic support 67=============== 68 69The switches do not support switch tagging in hardware. But they do support 70customizing the TPID by which VLAN traffic is identified as such. The switch 71driver is leveraging ``CONFIG_NET_DSA_TAG_8021Q`` by requesting that special 72VLANs (with a custom TPID of ``ETH_P_EDSA`` instead of ``ETH_P_8021Q``) are 73installed on its ports when not in ``vlan_filtering`` mode. This does not 74interfere with the reception and transmission of real 802.1Q-tagged traffic, 75because the switch does no longer parse those packets as VLAN after the TPID 76change. 77The TPID is restored when ``vlan_filtering`` is requested by the user through 78the bridge layer, and general IP termination becomes no longer possible through 79the switch netdevices in this mode. 80 81The switches have two programmable filters for link-local destination MACs. 82These are used to trap BPDUs and PTP traffic to the master netdevice, and are 83further used to support STP and 1588 ordinary clock/boundary clock 84functionality. 85 86The following traffic modes are supported over the switch netdevices: 87 88+--------------------+------------+------------------+------------------+ 89| | Standalone | Bridged with | Bridged with | 90| | ports | vlan_filtering 0 | vlan_filtering 1 | 91+====================+============+==================+==================+ 92| Regular traffic | Yes | Yes | No (use master) | 93+--------------------+------------+------------------+------------------+ 94| Management traffic | Yes | Yes | Yes | 95| (BPDU, PTP) | | | | 96+--------------------+------------+------------------+------------------+ 97 98Switching features 99================== 100 101The driver supports the configuration of L2 forwarding rules in hardware for 102port bridging. The forwarding, broadcast and flooding domain between ports can 103be restricted through two methods: either at the L2 forwarding level (isolate 104one bridge's ports from another's) or at the VLAN port membership level 105(isolate ports within the same bridge). The final forwarding decision taken by 106the hardware is a logical AND of these two sets of rules. 107 108The hardware tags all traffic internally with a port-based VLAN (pvid), or it 109decodes the VLAN information from the 802.1Q tag. Advanced VLAN classification 110is not possible. Once attributed a VLAN tag, frames are checked against the 111port's membership rules and dropped at ingress if they don't match any VLAN. 112This behavior is available when switch ports are enslaved to a bridge with 113``vlan_filtering 1``. 114 115Normally the hardware is not configurable with respect to VLAN awareness, but 116by changing what TPID the switch searches 802.1Q tags for, the semantics of a 117bridge with ``vlan_filtering 0`` can be kept (accept all traffic, tagged or 118untagged), and therefore this mode is also supported. 119 120Segregating the switch ports in multiple bridges is supported (e.g. 2 + 2), but 121all bridges should have the same level of VLAN awareness (either both have 122``vlan_filtering`` 0, or both 1). Also an inevitable limitation of the fact 123that VLAN awareness is global at the switch level is that once a bridge with 124``vlan_filtering`` enslaves at least one switch port, the other un-bridged 125ports are no longer available for standalone traffic termination. 126 127Topology and loop detection through STP is supported. 128 129L2 FDB manipulation (add/delete/dump) is currently possible for the first 130generation devices. Aging time of FDB entries, as well as enabling fully static 131management (no address learning and no flooding of unknown traffic) is not yet 132configurable in the driver. 133 134A special comment about bridging with other netdevices (illustrated with an 135example): 136 137A board has eth0, eth1, swp0@eth1, swp1@eth1, swp2@eth1, swp3@eth1. 138The switch ports (swp0-3) are under br0. 139It is desired that eth0 is turned into another switched port that communicates 140with swp0-3. 141 142If br0 has vlan_filtering 0, then eth0 can simply be added to br0 with the 143intended results. 144If br0 has vlan_filtering 1, then a new br1 interface needs to be created that 145enslaves eth0 and eth1 (the DSA master of the switch ports). This is because in 146this mode, the switch ports beneath br0 are not capable of regular traffic, and 147are only used as a conduit for switchdev operations. 148 149Offloads 150======== 151 152Time-aware scheduling 153--------------------- 154 155The switch supports a variation of the enhancements for scheduled traffic 156specified in IEEE 802.1Q-2018 (formerly 802.1Qbv). This means it can be used to 157ensure deterministic latency for priority traffic that is sent in-band with its 158gate-open event in the network schedule. 159 160This capability can be managed through the tc-taprio offload ('flags 2'). The 161difference compared to the software implementation of taprio is that the latter 162would only be able to shape traffic originated from the CPU, but not 163autonomously forwarded flows. 164 165The device has 8 traffic classes, and maps incoming frames to one of them based 166on the VLAN PCP bits (if no VLAN is present, the port-based default is used). 167As described in the previous sections, depending on the value of 168``vlan_filtering``, the EtherType recognized by the switch as being VLAN can 169either be the typical 0x8100 or a custom value used internally by the driver 170for tagging. Therefore, the switch ignores the VLAN PCP if used in standalone 171or bridge mode with ``vlan_filtering=0``, as it will not recognize the 0x8100 172EtherType. In these modes, injecting into a particular TX queue can only be 173done by the DSA net devices, which populate the PCP field of the tagging header 174on egress. Using ``vlan_filtering=1``, the behavior is the other way around: 175offloaded flows can be steered to TX queues based on the VLAN PCP, but the DSA 176net devices are no longer able to do that. To inject frames into a hardware TX 177queue with VLAN awareness active, it is necessary to create a VLAN 178sub-interface on the DSA master port, and send normal (0x8100) VLAN-tagged 179towards the switch, with the VLAN PCP bits set appropriately. 180 181Management traffic (having DMAC 01-80-C2-xx-xx-xx or 01-19-1B-xx-xx-xx) is the 182notable exception: the switch always treats it with a fixed priority and 183disregards any VLAN PCP bits even if present. The traffic class for management 184traffic has a value of 7 (highest priority) at the moment, which is not 185configurable in the driver. 186 187Below is an example of configuring a 500 us cyclic schedule on egress port 188``swp5``. The traffic class gate for management traffic (7) is open for 100 us, 189and the gates for all other traffic classes are open for 400 us:: 190 191 #!/bin/bash 192 193 set -e -u -o pipefail 194 195 NSEC_PER_SEC="1000000000" 196 197 gatemask() { 198 local tc_list="$1" 199 local mask=0 200 201 for tc in ${tc_list}; do 202 mask=$((${mask} | (1 << ${tc}))) 203 done 204 205 printf "%02x" ${mask} 206 } 207 208 if ! systemctl is-active --quiet ptp4l; then 209 echo "Please start the ptp4l service" 210 exit 211 fi 212 213 now=$(phc_ctl /dev/ptp1 get | gawk '/clock time is/ { print $5; }') 214 # Phase-align the base time to the start of the next second. 215 sec=$(echo "${now}" | gawk -F. '{ print $1; }') 216 base_time="$(((${sec} + 1) * ${NSEC_PER_SEC}))" 217 218 tc qdisc add dev swp5 parent root handle 100 taprio \ 219 num_tc 8 \ 220 map 0 1 2 3 5 6 7 \ 221 queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \ 222 base-time ${base_time} \ 223 sched-entry S $(gatemask 7) 100000 \ 224 sched-entry S $(gatemask "0 1 2 3 4 5 6") 400000 \ 225 flags 2 226 227It is possible to apply the tc-taprio offload on multiple egress ports. There 228are hardware restrictions related to the fact that no gate event may trigger 229simultaneously on two ports. The driver checks the consistency of the schedules 230against this restriction and errors out when appropriate. Schedule analysis is 231needed to avoid this, which is outside the scope of the document. 232 233Routing actions (redirect, trap, drop) 234-------------------------------------- 235 236The switch is able to offload flow-based redirection of packets to a set of 237destination ports specified by the user. Internally, this is implemented by 238making use of Virtual Links, a TTEthernet concept. 239 240The driver supports 2 types of keys for Virtual Links: 241 242- VLAN-aware virtual links: these match on destination MAC address, VLAN ID and 243 VLAN PCP. 244- VLAN-unaware virtual links: these match on destination MAC address only. 245 246The VLAN awareness state of the bridge (vlan_filtering) cannot be changed while 247there are virtual link rules installed. 248 249Composing multiple actions inside the same rule is supported. When only routing 250actions are requested, the driver creates a "non-critical" virtual link. When 251the action list also contains tc-gate (more details below), the virtual link 252becomes "time-critical" (draws frame buffers from a reserved memory partition, 253etc). 254 255The 3 routing actions that are supported are "trap", "drop" and "redirect". 256 257Example 1: send frames received on swp2 with a DA of 42:be:24:9b:76:20 to the 258CPU and to swp3. This type of key (DA only) when the port's VLAN awareness 259state is off:: 260 261 tc qdisc add dev swp2 clsact 262 tc filter add dev swp2 ingress flower skip_sw dst_mac 42:be:24:9b:76:20 \ 263 action mirred egress redirect dev swp3 \ 264 action trap 265 266Example 2: drop frames received on swp2 with a DA of 42:be:24:9b:76:20, a VID 267of 100 and a PCP of 0:: 268 269 tc filter add dev swp2 ingress protocol 802.1Q flower skip_sw \ 270 dst_mac 42:be:24:9b:76:20 vlan_id 100 vlan_prio 0 action drop 271 272Time-based ingress policing 273--------------------------- 274 275The TTEthernet hardware abilities of the switch can be constrained to act 276similarly to the Per-Stream Filtering and Policing (PSFP) clause specified in 277IEEE 802.1Q-2018 (formerly 802.1Qci). This means it can be used to perform 278tight timing-based admission control for up to 1024 flows (identified by a 279tuple composed of destination MAC address, VLAN ID and VLAN PCP). Packets which 280are received outside their expected reception window are dropped. 281 282This capability can be managed through the offload of the tc-gate action. As 283routing actions are intrinsic to virtual links in TTEthernet (which performs 284explicit routing of time-critical traffic and does not leave that in the hands 285of the FDB, flooding etc), the tc-gate action may never appear alone when 286asking sja1105 to offload it. One (or more) redirect or trap actions must also 287follow along. 288 289Example: create a tc-taprio schedule that is phase-aligned with a tc-gate 290schedule (the clocks must be synchronized by a 1588 application stack, which is 291outside the scope of this document). No packet delivered by the sender will be 292dropped. Note that the reception window is larger than the transmission window 293(and much more so, in this example) to compensate for the packet propagation 294delay of the link (which can be determined by the 1588 application stack). 295 296Receiver (sja1105):: 297 298 tc qdisc add dev swp2 clsact 299 now=$(phc_ctl /dev/ptp1 get | awk '/clock time is/ {print $5}') && \ 300 sec=$(echo $now | awk -F. '{print $1}') && \ 301 base_time="$(((sec + 2) * 1000000000))" && \ 302 echo "base time ${base_time}" 303 tc filter add dev swp2 ingress flower skip_sw \ 304 dst_mac 42:be:24:9b:76:20 \ 305 action gate base-time ${base_time} \ 306 sched-entry OPEN 60000 -1 -1 \ 307 sched-entry CLOSE 40000 -1 -1 \ 308 action trap 309 310Sender:: 311 312 now=$(phc_ctl /dev/ptp0 get | awk '/clock time is/ {print $5}') && \ 313 sec=$(echo $now | awk -F. '{print $1}') && \ 314 base_time="$(((sec + 2) * 1000000000))" && \ 315 echo "base time ${base_time}" 316 tc qdisc add dev eno0 parent root taprio \ 317 num_tc 8 \ 318 map 0 1 2 3 4 5 6 7 \ 319 queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \ 320 base-time ${base_time} \ 321 sched-entry S 01 50000 \ 322 sched-entry S 00 50000 \ 323 flags 2 324 325The engine used to schedule the ingress gate operations is the same that the 326one used for the tc-taprio offload. Therefore, the restrictions regarding the 327fact that no two gate actions (either tc-gate or tc-taprio gates) may fire at 328the same time (during the same 200 ns slot) still apply. 329 330To come in handy, it is possible to share time-triggered virtual links across 331more than 1 ingress port, via flow blocks. In this case, the restriction of 332firing at the same time does not apply because there is a single schedule in 333the system, that of the shared virtual link:: 334 335 tc qdisc add dev swp2 ingress_block 1 clsact 336 tc qdisc add dev swp3 ingress_block 1 clsact 337 tc filter add block 1 flower skip_sw dst_mac 42:be:24:9b:76:20 \ 338 action gate index 2 \ 339 base-time 0 \ 340 sched-entry OPEN 50000000 -1 -1 \ 341 sched-entry CLOSE 50000000 -1 -1 \ 342 action trap 343 344Hardware statistics for each flow are also available ("pkts" counts the number 345of dropped frames, which is a sum of frames dropped due to timing violations, 346lack of destination ports and MTU enforcement checks). Byte-level counters are 347not available. 348 349Device Tree bindings and board design 350===================================== 351 352This section references ``Documentation/devicetree/bindings/net/dsa/sja1105.txt`` 353and aims to showcase some potential switch caveats. 354 355RMII PHY role and out-of-band signaling 356--------------------------------------- 357 358In the RMII spec, the 50 MHz clock signals are either driven by the MAC or by 359an external oscillator (but not by the PHY). 360But the spec is rather loose and devices go outside it in several ways. 361Some PHYs go against the spec and may provide an output pin where they source 362the 50 MHz clock themselves, in an attempt to be helpful. 363On the other hand, the SJA1105 is only binary configurable - when in the RMII 364MAC role it will also attempt to drive the clock signal. To prevent this from 365happening it must be put in RMII PHY role. 366But doing so has some unintended consequences. 367In the RMII spec, the PHY can transmit extra out-of-band signals via RXD[1:0]. 368These are practically some extra code words (/J/ and /K/) sent prior to the 369preamble of each frame. The MAC does not have this out-of-band signaling 370mechanism defined by the RMII spec. 371So when the SJA1105 port is put in PHY role to avoid having 2 drivers on the 372clock signal, inevitably an RMII PHY-to-PHY connection is created. The SJA1105 373emulates a PHY interface fully and generates the /J/ and /K/ symbols prior to 374frame preambles, which the real PHY is not expected to understand. So the PHY 375simply encodes the extra symbols received from the SJA1105-as-PHY onto the 376100Base-Tx wire. 377On the other side of the wire, some link partners might discard these extra 378symbols, while others might choke on them and discard the entire Ethernet 379frames that follow along. This looks like packet loss with some link partners 380but not with others. 381The take-away is that in RMII mode, the SJA1105 must be let to drive the 382reference clock if connected to a PHY. 383 384RGMII fixed-link and internal delays 385------------------------------------ 386 387As mentioned in the bindings document, the second generation of devices has 388tunable delay lines as part of the MAC, which can be used to establish the 389correct RGMII timing budget. 390When powered up, these can shift the Rx and Tx clocks with a phase difference 391between 73.8 and 101.7 degrees. 392The catch is that the delay lines need to lock onto a clock signal with a 393stable frequency. This means that there must be at least 2 microseconds of 394silence between the clock at the old vs at the new frequency. Otherwise the 395lock is lost and the delay lines must be reset (powered down and back up). 396In RGMII the clock frequency changes with link speed (125 MHz at 1000 Mbps, 25 397MHz at 100 Mbps and 2.5 MHz at 10 Mbps), and link speed might change during the 398AN process. 399In the situation where the switch port is connected through an RGMII fixed-link 400to a link partner whose link state life cycle is outside the control of Linux 401(such as a different SoC), then the delay lines would remain unlocked (and 402inactive) until there is manual intervention (ifdown/ifup on the switch port). 403The take-away is that in RGMII mode, the switch's internal delays are only 404reliable if the link partner never changes link speeds, or if it does, it does 405so in a way that is coordinated with the switch port (practically, both ends of 406the fixed-link are under control of the same Linux system). 407As to why would a fixed-link interface ever change link speeds: there are 408Ethernet controllers out there which come out of reset in 100 Mbps mode, and 409their driver inevitably needs to change the speed and clock frequency if it's 410required to work at gigabit. 411 412MDIO bus and PHY management 413--------------------------- 414 415The SJA1105 does not have an MDIO bus and does not perform in-band AN either. 416Therefore there is no link state notification coming from the switch device. 417A board would need to hook up the PHYs connected to the switch to any other 418MDIO bus available to Linux within the system (e.g. to the DSA master's MDIO 419bus). Link state management then works by the driver manually keeping in sync 420(over SPI commands) the MAC link speed with the settings negotiated by the PHY. 421