acfcdb78 | 10-Dec-2024 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: dsa: felix: fix stuck CPU-injected packets with short taprio windows
With this port schedule:
tc qdisc replace dev $send_if parent root handle 100 taprio \ num_tc 8 queues 1@0 1@1 1@2 1@3 1@4
net: dsa: felix: fix stuck CPU-injected packets with short taprio windows
With this port schedule:
tc qdisc replace dev $send_if parent root handle 100 taprio \ num_tc 8 queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \ map 0 1 2 3 4 5 6 7 \ base-time 0 cycle-time 10000 \ sched-entry S 01 1250 \ sched-entry S 02 1250 \ sched-entry S 04 1250 \ sched-entry S 08 1250 \ sched-entry S 10 1250 \ sched-entry S 20 1250 \ sched-entry S 40 1250 \ sched-entry S 80 1250 \ flags 2
ptp4l would fail to take TX timestamps of Pdelay_Resp messages like this:
increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug ptp4l[4134.168]: port 2: send peer delay response failed
It turns out that the driver can't take their TX timestamps because it can't transmit them in the first place. And there's nothing special about the Pdelay_Resp packets - they're just regular 68 byte packets. But with this taprio configuration, the switch would refuse to send even the ETH_ZLEN minimum packet size.
This should have definitely not been the case. When applying the taprio config, the driver prints:
mscc_felix 0000:00:00.5: port 0 tc 0 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 132 octets including FCS mscc_felix 0000:00:00.5: port 0 tc 1 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 132 octets including FCS mscc_felix 0000:00:00.5: port 0 tc 2 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 132 octets including FCS mscc_felix 0000:00:00.5: port 0 tc 3 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 132 octets including FCS mscc_felix 0000:00:00.5: port 0 tc 4 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 132 octets including FCS mscc_felix 0000:00:00.5: port 0 tc 5 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 132 octets including FCS mscc_felix 0000:00:00.5: port 0 tc 6 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 132 octets including FCS mscc_felix 0000:00:00.5: port 0 tc 7 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 132 octets including FCS
and thus, everything under 132 bytes - ETH_FCS_LEN should have been sent without problems. Yet it's not.
For the forwarding path, the configuration is fine, yet packets injected from Linux get stuck with this schedule no matter what.
The first hint that the static guard bands are the cause of the problem is that reverting Michael Walle's commit 297c4de6f780 ("net: dsa: felix: re-enable TAS guard band mode") made things work. It must be that the guard bands are calculated incorrectly.
I remembered that there is a magic constant in the driver, set to 33 ns for no logical reason other than experimentation, which says "never let the static guard bands get so large as to leave less than this amount of remaining space in the time slot, because the queue system will refuse to schedule packets otherwise, and they will get stuck". I had a hunch that my previous experimentally-determined value was only good for packets coming from the forwarding path, and that the CPU injection path needed more.
I came to the new value of 35 ns through binary search, after seeing that with 544 ns (the bit time required to send the Pdelay_Resp packet at gigabit) it works. Again, this is purely experimental, there's no logic and the manual doesn't say anything.
The new driver prints for this schedule look like this:
mscc_felix 0000:00:00.5: port 0 tc 0 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 131 octets including FCS mscc_felix 0000:00:00.5: port 0 tc 1 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 131 octets including FCS mscc_felix 0000:00:00.5: port 0 tc 2 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 131 octets including FCS mscc_felix 0000:00:00.5: port 0 tc 3 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 131 octets including FCS mscc_felix 0000:00:00.5: port 0 tc 4 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 131 octets including FCS mscc_felix 0000:00:00.5: port 0 tc 5 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 131 octets including FCS mscc_felix 0000:00:00.5: port 0 tc 6 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 131 octets including FCS mscc_felix 0000:00:00.5: port 0 tc 7 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 131 octets including FCS
So yes, the maximum MTU is now even smaller by 1 byte than before. This is maybe counter-intuitive, but makes more sense with a diagram of one time slot.
Before:
Gate open Gate close | | v 1250 ns total time slot duration v <----------------------------------------------------> <----><----------------------------------------------> 33 ns 1217 ns static guard band useful
Gate open Gate close | | v 1250 ns total time slot duration v <----------------------------------------------------> <-----><---------------------------------------------> 35 ns 1215 ns static guard band useful
The static guard band implemented by this switch hardware directly determines the maximum allowable MTU for that traffic class. The larger it is, the earlier the switch will stop scheduling frames for transmission, because otherwise they might overrun the gate close time (and avoiding that is the entire purpose of Michael's patch). So, we now have guard bands smaller by 2 ns, thus, in this particular case, we lose a byte of the maximum MTU.
Fixes: 11afdc6526de ("net: dsa: felix: tc-taprio intervals smaller than MTU should send at least one packet") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Michael Walle <mwalle@kernel.org> Link: https://patch.msgid.link/20241210132640.3426788-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
f1288fd7 | 15-Aug-2024 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: dsa: felix: fix VLAN tag loss on CPU reception with ocelot-8021q
There is a major design bug with ocelot-8021q, which is that it expects more of the hardware than the hardware can actually do.
net: dsa: felix: fix VLAN tag loss on CPU reception with ocelot-8021q
There is a major design bug with ocelot-8021q, which is that it expects more of the hardware than the hardware can actually do. The short summary of the issue is that when a port is under a VLAN-aware bridge and we use this tagging protocol, VLAN upper interfaces of this port do not see RX traffic.
We use VCAP ES0 (egress rewriter) rules towards the tag_8021q CPU port to encapsulate packets with an outer tag, later stripped by software, that depends on the source user port. We do this so that packets can be identified in ocelot_rcv(). To be precise, we create rules with push_outer_tag = OCELOT_ES0_TAG and push_inner_tag = 0.
With this configuration, we expect the switch to keep the inner tag configuration as found in the packet (if it was untagged on user port ingress, keep it untagged, otherwise preserve the VLAN tag unmodified as the inner tag towards the tag_8021q CPU port). But this is not what happens.
Instead, table "Tagging Combinations" from the user manual suggests that when the ES0 action is "PUSH_OUTER_TAG=1 and PUSH_INNER_TAG=0", there will be "no inner tag". Experimentation further clarifies what this means.
It appears that this "inner tag" which is not pushed into the packet on its egress towards the CPU is none other than the classified VLAN.
When the ingress user port is standalone or under a VLAN-unaware bridge, the classified VLAN is a discardable quantity: it is a fixed value - the result of ocelot_vlan_unaware_pvid()'s configuration, and actually independent of the VID from any 802.1Q header that may be in the frame. It is actually preferable to discard the "inner tag" in this case.
The problem is when the ingress port is under a VLAN-aware bridge. Then, the classified VLAN is taken from the frame's 802.1Q header, with a fallback on the bridge port's PVID. It would be very good to not discard the "inner tag" here, because if we do, we break communication with any 8021q VLAN uppers that the port might have. These have a processing path outside the bridge.
There seems to be nothing else we can do except to change the configuration for VCAP ES0 rules, to actually push the inner VLAN into the frame. There are 2 options for that, first is to push a fixed value specified in the rule, and second is to push a fixed value, plus (aka arithmetic +) the classified VLAN. We choose the second option, and we select that fixed value as 0. Thus, what is pushed in the inner tag is just the classified VLAN.
From there, we need to perform software untagging, in the receive path, of stuff that was untagged on the wire.
Fixes: 7c83a7c539ab ("net: dsa: add a second tagger for Ocelot switches based on tag_8021q") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
a4303941 | 30-May-2024 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: dsa: ocelot: unexport felix_phylink_mac_ops and felix_switch_ops
Now that the common felix_register_switch() from the umbrella driver is the only entity that accesses these data structures, we
net: dsa: ocelot: unexport felix_phylink_mac_ops and felix_switch_ops
Now that the common felix_register_switch() from the umbrella driver is the only entity that accesses these data structures, we can remove them from the list of the exported symbols.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Sai Krishna <saikrishnag@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
efdbee7d | 30-May-2024 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: dsa: ocelot: common probing code
Russell King suggested that felix_vsc9959, seville_vsc9953 and ocelot_ext have a large portion of duplicated init code, which could be made common [1].
[1]: ht
net: dsa: ocelot: common probing code
Russell King suggested that felix_vsc9959, seville_vsc9953 and ocelot_ext have a large portion of duplicated init code, which could be made common [1].
[1]: https://lore.kernel.org/all/Zh1GvcOTXqb7CpQt@shell.armlinux.org.uk/
Here, we take the following common steps: - "felix" and "ds" structure allocation - "felix", "ocelot" and "ds" basic structure initialization - dsa_register_switch() call
and we make a common function out of them.
For every driver except felix_vsc9959, this is also the entire probing procedure. For felix_vsc9959, we also need to do some PCI-specific stuff, which can easily be reordered to be done before, and unwound on failure.
We also have to convert the bus-specific platform_set_drvdata() and pci_set_drvdata() calls into dev_set_drvdata(). But this should have no impact on the behavior.
Suggested-by: "Russell King (Oracle)" <linux@armlinux.org.uk> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Colin Foster <colin.foster@in-advantage.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
4ca54dd9 | 30-May-2024 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: dsa: ocelot: use ds->num_tx_queues = OCELOT_NUM_TC for all models
Russell King points out that seville_vsc9953 populates felix->info->num_tx_queues = 8, but this doesn't make it all the way int
net: dsa: ocelot: use ds->num_tx_queues = OCELOT_NUM_TC for all models
Russell King points out that seville_vsc9953 populates felix->info->num_tx_queues = 8, but this doesn't make it all the way into ds->num_tx_queues (which is how the user interface netdev queues get allocated) [1].
[1]: https://lore.kernel.org/all/20240415160150.yejcazpjqvn7vhxu@skbuf/
When num_tx_queues=0 for seville, this is implicitly converted to 1 by dsa_user_create(), and this is good enough for basic operation for a switch port. The tc qdisc offload layer works with netdev TX queues, so for QoS offload we need to pretend we have multiple TX queues. The VSC9953, like ocelot_ext, doesn't export QoS offload, so it doesn't really matter. But we can definitely set num_tx_queues=8 for all switches.
The felix->info->num_tx_queues construct itself seems unnecessary. It was introduced by commit de143c0e274b ("net: dsa: felix: Configure Time-Aware Scheduler via taprio offload") at a time when vsc9959 (LS1028A) was the only switch supported by the driver.
8 traffic classes, and 1 queue per traffic class, is a common architectural feature of all switches in the family. So they could all just set OCELOT_NUM_TC and be fine.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Colin Foster <colin.foster@in-advantage.com> Tested-by: Colin Foster <colin.foster@in-advantage.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
0367a177 | 30-May-2024 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: dsa: ocelot: move devm_request_threaded_irq() to felix_setup()
The current placement of devm_request_threaded_irq() is inconvenient. It is between the allocation of the "felix" structure and ds
net: dsa: ocelot: move devm_request_threaded_irq() to felix_setup()
The current placement of devm_request_threaded_irq() is inconvenient. It is between the allocation of the "felix" structure and dsa_register_switch(), both of which we'd like to refactor into a function that's common for all switches. But the IRQ is specific to felix_vsc9959.
A closer inspection of the felix_irq_handler() code suggests that it does things that depend on the data structures having been fully initialized. For example, ocelot_get_txtstamp() takes &port->tx_skbs.lock, which has only been initialized in ocelot_init_port() which has not run yet.
It is not one of those IRQF_SHARED IRQs, so CONFIG_DEBUG_SHIRQ_FIXME shouldn't apply here, and thus, it doesn't really matter, because in practice, the IRQ will not be triggered so early. Nonetheless, it is a good practice for the driver to be prepared for it to fire as soon as it is requested.
Create a new felix->info method for running custom code for vsc9959 from within felix_setup(), and move the request_irq() call there. The ocelot_ext should have an IRQ as well, so this should be a step in the right direction for that model (VSC7512) as well.
Some minor changes are made while moving the code. Casts from void * aren't necessary, so drop them, and rename felix_irq_handler() to the more specific vsc9959_irq_handler().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
4510bbd3 | 30-May-2024 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: dsa: ocelot: consistently use devres in felix_pci_probe()
Russell King suggested that felix_vsc9959, seville_vsc9953 and ocelot_ext have a large portion of duplicated init and teardown code, wh
net: dsa: ocelot: consistently use devres in felix_pci_probe()
Russell King suggested that felix_vsc9959, seville_vsc9953 and ocelot_ext have a large portion of duplicated init and teardown code, which could be made common [1]. The teardown code could even be simplified away if we made use of devres, something which is used here and there in the felix driver, just not very consistently.
[1] https://lore.kernel.org/all/Zh1GvcOTXqb7CpQt@shell.armlinux.org.uk/
Prepare the ground in the felix_vsc9959 driver, by allocating the data structures using devres and deleting the kfree() calls. This also deletes the "Failed to allocate ..." message, since memory allocation errors are extremely loud anyway, and it's hard to miss them.
Suggested-by: "Russell King (Oracle)" <linux@armlinux.org.uk> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
cc711c52 | 30-May-2024 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: dsa: ocelot: delete open coded status = "disabled" parsing
Since commit 6fffbc7ae137 ("PCI: Honor firmware's device disabled status"), PCI device drivers with OF bindings no longer need this ch
net: dsa: ocelot: delete open coded status = "disabled" parsing
Since commit 6fffbc7ae137 ("PCI: Honor firmware's device disabled status"), PCI device drivers with OF bindings no longer need this check.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
90ee9a5b | 30-May-2024 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: dsa: ocelot: use devres in seville_probe()
Russell King suggested that felix_vsc9959, seville_vsc9953 and ocelot_ext have a large portion of duplicated init and teardown code, which could be ma
net: dsa: ocelot: use devres in seville_probe()
Russell King suggested that felix_vsc9959, seville_vsc9953 and ocelot_ext have a large portion of duplicated init and teardown code, which could be made common [1]. The teardown code could even be simplified away if we made use of devres, something which is used here and there in the felix driver, just not very consistently.
[1] https://lore.kernel.org/all/Zh1GvcOTXqb7CpQt@shell.armlinux.org.uk/
Prepare the ground in the seville_vsc9953 driver, by allocating the data structures using devres and deleting the kfree() calls. This also deletes the "Failed to allocate ..." message, since memory allocation errors are extremely loud anyway, and it's hard to miss them.
Suggested-by: "Russell King (Oracle)" <linux@armlinux.org.uk> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|