118285accSJason Gunthorpe.. SPDX-License-Identifier: GPL-2.0 218285accSJason Gunthorpe 318285accSJason Gunthorpe=============== 418285accSJason Gunthorpefwctl subsystem 518285accSJason Gunthorpe=============== 618285accSJason Gunthorpe 718285accSJason Gunthorpe:Author: Jason Gunthorpe 818285accSJason Gunthorpe 918285accSJason GunthorpeOverview 1018285accSJason Gunthorpe======== 1118285accSJason Gunthorpe 1218285accSJason GunthorpeModern devices contain extensive amounts of FW, and in many cases, are largely 1318285accSJason Gunthorpesoftware-defined pieces of hardware. The evolution of this approach is largely a 1418285accSJason Gunthorpereaction to Moore's Law where a chip tape out is now highly expensive, and the 1518285accSJason Gunthorpechip design is extremely large. Replacing fixed HW logic with a flexible and 1618285accSJason Gunthorpetightly coupled FW/HW combination is an effective risk mitigation against chip 1718285accSJason Gunthorperespin. Problems in the HW design can be counteracted in device FW. This is 1818285accSJason Gunthorpeespecially true for devices which present a stable and backwards compatible 1918285accSJason Gunthorpeinterface to the operating system driver (such as NVMe). 2018285accSJason Gunthorpe 2118285accSJason GunthorpeThe FW layer in devices has grown to incredible size and devices frequently 2218285accSJason Gunthorpeintegrate clusters of fast processors to run it. For example, mlx5 devices have 2318285accSJason Gunthorpeover 30MB of FW code, and big configurations operate with over 1GB of FW managed 2418285accSJason Gunthorperuntime state. 2518285accSJason Gunthorpe 2618285accSJason GunthorpeThe availability of such a flexible layer has created quite a variety in the 2718285accSJason Gunthorpeindustry where single pieces of silicon are now configurable software-defined 2818285accSJason Gunthorpedevices and can operate in substantially different ways depending on the need. 2918285accSJason GunthorpeFurther, we often see cases where specific sites wish to operate devices in ways 3018285accSJason Gunthorpethat are highly specialized and require applications that have been tailored to 3118285accSJason Gunthorpetheir unique configuration. 3218285accSJason Gunthorpe 3318285accSJason GunthorpeFurther, devices have become multi-functional and integrated to the point they 3418285accSJason Gunthorpeno longer fit neatly into the kernel's division of subsystems. Modern 3518285accSJason Gunthorpemulti-functional devices have drivers, such as bnxt/ice/mlx5/pds, that span many 3618285accSJason Gunthorpesubsystems while sharing the underlying hardware using the auxiliary device 3718285accSJason Gunthorpesystem. 3818285accSJason Gunthorpe 3918285accSJason GunthorpeAll together this creates a challenge for the operating system, where devices 4018285accSJason Gunthorpehave an expansive FW environment that needs robust device-specific debugging 4118285accSJason Gunthorpesupport, and FW-driven functionality that is not well suited to “generic” 4218285accSJason Gunthorpeinterfaces. fwctl seeks to allow access to the full device functionality from 4318285accSJason Gunthorpeuser space in the areas of debuggability, management, and first-boot/nth-boot 4418285accSJason Gunthorpeprovisioning. 4518285accSJason Gunthorpe 4618285accSJason Gunthorpefwctl is aimed at the common device design pattern where the OS and FW 4718285accSJason Gunthorpecommunicate via an RPC message layer constructed with a queue or mailbox scheme. 4818285accSJason GunthorpeIn this case the driver will typically have some layer to deliver RPC messages 4918285accSJason Gunthorpeand collect RPC responses from device FW. The in-kernel subsystem drivers that 5018285accSJason Gunthorpeoperate the device for its primary purposes will use these RPCs to build their 5118285accSJason Gunthorpedrivers, but devices also usually have a set of ancillary RPCs that don't really 5218285accSJason Gunthorpefit into any specific subsystem. For example, a HW RAID controller is primarily 5318285accSJason Gunthorpeoperated by the block layer but also comes with a set of RPCs to administer the 5418285accSJason Gunthorpeconstruction of drives within the HW RAID. 5518285accSJason Gunthorpe 5618285accSJason GunthorpeIn the past when devices were more single function, individual subsystems would 5718285accSJason Gunthorpegrow different approaches to solving some of these common problems. For instance 5818285accSJason Gunthorpemonitoring device health, manipulating its FLASH, debugging the FW, 5918285accSJason Gunthorpeprovisioning, all have various unique interfaces across the kernel. 6018285accSJason Gunthorpe 6118285accSJason Gunthorpefwctl's purpose is to define a common set of limited rules, described below, 6218285accSJason Gunthorpethat allow user space to securely construct and execute RPCs inside device FW. 6318285accSJason GunthorpeThe rules serve as an agreement between the operating system and FW on how to 6418285accSJason Gunthorpecorrectly design the RPC interface. As a uAPI the subsystem provides a thin 6518285accSJason Gunthorpelayer of discovery and a generic uAPI to deliver the RPCs and collect the 6618285accSJason Gunthorperesponse. It supports a system of user space libraries and tools which will 6718285accSJason Gunthorpeuse this interface to control the device using the device native protocols. 6818285accSJason Gunthorpe 6918285accSJason GunthorpeScope of Action 7018285accSJason Gunthorpe--------------- 7118285accSJason Gunthorpe 7218285accSJason Gunthorpefwctl drivers are strictly restricted to being a way to operate the device FW. 7318285accSJason GunthorpeIt is not an avenue to access random kernel internals, or other operating system 7418285accSJason GunthorpeSW states. 7518285accSJason Gunthorpe 7618285accSJason Gunthorpefwctl instances must operate on a well-defined device function, and the device 7718285accSJason Gunthorpeshould have a well-defined security model for what scope within the physical 7818285accSJason Gunthorpedevice the function is permitted to access. For instance, the most complex PCIe 7918285accSJason Gunthorpedevice today may broadly have several function-level scopes: 8018285accSJason Gunthorpe 8118285accSJason Gunthorpe 1. A privileged function with full access to the on-device global state and 8218285accSJason Gunthorpe configuration 8318285accSJason Gunthorpe 8418285accSJason Gunthorpe 2. Multiple hypervisor functions with control over itself and child functions 8518285accSJason Gunthorpe used with VMs 8618285accSJason Gunthorpe 8718285accSJason Gunthorpe 3. Multiple VM functions tightly scoped within the VM 8818285accSJason Gunthorpe 8918285accSJason GunthorpeThe device may create a logical parent/child relationship between these scopes. 9018285accSJason GunthorpeFor instance a child VM's FW may be within the scope of the hypervisor FW. It is 9118285accSJason Gunthorpequite common in the VFIO world that the hypervisor environment has a complex 9218285accSJason Gunthorpeprovisioning/profiling/configuration responsibility for the function VFIO 9318285accSJason Gunthorpeassigns to the VM. 9418285accSJason Gunthorpe 9518285accSJason GunthorpeFurther, within the function, devices often have RPC commands that fall within 9618285accSJason Gunthorpesome general scopes of action (see enum fwctl_rpc_scope): 9718285accSJason Gunthorpe 9818285accSJason Gunthorpe 1. Access to function & child configuration, FLASH, etc. that becomes live at a 9918285accSJason Gunthorpe function reset. Access to function & child runtime configuration that is 10018285accSJason Gunthorpe transparent or non-disruptive to any driver or VM. 10118285accSJason Gunthorpe 10218285accSJason Gunthorpe 2. Read-only access to function debug information that may report on FW objects 10318285accSJason Gunthorpe in the function & child, including FW objects owned by other kernel 10418285accSJason Gunthorpe subsystems. 10518285accSJason Gunthorpe 10618285accSJason Gunthorpe 3. Write access to function & child debug information strictly compatible with 10718285accSJason Gunthorpe the principles of kernel lockdown and kernel integrity protection. Triggers 10818285accSJason Gunthorpe a kernel Taint. 10918285accSJason Gunthorpe 11018285accSJason Gunthorpe 4. Full debug device access. Triggers a kernel Taint, requires CAP_SYS_RAWIO. 11118285accSJason Gunthorpe 11218285accSJason GunthorpeUser space will provide a scope label on each RPC and the kernel must enforce the 11318285accSJason Gunthorpeabove CAPs and taints based on that scope. A combination of kernel and FW can 11418285accSJason Gunthorpeenforce that RPCs are placed in the correct scope by user space. 11518285accSJason Gunthorpe 11618285accSJason GunthorpeDenied behavior 11718285accSJason Gunthorpe--------------- 11818285accSJason Gunthorpe 11918285accSJason GunthorpeThere are many things this interface must not allow user space to do (without a 12018285accSJason GunthorpeTaint or CAP), broadly derived from the principles of kernel lockdown. Some 12118285accSJason Gunthorpeexamples: 12218285accSJason Gunthorpe 12318285accSJason Gunthorpe 1. DMA to/from arbitrary memory, hang the system, compromise FW integrity with 12418285accSJason Gunthorpe untrusted code, or otherwise compromise device or system security and 12518285accSJason Gunthorpe integrity. 12618285accSJason Gunthorpe 12718285accSJason Gunthorpe 2. Provide an abnormal “back door” to kernel drivers. No manipulation of kernel 12818285accSJason Gunthorpe objects owned by kernel drivers. 12918285accSJason Gunthorpe 13018285accSJason Gunthorpe 3. Directly configure or otherwise control kernel drivers. A subsystem kernel 13118285accSJason Gunthorpe driver can react to the device configuration at function reset/driver load 13218285accSJason Gunthorpe time, but otherwise must not be coupled to fwctl. 13318285accSJason Gunthorpe 13418285accSJason Gunthorpe 4. Operate the HW in a way that overlaps with the core purpose of another 13518285accSJason Gunthorpe primary kernel subsystem, such as read/write to LBAs, send/receive of 13618285accSJason Gunthorpe network packets, or operate an accelerator's data plane. 13718285accSJason Gunthorpe 13818285accSJason Gunthorpefwctl is not a replacement for device direct access subsystems like uacce or 13918285accSJason GunthorpeVFIO. 14018285accSJason Gunthorpe 14118285accSJason GunthorpeOperations exposed through fwctl's non-taining interfaces should be fully 14218285accSJason Gunthorpesharable with other users of the device. For instance exposing a RPC through 14318285accSJason Gunthorpefwctl should never prevent a kernel subsystem from also concurrently using that 14418285accSJason Gunthorpesame RPC or hardware unit down the road. In such cases fwctl will be less 14518285accSJason Gunthorpeimportant than proper kernel subsystems that eventually emerge. Mistakes in this 14618285accSJason Gunthorpearea resulting in clashes will be resolved in favour of a kernel implementation. 14718285accSJason Gunthorpe 14818285accSJason Gunthorpefwctl User API 14918285accSJason Gunthorpe============== 15018285accSJason Gunthorpe 15118285accSJason Gunthorpe.. kernel-doc:: include/uapi/fwctl/fwctl.h 15252929c21SSaeed Mahameed.. kernel-doc:: include/uapi/fwctl/mlx5.h 153*40325707SShannon Nelson.. kernel-doc:: include/uapi/fwctl/pds.h 15418285accSJason Gunthorpe 15518285accSJason Gunthorpesysfs Class 15618285accSJason Gunthorpe----------- 15718285accSJason Gunthorpe 15818285accSJason Gunthorpefwctl has a sysfs class (/sys/class/fwctl/fwctlNN/) and character devices 15918285accSJason Gunthorpe(/dev/fwctl/fwctlNN) with a simple numbered scheme. The character device 16018285accSJason Gunthorpeoperates the iotcl uAPI described above. 16118285accSJason Gunthorpe 16218285accSJason Gunthorpefwctl devices can be related to driver components in other subsystems through 16318285accSJason Gunthorpesysfs:: 16418285accSJason Gunthorpe 16518285accSJason Gunthorpe $ ls /sys/class/fwctl/fwctl0/device/infiniband/ 16618285accSJason Gunthorpe ibp0s10f0 16718285accSJason Gunthorpe 16818285accSJason Gunthorpe $ ls /sys/class/infiniband/ibp0s10f0/device/fwctl/ 16918285accSJason Gunthorpe fwctl0/ 17018285accSJason Gunthorpe 17118285accSJason Gunthorpe $ ls /sys/devices/pci0000:00/0000:00:0a.0/fwctl/fwctl0 17218285accSJason Gunthorpe dev device power subsystem uevent 17318285accSJason Gunthorpe 17418285accSJason GunthorpeUser space Community 17518285accSJason Gunthorpe-------------------- 17618285accSJason Gunthorpe 17718285accSJason GunthorpeDrawing inspiration from nvme-cli, participating in the kernel side must come 17818285accSJason Gunthorpewith a user space in a common TBD git tree, at a minimum to usefully operate the 17918285accSJason Gunthorpekernel driver. Providing such an implementation is a pre-condition to merging a 18018285accSJason Gunthorpekernel driver. 18118285accSJason Gunthorpe 18218285accSJason GunthorpeThe goal is to build user space community around some of the shared problems 18318285accSJason Gunthorpewe all have, and ideally develop some common user space programs with some 18418285accSJason Gunthorpestarting themes of: 18518285accSJason Gunthorpe 18618285accSJason Gunthorpe - Device in-field debugging 18718285accSJason Gunthorpe 18818285accSJason Gunthorpe - HW provisioning 18918285accSJason Gunthorpe 19018285accSJason Gunthorpe - VFIO child device profiling before VM boot 19118285accSJason Gunthorpe 19218285accSJason Gunthorpe - Confidential Compute topics (attestation, secure provisioning) 19318285accSJason Gunthorpe 19418285accSJason Gunthorpethat stretch across all subsystems in the kernel. fwupd is a great example of 19518285accSJason Gunthorpehow an excellent user space experience can emerge out of kernel-side diversity. 19618285accSJason Gunthorpe 19718285accSJason Gunthorpefwctl Kernel API 19818285accSJason Gunthorpe================ 19918285accSJason Gunthorpe 20018285accSJason Gunthorpe.. kernel-doc:: drivers/fwctl/main.c 20118285accSJason Gunthorpe :export: 20218285accSJason Gunthorpe.. kernel-doc:: include/linux/fwctl.h 20318285accSJason Gunthorpe 20418285accSJason Gunthorpefwctl Driver design 20518285accSJason Gunthorpe------------------- 20618285accSJason Gunthorpe 20718285accSJason GunthorpeIn many cases a fwctl driver is going to be part of a larger cross-subsystem 20818285accSJason Gunthorpedevice possibly using the auxiliary_device mechanism. In that case several 20918285accSJason Gunthorpesubsystems are going to be sharing the same device and FW interface layer so the 21018285accSJason Gunthorpedevice design must already provide for isolation and cooperation between kernel 21118285accSJason Gunthorpesubsystems. fwctl should fit into that same model. 21218285accSJason Gunthorpe 21318285accSJason GunthorpePart of the driver should include a description of how its scope restrictions 21418285accSJason Gunthorpeand security model work. The driver and FW together must ensure that RPCs 21518285accSJason Gunthorpeprovided by user space are mapped to the appropriate scope. If the validation is 21618285accSJason Gunthorpedone in the driver then the validation can read a 'command effects' report from 21718285accSJason Gunthorpethe device, or hardwire the enforcement. If the validation is done in the FW, 21818285accSJason Gunthorpethen the driver should pass the fwctl_rpc_scope to the FW along with the command. 21918285accSJason Gunthorpe 22018285accSJason GunthorpeThe driver and FW must cooperate to ensure that either fwctl cannot allocate 22118285accSJason Gunthorpeany FW resources, or any resources it does allocate are freed on FD closure. A 22218285accSJason Gunthorpedriver primarily constructed around FW RPCs may find that its core PCI function 22318285accSJason Gunthorpeand RPC layer belongs under fwctl with auxiliary devices connecting to other 22418285accSJason Gunthorpesubsystems. 22518285accSJason Gunthorpe 22618285accSJason GunthorpeEach device type must be mindful of Linux's philosophy for stable ABI. The FW 22718285accSJason GunthorpeRPC interface does not have to meet a strictly stable ABI, but it does need to 22818285accSJason Gunthorpemeet an expectation that userspace tools that are deployed and in significant 22918285accSJason Gunthorpeuse don't needlessly break. FW upgrade and kernel upgrade should keep widely 23018285accSJason Gunthorpedeployed tooling working. 23118285accSJason Gunthorpe 23218285accSJason GunthorpeDevelopment and debugging focused RPCs under more permissive scopes can have 23318285accSJason Gunthorpeless stabilitiy if the tools using them are only run under exceptional 23418285accSJason Gunthorpecircumstances and not for every day use of the device. Debugging tools may even 23518285accSJason Gunthorperequire exact version matching as they may require something similar to DWARF 23618285accSJason Gunthorpedebug information from the FW binary. 23718285accSJason Gunthorpe 23818285accSJason GunthorpeSecurity Response 23918285accSJason Gunthorpe================= 24018285accSJason Gunthorpe 24118285accSJason GunthorpeThe kernel remains the gatekeeper for this interface. If violations of the 24218285accSJason Gunthorpescopes, security or isolation principles are found, we have options to let 24318285accSJason Gunthorpedevices fix them with a FW update, push a kernel patch to parse and block RPC 24418285accSJason Gunthorpecommands or push a kernel patch to block entire firmware versions/devices. 24518285accSJason Gunthorpe 24618285accSJason GunthorpeWhile the kernel can always directly parse and restrict RPCs, it is expected 24718285accSJason Gunthorpethat the existing kernel pattern of allowing drivers to delegate validation to 24818285accSJason GunthorpeFW to be a useful design. 24918285accSJason Gunthorpe 25018285accSJason GunthorpeExisting Similar Examples 25118285accSJason Gunthorpe========================= 25218285accSJason Gunthorpe 25318285accSJason GunthorpeThe approach described in this document is not a new idea. Direct, or near 25418285accSJason Gunthorpedirect device access has been offered by the kernel in different areas for 25518285accSJason Gunthorpedecades. With more devices wanting to follow this design pattern it is becoming 25618285accSJason Gunthorpeclear that it is not entirely well understood and, more importantly, the 25718285accSJason Gunthorpesecurity considerations are not well defined or agreed upon. 25818285accSJason Gunthorpe 25918285accSJason GunthorpeSome examples: 26018285accSJason Gunthorpe 26118285accSJason Gunthorpe - HW RAID controllers. This includes RPCs to do things like compose drives into 26218285accSJason Gunthorpe a RAID volume, configure RAID parameters, monitor the HW and more. 26318285accSJason Gunthorpe 26418285accSJason Gunthorpe - Baseboard managers. RPCs for configuring settings in the device and more 26518285accSJason Gunthorpe 26618285accSJason Gunthorpe - NVMe vendor command capsules. nvme-cli provides access to some monitoring 26718285accSJason Gunthorpe functions that different products have defined, but more exist. 26818285accSJason Gunthorpe 26918285accSJason Gunthorpe - CXL also has a NVMe-like vendor command system. 27018285accSJason Gunthorpe 27118285accSJason Gunthorpe - DRM allows user space drivers to send commands to the device via kernel 27218285accSJason Gunthorpe mediation 27318285accSJason Gunthorpe 27418285accSJason Gunthorpe - RDMA allows user space drivers to directly push commands to the device 27518285accSJason Gunthorpe without kernel involvement 27618285accSJason Gunthorpe 27718285accSJason Gunthorpe - Various “raw” APIs, raw HID (SDL2), raw USB, NVMe Generic Interface, etc. 27818285accSJason Gunthorpe 27918285accSJason GunthorpeThe first 4 are examples of areas that fwctl intends to cover. The latter three 28018285accSJason Gunthorpeare examples of denied behavior as they fully overlap with the primary purpose 28118285accSJason Gunthorpeof a kernel subsystem. 28218285accSJason Gunthorpe 28318285accSJason GunthorpeSome key lessons learned from these past efforts are the importance of having a 28418285accSJason Gunthorpecommon user space project to use as a pre-condition for obtaining a kernel 28518285accSJason Gunthorpedriver. Developing good community around useful software in user space is key to 28618285accSJason Gunthorpegetting companies to fund participation to enable their products. 287