1================================== 2vfio-ccw: the basic infrastructure 3================================== 4 5Introduction 6------------ 7 8Here we describe the vfio support for I/O subchannel devices for 9Linux/s390. Motivation for vfio-ccw is to passthrough subchannels to a 10virtual machine, while vfio is the means. 11 12Different than other hardware architectures, s390 has defined a unified 13I/O access method, which is so called Channel I/O. It has its own access 14patterns: 15 16- Channel programs run asynchronously on a separate (co)processor. 17- The channel subsystem will access any memory designated by the caller 18 in the channel program directly, i.e. there is no iommu involved. 19 20Thus when we introduce vfio support for these devices, we realize it 21with a mediated device (mdev) implementation. The vfio mdev will be 22added to an iommu group, so as to make itself able to be managed by the 23vfio framework. And we add read/write callbacks for special vfio I/O 24regions to pass the channel programs from the mdev to its parent device 25(the real I/O subchannel device) to do further address translation and 26to perform I/O instructions. 27 28This document does not intend to explain the s390 I/O architecture in 29every detail. More information/reference could be found here: 30 31- A good start to know Channel I/O in general: 32 https://en.wikipedia.org/wiki/Channel_I/O 33- s390 architecture: 34 s390 Principles of Operation manual (IBM Form. No. SA22-7832) 35- The existing QEMU code which implements a simple emulated channel 36 subsystem could also be a good reference. It makes it easier to follow 37 the flow. 38 qemu/hw/s390x/css.c 39 40For vfio mediated device framework: 41- Documentation/driver-api/vfio-mediated-device.rst 42 43Motivation of vfio-ccw 44---------------------- 45 46Typically, a guest virtualized via QEMU/KVM on s390 only sees 47paravirtualized virtio devices via the "Virtio Over Channel I/O 48(virtio-ccw)" transport. This makes virtio devices discoverable via 49standard operating system algorithms for handling channel devices. 50 51However this is not enough. On s390 for the majority of devices, which 52use the standard Channel I/O based mechanism, we also need to provide 53the functionality of passing through them to a QEMU virtual machine. 54This includes devices that don't have a virtio counterpart (e.g. tape 55drives) or that have specific characteristics which guests want to 56exploit. 57 58For passing a device to a guest, we want to use the same interface as 59everybody else, namely vfio. We implement this vfio support for channel 60devices via the vfio mediated device framework and the subchannel device 61driver "vfio_ccw". 62 63Access patterns of CCW devices 64------------------------------ 65 66s390 architecture has implemented a so called channel subsystem, that 67provides a unified view of the devices physically attached to the 68systems. Though the s390 hardware platform knows about a huge variety of 69different peripheral attachments like disk devices (aka. DASDs), tapes, 70communication controllers, etc. They can all be accessed by a well 71defined access method and they are presenting I/O completion a unified 72way: I/O interruptions. 73 74All I/O requires the use of channel command words (CCWs). A CCW is an 75instruction to a specialized I/O channel processor. A channel program is 76a sequence of CCWs which are executed by the I/O channel subsystem. To 77issue a channel program to the channel subsystem, it is required to 78build an operation request block (ORB), which can be used to point out 79the format of the CCW and other control information to the system. The 80operating system signals the I/O channel subsystem to begin executing 81the channel program with a SSCH (start sub-channel) instruction. The 82central processor is then free to proceed with non-I/O instructions 83until interrupted. The I/O completion result is received by the 84interrupt handler in the form of interrupt response block (IRB). 85 86Back to vfio-ccw, in short: 87 88- ORBs and channel programs are built in guest kernel (with guest 89 physical addresses). 90- ORBs and channel programs are passed to the host kernel. 91- Host kernel translates the guest physical addresses to real addresses 92 and starts the I/O with issuing a privileged Channel I/O instruction 93 (e.g SSCH). 94- channel programs run asynchronously on a separate processor. 95- I/O completion will be signaled to the host with I/O interruptions. 96 And it will be copied as IRB to user space to pass it back to the 97 guest. 98 99Physical vfio ccw device and its child mdev 100------------------------------------------- 101 102As mentioned above, we realize vfio-ccw with a mdev implementation. 103 104Channel I/O does not have IOMMU hardware support, so the physical 105vfio-ccw device does not have an IOMMU level translation or isolation. 106 107Subchannel I/O instructions are all privileged instructions. When 108handling the I/O instruction interception, vfio-ccw has the software 109policing and translation how the channel program is programmed before 110it gets sent to hardware. 111 112Within this implementation, we have two drivers for two types of 113devices: 114 115- The vfio_ccw driver for the physical subchannel device. 116 This is an I/O subchannel driver for the real subchannel device. It 117 realizes a group of callbacks and registers to the mdev framework as a 118 parent (physical) device. As a consequence, mdev provides vfio_ccw a 119 generic interface (sysfs) to create mdev devices. A vfio mdev could be 120 created by vfio_ccw then and added to the mediated bus. It is the vfio 121 device that added to an IOMMU group and a vfio group. 122 vfio_ccw also provides an I/O region to accept channel program 123 request from user space and store I/O interrupt result for user 124 space to retrieve. To notify user space an I/O completion, it offers 125 an interface to setup an eventfd fd for asynchronous signaling. 126 127- The vfio_mdev driver for the mediated vfio ccw device. 128 This is provided by the mdev framework. It is a vfio device driver for 129 the mdev that created by vfio_ccw. 130 It realizes a group of vfio device driver callbacks, adds itself to a 131 vfio group, and registers itself to the mdev framework as a mdev 132 driver. 133 It uses a vfio iommu backend that uses the existing map and unmap 134 ioctls, but rather than programming them into an IOMMU for a device, 135 it simply stores the translations for use by later requests. This 136 means that a device programmed in a VM with guest physical addresses 137 can have the vfio kernel convert that address to process virtual 138 address, pin the page and program the hardware with the host physical 139 address in one step. 140 For a mdev, the vfio iommu backend will not pin the pages during the 141 VFIO_IOMMU_MAP_DMA ioctl. Mdev framework will only maintain a database 142 of the iova<->vaddr mappings in this operation. And they export a 143 vfio_pin_pages and a vfio_unpin_pages interfaces from the vfio iommu 144 backend for the physical devices to pin and unpin pages by demand. 145 146Below is a high Level block diagram:: 147 148 +-------------+ 149 | | 150 | +---------+ | mdev_register_driver() +--------------+ 151 | | Mdev | +<-----------------------+ | 152 | | bus | | | vfio_mdev.ko | 153 | | driver | +----------------------->+ |<-> VFIO user 154 | +---------+ | probe()/remove() +--------------+ APIs 155 | | 156 | MDEV CORE | 157 | MODULE | 158 | mdev.ko | 159 | +---------+ | mdev_register_parent() +--------------+ 160 | |Physical | +<-----------------------+ | 161 | | device | | | vfio_ccw.ko |<-> subchannel 162 | |interface| +----------------------->+ | device 163 | +---------+ | callback +--------------+ 164 +-------------+ 165 166The process of how these work together. 167 1681. vfio_ccw.ko drives the physical I/O subchannel, and registers the 169 physical device (with callbacks) to mdev framework. 170 When vfio_ccw probing the subchannel device, it registers device 171 pointer and callbacks to the mdev framework. Mdev related file nodes 172 under the device node in sysfs would be created for the subchannel 173 device, namely 'mdev_create', 'mdev_destroy' and 174 'mdev_supported_types'. 1752. Create a mediated vfio ccw device. 176 Use the 'mdev_create' sysfs file, we need to manually create one (and 177 only one for our case) mediated device. 1783. vfio_mdev.ko drives the mediated ccw device. 179 vfio_mdev is also the vfio device driver. It will probe the mdev and 180 add it to an iommu_group and a vfio_group. Then we could pass through 181 the mdev to a guest. 182 183 184VFIO-CCW Regions 185---------------- 186 187The vfio-ccw driver exposes MMIO regions to accept requests from and return 188results to userspace. 189 190vfio-ccw I/O region 191------------------- 192 193An I/O region is used to accept channel program request from user 194space and store I/O interrupt result for user space to retrieve. The 195definition of the region is:: 196 197 struct ccw_io_region { 198 #define ORB_AREA_SIZE 12 199 __u8 orb_area[ORB_AREA_SIZE]; 200 #define SCSW_AREA_SIZE 12 201 __u8 scsw_area[SCSW_AREA_SIZE]; 202 #define IRB_AREA_SIZE 96 203 __u8 irb_area[IRB_AREA_SIZE]; 204 __u32 ret_code; 205 } __packed; 206 207This region is always available. 208 209While starting an I/O request, orb_area should be filled with the 210guest ORB, and scsw_area should be filled with the SCSW of the Virtual 211Subchannel. 212 213irb_area stores the I/O result. 214 215ret_code stores a return code for each access of the region. The following 216values may occur: 217 218``0`` 219 The operation was successful. 220 221``-EOPNOTSUPP`` 222 The ORB specified transport mode or the 223 SCSW specified a function other than the start function. 224 225``-EIO`` 226 A request was issued while the device was not in a state ready to accept 227 requests, or an internal error occurred. 228 229``-EBUSY`` 230 The subchannel was status pending or busy, or a request is already active. 231 232``-EAGAIN`` 233 A request was being processed, and the caller should retry. 234 235``-EACCES`` 236 The channel path(s) used for the I/O were found to be not operational. 237 238``-ENODEV`` 239 The device was found to be not operational. 240 241``-EINVAL`` 242 The orb specified a chain longer than 255 ccws, or an internal error 243 occurred. 244 245 246vfio-ccw cmd region 247------------------- 248 249The vfio-ccw cmd region is used to accept asynchronous instructions 250from userspace:: 251 252 #define VFIO_CCW_ASYNC_CMD_HSCH (1 << 0) 253 #define VFIO_CCW_ASYNC_CMD_CSCH (1 << 1) 254 struct ccw_cmd_region { 255 __u32 command; 256 __u32 ret_code; 257 } __packed; 258 259This region is exposed via region type VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD. 260 261Currently, CLEAR SUBCHANNEL and HALT SUBCHANNEL use this region. 262 263command specifies the command to be issued; ret_code stores a return code 264for each access of the region. The following values may occur: 265 266``0`` 267 The operation was successful. 268 269``-ENODEV`` 270 The device was found to be not operational. 271 272``-EINVAL`` 273 A command other than halt or clear was specified. 274 275``-EIO`` 276 A request was issued while the device was not in a state ready to accept 277 requests. 278 279``-EAGAIN`` 280 A request was being processed, and the caller should retry. 281 282``-EBUSY`` 283 The subchannel was status pending or busy while processing a halt request. 284 285vfio-ccw schib region 286--------------------- 287 288The vfio-ccw schib region is used to return Subchannel-Information 289Block (SCHIB) data to userspace:: 290 291 struct ccw_schib_region { 292 #define SCHIB_AREA_SIZE 52 293 __u8 schib_area[SCHIB_AREA_SIZE]; 294 } __packed; 295 296This region is exposed via region type VFIO_REGION_SUBTYPE_CCW_SCHIB. 297 298Reading this region triggers a STORE SUBCHANNEL to be issued to the 299associated hardware. 300 301vfio-ccw crw region 302--------------------- 303 304The vfio-ccw crw region is used to return Channel Report Word (CRW) 305data to userspace:: 306 307 struct ccw_crw_region { 308 __u32 crw; 309 __u32 pad; 310 } __packed; 311 312This region is exposed via region type VFIO_REGION_SUBTYPE_CCW_CRW. 313 314Reading this region returns a CRW if one that is relevant for this 315subchannel (e.g. one reporting changes in channel path state) is 316pending, or all zeroes if not. If multiple CRWs are pending (including 317possibly chained CRWs), reading this region again will return the next 318one, until no more CRWs are pending and zeroes are returned. This is 319similar to how STORE CHANNEL REPORT WORD works. 320 321vfio-ccw operation details 322-------------------------- 323 324vfio-ccw follows what vfio-pci did on the s390 platform and uses 325vfio-iommu-type1 as the vfio iommu backend. 326 327* CCW translation APIs 328 A group of APIs (start with `cp_`) to do CCW translation. The CCWs 329 passed in by a user space program are organized with their guest 330 physical memory addresses. These APIs will copy the CCWs into kernel 331 space, and assemble a runnable kernel channel program by updating the 332 guest physical addresses with their corresponding host physical addresses. 333 Note that we have to use IDALs even for direct-access CCWs, as the 334 referenced memory can be located anywhere, including above 2G. 335 336* vfio_ccw device driver 337 This driver utilizes the CCW translation APIs and introduces 338 vfio_ccw, which is the driver for the I/O subchannel devices you want 339 to pass through. 340 vfio_ccw implements the following vfio ioctls:: 341 342 VFIO_DEVICE_GET_INFO 343 VFIO_DEVICE_GET_IRQ_INFO 344 VFIO_DEVICE_GET_REGION_INFO 345 VFIO_DEVICE_RESET 346 VFIO_DEVICE_SET_IRQS 347 348 This provides an I/O region, so that the user space program can pass a 349 channel program to the kernel, to do further CCW translation before 350 issuing them to a real device. 351 This also provides the SET_IRQ ioctl to setup an event notifier to 352 notify the user space program the I/O completion in an asynchronous 353 way. 354 355The use of vfio-ccw is not limited to QEMU, while QEMU is definitely a 356good example to get understand how these patches work. Here is a little 357bit more detail how an I/O request triggered by the QEMU guest will be 358handled (without error handling). 359 360Explanation: 361 362- Q1-Q7: QEMU side process. 363- K1-K5: Kernel side process. 364 365Q1. 366 Get I/O region info during initialization. 367 368Q2. 369 Setup event notifier and handler to handle I/O completion. 370 371... ... 372 373Q3. 374 Intercept a ssch instruction. 375Q4. 376 Write the guest channel program and ORB to the I/O region. 377 378 K1. 379 Copy from guest to kernel. 380 K2. 381 Translate the guest channel program to a host kernel space 382 channel program, which becomes runnable for a real device. 383 K3. 384 With the necessary information contained in the orb passed in 385 by QEMU, issue the ccwchain to the device. 386 K4. 387 Return the ssch CC code. 388Q5. 389 Return the CC code to the guest. 390 391... ... 392 393 K5. 394 Interrupt handler gets the I/O result and write the result to 395 the I/O region. 396 K6. 397 Signal QEMU to retrieve the result. 398 399Q6. 400 Get the signal and event handler reads out the result from the I/O 401 region. 402Q7. 403 Update the irb for the guest. 404 405Limitations 406----------- 407 408The current vfio-ccw implementation focuses on supporting basic commands 409needed to implement block device functionality (read/write) of DASD/ECKD 410device only. Some commands may need special handling in the future, for 411example, anything related to path grouping. 412 413DASD is a kind of storage device. While ECKD is a data recording format. 414More information for DASD and ECKD could be found here: 415https://en.wikipedia.org/wiki/Direct-access_storage_device 416https://en.wikipedia.org/wiki/Count_key_data 417 418Together with the corresponding work in QEMU, we can bring the passed 419through DASD/ECKD device online in a guest now and use it as a block 420device. 421 422The current code allows the guest to start channel programs via 423START SUBCHANNEL, and to issue HALT SUBCHANNEL, CLEAR SUBCHANNEL, 424and STORE SUBCHANNEL. 425 426Currently all channel programs are prefetched, regardless of the 427p-bit setting in the ORB. As a result, self modifying channel 428programs are not supported. For this reason, IPL has to be handled as 429a special case by a userspace/guest program; this has been implemented 430in QEMU's s390-ccw bios as of QEMU 4.1. 431 432vfio-ccw supports classic (command mode) channel I/O only. Transport 433mode (HPF) is not supported. 434 435QDIO subchannels are currently not supported. Classic devices other than 436DASD/ECKD might work, but have not been tested. 437 438Reference 439--------- 4401. ESA/s390 Principles of Operation manual (IBM Form. No. SA22-7832) 4412. ESA/390 Common I/O Device Commands manual (IBM Form. No. SA22-7204) 4423. https://en.wikipedia.org/wiki/Channel_I/O 4434. Documentation/arch/s390/cds.rst 4445. Documentation/driver-api/vfio.rst 4456. Documentation/driver-api/vfio-mediated-device.rst 446