1.. SPDX-License-Identifier: GPL-2.0 2 3============================== 4How To Write Linux PCI Drivers 5============================== 6 7:Authors: - Martin Mares <mj@ucw.cz> 8 - Grant Grundler <grundler@parisc-linux.org> 9 10The world of PCI is vast and full of (mostly unpleasant) surprises. 11Since each CPU architecture implements different chip-sets and PCI devices 12have different requirements (erm, "features"), the result is the PCI support 13in the Linux kernel is not as trivial as one would wish. This short paper 14tries to introduce all potential driver authors to Linux APIs for 15PCI device drivers. 16 17A more complete resource is the third edition of "Linux Device Drivers" 18by Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman. 19LDD3 is available for free (under Creative Commons License) from: 20https://lwn.net/Kernel/LDD3/. 21 22However, keep in mind that all documents are subject to "bit rot". 23Refer to the source code if things are not working as described here. 24 25Please send questions/comments/patches about Linux PCI API to the 26"Linux PCI" <linux-pci@atrey.karlin.mff.cuni.cz> mailing list. 27 28 29Structure of PCI drivers 30======================== 31PCI drivers "discover" PCI devices in a system via pci_register_driver(). 32Actually, it's the other way around. When the PCI generic code discovers 33a new device, the driver with a matching "description" will be notified. 34Details on this below. 35 36pci_register_driver() leaves most of the probing for devices to 37the PCI layer and supports online insertion/removal of devices [thus 38supporting hot-pluggable PCI, CardBus, and Express-Card in a single driver]. 39pci_register_driver() call requires passing in a table of function 40pointers and thus dictates the high level structure of a driver. 41 42Once the driver knows about a PCI device and takes ownership, the 43driver generally needs to perform the following initialization: 44 45 - Enable the device 46 - Request MMIO/IOP resources 47 - Set the DMA mask size (for both coherent and streaming DMA) 48 - Allocate and initialize shared control data (pci_allocate_coherent()) 49 - Access device configuration space (if needed) 50 - Register IRQ handler (request_irq()) 51 - Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip) 52 - Enable DMA/processing engines 53 54When done using the device, and perhaps the module needs to be unloaded, 55the driver needs to take the follow steps: 56 57 - Disable the device from generating IRQs 58 - Release the IRQ (free_irq()) 59 - Stop all DMA activity 60 - Release DMA buffers (both streaming and coherent) 61 - Unregister from other subsystems (e.g. scsi or netdev) 62 - Release MMIO/IOP resources 63 - Disable the device 64 65Most of these topics are covered in the following sections. 66For the rest look at LDD3 or <linux/pci.h> . 67 68If the PCI subsystem is not configured (CONFIG_PCI is not set), most of 69the PCI functions described below are defined as inline functions either 70completely empty or just returning an appropriate error codes to avoid 71lots of ifdefs in the drivers. 72 73 74pci_register_driver() call 75========================== 76 77PCI device drivers call ``pci_register_driver()`` during their 78initialization with a pointer to a structure describing the driver 79(``struct pci_driver``): 80 81.. kernel-doc:: include/linux/pci.h 82 :functions: pci_driver 83 84The ID table is an array of ``struct pci_device_id`` entries ending with an 85all-zero entry. Definitions with static const are generally preferred. 86 87.. kernel-doc:: include/linux/mod_devicetable.h 88 :functions: pci_device_id 89 90Most drivers only need ``PCI_DEVICE()`` or ``PCI_DEVICE_CLASS()`` to set up 91a pci_device_id table. 92 93New PCI IDs may be added to a device driver pci_ids table at runtime 94as shown below:: 95 96 echo "vendor device subvendor subdevice class class_mask driver_data" > \ 97 /sys/bus/pci/drivers/{driver}/new_id 98 99All fields are passed in as hexadecimal values (no leading 0x). 100The vendor and device fields are mandatory, the others are optional. Users 101need pass only as many optional fields as necessary: 102 103 - subvendor and subdevice fields default to PCI_ANY_ID (FFFFFFFF) 104 - class and classmask fields default to 0 105 - driver_data defaults to 0UL. 106 107Note that driver_data must match the value used by any of the pci_device_id 108entries defined in the driver. This makes the driver_data field mandatory 109if all the pci_device_id entries have a non-zero driver_data value. 110 111Once added, the driver probe routine will be invoked for any unclaimed 112PCI devices listed in its (newly updated) pci_ids list. 113 114When the driver exits, it just calls pci_unregister_driver() and the PCI layer 115automatically calls the remove hook for all devices handled by the driver. 116 117 118"Attributes" for driver functions/data 119-------------------------------------- 120 121Please mark the initialization and cleanup functions where appropriate 122(the corresponding macros are defined in <linux/init.h>): 123 124 ====== ================================================= 125 __init Initialization code. Thrown away after the driver 126 initializes. 127 __exit Exit code. Ignored for non-modular drivers. 128 ====== ================================================= 129 130Tips on when/where to use the above attributes: 131 - The module_init()/module_exit() functions (and all 132 initialization functions called _only_ from these) 133 should be marked __init/__exit. 134 135 - Do not mark the struct pci_driver. 136 137 - Do NOT mark a function if you are not sure which mark to use. 138 Better to not mark the function than mark the function wrong. 139 140 141How to find PCI devices manually 142================================ 143 144PCI drivers should have a really good reason for not using the 145pci_register_driver() interface to search for PCI devices. 146The main reason PCI devices are controlled by multiple drivers 147is because one PCI device implements several different HW services. 148E.g. combined serial/parallel port/floppy controller. 149 150A manual search may be performed using the following constructs: 151 152Searching by vendor and device ID:: 153 154 struct pci_dev *dev = NULL; 155 while (dev = pci_get_device(VENDOR_ID, DEVICE_ID, dev)) 156 configure_device(dev); 157 158Searching by class ID (iterate in a similar way):: 159 160 pci_get_class(CLASS_ID, dev) 161 162Searching by both vendor/device and subsystem vendor/device ID:: 163 164 pci_get_subsys(VENDOR_ID,DEVICE_ID, SUBSYS_VENDOR_ID, SUBSYS_DEVICE_ID, dev). 165 166You can use the constant PCI_ANY_ID as a wildcard replacement for 167VENDOR_ID or DEVICE_ID. This allows searching for any device from a 168specific vendor, for example. 169 170These functions are hotplug-safe. They increment the reference count on 171the pci_dev that they return. You must eventually (possibly at module unload) 172decrement the reference count on these devices by calling pci_dev_put(). 173 174 175Device Initialization Steps 176=========================== 177 178As noted in the introduction, most PCI drivers need the following steps 179for device initialization: 180 181 - Enable the device 182 - Request MMIO/IOP resources 183 - Set the DMA mask size (for both coherent and streaming DMA) 184 - Allocate and initialize shared control data (pci_allocate_coherent()) 185 - Access device configuration space (if needed) 186 - Register IRQ handler (request_irq()) 187 - Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip) 188 - Enable DMA/processing engines. 189 190The driver can access PCI config space registers at any time. 191(Well, almost. When running BIST, config space can go away...but 192that will just result in a PCI Bus Master Abort and config reads 193will return garbage). 194 195 196Enable the PCI device 197--------------------- 198Before touching any device registers, the driver needs to enable 199the PCI device by calling pci_enable_device(). This will: 200 201 - wake up the device if it was in suspended state, 202 - allocate I/O and memory regions of the device (if BIOS did not), 203 - allocate an IRQ (if BIOS did not). 204 205.. note:: 206 pci_enable_device() can fail! Check the return value. 207 208.. warning:: 209 OS BUG: we don't check resource allocations before enabling those 210 resources. The sequence would make more sense if we called 211 pci_request_resources() before calling pci_enable_device(). 212 Currently, the device drivers can't detect the bug when two 213 devices have been allocated the same range. This is not a common 214 problem and unlikely to get fixed soon. 215 216 This has been discussed before but not changed as of 2.6.19: 217 https://lore.kernel.org/r/20060302180025.GC28895@flint.arm.linux.org.uk/ 218 219 220pci_set_master() will enable DMA by setting the bus master bit 221in the PCI_COMMAND register. It also fixes the latency timer value if 222it's set to something bogus by the BIOS. pci_clear_master() will 223disable DMA by clearing the bus master bit. 224 225If the PCI device can use the PCI Memory-Write-Invalidate transaction, 226call pci_set_mwi(). This enables the PCI_COMMAND bit for Mem-Wr-Inval 227and also ensures that the cache line size register is set correctly. 228Check the return value of pci_set_mwi() as not all architectures 229or chip-sets may support Memory-Write-Invalidate. Alternatively, 230if Mem-Wr-Inval would be nice to have but is not required, call 231pci_try_set_mwi() to have the system do its best effort at enabling 232Mem-Wr-Inval. 233 234 235Request MMIO/IOP resources 236-------------------------- 237Memory (MMIO), and I/O port addresses should NOT be read directly 238from the PCI device config space. Use the values in the pci_dev structure 239as the PCI "bus address" might have been remapped to a "host physical" 240address by the arch/chip-set specific kernel support. 241 242See Documentation/driver-api/io-mapping.rst for how to access device registers 243or device memory. 244 245The device driver needs to call pci_request_region() to verify 246no other device is already using the same address resource. 247Conversely, drivers should call pci_release_region() AFTER 248calling pci_disable_device(). 249The idea is to prevent two devices colliding on the same address range. 250 251.. tip:: 252 See OS BUG comment above. Currently (2.6.19), The driver can only 253 determine MMIO and IO Port resource availability _after_ calling 254 pci_enable_device(). 255 256Generic flavors of pci_request_region() are request_mem_region() 257(for MMIO ranges) and request_region() (for IO Port ranges). 258Use these for address resources that are not described by "normal" PCI 259BARs. 260 261Also see pci_request_selected_regions() below. 262 263 264Set the DMA mask size 265--------------------- 266.. note:: 267 If anything below doesn't make sense, please refer to 268 Documentation/core-api/dma-api.rst. This section is just a reminder that 269 drivers need to indicate DMA capabilities of the device and is not 270 an authoritative source for DMA interfaces. 271 272While all drivers should explicitly indicate the DMA capability 273(e.g. 32 or 64 bit) of the PCI bus master, devices with more than 27432-bit bus master capability for streaming data need the driver 275to "register" this capability by calling pci_set_dma_mask() with 276appropriate parameters. In general this allows more efficient DMA 277on systems where System RAM exists above 4G _physical_ address. 278 279Drivers for all PCI-X and PCIe compliant devices must call 280pci_set_dma_mask() as they are 64-bit DMA devices. 281 282Similarly, drivers must also "register" this capability if the device 283can directly address "consistent memory" in System RAM above 4G physical 284address by calling pci_set_consistent_dma_mask(). 285Again, this includes drivers for all PCI-X and PCIe compliant devices. 286Many 64-bit "PCI" devices (before PCI-X) and some PCI-X devices are 28764-bit DMA capable for payload ("streaming") data but not control 288("consistent") data. 289 290 291Setup shared control data 292------------------------- 293Once the DMA masks are set, the driver can allocate "consistent" (a.k.a. shared) 294memory. See Documentation/core-api/dma-api.rst for a full description of 295the DMA APIs. This section is just a reminder that it needs to be done 296before enabling DMA on the device. 297 298 299Initialize device registers 300--------------------------- 301Some drivers will need specific "capability" fields programmed 302or other "vendor specific" register initialized or reset. 303E.g. clearing pending interrupts. 304 305 306Register IRQ handler 307-------------------- 308While calling request_irq() is the last step described here, 309this is often just another intermediate step to initialize a device. 310This step can often be deferred until the device is opened for use. 311 312All interrupt handlers for IRQ lines should be registered with IRQF_SHARED 313and use the devid to map IRQs to devices (remember that all PCI IRQ lines 314can be shared). 315 316request_irq() will associate an interrupt handler and device handle 317with an interrupt number. Historically interrupt numbers represent 318IRQ lines which run from the PCI device to the Interrupt controller. 319With MSI and MSI-X (more below) the interrupt number is a CPU "vector". 320 321request_irq() also enables the interrupt. Make sure the device is 322quiesced and does not have any interrupts pending before registering 323the interrupt handler. 324 325MSI and MSI-X are PCI capabilities. Both are "Message Signaled Interrupts" 326which deliver interrupts to the CPU via a DMA write to a Local APIC. 327The fundamental difference between MSI and MSI-X is how multiple 328"vectors" get allocated. MSI requires contiguous blocks of vectors 329while MSI-X can allocate several individual ones. 330 331MSI capability can be enabled by calling pci_alloc_irq_vectors() with the 332PCI_IRQ_MSI and/or PCI_IRQ_MSIX flags before calling request_irq(). This 333causes the PCI support to program CPU vector data into the PCI device 334capability registers. Many architectures, chip-sets, or BIOSes do NOT 335support MSI or MSI-X and a call to pci_alloc_irq_vectors with just 336the PCI_IRQ_MSI and PCI_IRQ_MSIX flags will fail, so try to always 337specify PCI_IRQ_LEGACY as well. 338 339Drivers that have different interrupt handlers for MSI/MSI-X and 340legacy INTx should chose the right one based on the msi_enabled 341and msix_enabled flags in the pci_dev structure after calling 342pci_alloc_irq_vectors. 343 344There are (at least) two really good reasons for using MSI: 345 3461) MSI is an exclusive interrupt vector by definition. 347 This means the interrupt handler doesn't have to verify 348 its device caused the interrupt. 349 3502) MSI avoids DMA/IRQ race conditions. DMA to host memory is guaranteed 351 to be visible to the host CPU(s) when the MSI is delivered. This 352 is important for both data coherency and avoiding stale control data. 353 This guarantee allows the driver to omit MMIO reads to flush 354 the DMA stream. 355 356See drivers/infiniband/hw/mthca/ or drivers/net/tg3.c for examples 357of MSI/MSI-X usage. 358 359 360PCI device shutdown 361=================== 362 363When a PCI device driver is being unloaded, most of the following 364steps need to be performed: 365 366 - Disable the device from generating IRQs 367 - Release the IRQ (free_irq()) 368 - Stop all DMA activity 369 - Release DMA buffers (both streaming and consistent) 370 - Unregister from other subsystems (e.g. scsi or netdev) 371 - Disable device from responding to MMIO/IO Port addresses 372 - Release MMIO/IO Port resource(s) 373 374 375Stop IRQs on the device 376----------------------- 377How to do this is chip/device specific. If it's not done, it opens 378the possibility of a "screaming interrupt" if (and only if) 379the IRQ is shared with another device. 380 381When the shared IRQ handler is "unhooked", the remaining devices 382using the same IRQ line will still need the IRQ enabled. Thus if the 383"unhooked" device asserts IRQ line, the system will respond assuming 384it was one of the remaining devices asserted the IRQ line. Since none 385of the other devices will handle the IRQ, the system will "hang" until 386it decides the IRQ isn't going to get handled and masks the IRQ (100,000 387iterations later). Once the shared IRQ is masked, the remaining devices 388will stop functioning properly. Not a nice situation. 389 390This is another reason to use MSI or MSI-X if it's available. 391MSI and MSI-X are defined to be exclusive interrupts and thus 392are not susceptible to the "screaming interrupt" problem. 393 394 395Release the IRQ 396--------------- 397Once the device is quiesced (no more IRQs), one can call free_irq(). 398This function will return control once any pending IRQs are handled, 399"unhook" the drivers IRQ handler from that IRQ, and finally release 400the IRQ if no one else is using it. 401 402 403Stop all DMA activity 404--------------------- 405It's extremely important to stop all DMA operations BEFORE attempting 406to deallocate DMA control data. Failure to do so can result in memory 407corruption, hangs, and on some chip-sets a hard crash. 408 409Stopping DMA after stopping the IRQs can avoid races where the 410IRQ handler might restart DMA engines. 411 412While this step sounds obvious and trivial, several "mature" drivers 413didn't get this step right in the past. 414 415 416Release DMA buffers 417------------------- 418Once DMA is stopped, clean up streaming DMA first. 419I.e. unmap data buffers and return buffers to "upstream" 420owners if there is one. 421 422Then clean up "consistent" buffers which contain the control data. 423 424See Documentation/core-api/dma-api.rst for details on unmapping interfaces. 425 426 427Unregister from other subsystems 428-------------------------------- 429Most low level PCI device drivers support some other subsystem 430like USB, ALSA, SCSI, NetDev, Infiniband, etc. Make sure your 431driver isn't losing resources from that other subsystem. 432If this happens, typically the symptom is an Oops (panic) when 433the subsystem attempts to call into a driver that has been unloaded. 434 435 436Disable Device from responding to MMIO/IO Port addresses 437-------------------------------------------------------- 438io_unmap() MMIO or IO Port resources and then call pci_disable_device(). 439This is the symmetric opposite of pci_enable_device(). 440Do not access device registers after calling pci_disable_device(). 441 442 443Release MMIO/IO Port Resource(s) 444-------------------------------- 445Call pci_release_region() to mark the MMIO or IO Port range as available. 446Failure to do so usually results in the inability to reload the driver. 447 448 449How to access PCI config space 450============================== 451 452You can use `pci_(read|write)_config_(byte|word|dword)` to access the config 453space of a device represented by `struct pci_dev *`. All these functions return 4540 when successful or an error code (`PCIBIOS_...`) which can be translated to a 455text string by pcibios_strerror. Most drivers expect that accesses to valid PCI 456devices don't fail. 457 458If you don't have a struct pci_dev available, you can call 459`pci_bus_(read|write)_config_(byte|word|dword)` to access a given device 460and function on that bus. 461 462If you access fields in the standard portion of the config header, please 463use symbolic names of locations and bits declared in <linux/pci.h>. 464 465If you need to access Extended PCI Capability registers, just call 466pci_find_capability() for the particular capability and it will find the 467corresponding register block for you. 468 469 470Other interesting functions 471=========================== 472 473============================= ================================================ 474pci_get_domain_bus_and_slot() Find pci_dev corresponding to given domain, 475 bus and slot and number. If the device is 476 found, its reference count is increased. 477pci_set_power_state() Set PCI Power Management state (0=D0 ... 3=D3) 478pci_find_capability() Find specified capability in device's capability 479 list. 480pci_resource_start() Returns bus start address for a given PCI region 481pci_resource_end() Returns bus end address for a given PCI region 482pci_resource_len() Returns the byte length of a PCI region 483pci_set_drvdata() Set private driver data pointer for a pci_dev 484pci_get_drvdata() Return private driver data pointer for a pci_dev 485pci_set_mwi() Enable Memory-Write-Invalidate transactions. 486pci_clear_mwi() Disable Memory-Write-Invalidate transactions. 487============================= ================================================ 488 489 490Miscellaneous hints 491=================== 492 493When displaying PCI device names to the user (for example when a driver wants 494to tell the user what card has it found), please use pci_name(pci_dev). 495 496Always refer to the PCI devices by a pointer to the pci_dev structure. 497All PCI layer functions use this identification and it's the only 498reasonable one. Don't use bus/slot/function numbers except for very 499special purposes -- on systems with multiple primary buses their semantics 500can be pretty complex. 501 502Don't try to turn on Fast Back to Back writes in your driver. All devices 503on the bus need to be capable of doing it, so this is something which needs 504to be handled by platform and generic code, not individual drivers. 505 506 507Vendor and device identifications 508================================= 509 510Do not add new device or vendor IDs to include/linux/pci_ids.h unless they 511are shared across multiple drivers. You can add private definitions in 512your driver if they're helpful, or just use plain hex constants. 513 514The device IDs are arbitrary hex numbers (vendor controlled) and normally used 515only in a single location, the pci_device_id table. 516 517Please DO submit new vendor/device IDs to https://pci-ids.ucw.cz/. 518There's a mirror of the pci.ids file at https://github.com/pciutils/pciids. 519 520 521Obsolete functions 522================== 523 524There are several functions which you might come across when trying to 525port an old driver to the new PCI interface. They are no longer present 526in the kernel as they aren't compatible with hotplug or PCI domains or 527having sane locking. 528 529================= =========================================== 530pci_find_device() Superseded by pci_get_device() 531pci_find_subsys() Superseded by pci_get_subsys() 532pci_find_slot() Superseded by pci_get_domain_bus_and_slot() 533pci_get_slot() Superseded by pci_get_domain_bus_and_slot() 534================= =========================================== 535 536The alternative is the traditional PCI device driver that walks PCI 537device lists. This is still possible but discouraged. 538 539 540MMIO Space and "Write Posting" 541============================== 542 543Converting a driver from using I/O Port space to using MMIO space 544often requires some additional changes. Specifically, "write posting" 545needs to be handled. Many drivers (e.g. tg3, acenic, sym53c8xx_2) 546already do this. I/O Port space guarantees write transactions reach the PCI 547device before the CPU can continue. Writes to MMIO space allow the CPU 548to continue before the transaction reaches the PCI device. HW weenies 549call this "Write Posting" because the write completion is "posted" to 550the CPU before the transaction has reached its destination. 551 552Thus, timing sensitive code should add readl() where the CPU is 553expected to wait before doing other work. The classic "bit banging" 554sequence works fine for I/O Port space:: 555 556 for (i = 8; --i; val >>= 1) { 557 outb(val & 1, ioport_reg); /* write bit */ 558 udelay(10); 559 } 560 561The same sequence for MMIO space should be:: 562 563 for (i = 8; --i; val >>= 1) { 564 writeb(val & 1, mmio_reg); /* write bit */ 565 readb(safe_mmio_reg); /* flush posted write */ 566 udelay(10); 567 } 568 569It is important that "safe_mmio_reg" not have any side effects that 570interferes with the correct operation of the device. 571 572Another case to watch out for is when resetting a PCI device. Use PCI 573Configuration space reads to flush the writel(). This will gracefully 574handle the PCI master abort on all platforms if the PCI device is 575expected to not respond to a readl(). Most x86 platforms will allow 576MMIO reads to master abort (a.k.a. "Soft Fail") and return garbage 577(e.g. ~0). But many RISC platforms will crash (a.k.a."Hard Fail"). 578