1=========================================== 2Automatically bind swap device to numa node 3=========================================== 4 5If the system has more than one swap device and swap device has the node 6information, we can make use of this information to decide which swap 7device to use in get_swap_pages() to get better performance. 8 9 10How to use this feature 11======================= 12 13Swap device has priority and that decides the order of it to be used. To make 14use of automatically binding, there is no need to manipulate priority settings 15for swap devices. e.g. on a 2 node machine, assume 2 swap devices swapA and 16swapB, with swapA attached to node 0 and swapB attached to node 1, are going 17to be swapped on. Simply swapping them on by doing:: 18 19 # swapon /dev/swapA 20 # swapon /dev/swapB 21 22Then node 0 will use the two swap devices in the order of swapA then swapB and 23node 1 will use the two swap devices in the order of swapB then swapA. Note 24that the order of them being swapped on doesn't matter. 25 26A more complex example on a 4 node machine. Assume 6 swap devices are going to 27be swapped on: swapA and swapB are attached to node 0, swapC is attached to 28node 1, swapD and swapE are attached to node 2 and swapF is attached to node3. 29The way to swap them on is the same as above:: 30 31 # swapon /dev/swapA 32 # swapon /dev/swapB 33 # swapon /dev/swapC 34 # swapon /dev/swapD 35 # swapon /dev/swapE 36 # swapon /dev/swapF 37 38Then node 0 will use them in the order of:: 39 40 swapA/swapB -> swapC -> swapD -> swapE -> swapF 41 42swapA and swapB will be used in a round robin mode before any other swap device. 43 44node 1 will use them in the order of:: 45 46 swapC -> swapA -> swapB -> swapD -> swapE -> swapF 47 48node 2 will use them in the order of:: 49 50 swapD/swapE -> swapA -> swapB -> swapC -> swapF 51 52Similaly, swapD and swapE will be used in a round robin mode before any 53other swap devices. 54 55node 3 will use them in the order of:: 56 57 swapF -> swapA -> swapB -> swapC -> swapD -> swapE 58 59 60Implementation details 61====================== 62 63The current code uses a priority based list, swap_avail_list, to decide 64which swap device to use and if multiple swap devices share the same 65priority, they are used round robin. This change here replaces the single 66global swap_avail_list with a per-numa-node list, i.e. for each numa node, 67it sees its own priority based list of available swap devices. Swap 68device's priority can be promoted on its matching node's swap_avail_list. 69 70The current swap device's priority is set as: user can set a >=0 value, 71or the system will pick one starting from -1 then downwards. The priority 72value in the swap_avail_list is the negated value of the swap device's 73due to plist being sorted from low to high. The new policy doesn't change 74the semantics for priority >=0 cases, the previous starting from -1 then 75downwards now becomes starting from -2 then downwards and -1 is reserved 76as the promoted value. So if multiple swap devices are attached to the same 77node, they will all be promoted to priority -1 on that node's plist and will 78be used round robin before any other swap devices. 79