1.\" 2.\" Copyright (C) 2022 Alexander Chernikov <melifaro@FreeBSD.org>. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. 9.\" 2. Redistributions in binary form must reproduce the above copyright 10.\" notice, this list of conditions and the following disclaimer in the 11.\" documentation and/or other materials provided with the distribution. 12.\" 13.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 14.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 15.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 16.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 17.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 18.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 19.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 20.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 21.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 22.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 23.\" SUCH DAMAGE. 24.\" 25.Dd November 30, 2022 26.Dt NETLINK 4 27.Os 28.Sh NAME 29.Nm Netlink 30.Nd Kernel network configuration protocol 31.Sh SYNOPSIS 32.In netlink/netlink.h 33.In netlink/netlink_route.h 34.Ft int 35.Fn socket AF_NETLINK SOCK_RAW "int family" 36.Sh DESCRIPTION 37Netlink is a user-kernel message-based communication protocol primarily used 38for network stack configuration. 39Netlink is easily extendable and supports large dumps and event 40notifications, all via a single socket. 41The protocol is fully asynchronous, allowing one to issue and track multiple 42requests at once. 43Netlink consists of multiple families, which commonly group the commands 44belonging to the particular kernel subsystem. 45Currently, the supported families are: 46.Pp 47.Bd -literal -offset indent -compact 48NETLINK_ROUTE network configuration, 49NETLINK_GENERIC "container" family 50.Ed 51.Pp 52The 53.Dv NETLINK_ROUTE 54family handles all interfaces, addresses, neighbors, routes, and VNETs 55configuration. 56More details can be found in 57.Xr rtnetlink 4 . 58The 59.Dv NETLINK_GENERIC 60family serves as a 61.Do container Dc , 62allowing registering other families under the 63.Dv NETLINK_GENERIC 64umbrella. 65This approach allows using a single netlink socket to interact with 66multiple netlink families at once. 67More details can be found in 68.Xr genetlink 4 . 69.Pp 70Netlink has its own sockaddr structure: 71.Bd -literal 72struct sockaddr_nl { 73 uint8_t nl_len; /* sizeof(sockaddr_nl) */ 74 sa_family_t nl_family; /* netlink family */ 75 uint16_t nl_pad; /* reserved, set to 0 */ 76 uint32_t nl_pid; /* automatically selected, set to 0 */ 77 uint32_t nl_groups; /* multicast groups mask to bind to */ 78}; 79.Ed 80.Pp 81Typically, filling this structure is not required for socket operations. 82It is presented here for completeness. 83.Sh PROTOCOL DESCRIPTION 84The protocol is message-based. 85Each message starts with the mandatory 86.Va nlmsghdr 87header, followed by the family-specific header and the list of 88type-length-value pairs (TLVs). 89TLVs can be nested. 90All headers and TLVS are padded to 4-byte boundaries. 91Each 92.Xr send 2 or 93.Xr recv 2 94system call may contain multiple messages. 95.Ss BASE HEADER 96.Bd -literal 97struct nlmsghdr { 98 uint32_t nlmsg_len; /* Length of message including header */ 99 uint16_t nlmsg_type; /* Message type identifier */ 100 uint16_t nlmsg_flags; /* Flags (NLM_F_) */ 101 uint32_t nlmsg_seq; /* Sequence number */ 102 uint32_t nlmsg_pid; /* Sending process port ID */ 103}; 104.Ed 105.Pp 106The 107.Va nlmsg_len 108field stores the whole message length, in bytes, including the header. 109This length has to be rounded up to the nearest 4-byte boundary when 110iterating over messages. 111The 112.Va nlmsg_type 113field represents the command/request type. 114This value is family-specific. 115The list of supported commands can be found in the relevant family 116header file. 117.Va nlmsg_seq 118is a user-provided request identifier. 119An application can track the operation result using the 120.Dv NLMSG_ERROR 121messages and matching the 122.Va nlmsg_seq 123. 124The 125.Va nlmsg_pid 126field is the message sender id. 127This field is optional for userland. 128The kernel sender id is zero. 129The 130.Va nlmsg_flags 131field contains the message-specific flags. 132The following generic flags are defined: 133.Pp 134.Bd -literal -offset indent -compact 135NLM_F_REQUEST Indicates that the message is an actual request to the kernel 136NLM_F_ACK Request an explicit ACK message with an operation result 137.Ed 138.Pp 139The following generic flags are defined for the "GET" request types: 140.Pp 141.Bd -literal -offset indent -compact 142NLM_F_ROOT Return the whole dataset 143NLM_F_MATCH Return all entries matching the criteria 144.Ed 145These two flags are typically used together, aliased to 146.Dv NLM_F_DUMP 147.Pp 148The following generic flags are defined for the "NEW" request types: 149.Pp 150.Bd -literal -offset indent -compact 151NLM_F_CREATE Create an object if none exists 152NLM_F_EXCL Don't replace an object if it exists 153NLM_F_REPLACE Replace an existing matching object 154NLM_F_APPEND Append to an existing object 155.Ed 156.Pp 157The following generic flags are defined for the replies: 158.Pp 159.Bd -literal -offset indent -compact 160NLM_F_MULTI Indicates that the message is part of the message group 161NLM_F_DUMP_INTR Indicates that the state dump was not completed 162NLM_F_DUMP_FILTERED Indicates that the dump was filtered per request 163NLM_F_CAPPED Indicates the original message was capped to its header 164NLM_F_ACK_TLVS Indicates that extended ACK TLVs were included 165.Ed 166.Ss TLVs 167Most messages encode their attributes as type-length-value pairs (TLVs). 168The base TLV header: 169.Bd -literal 170struct nlattr { 171 uint16_t nla_len; /* Total attribute length */ 172 uint16_t nla_type; /* Attribute type */ 173}; 174.Ed 175The TLV type 176.Pq Va nla_type 177scope is typically the message type or group within a family. 178For example, the 179.Dv RTN_MULTICAST 180type value is only valid for 181.Dv RTM_NEWROUTE 182, 183.Dv RTM_DELROUTE 184and 185.Dv RTM_GETROUTE 186messages. 187TLVs can be nested; in that case internal TLVs may have their own sub-types. 188All TLVs are packed with 4-byte padding. 189.Ss CONTROL MESSAGES 190A number of generic control messages are reserved in each family. 191.Pp 192.Dv NLMSG_ERROR 193reports the operation result if requested, optionally followed by 194the metadata TLVs. 195The value of 196.Va nlmsg_seq 197is set to its value in the original messages, while 198.Va nlmsg_pid 199is set to the socket pid of the original socket. 200The operation result is reported via 201.Vt "struct nlmsgerr": 202.Bd -literal 203struct nlmsgerr { 204 int error; /* Standard errno */ 205 struct nlmsghdr msg; /* Original message header */ 206}; 207.Ed 208If the 209.Dv NETLINK_CAP_ACK 210socket option is not set, the remainder of the original message will follow. 211If the 212.Dv NETLINK_EXT_ACK 213socket option is set, the kernel may add a 214.Dv NLMSGERR_ATTR_MSG 215string TLV with the textual error description, optionally followed by the 216.Dv NLMSGERR_ATTR_OFFS 217TLV, indicating the offset from the message start that triggered an error. 218Some operations may return additional metadata encapsulated in the 219.Dv NLMSGERR_ATTR_COOKIE 220TLV. 221The metadata format is specific to the operation. 222If the operation reply is a multipart message, then no 223.Dv NLMSG_ERROR 224reply is generated, only a 225.Dv NLMSG_DONE 226message, closing multipart sequence. 227.Pp 228.Dv NLMSG_DONE 229indicates the end of the message group: typically, the end of the dump. 230It contains a single 231.Vt int 232field, describing the dump result as a standard errno value. 233.Sh SOCKET OPTIONS 234Netlink supports a number of custom socket options, which can be set with 235.Xr setsockopt 2 236with the 237.Dv SOL_NETLINK 238.Fa level : 239.Bl -tag -width indent 240.It Dv NETLINK_ADD_MEMBERSHIP 241Subscribes to the notifications for the specific group (int). 242.It Dv NETLINK_DROP_MEMBERSHIP 243Unsubscribes from the notifications for the specific group (int). 244.It Dv NETLINK_LIST_MEMBERSHIPS 245Lists the memberships as a bitmask. 246.It Dv NETLINK_CAP_ACK 247Instructs the kernel to send the original message header in the reply 248without the message body. 249.It Dv NETLINK_EXT_ACK 250Acknowledges ability to receive additional TLVs in the ACK message. 251.El 252.Pp 253Additionally, netlink overrides the following socket options from the 254.Dv SOL_SOCKET 255.Fa level : 256.Bl -tag -width indent 257.It Dv SO_RCVBUF 258Sets the maximum size of the socket receive buffer. 259If the caller has 260.Dv PRIV_NET_ROUTE 261permission, the value can exceed the currently-set 262.Va kern.ipc.maxsockbuf 263value. 264.El 265.Sh SYSCTL VARIABLES 266A set of 267.Xr sysctl 8 268variables is available to tweak run-time parameters: 269.Bl -tag -width indent 270.It Va net.netlink.sendspace 271Default send buffer for the netlink socket. 272Note that the socket sendspace has to be at least as long as the longest 273message that can be transmitted via this socket. 274.El 275.Bl -tag -width indent 276.It Va net.netlink.recvspace 277Default receive buffer for the netlink socket. 278Note that the socket recvspace has to be least as long as the longest 279message that can be received from this socket. 280.El 281.Bl -tag -width indent 282.It Va net.netlink.nl_maxsockbuf 283Maximum receive buffer for the netlink socket that can be set via 284.Dv SO_RCVBUF 285socket option. 286.El 287.Sh DEBUGGING 288Netlink implements per-functional-unit debugging, with different severities 289controllable via the 290.Va net.netlink.debug 291branch. 292These messages are logged in the kernel message buffer and can be seen in 293.Xr dmesg 8 294. 295The following severity levels are defined: 296.Bl -tag -width indent 297.It Dv LOG_DEBUG(7) 298Rare events or per-socket errors are reported here. 299This is the default level, not impacting production performance. 300.It Dv LOG_DEBUG2(8) 301Socket events such as groups memberships, privilege checks, commands and dumps 302are logged. 303This level does not incur significant performance overhead. 304.It Dv LOG_DEBUG3(9) 305All socket events, each dumped or modified entities are logged. 306Turning it on may result in significant performance overhead. 307.El 308.Sh ERRORS 309Netlink reports operation results, including errors and error metadata, by 310sending a 311.Dv NLMSG_ERROR 312message for each request message. 313The following errors can be returned: 314.Bl -tag -width Er 315.It Bq Er EPERM 316when the current privileges are insufficient to perform the required operation; 317.It Bo Er ENOBUFS Bc or Bo Er ENOMEM Bc 318when the system runs out of memory for 319an internal data structure; 320.It Bq Er ENOTSUP 321when the requested command is not supported by the family or 322the family is not supported; 323.It Bq Er EINVAL 324when some necessary TLVs are missing or invalid, detailed info 325may be provided in NLMSGERR_ATTR_MSG and NLMSGERR_ATTR_OFFS TLVs; 326.It Bq Er ENOENT 327when trying to delete a non-existent object. 328.Pp 329Additionally, a socket operation itself may fail with one of the errors 330specified in 331.Xr socket 2 332, 333.Xr recv 2 334or 335.Xr send 2 336. 337.El 338.Sh SEE ALSO 339.Xr genetlink 4 , 340.Xr rtnetlink 4 341.Rs 342.%A "J. Salim" 343.%A "H. Khosravi" 344.%A "A. Kleen" 345.%A "A. Kuznetsov" 346.%T "Linux Netlink as an IP Services Protocol" 347.%O "RFC 3549" 348.Re 349.Sh HISTORY 350The netlink protocol appeared in 351.Fx 13.2 . 352.Sh AUTHORS 353The netlink was implemented by 354.An -nosplit 355.An Alexander Chernikov Aq Mt melifaro@FreeBSD.org . 356It was derived from the Google Summer of Code 2021 project by 357.An Ng Peng Nam Sean . 358