1.\" 2.\" Copyright (C) 2022 Alexander Chernikov <melifaro@FreeBSD.org>. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. 9.\" 2. Redistributions in binary form must reproduce the above copyright 10.\" notice, this list of conditions and the following disclaimer in the 11.\" documentation and/or other materials provided with the distribution. 12.\" 13.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 14.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 15.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 16.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 17.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 18.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 19.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 20.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 21.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 22.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 23.\" SUCH DAMAGE. 24.\" 25.\" $FreeBSD$ 26.\" 27.Dd November 30, 2022 28.Dt NETLINK 4 29.Os 30.Sh NAME 31.Nm Netlink 32.Nd Kernel network configuration protocol 33.Sh SYNOPSIS 34.In netlink/netlink.h 35.In netlink/netlink_route.h 36.Ft int 37.Fn socket AF_NETLINK SOCK_RAW "int family" 38.Sh DESCRIPTION 39Netlink is a user-kernel message-based communication protocol primarily used 40for network stack configuration. 41Netlink is easily extendable and supports large dumps and event 42notifications, all via a single socket. 43The protocol is fully asynchronous, allowing one to issue and track multiple 44requests at once. 45Netlink consists of multiple families, which commonly group the commands 46belonging to the particular kernel subsystem. 47Currently, the supported families are: 48.Pp 49.Bd -literal -offset indent -compact 50NETLINK_ROUTE network configuration, 51NETLINK_GENERIC "container" family 52.Ed 53.Pp 54The 55.Dv NETLINK_ROUTE 56family handles all interfaces, addresses, neighbors, routes, and VNETs 57configuration. 58More details can be found in 59.Xr rtnetlink 4 . 60The 61.Dv NETLINK_GENERIC 62family serves as a 63.Do container Dc , 64allowing registering other families under the 65.Dv NETLINK_GENERIC 66umbrella. 67This approach allows using a single netlink socket to interact with 68multiple netlink families at once. 69More details can be found in 70.Xr genetlink 4 . 71.Pp 72Netlink has its own sockaddr structure: 73.Bd -literal 74struct sockaddr_nl { 75 uint8_t nl_len; /* sizeof(sockaddr_nl) */ 76 sa_family_t nl_family; /* netlink family */ 77 uint16_t nl_pad; /* reserved, set to 0 */ 78 uint32_t nl_pid; /* automatically selected, set to 0 */ 79 uint32_t nl_groups; /* multicast groups mask to bind to */ 80}; 81.Ed 82.Pp 83Typically, filling this structure is not required for socket operations. 84It is presented here for completeness. 85.Sh PROTOCOL DESCRIPTION 86The protocol is message-based. 87Each message starts with the mandatory 88.Va nlmsghdr 89header, followed by the family-specific header and the list of 90type-length-value pairs (TLVs). 91TLVs can be nested. 92All headers and TLVS are padded to 4-byte boundaries. 93Each 94.Xr send 2 or 95.Xr recv 2 96system call may contain multiple messages. 97.Ss BASE HEADER 98.Bd -literal 99struct nlmsghdr { 100 uint32_t nlmsg_len; /* Length of message including header */ 101 uint16_t nlmsg_type; /* Message type identifier */ 102 uint16_t nlmsg_flags; /* Flags (NLM_F_) */ 103 uint32_t nlmsg_seq; /* Sequence number */ 104 uint32_t nlmsg_pid; /* Sending process port ID */ 105}; 106.Ed 107.Pp 108The 109.Va nlmsg_len 110field stores the whole message length, in bytes, including the header. 111This length has to be rounded up to the nearest 4-byte boundary when 112iterating over messages. 113The 114.Va nlmsg_type 115field represents the command/request type. 116This value is family-specific. 117The list of supported commands can be found in the relevant family 118header file. 119.Va nlmsg_seq 120is a user-provided request identifier. 121An application can track the operation result using the 122.Dv NLMSG_ERROR 123messages and matching the 124.Va nlmsg_seq 125. 126The 127.Va nlmsg_pid 128field is the message sender id. 129This field is optional for userland. 130The kernel sender id is zero. 131The 132.Va nlmsg_flags 133field contains the message-specific flags. 134The following generic flags are defined: 135.Pp 136.Bd -literal -offset indent -compact 137NLM_F_REQUEST Indicates that the message is an actual request to the kernel 138NLM_F_ACK Request an explicit ACK message with an operation result 139.Ed 140.Pp 141The following generic flags are defined for the "GET" request types: 142.Pp 143.Bd -literal -offset indent -compact 144NLM_F_ROOT Return the whole dataset 145NLM_F_MATCH Return all entries matching the criteria 146.Ed 147These two flags are typically used together, aliased to 148.Dv NLM_F_DUMP 149.Pp 150The following generic flags are defined for the "NEW" request types: 151.Pp 152.Bd -literal -offset indent -compact 153NLM_F_CREATE Create an object if none exists 154NLM_F_EXCL Don't replace an object if it exists 155NLM_F_REPLACE Replace an existing matching object 156NLM_F_APPEND Append to an existing object 157.Ed 158.Pp 159The following generic flags are defined for the replies: 160.Pp 161.Bd -literal -offset indent -compact 162NLM_F_MULTI Indicates that the message is part of the message group 163NLM_F_DUMP_INTR Indicates that the state dump was not completed 164NLM_F_DUMP_FILTERED Indicates that the dump was filtered per request 165NLM_F_CAPPED Indicates the original message was capped to its header 166NLM_F_ACK_TLVS Indicates that extended ACK TLVs were included 167.Ed 168.Ss TLVs 169Most messages encode their attributes as type-length-value pairs (TLVs). 170The base TLV header: 171.Bd -literal 172struct nlattr { 173 uint16_t nla_len; /* Total attribute length */ 174 uint16_t nla_type; /* Attribute type */ 175}; 176.Ed 177The TLV type 178.Pq Va nla_type 179scope is typically the message type or group within a family. 180For example, the 181.Dv RTN_MULTICAST 182type value is only valid for 183.Dv RTM_NEWROUTE 184, 185.Dv RTM_DELROUTE 186and 187.Dv RTM_GETROUTE 188messages. 189TLVs can be nested; in that case internal TLVs may have their own sub-types. 190All TLVs are packed with 4-byte padding. 191.Ss CONTROL MESSAGES 192A number of generic control messages are reserved in each family. 193.Pp 194.Dv NLMSG_ERROR 195reports the operation result if requested, optionally followed by 196the metadata TLVs. 197The value of 198.Va nlmsg_seq 199is set to its value in the original messages, while 200.Va nlmsg_pid 201is set to the socket pid of the original socket. 202The operation result is reported via 203.Vt "struct nlmsgerr": 204.Bd -literal 205struct nlmsgerr { 206 int error; /* Standard errno */ 207 struct nlmsghdr msg; /* Original message header */ 208}; 209.Ed 210If the 211.Dv NETLINK_CAP_ACK 212socket option is not set, the remainder of the original message will follow. 213If the 214.Dv NETLINK_EXT_ACK 215socket option is set, the kernel may add a 216.Dv NLMSGERR_ATTR_MSG 217string TLV with the textual error description, optionally followed by the 218.Dv NLMSGERR_ATTR_OFFS 219TLV, indicating the offset from the message start that triggered an error. 220Some operations may return additional metadata encapsulated in the 221.Dv NLMSGERR_ATTR_COOKIE 222TLV. 223The metadata format is specific to the operation. 224If the operation reply is a multipart message, then no 225.Dv NLMSG_ERROR 226reply is generated, only a 227.Dv NLMSG_DONE 228message, closing multipart sequence. 229.Pp 230.Dv NLMSG_DONE 231indicates the end of the message group: typically, the end of the dump. 232It contains a single 233.Vt int 234field, describing the dump result as a standard errno value. 235.Sh SOCKET OPTIONS 236Netlink supports a number of custom socket options, which can be set with 237.Xr setsockopt 2 238with the 239.Dv SOL_NETLINK 240.Fa level : 241.Bl -tag -width indent 242.It Dv NETLINK_ADD_MEMBERSHIP 243Subscribes to the notifications for the specific group (int). 244.It Dv NETLINK_DROP_MEMBERSHIP 245Unsubscribes from the notifications for the specific group (int). 246.It Dv NETLINK_LIST_MEMBERSHIPS 247Lists the memberships as a bitmask. 248.It Dv NETLINK_CAP_ACK 249Instructs the kernel to send the original message header in the reply 250without the message body. 251.It Dv NETLINK_EXT_ACK 252Acknowledges ability to receive additional TLVs in the ACK message. 253.El 254.Pp 255Additionally, netlink overrides the following socket options from the 256.Dv SOL_SOCKET 257.Fa level : 258.Bl -tag -width indent 259.It Dv SO_RCVBUF 260Sets the maximum size of the socket receive buffer. 261If the caller has 262.Dv PRIV_NET_ROUTE 263permission, the value can exceed the currently-set 264.Va kern.ipc.maxsockbuf 265value. 266.El 267.Sh SYSCTL VARIABLES 268A set of 269.Xr sysctl 8 270variables is available to tweak run-time parameters: 271.Bl -tag -width indent 272.It Va net.netlink.sendspace 273Default send buffer for the netlink socket. 274Note that the socket sendspace has to be at least as long as the longest 275message that can be transmitted via this socket. 276.El 277.Bl -tag -width indent 278.It Va net.netlink.recvspace 279Default receive buffer for the netlink socket. 280Note that the socket recvspace has to be least as long as the longest 281message that can be received from this socket. 282.El 283.Sh DEBUGGING 284Netlink implements per-functional-unit debugging, with different severities 285controllable via the 286.Va net.netlink.debug 287branch. 288These messages are logged in the kernel message buffer and can be seen in 289.Xr dmesg 8 290. 291The following severity levels are defined: 292.Bl -tag -width indent 293.It Dv LOG_DEBUG(7) 294Rare events or per-socket errors are reported here. 295This is the default level, not impacting production performance. 296.It Dv LOG_DEBUG2(8) 297Socket events such as groups memberships, privilege checks, commands and dumps 298are logged. 299This level does not incur significant performance overhead. 300.It Dv LOG_DEBUG3(9) 301All socket events, each dumped or modified entities are logged. 302Turning it on may result in significant performance overhead. 303.El 304.Sh ERRORS 305Netlink reports operation results, including errors and error metadata, by 306sending a 307.Dv NLMSG_ERROR 308message for each request message. 309The following errors can be returned: 310.Bl -tag -width Er 311.It Bq Er EPERM 312when the current privileges are insufficient to perform the required operation; 313.It Bo Er ENOBUFS Bc or Bo Er ENOMEM Bc 314when the system runs out of memory for 315an internal data structure; 316.It Bq Er ENOTSUP 317when the requested command is not supported by the family or 318the family is not supported; 319.It Bq Er EINVAL 320when some necessary TLVs are missing or invalid, detailed info 321may be provided in NLMSGERR_ATTR_MSG and NLMSGERR_ATTR_OFFS TLVs; 322.It Bq Er ENOENT 323when trying to delete a non-existent object. 324.Pp 325Additionally, a socket operation itself may fail with one of the errors 326specified in 327.Xr socket 2 328, 329.Xr recv 2 330or 331.Xr send 2 332. 333.El 334.Sh SEE ALSO 335.Xr genetlink 4 , 336.Xr rtnetlink 4 337.Rs 338.%A "J. Salim" 339.%A "H. Khosravi" 340.%A "A. Kleen" 341.%A "A. Kuznetsov" 342.%T "Linux Netlink as an IP Services Protocol" 343.%O "RFC 3549" 344.Re 345.Sh HISTORY 346The netlink protocol appeared in 347.Fx 14.0 . 348.Sh AUTHORS 349The netlink was implemented by 350.An -nosplit 351.An Alexander Chernikov Aq Mt melifaro@FreeBSD.org . 352It was derived from the Google Summer of Code 2021 project by 353.An Ng Peng Nam Sean . 354