README
1****************************
2RDMA Transport (RTRS)
3****************************
4
5RTRS (RDMA Transport) is a reliable high speed transport library
6which provides support to establish optimal number of connections
7between client and server machines using RDMA (InfiniBand, RoCE, iWarp)
8transport. It is optimized to transfer (read/write) IO blocks.
9
10In its core interface it follows the BIO semantics of providing the
11possibility to either write data from an sg list to the remote side
12or to request ("read") data transfer from the remote side into a given
13sg list.
14
15RTRS provides I/O fail-over and load-balancing capabilities by using
16multipath I/O (see "add_path" and "mp_policy" configuration entries in
17Documentation/ABI/testing/sysfs-class-rtrs-client).
18
19RTRS is used by the RNBD (RDMA Network Block Device) modules.
20
21==================
22Transport protocol
23==================
24
25Overview
26--------
27An established connection between a client and a server is called rtrs
28session. A session is associated with a set of memory chunks reserved on the
29server side for a given client for rdma transfer. A session
30consists of multiple paths, each representing a separate physical link
31between client and server. Those are used for load balancing and failover.
32Each path consists of as many connections (QPs) as there are cpus on
33the client.
34
35When processing an incoming write or read request, rtrs client uses memory
36chunks reserved for him on the server side. Their number, size and addresses
37need to be exchanged between client and server during the connection
38establishment phase. Apart from the memory related information client needs to
39inform the server about the session name and identify each path and connection
40individually.
41
42On an established session client sends to server write or read messages.
43Server uses immediate field to tell the client which request is being
44acknowledged and for errno. Client uses immediate field to tell the server
45which of the memory chunks has been accessed and at which offset the message
46can be found.
47
48Module parameter always_invalidate is introduced for the security problem
49discussed in LPC RDMA MC 2019. When always_invalidate=Y, on the server side we
50invalidate each rdma buffer before we hand it over to RNBD server and
51then pass it to the block layer. A new rkey is generated and registered for the
52buffer after it returns back from the block layer and RNBD server.
53The new rkey is sent back to the client along with the IO result.
54The procedure is the default behaviour of the driver. This invalidation and
55registration on each IO causes performance drop of up to 20%. A user of the
56driver may choose to load the modules with this mechanism switched off
57(always_invalidate=N), if he understands and can take the risk of a malicious
58client being able to corrupt memory of a server it is connected to. This might
59be a reasonable option in a scenario where all the clients and all the servers
60are located within a secure datacenter.
61
62
63Connection establishment
64------------------------
65
661. Client starts establishing connections belonging to a path of a session one
67by one via attaching RTRS_MSG_CON_REQ messages to the rdma_connect requests.
68Those include uuid of the session and uuid of the path to be
69established. They are used by the server to find a persisting session/path or
70to create a new one when necessary. The message also contains the protocol
71version and magic for compatibility, total number of connections per session
72(as many as cpus on the client), the id of the current connection and
73the reconnect counter, which is used to resolve the situations where
74client is trying to reconnect a path, while server is still destroying the old
75one.
76
772. Server accepts the connection requests one by one and attaches
78RTRS_MSG_CONN_RSP messages to the rdma_accept. Apart from magic and
79protocol version, the messages include error code, queue depth supported by
80the server (number of memory chunks which are going to be allocated for that
81session) and the maximum size of one io, RTRS_MSG_NEW_RKEY_F flags is set
82when always_invalidate=Y.
83
843. After all connections of a path are established client sends to server the
85RTRS_MSG_INFO_REQ message, containing the name of the session. This message
86requests the address information from the server.
87
884. Server replies to the session info request message with RTRS_MSG_INFO_RSP,
89which contains the addresses and keys of the RDMA buffers allocated for that
90session.
91
925. Session becomes connected after all paths to be established are connected
93(i.e. steps 1-4 finished for all paths requested for a session)
94
956. Server and client exchange periodically heartbeat messages (empty rdma
96messages with an immediate field) which are used to detect a crash on remote
97side or network outage in an absence of IO.
98
997. On any RDMA related error or in the case of a heartbeat timeout, the
100corresponding path is disconnected, all the inflight IO are failed over to a
101healthy path, if any, and the reconnect mechanism is triggered.
102
103CLT SRV
104*for each connection belonging to a path and for each path:
105RTRS_MSG_CON_REQ ------------------->
106 <------------------- RTRS_MSG_CON_RSP
107...
108*after all connections are established:
109RTRS_MSG_INFO_REQ ------------------->
110 <------------------- RTRS_MSG_INFO_RSP
111*heartbeat is started from both sides:
112 -------------------> [RTRS_HB_MSG_IMM]
113[RTRS_HB_MSG_ACK] <-------------------
114[RTRS_HB_MSG_IMM] <-------------------
115 -------------------> [RTRS_HB_MSG_ACK]
116
117IO path
118-------
119
120* Write (always_invalidate=N) *
121
1221. When processing a write request client selects one of the memory chunks
123on the server side and rdma writes there the user data, user header and the
124RTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only
125contains size of the user header. The client tells the server which chunk has
126been accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by
127using the IMM field.
128
1292. When confirming a write request server sends an "empty" rdma message with
130an immediate field. The 32 bit field is used to specify the outstanding
131inflight IO and for the error code.
132
133CLT SRV
134usr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM]
135[RTRS_IO_RSP_IMM] <----------------- (id + errno)
136
137* Write (always_invalidate=Y) *
138
1391. When processing a write request client selects one of the memory chunks
140on the server side and rdma writes there the user data, user header and the
141RTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only
142contains size of the user header. The client tells the server which chunk has
143been accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by
144using the IMM field, Server invalidate rkey associated to the memory chunks
145first, when it finishes, pass the IO to RNBD server module.
146
1472. When confirming a write request server sends an "empty" rdma message with
148an immediate field. The 32 bit field is used to specify the outstanding
149inflight IO and for the error code. The new rkey is sent back using
150SEND_WITH_IMM WR, client When it recived new rkey message, it validates
151the message and finished IO after update rkey for the rbuffer, then post
152back the recv buffer for later use.
153
154CLT SRV
155usr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM]
156[RTRS_MSG_RKEY_RSP] <----------------- (RTRS_MSG_RKEY_RSP)
157[RTRS_IO_RSP_IMM] <----------------- (id + errno)
158
159
160* Read (always_invalidate=N)*
161
1621. When processing a read request client selects one of the memory chunks
163on the server side and rdma writes there the user header and the
164RTRS_MSG_RDMA_READ message. This message contains the type (read), size of
165the user header, flags (specifying if memory invalidation is necessary) and the
166list of addresses along with keys for the data to be read into.
167
1682. When confirming a read request server transfers the requested data first,
169attaches an invalidation message if requested and finally an "empty" rdma
170message with an immediate field. The 32 bit field is used to specify the
171outstanding inflight IO and the error code.
172
173CLT SRV
174usr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM]
175[RTRS_IO_RSP_IMM] <-------------- usr_data + (id + errno)
176or in case client requested invalidation:
177[RTRS_IO_RSP_IMM_W_INV] <-------------- usr_data + (INV) + (id + errno)
178
179* Read (always_invalidate=Y)*
180
1811. When processing a read request client selects one of the memory chunks
182on the server side and rdma writes there the user header and the
183RTRS_MSG_RDMA_READ message. This message contains the type (read), size of
184the user header, flags (specifying if memory invalidation is necessary) and the
185list of addresses along with keys for the data to be read into.
186Server invalidate rkey associated to the memory chunks first, when it finishes,
187passes the IO to RNBD server module.
188
1892. When confirming a read request server transfers the requested data first,
190attaches an invalidation message if requested and finally an "empty" rdma
191message with an immediate field. The 32 bit field is used to specify the
192outstanding inflight IO and the error code. The new rkey is sent back using
193SEND_WITH_IMM WR, client When it recived new rkey message, it validates
194the message and finished IO after update rkey for the rbuffer, then post
195back the recv buffer for later use.
196
197CLT SRV
198usr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM]
199[RTRS_IO_RSP_IMM] <-------------- usr_data + (id + errno)
200[RTRS_MSG_RKEY_RSP] <----------------- (RTRS_MSG_RKEY_RSP)
201or in case client requested invalidation:
202[RTRS_IO_RSP_IMM_W_INV] <-------------- usr_data + (INV) + (id + errno)
203=========================================
204Contributors List(in alphabetical order)
205=========================================
206Danil Kipnis <danil.kipnis@profitbricks.com>
207Fabian Holler <mail@fholler.de>
208Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
209Jack Wang <jinpu.wang@profitbricks.com>
210Kleber Souza <kleber.souza@profitbricks.com>
211Lutz Pogrell <lutz.pogrell@cloud.ionos.com>
212Milind Dumbare <Milind.dumbare@gmail.com>
213Roman Penyaev <roman.penyaev@profitbricks.com>
214