xref: /linux/drivers/infiniband/ulp/rtrs/README (revision 4b4193256c8d3bc3a5397b5cd9494c2ad386317d)
1*745b6a3dSJack Wang****************************
2*745b6a3dSJack WangRDMA Transport (RTRS)
3*745b6a3dSJack Wang****************************
4*745b6a3dSJack Wang
5*745b6a3dSJack WangRTRS (RDMA Transport) is a reliable high speed transport library
6*745b6a3dSJack Wangwhich provides support to establish optimal number of connections
7*745b6a3dSJack Wangbetween client and server machines using RDMA (InfiniBand, RoCE, iWarp)
8*745b6a3dSJack Wangtransport. It is optimized to transfer (read/write) IO blocks.
9*745b6a3dSJack Wang
10*745b6a3dSJack WangIn its core interface it follows the BIO semantics of providing the
11*745b6a3dSJack Wangpossibility to either write data from an sg list to the remote side
12*745b6a3dSJack Wangor to request ("read") data transfer from the remote side into a given
13*745b6a3dSJack Wangsg list.
14*745b6a3dSJack Wang
15*745b6a3dSJack WangRTRS provides I/O fail-over and load-balancing capabilities by using
16*745b6a3dSJack Wangmultipath I/O (see "add_path" and "mp_policy" configuration entries in
17*745b6a3dSJack WangDocumentation/ABI/testing/sysfs-class-rtrs-client).
18*745b6a3dSJack Wang
19*745b6a3dSJack WangRTRS is used by the RNBD (RDMA Network Block Device) modules.
20*745b6a3dSJack Wang
21*745b6a3dSJack Wang==================
22*745b6a3dSJack WangTransport protocol
23*745b6a3dSJack Wang==================
24*745b6a3dSJack Wang
25*745b6a3dSJack WangOverview
26*745b6a3dSJack Wang--------
27*745b6a3dSJack WangAn established connection between a client and a server is called rtrs
28*745b6a3dSJack Wangsession. A session is associated with a set of memory chunks reserved on the
29*745b6a3dSJack Wangserver side for a given client for rdma transfer. A session
30*745b6a3dSJack Wangconsists of multiple paths, each representing a separate physical link
31*745b6a3dSJack Wangbetween client and server. Those are used for load balancing and failover.
32*745b6a3dSJack WangEach path consists of as many connections (QPs) as there are cpus on
33*745b6a3dSJack Wangthe client.
34*745b6a3dSJack Wang
35*745b6a3dSJack WangWhen processing an incoming write or read request, rtrs client uses memory
36*745b6a3dSJack Wangchunks reserved for him on the server side. Their number, size and addresses
37*745b6a3dSJack Wangneed to be exchanged between client and server during the connection
38*745b6a3dSJack Wangestablishment phase. Apart from the memory related information client needs to
39*745b6a3dSJack Wanginform the server about the session name and identify each path and connection
40*745b6a3dSJack Wangindividually.
41*745b6a3dSJack Wang
42*745b6a3dSJack WangOn an established session client sends to server write or read messages.
43*745b6a3dSJack WangServer uses immediate field to tell the client which request is being
44*745b6a3dSJack Wangacknowledged and for errno. Client uses immediate field to tell the server
45*745b6a3dSJack Wangwhich of the memory chunks has been accessed and at which offset the message
46*745b6a3dSJack Wangcan be found.
47*745b6a3dSJack Wang
48*745b6a3dSJack WangModule parameter always_invalidate is introduced for the security problem
49*745b6a3dSJack Wangdiscussed in LPC RDMA MC 2019. When always_invalidate=Y, on the server side we
50*745b6a3dSJack Wanginvalidate each rdma buffer before we hand it over to RNBD server and
51*745b6a3dSJack Wangthen pass it to the block layer. A new rkey is generated and registered for the
52*745b6a3dSJack Wangbuffer after it returns back from the block layer and RNBD server.
53*745b6a3dSJack WangThe new rkey is sent back to the client along with the IO result.
54*745b6a3dSJack WangThe procedure is the default behaviour of the driver. This invalidation and
55*745b6a3dSJack Wangregistration on each IO causes performance drop of up to 20%. A user of the
56*745b6a3dSJack Wangdriver may choose to load the modules with this mechanism switched off
57*745b6a3dSJack Wang(always_invalidate=N), if he understands and can take the risk of a malicious
58*745b6a3dSJack Wangclient being able to corrupt memory of a server it is connected to. This might
59*745b6a3dSJack Wangbe a reasonable option in a scenario where all the clients and all the servers
60*745b6a3dSJack Wangare located within a secure datacenter.
61*745b6a3dSJack Wang
62*745b6a3dSJack Wang
63*745b6a3dSJack WangConnection establishment
64*745b6a3dSJack Wang------------------------
65*745b6a3dSJack Wang
66*745b6a3dSJack Wang1. Client starts establishing connections belonging to a path of a session one
67*745b6a3dSJack Wangby one via attaching RTRS_MSG_CON_REQ messages to the rdma_connect requests.
68*745b6a3dSJack WangThose include uuid of the session and uuid of the path to be
69*745b6a3dSJack Wangestablished. They are used by the server to find a persisting session/path or
70*745b6a3dSJack Wangto create a new one when necessary. The message also contains the protocol
71*745b6a3dSJack Wangversion and magic for compatibility, total number of connections per session
72*745b6a3dSJack Wang(as many as cpus on the client), the id of the current connection and
73*745b6a3dSJack Wangthe reconnect counter, which is used to resolve the situations where
74*745b6a3dSJack Wangclient is trying to reconnect a path, while server is still destroying the old
75*745b6a3dSJack Wangone.
76*745b6a3dSJack Wang
77*745b6a3dSJack Wang2. Server accepts the connection requests one by one and attaches
78*745b6a3dSJack WangRTRS_MSG_CONN_RSP messages to the rdma_accept. Apart from magic and
79*745b6a3dSJack Wangprotocol version, the messages include error code, queue depth supported by
80*745b6a3dSJack Wangthe server (number of memory chunks which are going to be allocated for that
81*745b6a3dSJack Wangsession) and the maximum size of one io, RTRS_MSG_NEW_RKEY_F flags is set
82*745b6a3dSJack Wangwhen always_invalidate=Y.
83*745b6a3dSJack Wang
84*745b6a3dSJack Wang3. After all connections of a path are established client sends to server the
85*745b6a3dSJack WangRTRS_MSG_INFO_REQ message, containing the name of the session. This message
86*745b6a3dSJack Wangrequests the address information from the server.
87*745b6a3dSJack Wang
88*745b6a3dSJack Wang4. Server replies to the session info request message with RTRS_MSG_INFO_RSP,
89*745b6a3dSJack Wangwhich contains the addresses and keys of the RDMA buffers allocated for that
90*745b6a3dSJack Wangsession.
91*745b6a3dSJack Wang
92*745b6a3dSJack Wang5. Session becomes connected after all paths to be established are connected
93*745b6a3dSJack Wang(i.e. steps 1-4 finished for all paths requested for a session)
94*745b6a3dSJack Wang
95*745b6a3dSJack Wang6. Server and client exchange periodically heartbeat messages (empty rdma
96*745b6a3dSJack Wangmessages with an immediate field) which are used to detect a crash on remote
97*745b6a3dSJack Wangside or network outage in an absence of IO.
98*745b6a3dSJack Wang
99*745b6a3dSJack Wang7. On any RDMA related error or in the case of a heartbeat timeout, the
100*745b6a3dSJack Wangcorresponding path is disconnected, all the inflight IO are failed over to a
101*745b6a3dSJack Wanghealthy path, if any, and the reconnect mechanism is triggered.
102*745b6a3dSJack Wang
103*745b6a3dSJack WangCLT                                     SRV
104*745b6a3dSJack Wang*for each connection belonging to a path and for each path:
105*745b6a3dSJack WangRTRS_MSG_CON_REQ  ------------------->
106*745b6a3dSJack Wang                   <------------------- RTRS_MSG_CON_RSP
107*745b6a3dSJack Wang...
108*745b6a3dSJack Wang*after all connections are established:
109*745b6a3dSJack WangRTRS_MSG_INFO_REQ ------------------->
110*745b6a3dSJack Wang                   <------------------- RTRS_MSG_INFO_RSP
111*745b6a3dSJack Wang*heartbeat is started from both sides:
112*745b6a3dSJack Wang                   -------------------> [RTRS_HB_MSG_IMM]
113*745b6a3dSJack Wang[RTRS_HB_MSG_ACK] <-------------------
114*745b6a3dSJack Wang[RTRS_HB_MSG_IMM] <-------------------
115*745b6a3dSJack Wang                   -------------------> [RTRS_HB_MSG_ACK]
116*745b6a3dSJack Wang
117*745b6a3dSJack WangIO path
118*745b6a3dSJack Wang-------
119*745b6a3dSJack Wang
120*745b6a3dSJack Wang* Write (always_invalidate=N) *
121*745b6a3dSJack Wang
122*745b6a3dSJack Wang1. When processing a write request client selects one of the memory chunks
123*745b6a3dSJack Wangon the server side and rdma writes there the user data, user header and the
124*745b6a3dSJack WangRTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only
125*745b6a3dSJack Wangcontains size of the user header. The client tells the server which chunk has
126*745b6a3dSJack Wangbeen accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by
127*745b6a3dSJack Wangusing the IMM field.
128*745b6a3dSJack Wang
129*745b6a3dSJack Wang2. When confirming a write request server sends an "empty" rdma message with
130*745b6a3dSJack Wangan immediate field. The 32 bit field is used to specify the outstanding
131*745b6a3dSJack Wanginflight IO and for the error code.
132*745b6a3dSJack Wang
133*745b6a3dSJack WangCLT                                                          SRV
134*745b6a3dSJack Wangusr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM]
135*745b6a3dSJack Wang[RTRS_IO_RSP_IMM]                        <----------------- (id + errno)
136*745b6a3dSJack Wang
137*745b6a3dSJack Wang* Write (always_invalidate=Y) *
138*745b6a3dSJack Wang
139*745b6a3dSJack Wang1. When processing a write request client selects one of the memory chunks
140*745b6a3dSJack Wangon the server side and rdma writes there the user data, user header and the
141*745b6a3dSJack WangRTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only
142*745b6a3dSJack Wangcontains size of the user header. The client tells the server which chunk has
143*745b6a3dSJack Wangbeen accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by
144*745b6a3dSJack Wangusing the IMM field, Server invalidate rkey associated to the memory chunks
145*745b6a3dSJack Wangfirst, when it finishes, pass the IO to RNBD server module.
146*745b6a3dSJack Wang
147*745b6a3dSJack Wang2. When confirming a write request server sends an "empty" rdma message with
148*745b6a3dSJack Wangan immediate field. The 32 bit field is used to specify the outstanding
149*745b6a3dSJack Wanginflight IO and for the error code. The new rkey is sent back using
150*745b6a3dSJack WangSEND_WITH_IMM WR, client When it recived new rkey message, it validates
151*745b6a3dSJack Wangthe message and finished IO after update rkey for the rbuffer, then post
152*745b6a3dSJack Wangback the recv buffer for later use.
153*745b6a3dSJack Wang
154*745b6a3dSJack WangCLT                                                          SRV
155*745b6a3dSJack Wangusr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM]
156*745b6a3dSJack Wang[RTRS_MSG_RKEY_RSP]                     <----------------- (RTRS_MSG_RKEY_RSP)
157*745b6a3dSJack Wang[RTRS_IO_RSP_IMM]                        <----------------- (id + errno)
158*745b6a3dSJack Wang
159*745b6a3dSJack Wang
160*745b6a3dSJack Wang* Read (always_invalidate=N)*
161*745b6a3dSJack Wang
162*745b6a3dSJack Wang1. When processing a read request client selects one of the memory chunks
163*745b6a3dSJack Wangon the server side and rdma writes there the user header and the
164*745b6a3dSJack WangRTRS_MSG_RDMA_READ message. This message contains the type (read), size of
165*745b6a3dSJack Wangthe user header, flags (specifying if memory invalidation is necessary) and the
166*745b6a3dSJack Wanglist of addresses along with keys for the data to be read into.
167*745b6a3dSJack Wang
168*745b6a3dSJack Wang2. When confirming a read request server transfers the requested data first,
169*745b6a3dSJack Wangattaches an invalidation message if requested and finally an "empty" rdma
170*745b6a3dSJack Wangmessage with an immediate field. The 32 bit field is used to specify the
171*745b6a3dSJack Wangoutstanding inflight IO and the error code.
172*745b6a3dSJack Wang
173*745b6a3dSJack WangCLT                                           SRV
174*745b6a3dSJack Wangusr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM]
175*745b6a3dSJack Wang[RTRS_IO_RSP_IMM]            <-------------- usr_data + (id + errno)
176*745b6a3dSJack Wangor in case client requested invalidation:
177*745b6a3dSJack Wang[RTRS_IO_RSP_IMM_W_INV]      <-------------- usr_data + (INV) + (id + errno)
178*745b6a3dSJack Wang
179*745b6a3dSJack Wang* Read (always_invalidate=Y)*
180*745b6a3dSJack Wang
181*745b6a3dSJack Wang1. When processing a read request client selects one of the memory chunks
182*745b6a3dSJack Wangon the server side and rdma writes there the user header and the
183*745b6a3dSJack WangRTRS_MSG_RDMA_READ message. This message contains the type (read), size of
184*745b6a3dSJack Wangthe user header, flags (specifying if memory invalidation is necessary) and the
185*745b6a3dSJack Wanglist of addresses along with keys for the data to be read into.
186*745b6a3dSJack WangServer invalidate rkey associated to the memory chunks first, when it finishes,
187*745b6a3dSJack Wangpasses the IO to RNBD server module.
188*745b6a3dSJack Wang
189*745b6a3dSJack Wang2. When confirming a read request server transfers the requested data first,
190*745b6a3dSJack Wangattaches an invalidation message if requested and finally an "empty" rdma
191*745b6a3dSJack Wangmessage with an immediate field. The 32 bit field is used to specify the
192*745b6a3dSJack Wangoutstanding inflight IO and the error code. The new rkey is sent back using
193*745b6a3dSJack WangSEND_WITH_IMM WR, client When it recived new rkey message, it validates
194*745b6a3dSJack Wangthe message and finished IO after update rkey for the rbuffer, then post
195*745b6a3dSJack Wangback the recv buffer for later use.
196*745b6a3dSJack Wang
197*745b6a3dSJack WangCLT                                           SRV
198*745b6a3dSJack Wangusr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM]
199*745b6a3dSJack Wang[RTRS_IO_RSP_IMM]            <-------------- usr_data + (id + errno)
200*745b6a3dSJack Wang[RTRS_MSG_RKEY_RSP]	     <----------------- (RTRS_MSG_RKEY_RSP)
201*745b6a3dSJack Wangor in case client requested invalidation:
202*745b6a3dSJack Wang[RTRS_IO_RSP_IMM_W_INV]      <-------------- usr_data + (INV) + (id + errno)
203*745b6a3dSJack Wang=========================================
204*745b6a3dSJack WangContributors List(in alphabetical order)
205*745b6a3dSJack Wang=========================================
206*745b6a3dSJack WangDanil Kipnis <danil.kipnis@profitbricks.com>
207*745b6a3dSJack WangFabian Holler <mail@fholler.de>
208*745b6a3dSJack WangGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
209*745b6a3dSJack WangJack Wang <jinpu.wang@profitbricks.com>
210*745b6a3dSJack WangKleber Souza <kleber.souza@profitbricks.com>
211*745b6a3dSJack WangLutz Pogrell <lutz.pogrell@cloud.ionos.com>
212*745b6a3dSJack WangMilind Dumbare <Milind.dumbare@gmail.com>
213*745b6a3dSJack WangRoman Penyaev <roman.penyaev@profitbricks.com>
214