|This readme tries to provide some background on the hows and whys of RDS,
|and will hopefully help you find your way around the code.
|In addition, please see this email about RDS origins:
|RDS provides reliable, ordered datagram delivery by using a single
|reliable connection between any two nodes in the cluster. This allows
|applications to use a single socket to talk to any other process in the
|cluster - so in a cluster with N processes you need N sockets, in contrast
|to N*N if you use a connection-oriented socket transport like TCP.
|RDS is not Infiniband-specific; it was designed to support different
|transports. The current implementation used to support RDS over TCP as well
|The high-level semantics of RDS from the application's point of view are
|RDS uses IPv4 addresses and 16bit port numbers to identify
|the end point of a connection. All socket operations that involve
|passing addresses between kernel and user space generally
|use a struct sockaddr_in.
|The fact that IPv4 addresses are used does not mean the underlying
|transport has to be IP-based. In fact, RDS over IB uses a
|reliable IB connection; the IP address is used exclusively to
|locate the remote node's GID (by ARPing for the given IP).
|The port space is entirely independent of UDP, TCP or any other
|* Socket interface
|RDS sockets work *mostly* as you would expect from a BSD
|socket. The next section will cover the details. At any rate,
|all I/O is performed through the standard BSD socket API.
|Some additions like zerocopy support are implemented through
|control messages, while other extensions use the getsockopt/
|Sockets must be bound before you can send or receive data.
|This is needed because binding also selects a transport and
|attaches it to the socket. Once bound, the transport assignment
|does not change. RDS will tolerate IPs moving around (eg in
|a active-active HA scenario), but only as long as the address
|doesn't move to a different transport.
|RDS supports a number of sysctls in /proc/sys/net/rds
|AF_RDS, PF_RDS, SOL_RDS
|AF_RDS and PF_RDS are the domain type to be used with socket(2)
|to create RDS sockets. SOL_RDS is the socket-level to be used
|with setsockopt(2) and getsockopt(2) for RDS specific socket
|fd = socket(PF_RDS, SOCK_SEQPACKET, 0);
|This creates a new, unbound RDS socket.
|setsockopt(SOL_SOCKET): send and receive buffer size
|RDS honors the send and receive buffer size socket options.
|You are not allowed to queue more than SO_SNDSIZE bytes to
|a socket. A message is queued when sendmsg is called, and
|it leaves the queue when the remote system acknowledges
|The SO_RCVSIZE option controls the maximum receive queue length.
|This is a soft limit rather than a hard limit - RDS will
|continue to accept and queue incoming messages, even if that
|takes the queue length over the limit. However, it will also
|mark the port as "congested" and send a congestion update to
|the source node. The source node is supposed to throttle any
|processes sending to this congested port.
|bind(fd, &sockaddr_in, ...)
|This binds the socket to a local IP address and port, and a
|transport, if one has not already been selected via the
|SO_RDS_TRANSPORT socket option
|Sends a message to the indicated recipient. The kernel will
|transparently establish the underlying reliable connection
|if it isn't up yet.
|An attempt to send a message that exceeds SO_SNDSIZE will
|return with -EMSGSIZE
|An attempt to send a message that would take the total number
|of queued bytes over the SO_SNDSIZE threshold will return
|An attempt to send a message to a destination that is marked
|as "congested" will return ENOBUFS.
|Receives a message that was queued to this socket. The sockets
|recv queue accounting is adjusted, and if the queue length
|drops below SO_SNDSIZE, the port is marked uncongested, and
|a congestion update is sent to all peers.
|Applications can ask the RDS kernel module to receive
|notifications via control messages (for instance, there is a
|notification when a congestion update arrived, or when a RDMA
|operation completes). These notifications are received through
|the msg.msg_control buffer of struct msghdr. The format of the
|messages is described in manpages.
|RDS supports the poll interface to allow the application
|to implement async I/O.
|POLLIN handling is pretty straightforward. When there's an
|incoming message queued to the socket, or a pending notification,
|we signal POLLIN.
|POLLOUT is a little harder. Since you can essentially send
|to any destination, RDS will always signal POLLOUT as long as
|there's room on the send queue (ie the number of bytes queued
|is less than the sendbuf size).
|However, the kernel will refuse to accept messages to
|a destination marked congested - in this case you will loop
|forever if you rely on poll to tell you what to do.
|This isn't a trivial problem, but applications can deal with
|this - by using congestion notifications, and by checking for
|ENOBUFS errors returned by sendmsg.
|setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)
|This allows the application to discard all messages queued to a
|specific destination on this particular socket.
|This allows the application to cancel outstanding messages if
|it detects a timeout. For instance, if it tried to send a message,
|and the remote host is unreachable, RDS will keep trying forever.
|The application may decide it's not worth it, and cancel the
|operation. In this case, it would use RDS_CANCEL_SENT_TO to
|nuke any pending messages.
|setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)
|getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)
|Set or read an integer defining the underlying
|encapsulating transport to be used for RDS packets on the
|socket. When setting the option, integer argument may be
|one of RDS_TRANS_TCP or RDS_TRANS_IB. When retrieving the
|value, RDS_TRANS_NONE will be returned on an unbound socket.
|This socket option may only be set exactly once on the socket,
|prior to binding it via the bind(2) system call. Attempts to
|set SO_RDS_TRANSPORT on a socket for which the transport has
|been previously attached explicitly (by SO_RDS_TRANSPORT) or
|implicitly (via bind(2)) will return an error of EOPNOTSUPP.
|An attempt to set SO_RDS_TRANSPPORT to RDS_TRANS_NONE will
|always return EINVAL.
|RDMA for RDS
|see rds-rdma(7) manpage (available in rds-tools)
|see rds(7) manpage
|The message header is a 'struct rds_header' (see rds.h):
|per-packet sequence number
|piggybacked acknowledgment of last packet received
|length of data, not including header
|CONG_BITMAP - this is a congestion update bitmap
|ACK_REQUIRED - receiver must ack this packet
|RETRANSMITTED - packet has previously been sent
|indicate to other end of connection that
|it has more credits available (i.e. there is
|more send room)
|unused, for future use
|optional data can be passed here. This is currently used for
|passing RDMA-related information.
|ACK and retransmit handling
|One might think that with reliable IB connections you wouldn't need
|to ack messages that have been received. The problem is that IB
|hardware generates an ack message before it has DMAed the message
|into memory. This creates a potential message loss if the HCA is
|disabled for any reason between when it sends the ack and before
|the message is DMAed and processed. This is only a potential issue
|if another HCA is available for fail-over.
|Sending an ack immediately would allow the sender to free the sent
|message from their send queue quickly, but could cause excessive
|traffic to be used for acks. RDS piggybacks acks on sent data
|packets. Ack-only packets are reduced by only allowing one to be
|in flight at a time, and by the sender only asking for acks when
|its send buffers start to fill up. All retransmissions are also
|RDS's IB transport uses a credit-based mechanism to verify that
|there is space in the peer's receive buffers for more data. This
|eliminates the need for hardware retries on the connection.
|Messages waiting in the receive queue on the receiving socket
|are accounted against the sockets SO_RCVBUF option value. Only
|the payload bytes in the message are accounted for. If the
|number of bytes queued equals or exceeds rcvbuf then the socket
|is congested. All sends attempted to this socket's address
|should return block or return -EWOULDBLOCK.
|Applications are expected to be reasonably tuned such that this
|situation very rarely occurs. An application encountering this
|"back-pressure" is considered a bug.
|This is implemented by having each node maintain bitmaps which
|indicate which ports on bound addresses are congested. As the
|bitmap changes it is sent through all the connections which
|terminate in the local address of the bitmap which changed.
|The bitmaps are allocated as connections are brought up. This
|avoids allocation in the interrupt handling path which queues
|sages on sockets. The dense bitmaps let transports send the
|entire bitmap on any bitmap change reasonably efficiently. This
|is much easier to implement than some finer-grained
|communication of per-port congestion. The sender does a very
|inexpensive bit test to test if the port it's about to send to
|is congested or not.
|RDS Transport Layer
|As mentioned above, RDS is not IB-specific. Its code is divided
|into a general RDS layer and a transport layer.
|The general layer handles the socket API, congestion handling,
|loopback, stats, usermem pinning, and the connection state machine.
|The transport layer handles the details of the transport. The IB
|transport, for example, handles all the queue pairs, work requests,
|CM event handlers, and other Infiniband details.
|RDS Kernel Structures
|aka possibly "rds_outgoing", the generic RDS layer copies data to
|be sent and sets header fields as needed, based on the socket API.
|This is then queued for the individual connection and sent by the
|a generic struct referring to incoming data that can be handed from
|the transport to the general code and queued by the general code
|while the socket is awoken. It is then passed back to the transport
|code to handle the actual copy-to-user.
|pointers to transport-specific functions
|wraps the raw congestion bitmap, contains rbnode, waitq, etc.
|Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and
|The first time an attempt is made by an RDS socket to send data to
|a node, a connection is allocated and connected. That connection is
|then maintained forever -- if there are transport errors, the
|connection will be dropped and re-established.
|Dropping a connection while packets are queued will cause queued or
|partially-sent datagrams to be retransmitted when the connection is
|The send path
|struct rds_message built from incoming data
|CMSGs parsed (e.g. RDMA ops)
|transport connection alloced and connected if not already
|rds_message placed on send queue
|send worker awoken
|calls rds_send_xmit() until queue is empty
|transmits congestion map if one is pending
|may set ACK_REQUIRED
|calls transport to send either non-RDMA or RDMA message
|(RDMA ops never retransmitted)
|allocs work requests from send ring
|adds any new send credits available to peer (h_credits)
|maps the rds_message's sg list
|populates work requests
|post send to connection's queue pair
|The recv path
|looks at write completions
|unmaps recv buffer from device
|no errors, call rds_ib_process_recv()
|refill recv ring
|validate header checksum
|copy header to rds_ib_incoming struct if start of a new datagram
|add to ibinc's fraglist
|if competed datagram:
|update cong map if datagram was cong update
|call rds_recv_incoming() otherwise
|note if ack is required
|drop duplicate packets
|respond to pings
|find the sock associated with this datagram
|add to sock queue
|wake up sock
|do some congestion calculations
|copy data into user iovec
|return to application
|Multipath RDS (mprds)
|Mprds is multipathed-RDS, primarily intended for RDS-over-TCP
|(though the concept can be extended to other transports). The classical
|implementation of RDS-over-TCP is implemented by demultiplexing multiple
|PF_RDS sockets between any 2 endpoints (where endpoint == [IP address,
|port]) over a single TCP socket between the 2 IP addresses involved. This
|has the limitation that it ends up funneling multiple RDS flows over a
|single TCP flow, thus it is
|(a) upper-bounded to the single-flow bandwidth,
|(b) suffers from head-of-line blocking for all the RDS sockets.
|Better throughput (for a fixed small packet size, MTU) can be achieved
|by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed
|RDS (mprds). Each such TCP/IP flow constitutes a path for the rds/tcp
|connection. RDS sockets will be attached to a path based on some hash
|(e.g., of local address and RDS port number) and packets for that RDS
|socket will be sent over the attached path using TCP to segment/reassemble
|RDS datagrams on that path.
|Multipathed RDS is implemented by splitting the struct rds_connection into
|a common (to all paths) part, and a per-path struct rds_conn_path. All
|I/O workqs and reconnect threads are driven from the rds_conn_path.
|Transports such as TCP that are multipath capable may then set up a
|TPC socket per rds_conn_path, and this is managed by the transport via
|the transport privatee cp_transport_data pointer.
|Transports announce themselves as multipath capable by setting the
|t_mp_capable bit during registration with the rds core module. When the
|transport is multipath-capable, rds_sendmsg() hashes outgoing traffic
|across multiple paths. The outgoing hash is computed based on the
|local address and port that the PF_RDS socket is bound to.
|Additionally, even if the transport is MP capable, we may be
|peering with some node that does not support mprds, or supports
|a different number of paths. As a result, the peering nodes need
|to agree on the number of paths to be used for the connection.
|This is done by sending out a control packet exchange before the
|first data packet. The control packet exchange must have completed
|prior to outgoing hash completion in rds_sendmsg() when the transport
|is mutlipath capable.
|The control packet is an RDS ping packet (i.e., packet to rds dest
|port 0) with the ping packet having a rds extension header option of
|type RDS_EXTHDR_NPATHS, length 2 bytes, and the value is the
|number of paths supported by the sender. The "probe" ping packet will
|get sent from some reserved port, RDS_FLAG_PROBE_PORT (in <linux/rds.h>)
|The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediately
|be able to compute the min(sender_paths, rcvr_paths). The pong
|sent in response to a probe-ping should contain the rcvr's npaths
|when the rcvr is mprds-capable.
|If the rcvr is not mprds-capable, the exthdr in the ping will be
|ignored. In this case the pong will not have any exthdrs, so the sender
|of the probe-ping can default to single-path mprds.