[Pkg-ofed-commits] r433 - in /branches/ofed-1.4.2/rds-tools/trunk: ./ debian/ docs/
gmpc-guest at alioth.debian.org
gmpc-guest at alioth.debian.org
Fri Aug 7 15:50:44 UTC 2009
Author: gmpc-guest
Date: Fri Aug 7 15:50:44 2009
New Revision: 433
URL: http://svn.debian.org/wsvn/pkg-ofed/?sc=1&rev=433
Log:
OFED 1.4.2 release
Modified:
branches/ofed-1.4.2/rds-tools/trunk/debian/changelog
branches/ofed-1.4.2/rds-tools/trunk/docs/rds-architecture.txt
branches/ofed-1.4.2/rds-tools/trunk/pfhack.c
branches/ofed-1.4.2/rds-tools/trunk/pfhack.h
branches/ofed-1.4.2/rds-tools/trunk/rds-info.c
branches/ofed-1.4.2/rds-tools/trunk/rds-stress.c
Modified: branches/ofed-1.4.2/rds-tools/trunk/debian/changelog
URL: http://svn.debian.org/wsvn/pkg-ofed/branches/ofed-1.4.2/rds-tools/trunk/debian/changelog?rev=433&op=diff
==============================================================================
--- branches/ofed-1.4.2/rds-tools/trunk/debian/changelog (original)
+++ branches/ofed-1.4.2/rds-tools/trunk/debian/changelog Fri Aug 7 15:50:44 2009
@@ -1,3 +1,9 @@
+rds-tools (1.4.1-OFED-1.4.2-1) unstable; urgency=low
+
+ * New upstream release
+
+ -- Guy Coates <gmpc at sanger.ac.uk> Fri, 07 Aug 2009 16:34:06 +0100
+
rds-tools (1.4-1) unstable; urgency=low
* Fix manpage
Modified: branches/ofed-1.4.2/rds-tools/trunk/docs/rds-architecture.txt
URL: http://svn.debian.org/wsvn/pkg-ofed/branches/ofed-1.4.2/rds-tools/trunk/docs/rds-architecture.txt?rev=433&op=diff
==============================================================================
--- branches/ofed-1.4.2/rds-tools/trunk/docs/rds-architecture.txt (original)
+++ branches/ofed-1.4.2/rds-tools/trunk/docs/rds-architecture.txt Fri Aug 7 15:50:44 2009
@@ -5,202 +5,352 @@
This readme tries to provide some background on the hows and whys of RDS,
and will hopefully help you find your way around the code.
-There is a *little* bit of extra documentation available. The rds-tools
-package has two manpages rds(7) and rds-rdma(7) that describe the interface
-a little. If you search the rds-devel archives, Rick Frank posted
-a bunch of messages discussing the ideas behing RDS between early to
-mid November 2007. Not all of that material still applies 100% to the
-current code - for instance we no longer have RDMA barriers. But it may
-be helpful.
-
-In particular, there's a message dated Nov 15, subject "What is RDS and
-why did we build it?" which contains a doc file with some motivation on
-the design.
+In addition, please see this email about RDS origins:
+http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html
RDS Architecture
================
RDS provides reliable, ordered datagram delivery by using a single
-reliably connection between any two nodes in the cluster. This allows
+reliable connection between any two nodes in the cluster. This allows
applications to use a single socket to talk to any other process in the
cluster - so in a cluster with N processes you need N sockets, in contrast
to N*N if you use a connection-oriented socket transport like TCP.
RDS is not Infiniband-specific; it was designed to support different
-transports. The current implementation supports RDS over IB as well as
-TCP. Work is in progress to support RDS over iWARP.
+transports. The current implementation used to support RDS over TCP as well
+as IB. Work is in progress to support RDS over iWARP, and using DCE to
+guarantee no dropped packets on Ethernet, it may be possible to use RDS over
+UDP in the future.
The high-level semantics of RDS from the application's point of view are
* Addressing
- RDS uses IPv4 addresses and 16bit port numbers to identify
- the end point of a connection. All socket operations that involve
- passing addresses between kernel and user space generally
- use a struct sockaddr_in.
-
- The fact that IPv4 addresses are used does not mean the underlying
- transport has to be IP-based. In fact, RDS over IB uses a
- reliable IB connection; the IP address is used exclusively to
- locate the remote node's GID (by ARPing for the given IP).
-
- The port space is entirely independent of UDP, TCP or any other
- protocol.
+ RDS uses IPv4 addresses and 16bit port numbers to identify
+ the end point of a connection. All socket operations that involve
+ passing addresses between kernel and user space generally
+ use a struct sockaddr_in.
+
+ The fact that IPv4 addresses are used does not mean the underlying
+ transport has to be IP-based. In fact, RDS over IB uses a
+ reliable IB connection; the IP address is used exclusively to
+ locate the remote node's GID (by ARPing for the given IP).
+
+ The port space is entirely independent of UDP, TCP or any other
+ protocol.
* Socket interface
- RDS sockets work *mostly* as you would expect from a BSD
- socket. The next section will cover the details. At any rate,
- all I/O is performed through the standard BSD socket API.
- Some additions like zerocopy support are implemented through
- control messages, while other extensions use the getsockopt/
- setsockopt calls.
-
- Sockets must be bound before you can send or receive data.
- This is needed because binding also selects a transport and
- attaches it to the socket. Once bound, the transport assignment
- does not change. RDS will tolerate IPs moving around (eg in
- a active-active HA scenario), but only as long as the address
- doesn't move to a different transport.
+ RDS sockets work *mostly* as you would expect from a BSD
+ socket. The next section will cover the details. At any rate,
+ all I/O is performed through the standard BSD socket API.
+ Some additions like zerocopy support are implemented through
+ control messages, while other extensions use the getsockopt/
+ setsockopt calls.
+
+ Sockets must be bound before you can send or receive data.
+ This is needed because binding also selects a transport and
+ attaches it to the socket. Once bound, the transport assignment
+ does not change. RDS will tolerate IPs moving around (eg in
+ a active-active HA scenario), but only as long as the address
+ doesn't move to a different transport.
* sysctls
- RDS supports a number of sysctls in /proc/sys/net/rds
+ RDS supports a number of sysctls in /proc/sys/net/rds
+
Socket Interface
================
-AF_RDS, PF_RDS, SOL_RDS
- These constants haven't been assigned yet, because RDS isn't in
- mainline yet. Currently, the kernel module assigns some constant
- and publishes it to user space through two sysctl files
- /proc/sys/net/rds/pf_rds
- /proc/sys/net/rds/sol_rds
-
-fd = socket(PF_RDS, SOCK_SEQPACKET, 0);
- This creates a new, unbound RDS socket.
-
-setsockopt(SOL_SOCKET): send and receive buffer size
- RDS honors the send and receive buffer size socket options.
- You are not allowed to queue more than SO_SNDSIZE bytes to
- a socket. A message is queueud when you call sendmsg, and
- it leaves the queue when the remote system acknowledges
- its arrival.
-
- The SO_RCVSIZE option controls the maximum receive queue length.
- This is a soft limit rather than a hard limit - RDS will
- continue to accept and queue incoming messages, even if that
- takes the queue length over the limit. However, it will also
- mark the port as "congested" and send a congestion update to
- the source node. The source node is supposed to throttle any
- processes sending to this congested port.
-
-bind(fd, &sockaddr_in, ...)
- This binds the socket to a local IP address and port, and a
- transport.
-
-sendmsg(fd, ...)
- Sends a message to the indicated recipient. The kernel will
- transparently establish the underlying reliable connection
- if it isn't up yet.
-
- An attempt to send a message that exceeds SO_SNDSIZE will
- return with -EMSGSIZE
-
- An attempt to send a message that would take the total number
- of queued bytes over the SO_SNDSIZE threshold will return
- EAGAIN.
-
- An attempt to send a message to a destination that is marked
- as "congested" will return ENOBUFS.
-
-recvmsg(fd, ...)
- Receives a message that was queued to this socket. The sockets
- recv queue accounting is adjusted, and if the queue length
- drops below SO_SNDSIZE, the port is marked uncongested, and
- a congestion update is sent to all peers.
-
- Applications can ask the RDS kernel module to receive
- notifications via control messages (for instance, there is a
- notification when a congestion update arrived, or when a RDMA
- operation completes). These notifications are received through
- the msg.msg_control buffer of struct msghdr. The format of the
- messages is described in manpages.
-
-poll(fd)
- RDS supports the poll interface to allow the application
- to implement async I/O.
-
- POLLIN handling is pretty straightforward. When there's an
- incoming message queued to the socket, or a pending notification,
- we signal POLLIN.
-
- POLLOUT is a little harder. Since you can essentially send
- to any destination, RDS will always signal POLLOUT as long as
- there's room on the send queue (ie the number of bytes queued
- is less than the sendbuf size).
-
- However, the kernel will refuse to accept messages to
- a destination marked congested - in this case you will loop
- forever if you rely on poll to tell you what to do.
- This isn't a trivial problem, but applications can deal with
- this - by using congestion notifications, and by checking for
- ENOBUFS errors returned by sendmsg.
-
-setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)
- This allows the application to discard all messages queued to a
- specific destination on this particular socket.
-
- This allows the application to cancel outstanding messages if
- it detects a timeout. For instance, if it tried to send a message,
- and the remote host is unreachable, RDS will keep trying forever.
- The application may decide it's not worth it, and cancel the
- operation. In this case, it would use RDS_CANCEL_SENT_TO to
- nuke any pending messages.
+ AF_RDS, PF_RDS, SOL_RDS
+ These constants haven't been assigned yet, because RDS isn't in
+ mainline yet. Currently, the kernel module assigns some constant
+ and publishes it to user space through two sysctl files
+ /proc/sys/net/rds/pf_rds
+ /proc/sys/net/rds/sol_rds
+
+ fd = socket(PF_RDS, SOCK_SEQPACKET, 0);
+ This creates a new, unbound RDS socket.
+
+ setsockopt(SOL_SOCKET): send and receive buffer size
+ RDS honors the send and receive buffer size socket options.
+ You are not allowed to queue more than SO_SNDSIZE bytes to
+ a socket. A message is queued when sendmsg is called, and
+ it leaves the queue when the remote system acknowledges
+ its arrival.
+
+ The SO_RCVSIZE option controls the maximum receive queue length.
+ This is a soft limit rather than a hard limit - RDS will
+ continue to accept and queue incoming messages, even if that
+ takes the queue length over the limit. However, it will also
+ mark the port as "congested" and send a congestion update to
+ the source node. The source node is supposed to throttle any
+ processes sending to this congested port.
+
+ bind(fd, &sockaddr_in, ...)
+ This binds the socket to a local IP address and port, and a
+ transport.
+
+ sendmsg(fd, ...)
+ Sends a message to the indicated recipient. The kernel will
+ transparently establish the underlying reliable connection
+ if it isn't up yet.
+
+ An attempt to send a message that exceeds SO_SNDSIZE will
+ return with -EMSGSIZE
+
+ An attempt to send a message that would take the total number
+ of queued bytes over the SO_SNDSIZE threshold will return
+ EAGAIN.
+
+ An attempt to send a message to a destination that is marked
+ as "congested" will return ENOBUFS.
+
+ recvmsg(fd, ...)
+ Receives a message that was queued to this socket. The sockets
+ recv queue accounting is adjusted, and if the queue length
+ drops below SO_SNDSIZE, the port is marked uncongested, and
+ a congestion update is sent to all peers.
+
+ Applications can ask the RDS kernel module to receive
+ notifications via control messages (for instance, there is a
+ notification when a congestion update arrived, or when a RDMA
+ operation completes). These notifications are received through
+ the msg.msg_control buffer of struct msghdr. The format of the
+ messages is described in manpages.
+
+ poll(fd)
+ RDS supports the poll interface to allow the application
+ to implement async I/O.
+
+ POLLIN handling is pretty straightforward. When there's an
+ incoming message queued to the socket, or a pending notification,
+ we signal POLLIN.
+
+ POLLOUT is a little harder. Since you can essentially send
+ to any destination, RDS will always signal POLLOUT as long as
+ there's room on the send queue (ie the number of bytes queued
+ is less than the sendbuf size).
+
+ However, the kernel will refuse to accept messages to
+ a destination marked congested - in this case you will loop
+ forever if you rely on poll to tell you what to do.
+ This isn't a trivial problem, but applications can deal with
+ this - by using congestion notifications, and by checking for
+ ENOBUFS errors returned by sendmsg.
+
+ setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)
+ This allows the application to discard all messages queued to a
+ specific destination on this particular socket.
+
+ This allows the application to cancel outstanding messages if
+ it detects a timeout. For instance, if it tried to send a message,
+ and the remote host is unreachable, RDS will keep trying forever.
+ The application may decide it's not worth it, and cancel the
+ operation. In this case, it would use RDS_CANCEL_SENT_TO to
+ nuke any pending messages.
+
RDMA for RDS
============
-see manpage for now
+ see rds-rdma(7) manpage (available in rds-tools)
+
Congestion Notifications
========================
-see manpage
+ see rds(7) manpage
+
RDS Protocol
============
Message header
+
+ The message header is a 'struct rds_header' (see rds.h):
+ Fields:
+ h_sequence:
+ per-packet sequence number
+ h_ack:
+ piggybacked acknowledgment of last packet received
+ h_len:
+ length of data, not including header
+ h_sport:
+ source port
+ h_dport:
+ destination port
+ h_flags:
+ CONG_BITMAP - this is a congestion update bitmap
+ ACK_REQUIRED - receiver must ack this packet
+ RETRANSMITTED - packet has previously been sent
+ h_credit:
+ indicate to other end of connection that
+ it has more credits available (i.e. there is
+ more send room)
+ h_padding[4]:
+ unused, for future use
+ h_csum:
+ header checksum
+ h_exthdr:
+ optional data can be passed here. This is currently used for
+ passing RDMA-related information.
+
ACK and retransmit handling
- Cancellation
- Congestion Control
+
+ One might think that with reliable IB connections you wouldn't need
+ to ack messages that have been received. The problem is that IB
+ hardware generates an ack message before it has DMAed the message
+ into memory. This creates a potential message loss if the HCA is
+ disabled for any reason between when it sends the ack and before
+ the message is DMAed and processed. This is only a potential issue
+ if another HCA is available for fail-over.
+
+ Sending an ack immediately would allow the sender to free the sent
+ message from their send queue quickly, but could cause excessive
+ traffic to be used for acks. RDS piggybacks acks on sent data
+ packets. Ack-only packets are reduced by only allowing one to be
+ in flight at a time, and by the sender only asking for acks when
+ its send buffers start to fill up. All retransmissions are also
+ acked.
+
+ Flow Control
+
+ RDS's IB transport uses a credit-based mechanism to verify that
+ there is space in the peer's receive buffers for more data. This
+ eliminates the need for hardware retries on the connection.
+
+ Congestion
+
+ Messages waiting in the receive queue on the receiving socket
+ are accounted against the sockets SO_RCVBUF option value. Only
+ the payload bytes in the message are accounted for. If the
+ number of bytes queued equals or exceeds rcvbuf then the socket
+ is congested. All sends attempted to this socket's address
+ should return block or return -EWOULDBLOCK.
+
+ Applications are expected to be reasonably tuned such that this
+ situation very rarely occurs. An application encountering this
+ "back-pressure" is considered a bug.
+
+ This is implemented by having each node maintain bitmaps which
+ indicate which ports on bound addresses are congested. As the
+ bitmap changes it is sent through all the connections which
+ terminate in the local address of the bitmap which changed.
+
+ The bitmaps are allocated as connections are brought up. This
+ avoids allocation in the interrupt handling path which queues
+ sages on sockets. The dense bitmaps let transports send the
+ entire bitmap on any bitmap change reasonably efficiently. This
+ is much easier to implement than some finer-grained
+ communication of per-port congestion. The sender does a very
+ inexpensive bit test to test if the port it's about to send to
+ is congested or not.
+
+
+RDS Transport Layer
+==================
+
+ As mentioned above, RDS is not IB-specific. Its code is divided
+ into a general RDS layer and a transport layer.
+
+ The general layer handles the socket API, congestion handling,
+ loopback, stats, usermem pinning, and the connection state machine.
+
+ The transport layer handles the details of the transport. The IB
+ transport, for example, handles all the queue pairs, work requests,
+ CM event handlers, and other Infiniband details.
+
RDS Kernel Structures
=====================
+ struct rds_message
+ aka possibly "rds_outgoing", the generic RDS layer copies data to
+ be sent and sets header fields as needed, based on the socket API.
+ This is then queued for the individual connection and sent by the
+ connection's transport.
+ struct rds_incoming
+ a generic struct referring to incoming data that can be handed from
+ the transport to the general code and queued by the general code
+ while the socket is awoken. It is then passed back to the transport
+ code to handle the actual copy-to-user.
struct rds_socket
+ per-socket information
struct rds_connection
+ per-connection information
struct rds_transport
- rds work structs: send, recv, conn
+ pointers to transport-specific functions
+ struct rds_statistics
+ non-transport-specific statistics
+ struct rds_cong_map
+ wraps the raw congestion bitmap, contains rbnode, waitq, etc.
Connection management
=====================
- Connection states
- taking connection up and down
+ Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and
+ ERROR states.
+
+ The first time an attempt is made by an RDS socket to send data to
+ a node, a connection is allocated and connected. That connection is
+ then maintained forever -- if there are transport errors, the
+ connection will be dropped and re-established.
+
+ Dropping a connection while packets are queued will cause queued or
+ partially-sent datagrams to be retransmitted when the connection is
+ re-established.
+
The send path
=============
- Zero-wait send path
- Using trans->xmit
+ rds_sendmsg()
+ struct rds_message built from incoming data
+ CMSGs parsed (e.g. RDMA ops)
+ transport connection alloced and connected if not already
+ rds_message placed on send queue
+ send worker awoken
+ rds_send_worker()
+ calls rds_send_xmit() until queue is empty
+ rds_send_xmit()
+ transmits congestion map if one is pending
+ may set ACK_REQUIRED
+ calls transport to send either non-RDMA or RDMA message
+ (RDMA ops never retransmitted)
+ rds_ib_xmit()
+ allocs work requests from send ring
+ adds any new send credits available to peer (h_credits)
+ maps the rds_message's sg list
+ piggybacks ack
+ populates work requests
+ post send to connection's queue pair
The recv path
=============
- Receiving congestion updates
-
-RDS over Infiniband
-===================
-
- ib_cm
- rds_ib_xmit
- ib_rdma
+ rds_ib_recv_cq_comp_handler()
+ looks at write completions
+ unmaps recv buffer from device
+ no errors, call rds_ib_process_recv()
+ refill recv ring
+ rds_ib_process_recv()
+ validate header checksum
+ copy header to rds_ib_incoming struct if start of a new datagram
+ add to ibinc's fraglist
+ if competed datagram:
+ update cong map if datagram was cong update
+ call rds_recv_incoming() otherwise
+ note if ack is required
+ rds_recv_incoming()
+ drop duplicate packets
+ respond to pings
+ find the sock associated with this datagram
+ add to sock queue
+ wake up sock
+ do some congestion calculations
+ rds_recvmsg
+ copy data into user iovec
+ handle CMSGs
+ return to application
+
+
Modified: branches/ofed-1.4.2/rds-tools/trunk/pfhack.c
URL: http://svn.debian.org/wsvn/pkg-ofed/branches/ofed-1.4.2/rds-tools/trunk/pfhack.c?rev=433&op=diff
==============================================================================
--- branches/ofed-1.4.2/rds-tools/trunk/pfhack.c (original)
+++ branches/ofed-1.4.2/rds-tools/trunk/pfhack.c Fri Aug 7 15:50:44 2009
@@ -80,8 +80,13 @@
return *found;
fd = open(path, O_RDONLY);
- if (fd < 0)
- explode("Can't open address constant");
+ if (fd < 0) {
+ /* hmm, no more constants in /proc. we must not need it anymore
+ * so use official values.
+ */
+ *found = official;
+ return official;
+ }
while (total < sizeof(buf)) {
ret = read(fd, buf + total, sizeof(buf) - total);
Modified: branches/ofed-1.4.2/rds-tools/trunk/pfhack.h
URL: http://svn.debian.org/wsvn/pkg-ofed/branches/ofed-1.4.2/rds-tools/trunk/pfhack.h?rev=433&op=diff
==============================================================================
--- branches/ofed-1.4.2/rds-tools/trunk/pfhack.h (original)
+++ branches/ofed-1.4.2/rds-tools/trunk/pfhack.h Fri Aug 7 15:50:44 2009
@@ -44,8 +44,8 @@
#ifndef __PF_HACK_H
#define __PF_HACK_H
-#define OFFICIAL_PF_RDS 32
-#define OFFICIAL_SOL_RDS 272
+#define OFFICIAL_PF_RDS 21
+#define OFFICIAL_SOL_RDS 276
#ifdef DYNAMIC_PF_RDS
Modified: branches/ofed-1.4.2/rds-tools/trunk/rds-info.c
URL: http://svn.debian.org/wsvn/pkg-ofed/branches/ofed-1.4.2/rds-tools/trunk/rds-info.c?rev=433&op=diff
==============================================================================
--- branches/ofed-1.4.2/rds-tools/trunk/rds-info.c (original)
+++ branches/ofed-1.4.2/rds-tools/trunk/rds-info.c Fri Aug 7 15:50:44 2009
@@ -180,6 +180,26 @@
ipv4addr(msg.faddr),
ntohs(msg.fport),
msg.seq, msg.len);
+ }
+}
+
+static void print_tcp_socks(void *data, int each, socklen_t len, void *extra)
+{
+ struct rds_info_tcp_socket ts;
+
+ printf("\nTCP Connections:\n"
+ "%15s %5s %15s %5s %10s %10s %10s %10s %10s\n",
+ "LocalAddr", "LPort", "RemoteAddr", "RPort",
+ "HdrRemain", "DataRemain", "SentNxt", "ExpectUna", "SeenUna");
+
+ for_each(ts, data, each, len) {
+ printf("%15s %5u %15s %5u %10"PRIu64" %10"PRIu64" %10u %10u %10u\n",
+ ipv4addr(ts.local_addr),
+ ntohs(ts.local_port),
+ ipv4addr(ts.peer_addr),
+ ntohs(ts.peer_port),
+ ts.hdr_rem, ts.data_rem, ts.last_sent_nxt,
+ ts.last_expected_una, ts.last_seen_una);
}
}
@@ -230,6 +250,8 @@
print_msgs, "Send", 0 },
['t'] = { RDS_INFO_RETRANS_MESSAGES, "retransmit queue messages",
print_msgs, "Retransmit", 0 },
+ ['T'] = { RDS_INFO_TCP_SOCKETS, "TCP transport sockets",
+ print_tcp_socks, NULL, 0 },
['I'] = { RDS_INFO_IB_CONNECTIONS, "IB transport connections",
print_ib_conns, NULL, 0 },
};
Modified: branches/ofed-1.4.2/rds-tools/trunk/rds-stress.c
URL: http://svn.debian.org/wsvn/pkg-ofed/branches/ofed-1.4.2/rds-tools/trunk/rds-stress.c?rev=433&op=diff
==============================================================================
--- branches/ofed-1.4.2/rds-tools/trunk/rds-stress.c (original)
+++ branches/ofed-1.4.2/rds-tools/trunk/rds-stress.c Fri Aug 7 15:50:44 2009
@@ -1511,6 +1511,9 @@
sin.sin_family = AF_INET;
sin.sin_port = htons(opts->starting_port + 1 + id);
sin.sin_addr.s_addr = htonl(opts->receive_addr);
+
+ /* give main display thread a little edge? */
+ nice(5);
memset(tasks, 0, sizeof(tasks));
for (i = 0; i < opts->nr_tasks; i++) {
More information about the Pkg-ofed-commits
mailing list