[Pkg-ofed-commits] r433 - in /branches/ofed-1.4.2/rds-tools/trunk: ./ debian/ docs/

Fri Aug 7 15:50:44 UTC 2009

Author: gmpc-guest
Date: Fri Aug  7 15:50:44 2009
New Revision: 433

URL: http://svn.debian.org/wsvn/pkg-ofed/?sc=1&rev=433
Log:
OFED 1.4.2 release

Modified:
    branches/ofed-1.4.2/rds-tools/trunk/debian/changelog
    branches/ofed-1.4.2/rds-tools/trunk/docs/rds-architecture.txt
    branches/ofed-1.4.2/rds-tools/trunk/pfhack.c
    branches/ofed-1.4.2/rds-tools/trunk/pfhack.h
    branches/ofed-1.4.2/rds-tools/trunk/rds-info.c
    branches/ofed-1.4.2/rds-tools/trunk/rds-stress.c

Modified: branches/ofed-1.4.2/rds-tools/trunk/debian/changelog
URL: http://svn.debian.org/wsvn/pkg-ofed/branches/ofed-1.4.2/rds-tools/trunk/debian/changelog?rev=433&op=diff
==============================================================================

--- branches/ofed-1.4.2/rds-tools/trunk/debian/changelog (original)
+++ branches/ofed-1.4.2/rds-tools/trunk/debian/changelog Fri Aug  7 15:50:44 2009
@@ -1,3 +1,9 @@
+rds-tools (1.4.1-OFED-1.4.2-1) unstable; urgency=low
+
+  * New upstream release
+
+ -- Guy Coates <gmpc at sanger.ac.uk>  Fri, 07 Aug 2009 16:34:06 +0100
+
 rds-tools (1.4-1) unstable; urgency=low
 
   * Fix manpage

Modified: branches/ofed-1.4.2/rds-tools/trunk/docs/rds-architecture.txt
URL: http://svn.debian.org/wsvn/pkg-ofed/branches/ofed-1.4.2/rds-tools/trunk/docs/rds-architecture.txt?rev=433&op=diff
==============================================================================
--- branches/ofed-1.4.2/rds-tools/trunk/docs/rds-architecture.txt (original)
+++ branches/ofed-1.4.2/rds-tools/trunk/docs/rds-architecture.txt Fri Aug  7 15:50:44 2009
@@ -5,202 +5,352 @@
 This readme tries to provide some background on the hows and whys of RDS,
 and will hopefully help you find your way around the code.
 
-There is a *little* bit of extra documentation available. The rds-tools
-package has two manpages rds(7) and rds-rdma(7) that describe the interface
-a little. If you search the rds-devel archives, Rick Frank posted
-a bunch of messages discussing the ideas behing RDS between early to
-mid November 2007. Not all of that material still applies 100% to the
-current code - for instance we no longer have RDMA barriers. But it may
-be helpful.
-
-In particular, there's a message dated Nov 15, subject "What is RDS and
-why did we build it?" which contains a doc file with some motivation on
-the design.
+In addition, please see this email about RDS origins:
+http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html
 
 RDS Architecture
 ================
 
 RDS provides reliable, ordered datagram delivery by using a single
-reliably connection between any two nodes in the cluster. This allows
+reliable connection between any two nodes in the cluster. This allows
 applications to use a single socket to talk to any other process in the
 cluster - so in a cluster with N processes you need N sockets, in contrast
 to N*N if you use a connection-oriented socket transport like TCP.
 
 RDS is not Infiniband-specific; it was designed to support different
-transports.  The current implementation supports RDS over IB as well as
-TCP. Work is in progress to support RDS over iWARP.
+transports.  The current implementation used to support RDS over TCP as well
+as IB. Work is in progress to support RDS over iWARP, and using DCE to
+guarantee no dropped packets on Ethernet, it may be possible to use RDS over
+UDP in the future.
 
 The high-level semantics of RDS from the application's point of view are
 
  *	Addressing
-	RDS uses IPv4 addresses and 16bit port numbers to identify
-	the end point of a connection. All socket operations that involve
-	passing addresses between kernel and user space generally
-	use a struct sockaddr_in.
-
-	The fact that IPv4 addresses are used does not mean the underlying
-	transport has to be IP-based. In fact, RDS over IB uses a
-	reliable IB connection; the IP address is used exclusively to
-	locate the remote node's GID (by ARPing for the given IP).
-
-	The port space is entirely independent of UDP, TCP or any other
-	protocol.
+        RDS uses IPv4 addresses and 16bit port numbers to identify
+        the end point of a connection. All socket operations that involve
+        passing addresses between kernel and user space generally
+        use a struct sockaddr_in.
+
+        The fact that IPv4 addresses are used does not mean the underlying
+        transport has to be IP-based. In fact, RDS over IB uses a
+        reliable IB connection; the IP address is used exclusively to
+        locate the remote node's GID (by ARPing for the given IP).
+
+        The port space is entirely independent of UDP, TCP or any other
+        protocol.
 
  *	Socket interface
- 	RDS sockets work *mostly* as you would expect from a BSD
-	socket. The next section will cover the details. At any rate,
-	all I/O is performed through the standard BSD socket API.
-	Some additions like zerocopy support are implemented through
-	control messages, while other extensions use the getsockopt/
-	setsockopt calls.
-
-	Sockets must be bound before you can send or receive data.
-	This is needed because binding also selects a transport and
-	attaches it to the socket. Once bound, the transport assignment
-	does not change. RDS will tolerate IPs moving around (eg in
-	a active-active HA scenario), but only as long as the address
-	doesn't move to a different transport.
+        RDS sockets work *mostly* as you would expect from a BSD
+        socket. The next section will cover the details. At any rate,
+        all I/O is performed through the standard BSD socket API.
+        Some additions like zerocopy support are implemented through
+        control messages, while other extensions use the getsockopt/
+        setsockopt calls.
+
+        Sockets must be bound before you can send or receive data.
+        This is needed because binding also selects a transport and
+        attaches it to the socket. Once bound, the transport assignment
+        does not change. RDS will tolerate IPs moving around (eg in
+        a active-active HA scenario), but only as long as the address
+        doesn't move to a different transport.
 
  *	sysctls
- 	RDS supports a number of sysctls in /proc/sys/net/rds
+        RDS supports a number of sysctls in /proc/sys/net/rds
+
 
 Socket Interface
 ================
 
-AF_RDS, PF_RDS, SOL_RDS
-	These constants haven't been assigned yet, because RDS isn't in
-	mainline yet. Currently, the kernel module assigns some constant
-	and publishes it to user space through two sysctl files
-		/proc/sys/net/rds/pf_rds
-		/proc/sys/net/rds/sol_rds
-
-fd = socket(PF_RDS, SOCK_SEQPACKET, 0);
-	This creates a new, unbound RDS socket.
-
-setsockopt(SOL_SOCKET): send and receive buffer size
-	RDS honors the send and receive buffer size socket options.
-	You are not allowed to queue more than SO_SNDSIZE bytes to
-	a socket. A message is queueud when you call sendmsg, and
-	it leaves the queue when the remote system acknowledges
-	its arrival.
-
-	The SO_RCVSIZE option controls the maximum receive queue length.
-	This is a soft limit rather than a hard limit - RDS will
-	continue to accept and queue incoming messages, even if that
-	takes the queue length over the limit. However, it will also
-	mark the port as "congested" and send a congestion update to
-	the source node. The source node is supposed to throttle any
-	processes sending to this congested port.
-
-bind(fd, &sockaddr_in, ...)
-	This binds the socket to a local IP address and port, and a
-	transport.
-
-sendmsg(fd, ...)
-	Sends a message to the indicated recipient. The kernel will
-	transparently establish the underlying reliable connection
-	if it isn't up yet.
-
-	An attempt to send a message that exceeds SO_SNDSIZE will
-	return with -EMSGSIZE
-
-	An attempt to send a message that would take the total number
-	of queued bytes over the SO_SNDSIZE threshold will return
-	EAGAIN.
-
-	An attempt to send a message to a destination that is marked
-	as "congested" will return ENOBUFS.
-
-recvmsg(fd, ...)
-	Receives a message that was queued to this socket. The sockets
-	recv queue accounting is adjusted, and if the queue length
-	drops below SO_SNDSIZE, the port is marked uncongested, and
-	a congestion update is sent to all peers.
-
-	Applications can ask the RDS kernel module to receive
-	notifications via control messages (for instance, there is a
-	notification when a congestion update arrived, or when a RDMA
-	operation completes). These notifications are received through
-	the msg.msg_control buffer of struct msghdr. The format of the
-	messages is described in manpages.
-
-poll(fd)
-	RDS supports the poll interface to allow the application
-	to implement async I/O.
-
-	POLLIN handling is pretty straightforward. When there's an
-	incoming message queued to the socket, or a pending notification,
-	we signal POLLIN.
-
-	POLLOUT is a little harder. Since you can essentially send
-	to any destination, RDS will always signal POLLOUT as long as
-	there's room on the send queue (ie the number of bytes queued
-	is less than the sendbuf size).
-	
-	However, the kernel will refuse to accept messages to
-	a destination marked congested - in this case you will loop
-	forever if you rely on poll to tell you what to do.
-	This isn't a trivial problem, but applications can deal with
-	this - by using congestion notifications, and by checking for
-	ENOBUFS errors returned by sendmsg.
-
-setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)
-	This allows the application to discard all messages queued to a
-	specific destination on this particular socket.
-
-	This allows the application to cancel outstanding messages if
-	it detects a timeout. For instance, if it tried to send a message,
-	and the remote host is unreachable, RDS will keep trying forever.
-	The application may decide it's not worth it, and cancel the
-	operation. In this case, it would use RDS_CANCEL_SENT_TO to
-	nuke any pending messages.
+  AF_RDS, PF_RDS, SOL_RDS
+        These constants haven't been assigned yet, because RDS isn't in
+        mainline yet. Currently, the kernel module assigns some constant
+        and publishes it to user space through two sysctl files
+                /proc/sys/net/rds/pf_rds
+                /proc/sys/net/rds/sol_rds
+
+  fd = socket(PF_RDS, SOCK_SEQPACKET, 0);
+        This creates a new, unbound RDS socket.
+
+  setsockopt(SOL_SOCKET): send and receive buffer size
+        RDS honors the send and receive buffer size socket options.
+        You are not allowed to queue more than SO_SNDSIZE bytes to
+        a socket. A message is queued when sendmsg is called, and
+        it leaves the queue when the remote system acknowledges
+        its arrival.
+
+        The SO_RCVSIZE option controls the maximum receive queue length.
+        This is a soft limit rather than a hard limit - RDS will
+        continue to accept and queue incoming messages, even if that
+        takes the queue length over the limit. However, it will also
+        mark the port as "congested" and send a congestion update to
+        the source node. The source node is supposed to throttle any
+        processes sending to this congested port.
+
+  bind(fd, &sockaddr_in, ...)
+        This binds the socket to a local IP address and port, and a
+        transport.
+
+  sendmsg(fd, ...)
+        Sends a message to the indicated recipient. The kernel will
+        transparently establish the underlying reliable connection
+        if it isn't up yet.
+
+        An attempt to send a message that exceeds SO_SNDSIZE will
+        return with -EMSGSIZE
+
+        An attempt to send a message that would take the total number
+        of queued bytes over the SO_SNDSIZE threshold will return
+        EAGAIN.
+
+        An attempt to send a message to a destination that is marked
+        as "congested" will return ENOBUFS.
+
+  recvmsg(fd, ...)
+        Receives a message that was queued to this socket. The sockets
+        recv queue accounting is adjusted, and if the queue length
+        drops below SO_SNDSIZE, the port is marked uncongested, and
+        a congestion update is sent to all peers.
+
+        Applications can ask the RDS kernel module to receive
+        notifications via control messages (for instance, there is a
+        notification when a congestion update arrived, or when a RDMA
+        operation completes). These notifications are received through
+        the msg.msg_control buffer of struct msghdr. The format of the
+        messages is described in manpages.
+
+  poll(fd)
+        RDS supports the poll interface to allow the application
+        to implement async I/O.
+
+        POLLIN handling is pretty straightforward. When there's an
+        incoming message queued to the socket, or a pending notification,
+        we signal POLLIN.
+
+        POLLOUT is a little harder. Since you can essentially send
+        to any destination, RDS will always signal POLLOUT as long as
+        there's room on the send queue (ie the number of bytes queued
+        is less than the sendbuf size).
+
+        However, the kernel will refuse to accept messages to
+        a destination marked congested - in this case you will loop
+        forever if you rely on poll to tell you what to do.
+        This isn't a trivial problem, but applications can deal with
+        this - by using congestion notifications, and by checking for
+        ENOBUFS errors returned by sendmsg.
+
+  setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)
+        This allows the application to discard all messages queued to a
+        specific destination on this particular socket.
+
+        This allows the application to cancel outstanding messages if
+        it detects a timeout. For instance, if it tried to send a message,
+        and the remote host is unreachable, RDS will keep trying forever.
+        The application may decide it's not worth it, and cancel the
+        operation. In this case, it would use RDS_CANCEL_SENT_TO to
+        nuke any pending messages.
+
 
 RDMA for RDS
 ============
 
-see manpage for now
+  see rds-rdma(7) manpage (available in rds-tools)
+
 
 Congestion Notifications
 ========================
 
-see manpage
+  see rds(7) manpage
+
 
 RDS Protocol
 ============
 
   Message header
+
+    The message header is a 'struct rds_header' (see rds.h):
+    Fields:
+      h_sequence:
+          per-packet sequence number
+      h_ack:
+          piggybacked acknowledgment of last packet received
+      h_len:
+          length of data, not including header
+      h_sport:
+          source port
+      h_dport:
+          destination port
+      h_flags:
+          CONG_BITMAP - this is a congestion update bitmap
+          ACK_REQUIRED - receiver must ack this packet
+          RETRANSMITTED - packet has previously been sent
+      h_credit:
+          indicate to other end of connection that
+          it has more credits available (i.e. there is
+          more send room)
+      h_padding[4]:
+          unused, for future use
+      h_csum:
+          header checksum
+      h_exthdr:
+          optional data can be passed here. This is currently used for
+          passing RDMA-related information.
+
   ACK and retransmit handling
-  Cancellation
-  Congestion Control
+
+      One might think that with reliable IB connections you wouldn't need
+      to ack messages that have been received.  The problem is that IB
+      hardware generates an ack message before it has DMAed the message
+      into memory.  This creates a potential message loss if the HCA is
+      disabled for any reason between when it sends the ack and before
+      the message is DMAed and processed.  This is only a potential issue
+      if another HCA is available for fail-over.
+
+      Sending an ack immediately would allow the sender to free the sent
+      message from their send queue quickly, but could cause excessive
+      traffic to be used for acks. RDS piggybacks acks on sent data
+      packets.  Ack-only packets are reduced by only allowing one to be
+      in flight at a time, and by the sender only asking for acks when
+      its send buffers start to fill up. All retransmissions are also
+      acked.
+
+  Flow Control
+
+      RDS's IB transport uses a credit-based mechanism to verify that
+      there is space in the peer's receive buffers for more data. This
+      eliminates the need for hardware retries on the connection.
+
+  Congestion
+
+      Messages waiting in the receive queue on the receiving socket
+      are accounted against the sockets SO_RCVBUF option value.  Only
+      the payload bytes in the message are accounted for.  If the
+      number of bytes queued equals or exceeds rcvbuf then the socket
+      is congested.  All sends attempted to this socket's address
+      should return block or return -EWOULDBLOCK.
+
+      Applications are expected to be reasonably tuned such that this
+      situation very rarely occurs.  An application encountering this
+      "back-pressure" is considered a bug.
+
+      This is implemented by having each node maintain bitmaps which
+      indicate which ports on bound addresses are congested.  As the
+      bitmap changes it is sent through all the connections which
+      terminate in the local address of the bitmap which changed.
+
+      The bitmaps are allocated as connections are brought up.  This
+      avoids allocation in the interrupt handling path which queues
+      sages on sockets.  The dense bitmaps let transports send the
+      entire bitmap on any bitmap change reasonably efficiently.  This
+      is much easier to implement than some finer-grained
+      communication of per-port congestion.  The sender does a very
+      inexpensive bit test to test if the port it's about to send to
+      is congested or not.
+
+
+RDS Transport Layer
+==================
+
+  As mentioned above, RDS is not IB-specific. Its code is divided
+  into a general RDS layer and a transport layer.
+
+  The general layer handles the socket API, congestion handling,
+  loopback, stats, usermem pinning, and the connection state machine.
+
+  The transport layer handles the details of the transport. The IB
+  transport, for example, handles all the queue pairs, work requests,
+  CM event handlers, and other Infiniband details.
+
 
 RDS Kernel Structures
 =====================
 
+  struct rds_message
+    aka possibly "rds_outgoing", the generic RDS layer copies data to
+    be sent and sets header fields as needed, based on the socket API.
+    This is then queued for the individual connection and sent by the
+    connection's transport.
+  struct rds_incoming
+    a generic struct referring to incoming data that can be handed from
+    the transport to the general code and queued by the general code
+    while the socket is awoken. It is then passed back to the transport
+    code to handle the actual copy-to-user.
   struct rds_socket
+    per-socket information
   struct rds_connection
+    per-connection information
   struct rds_transport
-  rds work structs: send, recv, conn
+    pointers to transport-specific functions
+  struct rds_statistics
+    non-transport-specific statistics
+  struct rds_cong_map
+    wraps the raw congestion bitmap, contains rbnode, waitq, etc.
 
 Connection management
 =====================
 
-  Connection states
-  taking connection up and down
+  Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and
+  ERROR states.
+
+  The first time an attempt is made by an RDS socket to send data to
+  a node, a connection is allocated and connected. That connection is
+  then maintained forever -- if there are transport errors, the
+  connection will be dropped and re-established.
+
+  Dropping a connection while packets are queued will cause queued or
+  partially-sent datagrams to be retransmitted when the connection is
+  re-established.
+
 
 The send path
 =============
 
-  Zero-wait send path
-  Using trans->xmit
+  rds_sendmsg()
+    struct rds_message built from incoming data
+    CMSGs parsed (e.g. RDMA ops)
+    transport connection alloced and connected if not already
+    rds_message placed on send queue
+    send worker awoken
+  rds_send_worker()
+    calls rds_send_xmit() until queue is empty
+  rds_send_xmit()
+    transmits congestion map if one is pending
+    may set ACK_REQUIRED
+    calls transport to send either non-RDMA or RDMA message
+    (RDMA ops never retransmitted)
+  rds_ib_xmit()
+    allocs work requests from send ring
+    adds any new send credits available to peer (h_credits)
+    maps the rds_message's sg list
+    piggybacks ack
+    populates work requests
+    post send to connection's queue pair
 
 The recv path
 =============
 
-  Receiving congestion updates
-
-RDS over Infiniband
-===================
-
-  ib_cm
-  rds_ib_xmit
-  ib_rdma
+  rds_ib_recv_cq_comp_handler()
+    looks at write completions
+    unmaps recv buffer from device
+    no errors, call rds_ib_process_recv()
+    refill recv ring
+  rds_ib_process_recv()
+    validate header checksum
+    copy header to rds_ib_incoming struct if start of a new datagram
+    add to ibinc's fraglist
+    if competed datagram:
+      update cong map if datagram was cong update
+      call rds_recv_incoming() otherwise
+      note if ack is required
+  rds_recv_incoming()
+    drop duplicate packets
+    respond to pings
+    find the sock associated with this datagram
+    add to sock queue
+    wake up sock
+    do some congestion calculations
+  rds_recvmsg
+    copy data into user iovec
+    handle CMSGs
+    return to application
+
+

Modified: branches/ofed-1.4.2/rds-tools/trunk/pfhack.c
URL: http://svn.debian.org/wsvn/pkg-ofed/branches/ofed-1.4.2/rds-tools/trunk/pfhack.c?rev=433&op=diff
==============================================================================
--- branches/ofed-1.4.2/rds-tools/trunk/pfhack.c (original)
+++ branches/ofed-1.4.2/rds-tools/trunk/pfhack.c Fri Aug  7 15:50:44 2009
@@ -80,8 +80,13 @@
 		return *found;
 
 	fd = open(path, O_RDONLY);
-	if (fd < 0)
-		explode("Can't open address constant");
+	if (fd < 0) {
+		/* hmm, no more constants in /proc. we must not need it anymore
+		 * so use official values.
+		 */
+		*found = official;
+		return official;
+	}
 
 	while (total < sizeof(buf)) {
 		ret = read(fd, buf + total, sizeof(buf) - total);

Modified: branches/ofed-1.4.2/rds-tools/trunk/pfhack.h
URL: http://svn.debian.org/wsvn/pkg-ofed/branches/ofed-1.4.2/rds-tools/trunk/pfhack.h?rev=433&op=diff
==============================================================================
--- branches/ofed-1.4.2/rds-tools/trunk/pfhack.h (original)
+++ branches/ofed-1.4.2/rds-tools/trunk/pfhack.h Fri Aug  7 15:50:44 2009
@@ -44,8 +44,8 @@
 #ifndef __PF_HACK_H
 #define __PF_HACK_H
 
-#define OFFICIAL_PF_RDS		32
-#define OFFICIAL_SOL_RDS	272
+#define OFFICIAL_PF_RDS		21
+#define OFFICIAL_SOL_RDS	276
 
 
 #ifdef DYNAMIC_PF_RDS

Modified: branches/ofed-1.4.2/rds-tools/trunk/rds-info.c
URL: http://svn.debian.org/wsvn/pkg-ofed/branches/ofed-1.4.2/rds-tools/trunk/rds-info.c?rev=433&op=diff
==============================================================================
--- branches/ofed-1.4.2/rds-tools/trunk/rds-info.c (original)
+++ branches/ofed-1.4.2/rds-tools/trunk/rds-info.c Fri Aug  7 15:50:44 2009
@@ -180,6 +180,26 @@
 			ipv4addr(msg.faddr),
 			ntohs(msg.fport),
 			msg.seq, msg.len);
+	}
+}
+
+static void print_tcp_socks(void *data, int each, socklen_t len, void *extra)
+{		
+	struct rds_info_tcp_socket ts;
+
+	printf("\nTCP Connections:\n"
+		"%15s %5s %15s %5s %10s %10s %10s %10s %10s\n",
+		"LocalAddr", "LPort", "RemoteAddr", "RPort",
+		"HdrRemain", "DataRemain", "SentNxt", "ExpectUna", "SeenUna");
+	
+	for_each(ts, data, each, len) {
+		printf("%15s %5u %15s %5u %10"PRIu64" %10"PRIu64" %10u %10u %10u\n",
+			ipv4addr(ts.local_addr),
+			ntohs(ts.local_port),
+			ipv4addr(ts.peer_addr),
+			ntohs(ts.peer_port),
+			ts.hdr_rem, ts.data_rem, ts.last_sent_nxt,
+			ts.last_expected_una, ts.last_seen_una);
 	}
 }
 
@@ -230,6 +250,8 @@
 		print_msgs, "Send", 0 },
 	['t'] = { RDS_INFO_RETRANS_MESSAGES, "retransmit queue messages",
 		  print_msgs, "Retransmit", 0 },
+	['T'] = { RDS_INFO_TCP_SOCKETS, "TCP transport sockets",
+		  print_tcp_socks, NULL, 0 },
 	['I'] = { RDS_INFO_IB_CONNECTIONS, "IB transport connections",
 		  print_ib_conns, NULL, 0 },
 };

Modified: branches/ofed-1.4.2/rds-tools/trunk/rds-stress.c
URL: http://svn.debian.org/wsvn/pkg-ofed/branches/ofed-1.4.2/rds-tools/trunk/rds-stress.c?rev=433&op=diff
==============================================================================
--- branches/ofed-1.4.2/rds-tools/trunk/rds-stress.c (original)
+++ branches/ofed-1.4.2/rds-tools/trunk/rds-stress.c Fri Aug  7 15:50:44 2009
@@ -1511,6 +1511,9 @@
 	sin.sin_family = AF_INET;
 	sin.sin_port = htons(opts->starting_port + 1 + id);
 	sin.sin_addr.s_addr = htonl(opts->receive_addr);
+
+	/* give main display thread a little edge? */
+	nice(5);
 
 	memset(tasks, 0, sizeof(tasks));
 	for (i = 0; i < opts->nr_tasks; i++) {