InfiniBand is a technology that was developed to address the performance problems associated with data movement between computer input/output (I/O) devices and associated protocol stack processing. Although InfiniBand was developed to address I/O performance, InfiniBand is widely deployed within high performance compute (HPC) clusters due to the high bandwidth and low latency transport characteristics it offers.
This document is an introduction to InfiniBand concepts and technologies and how these relate to HPC. This document also provides a brief overview of IP-over-IB, SDP, uDAPL and MPI protocols that are commonly encountered using InfiniBand.
INTRODUCTION
Computers are made up of a number of addressable elements-CPU, memory, screen, hard disks, LAN and SAN interface etc.-that use a systems bus for communications. As these elements have become faster, the systems bus and overhead associated with data movement-commonly referred to as I/O-between devices has become a gating factor in computer performance.
To address the problem of server performance with respect to I/O in particular, InfiniBand was developed as a standards-based protocol to provide data movement offload from the CPU to dedicated hardware, thus allowing more CPU resources to be dedicated to application processing. As a result, InfiniBand, by leveraging networking technologies and principles, provides scalable, high-bandwidth transport for efficient communications between InfiniBand-attached devices.
INFINIBAND ARCHITECTURE
InfiniBand defines an architecture that leverages networking principles-switching and routing-to provide a scalable, high-performance server I/O fabric. InfiniBand provides transport services for upper-layer protocols and supports flow control and quality of service (QoS) to provide ordered, guaranteed packet delivery across the fabric. An InfiniBand fabric may comprise a number of InfiniBand subnets that are interconnected using InfiniBand routers, where each subnet may consist of one or more InfiniBand switches (Figure 1) and InfiniBand-attached devices. Each point-to-point connection within an InfiniBand subnet is referred to as a link, and may be copper, optical, or even a printed circuit board.
Figure 1. InfiniBand Functional Elements
Channel adapters (CAs) provide the device with an InfiniBand interface and protocol stack to provide communications between InfiniBandconnected devices. InfiniBand supports two types of channel adapter: a host channel adapter (HCA) and a target channel adapter (TCA). An HCA supports the full InfiniBand protocol stack and may initiate or accept connections to or from other InfiniBand-attached devices. By contrast, a TCA supports a subset of the InfiniBand protocol stack. InfiniBand-attached disk arrays are examples of InfiniBand TCAs.
InfiniBand switches and routers support unicast and multicast packet forwarding between InfiniBand-attached hosts. InfiniBand switches forward packets between hosts attached to the same InfiniBand subnet using the destination local ID (LID) within the local routing header (LRH). InfiniBand switches are also responsible for enforcing QoS and flow control within the InfiniBand network.
InfiniBand routers forward packets based on the Internet Protocol, version 6 (IPv6) format destination global ID within the global route header (GRH). As with IP routers, InfiniBand routers rewrite the LRH and decrement the hop limit of the frame as it is forwarded between subnets. It should be noted that although the InfiniBand standard defines a router, currently there are no commercially available InfiniBand routers.
The InfiniBand subnet manager (SM) provides active management of the operational characteristics of the InfiniBand fabric and, consequently, the SM is critical to the operation of the InfiniBand fabric. Within an InfiniBand network, the SM is responsible for:
• Discovery of the InfiniBand fabric topology
• Discovery of InfiniBand-attached nodes
• Activation of new links
• Path calculation and distribution
• Configuration of attached switches and end nodes with local and global IDs, partition keys, etc.
• Configuring switch forwarding tables, virtual lane-to-service level mappings, etc.
• Receiving and reacting to subnet management agent events
An InfiniBand SM communicates with SM agents that are embedded within each element of the InfiniBand fabric, including HCAs, to learn the topology of the network. When all of the devices that are attached to the InfiniBand subnet have been discovered, the SM distributes local and global IDs, configuration parameters, path routing information, and so on, to the InfiniBand nodes to ensure configuration and policy consistency.
Given that the SM controls all operational aspects of the InfiniBand network, SM redundancy is critical. Although only one SM can actively manage the InfiniBand network at any time, the InfiniBand architecture allows the definition of multiple SMs that provide backup services in the event of an active SM failure. To establish which SM has the active management role, the SMs negotiate which SM will be active, and what the standby hierarchy is in the event of an active SM failure. If an SM with a higher priority joins an InfiniBand fabric that has an established active SM, the SMs will negotiate a graceful handover of active status to the higher-priority SM. The SMs also exchange information periodically to maintain database consistency and ensure continuous operation in the event of a primary SM failure.
The InfiniBand standard does not mandate whether the SM function is embedded within the InfiniBand switch elements, hosted within a workstation as a standalone function, or a mixture of both. The Cisco Systems® InfiniBand Server Switching products offer both embedded and standalone SMs to provide flexible deployment and performance options.
INFINIBAND PROTOCOL STACK
From a protocol perspective, the InfiniBand architecture consists of four layers: physical, logical, network, and transport. These layers are analogous to Layers 1 through 4 of the OSI protocol stack (Figure 2). As with the seven-layer OSI model, each layer in the InfiniBand architecture abstracts its operation and provides services to adjacent layers. One important aspect of the InfiniBand model is that all InfiniBand functions are implemented in hardware. The host CPU is not interrupted or used for InfiniBand transport and consequently InfiniBand is an efficient and high-performance transport protocol.
Figure 2. InfiniBand Protocol Stack
InfiniBand provides, in hardware, all services required to move data between hosts. To leverage these services, the application must be able to interact with the InfiniBand hardware either by writing the application to use the native InfiniBand verbs interface, or by leveraging an intermediate layer such as the message passing interface (MPI) or user Direct Access Programming Language (uDAPL). If the application is a sockets-based protocol, the Sockets Direct Protocol (SDP) allows the application to leverage InfiniBand capabilities without costly application software changes. The operation of IPoIB, MPI, uDAPL, and SDP are discussed later in this document.
InfiniBand supports Ethernet functionality using the raw datagram transport that tunnels the Ethernet encapsulated packets within the InfiniBand payload. IP-over InfiniBand (IPoIB) enables IP communications between Ethernet and InfiniBand-attached hosts using either IPv4 or IPv6. It should be noted that InfiniBand is a server I/O technology that is not intended to replace LAN or WAN technologies.
InfiniBand Physical Layer
The InfiniBand physical layer specification supports three data rates, designated 1X, 4X and 12X, over both copper and fiber-optic cables. The base data rate, 1X, is clocked at 2.5 Gbps and is transmitted over two pairs of wires-transmit and receive-and yields an effective data rate of 2 Gbps full duplex (2 Gbps transmit, 2 Gbps receive). The InfiniBand 4X and 12X interfaces use the same base clock rate, but uses multiple pairs, where each pair commonly referred to as a lane. This enables an InfiniBand 4X interface to realize a signalling rate of 10 Gbps (8 Gbps data rate) using 4 lanes, and an InfiniBand 12X interface to realize a signalling rate of 30 Gbps (24 Gbps data rate) using 8 lanes.
Table 1. InfiniBand Link Comparison
InfiniBand Link
Signal Pairs
Signaling Rate
Data Rate
Full-Duplex Data Rate
1X
2
2.5 Gbps
2.0 Gbps
4.0 Gbps
4X
8
10 Gbps (4*2.5 Gbps)
8 Gbps (4*2 Gbps)
16 Gbps
12X
24
30 Gbps (12*2.5 Gbps)
24 Gbps (12*2 Gbps)
48 Gbps
Note: Although the signaling rate is 2.5 Gbps, the effective data rate is limited to 2 Gbps, due to the 8B/10B encoding scheme; i.e., (2.5*8)/10 = 2 Gbps
Table 2. InfiniBand Cables and Transmission Distance
Cable Type
Link Rate
Distance
Notes
CX-4 Copper
• 1X
• 4X
• 12X
• 0-20 meters (m)
• 0-15m
• 0-10m
Although the InfiniBand specification calls for 20m, Cisco supports up to 15m due to bit-error ratio (BER) degradation
Optical Fiber:
• 62.5 micron multimode
• 50 micron @ 500 MHz/Km
• 50 micron @ 2000 MHz/Km
• 4X
• 4X
• 4X
• 2-75m
• 2-125m
• 2-200m
InfiniBand Specification: 2-200m
• 12-core ribbon
• 12-core ribbon, Cisco supported.
• 12-core ribbon
Note: The 12-core ribbon cable uses the first four fibers as the transmit path. The center four fibers are unused, and the last four fibers are used for the receive path. Each fiber strand supports 2.5-Gbps transmission.
The structure of bandwidth offered by InfiniBand is similar to multi-path routing or EtherChannel in that multiple 2.5 Gbps links are bonded together to form higher speed interfaces. This makes switch design and buffering much simpler because packets do not need to be buffered to accommodate differences in link speeds and enables cut-through switching to be used. As an example, if a packet is received from a 2.5 Gbps 1X InfiniBand link, the packet can be switched to a 2.5 Gbps subchannel within a 30 Gbps 12X link without buffering the packet. Cut-through switching is a switching technique that enables a switch to forward a packet before the whole packet has been received, typically as soon as the destination address is received. The benefit that cut-through switching accrues is that is reduces switching latency, although error checking of packets is not possible.
InfiniBand supports double data rate (DDR) and quad data rate (QDR) transmission that enables InfiniBand 4X links to transmit at 20- and 40-Gbps respectively. To achieve this each InfiniBand lane is clocked at 5 Gbps (DDR) or 10Gbps (QDR) instead of 2.5 Gbps. This increases the throughput of the physical links and also reduces network serialization delay. An additional benefit is that because packets are received faster, the InfiniBand switches can perform the destination lookup and switch the packets faster, which reduces switch latency.
It should, however, be noted that packets cannot be cut-through switched between links that have different data rates. Although, single data rate (SDR), DDR and QDR can co-exist in the same network, if a packet is received on a single data rate link and is switched to a DDR, or QDR, link the packet must be store-and-forward switched. In practice this means that if QDR or DDR is used within the network, either all links must be DDR or QDR enabled, or the subnet manager must be aware of the different link rates such that packets can be switched between DDR enabled HCAs using DDR only links, or for QDR enabled HCAs using QDR only links. The complexity that is introduced by mixing different data-rates has ramifications for future upgrades to the InfiniBand network and careful consideration regarding mixing technologies must be taken.
Logical Link and Network Headers
Because InfiniBand defines a Layer 2 local routing header (LRH) and a Layer 3 global routing header (GRH) for switching and routing packets between InfiniBand-attached devices. Each InfiniBand device is assigned and identified by a unique local and global device identifier that is used within the LRH and GRH headers to forward packets within the InfiniBand fabric. Note that if a packet is destined for a host within the same subnet, the packet may or may not have a GRH header depending upon the packet type.
Figure 3. InfiniBand LRH Header
InfiniBand switches forward packets between InfiniBand-attached devices using the 2-byte destination LID contained within the LRH. The LRH provides local ID for source and destination, virtual lane (VLane), service level, a link next header (LNH) field that indicates which upper-layer headers follow, and the length of the payload. The LRH service level and VLane fields are discussed further in the QoS section.
The InfiniBand global routing header (GRH) is used to route frames between subnets based on the destination Global ID (GID), which is an IPv6format, 128-bit address. An InfiniBand router operates in a manner similar to that of an IP router in that it rewrites the LRH LID as the frame passes between subnets; it also decrements the hop limit value.
Figure 4. InfiniBand GRH Header
As with IPv6, InfiniBand supports the concept of a flow label that identifies the packet as belonging to a particular stream that may require special handling, and a hop limit. The GRH also supports traffic classes, a capability that indicates the priority that the frame should receive when traversing a router.
Although multicast transmission is an optional capability within the InfiniBand specification, most vendors support multicast capabilities. It should be noted that InfiniBand does not support broadcast functionality, which has ramifications for IPoIB operation that are discussed later.
Quality of Service: Service Levels and Flow Control
InfiniBand presents a number of transport services that provide different characteristics. To ensure reliable, sequenced packet delivery, InfiniBand uses flow control and service levels in conjunction with VLanes to achieve end-to-end QoS. InfiniBand VLanes are logical channels that share a common physical link, where VLane 15 has the highest priority and is used exclusively for management traffic, and VLane=0 the lowest. The concept of a VLane is similar to that of the hardware queues found in routers and switches.
For applications that require reliable delivery, InfiniBand supports reliable delivery of packets using flow control. Within an InfiniBand network, the receivers on a point-to-point link periodically transmit information to the upstream transmitter to specify the amount of data that can be transmitted without data loss, on a per-VLane basis. The transmitter can then transmit data up to the amount of credits that are advertised by the receiver. If no buffer credits exist, data cannot be transmitted. The use of credit-based flow control prevents packet loss that might result from congestion. Furthermore, it enhances application performance, because it avoids packet retransmission. For applications that do not require reliable delivery, InfiniBand also supports unreliable delivery of packets-i.e. they may be dropped with little or no consequence-that are not subject to flow control; some management traffic, for example does not require reliable delivery.
At the InfiniBand network layer, the GRH contains an 8-bit traffic class field. This value is mapped to a 4-bit service level field within the LRH to indicate the service level that the packet is requesting from the InfiniBand network. As each packet is transmitted, the HCA matches the packet's service level against a service level-to-VLane table, which has been populated by the subnet manager. The HCA then transmits the packet on the VLane associated with that service level. As the packet traverses the network, each switch matches the service level against the packet's egress port to identify the VLane within which the packet should be transported.
InfiniBand Subnet Management and QoS
InfiniBand supports two levels of management packets: subnet management and the general services interface (GSI). High-priority subnet management packets (SMP) are used to discover the topology of the network, attached nodes, and so on, and are transported within the high-priority VLane (which is not subject to flow control). The low-priority GSI management packets handle management functions such as chassis management and other functions not associated with subnet management. These services are not critical to subnet management, so GSI management packets are neither transported within the high-priority VLane nor subject to flow control.
INFINIBAND TRANSPORT LAYER
As with the OSI model's transport layer, the InfiniBand transport layer is responsible for reliable delivery and flow control. Because not all applications require ordered and sequenced packet delivery, the InfiniBand transport layer supports different connection types that have different reliability and packet sequencing characteristics. Although remote direct memory access (RDMA) isn't strictly part of the InfiniBand transport layer, InfiniBand HCAs provide RDMA support which offloads data movement as well as the multiplexing and demultiplexing of different stream from the CPU into the HCA.
Remote Direct Memory Access (RDMA)
One of the key problems with server I/O is the CPU overhead associated with data movement between memory and I/O devices such as LAN and SAN interfaces. InfiniBand solves this problem by using RDMA to offload data movement from the server CPU to the InfiniBand host channel adapter (HCA). RDMA is an extension of hardware-based Direct Memory Access (DMA) capabilities that allows the CPU to delegate data movement within the computer to the DMA hardware. The CPU informs the DMA hardware of the memory location where data that is associated with a particular process resides and the memory location the data is to be moved to. Once the DMA instructions are sent, the CPU can process other threads while the DMA hardware moves the data. RDMA enables data to be moved from one memory location to another, even if that memory resides on another device.
The problem that RDMA addresses is that when traditional network interface cards (NICs) receive data, the NIC invokes the following process to transfer data from the NIC to the CPU user application memory space:
1. The NIC interrupts the CPU.
2. The CPU suspends the current thread and either switches context to a suspended TCP thread, or starts a new TCP thread.
3. The CPU instructs the DMA engine where to copy the data to in I/O memory, and switches context back to any previous threads to resume processing.
4. As data is copied to I/O memory, the DMA engine periodically interrupts the CPU for TCP stack processing, etc.
5. The CPU switches context and then processes the TCP stack until all TCP segments are received and reconstructed into their original format1.
6. The CPU searches and associates the data with an application and then instructs the DMA engine to copy the data into the application memory space (Figure 5).
This process is extremely inefficient, because it results in multiple copies of the same data traversing the memory system bus, and also incurs multiple CPU interrupts and context switches.
Figure 5. Traditional Server I/O
Figure 6. RDMA-Enabled Server I/O
By contrast, RDMA, an embedded hardware function of the InfiniBand CA, handles all communications operations without interrupting the CPU (Figure 6). Using RDMA, the sending device either reads data from or writes data to the target devices user space memory, thereby avoiding CPU interrupts and multiple data copies on the memory bus, which enables RDMA to significantly reduce the CPU overhead associated with data movement between nodes.
To exploit RDMA, the application needs to be capable of leveraging the native RDMA capabilities of the InfiniBand HCA. As most applications are unaware of the underlying hardware, these applications are currently unable to realize the performance gains by RDMA without either rewriting the application, or using intermediary software that can leverage RDMA capabilities on behalf of the application. Because most applications are sockets based, the Sockets Direct Protocol (SDP) and Small Computer System Interface over IP (iSCSI), using the SCSI RDMA protocol (SRP) enables sockets-based applications to exploit the performance advantages RDMA without requiring costly application software customization.
Connection Types and Queue Pairs
Legacy NICs rely upon the CPU to process TCP for traffic multiplexing and demultiplexing, data movement, and identification of the application that the data is intended for, which incurs significant CPU overhead. By contrast, the InfiniBand HCA provides all transport services, such as reliable data delivery, multiplexing of streams, etc in hardware that improves CPU efficiency by offloading data movement, protocol stack processing and multiplexing operations from the CPU.
As described earlier, RDMA solves the problem of data movement, but does not address multiplexing, sequenced delivery and reliable delivery. To address this InfiniBand uses the concept of queue pairs (QPs) that are virtual interfaces associated with the HCA hardware to send and receive data (Figure 7). When an application is invoked, the application creates one or more QPs and registers memory where data can be written to and from, including read/write permissions, with the HCA hardware. An application may have a single or multiple QPs, depending on the application. The application also creates a completion queue to indicate whether an operation was successfully completed. A completion queue may be associated with a single or multiple QPs, or completion queues may be created for send and receive operations.
When one application needs to communicate with another application, the initiating application posts a work request to the QP that specifies what operation-either write or read-it requires to transfer data. If the data resides in local memory, the request will include a local key2 (L_key). If the memory location is remote, a remote key (R_key) must be included. The HCA hardware takes the work request at the head of the queue and reads or writes the data as requested within the work queue entry. If data is to be sent, the InfiniBand HCA fetches the data from local memory, fragments the data into packets as required, and transmits it to the target by appending transport headers to the data and sending the packets within a VLane appropriate to the specified service class.
Figure 7. Queue Pairs
If the sending application is using RDMA, the receiving HCA copies the data directly to memory using the R_Key to verify that the sending application is authorized to write data to the memory location identified within the RDMA header. If the sending application is not using RDMA, the HCA takes a receive-work queue entry from the receive queue and writes the received data into the memory space identified within the work queue entry. Once the data is written, the HCA places a completion queue entry in the completion queue to inform the receiving application that data, including the memory location, has been received. The HCA then acknowledges the receipt of the data to the sending application. Upon receipt of the acknowledgment, the HCA hardware places a work-complete entry for that work request in the completion queue to inform the requesting application that the operation is complete.
Because the InfiniBand protocol is responsible for managing all data transport between the nodes the benefit this accrues is that data is copied directly from or to application memory without CPU or kernel intervention at either the sending or receiving computer, thereby enabling the CPU to perform other tasks.
QPs are interconnected using a connection that is used to transfer data. InfiniBand supports a number of different connection types that have different characteristics depending upon the applications communications requirements. Because InfiniBand is responsible for data transmission that may not have an upper layer protocol to detect or recover from lost or corrupted frames, it supports mechanisms that provide reliable, in-order delivery of data. The InfiniBand transport layer is also responsible for fragmentation and reassembly of data and supports error checking and recovery that allows the HCA to detect errors and request the retransmission of corrupt frames in hardware independently of the CPU.
InfiniBand support two different cyclic redundancy checks (CRC) to detect errors: a variant CRC and an invariant CRC. The invariant CRC (ICRC) is used to detect errors within the fields that are not subject to change as data traverses the InfiniBand network and provides end-to-end data integrity protection. The variant CRC (VCRC) is used to detect errors that occur within those parts of the packet that may change as it traverses the InfiniBand network and provides hop-by-hop data integrity checking. The use of ICRC and VCRC provides a mechanism whereby even if bit errors are caused by an intermediate switch or router, the HCA will be able to detect the error and rerequest the data.
The attributes of the InfiniBand channel3 types are detailed below, with brief descriptions of the transport attributes and behavior, as well the application characteristics that can map to specific services.
Table 3. InfiniBand Transport Types and Characteristics
Connection Type
Description
Message Size (Max)
Reliable Connection
Acknowledged-Connection Oriented
2 GB
Reliable Datagram
Acknowledged-Multiplexed
2 GB
Unreliable Connection
Unacknowledged-Connection Oriented
2 GB
Unreliable Datagram
Unacknowledged-Multiplexed
256 B-4 KB
Raw Datagram
Unacknowledged-Connectionless
256 B-4 KB
Note: Raw Datagram is not an InfiniBand transport type and is used for "legacy protocol" operation. The operation of Raw Datagram is similar to the unreliable datagram (UD) transport and uses a special QP.
In practice, most applications leverage the reliable connection (RC) and unreliable datagram (UD) transport that are described below.
Reliable Connection Attributes
The InfiniBand RC service sends or receives messages between two QPs only and provides "once, and once only" delivery semantics where the RC transport expects each packet to be explicitly acknowledged by the receiving QP. The RC service provides ordered delivery using sequence numbers to detect whether a packet has been dropped and allows the RC service to request lost data. The reliability characteristics of the RC service enable it to support all InfiniBand operation types, including send, RDMA-read, RDMA-write, and atomic operations. Applications that benefit from RC service are those that require reliable communications between two nodes and cannot tolerate packet loss, data corruption, or out-of-sequence delivery.
Unreliable Datagram Attributes
The UD service is a connectionless, unreliable transport method that may send to, and receive from, any other UD QP that shares a specific Q_key. The job of the UD service Q_key is to prevent spurious data access by validating that the sending QP is authorized to access the queue. This validation is performed by checking that the received Q_key matches the configured Q_key. If the Q_key does not match, the packet is dropped. With UD service, the receiving QP does not acknowledge receipt of the packets, nor does it retry lost or corrupted packets. Furthermore, the UD service cannot detect out-of-sequence or duplicate packets. In light of these characteristics, the UD service supports the InfiniBand send operation only.
Figure 8. IP over InfiniBand LRH and Unreliable Datagram Format
UD transport is used for transport of IP over InfiniBand and for applications that may receive asynchronous requests from a number of remote processes and where lost or corrupted data can be resent after a timeout period by the initiating process.
INFINIBAND SUMMARY
InfiniBand is an interconnect technology that provides high throughput and low-latency transport for efficient data transfer between server memory and I/O devices, without CPU intervention. By leveraging techniques such as RDMA, InfiniBand increases computer CPU efficiency by enabling more resources to be dedicated to processing other tasks. Although I/O is not an issue for most computers, I/O is problematic for high-end server platforms and computationally intensive applications such as those found in high performance computing (HPC) environments.
INFINIBAND IN HIGH PERFORMANCE COMPUTING
High-performance computing (HPC) using parallel applications running on supercomputers has been used for many years to solve large and complex computations such as modeling aerodynamics, simulation of nuclear reactions, or weather prediction. To provide the performance required for these compute-intensive applications, many supercomputers have used the concept of massively parallel processing (MPP). With the advent of faster CPU and network interconnects, computer scientists realized that the techniques used for MPP could also be applied to industry-standard servers using software to enable message passing between nodes, to perform the same parallel processing operations that have come to be known as "cluster computing."
HPC and Server Input/Output
HPC clusters use much the same MPP principles first developed within supercomputers, where an individual compute node (or processor) within the HPC cluster calculates the result of a small subset of data, exchanges the result with other processors in the form of messages, and recalculates using the exchanged results iteratively until the data set is reduced to a final result. The exchange of messages between processors can incur significant CPU I/O overhead resulting from data movement, protocol stack processing, and the CPU interrupts that are generated to move data between interface cards and memory spaces.
As a result, although the CPUs were more than capable of performing the required calculations, the CPU utilization associated with data movement, as well as its associated software stack processing, was reducing the CPU efficiency of the cluster nodes dramatically. This inefficiency-caused by the imbalance between the system bus performance, data movement and CPU stack processing within the server-meant that the full potential of the CPU could not be fully realized.
InfiniBand provides an ideal solution for HPC cluster communications as it enables data movement to be offloaded from the CPU to InfiniBand hardware, which enables more CPU time to be devoted to application processing. This, and the high-bandwidth and low latency network characteristics of InfiniBand enable very large and CPU efficient HPC clusters to be built using industry standard computers to solve complex computationally intensive problems.
InfiniBand and Upper-Layer Protocols
For applications to fully leverage the RDMA capabilities offered by InfiniBand HCA, the upper-layer protocols (ULPs) need to be able to interact with the InfiniBand hardware. Most applications are generally unaware of the underlying hardware and use the sockets API to make socket calls down to the transport layer. The socket call is intercepted within the operating system kernel, and the appropriate protocol stack, typically TCP/IP, is invoked to enable data exchange between devices. The layer of abstraction for the application is the sockets API; therefore, the application is not aware of the InfiniBand HCA's RDMA capabilities that could be used to offload data movement.
Although most applications cannot directly address the InfiniBand hardware, these applications need not be rewritten to take advantage of InfiniBand functions. Because most enterprise applications are sockets based, SDP gives these applications access to InfiniBand hardware features without requiring adaptations to the application. For applications that are not sockets-based, IP-over-InfiniBand provides a mechanism that enables the transport of IP over an InfiniBand fabric, albeit without the ability to leverage InfiniBand RDMA hardware.
Although there are several protocols that can be used over InfiniBand, some are more efficient in leveraging InfiniBand than others. Protocols such as User Direct Access Programming Library (uDAPL) and the Message Passing Interface (MPI) can leverage InfiniBand's transport services. They provide lightweight access to the InfiniBand hardware, which enables these protocols to achieve very low CPU overhead and low stack latencies. A brief overview of each protocol is provided below.
IP-over-InfiniBand
IP over InfiniBand (IPoIB) allows TCP or UDP/IP applications to run over the InfiniBand transport and enables IP communications between InfiniBand attached servers or other IP devices. IPoIB also enables standard, sockets-based IP applications to be accessed on InfiniBand-attached servers when used in conjunction with Ethernet-to-InfiniBand gateways that are using the UD transport service. An important point to remember is that InfiniBand does not have native broadcast support and therefore must provide a mechanism to enable the Address Resolution Protocol (ARP) and Dynamic Host Configuration Protocol (DHCP) resolution.
To provide broadcast capabilities, the IPoIB service uses a multicast UD transport to distribute broadcast packets, such as DHCP leases and ARP, to all members of the IP subnet. This multicast is achieved by explicitly configuring a partition with which the IP subnet is associated.
The InfiniBand ARP process differs slightly from ARP over Ethernet, in that the ARP response for the target IP address returns the target device's GID, which is then cached along with the target device's IP address. The host then follows the standard GID-to-LID lookup, using the subnet manager to find the LID of the target device. This operation enables the IP application to work transparently across switched and routed InfiniBand networks, because the address resolution process resolves both local and global addresses.
Figure 9. Protocol Stacks and Libraries
Much as in a standard Ethernet environment, after the IP packet has been received by the InfiniBand hardware, it is encapsulated in a standard UD InfiniBand packet by the HCA and sent off to its destination. If that destination is an Ethernet-attached host being accessed across a gateway, then the gateway strips off the UD packet headers and encapsulates the IP packet in an Ethernet packet for further delivery.
Figure 10. IP-over-InfiniBand Internetworking
It should be noted that IPoIB does not leverage the inherent hardware capabilities of the InfiniBand HCA, namely RDMA offload, and the CPU is responsible for TCP stack processing and movement of data. When the receiving HCA receives the encapsulated packet, the HCA strips the InfiniBand headers from the IP packet and invokes the I/O resource manager (I/ORM). The I/ORM then interrupts the CPU and the received data is copied to I/O memory, whereupon the CPU is responsible for all associated TCP stack processing and session management (including ACK, FIN, window sizing, etc.).
Sockets Direct Protocol
Another approach that allows TCP/IP4 applications to leverage InfiniBand RDMA capabilities is SDP. Given that many modern applications are written using the sockets API, SDP can intercept the sockets at the kernel level and map these socket calls to an InfiniBand RC transport service that uses RDMA operations to offload data movement from the CPU to the HCA hardware. Because it intercepts the socket calls at the kernel level and uses the RC transport to provide reliable and sequenced data delivery, TCP processing is not required, because the InfiniBand RC transport provides reliable, in-order delivery of data.
Applications may leverage SDP by either natively calling the SDP protocol, or by using socket intercept within the operating system kernel. Although SDP does not use TCP/IP for communications between InfiniBand-attached hosts, TCP and IP parameters may be required to determine which sockets are intercepted by SDP. When SDP intercepts a socket call, it encodes the TCP port as well as source and destination IP addresses within the SDP connection setup message to identify the application at the target host. It should be noted that as packets are transferred between the two devices, even though the application may be calling a TCP/IP socket, TCP/IP headers are not used.
SDP supports three modes of operation-Bcopy, Zcopy, and transaction mode-depending on whether the data transfer is short-or long-lived, or transactional in nature. Using Bcopy, data is copied into local SDP buffers that are then transferred to the target SDP buffers, where they are copied to the target application buffers. Bcopy supports the concept of a sliding window and is useful for situations where small amounts of data are transferred, or where the application requires buffering. Note that SDP Bcopy mode doesn't require any application rewrites, but it cannot utilize the full InfiniBand RDMA capability. It does, however, leverage the RC transport service that enables data to be transferred without incurring TCP processing overhead. SDP Zcopy uses RDMA read and write semantics to transfer data and is ideally suited to larger data transfers. Zcopy requires some application rework to provide the dedicated application-space memory that the RDMA engine requires.
For transactional applications, such as database applications, the source typically transmits a small command message and expects to receive a larger response. SDP transaction mode optimizes the transactional application request/response process by piggybacking buffer availability with the request messages. This streamlining enables the source to identify to the target what information is required and to which memory location to post the results, thus optimizing the transaction by reducing communications.
User Direct Access Programming Language
The user Direct Access Programming Language (uDAPL) protocol enables an application to bypass the standard TCP/IP protocol and use the native transport to communicate between hosts in a cluster of servers and workstations on the fabric. uDAPL and its kernel-level equivalent, kDAPL, were designed to be transport-agnostic, and therefore can work on any RDMA-capable fabric as long as a Data Access Transport library exists to support it. This characteristic also enables the applications to take advantage of the underlying transport service provided by InfiniBand to permit direct I/O between the user mode processes.
The primary role of uDAPL is to provide transport-independent connection management and transport-independent, low-latency, zero-copy data transfer and completion across InfiniBand. Although uDAPL is a published specification, it is most commonly deployed within high-performance database clusters.
Message Passing Interface
The Message Passing Interface (MPI) is the most widely deployed protocol within HPC environments today, due primarily to the rich library of functions that can be used to construct parallel applications. MPI provides native access to the InfiniBand protocol stack that enables applications to leverage InfiniBand RDMA for efficient interprocess communications.
MPI is a message-passing protocol that enables messages containing different instructions to be passed between nodes within a parallel compute environment. MPI supports the notion of sending data to all processes, or to subgroups of processes through the use of communicator groups and collectives. All communication between processes relies upon point-to-point communications between nodes. However MPI defines a number of different communications models such as scatter, gather, and broadcast, and then expose these abstract patterns to the programmer as collectives which obviate the need to construct complex communications patterns. Collectives are then used to distribute or scatter data sets for calculation and then gather the results such that they can be combined to provide the final result. Due to the rich functionality offered by the MPI library, MPI has become the de facto standard for developing parallel compute applications.
INFINIBAND SUMMARY
InfiniBand is an interconnect technology that provides high throughput and low-latency transport for efficient data transfer between server memory and I/O devices, without CPU intervention. These characteristics, plus the use of efficient software libraries and protocols such as MPI, enable clusters of commodity servers to cooperate in executing large complex calculations that are the basis of many high-performance applications. SDP enables legacy applications to leverage the facilities of the InfiniBand hardware, improving the efficiency and performance of the system and applications.
The development of InfiniBand, that leverages RDMA for efficient data transport ,and cluster software has moved high-performance computing from the realms of expensive supercomputers that are available to a few enterprises, to an economically viable proposition for many enterprises.
Cisco Systems, the world leader in networking, provides industry-leading InfiniBand technologies that power many of the world's TOP500 supercomputers. The Cisco® InfiniBand portfolio includes the Server Fabric Switching 7000 series InfiniBand switches, Server Fabric Switching 3000 Ethernet and Fibre Channel gateways, InfiniBand Subnet Manager, and InfiniBand PCI 4X HCAs. Cisco is also active in the development and promotion of InfiniBand through the OpenIB consortia and development of OpenMPI software libraries.
1Although most Gigabit Ethernet NICs support the interrupt coalescing feature and not all inbound packets cause interrupts to be issued, significant CPU overhead can still be incurred
2The concept of Keys is used extensively within InfiniBand as it provides a simple and effective mechanism to control access to resources within the InfiniBand network.
3Connection is a slight misnomer, because the service type may be connectionless.