Guest

Cisco SFS 7000 Series InfiniBand Server Switches

HPC Networking: The Foundation for Networked Supercomputers

White Paper

The term "cluster" is used to describe computer systems where two or more computers collaborate to increase application performance or availability. High-performance computing (HPC) is a relatively new class of cluster that is used to support a wide variety of performance-intensive parallel applications that are widely recognized as having the most demanding performance characteristics. This document discusses the different classes of clusters that are available, and takes a closer look at HPC clusters in terms of network connectivity and the different considerations that are necessary to evaluate network technologies when building HPC clusters.

INTRODUCTION

Many enterprises now use high performance compute (HPC) clusters to run commercial HPC applications to provide faster "time to enlightenment" that can affect an enterprise's profitability and competitiveness. These applications offer significant advantages, especially when the amount of time saved to generate a result is considered, as this may reduce the risk of development, or enable investments to be better aligned and spent, or bring products to market faster.
Traditionally, parallel applications have run on monolithic supercomputers that have been prohibitively expensive for many companies to acquire and operate. A recent development that uses much the same principles as traditional supercomputers are HPC clusters. HPC clusters are made up of multiple, sometimes many thousands, of industry standard computers that use cluster software and high-performance network interconnects to run parallel applications at a fraction of the cost of traditional supercomputers.
A key element of HPC clusters is the network. At the heart of parallel computing is the ability to exchange messages with other nodes within the cluster, referred to as interprocess communications (IPC) that requires a high-performance network to facilitate these exchanges. However, other communications are required within the HPC cluster, such as how files are accessed and managed, that are often overlooked. Additionally, HPC applications have differing requirements such as how frequently and how much data is exchanged during execution of the application and understanding these requirements is critical when choosing a particular HPC solution.

Clusters, HPC Clusters, GRIDS, and Supercomputers Taxonomy

The term cluster is used to describe a multitude of different strategies for improving the performance or availability of an application. Within HPC, GRIDs and clusters are often used interchangeably to describe multiple computers providing a specific function even though they are not the same thing. The following terminology is used throughout this document to provide a common baseline and also provide definitions for terminology that is often used within the context of HPC and supercomputing.

Supercomputers, Federations, and Constellations

The term supercomputer can be applied to several forms of computers that, from an external perspective, run a single application-fast! A supercomputer can be a single, highly tuned symmetric multiprocessor (SMP) server; it can be a server that consists of a number of processors, often referred to as a massively parallel processor (MPP); or it can consist of a number of interconnected supercomputers commonly referred to as a federation or constellation. A relatively recent development is the use of HPC clusters of industry-standard computers interconnected using a high-performance network to provide scalable supercomputer performance. Examples of supercomputers can be found at the Supercomputer Top 500 Website http://www.top500.org/.

Figure 1. Supercomputer to Networked Supercomputer

Clusters

HPC clusters are an evolution of the federation concept, except that the processor nodes are industry-standard servers interconnected using a high-performance network and standards-based message passing and transport protocols. Due to the price/performance and scalability of clusters, clusters are the fastest growing segment of the supercomputer market.
The term cluster, loosely defined, generally refers to a collection of multiple compute resources under the control of a single administrative authority. Although the cluster is made up of multiple compute resources, from an external perspective, the cluster appears to be a single system. Clusters are typically geographically bound due to latency or administrative reasons.
The term cluster is used to describe many different types of collaborative operations between servers. In this respect, clusters can be generically broken down into the four (4) types described below:

High-Availability Cluster-An HA cluster provides non-stop, or as near non-stop as is possible, application availability. There are several strategies that may be employed to achieve HA, but in general one server is the master node and one or more servers provide back-up services if the master node fails. This type of cluster typically uses interserver heartbeats that periodically exchange node health information. HA clusters are not inherently CPU or network-bandwidth intensive, and are not latency sensitive.

Load-Balanced Cluster-A load-balanced cluster may be used to scale the performance of an application by dividing user sessions among servers within the server farm. In the event that the resources in the server farm become exhausted, additional capacity can be added without the users being aware of the changes. Again, there are various strategies that can be employed; for example, the Cisco® Content Services Module may be used to load-balance sessions across multiple servers. It should be noted that load-balancing clusters can also provide high availability if application level persistence, for example cookies or Java session persistence is used. Load-balancing clusters are not inherently CPU or network-bandwidth intensive as adding more servers can address capacity problems, and are not latency sensitive.

High Performance Computing Cluster-HPC clusters refers to two, or more, computers that are used to solve a single problem. HPC clusters consist of a single master node and multiple slave nodes that are interconnected using a high-performance network. Depending upon the application, HPCs can be divided into three sub-categories:

– Parametric execution-Parametric execution is used for applications that cannot be parallelized, and consequently the application runs on a single compute node. What parametric execution leverages is that, although the application cannot share information between nodes, if different input data is sent to different nodes running the same application, the nodes can compute their information in parallel, thereby speeding up the operation. As parametric execution does not exchange messages between nodes, a parametric execution HPC cluster is latency insensitive, but may be bandwidth intensive depending upon the application.

– Loosely coupled applications-Loosely coupled applications require minimal interaction with other cluster nodes as they can perform the entire computation autonomously. Loosely coupled clusters are generally latency insensitive and CPU intensive, and require moderate to high bandwidth network interconnects. Other commonly encountered classifications are massively parallel, embarrassingly parallel and nearly embarrassingly parallel. These applications share the much same characteristics as loosely coupled applications, although there are variations on sensitivity to latency and bandwidth consumed.

– Tightly coupled applications-Tightly coupled applications require information to be periodically exchanged with other nodes-either all nodes or subsets of nodes-in the form of messages. Tightly couple applications are generally latency sensitive and CPU intensive, and generate bursty, unpredictable traffic patterns as data is exchanged. Because time spent communicating is time spent not processing data, tightly coupled applications require low latency and high bandwidth network interconnects.

Transactional applications-Transactional applications such as decision support systems typically refer to database clusters. Clustering improves transactional performance by dividing database queries among several database servers. The database servers use some form of middleware-for example Oracle RAC-to distribute the query to perform the database lookup. It should be noted that transactional clusters are neither HPC nor HA/load-balancing clusters, even though most implementations have inherent load-balancing and HA capabilities. Transactional clusters are not inherently CPU or bandwidth intensive and adding more servers can address capacity problems; however, low latency and high bandwidth are required to maintain database record consistency (cache synchronization, etc) and to accommodate the size of database response, respectively.

GRID

The concept of GRID computing has been interpreted in many ways. A GRID generally refers to a pool of heterogeneous compute resources that may be under the control of several administrative authorities. A GRID may be made up of several pools of resources-storage, compute, etc.-and may be formed of multiple geographically distributed clusters. From an external perspective, GRID middleware allows resources within the GRID to be scheduled to perform compute tasks.
Many enterprises are looking towards GRID computing to enable sharing of information, such as part of an extended supply chain management agreement, or for research and development purposes. In this context GRID is often associated with utility computing, where compute resources can be scaled-up or down-according to demand, and where compute power can be purchased on a $-per-CPU, per-hour basis. GRIDs are also associated with scavenger-type applications such as Folding@home and Seti@home, which use spare CPU cycles on distributed PCs to perform computations.

HPC NETWORKING

As stated above, HPC clusters are an evolution of the federation concept in which the processor nodes are industry-standard servers interconnected using a standards-based, high-performance network and communications protocols. A key aspect of HPC performance is the characteristics and performance of the network that provides communications between nodes within the HPC cluster. Although several proprietary interconnects exist, Infiniband and Ethernet based supercomputer deployments are growing in the Supercomputer Top 500 supercomputer list (http://www.top500.org).
Within HPC, each node within the cluster needs to be able to communicate with different resources-storage, for example-and to other nodes for control and inter-process communications. Generically, communications within a cluster can be broken down into four operations:

Access network-The access network provides user access to the cluster to allow job scheduling and viewing of graphical data. The access network may also provides connectivity to remote resources such as network attached storage (NAS) or other clusters within the context of a GRID.

Management network-The management network is the clusters command and control network that enables the master node to schedule, start, checkpoint, and stop work that is executed on the cluster, and also allows the nodes to be monitored for troubleshooting purposes.

Storage or I/O network-In most HPC environments the cluster nodes download data from an external NAS or SAN into their local disk and then perform the necessary calculations before writing the result back to the NAS or SAN. This requires high-speed access between the NAS/SAN systems and the cluster nodes.

Interprocess communications (IPC) network-The IPC network provides high-speed connectivity between cluster nodes such that IPC messages can be exchanged. Because the IPC network characteristics have the most effect on application performance, the IPC network uses high-bandwidth and low latency network technologies.

Figure 2. HPC Connectivity

Typically, smaller HPC clusters collapse all connectivity onto a single Gigabit Ethernet switch. However larger clusters require careful consideration as to the requirements of each network is required, and many of the Top 500 clusters use different networks for IPC, access, management, and storage. This can, however, have significant ramifications as larger servers may be required to accommodate the number of interfaces. A discussion of the considerations for selecting the technology that addresses the connectivity requirements for an HPC cluster follows.

Access Network

The I/O network provides access to and from resources that are external to the HPC cluster nodes. Within HPC typically only the master or head node communicates with an external user. All other nodes are slave nodes that are controlled and managed by the master node. The programming interface for the master node is typically based upon a remote shell command line environment using Telnet, SSH, or BPROC (Beowulf) to initiate a particular job.
Although the management access protocols (SSH, etc.) only consume a few Kbytes of bandwidth, other connectivity requirements may require higher bandwidth. For GRID applications, where multicluster jobs are executed, depending upon the design of the message passing interface (MPI) collectives, slave nodes may be required to communicate with other nodes in remote clusters. Additionally, if graphical, real-time representation of the data is required, considerable bandwidth may be consumed. As an example, real-time visualization at 25 frames per second on a 1,024 x 1,024 resolution screen will consume about 600Mbps of bandwidth.
The overriding characteristic of the access network is that the resources accessing, or being accessed, by the cluster may be geographically remote and, although bandwidth may be a consideration, low latency typically isn't required. However, as external, potentially untrusted, devices may be accessing the cluster, security, QOS and availability become considerations as they directly impact the user's experience of the "service". Given these attributes, Gigabit Ethernet and TCP/IP are ideal candidates for access network connectivity as they provide high bandwidth, ubiquitous transport that supports robust services such as QoS, security, and IP multicast.

Management Network

The management network provides communications between the master and slaves nodes that enable the master node to determine the operational status of the slave nodes, schedule work for the slaves, and start, checkpoint, and stop jobs as necessary. The management network also provides mechanisms whereby each node can periodically report its health and operational status-using heartbeat messages-to the master node. Many management tools-for example Platform Rocks, Ganglia, Scali's Manage, and IBM's cluster systems management (CSM) and extreme cluster administration toolkit (xCAT)-are available.
The overriding characteristics of the management network are that the resources being accessed are generally local to the cluster and neither high bandwidth nor low latency is a consideration, and a degree of over-subscription can be tolerated. Given these attributes, Gigabit Ethernet and IP are ideal candidates for access network connectivity. Additionally, IP Multicast may be a requirement if IP multicast is used for reporting statistics to a master node*.

Note: It is fairly common practice in small to medium-size HPC clusters to consolidate the access and management networks as bandwidth and latency are rarely an issue, and traffic patterns are such that they do not cause congestion. If problems with congestion are encountered, QoS can easily be implemented to address this problem as it is relatively simple to classify HPC management traffic and user access traffic accordingly. Security may also be a concern if the access and management network are combined. Again, security can be fairly easily provided using access-control lists (ACL) and/or firewalls to control cluster node access.

* Reporting node statistics can reduce cluster efficiency, as the slave node must suspend the active thread to report information to the management station. In many HPC environments, the timers are set to report status and activity at relatively long intervals-30 minutes or more-to reduce CPU overhead associated with context switches and network communications.

Storage Network

The storage network provides access to the data that is to be computed by the master and slave nodes within the HPC cluster. Within HPC, there are several strategies that may be employed with respect to how data is stored and accessed by the HPC nodes. At the lowest level, data may be accessed in one of two ways: at the file level using an external file system, commonly referred to as network-attached storage (NAS), or at the block level using either direct attached storage (DAS), which includes the server physical hard drive, or storage area network (SAN) using either fiber channel or InfiniBand attached storage using the SCSI or SCSI RDMA protocol (SRP) respectively.
File access is familiar to most people and is relatively simple. A user of an application requests a particular file from a NAS. The NAS is responsible for retrieving files, file locking, and determining where the file is saved to on the physical hard disks. By contrast, block-level storage requires the application to manage where the file is physically stored on the hard disks. Most applications use file access and file managers to manage how the files are saved to hard drive. For example, desktop applications such as Microsoft Word save files using a meaningful name and the Windows operating system NTFS writes the file to the physical disk.
Block-level access is used for DAS and SAN attached storage using SCSI as an access protocol. Block-level access can achieve higher transaction rates that are typical in decision support systems (databases) clusters, but can present problems with file security and locking, and making the file content available to other nodes.
Although Fibre Channel is the most widely deployed storage area networking (SAN) technology, it is not commonly implemented on the HPC cluster slave nodes as it requires another interface type to be supported that may require larger servers to accommodate the Fiber Channel host bus adapter (HBA). Block access can also used to access InfiniBand storage using the SCSI RDMA Protocol (SRP). InfiniBand attached storage high-speed block access to storage that provides significant advantages for transactional applications such as database clusters. Another option for block-level access to remote storage is to use iSCSI over IP, although this has not been widely adopted.
By contrast, file-level access enables an application to simply request a file, or sub-set of a file, from the NAS, which then returns the file to the requestor. This is a simpler model for application developers as the application does not need to manage how data is written to the storage disk arrays and simplifies file sharing, locking and secure file access. NAS can also be used as a consolidation point for file storage and can be deployed with different storage redundancy options, such as RAID, to provide data protection and disaster recovery. NAS also provides a central point at which files can be managed to provide archival services in the event that data needs to be retrieved for future analysis. Another advantage is that because NAS protocols are based upon IP, the file system can be placed anywhere within the network.

Table 1. Storage Access Protocols and Technologies

Storage Type

Block or File Access

User Access Technology

Storage Access

NAS

File

Ethernet: GE or 10GE

Fiber Channel or Infiniband

SAN

Block

N/A

Fiber Channel, Infiniband or iSCSI

Parallel File System

File

Ethernet: GE

DAS, Fibre Channel or Infiniband

However, if a large number of HPC nodes are accessing the NAS, this can present a significant problem in terms of data input/output (I/O) and bandwidth if large volumes of data are required to be served. To enable the I/O throughput of the NAS to be scaled several strategies may be employed. One solution is to utilize multiple NAS I/O interfaces (or I/O nodes) and load-balancing techniques to improve performance. Another solution is the use of parallel file systems. Using parallel file system, the file data-which may be in the order of PetaBytes-is broken down into discrete chunks and distributed across a number of servers called I/O nodes. A meta-data server manages the distribution and location of the chunks, and also manages data redundancy by striping the data across multiple disks. When a file is requested, the meta-data server respond with the location of the file and the requestor then reads the file from the location specified within the response. Parallel file systems provide a relatively low-cost, high throughput file sharing solution for HPC and non-HPC file systems by using multiple standards-based servers as I/O nodes. Although NFS is widely used for NAS access, many HPC clusters utilize parallel file systems such as parallel virtual file system (PVFS), IBM's general parallel file system (GPFS), Lustre, or iBrix to scale storage I/O performance.
Although it is fairly simple to characterize the access and management networks, the storage network presents a large number of choices that, depending upon the application, requires different solutions. Consequently, the choice of network technology can only be made in the context of understanding the application requirements. In this respect, InfiniBand, Ethernet, and Fiber Channel may all play a role as part of an HPC storage solution as either access technologies, or as part of a NAS or SAN solution.
The most common HPC storage solution is NAS that may be accessed by the HPC nodes across the management network or using a dedicated Ethernet interface. The NAS in of itself may implement a SAN using Fiber Channel to manage disk attachment, although this is entirely transparent to the HPC nodes. Fiber Channel is the most widely adopted SAN network fabric due to the maturity of the products and broad industry support. InfiniBand attached storage is a relatively new product offering that delivers high-speed access to disk resources that is well suited to high-performance file servers and database clusters.

Interprocess Communications (IPC) Network

The IPC network provides the connectivity required to transfer information between the HPC cluster nodes during run time. The amount of information transferred between the nodes using messages in terms of bandwidth and frequency is entirely dependant upon the application type and the application communication patterns. Having a good understanding of the application traffic patterns, and how those patterns may be affected by using a different message-passing model can change the choice of IPC technology, over-subscription rates, and the design of the IPC network.
In the context of HPC, end-to-end message latency is a key metric in determining the efficiency of a particular technology. Message latency is defined as the time it takes to transfer a single zero-payload message from one processor to another processor, and includes all elements within the transmission path: communications stack latency, interface card latency, serialization delay, network switching latency, etc.
Although it is widely assumed that network latency is the biggest delay component, the latency introduced by communications stack processing and data movement within the servers can be significantly greater than network switch latencies. For most applications, the effects of the host protocol stack latency are masked by the application processing delays; however, within HPC these latencies can significantly affect the efficiency of the cluster and add hours, or even days, to large computations.
Within most servers, I/O operations require data to be moved between the ingress interface to mapped I/O memory that is then transferred to user space memory, all under the control of the CPU. This is highly inefficient, as the CPU must also process all TCP operations-acknowledgements, TCP windowing, resequencing of packets, checksum calculation, etc.-and then move the received data from the I/O memory to user-space memory. This also requires the server to suspend active threads and perform context switches to ensure timely responses to other processes, which can be very inefficient.
The affect of this operation is that the CPU, instead of processing the application, must suspend the active thread and switch context to process the communications thread. For IPC within HPC environments, time spent communicating is time spent not processing, and even short interrupts-in the 10s of microseconds-can add significant additional delays if job execution times are measured in days.

Figure 3. Non-RDMA vs RDMA Communications

RDMA changes this by enabling a server to write or read data directly to user space memory, thereby eliminating multiple copies of the same data being transferred between different memory spaces and the associated CPU interrupts. Because RDMA offloads data movement overhead from the CPU, the CPU can execute other tasks and greater efficiency can be realized. When used in combination with InfiniBand, or Ethernet NICs that have RDMA and TCP offload engines (TOE), nearly all transport protocol processing and data movement can be offloaded from the central CPU to the interface hardware, thereby realizing significant performance gains.
Another area of focus, especially within HPC and database cluster environments, is the communications protocol. As many database and HPC applications require low latencies to provide optimal performance, specific protocols such as user direct-access programming language (uDAPL) and the message passing interface (MPI) provide low latency protocol stack implementations that allow fast IPC communications. The most widely adopted protocol within HPC is MPI, due in part to the open nature of the protocol and the rich functionality contained within the implementation.
Another consideration is how the network is provisioned with respect to bandwidth. For many campus environments, bandwidth over-subscription of 20-30:1 is acceptable due to the bursty nature of user-application communications. For data centers, the level of over-subscription drops to between 4-2:1, as most traffic is destined for application servers hosted within the data center. In the event that bursts of traffic coincide and exceed the available bandwidth, interface buffering absorbs the burst to prevent packet loss. Depending upon the depth of the buffers this can vary from 10s to several 100s of microseconds that, for most applications, are masked by the overall network delay-that may be in the order of 10s of milliseconds-or by application processing delays.
Within HPC, however, additional delays can lead to increased communications times and lower CPU efficiency, and very low oversubscription rates- for example, 1:1, may be required. However, depending upon the application, computations may be divided into smaller computational tasks that can be used to localize the IPC traffic communications, and oversubscription rates can vary significantly. This factor can have a significant bearing on the cost of a solution as different technologies and network topologies may become viable alternatives.
In summary, understanding application characteristics with respect to whether the application is tightly or loosely coupled and the application communications patterns are critical considerations when selecting an IPC network technology. Additionally, each component of the path, including interface cards, server bus architecture, protocol stacks, buffering, and application traffic patterns must be considered to reduce end-to-end latency. Although RDMA is most often associated with InfiniBand, Gigabit and 10Gigabit Ethernet RDMA enabled NICs approach the transport efficiency offered by InfiniBand. Lastly consideration as to how the HPC environment will scale to meet future requirements, and the operational characteristics of the network are also critical. Although collapsing connectivity into larger chassis is often advocated, considerations with respect to inter-rack cabling, ease of management, impact of software or hardware upgrades, ease of expansion, etc need to be taken into account.

Table 2. Comparison of Ethernet and InfiniBand

Gigabit and 10 Gigabit Ethernet

InfiniBand

Pro's
Con's
Pro's
Con's

Low latency, high bandwidth switching

Slightly higher latencies than specialized transport protocols, eg InfiniBand, Myrinet

Ultra-low latency, high-bandwidth switching

Expensive, distance limited (17m), and bulky cabling limits topology

Standards-based, well-known protocol

Relatively new and expensive 10GE RNICs

Sophisticated traffic management: QoS and Flow control

Centralized InfiniBand subnet manager slower to converge than distributed protocols (IEEE)

Inexpensive Gigabit Ethernet technology-Free LOM

High CPU load if TOE and RDMA technology is not supported

Relatively inexpensive 10G support

Relatively new and complex protocol requires training, etc.

RDMA, iWARP, TOE, and iSCSI technology available

No standard implementation for RDMA over Ethernet currently available

Native storage now available

Limited management and support tools

Robust and mature product availability

 

Native RDMA and SRP support

Tightly coupled switch and HCA implementation requires careful upgrade and interoperability testing

Sophisticated traffic management (QoS) security, HA, etc.

 

SDP enables legacy IP to leverage InfiniBand RDMA

 

Simple cabling

 

Industry standards based

 

Available management and support tools

     

As with the storage network, the choice of IPC network technology and design is not clear cut. Several different factors need to be considered before making decisions as to what technology is the best for a particular implementation. For many applications, Gigabit Ethernet is a good choice for IPC network connectivity as it is relatively inexpensive high-speed network technology. However, it is worth investing in Gigabit Ethernet switches that have low-latency characteristics, such as the Cisco Catalyst 4948 or Catalyst 6500 as they provide tangible benefits in terms of application run-time. As the HPC environment grows or if applications that require ultra-low latency IPC are deployed, InfiniBand technology that offers high density 10G, very low latency switching becomes an viable technology. A table of the pro's and cons of Gigabit Ethernet and InfiniBand are shown in Table 2.

Choosing an Interconnect

The choice of network to meet the application requirements for access, management, storage, and IPC are ultimately dictated by performance. Of real interest to the HPC user is the time to complete a particular operation, the efficiency of the cluster nodes, and the required time to completion. However, if a user is able to accept slightly lower CPU efficiency, or slightly longer run times, then different decisions can be made with respect to technology.
As an example, for loosely coupled applications Gigabit Ethernet is a good IPC network solution, albeit at a slightly lower CPU efficiency than InfiniBand, caused by slightly higher network switching delays (1Gbps vs. 10Gbps** serialization). As a hypothetical example, if 32 nodes using InfiniBand as an IPC network completes a computation in four hours, but would require 36 nodes to complete the same computation in the same time using Gigabit Ethernet as an IPC network, this may be an acceptable trade off in terms of cost, ease of use, and familiarity to the user. However, if a computation ran for 12 hours on 1024 InfiniBand attached nodes and required an additional 400 nodes to complete the same computation using a Gigabit Ethernet IPC network, the additional cost in terms of compute nodes and attendant power and cooling may outweigh the cost benefits of Ethernet.
An additional consideration is the size of the cluster. If the HPC cluster is small, a single Ethernet switch may be adequate to support the performance and connectivity requirements required for a particular application. For larger clusters, separate network interfaces and networks may be required to meet the performance required to meet the application and business requirements. Ultimately, HPC cluster technology design decisions are based upon the performance required for the particular application. The following section describes the considerations to determine the best technology to address HPC connectivity requirements.
** Note that InfiniBand data rate is 8Gbps due to 8B/10B encoding and not 10Gbps.

Access and Management Network

The choice of technology for the access and management networks is simple: Ethernet. These networks have no specific latency requirements and are used for intermittent bursts of traffic that are not particularly bandwidth intensive. In this respect, because at least one Gigabit Ethernet interface is included for free on most server platforms, and making use of this interface makes good economic and practical sense.
The Cisco Catalyst® family of multilayer switches provides rich services such as security and QoS that can be enabled without affecting the high-performance and low latencies offered by these devices. The Cisco Catalyst 3750 Series switches and Cisco Catalyst 4948 Switch are ideal for top-of-rack deployments where high-performance and low cooling and power footprint are critical. For higher density deployments, the Cisco Catalyst 6500 multilayer switch offers unparalleled services and connectivity options that enable very dense and highly available connectivity.

Storage

For storage access there are several choices that can be made that include NAS, Fiber Channel SAN, iSCSI and InfiniBand attached storage, which are discussed below.

Network Attached Storage-Network attached storage (NAS) is used extensively within HPC as it provides centralized management and high bandwidth access to files. Another popular strategy is parallel file systems that deliver highly distributed NAS functionality, albeit in a slightly different fashion, that leverages the hard disks of the several nodes to provide high-performance file systems. As NAS protocols are IP centric-NFS, FTP, etc.-Ethernet connectivity is the de facto standard for this type of storage access. In some cases the NAS may be accessed using the management and user access network, or may be a dedicated interface, with the key decision factor regarding consolidation being the amount of bandwidth that will be consumed by data retrieval. It should be noted that data retrieval is not time bounded, and combining the management and storage networks is an acceptable strategy. If management traffic response times are a concern, QoS allows management and storage traffic to be classified high and low priority and scheduled accordingly, and is relatively simple to implement.

Fiber Channel-In the context of HPC, Fiber Channel SAN interfaces incurs a "slot tax" and may make the system more expensive by provisioning a Fiber Channel HBA, as the server may need to be a larger form factor (4RU instead of 1RU), and may also have ramifications for performance if PCIx servers are used. In practice, Fiber Channel is not widely used for HPC node connectivity, although it is used for scaling NAS disk capacity and reliability.

iSCSI-Although iSCSI is a viable technology, it is rarely, if ever used within HPC environments; however, iSCSI can easily be incorporated into HPC using the Ethernet management network or storage network.

InfiniBand Attached Storage-Today, InfiniBand attached storage is a recent development and holds the promise of high bandwidth and low overhead access to storage. For applications that need to move large volumes of data, an Ethernet attached parallel file system in combination with InfiniBand attached storage and SDP can deliver very high performance file transfer. An added advantage of this approach is that the inherent storage mechanism is NAS (file based) with all the benefits that NAS accrues combined with InfiniBand high bandwidth and low stack overhead.

Ethernet provides an ideal technology for access to NAS as it provide high bandwidth-10 Gigabit Ethernet attached NAS-and location-independent services. For less demanding HPC clusters, storage access may be combined with the access and management networks. The use of distributed parallel file systems also enable InfiniBand-attached storage to be used to deliver high-throughput data retrieval.

Figure 4. Cisco HPC Storage Solutions

For Fiber Channel environments, the Cisco MDS 9000 multilayer SAN switches delivers intelligent network services such as virtual storage-area networks (VSANs), comprehensive security, advanced traffic management, sophisticated diagnostics, and unified SAN management. The MDS 9000 family of products also support advanced features such as iSCSI gateway functionality, write acceleration and network accelerated serverless backup.

Inter-Process Communications Network

The IPC network is the essence of an HPC cluster as it provides the connectivity required for the slave nodes to exchange messages that enable parallel processing of complex computations. In general, the IPC network requires low latency and high bandwidth to enable optimum IPC communications, although the definition of "low" and "high" need to be carefully defined in the context of the application.
Referring to the taxonomy detailed at the beginning of this document, applications can be characterized as parametric execution, loosely coupled, tightly coupled, and transactional.

Table 3. Loosely Coupled Applications

Enterprise Vertical

Requirement

Applications

Notable Characteristics

Energy

Seismic and geophysical modeling

Geoquest, Geodepth

Applications have little or no IPC traffic and although latency is not generally an issue, bandwidth is a key consideration.

Financial

Financial Analytics (Monte Carlo)

Barra, Sungard, RMG

Digital Media

Digital image rendering, animation

Discreet, Pixar Renderman

In general, these characterizations enable some aspects of the application to be determined, such as sensitivity to latency. For example, loosely coupled applications and parametric execution have little or no IPC communications and are consequently not sensitive to IPC latencies because there are very few, if any, IPC messages. At the opposite end of the scale are the tightly coupled applications that have frequent IPC communications to exchange information, which are very latency sensitive, as time spent communicating is time spent not processing.

Table 4. Tightly Coupled Applications

Enterprise Vertical

Requirement

Applications

Notable Characteristics

Manufacturing

Crash simulation, Aerodynamics

Fluent, Powerflow

Low latency is important to minimize IPC communications overhead.

Life Sciences

Disease research, drug design, finite element analysis, etc

Amber, Blast, Charmm

There are several areas that need to be addressed to reduce latency. In the context of the network, low-latency switches and high-bandwidth connections reduce switch and network serialization delays. For many applications and smaller HPC clusters, Gigabit Ethernet is a good choice for the IPC network, as modern Gigabit Ethernet switches deliver low switching latencies and high bandwidth.
Careful consideration of the actual Gigabit Ethernet switch used is required, as not all Gigabit Ethernet switches have low latency switch characteristics. An example of a low latency Gigabit Ethernet switch is the Cisco Catalyst 4948 Multilayer Switch which has a measured latency of 4.8 microseconds. While this genre of low-latency switches may attract a premium when compared to other Gigabit Ethernet switches, the performance gain in terms of HPC efficiency more than compensates for the additional cost.

Figure 5. Cisco Gigabit and 10 Gigabit Ethernet HPC Solution

For larger HPC clusters and more demanding applications, InfiniBand provides an ideal solution of ultra-low latency and high bandwidth using 10G InfiniBand 4X host connectivity and uplinks. The Cisco SFS 7000 Series InfiniBand Server Switch product family switch latencies are in the order of 100s of nanoseconds, which provide significant application performance gains in terms of time to result.

Figure 6. Cisco InfiniBand HPC Solution

Although it is tempting to eliminate as many switch hops from a design and collapse switch layers into fewer and larger switches, consideration of the effect of a switch failure, cabling complexity and volumes, and power and cooling need to be taken into account. As an HPC cluster grows in size, top-of-rack solutions using central aggregation switches may make installation, scaling, operations, and maintenance simpler. Additionally, using a different technology, such as InfiniBand instead of Gigabit Ethernet may also reduce latency and enables the benefits that top-of-rack solutions provide in terms of fault isolation, management, etc. As most HPC server vendor solutions consist of a rack of servers, a top-of-rack switch provides a foundation for large, scalable HPC clusters using 10GE, or 4X InfiniBand uplinks to aggregation layer switches.
By far the largest impact on latency is gained within the HPC compute nodes by using communications protocol offload, Remote Direct Memory Access (RDMA), and communications offload to the IPC network interface cards. RDMA provides a mechanism that offloads the CPU of any requirements to move data between user and I/O memory. Within HPC clusters, MPI is able to leverage the underlying data movement offload capabilities of the InfiniBand HCA, or RDMA-capable NICs. RDMA also supports robust security and kernel bypass and is very efficient in terms of CPU overhead.
As a technology, RDMA is transport independent and can be applied to Ethernet HPC environments. However, Ethernet is slightly more complex as considerations for TCP/IP processing also need to be taken into consideration. In this respect, TCP offload engines (TOE) embedded on the NIC provide a way of offloading all or some parts of TCP processing from the CPU to the TOE NIC.
If performance is a consideration for Ethernet-based HPC, investing in TOE, or TOE + RDMA-capable NICs, provides performance in terms of CPU efficiency approaching those offered by InfiniBand technology. A table of commercially available 10/100/1000 and 10GE RDMA+TOE enabled NICs are shown in Table 5.

Table 5. RDMA-Capable Ethernet Network Interfaces Cards (RNIC)

RNIC Vendors

10/100/1000

10GE

LOM Chip Sets

Broadcom (Siliquent)

Yes

Yes

Yes

Chelsio

1000Base-T

Yes

No

Neterion

No

Yes

No

Currently, InfiniBand has an advantage over Ethernet because 10G (4X) InfiniBand HCAs and high-density 4X InfiniBand switches are widely available at comparatively low prices compared to 10GE RNICs and 10GE switches. This dynamic is likely to change with volume production of RNICs and as RNIC-enabled LAN-on-Motherboard becomes available.
So, which IPC network technology is right for my application? This question can only be made with a good understanding of application characteristics, the size of the HPC cluster, the application traffic patterns, target CPU efficiencies, etc. In general, Gigabit Ethernet is a good choice for many HPC environments where the applications are not particularly latency sensitive and some small node CPU inefficiency, or more nodes, can be used for that application. If the application is sensitive to latency, or benefits from 10G connectivity, InfiniBand is the good choice for these types of HPC applications and is also applicable to the situation where HPC node densities are likely to be high, or a mix of HPC applications may be encountered.

SUMMARY

The network is a critical component of HPC. Although HPC design is often focused on the requirements for the IPC component of the HPC cluster, other connectivity must be taken into consideration such as storage, management and user access. Additionally, as HPC has now moved into mainstream Enterprise deployments considerations such as the security, scalability and availability of the HPC network components must all be considered.
Cisco® Systems, the world leader in Networking, powers many of the Top 500 HPC supercomputers by providing a complete Network solution that meets the requirements for high performance computing (HPC) and GRID Applications.
The Cisco SFS 7000 server switch family of industry standards based InfiniBand switches and InfiniBand HCAs that deliver the performance, availability and scalability to support the most demanding HPC IPC network environments.
The award winning Cisco Catalyst® 6500 and Catalyst 4948 Multilayer switches deliver unprecedented performance, availability and functionality required for the most demanding HPC environments. The Catalyst 6500 and Catalyst 4948 proven high performance, low latency GE and 10GE Ethernet switching enables the Cisco Catalyst switches to support IPC, storage and management networks. Additionally, the Cisco Catalyst switches also support sophisticated security features that enable the Catalyst family of products to provide secure access to the HPC cluster.
The Cisco MDS 9000 Multilayer SAN switches delivers highly available, scalable storage services by combining a robust and flexible hardware architecture with multiple layers of network and storage-management Intelligence. These attributes enable the Cisco MDS 9000 to deliver advanced security and unified management, and lower the total cost of ownership for the most demanding storage environments.
Text Box:  Corporate HeadquartersCisco Systems, Inc.170 West Tasman DriveSan Jose, CA 95134-1706USAwww.cisco.comTel:   408 526-4000    800 553-NETS (6387)Fax: 408 526-4100    European HeadquartersCisco Systems International BVHaarlerbergparkHaarlerbergweg 13-191101 CH AmsterdamThe Netherlandswww-europe.cisco.comTel:  31 0 20 357 1000Fax:    31 0 20 357 1100    Americas HeadquartersCisco Systems, Inc.170 West Tasman DriveSan Jose, CA 95134-1706USAwww.cisco.comTel:    408 526-7660Fax:    408 527-0883    Asia Pacific HeadquartersCisco Systems, Inc.168 Robinson Road#28-01 Capital TowerSingapore 068912www.cisco.comTel: +65 6317 7777Fax: +65 6317 7799Cisco Systems has more than 200 offices in the following countries and regions. Addresses, phone numbers, and fax numbers are listed onthe Cisco Website at www.cisco.com/go/offices.Argentina · Australia · Austria · Belgium · Brazil · Bulgaria · Canada · Chile · China PRC · Colombia · Costa Rica · Croatia · Cyprus Czech Republic · Denmark · Dubai, UAE · Finland · France · Germany · Greece · Hong Kong SAR · Hungary · India · Indonesia · Ireland · Israel Italy · Japan · Korea · Luxembourg · Malaysia · Mexico · The Netherlands · New Zealand · Norway · Peru · Philippines · Poland · Portugal Puerto Rico · Romania · Russia · Saudi Arabia · Scotland · Singapore · Slovakia · Slovenia · South Africa · Spain · Sweden · Switzerland · Taiwan Thailand · Turkey · Ukraine · United Kingdom · United States · Venezuela · Vietnam · ZimbabweCopyright  2006 Cisco Systems, Inc. All rights reserved. CCSP, CCVP, the Cisco Square Bridge logo, Follow Me Browsing, and StackWise are trademarks of Cisco Systems, Inc.; Changing the Way We Work, Live, Play, and Learn, and iQuick Study are service marks of Cisco Systems, Inc.; and Access Registrar, Aironet, BPX, Catalyst, CCDA, CCDP, CCIE, CCIP, CCNA, CCNP, Cisco, the Cisco Certified Internetwork Expert logo, Cisco IOS, Cisco Press, Cisco Systems, Cisco Systems Capital, the Cisco Systems logo, Cisco Unity, Enterprise/Solver, EtherChannel, EtherFast, EtherSwitch, Fast Step, FormShare, GigaDrive, GigaStack, HomeLink, Internet Quotient, IOS, IP/TV, iQ Expertise, the iQ logo, iQ Net Readiness Scorecard, LightStream, Linksys, MeetingPlace, MGX, the Networkers logo, Networking Academy, Network Registrar, Packet, PIX, Post-Routing, Pre-Routing, ProConnect, RateMUX, ScriptShare, SlideCast, SMARTnet, The Fastest Way to Increase Your Internet Quotient, and TransPath are registered trademarks of Cisco Systems, Inc. and/or its affiliates in the United States and certain other countries.All other trademarks mentioned in this document or Website are the property of their respective owners. The use of the word partner does not imply a partnership relationship between Cisco and any other company. (0601R)Printed in the USA C11-333340-00   03/06 Text Box:  Corporate HeadquartersCisco Systems, Inc.170 West Tasman DriveSan Jose, CA 95134-1706USAwww.cisco.comTel:    408 526-4000    800 553-NETS (6387)Fax: 408 526-4100    European HeadquartersCisco Systems International BVHaarlerbergparkHaarlerbergweg 13-191101 CH AmsterdamThe Netherlandswww-europe.cisco.comTel:  31 0 20 357 1000Fax:    31 0 20 357 1100    Americas HeadquartersCisco Systems, Inc.170 West Tasman DriveSan Jose, CA 95134-1706USAwww.cisco.comTel:    408 526-7660Fax:    408 527-0883    Asia Pacific HeadquartersCisco Systems, Inc.168 Robinson Road#28-01 Capital TowerSingapore 068912www.cisco.comTel: +65 6317 7777Fax: +65 6317 7799Cisco Systems has more than 200 offices in the following countries and regions. Addresses, phone numbers, and fax numbers are listed onthe Cisco Website at www.cisco.com/go/offices.Argentina · Australia · Austria · Belgium · Brazil · Bulgaria · Canada · Chile · China PRC · Colombia · Costa Rica · Croatia · Cyprus Czech Republic · Denmark · Dubai, UAE · Finland · France · Germany · Greece · Hong Kong SAR · Hungary · India · Indonesia · Ireland · Israel Italy · Japan · Korea · Luxembourg · Malaysia · Mexico · The Netherlands · New Zealand · Norway · Peru · Philippines · Poland · Portugal Puerto Rico · Romania · Russia · Saudi Arabia · Scotland · Singapore · Slovakia · Slovenia · South Africa · Spain · Sweden · Switzerland · Taiwan Thailand · Turkey · Ukraine · United Kingdom · United States · Venezuela · Vietnam · ZimbabweCopyright  2006 Cisco Systems, Inc. All rights reserved. CCSP, CCVP, the Cisco Square Bridge logo, Follow Me Browsing, and StackWise are trademarks of Cisco Systems, Inc.; Changing the Way We Work, Live, Play, and Learn, and iQuick Study are service marks of Cisco Systems, Inc.; and Access Registrar, Aironet, BPX, Catalyst, CCDA, CCDP, CCIE, CCIP, CCNA, CCNP, Cisco, the Cisco Certified Internetwork Expert logo, Cisco IOS, Cisco Press, Cisco Systems, Cisco Systems Capital, the Cisco Systems logo, Cisco Unity, Enterprise/Solver, EtherChannel, EtherFast, EtherSwitch, Fast Step, FormShare, GigaDrive, GigaStack, HomeLink, Internet Quotient, IOS, IP/TV, iQ Expertise, the iQ logo, iQ Net Readiness Scorecard, LightStream, Linksys, MeetingPlace, MGX, the Networkers logo, Networking Academy, Network Registrar, Packet, PIX, Post-Routing, Pre-Routing, ProConnect, RateMUX, ScriptShare, SlideCast, SMARTnet, The Fastest Way to Increase Your Internet Quotient, and TransPath are registered trademarks of Cisco Systems, Inc. and/or its affiliates in the United States and certain other countries.All other trademarks mentioned in this document or Website are the property of their respective owners. The use of the word partner does not imply a partnership relationship between Cisco and any other company. (0601R)Printed in the USA C11-333340-00   03/06