Guest

Clustering and High-Performance Computing Solution

Comparing Cluster Interconnects: Fluent CFD over Ethernet and InfiniBand

White Paper

Measuring and Evaluating Performance of The Fluent CFD Application over Gigabit Ethernet, RDMA Gigabit Ethernet, and InfiniBand Interconnect Network.

Adoption of commodity compute clusters has grown beyond the traditional high-performance compute market and is entering the enterprise data center. Microsoft's release of a Compute Cluster edition of Windows 2003 underscores this trend. Efficiently designing cluster interconnect networks requires network professionals and cluster architects to select the appropriate technology. This decision is complicated by the fact that many applications have varying requirements and performance of a particular application can vary depending on computational datasets and inherent application scalability.
This paper presents application performance measurements using Fluent, a popular Computational Fluid Dynamics (CFD) application, and three Cisco® technology-based interconnect design options:

• Standard on-the-motherboard Gigabit Ethernet network interface cards (NICs)

• Remote Direct Memory Access (RDMA)-enhanced PCI-X Gigabit Ethernet NICs (RNICs)

• InfiniBand RDMA-enabled PCI Express host channel adapters (HCAs)

Customers can use this information to make more informed design decisions when building cluster interconnects.

CHALLENGES

Cluster interconnect network design is complex. A cluster's application and business requirements will dictate the technology and product choices for cluster interconnect.
Historically, most high-performance computing systems have been designed by a single "system integrator" vendor who provided and supported the entire system. This changed significantly with the adoption of clusters. A modern compute cluster represents a highly complex system with multiple layers of abstraction. This paper will focus on interconnect technology options, but the following are common factors in the design choices faced by cluster architects:

Hardware

• CPU vendor, cache and threading technologies (AMD, Intel, IBM-Power, etc.)

• Motherboard vendors and system bus technologies (chipsets, PCI-X, PCI Express, HTX, etc.)

• Interconnect technologies (Ethernet, InfiniBand, Myrinet, Quadrics, etc.)

• Interconnect switching topologies and over-subscription ratios (star, ring, CLOS, 2:1)

• Storage technologies (local disks, Fibre Channel, RAID, etc.)

• Cabling plant and rack design (copper vs fiber, bundles, patch panels, nodes per rack, layout)

• Power and cooling infrastructure (power strips, 220V/110V, efficient cooling)

Software

• OS vendors (Linux, BSD, Microsoft, commercial vs free, kernel versions)

• Driver choices and tuning (stock or vendor supplied)

• Programming languages (C, C++, Fortran, Java, etc.)

• Communication stacks (Message Passing Interface [MPI], Parallel Virtual Machine [PVM], etc., commercial or open)

• Interconnect communication optimizations (RDMA, user-space TCP, offload engines)

• Interconnect switching protocols (Spanning Tree Protocol, subnet managers, Layer 3 designs, gateways)

• File systems (cluster file systems, scratch space, swap space, checkpointing, backup)

• Job scheduling and initialization mechanisms (Remote and Secure Shell [RSH & SSH] Protocol,BProc, grid, cycle harvesting, etc.)

• "Middleware" (management and monitoring, intelligent job distribution, process migration)

• Compilers (commercial or open, supporting Fortran, optimized for processor choice)

METHODOLOGY

The wide variety of cluster interconnects and options have resulted in a marketing focus of extreme-condition benchmarks such as:

• Lowest single pair host-to-host interconnect latency for 0-Byte message transfer (no payload)

• Highest bidirectional interconnect throughput (maximum payload)

• Performance of the linear algebra application, Linpack (small set of linear algebra calculations)

These limited benchmarks, as provided on most vendor data sheets, are not necessarily indicative of general application performance. Having 10x the bandwidth or 1/10 the latency will not translate to a 10x application speedup.
This paper instead focuses on actual application performance metrics by measuring the performance of an application as the number of nodes in the cluster scales. For this paper Fluent Inc.'s commercial Computational Fluid Dynamic (CFD) application, Fluent, was chosen.

CISCO SOLUTIONS

Cisco Systems® provides a wide variety of network interconnect options for cluster architects:

• Gigabit Ethernet and 10 Gigabit Ethernet core networks using the Cisco Catalys 6500 Series switches

• Gigabit Ethernet rack switches using the Catalyst 6500 Series and Catalyst 4948 switches

• Integrated blade server switches for a variety of blade server vendors

• InfiniBand core and rack distribution networks using the Cisco SFS 7000 Series InfiniBand server switches

• InfiniBand gateway capabilities using the Cisco SFS 3000 Series multifabric server switches

• Fibre Channel and iSCSI storage capabilities using the Cisco MDS 9000 Series multilayer SAN switches

• Optical transport networks for connecting metro data centers using the Cisco ONS Family platform

• Core routing capabilities for connecting to research networks using the Cisco CRS-1 carrier routing systems and Cisco 12000 Series routers

Cisco is the only network hardware vendor to offer both Ethernet and InfiniBand interconnect fabric options. This helps Cisco provide cluster architects with unbiased technology comparisons.

TEST APPLICATION: FLUENT

Fluent is the world's largest provider of CFD software and consulting services. Fluent is used for simulation, visualization, and analysis of fluid flow, heat and mass transfer, and chemical reactions (http://www.fluent.com).
Fluent is a vital part of the computer-aided engineering (CAE) process for companies worldwide, and is deployed in nearly every manufacturing industry. Cisco hardware engineers use Fluent in thermal modeling of some of Cisco products.
Fluent uses the Message Passing Interface (MPI) libraries and protocol to communicate and synchronize the computation during a run.

TESTING SETUP

Hardware

Thirty-two Altus 2100 servers from Penguin Computing (http://www.penguincomputing.com) were used as compute nodes in the cluster. Each was configured as follows:

• Dual 2.0 GHz AMD Opteron 246 processors

• 4 GB of PC3200 ECC DDR RAM

• Onboard Broadcom BCM5721 Gigabit Ethernet NICs

• Ammasso 1100 Gigabit Ethernet RDMA-enabled RNICs using a 64-bit PCI-X 133 MHz bus

• Cisco InfiniBand PCI Express Host Channel Adapter using x8 PCI Express slots

A Cisco Catalyst 6509 Ethernet Switch was used for Gigabit Ethernet switching. It was configured as follows:

• Cisco Catalyst 6500 Series Supervisor Engine 720

• Cisco Catalyst 6500 Series 48-Port 10/100/1000 WC Ethernet Module with DFC-3B distributed forwarding

A Cisco SFS 7008 InfiniBand Server Switch was used for InfiniBand switching.

Software

• Scyld Linux Clustering Software package version 29cz-5 (kernel based on kernel.org 2.4.29, packages based on RHEL3).

• Scyld included Broadcom BMC5700 Linux kernel drivers for the Broadcom Gigabit Ethernet NICs.

• Ammasso RNIC driver version 1.2u1-ga and a Ammasso's modified RDMA MPI implementation, mpich version 1.2.5.

• Scyld provided InfiniBand and MPICH-VAPI drivers. These were derived from Topspin driver version 3.0.0.179 (MPICH Version 1.2.3).

• Fluent version 6.2.16. It includes built-in driver support for stock mpich and InfiniBand mpich. (Scyld mpirun for Gigabit Ethernet and mpirun for InfiniBand were not used. Ammasso directly provided modular drivers for Fluent.)

• Fluent requires RSH to be enabled on the compute nodes. Methods to enable RSH were provided by Scyld.

• Cisco IOS® Software Release 12.2(18)SXD was used on the Catalyst switch.

• TopspinOS 2.2.0 was used on the Cisco SFS 7008 InfiniBand Server Switch.

TESTING PROCEDURE

Fluent provides a suite of benchmarks for performance testing (http://www.fluent.com/software/fluent/fl5bench/). The test cases are meant to represent typical jobs run by Fluent customers. Jobs are initiated from a user logged into the head node. In parallel runs, Fluent pushes jobs out to the compute nodes. During each test run, all the compute nodes share state via interprocess communication (IPC) using MPI to complete the calculations via the cluster fabric. In this case, the tests used either the Cisco Catalyst 6509 Ethernet Switch or the Cisco SFS 7008 InfiniBand Switch.

Figure 1. Cluster System Setup with Two Fabric Options

Each test case is broken down in to a set of cells for calculation. Simple test cases contain approximately 32,000 cells, the most complex test case contains 3.6 million cells. Larger problems consisting of higher cell counts are generally more scalable because they can be broken down more efficiently. Table 1 outlines the test cases used.
Each test case was run serially as a single process on a single processor, and then in parallel with increasing numbers of processors. Note that each server node in the cluster has two processors in Symmetric Multiprocessing (SMP) configuration. Jobs were run with a 1-to-1 ratio of process to processor. Each test case was run three times and the results were averaged to improve accuracy.

Table 1. Fluent Benchmark Test Cases

Test Case Details

Serial Runtime (Approximate)

Cell Count

Model

Small 1 (fl5s1)

Accelerating turbulent flow in an elbow duct using segregated implicit solver

45s

32,000

Small 2 (fl5s2)

Accelerating turbulent flow in an elbow duct using coupled implicit solver

45s

32,000

Small 3 (fl5s3)

Transonic flow in rotating fan

75s

90,000

 
Medium 1 (fl5m1)

Coal combustion in a boiler

200s

155,000

 
Medium 2 (fl5m2)

Turbulent flow in an engine valveport

100s

243,000

 
Medium 3 (fl5m3)

Combustion in a high-velocity burner

450s

353,000

Large 1 (fl5l1)

Transonic flow around a fighter aircraft

730s

848,000

Large 2 (fl5l2)

Exterior flow around a passenger sedan

1060s

3,618,000

Note: A larger test case (fl5l3) exists, but it requires more RAM than was available to complete a single node serial run.

UNDERSTANDING OUTPUT DATA

Figure 2 shows a simplified example test run plotted on a graph. Each run is labeled with the shorthand name provided in Table 1 (for example, "fl5m2" refers to "Medium 2"). The X-axis represents the number of processes (and thus processors) used in the test. The Y-axis represents the resulting speedup or performance improvement. For example, a speedup of 6 represents a 6x performance improvement or job run in 1/6 the time of a single process. An ideal response would follow the ideal (X = Y) line. (X processors would perform X times as fast as a single processor.)

Figure 2. Example Output Graph

However, Fluent, like most all compute cluster applications, does not scale linearly. The distance between the ideal line and the output result data represents the efficiency of the run. Amdahl's law describes this behavior of diminishing returns. "Embarrassingly parallel" applications (often with little to no internode I/O) are capable of scaling nearly linearly, but an overhead cost of distributing processing always exists.
This is an important consideration for shared-use clusters with many concurrent users. Optimizing for a single user's response time may not result in optimal cluster use. As an example, given a job that needs to be calculated three times, you could run the jobs back-to-back on 24 processors three times, or you could run them all at the same time using eight processors per job. The parallel jobs will complete faster, because this job runs much more efficiently at eight processors.
Another point of interest is that for the example benchmark and stock Gigabit Ethernet interconnect, after 32 nodes the system degrades, resulting in a negative speedup. This is an example where adding more processors does not result in faster performance. This is an important data point if other technologies continue to scale when one technology degrades.
Knowing the scaling properties for a given application and work load is critical for optimizing compute cluster performance.

OUTPUT DATA-SCALABILITY

Figure 3 presents scalability response curves that were obtained for the test cases run. A brief analysis follows.

Figure 3. Fluent Benchmark Test Results

Note: Only 30 nodes were equipped with working Ammasso RNICs, thus the maximum process count for RNIC test runs was limited to 60 instead of 64.

All jobs tend to show a performance improvement using RNICs and InfiniBand. See the following section for more detail.
All small runs eventually degrade and give negative speedup.

fl5s1:

Speedup peaks: Gigabit Ethernet and RNIC at 8 processes, InfiniBand at 16 processes.

fl5s2:

Speedup peaks: Gigabit Ethernet and RNIC at 16 processes, InfiniBand at 24 processes. Note that all interconnects follow a particular curve pattern.

fl5s3:

• Speedup peaks: All at 24 processes.
• Gigabit Ethernet outperforms the RNIC for <= 24 processes and InfiniBand for <= 4 processes.

Medium runs show RNICs and InfiniBand providing greater scalability and show InfiniBand providing increased efficiency.

fl5m1:

• Speedup peaks: Gigabit Ethernet at 32 processes, RNIC at 48, InfiniBand at 64 (note that efficiency drops considerably for larger node counts).
• For processor counts <= 8 Gigabit Ethernet outperforms RNICs and InfiniBand.

fl5m2:

Speedup peaks: Gigabit Ethernet at 32 processes, RNIC at 48, InfiniBand at 64.

fl5m3:

Speedup peaks: Gigabit Ethernet at 24 processes, RNIC at 24, InfiniBand at 64 (note possible speedup above 48 processes).

fl5l1:

Speedup peaks: All at 48 processes. Gigabit Ethernet and RNIC fairly similar, with InfiniBand outperforming.

fl5l2:

Speedup peaks: All continue to scale at 64/60 processes. RNIC and InfiniBand closer in performance.

Note: The ideal line is very close to response curves, implying a very scalable test case.

OUTPUT ANALYSIS-PERFORMANCE IMPROVEMENT

Performance improvement numbers can give an idea of how much better one technology might be than another. The numbers in Table 3 compare the RNIC and InfiniBand technologies to a Gigabit Ethernet baseline.
To make these measurements fairer, performance improvement columns labeled "RNIC %" and "InfiniBand %" focus at the Gigabit Ethernet peak speedup. The "Average" columns average the speedup at the Gigabit Ethernet peak and below. (After the point where Gigabit Ethernet degrades, doing a comparison against the degraded number makes little sense.) This table ignores the fact that the RNICs and InfiniBand responses often scaled higher than Gigabit Ethernet, but is more representative of speedup expectations for a Gigabit Ethernet system at maximum capacity.

Table 2. Performance Improvements at Gigabit Ethernet Maximum Scalability Peak

Test Case

GE Peak

RNIC %

RNIC Average %

InfiniBand %

InfiniBand Average %

fl5s1

8

24%

19%

77%

48%

fl5s2

16

19%

10%

84%

40%

fl5s3

24

-1%

-7%

30%

8%

fl5m1

32

11%

-1%

40%

8%

fl5m2

32

18%

14%

51%

23%

fl5m3

24

7%

10%

46%

24%

fl5l1

48

4%

4%

24%

12%

fl5l2

64

30%

21%

47%

30%

AVERAGE
-
14%
9%
50%
24%

Note: Also that because of the 60-processor limit for RNIC tests, the RNIC comparison is made against a 64-node Gigabit Ethernet run. (This direct comparison is only done on the fl5l2 run.)

ESTIMATING INTERCONNECT VALUE

As an alternative to the above approach of measuring variable performance for a given process size, consider measuring the variable process sizing required to achieve a given performance level. This method ignores the fact that InfiniBand and RNICs have higher overall scalability and assumes node count is the primary constraint. While this method requires a-priori knowledge of how an application scales, it does answer some practical questions.
As a sample pseudo-financial exercise, consider the graph of the Medium 2 benchmark (fl5m2) in Table 2. This graph was chosen for its simplicity, not because it represents overall performance. This calculation will vary depending on specific application scaling.
For example, assume an 11x speedup on the Medium 2 benchmark is a design requirement for a hypothetical cluster:

• A Gigabit Ethernet network with 24 processors provides a speedup of approximately 11x.

• Comparable performance using a RNIC interconnect can be obtained using approximately 20 processors (interpolated).

• Comparable performance using an InfiniBand interconnect can be obtained using approximately 16 processors.

Thus the "value" of the RNIC network in this limited example is approximately four processors. The "value" of the InfiniBand network in this example is approximately eight processors. This method of valuation is rarely done. Node counts are often decided well in advance of network interconnects, but regardless, it can provide a sense of the value the interconnects provide.

NON-PERFORMANCE CONSIDERATIONS

While application performance is usually paramount, there are several outside factors to consider when comparing the stock Gigabit Ethernet solution to the RNIC and InfiniBand solutions.
Both RNICs and InfiniBand require specialized card upgrades, which involve physically installing cards into every node. It is often possible to purchase new machines with the cards preinstalled, but when done onsite, this process can be very time-consuming.
Gigabit Ethernet RNICs often require a second run of CAT-5 UTP cabling per server. While not always necessary (some RNICs can PXE boot and be used as a primary interface), oftentimes the IPC network will be cabled up separately from a management network. This consumes more ports on the fabric interconnect design and requires more cabling to be installed.
InfiniBand requires new cabling and switches. InfiniBand cables are significantly heavier than Gigabit Ethernet CAT-5 UTP. While like RNICs, InfiniBand can be configured as the only active interface, commonly a Gigabit Ethernet link will be used as well for management. InfiniBand switches are different from Ethernet switches in their management and configuration and, as with the adoption of any new technology, InfiniBand comes with its own challenges.

CONCLUSIONS

Matching application performance and scalability is critical when evaluating cluster interconnect designs. The following conclusions were derived from running these Fluent benchmarks. (They may not be indicative of other application scaling.)

• Scaling bandwidth and latency does not result in direct linear scaling of application performance

• RNICs and InfiniBand improve application scalability over Gigabit Ethernet (10 Gigabit Ethernet solutions have not yet been evaluated)

• InfiniBand is more scalable than RNICs or Gigabit Ethernet at high processor counts (likely influenced by low latency)

• InfiniBand provides better performance improvements and scaling than RNICs (overall 24 percent improvement versus 9 percent)

• Larger job runs scale more favorably and efficiently than smaller job runs on all technologies

• With a-priori understanding of application-scaling properties, the "value" of an interconnect can be estimated

• Other non-performance factors can influence design

CISCO CLUSTERLAB

The Cisco Clusterlab evaluates application performance as it relates to cluster interconnect design. Investigations include:

• Developing a general method to characterize applications and infer probable scaling properties

• Scaling thresholds with large node counts

• Performance of blocking interconnect architectures

• 10 Gigabit Ethernet performance (RNICs and TOE)

• OS and MPI stack overheads

• Process-and-processor job assignments (SMP versus UP jobs, Federated clusters)

• Storage impact (cluster file systems and other storage solutions)

• Compiler impact on application performance

E-mail: clusterlab@cisco.com for more information.
Text Box:  Corporate HeadquartersCisco Systems, Inc.170 West Tasman DriveSan Jose, CA 95134-1706USAwww.cisco.comTel: 408 526-4000    800 553-NETS (6387)Fax: 408 526-4100    European HeadquartersCisco Systems International BVHaarlerbergparkHaarlerbergweg 13-191101 CH AmsterdamThe Netherlandswww-europe.cisco.comTel:  31 0 20 357 1000Fax:    31 0 20 357 1100    Americas HeadquartersCisco Systems, Inc.170 West Tasman DriveSan Jose, CA 95134-1706USAwww.cisco.comTel:    408 526-7660Fax:    408 527-0883    Asia Pacific HeadquartersCisco Systems, Inc.168 Robinson Road#28-01 Capital TowerSingapore 068912www.cisco.comTel: +65 6317 7777Fax: +65 6317 7799Cisco Systems has more than 200 offices in the following countries and regions. Addresses, phone numbers, and fax numbers are listed onthe Cisco Website at www.cisco.com/go/offices.Argentina · Australia · Austria · Belgium · Brazil · Bulgaria · Canada · Chile · China PRC · Colombia · Costa Rica · Croatia · Cyprus Czech Republic · Denmark · Dubai, UAE · Finland · France · Germany · Greece · Hong Kong SAR · Hungary · India · Indonesia · Ireland · Israel Italy · Japan · Korea · Luxembourg · Malaysia · Mexico · The Netherlands · New Zealand · Norway · Peru · Philippines · Poland · Portugal Puerto Rico · Romania · Russia · Saudi Arabia · Scotland · Singapore · Slovakia · Slovenia · South Africa · Spain · Sweden · Switzerland · Taiwan Thailand · Turkey · Ukraine · United Kingdom · United States · Venezuela · Vietnam · ZimbabweCopyright  2005 Cisco Systems, Inc. All rights reserved. CCSP, CCVP, the Cisco Square Bridge logo, Follow Me Browsing, and StackWise are trademarks of Cisco Systems, Inc.; Changing the Way We Work, Live, Play, and Learn, and iQuick Study are service marks of Cisco Systems, Inc.; and Access Registrar, Aironet, ASIST, BPX, Catalyst, CCDA, CCDP, CCIE, CCIP, CCNA, CCNP, Cisco, the Cisco Certified Internetwork Expert logo, Cisco IOS, Cisco Press, Cisco Systems, Cisco Systems Capital, the Cisco Systems logo, Cisco Unity, Empowering the Internet Generation, Enterprise/Solver, EtherChannel, EtherFast, EtherSwitch, Fast Step, FormShare, GigaDrive, GigaStack, HomeLink, Internet Quotient, IOS, IP/TV, iQ Expertise, the iQ logo, iQ Net Readiness Scorecard, LightStream, Linksys, MeetingPlace, MGX, the Networkers logo, Networking Academy, Network Registrar, Packet, PIX, Post-Routing, Pre-Routing, ProConnect, RateMUX, ScriptShare, SlideCast, SMARTnet, StrataView Plus, TeleRouter, The Fastest Way to Increase Your Internet Quotient, and TransPath are registered trademarks of Cisco Systems, Inc. and/or its affiliates in the United States and certain other countries.All other trademarks mentioned in this document or Website are the property of their respective owners. The use of the word partner does not imply a partnership relationship between Cisco and any other company. (0502R)   205482.BJ_ETMG_CC_11.05Printed in the USA Text Box:  Corporate HeadquartersCisco Systems, Inc.170 West Tasman DriveSan Jose, CA 95134-1706USAwww.cisco.comTel:    408 526-4000    800 553-NETS (6387)Fax: 408 526-4100    European HeadquartersCisco Systems International BVHaarlerbergparkHaarlerbergweg 13-191101 CH AmsterdamThe Netherlandswww-europe.cisco.comTel:  31 0 20 357 1000Fax:    31 0 20 357 1100    Americas HeadquartersCisco Systems, Inc.170 West Tasman DriveSan Jose, CA 95134-1706USAwww.cisco.comTel:    408 526-7660Fax:    408 527-0883    Asia Pacific HeadquartersCisco Systems, Inc.168 Robinson Road#28-01 Capital TowerSingapore 068912www.cisco.comTel: +65 6317 7777Fax: +65 6317 7799Cisco Systems has more than 200 offices in the following countries and regions. Addresses, phone numbers, and fax numbers are listed onthe Cisco Website at www.cisco.com/go/offices.Argentina · Australia · Austria · Belgium · Brazil · Bulgaria · Canada · Chile · China PRC · Colombia · Costa Rica · Croatia · Cyprus Czech Republic · Denmark · Dubai, UAE · Finland · France · Germany · Greece · Hong Kong SAR · Hungary · India · Indonesia · Ireland · Israel Italy · Japan · Korea · Luxembourg · Malaysia · Mexico · The Netherlands · New Zealand · Norway · Peru · Philippines · Poland · Portugal Puerto Rico · Romania · Russia · Saudi Arabia · Scotland · Singapore · Slovakia · Slovenia · South Africa · Spain · Sweden · Switzerland · Taiwan Thailand · Turkey · Ukraine · United Kingdom · United States · Venezuela · Vietnam · ZimbabweCopyright  2005 Cisco Systems, Inc. All rights reserved. CCSP, CCVP, the Cisco Square Bridge logo, Follow Me Browsing, and StackWise are trademarks of Cisco Systems, Inc.; Changing the Way We Work, Live, Play, and Learn, and iQuick Study are service marks of Cisco Systems, Inc.; and Access Registrar, Aironet, ASIST, BPX, Catalyst, CCDA, CCDP, CCIE, CCIP, CCNA, CCNP, Cisco, the Cisco Certified Internetwork Expert logo, Cisco IOS, Cisco Press, Cisco Systems, Cisco Systems Capital, the Cisco Systems logo, Cisco Unity, Empowering the Internet Generation, Enterprise/Solver, EtherChannel, EtherFast, EtherSwitch, Fast Step, FormShare, GigaDrive, GigaStack, HomeLink, Internet Quotient, IOS, IP/TV, iQ Expertise, the iQ logo, iQ Net Readiness Scorecard, LightStream, Linksys, MeetingPlace, MGX, the Networkers logo, Networking Academy, Network Registrar, Packet, PIX, Post-Routing, Pre-Routing, ProConnect, RateMUX, ScriptShare, SlideCast, SMARTnet, StrataView Plus, TeleRouter, The Fastest Way to Increase Your Internet Quotient, and TransPath are registered trademarks of Cisco Systems, Inc. and/or its affiliates in the United States and certain other countries.All other trademarks mentioned in this document or Website are the property of their respective owners. The use of the word partner does not imply a partnership relationship between Cisco and any other company. (0502R)   205482.BJ_ETMG_CC_11.05Printed in the USA