Application performance is one of the most significant factors in the use of clusters in high-performance computing (HPC) environments. A number of variables affect performance, including node configuration, interconnect and switching fabric types and configuration, protocol offload technologies, storage and file system types, and the application characteristics.
Many application types can be defined as HPC, and the characteristics of each influence the choice of interconnect and fabric.
This document presents a comparative analysis of the performance of LS-DYNA, a finite element analysis application, using two interconnect and fabric technologies: Gigabit Ethernet and Cisco® InfiniBand.
Finite Element Analysis
Materials analysis has become increasingly important in the development of most physical products produced today. To better use resources such as plastics in a cell phone handset, for example, detailed simulations of the injection molding process are performed to understand how well the plastic will meet the flexibility and durability requirements of the final product, without air pockets or other materials failures.
In the automotive industry, the same sort of analysis is performed, but more important to most people, crash-safety features such as front-fender crumple zones and main-body safety-cage deformation in a simulated crash are also validated with finite element analysis and simulation.
In general, two types of simulations and analysis that are used in industry: two-dimensional (2-D) modeling and three-dimensional (3-D) modeling. Engineers, in the past as well as today, perform finite element analyses primarily on dedicated computers using finite element analysis software packages. Given the complexities of the materials and objects being modeled, this approach has certain limitations.
2-D modeling provides a simple computer analysis that can be run on a relatively standard computer, but it tends to yield less accurate results than 3-D modeling. 3-D modeling produces more accurate results but usually can be run only on a very fast computer.
Today's finite element analysis applications allow critical data to be gathered, and this information can potentially minimize the risk of inaccurate decisions and shorten the time to market for products. However, new systems and technologies are required to use this data efficiently enough to maximize both computing and human hours and to stay ahead of market competition.
HPC emerged specifically to meet the demands of these specialized applications. HPC accelerates a variety of complex and intensive computations, the results of which allow a company to make faster, more accurate business decisions and shorten the time to market.
Finite Element Analysis in a Cluster
With the advances of computing facilities and the development of higher-speed communication networks, finite element analysis computing has migrated from a single-system computing environment to a locally clustered or geographically distributed computing environment with software development and communications capabilities. This document focuses on the performance of LS-DYNA on a local cluster using various interconnects and fabric combinations.
HPC Network
A crucial component of an HPC cluster is the network that provides the communications between nodes. The performance and characteristics of this network affect the overall cluster performance. In the cluster architecture, multiple nodes must communicate with resources such as storage and other nodes necessary for control and interprocess communication. Generically, communications within a cluster can be divided into four categories:
• Access network-The access network is the cluster access point for users. It allows the cluster user to schedule jobs and view job results and graphical data.
• Management network-The management network is the cluster command and control network that allows the master node to schedule, start, check, and stop work that is performed on the cluster.
• Storage or I/O network-In most HPC environments, the cluster nodes download data from an external network attached storage (NAS) or storage area network (SAN) to their local disk and then perform the necessary calculations before writing the result back to the NAS or SAN. This process requires high-speed access between the NAS and SAN systems and the cluster nodes.
• Interprocessor communication (IPC) network-The IPC network provides high-speed connectivity between cluster nodes so that IPC messages can be exchanged. Because the IPC network characteristics have the most effect on application performance, the IPC network uses high-bandwidth and low-latency network technologies.
Of these four components, the IPC network can have the most significant effect on applications with large amounts of processor-to-processor traffic: the higher the message latency, the poorer the application performance. This component is the focus for the testing discussed in this document.
IPC and Interconnects
Today's computing environments use several different interconnects and technologies for IPC traffic. Two of the more popular are Gigabit Ethernet and a new technology called InfiniBand. These are the interconnects and fabrics used in the application performance and scalability testing of LS-DYNA discussed here.
IPC in parallel clusters primarily uses the Message Passing Interface (MPI). Both open-source and commercial versions of MPI are available. Because this document focuses on the performance and scalability of the interconnect and fabric, a single MPI stack was used for the Gigabit Ethernet testing: the Local Area Multicomputer (LAM) MPI. For the Cisco InfiniBand testing, the MVAPICH MPI stack included as part of Cisco InfiniBand was used.
Challenges
Choosing the correct interconnect, switching fabric, and cluster design can be a challenge. Cluster designers must choose the processor type and amount of memory and decide whether to use local storage or network attached storage. In the end, however, the application and business requirements will dictate what is used for all areas of the cluster design.
Understanding application performance requires an understanding of the characteristics of the cluster that have the greatest effects on the performance and scalability of the application.
Application Performance
Cluster performance is judged by the total run time, or wall-clock time, of the application. The shorter the wall-clock time, the more application processing can be performed, and the lower the cost per application process.
The main areas affecting application performance are message latency, CPU use and contention, and I/O latency for data set access.
Message Latency
Network latency has always been a primary factor when selecting a switching technology or even a switching vendor. In HPC, the concern is message latency: the time it takes to move a message from one processor to another. Message latency consists of the aggregate latency incurred at each element within the cluster network, including within the cluster nodes themselves. The protocol processing latency of MPI and TCP processes within the host itself has the highest effect on message latency.
Message latency has a significant effect on the scalability and performance of the application. The higher the message latency, the longer the application wall-clock time.
CPU Use and Contention
In application processing, the CPU usually handles a multitude of processes, so the goal is to have the processes use the CPU resources as efficiently as possible. However, computing resources often are not used effectively. The ultimate goal of application programmers is to have the application use the full CPU resources from start to finish after the application is invoked. Wall-clock time is crucial: the shorter the run time, the better.
The more CPU cycles committed to application processing, the lower the application wall-clock time. The number of cycles devoted to application processing is affected by the protocol processing and I/O processing that occurs during the application processing. As packets are received, headers and service information are stripped to provide the data to the application. This stripping takes CPU cycles away from the application processing. The more protocol processing that occurs, the fewer the CPU cycles devoted to the application and the greater the wall-clock time. The CPU use also affects scalability; the more time spent communicating between nodes, the longer the time needed to finish the application processing.
Interconnects such as Cisco InfiniBand with its implementation of Remote Direct Memory Access (RDMA) and optimized Gigabit Ethernet with TCP offloading or a new Gigabit Ethernet offloading technology called Internet Warp (iWARP) Remote Direct Data Placement (RDMA/RDDP over TCP) can improve CPU use.
These technologies allow protocol processing to occur in a host adapter, bypassing the CPU, kernel processes, and context switching and moving the data directly to application memory. This process frees CPU cycles for the application and significantly reduces end-to-end message latency.
I/O Latency
In many applications such as LS-DYNA, large data sets are used for input, crash or failure information, etc. as well as for output generated by the computing processes. I/O wait time on either side of data set reads or writes affects application wall-clock time. Storage methods and decisions significant effects on I/O latency and wait times.
Storage solutions can include the use of a network file system (NFS), or other parallel file systems such as Lustre, TeraGrid, general parallel file system (GPFS), parallel NFS, or parallel virtual file system (PVFS).
The goal of the testing was to show the performance and scalability characteristics of the Gigabit Ethernet and Cisco InfiniBand interconnects and fabrics, so the data sets were placed on the local storage of each node, thereby eliminating the effect of I/O wait time and helping ensure a constant disk access performance variable for the test.
Maximizing Resources
Companies that understand the importance of time to completion in crash-simulation tests generally make every effort to design clusters that maximize efficiency and to obtain the best tools available. However, costly purchases and careful planning can be undermined by a failure to maximize the communication between computing resources.
Gigabit Ethernet and LAM MPI
Gigabit Ethernet is a well-known networking technology often used as a high-speed LAN backbone. Gigabit Ethernet operates at 1000 Mbps (1 gigabit per second).
LAM is an open-source implementation of the MPI standard. Implementations of MPI (such as LAM) provide an API of library calls that allow users to pass messages between nodes of a parallel application.
For these tests, LAM MPI was used to maximize performance; for more information, see http://www.lam-mpi.org/. MPI is a library specification for message-passing, proposed as a standard by a broad based committee; for more information, see http://www.mpi-forum.org/docs/docs.html.
Cisco InfiniBand and RDMA
InfiniBand is a technology that was developed to address the performance problems associated with data movement between computer I/O devices and associated protocol stack processing. The Cisco InfiniBand physical layer supports three data rates, designated 1X, 4X, and 12X. For these tests, the Cisco InfiniBand 4X interface was used; the 4x interface has a signaling rate of 10 Gbps and a data throughput rate of 8 Gbps.
Cisco InfiniBand uses a hardware-based RDMA capability. RDMA allows the CPU to delegate data movement within the computer to the direct memory access (DMA) hardware. The CPU informs the hardware of the memory location where data that is associated with a particular process resides and the memory location that the data is to be moved to. After the DMA instructions are sent, the CPU can process other threads while the DMA hardware moves the data. RDMA allows data to be moved from one memory location to another, even if that memory resides on another device.
Test Methodology
This document focuses on actual application performance metrics by measuring the performance of an application as the number of nodes in the cluster scales. For this test, a crash simulation application known as LS-DYNA was used.
This document contains real application performance data for clusters scaling to 128 CPUs, comparing the widely used Gigabit Ethernet and Cisco InfiniBand interconnects.
The hardware used in this test is not intended to achieve the fastest possible result, but instead to show comparison results on a customer-realistic set of equipment. Customers can certainly obtain better results than those depicted here by upgrading the hardware of their clusters.
Figure 1 shows the cluster setup used in this test.
Figure 1. Cluster System Setup with Two Fabric Options
Each test case was run serially as a single process on a single processor and then in parallel with increasing numbers of processors. Note that each server node in the cluster has two processors in a symmetric multiprocessing (SMP) configuration. Jobs were run with a 1:1 ratio of process to processor. Each test case was run three times, and the results were averaged to improve accuracy.
LS-DYNA
LS-DYNA (http://www.lstc.com/) is a recognized commercial finite element analysis application that is used for a number of different types of analysis. Some of many common uses include automobile crash analysis, building collapse analysis, material formation and failure analysis, and heat transfer analysis.
For the analysis in this test, two recognized vehicle crash benchmarks were used: the Three-Car Crash Test simulation and the Neon-Refined Crash Test simulation. The latter is specific to the Dodge Neon.
Information on the use of LS-DYNA by various organizations can be found at http://www.dynalook.com/.
Testing Application: LS-DYNA
LS-DYNA is a multifunctional explicit and implicit finite-element program used to simulate and analyze highly nonlinear physical phenomena obtained in real-world problems. Usually those phenomena are subjected to large deformations within a short time duration, such as in crashworthiness simulations.
LS-DYNA is used extensively in vehicle crashworthiness analysis. The ability of LS-DYNA to model contacts and the wide range of material models available allows occupants and barriers to be modeled with ease.
Testing Setup
Up to 64 Altus 2100 servers from Penguin Computing (http://www.penguincomputing.com) were used as computing nodes in the cluster. Each was configured as follows:
Hardware for Gigabit Ethernet Tests
• Single and dual 2.0 GHz AMD Opteron 246 processors
• Cisco SFS 7008 InfiniBand Server Switch for InfiniBand switching
Software for Cisco InfiniBand Tests
• SFS host drivers - Linux 3.2.0_build 82
• SFS on Cisco SFS 7000P InfiniBand Server Switch, Release 2.7.14
Operating System on Computing Nodes for Gigabit Ethernet and Cisco InfiniBand Tests
• RedHat Enterprise Linux 4.3
Testing Procedure
The figures and tables that follow show the results for the standard 3-Car Crash test and Neon Refined test simulations.
The 3-Car Crash test analyzes a van crashing into the rear of a compact car. The compact car, in turn, crashes into a midsize car. Vehicle models were created by the National Crash Analysis Center (NCAC) and assembled into the input file by Mike Berger, consultant to LSTC. For more information, see http://www.topcrunch.org/benchmark_details.sfe?query=3&id=40.
The Neon Refined test analyzes a frontal crash with an initial speed of 31.5 miles per hour. The model size is 535k elements. The simulation length is 150 milliseconds (ms). The model was created by NCAC at George Washington University. It is one of the few publicly available models for vehicle crash analysis and is based on the 1996 Dodge Neon. For more information, see http://www.topcrunch.org/benchmark_details.sfe?query=3&id=60.
Of the two tests, the 3-Car Crash test causes the lower amount of IPC. Tests with higher IPC, such as the Neon Refined test, would be expected to show more substantial benefits from a low-latency interconnect.
The Website http://topcrunch.org/ contains benchmark results for several problems. All graphed data is available at this Website.
The following lists defines some of the terms used in the tests.
• Computational efficiency-Computational efficiency is the measure of the efficiency for a given job on a set of computing nodes. The base measure for this metric is operations per day on the lowest possible configuration (two CPUs in SMP mode and one CPU in non-SMP mode). For example, if the number of operations per day for a given job on two CPUs (in SMP mode) is 2, then that measure is the base measure. This value can be used to predict linear scalability. If the same job on four CPUs yields 3.2 operations per day, then computation efficiency is calculated as follows: (3.2 * 100) / (2 * 2) = 80%. In the preceding example above, this result would indicate a deviation from linear scalability of 20%.
• Operations per day-Operations per day is the quotient of the number of seconds in a day (86,400 seconds) divided by elapsed time. This value determines the number of iterations of a given operation that can be performed in a single day.
• Elapsed time-Elapsed time is the measure of the time taken to complete a single iteration of a given operation. For example, in the following tests, values are given for the total time taken to finish an iteration of the 3-Car Crash test simulation using two CPUs.
• CPU time-CPU time is the measure of time used by the system CPU to complete a single iteration of a given operation.
• Network processing overhead-Network processing overhead is the percentage difference between elapsed time and CPU time. This difference may indicate the percentage of processing time spent to accommodate communication between computing nodes.
Understanding the Output Data
Figure 2 shows a simplified sample test run plotted on a graph. The linear line represents the ideal scalability rate. The distance between the ideal line and the output result data represents the efficiency of the run.
The x-axis represents the number of processes (and thus processors) used in the test. The y-axis represents the resulting number of operations that can be run in a day.
For example, at four CPUs, the Cisco InfiniBand and Gigabit Ethernet interconnects closely match the ideal scaling rate. However, as the number of CPUs increases, the efficiency of the Gigabit Ethernet nodes drops off noticeably compared to Cisco InfiniBand.
Figure 2. 3-Car Crash Test-Side-by-Side Scalability Performance on Cisco InfiniBand and Gigabit Ethernet
Understanding the Output Data
Figure 3 combines the Cisco InfiniBand and Gigabit Ethernet performance tests. The graph compares the number of operations per day for both interconnects.
Figure 3 also shows the computational effect on the computing nodes for both Cisco InfiniBand and Gigabit Ethernet. The scaling tests show that Cisco InfiniBand has less effect on the system, allowing the system to maintain high efficiency even as IPC increases.
Figure 3. 3-Car Crash Test-Operations per Day and Computational Efficiency for Cisco InfiniBand and Gigabit Ethernet
Understanding the Output Data
Figure 4 measures the efficiency of a given job on a set of Cisco InfiniBand and Gigabit Ethernet computing nodes. The graph for this 3-Car Crash test shows that Cisco InfiniBand maintains a higher rate of computational efficiency as the number of CPUs increases.
Figure 4. 3-Car Crash Test-Computational Efficiency of Cisco InfiniBand and Gigabit Ethernet
Understanding the Output Data
Figure 5 shows the percentage difference between elapsed time (the time taken to complete a single iteration of a given operation) and CPU time (time used by the system CPU to complete a single iteration of a given operation) for both Cisco InfiniBand and Gigabit Ethernet.
A significant difference between elapsed time and CPU time may indicate processing power spent to accommodate communication between computing nodes, as depicted in Figure 3.
Figure 5. 3-Car Crash Test-Contrasting Elapsed Time and CPU Time for Cisco InfiniBand and Gigabit Ethernet
Understanding the Output Data
Figure 6 uses the 3-Car Crash test to compare the performance of SMP and uniprocessor (UP) hardware when using Gigabit Ethernet and when using Cisco InfiniBand as interconnects.
Figure 6. 3-Car Crash Test-SMP Compared to Uniprocessor on Gigabit Ethernet and Cisco InfiniBand
Understanding the Output Data
Tables 1 and 2 show the results of the LS-DYNA 3-Car Crash test on dual-CPU systems. The performance measurements reveal the increase in computational overhead and the decrease in the speed of jobs as the number of CPUs increases.
The ideal performance numbers that could be reached for this test are shown in the 1 CPU column; the goal is to scale the number of nodes (increasing IPC) while maintaining performance that is as close as possible to those ideal numbers.
Table 1. 3-Car Crash Test A-Gigabit Ethernet, TPC/IP, and LAM MPI
Dual CPU per Host (SMP)
Number of CPUs
1
2
4
8
16
32
64
128
Elapsed time (in seconds)
186,063
111,205
55,233
29,593
17,849
10,968
6272
-
CPU time (in seconds)
186,028
109,952
53,136
27,121
13,645
7619
3892
-
Network processing overhead
0.02%
1.13%
3.80%
8.36%
23.56%
30.54%
37.95%
-
Computational efficiency
100.00%
83.69%
84.78%
79.07%
65.76%
53.46%
46.77%
-
Operations per day
0.46
0.77
1.56
2.91
4.84
7.87
13.77
Linear scaling
0.46
0.92
1.84
3.68
7.36
14.72
29.44
Table 2. 3-Car Crash Test B-Cisco InfiniBand, RDMA, and MVAPICH MPI
Dual CPU per Host (SMP)
Number of CPUs
1
2
4
8
16
32
64
128
Elapsed time (in seconds)
97,080
49,613
25,691
14,315
7807
4424
-
CPU time (in seconds)
97,066
49,605
25,685
14,312
7805
4420
-
Network processing overhead
0.02%
0.02%
0.03%
0.03%
0.03%
0.10%
-
Computational efficiency
95.65%
94.56%
91.30%
81.92%
75.13%
66.30%
-
Operations per day
0.88
1.74
3.36
6.03
11.06
19.52
Linear scaling
0.46
0.92
1.84
3.68
7.36
14.72
29.44
Understanding the Output Data
Tables 3 and 4 show the same tests as Tables 1 and 2, except that these results were performed on single-CPU systems. The performance measurements reveal the increase in computational overhead and the decrease in the speed of jobs as the number of CPUs increases.
Table 3. 3-Car Crash Test C- Gigabit Ethernet, TPC/IP, and LAM MPI
Single CPU per Host (UP)
Number of CPUs
1
2
4
8
16
32
64
128
3-Car Crash test
Gigabit Ethernet, TCP/IP, and LAM MPI
Elapsed time (in seconds)
186,063
111,205
55,233
29,593
17,849
10,968
6272
-
CPU time (in seconds)
186,028
109,952
53,136
27,121
13,645
7619
3892
-
Network processing overhead
0.02%
1.13%
3.80%
8.36%
23.56%
30.54%
37.95%
-
Computational efficiency
100.00%
83.69%
84.78%
79.07%
65.76%
53.46%
46.77%
-
Operations per day
0.46
0.77
1.56
2.91
4.84
7.87
13.77
Linear scaling
0.46
0.92
1.84
3.68
7.36
14.72
29.44
Table 4. 3-Car Crash Test D-Cisco InfiniBand, RDMA, and MVAPICH MPI