I/O management is a key concern for system administrators deploying clustered servers. Many clusters have three disparate I/O requirements: LAN/WAN, SAN, and server-to-server communication. The conventional solution used to address these requirements is the installation of three separate networks. The cost and administrative burden of scaling and managing multiple networks is significant.
The Cisco Server Fabric Switch offers an alternative that scales better with lower costs. The server switch connects servers within a cluster with a unified high-bandwidth, low-latency InfiniBand fabric and then creates a central pool of LAN/WAN and SAN resources that can be shared by all servers connected to the fabric. Both IP and storage traffic are carried to the server over a single InfiniBand connection, reducing complexity.
Even though each server is only physically connected to one fabric, Cisco's Transparent Topology Architecture ensures that servers appear direct-attached to the SAN and LAN. Using 10Gbps of bandwidth for each server via the InfiniBand connection, the administrator can scale shared Ethernet and Fibre Channel capacity by simply adding new gateways to the centralized server switch.
By breaking the 1-to-1 physical binding between I/O and servers, the server switch creates a centrally managed I/O resource, with several benefits:
• On-demand, Scalable I/O: Servers access the resources they need, when they need them.
• Simplified Management: I/O can be centrally administered and scaled, significantly cutting costs and management complexity.
• Reduced Downtime: Failure rates are lowered by reducing managed cards and ports in the network infrastructure.
• Lower Costs from I/O Consolidation: By matching I/O requirements to performance rather than number of servers, customers can see a 50 percent reduction in I/O costs due to fewer cards, cables, and switch ports.
• Mobile virtual I/O subsystems: I/O identities previously bound to individual server hardware, such as World-Wide Node Names and MAC addresses, are virtualized and stored in the fabric, enabling rapid change. In conjunction with VFrame Server Virtualization software, this enables server virtualization.
This paper highlights the features of Cisco's Virtual I/O architecture and discusses how virtualization is performed with Ethernet and Fibre Channel networking technologies.
VIRTUAL I/O: HOW IT WORKS
When I/O resources are located in the server switch, they are accessible by any server in the cluster, and can be assigned to one or more servers dynamically. Figure 1 shows how each clustered server shares access to the common gateway in the server switch.
Figure 1. I/O Virtualization
Virtual I/O reduces network complexity by creating a unified high performance fabric that carries all types of network traffic, using an InfiniBand interconnect to carry all network traffic. Without virtual I/O, a server can have dedicated cards for each network type (doubled for high availability). A typical server might have two NICs for LAN connectivity, two HBAs for SAN connectivity, and two cards for cluster communications. In contrast, a cluster with virtual I/O has only one 10Gbps network port (two for redundancy), and share remote Fibre Channel and Ethernet ports on a central server switch. Figure 2 illustrates this unified fabric.
The Server switch allows all servers connected to InfiniBand to communicate to Fibre Channel and Ethernet via shared I/O gateways. In Figure 2, storage traffic travels across InfiniBand and is converted to Fibre Channel at the Fibre Channel gateway. In this example, a single Fibre Channel port at the Cisco SFS 3000 Series InfiniBand to Fibre Channel gateway replaces the ports in each server.
Using a "wire-once" model, administrators can centrally manage a pool of I/O, similar to how storage resources can be centrally managed. Instead of physically touching servers by installing and configuring additional ports, administrators simply add a new gateway or another server switch. The Server Fabric Switch can then add that bandwidth to the portion of the cluster that requires it. Administrators can now build I/O infrastructure based on average load, instead of dedicated peak bandwidth per server.
VIRTUALIZED IP
Using IP over InfiniBand, standard IP-based applications work transparently over InfiniBand. The InfiniBand Architecture Specification defines a standard IP-over-InfiniBand (IPoIB) encapsulation mechanism. The 300 Series InfiniBand to Ethernet gateway uses IPoIB to extract IP packets from InfiniBand and transmit them via Ethernet. Similarly, incoming Ethernet frames with IP payloads are converted to InfiniBand. IPoIB uses standard Berkeley Sockets libraries, so existing applications communicate using existing APIs, with no change required. When an administrator configures a host, he or she configures an IP interface tied to an InfiniBand port (i.e. ib0 or ib1), and also associated an InfiniBand partition, which is mapped to a VLAN via the gateway, as described below.
Cisco's gateway architecture provides IP interfaces to the server through two mechanisms:
• Service mapping with standard ARP protocol
• Bandwidth scaling with Etherchannel and multiple gateways
First, the Ethernet gateway dynamically associates IP services with physical Ethernet ports. Second, the Ethernet gateway supports bandwidth scaling with link aggregation and multiple gateway support. Scalability ensures that the administrator can install sufficient Ethernet bandwidth to meet the cluster's demand.
SERVICE MAPPING
Cisco's Ethernet gateway maps services automatically using the standard IP address resolution protocol. Each gateway in the Server I/O fabric participates in address resolution with clustered servers. Gateways learn which IP addresses are connected to which ports. The server switch directs an IP packet to the gateway that has access to the destination IP address.
The system administrator may configure partitions within the Server I/O fabric that map directly to IP subnets and Ethernet virtual LANs (VLANs). Gateways translate InfiniBand partitions into IP subnets and VLAN tags as set by the system configuration. Partition, subnet, and VLAN association is a two-way process. Incoming frames from the external Ethernet network are translated to InfiniBand partitions based on their subnet and VLAN information.
Transparent service mapping enables existing IP services to execute without modification. Clustered servers access Ethernet networks without explicitly selecting a physical Ethernet port on a gateway. What matters is that IP services are delivered seamlessly by the Server I/O fabric.
BANDWIDTH SCALING
A cluster's Ethernet bandwidth must be scalable to provide a necessary level of IP services to applications. Scalability techniques differ based on how a network is configured. The 3000 Series InfiniBand to Ethernet Gateway scales in two ways. Link aggregation, provides a higher bandwidth logical Ethernet port into a single IP subnet or VLAN. Adding multiple gateways provides Ethernet connectivity to more subnets and VLANs.
Link Aggregation
The InfiniBand to Ethernet gateway supports IEEE 802.3ad Link Aggregate Control Protocol (LACP). Multiple Gigabit Ethernet ports on the same gateway module may be combined into a single link aggregation group. The gateway distributes outgoing Ethernet traffic across all ports in the aggregation group. In effect, this creates a single logical port with higher bandwidth. Aggregated ports connect to the same IP subnet or are assigned to the same Ethernet VLAN.
The gateway automatically handles changes to the aggregation group such as dynamically adding or removing a port. A new port may be assigned to a link aggregation group on the same gateway. The gateway interoperates with Cisco Ethernet switches on the other end of the link to rebalance traffic across the expanded aggregation group. When a port in an aggregation group is disconnected or fails, the gateway automatically stops sending Ethernet traffic across this port.
Multiple Gateways
When a cluster is attached to a larger Ethernet network, multiple IP subnets or VLANs are typically used to segment and manage the network. The system administrator can add multiple gateways to the server fabric. New gateways may be added to the Server I/O fabric with partitions that correspond to the IP subnets and VLANs. A single gateway may connect to multiple subnets and VLANs, or it may be dedicated to a single subnet or VLAN for greater bandwidth.
Traffic can be load balanced across gateways and across switches using source, destination, mac, or round robin algorithms, in either active/active or active/passive topologies.
VIRTUALIZED FIBRE CHANNEL
Using the Transparent Topology Architecture, InfiniBand-attached hosts appear direct-attached to the Fibre Channel storage area network. Administrators install gateways to translate between Fibre Channel Protocol (FCP) on the SAN and SRP (SCSI RDMA Protocol) on the InfiniBand network. The server switch centrally manages storage connections and provides multiple paths through the InfiniBand network for load balancing and redundancy. Server administrators then load a SCSI driver on the host called SRP (SCSI RDMA Protocol), and each server is assigned a unique World-Wide Node Name (WWN). Because each server is uniquely discoverable, administrators can also run multipathing software on the host (ex: EMC Powerpath, Veritas DMP). This also allows zoning and host-based access controls to work seamlessly.
INCITS's SCSI RDMA protocol (SRP) is defined as an InfiniBand standard for encapsulating SCSI over InfiniBand. SRP was standardized by the INCITS T10 committee-the same committee that is responsible for the overall SCSI family of standards.
Cisco's 3000 Series InfiniBand to Fibre Channel gateway uses SCSI RDMA to extract SCSI commands from InfiniBand and transmit them via Fibre Channel. The gateway performs all SCSI command processing on behalf of the server and communicates with SAN storage devices.
Cisco's gateway architecture provides virtualized Fibre Channel bandwidth through three mechanisms:
• SAN resource discovery with InfiniBand device management
• Service mapping via automatic session management
• Bandwidth scaling with multiple gateways and load balancing across gateway ports
RESOURCE DISCOVERY
Resource discovery is a central aspect of SAN operation. The Fibre Channel name server maintains registration information for all ports and devices connected to the SAN. When a new device logs into a Fibre Channel switch, it queries the name server for available resources. The device also registers with the name server so that other devices can learn of its presence.
Cisco's Fibre Channel gateway interacts with the name server by issuing query and registration requests and by accepting state change notifications. The name server sends a state change notification when a new device logs on to the SAN.
The gateway maintains a table of SAN resources and passes this information to clustered servers via InfiniBand's standard device management protocol. Servers detect storage resources as if they are directly attached to the SAN.
SERVICE MAPPING
The 3000 Series InfiniBand to Fibre Channel Gateway maps SCSI services to Fibre Channel ports by dynamically assigning sessions. SCSI RDMA protocol is session oriented. Unique sessions bind a single server to a single storage device. A session is created when a server logs in to a storage device after performing resource discovery.
The server issues a login using InfiniBand's standard connection management protocol. The connection manager (CM) dynamically assigns the session to a particular gateway based on gateway utilization and access controls. Gateway utilization enables the CM to select the gateway that is being used least, thereby spreading the cluster's storage load across gateways. Access controls enable the system administrator to partition storage services and servers according to individual system requirements. Service mapping extends to the port level in each gateway. The administrator may enable some or all gateway ports for each server and storage device.
Once session mapping occurs at login, all transfers between that server and storage device occur via the selected gateway and ports. The mapping is transparent to the server and application.
BANDWIDTH SCALING
The 3000 Series InfiniBand to Fibre Channel Gateway scales storage bandwidth by using multiple ports on each gateway and by installing multiple gateways in the Server I/O fabric. Dynamic load balancing operates across the ports on a single gateway to deliver efficient utilization of all ports. Connection management spreads server-to-storage sessions across multiple gateways to create even traffic distribution across the server fabric. This enables the ability to hot-plug additional bandwidth without powering down a server. For example, when a new gateway is added, existing server is not affected, and administrators have the option of redistributing load across all the gateways, including the newly inserted expansion modules.
CISCO'S SCALABLE, SEAMLESS I/O VIRTUALIZATION
This paper has explained how Cisco delivers scalable and seamless virtual I/O for server fabrics and grids. I/O resources are separated from the server chassis and aggregated to form shared resource pools that are managed using industry standards. Each server has access to I/O resources in one or more server switches, which allows administrators to build I/O more intelligently, without overprovisioning for peak load per server.
With Virtual I/O, existing applications and management packages work without changes. Adding and managing I/O bandwidth is dynamic and transparent to the applications. IP and storage drivers interact with applications and the operating system the same way they did without Virtual I/O, but with the added advantage of centralized management and scale.
Virtual I/O also builds a foundation for server virtualization. Once the I/O infrastructure is virtualized, and can be moved independently of the server, the server infrastructure itself become simplified and more modular. In combination with Cisco VFrame Server Virtualization, servers can be dynamically assigned and unassigned, enabling a new model of deploying diskless servers as peripherals to the network.