Rapidly growing AQR Capital Management invested in a Cisco SAN to protect critical financial information and
ensure disaster recovery.
BUSINESS CHALLENGE
Founded in 1998, AQR Capital Management is a dynamic investment management firm providing financial services to a variety of pensions and other funds for corporate clients. Today, the company oversees approximately $20 billion in assets. While employing morethan 100 employees, AQR's research-based business model employs a partially automated global research, forecasting, and trading process, ranging from aggressive highvolatility hedge funds to low volatility traditional products. The result is an enormous amount of data for the company's size-more than 6 TB and growing.
"The company started out in offices in New York, and it rapidly became apparent that we were going to outgrow our physical storage capacity," says Jerry Levine, Vice President of IT at AQR. "The company was depending entirely on standalone, host-attached RAID (Redundant Array of Independent Disks) technology for storage and backup, and I must have bought half a dozen RAID arrays just during my short time with the company in New York."
IT staff also found that they were spending far too much time trying to manage this increasingly complex storage system. "Traditional RAID arrays just do not scale well fortechnology administration, and they do not scale well for the data center," Levine says. "It is easy to get caught up in adding more and larger arrays to accommodate temporary projects, and end up with a lot of empty disk space." By relying on these RAID technologies, AQR experienced increases in both capital costs due to ongoing technology investment, and in operational costs due to growing administration requirements. "These arrays required much more monitoring and administration than was reasonable for us," he says. "They also generated too much heat and did not make efficient use of data center space."
The power blackout of 2003 had served as a corporate wake-up call-AQR also knew that it needed to improve its disaster recovery (DR) plan. However, reproducing its labyrinth of RAID arrays for a DR center was what Levine simply described as "an ugly solution." AQR seized the opportunity offered by its recent move to a larger facility in Greenwich, Connecticut, to start planning for a flexible storage area network (SAN) solution to simplify storage allocation and provisioning, and to support data replication at a new disaster recovery site.
NETWORK SOLUTION
AQR had recently secured Cisco Systems as their IP phone vendor, and this familiarity made Cisco an easy choice as their SAN switching vendor as well. Over the past few years, SAN architectures have become more modular and management software more user-friendly, making SANs a much better fit for small- to medium-sized businesses.
AQR's new data center SAN is based on two Cisco MDS 9216i multilayer fabric switches, which form a dual redundant SAN. Each switch provides 14 Fibre Channel and two Gigabit Ethernet fixed ports that can be expanded, with a variety of optional switching modules, to as many as 48 ports in a single device for maximum configuration flexibility.
The dual redundant SAN design and dual power supplies are critical for protecting AQR's data. "In our new headquarters, even the e-mail servers have redundant connections in themselves, two physically separate host bus adapters in the servers going to two separate switches," Levine says. "With all this redundancy, there is not a single point of failure here locally. In the event of any kind of disaster, all the data on our SAN at headquarters is being replicated to our DR site over a pair of OC-3 circuits. This essentially ensures the protection of our most precious commodity, our data."
At the DR site, an additional 9216i switch supports a SAN for real-time and weekly data replication. Replication is managed with EMC MirrorView software, with PowerPath to connect one host redundantly to the SAN in the event that a failover is required. AQR also uses EMC VMware with the Cisco-based SAN to replicate instances of each virtual server over to the DR site. This means that, with the flip ofa switch, a server can be up and running in the DR site-adding an additional powerful recovery capability.
To help manage performance and throughput more effectively, AQR has deployed the SAN in a tiered architecture that prioritizes connection to the SAN in terms of information value. "If the server is mission-critical, we put two connections to the two separate switches in the SAN," Levine says. "In some cases, if it is not mission-critical, but needs to be directly attached to the SAN for performance and data protection, we have one connection. For low-end needs, we use a host-based model. I do not know if this is obvious to most people when they consider investing in a SAN, but not every single computer needs to be connected directly to the SAN fabric. One master server or cluster can be connected to the SAN and offer shares to other servers that may not have high data access performance needs. Understanding the requirements of your systems allows you to design customized SAN access that maximizes your investment."
Levine recalls that he prepared a one-page financial analysis when requesting funds for the new SAN from AQR's principals. "One of them commented, `We would be idiots not to do this.' That is what we were trying to show-that we will save a lot of money, a lot of effort, by centralizing it all. Best of all, we would build an infrastructure capable of scaling with the firm's anticipated growth."
"Initially there was apprehension about putting all our eggs in one basket. But through proper redundant and resilient design with the Cisco SAN solution, we concluded this was a significant improvement over the existing infrastructure and decided to move all our data on to it. As a result, we've achieved the highest level of data protection."
-Jerry Levine, Vice President, AQR Capital Management
BUSINESS RESULTS
With its new solution in place, AQR has recognized a number of important benefits to its operations. Storage is now much easier to provision, manage, and administer, and can be dynamically reallocated as needed. It is also more cost-effective. "With the SAN, we are able to buy and allocate just enough storage for what we need right now, and then easily add more later without having to invest in huge amounts of storage that will not be immediately used," Levine said. The SAN is also much more efficient for supporting failover clusters, which can be built by simply connecting the designated server pairs into the network without having to buy additional storage. ThenewSAN also:
• Ensures scalability for a rapidly growing company. AQR plans to significantly increase its number of employees in the next year and has doubled the size of its SAN to accommodate expansion.
• Reduces the cost of disaster recovery due to significant data consolidation at headquarters, with robust data protection.
• Reduces cost of centralized management. Cisco's any-to-any storage and server connectivity allows AQR to unify its UNIX, Linux, and Windows boxes into a single SAN, and repurpose the former RAID arrays for backup purposes only.
• Is far easier and faster to administer and manage, saving on additional staffing requirements. Storage can be dynamically allocated by project to set up testing environments, temporary databases, etc., ensuring the most efficient use of space.
• Improves throughput performance. AQR has been able to increase data access speeds afforded by the 2 GBS fiber-to-host connections to the SAN.
• Provides higher data availability, since volumes can be attached to any host, along with reduced LAN congestion due to backups being removed from the production environment. The LT03 tape library and backup servers are attached to the SAN, ensuring maximum backup performance.
• Makes far more efficient use of physical space than the traditional RAID array system.
NEXT STEPS
Due to corporate growth, AQR is already working to double the capacity of its new SAN. In an innovative deployment, AQR is also implementing Multiprotocol Label Switching across new Cisco CatalystÒ 6500 switches in order to extend headquarter VLANs to its DRsite. AQR is also excited about taking advantage of new technology, in which the system regularly takes a logical quick picture of the pointers to all the data, providing an instantaneous backup of the system at a certain point in time "We actually had a situation just recently where the database of an e-mail server became corrupted," Levine says. "Our systems administrators commented, `If we were using the SAN and we had a picture that was only a few hours old, we could have reverted back to the picture, rolled in some log files and been up-to-date in moments.' Instead, the recovery took us much longer as we retrieved a tape from our offsite tape storage facility, restored thedatabase, and applied a day's worth of logs. Today, all our e-mail servers are stored on the SAN. Redundancy without resiliency is justnot good enough."
In an industry where technology change is sometimes slow, AQR has found itself blazing a new trail for its financial colleagues. "When we deployed our SAN, we were surprised to learn that our work attracted attention from other financial services groups," Levine says. "But it is actually very easy. You do not have to buy monolithic systems; you can start small, like we did, so it is not as complicated or expensive as some people think. A key to our success is having the staff to learn, implement, and manage our SAN infrastructure. This investment ininternal resources is essential for providing the service levels that our internal and external customers demand."
FOR MORE INFORMATION
To find out more about Cisco SAN and storage solutions, go to: http://www.cisco.com