The range of eventualities to be considered in planning for business continuity has grown even wider in recent months. Coupled with the growing use of Internet-based business applications, disaster recovery planning must give considerable attention to the wide-area network (WAN) environment, as well as to traditional topics such as redundant storage of data. Enterprise planners can take advantage of the Cisco® AVVID (Architecture for Voice, Video and Integrated Data) to increase network and application resiliency, and must understand the technologies and WAN services that network service providers offer.
INTRODUCTION
Recent events have underscored the importance of disaster recovery plans, and made clear the need for a much broader scope than these plans have historically encompassed.
When mainframe computers were the norm, backup computer centers were the main area of focus for disaster recovery. Today an increasing number of companies conduct e-business, with many internal employee functions handled on intranets and relationships with key customers and suppliers managed over extranets. This Internet-centric approach places new demands on the information technology (IT) infrastructure, which must now operate 24 hours a day, 7 days a week, and not just internally.
The IT resources many businesses rely on today include not only the data center, but also the company's LAN infrastructure and the WAN interconnecting all of the offices, applications, and users. The successful operation of the business depends on the continuity of all of these elements. In a survey of readers, however, Information Week magazine recently noted that although 50 percent of respondents had formal plans for backing up and recovering data, less than half of the plans addressed continuity of critical distributed business processes, key supplier extranets, and so on.
The inability of users to access needed data poses as great a problem as the loss of the data itself. Because the goal of disaster recovery planning is the continuity of the business, effective planning must include business protection (guarding against attacks, viruses, worms, and so on), and business agility (decentralizing and mobilizing resources for maximum productivity under all circumstances).
Gartner Group defines a resilient business as one that can "bounce back from any kind of setback, whether a natural disaster, a hostile economic change, a competitive onslaught, cyber-espionage, or a terrorist attack." In a January 28, 2002 report, Gartner stated that only about a third of the enterprises that they had surveyed had plans in place to cope with a complete loss of physical assets and employee workspace.
This paper overviews disaster recovery planning, with an emphasis on the WAN aspects of the plan.
What Would You Do If Your...
· Headquarters, data center, and main PBX were destroyed?
· Network supporting 5000 desktops and servers was ruined?
· Branch offices in 45 cities were cut off from mission-critical applications?
After such a disaster, Lehman Bros. reopened for business one day later thanks to ...
· Decentralized data centers with synchronized data files connected by a metropolitan-area network
· Branch office connectivity re-established via Internet virtual private network (VPN) connections
· Instant offices in hotel rooms linked to the corporate network via access VPN technology
· Voice traffic rerouted over IP to alternate public network gateways in distant cities
WHAT CONSTITUTES A DISASTER?
Before addressing the key elements of a disaster recovery plan, a disaster should be defined. Table 1 shows the most frequent causes of declared disasters.
Table 1. Most Frequent Causes of Declared Disasters
1
Fire
2
Storm (tornado, hurricane, etc.)
3
Flood or other water-related
4
Extremely high or low temperature, humidity, etc.
5
Earthquake, mudslide, or other land movement
6
Automobile or airplane crash
Fires are the most frequent disaster. The U.S. National Fire Protection Agency reports that 43 percent of companies never resume business following a major fire, and 78 percent of companies experiencing such a disaster are out of business within 3 years.
Other causes of disaster can also produce major losses to business. In 1992, a flood in the Chicago area shut off electric power to more than 200 buildings in the downtown Loop area for over a week, affecting as many as 10,000 companies. In 1998, a severe ice storm in Quebec, Canada caused business losses estimated at over US$1 billion.
Fortunately, an Information Week Magazine reader survey conducted in early 2002 showed 55 percent of respondents were planning an increase in information security spending in 2002, and almost 30 percent planned an increase in spending for business continuity planning or preparations.
A COMPREHENSIVE PLAN
In any type of disaster, there are various types of losses to consider:
· Physical facilities (destroyed buildings, work spaces, computers, inventory)
· Access to facilities (condemned buildings)
· Information (corrupt disk drives, damaged computers)
· Access to information (no remote database access)
· People (production, support, managers)
The comprehensive disaster recovery plan must address every thing necessary to support the ongoing successful operation of the business. This means that every physical element, every software element, every human resources element, and every business process must be studied and addressed, and the acceptable degree of risk determined for each. Financial and operations issues must be included. Effective plans address all potential disasters, from acts of nature to terrorist events to cyber-disasters. (See Appendix B for information on preparing for and managing cyber-disasters.) In addition, the transition to the planned "backup" mode must be considered.
A "supply chain" analysis is a useful technique to address recovery of the physical assets of the company. This part of the plan should address how to deal with unavailable manufacturing and storage facilities, order entry systems, shipping, accounts receivable and payable systems, spare parts, and customer service. Time is a key element as well. Gartner Group recently recommended that companies reduce their recovery time for critical processes and applications to 24 hours or less, and for non-critical applications to four days.
Three types of solutions should be considered as part of the planning process. A company can (1) build its own redundancy (for example, have two separate factories, each running at partial capacity), (2) contract in advance for emergency-use capacity (for example, a hot-site data center owned and operated by a disaster recovery services company), or (3) can carry insurance to cover the costs likely to be incurred in responding to a disaster (for example, to cover costs of leasing facilities or purchasing products or parts to meet emergency needs). For most companies, no single approach is best; a combination of these three broad strategies is most effective.
Key equipment vendors are an essential part of any plan. Ensure vendors have adequate parts, staff, and financial resources to help out quickly in case of a major disaster.
Essential to disaster recovery plans is the need to effectively communicate and practice them. Effective communication with employees before a disaster strikes is critical, as is practicing emergency procedures. A CitiGroup leader was quoted in Information Week late in 2001 saying, "If you don't get this right, nothing else is going to matter because you're just going to have chaos in the company. Not being prepared will put your company out of business."
If developing an effective disaster recovery/business continuity plan seems daunting, especially if in-house expertise is limited, there are various forms and software tools and templates available from the commercial marketplace. Appendix A lists numerous resources to assist you.
THE IT PERSPECTIVE
The IT aspects of a comprehensive disaster recovery plan must cover network resilience, communications resilience, and business applications resilience.
A resilient network starts with an effective design and architecture, provides for mobility and security, and is built with platforms engineered for high availability. In design, redundancy eliminates single points of failure, while fast and automatic fail over insures quick recovery. Attention to traffic engineering, load balancing, and quality of service (QoS) will handle poorly behaved or unexpected traffic loads, which can even block access to business applications when no failure is present.
The communications perspective considers voice and PBX traffic, as well as data flows. IP Telephony can be the either the primary, or backup, mode of voice communications, while IP contact centers increase the firm's ability to maintain contact with key customers and suppliers. Recent press reports have noted instances where voice-over-IP links were the only means of communication when PBX systems and telephone exchanges were out of service. An IP-based voice communications network enhances mobility, and facilitates the rapid relocation of employees, whether to pre-planned backup locations or to "instant offices" in conference centers and hotel rooms.
At the applications level, important business applications must remain available, and the rapid recovery of critical corporate and customer information is essential. This is where backup data centers, as well as off-site data backup and storage capabilities, are required. (The WAN aspects of connecting the data centers, as well as the end users, are addressed later in this paper.)
To be successful, the IT aspects of a disaster recovery plan must cover considerably more than the company's data center. At a minimum, an effective plan addresses:
· Data center environments, including servers, storage, power, and HVAC
· User environments (PCs, LANs, application and client software)
· External communications facilities (service provider services and circuits)
· Operations (operations centers, help desks, specialized skills)
In a report about disaster recovery following the September 11th events in New York City, Comdisco, a disaster services company, noted, "the lion's share of the recovery effort was actually felt at the business end user level-the end points of computing. Often these business end user environments did not have a similar level of contingency planning" as the data centers enjoyed.
In addition to the IT considerations listed above, plans should take into account that usage levels of e-mail, Websites, telephones, and tie lines in the period immediately after a disaster are likely to be much higher than normal. In addition, the usual patterns of network traffic will probably vary, due to the new locations involved.
The plan should also identify sources for the many miscellaneous services that are likely to be needed immediately following a disaster, in addition to the replacement of physical assets lost. These might include:
· Guard and security services
· Debris removal and cleanup
· Pumping of water and associated cleaning
· Cleaning and decontaminating HVAC systems, ducting, etc.
· Data recovery from damaged media
· Catering services for employees
Network complexity can make it difficult to create business resilience. In developing a plan for business continuity, protection, and agility, the less complexity, the better. Minimizing the number of vendors providing equipment and removing unused older equipment are key steps in achieving this network simplicity.
WAN CONSIDERATIONS
Modern companies rely upon network communications for critical aspects of their business, and both LAN and WAN environments must be in place for employees to carry out their responsibilities. The components of the disaster recovery plan that address workspaces should be sure to include the equipment necessary for the LANs and for WAN access.
Keeping WANs available to support the business means starting with highly available, fault-tolerant systems and platforms, using the most reliable and resilient software possible, using careful network design, and following best practices from design through to daily operations.
A successful WAN design doesn't just focus on connectivity. One of the principles of business resiliency is the idea of distributing people and information assets as a means of reducing risk. Call centers need not be centralized, and data can be replicated, while simultaneously providing all employees with access to critical business applications as order entry and customer service. A resilient WAN design needs to incorporate redundancy to eliminate single points of failure, traffic load balancing to ensure continuous service and acceptable response, fast automatic fail over for quick recovery, and of course security measures appropriate to each situation. There are several challenges. There is a significant difference between the bandwidths practically available in the LAN or campus environment, and that which is economically available from service providers. While T1 lines are reasonably priced and widely available, a major budget increase is required, for example, for T3 or OC3 services. Some of the newer services available in metropolitan areas, typically based on fiber optic technologies, significantly improve this situation, with recent offers as low as US$1000 per month for 100 Mbps Fast Ethernet service. Of course designing the WAN for simultaneous voice/data/video traffic, using QoS techniques, is an effective cost reducing approach.
WAN requirements planning needs to encompass connecting existing data centers with existing employee work locations, existing data centers with backup employee work locations, and backup data centers with existing employee data centers. If offsite data storage is provided at different locations, say at a storage service provider's location, this connectivity must be included as well. All scenarios must contain provision for connections to the Internet, both for general use, and for the operation of extranets to key suppliers and important customers.
Regardless of the network design and technology choices, it is vitally important to have different physical routes for facilities and circuits. Many a supposedly redundant network has failed because both of the fibers or circuits went through the same conduit, manhole, or central office.
The choices for creating the WAN generally fall into three categories. Each has advantages and disadvantages, and requires differing amounts of work (and equipment) from the enterprise user. The options (Figure 1) are to:
Build it yourself, using leased lines (or "pipes") such as T1, T3, SONET, or fiber wavelengths provided by carriers or other network service providers to construct point-to-point circuits
Use Frame Relay (FR) or ATM services, which provide circuit-oriented "virtual" pipes from site to site
Use high-level connection-less optical network services, such as metro Ethernet, or metro IP
Figure 1
WAN Procurement Alternatives
In the first option, the most common approach is to contract for SONET/SDH (Synchronous Optical Network/Synchronous Digital Hierarchy) channels. Frequently used SONET transmission levels are OC-3 (155 Megabits per second-Mb/s), OC-12 (622 Mb/s), and OC-48 (2488 Mb/s, but often referred to as 2.4 Gb/s). Table 2 below illustrates the SONET/SDH hierarchy.
A variation on this approach is to either lease "dark" fiber, or to contract for a wavelength or two on a service provider's fiber network, and to install next-generation SONET/SDH equipment (such as a Cisco ONS 15454) at each location. Dark fiber, sometimes called un-lit fiber, is a strand of fiber with no electronic equipment on the ends. This approach is effective in certain metropolitan areas where fiber is readily available at attractive prices and the Enterprise IT staff has the needed expertise. Wavelength services are also fiber-based, but include the provision of wavelength division multiplexing (WDM) equipment by the carrier.
Table 2. The SONET/SDH Hierarchy
Fiber Optic Signal OC Level
Synchronous Transport Rate Signal STS
SONET/SDH Line Rate in Mbps
Equivalent Channels
DS3
DS1
DSOs
OC-3
STS-3
155.52
3
84
2016
OC-12
STS-12
622.08
12
336
8064
OC-48
STS-48
2488.32
48
1344
32256
OC-192
STS-96
4976.64
96
2688
64512
Note: STS-1 (or OC-1) at 51.84 Mbps is only used inside equipment such as multiplexors, etc.
In a disaster recovery scenario, multiple data centers can be easily linked. The equipment and services used will depend on the particular application requirements (for example., synchronous mirroring, remote tape backup, etc.).
Cisco 15530/15540 are particularly useful in these types of applications, as they support protocols required in solutions offered by leading storage systems vendors such as IBM and EMC, for example Enterprise System Connection (ESCON), Sysplex External Timer Reference, Fiber Channel, Fibre Connectivity (FICON), Fiber Distributed Data Interface (FDDI), and Gigabit Ethernet.
Figure 2 illustrates this application, including an example of deployment of storage arrays from a Cisco solutions partner. In these situations, minimizing latency and complexity are key goals, and this solution provides adequate support.
Figure 2
Data Center Backup Implementation
As an alternative to a circuit-oriented approach, by using products like the Cisco 10720 Series Internet routers at each location, an enterprise can construct a metro IP network over dark fiber between multiple locations. This implementation typically runs over dual fiber rings connecting all sites, but optionally can operate over carrier-provided SONET-based leased lines or wavelength services. This approach is optimized for applications such as IP multicast, for use in, for example, an internal staff training application.
Although SONET is known for its fault-detection and traffic re-routing abilities, the Dynamic Packet Transport (DPT) capabilities of the Cisco 10720 Series provides similar functionality but with much greater efficiency. (Note: DPT is the Cisco technology for the emerging IEEE Resilient Packet Ring [RPR] standard.) For example, both SONET and RPR offer sub-50-millisecond failure detection times, but RPR can accommodate up to 254 nodes per ring, compared to SONET's maximum of 16 nodes per ring. Topology discovery is automatic with RPR, but requires manual effort in SONET. Bandwidth is provisioned dynamically in RPR, but manually in SONET. RPR offers up to eight classes of differentiated service, while SONET has no service-class awareness and thus provides only one level of service (Figure 3).
Figure 3
Metro Ethernet Using DPT/RPR
The best WAN designs provide a logical mesh between all sites, whether this is implemented physically via rings or by point-to-point circuits. In many metropolitan areas, fiber may connect several sites in a physical ring, but the WAN can still be designed as a logical mesh network. Protection in SONET-based physical rings is usually provided via a technique known as bidirectional line switched ring (BLSR). This can be implemented with two fibers around the physical ring, but using four fibers provides the ultimate in survivability with a logical mesh design. For further details on how this works, see the Cisco Application Note referenced in the Resources Section.
VIRTUAL "PIPES"
Traditional WAN services such as Frame Relay and ATM, sometimes called Layer 2 VPN services, are packet-switched, connection-oriented services, that provide "logical" private-line-like services between two end-points via a permanent virtual circuit (PVC). They are ideal for hub and spoke site-to-site architectures.
Frame Relay (FR) services use PVCs that are capable of carrying variable length frames up to 4096 bytes per frame. FR provides multi-protocol LAN inter-connections for building private networks. Levels of performance (for example, bandwidth) can be set, and the security of PVCs is widely perceived as adequately strong.
ATM carries fixed-length (53-byte) cells, and is designed to support a wide variety of traffic, including native ATM, FR, Switched Multimegabit Data Service (SMDS), and circuit emulation. It can provide large amounts of bandwidth economically and on-demand. The asynchronous and multimedia characteristics of ATM permit both circuit and packet types of traffic to be carried simultaneously with complete transparency to the applications.
Both FR and ATM are connection-oriented services with simple demarcation points and are relatively easy to trouble-shoot. (Figure 4.)
Figure 4
Traditional WAN Services
USING METRO SERVICES
A growing assortment of MAN services are now available, including the metro Ethernet or metro IP services offered by many service providers recommended by Cisco in the Cisco Powered Network program.
IP VPNs, also called Layer 3 VPNs, deliver enterprise-scale connectivity deployed on a shared infrastructure. IP VPNs allow the end customer to realize the cost advantages of a shared network while enjoying the same security, QoS, reliability, and manageability as they do in their own private networks. An IP VPN can be built using the Internet and IP security (IPSec) technology, or on a service provider's IP infrastructure using Multiprotocol Label Switching (MPLS) technology.
After deciding to sign up for a metro Ethernet or metro IP service, there are further options to consider. The network could be configured as a hub-and-spoke design, with different virtual LANs (VLANs) linking each remote office or work location to a central site (Figure 5). An Ethernet-based Transparent LAN service could be used, making the entire Enterprise in the metro area appear to be on a single shared Ethernet segment (Figure 6). Because a major factor in choosing between these options is scalability, understanding the growth plans of your company is essential.
Figure 5
Point-to-Point Metro Ethernet Design
Figure 6
Metro Area Transparent LAN Service
MAKING THE CHOICE
As already suggested, there is no "always best" approach to the WAN aspects of a disaster recovery plan. Best practices, however, include avoiding campus-wide VLANs, using Layer 3 as the demarcation point of choice, including redundancy as often as feasible, and using point-to-point links (or virtual circuits) whenever possible.
Ever-increasing difficulties in hiring and retaining qualified IT staff are pushing more and more companies toward outsourcing, and issues such as the expertise and location of the enterprise IT staff, as well as the available budget, are important criteria.
In any case, it is critically important for the enterprise to understand the technologies that the service provider is using. When enterprise networks are designed with Cisco products and follow Cisco AVVID, and when Cisco Powered Network-designated WAN services are utilized, the WAN aspects of the overall plan become both simpler and less labor intensive. A more resilient and cost-effective enterprise-wide network is the result.
APPENDIX A: DEALING WITH "CYBER DISASTERS"
Although most people immediately think of the physical sort of disaster, a cyber-disaster can be equally as devastating to a company's operations. Such an event, whether caused by external hackers or by a disgruntled employee, can leave computers and other IT equipment in place, but effectively inoperable. As businesses rely more and more on networked applications such as e-commerce, supply chain management, and customer relationship management, these non-physical threats may pose a greater risk, and be more likely to occur, than fires or other traditional types of disasters.
Following are some ideas on defending against non-physical network threats:
Develop a comprehensive information security policy and update it at least twice a year, or when the IT environment experiences significant change. Communicate it quarterly to all employees, and include it in new hire information packets
Assign a manager to be explicitly responsible for information security-tasks that "everyone" is responsible for often end up being handled by no one
Remember that not all network service providers take security equally seriously. (For example, WorldCom has a Director in charge of Business Continuity Solutions, and some other service providers have a similar emphasis, but most do not.) Service providers with the Cisco Powered Network-Managed Security Services designation have a particular emphasis on security, and have security-knowledgeable staff. Look for ISPs that have implemented key IETF recommendations in their network such as private address filtering (RFC1918), filtering of IP source addresses (RFC2827), and traffic-type rate limiting (again RFC2827).
Do not assume that because you don't do business in "trouble spots" around the world that you are not vulnerable. A study by security firm Riptech found that 30 percent of attacks came from within the United States, with those from next highest countries South Korea, China, and Russia much further down the list. Hardly any network attacks came from countries such as Iraq, Libya, Afghanistan, or Myanmar (Burma).
When new equipment is installed, immediately review all manufacturer-furnished configurations and passwords for appropriateness to the corporate security policies, and change all default passwords
Be rigorous in reviewing and applying software patches and updates from vendors of installed equipment, as appropriate
Remove the accounts of all former employees and contractors immediately
Position any wireless access points away from exterior building walls to minimize the possibility of signal interception
Periodically (annually or perhaps twice a year) review data center security policies (for example, outside signage, visible badges, physical access controls, visitor logs and escorts, and video surveillance)
Backup critical data regularly (monthly, daily, hourly, in real-time-depending on the business process the data supports) at off-site locations
Insure that System Event Logging is on, at least at the minimum level, in all equipment
Install and routinely analyze one or more intrusion detection systems (IDS), including both network-based and host-based varieties
Contract for a periodic (at least annually, and more often for networks with frequent installation or change activity) risk assessment and network vulnerability audit from "friendly hackers" (for example, the Cisco Security Posture Assessment)
Test security plans and procedures with periodic drills so all staff have a clear understanding of what to do during a security violation, and reinforce training as required
Conduct background checks on all employees who have access to critical data