Disaster recovery (DR) is a management practice that addresses the contingencies that need to be in place prior to major negative events that could take down a company’s infrastructure, hindering or stopping operations altogether. Such disasters can include:

  • Natural disasters: earthquakes, hurricanes, tornadoes, floods, wildfires
  • Infrastructure failures: equipment malfunctions, building fires, power outages
  • IT security breaches: DDoS attacks, data theft, system infiltration, various passive and active cyber attacks

Key to disaster recovery efforts is the disaster recovery plan (DRP). A company’s DRP outlines all procedures for dealing with large scale threats to its IT systems and employees. Departmental DR plans outline departmental DR procedures and are drafted in support of the master DRP. Employees must review and be trained in DRP procedures, and to be effective, like fire drills, they must be practiced.

Disaster recovery metrics are integral in maintaining and improving baseline standards for recovery efforts. A company may use hundreds of KPMs to measure disaster recovery efforts, some of the most common key performance metrics include:

  • Mean Time Between Failures (MTBF) - The average time before a device fails.
  • Mean Time To Repair (MTTR) - The average time to repair a failed component.
  • Self-Monitoring, Analysis, and Reporting Technology (SMART) - A set of over 100 metrics used to monitor HHDs and predict their health status, and chance of failure.

Monitoring and reviewing KPM systematically and over time can help teams understand the nature of their IT systems better, find bottlenecks, and make performance optimizations that can prevent potentially costly disasters later.

In the larger context of Business Continuity (BC), disaster recovery supports a company's efforts to keep its services available to users in two ways. Firstly, in order to help mitigate data loss, replication and storage backup are deployed as data protection solutions and can quickly return data to pre-disaster status. And secondly, disaster recovery planning and preparation are instituted so that emergency situations are resolved quickly and services are put back to normal.

For example, in the case of a data center partner serving several tenants that loses connectivity due to damage caused by an earthquake, in preparation for such an event, they sufficiently planned disaster recovery by incorporating in business continuity plans a hot site, a duplicate data center, located in a different region, and capable of delivering services to users immediately.

In today’s modern cloud landscape, companies can choose to implement a disaster recovery software solution themselves, or subscribe to a disaster recovery as a service (DRaaS). Both solutions help companies rapidly and efficiently recover their applications, settings, and data to a state prior to disaster. This is particularly important to digital businesses where disaster recovery is key in supporting infrastructure resilience.

Generally, disaster recovery is good business, and helps to ensure business investments are protected, and clients continue to have access to services. DR also helps providers to honor their service level agreements (SLA) by preventing potentially catastrophic problems befalling operations. By planning and practicing disaster recovery, employees and the company will remain calm and work together to return operations back to normal. Disaster recovery has the following specific benefits.

  • Minimizes Interruptions - Many threats can slow or even bring services down. DRPs prepare teams and reduce the negative impacts of interruptions by getting services back up quickly.
  • Prevents and Limits Damages - DRPs help teams think through potential threats and prepare accordingly. Timely and directed responses to disasters can limit damage and prevent spreading to other critical areas causing more damages.
  • Empowers and Builds Employee Confidence - Preparation in the face of emergency helps combat stress levels, and makes a higher functioning team thanks to foresight. Disaster recovery plans help to prepare and organize employees in response to disaster so that they know exactly what to do and when.
  • Service Restoration - DR is intent on restoring an organization to full service functionality from the worst setbacks. Using a metrics defined DRP, teams can systematically work towards anticipating threats and contingencies, and reduce their recovery time objective (RTO), the allowable time from service failure to restoration.

Business continuity (BC) and disaster recovery (DR) are two related but separate concepts that deal with keeping a service available.

Business continuity primarily concerns itself with the largest view of service availability, and encompasses all the business systems that keep operations moving despite setbacks. Specifically, BC is a set of plans, procedures, and technologies employed to ensure the organization resolves incidents, technology failures, and errors quickly, and effectively. From this viewpoint, BC is holistic and proactive.

Disaster recovery supports BC plans with contingencies by addressing the likely and not so likely major events that can cripple a company’s services. Disaster recovery, also a set of plans, procedures, and technologies, will address what steps should be implemented in the case of an emergency event. DR plans call for controls to be in place to deal with those anticipatable events. DR is reactive, and less likely to be implemented when BC efforts are done well.

Backup and recovery is a set of data practices that supports the mission of disaster recovery (DR). A backup is a copy of an organization’s data used to replace original data in the event of data loss.

Automatic backup solutions have lessened the burden of managing data storage and recovery, but still attention must be paid to the planning and design of backup operations. A proper disaster recovery plan (DRP) and data retention plan outlines where and how data is to be stored, how many copies, for how long, and how data is to be recovered. Other considerations include if data is stored off-site, on-premise, or in the cloud. Choosing cloud backups can help to lessen management burden as well.

An essential part of business continuity is the use of alternative sites. Using remote sites can provide more robust disaster recovery solutions for businesses. Multi-site business continuity plans rely on backup sites where companies can quickly relocate their IT infrastructure if the primary site goes dark. For this purpose there are three types of data backup sites an organization can put into place.

  1. Cold Sites — The least expensive of the backup site options, cold sites are essentially buildings with connectivity but without hardware. For businesses, these stand-by sites are the cheapest option, providing a location to set up intermediary operations while primary sites are restored. This means operations at the new site need to be “warmed up”, and hardware installed, configured, otherwise brought online and capable of serving.
  2. Hot Sites — Hot sites essentially are duplicates of production systems, running in parallel, waiting to take over should there be a disaster that hobbles the main system. These systems synchronize in real-time, making them expensive yet highly resilient. For critical industries like healthcare and finance, hot sites are essential for business continuity.
  3. Warm Sites — Warm sites occupy the continuum between hot and cold sites. These systems duplicate production systems set to a schedule. To optimize expense and administration, primary site backups to warm sites are performed on a cycle, perhaps daily, or weekly. If disaster should occur, then current systems can be restored to the most recent backup.

Hot sites provide the fastest recovery time, while cold sites are the least expensive to initially set up (though perhaps costly in time when responding to disaster). Warm sites provide some wiggle room between, but these decisions are typically based on an organization's acceptable recovery time objective (RTO), or the duration allowed between an outage and restoration of main systems.

The overarching benefit of disaster recovery software is to provide efficient failover and recovery capabilities ensuring that downed services are quickly restored. Within this context, disaster recovery can take many forms, but the following beneficial features are typical.

  • Failover and recovery capabilities for on-site and cloud data
  • Unified local management of data restoration
  • Recovery of systems to backup points prior to failure
  • Integrates with an ecosystem of business continuity solutions
  • Compatibility with on-premise, cloud, and virtual infrastructure

Included with the above benefits are those that are inherited by cloud solutions, namely disaster recovery as a service (DRaaS). They include:

  • Time Savings — DRaaS emancipates time from administrating DR tasks in-house. IT teams can focus more on valuable core operations.
  • Reduces Costs — DR responsibilities offloaded to the cloud are reduced to a budget line item for consumers. Cloud providers can leverage economies of scale to help reduce fees even more.
  • Expertise On Ready — Reputable DRaaS providers maintain expert staff 24/7 ready to support in-house teams in the event of disaster.
  • Rapid Recovery — DRaaS providers are proficient and return services to normal, often within minutes.
  • Data Security — The technical requirements of duplicating and protecting data can be easily handled within the cloud, while providers offer the best physical and virtual security of their data centers.

An organization's disaster recovery plan (DRP) establishes its capabilities and readiness to cope with potential disasters. The DRP essentially defines how an organization will move to a secondary location and resume operations with different resources, while primary resources are being restored. To this end, a DRP contain the following general sections:

  • Business Risk Evaluation
  • Impact Analysis
  • Roles and Responsibilities
  • Damage Assessment Process
  • Disaster Declaration Process
  • RTO and RPO Thresholds
  • Call Trees
  • Communication Protocols
  • Prearranged communication templates
  • Testing methods
  • Training Methods
  • Impact/Probability Disaster Scenarios
  • Recovery Procedures.

Planning your disaster recovery strategy begins by answering essential questions. A DRP is a living document and answers to questions should be reviewed and adjusted for circumstances periodically. The following steps broadly outline the main points to consider.

Step 1. Major goals: Outline major goals of DRP.

Step 2. Personnel: Include a record copy of your data processing personnel.

Step 3. Application profile: List applications by criticality.

Step 4. Inventory profile: Create an inventory, including information on manufacturer, model, serial number, cost and ownership.

Step 5. Information services backup procedures: Explicit procedures on creating, managing, and maintaining backups.

Step 6. Disaster recovery procedures: DRPs must include explicit procedures for:

  • Step a. Emergency responses: appropriate emergency response to a fire, natural disaster, or any other activities in order to protect lives and limit damages.
  • Step b. Backup operations: essential data processing operational tasks to be conducted after disruption.
  • Step c. Recovery actions: rapid restoration of data processing systems after disaster.

Step 7. DRP for mobile site: Explicit procedures addressing recovery of mobile sites.

Step 8. DRP for hot site: Explicit procedures for shift over to an alternative hot site.

Step 9. Restoring the entire system: Backup and recovery procedures.

Step 10. Rebuilding process: Damage assessment and beginning of reconstruction of new data center.

Step 11. Testing the disaster recovery and cyber recovery plan: Plans must be tested periodically.

Step 12. Disaster site rebuilding: Full specifications of site rebuild.

Step 13. Record of plan changes: Keep your DRP current.