A-S-S-U-M-E: we all know the cautionary tale of what happens when we assume. In the context of Disaster Recovery, making assumptions is both the easiest and least advisable thing to do– the most efficient way to derail an otherwise perfectly well-intentioned plan. As a recap, our last installment of Disaster Recovery: A Quick and Dirty Guide, explored the key components of a DR plan, including cataloging resources, backup methodology, and technological considerations. In Part II, we’ll look at the most common (and damaging) assumptions made when building a DR Plan. Not only will we help you avoid key missteps that can come back to bite you in Disaster Recovery, but also some very simple steps you can take to avoid them. Much like preventative dental work–which takes a little bit of time every six months, but will spare you a root canal in the end–we’ll look at DR plans as a series of measures that require attention to detail, but will be invaluable if and when your network goes down. Many thanks to TRUE engineer and Disaster Recovery expert, Gary Noto, who has graciously shared his experience and knowledge for this installment.
Audit Asset Catalog and Acceptable Downtime
Jumping right in, Gary encourages us to review what’s been done to date. Looking at the initial steps of building a solid Disaster Recovery plan, the first two–and rather obvious–places a mistake can be made are simply omissions. Looking back to building a comprehensive asset catalog, the first place one can miss is in leaving out one or more key components. Your DR plan should clearly list and define all the people, departments, functions and system(s) that will be recovered. Ask yourself when you review your existing catalog–In looking at everything it takes to complete a day’s work, did you include all hardware? People? Roles? What about internet access? If a storm takes out your local internet provider, your employees can’t just pop on over to Starbuck’s to send a few emails. (In fact, do you really want them using an unsecured public connection?) So you’ll want to be inclusive in that list.
The second place for a disastrous assumption is in misjudging acceptable loss for your backups. Did you take the time to really calculate the total cost to your organization per hour of downtime, and did you select a backup solution that is in line with your timing needs for getting back up and running? For more on how to determine your acceptable recovery time (i.e. how long your business can keep its doors open in the event of operational downtime), see Part I.
Avoid Assumptions in Written DR Procedures
Once you have audited your asset catalog and backup solution, the next step is to write a list of instructions. Just like the classroom exercise many of us have seen, where one person gives instructions to another person, who can only execute those actions which are precisely written in a series of steps that tell how to make a sandwich, it’s easy to leave our a step. When making a sandwich, you don’t want to assume that your reader knows to open the bread sack, for example. Skipping steps leaves the actual technicians who will be executing your DR Plan directions at a loss. So when establishing a comprehensive and concise list of procedures to spin up your recovery systems, it’s good to be careful with expectations. As Noto points out, you shouldn't expect a system that was undefined in the DR plan to be fully functional after the execution of the plan. This seems simple, but it’s one of those cases where one might assume everything will be fine, until one day it isn’t. Auditing and filling in your procedures until they are inclusive will help prevent this common DR Plan mishap.
Test Your Plan
Another key assumption you can avoid is expecting all systems to be available at the same point in time. How can you know which will spin up first, and which may need a little longer (or which may need better recovery processes), until you test the plan? Testing to know exactly what order in which each system becomes available will you know exactly what is going to happen, so you can plan or revise accordingly. When systems go down, and your executive steering committee members want to know, "What now?", you'll have solid and reliable answers for them.
Audit Providers’ Manuals
Next, since some of your procedures will involve copying technical instructions from your providers, to be executed by the personnel you have tasked with executing recovery for your organization, you’ll want to take a look at cloud, hardware, backup, and other infrastructure documentation required to effectively spin up failover. Do the instruction manuals from your providers make it VERY clear what to do, when, and how? If not, you’ll want to supplement or simplify explanations so that the appropriate personnel can understand how to follow each step with absolute clarity. You might be surprised by how confusing or incomplete these manuals can be–unless you are the person who has been reading and using them in recent years(!). Noto points out that TRUE has been rewriting and supplementing technical instruction manuals for clients for years as part of our managed and co-managed DR services. If a particular set of steps is too complicated, or too sparse, for internal IT teams to execute, that can lead to serious issues when those teams are attempting to implement recovery. Again, simple, but also simple to overlook–and potentially very frustrating to team members dealing with an emergency.
Evaluate 3 Phases for Goal, Common Mistakes, and Fixes
Once you have your procedures well documented, you’ll want to take a step back and look at your overall plan in three basic phases. Gary's reasoning behind breaking this into stages is simply that it can help you identify gaps in the flow.
1. Activation and Notification Phase
a. PHASE 1 GOAL: Activation of the DR Plan after a disruption has occurred
b. KEY MISTAKE: Apparently, it’s not uncommon for people to actually miss writing down (and, therefore, executing) some of these steps entirely, due to their simplicity and basic assumptions about who will do what.
c. HOW TO FIX THE PROBLEM: In this context, think through the most basic of observations for each phase. For example, how will you identify a disruption? Who is responsible for keeping track of availability and disruptions to your systems (during andafter business hours)? Then, who will get the first notification? The second? (…and so on…) In those cases, whether it is a disruption that goes unnoticed, or a key player who is not notified, a very serious oversight could easily have been easily avoided by auditing and testing the DR plan through each step of each phase on a regular basis. Clearly, you don’t want to derail an otherwise excellent plan from the outset.
2. Recovery Phase
a. PHASE 2 GOAL: Activities and procedures are written at a level that an appropriately skilled technician can recover the system without intimate system knowledge. They have been tested and undergo routine review for updates. Further, the plan includes how to stop the DR Plan and return to normal service.
b. KEY MISTAKES:
- Requiring that a technician be intimately familiar with the internal workings of various systems.
- Assuming because you once developed a DR Plan, that no procedural updates are ever required for the Recovery Phase.
- Not knowing how to deactivate the DR Plan and return to normal service.
c. HOW TO FIX THE PROBLEM: First, according to Noto, the DR Plan should be tested at regular intervals for any gaps, changes to the environment, failures, or outdated processes, then updated accordingly. The best way to know everything is working is to test monthly. Some organizations are lucky to test their DR Plan annually, but this leaves so much to chance. Honestly, how many changes does your organization undergo in the course of a year? Have you rolled out any new hardware? Hired new employees? Phased out an old system? Implemented a new solution or technology disruptor? Added vendors to your supply chain? In the end, demonstrating efficacy of your plan may feel inconvenient once a month, but as we have said historically at TRUE, a crisis is no time to improvise. Further, while you are testing, that is the perfect time to update. You already have it out and will be able to see, in real time, any gaps or outdated information that need to be adjusted. There is almost no point in having a DR Plan if you can’t know beyond the shadow of a doubt a) that it will work whenever you need it, b) you have recent data to prove it works c) each person in the process knows exactly what to do without question, and d) all changes to systems, personnel, hardware, etc. since the previous month have been factored into the plan.
3. Reconstitution Phase
a. GOALS: The Reconstitution phase defines the actions taken to test and validate system capability and functionality at the original or new permanent location. This phase needs to validate 1) successful reconstitution and 2) deactivation of the plan.
b. KEY MISTAKES:
- Assuming because you have a DR Plan, that it will work in the event of a disaster
- Assuming you can test your DR Plan effectively in the production environment that exists.
- Assuming you can restore a system from backup without actually testing it.
- Skipping Tabletop Exercises, assuming a single monthly test of the DR Plan will suffice
- Testing only one disaster scenario
c. HOW TO FIX THE PROBLEM: First, with regards to testing, not only should the plan be spun up and updated monthly, but the DR Plan should be tested in a real test environment that exists independently of the current production systems. The reason for this is to prevent unnecessary disruptions to your business processes. After all, you are trying to avoid downtime, not create it. On the other hand, you also don’t want to rely on hypotheticals. The test environment should mimic reality. Actual system backups should be tested by doing a full restore in the test environment. Are you able to access all data and systems you expected to see?
Second, your DR Plan should be subjected to Tabletop Exercises (a real-life form of testing). That includes testing all of the people, processes, and systems as a walk-through, responding to simulations of the various disruption types your organization could face. Until you walk it through, how will you know if your instructions make perfect sense to all personnel involved? How can you identify gaps in the plan due to change or omission?
Keep it Simple, Literal, and Straight-forward
In short, don’t make assumptions. If taking these steps seems so simple you are tempted to skip them, you can know you’re on the road to embarrassing and frustrating omissions that will affect your recovery performance and efficacy when faced with real downtime. This could be a malware attack, a weather event, or even something as simple as hardware failure or accidentally deleted files or data that are vital to business processes. One option you could consider if you are too strapped for time and internal personnel to perform the necessary steps involved in building, testing, and updating a reliable DR Plan, would be to outsource some or all of your DR Plan development and maintenance. TRUE has been successfully creating, testing, and updating DR Plans on behalf of and alongside our customers for decades. If you would like an initial consultation with one of our Disaster Recovery experts–free of charge–to talk through your organization’s current plan, or to begin building one, please reach out to us.
We look forward to hearing from you.