Avoiding the Doomsday Scenario: Why Incident Response Plans Fall Short

October 04, 2023 4 Min Read By: Harry Zarek

Having watched, monitored, and participated in many cybersecurity incidents over the years, it is becoming more and more clear to me that organizations are missing a few key opportunities to improve the outcome and speed of recovery following an attack. This can make all the difference between restoring business operations or shuttering for good.

The reason I am so focused on this topic is because I have seen the magnitude of effort involved in restarting operations after a significant incident has occurred. More disconcerting is the lack of a documented recovery plan in many situations. My goal in writing this article is to highlight some thoughts as to why we as an industry have failed in this regard, and in a subsequent note, I will identify ways to plan appropriately.

Building an Incident Command Centre

When a cybersecurity incident has been detected, an Incident Command Centre is set up to respond and manage the impact of the malware. Forensic experts are brought in to trace what happened, whether data has been exfiltrated and confirmation the malware has been eradicated. After the experts conclude all is safe, they turn the environment back to the business. A root cause analysis report is generated, and the incident is considered closed leaving behind a faint hope that the malware has been irradicated and that systems will be restored in a relatively short period of time.

Most often, the environment has been so badly impacted with programs corrupted and data infected, the only recourse customers have is to rebuild their entire environment from scratch.

If you know the outcome will be a total rebuild, then the preparation work should begin at the same time as the Incident Command Center has been declared. Unfortunately, this does not happen.

This is when three questions must be asked:

How long has the malware been infecting your environment?
Is your network segmented and do you know them to be free of malware? (i.e. any air gapped systems or immutable systems)
Are you certain that your backup systems are free of the malware?

If you are not highly confident that you can answer these questions, the only safe measure to take is to perform a ’bare metal’ restart. Termed the “Doomsday Scenario,” organizations avoid having to make this decision at all costs.

As modern businesses continue to invest in technology, there is a significant effort in actively protecting and managing the security of our technology environment. The technology industry has done a very good job highlighting the need for security protection with clear explanations around the impact of a security failure. To mitigate these risks, the security industry has developed a cornucopia of software tools to protect the infrastructure and detect for malware. (Gartner has identified over 3,000 products in this space.)

The Recovery Phase

The NIST Cybersecurity Framework (CSF) is the industry methodology used to align the security functions in the expectation you will encounter an incident. The last function in the framework is Recovery but is focused more on recovery from the security incident rather than recovery of the business.

Ultimately, the ability to restore operations to a normal state is the overriding goal of an incident, and the Recovery phase offers the greatest opportunity to do just that. But in far too many events, this effort is the most difficult to achieve and takes the longest time. All the while, the business struggles to operate resulting in lost revenue and diminished customer credibility which can have long-term repercussions. With the stakes so high, you might conclude that the priority focus would be on the recovery process. Unfortunately, that is not the case. There are many examples of public companies reporting hundreds of millions of dollars in direct costs as they struggle to recover business operations.

The most important decision to make when confronted with an attack is to invoke the recovery plan simultaneously with the incident.

Recovery at the Onset of Attack

This is separate from the Incident Response process. Classified under the category of Business Continuity or Disaster Recovery, many customers have built restoration plans for data and applications as the basis for their recovery process. But, in most cases that does not work. Why not?

Because underlying the data and applications, are extensive and integrated layers of infrastructure consisting of operating systems, network components, virtualization, identity and authentication services, and an array of others—both on-premises and in various clouds environments. It is not just this menu of components; they must be restarted in a systematic sequence.

An example of a response failure is the inability to communicate via corporate email after an incident has occurred. Most frequently, that system has been brought down to protect against further infection. But, without a reliable secondary communications method, the ability to mobilize resources is delayed.

I am convinced if organizations make the decision to invoke the Recovery Plan immediately, the elapsed time to restore the business to a functioning capability will be dramatically shortened.

So why is this not happening?

Here are five roadblocks that are preventing (or prolonging) leaders from taking this critical step:

The belief that the malware can be controlled or eradicated quickly.
Thinking it will be simple enough to reload the affected application(s) and restore data.
Misunderstanding how intertwined systems are today. This includes single sign-on systems, authentication, and authorization systems.
Complexity around performing a complete restart of the entire environment.
Lack of planning for this contingency.

Missing the Mark

So what’s the real challenge in recovery? Ultimately, the skills to implement a restart program do not reside within the cybersecurity team. Nor should they. Herein lies the primary reason we have failed in my opinion. We have not involved the Enterprise Architects, the IT Operations staff, and the other technical specialists whose skills are required to restart the entire environment. Nor have we included the leaders whose business function depends on their applications running. They have a role to play in prioritizing which applications are most critical to support the business.

In conclusion we must expand our thinking about Incident Response and include Recovery as a key responsibility at the invocation of an incident.

In the following article, I will outline how to go about setting this up.

I’m interested to know what your thoughts are. Do you agree, disagree, or have a different perspective to share on this highly important topic?