The importance of resilience

The recent Crowdstrike incident highlights a crucial lesson: a single point of failure can have far-reaching consequences. Lee shares his thoughts on the critical need for organizations to prioritize resilience in their IT infrastructure and business operations.
Written by
Lee Conlin
Published on
July 22, 2024
The importance of resilience

The massive IT outage that occurred on July 19, 2024, serves as a stark reminder of the vulnerabilities inherent in our increasingly digitized world and the importance of building robust, resilient systems.

The 2024 CrowdStrike Incident: A Wake-Up Call

In the aftermath it looks like a faulty update to CrowdStrike's security software caused an estimated 8.5 million computers running Microsoft Windows to crash, rendering them unable to restart properly. This event, described by Cybersecurity consultant Troy Hunt as the largest outage in the history of information technology, disrupted businesses and governments worldwide, affecting industries ranging from aviation and finance to healthcare and manufacturing.

The scale of the disruption was unprecedented. Airlines grounded flights, hospitals struggled to access patient records, stock markets faltered, and government services, including emergency response systems, were severely compromised. The financial impact was staggering, with damages estimated at around $10 billion.

This incident underscores a crucial lesson: in our interconnected world, a single point of failure can have far-reaching consequences. It highlights the critical need for organizations to prioritize resilience in their IT infrastructure and business operations.

Understanding Resilience in the Business Context

Resilience, in the context of business and IT, refers to an organization's ability to quickly adapt to disruptions while maintaining continuous business operations and safeguarding people, assets, and overall brand equity. It's about more than just disaster recovery—it's about creating systems and processes that can withstand, adapt to, and recover from adverse conditions.

Building Resilience into Business Infrastructure

Diversification of Systems and Providers

One of the key lessons from the CrowdStrike incident is the danger of over-reliance on a single system or provider. Organizations should consider diversifying their IT infrastructure across multiple providers and platforms. This approach, while potentially more complex to manage, provides a safety net in case one system fails.

Regular Risk Assessments and Scenario Planning

Conduct regular risk assessments to identify potential vulnerabilities in your infrastructure. Engage in scenario planning exercises that consider a wide range of potential disruptions, from cyberattacks to natural disasters. These exercises can help you identify weaknesses in your systems and develop strategies to address them.

Implementing Robust Backup and Recovery Systems

Ensure that you have comprehensive backup systems in place, with data stored in geographically diverse locations. Regularly test your backup and recovery processes to ensure they function as expected when needed.

Embracing Cloud Technologies

Cloud technologies can offer enhanced resilience through their distributed nature and built-in redundancies. However, it's crucial to understand that moving to the cloud doesn't automatically guarantee resilience. Carefully consider your cloud strategy, potentially adopting a multi-cloud approach to further mitigate risks.

Investing in Cybersecurity

As the CrowdStrike incident demonstrates, even security software can be a point of vulnerability. Invest in robust cybersecurity measures, including regular security audits, employee training, and up-to-date security protocols.

Disaster Recovery Policies: Beyond Backup Plans

While having a disaster recovery plan is crucial, true resilience requires going beyond simple backup and restore procedures. Consider the following elements when developing your disaster recovery policies:

Clear Communication Protocols

Establish clear communication channels and protocols for use during a crisis. This should include both internal communication among staff and external communication with customers, partners, and the public.

Regular Testing and Drills

Regularly test your disaster recovery procedures through drills and simulations. These exercises can help identify gaps in your plans and ensure that your team is prepared to act effectively in a crisis.

Prioritization of Critical Systems

Identify your most critical systems and prioritize their recovery in your disaster response plans. This may involve creating a tiered system of recovery, ensuring that the most crucial operations are restored first.

Cross-Training of Staff

Ensure that multiple team members are trained in critical recovery procedures. This redundancy in human resources can be crucial if key personnel are unavailable during a crisis.

Continuous Improvement

Treat your disaster recovery policy as a living document. Regularly review and update it based on lessons learned from drills, minor incidents, and changes in your business environment.

No items found.

Ensuring Resilience in Digital Products and Websites

In today's digital-first business environment, your online presence is often the primary interface between your organization and your customers. Ensuring the resilience of your digital products and websites is crucial for maintaining business continuity and customer trust.

Implement Content Delivery Networks (CDNs)

CDNs can improve the resilience of your websites by distributing content across multiple, geographically diverse servers. This not only improves performance but also provides redundancy in case of server failures.

Use Load Balancing

Implement load balancing to distribute traffic across multiple servers. This can help prevent overloads and ensure that your digital services remain available even if some servers fail.

Adopt a Microservices Architecture

Consider adopting a microservices architecture, sometimes called composable or MACH architecture, for your digital products. This approach, which breaks applications into smaller, independently deployable services, can enhance resilience by limiting the impact of failures to specific services rather than entire applications.

Implement Circuit Breakers

Use the circuit breaker pattern in your applications to prevent cascading failures. This design pattern can help isolate problems and maintain overall system stability when individual components fail.

Regular Performance Testing

Conduct regular performance testing of your digital products and websites. This can help identify potential bottlenecks and vulnerabilities before they lead to outages.

Implement Robust Monitoring and Alerting Systems

Deploy comprehensive monitoring and alerting systems that can quickly identify and notify you of issues with your digital products. Early detection can be crucial in minimizing the impact of potential disruptions.

The Human Factor in Resilience

While much of our discussion has focused on technical aspects, it's crucial to remember that resilience is as much about people as it is about technology. Foster a culture of resilience within your organization by:

  1. Promoting a proactive mindset towards risk management
  2. Encouraging open communication about potential vulnerabilities
  3. Rewarding innovative solutions to enhance resilience
  4. Providing ongoing training and development in resilience-related skills

Conclusion: Resilience as a Competitive Advantage

In an era where digital disruptions are becoming increasingly common and impactful, resilience is no longer just about risk mitigation—it's a potential source of competitive advantage. Organizations that can demonstrate robust resilience in the face of adversity are likely to earn greater trust from customers, partners, and stakeholders.

The 2024 CrowdStrike incident serves as a powerful reminder of the interconnected nature of our digital world and the cascading effects that can result from a single point of failure. By prioritizing resilience in our systems, processes, and organizational culture, we can better prepare ourselves to navigate the uncertainties of the digital age.

As business and IT leaders, the responsibility falls on us to champion resilience within our organizations. It requires investment, both in terms of resources and mindset, but the payoff—in terms of business continuity, stakeholder trust, and long-term sustainability—is immeasurable. In a world where the next major disruption may be just around the corner, resilience isn't just important—it's essential.

No items found.
No items found.

Contents

Regular newsletter
Our latest thoughts, tips and exclusive interviews in your inbox every month.