Operational Resilience Starts With ITSM Processes

Organizations have never been more reliant on information technology. It runs the business. But we’ve never been more exposed. Cyberattacks, ransomware, operational mistakes, and supply chain dependencies — the risks are everywhere.

Those risks translate directly into outages, reputational damage, and financial loss. Improving operational resilience means going back to basics: your ITSM processes.

At its core, operational resilience in IT is determined by how well those processes actually work.

What Is Operational Resilience

Operational resilience is an organization’s ability to detect, withstand, continue operating, and recover from a wide range of threats — particularly those that impact critical business services. For the purpose of this article, we’re focusing on information technology and the systems that support critical business services.

It starts with understanding the threats facing the organization. Why? Because you can’t protect yourself from what you don’t understand. Those threats are many — from nefarious actors looking to exploit vulnerabilities, to natural events like pandemics, fires, floods, and other disasters, to emerging technologies such as AI introducing new risks, to internal risks such as ineffective processes, fragile systems, or poorly trained staff.

To be resilient, organizations need an ongoing process for identifying, evaluating, and mitigating those risks — whether through proactive planning, transferring risk (to suppliers or insurance), or accepting it where no better option exists.

Withstanding those risks means putting the right safeguards in place. This includes designing and deploying secure systems, building redundant infrastructure, protecting against cyber threats, and establishing operational practices that allow you to continue operating during disruption and detect when things go wrong.

Then there’s recovery — understanding the business impact of each system and having the strategies and processes in place to restore services based on that impact to critical business services.

At its foundation, operational resilience starts with consistent, well-governed ITSM processes.

What Most Organizations Get Wrong About Operational Resilience

In our opinion, there are two critical things most organizations get wrong about operational resilience (OpRes).

Mistake one: Thinking operational resilience comes from tools and technology

Yes, tools are a critical part of having a resilient organization. But technology is only as good as how it is implemented and operated. Every year we see major outages, breaches, and privacy incidents at some of the world’s largest organizations. So they didn’t have tools? Of course they did — but that wasn’t enough to prevent an outage.

What good are monitoring systems if teams aren’t monitoring the right things or reviewing the logs? How effective are firewalls if patches aren’t applied? And how strong is access management if you don’t, well, manage it?

OpRes tools are not silver bullets. They still require the right combination of people and processes to make them effective. Tools don’t create resilience. Execution does.

Mistake two: Treating operational resilience as a governance, risk, and compliance exercise

Far too many organizations treat operational resilience as something to be documented, assessed, and reported on — checking the right boxes. But resilience isn’t achieved through controls on paper or audit findings in a report.

You can have a fully populated risk register, pass every audit, and still experience a major outage. Why? Because resilience doesn’t come from identifying risks — it comes from how work is actually executed day to day.

And as a vendor, I can tell you many of the assessments organizations rely on are taken at face value, with limited validation or follow-up.

GRC plays an important role, but it doesn’t create resilience. It defines expectations. Real resilience is built through consistent, well-governed operational processes.

These misconceptions lead to a deeper issue — a misunderstanding of what operational resilience actually is.

Operational Resilience is an Outcome

If there is one takeaway from this article, it’s this: operational resilience isn’t a department, a tool, or a checklist — it’s what you get when your ITSM processes actually work. Operational resilience is an outcome of execution.

Here’s a story that reinforces this.

Earlier in my career, working for a hardware vendor, we rolled out a large number of storage devices across North America. As the months progressed, we started seeing sporadic failures — some of them resulting in customer outages.

Individually, technicians in different regions weren’t seeing enough incidents to raise concern. But when viewed collectively, a clear pattern emerged.

The good news is that our organization took processes seriously. Configuration Management tracked each device and its specifications. Incident Management captured every failure. Problem Management identified a trend — devices from a specific manufacturing run. And Change Management enabled us to proactively replace those devices before they failed.

The result: we prevented further outages, avoided reputational damage, and reduced potential financial penalties.

That wasn’t luck. That was operational resilience — driven by ITSM processes that worked.

Operational resilience is broader than cybersecurity.

Many organizations conflate operational resilience with cybersecurity. You can have strong cybersecurity and still lack operational resilience. The reverse isn’t true.

Cybersecurity consistently shows up at the top of CEO and CIO concern lists in surveys from PwC, Gartner, and Deloitte. In reality, they’re all pointing to the same underlying issue: operational resilience.

Significant investments are being made in security frameworks, tools, and consulting — when the root causes of breaches are often failures in people and process.

We would argue that these investments in cybersecurity can create a false sense of security — one where controls exist, but execution falls short.

A strong security posture, supported by frameworks such as the NIST CSF and implemented through modern monitoring and threat prevention technologies, is critical. But so are the underlying processes that ensure patches are applied, systems are hardened, and people are trained.

The ITSM Foundation of Operational Resilience

Operational resilience doesn’t operate in a vacuum — it is built on a strong foundation of IT Service Management (ITSM) processes.

IT Service Management has its roots in the earliest days of mainframe computing. The technology was incredibly expensive and required rigorous processes to ensure reliability and availability.

While computing technology may be more reliable today, its complexity — and its proliferation across every aspect of business and society — has made it far more difficult to manage, even as our dependence on it has increased.

The same ITSM processes that enabled “lights-out” mainframe data centers with “five nines” of reliability are even more critical today, where there are far more moving parts and interdependencies.

To put it simply, you cannot achieve operational resilience without a foundation of well-implemented, consistently applied, and well-governed processes.

So what does that look like in practice?

The Core ITSM Processes Driving Resilience

Depending on the framework you reference — whether ISO/IEC 20000, ITIL, or COBIT — there are dozens of ITSM processes and practices. These span the full IT lifecycle, from strategy and design through build, implementation, operation, and governance.

All of these processes play a role in IT management. However, we believe there are 14 “first-order” ITSM processes that are critical to cybersecurity outcomes and operational resilience.

These are not theoretical — they directly map to how resilience is executed day to day:

Asset Management — Ensures critical assets are known, controlled, and prioritized so operational resilience is maintained across the services they support.
Change Management — Reduces the risk of outages by ensuring changes are assessed, approved, and implemented in a controlled way that protects operational resilience.
Incident Management — Restores services quickly when disruptions occur, minimizing business impact and maintaining operational resilience.
Information Security Management — Protects systems and data from threats, forming a critical layer of operational resilience against cyber risk.
Infrastructure and Platform Management — Maintains the stability, capacity, and performance of core systems to ensure services remain resilient under load and stress.
Measurement and Reporting — Provides the visibility needed to assess performance, identify weaknesses, and continuously improve operational resilience.
Monitoring and Event Management — Detects issues early and enables rapid response, helping sustain operational resilience during emerging disruptions.
Problem Management — Eliminates root causes of recurring issues to strengthen long-term operational resilience and reduce repeat failures.
Release Management — Ensures deployments are coordinated and controlled, reducing disruption and preserving operational resilience.
Risk Management — Identifies and mitigates risks to services and operations, enabling informed decisions that protect operational resilience.
Service Continuity Management — Ensures critical services can continue or be recovered within acceptable timeframes during major disruptions.
Service Validation and Testing — Confirms that changes will not negatively impact services, protecting operational resilience before deployment.
Software Development — Builds and maintains applications in a way that supports stable, secure, and resilient service delivery.
Supplier Management — Ensures third-party services and dependencies meet performance and resilience requirements to avoid external points of failure.

To better understand how these processes drive operational resilience, it helps to group them into six capabilities:

Understand the Risks

Resilience starts with knowing what matters and where you’re exposed.

You can’t protect what you don’t understand. Operational resilience starts with a clear view of both the risks you face and the assets those risks impact. Two ITSM processes are essential to this: Risk Management and Asset Management.

Risk Management provides the structure for continuously identifying, assessing, and mitigating threats that could impact services and operations. It ensures risks are understood in context, prioritized based on business impact, and addressed through a combination of controls, process discipline, and informed decision-making. For a more detailed breakdown, see our guide to Building an ITSM Risk Management Process. At the same time, Asset Management provides complete visibility into the assets that support those services — from infrastructure and applications to cloud and SaaS — ensuring they are properly tracked, configured, and managed throughout their lifecycle.

Working together, these processes establish a clear understanding of what matters most and where the organization is most exposed. Without this foundation, it’s impossible to prioritize controls, allocate resources effectively, or build a meaningful operational resilience capability.

Protect Your Services

Reduce the likelihood of failure through disciplined change, security, and design.

Once you understand the risks, the next step is protection. Operational resilience depends on putting the right controls and practices in place to prevent disruption before it occurs. Six ITSM processes play a key role in building and maintaining those defenses: Change Management, Release Management, Information Security Management, Infrastructure and Platform Management, Service Validation and Testing, and Software Development.

Change Management ensures that changes are introduced in a controlled and coordinated way, reducing one of the leading causes of outages. Release Management complements this by coordinating and deploying changes in a structured manner, ensuring they are introduced safely and consistently into the production environment. Information Security Management protects systems and data from threats by embedding security into day-to-day operations and aligning controls to business risk. Infrastructure and Platform Management maintains the stability, capacity, and performance of the underlying environment, ensuring services can operate reliably under normal and stressed conditions.

At the same time, Service Validation and Testing confirms that changes will not negatively impact services before they are deployed, while Software Development ensures that applications are built and maintained in a way that supports stability, security, and operational reliability from the outset.

Together, these processes form the protective layer of operational resilience — reducing the likelihood of failure and ensuring that systems are designed, built, and operated to withstand disruption.

Detect Threats to Resilience

See issues early before they become outages.

Even with strong protections in place, disruptions will occur. Operational resilience depends on the ability to detect issues early — before they escalate into outages or broader business impact. Two ITSM processes are critical to this capability: Monitoring and Event Management and Incident Management.

Monitoring and Event Management provides real-time visibility into system behavior, detecting anomalies, failures, and emerging threats as they happen. It ensures that the right signals are captured, correlated, and acted upon quickly, enabling timely response to maintain service stability. Incident Management builds on this by ensuring that identified issues are logged, prioritized, and acted upon quickly, preventing escalation and minimizing business impact. For more detail, see our guide to Incident Management Best Practices.

Together, these processes provide the visibility and responsiveness needed to sustain operational resilience — enabling organizations to detect, understand, and act on threats before they impact critical services.

Respond to Disruption

Stabilize quickly and minimize impact.

Even with strong detection in place, disruptions will occur. Operational resilience depends on the ability to respond quickly and effectively to stabilize services and minimize business impact. Three ITSM processes are critical to this capability: Incident Management, Problem Management, and Service Continuity Management.

Incident Management focuses on restoring service as quickly as possible, ensuring that disruptions are contained and business impact is minimized. Problem Management works alongside this by identifying root causes and eliminating recurring issues, strengthening resilience over time. For more detail, see our guide to Problem Management. When disruptions are significant or widespread, Service Continuity Management ensures that critical services can continue or be restored within acceptable timeframes through predefined recovery strategies.

Together, these processes enable organizations to respond to disruption in a structured and coordinated way — restoring stability, reducing impact, and preventing future failures.

Recover Services

Restore fully and prevent recurrence.

Even with effective response, restoring full service and preventing further disruption requires a structured recovery capability. Operational resilience depends on the ability to recover services quickly and completely, while addressing underlying issues to avoid recurrence. Two ITSM processes are critical to this capability: Problem Management and Service Continuity Management.

Problem Management ensures that the root causes of incidents are identified and addressed, preventing repeat failures and strengthening resilience over time. It moves the organization beyond short-term fixes to long-term stability. Service Continuity Management complements this by ensuring that critical services can continue or be restored within acceptable timeframes during major disruptions, using predefined recovery strategies and plans.

Together, these processes ensure that organizations not only recover from disruption, but do so in a way that reduces future risk and strengthens overall operational resilience.

Govern and Improve

Sustain discipline and continuously strengthen resilience.

Operational resilience is not a one-time achievement — it requires continuous oversight, accountability, and improvement. Effective governance ensures that processes are consistently applied, controls remain aligned to risk, and performance is measured and improved over time. Three ITSM processes are critical to this capability: Information Security Management, Supplier Management, and Monitoring and Reporting.

Information Security Management provides the policies, standards, and controls needed to protect systems and data, while ensuring those controls are aligned to business risk and consistently enforced. Supplier Management ensures that third-party providers meet performance, security, and resilience expectations, reducing the risk introduced by external dependencies. Monitoring and Reporting provides visibility into process performance and outcomes, enabling informed decision-making, oversight, and continuous improvement.

Together, these processes ensure that operational resilience is governed, measured, and continuously strengthened — embedding discipline and accountability into how services are managed and delivered.

Where Operational Resilience Breaks Down

Operational resilience doesn’t fail because organizations lack frameworks — it fails because governance erodes. Even with the right processes in place, operational resilience ultimately depends on how well they are governed.

Over time, processes that were once followed closely become inconsistent. Oversight weakens, metrics stop being reviewed, and small shortcuts become normal. Nothing breaks immediately. Until it does. A major outage, a security incident, or a failed audit forces a return to discipline. This isn’t a one-time failure — it’s a predictable pattern.

But the problem was never the process. It was the lack of sustained governance.

In my experience, a basic process that is consistently followed, measured, and improved will outperform a “perfect” process that isn’t. Frameworks provide guidance, but they don’t make processes work — governance does.

I call this pattern the Shark Fin Cycle, where governance erodes over time until a crisis forces correction, only for the cycle to repeat.

For a deeper look at this concept and the experience behind it, see our article on Practical Governance Beats Best-Practice Frameworks Every Time.

Why Compliance Doesn't Equal Resilience

There’s an old saying: the proof of the pudding is in the eating. In other words, it’s about outcomes.

When it comes to operational resilience, it doesn’t matter how good things look on paper if you’re still experiencing outages, failures, or breaches.

You can pass audits, meet compliance requirements, and have every control documented and still lack resilience. Why? Because compliance measures whether controls exist, not whether they are consistently executed.

Resilience is different.

It shows up in how well you operate day to day — how incidents are handled, how changes are controlled, how risks are managed, and how quickly services are restored when things go wrong.

Compliance may tell you that you should be resilient.

Operational reality tells you whether you actually are.

What Resilient Organizations Do Differently

Resilient organizations start with foundational ITSM processes. Not a Visio diagram sitting on a shelf.

They have processes that are clearly defined, communicated, and actually used. They define them, run them, measure them, and improve them.

That’s the difference. Resilience is not built through documentation. It is built through disciplined execution.

That starts with leadership understanding that ITSM processes matter more than the tools.

Resilient organizations also establish clear ownership and accountability for those processes. One way to reinforce this is through a Service Management team made up of process owners and practitioners with the skills to design, implement, and govern them.

But be careful.

Too often, “Service Management” gets reduced to the ITSM tool implementation team.

That is not what this is.

A true Service Management function, often called a Service Management Office, is responsible for improving the quality and effectiveness of ITSM practices across the organization. It owns the processes, not just the platform.

This may sound like overhead, but in practice it is what sustains discipline and prevents the slow erosion of governance that leads to the Shark Fin Cycle — a predictable pattern in most organizations.

In smaller organizations, this can be a virtual team of key stakeholders working together to ensure processes run consistently. Larger organizations may formalize this as a dedicated Service Management Office.

For more detail, see our article on What is a Service Management Office? Everything You Need to Know.

Resilient organizations don’t just define processes. They run them.

How to Get Started with Operational Resilience

The first step is simple: understand where you currently stand.

Start with an assessment of the core 14 IT Service Management processes that directly support operational resilience. The goal is a clear, fact based view of how these processes actually perform in practice.

Most organizations are surprised by what they find.

This is also a valuable opportunity to engage stakeholders across IT and the business. Operational Resilience is not owned by one team. It depends on how consistently processes are executed across the organization.

Start with a Baseline

A quick baseline can be established through structured surveys, like those used in Navvia’s Operational Resilience Assessment. When sent to the right stakeholder groups, these surveys provide valuable insight into:

What is working well
Where gaps exist
Where perceptions differ across teams.

Go Deeper (optional)

For a more detailed view compliment surveys with:

Stakeholder interviews
A review of existing processes, procedures, and tools.

This adds context, validates findings, and helps uncover root causes behind the gaps.

Focus on What Matters

The goal is not just to score processes. It is to identify practical, actionable improvement opportunities.

This approach results in:

Quick wins that can be addressed immediately
A prioritized roadmap of improvements that strengthen operational resilience over time.

Think of it like a Building Assessment

It is no different than assessing the condition of a building before starting renovations. You need to understand:

What is solid,
What is at risk
Where to focus first.

Without this baseline, improvement efforts are often misdirected or ineffective.

Not sure how to get started? Navvia can help, either through our assessment platform or through a guided consulting engagement.

Final Thought: Operational Resilience Follows Execution

Operational resilience is not something you implement. It is not a framework, a tool, or a checklist.

It is the result of how well your organization executes day to day.

You can invest in the best technologies, adopt the most respected frameworks, and pass every audit. But if your processes are inconsistent, poorly governed, or not followed, resilience will break down.

Resilient organizations focus on the fundamentals.

They understand their risks. They protect what matters. They detect issues early, respond quickly, recover effectively, and continuously improve. Most importantly, they execute consistently.

That execution is driven by well-defined, well-governed ITSM processes.

If you want to improve operational resilience, don’t start with tools or frameworks.

Start with how work actually gets done.

Because operational resilience doesn’t come from what you design. It comes from what you run.