Change Management has again returned to the headlines, though you may not have heard it that way.
Two years ago in Change Management in the age of DevOps, I wrote about the profound business impact of failed changes when Facebook and Target both had significant change-related outages.
Financial markets reacted immediately, with significant financial impact to company stakeholders.
Recently (October 4, 2021), Facebook (and Instagram, WhatsApp, Messenger, etc.) suffered another catastrophic, multi-hour outage.
The next day, Facebook Engineering explained: “During … routine maintenance …, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally.”, adding “This change caused a complete disconnection of our server connections between our data centers and the internet.”
Investor news site Marketwatch summarized the impact: “Outage cost company about $164,000 a minute in revenue, while stock’s decline wiped away more than $40 billion in market cap and cost Mark Zuckerberg roughly $6 billion personally”
In the wake of that, Twitch also suffered a “data leakage” owing to a “… error in a Twitch server configuration change”.
As if to get the last word, Facebook finished out the rough week with a late Friday (October 8th 2021) outage. A brief Fortune.com article cites a “configuration error” as the cause for the second incident.
In this article, I’m going to take a look at Change Management from the vantage point of Governance, Risk and Compliance.
Change Management and Governance
Let’s cut to the chase. While I’m completely convinced that none of these incidents is the result of negligence, incompetence, or poor practices, what remains is the enormity of the business impact of these change-related incidents.
One of the core aspects of governance is decision-making. Who can make what decisions on behalf of the company and what limitations apply?
Good governance ensures that decision making authority is both clearly understood and consistently followed. This includes documentation of who and how key decisions are made in the organization.
In the case above, was the change-related decision made by someone with the right decision-making authority (or under appropriate and documented delegation)? Was that decision/approval documented?
Change and risk management
With $40 *Billion* impact, this is no small matter. Careers and livelihoods are made and broken for far lesser impacts.
The risk here is far, far beyond IT/operational risk. This is true blue, board room level organizational risk here.
One of the key roles of the CIO is to operationalize the board’s appetite for risk in day to day operational practices.
It’s not my place to presume what Facebook’s risk appetite is or should be. What is absolutely true, however, is that Facebook, Twitch and every other organization must ensure that their IT operational practices uphold their organization’s governance objectives for risk management.
Do stakeholders care that a failed change had $40 billion financial impact on the company?
The Facebook Engineering article was instructive in that it revealed that safeguards were in place to mitigate this particular failure mode (automated assessment of commands for error before execution). This is a great example of risk mitigation controls operationalized in daily practice.
It’s a means of risk management/mitigation.
It’s one of, I’m sure, many ways that governed risk appetite is operationalized.
But it wasn’t enough. In this case, the impact was severe, and measures to mitigate the impact of this particular risk failed.
One of the tools of risk management is formalization of risk scenarios – what could go wrong, and what would be the impact if it did. From that, based on the likelihood of that particular scenario actually happening, compensating controls, and risk mitigation measures are implemented.
When it comes to managing changes – this is precisely the point. Change Management must instantiate operational controls in alignment with governed risk appetite.
Change Management should also oversee the Change Outcome Performance of the various value streams that produce changes. Where value streams are consistently achieving the governing body’s change outcome expectations, no additional action is required.
Where value streams fall short of achieving change outcome expectations, Change Management must work with the value stream to (re)align operational practices with governed expectations.
Change Management and Compliance
Which brings us to compliance. The need to demonstrate that certain controls are in place and consistently achieved. Most organizations have some level of external, regulatory compliance expectations. Depending on the country and industry a great many compliance frameworks may apply – Sarbane Oxley, PCI/DSS, GDPR, HIPAA, NIST 800-53 and many more.
Each organization must understand what compliance standards apply to their situation, and must be able to demonstrate compliance. In order to ‘demonstrate compliance’ – achievement of the specified controls must be auditable. In other words, where controls require certain actions to be taken (i.e. “Changes must be approved”), evidence of these actions (“artifacts”) must be maintained and auditable.
There’s an entire body of knowledge around compliance and audit that I won’t attempt to address here. But, suffice it to say that these are very serious matters to organizations, especially publicly traded companies, where legal actions can be taken against companies and company leaders who fail to meet their regulatory obligations.
Which helps put Change Management in its more somber light: Change Management controls are called for in many compliance frameworks, and must be operationalized in daily practice.
Change Management is not the same as CAB
Notice that I’ve haven’t said “CAB”, or external review.
For those that don’t differentiate the two, please listen closely – Change Management is about aligning operational practices with governed change outcome expectations – consistent with the organization’s governance, risk and compliance requirements.
It is NOT about external reviews. It’s not about inspecting each-and-every change. And it’s certainly not about Change Advisory Boards.
As mentioned above, Facebook had a control in place to address this very specific form of failure. However, the engineering article also says “Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command.”
As an outsider, it’s impossible to know the inner working of Facebook, but we have some solid clues in what’s been publicly revealed.
Facebook had identified this particular risk scenario and had implemented operational controls to mitigate. (This is why they implemented the command audit tool.)
Clearly, they understood the impact of such a failure. A massive failure of this nature was NOT within the governing body’s risk appetite, so they implemented operational controls to mitigate.
This is Change Management in a modern context.
Based on decades of IT Operations experience, I guarantee that this incident didn’t get swept under the rug. Nor do I believe that they held a blame storming session to fix blame on the network engineer who made the change.
What I believe is that they went to work on updating the risk scenario, adding what they learned from this particular outage.
This is what learning organizations do.
In the end, the goal is to ensure the governing body’s appetite for risk is consistently realized in daily operations.
Continual improvement and continual alignment.
But I don’t work for Facebook
Most of us don’t work for a tech giant like facebook. What’s all this mean to me, in my organization?
I talk with a lot of people around the world who work with Change Management in some form, and so much of what I hear about is very process-heavy emphasis on inspecting every change.
The lesson of these Facebook outages isn’t that we should do more inspection. On the contrary, we’d do well to instead seek to understand the organization’s governance, risk and compliance expectations.
Modern organizations implement far too many changes daily to rely solely on any form of CAB or external inspection.
Instead, Change Management must work with change-producing value streams to identify and establish controls that operationalize the governing body’s change outcome expectations.
The emphasis must shift from process compliance and instead become (change) outcome focused. Change Management must monitor the results of how changes are handled in the individual value streams, and where they fall short, work in partnership to improve.
As new risk scenarios are identified, or new learnings are added to existing (as in the Facebook example), Change Management must work with the value streams to ensure the updated risk is adequately addressed – in the value stream.
In this way, Change Management must become dynamic and adaptive. The organization’s appetite for risk is anything but static.
As much as practical, oversight of individual changes is delegated to the value streams, and Change Management is elevated to a governance and oversight function.
Far from obsolete, Change Management is once again in the spotlight and rightfully so. The stakes have never been higher, and the same ole, same ole just won’t cut it.
It’s a whole new game, and a challenging and exciting time to be in Change Management.