“IT is becoming more integrated into the business of organizations, and more integrated with the work of their staff and customers. As a result, the pace and frequency of change – changing requirements and IT service changes to address the requirements – is increasing. We are also in a period of rapidly changing technology. At the same time business users are expecting and needing high-quality IT services. Therefore, organizations need to cope with change without adversely impacting on the quality of IT services”
This is what we all face in our daily jobs in the world of IT. Its why agile and DevOps, and along with it, Site Reliability Engineering have emerged as the at-speed practices of choice for fast-moving organizations.
The quote above resonates with modern IT challenges and the radically focused DevOps and agile approaches to IT value delivery. But, it may shock you that the quote actually dates back to 1989 from, of all unlikely places, the inaugural version of ITIL in the Change Management volume.
The challenge that hasn’t gone away is to this very day: a great many operational issues (read; downtime) are caused by changes.
At this point, nearly no one would argue that a traditional CAB-based change program works well in organizations fully embracing these modern approaches. That’s just a silly argument to make.
Equally silly, I submit, is the idea that “we don’t need no stinkin’ change management”. In March of this year (2019), Facebook and Instagram were completely off line for 24 hours, potentially impacting an estimated 3 billion users worldwide.
The reported cause was “a server configuration change…“. But it wasn’t just users who couldn’t keep up with the latest gossip from friends and family. “The Facebook outage also affected the Ads Manager tool, which is infuriating to the 6 million advertisers who spend more than $90 million per day to reach potential customers…”
More recently (June 15, 2019), Target stores experienced a couple hours of nation-wide outage, owing to an “…error made during routine maintenance.” Fortune reports that analysts estimate the financial loss of the Target outage at $50 million.
These organizations are highly regarded in the industry and their track record for systems availability speaks for itself (How many times can you recall Facebook, Instagram or Target being down for any noticeable length of time?) These companies were recently list by TechBeacon among 10 companies killing it at DevOp. These are adopters of, and innovators in modern methods and are widely viewed as Exhibit A of what modern IT delivery looks like.
Down time. Customer impact. Financial impact.
At one point, Target stock price was down as much as 2% as investors tried to “…gauge how much the retailer could have lost in revenues and store traffic.” For its part, Facebook’s shares tumbled over 7% following the outage and other unrelated bad-news.
So, for the thousands of changes these organizations make every day, some don’t go as planned. And contrary to popular belief, all the modern tools and automation in the world, can’t protect an organization from the devasting, customer-impacting effects when changes go badly wrong.
Risk management and organization governance
What modern testing and Continuous Integration/Deployment tools do is radically change the risk profile associated with managing changes. For the most part, modern technology can facilitate changes such that, even if it goes badly, is little more than a blip. The ‘blast radius’ (how much of the organization is at risk if/when a change goes badly) can be reduced to a single application/system and even a single feature within that system.
We’ve also engineered recovery from failed changes down to seconds or minutes. Changes that go badly can, at least in theory, be instantly rolled back; no harm, no foul. This, I believe, is why Facebook and Target and a great many high performing organizations, aren’t well known for downtime.
But despite all this, as we all recently experienced, changes do fail, and in some cases, recovery isn’t a transparent, instantaneous rollback, and the impact to the company is high. Very high.
Publicly-traded companies are obligated to operate in the best interests of their stockholders. Corporate officers are held personally liable for practices that are either not consistent with regulatory requirements or stakeholder interests. Corporate officers and board members can, at least in theory, go to jail for failing to fulfill their duty to the corporation.
The stakes are so high, I’m left pondering – if executives at either of these organization were aware that these particular changes were being implemented, and that at stake was extended and significant impact to the company. I’m quite sure the answer is ‘no’, which means (stay with me here) that risk-based decisions, impacting billions in market capitalization, are being made by operational staff.
What’s the point?
Perhaps I should pause to clarify a bit. I’ve been in IT operations all my professional life – the first 25 years at a Fortune 10 company. I’m no stranger to major incidents, and I would never disparage my IT colleagues at Facebook, Target or anywhere else. They are excellent, in a league all their own.
I’m also not throwing shade on agile, DevOps and SRE. They have proven their value to organizations all over the world.
My point is simply this. Changes continue to cause significant business impact, and I don’t see that going away any time soon. Stuff happens, and companies like Facebook and Target aren’t country bumpkins. They know what they’re doing, even on the risk and compliance fronts.
But that’s not true for every organization.
One size doesn’t fit all
Agile, DevOps, or any other methodology isn’t a one-size-fits-all. There’s always a context in which they’re adopted. If the top companies, those early adopters and innovators of modern IT methodologies, as well funded, staffed and organizationally supported as they are. If they aren’t immune to catastrophic change-related failures, then neither is your organization.
The question isn’t whether or not you’re doing what the big players are. They don’t have your challenges and circumstances, and you don’t have theirs. If you’re like many IT teams around the world, you likely don’t have their level of technology investment and resources either. The real question is are you meeting your organization’s expectations for managing change-related risk.
Don’t buy off on ‘doing what the cool kids are doing’. There’s much that any organization can learn from the big players. Modern at-speed approaches have changed the pace of the game. But they haven’t eliminated the need for effective Change Management.
Learn from the best, apply and use what helps your organization better achieve its goals.