Resources

Here’s What Goes Catastrophically Wrong When You Don’t Follow Your IT Process

IT Process

On November 18, 2014, Microsoft Azure went dark. Thousands of the cloud computing service’s customers experienced downtime on their sites for over 9 hours. When they flooded Microsoft customer support to ask what the hell had gone wrong, customers learned that it wasn’t some glitch, natural disaster, or devious hacking scheme.

It was pure human error.

Microsoft deployed an update without running through the standard operating guidelines specifically laid out for this scenario. Instead of rigorously checking that the update was good to go, engineers shipped it on the assumption it was bug-free. This wasn’t just this one-time incident—engineers at Azure regularly violated standard operating procedure because 99.9% of the time, it was a total waste of their resources.

And it’s not just Azure that does this—habitual violation of process leads to a ton of mistakes in all kinds of IT arenas.

When I was 16, I landed my first “real” job which so happened to be in IT. I had just passed my CCNA with the help of my dad (apparently the youngest person in Australia at the time) and was hungry to get my hands on some real technology.

But once I started, I found there was rigid process everywhere, and configuring a live Windows 2000 server was a totally different experience to tinkering with my fathers machines at home, with serious consequences for not following the process.

As Microsoft hardware designer Dan Luu writes,

How many technical postmortems start off with “someone skipped some steps because they’re inefficient”, e.g., “the programmer force pushed a bad config or bad code because they were sure nothing could go wrong and skipped staging/testing?”

His answer: a lot. In fact, a lot of major IT mess-ups—the high-profile Sony hacks, cloud computing outages, downtime on major websites—are totally preventable. They happen because IT people don’t follow the processes put in place to prevent them from happening. What’s more, they don’t even care, until it becomes a catastrophe.

This behavior is actually a widely-studied sociological phenomenon called the “normalization of deviance”—and it has huge implications for your computing security.

Why Deviance Becomes the Norm

The term “normalization of deviance” was coined by Columbia University sociology professor Diane Vaughan. She describes it as: “a gradual process that leads to a situation where unacceptable practices or standards become acceptable, and flagrant violations of procedure become normal—despite that fact that everyone involved knows better.”

Essentially, it’s when negligence becomes the norm. In IT scenarios, it’s often most noticeable when a new person joins the team. Here’s common scenario, in Dan Luu’s words:

[New person joins the team, discovers bad IT processes]
New person: WTF WTF WTF WTF WTF
Old hands: Yeah we know, we’re concerned about it.
New person: WTF WTF wTF wtf wtf w…
[New person gets used to bad IT processes]

[New person #2 joins]
New person #2: WTF WTF WTF WTF
New person #1: Yeah we know. We’re concerned about it.

“The thing that’s really insidious here,” Dan writes, “is that people will really buy into the WTF idea, and they can spread it elsewhere for the duration of their career.” Things that should be eradicated are totally normalized, whether it’s not wearing gloves in a hospital or not double-checking that you have the right inventory inside PlayStation packaging.

As security expert Bruce Schneier argues, normalization of deviance has huge implications in the field of IT, especially because it’s easy for problems to go unnoticed until they reach a large scale.

Here are a couple scenarios of what can happen when IT deviates from processes.

1. Hackers Steal Your Customers’ Data

For a companies like the social sharing tool Buffer, trust is a crucial. Customers trust Buffer with their passwords and login info to essentially post things to Facebook and Twitter for them. And when Buffer was hacked in 2013, customers experienced perhaps the most flagrant violation of that trust: Buffer used their logins to post spam.

Confirmation-About-Hacking-Of-Buffer-In-Tweet-800x474

Buffer was extremely open about the security breach, sending customers regular updates about the app’s functionality and ultimately, a post-mortem explaining what had happened—including a list of security measures they put in place after the fact.

The real problem with the hack was that once hackers got into Buffer’s system, the data was right there and totally usable since Buffer hadn’t encrypted any of the data.

This was of the most important measures Buffer implemented after the fact. As they told customers, “we have added encryption of OAuth access tokens and we have changed all API calls to use an added security parameter.” Problem was, it was too late.

Deviant Behavior: Not Encrypting Data

It’s the classic “out of sight, out of mind” problem, the same way that you don’t need health insurance until you’re diagnosed with a rare skin disease. If you’re planning on getting hacked, it’s not worth the effort to encrypt data. But then again, who plans on getting hacked, or getting a rare skin disease?

It’s basic common sense that every IT department needs to follow. They really need regularly check up on and enforce these processes.

As Auth0 says, normalization of deviance in security is totally normal, especially when companies are focused on other things. “Moreover, security gets de-prioritized on your roadmap because it’s hard to build and isn’t immediately urgent.”

They table security to tackle more pressing issues. By consistently doing so, they get accustomed to sweeping this deviant behavior under the rug to the extent that they don’t even consider it deviant anymore. It’s totally normalized, even though any outsider would recognize that it’s bad practice and needs to be fixed immediately.

What You Can Do

This de-prioritization is why a lot of companies elect to opt out of important security steps like multi-factor authentication or encrypting data, which saved the day in the Patreon hack. Because the data was encrypted, no one’s credit card information was stolen.

An easy way to circumvent this deviant behavior is to outsource the issue to another provider. Tools like Auth0 take care of your security needs, providing code for features like multi-factor authentication, so you don’t have to worry about it.

But even if your company has these in place, it’s important to regularly run through checklists to make sure everything is up to date. Try running Process Street‘s checklist on network security management every couple weeks, to provide accountability on a company-wide scale.

2. You Literally Lose Your Data

In 2010, Zurich Insurance was fined a record 2.28 million pounds for losing personal details on 46,000 policyholders, which happened because they lost an unencrypted data tape in a routine transfer.

iron_mountain_tape_vault

The real problem in this case—the one that caused the Financial Services Authority real concern—was that they weren’t adhering to guidelines that are in place. As the Telegraph reported, Zurich Insurance’s big blunder was that it didn’t have “controls in place to prevent the lost data being used for financial crime.”

It’s sloppy behavior, but more common than you might think. Gartner estimates 28 percent of corporate data is stored only on endpoint devices. You lose the device, you lose the data.

Deviant Behavior: Relying Solely on Endpoint Devices

The company certainly knew they should have had these “controls in place,” but somewhere along the line they had grown accustomed to not having them. The deviant behavior—only relying on endpoint devices—was normalized over time.

Zurich Insurance became so used to doing this the wrong way—the way that a new hire might say “WTF WTF WTF”—but everyone became accustomed to it.

The company didn’t do anything about it because it had fallen off their radar. Since the bad behavior was normalized, there wasn’t a pressing need to change their processes (or so they thought). The bad process only became a problem once it was too late.

What You Can Do

Changing deviant behavior is hard, but sometimes it’s just a matter of constantly reminding teams about values and reinforcing your IT processes. Take a look at Process Street’s Client Data Backup Best Practices checklist—having teams run through this every couple of weeks puts your company’s processes back on everyone’s radar.

This checklist in particular is a reminder that even the small stuff that you might think is irrelevant, like verifying that you have the right kind of equipment, labeling tapes properly, or setting up recurring reminders to automatically perform a differential backup, can make a huge difference.

Deviant behavior will actually feel deviant, which makes teams want to address it.

3. Your Server Crashes When You Need it Most

In 2011, Italian designer Margherita Maccapini Missoni, whose clothes often put you back for upwards of $4,000, developed a limited-edition line at Target. The release of the line was highly anticipated, and while it made long lines outside brick-and-mortar stores, it caused Target’s website to come crashing down.

DSC_0328

“It’s a little bit embarrassing for one of the nation’s largest retailers to have a Web site that can’t support a rush — it’s not like they’re any strangers to rushes,” said Ian Schafer, chief executive of the digital marketing firm Deep Focus.

They managed to handle a lot of other rushes, but not this one. Why? That testing went out the window. They figured it wouldn’t be a big problem, so they neglected to perform the tests they normally would for a big event like Cyber Monday.

There are a lot of reasons servers crash, but it’s no excuse for not preparing or preventing the scenario from happening. It’s a big problem with customer retention. 74% of visitors leave your site after hitting a 404 error page—and they’re much less likely to return in the future.

Deviant Behavior: Neglecting Server Maintenance

The problem here is that IT department knew that they should be doing better, but they ignored it. It’s really common. As CEO Steve Klein of StatusPage.io says, there’s a good reason people put off doing server maintenance, or creating an external status page:

The idea of creating and maintaining a status page outside of your infrastructure is like a cold that won’t go away. You should probably see the doctor, but keep putting it off until suddenly you’re on your way to the ER coughing up a lung. Similarly, a status page is one of those things you should create before you experience downtime, but end up building at 4am in the morning when your hosting provider goes down.

The behavior is deviant, but everyone is so accustomed to it, that they don’t take action. You don’t care that behavior is deviant until it becomes a problem. And oftentimes, like in the case of the Target crash, they don’t prepare for those emergencies.

What You Can Do

IT should regularly run through a checklist to make sure they’re not missing any steps of routine server maintenance. Check out Process Street’s server maintenance checklist for a template you can customize to your own needs. Setting regular reminders will help your team do the important stuff like setting up an external status page so that if your site goes down, your visitors won’t encounter that dreaded 404 error.

It’s especially important to run through this checklist if you’re anticipating any changes to your server, like a surge in traffic.

When Paper magazine released nude photos of Kim Kardashian, they knew they ran a risk of their servers practically melting. Their response? Anything but deviance—they strictly adhered to processes, rigorously checking that the site would stay functional. You can test the server load capacity of your store with tools like LoadImpact.com or Blitz.io.

275c8de01343b6cd1ea78fb4c018ddba

Bottom Line: Battle the Culture of Complacency

It’s really easy to get complacent, especially when we’re so reliant upon technology to do the work for us. It’s easy to let things slide, but it can have huge repercussions. There’s no excuse for human error, especially when there are so many automation tools at your disposal to prevent them.

Checklists are a powerful tool to hedge against human error and the normalization of deviance. They’re a tool surgeon Atul Gawande uses to combat the very same problem he’s witnessed in his operating room. Essentially, deploying checklists is a reminder that our deviant behavior is actually deviant, and not normal. It prevents us from getting accustomed to bad behavior.

This changes the very culture behind adhering to processes. As Atul writes in the Checklist Manifesto, checklists help fundamentally change the mindset teams enter when they do their work. Rather than just getting complacent, and normalizing deviant behavior, they are reminded to be vigilant and disciplined.

That’s because checklists are about more than just adhering to rules. As Atul writes, “Just ticking boxes is not the ultimate goal here. Embracing a culture of teamwork and discipline is.” Checklists restrain our natural instinct to normalize deviance. In the case of IT, it’s easy to get complacent, especially when everything appears to be running just fine.

So if you don’t want to see your company’s name in the headlines of the next high-profile hack, checklists might be the best tool in your belt.

Get our posts & product updates earlier by simply subscribing

Vinay Patankar

CEO and Co-Founder of Process Street. Find him on Twitter and LinkedIn.

Leave a Reply

Your email address will not be published. Required fields are marked *

Take control of your workflows today