Automation for Incident Resolution
Responding to incidents and alerts presents opportunities for automation in some environments as well. While we find automation in the creation steps of technical products, it is less common in the longer term maintenance of the systems. Automation improves the health of our running systems and helps us better manage issues that arise. When considering automation for incident remediation, keep in mind that in complex systems, failure is inevitable. Our automation goals at this stage aren’t to prevent failures, but to swiftly deal with failure when it happens and optimize for it as much as possible.
Creating monitoring and health-checks for production systems is fairly ubiquitous since organizations that don’t count IT as a customer product space still heavily rely on IT services to be functional and performative. As long as all services are up and running, everything is fine. When something goes wrong, what happens next could be chaos or it could be a well-managed practice of contacting responders and remediating issues. Automation can be employed from the first blip or hiccup, including how the correct team is contacted, how they are able to respond, and whether there is additional infrastructure to support troubleshooting and remediation. More efficiency and the intentional use of automation in even these early phases of remediation reduce the time it takes to acknowledge and repair issues that arise.
Making changes safely is particularly important when a team is attempting to fix an incident, but unplanned changes made during troubleshooting are often made manually. When a service is unavailable or not performing in some way, making a mistake because of a manual process can delay the resolution. It could even make matters worse. If an incident responder makes a copy-and-paste error, skips a step in a runbook, or executes a command in the incorrect terminal, any number of unpredictable things could happen. So, we look to employ automation in our remediation processes to mitigate this unpredictable risk.
We want automated remediation for many of the same reasons we want automation in general — as systems increase in number and complexity, the amount of information needed to run and maintain them effectively also increases. The decision to auto-remediate alarms from certain inputs might consist of several points:
- How often the alarm triggers.
- We reduce the noisiest alarms for the greatest gain.
- How often the alarm is non-impacting to end users at first instance.
- Early warnings like disk usage can be dealt with by automation.
- If the first step in a manual remediation is always the same for the alarm.
- If a human typically restarts a service to see if that fixed the issue, the automation should do that step.
Teams may find that their alarms have a consistent set of solutions that can be automated. Creating this automation — via any number of tools — removes these alerts from the immediate attention of the team and lessens the potential for what we refer to as “alert fatigue.”
Alert fatigue occurs in a number of industries where workers are exposed to alerts and alarms on a regular basis to the point where the alerts lose meaning. Large numbers of alarms, or alarms with high frequency, can cause responders to become desensitized over time. As responders become desensitized, their response times become longer and the potential for mistakes increases when there is an important alarm. We see this in IT when a preponderance of low-urgency alerts are passed to responders in real time, 24 hours a day instead of being added to a work queue, delayed to working hours, or otherwise managed.
IT teams can deploy automation to combat the contributing factors to alert fatigue. While a dashboard may seem like a good idea as it can help eliminate the cacophony of beeps, chimes, chirps, and buzzes from alerts, a screen full of red status reports or flashing issues can be difficult to make use of in a timely manner. When using a dashboard, teams that categorize their alarms by severity and urgency can also categorize them as targets for future automation. When everything else has been mitigated by automated processes, the team will have more capacity to deal with the alerts that do need human attention.
We also want to use automation when the solution should be faster than a human could be expected to perform the actions. This might include production activities like autoscaling when a service is under heavy load or prohibiting IP addresses that are repeatedly attempting a bad request. Depending on your use of IaaS platforms, you might already be making use of some of these functions that are built into the service.
Machines are faster than humans at some tasks, and they don’t mind work that is boring and repetitive. As we build automation, we focus on the tasks with the most toil (i.e. those that require humans to do a lot of work, but the work is relatively low value). Those are the tasks that can be completed by automated processes.
Automation can help a team respond to incidents in a predictable and defined way. Your team may already be using documentation or guides like runbooks that prescribe the steps to take to remediate an issue. When runbooks can be performed using automation, fewer distractions and alerts will go to the human responders. Particularly for remediation tasks that are low value, like restarting services or clearing disk space, this work is better allocated to automation. The automation can then also be applied to multiple sets of similar systems.