Getting Started
We can take a multistage approach for implementing automated remediation in our production systems: Automation should be reliable and consistent, so as we build out our automation, we’ll also want to keep these goals in mind:
- The procedure should be testable
- The procedure should be flexible and implemented for future improvements
- The procedure should be reviewable by someone else as a check
- The procedure should be put under version control
- The procedure should be applicable to related resources, and “one-off” changes should be discouraged, and it should be applied to all related resources in the same way
- The procedure should be repeatable and auditable
Reducing Noise#
Before even thinking about automating processes, take a long, hard look at the alerts being generated by your systems and ask yourself these questions:
- Are there unactionable alerts?
- Are there alerts that are overly complex?
- Are there alerts that should be fixed in engineering?
Round 1 of automating incident response is to ensure the alerts that are coming through are useful and can be fixed in the production environment. You can find more guidance on creating useful alerts in our Ops Guide on Incident Response. Good alerts will contain an appropriate amount of useful information about the impacted system. They’ll be rated in line with their impact on users. And they’ll be something that can be remediated in production under normal conditions. If your systems generate alerts that can’t be fixed via changes to the production environment, send them back to engineering. For example, when moving to a distributed services model in a cloud, you might see a need to increase the timeouts for requests to remote services. This is an expected performance tradeoff for the architectural change, and the timeout for those types of requests often needs to be increased.
Identify Candidate Workflows#
Once you have the alerts cleaned up, take a look at the data you have on the number of alerts that fire, when they occur, and what their impacts are. This will give you candidates that will have the most impact on your responders when you create automated remediation. Create a list of potential alerts that can be automatically remediated based on their volume or simplicity. Potential candidates for your first round of automation might be alerts for single subsystem issues, like disk space warnings, or alerts that already have a manual runbook that can be automated.
Evolving Automation Components#
Your first automated remediation targets should be well defined and well contained. Part of building up trust in your automation tools will come from creating cumulative successes, so start with a small collection of alerts to automate. Keeping the first set all within a single team of responders can help with training and communication.
Use your data set to determine the performance of your automation efforts. Ask yourself questions like:
- Is the team seeing a reduction in alerts?
- Are incidents still getting resolved in a timely and correct manner?
- Has there been any negative impact to the customer?
- Have you reduced the amount of toil the team is required to do on a daily basis to support the services?
As the illustration above shows, you might want your team to implement automation in phases, allowing the automation to run but still alerting a human responder as a check. Even before that, you can build trust in the automation by alerting a team member and having that person initiate the automation process. Over time, this ensures that the automation runs as expected, but also gives the team background knowledge of what the automation is intended to do. Working with unfamiliar automation can have negative impacts for responder teams who aren’t sure what behaviors might have triggered the automation and what side effects are of the automation itself.
Some teams might also want a place for the automation to report a status for later tracking and trend determination. For example, your automation might be clearing unused files out of a cache directory to clear disk space, but if this starts happening more and more often, your team will want to engage and find the underlying cause. The automation can only do so much.
This is a good point to report your efforts to other teams to highlight what you’ve learned about the process and how the automation is making the operation of systems and services better.
Maintaining Automation#
The addition of an automation component to your production incident response will require tracking for updates when the services they work on are updated. Downstream activities might be impacted by changes to things like service names or command options. Remediation automation components, if they meet the goals mentioned above, will be testable and checked into version control. They travel the software development lifecycle with the services that they support, either in their own repository or in the project repository. Make sure you have a plan for how they are updated, tested, and released when new versions of your services are deployed.