Automation is a key component in the management of complex, modern IT systems. Automation helps teams avoid mistakes, increase reliability, and reduce toil in their day-to-day tasks. While building a production environment might rely on a number of automation tools, the lifecycle of that environment will include unplanned incidents and other work that often is performed manually.
Human mistakes during an incident can increase the time to resolution and even make the problem worse. When our systems experience incidents during non-working hours, our team might be away from their computers, or unavailable, or even asleep. We want to minimize the number of incidents that require human intervention and limit alerting responder teams to only those alerts that require a human element.
Who Is This For?#
This resource is for teams that develop or operate software applications who want to make effective use of automation tools during their incident response process.
What is Covered?#
Automation Use Cases in IT#
Many teams already use a lot of automation to help get their tasks accomplished in a reliable, repeatable way. This section touches on some examples in:
Automating the Incident Response Process#
Automation can help your team respond to incidents more effectively and efficiently. This section covers the workflows of responding to an incident:
- Team Alerting and Orchestration
- Triage and State Analysis
- Business Communication
- Automation of Remediation This section covers Self Healing Systems and Runbooks
Automation for Incident Remediation#
Actually fixing issues after getting alerted is the next step in your team's journey to uninterrupted sleep. This section is a more in-depth discussion of managing automation for incident remediation processes.
Getting Started with Automated Incident Resolution#
Some things to keep in mind when you are working on automation for Incident Remediation:
Automation in Regulated Environments#
Regulation can present unique and interesting challenges when teams are automating workflows.
Challenges to Automation#
Not everyone will be enthusiastic about the prospect of automating parts of their job - even if they don't particularly like some of the tasks. There are challenges to introducing automation goals to already established teams. Some of these challenges are well understood and others are more abstract. We can refer to decades of research on systems automation for some tips and guidance.
References and Further Reading#
Some references we used to create this document. If you have suggestions for additions to this list, let us know!
This documentation is provided under the Apache License 2.0. In plain English, that means you can use and modify this documentation and use it both commercially and for private use. However, you must include any original copyright notices and the original LICENSE file.
Whether you are a PagerDuty customer or not, we want you to have the ability to use this documentation internally at your own company. You can view the source code for all of this documentation on our GitHub account. Feel free to fork the repository and use it as a base for your own internal documentation.