Early in my career, I worked on several massive systems that generated numerous alerts from different mechanisms throughout the day. The noise became so frequent and numbing – it caused these systems to be looked at like “the boy who cried wolf one too many times.” Likewise, federal IT managers have become so accustomed to receiving frequent alerts that they may be ignoring the ones that signify critical issues.
In a perfect world, federal IT managers would receive root cause alerts, relating to the highest priority items, which can then be handled quickly and easily. We do not live in a perfect world. With so many alerting mechanisms deployed, nearly every system within an agency’s IT environment issues pings for every hiccup. Without an adequate framework, this can lead to alert overload that cause issues in the integrity of your data and security posture of your agency.
Use the following three steps as a guide for your data center – agency security and stress level will all be better off. Yes – there is a way to cut through the alert clutter quickly and easily, consolidate and prioritize the information you receive, and focus on the alerts that truly matter.
Step One: Set a Goal
The most important step in increasing your signal-to-noise ratio is to measure and analyze the number and type of alerts you’re getting in order to set specific goals toward reducing the volume. From there, you should be able to group them into categories. For example, false positives (no alert should have been sent), informational (system was taken down for maintenance), and critical (something is broken and needs immediate attention).
Once you have categorized your alerts, try to automate this process as they come in. Most email clients have sufficient rule processing powers to help, as long as there is structure to your current alerts.
Next, with a general structure and some data about your alerts (the number by category), you can start to set goals to reduce them. For example, start with a 20% reduction in weekly false positive alerts. Common sources of false positives in a data center include forgetting to set a node or device to maintenance mode before taking it down or applying thresholds too broadly. If 20% is too large, decrease the goal to a smaller number—10% or even 5%. The point is to set a goal, measure your progress toward that goal, and continue to improve.
Step Two: Get the Right Information
The second step for reducing the noise from your alerts is to make sure they include useful information. Too often federal IT pros will receive alert emails without knowing which rule generated the alert.
Most tools allow you to specify the alert name in the rule. This simple change will have a profound impact on your ability to process, organize, and respond to alerts. In addition to including the alert name with the notification, try also to include:
- the device or system name
- criticality of the alert
- a direct link in your monitoring system to the affected system
- current vitals such as CPU, memory, disk, location (city or data center name as well as actual rack number and placement in the rack)
- local contact information if you need to work with a colleague on-site
- related systems and infrastructure (if possible)
With all of the first level information in the alert, you can take direct action to begin service restoration or mitigation as soon as you receive notification.
Step Three: Consolidate
Informational alerts can be an important part of your monitoring process as they’re often a good way to identify changes in your environment before there is a problem. Rather than having these come through with other, higher-priority alerts, consider consolidating these into a daily or weekly report of informational activities.
This approach will allow you to examine more closely the context of the information rather than treating it like something that requires action. You may, in turn, notice something unique that you did not notice before, since you’re considering this information differently from other types of alerts.
These steps may seem daunting given the volume of alerts you receive on a daily basis.
The idea is to take it slowly. Start small and take a measured approach. A little goes a long way when it comes to alerts.
Most importantly, the value you get from your alerts will increase dramatically if you manage them properly. You will be far less likely to miss a serious threat. If you only receive a few alerts a week, it becomes much more practical to investigate threats and quickly address operational issues.