SCOM Alert Management Pack: Alert Storm Mitigation

This is the third of four related articles discussing the features and capabilities of the SCOM Alert Management Pack. A common pain point for Operations Manager administrators is the dreaded alert storm. It may be caused by a network outage, unplanned maintenance, a poorly tuned environment or other factors. A single incident will cause OpsMgr to generate many alerts and trigger an avalanche of notifications.

Introduction

My favorite story on this topic was when a customer had a glitch on their firewall which caused about 600 agents to lose connectivity with OpsMgr. The issue was real: the firewall was blocking other traffic as well. However, while trying to sort through the very real firewall issue, the Windows admins were getting flooded with critical notifications telling them that the “Health Service Heartbeat” had failed. The admins were unhappy with being flooded with these messages and the SMTP relay server was also overloaded.

Screenshot from SCOM Console showing several related alerts.
Figure 1: An Alert Storm Caused by a Network Outage

Business Requirements

Our solution to the problem was driven by requirements generated by our customer. In summary, the customer wanted:

  1. To suppress individual notifications when there were 10 or more alerts of the same name within a 5 minute window;
  2. To tag the individual alerts to indicate that they were related;
  3. To create a “master” alert to indicate that there was an Alert Storm in progress;
  4. To send a notification for the “master” alert;
  5. To be able to filter Alert Views to exclude individual alerts that were part of an Alert Storm.

NOTE: An important limitation in our solution is that all notifications are now delayed by at least 5 minutes (but not more than 10). In addition, we no longer generate notifications based on “New” resolution state. Instead, in our solution, all notifications are generated based on the “Verified” Resolution State.

“Verified” is the New “New”

Our first step was to introduce a queuing mechanism that would enable us to evaluate incoming alerts to see if they were part of an alert storm. Also, we had to accept that it was not enough to delay sending notifications for these alerts. We don’t know how long a particular incident will last. Instead, we decided to place these alerts into a different resolution state that will not generate a notification.

There is a rule in the SCOM Alert Management Pack that handles both Alert Storm mitigation as well as basic Alert Workflows. By enabling the “Escalate SCOM Alerts” rule, you will gain access to these features. When enabled, this rule evaluates the following:

  1. Is there an Alert Storm in progress?
  2. Are individual alerts “transient”?
  3. If neither (1) nor (2) are True, then we assign a Resolution State of “Verified

We’ll discuss the first phase at length here. In the third article in this series, we’ll discuss the second and third phases of “Escalate SCOM Alerts” rule.

Getting Started

STOP!!!!

  1. WARNING! If you have subscriptions that are tied to the “New” Resolution State, you will want to pause before you enable either the “Assign SCOM Alerts” or the “Escalate SCOM Alerts” rules. In our solution, the intent is to send notifications only when an alert is in “Verified” Resolution State.
  2. WARNING! If you have already implemented custom resolution states, be aware that we use the following custom Resolution States:
    1. Resolution State 5 (Assigned)
    2. Resolution State 15 (Verified)
    3. Resolution State 18 (Alert Storm)

Installation

You can download the Alert Management MP from Releases · hmscott4/AlertManagement (github.com). There are detailed instructions, which you can read here: Home · hmscott4/AlertManagement Wiki (github.com).

A basic overview of the steps (read the warnings above!):

  1. Import the Management Pack
  2. Add custom resolution states
  3. Deploy the Configuration Files
    1. Edit the Alert Storm Rules to suit your environment
    2. Enable/Disable Alert Storm Rules to suit your environment
  4. Enable the “Assign SCOM Alerts” rule
  5. Enable the “Escalate SCOM Alerts” rule

Note: We assume at this point that you have enabled the “Assign SCOM Alerts” rule in this Management Pack. In theory, Alert Storm mitigation should work without the rule being enabled, but that scenario has not been tested.

Configuration Files

We generate two Configuration Files as part of the Management Pack. The default configuration files are meant to serve as a “starting point”. You may edit them as needed to suit your operational requirements.

IMPORTANT: The files should be stored on a file share that is accessible to all the management servers in your Management Group.

We looked at the assign.alert.config in the first post. This time, we will focus on the escalate.alert.config file. Let’s look at an excerpt from this file, focusing on just the alertStormRules element.

<alertStormRules>   
    <stormRule name="Alert Count by Name" enabled="true">
        <Sequence>100</Sequence>
        <Property>Name</Property>
        <Criteria><![CDATA[ResolutionState<>255]]></Criteria>
        <Count>10</Count>
        <Window>5</Window>
        <NewResolutionState>18</NewResolutionState>
        <Comment><![CDATA[Alert updated by the alert automation: Alert Storm]]></Comment>
    </stormRule>
</alertStormRules>

Currently, there is one stormRule in alertStormRules. The key elements in this rule to focus on:

  1. Property: The property of the Alert which we will use to group by
  2. Count: The threshold count; above this number, we have an alert storm
  3. Window: The look-back period (in minutes)
  4. Criteria: The criteria we use to form the pool of alerts to evaluate; in this case, all open alerts

In plain language, what this alertStormRule says is, “Group all Open alerts by Name, if there are ten (10) or more alerts within the last five (5) minutes, then this is an Alert Storm.”

In PowerShell, this might look like:

$DateTime = (Get-Date).ToUniversalTime().AddMinutes(-5)
Get-SCOMAlert -Criteria "ResolutionState <> 255 AND TimeCreated > '$DateTime'" | Group-Object Name | Where-Object {$_.Count -ge 10}

If you want to:

  1. Increase the threshold count, then update the Count element;
  2. Change the look-back window, then update the Window element;
  3. Change the property of the alert that is used to group the objects, then update the Property element.

For now, we’re going to leave the default configuration in place and let’s see what happens in a lab environment.

Enable Alert Storm Mitigation

Let’s go ahead and enable the “Escalate SCOM Alerts” rule for the management pack.

HINT!!!! If you only want to test Alert Storm mitigation, then disable all the other Rules in the configuration file. Do a find/replace on ‘enabled=”true”‘. Replace “true” with “false” (except for the stormRule).

Screenshot from SCOM showing how to Enable the "Escalate SCOM Alerts" rule.
Figure 2: Enable the Escalate SCOM Alerts Rule

There are two overrides that are required to enable the rule:

  1. Enabled: Set to True
  2. Configuration File: Set to the full UNC path for the escalate.alert.config file.

Optionally, you can adjust the following overrides

  1. Storm Ticket Date: Alerts that are part of a storm will be assigned a Ticket ID. The body of this number is (by default) a date stamp.
  2. Storm Ticket Prefix: The Ticket ID will also have a string prefix. Here, I have changed it to “INC
  3. Debug Logging: If enabled, this will log additional information to the Operations Manager Event log.

Results

I’m going to shut down 13 servers in my lab environment. This will generate 13 “Health Service Heartbeat Failure” alerts (along with some others). Since 13 exceeds the threshold value of 10, we would expect to see an Alert Storm.

Initially, what we will see is that there will be 13 New Alerts with no owner assigned. This is how SCOM works out of the box:

Screenshot from SCOM showing new related alerts.
Figure 3: New Alerts Flowing into SCOM

Since I have also enabled the “Assign SCOM Alerts” rule, we’ll see that these alerts will get updated to the “Assigned” resolution state and the Owner field will be updated.

Screenshot from SCOM showing new alerts being updated with Owner and Resolution State Assigned.
Figure 4: Alert Ownership updated and Resolution State set to “Assigned”

Finally, the Escalate SCOM Alerts rule will run and it will detect that the alerts are related (by name) and that they fall within the lookback window (5 minutes) and that the count of related alerts exceeds the threshold count of 10. The alerts will be updated to resolution state “Alert Storm“:

Screenshot from SCOM showing Resolution State updated to "Alert Storm"
Figure 5: Alert Resolution State updated to “Alert Storm”

In addition, a new alert will appear. The name of the alert will be “Alert Storm Detected:” The name of those alerts in the Alert Storm will then be appended to the master alert name. In this case, we see: “Alert Storm Detected: Health Service Heartbeat Failure“:

Screenshot from SCOM showing the master alert for an Alert Storm.
Figure 6: “Master” Alert Generated

In the details of this “master” alert, you will see the objects that are included in the alert, as well as the Alert Storm Internal ID:

Screenshot from SCOM showing the details of the master alert. The Description field contains the Internal Ticket Id as well as details for the objects affected.
Figure 7: Details of the “Master” Alert. Objects related to the Alert Storm are listed in the Description.

If we go back and open up the details for one of the Alert Storm alerts, we’ll see that Internal Ticket Id. Note that the format matches the format that we specified in the overrides for the Alert.

Screenshot from SCOM showing details of an individual Alert Storm alert.  The Ticket Id field has been populated with an internal identifier.
Figure 8: Details on an individual member alert of an Alert Storm. Ticket Id is populated and resolution state is set to “Alert Storm”.

In the History tab for the Alert, we will also see the actions that the SCOM Alert Management Pack has taken with the alert:

Screenshot from SCOM showing the Alert History tab.  This includes information showing that the Alert was set to "Alert Storm" resolution state.
Figure 9: Alert History tab for an Alert Storm Alert.

If I go to the “Active Alerts” view and I use the “Look For:” field, I can type in the Internal ticket number and see all the related alerts:

Screenshot from SCOM showing the "Look for:" feature.  User has filtered alerts by the internal Ticket Id.
Figure 10: Search for Alerts by Internal Incident Id

I can also see all open Alert Storm Alerts, by opening the “Alert Storm” view in the “Alert Management” folder:

Screenshot from SCOM showing the Alert Storm view in the Alert Management MP folder.
Figure 11: Find all open Alert Storm Alerts

Maybe that’s not enough? You want to see if this will work for other Alerts? Okay, we’ll use Kevin Holman’s SCOM Admin Management Pack to generate Test Events on a different group of servers. I won’t go step-by-step through each screen, but here’s the end result:

Screenshot from SCOM showing related Alerts that are part of an Alert Storm along with the Master Alert.
Figure 12: Another demonstration of the Alert Storm Mitigation feature with a different alert

Note that the “Master” alert is the same severity as the constituent alerts.

Benefits

With the Alert Storm member alerts now all in Resolution State “Alert Storm” (and never having entered Resolution State “Verified“), we avoid sending out notifications for each individual alert. The master alert does get placed into “Verified” Resolution State, so we will send out a notification for it.

In this manner, you will retain the history for each alert, but you won’t spam your admin teams with unnecessary notifications.

A future enhancement that we have planned will be to track the individual alerts in an Alert Storm. When all of them have closed, we will update the master alert to “Resolved”.

Acknowledgements

There are a lot of people who have helped make this management pack a reality. It would be impossible to thank all of them, but I would like to specifically acknowledge:

  • Dan Reist
  • Shane Hutchens
  • Tyson Paul

Leave a Reply

Your email address will not be published. Required fields are marked *