This is the fourth of four related articles discussing the features and capabilities of the SCOM Alert Management Pack. In previous articles about the SCOM Alert Management Pack, we talked about assigning Alert Ownership and mitigating Alert Storms. In this final topic, we’ll introduce basic Alert Workflows.
Introduction
The SCOM Alert Management Pack contains some basic alert workflows that — out of the box — will help you reduce the number of notifications generated by OpsMgr and will enable you to create custom Alert Views which will help your users find and act on Alerts that impact the systems for which they are responsible.
Getting Started
STOP!!!!
- WARNING! If you have subscriptions that are tied to the “New” Resolution State, you will want to pause before you enable either the “Assign SCOM Alerts” or the “Escalate SCOM Alerts” rules. In our solution, the intent is to send notifications only when an alert is in “Verified” Resolution State.
- WARNING! If you have already implemented custom resolution states, be aware that we use the following custom Resolution States:
- Resolution State 5 (Assigned)
- Resolution State 15 (Verified)
- Resolution State 18 (Alert Storm)
Installation
You can download the Alert Management MP from Releases · hmscott4/AlertManagement (github.com). There are detailed instructions, which you can read here: Home · hmscott4/AlertManagement Wiki (github.com).
A basic overview of the steps (read the warnings above!):
- Import the Management Pack
- Add custom resolution states
- Deploy the Configuration Files
- Edit the Alert Storm Rules to suit your environment
- Enable/Disable Alert Storm Rules to suit your environment
- Enable the “Assign SCOM Alerts” rule
- Enable the “Escalate SCOM Alerts” rule
Configuration Files
We generate two Configuration Files as part of the Management Pack. The default configuration files are meant to serve as a “starting point”. You may edit them as needed to suit your operational requirements.
IMPORTANT: The files should be stored on a file share that is accessible to all the management servers in your Management Group.
We’ve looked at the assign.alert.config file in the first post. In the second post, we looked at the alertStormRules element of the escalate.alert.config file. In this post, we’ll look more closely at the rules and exceptions elements in the escalate.alert.config file. There are quite a number of entries to look at, we’ll focus on just a few.
<rule name="Update Queue Assigned: Verified" enabled="true">
<Category></Category>
<Description><![CDATA[Update Monitor-based alerts to Verifed]]></Description>
<Sequence>5</Sequence>
<Criteria><![CDATA[ResolutionState=5 AND TimeRaised < '__TimeRaised__' AND IsMonitorAlert=1]]></Criteria>
<NewResolutionState>15</NewResolutionState>
<TimeRaisedAge>10</TimeRaisedAge>
<PostPipelineFilter></PostPipelineFilter>
<Comment><![CDATA[Alert updated by the alert automation: Verified]]></Comment>
</rule>
<rule name="Update Awaiting Evidence: Verified" enabled="true">
<Category></Category>
<Description><![CDATA[Update alerts from Awaiting Evidence to Verified]]></Description>
<Sequence>6</Sequence>
<Criteria><![CDATA[ResolutionState=247 AND LastModified < '__LastModified__' AND IsMonitorAlert=0]]></Criteria>
<NewResolutionState>15</NewResolutionState>
<LastModifiedAge>10</LastModifiedAge>
<PostPipelineFilter>$_.RepeatCount -gt 0</PostPipelineFilter>
<Comment><![CDATA[Alert updated by the alert automation: Verified]]></Comment>
</rule>
<rule name="Update Queue Assigned: Awaiting Evidence" enabled="true">
<Category></Category>
<Description><![CDATA[Update Rule-based alerts to Awaiting Evidence]]></Description>
<Sequence>7</Sequence>
<Criteria><![CDATA[ResolutionState=5 AND TimeRaised < '__TimeRaised__' AND IsMonitorAlert=0]]></Criteria>
<NewResolutionState>247</NewResolutionState>
<TimeRaisedAge>10</TimeRaisedAge>
<PostPipelineFilter></PostPipelineFilter>
<Comment><![CDATA[Alert updated by the alert automation: Awaiting Evidence]]></Comment>
</rule>
First off, let’s acknowledge the obvious: this isn’t pretty to look at! We chose XML because as OpsMgr admins, we’ve all been forced to read through XML more than we’d like. It’s also easy to work with XML in PowerShell. In early versions with customers, this was all done with System Center Orchestrator and PowerShell. We’re trying to incorporate the core design into an easy-to-deploy Management Pack.
The above snippet contains three rules: “Update Queue Assigned: Verified“, “Update Queue Assigned: Awaiting Evidence” and “Update Awaiting Evidence: Verified“. There are several other entries in the file, but we’ll focus on just these three for now.
Update Queue Assigned: Verified
This entry places monitor-based alerts (IsMonitor=1) that have been open for more than 10 minutes and that are currently in “Assigned” state into resolution state “Verified”. (15)
The selection criteria can be seen in the Criteria element:
ResolutionState=5 AND TimeRaised < '__TimeRaised__' AND IsMonitorAlert=1
At runtime, we replace the value of ‘__TimeRaised__’ with the current UTC time, offset by the TimeRaisedAge in minutes. Again, in PowerShell, this might look like:
(Get-Date).ToUniversalTime().AddMinutes(-10)
The element NewResolutionState tells OpsMgr to update the alert to “Verified“. We also add a comment to the Alert History so that we know what happened.
Update Queue Assigned: Awaiting Evidence
This entry places rule-based alerts (IsMonitor=0) that have been open for more than 10 minutes and that are currently in “Assigned” state into resolution state “Awaiting Evidence” (247).
Again, we see the selection criteria in the Criteria element:
ResolutionState=5 AND TimeRaised < '__TimeRaised__' AND IsMonitorAlert=0
We do the same math on ‘__TimeRaised__’ as previously discussed. The NewResolutionState element tells OpsMgr to place these alerts into the “Awaiting Evidence” resolution state and we again add a comment to the Alert History.
Update Queue Assigned: Awaiting Evidence
This entry examines rule-based alerts that are in Awaiting Evidence and checks to see if RepeatCount is greater than 0. If it is, we move it into Verified state.
This is an interesting entry because we make use of the PostPipelineFilter element. Basically, this is the equivalent of doing this in PowerShell:
Get-SCOMAlert -Criteria "ResolutionState=247 AND LastModified < '__LastModified__' AND IsMonitorAlert=0" | Where-Object {$_.RepeatCount -gt 0}
Summary
In summary, the impact of these three rules is to:
- Introduce a slight delay between the time a monitor alert is raised and the time that alert is placed into “Verified” state.
- Automatically move rule-based alerts into “Awaiting Evidence“
- When rule-based alerts get re-triggered (when the RepeatCount gets incremented), then we move them from “Awaiting Evidence” to “Verified“
Now let’s take a look at how this solution changes how admins interact with OpsMgr:
Viewing Alerts
Before
Prior to implementing this Management Pack, the “Active Alerts” view looks similar to:
We see that there’s no real differentiation for the alerts. All alerts are in “New” resolution state. Since the Owner field isn’t populated, it can’t be used to get a sense of who should “own” a specific alert (or conversely, how many alerts are owned by any given team).
After
After implementing the Management Pack, the default “Active Alerts” view shows alert ownership as well as different alert Resolution States based on where an alert is in the Alert Workflow process.
The general “Active Alerts” view is still pretty busy. There are a lot of alerts to look at. But we’re starting to see some patterns emerge:
- We can used the “Look for:” feature and by entering my team name, we can filter for those alerts that are assigned to my team.
- If we then sort by the Resolution State column, we can “organize” my alerts and focus in on just those that are in “Verified” state.
More importantly, we can create a custom view that only includes “Verified” Resolution State where the Owner matches the name of my Team:
Which results in the following view:
Let’s have a look at some specific alerts to see how the Management Pack routed them through the alert workflows described above.
Monitor Alert : Verified
We see in this History tab for this alert that
- The alert was activated by the system at 12:08pm;
- The alert was assigned to the DBA Team at 12:09pm;
- The alert was updated to “Verified” at 12:20pm;
If we configured our notification rules to send alerts when alert is set to “Verified”, then the DBA Team would have received a notification 12 minutes after the alert was triggered.
Monitor Alert: Closed
In this case, we see a “transient” alert. The alert is a monitor based alert for AD Trust health. However, the circumstances leading to the alert are resolved quickly, so no notification is sent:
In the History tab for this alert, we see:
- The system activated the alert at 11:00am
- The alert is assigned to the Monitoring Team at 11:01am (note to self, adjust this assignment rule!)
- The alert is closed at 11:06am.
Since the alert never reaches the “Verified” state, no notification is sent.
Rule-based Alert : Verified
In this example, we have a rule-based alert that recurs because the underlying issue persists. This alert warns us that the discovery check is failing for a database because the account lacks permissions.
We can see from the History tab that:
- The alert was activate by the system
- The alert was assigned to the DBA Team
- Since this is a rule-based alert, it was routed to “Awaiting Evidence“
- Because the repeat count incremented, the alert was sent on to “Verified“
Note: The time stamps are off on this example because I was enabling and disabling the “Escalate SCOM Alerts” rule for testing.
Alert Storm Member Alert
In this example, we see an alert that was part of a larger alert storm. As a member of an alert storm, this alert was set to “Alert Storm” resolution state and no notification was sent.
In the Alert History tab, we see that:
- The alert was activated by the system at 12:37pm
- The alert was assigned to the EFG Team at 12:39pm
- The alert was classified as being part of a larger alert storm at 12:30pm and set to resolution state “Alert Storm“
- At 1:07pm the alert was closed
Benefits
There are several benefits to implementing this solution for customers, among them:
- By decreasing the number of outbound notifications sent by OpsMgr, we reduce “alert fatigue” and increase the relevance of individual notifications that systems administrators see;
- By creating “focused” views (filtered to “Verified” alerts that belong to specific teams), we increase the relevance of the user experience.
- By aggressively closing out certain rule-based alerts, we have fewer open alerts in the console.
There are some side benefits as well. Using PowerShell, we can quickly gain an overview of how many alerts are in each queue:
Get-SCOMAlert -Criteria "ResolutionState < 255" | Group-Object Owner
We can also see how many are in each Resolution State:
Get-SCOMAlert -Criteria "ResolutionState < 255" | Group-Object ResolutionState
It gets messier, but it’s still readable if you combine the two:
Get-SCOMAlert -Criteria "ResolutionState < 255" | Group-Object Owner, ResolutionState -NoElement
This would result in:
Acknowledgements
There are a lot of people who have helped make this management pack a reality. It would be impossible to thank all of them, but I would like to specifically acknowledge:
- Dan Reist
- Shane Hutchens
- Tyson Paul