SCOM Unit Monitor Health Recalculate and OnDemand Detection

If you’ve ever used Health Explorer it’s likely that you’ve seen the Recalculate button. If you’ve ever clicked it, you’d know that it isn’t very exciting. Often times it doesn’t really respond or provide feedback.

If you’re lucky you will see the confirmation window shown above but that doesn’t mean it will actually do anything. In fact, it might not even do anything at all. OnDemand detection must be baked into the MonitorType for the Recalculate button to do anything. Typically you would find it included for a scheduled unit monitor.

There are several reasons why it may not exist in the MonitorType:

The author didn’t know how to implement it
It’s not appropriate for the monitor type (example: event detection monitor)
Author was lazy or forgot
Author omitted it on purpose

How does OnDemand detection work?

Normally a scheduler module will tell the DataSource when to retrieve the data (property bag) either on a specifically synched interval, non-synched interval, or specific days/times. The OnDemand ability exists to provide a way to trigger the DataSource (where data is harvested) for the MonitorType (where the data is used to calculate health state) without a scheduled trigger, as in RIGHT NOW. Think of this like someone standing at a bus stop, waiting for the bus to arrive at a specific time in the future. OnDemand is a mechanism to make that bus appear instantly so you can step onto it. The bus will either take you to HealthyTown or BrokenVille.

Here’s an example from the SCOM Agent Proxy management pack. In the code snippet below you can see the HostUnreachable monitor type with the RegularDetection and additionally the OnDemandDetection. The RegularDetection relies on the Scheduler included in the datasource to determine when to retrieve the data for the condition detections. However, the OnDemandDetection allows the user to initiate the ProbeAction immediately when needed.

      <UnitMonitorType ID="SCOMAgentProxy.PortCheck.HostUnreachable.MT" Accessibility="Public">
        <MonitorTypeStates>
          <MonitorTypeState ID="HostUnreachableFailure" NoDetection="false" />
          <MonitorTypeState ID="NoHostUnreachableFailure" NoDetection="false" />
        </MonitorTypeStates>
        <Configuration>
          <IncludeSchemaTypes>
            <SchemaType>System!System.ExpressionEvaluatorSchema</SchemaType>
          </IncludeSchemaTypes>
          <xsd:element minOccurs="1" name="IntervalSeconds" type="xsd:integer" xmlns:xsd="http://www.w3.org/2001/XMLSchema" />
          <xsd:element minOccurs="1" name="Port" type="xsd:integer" xmlns:xsd="http://www.w3.org/2001/XMLSchema" />
          <xsd:element minOccurs="1" name="ServerName" type="xsd:string" xmlns:xsd="http://www.w3.org/2001/XMLSchema" />
          <xsd:element minOccurs="1" name="SpreadInitializationOverInterval" type="xsd:integer" xmlns:xsd="http://www.w3.org/2001/XMLSchema" />
        </Configuration>
        <OverrideableParameters>
          <OverrideableParameter ID="IntervalSeconds" Selector="$Config/IntervalSeconds$" ParameterType="int" />
          <OverrideableParameter ID="Port" Selector="$Config/Port$" ParameterType="int" />
          <OverrideableParameter ID="ServerName" Selector="$Config/ServerName$" ParameterType="string" />
          <OverrideableParameter ID="SpreadInitializationOverInterval" Selector="$Config/SpreadInitializationOverInterval$" ParameterType="int" />
        </OverrideableParameters>
        <MonitorImplementation>
          <MemberModules>
            <DataSource ID="DS" TypeID="SCOMAgentProxy.PortCheck.DS">
              <IntervalSeconds>$Config/IntervalSeconds$</IntervalSeconds>
              <Port>$Config/Port$</Port>
              <ServerName>$Config/ServerName$</ServerName>
              <SpreadInitializationOverInterval>$Config/SpreadInitializationOverInterval$</SpreadInitializationOverInterval>
            </DataSource>
            <ProbeAction ID="PASSTHRU" TypeID="System!System.PassThroughProbe" />
            <ProbeAction ID="ONDEMAND" TypeID="MSSL!Microsoft.SystemCenter.SyntheticTransactions.TCPPortCheckProbe">
              <ServerName>$Config/ServerName$</ServerName>
              <Port>$Config/Port$</Port>
            </ProbeAction>
            <ConditionDetection ID="CDHostUnreachableFailure" TypeID="System!System.ExpressionFilter">
              <Expression>
                <SimpleExpression>
                  <ValueExpression>
                    <XPathQuery Type="UnsignedInteger">StatusCode</XPathQuery>
                  </ValueExpression>
                  <Operator>Equal</Operator>
                  <ValueExpression>
                    <Value Type="UnsignedInteger">2147952465</Value>
                  </ValueExpression>
                </SimpleExpression>
              </Expression>
            </ConditionDetection>
            <ConditionDetection ID="CDNoHostUnreachableFailure" TypeID="System!System.ExpressionFilter">
              <Expression>
                <SimpleExpression>
                  <ValueExpression>
                    <XPathQuery Type="UnsignedInteger">StatusCode</XPathQuery>
                  </ValueExpression>
                  <Operator>NotEqual</Operator>
                  <ValueExpression>
                    <Value Type="UnsignedInteger">2147952465</Value>
                  </ValueExpression>
                </SimpleExpression>
              </Expression>
            </ConditionDetection>
          </MemberModules>
          <RegularDetections>
            <RegularDetection MonitorTypeStateID="HostUnreachableFailure">
              <Node ID="CDHostUnreachableFailure">
                <Node ID="DS" />
              </Node>
            </RegularDetection>
            <RegularDetection MonitorTypeStateID="NoHostUnreachableFailure">
              <Node ID="CDNoHostUnreachableFailure">
                <Node ID="DS" />
              </Node>
            </RegularDetection>
          </RegularDetections>
			 
          <OnDemandDetections>
            <OnDemandDetection MonitorTypeStateID="HostUnreachableFailure">
              <Node ID="CDHostUnreachableFailure">
                <Node ID="ONDEMAND">
                  <Node ID="PASSTHRU" />
                </Node>
              </Node>
            </OnDemandDetection>
            <OnDemandDetection MonitorTypeStateID="NoHostUnreachableFailure">
              <Node ID="CDNoHostUnreachableFailure">
                <Node ID="ONDEMAND">
                  <Node ID="PASSTHRU" />
                </Node>
              </Node>
            </OnDemandDetection>
          </OnDemandDetections>
			 
        </MonitorImplementation>
      </UnitMonitorType>

When/How is the OnDemand detection used?

(Assuming it is added to the MonitorType)
1) OnDemand is used by the Recalculate button to run the monitoring workflow immediately (without requiring a scheduler) to retrieve data from the datasource, then use the data to determine the health state outcome.
2) Upon initialization of a unit monitor, an instance of the monitoring workflow (including the datasource probe) will be executed for every applicable target instance. That means that health is calculated immediately for every instance of the monitor target object type.
If OnDemand detection has not been defined, a monitor will be initialized as healthy automatically without actually running the workflow to verify the health. This can be misleading and hide workflow problems/failures.

What if OnDemand detection does not exist in the MonitorType?

If OnDemand detection does not exist in the MonitorType, the unit monitor will automatically initialize to healthy, there will be no state change context data available in Health Explorer ->State Change Events.

It will then run next based on the configured schedule (assuming it’s a scheduled workflow) and calculate health normally at that time, assuming the workflow terminates gracefully and a condition detection within the MT is matched to a health state (healthy/warning/critical). Cookdown will be leveraged by the datasource if possible. If the workflow crashes and no dataitem is produced or if the dataitem does not match a condition detection, the state will not change, it will remain healthy and there will be little evidence of failure.

Why would the OnDemand detection be omitted on purpose, you ask?

Upon initialization of a unit monitor, OnDemand (if included in the MT) will calculate health for every instance of the target class type but it WILL NOT use cookdown. The agent will run separate instances of the datasource for every single target instance. This can be devastating for target types with many instances. Think of IIS sites or SQL agent jobs with hundreds or potentially thousands of instances on an agent. OnDemand is certainly helpful at times but the potentially harmful effects on an agent for multi-instance classes during unit monitor initialization means it may be better to leave it out.

The Monitoring Guys

A fine site for discussions about monitoring and related technology