I ran into a situation last month where my customer noticed that there were numerous Windows Failover Cluster role server names (aka ‘network name’ aka ‘virtual name’) that were not getting discovered by SCOM. When I looked closer it became clear that the problem was present on cluster ‘roles’ where there was more than one virtual name attached. In the screenshot below you will notice there are two Server Names; a ‘V’ name and a ‘C’ name. (This is the naming convention used by the customer. The customer was only interested in the ‘V’ name and wasn’t really using the ‘C’ names at the time, so they didn’t really care about discovering the ‘C’ names. ) For most of the roles, only the ‘V’ name was being discovered, not the ‘C’ name. However for a small minority of the roles, only the ‘C’ name was being discovered. There didn’t seem to be any consistency or any obvious logic for which one was found by the discovery; not the IP network/subnet, not the order in which they appeared in Failover Cluster Manager, not the name (spelling, alphabetical order). The only thing I could guess was perhaps it was related to the order in which the names were created and added to the cluster role by whomever set up the WFC, and I don’t have a time machine nor a magic portal to go gather that intel. Therefore the enumeration order remains a mystery.
I decided that discovery is technically working, but only recognizing the “first” virtual name. Now what?
I started digging into the discovery. I eventually found the cause and solution. I’m going to describe my journey here so that hopefully my process and methods will help someone else down the road.
I have a two-node Windows cluster in my lab, “DBC1”. On that cluster I added a SQL server Always On Availability Group to the Roles. The AOAG has a virtual name, “DBC1AGListener”.
I deployed the SCOM (2019) agent to my first SQL server only (db01.contoso.com), some time later (10-15 minutes?) both the Windows cluster name and the AG listener name appeared.
There are some things to note here. My lab is different in a number of ways from the customer.
|Windows Server Ver.||2016||2019|
|Cluster Role(s) Virtual Names||2||1|
In the end, the labs were similar enough to research the problem.
I needed to dig into the discovery process to figure out what it was doing. Before I could do that, I needed to identify which discovery to focus on. Before I could do that, I needed to identify what was being discovered; what are ‘DBC1.Contoso.com’ and ‘DBC1AGListener.Contoso.com’? That is to say, what type of objects are they? Once I know their class types I can identify the discoveries that are capable if discovering them.
I safely assumed that they were either a “virtual” or “cluster” type class (name) so I used the following code to show me all class instances related to the AGListener name.
Hit! I got the object, now let’s look at it’s type(s).
From the screenshot above you can see that there are 3 classes in the family tree of the object. Let’s find out what those classes are.
Here I use the New-SCOMClassGraph function to show me the entire family tree for the 3 related classes.
Success! “DBC1AGListener.Contoso.com” is a “Microsoft.Windows.Cluster.VirtualServer” or “Virtual Server” (DisplayName) shown at the bottom of the graph. From the graph you can see that the only discovery (blue box) that is linked to that class type is: “Microsoft.Windows.Cluster.Classes.Discovery” and it is defined in this management pack: “Microsoft.Windows.Cluster.Library”.
MP DisplayName: Windows Cluster Library
MP Name: Microsoft.Windows.Cluster.Library
MP Version: 7.0.8437.16
Now we know what type of object “DBC1AGListener.Contoso.com” is and which discovery workflow is responsible for discovering it and we know where to find the discovery.
I export the MP with PowerShell and locate the discovery in the .xml file.
Notice this line in the discovery code above:
This implies that the discovery has the ability to discovery more than one virtual server at a time, but it is set to “false” by default. This would explain the symptoms we are seeing in the customer’s environment. Can we override it to “true”? A quick look at the DataSource will determine yes or no.
This is the datasource:
<DataSource ID="DiscoveryDataSource" TypeID="Microsoft.Windows.Cluster.Classes.Discovery.ModuleType">
Here we can see in the datasource module type that this parameter, “DiscoverMultipleVirtualServers”, is able to be overridden.
This is what it looks like in the Console:
After the customer enabled this parameter, SCOM instantly started to discover all of the missing virtual server names from the cluster roles.