Discovery Data Purge and Snapshot Synchronization Agent Tasks for SCOM



*NEW* 2023.08.25: v1.0.0.50
Added tasks (ref):

  • Show StateChange Age
  • Clean StateChange Data


If you do any management pack authoring then you are probably familiar with the headaches that can result from this sequence of activity:

  1. importing a management pack (which includes class discovery)
  2. removing that management pack
  3. changing the service model (classes and/or their existing properties and/or taxonomy)
  4. importing that management pack again

The agents can appear to get confused and refuse to discover your instances again or if they do get discovered, the rules/monitors may refuse to run for the instances (at least until the agents get their marching orders corrected by the next snapshot synchronization). This has to do with how the class instances get flagged as deleted when the MP is removed but when the MP is imported again the classes may be technically different because of the taxonomy changes but it shouldn’t matter because it’s a new MP but it does matter because obviously something is wrong but, but, but…

…headache.

I won’t attempt to explain the nerd science behind this behavior because I will probably butcher the details.
What I will do is provide a management pack with agent tasks to allow you to run these normally scheduled maintenance activities manually:

  • Discovery Data Purge – Force removal of items flagged as ‘IsDeleted=1’
  • Snapshot Synchronization – A config update for everything in the management group

These are maintenance activities that your management servers already perform automatically on an unchangeable schedule. However, the discovery purge normally happens at 2am and when you are in the middle of management pack development you may not want to wait around until the next day to test your MP. You may need those cobwebs cleaned out immediately to continue your testing/dev.

If my memory serves me correctly, the snapshot synchronization procedure occurs daily but on its own synch schedule. It can take 30 seconds in small environments but up to 30+ minutes in large environments.

Upon installation of this management pack, a new folder called “Maintenance” will appear in your Monitoring workspace. In this folder, there is a single state view called “OpsDB Watcher” which will display a single entity (the Operational Database Watcher is a system class). This class instance has four new tasks associated which will appear in the “Tasks” pane (screenshot below). These tasks will all use the existing “Operational Database Account” security profile. By default, this profile is empty and therefore the tasks will typically use the “Management Server Action Account“.


The Tasks

The tasks leverage a PowerShell datasource to run the SQL queries. They get executed on the management server where the OpsDB Watcher instance lives; in this lab: ms02.contoso.com.

OpsDB – 1) Execute DiscoveryDataPurge

This will purge entries from the OpsDB BaseManagedEntity table where “IsDeleted = 1”.

/* 
Script: DiscoveryDataPurge.SQL
Author: Tyson Paul (this is a fancier version of the classic purge procedure)
Version History:
2021.11.23.1703 - v1
*/

DECLARE @Err int
DECLARE @Ret int
DECLARE @Command nvarchar(MAX)
DECLARE @GroomHistoryId bigint
DECLARE @ItemsToDelete	INT
DECLARE @ItemsRemaining	INT
DECLARE @ItemsDeleted INT

-- Custom job memo
SET @Command = N'Manual Purge: dbo.p_DiscoveryDataPurgingByTypedManagedEntityInBatches, dbo.p_DiscoveryDataPurgingByRelationshipInBatches, dbo.p_DiscoveryDataPurgingByBaseManagedEntityInBatches'

-- Call the grooming history insert sproc. This will add a default job timestamp with 'Failed' Status. Only upon successful completion will the Status become updated to "Success"
EXEC @Ret = dbo.p_InternalJobHistoryInsert @Command, @GroomHistoryId OUT

SET @ItemsToDelete = (
	SELECT count(*) FROM BaseManagedEntity WHERE IsDeleted = 1
)
-- This only purges 10000 records.  If you have more it will require multiple runs
-- If it fails with an error set BatchSize to 500 and run multiple times
DECLARE @TimeGenerated DATETIME, @BatchSize INT, @RowCount INT
SET @BatchSize = 10000
-- These 3 procedures below require a DateTime param which is basically a watermark to delete items OLDER than the timestamp. 
-- Normally this timestamp is 2-ish days. Here we are setting @TimeGenerated to RIGHT NOW so all entries will be older than right now and will be deleted.
SET @TimeGenerated = GETUTCDATE()

EXEC @Err = p_DiscoveryDataPurgingByRelationship @TimeGenerated, @BatchSize, @RowCount
IF (@Err <> 0)
    GOTO Error_Exit

EXEC @Err = p_DiscoveryDataPurgingByTypedManagedEntity @TimeGenerated, @BatchSize, @RowCount
IF (@Err <> 0)
    GOTO Error_Exit

EXEC @Err = p_DiscoveryDataPurgingByBaseManagedEntity @TimeGenerated, @BatchSize, @RowCount 
IF (@Err <> 0)
    GOTO Error_Exit

-- Call the grooming history insert sproc to update status to "Success"
EXEC @Ret = dbo.p_InternalJobHistoryUpdate @GroomHistoryId, 1

SET @ItemsRemaining = (
	SELECT count(*) FROM BaseManagedEntity WHERE IsDeleted = 1
)
SET @ItemsDeleted = (
	SELECT @ItemsToDelete - @ItemsRemaining
)

Select @ItemsToDelete as 'ItemsToPurge'
	,@ItemsDeleted as 'ItemsDeleted'
	, @ItemsRemaining as 'ItemsRemaining'

GOTO _Exit

Error_Exit:
    Select 'Failed' as 'Status'

_Exit:


OpsDB – 2) Show DiscoveryDataPurgeHistory

This will show the discovery purge history; both scheduled and manual executions.

/* 
Script: ShowDiscoveryDataPurgeHistory.SQL
Author: Tyson Paul
Version History:
2021.11.23.1703 - v1
*/

-- build table list to contain friendly names for Status codes
IF (OBJECT_ID('tempdb..#tmpStatus') IS NOT NULL)
DROP TABLE #tmpStatus

CREATE TABLE #tmpStatus (
    StatusCode          int     NOT NULL
	,StatusName         sysname NOT NULL
)
      
INSERT #tmpStatus VALUES ( 0, 'Failed')
INSERT #tmpStatus VALUES ( 1, 'Success')
INSERT #tmpStatus VALUES ( 2, 'FailedOther')

select top 20 InternalJobHistoryId
,DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), GETDATE()), TimeStarted) AS TimeStartedInLocalTime
,DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), GETDATE()), TimeFinished) AS TimeFinishedInLocalTime
,DATEDIFF(second, (DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), GETDATE()), TimeStarted) ), (DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), GETDATE()), TimeFinished) )) as JobDurationSeconds
,InternalJobHistory.StatusCode
,StatusName
,Command
from InternalJobHistory
INNER JOIN #tmpStatus ON InternalJobHistory.StatusCode = #tmpStatus.StatusCode
--Where Command like '%p_DataPurging%'
order by InternalJobHistoryId DESC

IF (OBJECT_ID('tempdb..#RegularTableToGroom') IS NOT NULL)
DROP TABLE #RegularTableToGroom


OpsDB – 3) Execute Snapshot Synchronization

This triggers a configuration update for everything in the management group.

/* 
Script: SnapshotSynchForce.SQL
Author: Tyson Paul (this is a fancy wrapper for CS.SnapshotSynchronizationForce )
Version History:
2021.11.29.1457 - Improved output
2021.11.23.1703 - v1
*/

DECLARE 
     @ErrorInd        bit
    ,@ErrorMessage    nvarchar(max)
    ,@ErrorNumber     int
    ,@ErrorSeverity   int
    ,@ErrorState      int
    ,@ErrorLine       int
    ,@ErrorProcedure  nvarchar(256)
    ,@ErrorMessageText nvarchar(max)


-- Calculate the average time required to run the procedure. This is good info for the user to help manage expectations. 
DECLARE @AvgRunTime	INT
SET @AvgRunTime = (
	SELECT AVG([DurationSeconds]) as AverageRunTime
	FROM CS.WorkItem
	INNER JOIN CS.WorkItemState on WorkItem.WorkItemStateId = WorkItemState.WorkItemStateId
	WHERE WorkItemName like '%snap%'
)

BEGIN TRY
	-- Try to detect if the job is already running
	IF NOT EXISTS (
		SELECT TOP 1 1
		FROM CS.WorkItem
		INNER JOIN CS.WorkItemState on WorkItem.WorkItemStateId = WorkItemState.WorkItemStateId
		WHERE WorkItemName like '%snap%'
		AND WorkItemStateName like 'Running'
		ORDER BY WorkItemRowId DESC
	)
		BEGIN
			-- If not already running, run the procedure
			EXEC CS.SnapshotSynchronizationForce
			-- It often takes a few seconds for the job status to appear in the WorkItem table.
			WAITFOR DELAY '00:00:5';
		END 

	--snapshot sync
	SELECT [WorkItemRowId]
			,[WorkItemName]
			,CS.WorkItem.WorkItemStateId
			,WorkItemState.WorkItemStateName
			,[ServerName]
			,[InstanceName]
			,[StartedDateTimeUtc]
			,[LastActivityDateTimeUtc]
			,[CompletedDateTimeUtc]
			,[DurationSeconds]
			,@AvgRunTime AS 'Average Runtime (seconds)'
			,[ErrorMessage]
	FROM CS.WorkItem
	INNER JOIN CS.WorkItemState on WorkItem.WorkItemStateId = WorkItemState.WorkItemStateId
	WHERE WorkItemName like '%snap%'
	ORDER BY WorkItemRowId DESC

END TRY

BEGIN CATCH
	SELECT
    @ErrorNumber = ERROR_NUMBER()
    ,@ErrorSeverity = ERROR_SEVERITY()
    ,@ErrorState = ERROR_STATE()
    ,@ErrorLine = ERROR_LINE()
    ,@ErrorProcedure = ISNULL(ERROR_PROCEDURE(), '-')
    ,@ErrorMessageText = ERROR_MESSAGE()

    SET @ErrorInd = 1
END CATCH

  -- report error if any
IF (@ErrorInd = 1)
BEGIN
	DECLARE @AdjustedErrorSeverity int

	SET @AdjustedErrorSeverity = CASE
		WHEN @ErrorSeverity > 18 THEN 18
		ELSE @ErrorSeverity
		END
    
	RAISERROR ('Sql execution failed. Error %d, Level %d, State %d, Procedure %s, Line %d, Message: %s', @AdjustedErrorSeverity, 1
		,@ErrorNumber
		,@ErrorSeverity
		,@ErrorState
		,@ErrorProcedure
		,@ErrorLine
		,@ErrorMessageText
	)
END


OpsDB – 4) Show Snapshot Synchronization History

This will show the snapshot synch history; both scheduled and manual executions.

/* 
Script: ShowSnapshotSynchHistory.SQL
Author: Tyson Paul
Version History:
2021.11.23.1703 - v1
*/

 --Snapshot sync history
SELECT [WorkItemRowId]
      ,[WorkItemName]
      ,CS.WorkItem.WorkItemStateId
	  ,WorkItemState.WorkItemStateName
      ,[ServerName]
      ,[InstanceName]
      ,[DurationSeconds]
      ,[StartedDateTimeUtc]   
	  ,DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), GETDATE()), [StartedDateTimeUtc]) AS TimeStartedLocalTime
      ,[LastActivityDateTimeUtc]
      ,[CompletedDateTimeUtc]
	  ,DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), GETDATE()), [CompletedDateTimeUtc]) AS TimeStartedLocalTime
      ,[ErrorMessage]
FROM CS.WorkItem
INNER JOIN CS.WorkItemState on WorkItem.WorkItemStateId = WorkItemState.WorkItemStateId
WHERE WorkItemName like '%snap%'
ORDER BY WorkItemRowId DESC


OpsDB – 5) Show StateChange Age

-- Reveal StateChanges and some stats on age
DECLARE @statedaystokeep INT

SELECT @statedaystokeep = DaysToKeep
FROM PartitionAndGroomingSettings
WHERE ObjectName = 'StateChangeEvent'

SELECT COUNT(*) AS 'Total StateChanges'
	,count(CASE 
		WHEN sce.TimeGenerated > dateadd(dd, - @statedaystokeep, getutcdate())
			THEN sce.TimeGenerated
		ELSE NULL
		END) AS 'Within Grooming Retention'
	,count(CASE 
		WHEN sce.TimeGenerated < dateadd(dd, - @statedaystokeep, getutcdate())
			THEN sce.TimeGenerated
		ELSE NULL
		END) AS '> Grooming Retention'
	,count(CASE 
		WHEN sce.TimeGenerated < dateadd(dd, - 30, getutcdate())
			THEN sce.TimeGenerated
		ELSE NULL
		END) AS '> 30 Days'
	,count(CASE 
		WHEN sce.TimeGenerated < dateadd(dd, - 90, getutcdate())
			THEN sce.TimeGenerated
		ELSE NULL
		END) AS '> 90 Days'
	,count(CASE 
		WHEN sce.TimeGenerated < dateadd(dd, - 365, getutcdate())
			THEN sce.TimeGenerated
		ELSE NULL
		END) AS '> 365 Days'
	FROM StateChangeEvent sce

OpsDB – 6) Clean StateChange Data

USE [OperationsManager]
SET ANSI_NULLS ON

SET QUOTED_IDENTIFIER ON

BEGIN
	SET NOCOUNT ON

	DECLARE @Err INT
	DECLARE @Ret INT
	DECLARE @DaysToKeep TINYINT
	DECLARE @GroomingThresholdLocal DATETIME
	DECLARE @GroomingThresholdUTC DATETIME
	DECLARE @TimeGroomingRan DATETIME
	DECLARE @MaxTimeGroomed DATETIME
	DECLARE @RowCount INT

	SET @TimeGroomingRan = getutcdate()

	SELECT @GroomingThresholdLocal = dbo.fn_GroomingThreshold(DaysToKeep, getdate())
	FROM dbo.PartitionAndGroomingSettings
	WHERE ObjectName = 'StateChangeEvent'

	EXEC dbo.p_ConvertLocalTimeToUTC @GroomingThresholdLocal
		,@GroomingThresholdUTC OUT

	SET @Err = @@ERROR

	IF (@Err <> 0)
	BEGIN
		GOTO Error_Exit
	END

	SET @RowCount = 1

	-- This is to update the settings table 
	-- with the max groomed data
	SELECT @MaxTimeGroomed = MAX(TimeGenerated)
	FROM dbo.StateChangeEvent
	WHERE TimeGenerated < @GroomingThresholdUTC

	IF @MaxTimeGroomed IS NULL
		GOTO Success_Exit

	-- Instead of the FK DELETE CASCADE handling the deletion of the rows from 
	-- the MJS table, do it explicitly. Performance is much better this way.
	DELETE MJS
	FROM dbo.MonitoringJobStatus MJS
	JOIN dbo.StateChangeEvent SCE ON SCE.StateChangeEventId = MJS.StateChangeEventId
	JOIN dbo.STATE S
	WITH (NOLOCK) ON SCE.[StateId] = S.[StateId]
	WHERE SCE.TimeGenerated < @GroomingThresholdUTC
		AND S.[HealthState] IN (
			0
			,1
			,2
			,3
			)

	SELECT @Err = @@ERROR

	IF (@Err <> 0)
	BEGIN
		GOTO Error_Exit
	END

	-- create temp table to hold StateChangeEvent row ids to be deleted
	IF (OBJECT_ID('tempdb..#StateChangeEventToGroom') IS NOT NULL)
		DROP TABLE #StateChangeEventToGroom

	SELECT @Err = @@ERROR

	IF (@Err <> 0)
		GOTO Error_Exit

	CREATE TABLE #StateChangeEventToGroom (StateChangeEventId UNIQUEIDENTIFIER NOT NULL)

	SELECT @Err = @@ERROR

	IF (@Err <> 0)
		GOTO Error_Exit

	WHILE (@RowCount > 0)
	BEGIN
		-- Delete StateChangeEvents that are older than @GroomingThresholdUTC
		-- We are doing this in chunks in separate transactions on 
		-- purpose: to avoid the transaction log to grow too large.
		-- NOTE: The method when we select ids into temp table
		-- and then groom in a separate statement works faster than just DELETE TOP...
		INSERT #StateChangeEventToGroom (StateChangeEventId)
		SELECT TOP (10000) SCE.StateChangeEventId
		FROM dbo.StateChangeEvent SCE
		JOIN dbo.STATE S
		WITH (NOLOCK) ON SCE.[StateId] = S.[StateId]
		WHERE TimeGenerated < @GroomingThresholdUTC
			AND S.[HealthState] IN (
				0
				,1
				,2
				,3
				)

		SELECT @Err = @@ERROR
			,@RowCount = @@ROWCOUNT

		WAITFOR DELAY '00:00:01' -- waiting 1 seconds

		IF (@Err <> 0)
		BEGIN
			GOTO Error_Exit
		END

		SET STATISTICS TIME ON;

		DELETE dbo.StateChangeEvent
		WHERE StateChangeEventId IN (
				SELECT StateChangeEventId
				FROM #StateChangeEventToGroom
				)

		SET STATISTICS TIME OFF

		SELECT @Err = @@ERROR

		IF (@Err <> 0)
			GOTO Error_Exit

		TRUNCATE TABLE #StateChangeEventToGroom

		SELECT @Err = @@ERROR

		IF (@Err <> 0)
			GOTO Error_Exit
	END

	UPDATE dbo.PartitionAndGroomingSettings
	SET GroomingRunTime = @TimeGroomingRan
		,DataGroomedMaxTime = @MaxTimeGroomed
	WHERE ObjectName = 'StateChangeEvent'

	SELECT @Err = @@ERROR
		,@RowCount = @@ROWCOUNT

	IF (@Err <> 0)
	BEGIN
		GOTO Error_Exit
	END

	Success_Exit:

	IF (OBJECT_ID('tempdb..#StateChangeEventToGroom') IS NOT NULL)
		DROP TABLE #StateChangeEventToGroom

	Error_Exit:
END


New Tasks featured here: https://monitoringguys.com/?p=7950


Examples

In this example I removed a management pack which resulted in several class instances discovered by that pack being marked for deletion. 30 entities were purged. The stored procedures used to do the heavy lifting will remove flagged items in batches of 10,000. If more flagged items exist than the batch limit, the remaining number of flagged items will be shown in the task result. If items remain, simply run the task again until no items remain.



In this example, the Snapshot Synchronization task triggers the “CS.SnapshotSynchronizationForce” procedure and returns the procedure history which includes the current “running” procedure status.


Extra Info

A stored procedure is a prepared SQL snippet (sortof like a function) that you can save, so the code can be reused over and over again. CS.SnapshotSynchronizationForce is a stored procedure found in folder shown below.


Download

ALWAYS CREATE (AND VERIFY) A FULL BACKUP OF YOUR DATABASES BEFORE ATTEMPTING TO MODIFY YOUR DATABASE(S) MANUALLY.

The terms you agree to by using this offering:
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Download “SCOM Maintenance” SCOMMaintenance_1.0.0.50.zip – Downloaded 11277 times – 24.35 KB

Leave a Reply

Your email address will not be published. Required fields are marked *