Cleaning old health state change data made easy



The SCOM Maintenance MP that I published in 2022 made purging discovery data and config synch effortless. Newly added to the MP are the following agent tasks:

  • Show StateChange Age
  • Clean StateChange Data

State change history is recorded for every monitored object when the health state for any monitor changes between Healthy, Warning, and Critical. If the object is in an unhealthy state when the nightly grooming procedure occurs, the state data doesn’t get groomed and this can build up forever, filling the OperationsManager database. This condition is documented pretty well here. Years ago, Tao Yang created the Self Maintenance MP (now maintained by Cookdown here) with a rule to detect when stale data exists. This is great for awareness but it doesn’t solve the issue. You can now use this agent task to run the SQL script to clean up the data.

Examples

In the example screenshot above you can see by the results of the “Show StateChange Age” task that I have a tremendous amount of stale state change data and also my OperationsManager database is unhealthy due to low space. In my case I happen to know that the low free space condition is largely due to aggressive performance collections but in some cases this can be due to accumulation of old state change.

Below you can see the result of a cleaning moved the needle over 4%. The DB is still in Warning state so I’ve got more tuning to do.


Below is another example of a management group with over 86,000,000 stale state change records that was recently cleaned. You can see how the space measurements shifted dramatically by removing the old records. This doesn’t affect the current health state or any active monitors or alerts for the objects; it merely cleans out old state change history that is outside of your configured grooming window.

Below is the query used in the task. It’s similar to the version found here but with a few modifications to make it slightly faster, at least according to the nerd who modified it.

USE [OperationsManager]
SET ANSI_NULLS ON

SET QUOTED_IDENTIFIER ON

BEGIN
	SET NOCOUNT ON

	DECLARE @Err INT
	DECLARE @Ret INT
	DECLARE @DaysToKeep TINYINT
	DECLARE @GroomingThresholdLocal DATETIME
	DECLARE @GroomingThresholdUTC DATETIME
	DECLARE @TimeGroomingRan DATETIME
	DECLARE @MaxTimeGroomed DATETIME
	DECLARE @RowCount INT

	SET @TimeGroomingRan = getutcdate()

	SELECT @GroomingThresholdLocal = dbo.fn_GroomingThreshold(DaysToKeep, getdate())
	FROM dbo.PartitionAndGroomingSettings
	WHERE ObjectName = 'StateChangeEvent'

	EXEC dbo.p_ConvertLocalTimeToUTC @GroomingThresholdLocal
		,@GroomingThresholdUTC OUT

	SET @Err = @@ERROR

	IF (@Err <> 0)
	BEGIN
		GOTO Error_Exit
	END

	SET @RowCount = 1

	-- This is to update the settings table 
	-- with the max groomed data
	SELECT @MaxTimeGroomed = MAX(TimeGenerated)
	FROM dbo.StateChangeEvent
	WHERE TimeGenerated < @GroomingThresholdUTC

	IF @MaxTimeGroomed IS NULL
		GOTO Success_Exit

	-- Instead of the FK DELETE CASCADE handling the deletion of the rows from 
	-- the MJS table, do it explicitly. Performance is much better this way.
	DELETE MJS
	FROM dbo.MonitoringJobStatus MJS
	JOIN dbo.StateChangeEvent SCE ON SCE.StateChangeEventId = MJS.StateChangeEventId
	JOIN dbo.STATE S
	WITH (NOLOCK) ON SCE.[StateId] = S.[StateId]
	WHERE SCE.TimeGenerated < @GroomingThresholdUTC
		AND S.[HealthState] IN (
			0
			,1
			,2
			,3
			)

	SELECT @Err = @@ERROR

	IF (@Err <> 0)
	BEGIN
		GOTO Error_Exit
	END

	-- create temp table to hold StateChangeEvent row ids to be deleted
	IF (OBJECT_ID('tempdb..#StateChangeEventToGroom') IS NOT NULL)
		DROP TABLE #StateChangeEventToGroom

	SELECT @Err = @@ERROR

	IF (@Err <> 0)
		GOTO Error_Exit

	CREATE TABLE #StateChangeEventToGroom (StateChangeEventId UNIQUEIDENTIFIER NOT NULL)

	SELECT @Err = @@ERROR

	IF (@Err <> 0)
		GOTO Error_Exit

	WHILE (@RowCount > 0)
	BEGIN
		-- Delete StateChangeEvents that are older than @GroomingThresholdUTC
		-- We are doing this in chunks in separate transactions on 
		-- purpose: to avoid the transaction log to grow too large.
		-- NOTE: The method when we select ids into temp table
		-- and then groom in a separate statement works faster than just DELETE TOP...
		INSERT #StateChangeEventToGroom (StateChangeEventId)
		SELECT TOP (10000) SCE.StateChangeEventId
		FROM dbo.StateChangeEvent SCE
		JOIN dbo.STATE S
		WITH (NOLOCK) ON SCE.[StateId] = S.[StateId]
		WHERE TimeGenerated < @GroomingThresholdUTC
			AND S.[HealthState] IN (
				0
				,1
				,2
				,3
				)

		SELECT @Err = @@ERROR
			,@RowCount = @@ROWCOUNT

		WAITFOR DELAY '00:00:01' -- waiting 1 seconds

		IF (@Err <> 0)
		BEGIN
			GOTO Error_Exit
		END

		SET STATISTICS TIME ON;

		DELETE dbo.StateChangeEvent
		WHERE StateChangeEventId IN (
				SELECT StateChangeEventId
				FROM #StateChangeEventToGroom
				)

		SET STATISTICS TIME OFF

		SELECT @Err = @@ERROR

		IF (@Err <> 0)
			GOTO Error_Exit

		TRUNCATE TABLE #StateChangeEventToGroom

		SELECT @Err = @@ERROR

		IF (@Err <> 0)
			GOTO Error_Exit
	END

	UPDATE dbo.PartitionAndGroomingSettings
	SET GroomingRunTime = @TimeGroomingRan
		,DataGroomedMaxTime = @MaxTimeGroomed
	WHERE ObjectName = 'StateChangeEvent'

	SELECT @Err = @@ERROR
		,@RowCount = @@ROWCOUNT

	IF (@Err <> 0)
	BEGIN
		GOTO Error_Exit
	END

	Success_Exit:

	IF (OBJECT_ID('tempdb..#StateChangeEventToGroom') IS NOT NULL)
		DROP TABLE #StateChangeEventToGroom

	Error_Exit:
END

DOWNLOAD

Leave a Reply

Your email address will not be published. Required fields are marked *