*NEW* 2023.08.25: v1.0.0.50
Added tasks (ref):
- Show StateChange Age
- Clean StateChange Data
If you do any management pack authoring then you are probably familiar with the headaches that can result from this sequence of activity:
- importing a management pack (which includes class discovery)
- removing that management pack
- changing the service model (classes and/or their existing properties and/or taxonomy)
- importing that management pack again
The agents can appear to get confused and refuse to discover your instances again or if they do get discovered, the rules/monitors may refuse to run for the instances (at least until the agents get their marching orders corrected by the next snapshot synchronization). This has to do with how the class instances get flagged as deleted when the MP is removed but when the MP is imported again the classes may be technically different because of the taxonomy changes but it shouldn’t matter because it’s a new MP but it does matter because obviously something is wrong but, but, but…
…headache.
I won’t attempt to explain the nerd science behind this behavior because I will probably butcher the details.
What I will do is provide a management pack with agent tasks to allow you to run these normally scheduled maintenance activities manually:
- Discovery Data Purge – Force removal of items flagged as ‘IsDeleted=1’
- Snapshot Synchronization – A config update for everything in the management group
These are maintenance activities that your management servers already perform automatically on an unchangeable schedule. However, the discovery purge normally happens at 2am and when you are in the middle of management pack development you may not want to wait around until the next day to test your MP. You may need those cobwebs cleaned out immediately to continue your testing/dev.
If my memory serves me correctly, the snapshot synchronization procedure occurs daily but on its own synch schedule. It can take 30 seconds in small environments but up to 30+ minutes in large environments.
Upon installation of this management pack, a new folder called “Maintenance” will appear in your Monitoring workspace. In this folder, there is a single state view called “OpsDB Watcher” which will display a single entity (the Operational Database Watcher is a system class). This class instance has four new tasks associated which will appear in the “Tasks” pane (screenshot below). These tasks will all use the existing “Operational Database Account” security profile. By default, this profile is empty and therefore the tasks will typically use the “Management Server Action Account“.
The Tasks
The tasks leverage a PowerShell datasource to run the SQL queries. They get executed on the management server where the OpsDB Watcher instance lives; in this lab: ms02.contoso.com.
OpsDB – 1) Execute DiscoveryDataPurge
This will purge entries from the OpsDB BaseManagedEntity table where “IsDeleted = 1”.
/*
Script: DiscoveryDataPurge.SQL
Author: Tyson Paul (this is a fancier version of the classic purge procedure)
Version History:
2021.11.23.1703 - v1
*/
DECLARE @Err int
DECLARE @Ret int
DECLARE @Command nvarchar(MAX)
DECLARE @GroomHistoryId bigint
DECLARE @ItemsToDelete INT
DECLARE @ItemsRemaining INT
DECLARE @ItemsDeleted INT
-- Custom job memo
SET @Command = N'Manual Purge: dbo.p_DiscoveryDataPurgingByTypedManagedEntityInBatches, dbo.p_DiscoveryDataPurgingByRelationshipInBatches, dbo.p_DiscoveryDataPurgingByBaseManagedEntityInBatches'
-- Call the grooming history insert sproc. This will add a default job timestamp with 'Failed' Status. Only upon successful completion will the Status become updated to "Success"
EXEC @Ret = dbo.p_InternalJobHistoryInsert @Command, @GroomHistoryId OUT
SET @ItemsToDelete = (
SELECT count(*) FROM BaseManagedEntity WHERE IsDeleted = 1
)
-- This only purges 10000 records. If you have more it will require multiple runs
-- If it fails with an error set BatchSize to 500 and run multiple times
DECLARE @TimeGenerated DATETIME, @BatchSize INT, @RowCount INT
SET @BatchSize = 10000
-- These 3 procedures below require a DateTime param which is basically a watermark to delete items OLDER than the timestamp.
-- Normally this timestamp is 2-ish days. Here we are setting @TimeGenerated to RIGHT NOW so all entries will be older than right now and will be deleted.
SET @TimeGenerated = GETUTCDATE()
EXEC @Err = p_DiscoveryDataPurgingByRelationship @TimeGenerated, @BatchSize, @RowCount
IF (@Err <> 0)
GOTO Error_Exit
EXEC @Err = p_DiscoveryDataPurgingByTypedManagedEntity @TimeGenerated, @BatchSize, @RowCount
IF (@Err <> 0)
GOTO Error_Exit
EXEC @Err = p_DiscoveryDataPurgingByBaseManagedEntity @TimeGenerated, @BatchSize, @RowCount
IF (@Err <> 0)
GOTO Error_Exit
-- Call the grooming history insert sproc to update status to "Success"
EXEC @Ret = dbo.p_InternalJobHistoryUpdate @GroomHistoryId, 1
SET @ItemsRemaining = (
SELECT count(*) FROM BaseManagedEntity WHERE IsDeleted = 1
)
SET @ItemsDeleted = (
SELECT @ItemsToDelete - @ItemsRemaining
)
Select @ItemsToDelete as 'ItemsToPurge'
,@ItemsDeleted as 'ItemsDeleted'
, @ItemsRemaining as 'ItemsRemaining'
GOTO _Exit
Error_Exit:
Select 'Failed' as 'Status'
_Exit:
OpsDB – 2) Show DiscoveryDataPurgeHistory
This will show the discovery purge history; both scheduled and manual executions.
/*
Script: ShowDiscoveryDataPurgeHistory.SQL
Author: Tyson Paul
Version History:
2021.11.23.1703 - v1
*/
-- build table list to contain friendly names for Status codes
IF (OBJECT_ID('tempdb..#tmpStatus') IS NOT NULL)
DROP TABLE #tmpStatus
CREATE TABLE #tmpStatus (
StatusCode int NOT NULL
,StatusName sysname NOT NULL
)
INSERT #tmpStatus VALUES ( 0, 'Failed')
INSERT #tmpStatus VALUES ( 1, 'Success')
INSERT #tmpStatus VALUES ( 2, 'FailedOther')
select top 20 InternalJobHistoryId
,DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), GETDATE()), TimeStarted) AS TimeStartedInLocalTime
,DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), GETDATE()), TimeFinished) AS TimeFinishedInLocalTime
,DATEDIFF(second, (DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), GETDATE()), TimeStarted) ), (DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), GETDATE()), TimeFinished) )) as JobDurationSeconds
,InternalJobHistory.StatusCode
,StatusName
,Command
from InternalJobHistory
INNER JOIN #tmpStatus ON InternalJobHistory.StatusCode = #tmpStatus.StatusCode
--Where Command like '%p_DataPurging%'
order by InternalJobHistoryId DESC
IF (OBJECT_ID('tempdb..#RegularTableToGroom') IS NOT NULL)
DROP TABLE #RegularTableToGroom
OpsDB – 3) Execute Snapshot Synchronization
This triggers a configuration update for everything in the management group.
/*
Script: SnapshotSynchForce.SQL
Author: Tyson Paul (this is a fancy wrapper for CS.SnapshotSynchronizationForce )
Version History:
2021.11.29.1457 - Improved output
2021.11.23.1703 - v1
*/
DECLARE
@ErrorInd bit
,@ErrorMessage nvarchar(max)
,@ErrorNumber int
,@ErrorSeverity int
,@ErrorState int
,@ErrorLine int
,@ErrorProcedure nvarchar(256)
,@ErrorMessageText nvarchar(max)
-- Calculate the average time required to run the procedure. This is good info for the user to help manage expectations.
DECLARE @AvgRunTime INT
SET @AvgRunTime = (
SELECT AVG([DurationSeconds]) as AverageRunTime
FROM CS.WorkItem
INNER JOIN CS.WorkItemState on WorkItem.WorkItemStateId = WorkItemState.WorkItemStateId
WHERE WorkItemName like '%snap%'
)
BEGIN TRY
-- Try to detect if the job is already running
IF NOT EXISTS (
SELECT TOP 1 1
FROM CS.WorkItem
INNER JOIN CS.WorkItemState on WorkItem.WorkItemStateId = WorkItemState.WorkItemStateId
WHERE WorkItemName like '%snap%'
AND WorkItemStateName like 'Running'
ORDER BY WorkItemRowId DESC
)
BEGIN
-- If not already running, run the procedure
EXEC CS.SnapshotSynchronizationForce
-- It often takes a few seconds for the job status to appear in the WorkItem table.
WAITFOR DELAY '00:00:5';
END
--snapshot sync
SELECT [WorkItemRowId]
,[WorkItemName]
,CS.WorkItem.WorkItemStateId
,WorkItemState.WorkItemStateName
,[ServerName]
,[InstanceName]
,[StartedDateTimeUtc]
,[LastActivityDateTimeUtc]
,[CompletedDateTimeUtc]
,[DurationSeconds]
,@AvgRunTime AS 'Average Runtime (seconds)'
,[ErrorMessage]
FROM CS.WorkItem
INNER JOIN CS.WorkItemState on WorkItem.WorkItemStateId = WorkItemState.WorkItemStateId
WHERE WorkItemName like '%snap%'
ORDER BY WorkItemRowId DESC
END TRY
BEGIN CATCH
SELECT
@ErrorNumber = ERROR_NUMBER()
,@ErrorSeverity = ERROR_SEVERITY()
,@ErrorState = ERROR_STATE()
,@ErrorLine = ERROR_LINE()
,@ErrorProcedure = ISNULL(ERROR_PROCEDURE(), '-')
,@ErrorMessageText = ERROR_MESSAGE()
SET @ErrorInd = 1
END CATCH
-- report error if any
IF (@ErrorInd = 1)
BEGIN
DECLARE @AdjustedErrorSeverity int
SET @AdjustedErrorSeverity = CASE
WHEN @ErrorSeverity > 18 THEN 18
ELSE @ErrorSeverity
END
RAISERROR ('Sql execution failed. Error %d, Level %d, State %d, Procedure %s, Line %d, Message: %s', @AdjustedErrorSeverity, 1
,@ErrorNumber
,@ErrorSeverity
,@ErrorState
,@ErrorProcedure
,@ErrorLine
,@ErrorMessageText
)
END
OpsDB – 4) Show Snapshot Synchronization History
This will show the snapshot synch history; both scheduled and manual executions.
/*
Script: ShowSnapshotSynchHistory.SQL
Author: Tyson Paul
Version History:
2021.11.23.1703 - v1
*/
--Snapshot sync history
SELECT [WorkItemRowId]
,[WorkItemName]
,CS.WorkItem.WorkItemStateId
,WorkItemState.WorkItemStateName
,[ServerName]
,[InstanceName]
,[DurationSeconds]
,[StartedDateTimeUtc]
,DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), GETDATE()), [StartedDateTimeUtc]) AS TimeStartedLocalTime
,[LastActivityDateTimeUtc]
,[CompletedDateTimeUtc]
,DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), GETDATE()), [CompletedDateTimeUtc]) AS TimeStartedLocalTime
,[ErrorMessage]
FROM CS.WorkItem
INNER JOIN CS.WorkItemState on WorkItem.WorkItemStateId = WorkItemState.WorkItemStateId
WHERE WorkItemName like '%snap%'
ORDER BY WorkItemRowId DESC
OpsDB – 5) Show StateChange Age
-- Reveal StateChanges and some stats on age
DECLARE @statedaystokeep INT
SELECT @statedaystokeep = DaysToKeep
FROM PartitionAndGroomingSettings
WHERE ObjectName = 'StateChangeEvent'
SELECT COUNT(*) AS 'Total StateChanges'
,count(CASE
WHEN sce.TimeGenerated > dateadd(dd, - @statedaystokeep, getutcdate())
THEN sce.TimeGenerated
ELSE NULL
END) AS 'Within Grooming Retention'
,count(CASE
WHEN sce.TimeGenerated < dateadd(dd, - @statedaystokeep, getutcdate())
THEN sce.TimeGenerated
ELSE NULL
END) AS '> Grooming Retention'
,count(CASE
WHEN sce.TimeGenerated < dateadd(dd, - 30, getutcdate())
THEN sce.TimeGenerated
ELSE NULL
END) AS '> 30 Days'
,count(CASE
WHEN sce.TimeGenerated < dateadd(dd, - 90, getutcdate())
THEN sce.TimeGenerated
ELSE NULL
END) AS '> 90 Days'
,count(CASE
WHEN sce.TimeGenerated < dateadd(dd, - 365, getutcdate())
THEN sce.TimeGenerated
ELSE NULL
END) AS '> 365 Days'
FROM StateChangeEvent sce
OpsDB – 6) Clean StateChange Data
USE [OperationsManager]
SET ANSI_NULLS ON
SET QUOTED_IDENTIFIER ON
BEGIN
SET NOCOUNT ON
DECLARE @Err INT
DECLARE @Ret INT
DECLARE @DaysToKeep TINYINT
DECLARE @GroomingThresholdLocal DATETIME
DECLARE @GroomingThresholdUTC DATETIME
DECLARE @TimeGroomingRan DATETIME
DECLARE @MaxTimeGroomed DATETIME
DECLARE @RowCount INT
SET @TimeGroomingRan = getutcdate()
SELECT @GroomingThresholdLocal = dbo.fn_GroomingThreshold(DaysToKeep, getdate())
FROM dbo.PartitionAndGroomingSettings
WHERE ObjectName = 'StateChangeEvent'
EXEC dbo.p_ConvertLocalTimeToUTC @GroomingThresholdLocal
,@GroomingThresholdUTC OUT
SET @Err = @@ERROR
IF (@Err <> 0)
BEGIN
GOTO Error_Exit
END
SET @RowCount = 1
-- This is to update the settings table
-- with the max groomed data
SELECT @MaxTimeGroomed = MAX(TimeGenerated)
FROM dbo.StateChangeEvent
WHERE TimeGenerated < @GroomingThresholdUTC
IF @MaxTimeGroomed IS NULL
GOTO Success_Exit
-- Instead of the FK DELETE CASCADE handling the deletion of the rows from
-- the MJS table, do it explicitly. Performance is much better this way.
DELETE MJS
FROM dbo.MonitoringJobStatus MJS
JOIN dbo.StateChangeEvent SCE ON SCE.StateChangeEventId = MJS.StateChangeEventId
JOIN dbo.STATE S
WITH (NOLOCK) ON SCE.[StateId] = S.[StateId]
WHERE SCE.TimeGenerated < @GroomingThresholdUTC
AND S.[HealthState] IN (
0
,1
,2
,3
)
SELECT @Err = @@ERROR
IF (@Err <> 0)
BEGIN
GOTO Error_Exit
END
-- create temp table to hold StateChangeEvent row ids to be deleted
IF (OBJECT_ID('tempdb..#StateChangeEventToGroom') IS NOT NULL)
DROP TABLE #StateChangeEventToGroom
SELECT @Err = @@ERROR
IF (@Err <> 0)
GOTO Error_Exit
CREATE TABLE #StateChangeEventToGroom (StateChangeEventId UNIQUEIDENTIFIER NOT NULL)
SELECT @Err = @@ERROR
IF (@Err <> 0)
GOTO Error_Exit
WHILE (@RowCount > 0)
BEGIN
-- Delete StateChangeEvents that are older than @GroomingThresholdUTC
-- We are doing this in chunks in separate transactions on
-- purpose: to avoid the transaction log to grow too large.
-- NOTE: The method when we select ids into temp table
-- and then groom in a separate statement works faster than just DELETE TOP...
INSERT #StateChangeEventToGroom (StateChangeEventId)
SELECT TOP (10000) SCE.StateChangeEventId
FROM dbo.StateChangeEvent SCE
JOIN dbo.STATE S
WITH (NOLOCK) ON SCE.[StateId] = S.[StateId]
WHERE TimeGenerated < @GroomingThresholdUTC
AND S.[HealthState] IN (
0
,1
,2
,3
)
SELECT @Err = @@ERROR
,@RowCount = @@ROWCOUNT
WAITFOR DELAY '00:00:01' -- waiting 1 seconds
IF (@Err <> 0)
BEGIN
GOTO Error_Exit
END
SET STATISTICS TIME ON;
DELETE dbo.StateChangeEvent
WHERE StateChangeEventId IN (
SELECT StateChangeEventId
FROM #StateChangeEventToGroom
)
SET STATISTICS TIME OFF
SELECT @Err = @@ERROR
IF (@Err <> 0)
GOTO Error_Exit
TRUNCATE TABLE #StateChangeEventToGroom
SELECT @Err = @@ERROR
IF (@Err <> 0)
GOTO Error_Exit
END
UPDATE dbo.PartitionAndGroomingSettings
SET GroomingRunTime = @TimeGroomingRan
,DataGroomedMaxTime = @MaxTimeGroomed
WHERE ObjectName = 'StateChangeEvent'
SELECT @Err = @@ERROR
,@RowCount = @@ROWCOUNT
IF (@Err <> 0)
BEGIN
GOTO Error_Exit
END
Success_Exit:
IF (OBJECT_ID('tempdb..#StateChangeEventToGroom') IS NOT NULL)
DROP TABLE #StateChangeEventToGroom
Error_Exit:
END
New Tasks featured here: https://monitoringguys.com/?p=7950
Examples
In this example I removed a management pack which resulted in several class instances discovered by that pack being marked for deletion. 30 entities were purged. The stored procedures used to do the heavy lifting will remove flagged items in batches of 10,000. If more flagged items exist than the batch limit, the remaining number of flagged items will be shown in the task result. If items remain, simply run the task again until no items remain.
In this example, the Snapshot Synchronization task triggers the “CS.SnapshotSynchronizationForce” procedure and returns the procedure history which includes the current “running” procedure status.
Extra Info
A stored procedure is a prepared SQL snippet (sortof like a function) that you can save, so the code can be reused over and over again. CS.SnapshotSynchronizationForce is a stored procedure found in folder shown below.
Download
ALWAYS CREATE (AND VERIFY) A FULL BACKUP OF YOUR DATABASES BEFORE ATTEMPTING TO MODIFY YOUR DATABASE(S) MANUALLY.
The terms you agree to by using this offering:
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.