Most of the time Config Churn is the usual culprit here. A discovery has run wild, an override has been wrongly set or a new rule/monitor has been created which is way too much noisy. Or even a bad MP is the culprit,like I already blogged about long ago.
But still, every time when such an EventID pops up on the RMS it needs to be thoroughly investigated. Never presume. Ever!
Yesterday at a customers site EventID 5300 and 5304 started popping up out of nowhere. No new MPs had been imported, nor adjusted in any kind of way. The RMS just started dying on me. Every 30 minutes! It took me much time to find the culprit in this case.
First I presumed Config Churn to be at hand here. So I ran some queries in order to get to the bottom of it. Even unloaded some MPs which turned up in these queries but still the RMS kept on dying on me.
So no Config Churn here. What else?
Again, like I have blogged many times before, the OpsMgr event log is a VERY good starting point for troubleshooting. SCOM logs a lot and I am happy about it. Because this log led me to the culprit, thus enabled me to solve it!
This is what the OpsMgr event log told me some seconds after I restarted the Health Service on the RMS. EventID 33350, DataAccessLayer:
I have grayed out the user name, but in this case the server name of the RMS was displayed.
This is no good. Time to check the SQL server…
Log Name: Application
Date: 5/28/2010 9:33:53 AM
Event ID: 18456
Task Category: Logon
Keywords: Classic,Audit Failure
User: <DOMAIN>\<RMS SERVERNAME>
Computer: SQL SERVER
Login failed for user '<DOMAIN>\<RMS SERVERNAME>'. Reason: Token-based server access validation failed with an infrastructure error. Check for previous errors. [CLIENT: <IP ADDRESS RMS>]
I had some contact with Graham Davies from the UK and he sent me a link to KB321044. It is all about duplicate SPNs. Time to speak with the AD and Infra guys/girls.
It turned out the computer name and the SPN had gone corrupt in AD. Even though there are ways to reset/remove/rebuild/reregister a SPN with some tooling, I preferred to take another approach which would assure me all would be OK again for a longer time and guarantee me that the issue is solved for a full 100%. This is what I did (with the help of a much appreciated system administrator):
- Made a backup of the OpsMgr DB;
- Promoted a MS to new RMS;
- Demoted the old RMS to MS;
- Checked the new RMS in order to see whether it stayed functional: IT DID! Yeah!;
- Removed the computer account of the old RMS from AD;
- Disjoined the old RMS from the AD domain;
- Rejoined the old RMS to the AD domain;
- SCOM on this server became problematic so i removed it totally;
- Deleted the old RMS from SCOM;
- Reinstalled SCOM on the old RMS;
- Applied CU#2 on that server;
- Promoted to old RMS back to RMS;
- Demoted the new RMS to back to MS;
- All Agents reported to the original RMS so they had to be set back to the original MS.
And now all is well again. Phew!
The cause of the corrupt SPN and computer name has been addressed as well. So this is likely not to happen again.
Time to start the week-end!