Monday, November 29, 2010

SCOM R2 Gateway Server not communicating with the SCOM Management Group: EventID 20070 on the GW server and EventID 20000 on the RMS

Normally when a SCOM Gateway is installed and all prereqs are met, things run like clock work. In the years that I work with SCOM I have installed many SCOM GWs, all without any real issues what so ever. And when something was amiss, it turned out to be something simple like a firewall blocking some traffic or an incorrect certificate or a missing certificate chain. With just a few mouse clicks, all was fine and life was good again.

Until last week that is. I bumped into a GW that wouldn’t work. AT ALL! I could reproduce it as well with another GW, installed in total different environment. Strangest thing was that another SCOM R2 GW server was already installed and fully functional. So what was happening? And more over, how to solve it?

The Situation:
The SCOM R2 GW is installed and everything is in place (certs, SCOM GW Approval Tool has been run, firewalls have been configured and the lot). So there is a connection from the GW to the MG.

However, the GW throws EventID 20070 with the message ‘…Check the event log on the server for the presence of 20000 events, indicating that the agents which are not approved are attempting to connect ’:
image

On the RMS side of things, EventID 20000 is shown, telling that the SCOM R2 GW tries to connect but isn’t recognized as part of this Management Group (A device which is not part of this management group has attempted to access this Health Service. Requesting Device Name : <GW SERVER NAME>…):
image

Things we tried:
Wow! We did many things in order to get it all up & running:

  1. Of course, we checked the firewalls, routers and switches;
  2. Even installed Network Monitor on the RMS;
  3. Renewed the certs on the GW side of it all, reinstalled the SCOM GW;
  4. Reran the GW Approval Tool many times;
  5. Flushed the Health Service State on the RMS and the MS which the GW should report to in order to get a fresh config file (~:\Program Files\System Center Operations Manager 2007\Health Service State\Connector Configuration Cache\<NAME OF MG>\OpsMgrConnector.Config.xml);
  6. Installed the SCOM GW on total new server;
  7. Renamed the SCOM GW to see whether the computer name was causing it all;
  8. Ran some verbose logging on the RMS, MS and GWs which only showed EventID 20000 happening and nothing more;
  9. Deleted the SCOM GW and its SITE entry from the SCOM DB, waited until they were groomed out and started all over totally CLEAN;
  10. Ran some good tracing on the firewalls involved as well, showing us the connection was closed by the RMS (EventID 20000).

All to no avail. Nothing solid came out of it.

So I installed a new SCOM GW in total different Forest. And experienced the same issue! And all that time, the GW server which was installed some weeks ago was running just fine.

Dive Dive!:
So it was time for a deep deep dive. We copied the file OpsMgrConnector.Config.xml of the RMS and MS to another location and started to take a deep dive into them. Soon we noticed a difference: the file from the RMS contained the Connector information for the fully functional GW server, while the MS didn’t.

That’s strange! Since that GW server was installed by me using the GW Approval Tool, telling SCOM that the GW server should report to the MS and not the RMS. So this entrance should be found in the file located on the MS, not the RMS! I checked my installation document for that particular environment and indeed, I referred to the MS, not the RMS….

Time to run a PS-cmdlet which shows to WHAT MS the GW server is primarily talking to: Get-GatewayManagementServer | where {$_.Name -like '< GW SERVER NAME>'} | Get-PrimaryManagementServer.

And the output really puzzled me: the functional GW Server wasn’t talking to the MS but the RMS. Also the people running the firewall (TMG) told me that ONLY the RMS was being published, not the MS!

Now it all hit home! Wow!

The Solution:
I stopped the Health Service on the problematic test GW server, removed the GW server from the SCOM R2 Console, reran the GW Approval Tool, this time I referred to the RMS as the Management Server, adjusted the registry on the GW server in order to reflect the RMS and not the MS and restarted the Health Service on the GW.

BINGO!

All was working now!

Did the same for the problematic production GW server and hit the jackpot there as well!

However, some additional work needs to be done but that will be planned for the days to come:

  1. Publish the MS instead of the RMS on the TMG;
  2. Reconfigure the GWs to talk to the MS and not the RMS (some simple PS-cmdlets will do the trick here);
  3. Adjust the registry entries on the GWs in order to reflect the changes.

Why? It is not good to have servers reporting to the RMS.

Puzzled:
Yes, I am still puzzled. WHY does the first functional GW server talk to the RMS instead of the MS, while I have ran the GW Approval Tool in such a manner that it should talk to the MS? Got the screen dumps showing it. Really felt stupid and taken by surprise. Also learned a valuable lesson: How to troubleshoot SCOM R2…

Credits:
While troubleshooting this issue many colleagues (Peer, Tim, Wim, Pieter-Jan and Maarten) tuned in. Also got some serious aid from the SCOM MVPs like Pete, Graham, Alexandre, Paul and Simon. Even KH assisted! A good experience it was!
image

Without their help, effort and time I would not have cracked it! Thank you guys! Much appreciated!

2 comments:

Dan said...

I know this is a fairly old post but we were having this exact issue ourselves.
Unfortunately for us, the registry was already correctly configured, however we resolved this by clearing the cache on the Management Server we were pointing the new GW at:
1. stop all “System Center Operations Management” services
2. Delete the c:\program files\system center operations manager 2007\health service state\health service store’ folder
3. Restart the Management Server
4. Wait about 15 to 20 minutes for all services and collections to fully initialize.

Kris said...

I've recently had a similar problem when replacing a 2012 management server; the OpsMgrConnector.config file was missing the entry for my gateway servers (these were domain-joined, not workgroup, so certs weren't to blame).
Flushing the cache on the MS and on the affected gateways and running the following resolved the issue:
$primaryMS = Get-SCOMManagementServer -Name ""
$gatewayMS = Get-SCOMGatewayManagementServer -Name ""
Set-SCOMParentManagementServer -Gateway $gatewayMS -PrimaryServer $primaryMS