Configuration Manager Update not progressing

Technical Preview 1708 was released yesterday and I fired up the lab virtual machines to give it a try. ***Several Hours Later*** I am still trying to install. 1st I had a few SQL issues to sort out but mainly they were let the database start before opening the console. Next the download would not start. It turns out the sms_site_component_manager was crashing.  Luckily a quick search found that is was a known issue https://social.technet.microsoft.com/Forums/msonline/en-US/54d5a139-a4b1-4cb5-9644-2b826c4b56eb/site-component-manager-crashing-once-per-hour-after-upgrade-to-1706?forum=ConfigMgrCBGeneral

OK, now that the component manager is running the download completes and I can start the install. Everything looks normal and I head off to bed to let it cook overnight. Imagine my surprise to find that this morning the status has not changed. The next step in the update would be to run the prerequisite change and it has no status. Off to the logs and I find that the CMUpdate.log has not been updated in a while. And one of the last updates is “CONFIGURATION_MANAGER_UPDATE service is signalled to stop…”  So I check the windows services and sure enough the CONFIGURATION_MANAGER_UPDATE service is not running. After starting that service up now everything is progressing again. Adding this to my growing list of upgrade checks and making a pot of coffee while I watch the logs for this upgrade.

Everybody out of the pool, the Application Pool

Hi my name is MrBoDean and I need to confess that I am not running a supported version of SCCM. Yes, I am migrating to Current Branch but the majority of my system are still in SCCM 2012 R2 without SP1. The reason why is quite boring and repeated and many companies I am sure, but takes a while to tell. So for the past year it feels like duct tape and bailing wire are all that is keeping the 2012 environment up while we try to upgrade between a string of crisis. It is a shame when the Premier Support engineers are on a 1st name basis with you.

So Monday night I was called at 2AM because OSD builds were failing. It happens and most times a quick review of the log files points you in the right direction. Not so much this time around. The builds were failing to even start, everyone was failing with the error

I have 4 Management Points they can not all be down. They are are up and responding when I test them with

Ok back the the smsts.log for the client that is failing.  It starts up up fine and even does its initial communication for MPLocation and gets a response

It picks the 1st MP in the list and sends a Client Identity Request. That fails quickly with a timeout error.

While it does retry it only submits the request to one MP. But the retry fails with the same error and the build fails before it even starts. Nothing stands out initially and being a little groggy, I go for the old stand by of turning it off and back on again. MP1 was the once getting the timeouts so it gets the reboot. After the reboot we try and again and get the same error. At this point a couple of hours have passed and the overnight OSD builds are canceled and I grab a quick nap and start again 1st thing in the morning.  Well that was the plan until the day crew starts trying to do OSD build and everything everywhere is failing. So I open a critical case with Microsoft. While waiting for the engineer to call I keep looking at logs try to identify what is going on. I RDP into MP1 to check the iis configuration and notice that the system is slow to launch applications. I take a peek at task manager and see that RPC requests were consuming 75% on the available memory. To reset those connections and to get the system responsive quickly, down it went for another reboot. Once it came back up, I took a chance and tried to start a OSD build. This time it worked. So the good news goes out to the field techs. Now I just need to figure out what happened to explain why. Management always needs to know why and what are you doing to not let it happen again. About this time the Microsoft engineer calls and we lower the case to a normal severity. I capture some logs for him and to his credit he quickly finds that MP1 was experiencing returning a 503.2 iis status when the overnight builds were failing. To reduce the risk of this occurring again we set the connection pool limit to 2000 for the application pool “CCM Server Framework Pool” on the management points. I get the task of monitoring to make sure the issue does not return and we agree to touch base the next day. Well I am curious about what lead to this and how long it has been going on. Going back the the past couple of days I see a clear spike in the 503 errors Monday evening starting with a few thousand and ramping up to over 300,000 by Tuesday morning. While I recommend using log parser to analyze the iis logs if you are just looking for a count of a single status code you can get it with powershell. This will give you the count of the 503 status with a subcode of 2. (Just be sure to update the log file name to the date you are checking.

While I still have not found out why, at least I know what was causing the timeout error. While that knowledge, I finally get some sleep. Surprised that no one called to wake me up because the issue was occurring again, I manage to get into the office early and start looking at the logs again and see another large spike in the 503 errors. I do a quick test to be sure OSD is working and it is. A quick email to the Microsoft engineer and some more log captures leads to an interesting conversation.

We check to make sure that the clients are using all the management points with this sql query

And we see that the clients are using all the management points but MP1 and MP4 have about twice as many clients as the other two management points. next we check the number of web connections bot of these servers have with netstat: in a command prompt

*Just in case you try and run this command in powershell, you will find that the powershell parser will evaluate the quotes and cause the find command to fail. To run the command in powershell escape the quotes.

This showed the MP1 and MP4 were maintaining around 2000 connections each. With a app pool connection limit of 2000 this means that any delay on processing requests  can quickly lead to the application pool being exceeded and lots of 503 errors will result.  So this time the connection pool limit was set to 5000. But a word of caution before you do this in your environment. When a request is waiting in the queue, by default it must complete within 5 minutes or it is kicked out and the request will have to be retried. Be sure that your servers have the CPU and Memory resources to handle additional load that this may cause.

In SCCM 2012 R2 pre SP1 there is no preferred management point. The preferred management points where added in SP1 and improved in Current Branch to be preferred by Boundary Groups. In 2012 your 1st Management point is the preferred MP until the client location process rotates the MP or the client is unable to communicate with a MP for 40 minutes. In this case MP1 is the initial MP for all OSD builds because it is always 1st in a alphabetical sorted list. MP4 is the default MP for the script used for manual client installs. If my migration to Current Branch was done I would be able to assign management point to boundary groups and better balance out the load.  But until then I am tweaking the connection limit on the application pool to keeps things working. Hopefully you are not in the same boat but if you are maybe this can help.

 

TP 1707 Run Scripts Thoughts

I just finished running kicking the tires on the Configuration Manager Technical Preview 1707 Run Scripts feature. This is a post on my personal thoughts about the functionality and as time moves on and more improvements are made all these thoughts may (and hopefully should) become irrelevant.

This has great potential but also poses several challenges.

  • Currently the results of the script execution are hard to report on for most users. For example, Someone important wants you to run a script and report with systems reported XYZ. To do that you have to query SQL and then parse the results. But if you are not the SCCM Admin you may not have any SQL rights to the database. So building custom reports is in your future until there are some canned ones.
  • Releasing this is a large enterprise is almost a no go until the ability to apply scopes and RBAC goodness is available. I can just imagine explaining this to compliance officers and auditors. Oh the meetings we will have.
  • There seems to be plans to support revisions of scripts but for the moment you have to be perfect or start over. As my wife often reminds me, I am good but not perfect. I need to correct mistakes. As admins we get to update all the other objects if needed, we will need to update scripts.
  • A kill switch. One bad script deployed to everything could suck up resources or do worse on every system. Good review and use of the approval workflow should help but you know someone will do it.

Configuration Manager TP 1707 – Run Scripts

I want to talk a bit about the new Run Script feature that was added in 1706. In Technical Preview 1707 it gained the option to add parameters to a script. This has the potential to huge benefit many users of Config Manager and is a great example of SAAS quickly delivering functionality.

Creating a script if very straight forward for this example it is just a Query of WMI for Win32_ComputerSystem

After the script is created, You must approve it. (There is a hierarchy setting to allow\stop authors to approve their own scripts. This should only be allowed in a test environment. ) After the script has been approved it can be run. To run a script go to a collection with the systems you would like to target. You can run the script against the collection as a whole or individual systems in the collection. (You must show the collection membership to target individual systems. The Run Script  option is not available via the default device view.)

Next select the script to run

To view the results of the script execution you will need to use Script Status in the Monitoring view.

 Any output from the script is stored in Script Output. For a good peak at what is going on behind the scenes check out this great write up from the 1706 TP by Tom Degreef

Now for the new stuff. Parameters!! Create a new script using the same simple wmi query with a parameter.

If you click next you will be able to set the default value for the parameter.

BUG… errr feature alert… If you click next or back without editing the parameter value the edit button is no longer present.

Not to worry you will be able to edit the parameter at run time.

When you run a script with a parameter you get a new dialog that allows you to edit the parameter values.

If you were not able or choose to not set the value when creating the script, click on the parameter name and click edit.  Be sure the parameter name is highlighted or the edit button will not do anything.  I spent a bit thinking  how silly to no be able to edit a parameter more then once. Rechecking my steps proved that was not the case.

Set the parameter value and let the script run.

Hopefully this will get you started with running scripts with parameters

Cleaning Up WSUS based on what you are not deploying in Configuration Manager

Let me start with this statement, I wish I had something other than WSUS stuff to talk about. It has been another long week and more issues related to patching. Even with all the other tips I have shared, we experienced major issues getting patches applied. In case you are not aware the windows update agent can have a memory allocation error . The good news is that is you keep your systems patched there is a hotfix to address the issue on most systems. The bad news is that the patch for the issue was not made available for the Standard Editions of Windows Server 2008 and 2012. If you have these operating systems installed and they are 64 versions; with plenty of memory, you may not see the issue or it may just be transitory and clear up on the next update scan. I am not that lucky and have lots of Windows 2012 Standard servers with 2GB of memory.  The strange part of this is that it seemed like some systems would complete a scan and report a success only to report corruption of the windows update data store. This would cause the next update scan to be a full scan and it would rebuild the local data store and the cycle of issues would start again. The fun part is that when this is occurring if you deploy patches via Configuration Manager the client will fail to identify any patches to apply and report that is compliant for the updates in the deployment. The next successful software update scan would then find the patches missing and the system will return to a non compliance state. (This is justification for external verification of patch installs from what ever product you use to install patches. But that is a story for another day. ) So back to the post from Microsoft on the issue, basically if you can not apply the hot fix you have 2 options.

  1.  Move wuauserv (Windows Update Agent) to its own process. (But on systems with less than 4GB of memory this will not gain you much and can be counter productive and impact applications running on the server. )
  2. Cleanup WSUS

For my issue adding memory to the clients was recommended and the Server team to make the change. But one of the joys of working in a large enterprise is that this will take awhile, (not weeks .. months at least). But in the interim, I need to do everything possible to decline updates in WSUS to reduce the catalog size.  At the start of these steps I had ~6200 un-declined updates in WSUS. The guidance I got from Microsoft was to target between 4000 -5000 updates in the catalog. But the lower the number the better off we would be.

Step one review the products and categories that we sync. This was easy because we already review this routinely. There was not much to change but I did trim a few and could decline a 100 or so updates. Not much everything helps.

Step two review the superseded updates. Due to earlier  patching issues our patching team had requested that we keep superseded updates for 60 days. Now this was before the updates had moved to the cumulative model and at this point ensuring the current security patches were applying was critical. (Thank you wanttocry and notpetya) So I checked to see which updates had been superseded for 30 days. I found ~1300, checking for less then 30 days only found 1 more. Big win there so after declining those the WSUS catalog was down to ~4700 updates. That got us under the upper limit of the suggested target. After triggering scans on the systems having issues and reviewing the status, it did help but not enough to call it significant improvement.

Step three break out the coffee and dig in. Wouldn’t be great to see what patches had not been declined that and  are not deployed in Configuration Manager. Easy enough to see what is not deployed in the console for SCCM but you have to look up the update in WSUS to see if it has been declined. At this point I am on the hook to stay up and monitor the patching installs and help the patching team; there are a couple of hours to kill between the end of the office day and when the bulk of our patch installs occur. So I started poking around to see what I could do to automate the comparison between Configuration Manager and WSUS. Our good friend PowerShell to the rescue. First thing is to get the patches from SCCM.  This

This connects to your server and gets all the patches listed in the console and selects the first one so you can take a look at all the properties. I am excluding a few with identifying information but you will see something similar.

Looks great and lots of thing to use to select patches to check on. However if you use query or filter you will find that a lot of those properties are lazy properties . If you pull all the properties for the 1000s of patches the script will run a looooong time. However if you so a select on the object you will get the value reported from the query and you can select what you want using a where-object in PowerShell.  I decided that the following properties would allow me to evaluate the patches: LocalizedDisplayName, CI_UniqueID, IsDeployed, NumMissing

Now to get patches that are not deployed and are not required

And patches that are not deployed and are required

Using this information you are determine a criteria to select the patches to decline. I  settled on patches that are not required and not deployed and have been available for more then 30 days. You can download the script from https://gallery.technet.microsoft.com/Decline-Update-in-WSUS-e934565f

 

Another ~2500ish declined and now the WSUS catalog is down to ~2200  patches. This did help improve the scans and patch deployments for all but the servers with 2GB of memory. But the patches for them can be delivered via a software distribution package until all the memory upgrades are completed.

 

 

WSUS Error Codes

I have found that troubleshooting WSUS is like peeling an Onion. Fix one thing only to find another problem. It is enough to make you cry\scream\drink\etc…

This post is how I approach two common issues.  The error codes below come from the client logs and\or SQL. If you need some help pulling the error codes from SQL see http://www.mrbodean.net/2017/06/25/software-update-troubleshooting-finding-the-problem-children/


0x80072EE2 – The operation timed out

This can be caused by anything that impacts communication between  the client and the WSUS server. Here is my list to check before asking the network guys what changed:

  • Ensure that the WSUS IIS application pool is running on the WSUS server the client is communicating with.
  • Check the CPU & Memory Utilization on the WSUS server. High utilization by the WSUS IIS application pool can cause timeouts for the clients. This is also a sign that you may need to do some clean up or reindex the WSUS database, see http://www.mrbodean.net/2017/06/04/wsus-the-redheaded-step-child-of-configuration-manager/
  • Check the event logs on the WSUS server for WSUS IIS application pool crashes. This is a definite sign that you need to do some clean and reindex the WSUS database. see http://www.mrbodean.net/2017/06/04/wsus-the-redheaded-step-child-of-configuration-manager/
  • Make sure the WSUS server is up. Yes, I know that this should be 1st. But if you follow directions like me, it is right where it should be.
  • Ensure that the client can communicate to WSUS server over the correct port. Use this url and replace the server name and port to match your environment. http://<yourservernamehere>:8530/ClientWebService/susserverversion.xml
    • If the xml request fails you may have a new firewall and\or acl blocking communication. Bake some cookies and ask the network team what happened. Withhold the cookies until everything works or they prove it is not the network.

0x80244010 – Exceeded max server round trips

This is a long standing issue with WSUS, see https://blogs.technet.microsoft.com/sus/2008/09/18/wsus-clients-fail-with-warning-syncserverupdatesinternal-failed-0x80244010/

1st step is to decline unused updates. Make sure you only sync what you are patching and decline what is not being used, see http://www.mrbodean.net/2017/06/04/wsus-the-redheaded-step-child-of-configuration-manager/  (It feels like I am beating a dead horse, but you have no ideal how many times that has been the resolution.)

After doing the clean up you may find that you may need to increase the Max XML per Request. By default the xml response is capped at 5MB and limited to 200 exchanges (round trips) See the Microsoft Blog post above. The sql query will below will allow for an unlimited sized response. (BE AWARE THIS CAN HAVE NEGATIVE IMPACTS! – Your network team may come find you and withhold cookies until you stop saturating all the WAN Links.) You may need to turn this on and off to address issues. If you have a large population of clients on the other side of a slow link and need to frequently enable this, you may need to rethink your design for WSUS or SUP for SCCM.

To return this to the default setting

 

Software Update Troubleshooting – Finding the Problem Children

It can seem like a never ending struggle to keep Configuration Manager clients healthy and ready to install software and patches. After fighting with WSUS the past few patch cycles, I have been sending time drilling into the client side issues. Eswar Koneti has a post that has a great sql query to help identify clients that are not successfully completing a software update scan.  Eswar’s query reports the last error code as it is stored in SQL as a  decimal, I find it helpful to convert it to hex as that is what you will see in the client log files. (This makes your googlefu more efficient.) Using Eswar’s query as a base, I created this query to help focus of the problem areas.

This gives you report of the number of systems that are experiencing the same error. A small modification allows you focus in on specific client populations. For example to just report on servers

Using the results you can then query for the systems that are experiencing the same error

In this example the error code -2145107952 has a hex value of 0x80244010. Which translates to 

Armed with this info I can begin tacking the largest group of systems with the same error.  While the root cause and resolution can be different depending on the environment these steps will help identify what to focus on.

 

 

WSUS the Redheaded step child of Configuration Manager

So like a lot of people I drank the kool aid for WSUS and Config Manager. Install the feature let SCCM configure it and forget the WSUS console exists. As long as you do some occasional maintenance it just works. Then the cumulative patches came along and every month this year has had 5-6 days each devoted to “fixing” the WSUS\SUP servers. I know I am not alone in fighting high CPU spikes while patching. I have added more memory and CPU to the servers, it helped but the next month the issue returned. Open a case with Microsoft and got a very intensive lesson on how to do maintenance the right way. Which if you need to learn check out The complete guide to Microsoft WSUS and Configuration Manager SUP maintenance and then follow it. But even after all that the issue started to reoccur as my patching team was dealing with the stragglers from the last patching round.

So I opened another case up with the wonderful folks at premier support and we start looking. This time around I would just getting spikes in CPU that would clear up after a hour or 6. As we checked and rechecked everything we were seeing that as few as 50ish connections to the WSUS site would spike the CPU utilization up to 80-90%. Prior to all these issue I would see an average CPU utilization on these servers of 30-40%. While there would be spikes during heavy patching periods they were also accompanied by large numbers of connections to the WSUS site. Using this as justification to finally clean out some obsolete products from the catalog,(Yes Server 2003 was still in there), I unchecked a few products and synced. After running the cleanup, reindex, and decline process; Still no improvement. After looking at the calendar and seeing the next Patch Tuesday coming quickly, I though well if it is going to be another crappy patch cycle lets try just doing security patches and kick everyone one else out of the pool. Well the Updates classification has the largest number of updates in my environment. (This may not be the case in yours.)  So I unchecked the classification and synced. Wow, performance dropped back to normal. To be sure, I triggered a couple of thousand update scans. I was able to get several hundred active connections and the CPU never spiked over 60% and was averaging ~30% utilization. To double check that this was truly the cause, I added the Updates classification back and synced. The sync took about 2 hours to finish and the CPU utilization started spiking again. This time 90-100 %, quick dig in and look at the root cause.

So I start searching through the updates in WSUS and comparing to what is being deployed via Config Manager. WSUS still has lots and Server 2003 updates and I just removed them, why are they still approved? I even found some XP and 2000 Updates approved in WSUS and they have been long gone. But the updates were approved and the WSUS server was diligently querying them to see if they applied and updating the status for them as well. So based on the assumption all those old products were increasing catalog to the point that performance was suffering, I started looking for a way to clean up.  *While I am going to talk about my script and hope you use it, Full credit to the Decline-SupersededUpdates.ps1 linked in The complete guide to Microsoft WSUS and Configuration Manager SUP maintenance and to Automatically Declining Itanium Updates in WSUS as the basis for how to do all this cleanup via powershell.

Now back to the investigation, I still wanted to figure out why all these updates had been approved. After lots of checking and comparing between various sites I found that my top level WSUS server for SCCM had the default auto approval rule enabled in WSUS. Well that explains the why but now for the clean up. To help Identify the updates I wanted to decline I used this powershell

This will grab all updates that are not declined and send them to a Gridview window. I like this because when wsus is overworked the console can timeout frequently and I find it easier to search through all the updates this way. A few things to remember about this, This code assumes you are running on the WSUS server you are checking. It can be run remotely on any system that has the management tool install. You will need to adjust the variables to match your environment.

If you get timeouts with this then your wsus server needs some love. You can retry but if you get timeouts 2 or 3 times stop and go read the complete guide to Microsoft WSUS and Configuration Manager SUP maintenance. Follow those steps and come back and try again.

Once you get the GridView window start searching for updates that can be declined. For example search for XP and see what you get. Here is what I found on one of my servers Lots and lots of XP updates. What I found is that even when you stop syncing the product the updates already in the catalog stay until you decline them. Why does the matter you ask? While the clients do not get them sent to them the wsus server has to process the updates in queries when a client request a scan. In my case a server with plenty of CPU and Memory using a full SQL install could only handle ~50 scan requests before getting overworked. After declining all the old unwanted update performance returned to normal.

Using the variable $allupdates from the powershell above I created several rules to identify and decline updates. Now this is what could be declined in my environment. YOU MUST EVALUATE WHAT CAN BE DECLINED IN YOUR ENVIRONMENT. I am posting these a examples of what I did and how I cleaned up my environment. If you copy what I did and find that you need the updates, all is not lost, just approve the update again and it will be available again.

Now with all that being said I wish that I could give you a definitive recommendation on what number of un-declined update with cause you issues but I don’t because every environment different. What I can say is now we must monitor the wsus catalog and ensure the our maintenance processes now ensure that unused and unwanted updates are declined.

Here is the full cleanup script I used to get back to normal

 

Config Migration Tip – Use PowerShell to export and import Security Roles

I have been doing a lot of migration prep work and wanted to share a big time saver for moving security roles. You can use PowerShell to export and import security roles. If you have lots of custom roles this is a huge time saver.

To export all of the custom roles

After you collect all the xml files for the roles and are ready to import them use this

 

Restoring SMS Registry from the SCCM Site Backup

Well I had an interesting morning. For the past couple of weeks I have had to repair several site servers where key ccm and sms registry key had been deleted. At 1st it appeared that a client repair had gone bad and killed the keys. But this morning it was track down to someone running a client repair script incorrectly. They were targeting a remote client but the script was removing local registry keys. However today it happened on a primary server and we were looking at a site recovery to fix it. What follows may not be supported but it worked for me; if you are looking at site recovery worst case is this does not work and you will need to do the recovery anyway.

On this system the script attempted to deleted the HKLM\Software\Microsoft\SMS key and all sub keys. Most were still present because the SCCM services and components had them open and the delete failed. But a lot were missing! So we when looking for possible backups. I attempted to load the backup copy of the Software hive from windows\system32\config\regback but that was unsuccessful. Next I turned to the System Backups but the recovery plan for this server was to rebuild and then restore the application drives so the OS drive was not backed up. Well the site recovery was looking more and more like the solution. As I checked that backup from the site maintenance process the file \SiteServer\SMSbkSiteRegSMS.dat file reminded me that the back up includes the HKLM\Software\Microsoft\SMS key. So I took a peek at the DAT file in notepad and sure enough it had the registry info. After loading the DAT file as a custom hive in regedit I exported the custom hive and the sms key. (Always remember to back up the registry you are about to change. Got to remember to explain this to the script author 🙂 ) In the reg file for the custom hive I updated the path so that all of the key were for HKLM\Software\Microsoft\SMS. After ensuring that all of the SMS services where stopped, the custom hive reg file was imported into the registry. Some checking to ensure thing like server names and site codes where correct and the sms services were restarted. After celebrating the lack of red in the server logs, the site was declared functional and I snuck off for a nap.