Notes from the field: vCenter cannot validate SSO domain

Came across a peculiar issue when adding an second vCenter to the same SSO domain and enable ELM.
The first deployment worked like a charm and the second errored out with the following error:

It turns out there is a known bug when using uppercase FQDN in the configuration wizard, the solution is to put it all in lowercase.
see the following link for reference: https://kb.vmware.com/s/article/56812

Notes from the field: UEM/vIDM integration caveats

Not too long ago I encountered some issues when configuring UEM and IDM integration. When providing the vIDM URL in UEM for configuring the integration it would error out with below error:

After some troubleshooting it appeared that the access policies where not properly configured as in the last rule in the default access application ruleset was blocking access. Resolution was editing the default policy and ending it with the password method which is associated with the built-in workspace IDP, after that the integration part is working as expected.

Another configuration task which caught me by surprise was that after the configuration is set up between UEM and vIDM the following errors occurred:

Turned out that the integration between UEM and vIDM is depending on Active Directory integration. The basic system domain accounts (even full admins) won’t work in this scenario. Resolution is configuring an domain account with the necessary admin rights in both tenants and then it will work as expected.

Hope this helps!

Notes from the field: vIDM and o365 modern authentication delay

Just a quick win blog to mention and give a heads-up that when you are in the process of configuring vIDM and o365 you might encounter native clients prompting for authentication and a big ass delay when you flip over the authentication and the requested domain from managed to federated with vIDM. This might be up to eight hours!!! Thanks to the #community #vExpert that I got this answer quite fast because I recalled that Laurens van Duijn put something similar in the vExpert Slack group mentioning that he saw this kind of behavior.

So in summary, do it on a Friday and inform your users.

Big shout out to Laurens van Duijn and be sure to follow him on twitter and his blog

Twitter: @LaurensvanDuijn

Blog: https://vdr.one/

Notes from the field: VMware vCenter /dev/mapper/core_vg-core full

Not too long ago I’ve encountered an vCenter instance blowing up the /dev/mapper/core_vg-core with gigabytes of java dump errors.. Just for reference the customers setup is an dual SDDC with respectively an vCenter at each site comprising of vCenter 6.5 U2 and embedded linked mode enabled.

In troubleshooting mode I’ve encountered the following two articles:

https://kb.vmware.com/s/article/2150731

https://kb.vmware.com/s/article/60161

Afterwards decided to open up a support case. This resulted in a session which stated that they had seen this sort of issues arising in 6.7u1 and higher which root caused against hardware level 13 for the appliance and WIA Active Directory integration.

The customer setup had an hardware level 13 deployment on both sites and only one experiencing the problem, and using Active Directory over LDAP integration.

The resolution of the issue was downgrading the VCSA hardware level to version 10.

Supports way is restoring the VCSA with a VAMI back-up restore, my way was re-register the appliance with the VMX file downgraded to the level needed, see https://communities.vmware.com/thread/517825

Hope this helps!

Notes from the field: VMware vCloud Usage Meter vROps cleanup not working

If you ever are in the proces of cleaning up your vRealize Operations Manager instances and are using vCloud Usage Meter as well you might find yourself in a situation that Usage meter keeps referencing an old node which is deleted.

There is a nice explanatory blog available from VMware to resolve most part of this: https://blogs.vmware.com/vcloud/2018/01/updating-vrops-instance-vcloud-usage-meter.html

But if you find yourself in the situation that the old node is still there in Usage Meter but not referencing an vCenter this won’t help.

Should this happen then we need to do the following on Usage Meter:

  1. Login to the Usage Meter CLI as root
  2. Run sql  to enter the DB 
  3. Run: select * from “VcopsServer”; 
  4. Identify the unwanted vROps node from the table — and note its ‘active’ status and ‘id’ from the associated columns
  5. Run: update “VcopsServer” set “active” = ‘f’ where id = [id]; 
  6. Run the same query from step 3 to verify that the server has been deactivated 
  7. Restart the tomcat services with: service tomcat restart 
  8. Log back into the Usage Meter web-portal 
  9. Delete and reactivate the relevant VC server endpoint to refresh the connection
  10. Force a data collection by changing the minute hand from the ‘Collections’ tab to validate fix

You might need to do a reboot of Usage Meter as well but after that the problem will be resolved.

Hope this helps!

Notes from the field: vCenter VCSA 6.5u2 SEAT cleanup

This is a quick blog to show how an SEAT database failure can be cleared after an sporadic growth and increase to the events part of the SEAT DB in VCSA. I’ll explain the issue origin in an upcoming blog, but in a nutshell the 20gb was reached within six days and crashed the vCenter of a secondary site.

You SSH into the vCenter VCSA and enable shell and afterwards go to the vpostgres directory to complete the tasks. See below entries for reference and testing:

shell.set –enabled true

cd /opt/vmware/vpostgres/current/bin

./psql -d VCDB -U postgres

SELECT nspname || ‘.’ || relname AS “relation”, pg_size_pretty(pg_total_relation_size(C.oid)) AS “total_size”
FROM pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE nspname NOT IN (‘pg_catalog’, ‘information_schema’)
AND C.relkind <> ‘i’
AND nspname !~ ‘^pg_toast’
ORDER BY pg_total_relation_size(C.oid) DESC
LIMIT 20;

After these commands you will see the twenty top entries of the database, in my case there were entries from 800mb/1gb+ files and needed to be truncated

for example you filter out the largest and truncate that file:

truncate table vc.vpx_event_1 cascade;

Do this numerous times until the largests sets of events are gone, I did do all of this with a sev1 support case engineer so this is not something to do out of the blue. Hope this helps you out as well.

For reference the following article for vPostgres DB interaction:

https://kb.vmware.com/s/article/2147285

Notes from the field: vSphere 6 NVIDIA vGPU not working

Quite recently I’ve deployed a POC setup for a customer who wanted to leverage NVIDIA vGPU for their XenDesktop environment. In regards to all the prerequisites being met the VM’s wouldn’t boot when trying to test this on the base build of vSphere 6(the latest version that could be downloaded from the site) and the dedicated hardware.

After some time troubleshooting the issue was in the base build of VMware which was downloaded from the site. It included a hotfix which in turn would kill the vGPU support / integration. The resolution was updating the host to the latest level of patches. (the host at first was standalone being prepped to be inserted into the cluster of the customer, once joined update manager could do the rest)

For reference see the following articles:
https://kb.vmware.com/s/article/2150498
https://kb.vmware.com/s/article/2143832

Hope this helps!

Notes from the lab: vRealize Log Insight Cluster Upgrade 1-2-3

I must say I’m very impressed by the simpleness and stability at how VMware put the upgrade process in place for vRealize Log Insight.

First a little bit of a background of my deployment:
– Three node vRealize Log Insight 4.6.0 cluster
– Integrated Load Balancer (ILB) configured
– vSphere 6.7 as hypervisor platform

I had the deployment running for a while and saw that 4.6.1 was available. Simple as that downloaded the upgrade .pak file from myVMware and logged in to my Log Insight cluster address, started the upgrade and got prompted to redirect to the master node for the upgrade progress, and simple as that nothing else to do! Either it works and every node will get rebooted automatically or it will fail and rollback all nodes.

For reference I’ve taken some screenshots of the process:

Hope this helps.

Notes from the field: Ghost NIC on VMware

Quite recently I’ve encountered an issue/question at a customer which complained that two virtual machines had ghost NIC’s attached. Well it doesn’t always have to be hard in our line of work 😊, after a quick look it was clear that there were snapshots in place for those VM’s with deleted old NIC’s attached.

Removal of the snapshot and the NIC’s were no more.

See the following reference screenshot of the ghost NIC and the distributed port group NIC:

 

Hope this helps.

Notes from the field: NetScaler VPX & Intel Xeon Gold

Quite recently I came across an issue when deploying a VPX instance on VMware 6.5, which resulted in a bug of the VPX image and underlying physical hardware.
For reference the following hardware was backing the hypervisor:
Supermicro SYS-2029U-E1CR25M
Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz
VMware ESXi, 6.5.0, 7967591 with vSAN
NetScaler VPX 12.0 57.24nc

When deploying the VPX appliance it will get the default VM version 7 which needs to get upgraded to VM version 11/13 to support VMXNET3 NIC interfaces, well easily said and done configured the setup and booted the appliance and got stumped with the following error:

right.. that’s a beauty.. after some troubleshooting and migrating the host to another host/cluster it started booting. Migrated it back and crashes.. reimported the appliance let it stay at hardware level 7, and upgraded it to 11 and still works, went to 13 again and crash.. Ok so this is not a user error :).

Logged a case with Citrix support and after some time got the reply this is in regards to a known issue with Intel Xeon Gold processors. The resolvement should be in NetScaler 12.1 build which is expected to release Q3 this year.

For now the workaround is to keep it at level 11, as a reference case# 77116209 can be used to log the same if it applies to your setup.

UPDATE:

With the release of NS12.1 49.23.nc for NetScaler and NMAS you can now upgrade to the latest hardware level and everything works as expected. Also the default dashboard view won’t give any error of any kind in regards to the undefined message which would popup.