On one of my latest projects consisting of a new Windows Server 2019 setup on VMware and making use of Storage Replica in a server to server setup for replicating home drives and profiles I came across a random lock-up of the VM and by that inaccessible shares.
The setup was all working until the failover part. It seems there is an delay of some sort and the failover isn’t instant or takes a while to be active with the server being unresponsive and disconnecting any form of management to the VM in question(VM tools are not responding as well and console login will not work in this failover time). I’ve tried the actions again of doing a storage replica failover and I got an BSOD on the VM stating: HAL INITIALIZATION FAILED I’ve tried all of this in a separate test setup and had this working without any problems on Server 2016, and Server 2019. Only this time it gave me this strange behavior. The difference in my own setup is HW level 14 and this new one had HW level 15 and the hosts are 6.7 13981272 build and my own setup is 6.7 14320388 build (older builds have also worked fine for me)
After some troubleshooting and providing the BSOD dump findings to VMware GSS support it became clear that version 10341 of VMware tools was the troublemaker. The solution was to upgrade to the latest 10346 VMware tools. The vmm2core tool provided me with the means of creating a dump file with the VM in question.
One of my last projects I needed to convert Hyper-V VM’s to VMware, this all went fine with the offline capability of vcenter converter and the migration succeeded. Only after trying to install the VMware tools this would hang on starting the VGauth services and several other dependencies. For reference the VM’s in question are a mixture of 2008R2 / 2012R2. After some troubleshooting and searching the knowledgebase I stumbled across this article: https://kb.vmware.com/s/article/55798
So for the project I didn’t had any ok to patch the servers that was out of scope for this one, the mitigation was to install older VMware tools (10.2.5 to be exact) afterwards the tools installed fine.
On a side note when finalizing the converted VM don’t forget to delete the hidden older hyper-v network adapter, this can still provide conflicts if not removed.
While doing some lab work I came across an issue that the Domain Admin accounts could not register on the manageotp site while Domain Users could. This got me figuring it out.
For the use of Native OTP on the ADC we need to use an bind account for Active Directory which has the appropriate write permissions on the userParameters value of the users.
When we delegate control of the exact write permission of the userParameters everything is fine for normal users but administrator accounts won’t work. When we use a service account with full blown domain administrator permissions as the bind account then it works.
After some researching I came across this old article which explained the behavior:
Long story short, if any user is also a member of a high privileged group the AdminSDHolder protection will prevent this. There is a way that inheritance can be enabled but this is mostly not recommended as you will open up a whole lot of extra security risks.
If it isn’t needed then just delegate control of the needed permissions otherwise use an bind account with domain admin permissions.
For some in depth knowledge of AdminSDHolder and it’s workings see the following article:
Came across a peculiar issue when adding an second vCenter to the same SSO domain and enable ELM.
The first deployment worked like a charm and the second errored out with the following error:
It turns out there is a known bug when using uppercase FQDN in the configuration wizard, the solution is to put it all in lowercase.
see the following link for reference: https://kb.vmware.com/s/article/56812
Not too long ago I encountered some issues when configuring UEM and IDM integration. When providing the vIDM URL in UEM for configuring the integration it would error out with below error:
After some troubleshooting it appeared that the access policies where not properly configured as in the last rule in the default access application ruleset was blocking access. Resolution was editing the default policy and ending it with the password method which is associated with the built-in workspace IDP, after that the integration part is working as expected.
Another configuration task which caught me by surprise was that after the configuration is set up between UEM and vIDM the following errors occurred:
Turned out that the integration between UEM and vIDM is depending on Active Directory integration. The basic system domain accounts (even full admins) won’t work in this scenario. Resolution is configuring an domain account with the necessary admin rights in both tenants and then it will work as expected.
Hope this helps!
quick win blog to mention and give a heads-up that when you are in the process
of configuring vIDM and o365 you might encounter native clients prompting for
authentication and a big ass delay when you flip over the authentication and the
requested domain from managed to federated with vIDM. This might be up to eight
hours!!! Thanks to the #community #vExpert that I got this answer quite fast
because I recalled that Laurens van Duijn put something similar in the vExpert
Slack group mentioning that he saw this kind of behavior.
summary, do it on a Friday and inform your users.
out to Laurens van Duijn and be sure to follow him on twitter and his blog
long ago I’ve encountered an vCenter instance blowing up the
/dev/mapper/core_vg-core with gigabytes of java dump errors.. Just for
reference the customers setup is an dual SDDC with respectively an vCenter at each
site comprising of vCenter 6.5 U2 and embedded linked mode enabled.
mode I’ve encountered the following two articles:
decided to open up a support case. This resulted in a session which stated that
they had seen this sort of issues arising in 6.7u1 and higher which root caused
against hardware level 13 for the appliance and WIA Active Directory
setup had an hardware level 13 deployment on both sites and only one
experiencing the problem, and using Active Directory over LDAP integration.
resolution of the issue was downgrading the VCSA hardware level to version 10.
way is restoring the VCSA with a VAMI back-up restore, my way was re-register
the appliance with the VMX file downgraded to the level needed, see https://communities.vmware.com/thread/517825
If you ever are in the proces of cleaning up your vRealize Operations Manager instances and are using vCloud Usage Meter as well you might find yourself in a situation that Usage meter keeps referencing an old node which is deleted.
There is a nice explanatory blog available from VMware to resolve most part of this: https://blogs.vmware.com/vcloud/2018/01/updating-vrops-instance-vcloud-usage-meter.html
But if you find yourself in the situation that the old node is still there in Usage Meter but not referencing an vCenter this won’t help.
Should this happen then we need to do the following on Usage Meter:
- Login to the Usage Meter CLI as root
- Run sql to enter the DB
- Run: select * from “VcopsServer”;
- Identify the unwanted vROps node from the table — and note its ‘active’ status and ‘id’ from the associated columns
- Run: update “VcopsServer” set “active” = ‘f’ where id = [id];
- Run the same query from step 3 to verify that the server has been deactivated
- Restart the tomcat services with: service tomcat restart
- Log back into the Usage Meter web-portal
- Delete and reactivate the relevant VC server endpoint to refresh the connection
- Force a data collection by changing the minute hand from the ‘Collections’ tab to validate fix
You might need to do a reboot of Usage Meter as well but after that the problem will be resolved.
Hope this helps!
This is a quick blog to show how an SEAT database failure can be cleared after an sporadic growth and increase to the events part of the SEAT DB in VCSA. I’ll explain the issue origin in an upcoming blog, but in a nutshell the 20gb was reached within six days and crashed the vCenter of a secondary site.
You SSH into the vCenter VCSA and enable shell and afterwards go to the vpostgres directory to complete the tasks. See below entries for reference and testing:
shell.set –enabled true
./psql -d VCDB -U postgres
SELECT nspname || ‘.’ || relname AS “relation”, pg_size_pretty(pg_total_relation_size(C.oid)) AS “total_size”
FROM pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE nspname NOT IN (‘pg_catalog’, ‘information_schema’)
AND C.relkind <> ‘i’
AND nspname !~ ‘^pg_toast’
ORDER BY pg_total_relation_size(C.oid) DESC
After these commands you will see the twenty top entries of the database, in my case there were entries from 800mb/1gb+ files and needed to be truncated
for example you filter out the largest and truncate that file:
truncate table vc.vpx_event_1 cascade;
Do this numerous times until the largests sets of events are gone, I did do all of this with a sev1 support case engineer so this is not something to do out of the blue. Hope this helps you out as well.
For reference the following article for vPostgres DB interaction:
Fun quick fact that I’ve encountered when deploying a ADC Gateway GSLB setup for a customer! You only have to enroll once with the nFactor/Native OTP on one of the ADC’s. (when having a Active Directory Domain across multiple datacenter sites)
The setup of choice:
- Two ADC appliances in HA set on each site
- GSLB enabled in active/passive mode for the Gateway across both sites
- Native OTP enabled and active as the way for authentication
- Active Directory Domain across two sites
There is no difference in configuration whatsoever because the magic of Native OTP depends on Active Directory.
Configure each ADC identically with the nFactor/Native OTP setup and enable GSLB and you’re done. I must admit at first I thought that I would need to enroll at both gateways independent but happily this is not the case.
For the configuration steps see common examples as below: