long ago I’ve encountered an vCenter instance blowing up the
/dev/mapper/core_vg-core with gigabytes of java dump errors.. Just for
reference the customers setup is an dual SDDC with respectively an vCenter at each
site comprising of vCenter 6.5 U2 and embedded linked mode enabled.
mode I’ve encountered the following two articles:
decided to open up a support case. This resulted in a session which stated that
they had seen this sort of issues arising in 6.7u1 and higher which root caused
against hardware level 13 for the appliance and WIA Active Directory
setup had an hardware level 13 deployment on both sites and only one
experiencing the problem, and using Active Directory over LDAP integration.
resolution of the issue was downgrading the VCSA hardware level to version 10.
way is restoring the VCSA with a VAMI back-up restore, my way was re-register
the appliance with the VMX file downgraded to the level needed, see https://communities.vmware.com/thread/517825
If you ever are in the proces of cleaning up your vRealize Operations Manager instances and are using vCloud Usage Meter as well you might find yourself in a situation that Usage meter keeps referencing an old node which is deleted.
There is a nice explanatory blog available from VMware to resolve most part of this: https://blogs.vmware.com/vcloud/2018/01/updating-vrops-instance-vcloud-usage-meter.html
But if you find yourself in the situation that the old node is still there in Usage Meter but not referencing an vCenter this won’t help.
Should this happen then we need to do the following on Usage Meter:
- Login to the Usage Meter CLI as root
- Run sql to enter the DB
- Run: select * from “VcopsServer”;
- Identify the unwanted vROps node from the table — and note its ‘active’ status and ‘id’ from the associated columns
- Run: update “VcopsServer” set “active” = ‘f’ where id = [id];
- Run the same query from step 3 to verify that the server has been deactivated
- Restart the tomcat services with: service tomcat restart
- Log back into the Usage Meter web-portal
- Delete and reactivate the relevant VC server endpoint to refresh the connection
- Force a data collection by changing the minute hand from the ‘Collections’ tab to validate fix
You might need to do a reboot of Usage Meter as well but after that the problem will be resolved.
Hope this helps!
This is a quick blog to show how an SEAT database failure can be cleared after an sporadic growth and increase to the events part of the SEAT DB in VCSA. I’ll explain the issue origin in an upcoming blog, but in a nutshell the 20gb was reached within six days and crashed the vCenter of a secondary site.
You SSH into the vCenter VCSA and enable shell and afterwards go to the vpostgres directory to complete the tasks. See below entries for reference and testing:
shell.set –enabled true
./psql -d VCDB -U postgres
SELECT nspname || ‘.’ || relname AS “relation”, pg_size_pretty(pg_total_relation_size(C.oid)) AS “total_size”
FROM pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE nspname NOT IN (‘pg_catalog’, ‘information_schema’)
AND C.relkind <> ‘i’
AND nspname !~ ‘^pg_toast’
ORDER BY pg_total_relation_size(C.oid) DESC
After these commands you will see the twenty top entries of the database, in my case there were entries from 800mb/1gb+ files and needed to be truncated
for example you filter out the largest and truncate that file:
truncate table vc.vpx_event_1 cascade;
Do this numerous times until the largests sets of events are gone, I did do all of this with a sev1 support case engineer so this is not something to do out of the blue. Hope this helps you out as well.
For reference the following article for vPostgres DB interaction:
Quite recently I’ve deployed a POC setup for a customer who wanted to leverage NVIDIA vGPU for their XenDesktop environment. In regards to all the prerequisites being met the VM’s wouldn’t boot when trying to test this on the base build of vSphere 6(the latest version that could be downloaded from the site) and the dedicated hardware.
After some time troubleshooting the issue was in the base build of VMware which was downloaded from the site. It included a hotfix which in turn would kill the vGPU support / integration. The resolution was updating the host to the latest level of patches. (the host at first was standalone being prepped to be inserted into the cluster of the customer, once joined update manager could do the rest)
For reference see the following articles:
Hope this helps!
I must say I’m very impressed by the simpleness and stability at how VMware put the upgrade process in place for vRealize Log Insight.
First a little bit of a background of my deployment:
– Three node vRealize Log Insight 4.6.0 cluster
– Integrated Load Balancer (ILB) configured
– vSphere 6.7 as hypervisor platform
I had the deployment running for a while and saw that 4.6.1 was available. Simple as that downloaded the upgrade .pak file from myVMware and logged in to my Log Insight cluster address, started the upgrade and got prompted to redirect to the master node for the upgrade progress, and simple as that nothing else to do! Either it works and every node will get rebooted automatically or it will fail and rollback all nodes.
For reference I’ve taken some screenshots of the process:
Hope this helps.
Quite recently I’ve encountered an issue/question at a customer which complained that two virtual machines had ghost NIC’s attached. Well it doesn’t always have to be hard in our line of work ?, after a quick look it was clear that there were snapshots in place for those VM’s with deleted old NIC’s attached.
Removal of the snapshot and the NIC’s were no more.
See the following reference screenshot of the ghost NIC and the distributed port group NIC:
Hope this helps.
Quite recently I came across an issue when deploying a VPX instance on VMware 6.5, which resulted in a bug of the VPX image and underlying physical hardware.
For reference the following hardware was backing the hypervisor:
Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz
VMware ESXi, 6.5.0, 7967591 with vSAN
NetScaler VPX 12.0 57.24nc
When deploying the VPX appliance it will get the default VM version 7 which needs to get upgraded to VM version 11/13 to support VMXNET3 NIC interfaces, well easily said and done configured the setup and booted the appliance and got stumped with the following error:
right.. that’s a beauty.. after some troubleshooting and migrating the host to another host/cluster it started booting. Migrated it back and crashes.. reimported the appliance let it stay at hardware level 7, and upgraded it to 11 and still works, went to 13 again and crash.. Ok so this is not a user error :).
Logged a case with Citrix support and after some time got the reply this is in regards to a known issue with Intel Xeon Gold processors. The resolvement should be in NetScaler 12.1 build which is expected to release Q3 this year.
For now the workaround is to keep it at level 11, as a reference case# 77116209 can be used to log the same if it applies to your setup.
With the release of NS12.1 49.23.nc for NetScaler and NMAS you can now upgrade to the latest hardware level and everything works as expected. Also the default dashboard view won’t give any error of any kind in regards to the undefined message which would popup.
We all know it, the once in a while “it’s slow logging on..” and then it gets dropped at the escalation desk for a resolution. So I got the call for troubleshooting this issue. Since I knew from previous experiences that uberAgent is the troubleshooting tool you will want for this I contacted them and requested the consulting license at https://uberagent.com/ (thanks to Helge Klein) did the installation of Splunk / Uberagent and got myself a monitoring baseline to work with. A little background on the setup:
- vSphere 6.0
- XenDesktop 7.15 / MCS – Windows 8.1 & Windows Server 2012 R2
- RES WorkspaceManager 10.1.300.1
The problem was at times users would have a profile initialization of 90 seconds! and at times the user shell would hang..
After a period of two weeks I would have my baseline with uberAgent and filtered out that this would be random very early start of the day or just after break time. No funny business whatsoever in the environment and no lack of resources e.g. iops or cpu/memory exhaustion, drilling down in some user trending with uberAgent I came to a somewhat recurring user base that experienced the issue. Ok! That helps and after that I could reproduce it with the useraccounts in question displaying the following screen:Dropped this in the resrockstars.slack.com group and got a quick reply from Dennis van Dam in regards to traceviewer and came to the following:This in turn pointed me out to the following support article:Problem resolved and a happy customer! Hope this helps you out as well.
HOWTO: Create a trace file
When I started to rebuild my lab I came across the most strangest thing when configuring my NetScaler’s again. First a little background regarding my setup:
VMware ESXi 6.5u1 Hypervisors
NetScaler VPX 1000 Platinum Appliances
Distributed vSwitches with vlan trunks enabled
Dedicated NSVLAN for management (tagged)
Data transport vlan tagged
Whilst configuring and setting op the first and secondary nodes I’ve let the default appliance imports intact, that is 2vcpu and 2gb of ram and changed the E1000 nic’s to VMXNET3 and upgraded the VM compatibility format to the latest level. Nothing wrong here and started configuring both appliances with their NSIP’s respectively. Created the HA set and all was well.
Then it was time to put in the second nic which I’m going to use for my data transport with all vlan tagged interfaces and ip’s. Gave both appliances a shutdown and configured the nic’s accordingly (so it seemed at the time it was late ?)
First node came back flawlessly but the second node wasn’t reachable anymore.. So put open the hypervisor console and I saw error messages regarding the nic and that the instance had crashed. When I would log in with the nsroot account I would get nsnet_connect prevents logon… Well ok.. that one was familiar to me with in mind the switch of E1000 and VMXNET3 devices (had this when upgrading a customer’s setup and that was the VM compatibility level, because you will need the latest build to be able to use VMXNET3, the default appliance level isn’t enough) but I’ve got both appliances up to date… I thought what the !%!@% and logged in with the nsrecover username to be able to login to the shell and dig in deeper. Thank god that worked and I was able to run the command ns_hw_err.bash which will check for any hardware error. And yes I instantly got the nic not present and reachable message. Looked at the configuration of the nic’s and a nice homer simpson moment the nic in question was still a E1000.. right… so turned it off and removed the nic, re-added it with the same MAC and presto all is well again.
Moral of the story double check your network settings when using VMXNET3!!!!