We all know it, the once in a while “it’s slow logging on..” and then it gets dropped at the escalation desk for a resolution. So I got the call for troubleshooting this issue. Since I knew from previous experiences that uberAgent is the troubleshooting tool you will want for this I contacted them and requested the consulting license at https://uberagent.com/ (thanks to Helge Klein) did the installation of Splunk / Uberagent and got myself a monitoring baseline to work with. A little background on the setup:
- vSphere 6.0
- XenDesktop 7.15 / MCS – Windows 8.1 & Windows Server 2012 R2
- RES WorkspaceManager 10.1.300.1
The problem was at times users would have a profile initialization of 90 seconds! and at times the user shell would hang..
After a period of two weeks I would have my baseline with uberAgent and filtered out that this would be random very early start of the day or just after break time. No funny business whatsoever in the environment and no lack of resources e.g. iops or cpu/memory exhaustion, drilling down in some user trending with uberAgent I came to a somewhat recurring user base that experienced the issue. Ok! That helps and after that I could reproduce it with the useraccounts in question displaying the following screen:Dropped this in the resrockstars.slack.com group and got a quick reply from Dennis van Dam in regards to traceviewer and came to the following:This in turn pointed me out to the following support article:Problem resolved and a happy customer! Hope this helps you out as well.
HOWTO: Create a trace file
When I started to rebuild my lab I came across the most strangest thing when configuring my NetScaler’s again. First a little background regarding my setup:
VMware ESXi 6.5u1 Hypervisors
NetScaler VPX 1000 Platinum Appliances
Distributed vSwitches with vlan trunks enabled
Dedicated NSVLAN for management (tagged)
Data transport vlan tagged
Whilst configuring and setting op the first and secondary nodes I’ve let the default appliance imports intact, that is 2vcpu and 2gb of ram and changed the E1000 nic’s to VMXNET3 and upgraded the VM compatibility format to the latest level. Nothing wrong here and started configuring both appliances with their NSIP’s respectively. Created the HA set and all was well.
Then it was time to put in the second nic which I’m going to use for my data transport with all vlan tagged interfaces and ip’s. Gave both appliances a shutdown and configured the nic’s accordingly (so it seemed at the time it was late 😊)
First node came back flawlessly but the second node wasn’t reachable anymore.. So put open the hypervisor console and I saw error messages regarding the nic and that the instance had crashed. When I would log in with the nsroot account I would get nsnet_connect prevents logon… Well ok.. that one was familiar to me with in mind the switch of E1000 and VMXNET3 devices (had this when upgrading a customer’s setup and that was the VM compatibility level, because you will need the latest build to be able to use VMXNET3, the default appliance level isn’t enough) but I’ve got both appliances up to date… I thought what the !%!@% and logged in with the nsrecover username to be able to login to the shell and dig in deeper. Thank god that worked and I was able to run the command ns_hw_err.bash which will check for any hardware error. And yes I instantly got the nic not present and reachable message. Looked at the configuration of the nic’s and a nice homer simpson moment the nic in question was still a E1000.. right… so turned it off and removed the nic, re-added it with the same MAC and presto all is well again.
Moral of the story double check your network settings when using VMXNET3!!!!