Friday, May 29, 2009

ESX host disconnected from VirtualCenter

I recently ran into a problem when checking a customer's VMware cluster. One of the ESX servers showed disconnected in Virtual Center and would not reconnect. The VI3 setup is fairly simple for this customer, just a two node DRS cluster with 11 VMs in production.

Doing so research on Google I found a similar problem/resolution here:
Malaysia VMware Communities

This solution did not work for me but it did get me pointed in the right direction.
  • Ran the command: services vmware-vpxa status --> indicated the service was offline
  • Attempted to restart the service: services vmware-vpxa start --> This failed with the following output - "Another process is already running for this config file"
  • Checked the running processes to find if it was hung: ps -elf | grep vpxa --> found the process was running with a D in the stat column (disk wait). In this condition a process cannot be killed even with SIGKILL.
Unfortunately, this meant my fix would incur some downtime for the 5 VMs that were currently running on the disconnected server. The first thing I needed to do was shutdown and migrate the VMs to the other cluster member. I performed the following steps to achieve this:
  • Logged into each VM's guest OS on the effected ESX server and shut it down
  • Logged into Virtual Center via the VI Client to removed the ESX server and all of its VM guests from the inventory
  • Added the VM guests back into the cluster via the Datastore view. You can do this by right clicking on the Datastore where the VM resides and selecting "browse datastore". Navigate to the .vmx file of the desired VM, right click it and select "add to inventory".
  • Once the VM guests were back in the cluster I started them back up. This brought the customer's resources back online.
Now the ESX host issue could be addressed without impacting the production VMs further. Since the process could not be killed I wanted to reboot the ESX host to see if it would clear the error. I followed the following steps:
  • SSH to the problem ESX host and ran: vmware-cmd -l --> This listed all of the VM guests the ESX host thought it still owned.
  • Ran the command: vmware-com -s unregister /"path-to-vmx-found-in-previous-step" --> do this to clear each VM guest
  • Ran the command: vmware-cmd -l --> confirmed all VM associations were gone
At this point everything was prepared to safely reboot the ESX host.

In my case, the ESX host did not come back up cleanly after the reboot and indicated a corruption problem with initrd. To clear this I ran esxcfg-boot -b from the console and rebooted again. Once the ESX server was back online it could be added back to the cluster via the Inventory view in Virtual Center. The add is accomplished by right clicking on the cluster name and selecting "add host". With the host back in the cluster I used VMotion to redistribute some of the load back to it.

These steps helped resolve the VC disconnect issue but the root cause was not yet resolved. The "disk wait" status of the hung process indicated a hardware problem with the disk or controller or a driver bug. No problem was found with either hardware device. Since applying the latest updates to the ESX host the problem has not recurred.

KB Article 1003409 from the VMware site was also quite useful in diagnosing the disconnect problem.



Thursday, May 28, 2009

Kicking things off...

The following are some definitions of the term "generalist". I think they apply nicely to those of us who bear this label the IT world.