1

I am having a serious problem with a server running ESXi 6.0 Everything worked great last week. Now out of no where the whole thing is basically useless. I am getting datastore latency of up to 51 seconds! Nothing has changed between now and last week other than installing some software on a VM.

Datastore Lag

The server is an HP Proliant DL360 G7 2X hexacore Xeon X5650 2.67GHz 144 GB RAM. 8x 300GB HP 10k SAS hard drives in RAID 10.

I have 6 VM's on the machine, most with thin provisioned VMDK's Out of 1.6 TB I have 600GB free.

2 of the VM's seem to run fine, the others run like total crap.

I have tried rebooting the server. Assigning more resources to the slow VM's (even tho they have plenty) and nothing is working.

Even with every VM powered off I have tried to move the VM's off the server to a storage device on the network and I am getting spikes in data transfer. It will move at 20 -30MB /s for about 20 seconds then drop down to near 0 for a few minutes then back up in a constant pattern which suggests a bottleneck somewhere.

When I try to move data between virtual drives in a powered on VM, same thing happens. Right now I am trying to transfer a file and it is going about 200kb/s. On the slow VM's it takes over 20 minutes to boot and is so slow you can't use it.

Disk transfer rate

I am at a total loss. Any help in resolving this would me much appreciated.

1 Answers1

3

I would suggest that your issue is related to the health of your RAID controller's cache and battery/flash module. If the RAID write cache has been disabled due to failure of the RAID battery, for instance, your write performance on the array will degrade severely.

There are a couple of ways to check this. Can you specify if this is a standalone host or part of a cluster managed by vCenter?


Edit:

This host does not appear to have the HP-specific version of ESXi installed.

Without this or the HP add-ons for ESXi, there's no monitoring of the host hardware or any of the utilities necessary to check system status.

Normally, you can see status graphically like this:

enter image description here

enter image description here

I suspect you have a storage battery failure, considering the G7 line was introduced in 2011 and the batteries tend to last 3-5 years in production. If this was a used server, this is the likely cause. You should add them from here, here and here.

At the command line, running the following will show your battery's status (other handy commands):

/opt/hp/hpssacli/bin/hpssacli ctrl all show config detail | grep -i battery

Output:

[root@c2-esx1:~] /opt/hp/hpssacli/bin/hpssacli ctrl all show config detail | grep -i battery
   No-Battery Write Cache: Disabled
   Battery/Capacitor Count: 1
   Battery/Capacitor Status: OK

If the part is bad, we can force it to ignore the battery status using the following (there's risk if you don't have stable power for your equipment):

/opt/hp/hpssacli/bin/hpssacli ctrl slot=0 modify nbwc=enable

This will at least restore performance while you arrange for parts repair/replacement.

ewwhite
  • 201,205