9

I am running two Dell R410 servers in the same rack of a data center (behind a load balancer). Both have the same hardware configuration, run Ubuntu 10.4, have the same packages installed and run the same Java web servers (no other load) and I'm seeing a substantial performance difference between the two.

The performance difference is most obvious in the average response times of both servers (measured in the Java app itself, without network latencies): One of them is 20-30% faster than the other, very consistently.
I used dstat to figure out, if there are more context switches, IO, swapping or anything, but I see no reason for the difference. With the same workload, (no swapping, virtually no IO), the cpu usage and load is higher on one server.

So the difference appears to be mainly CPU bound, but while a simple cpu benchmark using sysbench (with all other load turned off) did yield a difference, it was only 6%. So maybe it is not only CPU but also memory performance.

So far I've checked:

  • Firmware revisions on all components (identical)
  • BIOS settings (I did a dump using dmidecode, and that showed no differences)
  • I compared /proc/cpuinfo, no difference.
  • I compared the output of cpufreq-info, no difference.
  • Java / JVM Parameters (same version and parameters on both systems)

Also, I completely replaced the RAM some months ago, without any effect.

I am lost. What can I do to figure out, what is going on?

UPDATE: Yay! Both servers perform equally now. It was the "power CRAP" settings as jim_m_somewhere named them in the comments. The BIOS options for "Power Management" were on "Maximum Performance" on the fast server, and on "Active Power Controller" (default setting from Dell) on the other one. Obviously I forgot, that I made that setting two years ago, and I didn't do that on all servers. Thanks to all for your very helpful input!

5 Answers5

6

Two ideas, depending on how far you want to go with this:

  1. Swap the disks of both servers and see if the speed performance stays on the hardware or moves with the software.

  2. Compare the output of /opt/dell/toolkit/bin/syscfg -o complete-bios-config.out if you can somehow trick this package to install.

chutz
  • 8,300
3

More possibilities to output and diff:

  • sysctl -a (make sure kernel tuneables are the same)
  • cat /proc/interrupts (Maybe there is some other piece of hardware messing up?)
  • ipmitool sensor list (long shot, but check for more low level differences, overheating, voltage problems, etc)
3

This sounds like it might be load-balancer related to me. When you say "same workload" how are you measuring this?
Are you directly benchmarking each server by applying a test load in isolation?
or Are you applying some load to the load-balancer and looking at the results on both servers?

If you're doing the latter (measuring the load placed on both servers through the load balancer) your load balancer may not be splitting the workload exactly evenly between the servers (a 20% skew for a pair of servers is not uncommon depending on how your load balancer decides who gets which requests), which is causing one server to take more load, and thus perform poorly.

(If you're directly benchmarking each server, in isolation, without using the load balancer as an intermediary, and you've verified that every component is identical (down to manufacturer revisions) between both systems then I'm at a loss -- I can't think of any other measurable reason for this kind of performance difference between otherwise identical servers)

voretaq7
  • 80,749
3

Try some profiling tools, either system profiling like perf or Java profiling like VisualVM.

With perf you could profile either the running Java process by PID or profile a benchmark. Look at both systems, see where the slow system is spending its time.

apt-get install linux-tools-common linux-tools

Then something like:

perf record -e cpu-cycles -p <pid>

or

perf record -a -g <benchmark command>

then

perf report

A couple ideas of how systems can perform differently:

Environment: Is the air temperature or airflow different? Are they in racks? I have seen systems perform differently in different rack positions, caused by vibration. There are different levels of vibration throughout each rack. It's unlikely, considering you said there is almost no I/O being used. But I have seen disks slow down to 2MB/sec sequential writes due to vibration in parts of a rack.

Hardware Faults: Any of the hardware could be faulty. Use the profiling to see what is slow. It could be a bad CPU or chipset, a heatsink not attached properly, out of balance fans causing vibration, failed fans, even a bad PSU. Try swapping things that are easy to swap.

Anton Cohen
  • 1,152
1

Why has nobody suggested 'sysprof'..?

This is what it was designed for.

Or ummm second thought... try stuffing some limits in /etc/security/limits.conf

Try both.

If you get nothing.... you have a security problem most likely or a physical defect.

see also: My linux server "Number of processes created" and "Context switches" are growing incredibly fast

ArrowInTree
  • 190
  • 7