28

Related to: Current wisdom on SQL Server and Hyperthreading

Recently we upgraded our Windows 2008 R2 database server from an X5470 to a X5560. The theory is both CPUs have very similar performance, if anything the X5560 is slightly faster.

However, SQL Server 2008 R2 performance has been pretty bad over the last day or so and CPU usage has been pretty high.

Page life expectancy is massive, we are getting almost 100% cache hit for the pages, so memory is not a problem.

When I ran:

SELECT * FROM sys.dm_os_wait_stats 
order by signal_wait_time_ms desc

I got:

wait_type                                                    waiting_tasks_count  wait_time_ms         max_wait_time_ms     signal_wait_time_ms
------------------------------------------------------------ -------------------- -------------------- -------------------- --------------------
XE_TIMER_EVENT                                               115166               2799125790           30165                2799125065
REQUEST_FOR_DEADLOCK_SEARCH                                  559393               2799053973           5180                 2799053973
SOS_SCHEDULER_YIELD                                          152289883            189948844            960                  189756877
CXPACKET                                                     234638389            2383701040           141334               118796827
SLEEP_TASK                                                   170743505            1525669557           1406                 76485386
LATCH_EX                                                     97301008             810738519            1107                 55093884
LOGMGR_QUEUE                                                 16525384             2798527632           20751319             4083713
WRITELOG                                                     16850119             18328365             1193                 2367880
PAGELATCH_EX                                                 13254618             8524515              11263                1670113
ASYNC_NETWORK_IO                                             23954146             6981220              7110                 1475699

(10 row(s) affected)

I also ran

-- Isolate top waits for server instance since last restart or statistics clear
WITH Waits AS (
   SELECT 
        wait_type, 
        wait_time_ms / 1000. AS [wait_time_s],
        100. * wait_time_ms / SUM(wait_time_ms) OVER() AS [pct],
        ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS [rn]
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN ('CLR_SEMAPHORE','LAZYWRITER_SLEEP','RESOURCE_QUEUE',
    'SLEEP_TASK','SLEEP_SYSTEMTASK','SQLTRACE_BUFFER_FLUSH','WAITFOR','LOGMGR_QUEUE',
    'CHECKPOINT_QUEUE','REQUEST_FOR_DEADLOCK_SEARCH','XE_TIMER_EVENT','BROKER_TO_FLUSH',
    'BROKER_TASK_STOP','CLR_MANUAL_EVENT','CLR_AUTO_EVENT','DISPATCHER_QUEUE_SEMAPHORE',
    'FT_IFTS_SCHEDULER_IDLE_WAIT','XE_DISPATCHER_WAIT', 'XE_DISPATCHER_JOIN'))

SELECT W1.wait_type, 
    CAST(W1.wait_time_s AS DECIMAL(12, 2)) AS wait_time_s,
    CAST(W1.pct AS DECIMAL(12, 2)) AS pct,
    CAST(SUM(W2.pct) AS DECIMAL(12, 2)) AS running_pct
FROM Waits AS W1
INNER JOIN Waits AS W2 ON W2.rn <= W1.rn
GROUP BY W1.rn, W1.wait_type, W1.wait_time_s, W1.pct
HAVING SUM(W2.pct) - W1.pct < 95; -- percentage threshold

And got

wait_type           wait_time_s     pct  running_pct
CXPACKET              554821.66   65.82        65.82
LATCH_EX              184123.16   21.84        87.66
SOS_SCHEDULER_YIELD    37541.17    4.45        92.11
PAGEIOLATCH_SH         19018.53    2.26        94.37
FT_IFTSHC_MUTEX        14306.05    1.70        96.07

That shows huge amounts of time synchronizing queries involving parallelism (high CXPACKET). Additionally, anecdotally many of these problem queries are being executed on multiple cores (we have no MAXDOP hints anywhere in our code)

The server has not been under load for more than a day or so. We are experiencing a large amount of variance with query executions, typically many queries appear to be slower that they were on our previous DB server and CPU is really high.

Will disabling Hyperthreading help at reducing our CPU usage and increase throughput?

Sam Saffron
  • 2,019

8 Answers8

12

I agree that

  • at best the recommendation is "try HyperThreading on your workload and see what happens". We are doing this right now as I type, and.. it ain't good!
  • you should probably always start with HyperThreading disabled, as that is safest

It looks like we should be tuning two things:

  1. MAXDOP (Maximum Degrees of Parallelism). Everything I read indicates that having this unbounded is probably a bad idea, and the Microsoft documentation says:

    Setting this option [MAXDOP] to a larger value [than 8] often causes unwanted resource consumption and performance degradation.

    anything higher than 8 is not generally recommended .. so I set it to 4 for now. It was zero (unbounded) initially.

  2. Cost Threshold for Parallelism. Apparently the default of 5 here is considered a pretty low default according to a few SQL MVP posts I've found -- we can tune it up to reduce how much parallelism is even attempted by the scheduler.

But honestly these feel like workarounds; I think the true solution for our workload (full-text index heavy) is to disable HT.

Jeff Atwood
  • 13,264
10

I still feel that testing your specific workload, as per the original answer, is the only way to be sure. It's not an ideal answer when you're trying to tune a production system (so I'd ask if it was possible to get an identical testbed in systems where both performance and availability really matter) but it's the only one I'm really comfortable with.

We can talk about the theory of whether or not Hyperthreading should hurt or improve things in general (I find it to be more likely to hurt than help on servers so for a "generic" deployment I'd probably disable it), but there is only one way to see for sure if it's going to make a difference in your specific case, and that is try it and see.

Rob Moir
  • 32,154
9

Anandtech found that with the pure read load, it hurt a little, and with a write heavy load, it was a bit of a win. I've not seen anything to make me think it is going to get you a hit much worse than -5%, or a win much better than 15%. Note what with a Atom, it is a huge win, but that is a very odd cpu.

All you changed was the cpu? You went from 12MB cache and 4 threads, so 3MB of cache per thread, to 8 MB of cache, and 8 threads, so 1MB per thread. Now, that is oversimplifying, but I bet that is what is killing you, you used to run queries in cache, and now run them from RAM because they need more than 1MB but less than 3MB. Turning off HT will probably help, but I'd go back to the old CPU. Turn off HT, and you get 2MB per thread, but if your workload thrashes with that much, it will not help. It may well be that the old 12MB cache cpu is hugely faster for your workload.

I would try turning HT off, and see if that is an improvement, but I suspect that cache is king for your work load, and you may well need to go back to the 12 MB chip.

7

Hyperthreading is, at best, just a way of abstracting task switching away from the operating system and placing it on-die, with direct access to the L1 and L2 cache, which makes task switching a crapload faster.

Testing with VMWare have indicated that disabling HT made no discernable difference under standard load, and a 5% increase under heavy load, due to the fact that ESXi is smart enough to know the difference between the "real" thread and the "fake" thread (there's a lot more to it than that, but that's in laymens terms). SQL Server 2005 isn't quite that smart, but it combined with an up-to-date operating system there should be little advantage to disabling HT.

All that said, I agree with Ronald that it's most likely going to be your L2 cache. A 33% drop in cache size is substantial, and when we spec our SQL Servers we always go for cache over raw clock speed every time.

7

Based on my experience, HT was making I/O operations take forever on my of my active nodes on a Windows 2008 R2 Cluster (running SQL Server 2008 R2). An interesting fact was that it was neither reflected in the wait stats nor in the pssdiag I ran for Microsoft support.

The way I noticed low I/O was just by watching the OS counters for physical disk. As Sam pointed out, I wrote about it here and here

If you do NOT experience I/O problems and are CPU bound I suggest you start this way:

Pinpoint which processes and T-SQL blocks are causing the most CPU utilization. In our experience, after we fixed the problem with I/O (by turning HT off) we identified code that was performing horribly in 2008 R2 and doing fine in 2005. I wrote about it here.

While under high load, run Adam Machanic's sp_whoisactive. You can download it from here. We were experiencing a very high CPU utilization due to the excessive amount of logical reads (20 million per query) due to a really bad plan. Our processes were performing anti-semi joins with tables that were partitioned.

My next recommendation is to run profiler to identify a set of T-SQL code that are both high in CPU and I/O logical reads.

With the steps above we were able to tune the offending processes and go from 85% sustained CPU utilization to almost nil.

Good Luck and please feel free to drop me a line if you find a fix as I would like to add the case to my blog.

Thanks

Oscar

ozamora
  • 171
2

Whether HT is good or bad is hard to pin down.

It really does depend on the server load pattern based on experience and reading. That is, when it affects performance it does so badly: otherwise you don't notice it.

The theory I read was that the threads share cache which means under adverse conditions each thread can overwrite the other thread's cache. If you don't have much parallelism, or your load is many short queries, then it may not affect you.

I've tried with MAXDOP and processor affinity (back in my last real DBA role on SQL Server 2000) but could never find anything conclusive: but only for my shop at that time.

As a quick test, you can set processor affinity to use only physical cores (the lower numbers) and see what happens.

However, at most you lose half your cores. Nowadays that may not matter compared to what I was playing with a few years ago when it was 2 vs 4 or 4 vs 8. Now it's 8 vs 16 or 16 vs 32.

Edit: A test by Slava Oks

Glorfindel
  • 1,213
gbn
  • 6,099
2

Unfortunately, I don't think you are going to get any more definitive answer than "try turning hyperthreading off and see if that helps".

Despite the helpful answer from Jonathan in my original thread (which you linked in your question), I was never able to get any definitive evidence about the impact of HT on the specific servers I was investigating. In my case, the servers were already scheduled for replacement, so we simply let those replacements "take care of the issue" so to speak.

My advice:

Try a server-level MAX Degree of Parallelism setting of 1. Parallelism on SQL is most useful for larger, longer running queries anyway, and your load (I assume) consists of a massively high number of smaller queries anyway. This should entirely eliminate CXPACKET waits. This could make certain individual queries run slightly longer, but should allow more "throughput" of total queries on the server.

I've had good results doing this on OLTP servers. Other kinds of servers (reporting servers, processing servers, data warehousing) definitely need the MAXDOP set higher.

And just to be clear, this setting would still allow SQL to use multiple threads for each individual table in a JOIN, so you're not really eliminating parallelism entirely.

At least worth a try, since this setting change takes effect immediately and doesn't even require you to restart the SQL service: http://msdn.microsoft.com/en-us/library/ms181007.aspx
This means you could switch it back immediately if things started going to hell.

Turning off hyperthreading in the BIOS would require a full server reboot, so is a bit more risky.

BradC
  • 2,230
0

For the record, we also had unexpectedly bad performance after a server upgrade. It turned out to be due to issues with the BIOS and CPU power saving. The default setting on the server (HP) was to ignore the OS control of CPU speed and use its own algorithm. Changing this to OS control, and updating the BIOS, resulted in significant improvements. There were some release notes (can't find them now) that there was a BIOS bug that was locking the CPU at the lowest performance state.

https://serverfault.com/a/196329/6390

Mark Sowul
  • 1,859