7

I have a SQL Server 2008 query that is working on millions of records. The query is w/in a proc that is run nightly by a job. The query can take a full day to run when I first put it on the server, but within a week or so it will drop down to less than one hour--without any intervention from me. It FIXES ITSELF somehow.

The query runs in tempdb and, before it fixes itself, when I examine performance stats on it I find the following: CXPACKET: 20,700sec or 66% of wait time.
PAGEIOLATCH:_SH 2,500 or 8% of wait time.
LOGBUFFER: 1500sec or 5% of wait time IO_COMPLETION: 1500sec or 5% of wait time

I tried to tune indexes, etc. and the stats above are somewhat of an improvement over my first run when CXPACKET was 77% of wait time. I read trouble shooting tips that said I should split my tempdb into one file for each CPU. I have a dual cpu 32 bit W2K8 system and so I split tempdb into 2 files and greatly increased the size of each to 150 GB each 10% autogrow, but they aren't growing so I think the size is sufficient.

When I looked at the server while the query was running I could see that the CPUs were NOT pinned and were hovering down around <10% of their capacity. What was pinned was DISK IO. The machine has a single disk.

Without further ado, here are the two queries causing the trouble (the first query used to be a subquery of the latter-see explanation below):

insert into #ttcbm(tradeId1, tradeId2)
select distinct tp.tradeid tradeId1, tp1.tradeid tradeId2
from #tradeP tp
join #tradeP tp1    
on tp.cmbId = tp1.cmbId
and tp.qs_plyrid = tp1.qs_plyrid    
and tp.tradeId > tp1.tradeId    
OPTION (MAXDOP 1)


insert into #mergeT(tradeId1, tradeId2)
select distinct tp.tradeid tradeId1, tp1.tradeid tradeId2
from #tradep tp
join #tradep tp1
on tp.cmbId = tp1.cmbId
and tp.tradeid > tp1.tradeId
left join #ttcbm x
on tp.tradeId = x.tradeId1
and tp1.tradeId = x.tradeId2
where 1 = 1
and x.tradeId1 is null
and x.tradeId2 is null
OPTION (MAXDOP 1);

I added MAXDOP 1 per a troubleshooting tip I read that said CXPACKET was caused by parallelism, and perhaps it did help drive down my waits a bit, but not like the improvement that happens when the query fixes itself, i.e., from 24 hours to lest than 1 hr.

the #ttcbm table has a PK of tradeid1, tradeid2 and #tradep has a pk of (cmbId, qs_plyrid, tradeid) and both tables have record counts on the order of 100K to 500k. #ttcbm used to be a subquery of the latter 'insert into #mergeT' query, but I separated it out when I read that separating out complicated queries can improve performance when parallelism is a problem.

atommc
  • 71
  • 1
  • 1
  • 3

6 Answers6

10

There is a lot of misunderstanding about CXPACKET. CXPACKET isn't the cause of your problems, it is a side effect. What CXPACKET means when you see it is that that thread of a parallel query is waiting for another thread of that query to do something. Just because you see a lot of CXPACKET waits doesn't mean there's a problem with the query, it means that there is a problem somewhere else. When you see CXPACKET waits you need to look at the other threads of the SPID and see what other waits are besides CXPACKET. That other wait is the problem.

To your specific issue, the reason that the run times are so crazy is probably because the SQL Server is generating a different plan on some days because the statistics are out of date (making the job run long). Then you either manually update statistics (or via a job) or auto stats kicks in and then the plan gets better and the job runs quickly again.

Once you've solved the stats problem you can start looking at other reasons why the job is running slow. The fact that you only have a single disk is not helping any for sure.

mrdenny
  • 27,106
  • 2
  • 44
  • 81
5

Two reasons can cause improvement of the same query over time:

  1. The data is changing
  2. The query plan is changing

The data you can't do much about. To verify that the query plan is improving, try running the query manually during a time when you know the query is running at it's fastest. SAVE the fast query plan Run the same query during a time when you know it will be slow, save the slow query Compare the two plans, if they are different you can force your query to use a saved plan. http://technet.microsoft.com/en-us/library/cc917694.aspx

Jimbo
  • 834
  • 5
  • 6
4

Chances are the CXPACKET waits are actually disproportionately represented if you have a lot of threads waiting on something that is I/O bound. You can check this by setting MAXDOP=1 and re-running the query. See if the proportion of wait time from PAGEIOLATCH waits increases significantly.

If your PAGEIOLATCH waits are a large proportion of the wait time after checking this it might indicate that your query is I/O bound.

Your query could be righting itself because something changes the query plan - the system will take pressure on resources such as memory into account when choosing the appropriate query plan. The best way to check for this is to run the profiler and capture the actual execution plan from the query. If you have the option, set the application name on the connection (Application Name=foobar in the connection string) and filter that in the trace.

If you can see what's slow, then you may be able to tune the query or index one of the tables.

ConcernedOfTunbridgeWells
  • 17,081
  • 2
  • 59
  • 71
0

Did you consider placing an index on those temp tables after they were created?

This might be a good starting point, order of the fields would depend on the value distribution:

CREATE CLUSTERED INDEX IX_tradeP ON #tradeP(cmbId, qs_plyrid, tradeId);

CXPACKET should not be on your worry-list at the moment. Most of the time it's the synchronising thread waiting for skewed worker threads to finish. The trouble starts when the optimiser expects the load to be evenly spread over worker threads, but in the actual execution one thread gets to do all the work.

András Váczi
  • 31,778
  • 13
  • 102
  • 151
mceda
  • 111
  • 4
0

Totally figured this out. I got rid of all the clustered primary keys on every table involved in the issue. I replaced the PK with just a regular index added after the insert. The problem was from inserting hundreds of millions of rows into a table with a clustered primary key. SQL couldn't take doing that for some reason. Once I removed the clustered PK the insert took only minutes and my log didn't blow up by 100 GB...it was just a straight dump of data onto the heap. Creating the regular (nonclustered) index after the insert improved query performance w/o creating a lock situation.

atommc
  • 71
  • 1
  • 1
  • 3
0

This problems continues to pop-up now and again. I think it is related to the amount of size I allocate for the db and log. My DB is >100GB and I let it grow. I think what is happening is when autogrow happens the process hangs with this error message. I'm going to allocate a huge size for the tempdb, templog, and db/log and see if that works.

Tom McDonald
  • 101
  • 1