Concatenating rows into a single string query running for 5 hrs and counting

Question

I have a table with 2.6m records. It looks like this:

email                           prject_name
rafael.nadal@xyz.com              lab1
rafael.nadal@xyz.com              lab2
rafael.nadal@xyz.com              lab3
TEST@TEST.COM                     shift1
TEST@TEST.COM                     shift2

But I want my table to look like this:

email                     project_name
rafael.nadal@xyz.com     lab1, lab2, lab3
TEST@TEST.COM            shift1, shift2, shift3

I have used this query

select distinct email ,
STUFF((Select ','+project_name
from dbo.[UMG sent 2016] as  T1
where T1.email=T2.email
FOR XML PATH('')),1,1,'') from dbo.[UMG sent 2016] as T2;

It has been running for 5 hours already.
How do I speed up the process?

Martin Smith · Answer 1 · 2016-06-07T22:35:32.010

As you don't care about the order of the concatenated items it would be quite easy to knock up a custom CLR aggregate to do this and it will likely out perform the XML method, there is an example of one in this article.

There is a quick and easy change you can make to your existing code though.

Instead of

SELECT DISTINCT email,
                STUFF((SELECT ',' + project_name
                       FROM   dbo.[UMG sent 2016] AS T1
                       WHERE  T1.email = T2.email
                       FOR XML PATH('')), 1, 1, '')
FROM   dbo.[UMG sent 2016] AS T2;

You could use

SELECT email,
       STUFF((SELECT ',' + project_name
              FROM   dbo.[UMG sent 2016] AS T1
              WHERE  T1.email = T2.email
              FOR XML PATH('')), 1, 1, '')
FROM   dbo.[UMG sent 2016] AS T2
GROUP  BY email;

The difference being that the first one calculates concatenated strings for all rows in [UMG sent 2016] and then removes duplicates for email,string. The second one finds distinct email first and then just performs the string concatenation work on those distinct values. So in your example data instead of performing the work 5 times (twice for test and 3 times for Nadal) then throwing away three of them it will just perform the work 2 times, once for each.

wBob · Answer 2 · 2016-06-07T21:07:19.340

That STUFF FOR XML PATH string concatenation technique sure is cute, but it does not scale very well and across millions of rows it is probably not a very good idea. For larger tables, you may have to write some good old-fashioned procedural SQL with a loop, something like this:

-- Create the working table ...
IF OBJECT_ID('tempdb..#tmp') IS NOT NULL DROP TABLE #tmp

SELECT ROW_NUMBER() OVER( PARTITION BY email ORDER BY prject_name ) rowId, email, CAST( prject_name AS VARCHAR(500 ) ) prject_name
INTO #tmp
FROM dbo.[UMG sent 2016]
GO

-- Index temp table
CREATE UNIQUE CLUSTERED INDEX _cdx ON #tmp ( rowId, email )
GO

SELECT TOP 100 'before' s, *
FROM #tmp
ORDER BY email


-- Loop through appending the projects
DECLARE @n INT = 1

WHILE @@ROWCOUNT != 0
BEGIN

    IF @n > 99 BEGIN RAISERROR( 'Too many loops!', 16, 1 ) BREAK END    -- Loop safety
    SET @n += 1

    UPDATE t
    SET t.prject_name = CONCAT( t.prject_name, ', ', s.prject_name )
    FROM #tmp t
        INNER JOIN #tmp s ON t.email = s.email
    WHERE t.rowId = 1
      AND s.rowId = @n

END
GO

SELECT TOP 100 'after' s, *
FROM #tmp
WHERE rowId = 1
ORDER BY email

The concatenated result all ends up in 'bucket 1'. In my simple repro, with 2.6 million records with between 1 and 26 projects each, this script ran in a few minutes. Full repro script here.

Please bear in mind, this pattern is optimized for large tables with fewer items to concatenate. It also relies on the email/project combinations being unique, hence the primary key in my repro. There will be a tipping point where the STUFF technique is faster. There are also other techniques such as CLR, cursor even, which might suit depending on the distribution of your data.

Finally, can you please tell me more about your data so I can tweak my repro? For example, on average how many projects does each email have and what does the distribution look like?

score 0 · Answer 3 · answered Feb 23 '18 at 21:33

0

I know this is an old post, but you almost certainly just need to add an index on email that includes prject_name

create index ixEmail on [UMG sent 2016](email) include (prject_name)

answered Feb 23 '18 at 21:33

ubergeek

56
4

Concatenating rows into a single string query running for 5 hrs and counting

3 Answers3