Is there a SQL Server implementation of the Longest Common Substring problem?

Question

Is there a SQL Server implementation of the Longest Common Substring problem? A solution that checks with all rows of a column in SQL Server? I have seen solutions that take two strings as input, but no SQL Server solution that looks at all rows of a column in a table.

I did try a few things, but to be honest I think a solution goes over my head at the moment, so any suggestions are welcome.

There is no "real world" problem here, I'm just looking at programming problems and how they could be solved with SQL Server.

score 7 · Accepted Answer · edited Apr 13 '17 at 12:42

This can be done ~~rather easily~~ as a SQLCLR User-Defined Aggregate (UDA). An aggregate operates over a set of rows, so you would be able to operate over all rows or just a subset, based on a WHERE condition and optional GROUP BY (if wanting to operate over separate sets of rows).

BUT, whether or not you should do this depends on what you are planning on doing with the result. If this is a one-off project to do some research that won't be repeated, then it is probably best to just make a small Console App to read in the rows and process accordingly.

However, if you do have some need to use the returned value within a database-centered process, then SQLCLR should be fine (assuming you follow the two recommendations mentioned under "The following tricks can be used to reduce the memory usage of an implementation" in the Pseudocode section. You will just need to find a "creative" way of dealing with the situation of having multiple results for what is considered the longest common substring (i.e. if 2 or more common substrings tie for "first place"). Using the example from the Wikipedia page (which shows 2 matches for the 2 strings):

ABAB
BABA

Returns both:

BAB
ABA

Perhaps returning an XML document of matches since that is parsable and can contain any string (when properly escaped).

UPDATE (updated, and updated again)

The .NET C# source code for this can be found on Pastebin.com at:

SQLCLR UDA for Longest Common Substring - Source Code

And for anyone that wants to play with this UDA without compiling it, an installation T-SQL script (no external DLL) can be found on Pastebin.com at:

SQLCLR UDA for Longest Common Substring - Installer

The UDA should be fairly memory-efficient as it only stores substrings that match the current string and all previously encountered strings. As a new row calls the UDA, any substrings that are not found in the new string are removed from the collection.

It is also CPU-efficient in that, if at any point the number of "common" substrings goes to zero, it sets a flag indicating that no substrings are even possible and short-cicruits all future executions to simply exit upon being called. This happens immediately if an empty string is encountered. In any of these cases, an empty XML document (i.e. root element only) is returned. The meaning of the empty document is equivalent to an empty string, since the only thing the input strings have in common is that they were non-NULL strings.

NULL, in my interpretation, is ignored and does not indicate no possible matches like an empty string does.

A NULL is returned in the following two cases:

All inputs are NULL
Only a single non-NULL row was in the set and hence there was no other string to compare to, hence nothing to be considered "common".

I added a second input parameter which controls whether or not the return is just the longest common substrings or all common substrings. In the case of returning all, an attribute is added to each "item" to indicate whether or not it is one of the "longest" substrings:

SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test1a]
FROM   (VALUES (N'ABAB'), (N'BABA')) tab(col);

Returns:

<Items Merged="False">
  <Item>ABA</Item>
  <Item>BAB</Item>
</Items>

And

SELECT dbo.LongestCommonSubstring(tab.col, 1) AS [Test1b]
FROM   (VALUES (N'ABAB'), (N'BABA')) tab(col);

Returns:

<Items Merged="False">
  <Item IsLongest="True">ABA</Item>
  <Item IsLongest="True">BAB</Item>
  <Item IsLongest="False">AB</Item>
  <Item IsLongest="False">BA</Item>
  <Item IsLongest="False">A</Item>
  <Item IsLongest="False">B</Item>
</Items>

Also, the comparisons are now case-InSensitive to match the typical Collation.

Below are 16 more test cases that check functionality only, not performance. I will post additional tests later that operate over many rows of much longer strings. I intentionally left out Combining Characters and Supplementary Characters, for now, as they are bit a more complicated.

SELECT dbo.LongestCommonSubstring(tab.col, 1) AS [Test2]
FROM   (VALUES (N'ABAB'), (N'BABA'), (N'2BAB5')) tab(col);
-- <Items><Item>BAB</Item></Items>

SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test3]
FROM   (VALUES (N'ABAB'), (N'BABA'), (NULL), (N'2BAB5')) tab(col);
-- <Items><Item>BAB</Item></Items>

SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test4]
FROM   (VALUES (NULL), (NULL), (NULL)) tab(col);
-- NULL

SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test5]
FROM   (VALUES (N'ABAB'), (N'BABA'), (N''), (N'2BAB5')) tab(col);
-- <Items />

SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test6]
FROM   (VALUES (N'ABAB'), (N'BABA'), (N'L'), (N'2BAB5')) tab(col);
-- <Items />

SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test7]
FROM   (VALUES (N'ABAB')) tab(col);
-- NULL

SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test8a-DuplicatesAcross2Rows]
FROM   (VALUES (N'ABAB'), (N'ABAB')) tab(col);
-- <Items><Item>ABAB</Item></Items>

SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test8b-DuplicatesAcross3Rows]
FROM   (VALUES (N'ABAB'), (N'ABAB'), (N'ABAB')) tab(col);
-- <Items><Item>ABAB</Item></Items>

SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test8c-DuplicatesAcross4Rows]
FROM   (VALUES (N'ABAB'), (N'ABAB'), (N'ABAB'), (N'ABAB')) tab(col);
-- <Items Merged="False"><Item>ABAB</Item></Items>

SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test9-DuplicatesWithinOneString]
FROM   (VALUES (N'ABAB'), (N'zABABh2348923ABABf')) tab(col);
-- <Items Merged="False"><Item>ABAB</Item></Items>

SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test10-XmlEncodableCharacters]
FROM   (VALUES (N'ABA&B'), (N'zABA&Bh2348923ABA&Bf')) tab(col);
-- <Items Merged="False"><Item>ABA&amp;B</Item></Items>

SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test11a-FinalMatchesShorterThanInitialSet]
FROM   (VALUES (N'ABCDq1234g'), (N'1234qABCDg'), (N'uiyuiuy1234qBCDg'), (N'512tttrtrtBCDdfdfgdg')) tab(col);
-- <Items Merged="False"><Item>BCD</Item></Items>

SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test11b-FinalMatchesShorterThanInitialSet]
FROM   (VALUES (N'BCDq1234g'), (N'1234qABCDg'), (N'uiyuiuy1234qBCDg'), (N'512tttrtrtBCDdfdfgdg')) tab(col);
-- <Items Merged="False"><Item>BCD</Item></Items>

SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test11c-FinalMatchesShorterThanInitialSet]
FROM   (VALUES (N'ABCDq1234g'), (N'1234qABCDg'), (N'uiyuiuy1234qBCDg'), (N'5123tttrtrtBCDdfdfgdg')) tab(col);
-- <Items Merged="False"><Item>BCD</Item><Item>123</Item></Items>

SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test11d-FinalMatchesShorterThanInitialSet]
FROM   (VALUES (N'BCDq1234g'), (N'1234qABCDg'), (N'uiyuiuy1234qBCDg'), (N'5123tttrtrtBCDdfdfgdg')) tab(col);
-- <Items Merged="False"><Item>BCD</Item><Item>123</Item></Items>

SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test11e-FinalMatchesShorterThanInitialSet]
FROM   (VALUES (N'BCDq1234g'), (N'1234qABCDg'), (N'uiyuiuy1234qBCDg'), (N'123tttrtrtBCDdfdfgdg')) tab(col);
-- <Items Merged="False"><Item>BCD</Item><Item>123</Item></Items>


SELECT dbo.LongestCommonSubstring(tab.col, 1) AS [Test12-CaseInSensivity]
FROM   (VALUES (N'AbAB'), (N'BAbA')) tab(col);
/*
<Items Merged="False">
  <Item IsLongest="True">AbA</Item>
  <Item IsLongest="True">bAB</Item>
  <Item IsLongest="False">Ab</Item>
  <Item IsLongest="False">bA</Item>
  <Item IsLongest="False">A</Item>
  <Item IsLongest="False">b</Item>
</Items>
*/

Final Update (hopefully)

I made a couple of minor changes to the test code provided in @MisterMagoo's answer so that it would: a) return ties for "longest" common substring, and b) include the last character in each string as part of the search. I also changed, slightly, how the initial test rows are gathered such that there would be just over 1 million rows, and re-ran the tests. The results were that the T-SQL version took 1 minute and 13 seconds while the SQLCLR UDA took only 12 seconds (and even has the option of returning all common substrings, not just the longest).

I then modified the test data to include another common substring of the same length as the current winner (7 characters), a shorter but still common substring of 4 characters, and 3 random characters. This increased the max test string size from 32 characters to 46 characters. Running the test again (same code), the T-SQL version had to be killed at 23 minutes and had only tested the first 3 lengths: 27, 26, and 25. The SQLCLR UDA returned in about 1 minute and 10 seconds.

Is it time to celebrate? Has SQLCLR saved the day? Hold on..

On a hunch, I decided to see if the optimization I added to the SQLCLR version would help the T-SQL, namely:

Grab the shortest two strings (shortest would have the fewest number of possible substrings and would be the longest possible match anyway).
Extract all possible common substrings from the two short strings (any substrings that are "common" across all rows necessarily must be in the set derived from just these two strings, and no new substrings from other rows can be introduced as they wouldn't be "common").
Starting with the longest common substring, test to see if it is found in all rows. I used IF (NOT EXISTS (WHERE CHARINDEX(substring, test_row) > 0)) because the EXISTS clause will exit on the first row that returns a 0 (meaning substring not present) and hence not need to test all rows. This part is accomplished with a CURSOR (shh, don't tell anyone) because it allows for starting at the top of the list and just picking each new row off without needing to re-scan the list each time to find the next entry.
Once the first substring is found, save its length in a variable, and save the substring itself into a table variable.
Keep testing, but only for strings that are the same length as the first common substring found. as soon as the length of the next substring to search for is less than the length of the first substring to match, exit the loop and clean up the cursor.

All of this cursor stuff must make it painfully slow, right? Well, keeping in mind that the prior T-SQL version would have taken several hours to finish, and the SQLCLR UDA took 1 minute and 10 seconds, you might guess this updated version would take, what? A few minutes? 10 minutes? More? After all, just mentioning "cursor" is an automatic 5 minute hit, right? The actual time for the modified version:

22 seconds!!!

Of course, that is largely due to the substrings being on the longer side (compared to the size of the strings) and so the loop was able to exit earlier than if the longest common substring was only 3 or 4 characters long (i.e. it had less to test, but that is still a valid case and is not cheating). Also, one benefit of working in T-SQL is that you can test the entire set against a single value, whereas SQLCLR only has the current row and cannot see the entire set, hence the SQLCLR cannot short-circuit upon finding the longest common substring (because it won't know what is "common" in the first place until it has been executed across all rows).

A final change was to allow the new T-SQL version to return all "common" substrings, not just the longest ones, while indicating which ones were the longest in a second column of datatype BIT. When returning all substrings, mirroring the functionality of the SQLCLR UDA (even when the UDA only returns the longest common substrings, it still has the full list of all common substrings stored since it, again, has no ability to short-circuit), the T-SQL version returns in 2 minutes and 41 seconds.

So, there are conditions were the T-SQL can be faster, even at 1.2 million rows. But the performance is far more stable for the SQLCLR version, and definitely faster when you want all common substrings.

The test script, containing all 3 tests along with their results, can be found on Pastebin at:

SQLCLR UDA for Longest Common Substring - Testing

P.S. Even though the test was done over 1.2 million rows, the longest string tested was 46 characters. I am not sure how the performance of either approach would be affected by operating over strings that are much longer. That testing will have to wait until there is time to do it ;-).

Mister Magoo · Answer 2 · 2016-03-12T01:02:26.310

Solomon is probably right, but until a CLR solution shows up, here's a T-SQL version to play with.

/* Create some test data : on SQL 2016 this creates about 470K rows to test (with duplicates) */

 if object_id('tempdb..#Strings') is not null  drop table #Strings

   select a.Name as String, len(a.Name)+1 as StringLength
   into #Strings
   from sys.all_columns a, sys.all_columns b
   where a.Name like '%refer%';

 set nocount on;

 /* Any nulls mean there is not a "longest common string" */
 if exists(select 1 from #Strings where String is null)
 begin
   return;
 end;

/* We need to know number of rows in the sample and the length of the shortest string
   as the longest common substring cannot be longer than the shortest string in the set */

 declare @totalrows int;
 declare @minlen tinyint;
 declare @result varchar(50);

 select @minlen = min(StringLength-1), @totalrows = count(distinct String) from #Strings;

 raiserror(N'Maximum Possible Length: %d Total Distinct Rows: %d',0,0,@minlen,@totalrows) with nowait;

/* Check backwards from the longest possible string to the shortest and break when we find a match */
/* You might want to check the air conditioner is switched on here */

 while @minlen>1 and @result is null
 begin
   raiserror(N'Processing strings of length: %d',0,0,@minlen) with nowait;

   /* this method is "brute force" 
      1. find all substrings for each input string
      2. pick the first substring that appears in every input string
         we find this by grouping, counting and comparing to the number of input strings
   */
   select top(1) @result=match
   from (
     select String, substring(String, T.N, @minlen) match
     from #Strings
     cross apply ( select StringLength - @minlen ) a(L)
     cross apply (
       select top(a.L) V.N
       from (
         values(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),(41),(42),(43),(44),(45),(46),(47),(48),(49),(50)
         ) V(N)
       ) T(N)
     ) matches(String, match)
   group by match
   having count(distinct String) = @totalrows;
--   order by match;

   /* Decrement so next time we check for a shorter match */
   set @minlen = @minlen -1;
 end

/* display the result */
select 'Longest Common Substring: '+isnull(@result,'*** no match found ***');

Alan Burstein · Answer 3 · 2017-11-28T05:57:37.093

UPDATED on 20171127 at 11:58PM CST TO HANDLE UNIGRAMS

I know I'm a bit late here (by more than a year) but, my newest t-sql Longest Common Substring function is several thousand times faster than anything I've seen posted anywhere, including the aforementioned CLR (I just tested it).

It's a semi brute force technique I came up that:

Breaks up the shorter of the two strings and searches the longer substring for a match.
Accepts a 3rd parameter called a "window" (it's more like an NTile parameter but I use the term "window" because few people understand Ntile.) This is the secret sauce that makes this bad dog so fast
The routine then uses a pure brute force to see if the substrings size 20 or less in the short string also exist in the longer string. Using a tally table - the brute force approach for substrings size 20 or less will be instantaneous (0 ms).
After that it searches the longer string for substrings that exist in the shorter string whose size is evenly divisible by @window. For example, if the @window = 100, it will search for matching substrings that are of length 100, 200.... up to the length of the longer string (e.g. if the longer string is 515 characters long it will search for matching substrings that are 100, 200, 300, 400 and 500 characters long.
Once we have identified which "window" the longest substring lives in, I use a tally table and a "gaps & islands on strings" trick I learned from Chris Morris (discussed here) to compare each string for matching substrings.
TOP 1 with ties is used to identify the longest substring.

The function(s):

CREATE FUNCTION dbo.getshortstring8k(@s1 varchar(8000), @s2 varchar(8000))
RETURNS TABLE WITH SCHEMABINDING AS RETURN 
SELECT
  s1 = CASE WHEN LEN(@s1) < LEN(@s2) THEN @s1 ELSE @s2 END,
  s2 = CASE WHEN LEN(@s1) < LEN(@s2) THEN @s2 ELSE @s1 END;
GO

CREATE FUNCTION dbo.lcssWindowAB(@s1 varchar(8000), @s2 varchar(8000), @window int)
RETURNS TABLE WITH SCHEMABINDING AS RETURN
/*****************************************************************************************
Purpose
 Calculates the longest common substring between two varchar(n) strings up to 8000 
 characters each.

Developer Notes:
 1. Optimal performance gains will be seen on longer strings. All Longest Common Substring
    functions I have seen in SQL Server begin to choke at 500-1000. Set based Brute force 
    solutions that use a tally table are very fast for up to 1000 characters but begin to 
    slow dramatically after. 

    With N as the length of a string, The number of substrings is: N(N+1)/2
    For 1,000 character string: 1000(1000+1)/2 = 500,500   substrings
    For 2,000 characters:       2000(2000+1)/2 = 2,001,000 substrings

 2. Requires a materialized tally table beginning with 1 containing at least 8000 numbers;
    as written, this function will slow to a crawl using a cte tally table. This is due to
    the WHERE x.n BETWEEN 2 AND 10 clause. A CTE tally table will struggle with this but a
    correctly indexed materialized tally table will not. 

    For optimal performance your tally table should have a unique nonclustered index.
    THE DDL TO BUILD THE REQUIRED TALLY TABLE IS BELOW.

 3. Performance optimizations:

   3.1. The first major optimization is that the *shorter* of the two strings is broken up 
        into substrings which are then searched for in the larger string using CHARINDEX. 
        This reduces the number of substrings generated by:
          abs(len(@s1)-len(@s2)) * (abs(len(@s1)-len(@s2)) + 1) / 2  

        For example, if one is 20 characters longer than the other, 210 fewer substrings 
        will be evaluated; for 100 it's 5,050 fewer substrings, etc. 

   3.2. The second optimization is a technique I developed that I call "windowing". I use
        a @window parameter to break up the search specific windows for the presense of a
        matching substring. 1-10 legnth substrings in the smaller string are searched for 
        in the longer string. After that, only tokens with sizes evenly divisible by the
        @window parameter are evaluated. For example, say the short string length is 100
        and I set @window 20. All tokens sized 1-10 are searched for, the 20-grams, then
        40, 60... 100. This reduces the number of substrings from 5,050 to somewhere 
        beween 480 & 500 (depending on the size of the longest common substring.)

 4. The window parameter is for optimization only! It does not affect the final result set
    in any way. I strongly suggest that testing the function using different window sizes
    for different scenarios; the optimal size will vary. I have seen queries execute 3-10 
    faster when changing the window size. Start with 100, then try windows sizes of 
    20, 200, 300, 1000... 

 5. This function does not see a performance gain when run in parallel; 
    use option (maxdop 1) unless you're testing shows performance gains when running under 
    a parallel execution plan. 

Required Tally Table DDL (this will build exactly 8000 rows):
------------------------------------------------------------------------------------------
  -- (1) Create tally table
  IF OBJECT_ID('dbo.tally') IS NOT NULL DROP TABLE dbo.tally;
  CREATE TABLE dbo.tally (N int not null);

  -- (2) Add numbers to the tally table
  WITH DummyRows(V) AS ( 
    SELECT 1 FROM (VALUES -- 100 Dummy Rows
     ($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
     ($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($)) t(N))
  --INSERT dbo.tally
  SELECT TOP (8000) ROW_NUMBER() OVER (ORDER BY (SELECT 1))
  FROM DummyRows a CROSS JOIN DummyRows b CROSS JOIN DummyRows c;

  -- (3) Required constraints (and indexes) for performance
  ALTER TABLE dbo.tally ADD CONSTRAINT pk_cl_tally PRIMARY KEY CLUSTERED(N) 
    WITH FILLFACTOR = 100;

  ALTER TABLE dbo.tally ADD CONSTRAINT uq_nc_tally UNIQUE NONCLUSTERED(N);
  GO
------------------------------------------------------------------------------------------
Usage Examples:
  SELECT * FROM dbo.lcssWindowAB('abcxxx', 'yyyabczzz', 10);

  SELECT * FROM dbo.lcssWindowAB('123xxx', '!!123!!xxx!!', 10);

History:
 20171126 - Initial Development - Developed by Alan Burstein  
 20171127 - updated to return unigrams when longest common substring is 1;
            updated to use a materialized tally table; -- Alan Burstein
*****************************************************************************************/
SELECT TOP (1) WITH TIES itemIndex, itemLen = itemLen+addThis, item
FROM dbo.getshortstring8k(@s1, @s2) xs
CROSS APPLY
(
  SELECT
    itemIndex = MIN(position) over (partition by grouper order by (select $)),
    itemLen   = itemLen,
    addThis   = position-MIN(position) over (partition by grouper order by (select $)),
    item      = SUBSTRING
                (
                 xs.s1, 
                 MIN(position) over (partition by grouper order by (select $)),
                 itemLen+position-MIN(position) over (partition by grouper order by (select $))
                )
  FROM 
  (
    SELECT position - ROW_NUMBER() OVER (ORDER BY position), position, itemLen
    FROM
    (
      SELECT TOP (1) WITH TIES t.N, x.N -- Get the "longest" (including ties)
      FROM dbo.tally t                  -- all positions within the shorter string (s.s1)
      CROSS JOIN dbo.tally x            -- all sizes of substrings within the shorter string
      WHERE (t.N <= LEN(xs.s1) AND x.N <= LEN(xs.s1) AND LEN(xs.s1) - t.N + 1 - x.N >= 0)
      AND   (x.N BETWEEN 2 AND 10 OR x.N % @window = 0)      -- only 2-20 & @window-sized tokens
      AND   CHARINDEX(SUBSTRING(xs.s1, t.N, x.N), xs.s2) > 0 -- only tokens matched in both strings
      ORDER BY -x.N
    ) longSubstrings (position, itemLen)
    UNION ALL
    SELECT -position-1, position, 1
    FROM
    (
      SELECT t.N, 1             -- unigrams only
      FROM dbo.tally t          -- all positions within the shorter string (s.s1)
      WHERE (t.N <= LEN(xs.s1)) -- all valid unigrams
      AND   CHARINDEX(SUBSTRING(xs.s1, t.N, 1), xs.s2) > 0 -- only unigrams matched in both strings
    ) unigrams (position, itemLen)
  ) addGrouper (grouper, position, itemLen)
) lcssWindow (itemIndex, itemLen, addthis, item)
WHERE @window >= 10 AND @window%10 = 0 -- must be greater than 10 and divisible by 10
AND   CHARINDEX(item, xs.s2) > 0
ORDER BY -itemLen, -addThis;
GO

Here's an example of how the function calculating the longest common substring between two strings about 7700 characters long. It returns the correct answer in 16 milliseconds.

set statistics time on;

declare
  @s1 varchar(8000) = replicate('x',50)+replicate('Wow!',1900)+'!'+'junk,junk,junk...',
  @s2 varchar(8000) = replicate('z',95)+replicate('Wow!',1900)+'!'+'zzzzzzzzzzz......',
  @window int       = 100;

select len(@s1), len(@s2);

select * from dbo.lcssWindowAB(@s1, @s2, 100);
set statistics time off;

This is part of an article I'm working on. More to come!

Is there a SQL Server implementation of the Longest Common Substring problem?

3 Answers3

Linked