This can be done rather easily as a SQLCLR User-Defined Aggregate (UDA). An aggregate operates over a set of rows, so you would be able to operate over all rows or just a subset, based on a WHERE condition and optional GROUP BY (if wanting to operate over separate sets of rows).
BUT, whether or not you should do this depends on what you are planning on doing with the result. If this is a one-off project to do some research that won't be repeated, then it is probably best to just make a small Console App to read in the rows and process accordingly.
However, if you do have some need to use the returned value within a database-centered process, then SQLCLR should be fine (assuming you follow the two recommendations mentioned under "The following tricks can be used to reduce the memory usage of an implementation" in the Pseudocode section. You will just need to find a "creative" way of dealing with the situation of having multiple results for what is considered the longest common substring (i.e. if 2 or more common substrings tie for "first place"). Using the example from the Wikipedia page (which shows 2 matches for the 2 strings):
ABAB
BABA
Returns both:
Perhaps returning an XML document of matches since that is parsable and can contain any string (when properly escaped).
UPDATE (updated, and updated again)
The .NET C# source code for this can be found on Pastebin.com at:
SQLCLR UDA for Longest Common Substring - Source Code
And for anyone that wants to play with this UDA without compiling it, an installation T-SQL script (no external DLL) can be found on Pastebin.com at:
SQLCLR UDA for Longest Common Substring - Installer
The UDA should be fairly memory-efficient as it only stores substrings that match the current string and all previously encountered strings. As a new row calls the UDA, any substrings that are not found in the new string are removed from the collection.
It is also CPU-efficient in that, if at any point the number of "common" substrings goes to zero, it sets a flag indicating that no substrings are even possible and short-cicruits all future executions to simply exit upon being called. This happens immediately if an empty string is encountered. In any of these cases, an empty XML document (i.e. root element only) is returned. The meaning of the empty document is equivalent to an empty string, since the only thing the input strings have in common is that they were non-NULL strings.
NULL, in my interpretation, is ignored and does not indicate no possible matches like an empty string does.
A NULL is returned in the following two cases:
- All inputs are
NULL
- Only a single non-
NULL row was in the set and hence there was no other string to compare to, hence nothing to be considered "common".
I added a second input parameter which controls whether or not the return is just the longest common substrings or all common substrings. In the case of returning all, an attribute is added to each "item" to indicate whether or not it is one of the "longest" substrings:
SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test1a]
FROM (VALUES (N'ABAB'), (N'BABA')) tab(col);
Returns:
<Items Merged="False">
<Item>ABA</Item>
<Item>BAB</Item>
</Items>
And
SELECT dbo.LongestCommonSubstring(tab.col, 1) AS [Test1b]
FROM (VALUES (N'ABAB'), (N'BABA')) tab(col);
Returns:
<Items Merged="False">
<Item IsLongest="True">ABA</Item>
<Item IsLongest="True">BAB</Item>
<Item IsLongest="False">AB</Item>
<Item IsLongest="False">BA</Item>
<Item IsLongest="False">A</Item>
<Item IsLongest="False">B</Item>
</Items>
Also, the comparisons are now case-InSensitive to match the typical Collation.
Below are 16 more test cases that check functionality only, not performance. I will post additional tests later that operate over many rows of much longer strings. I intentionally left out Combining Characters and Supplementary Characters, for now, as they are bit a more complicated.
SELECT dbo.LongestCommonSubstring(tab.col, 1) AS [Test2]
FROM (VALUES (N'ABAB'), (N'BABA'), (N'2BAB5')) tab(col);
-- <Items><Item>BAB</Item></Items>
SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test3]
FROM (VALUES (N'ABAB'), (N'BABA'), (NULL), (N'2BAB5')) tab(col);
-- <Items><Item>BAB</Item></Items>
SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test4]
FROM (VALUES (NULL), (NULL), (NULL)) tab(col);
-- NULL
SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test5]
FROM (VALUES (N'ABAB'), (N'BABA'), (N''), (N'2BAB5')) tab(col);
-- <Items />
SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test6]
FROM (VALUES (N'ABAB'), (N'BABA'), (N'L'), (N'2BAB5')) tab(col);
-- <Items />
SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test7]
FROM (VALUES (N'ABAB')) tab(col);
-- NULL
SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test8a-DuplicatesAcross2Rows]
FROM (VALUES (N'ABAB'), (N'ABAB')) tab(col);
-- <Items><Item>ABAB</Item></Items>
SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test8b-DuplicatesAcross3Rows]
FROM (VALUES (N'ABAB'), (N'ABAB'), (N'ABAB')) tab(col);
-- <Items><Item>ABAB</Item></Items>
SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test8c-DuplicatesAcross4Rows]
FROM (VALUES (N'ABAB'), (N'ABAB'), (N'ABAB'), (N'ABAB')) tab(col);
-- <Items Merged="False"><Item>ABAB</Item></Items>
SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test9-DuplicatesWithinOneString]
FROM (VALUES (N'ABAB'), (N'zABABh2348923ABABf')) tab(col);
-- <Items Merged="False"><Item>ABAB</Item></Items>
SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test10-XmlEncodableCharacters]
FROM (VALUES (N'ABA&B'), (N'zABA&Bh2348923ABA&Bf')) tab(col);
-- <Items Merged="False"><Item>ABA&B</Item></Items>
SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test11a-FinalMatchesShorterThanInitialSet]
FROM (VALUES (N'ABCDq1234g'), (N'1234qABCDg'), (N'uiyuiuy1234qBCDg'), (N'512tttrtrtBCDdfdfgdg')) tab(col);
-- <Items Merged="False"><Item>BCD</Item></Items>
SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test11b-FinalMatchesShorterThanInitialSet]
FROM (VALUES (N'BCDq1234g'), (N'1234qABCDg'), (N'uiyuiuy1234qBCDg'), (N'512tttrtrtBCDdfdfgdg')) tab(col);
-- <Items Merged="False"><Item>BCD</Item></Items>
SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test11c-FinalMatchesShorterThanInitialSet]
FROM (VALUES (N'ABCDq1234g'), (N'1234qABCDg'), (N'uiyuiuy1234qBCDg'), (N'5123tttrtrtBCDdfdfgdg')) tab(col);
-- <Items Merged="False"><Item>BCD</Item><Item>123</Item></Items>
SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test11d-FinalMatchesShorterThanInitialSet]
FROM (VALUES (N'BCDq1234g'), (N'1234qABCDg'), (N'uiyuiuy1234qBCDg'), (N'5123tttrtrtBCDdfdfgdg')) tab(col);
-- <Items Merged="False"><Item>BCD</Item><Item>123</Item></Items>
SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test11e-FinalMatchesShorterThanInitialSet]
FROM (VALUES (N'BCDq1234g'), (N'1234qABCDg'), (N'uiyuiuy1234qBCDg'), (N'123tttrtrtBCDdfdfgdg')) tab(col);
-- <Items Merged="False"><Item>BCD</Item><Item>123</Item></Items>
SELECT dbo.LongestCommonSubstring(tab.col, 1) AS [Test12-CaseInSensivity]
FROM (VALUES (N'AbAB'), (N'BAbA')) tab(col);
/*
<Items Merged="False">
<Item IsLongest="True">AbA</Item>
<Item IsLongest="True">bAB</Item>
<Item IsLongest="False">Ab</Item>
<Item IsLongest="False">bA</Item>
<Item IsLongest="False">A</Item>
<Item IsLongest="False">b</Item>
</Items>
*/
Final Update (hopefully)
I made a couple of minor changes to the test code provided in @MisterMagoo's answer so that it would: a) return ties for "longest" common substring, and b) include the last character in each string as part of the search. I also changed, slightly, how the initial test rows are gathered such that there would be just over 1 million rows, and re-ran the tests. The results were that the T-SQL version took 1 minute and 13 seconds while the SQLCLR UDA took only 12 seconds (and even has the option of returning all common substrings, not just the longest).
I then modified the test data to include another common substring of the same length as the current winner (7 characters), a shorter but still common substring of 4 characters, and 3 random characters. This increased the max test string size from 32 characters to 46 characters. Running the test again (same code), the T-SQL version had to be killed at 23 minutes and had only tested the first 3 lengths: 27, 26, and 25. The SQLCLR UDA returned in about 1 minute and 10 seconds.
Is it time to celebrate? Has SQLCLR saved the day? Hold on..
On a hunch, I decided to see if the optimization I added to the SQLCLR version would help the T-SQL, namely:
Grab the shortest two strings (shortest would have the fewest number of possible substrings and would be the longest possible match anyway).
Extract all possible common substrings from the two short strings (any substrings that are "common" across all rows necessarily must be in the set derived from just these two strings, and no new substrings from other rows can be introduced as they wouldn't be "common").
Starting with the longest common substring, test to see if it is found in all rows. I used IF (NOT EXISTS (WHERE CHARINDEX(substring, test_row) > 0)) because the EXISTS clause will exit on the first row that returns a 0 (meaning substring not present) and hence not need to test all rows. This part is accomplished with a CURSOR (shh, don't tell anyone) because it allows for starting at the top of the list and just picking each new row off without needing to re-scan the list each time to find the next entry.
Once the first substring is found, save its length in a variable, and save the substring itself into a table variable.
Keep testing, but only for strings that are the same length as the first common substring found. as soon as the length of the next substring to search for is less than the length of the first substring to match, exit the loop and clean up the cursor.
All of this cursor stuff must make it painfully slow, right? Well, keeping in mind that the prior T-SQL version would have taken several hours to finish, and the SQLCLR UDA took 1 minute and 10 seconds, you might guess this updated version would take, what? A few minutes? 10 minutes? More? After all, just mentioning "cursor" is an automatic 5 minute hit, right? The actual time for the modified version:
22 seconds!!!
Of course, that is largely due to the substrings being on the longer side (compared to the size of the strings) and so the loop was able to exit earlier than if the longest common substring was only 3 or 4 characters long (i.e. it had less to test, but that is still a valid case and is not cheating). Also, one benefit of working in T-SQL is that you can test the entire set against a single value, whereas SQLCLR only has the current row and cannot see the entire set, hence the SQLCLR cannot short-circuit upon finding the longest common substring (because it won't know what is "common" in the first place until it has been executed across all rows).
A final change was to allow the new T-SQL version to return all "common" substrings, not just the longest ones, while indicating which ones were the longest in a second column of datatype BIT. When returning all substrings, mirroring the functionality of the SQLCLR UDA (even when the UDA only returns the longest common substrings, it still has the full list of all common substrings stored since it, again, has no ability to short-circuit), the T-SQL version returns in 2 minutes and 41 seconds.
So, there are conditions were the T-SQL can be faster, even at 1.2 million rows. But the performance is far more stable for the SQLCLR version, and definitely faster when you want all common substrings.
The test script, containing all 3 tests along with their results, can be found on Pastebin at:
SQLCLR UDA for Longest Common Substring - Testing
P.S. Even though the test was done over 1.2 million rows, the longest string tested was 46 characters. I am not sure how the performance of either approach would be affected by operating over strings that are much longer. That testing will have to wait until there is time to do it ;-).