Capitalize first letter with exceptions

Question

I need to properly format some European addresses. One of the steps is to capitalize the first letter, but avoiding some specific words, like 'on', 'upon', 'von', 'van', 'di', 'in', 'sul'. Therefore, although my skills are scarce, I thought it was a good idea to use a RegEx-based function.

After some googling I found this here:

CREATE FUNCTION InitialCap
(
    @String nvarchar(max)
)
    RETURNS nvarchar(max)
AS
    BEGIN 
        DECLARE @Position INT;

        SELECT 
            @String   = STUFF(LOWER(@String),1,1,UPPER(LEFT(@String,1))) COLLATE Latin1_General_Bin,
            @Position = PATINDEX('%[^A-Za-z][a-z]%',@String COLLATE Latin1_General_Bin);

        WHILE @Position > 0
            SELECT 
                @String   = STUFF(@String,@Position,2,UPPER(SUBSTRING(@String,@Position,2))) COLLATE Latin1_General_Bin,
                @Position = PATINDEX('%[^A-Za-z][a-z]%',@String COLLATE Latin1_General_Bin);

        RETURN @String;
    END

That seems to search for a sequence of a 'non-letter' + a lowercase 'letter'

[^A-Za-z][a-z]

OK, I think I've understood how it works, and I modify it to best match my needs.

I supposed it was better to search for a space or ' or - and a lowercase letter, therefore I changed it to

[\s'-][\w]

Then, after several attempt, I built this RegEx on regexr.com, that seems to catch the required sequence:

[\s](?!di\s|in\s|sul\s|on\s|upon\s|von\s|uber\s|ueber\s)[\w]

But when I put that into the above function, the result is not as expected.

What's wrong?

Solomon Rutzky · Answer 1 · 2017-12-15T18:19:24.920

SQL Server does not internally support Regular Expressions. LIKE and PATINDEX both support very limited wildcards, including a single-character range match [...] or exclude [^...] that are of similar syntax to RegEx, and are to a degree functionally similar, but are certainly not RegEx.

If you want / need Regular Expressions in SQL Server, you need to use SQLCLR. You can either code this yourself, or use pre-built functions such as the ones available in SQL# (which I wrote). Most of the RegEx functions are available in the Free version. I think you could use RegEx_Matches to return a result set of words that are not in your exclude list, and then combine that with the String_ToTitleCase4k function (also available in the Free version) to do the InitCap.

For example:

DECLARE @Input NVARCHAR(MAX) =
               N'santacroce sull''arno o''sullivan suLL sUlLiVan gsantacroce',
        @Expression NVARCHAR(4000) =
               N'[\s](?!di\s|in\s|sull\s|on\s|upon\s|von\s|uber\s|ueber\s)[\w]';

-- show matches for debugging
SELECT word.[StartPos],
       word.[EndPos],
       word.[Value] AS [Original],
       SQL#.String_ToTitleCase4k(word.[Value], N'') AS [TitleCased]
FROM   SQL#.RegEx_Matches(@Input, @Expression, 1, N'ignorecase') word;


SELECT @Input = STUFF(@Input,
                      word.[StartPos],
                      ((word.[EndPos] - word.[StartPos]) + 1),
                      SQL#.String_ToTitleCase4k(word.[Value], N'')
                                                COLLATE Latin1_General_100_BIN2)
FROM SQL#.RegEx_Matches(@Input, @Expression, 1, N'ignorecase') word;


SELECT @Input AS [Fixed];

Returns:

StartPos    EndPos    Original    TitleCased
--------    ------    --------    ----------
11          12        s           S
21          22        o           O
32          33        s           S
37          38        s           S
46          47        g           G


Fixed
-------------------------
santacroce Sull'arno O'sullivan SuLL SUlLiVan Gsantacroce

The reason that it doesn't work entirely correctly is due to your regular expression being incorrect:

It only matches a single letter.
It won't exclude any of the fragments if they are at the end of the string, but that might be ok if that never occurs in actual use.
It won't include the first word of the string (due to the required white-space to the left), but that might be ok if that never occurs in actual use.

UPDATE:
I was able to fix your regular expression by changing it to be as follows:

\b(?!(?:di|in|sull|on|upon|von|uber|ueber)\b)\w+

Main differences from the original:

I am using \b (word boundary) instead of \s (white-space) as it handles beginning and ending of line/string. It also doesn't capture the white-space, which \s does, and if it isn't visually obvious, each of the matched strings above is prefixed with the space that matched. While that space wouldn't affect the replacement as it would still be a space, it did prevent the first word in the group from matching, unless the entire string was preceded by some white-space. In the case of using this with addresses, if they always start with a number then perhaps there will always be that preceding space, but best not to include it in the match.
I added the + (one or more) quantifier to the \w so that it will pick up more than just the first character
I simplified the exclusion list by moving the common \b at the end of each fragment to just outside of a new, inner non-capturing group. This is a non-functional difference. It just makes it easier to read and deal with.

New output:

StartPos    EndPos    Original      TitleCased
--------    ------    --------      ----------
1           10        santacroce    Santacroce
17          20        arno          Arno
22          22        o             O
24          31        sullivan      Sullivan
38          45        sUlLiVan      Sullivan
47          57        gsantacroce   Gsantacroce

Fixed
-------------------------
Santacroce sull'Arno O'Sullivan suLL Sullivan Gsantacroce

UPDATE 2:

If the desire is to have a list of exclude words that is updateable without needing to update the function containing the regular expression, that is fairly easy to accomplish by doing the following:

Create a table to hold the exclusion words:

CREATE TABLE #ExcludeWords ([Word] NVARCHAR(50) NOT NULL);

Populate / manage-over-time the words in that table:

INSERT INTO #ExcludeWords ([Word]) VALUES
   (N'di'), (N'in'), (N'sull'), (N'on'), (N'upon'), (N'von'), (N'uber'), (N'ueber');

In whatever code you have that does this data-cleansing, dynamically build the regular expression from the table of words to exclude:

DECLARE @Expression NVARCHAR(4000) = N'\b(?!(?:';

SELECT @Expression += ex.[Word] + N'|'
FROM   @ExcludeWords ex

SET @Expression = LEFT(@Expression, LEN(@Expression) - 1) + N')\b)\w+';

SELECT @Expression; -- JUST FOR DEBUG
--\b(?!(?:di|in|sull|on|upon|von|uber|ueber)\b)\w+

UPDATE 3:

Originally no sample data was provided, so both answers tested with a simple list of words, some of which were the exclusion words. But now some test cases have been provided and in testing with one of those, I found a few issues with my implementation:

I had forgotten to specify the "case-insensitive" option for the RegEx function
I was replacing all occurrences of the entire match within the original string. This works fine when no matches are substrings of other strings within the entire string. But when including "O'Sullivan", the "O" is a substring of a few items and hence produced erroneous results.

So, I have adjusted the code and test case and results above to account for these issues. The main differences are:

Added the RegEx option of N'ignorecase'
Switched the REPLACE function with STUFF which allows me to use the starting and ending positions of each match to replace just that one item

PLEASE NOTE:

There is a minor discrepancy between the exclusion list in the question and the sample data provided in a comment on the question: question uses sul while the comment uses sull. I have adjusted my answer to work with the sull (two "L"s) shown in the test case provided in the comment.
This is reason number 5,235,948,567 for why it is critically important to have and/or provide actual test data ;-).

score 1 · Accepted Answer · edited Dec 16 '17 at 18:56

IMHO, I think above UDF cannot be implemented with your requirement.

My approach will be to,

Create Split String UDF
Create Exclude Word dictionary table.

It will work like this. I have one split string UDF. You can create or download your own split string function.

        declare @Excludeword table(word varchar(50))--this should be permanenet table
insert into @Excludeword VALUES ('on'), ('upon'),('von'),('van'),('di'),('in'),('sul')

    DECLARE @String nvarchar(max)='santacroce sull''arno'--'caT  UPON the wet floor'

SELECT stuff((
            SELECT ' ' + CASE 
                    WHEN ca.word = item
                        THEN ca.word
                    WHEN charindex('''', item) > 0
                        THEN STUFF(LOWER(item), charindex('''', item) + 1, 1, UPPER(substring(item, charindex('''', item) + 1, 1))) COLLATE Latin1_General_Bin
                    ELSE STUFF(LOWER(item), 1, 1, UPPER(LEFT(item, 1))) COLLATE Latin1_General_Bin
                    END
            FROM dbo.DelimitedSplit8K(@String, ' ')
            OUTER APPLY (
                SELECT word
                FROM @Excludeword E
                WHERE e.word = item
                ) ca
            WHERE item <> ''
            FOR XML PATH('')
            ), 1, 1, '') Item

Output, Cat on The Wet Floor or Santacroce sull'Arno or Cat upon The Wet Floor

You should mention all your business rules in your question itself with suitable examples.

Also how are you are planning to use it? Like will you pass a variable or you will process whole table.

Note :

This is just rough idea.

If this is what you are looking for then Split String can be customized suiting your requirement and a few things can be encapsulated in UDf itself.

Download [dbo].[DelimitedSplit8K]

Capitalize first letter with exceptions

2 Answers2