Does this regex reliably remove square brackets from T-SQL?

Question

I'm using SQL Server Management Studio 17.4, which supports search-and-replace using a Microsoft version of regular expressions (regex).

I need a regex expression that can reliably remove square brackets from source code, such as large CREATE TABLE statements that have been scripted from existing an database.

I'm currently using this regex in the "find" dialog:

\[((?!\d+)(?!PRIMARY))(([A-Z]|_|[0-9])*?)\]

The "replacement" is $2.

It seems to work very well, however I'm concerned this may eliminate required square brackets under some conditions.

In the regex above, the (?!PRIMARY) piece ensures ON [PRIMARY] does not have the square brackets removed. Are there other exceptions I need to be aware of?

I'm willing to exclude the possibility of column names being reserved words, such as [GROUP]. The environment I work in has specific naming convention that typically eliminates that possibility.

Solomon Rutzky · Answer 1 · 2018-08-20T15:19:19.467

When I first answered this question the rules for what characters would determine valid or invalid were a bit hazy (the documentation isn't entirely clear on it, and is in some ways misleading). However, since that time I have done some research into the specific rules for valid regular identifiers (i.e. those that do not need to be delimited by either "name" or [name]) related to this DBA.SE question:

How to create Unicode parameter and variable names

and have discovered the exact rules. The research into getting those exact rules lead to finding a list of Unicode Categories (not blocks) that can be used in combination to determine both the "first" / "start" character and any possible "subsequent" / "continue" characters. That research is towards the end of this post (Steps 7 and 8):

The Uni-Code: The Search for the True List of Valid Characters for T-SQL Regular Identifiers, Part 2

Reworking your pattern I came up with the following:

(?!\[(?i:PRIMARY|GROUP)\])\[([\p{L}\p{Nl}\u0023\u005F\uFF3F][\p{L}\p{Nl}\p{Mn}\p{Mc}\p{Nd}\p{Pc}\p{Cf}\u0023\u0024\u0040]*)\](?!\])

Some notes:

You don't need to "or" the separate items together with a | in a character class (i.e. the [stuff_here]) as it is already a list. In fact, including the pipe symbol actually means you were adding it to the list of allowed characters ;-).
You need to separate out the exclusion list (i.e. PRIMARY|GROUP...) such that they can be wrapped in [ and ]. Otherwise you allow for something like [PRIMARYa] to get skipped and keep its delimiters.
I added (?!\]) to the end to account for embedded escaped end-delimiters: ]].
This pattern is 99.5% effective. The only thing it does not catch is if you have both:
- an embedded leading delimiter: [, AND
- no embedded escaped end-delimiter: ]
Meaning, [Te[st] will erroneously un-delimit it to be [Test (oops), but both [Te[s]]t] and [Tes]]t] will remain delimited as expected. And really, how often are there embedded delimiters? And even then, how often is there just the beginning delimiter and not the ending delimiter? Very rare, I would think.

Of course, I said it was 99.5% effective but it only catches 2 reserved words. Easy enough to fix since we can get the list of reserved words. Adding in all 185 reserved words (well, 184 as I left out "WITHIN GROUP" since it has a space in it and is thus naturally excluded) gives us the following:

_{( I made it a comment instead of a code block so that it would be readable since it is a single line [that can be copied and pasted into the "Find" box] )}

(?!\[(?i:ADD|ALL|ALTER|AND|ANY|AS|ASC|AUTHORIZATION|BACKUP|BEGIN|BETWEEN|BREAK|BROWSE|BULK|BY|CASCADE|CASE|CHECK|CHECKPOINT|CLOSE|CLUSTERED|COALESCE|COLLATE|COLUMN|COMMIT|COMPUTE|CONSTRAINT|CONTAINS|CONTAINSTABLE|CONTINUE|CONVERT|CREATE|CROSS|CURRENT|CURRENT_DATE|CURRENT_TIME|CURRENT_TIMESTAMP|CURRENT_USER|CURSOR|DATABASE|DBCC|DEALLOCATE|DECLARE|DEFAULT|DELETE|DENY|DESC|DISK|DISTINCT|DISTRIBUTED|DOUBLE|DROP|DUMP|ELSE|END|ERRLVL|ESCAPE|EXCEPT|EXEC|EXECUTE|EXISTS|EXIT|EXTERNAL|FETCH|FILE|FILLFACTOR|FOR|FOREIGN|FREETEXT|FREETEXTTABLE|FROM|FULL|FUNCTION|GOTO|GRANT|GROUP|HAVING|HOLDLOCK|IDENTITY|IDENTITY_INSERT|IDENTITYCOL|IF|IN|INDEX|INNER|INSERT|INTERSECT|INTO|IS|JOIN|KEY|KILL|LEFT|LIKE|LINENO|LOAD|MERGE|NATIONAL|NOCHECK|NONCLUSTERED|NOT|NULL|NULLIF|OF|OFF|OFFSETS|ON|OPEN|OPENDATASOURCE|OPENQUERY|OPENROWSET|OPENXML|OPTION|OR|ORDER|OUTER|OVER|PERCENT|PIVOT|PLAN|PRECISION|PRIMARY|PRINT|PROC|PROCEDURE|PUBLIC|RAISERROR|READ|READTEXT|RECONFIGURE|REFERENCES|REPLICATION|RESTORE|RESTRICT|RETURN|REVERT|REVOKE|RIGHT|ROLLBACK|ROWCOUNT|ROWGUIDCOL|RULE|SAVE|SCHEMA|SECURITYAUDIT|SELECT|SEMANTICKEYPHRASETABLE|SEMANTICSIMILARITYDETAILSTABLE|SEMANTICSIMILARITYTABLE|SESSION_USER|SET|SETUSER|SHUTDOWN|SOME|STATISTICS|SYSTEM_USER|TABLE|TABLESAMPLE|TEXTSIZE|THEN|TO|TOP|TRAN|TRANSACTION|TRIGGER|TRUNCATE|TRY_CONVERT|TSEQUAL|UNION|UNIQUE|UNPIVOT|UPDATE|UPDATETEXT|USE|USER|VALUES|VARYING|VIEW|WAITFOR|WHEN|WHERE|WHILE|WITH|WRITETEXT)\])\[([\p{L}\p{Nl}\u0023\u005F\uFF3F][\p{L}\p{Nl}\p{Mn}\p{Mc}\p{Nd}\p{Pc}\p{Cf}\u0023\u0024\u0040]*)\](?!\])

Also, with the new pattern you will need to change the replacement to be $1 instead of $2.

I tested with the following:

SELECT [f], [p$o],   -- match these
       [prImary], [12y], [a b], [g%g], [j^6], [f¾], [u²], [h⁊], [s|w]
FROM   dbo.[$a],
       dbo.[#a], dbo.[@d], dbo.[_a] -- match these
WHERE  f.[&col1] = N'it will also [match] <- that :-(';
/* and [this] */

--[PRIMARY]
--[PRIMARy]
--[PRIMARYg]
--[1PRIMARYg]
--[PRIMARY g]
--[GROUP]
--[GROUPBY]
--[@Test]
--[Te]]st]
--[Te]]st]]]
--SELECT 1 AS [Te]]st]]a[b];        --OOPS
--[Te]]s[t]]ab]
--[Te]]s[t]]ab]]
--SELECT 2 AS [Te]]st]]], 2.1 AS [p];
--SELECT 3 AS [Tes[tab];            --OOPS
--[Te[st]                           --OOPS
--[Te[s]]t]
--[Tes]]t]
--[T[e[s]]t]
--[dbo].[TableName]
--[dbo].[1TableName]
--[dbo.1TableName]
--[[t]                              --OOPS
--SELECT 4 AS []]t];
--[[t]]]
-- SELECT 1 AS [d], '---' AS [1f], '$' AS [g]
-- SELECT 1 AS d, '---' AS [1f], '$' AS g
--[OPTION]
--[OPTIONt]

That all being said, I will reiterate what I originally said in a comment on the question:

Is making the script "look nicer" worth any amount of risk of potential errors? I would say "no". I used to remove square brackets for pretty much the same reason, but have gotten to the place where I now always add them because "guaranteed working" beats aesthetics. For me. I realize not for everyone. Of course, if you are more comfortable with .NET RegEx, then SQL# (which I wrote) has RegEx_Replace for free .

Also, keep in mind that this is a global search and replace, so it will match strings of this pattern even if they are in comments or string literals. If you have such strings then this might not be desirable and you will either need to find a way to exclude those (especially string literals), or give up this fight ;).

Aaron Bertrand · Accepted Answer · 2018-02-17T00:57:25.907

Essentially what you've done with PRIMARY is started building a list of exceptions, like people do when they start trying to thwart SQL injection the hard way (by "cleansing" the SQL). Do you want to do this for all current and future reserved words? Even in a strict environment things will slip through, especially if things aren't reserved words yet but become reserved words later (think about people who named their string splitting function dbo.STRING_SPLIT() before SQL Server 2016).

Here are just a handful I thought of off the cuff; there are dozens and dozens more. Your RegEx probably captures some of these (like dashes and spaces) but, if you remove any of the square brackets that remain after your RegEx has run, this script will break:

CREATE TABLE dbo.[Create]
(
  [Alter]      int,
  [Drop]       int,
  [Begin]      int,
  [End]        int,
  [Grant]      int,
  [Deny]       int,
  [Revoke]     int,
  [Truncate]   int,
  [Identity]   int,
  [Select]     int,
  [Update]     int,
  [Insert]     int,
  [Delete]     int,
  [Merge]      int,
  [Procedure]  int,
  [Trigger]    int,
  [Table]      int,
  [Function]   int,
  [View]       int,
  [Raiserror]  int,
  [bad-name]   int,
  [worse name] int,
  [0xd83d]     int,
  []        int
);
GO

Get over your hatred for the square brackets and get used to them - they will someday protect you from someone who named something incorrectly.

Does this regex reliably remove square brackets from T-SQL?

2 Answers2