0

TLDR for a reader with the same problem

If your external script is too slow, you may do something in your scritpt which can be implemented in TSQL, which is a big problem. If this is not the case use CLR or process your data in a external application where the functionality is present. Normally you should not rely on external scripts to achieve high performance. CLR is also more secure.

I am not allowed to do the above, because of workspace/interpersonal reasons.

Problem

I have to write a query which must rely on regular expressions and I am not allowed to use CLR or use anything outside the SQL Server environment. So I must use a R or Python external script. I have experience with Python. So I went with that.

I wrote the script it works but the input data is extremely large so the external script execution takes a long time. I already heavily optimized the script, but it is still slow. So I plan to parallelize it.

Python's multiprocessing module doesn't work inside the external script. So I got a strange idea:

What if I horizontally partition the temp table which holds the input data and execute the external script on the partitions using asynchronous procedure execution, then I just get back their union?

Or perhaps there is a better way to parallel execution of an exterenal script?

I work on SQL Server 2019 Development version we also have an enterprise version which I guess doesn't matter here.

Details

@J.D rightly asked for more details. I am only allowed to share a limited amount:

The temp table providing the input has different columns for the actual-, birth- and mother names for different people.

I also have a dataframe in the external script containing valid first names and associated genders parsed from a JSON which is passed as a nvarchar(max). This method described here.

My script extracts titles, surnames, given names and guess biological sex and martial status. The names are not English names.

I apply a function to the first axis of the input Pandas Dataframe which does the extraction.

I use only Pandas provided methods.

With a standard python environment I would use the np.array_split function to split the dataframe and use the multiprocessing module to apply extraction function on the splits in parallel processes, then concatenate the dataframes. As you can see this is a text processing problem and cannot be vectorized as mathematical one could be. So tools like Dask cannot be used here.

Why not CLR?

Using CLR would be a sane choice. So a colleague of mine and I also recommended to use CLR. Our expert said using CLR is very dangerous. Yes I know about the CLR strict security option, but our expert may not. So I am forced to use external script against my better judgement.

Why not use multi-threading

Python's Global Interpreter Lock only allows one thread to be executed by the interpreter. So multi-threading with python will not provide parallel computation, at least with the standard implementation. This is not a limitation for Jython or IronPython, but this is a different story.

atevm
  • 337
  • 2
  • 7

0 Answers0