4

I am trying to write a script that will take multiple paths to files on various servers, search them all simultaneously, and return a single list of results to a user. Initially, I was just using Python threads to do this, however I quickly ran into some noted problems:

  1. I wasn't controlling how many threads could be kicked off. So if someone sent 100 files to query for one server, you'd have 100 threads kicked off on that machine, which was bad news.

  2. The results I was getting back were incomplete and varied drastically. If I ran the searches linearly (no threads), I would get complete results, but it would take a long time. I concluded based on this and some personal research that I was not taking a thread-safe approach and began looking into the Queue module.

I've ended up with something like this...

def worker():
   while q.qsize != 0:
      cmd = q.get()
      # kick off a bash command that zgreps files from different servers
      p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
      results.extend(''.join(p.stdout.readlines()).split('\n')[:-1])
      q.task_done()

NUM_WORKER_THREADS = 10
results = []

q = Queue.Queue()
for i in range(NUM_WORKER_THREADS):
   t = threading.Thread(target=worker)
   t.daemon = True
   t.start()

""" Code to generate input commands needed here """

for c in commands:
   q.put(c)

q.join()

""" Post processing of collected *results* array"""

After putting some thread-pool constraints around my program, and also checking in each thread if there is anything still in the queue, my results are inline with what I would expect. After testing, the results match the output of a single threaded approach (except it is much faster).

My questions are as follows:

  1. Is my approach thread-safe? Is there any chance one of the 10 worker threads could overwrite anothers attempt to extend the results array? I am worried that I have just decreased the chance of overwriting to take place, by allocating a smaller thread pool to handle the input, but have not actually solved the problem.

  2. I understand from reading that queues are supposed to be thread-safe. However, when I eliminate thread pools and don't check for the queue size in my threads, I can reproduce the same problem I was having before with large volumes of input. Can someone explain why that is?

gnat
  • 20,543
  • 29
  • 115
  • 306
Elias51
  • 43

1 Answers1

3

1a. Is my approach thread-safe?

No because your results isn't thread safe but is being read and written by many threads. Consider turning this into a Queue as well.

1b. Is there any chance one of the 10 worker threads could overwrite anothers attempt to extend the results array?

Yes, that's exactly what could happen, which is why you should avoid using an array in that manner when threading.

2. I understand from reading that queues are supposed to be thread-safe. However, when I eliminate thread pools and don't check for the queue size in my threads, I can reproduce the same problem I was having before with large volumes of input. Can someone explain why that is?

You really should be using the prebaked Threading Pool to do your work. It should make your life a lot simpler. It takes in a function to call, and a list containing all the variables it should be called with, and then distributes the list of work among the threads. The only catch is that the function you are calling can only be passed one parameter, but since you're only using one you can adapt your code fairly easily

For example:

from multiprocessing import Pool

def do_work(cmd):
  # kick off a bash command that zgreps files from different server
  # Not sure if this can be done better. Not clear what command you're running
  p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
  return ''.join(p.stdout.readlines()).split('\n')[:-1]

NUM_WORKER_THREADS = 10

p = Pool(NUM_WORKER_THREADS)
results = p.map(do_work, commands)

As a side note, you can always get around the 1 argument limit by simply packing all your necessary args into a tuple and then unpacking them at the start of the method. Tuples are cheap in python.

For example if you had a function do_work that took two commands as follows:

def do_work(cmd_one, cmd_two):
  p = subprocess.Popen(cmd_one, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
  p2 = subprocess.Popen(cmd_two, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
  return ''.join(p.stdout.readlines()) + ''.join(p2.stdout.readlines())

we could rewrite it as follows:

def do_work(commands):
  cmd_one, cmd_two= commands
  p = subprocess.Popen(cmd_one, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
  p2 = subprocess.Popen(cmd_two, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
  return ''.join(p.stdout.readlines()) + ''.join(p2.stdout.readlines())

and change our main method to look like the following:

from multiprocessing import Pool

NUM_WORKER_THREADS = 10

p = Pool(NUM_WORKER_THREADS)
results = p.map(do_work, [(x, y) for x in command1_list for y in command2_list])

In this way, we package both our variables into a tuple for the call, and then immediately unpack them when we need them.

Ampt
  • 4,733