0

I have a python program for scraping that needs a lot of time to run. To parallelize it, I have modified the code so that the program can run in parallel in different machines. I also created a docker image and pushed it to Dockerhub.

I tried to use Airflow and KubernetesPodOperator to create 10 Kubernetes pods to achieve that. But I didn't have success so far and the documentation is lacking in this regard. Is there any other way I can achieve this? How about GCP, Spark and Airflow? or just GCE machines somehow orchestrated by Airflow? any other options?

Michael Hampton
  • 252,907

1 Answers1

1

I suggest you have a look at this thread, jug or ray seem like easier options.
And here you will find a pretty complete list of parallel processing (cluster computing) solutions.
Here is a ray example:

import ray
ray.init()

@ray.remote def mapping_function(input): return input + 1

results = ray.get([mapping_function.remote(i) for i in range(100)])

Or, if you are using Python multiprocessing, you can scale it to a cluster by using ray.util.multiprocessing.pool's Pool instead of from multiprocessing.pool's Pool.
Check out this post for details

Example code you could run (Monte Carlo Pi Estimation):

import math
import random
import time

def sample(num_samples): num_inside = 0 for _ in range(num_samples): x, y = random.uniform(-1, 1), random.uniform(-1, 1) if math.hypot(x, y) <= 1: num_inside += 1 return num_inside

def approximate_pi_distributed(num_samples): from ray.util.multiprocessing.pool import Pool # NOTE: Only the import statement is changed. pool = Pool()

start = time.time()
num_inside = 0
sample_batch_size = 100000
for result in pool.map(sample, [sample_batch_size for _ in range(num_samples//sample_batch_size)]):
    num_inside += result

print(&quot;pi ~= {}&quot;.format((4*num_inside)/num_samples))
print(&quot;Finished in: {:.2f}s&quot;.format(time.time()-start))

Note: I am myself willing to achieve the same for one of my projects and I haven't tried it yet but I will soon. I'll post any interesting update I have. Don't hesitate to do the same from your side.

Ksign
  • 128